SDE Book

Stochastic Differential Equations: A Systems-Theoretic Approach
(DRAFT)
Maxim Raginsky
May 1, 2023
Contents
I Preliminaries 1
1 Introduction 2
1.1 Notes and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Brownian Motion and Diffusion Processes 3

2.1 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Definition and construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Multidimensional Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Diffusion processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The Kolmogorov equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Stochastic Integrals and Stochastic Differential Equations 16

3.1 From Kolmogorov to Itô: an internalist model of a diffusion process . . . . . . . . . . 17
3.1.1 Filtrations, martingales, and all that . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 A martingale associated with a diffusion process . . . . . . . . . . . . . . . . 19
3.1.3 Enter the stochastic integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.4 Back to diffusion processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.5 Multidimensional diffusion processes . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Stochastic integration with respect to a Brownian motion . . . . . . . . . . . . . . . 26
3.2.1 Stochastic differentials and Itô’s differentiation rule . . . . . . . . . . . . . . . 27
3.2.2 Multidimensional Itô calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Stochastic differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Coordinate changes and the Stratonovich integral . . . . . . . . . . . . . . . . 37
3.4 SDEs and models of engineering systems with random disturbances . . . . . . . . . . 40
3.4.1 The Wong–Zakai theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Other models of physical noise processes . . . . . . . . . . . . . . . . . . . . . 43
3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ii
4 Stochastic calculus in path space 47
4.1 Cameron–Martin–Girsanov theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Motivation: importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.2 Removing a constant drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.3 A general Girsanov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.4 Absolute continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.5 Weak solutions of SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 The Feynman–Kac formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 A killing interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
II Applications 60
5 Stochastic control and estimation 61

5.1 Linear estimation: the Kalman–Bucy filter . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Nonlinear filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 The innovations approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.2 The reference measure approach . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Optimal control of diffusion processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 The linear quadratic regulator problem . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2 Fleming’s logarithmic transformation and the Schrödinger bridge problem . . 77
5.4 Optimal control with partial observations . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 The linear-quadratic-Gaussian (LQG) control problem . . . . . . . . . . . . . 80
5.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6 Optimization 84
6.1 Langevin dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1 Convergence to equilibrium and the spectral gap . . . . . . . . . . . . . . . . 86
6.1.2 The relative entropy and the logarithmic Sobolev inequality . . . . . . . . . . 89
6.1.3 Simulated annealing in continuous time . . . . . . . . . . . . . . . . . . . . . 90
6.2 Constrained minimization and stochastic neural nets . . . . . . . . . . . . . . . . . . 92
6.3 Free energy minimization and optimal control . . . . . . . . . . . . . . . . . . . . . . 95
6.3.1 Free energy minimization in path space . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 The optimal control interpretation . . . . . . . . . . . . . . . . . . . . . . . . 99
7 Sampling and generative models 100

7.1 Generative modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Time reversal of diffusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2.1 An optimal control interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3 Sample-based construction and score matching . . . . . . . . . . . . . . . . . . . . . 107
7.4 Stochastic thermodynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
iii
A Probability Facts 112
A.0.1 Convergence concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B Analysis Facts 113
iv
Part I
Preliminaries
1
Chapter 1
Introduction
To Be Written
1.1 Notes and further reading

Mortensen [Mor69] gives a good discussion of the engineering modeling aspects of stochastic calculus.
2
Chapter 2
Brownian Motion and Diffusion

Processes
2.1 Brownian motion

2.1.1 Definition and construction
The standard one-dimensional Brownian motion (also known as the Wiener process) is a random
process W = (Wt )t≥0 with the following properties:
1. W0 = 0, and the increments Wt1 , Wt2 − Wt1 , Wt3 − Wt2 , . . . for 0 < t1 < t2 < t3 < . . . are
independent.
2. Wt − Ws is Gaussian with zero mean and variance |t − s|.
3. The sample paths t 7→ Wt are continuous almost surely.
The first two items imply that W is a zero-mean Gaussian process with covariance function
E[Wt Ws ] = t ∧ s. There are a number of ways one could construct a Brownian motion. We will
illustrate one that has the advantage of being fairly direct and elementary1 in the sense of requiring
the minimum of fancy probabilistic machinery.
To start with, we note that it suffices to construct the process only on the unit interval, i.e., for
t ∈ [0, 1]. Once this is done, we can generate an infinite sequence of independent copies of W on
[0, 1], i.e., W (0) , W (1) , W (2) , . . . and then take

(0)



 Wt , 0≤t<1
W (0) + W (1) ,

1≤t<2
1 t−1
Wt = (0) (1) (2) (2.1.1)


 W1 + W2 + Wt−2 , 2 ≤ t < 3

. . .

– indeed, it is not hard to verify that the resulting process has the properties 1–3 above. We can now
motivate the construction of W := (Wt )t∈[0,1] formally as follows: Let ξ = (ξt )t∈[0,1] be a white noise
1
“Elementary” here does not mean “simple;” it just means that no advanced techniques are necessary beyond basic
properties of Gaussian randomm variables and some real analysis.
3
process, i.e., the time derivative of W : ξt = Ẇt . Let us pick a complete orthonormal basis (ψn )n∈Z+
of the Hilbert space L2 [0, 1] and expand ξ in this basis:
∞
X
ξt = Zn ψn (t), t ∈ [0, 1]. (2.1.2)
n=0
This is a (formal) Karhunen–Loève expansion of the white noise. Here, the coefficients Zn are
random and given by
Z 1
Zn = hξ, ψn i = ξt ψn (t) dt. (2.1.3)
0
Evidently, E[Zn ] = 0 and, since E[ξt ξs ] = δ(t − s), they are mutually uncorrelated:
Z 1Z 1
E[Zm Zn ] = E[ξt ξs ]ψm (t)ψn (s) ds dt (2.1.4)
0 0
Z 1
= ψm (t)ψn (t) dt (2.1.5)
0
= hψm , ψn i (2.1.6)
= δmn (2.1.7)
Finally, since the white noise ξ is a Gaussian process and (2.1.3) expresses each Zn as a limit of
linear operations on ξ, the coefficients Zn are Gaussian random variables. Putting together all of
the above, we see that the coefficients Z0 , Z1 , . . . are independent, identically distributed (i.i.d.)
standard Gaussian (zero mean, unit variance) random variables. This suggests that we should be
able to obtain a Brownian
Rt motion W by passing the expansion (2.1.2) of the white noise ξ through
an integrator: Wt = 0 ξs ds. This is indeed the case:
Theorem 1. Let (ψn )n∈Z+ be a complete orthonormal basis of L2 [0, T ], and let (Zn )n∈Z+ be a
sequence of i.i.d. standard Gaussian random variables. Then the infinite series
X∞ Z t
Wt = Zn Ψn (t), where Ψn (t) := ψn (s) ds (2.1.8)
n=0 0
converges uniformly a.s., and the limit is a Brownian motion.

Remark 1. The phrase “almost sure uniform convergence” here means that, with probability one,
lim sup |Wtn − Wt | = 0. (2.1.9)
N →∞ t∈[0,1]
where we have defined the partial sums

n
X
Wtn := Zi Ψi (t). (2.1.10)
i=0
Note that each Wtn , being a finite linear combination of i.i.d. standard Gaussian random variables,
is itself a Gaussian random variable with zero mean and variance
n
X
E[(Wtn )2 ] = |Ψi (t)|2 . (2.1.11)
i=0
4
Proof (Sketch). For each t ∈ [0, 1], let It denote the indicator function of the interval [0, t], i.e.,
It (s) = 1{s≤t} . Then It ∈ L2 [0, 1] and Ψn (t) = hIt , ψn i. Therefore, for m, n ∈ Z+ and t ∈ [0, 1] we
have
 2 
m
 X
E[(Wtm − Wtn )2 ] = E  Zi hIt , ψi i  (2.1.12)

i=n+1
m
X
= |hIt , ψi i|2 (2.1.13)
i=n+1
m,n→∞
−−−−−→ 0. (2.1.14)
That is, for each t, (Wtn )n∈Z+ is a Cauchy sequence in quadratic mean and therefore has a limit
Wt , both in quadratic mean and almost surely. Since each Wtn is Gaussian, so is Wt . It is also
straightforward if tedious to show that (Wt )t∈[0,1] is a Gaussian process by working with the limits
of the joint characteristic functions of Wtn ’s. Finally, using Parseval’s relation, we can write
  
∞ ∞
 X X
E[Wt Ws ] = E  Zm Ψm (t)  Zn Ψn (s) (2.1.15)

m=0 n=0
∞
X
= hIt , ψn ihIs , ψn i (2.1.16)
n=0
= hIt , Is i (2.1.17)
= t ∧ s. (2.1.18)
Thus, the limit W = (Wt )t∈[0,1] is a Gaussian process that has the properites 1 and 2 of the Brownian
motion. It remains to show that it has a.s. continuous sample paths.
While the a.s. uniform convergence holds for any choice of the basis functions ψn , the proof in
the general case is rather difficult. We will give a direct construction, due to Lévy and Ciesielski,
that makes use of a particular basis, the Haar wavelet basis. To introduce this basis, it is convenient
to work with two integer indices instead of one; apart from the straightforward relabeling, the ideas
are exactly the same. The Haar wavelets ψk,n with n = 0, 1, 2, . . . and k = 1, 3, 5, . . . , 2n − 1 are
defined as follows:

(n−1)/2 , (k − 1)2−n ≤ t < k2−n
+2


ψ1,0 (t) := 1, ψk,n (t) := −2(n−1)/2 , k2−n ≤ t < (k + 1)2−n . (2.1.19)

0,

otherwise
It is not hard to show that they form a complete orthonormal basis of L2 [0, 1]. Let us also generate a
doubly indexed sequence (Zk,n : n = 0, 1, . . . ; k = 1, 3, . . . , 2n − 1) of i.i.d. standard Gaussian random
variables. We now define
Xn X
Wtn := Zk,i Ψk,i (t), (2.1.20)
i=0 k=1,3,...,2n −1
X∞ X
Wt := Zk,n Ψk,n (t), (2.1.21)
n=0 k=1,3,...,2n −1
5
Rt
where the Schauder functions Ψk,n (t) = 0 ψk,n (s) ds are given by

(n−1)/2 (t − (k − 1)2−n ), (k − 1)2−n ≤ t < k2−n
2


Ψ1,0 (t) = t, Ψk,n (t) = (k + 1)2−(n+1)/2 − 2(n−1)/2 t, k2−n ≤ t < (k + 1)2−n . (2.1.22)

0,

otherwise
So, for n ≥ 1, Ψk,n are little ‘tents’ of height 2−(n+1)/2 supported on the intervals [k2−n , (k + 1)2−n ],
which are disjoint for distinct values of k = 1, 3, . . . , 2n − 1. Therefore,

X
n n−1

sup |Wt − Wt | = sup Zk,n Ψk,n (t)
t∈[0,1] t∈[0,1] k=1,3,...,2n −1 (2.1.23)
= 2−(n+1)/2 max |Zk,n |.
k=1,3,...,2n −1
√
Since the maximum of m independent standard Gaussian random variables is on the order of log m,
√
we see thatPthe maximum of |Wtn − Wtn−1 | over t ∈ [0, 1] is on the order of n2−n , so we expect
the series n≥0 supt∈[0,1] |Wtn − Wtn−1 | to be summable; from this, it will be easy to show that
n→∞
Wtn −−−→ Wt uniformly in t ∈ [0, 1], and that the limit is a.s. continuous. We now make this
argument precise. Using (2.1.23), for any r > 0 we can write
" #
p
P sup |Wtn − Wtn−1 | > r 2−n n log 2
t∈[0,1]

p
=P max |Zk,n | > r 2n log 2 (2.1.24)
k=1,3,...,2n −1
X h p i
≤ P |Zk,n | > r 2n log 2 (2.1.25)
k=1,3,...,2n −1
p
≤ 2n · P[Z > r 2n log 2], (2.1.26)
where Z is a standard Gaussian random variable. We now use the Gaussian tail bound P[Z > r] ≤
2
e−r /2 to write
" #
p 2
n−1
P sup |Wt − Wt | > r 2−n n log 2 ≤ 2n exp −nr2 log 2 ≤ 2n(1−r ) ,
n
(2.1.27)
t∈[0,1]
so, if we take r > 1, then

∞ ∞
" #
X p X 2
n n−1 −n
P sup |Wt − Wt | > r 2 n log 2 ≤ 2n exp −nr2 log 2 ≤ 2n(1−r ) < ∞, (2.1.28)
n=1 t∈[0,1] n=1
and therefore
" #
P sup |Wtn − Wtn−1 | for all sufficiently large n = 1 (2.1.29)
t∈[0,1]
6
by the Borel–Cantelli lemma. Therefore, almost surely
∞
X
sup |Wtn − Wtn−1 | < ∞. (2.1.30)
n=1 t∈[0,1]
Since each of the random functions t 7→ Wtn is continuous, Lemma 7 in Appendix B tells us that the
sequence W n converges uniformly a.s. to W , which is itself continuous.
2.1.2 Basic properties

Although Brownian motion has continuous sample paths, they are nowhere differentiable. The proof
of this fact is somewhat involved, but the basic intuition is that, since the increments
√ Wt+h − Wt
for h > 0 are zero-mean Gaussian with variance h, they are on the order of h, so the quotient
h−1 (Wt+h − Wt ) will blow up as h ↓ 0. In fact, the length (or the total variation) of W is infinite
on any interval [0, t]. To see this, let us first examine its quadratic variation on [0, t]. To that end,
consider the sequence of random variables
X
Qnt := (W(k+1)2−n − Wk2−n )2 , n = 0, 1, . . . . (2.1.31)
0]≤k<2n t
By the properties of Brownian motion, Qnt is a sum of independent random variables, each with
mean 21n and variance 22n3
. (This follows from the fact that, for Z ∼ N (0, σ 2 ), E[Z 2 ] = σ 2 and
E[Z 4 ] = 3σ 4 .) Therefore, if we consider the sequence tn := max{k2−n < t}, which evidently
converges to t, we can write
Var[Qnt ] = E[(Qnt − tn )2 ] (2.1.32)

n 3 1
= 2 tn − (2.1.33)
22n 22n
2tn
= n (2.1.34)
2
2t
< n, (2.1.35)
2
so, for any ε > 0, by Chebyshev’s inequality
∞ ∞
X X Var[Qn ] t
P[|Qnt − tn | ≥ ε] ≤ < ∞. (2.1.36)
ε2
n=0 n=0
Thus, by the Borel–Cantelli lemma, |Qnt − tn | < ε for all but finitely many values of n a.s.; since
ε > 0 was arbitrary, it follows that Qnt a.s. converges to t as n → ∞. The fact that W has infinite
variation on [0, t] can now be derived as follows. First of all, it is easy to see that
X
Qnt ≤ maxn |W(k+1)2−n − Wk2−n | · |W(k+1)2−n − Wk2−n |. (2.1.37)
0≤k<2 t
0≤k<2n t
As we have already shown, in the limit as n → ∞, the left-hand side converges to t; moreover, since
W has continuous sample paths, max0≤k<2n t |W(k+1)2−n − Wk2−n | converges a.s. to 0 as n → ∞.
Therefore, the sum on the right-hand side has to diverge to infinity as n → ∞.
7
A key property of Brownian motion is the following. Suppose that we have been observing the
evolution of W over the interval [0, s], i.e., we have recorded the trajectory W[0,s] := (Wr )t∈[0,s] , and
we wish to predict the value Wt as some t > s. Our prediction, which we denote by Ŵt , will be a
function of the observations W[0,s] , and we want it to be optimal in the mean-square sense, i.e.,
E[(Ŵt − Wt )2 ] = inf E[(F (W[0,s] ) − Wt )2 ], (2.1.38)

F
where (leaving aside various technicalities) the infimum is over all “reasonable” functions F that map
W[0,s] to a prediction of Wt . The minimum mean-square predictor of Wt is given by the conditional
expectation of Wt given the observations, i.e.,
Ŵt = E[Wt |W[0,s] ], (2.1.39)
which we can write, using a convenient shorthand, as Ŵt = E[Wt |Fs ], where Fs is the σ-algebra
generated by W[0,s] . (Roughly speaking, we can think of Fs as all the “information” about W[0,s] .)
Then it is not hard to see that Ŵt = E[Wt |Fs ] = Ws :
E[Wt |Fs ] = E[Wt − Ws + Ws |Fs ] (2.1.40)

= E[Wt − Ws |Fs ] + E[Ws |Fs ] (2.1.41)
= Ws , (2.1.42)
where we have used the fact that, for t > s, the increment Wt − Ws is independent of W[0,s] and has
zero mean, while E[Ws |Fs ] = Ws because Ws is a deterministic function of the available information
W[0,s] . This property has a name: We say that the Brownian motion is a martingale, a random
process with the property that the best prediction (in the minimum mean-square sense) of its value
at some time t given the past observations of the process for all times 0 ≤ r ≤ s is equal to the most
recent observation Ws .
Another key property of Brownian motion is that it is a Markov process, i.e., for any selection of
times t1 < t2 < · · · < tn−1 < tn , the conditional probability distribution of Wtn given (Wti : 1 ≤ i <
n) is equal to the conditional distribution of Wtn given just Wtn−1 . This can be derived from the
properties of Brownian motion: By expressing the joint distribution of (Wti : 1 ≤ i ≤ n) in terms of
the joint distribution of (Wt1 , Wt2 − Wt1 , . . . , Wtn − Wtn−1 ) and using the fact that the increments
Wti − Wti−1 are independent and Gaussian with zero mean and variance ti − ti−1 , for any a < b and
any x1 , . . . , xn−1 , we have
Z b !
1 (y − xn−1 )2
P[a ≤ Wtn ≤ b|Wti = xi , 1 ≤ i < n] = p exp − dy (2.1.43)
2π(tn − tn−1 ) a 2(tn − tn−1 )
= P[a ≤ Wtn ≤ b|Wtn−1 = xn−1 ]. (2.1.44)
This property can be stated in terms of transition densities: For any x ∈ R, any 0 ≤ s < t, and any
bounded measurable function ϕ : R → R,
Z ∞ !
1 (y − x)2
E[ϕ(Wt )|Ws = x] = p ϕ(y) exp − dy (2.1.45)
2π(t − s) −∞ 2(t − s)
Z ∞
=: ϕ(y)ps,t (x, y) dy, (2.1.46)
−∞
8
where ps,t (x, ·), for 0 ≤ s < t, is the conditional probability density of Wt given Ws = x. But W has
stationary increments, so it suffices to work with
1 − (y−x)2
pt (x, y) := p0,t (x, y) ≡ √ e 2t (2.1.47)
2πt
since evidently ps,t (x, y) = pt−s (x, y). For example, because W0 = 0,
Z ∞
E[ϕ(Wt )] = ϕ(y)pt (0, y) dy; (2.1.48)
−∞
more generally,
Z ∞
E[ϕ(Wt )|Ws = x] = ϕ(y)pt−s (x, y) dy. (2.1.49)
−∞
Let us see what happens when we differentiate both side of this equality w.r.t. the time t:
Z ∞
d ∂
E[ϕ(Wt )|Ws = x] = ϕ(y) pt−s (x, y) dy (2.1.50)
dt −∞ ∂t
Z ∞ 2 !
y 1
= ϕ(y) − pt−s (x, y) dy (2.1.51)
−∞ t−s t−s
1 ∞ ∂2
Z
= ϕ(y) 2 pt−s (x, y) dy. (2.1.52)
2 −∞ ∂y
Since ϕ is arbitrary, we conclude that the transition density pt−s (x, y), as a function of the ‘forward
space’ variable y and the ‘forward time’ variable t, is a solution of the second-order linear partial
differential equation (PDE)
∂ 1 ∂2
pt−s (x, y) = pt−s (x, y), t>s (2.1.53)
∂t 2 ∂y 2
with the initial condition limt→s pt−s (x, y) = δ(y − x). Integrating by parts twice and assuming that
ϕ is twice differentiable and sufficiently well-behaved at infinity, we can write
Z ∞
d
E[ϕ(Wt )|Ws = x] = ϕ00 (y)pt−s (x, y) dy (2.1.54)
dt −∞
1
= E[ϕ00 (Wt )|Ws = x], (2.1.55)
2
or, in integral form,
Z t
1
E[ϕ(Wt )|Ws = x] = ϕ(x) + E[ϕ00 (Wr )|Ws = x] dr. (2.1.56)
2 s
2.1.3 Multidimensional Brownian motion

We will be considering processes whose sample paths evolve in the d-dimensional space Rd . The
basic object will be the standard d-dimensional Brownian motion, which is simply a vector-valued
process (Wt )t≥0 , where Wt = (Wt1 , . . . , Wtd )T is a vector of d independent scalar Brownian motions.
9
All the properties carry over straightforwardly from the one-dimensional case. Among other things,
W has the Gaussian transition densities
!
1 |y − x|2
pt (x, y) = exp − , x, y ∈ Rd , t > 0 (2.1.57)
(2πt)d/2 2t
√
where |x| = xT x denotes the Euclidean norm of the vector x = (x1 , . . . , xd )T . These densities are
the solutions of the d-dimensional heat equation
∂ 1
pt (x, y) = ∆y pt (x, y) (2.1.58)
∂t 2
with the initial condition limt→0 pt (x, y) = δ(y − x), where ∆y = (∇y · ∇y ) is the Laplace operator
and the partial differentiation is w.r.t. the coordinates of y. Consequently, for any twice-differentiable
function ϕ : Rd → R well-behaved at infinity,
1 t
Z
E[ϕ(Wt )|Ws ] = ϕ(Ws ) + E[∆ϕ(Wr )|Ws ] dr, t > s. (2.1.59)
2 s
2.2 Diffusion processes

We can extract the following from the discussion in Section 2.1: If W = (Wt )t≥0 is the standard
d-dimensional Brownian motion, then it has continuous sample paths and, for all t ≥ 0 and all h > 0,
we have
E[Wt+h − Wt |Ft ] = 0 (2.2.1)
T
E[(Wt+h − Wt )(Wt+h − Wt ) |Ft ] = Id (2.2.2)
where Id is the d × d identity matrix and Ft is the σ-algebra generated by the trajectory (Wr : 0 ≤
r ≤ t) Moreover, using the properties of Gaussian random vectors, it can be shown that
1
lim P[|Wt+h − Wt | > r|Ft ] = 0, ∀r > 0. (2.2.3)
h↓0 h
In fact, it can be shown that a process (Wt )t≥0 with continuous sample paths that has these properties
is a Brownian motion.
Generalizing these observations led Andrey Kolmogorov to the following definition: A d-
dimensional random process (Xt )t≥0 with continuous sample paths is a (time-homogeneous) Markov
diffusion process if, for any t ≥ 0 and any h > 0,
1
lim P[|Xt+h − Xt | > r|Ft ] = 0, ∀r > 0 (2.2.4)
h↓0 h
and if there exist a vector-valued function f : Rd → Rd (the drift) and a matrix-valued function
a : Rd → Rd×d (the diffusion matrix ), such that, for all t ≥ 0 and all h > 0,
1
lim E[Xt+h − Xt |Ft ] = f (Xt )
h↓0 h
(2.2.5)
1
lim E[(Xt+h − Xt )(Xt+h − Xt )T |Ft ] = a(Xt ).
h↓0 h
It follows from the above that a(x) is symmetric and positive semidefinite for each x ∈ Rd . The
Brownian motion is a special case of this, with f (x) ≡ 0 and a(x) ≡ Id .
10
Theorem 2. Let (Xt )t≥0 be a d-dimensional Markov diffusin process with drift f and diffusion
matrix a. Then, for any twice differentiable function ϕ : Rd → R,
Z t
E[ϕ(Xt )|Xs = x] = ϕ(x) + E[Aϕ(Xr )|Xs = x] dr, (2.2.6)
s
where A is the linear second-order differential operator given by
1
Aϕ(x) := f (x)T ∇ϕ(x) + tr{a(x)∇2 ϕ(x)}. (2.2.7)
2
The operator A is called the infinitesimal generator of the diffusion process. While the full rigorous
proof is beyond the scope of these notes, we can give a sketch of the main idea. Define the function
Fs (x, t) := E[ϕ(Xt )|Xs = x] for t ≥ s and x ∈ Rd , and fix some h > 0. Then, expanding ϕ(Xt+h ) in
a Taylor series around Xt , we can write
Fs (x, t + h) = E[ϕ(Xt+h )|Xs = x]

h 1 i
= E ϕ(Xt ) + ∇ϕ(Xt )T (Xt+h − Xt ) + (Xt+h − Xt )T ∇2 ϕ(Xt )(Xt+h − Xt ) + Rt,h Xs = x

2
(2.2.8)
h 1 i
= Fs (x, t) + E ∇ϕ(Xt )T (Xt+h − Xt ) + (Xt+h − Xt )T ∇2 ϕ(Xt )(Xt+h − Xt ) + Rt,h Xs = x

2
(2.2.9)
where Rt,h is the remainder term that scales as o(|Xt+h −Xt |2 ). Using the law of iterated expectation,
the fact that Ft ⊇ Fs , and (2.2.5), we can write
h
T 1 T 2
i
E ∇ϕ(Xt ) (Xt+h − Xt ) + (Xt+h − Xt ) ∇ ϕ(Xt )(Xt+h − Xt ) + Rt,h Xs = x (2.2.10)

2
h h 1 i i
= E E ∇ϕ(Xt )T (Xt+h − Xt ) + (Xt+h − Xt )T ∇2 ϕ(Xt )(Xt+h − Xt ) + Rt,h Ft Xs = x

2
(2.2.11)
h 1 i
= E ∇ϕ(Xt )T (hf (Xt ) + o(h)) + tr{∇2 ϕ(Xt )(ha(Xt ) + o(h)) + Rt,h Xs = x . (2.2.12)

2
Rearranging and using the definition of A gives
1 1 o(h)
(Fs (x, t + h) − Fs (x, t)) = E[Aϕ(Xt )|Xs = x] + E[Rt,h |Xs = x] + . (2.2.13)
h h h
Taking the limit as h ↓ 0, we get
d 1
E[ϕ(Xt )|Xs = x] = E[Aϕ(Xt )|Xs = x] + lim E[Rt,h |Xs = x]. (2.2.14)
dt h↓0 h
Using (2.2.4) it can be shown that the last term on the right-hand side vanishes, and we get (2.2.6)
by integrating.
11
2.3 The Kolmogorov equations
So far, Brownian motion is the only diffusion process whose existence we have proved constructively.
While its definition is framed in the externalist picture, the explicit construction using wavelets pro-
vides the complementary internalist description. This is not the case with general diffusion processes:
Kolmogorov’s definition is purely externalist, dealing as it does with conditional expectations, and
also local in time and in space, prescribing the first- and second-order statistics of the increments
of the process on infinitesimally small time intervals and in arbitrarily small neighborhoods of the
current state of the process. In order to give the complementary internalist description of a general
diffusion process, we will need the machinery of Itô calculus, and it will turn out that all such
processes can be built up from Brownian motion. For now, though, we will show that the linear
second-order differential operator A in (2.2.7) plays a key role in constraining the behavior of the
transition probabilities of the process.
As a reminder, any time-homogeneous continuous-time Markov process (Xt )t≥0 admits a proba-
bilistic description in terms of its transition probability functions Pt (x, A), for t > 0, x ∈ Rd , and
Borel sets A:
P[Xs+t ∈ A|Xs = x] = Pt (x, A) (2.3.1)
or, more generally,

Z
E[ϕ(Xs+t )|Xs = x] = ϕ(y)Pt (x, dy). (2.3.2)
Rd
The transition probability functions obey the Champan–Kolmogorov equation,

Z
Pt (x, A) = Pt−s (y, A)Ps (x, dy), 0<s<t (2.3.3)
Rd
which expresses mathematically the intuitive picture that, in order to compute the probability that
the process reaches the set A from the point x in time t, we need to consider the probabilities of
intermediate positions after some fixed time s < t. Under appropriate regularity conditions, the
transition probability functions can be expressed in terms of transition densities pt (x, y):
Z
Pt (x, A) = pt (x, y) dy (2.3.4)
A
where pt (x, ·) is the conditional probability density of Xs+t given Xs = x for any s ≥ 0. While
establishing the existence of transition densities is a delicate matter, we will sweep these technical
issues under the rug and, assuming that the diffusion process of interest is sufficiently well-behaved
to admit transition densities, derive two PDEs that they must satisfy. These are known as the
forward and the backward Kolmogorov equations.
We will assume that the transition densities pt (x, y) are once continuously differentiable in t and
twice continuously differentiable in x and y, and that the first and second derivatives of pt (x, y) w.r.t.
x are continuous in t. We will start by deriving the forward Kolmogorov equation. Let ϕ : Rd → R
be an infinitely differentiable function supported on a compact subset of Rd . Then, on the one hand
we have
Z
d ∂
E[ϕ(Xt )|Xs = x] = ϕ(y) pt−s (x, y) dy. (2.3.5)
dt Rd ∂t
12
On the other hand, Theorem 2 tells us that
d
E[ϕ(Xt )|Xs = x] = E[Aϕ(Xt )|Xs = x]. (2.3.6)
dt
Using the definition (2.2.7) of A and integrating by parts, we get
E[Aϕ(Xt )|Xs = x]
Z
= Aϕ(y)pt−s (x, y) dy (2.3.7)
d
ZR Z
1
= T
f (y) ∇ϕ(y)pt−s (x, y) dy + tr{a(y)∇2 ϕ(y)}pt−s (y) dy (2.3.8)
Rd  2 Rd

d d
!
2
Z X
∂ 1 X ∂ 
= ϕ(y) fi (y) + ϕ(y) aij (y) pt−s (x, y) dy (2.3.9)
Rd  i=1 ∂y i 2 ∂y i ∂y j 
i,j=1
 
Z d d 2
 X ∂ 1 X ∂ 
= ϕ(y) − f i (y)p t−s (x, y) + aij (y)p t−s (x, y) dy. (2.3.10)
Rd  ∂y i 2 ∂y i ∂y j 
i=1 i,j=1
The right-hand sides of (2.3.5) and (2.3.10) are equal, and, since ϕ was arbitrary, it must be the
case that the transition density pt−s (x, y) is a solution of the PDE
d d
∂ X ∂ 1 X ∂2
pt−s (x, y) = − i
fi (y)p t−s (x, y) + i j
aij (y)pt−s (x, y) , t > s (2.3.11)
∂t ∂y 2 ∂y ∂y
i=1 i,j=1
with the initial condition limt→s pt−s (x, y) = δ(y − x). This is Kolmogorov’s forward equation (also
known as the Fokker–Planck equation); we can also use it to describe the evolution of the marginal
density ρt of Xt when the initial state X0 has a given density ρ0 . In terms of the transition densities,
we have
Z
ρt (y) = ρ0 (x)pt (x, y) dx; (2.3.12)
Rd
differentiating both sides w.r.t. t and using (2.3.11), we get

d d
∂ X ∂ 1 X ∂2
ρt (y) = − i
fi (y)ρt (y) + i j
aij (y)ρt (y) , t>0 (2.3.13)
∂t ∂y 2 ∂y ∂y
i=1 i,j=1
with the given initial condition ρ0 at t = 0.

To derive Kolmogorov’s backward equation, we start with the Chapman–Kolmogorov equation
for the transition densities: for 0 ≤ s < t and x, y ∈ Rd ,
Z
pt−s (x, y) = pt−s−h (z, y)ph (x, z) dz, (2.3.14)
Rd
where h ∈ R is sufficiently small, so that t − s > h. Expanding the function z 7→ pt−s−h (z, y) to
second order in a neighborhood of x gives
pt−s−h (z, y) = pt−s−h (x, y) + ∇x pt−s−h (x, y)T (z − x)
1
+ (z − x)T ∇2x pt−s−h (x, y)(z − x) + o(|x − z|2 ) (2.3.15)
2
13
Substituting this into (2.3.14), we obtain
Z
pt−s (x, y) = pt−s−h (x, y)ph (x, z) dz
Rd Z
+ ∇x pt−s−h (x, y)T (z − x)ph (x, z) dz

R d
Z
1
+ (z − x)T ∇2x pt−s−h (x, y)(z − x)ph (x, z) dz
2 Rd
+ o(h) (2.3.16)
T
= pt−s−h (x, y) + ∇x pt−s−h (x, y) E[Xh − X0 |X0 = x]
1 n o
+ tr E[(Xh − X0 )(Xh − X0 )T |X0 = x]∇2x pt−s−h (x, y) + o(h). (2.3.17)
2
Rearranging, dividing by h, and using (2.2.5) gives
1
− [p (x, y) − pt−s (x, y)]
h t−(s+h)
1
= ∇x pt−(s+h) (x, y)T E[Xh − X0 |X0 = x]
h
1 n o 1
+ tr E[(Xh − X0 )(Xh − X0 )T |X0 = x]∇2x pt−(s+h) (x, y) + o(h) (2.3.18)
2h h
1 T
= · (hf (x) + o(h)) ∇x pt−(s+h) (x, y)
h
1 1 T 2 1
+ tr · (ha(x) + o(h)) ∇x pt−(s+h) (x, y) + o(h). (2.3.19)
2 h h
Taking the limit as h → 0 and using the assumed continuity of t 7→ pt (x, y), we get the backward
Kolmogorov equation
d d
∂ X ∂ 1 X ∂2
pt−s (x, y) = − fi (x) i pt−s (x, y) − aij (x) i j pt−s (x, y) (2.3.20)
∂s ∂x 2 ∂x ∂x
i=1 i,j=1
for 0 ≤ s ≤ t with the terminal condition lims→t pt−s (x, y) = δ(y − x). Note that we can rewrite
this more succinctly using the definition of A:
∂
pt−s (x, y) = −Apt−s (x, y), (2.3.21)
∂s
where the derivatives in A act on x while keeping y fixed.
What we have done is start with a diffusion process characterized by the drift f and the diffusion
matrix a and, assuming it is regular enough to admit transition densities, derived two PDEs that
must be obeyed by these transition densities. It turns out that we can also go the other way around
— that is, if pt (x, y) are smooth solutions of (2.3.11) and (2.3.20) for a given pair of f and a, then
they are the transition densities of a diffusion process with drift f and diffusion matrix a.
2.4 Problems
1. Prove that a scalar zero-mean Gaussian process (Vt )t≥0 with V0 = 0 has the covariance function
E[Vs Vt ] = s ∧ t if and only if the increments Vt − Vs are zero-mean Gaussian with variance |t − s|.
Hint: Use the identity a ∧ b = 12 (a + b − |a − b|).
14
2. Let (Vt )t≥0 be a scalar zero-mean Gaussian process with covariance function E[Vs Vt ] = s ∧ t.
Prove that it has independent increments, i.e., for any n and 0 ≤ t1 < t2 < · · · < tn , the random
variables Vt2 − Vt1 , Vt3 − Vt2 , . . . , Vtn − Vtn−1 are independent.
3. Let (Wt )t≥0 be a standard one-dimensional Brownian motion. Prove that the following processes
are also Brownian motion processes:
(i) (Wt+a − Wa )t≥0 , where a ≥ 0 is a fixed constant.
(ii) (λ−1 Wλ2 t )t≥0 , where λ 6= 0 is a fixed constant.
(iii) (tW1/t )t>0 .
4. Let (X1 , . . . , Xn ) be an n-dimensional random vector with a joint pdf pX . Express pX in terms
of the joint pdf of (X1 , X2 − X1 , . . . , Xn − Xn−1 ). Use this fact to give a proof that Brownian motion
is a Markov process with transition densities
!
1 (y − x)2
pt (x, y) = √ exp − .
2πt 2t
5. Let (Xt )t≥0 be a d-dimensional diffusion process. Show that, for t ≥ 0 and h > 0,
Cov[Xt+h − Xt |Xt ] = ha(Xt ) + o(h)
6. Let (Wt )t≥0 be a standard one-dimensional Brownian motion, and let ϕ : R → R be a strictly
monotone, twice continuously differentiable function. Show that Xt = ϕ(Wt ) is a one-dimensional
diffusion process with drift f (x) = 12 ϕ00 (ϕ−1 (x)) and diffusion coefficient a(x) = |ϕ0 (ϕ−1 (x))|2 , where
ϕ−1 is the inverse of ϕ, i.e., the unique solution y to the equation ϕ(y) = x.
7. Let (Xt )t≥0 be a d-dimensional diffusion process, and let ϕ : Rd × [0, ∞) → R be a function
of x ∈ Rd and t ∈ [0, ∞), which is twice continuously differentiable in x and once continuously
differentiable in t. Prove that, for any t > s ≥ 0,
Z t h i
E[ϕ(Xt , t)|Fs ] = ϕ(Xs , s) + E ϕ̇(Xr , r) + Aϕ(Xr , r)Fs dr

s
where Fs is the σ-algebra generated by (Xr )0≤r≤s , ϕ̇(x, t) := ∂t

∂
ϕ(x, t), and A is the infinitesimal
generator of (Xt )t≥0 , which acts on functions of both x and t as
1
Aϕ(x, t) = f (x)T ∇x ϕ(x, t) + tr{a(x)∇2x ϕ(x, t)}.
2
8. Consider a scalar diffusion process with drift f (x) = −x, and diffusion coefficient a(x) = 2 (this
is known as the standard Ornstein–Uhlenbeck process in one dimension). Suppose that X0 has the
2
standard Gaussian density ρ0 (x) = (2π)−1/2 e−x /2 . Using the Fokker–Planck equation, show that
the marginal densities ρt (x) of Xt are all equal to ρ0 .

The construction of the Brownian motion in Section 2.1 largely follows [Cla73].
15
Chapter 3
Stochastic Integrals and Stochastic

Differential Equations
In Chapter 2, we have introduced diffusion processes from an externalist point of view in terms of
their drift and diffusion coefficients. This is a local description of a diffusion process which, in fact,
completely determines the transition probability functions of the process. These, in turn, allow us to
compute the probabilities of various events pertaining to the process, but tell us nothing about the
way one could go about generating a process with these transition probabilities. On the other hand,
an important first step in building models of physical and engineering systems with random effects is
to go from an externalist description to an internalist one involving latent variables not amenable to
direct observation. This passage from a given externalist description to a suitable internalist model
is often referred to as the realization problem, and one typically imposes some structural constraints
on the mechanism that relates the internal variables to the external variables.
You have no doubt encountered this type of modeling in the first course on probability theory:
Any real-valued random variable X with cdf FX (x) = P[X ≤ x] can be realized as a function of
a random variable U uniformly distributed on the unit interval [0, 1], i.e., FX−1 (U ) and X have
the same cdf. This is readily verified using the fact that the cdf of U is FU (u) = u for u ∈ [0, 1].
In this case, U is the internal or latent variable and X is the external or observed variable. Of
course, not every instance of X is necessarily generated in this way, but a representation of this
kind is nevertheless very useful. For instance, if we know the cdf FX , then we can use it as many
independent copies of X as we want; if we don’t know FX , then we can attempt to learn it by
comparing the cdf of F̂ −1 (X) with the cdf of U for various candidates F̂ — this is precisely the idea
behind the Kolmogorov–Smirnov goodness-of-fit test.
Our goal in this chapter is twofold: We will first show that a wide class of diffusion processes can
be realized as deterministic functions of Brownian motion. That is, if (Xt )t≥0 is a d-dimensional
diffusion process satisfying certain regularity conditions, then it can be realized as a function of a
d-dimensional Brownian motion (Wt )t≥0 , i.e., for each t there exists a mapping Ft such that the
entire path (Xs )0≤s≤t can be obtained as Ft [X0 , (Ws )0≤s≤t ], where the (possibly random) starting
point X0 is independent of the Brownian motion. This Brownian motion plays the role of the latent
object; moreover, we will be able to write down the form of Ft explicitly in terms of the drift and
the diffusion coefficients of (Xt )t≥0 . To that end, we will need to introduce stochastic integrals in
the sense of Itô, and this will in turn lead us to the construction of stochastic differential equations.
Martingale theory will play an important role here as well. One can think of this result as solving
16
the realization problem for diffusion processes. We will then turn this idea around — we will start
with internalist models based on stochastic differential equations and show that, under appropriate
regularity conditions, their solutions are in fact diffusion processes. Given this, it is tempting to
view stochastic differential equations as limits of ordinary differential equations driven by wideband
noise. As we shall see, however, this will run into certain complications having to do with the fact
that the rules of Itô calculus are different from the rules of ordinary calculus.
3.1 From Kolmogorov to Itô: an internalist model of a diffusion

process
As mentioned above, we will introduce Itô’s stochastic integrals in the context of solving the realization
problem for diffusion processes. To keep things simple, we will consider the one-dimensional case; the
extension to vector-valued processes is relatively straightforward once the main ideas are understood.
Suppose that we are observing some stochastic system over a time interval of finite length T . Our
measurement apparatus connected to the system produces a random trajectory X = (Xt )t∈[0,T ] , and
we will assume the act of measurement does not introduce any additional disturbances. We also
assume that the observed trajectory is a diffusion process with known drift f and diffusion coefficient
a, where both are sufficiently regular to ensure that the processes f (Xt ) and a(Xt ) have continuous
sample paths. For instance, since Xt , being a diffusion process, has continuous sample paths and if
both f and a are continuous functions of their argument, then the above requirement will be met.
Our goal is to show that X can be represented as a deterministic function of the initial condition
X0 and a Brownian motion W = (Wt )t∈[0,T ] independent of X0 subject to the reasonable causality
requirement that, for each t ∈ [0, T ], (Xτ )τ ∈[0,t] is determined only by X0 and (Wτ )τ ∈[0,t] .
3.1.1 Filtrations, martingales, and all that

The observed process X is defined on a probability space (Ω, F, P). The above causality requirement
also forces us to consider, for each time t ∈ [0, T ], a smaller σ-algebra Ft which is a subset of F
and which represents all the information available at time t. Since we are restricted to external
observations only, we will take Ft to be the σ-algebra generated by (Xτ )τ ∈[0,t] . In particular, Ft
contains all events of the form
E = {ω ∈ Ω : a1 ≤ Xt1 (ω) < b1 , . . . , an ≤ Xtn (ω) < bn } (3.1.1)
for all finite collections of times t1 , . . . , tn ∈ [0, t] and intervals [a1 , b1 ), . . . , [an , bn ), as well as all
other events that could be generated from these by taking complements and countable unions. For
example, if the system under observation is an electrical circuit that contains noise sources and
Xt is the random voltage across a designated pair of terminals at time t, then the event E in
(3.1.1) represents all the circumstances under which the voltages measured by an ideal voltmeter
at times 0 ≤ t1 , . . . , tn ≤ t fall within given ranges [a1 , b1 ), . . . , [an , bn ). The σ-algebra F0 contains
all available information about the initial condition X0 . It is not hard to see that these σ-algebras
are nested: If 0 ≤ t < t0 , then F0 ⊆ Ft ⊆ Ft0 ⊆ F. This represents the situation that we accumulate
information as time goes on. Any such increasing collection of σ-algebras is called a filtration, and
we say that a stochastic process Y = (Yt )t∈[0,T ] is adapted to the filtration (Ft )t∈[0,T ] if each Yt is
measurable w.r.t. Ft , i.e., the information available at time t completely determines Yt . In our current
setting, each Ft is generated by (Xτ )τ ∈[0,t] , the process X is evidently adapted to the resulting
17
filtration (we then say that this filtration is generated by X). Moreover, we will see that the ‘latent’
Brownian motion W will also be adapted to the same filtration, which makes intuitive sense in
light of our causality requirement. This simply means that, for any 0 ≤ s < t ≤ T , the increment
Wt − Ws is a zero-mean Gaussian random variable with variance t − s, which is also independent of
Fs (in our case, this means that Wt − Ws is independent of (Xr )0≤r≤s ). In particular, we will have
E[Wt − Ws |Fs ] = 0 and E[(Wt − Ws )2 |Fs ] = t − s, t ≥ s. (3.1.2)
This naturally leads us to another important definition: A stochastic process (Yt )t∈[0,T ] adapted to a
filtration (Ft )t∈[0,T ] is a martingale w.r.t. this filtration if
E[Yt − Ys |Fs ] = 0, t ≥ s. (3.1.3)
We will often abbreviate this terminology by saying that Y is an Ft -martingale or that (Yt , Ft )t∈[0,T ]
is a martingale. It is important to keep in mind that a given process may be adapted to many
different filtrations, and that it may be a martingale w.r.t. one such filtration, but fail to be a
martingale w.r.t. another.
Now, the first equality in (3.1.2) tells us that any Brownian motion adapted to Ft is an Ft -
martingale. An important result, which we will use soon, is that any martingale with continous
sample paths that satisfies (3.1.2) is, in fact, a Brownian motion:
Theorem 3 (Lévy’s characterization of Brownian motion). Let W = (Wt )t≥0 be a martingale w.r.t.
(Ft )t≥0 with continuous sample paths, such that W0 = 0 and
E[(Wt − Ws )2 |Fs ] = t − s, t ≥ s. (3.1.4)
Then (W is an Ft -Brownian motion.
See Example 4 for the proof.

It will be convenient to adopt the notation dYt for the forward differentials Yt+dt −Yt for 0 ≤ t < T
and a small dt > 0 such that t + dt ≤ T . Then Y is a martingale w.r.t. Ft if it is adapted to Ft and
if E[dYt |Ft ] = 0 for all 0 ≤ t < T . From now on, we will be working with such forward differentials
whenever possible; among other things, this will allow us to neglect terms of order (dt)2 and higher.
In this notation, we can restate the condition (3.1.4) as E[(dWt )2 |Ft ] = dt.
There are multiple ways to construct martingales adapted to a given filtration Ft . For example,
let a random variable V be given. Then the process Yt := E[V |Ft ] is an Ft -martingale, an immediate
consequence of the properties of conditional expectation. Another way is to start with a process
Z = (Zt )t∈[0,T ] which is not necessarily adapted to Ft and then construct another process A by
setting A0 = 0 and
dAt := E[dZt |Ft ]. (3.1.5)
Then Mt := Zt − At is an Ft -martingale: It is obviously adapted to Ft , and
E[dMt |Ft ] = E[dZt |Ft ] − E[dAt |Ft ] = 0. (3.1.6)
For example, if Ft is the filtration generated by Zt , i.e., Ft = σ(Zτ : 0 ≤ τ ≤ t), then dMt =
dZt − E[dZt |Zτ , 0 ≤ τ ≤ t] contains the new information in dZt on top of Ft . The process Mt is
then called the innovations process for Zt .
18
Finally, an important concept is that of quadratic variation. We have already seen it in the case
of Brownian motion: If (Wt )t≥0 is a standard Brownian motion, then the limit
X 2
lim W(k+1)2−n − Wk2−n = t (3.1.7)
n→∞
0≤k<2n t
exists almost surely; in fact, it does not even depend on the specific sequence of partitions of the
interval [0, t], as long as they become increasingly fine in the limit. We can now consider any
square-integrable process Yt (this means that E[Yt2 ] < ∞ for all t ∈ [0, T ]) with continuous sample
paths and define its quadratic variation on [0, t] as
X 2
[Y ]t := lim Y(k+1)2−n − Yk2−n (3.1.8)
n→∞
0≤k<2n t
(again, we can take the limit along any sequence of partitions of [0, t], as long as they become
increasingly fine in the limit). For example, if Wt a standard Brownian motion, then [W ]t = t
almost surely. Also, the same argument that was used in Section 2.1 to show that Brownian motion
has infinite total variation on any finite interval can be applied to show that any process Yt whose
quadratic variation [Y ]t is a.s. finite and increasing with t (which will be the case whenever Yt is not
a.s. constant) must have infinite total variation on [0, t].
Now, quadratic variation has special properties when the underlying process is a martingale.
Thus, let Mt be a square-integrable Ft -martingale with continuous sample paths, and let us consider
the process Yt := Mt2 . Then, using the properties of conditional expectations and the fact that Mt
is a martingale, we have
2
E[dYt |Ft ] = E[Mt+dt − Mt2 |Ft ] (3.1.9)
2
= E[(Mt+dt − Mt ) + 2(Mt+dt − Mt )Mt |Ft ] (3.1.10)
2
= E[(dMt ) |Ft ] + 2 E[Mt dMt |Ft ] (3.1.11)
2
= E[(dMt ) |Ft ] + 2 Mt E[dMt |Ft ] (3.1.12)
= E[(dMt )2 |Ft ]. (3.1.13)
Then the process A given by A0 = 0 and dAt := E[dYt |Ft ] is nondecreasing since (dMt )2 ≥ 0, and
is in fact precisely the quadratic variation [M ]t . In fact, it will be strictly increasing unless Mt is
a.s. constant. (Note, by the way, that the definition of quadratic variation does not depend on any
particular filtration, unlike our definition of At .)
3.1.2 A martingale associated with a diffusion process

We will now see why we needed the above interlude on martingales. Getting back to our diffusion
process X, we can write
E[dXt |Ft ] = f (Xt ) dt, 0≤t<T (3.1.14)
— indeed, this is just the restatement of E[Xt+dt − Xt |Ft ] = f (Xt ) dt + o(dt) that we have been using
before. Then the process M = (Mt )t∈[0,T ] defined by M0 = 0 and
dMt := dXt − f (Xt ) dt, (3.1.15)
19
or, in integral form,
Z t
Mt = Xt − X0 − f (Xs ) ds, (3.1.16)
0
is an Ft -martingale. Indeed, as can be easily seen from (3.1.16), Mt is completely determined by

(Xs )s∈[0,t] , i.e., it is measurable w.r.t. Ft ; moreover, it follows from (3.1.14) that
E[dMt |Ft ] = 0, 0 ≤ t < T. (3.1.17)
Thus, our diffusion process X admits a decomposition of the form

Z t
Xt = X0 + f (Xs ) ds + Mt , t ∈ [0, T ] (3.1.18)
0
where M = (Mt )t∈[0,T ] is a martingale. But, in fact, we can say more about M .
First of all, since t 7→ f (Xt ) is continuous by hypothesis, Mt has continuous sample paths.
Moreover, M0 = 0, and
E[(dMt )2 |Ft ] = E[(dXt − f (Xt ) dt)2 |Ft ] (3.1.19)

= E[(dXt )2 |Ft ] − 2E[f (Xt ) dXt |Ft ] dt + E[f 2 (Xt )|Ft ](dt)2 (3.1.20)
2 2
= a(Xt ) dt − 2f (Xt )(dt) (3.1.21)
= a(Xt ) dt, (3.1.22)
which can be rewritten in integral form as

"Z #
t
E[(Mt − Ms )2 |Fs ] = E a(Xr ) drFs , t ≥ s. (3.1.23)

r
This readily gives us the quadratic variation of Mt : d[M ]t = a(Xt ) dt or, in integral form,
Z t
[M ]t = a(Xs ) ds, 0 ≤ t ≤ T. (3.1.24)
0
The interesting observation here is that the square of the differential dMt is not negligible since it is
first order in dt. Thus, we have obtained our first key result:
Theorem 4. Let X = (Xt )t∈[0,T ] be a one-dimensional diffusion process whose drift f and diffusion
coefficient a are continuous functions of their argument. Then there exists a square-integrable
martingale M = (Mt )t∈[0,T ] with continuous sample paths, such that M0 = 0 and
Z t
Xt = X0 + f (Xs ) ds + Mt , 0 ≤ t ≤ T. (3.1.25)
0
Moreover, the quadratic variation process [M ]t has the formm

Z t
[M ]t = a(Xs ) ds. (3.1.26)
0
20
We can consider two special cases: If a(x) ≡ 0, then the evolution of Xt is deterministic, i.e.,
Z t
Xt = X0 + f (Xs ) ds, 0≤t≤T (3.1.27)
0
which is equivalent to the ODE

d
Xt = f (Xt ), t ∈ [0, T ] (3.1.28)
dt
with (a possibly random) initial condition X0 . On the other hand, if a(x) = σ 2 for a positive constant
σ > 0, then
d[M ]t = E[(dMt )2 |Ft ] = σ 2 dt. (3.1.29)
Since X has continuous sample paths and the drift f is a continuous function, it follows that M is a
martingale w.r.t. (Ft )t∈[0,T ] that has continuous sample paths. Since M0 = 0, we conclude that the
normalized process (σ −1 Mt )t≥0 is a Brownian motion adapted to the filtration (Ft )t≥0 by Lévy’s
characterization of the Brownian motion (Theorem 3). In other words, we can now express X as a
deterministic function of the initial condition X0 and a Brownian motion W :
Z t
Xt = X0 + f (Xs ) ds + σWt , 0 ≤ t ≤ T. (3.1.30)
0
As advertised, this expresses X as a deterministic function of X0 and an independent Brownian

motion W , and the path (Xs )s∈[0,t] depends only on X0 and on (Ws )s∈[0,t] . Moreover, the Brownian
motion W is adapted to the filtration Ft generated by X. We still have to consider the case when
the diffusion coefficient a depends on the ‘space variable’ x.
3.1.3 Enter the stochastic integral

So far, we have been proceeding in a logical fashion, arriving at the general representation of X in
Theorem 4, and at the more concrete representation (3.1.30) for a(x) equal to a constant σ 2 > 0, by
deductive reasoning starting from the axioms defining a diffusion process and some basic martingale
theory. In order to proceed beyond this simple case, we need to introduce a new tool, that of
stochastic integration. The basic ideas were developed by Itô (and, somewhat earlier, by Doeblin),
where the main point is that one can integrate any sufficiently regular square-integrable process
adapted to a filtration w.r.t. a a Brownian motion adapted to the same filtration. However, we will
develop the concept of stochastic integration in a bit more generality, which will pay off later when
we start talking about filtering and estimation problems.
Let a filtration (Ft )t∈[0,T ] be given, and let M = (Mt )t∈[0,T ] be a square-integrable Ft -martingale
with continuous sample paths. We would like to be able to define a stochastic integral
Z T
Zt dMt (3.1.31)
0
of any process Z = (Zt )t∈[0,T ] adapted to Ft and satisfying some additional regularity conditions.
Any such process will be called the integrand, and the martingale M will be called the integrator.
To that end, let us write the quadratic variation [M ]t in the form At dt, where At is an increasing
21
positive process adapted to Ft . Then the space of admissible integrands in (3.1.31) will consist of all
processes Z = (Zt )t∈[0,T ] that are adapted to Ft and that also satisfy
Z T
E[Zt2 At ] dt < ∞. (3.1.32)
0
We will also use the shorthand ITM (Z) to denote (3.1.31). This is a definite integral, and we can
Rt Rt RT
obtain the indefinite integral 0 Z dM for any t ∈ [0, T ] by setting 0 Z dM = 0 1[0,t] Z dM . Our
eventual goal is to show that, when the diffusion coefficient a is everywhere positive, the martingale
Mt in (3.1.25) can be expressed as
Z tp
Mt = a(Xs ) dWs , 0≤t≤T (3.1.33)
0
where W is a Brownian motion adapted to Ft . (When a is not everywhere positive, we will still be
able to express Mt as an Itô integral, but then we will need to introduce additional ‘randomness’ to
construct the Brownian motion integrator, on top of that already available in X.)
We begin by defining the stochastic integral for the so-called elementary integrands. We say that
an Ft -adapted process Z is an elementary integrand if it can be represented in the form
n−1
X
Zt = ζi 1{ti ≤t<ti+1 } , (3.1.34)
i=0
where t0 = 0 < t1 < . . . < T = tn is a partition of [0, T ], and where each ζi is a random variable
measurable w.r.t. Fti , such that
n−1
X Z T
E[ζi2 (Mti+1 − Mti ) ] ≡ 2
E[Zt2 At ] dt < ∞. (3.1.35)
i=0 0
We then define the integral (3.1.31) by
n−1
X
ITM (Z) := ζi (Mti+1 − Mti ). (3.1.36)
i=0
Note that, since each ζi is Fti -measurable and M is a martingale, we have
E[ζi (Mti+1 − Mti )|Fti ] = ζi E[Mti+1 − Mti |Fti ] = 0, (3.1.37)
22
so we immediately deduce that ITM [Z] has zero mean:
 
n−1
X
E[ITM (Z)] = E  ζi (Mti+1 − Mti ) (3.1.38)
i=0
n−1
X
= E[ζi (Mti+1 − Mti )] (3.1.39)
i=0
n−1
X
= E[E[ζi (Mti+1 − Mti )|Fti ]] (3.1.40)
i=0
n−1
X
= E[ζi E[Mti+1 − Mti |Fti ]] (3.1.41)
i=0
=0 (3.1.42)
Moreover, if i < j, then Ftj ⊇ Fti+1 , so
E[ζi ζj (Mti+1 − Mti )(Mtj+1 − Mtj )|Ftj ] = E[ζi ζj (Mti+1 − Mti )E[Mtj+1 − Mtj ]|Ftj ] = 0 (3.1.43)
which gives
h i n−1
X h i
E (ITM (Z))2 = E ζi ζj (Mti+1 − Mti )(Mtj+1 − Mtj ) (3.1.44)
i,j=0
n−1
X h i
= E ζi2 (Mti+1 − Mti )2 (3.1.45)
i=0
n−1
" #
X Z ti+1
= E ζi2 At dt (3.1.46)
i=0 ti
n−1
"Z #
X ti+1
= E Zt2 At dt (3.1.47)
i=0 ti
Z T
= E[Zt2 At ] dt. (3.1.48)
0
One can then prove that, for any admissible integrand Z, there exists a sequence (Z m )m∈N of
elementary integrands, such that
Z T
lim E[(Ztm − Zt )2 At ] dt = 0, (3.1.49)
m→∞ 0
and then one can define the stochastic integral as

Z T Z T
Zt dMt := lim Ztm dMt , (3.1.50)
0 m→∞ 0
where the limit is in mean-square sense. The detailed proofs can be found in many places, but the
main idea is that we can introduce on the space H0 of elementary integrands an inner product
Z T
0
hZ, Z i := E[Zt Zt0 At ] dt, (3.1.51)
0
23
which is positive definite by vritue of our assumption that At > 0 pfor all t, and then consider the
Hilbert space H obtained by completing H0 w.r.t. the norm kZk := hZ, Zi. It can then be shown
that H is precisely the space of all admissible integrands. The stochastic integral has a number of
useful properties:
RT RT RT
1. Linearity – 0 (Zt + Zt0 ) dMt = 0 Zt dMt + 0 Zt0 dMt
RT RT RT
2. Isometry – E[( 0 Zt dMt )( 0 Zt0 dMt )] = 0 E[Zt Zt0 At ] dt
3. Martingale property – the process (ItM (Z))t∈[0,T ] is an Ft -martingale with continuous sample
paths
Here, Zt and Zt0 are arbitrary admissible integrands. All of these are readily verified when Zt and Zt0
are elementary integrands, and then for arbitrary admissible integrands by taking limits.
Remark 2. The class of admissible integrands can be expanded even further by relaxing (3.1.32) to
the weaker requirement
Z T
Zt2 At dt < ∞ a.s. (3.1.52)
0
The integral of such a Zt w.r.t. Mt is then defined as

Z T Z T
Zt dMt := lim Ztn dMt in probability (3.1.53)
0 n→∞ 0
where the process Ztn is defined by

( Rt
Zt , if 0 Zs2 As ds ≤ n
Ztn := (3.1.54)
0, otherwise
Rt
There is a slight price to pay for this increased generality: The indefinite integral 0 Z dM may no
longer be a martingale, but rather a local martingale. Add informal discussion later.
3.1.4 Back to diffusion processes

Now we are ready to return to the matter of characterizing the martingale Mt in (3.1.25) under the
assumption that the diffusion coefficient a of X is everywhere positive. Recall that the p quadratic
variation process [M ]t satisfies d[M ]t = a(Xt ) dt. It is then readily seen that Zt := 1/ a(Xt ) is an
admissible integrand for Mt :
Z T
E[Zt2 a(Xt )] dt = T. (3.1.55)
0
1
Let us then define the process Wt by setting W0 = 0 and dWt = √ dMt , or, in integral form,
a(Xt )
Z t
1
Wt = ItM (Z) = p dMτ . (3.1.56)
0 a(Xτ )
24
This process is an Ft -martingale with continuous sample paths, and we can compute
1
E[(dWt )2 |Ft ] = E[(dMt )2 |Ft ] (3.1.57)
a(Xt )
1
= d[M ]t (3.1.58)
a(Xt )
1
= · a(Xt ) dt (3.1.59)
a(Xt )
= dt, (3.1.60)
which is equivalent to d[Wp ]t = dt. Consequently, W is a Brownian motion adapted to Ft by Lévy’s
theorem. Writing dMt = a(Xt ) dWt , we obtain the following refinement of Theorem 4 and the
main result of this section:
Theorem 5 (Doob). Let X = (Xt )t∈[0,T ] be a one-dimensional diffusion process whose drift f and
diffusion coefficient a are continuous functions of their argument, and a is everywhere positive. Then
there exists a Brownian motion W adapted to Ft , such that
Z t Z tp
Xt = X0 + f (Xs ) ds + a(Xs ) dWs . (3.1.61)
0 0
Remark 3. If a is only nonnegative, then we have to introduce additional independent randomness

into our internalist model. If a(Xt ) is positive a.e., then our original construction still goes through.
If not, then we define the function
( p
1/ a(x), a(x) > 0
g(x) := (3.1.62)
0, otherwise
and define the process W by W0 = 0 and
p
dWt = g(Xt ) dMt + (1 − g(Xt ) a(Xt )) dW̃t , (3.1.63)
where W̃ = (W̃t )t∈[0,T ] is a Brownian motion independent of X. This procedure requires enlarging
the filtration Ft to Gt containing, in addition to all events generated by (Xs )0≤s≤t , also all events
generated by (W̃s )0≤s≤t . It follows from Lévy’s theorem that W is a Gt -Brownian motion, and then
Xt can be expressed as (3.1.61).
The stochastic integral in (3.1.61), with the Brownian motion serving as the integrator, is the classic
stochastic integral of Itô. We will limit ourselves to this setting for the time being, although later we
will have many occasions to use more general integrator processes. The main point here is that the
process
Z t
Mt = Xt − X0 − f (Xs ) ds (3.1.64)
0
is a martingale that admits an explicit representation as a stochastic integral w.r.t. a Brownian

motion, i.e.,
Z tp
Mt = a(Xs ) dWs , (3.1.65)
0
and also is an innovations process for Xt .
25
3.1.5 Multidimensional diffusion processes
The extension to a d-dimensional diffusion process (Xt )t∈[0,T ] with drift f : Rd → Rd and diffusion
matrix a : Rd → Rd×d is straightforward. As before, Ft denotes the σ-algebra generated by (Xs )s∈[0,t] .
A d-dimensional Brownian motion W = (Wt )t∈[0,T ] is a vector-valued process, Wt = (Wt1 , . . . , Wtd )T ,
where Wti are independent scalar Brownian motions adapted to Ft .
Theorem 6. Let X = (Xt )t∈[0,T ] be a d-dimensional diffusion process whose drift f : Rd → Rd×d
and diffusion matrix a : Rd → Rd×d are continuous functions of their argument, and a is everywhere
positive definite. Then there exists a d-dimensional Brownian motion W a adapted to Ft , such that
Z t Z t
Xt = X0 + f (Xs ) ds + g(Xs ) dWs , (3.1.66)
0 0
where g(x) ∈ Rd×d is the positive square root of a(x), and the above formula holds coordinatewise,
i.e., for each i ∈ {1, . . . , d},
Z t d Z
X t
Xti = X0i + i
f (Xs ) ds + g ij (Xs ) dWsj , (3.1.67)
0 j=1 0
where f i (x) is the ith coordinate of f (x) and g ij (x) is the (i, j) entry of g(x).
3.2 Stochastic integration with respect to a Brownian motion

We will now take a closer look at stochastic integrals w.r.t. a Brownian motion. Let a probability
space (Ω, F, (Ft )t∈T , P) be given, where we have explicitly indicated the filtration (Ft )t∈T . The set T
will be either an interval [0, T ] for a finite T > 0 or the set of all t ≥ 0. Let Wt be a Ft -Brownian
motion. Since [W ]t = t, the space of admissible integrands will consist of all processes (Zt )t∈T that
are adapted to (Ft )t∈T and satisfy
Z
E[Zt2 ] dt < ∞. (3.2.1)
T
Rt
We will refer to the integrals w.r.t. Wt as Itô integrals. The indefinite integral 0 Zs dWs is a
zero-mean martingale with continuous sample paths, and we the important property of Itô isometry:
"Z # Z
t t
E Zs dWs = E[Zs2 ] ds. (3.2.2)
0 0
In fact, as discussed in Remark 2, we can consider a wider class of admissible integrands Zt that are
Ft -adapted and satisfy
Z
Zt2 dt < ∞ a.s. (3.2.3)
T
We will use this version from now on since it will allow us to state various results to a useful degree
of generality, even though the resulting indefinite integrals may no longer be martingales.
26
It helps to see an example to appreciate how different Itô integration is compared to the rules of
ordinary integral calculus. For example, let’s take Zt = Wt , which is clearly admissible on any finite
interval T = [0, T ]. It is tempting to conjecture, by analogy with ordinary calculus, that
Z t
1
Ws dWs = Wt2 . (3.2.4)
0 2
Rt
This, however, cannot be the case since 0 Ws dWs must have zero mean, while E[Wt2 ] = t. In
order to compute the Itô integral of Wt w.r.t. dWt , let us consider any sequence of increasingly fine
partitions of [0, t]. For any such partition (tk )m
k=0 with t0 = 0 and tm = t, we can write
m−1 m−1
1 2 X 1X
Wt = Wtk (Wtk+1 − Wtk ) + (Wtk+1 − Wtk )2 . (3.2.5)
2 2
k=0 k=0
Then, taking the limit of both sides as theRpartitions get finer and finer, we see that the first term on
t
the right-hand side converges the integral 0 Ws dWs , while the second term converges to 12 [W ]t = 12 t,
so
Z t
1
Ws dWs = (Wt2 − t). (3.2.6)
0 2
However, we have derived this using a bespoke argument that can’t be generalized further. In order
to have a general rule that will allow us to compute Itô integrals, we need to introduce stochastic
differentials and Itô’s differentiation rule.
3.2.1 Stochastic differentials and Itô’s differentiation rule

Let Ft and Gt be Ft -adapted processes, such that
Z Z
|Ft | dt < ∞ and G2t dt < ∞ a.s. (3.2.7)
T T
and let X0 be an F0 -measurable random variable. Then we can form the integral process
Z t Z t
Xt := X0 + Fs ds + Gs dWs , t∈T (3.2.8)
0 0
where the second term on the right-hand side is an ordinary integral, while the third term is an Itô
integral. We can also write this as an Itô stochastic differential
dXt = Ft dt + Gt dWt , (3.2.9)
which, together with the initial condition X0 , is always to be interpreted as shorthand for (3.2.8).
Such processes are also referred to as Itô processes. We can integrate other processes w.r.t. Xt : If Vt
is an Ft -adapted process, such that
Z Z
|Vt Ft | dt < ∞ and Vt2 G2t dt < ∞ a.s., (3.2.10)
T T
27
then we can define the integral of Vt w.r.t. Xt in an obvious fashion:
Z t Z t Z t
Vs dXs := Vs Fs ds + Vs Gs dWs . (3.2.11)
0 0 0
This gives another Itô process. It is not hard to see that the class of all Itô processes is closed under
linear operations: If Xt and X̃t are two Itô processes, then any linear combination αXt + β X̃t , where
α and β are deterministic constants, is also an Itô process. In fact, as we will see shortly, the class
of Itô processes is closed under multiplication, i.e., Yt = Xt X̃t will also be an Itô process. This is
a consequence of the fact that any sufficiently smooth functional of an Itô process is itself an Itô
process. We begin by stating a simple one-dimensional version:
Theorem 7 (Itô’s differentiation rule, one-dimensional version). Let Xt be an integral process, as

in (3.2.8), and consider the process Yt := ϕ(Xt , t), where ϕ(x, t) is once continuously differentiable
w.r.t. t and twice continuously differentiable w.r.t. x. Then the following holds almost surely:
Z t Z t
1 t 00
Z
0
Yt = ϕ(X0 , 0) + ϕ̇(Xs , s) ds + ϕ (Xs , s) dXs + ϕ (Xs , s)Fs ds, (3.2.12)
0 0 2 0
∂2
where ϕ̇(x, t) := ∂
∂t ϕ(x, t), ϕ0 (x, t) := ∂
∂x ϕ(x, t), and ϕ00 (x, t) := ∂x2
ϕ(x, t).
Proof. We will give a formal argument instead of a rigorous proof to illustrate the main idea. We
are interested in computing the forward differential
dYt = Yt+dt − Yt = ϕ(Xt+dt , t + dt) − ϕ(Xt , t). (3.2.13)
Since ϕ(x, t) is twice differentiable in x and once in t, we can write out a Taylor expansion of ϕ to
second order x and to first order in t:
ϕ(Xt+dt , t + dt) − ϕ(Xt , t) (3.2.14)

= ϕ(Xt + dXt , t + dt) − ϕ(Xt , t) (3.2.15)
1
= ϕ0 (Xt , t) dXt + ϕ̇(Xt , t) dt + ϕ00 (Xt , t)(dXt )2 + higher-order terms (3.2.16)
2
Since dXt = Ft dt + Gt dWt , we can express (dXt )2 further as
(dXt )2 = Ft2 (dt)2 + 2Ft Gt dWt dt + G2t (dWt )2 (3.2.17)

= Ft2 (dt)2 + 2Ft Gt dWt dt + G2t dt, (3.2.18)
where the equality (dWt )2 = dt holds almost surely and in mean-square sense. All the other terms
are o(dt), so we can neglect them and write
(dXt )2 = G2t dt. (3.2.19)
Substituting this back into our expression for dYt gives

1
dYt = ϕ̇(Xt , t) dt + ϕ0 (Xt , t) dXt + ϕ00 (Xt )G2t dt (3.2.20)
2
and (3.2.12) follows by integration.
28
Thus, Itô’s differentiation rule (or, simply, Itô’s rule) tells us that, for any function ϕ(x, t) which
is C 2 in x and C 1 in t and for any Itô process dXt = Ft dt + Gt dWt , Yt = ϕ(Xt , t) is itself an Itô
process, i.e.,

1 00
dYt = ϕ̇(Xt , t) + ϕ (Xt , t)Ft + ϕ (Xt )Gt dt + ϕ0 (Xt , t)Gt dWt .
0 2
(3.2.21)
2
The presence of the last term on the right-hand side of (3.2.12), involving the second derivative of
ϕ(x, t) w.r.t. x, is the surprising part and the main difference between the rules of Itô calculus and
those of ordinary calculus. In order to appreciate how different this really is, consider the case when
Gt ≡ 0 for all t. Then, on the path-by-path basis, we have the ODE
d
Xt (ω) = Ft (ω), (3.2.22)
dt
and we can simply apply the chain rule to see what is happening with Yt = ϕ(Xt , t):
d d
Yt (ω) = ϕ(Xt (ω), t) (3.2.23)
dt dt
d
= ϕ̇(Xt (ω), t) + ϕ0 (Xt (ω), t) Xt (ω), (3.2.24)
dt
which we can also write as dYt = ϕ̇(Xt , t) dt + ϕ0 (Xt , t) dXt . Integrating, we obtain the expression
Z t Z t
ϕ(Xt , t) = ϕ(X0 , 0) + ϕ̇(Xs , s) ds + ϕ0 (Xs ) dXs , (3.2.25)
0 0
which is exactly what we would expect from the rules of ordinary calculus. The extra term in Itô’s
rule that contains ϕ00 is a consequence of the fact that d[W ]t = dt. Ultimately, we will need the
multidimensional version of Itô’s rule, but we can already look at a few examples to get an idea of
what’s involved.
Rt
Example 1 ( 0 W dW revisited). The formula (3.2.6) for the Itô integral of W w.r.t. dW can now
be seen as a direct consequence of Itô’s rule. Indeed, since Wt is evidently an Itô process with Ft = 0
and Gt = 1, we can apply Theorem 7 to ϕ(x, t) = 12 x2 . Then
ϕ̇(x, t) = 0, ϕ0 (x, t) = x, ϕ00 (x, t) = 1 (3.2.26)
and therefore
Z t Z t Z t
1 2 1 1
W = Ws dWs + ds = Ws dWs + t. (3.2.27)
2 t 0 2 0 0 2
Rearranging, we get (3.2.6).
Example 2 (fundamental
Rt theorem of Itô calculus). We can generalize the above example to compute
the Itô integral 0 ϕ0 (Wt ) dWt , where ϕ : R → R is a twice continuously differentiable function. In
this case, Itô’s rule gives the formula
Z t
1 t 00
Z
0
ϕ(Wt ) = ϕ(0) + ϕ (Ws ) dWs + ϕ (Ws ) ds, (3.2.28)
0 2 0
29
or, upon rearranging,
Z t Z t
0 1
ϕ (Ws ) dWs = ϕ(Wt ) − ϕ(0) − ϕ00 (Ws ) ds. (3.2.29)
0 2 0
We can also consider functions ϕ(x, t). For example, let ϕ(x, t) = ex−t/2 . Then
1 1
ϕ̇(x, t) = − ex−t/2 = − ϕ(x, t), ϕ0 (x, t) = ϕ00 (x, t) = ϕ(x, t). (3.2.30)
2 2
Applying Itô’s rule gives
1 t Ws −s/2
Z t
1 t Ws −s/2
Z Z
eWt −t/2 = 1 − e ds + eWs −s/2 dWs + e ds (3.2.31)
2 0 0 2 0
Z t
1+ eWs −s/2 dWs (3.2.32)
0
so, for this particular choice of ϕ, we can write

Z t
t
ϕ(Ws , s) dWs = ϕ(Ws , s)0 . (3.2.33)
0
Compare this with ordinary calculus, where the exponential function x 7→ ex is the unique solution of
the ODE ϕ0 (x) = ϕ(x), i.e.,
Z b
b
ϕ(x) dx = ϕ(x)a . (3.2.34)
a
We will later see that this is a special case of the so-called Doléans-Dade exponential of an Itô process.
3.2.2 Multidimensional Itô calculus

Since many applications of interest involve multidimensional processes, we need to develop the
appropriate framework for working with such processes. As before, we have the probability space
(Ω, F, (Ft )t∈T , P) with a filtration defined on it, as well as an m-dimensional Brownian motion process
Wt = (Wt1 , . . . , Wtm )T adapted to this filtration. (That is, Wti are independent Brownian motions
adapted to Ft .) We then say that Xt = (Xt1 , . . . , Xtn )T is an n-dimensional Itô process adapted to
Ft if there exist adapted processes Fti , Gij t for i ∈ {1, . . . , n} and j ∈ {1, . . . , m}, such that
Z Z
i
|Ft | dt < ∞ and |Gij 2
t | dt < ∞ a.s. ∀i, j (3.2.35)
T T
and
Z t m Z
X t
Xti = X0i + Fsi ds + Gij j
s dWs , i = 1, . . . , n. (3.2.36)
0 j=1 0
If we assemble the processes Fti into an n-dimensional vector process Ft = (Ft1 , . . . , Ftn )T and Gij
t
into an n × m matrix process Gt , then we can express (3.2.36) more succinctly as
Z t Z t
Xt = X0 + Fs ds + Gs dWs . (3.2.37)
0 0
We can now state the multidimensional version of Itô’s rule:
30
Theorem 8. Let ϕ : Rn × T → R be such that ϕ(x, t), where x = (x1 , . . . , xn )T , is twice continuously
differentiable in all xi and once continuously differentiable in t. Then Yt = ϕ(Xt , t) is an Itô process:
n X
X m Z t
ϕ(Xt , t) = ϕ(X0 , 0) + ϕi (Xs , s)Gij j
s dWs
i=1 j=1 0
 
Z t n n X
m
X 1 X
+ ϕ̇(Xs , s) + ϕi (Xs , s)Fsi + ϕij (Xs , s)Gik jk 
s Gs ds,
0 2
i=1 i,j=1 k=1
∂ ∂ ∂2
where ϕ̇(x, s) := ∂s ϕ(x, s), ϕi (x, s) := ∂xi
ϕ(x, s), ϕij (x, s) := ∂xi ∂xj
ϕ(x, s).
We can rewrite (3.2.38) using a more compact vector notation as
1 n o
dϕ(Xt , t) = ϕ̇(Xt , t) dt + ∂ϕ(Xt , t) dXt + tr ∂ 2 ϕ(Xt , t) dXt (dXt )T , (3.2.38)
2
where ∂ϕ(x, t) denotes the row vector (ϕ1 (x, t), . . . , ϕn (x, t)), ∂ 2 ϕ(x, t) is the n × n matrix with
entries ϕij (x, t), and the products dXti dXtj are evaluated using the box rules
(dt)2 = dWti dt = dt dWti = 0, dWti dWtj = δij dt. (3.2.39)
Example 3 (Itô product rule). Let Xti , i ∈ {1, 2}, be two scalar Itô processes. Then, for the function
ϕ(x1 , x2 ) = x1 x2 we have
!
0 1
∂ϕ(x1 , x2 ) = (x2 , x1 ), ∂ 2 ϕ(x1 , x2 ) = (3.2.40)
1 0
so Itô’s rule gives

! ! !
dXt1 1 0 1 dXt1
d(Xt1 Xt2 ) = (Xt2 , Xt1 ) + (dXt1 , dXt2 ) (3.2.41)
dXt2 2 1 0 dXt2
= Xt1 dXt2 + Xt2 dXt1 + dXt1 dXt2 , (3.2.42)
which is the Itô calculus counterpart of the product rule d(U V ) = U dV + V dU of ordinary calculus,
where we pick up the extra term dU dV . Among other things, this shows that the class of Itô processes
is closed under pontwise multiplication – this is easy to verify by starting with the representations
dXti = Fti dt + Git dWt , multiplying things out, and using the box rules.
The product rule can be written down in a more compact form in terms of the joint quadratic
variation of X 1 and X 2 , which is denoted by [X 1 , X 2 ] and defined as
1 1
[X 1 , X 2 ] := [X + X 2 ] − [X 1 − X 2 ] . (3.2.43)
4
Using this definition and Itô’s box rules, it is not hard to show that
dXt1 dXt2 = G1t G2t dt = d[X 1 , X 2 ]t , (3.2.44)
so the product rule now takes the form
d(Xt1 Xt2 ) = Xt1 dXt2 + Xt2 dXt1 + d[X 1 , X 2 ]t . (3.2.45)
31
Example 4 (Proof of Lévy’s theorem). Let (Wt )t≥0 be a martingale w.r.t. a filtration (F)t≥0 with
W0 = 0, and suppose that Wt has continuous sample paths and [W ]t = t. Lévy’s characterization
of Brownian motion (Theorem 3) then says that Wt is a Brownian motion w.r.t. Ft . To √ see this,
let us apply Itô’s lemma to Zt = eiαWt , where α is an arbitrary real parameter and i = −1 is the
imaginary unit: For any t ≥ s ≥ 0,
1
dZt = iαZt dWt − α2 Zt d[W ]t (3.2.46)
2
1
= iαZt dWt − α2 Zt dt. (3.2.47)
2
Therefore, for t ≥ s ≥ 0,
Z t Z t
iαWt iαWs 1
e −e = iα Zr dWr − α2 Zr dr. (3.2.48)
s 2 s
Using the martingale property of stochastic integrals gives

t
α2
Z
iαWt iαWs
E[e |Fs ] = e − E[eiαWr |Fs ] dr. (3.2.49)
2 s
This can be easily solved for E[eiαWt |Fs ]:
α2
E[eiαWt |Fs ] = eiαWs e− 2
(t−s)
, (3.2.50)
α2
which is equivalent to E[eiα(Wt −Ws ) |Fs ] = e− 2 (t−s) for t ≥ s ≥ 0. Thus, conditionally on Fs , the
increment Wt − Ws is Gaussian with zero mean and variance t − s. This proves that (Wt )t≥0 is an
Ft -Brownian motion.
3.3 Stochastic differential equations

We have begun this chapter by presenting Doob’s solution of the realization problem for a class of
diffusion processes: If (Xt )t∈[0,T ] is a d-dimensional diffusion process with drift f (x) and positive
definite diffusion matrix a(x), then it can be realized as an Itô process
Z t Z t
Xt = X0 + f (Xs ) ds + g(Xs ) dWs , 0≤t≤T (3.3.1)
0 0
for some d-dimensional Brownian motion adapted to the filtration generated by Xt . Now, Eq. (3.3.1)
expresses Xt as an Itô process with Ft = f (Xt ) and Gt = g(Xt ) both depending on Xt , i.e.,
dXt = f (Xt ) dt + g(Xt ) dWt , 0≤t≤T (3.3.2)
with the given initial condition X0 . Doob’s representation theorem then allows us to say that
(Xt )t∈[0,T ] is a solution of the stochastic differential equation (SDE) (3.3.2).
This observation motivates us to ask this question generally: Let an m-dimensional Brownian
motion (Wt )t∈[0,T ] , an n-dimensional random vector X0 independent of (Wt ), and functions f : Rn →
32
Rn and g : Rn → Rn × Rn×m be given. We would like to know whether there exists a solution of the
stochastic differential equation
dXt = f (Xt ) dt + g(Xt ) dWt , 0≤t≤T (3.3.3)
i.e., an adapted n-dimensional process (Xt )t∈[0,T ] , such that

Z t Z t
Xt = X0 + f (Xs ) ds + g(Xs ) dWs , 0 ≤ t ≤ T. (3.3.4)
0 0
Moreover, we would like to know whether Xt is a diffusion process and to identify its drift and
diffusion matrix. Thus, we are starting from an internalist description given by (3.3.3) in terms of the
initial condition X0 and an independent Brownian motion playing the role of the latent variables; we
then ask whether this description is well-posed in the sense of admitting a well-behaved externalist
description, i.e., a solution of (3.3.3) with the latent Brownian motion eliminated. As we will see
next, this is possible under mild regularity conditions on f and g.
A reminder about notation. Given a vector v ∈ Rn and a matrix A ∈ Rn×m , we will denote by
|v| and kAk the Euclidean norm of v and the operator norm of A, i.e.,
√
|v| = v T v, kAk = sup{|Au| : u ∈ Rm , |u| = 1}. (3.3.5)
Theorem 9. Suppose that the initial condition X0 and the functions f, g satisfy the following for
some constant 0 ≤ K < ∞:
1. X0 is independent of (Wt )t∈[0,T ] , and E[|X0 |2 ] < ∞
2. f and g are Lipschitz continuous:
|f (x) − f (y)| + kg(x) − g(y)k ≤ K|x − y|, x, y ∈ Rn (3.3.6)
3. f and g are of at most linear growth:
|f (x)|2 + kg(x)k2 ≤ K 2 (1 + |x|2 ), x ∈ Rn (3.3.7)
Rt
Then the SDE (3.3.3) has a unique solution (Xt )t∈[0,T ] satisfying 0 E[|Xs |2 ] ds < ∞ for all t ∈ [0, T ].
Moreover, Xt is a diffusion process with drift f (x) and diffusion matrix a(x) = g(x)g(x)T .
Proof. The proof is based on Picard iteration, which is the same successive approximation procedure
one uses to show existence and uniqueness of solutions of ODEs. To keep things simple, we will limit
ourselves to the scalar case, i.e., n = m = 1; the general case is proved along the same lines. We will
(0)
construct a sequence of processes (X (n) )n=0,1,... recursively as follows: Xt = X0 for all t ∈ [T ], and
for n = 0, 1, . . .
Z t Z t
(n+1)
Xt := X0 + f (Xs(n) ) ds + g(Xs(n) ) dWs , 0≤t≤T (3.3.8)
0 0
33
Each of these processes is evidently adapted. Moreover, using the linear growth condition (3.3.7), it
(n) Rt (n)
can be verified that each Xt has continuous sample paths and satisfies 0 E[|Xs |2 ] ds < ∞ for all
(n+1) (n)
0 ≤ t ≤ T . Next, let ∆nt := E[(Xt − Xt )2 ]. Then, since
Z t Z t
(n+1) (n)
Xt − Xt = (f (Xs(n) ) − f (Xs(n−1) )) ds + (g(Xs(n) ) − g(Xs(n−1) )) dWs , (3.3.9)
0 0
we can estimate
 !2   !2 
Z t Z t
∆nt ≤ 2E  (f (Xs(n) ) − f (Xs(n−1) )) ds  + 2E  (g(Xs(n) ) − g(Xs(n−1) )) dWs 
0 0
(3.3.10)
Z t Z t
≤ 2t E[(f (Xs(n) ) − f (Xs(n−1) ))2 ] ds + 2 E[(g(Xs(n) ) − g(Xs(n−1) ))2 ] ds (3.3.11)
0 0
Z t
2
≤ 2K (1 + T ) E[|Xs(n) − Xs(n−1) |2 ] ds (3.3.12)
0
Z t
=: C0 ∆n−1
s ds, (3.3.13)
0
where we have used the Itô isometry and the Lipschitz continuity condition (3.3.6). Now,
(1) (0)
∆0t = E[(Xt − Xt )2 ] (3.3.14)
 !2 
Z t Z t
= E f (X0 ) ds + g(X0 ) dWs  (3.3.15)
0 0
≤ 2E[t|f (X0 )|2 + |g(X0 )|2 ] (3.3.16)

2 2
≤ 2K (1 + T )E[|X0 | ]t (3.3.17)
=: C1 t. (3.3.18)
Therefore, repeated integration gives the estimate
(C0 t)n
∆nt ≤ C1 . (3.3.19)
n!
Now, for any t ∈ [0, T ] and any k, n ∈ {0, 1, . . . }, we can use Cauchy–Schwarz inequality to write
k
(n+k) (n) (n+j) (n+j−1)
X
|Xt − Xt | ≤ |Xt − Xt | (3.3.20)
j=1
 1/2  1/2
k k
X 1 X (n+j) (n+j−1) 2 
≤   2n+j−1 |Xt − Xt | , (3.3.21)
2n+j−1
j=1 j=1
34
so
k k
h
(n+k) (n)
i X 1
2n+j−1 ∆n+j−1
X
E |Xt − Xt |2 ≤ · t (3.3.22)
2n+j−1
j=1 j=1
k
X (2C0 t)n+j−1
≤ 2C1 (3.3.23)
(n + j − 1)!
j=1
∞
X (2C0 t)j
≤ 2C1 , (3.3.24)
j!
j=n−1
which means that

∞
h
(n+k) (n)
i X (2C0 T )j
sup E |Xt − Xt |2 ≤ 2C1 , (3.3.25)
k≥0 j!
j=n−1
where the right-hand side converges to 0 as n → ∞. Thus

h i
(n+k) (n) n→∞
sup sup E |Xt − Xt |2 −−−→ 0. (3.3.26)
t∈[0,T ] k≥0
(n)
This shows that, for every t ∈ [0, T ], the sequence (Xt )n=0,1,... converges in mean-square, so we
have obtained a process (Xt )t∈[0,T ] , such that
h i
(n)
lim sup E |Xt − Xt |2 = 0. (3.3.27)
n→∞ t∈[0,T ]
Rt
This process is adapted, satisfies 0 E[Xs2 ] ds < ∞ for all t ∈ [0, T ], and solves (3.3.3). To show
uniqueness, let (X̃t )t∈[0,T ] be another solution of (3.3.3). Then Zt := Xt − X̃t satisfies Z0 = 0 and
Z t Z t
Zt = (f (Xs ) − f (X̃s )) ds + (g(Xs ) − g(X̃s )) dWs , (3.3.28)
0 0
and the same argument as before gives

Z t
EZt2 2
≤ 2K (1 + T ) EZs2 ds, 0≤t≤T (3.3.29)
0
2
Gronwall’s inequality then gives EZt2 ≤ e2K (1+T )t EZ02 , so EZt2 = 0 for all t ∈ [0, T ]. Verification of
the fact that Xt is a diffusion process with drift f and diffusion matrix gg T is left as an exercise.
3.3.1 Examples
Before proceeding further, let us see some examples of SDEs that admit explicit solutions.
Example 5 (Linear SDEs and the Ornstein–Uhlenbeck process). A (time-homogeneous) linear ODE
is of the form
d
x(t) = F x(t), (3.3.30)
dt
35
where x(t) is an n-dimensional vector and A is an n × n matrix. The solution, for a given initial
condition at x = 0, is given by the matrix exponential: x(t) = etF x(0). The stochastic analog takes
the form
dXt = F Xt dt + G dWt (3.3.31)
for an n-dimensional process Xt and an m-dimensional Brownian motion Wt , where F ∈ Rn×n

and G ∈ Rn×m are given matrices. Here, f (x) = F x and g(x) = G evidently satisfy the regularity
conditions of Theorem 9, so for any T > 0 there exists a unique diffusion process (Xt )t∈[0,T ] with drift
f (x) = F x and diffusion matrix a(x) = GGT that solves the above SDE for any square-integrable
initial condition X0 . We will now show that this process can be expressed in closed form in terms of
the initial condition X0 and the Brownian motion Wt only.
To that end, consider the vector-valued function ϕ(x, t) = e−tF x. Applying Itô’s rule coordinate-
wise, we can write the following for the process Yt = ϕ(Xt , t) = e−tF Xt :
dYt = −F ϕ(Xt , t) dt + e−tF dXt (3.3.32)

= −F ϕ(Xt , t) dt + e−tF F Xt dt + e−tF G dWt (3.3.33)
−tF
=e G dWt , (3.3.34)
where the term 12 dXt (dXt )T ∂ 2 ϕ(Xt , t) vanishes since ϕ(x, t) is linear in x. Since Y0 = X0 , we can
integrate the above SDE to get
Z t
Yt = X0 + e−sF G dWs , 0≤t≤T (3.3.35)
0
which finally gives

Z t
Xt = etF X0 + e(t−s)F G dWs . (3.3.36)
0
This process is known as the Ornstein–Uhlenbeck process. If X0 is a Gaussian random vector, then
Xt is a Gaussian process.
Example 6 (Stochastic exponential of Brownian motion). Consider the one-dimensional SDE
dXt = Xt dWt , 0≤t≤T (3.3.37)

t
with initial condition X0 = 1. The solution of this is given by Xt = eWt − 2 , the stochastic (or
Doléans-Dade) exponential of the Brownian motion. To show this, consider any C 2 function ϕ(x);
Itô’s rule then gives
1
dϕ(Xt ) = ϕ0 (Xt ) dXt + ϕ00 (Xt )(dXt )2 (3.3.38)
2
1
= ϕ00 (Xt )Xt2 dt + ϕ0 (Xt )Xt dWt . (3.3.39)
2
In particular, taking ϕ(x) = log x, we get
1
d log Xt = − dt + dWt , (3.3.40)
2
which is readily integrated to give log Xt = Wt − 2t . The claim then follows.
36
Example 7 (Geometric Brownian motion). Adding a linear drift to (3.3.37) and rescaling the
Brownian motion gives
dXt = µXt dt + σXt dWt , (3.3.41)
where µ ∈ R and σ > 0 are known as the drift and the volatility parameters. The solution of (3.3.41)
with the initial condition X0 = 0 is known as the geometric Brownian motion. To obtain an explicit
formula, we once again consider log Xt :
1 1
d log Xt = dXt − (dXt )2 (3.3.42)
Xt 2Xt2
!
σ2
= µ− dt + σ dWt , (3.3.43)
2
which gives
σ2
Xt = exp σWt + µ − t . (3.3.44)
2
Geometric Brownian motion is used in mathematical finance to model stock prices in the so-called
Black–Scholes model.
3.3.2 Coordinate changes and the Stratonovich integral

When working with dynamical systems modeled by ordinary differential equations, the choice of the
coordinate system often makes a difference when it comes to ease of analysis. Consider an ODE
d
x(t) = f (x(t)) (3.3.45)
dt
describing the evolution of an n-dimensional state vector x(t), and let a smooth (e.g., C 1 ) function
ϕ : Rn → Rn be given. Then the evolution of y(t) := ϕ(x(t)) is easily described using the chain rule:
d d
y(t) = ϕ(x(t)) (3.3.46)
dt dt
∂ϕ d
= (x(t)) x(t) (3.3.47)
∂x dt
∂ϕ
= (x(t))f (x(t)), (3.3.48)
∂x
where ∂ϕ
∂x is the Jacobian of ϕ, i.e., the n × n matrix of all its first-order partial derivatives. If ϕ is
invertible, then we obtain an equivalent model in terms of y(t) only:
d
y(t) = f˜(y(t)), (3.3.49)
dt
where f˜(y) := ∂ϕ −1 −1
∂x (ϕ (y))f (ϕ (y)). This procedure is useful, for example, when passing to the new
coordinates exposes some favorable properties of the dynamics or allows one to obtain an explicit
solution.
37
At this point it should not be surprising that the situation would be different when working with
SDEs. Indeed, if we start with an n-dimensional SDE
dXt = f (Xt ) dt + g(Xt ) dWt (3.3.50)
driven by an m-dimensional Brownian motion and consider the process Yt = ϕ(Xt ), then Itô’s rule
gives
∂ϕi 1 ∂ 2 ϕi
dYti = (Xt ) dXt + (dXt )T (Xt ) dXt , i = 1, . . . , n (3.3.51)
∂x 2 ∂x2
i ∂ϕi ∂ϕi
where ∂ϕ
∂x = ∂x1 , . . . , ∂xn is the row vector of the first-order partial derivatives of the ith coordinate
2 i 2 ϕi
of y = ϕ(x), while ∂∂xϕ2 = ∂x∂j ∂x k 1≤j,k≤n is the n × n matrix of the second-order partial derivatives
i
of ϕi . As before, only the first term, involving ∂ϕ
∂x , matches the chain rule of ordinary calculus, but
we also have the other term involving second-order derivatives of ϕ.
Now, it is possible to define stochastic integrals in a manner that would preserve the rules of
ordinary calculus. This was done independently by Stratonovich and, independently, by Fisk. A
nice way to introduce the Stratonovich integral, different from the original construction, is to see
what needs to be modified in Itô’s definition in order to get back the rules of ordinary calculus. (We
will also see that this convenience comes at a price.) To keep things simple, we will consider the
one-dimensional case. Let ϕ : R → R be a C 1 function; then Itô’s rule gives
Z t Z t
1
ϕ(Ws ) dWs = Φ(Wt ) − Φ(W0 ) − ϕ0 (Ws ) ds, (3.3.52)
0 2 0
Rx
where Φ(x) := 0 ϕ(y) dy and the integral on the left-hand side is the Itô integral. Now, if we define
another type of stochastic integral, namely
Z t Z t Z t
1
ϕ(Ws ) ◦ dWs := ϕ(Ws ) dWs + ϕ0 (Ws ) ds, (3.3.53)
0 0 2 0
then the fundamental theorem of ordinary calculus holds by construction:

Z t
ϕ(Ws ) ◦ dWs = Φ(Wt ) − Φ(W0 ). (3.3.54)
0
This procedure may look somewhat contrived, but we can generalize it using Itô’s product rule. To
that end, let Xt and X̃t be two Itô processes. Then the product rule tells us that
Z t Z t Z t
Xt X̃t − X0 X̃0 = Xs dX̃s + X̃s dXs + d[X, X̃]s , (3.3.55)
0 0 0
where [X, X̃]t is the joint quadratic variation of Xt and X̃t . If we split the joint quadratic variation
term equally between the first two Itô integrals on the right-hand side, then we can define
Z t Z t Z t
1
Xs ◦ dX̃s := Xs dX̃s + d[X, X̃]s (3.3.56)
0 0 2 0
38
which gives us something that looks like the product rule of ordinary calculus:
Z t Z t
Xs ◦ dX̃s + X̃s ◦ dXs = Xt X̃t − X0 X̃0 . (3.3.57)
0 0
We can then take (3.3.56) as the definition of the Stratonovich integral of X w.r.t. X̃, and it easy to
verify that (3.3.53) is indeed a special case of (3.3.56):
d[ϕ(W ), W ]t = dϕ(Wt ) dWt (3.3.58)

1
= (ϕ0 (Wt ) dWt + ϕ00 (Wt ) dt) dWt (3.3.59)
2
= ϕ0 (Wt ) dt. (3.3.60)
We can extend the above definition to C 1 functions of Itô processes; e.g., if h is such a function,
then we set
Z t Z t
1 t 0
Z
h(Xs ) ◦ dX̃s := h(Xs ) dX̃s + h (Xs ) d[X, X̃]s (3.3.61)
0 0 2 0
(this definition coincides with (3.3.56) if h(Xt ) happens to be an Itô process). Thus, in particular, if
ϕ is C 2 and if we take X̃ = X, then
Z t Z t
1 t 00
Z
ϕ0 (Xs ) ◦ dXs = ϕ0 (Xs ) dXs + ϕ (Xs ) d[X, X]s (3.3.62)
0 0 2 0
= ϕ(Xt ) − ϕ(X0 ), (3.3.63)
since, by Itô’s rule, the quantity on the right-hand side is the integral from 0 to t of the total Itô
differential dϕ(Xs ).
Rt
Remark 4. The original definition of Fisk and Stratonovich for the integral 0 h(Xs ) ◦ dWs was as
the limit in probability of sums of the form
n−1
X 1
h (Xti+1 + Xti ) (Wti+1 − Wti ) (3.3.64)
2
i=0
for any sequence of partitions of [0, t] that get increasingly finer as n → ∞. In particular, if Xt = Wt
and h(x) = x, we have
1 1
(Wti+1 + Wti )(Wti+1 − Wti ) = (Wt2i+1 − Wt2i ), (3.3.65)
2 2
so the sum in (3.3.64) telescopes to 12 Wt2 − 12 W02 .
We can now write down Stratonovich SDEs. For example, let Xt be a process described by a
scalar Itô SDE
dXt = f (Xt ) dt + g(Xt ) dWt , (3.3.66)
where f and g satisfy the regularity conditions of Theorem 9, and g is also C 1 . Then it can also be
described by the Stratonovich SDE
dXt = fˆ(Xt ) dt + g(Xt ) ◦ dWt , (3.3.67)
39
where the drift fˆ is related to f and g through
1
fˆ(x) = f (x) − g(x)g 0 (x) (3.3.68)
2
(the extra term subtracted from f is usually called the Itô correction). This is easy to show:
integrating (3.3.66) and using (3.3.61), we have
Z t Z t
Xt = X0 + f (Xs ) ds + g(Xs ) dWs (3.3.69)
0 0
Z t Z t Z t
1
= X0 + f (Xs ) ds + g(Xs ) ◦ dWs − g 0 (Xs ) d[X, W ]s (3.3.70)
0 0 2 0
Z t Z t
1 0
= X0 + f (Xs ) − g(Xs )g (Xs ) ds + g(Xs ) ◦ dWs (3.3.71)
0 2 0
Z t Z t
= X0 + ˆ
f (Xs ) ds + g(Xs ) ◦ dWs . (3.3.72)
0 0
The main benefit of working with Stratonovich SDEs is that they respect the chain rule of ordinary
calculus. Indeed, if ϕ is a C 2 function, then Yt = ϕ(Xt ) satisfies

1 00
dYt = ϕ (Xt )f (Xt ) + ϕ (Xt )g (Xt ) dt + ϕ0 (Xt )g(Xt ) dWt
0 2
(3.3.73)
2

1 00 1
= ϕ (Xt )f (Xt ) + ϕ (Xt )g (Xt ) dt + ϕ0 (Xt )g(Xt ) ◦ dWt − (ϕ0 (Xt )g(Xt ))0 g(Xt ) dt
0 2
2 2
(3.3.74)

= ϕ0 (Xt ) fˆ(Xt ) dt + g(Xt ) ◦ dWt , (3.3.75)
where we have first used Itô’s rule, then converted Itô differentials to Stratonovich differentials, and
then collected terms to get the final expression. The main downside is that, unlike Itô integrals,
Stratonovich integrals are not martingales.
3.4 SDEs and models of engineering systems with random distur-

bances
Our digression on the stochastic calculus of Stratonovich was motivated by the failure of the familiar
change-of-variables formula when working with Itô differentials. On the one hand, this should not
come as a surprise: One cannot interpret Itô SDEs as differential equations in the usual sense, not
the least because their solutions (being diffusion processes) do not have differentiable sample paths,
unlike the solutions of ODEs where differentiability is expected more or less by design. On the
other hand, the way we have arrived at Itô SDEs was via Doob’s solution of the realization problem
for diffusion processes specified through their transition probability functions. Thus, as far as the
questions of describing the temporal evolution of means, covariances, and other expected values are
concerned, the stochastic calculus of Itô is a convenient framework, especially given various useful
properties of the Itô integral.
On the other hand, again by appealing to the results of Doob and Itô, our internalist models of
diffusion processes make fundamental use of Brownian motion, which is a mathematical idealization
40
of various physical noise processes. Thus, one should think of diffusion processes as themselves
idealizations of the state trajectories of physical or engineering systems subject to random excitations
or disturbances. This was the original motivation of the early phenomenological work of Pontryagin,
Andronov, and Vitt, and it was the motivation of the work of Stratonovich as well. In particular,
realistic models of stochastic systems involve disturbance processes whose correlation times are
much shorter than the characteristic time constants of the signals generated by the system. In such
instances, we may hope that the signals can be described by solutions of ODEs of the form
d
Xt = f (Xt ) + g(Xt )ξt , (3.4.1)
dt
where ξt is a “physical” (e.g., wideband) noise process. If this process is sufficiently well-behaved,
then it should be possible to obtain a solution of (3.4.1) using ordinary calculus. Such a solution will
be (at least once) differentiable as a function of time, which is consistent with empirical observations
of many types of physical and engineering systems. Of course, the catch is that the corresponding
process will generally not be Markov due to the disturbance having nonzero memory. Thus, it is
reasonable to consider a family of disturbance processes parametrized by something like a “maximum
frequency” α > 0 such that the processes become more and more “white-noise-like” as α → ∞, and
then ask whether the solutions Xtα of the corresponding family of ODEs
d α
X = f (Xtα ) + g(Xtα )ξtα (3.4.2)
dt t
converge in some sense to a diffusion process, i.e., a solution of an SDE. We would also like to know
the drift and the diffusion matrix of this process and in what sense (Itô or Stratonovich) we should
interpret this SDE.
3.4.1 The Wong–Zakai theorem

We start by giving one such result, which is relatively easy to state and to prove (at least in the scalar
setting). Thus, consider the family of ODEs (3.4.2) on the finite time interval [0, T ], all having a
common initial condition X0 . We assume that, for each value of α, the disturbance process (ξtα )t∈[0,T ]
is sufficiently regular to guarantee that (3.4.2) has a unique solution in the usual sense. We also
assume that, as α → ∞, the disturbance processes (ξtα ) become better and better approximations to
white noise in the sense that there exists a Brownian motion (Wt )t∈[0,T ] , such that
Z t
α→∞
sup |Wtα − Wt | −−−→ 0 a.s., Wtα := ξsα ds. (3.4.3)
t∈[0,T ] 0
Finally, we make the following assumptons on the functions f and g:
1. f and g are both Lipschitz continuous and bounded.
2. g is C 1 , and the function x 7→ g(x)g 0 (x) is Lipschitz continuous (the prime denotes differentia-
tion w.r.t. x);
3. there exists a constant c > 0, such that g(x) ≥ c for all x.
41
Theorem 10 (Wong–Zakai). Under the above assumptions, the solutions Xtα of the ODEs (3.4.2)
converge, as α → ∞, a.s. to the unique solution Xt of the Stratonovich SDE
dXt = f (Xt ) + g(Xt ) ◦ dWt (3.4.4)
with the same initial condition, i.e.,
sup |Xtα − Xt | = 0 a.s. (3.4.5)
t∈[0,T ]
Remark 5. It is important to keep in mind that, when we express (3.4.4) in the Itô form, the drift
willl pick up the Itô correction term:

1 0
dXt = f (Xt ) + g(Xt )g (Xt ) dt + g(Xt ) dWt . (3.4.6)
2
When g is constant, i.e., g(x) = σ for some σ > 0, the Stratonovich and the Itô SDEs coincide.
Proof. First, it is easy to verify that, under our assumptions on f and g, the SDE (3.4.4) has a
unique solution Xt . Define now the function
Z x
1
ϕ(x) := dz, (3.4.7)
0 g(z)
0
so that ϕ0 (x) = 1
g(x) and ϕ00 (x) = − gg2(x)
(x)
. Then, for Ytα := ϕ(Xtα ) and Yt := ϕ(Xt ), we have
d α f (Xtα )
Y = + ξtα (3.4.8)
dt t g(Xtα )
and
f (Xt )
dYt = dt + dWt , (3.4.9)
g(Xt )
where (3.4.8) follows form the ordinary chain rule applied to the solution of the ODE (3.4.2), while
(3.4.9) follows from Itô’s rule. Integrating (3.4.8) and (3.4.9) and using the fact that X0α = X0 for
all α, we get
Z t !
α f (Xtα ) f (Xt )
Yt − Yt = α ) − g(X ) dt + Wtα − Wt . (3.4.10)
0 g(X t t
From our assumptions on f and g it easily follows that

Z x
1 1
|ϕ(x) − ϕ(z)| = dv ≥ |x − z| (3.4.11)
z g(v) C 1
and

f (x) f (z) |f (x) − f (z)| |g(x) − g(z)|
g(x) − g(z) ≤ + |f (z)| ≤ C2 |x − z| (3.4.12)

g(x) g(x)g(z)
for some C1 , C2 > 0. Using these inequalities together with (3.4.10), we obtain
Z t
|Xtα − Xt | ≤ C1 C2 |Xsα − Xs | ds + C1 sup |Wtα − Wt |, (3.4.13)
0 0≤t≤T
whence, by Gronwall’s lemma, it follows that

sup |Xtα − Xt | ≤ C1 eC1 C2 T sup |Wtα − Wt |. (3.4.14)
0≤t≤T 0≤t≤T
Taking the limit of both sides as α → ∞, we obtain the claimed result.
42
3.4.2 Other models of physical noise processes
The theorem of Wong and Zakai is only a particular result among many more general results of this
nature. For example, one can consider a wide class of noise processes ξt in (3.4.1) that are obtained
from Brownian motion by linear filtering, i.e., there exists a vector Brownian motion (Wt ) and a
deterministic family of matrices h(t, s) of appropriate shape, such that
Z t
ξt = h(t, s) dWs . (3.4.15)
0
Each such process is evidently Gaussian, and many types of random phenomena in engineering
systems can be modeled in this way. Here are a few examples:
1. Approximately white stationary noise. Let ξt be a scalar Gaussian stationary process whose
power spectral density S(ω) satisfies the so-called Paley–Wiener condition
Z ∞
log S(ω)
2
dω < ∞. (3.4.16)
−∞ 1 + ω
Then there exist a scalar Brownian motion (Wt )−∞<t<∞ and a deterministic function h(t),
such that ξt can be represented as
Z t
ξt = h(t − s) dWs . (3.4.17)
−∞
In the above representation, ξt depends on the entire past (Ws )−∞<s≤t of the Brownian motion.
However, if ξt has very short correlation time, then h(t − s) will decay rapidly as t − s → ∞,
and in that case we can approximate ξt by cutting off the integration in (3.4.17) at s = 0.
2. The Ornstein–Uhlenbeck process. The diffusion process that solves the linear Itô SDE
dξt = F ξt dt + G dWt , ξ0 = 0 (3.4.18)
has the explicit representation
Z t
ξt = e(t−s)F G dWs , (3.4.19)
0
as shown in Example 5.
3. Piecewise constant processes. Consider the process ξt which is constant over intervals of length
δ > 0, and the values that it takes in nonoverlapping intervals are independent zero-mean
Gaussian random vectors. Such a process can be represented in the form (3.4.15) with
(
1
I, (n − 1)δ ≤ s ≤ nδ ≤ t < (n + 1)δ, n = 1, 2, . . .
h(t, s) = δ . (3.4.20)
0, otherwise
One can construct other examples of this type. The bottom line is that many types of noise processes
ξt can be represented in the form (3.4.15), where the matrices h(t, s) are such that ξt has a.s.
piecewise continuous sample paths.
Now, just as we had done to prove the Wong–Zakai theorem, we can consider a family (ξtα ) of
such processes, where α > 0 is some sort of an “effective maximum frequency” parameter, and we
assume that there exist constant matrices A, B and a Brownian motion (Wt ), such that:
43
• for all 0 < t ≤ T ,
Z tZ s
1
lim E[ξsα (ξrα )T ] dr ds = A (3.4.21)
α→∞ t 0 0
• for all 0 < t ≤ T ,

Z 2
t
lim E ξs ds − BWt = 0 (3.4.22)

α→∞ 0
(the matrices A and B are then related by BB T = A + AT ).

Then, under some additional technical conditions, it can be shown that the solutions Xtα of (3.4.2) for
0 ≤ t ≤ T with zero initial condition X0α = 0 converge as α → ∞ to the solution of the Stratonovich
SDE
dXt = fˆ(Xt ) dt + ĝ(Xt ) ◦ dWt , X0 = 0 (3.4.23)
with
1 X jk ∂
fˆi (x) = f i (x) + g (x) j g ik (x)(Ak` − A`k ) (3.4.24)
2 ∂x
j,k,`
and
ĝ ij (x) = g ij (x)B ij . (3.4.25)
The convergence is in mean-square sense, such that E|Xtα − Xt |2 = O(α−1 ).
3.5 Problems
1. Let (Wt )t≥0 be a standard Brownian motion adapted to a filtration (Ft )t≥0 . Prove that
2
Mt = exp(µWt − µ2 t) is an Ft -martingale for any value of µ.
2. Give a direct proof that Xt = exp(Wt − 2t ) is the unique solution of the Itô SDE dXt = Xt dWt
with the initial condition X0 = 1.
Hint: Use Itô’s product rule.
3. Consider the Ornstein–Uhlenbeck SDE dXt = F Xt dt + G dWt , as in Example 5. For an initial

condition X0 independent of the Brownian motion Wt , its solution is given by
Z t
Xt = etF X0 + e(t−s)F G dWs . (3.5.1)
0
(i) Suppose that X0 is a Gaussian random vector. Prove that Xt is a Gaussian process and find
its mean µ(t) := E[Xt ] and covariance matrix K(s, t) := E[(Xs − µ(s))(Xt − µ(t))T ].
(ii) Suppose that F is a Hurwitz matrix, i.e., all of its eigenvalues have negative real parts. Find the
asymptotic mean µ̄ := limt→∞ µ(t) and the asymptotic covariance matrix K̄ := limt→∞ K(t, t).
44
4. Solve the one-dimensional Itô SDE
dXt = (Xt3 − Xt ) dt + (1 − Xt2 ) dWt (3.5.2)
with the initial condition X0 = 0.
Hint: Try a solution of the form Xt = ϕ(Wt ) for a C 2 function ϕ and use Itô’s rule. Problem 6 in
Section 2.4 may be useful as well.
5. Use Itô calculus to prove Theorem 2.
6. According to the Nyquist–Johnson model of thermal noise, the current It and the voltage Vt in
a resistor with conductance1 g are related via the formula
p
It = gVt + 2kgT ξt , (3.5.3)
where ξt is a white noise process (the formal derivative of the standard one-dimensional Brownian
motion), k is Boltzmann’s constant, and T is the absolute temperature. Thus, a Nyquist–Johnson
resistor is modeled by an ideal (noiseless) resistor in series with a white-noise voltage source (or in
parallel with a white-noise current source). A circuit consisting of a Nyquist–Johnson resistor in
series with a linear capacitor with (possibly time-varying) capacitance c(t) can be described by an
Itô SDE
p
d(c(t)Vt ) = −gVt dt + 2kgT dWt , t≥0 (3.5.4)
where Vt is the voltage across the capacitor, and we assumme that E[V02 ] < ∞.
(i) Consider first the case when the capacitance is constant, i.e., c(t) ≡ c > 0 for all t, and find
the steady-state energy stored in the capacitor,
1
ē := lim E[cVt2 ]. (3.5.5)
t→∞ 2
(ii) Now suppose that the capacitor is time-varying, and its capacitance c(t) > 0 is evolving
d
deterministically according to the ODE dt c(t) = u(t), where u(t) is a given function of time
(we can think of it as a control input). Write down an Itô SDE for the voltage Vt across the
capacitor.
(iii) Based on the result of part (ii), write down an ODE for e(t) := 12 E[c(t)Vt2 ], the average energy
stored in the capacitor at time t, and solve it for e(t).
Hint: You should end up with an ODE of the form
d
e(t) = −a(t)e(t) + b(t),
dt
where a(·) and b(·) are some fixed functions of t; you can then use the method of integrating
factors.
(iv) Consider now the case when c(0) = c > 0 and u(t) = λ for some λ > 0. Find an explicit
expression for e(t) from part (iii) and for its steady-state value ē := limt→∞ e(t). What happens
to ē as λ → 0?
1
Defined as the reciprocal of resistance.
45
A detailed exposition of the system modeling philosophy involving externalist and internalist
descriptions can be found in [Wil89], and [Pic91] provides a discussion of stochastic realization theory.
The derivation of the martingale representation of a diffusion process in Section 3.1 is based on
Chapter 11 of [Nel67], which itself was based on the book of Doob [Doo53]. The (largely informal)
presentation of martingale theory and related concepts was inspired by the survey article of Wong
[Won73].
Interesting comparative discussions of Itô vs. Stratonovich integrals can be found in [Mor69]
(from an engineering perspective) and [Kam81] (from a physics perspective).
The Wong–Zakai theorem has its origins in [WZ65a, WZ65b]. The discussion of physical noise
processes obtained by linear filtering of Brownian motion is loosely based on [Cla66].
46
Chapter 4
Stochastic calculus in path space
So far, our probabilistic analysis of diffusion processes was largely local: Given a diffusion process
(Xt ), we were able to write down expressions for expectations E[ϕ(Xt )|Xs = x] in terms of the drift
and the diffusion coefficient of the process. However, early on we were emphasizing the fact that
even the most basic diffusion process—the Brownian motion—has continuous sample paths. This is
a global property of the entire path (although continuity is defined locally, in a neighborhood of every
time t). In other words, when we talk about continuity of sample paths, we are viewing the process
as an element of a function space. For instance, if (Wt )t∈[0,T ] is a standard d-dimensional Brownian
motion, then its path is a random element of the space C([0, T ]; Rd ) of continuous functions from
[0, T ] into Rd , and we can talk about expectations of real-valued functionals of the entire path, for
example
" # "Z #
T
E sup |Wt | or E 1{Wt ≥0} dt .
t∈[0,T ] 0
In order to do this properly, we need to specify an appropriate sample space Ω and a σ-algebra
of events on it. We have already selected Ω as C([0, T ]; Rd ), and to define the σ-algebra we will
make use of the fact that Ω = C([0, T ]; Rd ) comes equipped with a distance: If ω = (ωt )t∈[0,T ] and
ω̃ = (ω̃t )t∈[0,T ] are two continuous functions of t ∈ [0, T ] with values in Rd , then we take
d(ω, ω̃) := sup max |ωti − ω̃ti |,

t∈[0,T ] 1≤i≤d
where ωti is the ith coordinate of ωt ∈ Rd . We then let B denote the smallest σ-algebra containing all
sets of the form {ω ∈ Ω : d(ω, ν) < r} for all ν ∈ Ω and all r > 0. This is called the Borel σ-algebra,
in analogy with the Borel σ-algebra on Rd , which is the smallest σ-algebra containing all open balls
in Rd . There is also an alternative characterization of B: It is the smallest σ-algebra on Ω containing
all sets of the form

C = ω ∈ Ω : ωti ∈ Ai , i = 1, . . . , n , (4.0.1)
for all n, all choices of times 0 < t1 < . . . < tn ≤ T , and all Borel sets Ai ⊆ Rd . Sets of this form are
referred to as the cylinder sets, and they usually have nice operational interpretations: Suppose,
for example, that ωt = (ωt1 , . . . , ωtd ) is a collection of voltage waveforms measured at d nodes of
47
an electric circuit. Then the set C in (4.0.1) describes the situation when the vectors of voltages
measured at times t1 , . . . , tn fall within some given sets A1 , . . . , An ⊂ Rd .
With these definitions, a d-dimensional Rd -valued random process (Xt )t∈[0,T ] with continuous
sample paths defined on an arbitrary probability space (Ω̃, F̃, P̃ ) can be viewed as a random element
of Ω. To do this, we first define a mapping X : Ω̃ → C([0, T ]; Rd ), such that ω := X(ω̃) maps each
t ∈ [0, T ] to Xt (ω̃). This then induces a probability measure P on (Ω, B) by P (A) := P̃ (X −1 (A)) for
all A ∈ B. Alternatively, we can introduce probability measures directly on (Ω, B). In particular, we
can let P denote the probability measure on (Ω, B) that corresponds to the standard d-dimensional
Brownian motion; thus, for the set C defined in (4.0.1), we will have
n
!
|xi − xi−1 |2
Z Y 1
P (C) = d/2
exp − dx1 . . . dxn , (4.0.2)
A1 ×···×An i=1 (2π(ti − ti−1 ) 2(ti − ti−1 )
where we have taken (x0 , t0 ) = (0, 0). It can also be shown that any measurable function f : Ω → R
(i.e., f is such that all the sets of the form {ω ∈ Ω : f (ω) < a} for a ∈ R belong to B) can be
approximated arbitrarily closely by sums of indicator functions of cylinder sets of the form (4.0.1). We
will also say things like “(Wt )t∈[0,T ] is a P -Brownian motion” to indicate this, with the understanding
that changing the probability measure from P to some other Q will also change the probability law
of the process.
4.1 Cameron–Martin–Girsanov theory

With these preliminaries out of the way, let us consider the following problem (let’s consider the
scalar case to keep things simple): We have a diffusion process (Xt )t∈[0,T ] , which, as a random
element of Ω = C([0, T ]), has probability law Q. Suppose that Q is absolutely continuous w.r.t. P ,
the probability law of the Brownian motion. This means that, for any A ∈ B such that P (A) = 0,
we also have Q(A) = 0. A good example of such a process would be an Itô process of the form
Z t
Xt = Fs ds + Wt , (4.1.1)
0
where (Wt ) is a P -Brownian motion and (Ft ) is an adapted process (and, in fact, this is the only
possibility). Then, by absolute continuity, there exists some nonnegative function g : Ω → R+ (called
dQ
the Radon–Nikodym derivative of Q w.r.t. P and often denoted by dP ) such that
EQ [1A (X)] = EP [1A (W )g(W )], A ∈ B. (4.1.2)
A representation like this has obvious benefits, one being that the computation of expectations w.r.t.
P is (presumably) easier than working with Q. Of course, this hinges on the availability of an
explicit expression for g. This is precisely the content of so-called Cameron–Martin–Girsanov theory
(or just Girsanov theory), although its reach goes far beyond Itô processes.
4.1.1 Motivation: importance sampling

Let’s consider, by way of motivation, a very simple one-dimensional example. Let W be a standard
Gaussian random variable, i.e., W ∼ N(0, 1) and let X = µ + W for some µ ∈ R. Then, for any
48
well-behaved (say, bounded) function f , we have
Z ∞
1 1 2
E[f (W )] = √ f (x)e− 2 x dx. (4.1.3)
2π −∞
Expanding the square x2 = ((x − µ) + µ)2 and rearranging, we can express this in another way:
1 1 2 1 2
E[f (W )] = √ f (x)eµx− 2 µ e− 2 (x−µ) dx (4.1.4)
2π
1 2
= E[f (X)eµX− 2 µ ], (4.1.5)
so we can “shift the mean” by a simple exponential transformation. Stated in this way, this
idea borders on triviality, but it has practical consequences. For example, consider the function
f (x) = 1{x≥20} . Then
2 /2
E[f (W )] = P[W ≥ 20] ≤ e−(20) ≈ 1.4 × 10−87 . (4.1.6)
Suppose we wanted to simulate the event {W ≥ 20}—the usual way is to generate a large number n
of i.i.d. samples from N(0, 1) and to look at the fraction of times the threshold 20 was exceeded:
n n
1X 1X
1{Wi ≥20} ≡ f (Wi ). (4.1.7)
n n
i=1 i=1
Of course we know that, by the law of large numbers, this fraction converges to its expected value
P[W ≥ 20], but a simple calculation shows that we would have to generate a huge number of samples
from N(0, 1) before seeing even one instance of the threshold being exceeded. On the other hand, for
X ∼ N(20, 1), P[X ≥ 20] = 12 , so with high probability we will see the threshold exceeded roughly
half the time. In that case, the approximation
n
1X 1 2
E[f (W )] ≈ g(Xi ), g(x) := f (x)e−µx+ 2 µ (4.1.8)
n
i=1
i.i.d.
will already be good for a “reasonable” value of n, but with X1 , . . . , Xn ∼ N(µ, 1) for the values of
µ close to 20.
4.1.2 Removing a constant drift

We begin by considering the simplest case of Brownian motion with a constant drift, i.e., Xt = Wt +µt,
0 ≤ t ≤ T , for some µ ∈ R. Suppose that we want to compute an expectation of the form
E[f (Xt1 , Xt2 , . . . , Xtn )], where the times 0 < t1 < t2 < · · · < tn ≤ T are given. Let q(x1 , . . . , xn )
denote the probability density of the random vector (Xt1 , . . . , Xtn ), so that
Z
E[f (Xt1 , Xt2 , . . . , Xtn )] = f (x1 , . . . , xn )q(x1 , . . . , xn ) dx1 . . . dxn . (4.1.9)
Rn
Now, from the form of Xt we can write q as follows:

n
!
Y (xi − xi−1 − µ(ti − ti−t )2
q(x1 , . . . , xn ) = C exp − , (4.1.10)
2(ti − ti−1 )
i=1
49
where we have defined, as before, (x0 , t0 ) = (0, 0) and where
n
Y 1
C= p
i=1
2π(ti − ti−1 )
is a normalization constant. Since
(xi − xi−1 − µ(ti − ti−t )2 (xi − xi−1 )2 µ2 (ti − ti−1 )
= − µ(xi − xi−1 ) + , (4.1.11)
2(ti − ti−1 ) 2(ti − ti−1 ) 2
we can express q differently as
 #
n
"
X 2
µ (ti − ti−1 ) 
q(x1 , . . . , xn ) = p(x1 , . . . , xn ) · exp  µ(xi − xi−1 ) − (4.1.12)
2
i=1
!
µ2
= p(x1 , . . . , xn ) exp µxn − tn , (4.1.13)
2
2

i −xi−1 )
where p(x1 , . . . , xn ) = C ni=1 exp − (x
Q
2(ti −ti−1 ) is the probability density of (Wt1 , . . . , Wtn ). Hence,
we can write

E[ϕ(Xt1 , . . . , Xtn )] = E ϕ(Wt1 , . . . , Wtn )Mtn , (4.1.14)
where on the right-hand side the expectation is w.r.t. the probability law of the Brownian motion
2
(Wt )t∈[0,T ] and where the process Mt := exp(µWt − µ2 t) is a martingale w.r.t. the filtration (Ft )
generated by the Brownian motion. The martingale property of (Mt ) turns out to be rather important:
Since tn ≤ T , we have E[MT |Ftn ] = Mtn , and therefore, using the law of iterated expectations twice,
we get

E ϕ(Wt1 , . . . , Wtn )Mtn = E[ϕ(Wt1 , . . . , Wtn )E[MT |Ftn ]] (4.1.15)
= E[E[ϕ(Wt1 , . . . , Wtn )MT |Ftn ]] (4.1.16)

= E ϕ(Wt1 , . . . , Wtn )MT . (4.1.17)
Thus, we have obtained a remarkable change-of-measure formula:
E[ϕ(Xt1 , . . . , Xtn )] = E[ϕ(Wt1 , . . . , Wtn )MT ], (4.1.18)
where on the left-hand side the expectation is w.r.t. Q, the probability law of (Xt )t∈[0,T ] , while on the
right-hand side the expectation is w.r.t. P , the probability law of the Brownian motion (Wt )t∈[0,T ] .
We can think of this as removing the drift from Xt ; another difference is that, on the right-hand side,
the “observable” ϕ(·) is multiplied by the exponential martingale MT that depends neither on the
choice of ϕ nor on the choice of the times t1 , . . . , tn . Hence, if we consider both processes as defined
on the path space (Ω, B), then for any measurable function f : Ω → R of the entire continuous path
ω = (ω)t∈[0,T ] , we can write
EQ [f (X)] = EP [f (W )MT ]. (4.1.19)

Since f is arbitrary, this gives us a relation between the two path-space measures Q and P , one
corresponding to Brownian motion with linear drift and one without: for any A ∈ B,
Q(A) = EQ [1A ] = EP [1A MT ]. (4.1.20)
50
4.1.3 A general Girsanov theorem
We start by introducing (Wt )t∈[0,T ] , a d-dimensional P -Brownian motion, and take (Ft )t∈[0,T ] be the
filtration generated by it.
Theorem 11. Consider an Itô process

Z t
Xt = Fs ds + Wt , 0 ≤ t ≤ T, (4.1.21)
0
where the dritf Ft , adapted to Ft , is such that the process

!
Z t Z t
1
Mt := exp − FsT dWs − |Fs |2 ds (4.1.22)
0 2 0
satisfies EP [MT ] = 1. If we let Q denote the path-space measure defined by EQ [·] = EP [·MT ], then
(Xt )t∈[0,T ] is a Q-Brownian motion.
Remark 6. The condition EP [MT ] = 1 ensures, among other things, that Q is indeed a probability
measure on the path space C([0, T ]; Rd ).
Proof. We will assume d = 1 to keep things simple; the proof for d > 1 follows along the same lines.
Using Itô’s formula, we can write
dXt = Ft dt + dWt (4.1.23)
and
dMt = −Mt Ft dWt . (4.1.24)
The first of these follows from the structure of Xt as an Itô process, while the second is readily
verified by computing d log Mt :
dMt 1
d log Mt = − d[M ]t (4.1.25)
Mt 2Mt2
1 1
=− · Mt Ft dWt − · |Ft |2 Mt2 dt (4.1.26)
Mt 2Mt2
1
= −Ft dWt − |Ft |2 dt (4.1.27)
2
Then Itô’s product rule gives
d(Xt Mt ) = Xt dMt + Mt dXt + dXt dMt (4.1.28)

= −Xt Mt Ft dWt + Mt (Ft dt + dWt ) − Mt Ft dt (4.1.29)
= (1 − Xt Ft )Mt dWt . (4.1.30)
Hence, both Mt and Xt Mt are both Ft -adapted Itô processes with zero drift, and are therefore both
P -martingales.1 By contrast, since Xt has a nonzero drift, it is not a P -martingale. We will now
1
Strictly speaking, we have only shown that they are local martingales, but it can be proved that, under our
assumptions, they are actually martingales.
51
show that this implies that Xt is a Q-martingale. To that end, it suffices to show that, for any
T ≥ t ≥ s ≥ 0 and any event A ∈ Fs ,
EQ [Xt 1A ] = EQ [Xs 1A ]. (4.1.31)
which is equivalent to E[Xt |Fs ] = Xs . Using the definition of Q, the law of iterated expectations,
and the fact that both Mt and Xt Mt are P -martinagles, we can write
EQ [Xt 1A ] = EP [Xt MT 1A ] (4.1.32)

= EP [EP [Xt MT 1A |Ft ]] (4.1.33)
= EP [Xt EP [MT |Ft ]1A ] (4.1.34)
= EP [Xt Mt 1A ] (4.1.35)
= EP [Xs Ms 1A ] (4.1.36)
= EP [Xs EP [MT |Fs ]1A ] (4.1.37)
= EP [EP [Xs MT 1A |Fs ]] (4.1.38)
= EP [Xs MT 1A ] (4.1.39)
(a good exercise is to go through this chain of equalities and to identify which property is used
where). Now, the last expression is equal to EQ [Xs 1A ] by the definition of Q, so we are done.
Finally, observe that the process Xt has quadratic variation [X]t = t, since the drift contributes
nothing to the quadratic variation. Hence, under Q, (Xt )t∈[0,T ] is a martingale with continuous
sample paths and [X]t = t, and therefore it is a Q-Brownian motion by Lévy’s theorem.
We can also give this result in a more symmetric form:

Theorem 12. Let P denote the path-space probability law of the d-dimensional Itô process
Z t Z t
Xt = x + Fs ds + Gs dWs , 0 ≤ t ≤ T, (4.1.40)
0 0
where x ∈ Rd is a deterministic initial condition, (Wt ) is a d-dimensional Brownian motion, and

the random d × d matrices Gt are invertible. Furthermore, let F̃t be any adapted drift, such that the
process Ht := G−1
t (Ft − F̃t ) is well-defined, and the process
Z t !
1 t
Z
T 2
Mt := exp − Hs dWs − |Hs | ds (4.1.41)
0 2 0
satisfies EP [MT ] = 1. Then for the path-space measure Q given by EQ [·] = EP [·MT ], the process
Z t
W̃t = Wt + Hs ds (4.1.42)
0
is a Q-Brownian motion, and Xt can also be represented as

Z t Z t
Xt = x + F̃s ds + Gs dW̃s , 0 ≤ s ≤ T. (4.1.43)
0 0
Remark 7. It is easy to see by considering F̃t = 0 and Gt = I that Theorem 11 is a special case of
this.
52
Proof. We already done most of the work in proving Theorem 11. In particular, it is easy to show,
using essentially the same arguments, that Mt and Xt Mt are both P -martingales; the fact that W̃t
is a Q-Brownian motion is a consequence (in fact, the content) of Theorem 11. It only remains to
verify that Xt can be represented as an Itô process using F̃t , Gt , and the Q-Brownian motion W̃t .
This is a routine computation:
dXt = Ft dt + Gt dWt (4.1.44)

= F̃t dt + Gt Ht dt + dWt (4.1.45)
= F̃t dt + Gt (dWt + Ht dt) (4.1.46)
= F̃t dt + Gt dW̃t , (4.1.47)
where the last line follows from the form of W̃t .
4.1.4 Absolute continuity

Hidden in the statement of Girsanov’s theorem is a very important fact about probability laws of
Itô processes: The probability laws of any two Itô processes that differ only in their drift terms are
absolutely continuous w.r.t. each other. This fact is often used in computation of expectations. For
example, consider a d-dimensional diffusion process X = (Xt )t∈[0,T ] given by
Z t
Xt = x + f (Xs ) ds + Wt (4.1.48)
0
and let Q denote its probability law on the path space Ω = C([0, T ]; Rd ). Suppose we wanted to
compute the expectation of some functional h : Ω → R:
Z
EQ [h(X)] = h(x)Q(dx). (4.1.49)
Ω
This type of “functional” integration arises in a variety of applications of stochastic calculus. The
difficulty here, apart from the infinite-dimensional nature of the x, is that we do not even have an
explicit expression for Q. However, with the help of Girsanov’s theorem we can write this expectation
as

dQ
EQ [h(X)] = EP h(x + W ) (x + W ) (4.1.50)
dP
 !
Z T Z T
1
= EP h(x + W ) exp f (x + Wt ) dWt − |f (x + Wt )|2 dt  (4.1.51)
0 2 0
—this is still a functional integral, but now it involves “only” the probability law of the Brownian
motion, and there are computational methods for approximating such functional integrals. To see
how the above representation comes about, we simply apply Theorem 12 with Ft = 0, F̃t = f (Xt ),
Gt = I.
53
4.1.5 Weak solutions of SDEs
Consider an Itô SDE
dXt = f (Xt ) dt + dWt , 0≤t≤T (4.1.52)
Theorem 9 on the existence and uniqueness of solutions to SDEs showed that, under appropriate
regularity conditions on the drift f , the above equation has a unique solution which can be expressed
as a function of the initial condition X0 and the given Brownian motion (Wt )0≤t≤T on a probability
space (Ω, F, P). This is known as a strong solution, and it has a clear system-theoretic interpretation:
We can realize the process (Xt )0≤t≤T as the output of a system that takes the initial condition X0
and the Brownian motion (Wt )0≤t≤T and produces the path (Xt )0≤t≤T . Moreover, the system is
causal in the sense that Xt depends only on X0 and on (Ws )0≤s≤t .
However, this is not the only sense in which an SDE like (4.1.52) can have a solution. For example,
we can write it down and then ask whether there exists some probability space (Ω, F, P), such that
on that space we can construct both a Brownian motion (Wt )0≤t≤T and a process (Xt )0≤t≤T , such
that
Z t
Xt = f (Xs ) ds + Wt , 0 ≤ t ≤ T. (4.1.53)
0
If such a construction is possible, then we say that (4.1.52) has a weak solution. It turns out that
weak solutions exist whenever Girsanov theory applies. Let us take any probability space (Ω, F, P),
such that (Xt )0≤t≤T is a P-Brownian motion. Now suppose that f is such that
 !
Z T Z T
1
EP exp f (Xt )T dXt − |f (Xt )|2 dt  = 1, (4.1.54)
0 2 0
and consider a new probability measure Q on (Ω, F), such that

!
Z T Z T
dQ 1
= exp f (Xt )T dXt − |f (Xt )|2 dt . (4.1.55)
dP 0 2 0
Then, by Theorem 12, the process

Z t
Wt = Xt − f (Xs ) ds (4.1.56)
0
is a Q–Brownian motion, and thus we have constructed the pair (Xt , Wt ), where (Wt )0≤t≤T is a
Brownian motion and (4.1.53) holds. The catch with weak solutions is that, unlike the strong case
where (Xt )0≤t≤T is generated causally from the given driving Brownian motion (Wt )0≤t≤T , from
(4.1.56) we see that the reverse is true: The Brownian motion (Wt )0≤t≤T is generated causally from
the process (Xt )0≤t≤T ! Weak solutions are probabilistic in nature, unlike strong solutions that are
closer in spirit to solutions of differential equations.
54
4.2 The Feynman–Kac formula
As we already saw in Chapter 2, there is a fundamental connection between diffusion processes and
a certain type of second-order PDEs. For example, consider a function u(x, t) defined by
Z
1 2
u(x, t) := d/2
ϕ(x + ξ)e−|ξ| /2t dξ, (x, t) ∈ Rd × [0, ∞) (4.2.1)
(2πt) Rd
where ϕ : Rd → R is a given C 2 function. On the one hand, it is a matter of straightforward calculus

to show that it solves the PDE
∂ 1
u(x, t) = ∆u(x, t), u(x, 0) = ϕ(x). (4.2.2)
∂t 2
On the other, it also has a probabilistic representation
u(x, t) = E[ϕ(x + Wt )], (4.2.3)
where (Wt )t≥0 is a standard d-dimensional Brownian motion. This, of course, is not a coincidence
since A = 12 ∆ is the infinitesimal generator of the Brownian motion. More generally, consider a
d-dimensional diffusion process (Xt )t≥0 with drift f (x) and diffusion matrix a(x). Suppose that a
function u(x, t), which is C 2 in x and C 1 in t, solves the PDE
∂
u(x, t) = Au(x, t), u(x, 0) = ϕ(x) (4.2.4)
∂t
for a given C 2 function ϕ : Rd → R, where
1
Au(x, t) = f (x)T ∇u(x, t) + tr{a(x)∇2 u(x, t) (4.2.5)
2
is the action of the infinitesimal generator A of (Xt )t≥0 on u(x, t) (as usual, ∇ and ∇2 act on
the space variable x, while keeping t fixed). Then it can be shown that it also has a probabilistic
representation u(x, t) = E[ϕ(Xt )|X0 = x]. (In fact, the converse can also be shown to hold under
additional regularity conditions.)
The relation between PDEs and diffusion processes can be pushed further. For example, let
u(x, t) be a solution of
∂
u(x, t) = Au(x, t) − λu(x, t), u(x, 0) = ϕ(x) (4.2.6)
∂t
where λ > 0 is a fixed constant. Then u(x, t) = e−λt E[ϕ(Xt )|X0 = x], where Xt , as before, is a
diffusion process with generator A. To show this, let v(x, t) = eλt . Then v(x, 0) = u(x, 0) = ϕ(x),
and
∂ ∂
v(x, t) = λv(x, t) + eλt u(x, t) (4.2.7)
∂t ∂t
= λv(x, t) + eλt Au(x, t) − λu(x, t)

(4.2.8)
= eλt Au(x, t) (4.2.9)
= Av(x, t). (4.2.10)
Thus, u(x, t) = e−λt v(x, t) = e−λt E[ϕ(Xt )|X0 = x], as claimed. The following fundamental result,
however, shows that we can obtain probabilistic representations of solutions of a much broader class
of PDEs:
55
Theorem 13 (Feynman–Kac). Let u(x, t) be a C 2,1 solution of the PDE
∂
u(x, t) = Au(x, t) + V (x)u(x, t), u(x, 0) = ϕ(x). (4.2.11)
∂t
Then it has the following probabilistic representation:
 ! 
Z t
u(x, t) = E ϕ(Xt ) exp V (Xs ) ds X0 = x , (4.2.12)

0
where (Xt )t≥0 is a diffusion proess with infinitesimal generator A.

Remark 8. As will evident from the proof, the result also holds, with obvious modifications, when
V depends on both x and t.
Proof. Fix x and t > 0 and consider the following process (Ms )0≤s≤t :
Z s
Ms := u(Xs , t − s) exp V (Xr ) dr . (4.2.13)
0
Then M0 = u(X0 , t) and

! !
Z t Z t
Mt = u(Xt , 0) exp V (Xr ) dr = ϕ(Xt ) exp V (Xr ) dr . (4.2.14)
0 0
We want to show that (Ms )0≤s≤t is a martingale, which would immediately imply that
 ! 
Z t
u(x, t) = E[M0 |X0 = x] = E[Mt |X0 = x] = E ϕ(Xt ) exp V (Xs ) ds X0 = x . (4.2.15)

0
R
s
Let Us := u(Xs , t − s) and Is := exp 0 V (Xr ) dr . For the latter, we have dIs = V (Xs )Is ds.
Furthermore, applying Itô’s rule to ϕ(x, s) = u(x, t − s) and using the fact that u(x, t) solves (4.2.11)
we obtain
∂
dUs = −u̇(Xs , t − s) ds + Au(Xs , t − s) ds + u(Xs , t − s)g(Xs ) dWs (4.2.16)
∂x
∂
= −V (Xs )u(Xs , t − s) ds + u(Xs , t − s)g(Xs ) dWs (4.2.17)
∂x
∂
= −V (Xs )Us ds + u(Xs , t − s)g(Xs ) dWs . (4.2.18)
∂x
Now, Ms = Us Is , so Itô’s product rule gives
dMs = Us dIs + Is dUs (4.2.19)

∂
= Us V (Xs )Is ds + Is −V (Xs )Us ds + u(Xs , t − s)g(Xs ) dWs (4.2.20)
∂x
∂
= Is u(Xs , t − s)g(Xs ) dWs , (4.2.21)
∂x
which shows that Ms is indeed a martingale.
56
4.2.1 A killing interpretation
Let (Xt )t≥0 be a continuous-time d-dimensional Markov process, and let τ be a nonnegative-valued
random variable. We then define a Markov process (X̄t )t≥0 killed at time τ by adding a special
“death” state d and taking
(
Xt , t < τ
X̄t := (4.2.22)
d, t≥τ
In other words, the path of X̄t is obtained by following the path of Xt until the “killing time” τ , at
which point the process enters the death state and remains there. Given a function ϕ : Rd → R, we
extend it by taking ϕ(d) = 0. Then
E[ϕ(X̄t )] = E[ϕ(Xt )1{τ >t} ]. (4.2.23)
For example, suppose τ ∼ Exp(λ), i.e., τ has exponential distribution with parameter λ and is
independent of (Xt ). Then
E[ϕ(X̄t )] = E[ϕ(Xt )]P[τ > t] (4.2.24)

= e−λt E[ϕ(Xt )]. (4.2.25)
We can consider other choices of τ that depend on the path of Xt . For instance, let a nonnegative
function V : Rd → R be given and take
( )
Z t
τ := inf t≥0: V (Xs ) ds ≥ γ , (4.2.26)
0
where γ ∼ Exp(1) is independent of (Xt ). Then, letting (Ft ) be the filtration generated by (Xt ), we
have
h i
E[ϕ(X̄t )] = E ϕ(Xt )1{τ >t} (4.2.27)
h i
= E ϕ(Xt )1{γ>R t V (Xs ) ds} (4.2.28)
0
 " #

= E E ϕ(Xt )1{γ>R t V (Xs ) ds} Ft  (4.2.29)

0
 " #

= E ϕ(Xt )E 1{γ> t V (Xs ) ds} Ft  (4.2.30)
 R
0
 !
Z t
= E ϕ(Xt ) exp − V (Xs ) ds  . (4.2.31)
0
Hence, if (Xt )t≥0 is a diffusion process with deterministic initial condition X0 = x, then the
Feynman–Kac formula tells us that the solution of the PDE
∂
u(x, t) = Au(x, t) − V (x)u(x, t), u(x, 0) = ϕ(x) (4.2.32)
∂t
57
is equal to the expectation E[ϕ(X̄t )|X0 = x], where X̄t is the process Xt killed at time τ defined in
(4.2.26).
R t We can think of killing as a variant of importance sampling, where the exponential weight
exp(− 0 V (Xs ) ds) is used to reject (or “kill”) those paths of (Xt ) that incur a large total cost at
time t, with the running cost measured by the potential function V .
4.3 Problems
1. Recall that Girsanov’s theorem says the following: If (Xt )0≤t≤1 is a d-dimensional P -Brownian
motion, then, for any d-dimensional adapted process (Ft )0≤t≤1 such that
 !
Z 1 Z 1
1
EP exp FtT dXt − |Ft |2 dt  = 1 (4.3.1)
0 2 0
the process
Z t
Wt := Xt − Fs ds, 0≤t≤1 (4.3.2)
0
is a Q-Brownian motion, where Q is the probability measure on the path space Ω = C([0, 1]; Rd )
defined by EQ [·] = EP [·MT ] with
Z 1 Z 1 !
1
MT = exp FtT dXt − |Ft |2 dt . (4.3.3)
0 2 0
For any deterministic vector v ∈ Rd , let

!
Z 1 Z 1
1
MTv := exp T
(Ft + v) dXt − 2
|Ft + v| dt , (4.3.4)
0 2 0
so that MT = MT0 . Using martingale methods and Lévy’s characterization of Brownian motion, it
can be shown that
Z t
EP [MTv ] = 1, ∀v ∈ Rd =⇒ Wt = Xt − Fs ds is a Q-Brownian motion. (4.3.5)
0
In this problem, we will prove a converse result: If (Ft )0≤t≤1 is an adapted process such that the
process Wt defined in (4.3.2) is a Q-Brownian motion with Q defined in (4.3.3), then (4.3.7) holds.
(i) Let v ∈ Rd be a deterministic vector. Prove that
1
log MTv = log MT + v T W1 − |v|2 . (4.3.6)
2
(ii) Use the result of (i) to prove that
EP [MTv ] = 1, ∀v ∈ Rd . (4.3.7)
Hint: Recall that, under Q, W1 ∼ N(0, Id ).
58
2. Let (Xt )0≤t≤1 , X0 = 0, be a one-dimensional diffusion process governed by the SDE
dXt = f (Xt ) dt + dWt , (4.3.8)
where the drift f : R → R satisfies the nonlinear differential equation
f 0 (x) + f 2 (x) = ax2 + bx + c (4.3.9)
for a ≥ 0 and arbitrary b, c. Prove that, for any bounded measurable function ϕ : R → R,
 !
Z 1
1
E[ϕ(X1 )] = e−c/2 E ϕ(W1 ) exp F (W1 ) − (aWt2 + bWt ) dt  , (4.3.10)
2 0
Rx
where F (x) := 0 f (y) dy, and where on the right-hand side (Wt )0≤t≤1 is a standard one-dimensional
Brownian motion.
Hint: Use Girsanov’s theorem to remove the drift from Xt , then apply Itô’s rule to F (Wt ).
3. Let f, F, a, b, c be as in Problem 2. Let u(x, t) be a C 2,1 solution of the PDE
∂ 1 ∂2 1
u(x, t) = 2
u(x, t) − (ax2 + bx)u(x, t), u(x, 0) = eF (x) , (4.3.11)
∂t 2 ∂x 2
Show that u(0, 1) = ec/2 .

My presentation of Girsanov’s theory, including the motivation through importance sampling, largely
follows that of [Ste01]. Importance sampling is a rich and interesting subject in its own right; see
[Buc04] for a clean exposition geared toward engineering system simulation. The first result on
shifting the drift from a Brownian motion via a change of measure was obtained by Cameron and
Martin [CM44] for deterministic drifts; the paper of Girsanov [Gir60] generalized it to arbitrary Itô
processes. The martingale proof of Girsanov’s theorem, which I got from Steele’s book, seems to
originate in the work of Beneš [Ben71], although this paper is not cited by Steele. The paper of
Van Schuppen and Wong [SW74] gives what is probably the most general Girsanov-type result. A
brief discussion of the system-theoretic interpretation of strong solutions of SDEs can be found in
Section 5.2 of [KS98].
The origins of the Feynman–Kac formula go back to [Kac49]. As Kac recounts in his delightful
autobiography Enigmas of Chance [Kac85], he attented a physics colloquium talk by Richard
Feynman in 1947, when both of them were at Cornell University. Feynman’s talk was about the
new approach to quantum mechanics he had developed, based on “sums over histories” or what is
now referred to as “Feynman path integrals” [FH65]. Feynman’s use of his approach to derive a
fundamental quantum-mechanical quantity known as the propagator, which is related to solutions of
Schrödinger’s equation. As Kac remembers it [Kac85, p. 116], the steps in Feynman’s derivation
were reminiscent of some arguments pertaining to the connection between Brownian motion and
the heat equation. See [Roe94, Sim05] for detailed expositions of the path (or functional) integral
approach.
59
Part II
Applications
60
Chapter 5
Stochastic control and estimation
Control, in very broad strokes, is about shaping the behavior of some system in order to attain a
given goal. For example, if we have a continuous-time deterministic system with input u(t) ∈ Rm ,
state x(t) ∈ Rn , and output y(t) ∈ Rp described by the state-space model
ẋ(t) = f (x(t), u(t)), (5.0.1a)

y(t) = h(x(t)) (5.0.1b)
then the following are all examples of control:
1. If we have perfect observation of the state, i.e., y(t) = h(x(t)), then we may seek a control
law u : [0, 1] → Rm to transfer the system’s state from a given initial point x(0) = x0 to a
given final point x(1) = x1 . (We say that we want to control the system from x0 to x1 ) in unit
time.) Whether this is at all possible depends on the dynamical law of the system, i.e., on
f (·, ·); this is a question of controllability.
2. If at least one such control law exists, we then may want to ask whether the transfer can be
accomplished in the best possible way—thus, we may want to minimize the control effort
1 1
Z
|u(t)|2 dt (5.0.2)
2 0
over all controls u : [0, 1] → Rm that accomplish the transfer from x0 to x1 in unit time. This
is a instance of optimal control.
3. More generally, we may want to choose a control law u to minimize a total cost of the form
Z 1
q(x(t), u(t)) dt + r(x(1)) (5.0.3)
0
over all “well-behaved” control laws u : [0, 1] → Rm , where q is the instantaneous cost and r
is the terminal cost. The interpretation of this is that, for a given initial condition x0 , we
choose u(·) to trade off the terminal cost r(x(1)) against the integrated cost of “getting there”
from x(0) = x0 . This can be done using state feedback control, i.e., there exists a function
k : Rn × [0, 1] → Rm such that the optimal control law u(t) is given by k(x(t), t), where
x(t) evolves according to the closed-loop dynamics ẋ(t) = f (x(t), k(x(t)), t). However, due
61
to the nature of solutions of deterministic ODEs, we can also look for optimal control laws
in open-loop form, i.e., just as functions of time. These are two complementary viewpoints,
formalized by dynamic programming and the maximum principle, respectively.
4. Even more generally, we may want to accomplish various goals while only having access to a
processed version of the state, i.e., the sensor output y(t) = h(x(t)). In this case, we may still
want to choose u : [0, T ] → Rm to minimize the total cost as in the above example, but with
the additional restriction that the control u(t) can only be chosen on the basis of the output
trajectory (y(s) : 0 ≤ s < t). This is the problem optimal control with partial observations,
and it is intimately connected with such concepts as observability, i.e., when the initial state
x(0) can be recovered from an observation trajectory up to time t.
The solutions to these problems are all known if the system in question is linear, i.e., if f (x, u) =
Ax + Bu and g(x) = Cx for some matrices A, B, C of appropriate shapes. (The systems don’t have
to be time-invariant, i.e., the system matrices may all depend on t.) For nonlinear systems, there
is no general theory, and in fact some questions can be answered in general only locally, e.g., we
may want to ask whether a given point x0 has an open neighborhood K, such that it is possible
to transfer the system from x0 to any given point x1 ∈ K in finite time. Nevertheless, we see that
there are a number of basic questions regarding control and estimation that one may want to pose
about dynamical systems.
Our interest here lies with stochastic systems modeled by SDEs, so some of the above questions
(such as controllability or observability) do not have obvious stochastic counterparts. Thus, we will
focus for the most part on the optimal control problem. In that context, we can also consider the
cases of full vs. partial observations, and this will force us to consider some estimation problems as
well. We will start with the latter.
5.1 Linear estimation: the Kalman–Bucy filter

We consider the following system of linear SDEs with time-varying coefficients:
dXt = A(t)Xt dt + B(t) dWt (5.1.1)

dYt = C(t)Xt dt + D(t) dVt (5.1.2)
where (Wt )t≥0 and (Vt )t≥0 are two independent Brownian motions that are also independent of the
initial condition X0 . We assume that X0 is Gaussian with mean m0 and covariance matrix K0 and
that Y0 = 0. Thus, both Xt and Yt are Gaussian processes. We also assume that, for each t, the
matrix D(t)D(t)T is invertible. For each t, let Ft be the σ-algebra generated by (Xs , Ys )0≤s≤t and
let FtY be the observation σ-algebra generated by the observations (Ys )0≤s≤t only. Our goal is to
show that the conditional mean X̂t := E[Xt |FtY ] is determined by a linear SDE with initial condition
X̂0 = m0 .
We start by recalling a fundamental result on minimum mean-square error estimation in the
jointly Gaussian setting:
Lemma 1. Let I be a parameter set, let (Vα )α∈I be a collection of random variables indexed by
the elements of I, and let V be the σ-algebra generated by the Vα ’s. Let U be a random variable,
such that U and (Vα )α∈A are jointly Gaussian. Then the conditional mean Û = E[U |V] is uniquely
characterized by the following properties:
62
• measurability—Û is a measurable function of (Vα )α∈I ;
• unbiasedness—E[U − Û ] = 0;
• orthogonality—E[(U − Û )Vα ] = 0 for all α ∈ I.
We will look for X̂t defined through a linear SDE of the form
dX̂t = A(t)X̂t dt + L(t)(dYt − C(t)X̂t dt) (5.1.3)
with initial condition X̂0 = m0 , where L(t) is a matrix that we need to determine. Since the solution
of (5.1.3) has the form
Z t
Z t
X̂t = m0 + A(s) − L(s)C(s) X̂s ds + L(s) dYs , (5.1.4)
0 0
we see that X̂t is a measurable function of (Ys )0≤s≤t . Moreover, if we define the error process
X̃t := Xt − X̂t , then E[X̃0 ] = 0 and

dX̃t = A(t) − L(t)C(t) X̃t dt + B(t) dWt − L(t)D(t) dVt , (5.1.5)
which gives E[Xt − X̂t ] = E[X̃t ] = 0 for all t. Thus, X̂t satisfies the measurability and the
unbiasedness requirements of the lemma with U = Xt , I = [0, t], and V = FtY . RFinally, we would like
t
to show orthogonality. To that end, let us also define the process νt := Yt − 0 C(s)X̂s ds with the
initial condition ν0 = 0. Then the joint process (X̃t , νt ) is Gaussian since it has the initial condition
X̃0 ∼ N(0, K0 ), ν0 = 0, and is governed by the system of linear SDEs

dX̃t = A(t) − L(t)C(t) X̃t dt + B(t) dWt − L(t)D(t) dVt (5.1.6)
dνt = C(t)X̃t dt + D(t) dVt (5.1.7)
or, in more suggestive matrix notation,

! ! ! ! !
dX̃t A(t) − L(t)C(t) 0 X̃t B(t) −L(t)D(t) dWt
= dt + . (5.1.8)
dνt C(t) 0 νt 0 D(t) dVt
From this, it is not hard to write down the differential equations for the covariance and the
cross-covariance matrices KX̃ (t) := E[X̃t X̃tT ], KX̃ν (t) := E[X̃t νtT ], and Kν (t) := E[νt νtT ]:
T
K̇X̃ (t) = KX̃ (t) A(t) − L(t)C(t) + A(t) − L(t)C(t) KX̃ (t)
+ B(t)B(t)T + L(t)D(t)D(t)T L(t)T (5.1.9)

K̇X̃ν (t) = A(t) − L(t)C(t) KX̃ν (t)T + KX̃ (t)C(t)T − L(t)D(t)D(t)T (5.1.10)
T T T T
K̇ν (t) = C(t)KX̃ν (t) + KX̃ν (t) C(t) + D(t)D(t) (5.1.11)
with the initial conditions KX̃ (0) = K0 , KX̃ν (0) = 0, Kν (0) = 0. The choice of
−1
L(t) = KX̃ (t)C(t)T D(t)D(t)T (5.1.12)
63
is rather fortuitous, since it simultaneously results in KX̃ν (t) = 0 for all t,
Z t
Kν (t) = D(s)D(s)T ds, (5.1.13)
0
and
−1
K̇X̃ (t) = KX̃ (t)A(t)T + A(t)KX̃ (t) − KX̃ (t)C(t)T D(t)D(t)T C(t)KX̃ (t) + B(t)B(t)T ,
(5.1.14)
where the latter is the Riccati differential equation with the initial condition KX̃ (0) = K0 . Thus, if
in (5.1.3) we take L(t) according to (5.1.12), where the error covariance matrix KX̃ (t) is determined
from (5.1.14), then we see that, for each t, the error X̃t = Xt − X̂t has zero mean and is orthogonal
to νt :
E[X̃t νtT ] = KX̃ν (t) = 0. (5.1.15)
From here, it is not hard to show that X̃t is orthogonal to all νs , 0 ≤ s ≤ t. Thus, by Lemma 1,
X̂t = E[Xt |Ftν ], where
Z t
νt = Yt − C(s)X̂s ds (5.1.16)
0
is the innovations process with covariance matrix given by (5.1.13). Recall, however, that we were
after the conditional mean of Xt given FtY , not Ftν , but it turns out that these two conditional
means are equal since the observations (Ys )0≤s≤t and the innovations (νs )0≤s≤t generate the same
σ-algebra:
FtY = Ftν . (5.1.17)
This result, known as causal equivalence, follows from the fact that X̂t can be expressed in two ways
as
Z t Z t
X̂t = X̂0 + A(s)X̂s ds + L(s) dνs (5.1.18)
0 0
Z t Z t

= X̂0 + A(s) − L(s)C(s) X̂s ds + L(s) dYs , (5.1.19)
0 0
which, in combination with (5.1.16), shows that we can generate the innovations trajectory (νs )0≤s≤t
from (Ys )0≤s≤t and vice versa. Thus, X̂t is the conditional mean E[Xt |FtY ].
The innovations process νt has a number of useful properties. For instance, it has independent
increments, i.e., for all 0 ≤ t1 ≤ t2 ≤ t3 ≤ t4 , the random vectors νt4 − νt3 and νt2 − νt1 are
independent. Moreover, for t ≥ s ≥ 0, the increments νt − νs are independent of FsY . To see this,
observe that, for t ≥ s,
Z t
νt − νs = Yt − Ys − C(r)X̂r dr (5.1.20)
s
Z t Z t
= C(r)X̃r dr + D(r) dVr , (5.1.21)
s s
64
and both (X̃r )r≥s and (Ṽr )r≥s are independent of FsY . The independent increments property of νt is
proved similarly. Moreover, we can compute the covariance matrix of νt − νs :
Z t
T
E[(νt − νs )(νt − νs ) ] = D(r)D(r)T dr. (5.1.22)
s
This can be seen as follows: For any deterministic vector v ∈ Rp (where p is the dimension of Yt ),
define the process Ztv := v T (νt − νs ). Then
dZtv = v T C(t)X̃t dt + v T D(t) dVt (5.1.23)
so we can easily compute the joint quadratic variation d[Z v , Z w ]t = v T D(t)D(t)T w dt. Itô’s product
rule then gives
d(Ztv Ztw ) = Ztv dZtw + Z w dZtv + d[Z v , Z w ]t (5.1.24)
= Ztv wT C(t)X̃t dt + wT D(t) dVt + Ztw v T C(t)X̃t dt + v T D(t) dVt + v T D(t)D(t)T w dt.

(5.1.25)
Since Ztv and X̃t are independent, taking expectations gives

Z t
E[v T (νt − νs )(νt − νs )T w] = v T D(r)D(r)T w dr. (5.1.26)
s
Since vpand w are arbitrary, we get (5.1.13). Now, since D(t)D(t)T is invertible, the process
Ŵt := (D(t)D(t)T )−1 νt has independent increments and d[Ŵ i , Ŵ j ]t = δij dt for all 1 ≤ i, j ≤ p, so
it is a standard p-dimensional Brownian motion adapted to Ftν (and hence to FtY ). We can therefore
express the filter process X̂t as
Z t Z t q
X̂t = m0 + A(s)X̂s ds + L(s) D(s)D(s)> dŴs . (5.1.27)
0 0
In particular, if D(t) is symmetric and positive definite (and thus the observation noise Vt is a
standard p-dimensional Brownian motion), then the observation process Yt obeys the SDE
dYt = C(t)X̂t dt + D(t) dŴt , (5.1.28)
where D(t) dŴt = dνt provides the new information, on top of C(t)X̂t dt.
5.2 Nonlinear filtering

A more general problem involves observations of the form
Z t
Yt = h(Xs ) ds + Vt , 0 ≤ t ≤ T, (5.2.1)
0
where (Vt ) is a standard Brownian motion and where (Xt ) is the n-dimensional signal (or state)
process governed by the SDE
dXt = f (Xt ) dt + g(Xt ) dWt . (5.2.2)
Here, (Wt ) is a standard n-dimensional Brownian motion independent of the initial condition X0 . In
what follows, we will impose the following assumptions:
65
1. The observation process (Yt ) is scalar.
2. The signal process (Xt ) and the measurement function h are such that
"Z #
T
E h2 (Xt ) dt < ∞. (5.2.3)
0
3. The Brownian motion (Vt ) is independent of the signal process (Xt ).

The first assumption is introduced mainly for convenience; in the multidimensional case, we can do
all the calculations on a per-coordinate basis. The third assumption can also be relaxed, but at the
expense ofR considerably more technical arguments. It is also possible to consider observation noise of
t
the form 0 g(s) dVs with g > 0 everywhere.
Let FtY := σ(Ys : 0 ≤ s ≤ t) denote the σ-algebra generated by the observations up to time t.
The problem of nonlinear filtering is to characterize, for each t, the conditional probability law of
Xt given FtY . To be more precise, let a function ϕ : Rn → R be given, and let πt (ϕ) denote the
conditional expectation
πt (ϕ) := E[ϕ(Xt )|FtY ]. (5.2.4)
The goal is to recursively compute πt (ϕ) for any sufficiently regular (say, bounded and continuous)
function ϕ. There are two approaches to this problem, the innovations approach and the reference
measure approach.
5.2.1 The innovations approach

The innovations approach was first proposed by Kailath in the linear case and then extended by
Frost and Kailath to the nonlinear setting. Given any process (ξt ), let ξˆt := E[ξt |FtY ]. In particular,
let Zt := h(Xt ) and define the innovations process
Z t
νt := Yt − Ẑs ds. (5.2.5)
0
Just as in the Kalman–Bucy setting, the increments νt+h − νt represent the “new” information about
the process (Zt ) contained in the observations Ys for times s between t and t + h. In fact, just like
in the linear case, we have the following:
Lemma 2. The innovations process (νt ) is an FtY -Brownian motion.
Proof. The innovations process (νt ) is an FtY -martingale:
E[dνt |FtY ] = E[dYt − Ẑt dt|FtY ] (5.2.6)
= E[(Zt − Ẑt ) dt + dVt |FtY ] (5.2.7)
= E[Zt − Ẑt |FtY ] dt + E[dVt |FtY ], (5.2.8)
where the first conditional expectation vanishes by the definition of Ẑt , while the second one vanishes
since dVt is zero-mean and independent of FtY . Moreover, using (5.2.1) and the definition of νt in
(5.2.5) can express (Yt ) in two ways as follows:
Z t Z t
Yt = Zs ds + Vt = Ẑs ds + νt . (5.2.9)
0 0
66
Hence, the quadratic variation [ν]t is equal to the quadratic variation [Y ]t , which is in turn equal to t.
Thus, since (νt ) is an FtY -martingale with continuous sample paths and [ν]t = t, it is an FtY -Brownian
motion by Lévy’s theorem.
We will also make use of the following martingale representation result:
Lemma 3 ([FKK72]). Every square-integrable FtY -martingale Mt can be represented as

Z t
Mt = E[M0 ] + ηs dνs . (5.2.10)
0
Next, we give a representation for the conditional mean ξˆt = E[ξt |FtY ], where ξt is a process of
the form
Z t
ξt = ξ0 + Fs ds + Mt , (5.2.11)
0
where Mt is an Ft -martingale, where Ft is the σ-algebra generated by (Xs , Ys )0≤s≤t . Then we will
specialize it to the case ξt = ϕ(Xt ).
Theorem 14. Suppose that the joint quadratic variation [M, V ]t between the martingale Mt in
(5.2.11) and the Brownian motion Vt in (5.2.1) vanishes. Then ξˆt admits the following representation:
Z t Z t
ξˆt = ξˆ0 + ˆ

F̂s ds + s Zs − ξs Ẑs dνs .
ξd (5.2.12)
0 0
Proof. First, we show that the process

Z t
M̄t := ξˆt − ξˆ0 − F̂s ds (5.2.13)
0
is a zero-mean FtY -martingale. Indeed, M̄0 = 0, and
E[dξˆt |FtY ] = E[dξt |FtY ] (5.2.14)

= E[Ft |FtY ] dt + E[dMt |FtY ] (5.2.15)
= E[F̂t |FtY ] dt, (5.2.16)
where in the last step we have used the fact that Mt is a FtY -martingale. Rearranging shows that
E[dM̄t |FtY ] = 0, as claimed. We can now invoke Lemma 3 to conclude that there exists a FtY -adapted
process (ηt ), such that
Z t Z t
ξˆt = ξˆ0 + F̂s ds + ηs dνs . (5.2.17)
0 0
It remains to identify the process ηt . We start with the observation that
E[(ξt − ξˆt )Yt |FtY ] = 0, (5.2.18)
67
which follows from the properties of conditional expectation. For ξt Yt , Itô’s product rule gives
Z t Z t Z t
ξt Yt = ξ0 Y0 + Ys dξs +
ξs dYs + d[ξ, Y ]s (5.2.19)
0 0 0
Z t Z t
= ξ0 Y0 + ξs (Zs ds + dVs ) + Ys (Fs ds + dMs ), (5.2.20)
0 0
where we have also used the fact that the quadratic variation term vanishes since [V, M ]t = 0 by
hypothesis. Analogously, for ξˆt Yt ,
Z t Z tZ t
ξˆt Yt = ξˆ0 Y0 + ξˆs dYs + ˆ
Ys dξs + ˆ Y ]s
d[ξ, (5.2.21)
0 0 0
Z t Z t Z t
= ξ0 Y0 + ξˆs (Zs ds + dVs ) + Ys (F̂s ds + ηs dνs ) + ηs ds, (5.2.22)
0 0 0
ˆ Y ]t = ηt dt. Taking conditional expectations

where we have used the (easily established) fact that d[ξ,
given Ft , we obtain
Y
Z t Z t
E[ξt Yt |FtY ] = E[ξs Zs |FsY ] ds + E[Ys Fs |FsY ] ds (5.2.23)
0 0
Z t Z t
= ξd
s Zs ds + Ys F̂s ds (5.2.24)
0 0
and
Z t Z t Z t
E[ξˆt Yt |FtY ] = ˆ
E[ξs Zs ] ds + Y
E[Ys F̂s |Fs ] ds + ηs ds (5.2.25)
0 0 0
Z t Z t Z t
= ξˆs Ẑs ds + Ys F̂s ds + ηs ds. (5.2.26)
0 0 0
ˆ
t Zt − ξt Ẑt .
Plugging these expressions into (5.2.18) shows that ηt = ξd
Let us now use Theorem 14 to derive the nonlinear filter. Thus, let ξt = ϕ(Xt ), for a C 2 function
ϕ. Then
ξˆt = E[ϕ(Xt )|FtY ] = πt (ϕ). (5.2.27)
With this in hand, we have the following:
Theorem 15. The nonlinear filter πt (ϕ) admits the following representation:
Z t Z t
πt (ϕ) = π0 (ϕ) + πs (Aϕ) ds + πs (hϕ) − πs (h)πs (ϕ) dνs , (5.2.28)
0 0
where π0 (ϕ) = E[ϕ(X0 )], A is the infinitesimal generator of the signal diffusion process, and hϕ
denotes the function x 7→ h(x)ϕ(x).
68
Proof. By Itô’s rule, the process ξt = ϕ(Xt ) can be expressed as
Z t Z t
ξt = ξ0 + Aϕ(Xs ) ds + ∇ϕ(Xs )T g(Xs ) dWs , (5.2.29)
0 0
This has the form (5.2.11) with Fs = Aϕ(Xs ) and dMs = ∇ϕ(Xs )T g(Xs ) dWs . Moreover, Lemma 4.2
in [FKK72] shows that [M, V ]t = 0 follows from the independence of (Xt ) and (Vt ). Hence, we can
apply Theorem 14, and (5.2.28) follows since, e.g., Ẑt = E[h(Xt )|FtY ] = πt (h), etc.
The expression (5.2.28) for the conditinoal expectation πt (ϕ) is known as the Kushner–Stratonovich
equation. It is a stochastic partial differential equation due to the presence of the generator A in
the right-hand side. Even though it gives the solution of the nonlinear filtering problem, it does
not lead to a finite-dimensional recursive representation apart from some special cases. To see why
this is so, let us consider the case when the signal process (Xt ) is scalar and we wish to compute
the conditional mean of Xt given FtY , i.e., X̂t = E[Xt |FtY ], which is equal to πt (ϕ) with ϕ(x) = x.
Then, as can be seen fron (5.2.28), we need to have expressions for πt (Aϕ) = πt (f ) = f\ (Xt ), for
\ \
πt (h) = Ẑt = h(Xt ), and for πt (hϕ) = Zt Xt = h(Xt )Xt , and each of these, in turn, comes with its
[
own Kushner–Stratonovich equation. This is known as the closure problem.
Example 8 (The Kalman–Bucy filter). Suppose that the signal process (Xt ) and the observation
process (Yt ) are both scalar and given by
Z t
Xt = X0 + aXs ds + bWt (5.2.30)
0
Z t
Yt = cXs ds + Vt (5.2.31)
0
where X0 ∼ N(m0 , σ 2 ), and we are interested in computing X̂t = E[Xt |FtY ]. Of course, we have
already worked out the more general, multidimensional and time-varying, version of this problem;
this example is meant to illustrate that the Kalman–Bucy filter is a special instance of the nonlinear
filter that does not suffer from the closure problem.
The Kushner–Stratonovich equation for X̂t takes the form
Z t Z t
X̂t = E[X0 ] + aX̂s ds + c2 − (X̂s )2 ) dνs
c(X (5.2.32)
s
0 0
Z t Z t
= E[X0 ] + aX̂s ds + c2 − (X̂s )2 )(dYs − cX̂s ds)
c(X (5.2.33)
s
0 0
Z t Z t
= E[X0 ] + aX̂s ds + cKt (dYs − cX̂s ds), (5.2.34)
0 0
where Kt := E[(Xt − X̂t )2 |FtY ] is the conditional error variance. Since (Xt , Yt ) is a Gaussian process,
the error covariance Kt is nonrandom, and can be precomputed by solving the ODE
d
Kt = 2aKt − c2 Kt2 + b2 , (5.2.35)
dt
which is just the scalar Riccati equation.
69
5.2.2 The reference measure approach
The alternative approach to nonlinear filtering starts from the seemingly trivial observation that,
when h ≡ 0 (i.e., when the path (Ys )0≤s≤t is completely uninformative about the signal Xt ),
πt (ϕ) = E[ϕ(Xt )] for all t. More generally, for any measurable function F of the signal path
X t := (Xs )0≤s≤t and the observation path Y t := (Ys )0≤s≤t ,
Z
Y
E[F (X t , Y t )|Ft ] = F (xt , Y t )P X (dxt ), (5.2.36)
that is, we simply marginalize out the signal process. But this is something we can arrange by
making a Girsanov change of measure!
To that end, let P denote the probability law of X T and Y T , and define a new probability law
P̃ by
!
Z T Z T
dP 1 2
:= exp Zt dYt − |Zt | dt , (5.2.37)
dP̃ 0 2 0
where, as we recall, Zt = h(Xt ). Under P̃ , the signal process X T has the same marginal probability
law as under P , but the observation process Y T is now a Brownian motion independent of X T .
Thus, we find ourselves in the situation described in the preceding paragraph, and we will now
capitalize on this. First, let us define, for each t ∈ [0, T ], the following functional of the paths X t
and Y t :
!
Z t Z t
1
Λt (X t , Y t ) := exp h(Xs ) dYs − |h(Xs )|2 ds . (5.2.38)
0 2 0
We will now prove the following result, known as the Kallianpur–Striebel formula:
R
ϕ(xt )Λt (xt , Y t )P X t (dxt )
πt (ϕ) = R (5.2.39)
Λt (xt , Y t )P X t (dxt )
The starting point is the conditional Bayes theorem: Let µ and ν be two probablity measures on a
common probability space (Ω, F) that are mutually absolutely continuous w.r.t. each other, and let
G ⊆ F be a σ-algebra. Then, for any bounded random variable X,
Eν [X dµ
dν |G]
Eµ [X|G] = , (5.2.40)
Eν [ dµ
dν |G]
where E[·] (respectively, Eν [·]) denotes expectation w.r.t. µ (respectively, ν). To prove (5.2.40), let
70
A be an arbitrary event in G. Then, using the properties of conditional expectations, we can write
" # " #
dµ dµ
Eν 1A Eν X G = Eν Eν 1A X G (5.2.41)
dν dν

dµ
= Eν 1A X (5.2.42)
dν
= Eµ [1A X] (5.2.43)
= Eµ [1A Eµ [X|G]] (5.2.44)

dµ
= Eν 1A Eµ [X|G] (5.2.45)
dν
" #
dµ
= Eν 1A Eµ [X|G]Eν G . (5.2.46)
dν
Since A was arbitrary, we conclude that

dµ dµ
Eν X G = Eµ [X|G]Eν G , (5.2.47)
dν dν
h i
and then (5.2.40) follows since Eν dµ
dν G is almost surely positive by absolute continuity. To obtain

(5.2.39), we apply (5.2.40) to µ = P , ν = P̃ , G = FtY , and X = ϕ(Xt ). Here, we also make use of
the fact that Λt is a martingale under P̃ (which is, again, a consequene of Girsanov’s theorem), so
that (denoting by Ẽ[·] the expectation w.r.t. P̃ )
Ẽ[ϕ(Xt )ΛT (X T , Y T )|FtY ] = Ẽ[ϕ(Xt )Λt (X t , Y t )|FtY ]. (5.2.48)
The Kallianpur–Striebel formula is often expressed in terms of the unnormalized filter σt , which is
defined as
σt (ϕ) := Ẽ[ϕ(Xt )Λt |FtY ] (5.2.49)

Z
= ϕ(xt )Λt (xt , Y t )P X t (dxt ), (5.2.50)
where ϕ is any bounded measurable function. This is simply the quantity in the numerator of
(5.2.39), but the denominator is also of this form with the constant function ϕ(x) ≡ 1. That is,
σt (ϕ)
πt (ϕ) = . (5.2.51)
σt (1)
We can also obtain a recursive representation for the unnormalized filter, known as the Duncan–
Mortensen–Zakai equation. Assume ϕ is C 2 . First, by Itô’s rule
dϕ(Xt ) = Aϕ(Xt ) dt + ∇ϕ(Xt )T g(Xt ) dWt ; (5.2.52)
moreover, dΛt = Λt Zt dYt , and the joint quadratic variation of ϕ(Xt ) and Λt vanishes since the
Brownian motion processes (Wt ) and (Vt ) are independent. Therefore, Itô’s product rule gives
d(ϕ(Xt )Λt ) = ϕ(Xt ) dΛt + Λt dϕ(Xt ) + dϕ(Xt ) dΛt (5.2.53)

= ϕ(Xt )Λt Zt dYt + Λt Aϕ(Xt ) dt + Λt ∇ϕ(Xt ) g(Xt ) dWt .
T
(5.2.54)
71
Integrating, we get
Z t Z t
ϕ(Xt )Λt = ϕ(X0 ) + Aϕ(Xs )Λs ds + ϕ(Xs )Zs Λs dYs (5.2.55)
0 0
Z t Z t
= ϕ(X0 ) + Aϕ(Xs )Λs ds + ϕ(Xs )h(Xs )Λs dYs . (5.2.56)
0 0
Taking expectations w.r.t. Ẽ conditioned on FtY and using the definition of σt yields
Z t Z t
σt (ϕ) = σ0 (ϕ) + σs (Aϕ) ds + σs (hϕ) dYs , (5.2.57)
0 0
where, just as before, hϕ is shorthand for the function x 7→ h(x)ϕ(x). This is the Duncan–Mortensen–
Zakai equation, and from there one can derive the Kushner–Stratonovich equation (5.2.28) by
applying Itô’s product rule to σσtt(ϕ)
(1) .
5.3 Optimal control of diffusion processes

Now that we have discussed the estimation and filtering problems, we move on to the problem
of optimal control. First, we need to develop a suitable stochastic analogue of the controlled
deterministic system (5.0.1). For the most part, we will consider the case of complete observations.
This suggests replacing the deterministic state dynamics (5.0.1a) with an SDE of the form
dXt = f (Xt , ut ) dt + g(Xt , ut ) dWt , (5.3.1)
where we now have the n-dimensional state process (Xt )0≤t≤1 , the m-dimensional input (or control)
process (ut )0≤t≤1 , and a k-dimensional Brownian motion (Wt )0≤t≤1 . Note that the control input
may, in principle, affect both the drift and the diffusion matrix.
Now, we have to be careful about specifying in which sense, if any, the SDE (5.3.1) has a solution.
This will, in turn, necessitate imposing some restrictions on the admissible control processes. The most
obvious, direct approach is to require that the functions f : Rn × Rm → Rn and g : Rn × Rm → Rn×k
satisfy
|f (x, u) − f (x0 , u0 )| + kg(x, u) − g(x0 , u0 )k ≤ K(|x − x0 | + |u − u0 |) (5.3.2)
and
|f (x, u)|2 + kg(x, u)k2 ≤ K(1 + |x|2 + |u|2 ) (5.3.3)
for some constant 0 ≤ K < ∞, for all x, x0 ∈ Rn and all u, u0 ∈ Rm . Then we can take a time-varying
state feedback control law u = k(x, t) for any function k : Rn × [0, 1] → Rm satisfying
|k(x, t) − k(x0 , t)| ≤ K 0 |x − x0 | (5.3.4)
and
|k(x, t)|2 ≤ K 0 (1 + |x|2 ) (5.3.5)
72
for some constant 0 ≤ K 0 < ∞, for all x, x0 ∈ Rn and all t ∈ [0, 1]. Under these assumptions, the
functions f k (x, t) := f (x, k(x, t)) and g k (x, t) := g(x, k(x, t)) will satisfy
|f k (x, t) − f k (x0 , t)| = |f (x, k(x, t)) − f (x0 , k(x0 , t))| (5.3.6)
0 0
≤ K(|x − x | + |k(x, t) − k(x , t)|) (5.3.7)
00 0
≤ K |x − x | (5.3.8)
and
|f k (x, t)|2 + kg k (x, t)k2 = |f (x, k(x, t))|2 + kg(x, k(x, t))k2 (5.3.9)
2 2
≤ K(1 + |x| + |k(x, t)| ) (5.3.10)
≤ K̃(1 + |x|2 ) (5.3.11)
for some other constant K̃, for all x, x0 ∈ Rn and all t ∈ [0, 1]. Under these assumptions, the SDE
dXtk = f k (Xtk , t) dt + g k (Xtk , t) dWt (5.3.12)
will have a unique strong solution (Xtk )0≤t≤1 for each initial condition X0 = x0 , and we can therefore
take Xt = Xtk and ut = k(Xtk ) in (5.3.1). Of course, these are fairly severe restrictions, and we may
often settle for the existence of weak solutions (see Section 4.1.5) that are unique in probability law.
That is, for each x ∈ Rn , there exists some probability space (Ω, F, (Ft )0≤t≤1 , P ), a P -Brownian
motion (Wt )0≤t≤1 adapted to the filtration (Ft )0≤t≤1 , and two Ft -adapted processes (Xt )0≤t≤1 and
(ut )0≤t≤1 , such that (5.3.1) has a solution with X0 = x, i.e.,
Z t Z t
Xt = x + f (Xs , us ) ds + g(Xs , us ) dWs , 0 ≤ t ≤ 1. (5.3.13)
0 0
Moreover, the probability law of (Xt , ut )0≤t≤1 on the path space Ω = C([0, 1]; Rn × Rm ) is unique.
The adaptedness requirement is entirely natural as it expresses that the controls are causal, i.e., do
not make use of any information from the future. The uniqueness of the probability law of (Xt , ut )
is needed to ensure that the expected cost is well-defined. When the control is of state feedback
form, i.e., u = k(x, t) for some k : Rn × [0, 1] → Rm , the above discussion simply means that the
SDE (5.3.12) has a weak solution unique in probability law.
With all of this in mind, we will simply say that there is a class U of admissible input (or control )
processes u = (ut )0≤t≤1 , such that the controlled SDE (5.3.1) has a solution, in an appropriate sense,
for each initial condiotion X0 = x. We can then define, for each admissible control process u, each
x ∈ Rn , and each t ∈ [0, 1], the expected cost-to-go
"Z #
1
J(x, t; u) := E q(Xs , us ) ds + r(X1 )Xt = x (5.3.14)

t
and the value function
V (x, t) := min J(x, t; u). (5.3.15)

u∈U
We say that ū ∈ U is optimal if it achieves the value function, i.e., if J(x, t; ū) = V (x, t) for all
x, t. Just like in the deterministic case, the computation of the value function relies on a dynamic
73
programming argument. The underlying idea, known as Bellman’s optimality principle, is as follows:
If u = (ut )0≤t≤1 is an optimal control process, then (us )t≤s≤1 is optimal on the time interval [t, 1],
so replacing the trajectory (us )t≤s≤1 with some other admissible trajectory (ũs )t≤s≤1 will result in
larger overall expected cost. This idea can be used as the basis for a rigorous derivation of a certain
PDE that the value function V should satisfy, known as the Hamilton–Jacobi–Bellman equation,
given in (5.3.17) below. In general, solving the HJB equation is a difficult task, although in some
cases we can guess a solution based on some structural considerations (we will see examples of this
later on). If our guess turns out to be correct, then we have obtained an optimal control. This is the
content of the so-called verification theorem.
First, some notation. To each fixed u ∈ Rm , we will associate a second-order linear differential
operator Au defined by
1
Au ϕ(x, t) := f (x, u)T ∇ϕ(x, t) + tr{g(x, u)g(x, u)T ∇2 ϕ(x, t)}. (5.3.16)
2
We can think of Au as the infinitesimal generator of a diffusion process with drift f (·, u) and diffusion
matrix g(·, u)g(·, u)T .
Theorem 16. Suppose that V : Rn × [0, 1] → R is a C 2,1 solution of the Hamilton–Jacobi–Bellman
equation
∂
V (x, t) + minm Au V (x, t) + q(x, u) = 0, (x, t) ∈ Rn × [0, 1]

(5.3.17)
∂t u∈R
subject to the terminal condition V (x, 1) = r(x). Then we have the following:
1. For any admissible control ũ, any x, and any t,
J(x, t; ũ) ≥ V (x, t). (5.3.18)
2. If there exists a function k : Rn × [0, 1] → Rm , such that
Ak(x,t) V (x, t) + q(x, k(x, t)) = minm Au V (x, t) + q(x, u) , ∀(x, t) ∈ Rn × [0, 1] (5.3.19)

u∈R
and the control ū corresponding to the state feedback law ū = k(x, t) is admissible, then
J(x, t; ū) = V (x, t) for all (x, t) ∈ Rn × [0, 1], and therefore ū is optimal.
Proof. For the first item, let ũ ∈ U, x ∈ Rn , and t ∈ [0, 1] be given. Let (X̃s )t≤s≤1 be a solution of
dX̃s = f (X̃s , ũs ) ds + g(X̃s , ũs ) dWs , t≤s≤1 (5.3.20)
with X̃t = x. By Itô’s rule,

Z t
V (X̃1 , 1) = V (X̃t , t) + V̇ (X̃s , s) + Aũs V (X̃s , s) ds + Mt,1 (5.3.21)
0
Z t
= V (x, t) + V̇ (X̃s , s) + Aũs V (X̃s , s) ds + Mt,1 , (5.3.22)
0
R1
where Mt,1 = t ∇V (X̃s , ũs )T g(X̃s , ũs ) dWs has zero mean. Since V solves the HJB equation
(5.3.17),
V̇ (x, s) + Au V (x, s) ≥ −q(x, u), ∀x, u, s. (5.3.23)
74
Therefore,
Z t Z t
V̇ (X̃s , s) + Aũs V (X̃s , s) ds ≥ − q(X̃s , ũs ) ds. (5.3.24)
0 0
Since V (X̃1 , 1) = r(X̃1 ), we have

"Z #
1
E[r(X̃1 )|X̃t = x] ≥ V (x, t) − E q(X̃s , ũs ) dsX̃t = x , (5.3.25)

t
so we get (5.3.18) after rearranging terms. The same argument can be used to prove the second item
since now
V̇ (x, s) + Ak(x,s) V (x, s) = −q(x, k(x, s)), ∀x, s (5.3.26)
by hypothesis on k.
The verification theorem gives a sufficient condition for optimality, and its application hinges on the
ability to guess a solution of the HJB equation in a given setting (otherwise, we resort to numerical
methods). We next discuss a couple of examples when a solution can be obtained.
5.3.1 The linear quadratic regulator problem

Consider the controlled SDE
dXt = A(t)Xt dt + B(t)ut dt + F (t) dWt , 0 ≤ t ≤ 1, (5.3.27)
where A(t), B(t), and F (t) are matrices of appropriate shapes. This is a time-varying system, but
it is easy to generalize the above discussion to the time-varying case with f (x, u, t) and g(x, u, t)
instead of f (x, u) and g(x, u), q(x, u, t) instead of q(x, u), Aut instead of Au , etc. The HJB equation
will then take the form
∂
V (x, t) + minm {Aut V (x, t) + q(x, u, t)} = 0, (x, t) ∈ Rn × [0, 1] (5.3.28)
∂t u∈R
with the terminal condition V (x, 1) = r(x).

We wish to find an admissible control for (5.3.27) to minimize the expected cost
"Z #
1
1 T T
T

J(x, t; u) = E Xs P (s)Xs + us Q(s)us ds + X1 RX1 Xt = x (5.3.29)

2 t
where the matrices P (·), R are symmetric and positive semidefinite and the matrices Q(·) are
symmetric and positive definite. Thus, we have
f (x, u, t) = A(t)x + B(t)u, g(x, u, t) = F (t); (5.3.30)

1 1
q(x, u, t) = xT P (t)x + uT Q(t)u , r(x) = xT Rx, (5.3.31)
2 2
75
which comes with the family of infinitesimal generators
1
Aut ϕ(x, t) = A(t)x + B(t)u)T ∇ϕ(x, t) + tr{F (t)F (t)T ∇2 ϕ(x, t)} (5.3.32)
2
1
= xT A(t)T ∇ϕ(x, t) + tr{F (t)F (t)T ∇2 ϕ(x, t)} + uT B(t)T ∇ϕ(x, t). (5.3.33)
2
Now, for any ξ ∈ Rn ,

1 T 1
min u B(t) ξ + u Q(t)u = − ξ T B(t)Q(t)−1 B(t)T ξ
T T
(5.3.34)
u∈Rm 2 2
where the minimum is attained uniquely at ū(ξ, t) := −Q(t)−1 B(t)T ξ. Using this, we can write
down the HJB equation:
∂ 1
V (x, t) = −xT A(t)T ∇V (x, t) + ∇V (x, t)T B(t)Q(t)−1 B(t)T ∇V (x, t)
∂t 2
1 1
− tr{F (t)F (t)T ∇2 V (x, t)} − xT P (t)x (5.3.35)
2 2
with the terminal condition V (x, 1) = 12 xT Rx. Given the structure of the problem, it is reasonable
to guess the quadratic form for V (x, t):
1
V (x, t) = xT M (t)x + q(t), (5.3.36)
2
where M (t) are symmetric, positive semidefinite matrices and q(t) are nonnegaitve reals. The
terminal condition V (x, 1) = 12 xT Rx then leads to M (1) = R and q(1) = 0. Substituting
1
V̇ (x, t) = xT Ṁ (t)x + q̇(t), ∇V (x, t) = M (t)x, ∇2 V (x, t) = M (t) (5.3.37)
2
into (5.3.35) leads to
1 T 1 1
x Ṁ x + q̇ = − xT AT M + M A − M BQ−1 B T M + P − tr{F F T M } (5.3.38)
2 2 2
(we have omitted the time variable to avoid notational clutter). This gives us ODEs for M (t) and
for q(t):
Ṁ = −AT M − M A + M BQ−1 B T M − P, M (1) = R (5.3.39)
and
1
q̇ = − tr{F F T M }, q(1) = 0. (5.3.40)
2
The equation for M (t) is of the familiar Riccati form, while the one for q(t) can be solved in terms
of M (t):
Z 1
1
q(t) = tr{F (s)F (s)T M (s)} ds. (5.3.41)
2 t
76
Moreover, if we define the matrices K(t) := Q(t)−1 B(t)T M (t), then it is straightforward to show
that
−K(t)x
At V (x, t) + q(x, −K(t)x, t) = minm {Aut V (x, t) + q(x, u, t)}, (5.3.42)
u∈R
so the state feedback control ut = −K(t)x is optimal. The minimum value of the control cost, for a
given initial condition X0 = x, is then equal to
Z 1
1 1
V (x, 0) = xT K(0)x + tr{F (t)F (t)T K(t)} dt. (5.3.43)
2 2 0
5.3.2 Fleming’s logarithmic transformation and the Schrödinger bridge prob-

lem
Consider the n-dimensional controlled SDE
dXt = ut dt + dWt , 0 ≤ t ≤ 1. (5.3.44)
Given the initial condition X0 = x, we wish to find a control process u = (ut )0≤t≤1 to minimize the
expected cost
" Z #
1 1
J(x; u) := E |ut |2 dt + r(X1 )X0 = x . (5.3.45)

2 0
The corresponding cost-to-go is then

" Z #
1 1 2

J(x, t; u) = E |us | ds + r(X1 )Xt = x . (5.3.46)

2 t
Here, the effect of the control is to add

R a drift to a Brownian motion process; the overall control
1 1 2
cost consists of the ‘control effort’ 2 0 |ut | dt and a terminal cost r(X1 ). Thus, we wish to find a
drift that achieves the best trade-off between the expected control effort (how much perturbation is
added to the Brownian motion) and the expected terminal cost.
In this set-up, we have f (x, u) = u, g(x, u) = In , q(x, u) = 12 |u|2 , and Au ϕ = uT ∇ϕ + 12 ∆ϕ.
Since
n 1 o 1
minn uT ξ + |u|2 = − |ξ|2 , (5.3.47)
u∈R 2 2
the HJB equation takes the form
∂ 1 1
V (x, t) + ∆V (x, t) = |∇V (x, t)|2 , (x, t) ∈ Rn × [0, 1] (5.3.48)
∂t 2 2
with the terminal condition V (x, 1) = r(x). In order to apply the verification theorem, we need to
come up with a solution of (5.3.48), which, on the face of it, looks like a difficult task. However, as
first shown by W. Fleming, a simple logarithmic transformation suffices to reduce the problem to
77
solving a linear second-order PDE. To that end, let us consider the function h(x, t) := e−V (x,t) , or
V (x, t) = − log h(x, t). It is straightforward to compute
∂ ∂
h(x, t) = −h(x, t) V (x, t), (5.3.49)
∂t ∂t
∂ ∂
h(x, t) = −h(x, t) V (x, t), (5.3.50)
∂xi ∂xi
∂2 ∂ ∂
h(x, t) = − h(x, t) V (x, t)
∂x2i ∂xi ∂xi
∂ ∂ ∂2
=− h(x, t) V (x, t) − h(x, t) 2 V (x, t)
∂xi ∂xi ∂xi
∂ 2 ∂ 2
h(x, t) V (x, t) − 2 V (x, t) , (5.3.51)

∂xi ∂xi
which gives
∆h(x, t) = h(x, t) |∇V (x, t)|2 − ∆V (x, t) .

∇h(x, t) = −h(x, t)∇v(x, t), (5.3.52)
Using these relations, we can obtain a PDE for h:

∂ ∂
h(x, t) = −h(x, t) V (x, t) (5.3.53)
∂t ∂t
1
= h(x, t) ∆V (x, t) − |∇V (x, t)|2

(5.3.54)
2
1
= − ∆h(x, t), (5.3.55)
2
with the terminal condition h(x, 1) = e−r(x) . It is convenient to reverse the time by defining
h̃(x, t) := h(x, 1 − t), so that h̃(x, 0) = e−r(x) and
∂ 1
h̃(x, t) + ∆h̃(x, t) = 0. (5.3.56)
∂t 2
This can be solved immediately using the Feynman–Kac formula:
h̃(x, t) = E[e−r(x+Wt ) ], (5.3.57)
where (Wt )0≤t≤1 is a standard n-dimensional Brownian motion. Since Wt ∼ N(0, tIn ), we can express
this as a Gaussian integral: If Z ∼ N(0, In ), then
√ 1
Z √
−r(x+ tZ) −r(x+ tz) −|z|2
h̃(x, t) = E[e ]= e e dz. (5.3.58)
(2π)n/2 Rn
Consequently,
V (x, t) = − log h(x, t) (5.3.59)

= − log h̃(x, 1 − t) (5.3.60)
√
= − log E[e−r(x+ 1−tZ)
], (5.3.61)
78
and the state feedback control ut = −∇V (x, t) is optimal. The minimum expected cost is given by
V (x, 0) = − log E[e−r(x+Z) ]. (5.3.62)
An interesting application of this control problem is as follows: Let a strictly positive probability
density p on Rn be given. Consider all adapted control processes u = (ut )0≤t≤1 , such that p is the
probability density of
Z 1
X1 = ut dt + W1 , (5.3.63)
0
R1
and among these find the control that minimizes the expected effort E[ 12 0 |ut |2 dt]. This is an
instance of the so-called Schrödinger bridge problem, where the goal is to steer the stochastic evolution
of a system of particles between t = 0 and t = 1 to a given distribution at t = 1, while making sure
that the trajectories of the particles are as close to Brownian as possible. To see how this problem
2
connects to the above optimal control problem, let γ(x) = (2π)1n/2 e−|x| /2 denote the standard
Gaussian density on Rn and consider the function
p(x) 1 n
r(x) := − log = − log p(x) + |x|2 + log(2π). (5.3.64)
γ(x) 2 2
(since p > 0 everywhere, r is well-defined). Then E[e−r(Z) ] = 1 for Z ∼ N(0, In ), and therefore
" Z #
1 1
min E |ut |2 dt + r(X1 ) = − log E[e−r(Z) ] = 0. (5.3.65)
u 2 0
Now, if u = (ut )0≤t≤1 is any control that ensures X1 ∼ p, then

" Z # " Z # Z
1 1 1 1
E |ut |2 dt + r(X1 ) = E |ut |2 dt + r(x)p(x) dx (5.3.66)
2 0 2 0 Rn
" Z # Z
1 1 2 p(x)
=E |ut | dt − p(x) log dx (5.3.67)
2 0 Rn γ(x)
" Z #
1 1 2
=E |ut | dt − D(pkγ), (5.3.68)
2 0
where D(pkγ) is the relative entropy (or Kullback–Leibler divergence) between the densities p and γ,
so we get the following lower bound on the required control effort:
" Z #
1 1
u gives X1 ∼ p ⇒ E |ut |2 dt ≥ D(pkγ). (5.3.69)
2 0
Moreover, the control that achieves the minimum in (5.3.65) also solves our optimum steering
problem, i.e., attains equality in (5.3.69); this will be shown in a homework problem.
79
5.4 Optimal control with partial observations
We now consider the setting where we only have noisy observations of the state in (5.3.1), i.e., we
have the system
dXt = f (Xt , ut ) dt + g(Xt , ut ) dWt (5.4.1a)
dYt = h(Xt , t) dt + dVt , (5.4.1b)
where (Wt ) and (Vt ) are two independent Brownian motion processes. We still face the problem of
minimizing the expected cost
"Z #
1
J(u) = E q(Xt , ut ) dt + r(X1 ) , (5.4.2)
0
where the initial condition X0 is now random. But now the control u has to be adapted to the
filtration (FtY ) generated by the observation process (Yt ). In fact, we have to be a bit more careful
here, since the choice of the control influences the state process and hence the observation process,
which means that the relevant filtration will generally depend on u as well. In other words, we
should be more precise and indicate the dependence on the control everywhere, as in
dXtu = f (Xtu , ut ) dt + g(Xtu , ut ) dWt (5.4.3a)
dYtu = h(Xtu , ut ) dt + dVt , (5.4.3b)
and the control u is constrained to be adapted to the filtration (FtY,u ) generated by (Ytu ).
The obvious step here is to make use of the nonlinear filter πtu (·), where for any function ψ(x, u)
we have
πtu [ψ(·, ut )] = E[ψ(Xtu , ut )|FtY,u ]. (5.4.4)
Using this definition and the law of iterated expectations, we can express the expected cost J(u) as
"Z #
1
J(u) = E πtu (q(·, ut )) dt + π1u (r) , (5.4.5)
0
where u is any (FtY,u )-adapted control. This reframes the problem of controlling the hidden state
Xtu on the basis of observations Ytu into one with the fully observed state given by πtu , and the
control can now be based on the filter output. This idea is known as the separation principle, where
the problem of choosing the control is separated from the problem of estimating the hidden state by
filtering the noisy observations. However, any gain due to this reframing is illusory, since now we
are confronted with the problem of computing the nonlinear filter, which is already quite difficult
(and generally infinite-dimensional). Thus, in general, one has to resort to numerical approximations.
The case of linear SDEs for Xt and Yt , quadratic costs, and Gaussian initial condition is a pleasant
exception, although even there we have to be careful.
5.4.1 The linear-quadratic-Gaussian (LQG) control problem

Let us consider the following special case of (5.4.1):
dXt = A(t)Xt dt + B(t)ut dt + F (t) dWt (5.4.6a)
dYt = H(t)Xt dt + G(t) dVt , (5.4.6b)
80
with Gaussian initial condition X0 ∼ N(m0 , K0 ). Here, as before, (Wt ) and (Vt ) are two independent
Brownian motion processes of appropriate dimensions that are also independent of X0 . Also, we
assume that the matrices G(t)G(t)T are positive definite for all t, just like in the Kalman–Bucy
filter setting. We are not indicating the dependence on u explicitly, as we did in (5.4.3) to keep the
notation clean, but it should be kept in mind. The goal is to minimize the expected cost
"Z #
1
1
J(u) = E XtT P (t)Xt dt + uT T
t Q(t)ut dt + X1 RX1 , (5.4.7)
2 0
where the matrices P (t), Q(t), R satisfy the same conditions as in Section 5.3.1, but now the control
u must be adapted to FtY,u . We will derive the solution shortly, but, in a nutshell, the idea is to
obtain an estimate X̂t of the state Xt using the Kalman filter, and then solve the fully observed
LQR problem for controlling X̂t , with the solution given in the estimated state feedback form
ut = −K(t)X̂t . This is, of course, a manifestation of the separation principle: The task of estimating
the state and the task of choosing the control are separated.
To begin the analysis, let us first consider the case of zero controls (obtained by setting ut ≡ 0 in
(5.4.6a)) which is described by the state equation
dXt0 = A(t)Xt0 dt + F (t) dWt (5.4.8)
and the observation equation
dYt0 = H(t)Xt0 dt + G(t) dVt (5.4.9)
with the initial conditions X00 = X0 , Y00 = 0. Let FtY,0 denote the σ-algebra generated by (Ys0 )0≤s≤t .
Then we can write down the Kalman–Bucy filter equation for X̂t0 = E[Xt0 |FtY,0 ]:
dX̂t0 = A(t)X̂t0 dt + L(t) dYt0 − H(t)X̂t0 dt

(5.4.10)
with the initial condition X̂00 = m0 , and let Π(t) be the error covariance E[(X̂t0 − Xt0 )(X̂t0 − Xt0 )T ].
If we now define the processes Xt1 , Yt1 by
dXt1 = A(t)Xt1 + B(t)ut dt,

(5.4.11)
dYt1 = H(t)Xt1 dt (5.4.12)
with X01 = 0 and Y01 = 0, then evidently
Xt = Xt0 + Xt1 , Yt = Yt0 + Yt1 . (5.4.13)
We then have the following result:

Lemma 4. Suppose that the control u is adapted to FtY,0 , and the σ-algebras FtY,0 and FtY,u are
equal for all t. Then:
1. The conditional distribution of Xt given FtY,u is Gaussian, with conditional mean X̂t =
E[Xt |FtY,u ] = X̂t0 + Xt1 and conditional covariance Π(t).
2. The conditional mean X̂t satisfies the SDE
dX̂t = A(t)X̂t + B(t)ut dt + L(t) dYt0 − H(t)X̂t0 dt .

(5.4.14)
81
Proof. The proof is straightforward: Since the σ-algebras FtY,0 and FtY,u are equal, conditional
expectations given FtY,u and FtY,0 are equal. By the same token, since the control ut is adapted to
FtY,0 , so are Xt1 and Yt1 . Everything then follows from the analysis of the Kalman–Bucy filter.
Let u satisfy the conditions of the Lemma. We can rewrite the expected cost of u in terms of X̂t
and the error X̃t = Xt − X̂t . For the latter, we have
Xt − X̂t = Xt0 + Xt1 − (X̂t0 + Xt1 ) = Xt0 − X̂t0 , (5.4.15)
so X̃t is zero-mean and independent of X̂t and ut . Then

"Z # "Z #
1 1
T T T T T
J(u) = E X̂t P (t)X̂t + ut Q(t)ut dt + X̂1 RX̂1 +E X̃t P (t)X̃t dt + X̃1 RX̃1 (5.4.16)
0 0
| {z }
=:J 0 (u)
Z 1
ˆ
= J(u) + tr{Π(t)P (t)} dt + tr{Π(1)R}, (5.4.17)
0
where the last two terms are nonrandom and determined by the error covariance matrices Π(t). The
modified cost J 0 (u) involves the Kalman–Bucy filter output X̂t and the control ut , and this is an
instance of Rthe linear quadratic regulator since the forward differential of the innovations process
t
νt0 := Yt0 − 0 H(s)X̂s0 ds can be written as
p
dνt0 = (G(t)G(t)T )−1 dŴt0 (5.4.18)
for some Brownian motion process Ŵt0 adapted to FtY,0 .
5.5 Problems
1. Consider a linear SDE
dXt = A(t)Xt dt + G(t) dWt (5.5.1)
for an n-dimensional diffusion process (Xt )t≥0 driven by an m-dimensional Brownian motion (Wt )t≥0 ,
where A(t) ∈ Rn×n and G(t) ∈ Rn×m are time-varying matrices. Show that, for any 0 ≤ s ≤ t,
Z t
Xt = Φ(t, s)Xs + Φ(t, r)G(r) dWr , (5.5.2)
s
where Φ(t, s) is the fundamental matrix of the linear time-varying system ẋ(t) = A(t)x(t), i.e., it
solves the matrix ODE
d
Φ(t, s) = A(t)Φ(t, s), t≥s (5.5.3)
dt
with the initial condition Φ(s, s) = In .
82
2. Let X̃t and νt be the state estimation error and the innovation processes of the Kalman–Bucy
filter, cf. Section 5.1. Prove that X̃t is orthogonal to (νs )0≤s≤t , i.e.,
E[X̃t νsT ] = 0, 0 ≤ s ≤ t. (5.5.4)
Hint: Use (5.1.15) and Problem 1.
3. Consider the Schrödinger bridge problem discussed in Section 5.3.2. For t > 0, let γt (x) :=
2
1
(2πt)n/2
e−|x| /2t .
(i) For the SDE
dXt = −∇V (Xt , t) dt + dWt , (5.5.5)
where V (x, t) is the solution of the HJB equation (5.3.48), show that the density pt (x) =
(1/Ct )γt (x)e−V (x,t) , where Ct are the normalization constants, solves the forward Kolmogorov
equation
∂ 1
pt (x) = ∇ · ∇V (x, t)pt (x) + ∆pt (x), 0≤t≤1 (5.5.6)
∂t 2
with the initial condition p0 (x) = δ(x).
Note: that normalization Ct constant is important!
p(x)
(ii) If the terminal cost r(x) is equal to − log γ(x) , where γ(x) = γ1 (x) is the standard Gaussian
n
density on R , show that X1 (with initial condition X0 = 0) has density p and thus show that
the feedback control ut = −∇V (Xt , t) attains equality in (5.3.69).
83
Chapter 6
Optimization
Function optimization is a basic task in a variety of engineering applications and settings. A general
optimization problem can be formulated as follows:
minimize f (x)
subject to x ∈ C
where f is a real-valued function defined on some set X and where C ⊆ X is a constraint set. Some
examples are:
1. Minimizing a linear function f (x) = cT x of x ∈ Rn , where c ∈ Rn is a fixed vector, subject to

linear inequality constraints of the form
aT
i x ≤ bi , 1 ≤ i ≤ m. (6.0.1)
This is the classical problem of linear programming, common in such settings as resource
allocation.
2. Statistical estimation problems involve minimizing quadratic functions of the form f (x) =
xT P x + cT x subject to various constraints, e.g., linear inequality constraints as in (6.0.1).
3. Stochastic optimization problems, where f (x) has the form of an expectation w.r.t. some
random variable Z, i.e., f (x) = E[F (x, Z)]. For example, if ξ is a random couple (ξ, Y ), where
ξ is a vector of features and Y is a vector of responses, the goal is to predict Y on the basis of
ξ, and we have a family of candidate predictors Ŷ = h(ξ, x), then we could take
F (x, Z) = F (x, (ξ, Y )) = |Y − h(ξ, x)|2 .
4. Optimal control problems of the sort considered in Chapter 5 are infinite-dimensional opti-
mization problems: The set X is (a subset of) some function space.
5. Other infinite-dimensional optimization problems, e.g., when X is a set of probability measures

on some sample space Ω, such as the path space Ω = C([0, T ]; Rd ).
In this chapter, we will explore the use of SDEs in the context of certain types of optimization
problems.
84
6.1 Langevin dynamics
Consider the problem of globally minimizing a differentiable function f : Rn → R, i.e., the goal is to
find a point x̄ ∈ Rn such that f (x̄) ≤ f (x) for all x ∈ Rn . Assuming that at least one such global
minimizer x̄ exists (at the very least, we would assume that f is bounded from below), it would
have to satisfy ∇f (x̄) = 0. (Note that this is only a necessary condition for (local) optimality; if f is
twice differentiable, then x̄ is a (local) minimizer if and only if ∇f (x̄) = 0 and the Hessian ∇2 f (x̄)
is positive semidefinite.) This observation underlies the use of gradient descent methods: Pick some
initial point x0 ∈ Rn and generate the iterates
xk+1 = xk − ηk ∇f (xk ), k = 0, 1, . . . (6.1.1)
where (ηk )k≥0 is a decreasing sequence of positive step sizes. If the step sizes are sufficiently small,
then we can consider a continuous-time idelization of (6.1.1), the gradient flow
ẋ(t) = −∇f (x(t)), t≥0 (6.1.2)
A simple calculation shows that the value of f (x(t)) decreases along the trajectory of (6.1.2):
d
f (x(t)) = ∇f (x(t))T ẋ(t) (6.1.3)
dt
= −|∇f (x(t))|2 (6.1.4)
≤ 0. (6.1.5)
However, unless the function f has additional properties, such as convexity, there can be no guarantee
that x(t) will converge, let alone converge to a global minimizer. In fact, depending on the initial
condition x(0), the trajectory x(t) may get trapped in the neighborhood of some local minimizer.
Intuitively, the reason for this behavior is that gradient descent methods are, by their nature,
local: At each time t, the negative gradient −∇f (x(t)) points in the direction of greatest initial
decrease of f (thus the name “steepest descent” that is often attached to gradient methods), but
only the points x(t) in a small vicinity of the trajectory of (6.1.2) can be explored in this way. A
way to get around this is to inject some stochasticity into the dynamics and consider, instead of
(6.1.2), the SDE
√
dXt = −∇f (Xt ) dt + 2ε dWt , (6.1.6)
where ε > 0 is a small parameter, typically referred to as the temperature. The SDE (6.1.6) is called
the (overdamped) Langevin dynamics with potential f . The name comes from the work of Paul
Langevin on Brownian motion, where the quadratic potential f (x) = c|x|2 was used to model the
frictional forces in the surrounding medium. To provide some motivation for the form of (6.1.6),
consider the following stochastic relaxation of the problem of minimizing f : Instead of searching for
a global minimizer x̄, let us sample a random point X̄ from the density
1 1
π ε (x) := ε exp − f (x) , (6.1.7)
Z ε
where Z ε := Rn e−(1/ε)f (x) dx is the normalization constant (we assume that f (x) is sufficiently
R
well-behaved as |x| → ∞ for Z ε to be finite). This explains why we have used the term ‘temperature’
to refer to the parameter ε: In statistical physics, probability densities of the form (6.1.7) are referred
85
to as Gibbs densities and are used to model the behavior of large systems in thermal equilibrium,
where f (x) is the ‘energy’ of a ‘system configuration’ x ∈ Rn and ε is the absolute temperature.
Under some regularity conditions on f , it can be shown that
C
E[f (X̄)] ≤ min f (x) + nε log (6.1.8)
x nε
for some constnat C > 0 that depends on f . Moreover, π ε is an equilibrium density of (6.1.6), i.e.,
if X0 ∼ π ε , then Xt ∼ π ε for all t. To see this, let us look at the Fokker–Planck equation for the
density ρt (x) of Xt when X0 is sampled from some density ρ0 :
∂
ρt (x) = ∇ · ∇f (x)ρt (x) + ε∆ρt (x). (6.1.9)
∂t
Since ∇ log π ε (x) = − 1ε ∇f (x), it follows that π ε is a stationary solution of (6.1.9). Moreover, under
certain regularity conditions on f , it is possible to show that the probability density of Xt converges
to π ε as t → ∞ for any initial density ρ0 of X0 . While the precise argument requires a great deal of
functional analysis, we can give a brief sketch illustrating the basic ideas.
6.1.1 Convergence to equilibrium and the spectral gap

Let pt (x0 , x) denote the transition density of (6.1.6), i.e.,
Z
E[ϕ(Xt )|X0 = x0 ] = ϕ(x)pt (x0 , x) dx (6.1.10)
Rn
for all smooth functions ϕ : Rn → R. It satisfies the Fokker–Planck equation

∂
pt (x0 , x) = ∇ · ∇f (x)pt (x0 , x) + ε∆pt (x0 , 0) (6.1.11)
∂t
with the initial condition limt→0 pt (x0 , x) = δ(x − x0 ). (As usual, the operators ∇ and ∆ act on the
x variable, while x0 remains fixed.) Then, for sufficiently well-behaved f , it can be shown that we
can express pt (x0 , x) as
∞
X
pt (x0 , x) = π ε (x) e−λi t ψi (x)ψi (x0 ), (6.1.12)
i=0
where λi ≥ 0 and ψi are the eigenvalues and the normalized eigenfunctions of the Sturm–Liouville
equation
ε∇ · π ε (x)∇ψ(x) + λπ ε (x)ψ(x) = 0,

(6.1.13)
i.e., both the function ψ and the constant λ must be solved for, and, since (6.1.13) is linear in ψ, we
require that
Z
π ε (x)|ψ(x)|2 = 1. (6.1.14)
Rn
The basic idea is to attempt to solve (6.1.11) using separation of variables. To that end, consider an
ansatz of the form h(x, t) = g(t)π ε (x)ψ(x). Then
∂ d
h(x, t) = π ε (x)ψ(x) g(t), (6.1.15)
∂t dt
86
while
!

∇f (x) · h(x, t) + ε∇h(x, t) = g(t) ψ(x) π ε (x)∇f (x) + ε∇π ε (x) + επ ε (x)∇ψ(x) (6.1.16)
| {z }
=0
= εg(t)π ε (x)∇ψ(x). (6.1.17)
Thus, if h(x, t) is a solution of (6.1.11), then it has to satisfy

1 d 1
ε∇ · π ε (x)∇ψ(x)

g(t) = ε (6.1.18)
g(t) dt π (x)ψ(x)
The left-hand side of (6.1.18) is a function of t only, while the right-hand side is a function of x only.
Thus, they are both equal to some constant −λ, so we can take g(t) = e−λt , which implies that ψ
and λ must satisfy (6.1.13).
Problems of this type are studied in functional analysis, and in particular in the theory of
Schrödinger operators, named this way because they appear in the context of Schrödinger’s equation
of quantum mechanics. Using the theory of Schrödinger operators, it is possible to show that there
are only countably many solutions of (6.1.13), i.e., the eigenfunction-eigenvalue pairs (ψi , λi ). It
is immediate that ψ0 (x) ≡ 1 is an eigenfunction with λ0 = 0, and it can be shown that all other
eigenvalues λ1 ≤ λ2 ≤ . . . are positive. The first nonzero eigenvalue λ1 is important because it
controls the rate at which pt (x0 , x) converges to π ε (x) as t → ∞. To see this, let us write

∞
X
ε −λ t

|pt (x0 , x) − π (x)| = e i ψi (x)ψi (x0 ) (6.1.19)
i=1

∞
X
−λ1 t −(λi −λ1 )t

=e e ψi (x)ψi (x0 ) (6.1.20)
i=1
Now,

∞
X
ε −λi 2

|p1 (x, x) − π (x)| =
e ψi (x) , (6.1.21)
i=1
so if we define
k(x) := eλ1 |p1 (x, x) − π ε (x)| (6.1.22)

X∞
= e−(λi −λ1 ) ψi2 (x) (6.1.23)
i=1
then, for t > 1,

v
∞ u∞ ∞
X −(λ −λ )t uX X

e i 1
ψi (x)ψi (x0 ) ≤
t e −(λ i −λ 1 )t 2
ψi (x) e−(λj −λ1 )t ψj2 (x0 ) (6.1.24)
i=1 i=1 j
p
≤ k(x)k(x0 ) (6.1.25)
≤ sup k(x), (6.1.26)
x∈Rn
87
where the first inequality uses Cauchy–Schwarz. Thus, for t > 1,
|pt (x0 , x) − π ε (x)| ≤ Ke−λ1 t , K := sup k(x) (6.1.27)

x∈Rn
where we assume that f is such that the constant K defined above is finite. Then, if ρ0 is the initial
density of X0 , we have
Z
ε ε

|ρt (x) − π (x)| = pt (x0 , x)ρ0 (x0 ) dx0 − π (x) (6.1.28)
n
ZR
≤ ρ0 (x0 )|pt (x0 , x) − π ε (x)| dx0 (6.1.29)
Rn
−λ1 t
≤ Ke . (6.1.30)
The constant λ1 can be characterized as follows. Let S be the space of all C 1 functions ϕ : Rn → R,
such that
Z Z
ε
π (x)ϕ(x) dx = 0 and π ε (x)|ϕ(x)|2 dx < ∞. (6.1.31)
Rn Rn
It is easy to see that S is a vector space, and all the eigenfunctions ψ1 , ψ2 , . . . are the elements of S.
Now, if we consider (6.1.13) for (ψi , λi ), i ≥ 1, multiply both sides by ψi and integrate, we get
Z
ψi (x)∇ · π ε (x)∇ψi (x) dx

λi = −ε (6.1.32)
Z Rn
=ε π ε (x)|∇ψi (x)|2 dx, (6.1.33)

Rn
where the second step follows after integrating by parts (and assuming that ψi , π ε decay sufficiently
rapidly as |x| → ∞). Thus, for any ϕ ∈ S,
ε 2
R
n π (x)|∇ϕ(x)| dx
λ1 ≤ ε RR ε 2
, (6.1.34)
Rn π (x)|ϕ(x)| dx
but, since equality holds for ϕ = ψ1 , we obtain
ε 2
R
n π (x)|∇ϕ(x)| dx
λ1 = ε min RR ε 2
. (6.1.35)
Rn π (x)|ϕ(x)| dx
ϕ∈S
The constant λ1 is known as the Poincaré constant (or spectral gap) associated to the Langevin
diffusion (6.1.6), and there exist many methods for obtaining upper and lower bounds for it. Generally,
it will scale rather poorly with problem dimension n and inverse temperature 1/ε—e.g., if f has
multiple minima as well as saddle points (i.e., the points x such that ∇f (x) = 0 but the Hessian
∇2 f (x) has both positive and negative eigenvalues), then the spectral gap will scale like e−n in n
and like e−1/ε in ε. When f has enough ‘curvature,’ however (in particular, if it has no local minima,
and thus all minima are global), much better dependence on n and 1/ε can be obtained.
88
6.1.2 The relative entropy and the logarithmic Sobolev inequality
Another way to quantify the convergence of ρt to π ε is through an information-theoretic quantity
known as the relative entropy or the Kullback–Leibler divergence. For our purposes, the following
definition can be given: Let ρ be a probability density on Rn . Then the relative entropy between ρ
and π ε is given by
Z
ε ρ(x)
D(ρkπ ) := ρ(x) log dx. (6.1.36)
Rn π ε (x)
Detailed discussions of this quantity can be found in many texts on information theory; here we note
the important property that D(ρkπ ε ) is always nonnegative, and it is equal to zero if and only if
ρ = π ε . Another important quantity is the relative Fisher information between ρ and π ε , defined as
Z
ε
ρ(x) 2
I(ρkπ ) := ρ(x)∇ log ε dx, (6.1.37)

Rn π (x)
which is evidently nonnegative and vanishes if and only if ∇ log πρε ≡ 0.

Let us now consider the Langevin diffusion (6.1.6) with X0 having a density ρ0 . The evolution
of the densities ρt is governed by the Fokker–Planck equation (6.1.9), and it can be shown that ρt
is positive everywhere if ρ0 is. It will be convenient to express the Fokker–Planck equation in the
following form:
∂
ρt (x) = −∇ · jt (x), (6.1.38)
∂t
where
jt (x) := −ρt (x)∇f (x) − ε∇ρt (x) (6.1.39)
is referred to as the probability flux or current. In equilibrium, i.e., when ρt = π ε for all t, the
probability flux vanishes. Hence, writing ρt = π ε · πρtε , we have
ρt (x) ε ε
ρt (x)
jt (x) = − π (x)∇f (x) + ε∇π (x) − επ ε (x)∇ ε (6.1.40)
π ε (x) | {z } π (x)
=0
ρt (x)
= −επ ε (x)∇ ε (6.1.41)
π (x)
ρt (x)
= −ερt (x)∇ log ε . (6.1.42)
π (x)
Let us now differentiate D(ρt kπ ε ) with respect to time:

Z
d d ρt (x)
D(ρt kπ ε ) = ρt (x) log dx (6.1.43)
dt dt Rn π ε (x)
Z Z
∂ ρt (x) ∂ ρt (x)
= ρt (x) · log ε dx + ρt log ε dx. (6.1.44)
Rn ∂t π (x) Rn ∂t π (x)
89
The second integral is identically zero:
Z Z
∂ ρt (x) ∂
ρt log ε dx = ρt log ρt (x) dx (6.1.45)
Rn ∂t π (x) n ∂t
ZR
∂
= ρt ρt (x) dx (6.1.46)
Rn ∂t
Z
d
= ρt (x) dx (6.1.47)
dt Rn
= 0. (6.1.48)
Thus, using (6.1.38) and integrating by parts, we can write
Z
d ∂ ρt (x)
D(ρt kπ ε ) = ρt (x) · log ε dx (6.1.49)
dt Rn ∂t π (x)
Z ρt (x)
=− ∇ · jt (x) log ε dx (6.1.50)
Rn π (x)
Z
ρt (x)
= jt (x)T ∇ log ε dx (6.1.51)
Rn π (x)
Z ρt (x) 2
= −ε ρt (x)∇ log ε dx. (6.1.52)

Rn π (x)
Finally, using the definition of the relative Fisher information, we obtain the following important
formula:
d
D(ρt kπ ε ) = −εI(ρt kπ ε ). (6.1.53)
dt
This shows that the relative entropy D(ρt kπ ε ) between the density ρt of Xt and the equilibrium
density π ε decreases with t; however, just as in the case of gradient descent, further conditions are
needed on π ε to ensure that it converges to 0. One such sufficient condition is as follows: There
exists a constant c > 0, such that
cε
D(ρkπ ε ) ≤ I(ρkπ ε ) (6.1.54)
2
for all densities ρ. When (6.1.54) holds, we say that π ε satisfies a logarithmic Sobolev inequality with
1 2
constant c. For example, if f (x) = 12 |x|2 , so that π ε (x) ∝ e− 2ε |x| , (6.1.54) holds with c = 1. So, if
π ε satisfies the log-Sobolev inequality with constant c, then
d 2
D(ρt kπ ε ) ≤ − D(ρt kπ ε ), (6.1.55)
dt c
and in this case the relative entropy between ρt and π ε decays exponentially:
D(ρt kπ ε ) ≤ e−2t/c D(ρ0 kπ ε ). (6.1.56)
6.1.3 Simulated annealing in continuous time

As mentioned earlier, the expected value
Z Z
1 1
¯ε
f := ε
f (x)π (x) dx = ε f (x)e− ε f (x) dx (6.1.57)
Rn Z Rn
90
converges to the global minimum of f as the temperature parameter ε → 0. Moreover, for all
sufficiently large t (e.g., t c, where c is the log-Sobolev constant of π ε ), the probability density ρt
of Xt will be sufficiently close to π ε . This suggests that we could eventually approach the minimum
value of f by running a Langevin dynamics with time-dependent temperature ε(t):
p
dXt = −∇f (Xt ) dt + 2ε(t) dWt . (6.1.58)
The idea is that, on the one hand, the temperature ε(t) would vary sufficiently slowly with time so
that E[f (Xt )] ≈ f¯ε(t) for t large enough, but would also monotonically converge to 0 as t → ∞, so
that
f¯ε(t) → f¯0 ≡ min f, as t → ∞. (6.1.59)
This is the basis of so-called simulated annealing, first proposed in the context of discrete-time
optimization by Kirkpatrick, Gelatt, and Vecchi in 1983. The term “annealing” refers to a process in
metallurgy where an alloy is first heated and then slowly cooled in order to increase its hardness.
The rate of decrease of ε(t) is called the cooling schedule or protocol.
We can reason about this heuristically as follows. Let the cooling protocol be given with ε(0) > 0,
and let πt denote the density π ε(t) . Then we have the following for the time derivative of the relative
entropy D(ρt kπt ), where ρt is the density of Xt in (6.1.59):
Z Z
d ∂ ρt (x) ∂ ρt (x)
D(ρt kπt ) = ρt (x) · log dx + ρt (x) log dx (6.1.60)
dt Rn ∂t πt (x) Rn ∂t πt (x)
Z
∂
= −ε(t)I(ρt kπt ) − ρt (x) log πt (x) dx. (6.1.61)
Rn ∂t
To further analyze the second integral, it is convenient to work with the inverse temperature
β(t) := ε−1 (t). Then we can write
∂ ∂
log πt (x) = β̇(t) log π 1/β (x) , (6.1.62)

∂t ∂β β=β(t)
and further
∂ d
log π 1/β (x) = −f (x) − log Z 1/β (6.1.63)
∂β dβ
= f¯1/β − f (x) (6.1.64)
Therefore,
Z
d 1
f (x) − f¯1/β ρt (x) dx.

D(ρt kπt ) = − I(ρt kπt ) + β̇(t) (6.1.65)
dt β(t) Rn
Now suppose that each π ε satisfies a log-Sobolev inequality, and let c(t) denote the log-Sobolev
constant πt . In that case, we can further estimate
d 2
D(ρt kπt ) ≤ − D(ρt kπt ) + β̇(t)E[f (Xt ) − f¯1/β(t) ]. (6.1.66)
dt c(t)
91
Suppose that there exists a constant C > 0 that depends on f and n, such that
C
E[f (Xt ) − f¯1/β(t) ] ≤ D(ρt kπt ) (6.1.67)
β(t)
(this may be shown in many cases provided f is sufficiently well-behaved). Then we want to choose
β̇(t) such that
1 β̇(t)
. , (6.1.68)
c(t) β(t)
at least for all t large enough. In many cases, the log-Sobolev constant of π ε will scale like e−c/ε = ecβ
for some constant c > 0. We can then take β(t) ∼ log(1 + t), which suggests taking β(t) ∼ log t for
all t sufficiently large.
6.2 Constrained minimization and stochastic neural nets

So far, we have focused on unconstrained optimization. When constraints are present, one has to
make appropriate modifications to the gradient descent method to make sure that the points along
its trajectory satisfy the constraints. An interesting approach was suggested by J. Hopfield. Suppose
we wish to minimize a function f : Rn → R subject to the constraint x ∈ [0, 1]n . Consider the
following continuous-time dynamics:
ẋ(t) = −∇f (y(t)) (6.2.1a)

y(t) = σ(x(t)), (6.2.1b)
where σ : Rn → [0, 1]n has the form

T
σ(x1 , . . . , xn ) := σ(x1 ), . . . , σ(xn ) (6.2.2)
for some strictly increasing, continuous function σ : R → [0, 1]. For example, we could take
1 v
σ(v) = 1 + tanh , (6.2.3)
2 a
where a > 0 is a tunable parameter. We can think of (6.2.1) as an autonomous dynamical system
with state x(t) ∈ Rn and output y(t) ∈ Rn . The ODE (6.2.1a) can be expressed purely in terms of
x(t) as
ẋ(t) = −∇f (σ(x(t)). (6.2.4)
The effect of σ in (6.2.1) is to “squish” each coordinate of x(t) to the interval [0, 1], thus ensuring
that each point along the output trajectory y(t) satisfies the constraints. Moreover, it is not hard to
see that f (y(t)) is a decreasing function of t:
d
f (y(t)) = ∇f (y(t))T ẋ(t) (6.2.5)
dt
= −|∇f (y(t))|2 (6.2.6)
≤ 0. (6.2.7)
92
However, just as in the unconstrained case, the trajectory y(t) can get stuck near a local minimum.
Hence, it makes sense to look for a suitable stochastic relaxation, just like we had done in the
unconstrained case with the Langevin dynamics. One approach, which we will describe next, was
suggested by E. Wong.
Let us modify the deterministic dynamics (6.2.1) as follows:
dXt = −∇f (Yt ) dt + α(Xt ) dWt (6.2.8a)

Yt = σ(Xt ) (6.2.8b)
where the matrix-valued function α : Rn → Rn×n is to be chosen so that the output process (Yt )t≥0
is a diffusion process with equilibrium density given by
1 1
π ε (y) = exp − f (y) 1{y∈[0,1]n } , (6.2.9)
Zε ε
where the normalization constant Z ε now takes the form
Z
1
ε
Z = e− ε f (y) dy. (6.2.10)
[0,1]n
(Of course, the process (Yt ) is confined to the cube [0, 1]n due to the effect of σ.) Let us assume that
σ is twice differentiable; since it is strictly increasing, its derivative σ 0 is positive everywhere. We
will now show that the following simple choice gives us what we want:
s

 2ε
ij

0
, i=j
α (x) = σ (xi ) (6.2.11)

0,

else
Suppose then that the (as yet undetermined) matrix α(x) is diagonal, and let αi (x) denote its
entry in position (i, i). If (Yt ) is a diffusion process, then it will satisfy an SDE
dYt = m(Yt ) dt + β(Yt ) dWt , (6.2.12)
for some drift m : Rn → Rn and diffusion matrix β : Rn → Rn×n . Let us only consider the case
when β(y) is a diagonal matrix (later we will see that we will have this by construction). We first
determine the conditions that m and β must satisfy to give us the equilibrium density we want.
We will neglect the multiplier 1{y∈[0,1]n } in (6.2.9) since Yt ∈ [0, 1]n by construction. Since β(y) is
diagonal, the Fokker–Planck equation for (6.2.12) takes the form
n
!
∂ X ∂ i 1 ∂ i

ρt (y) = − m (y)ρt (y) − B (y)ρt (y) , (6.2.13)
∂t ∂yi 2 ∂yi
i=1
where B(y) := β(y)2 . let us denote the ith coordinate of ∇f (x) by ∂i f (x). If we want π ε (y) ∝
e−(1/ε)f (y) to be an equilibrum density, m and B must be such that
1 1 ∂ 1
mi (y)e− ε f (y) − B i (y)e− ε f (y) = 0,

i = 1, . . . , n (6.2.14)
2 ∂yi
93
which simplifies to
!
1 1 i 1 ∂ i
mi (y) = − B (y)∂i f (y) + B (y) , i = 1, . . . , n (6.2.15)
2 ε 2 ∂yi
Now, Itô’s rule gives

1
dYti = σ 0 (Xti ) dXti + Ai (Xt )σ 00 (Xt ) dt (6.2.16)
2
1
= − ∂i f (Yt )σ 0 (Xti ) + σ 00 (Xti )Ai (Xt ) dt + σ 0 (Xti )αi (Xt ) dWti , (6.2.17)
2
where A(x) := α(x)2 . Then, for each i we should have
β i (σ(x)) = σ 0 (xi )αi (x), (6.2.18)
so that B i (σ(x)) = (σ 0 (xi ))2 Ai (x) and therefore
1 σ 00 (xi ) i
mi (y) = −∂i f (y)σ 0 (xi ) + B (y). (6.2.19)
2 (σ 0 (xi ))2
Let us now define the function σ̌(v) := σ 0 (σ −1 (v))—recall that σ is strictly increasing and thus
invertible. Then it is readily verified that
σ 00 (σ −1 (v))
(log σ̌)0 (v) = , (6.2.20)
(σ 0 (σ −1 (v)))2
and, since xi = σ −1 (y i ), we can write

1
mi (y) = −∂i f (y)σ̌(y i ) + (log σ̌)0 (y i )B i (y). (6.2.21)
2
Comparing (6.2.15) and (6.2.21) gives
1 i
B (y) = εσ̌(y i ), (6.2.22)
2
which, together with σ 0 (xi ) = σ̌(y i ), leads to
2εσ̌(y i ) 2ε
Ai (x) = 0 i 2
= 0 i , (6.2.23)
(σ (x )) σ (x )
as claimed. For example, with the hyperbolic tangent choice of σ, as in (6.2.3), we have

0 1 2 v 1
σ (v) = 1 − tanh = , (6.2.24)
2a a 2a cosh2 av
so the stochastic dynamics of (Xt , Yt ) becomes

√ Xi
t
dXti = −∇f (Yt ) dt + 2 aε cosh dWti , (6.2.25)
! a
1 Xi
i t
Yt = 1 + tanh , i = 1, . . . , n. (6.2.26)
2 a
94
6.3 Free energy minimization and optimal control
In Section 6.1, the Gibbs densities (6.1.7), for ε > 0 sufficiently close to zero, were motivated through
a stochastic relaxation of the global minimization problem for a given f : Rn → R. However, there
is also a sense in which such densities are themselves solutions of an optimization problem in the
space of probability distributions. While the origin of this idea is in statistical physics, where it is
known as the Gibbs variational principle, it has wide applicability beyond physics.
To see how the density π ε could arise as a solution of an optimization problem, let us define the
following functional on the space of all (sufficiently well-behaved) probability densities ρ on Rn :
Z Z
Gε (ρ) := f (x)ρ(x) dx + ε ρ(x) log ρ(x) dx. (6.3.1)
Rn Rn
The first term is the expectation of f w.r.t. ρ. If we maintain the analogy with statistical physics,
then f plays the role of an energy function, so the first term is the expected energy, which we will
denote by E(ρ). The second term can be written as −εS(ρ), where
Z
S(ρ) := − ρ(x) log ρ(x) dx (6.3.2)
Rn
is the entropy of ρ. The quantity Gε (ρ) = E(ρ) − εS(ρ) is then known as the Gibbs free energy at
temperature ε. In statistical physics, it has the intepretation of the energy available for conversion
to useful work, since E(ρ) is the total average energy, while εS(ρ) is the heat dissipated into the
environment. Now, using the definition (6.1.7) of π ε and the formula for the relative entropy D(ρkπ),
we can write
Z
ε ρ(x)
G (ρ) = ε ρ(x) log −(1/ε)f (x) dx (6.3.3)
Rn e
= εD(ρkπ ε ) − ε log Z ε . (6.3.4)
Since D(ρkπ ε ) ≥ 0, Gε (ρ) is never smaller than −ε log Z ε ; moreover, this minimum value is attained
if and only if ρ = π ε .
Now we will discuss a simple modification of this construction that brings many benefits. Instead
of considering the negative entropy −S(ρ), let us fix a ‘reference’ density ρ0 and define the functional
Fε (ρ) := E(ρ) + εD(ρkρ0 ). (6.3.5)
This is also a free energy functional, and in fact we may view it as a Gibbs free energy if in (6.3.1)
we replace f with f − ε log ρ0 . A calculation similar to the one for Gε shows that the minimum value
of Fε is equal to
Z
−ε log ρ0 (x)e−(1/ε)f (x) =: −ε log Z̃ ε , (6.3.6)
Rn
and is uniquely minimized by
e−(1/ε)f (x) ρ0 (x)

π̃ ε (x) = R −(1/ε)f (x) ρ (x) dx
. (6.3.7)
Rn e 0
95
The advantage of using relative entropy instead of negative entropy, apart from the fact that the
former is always nonnegative, is that Z ε will be infinite if f is positive and bounded, whereas Z̃ ε
will be finite since the integration is performed w.r.t. the reference pdf ρ0 rather than the Lebesgue
measure on Rn . Thus, f must grow sufficiently rapidly as |x| → ∞ in order to π ε to be well-defined,
whereas π̃ ε will exist under much milder requirements on f .
Many problems are free energy minimization in disguise. As an example, consider the topic of
Bayesian inference: Let (X, Y ) be a random couple, where we observe some ‘evidence’ Y correlated
with some hidden ‘state’ X. If X has the prior density ρ0 and if Y conditioned on X = x has density
q(y|x), then, upon observing Y = y, we update the prior ρ0 (x) to the posterior density ρ(x|y), given
by
q(y|x)ρ0 (x)
ρ(x|y) = R . (6.3.8)
Rn q(y|x)ρ0 (x) dx
If we define the y-dependent energy f (x; y) := − log q(y|x), then the posterior ρ(x|y) has the Gibbs
form
ρ(x|y) ∝ e−f (x;y) ρ0 (x), (6.3.9)
and therefore minimizes the free energy

Z
F(ρ; y) := − ρ(x) log q(y|x) dx + D(ρkρ0 ). (6.3.10)
Rn
Now, even though we have formulated everything in terms of densities on Rn , the Gibbs variational
principle can be formulated more broadly as a minimization problem over probability measures on
an abstract measurable space (Ω, F), where F is a given σ-algebra. To that end, we first expand the
definition of the relative entropy to any pair of probability measures µ, ν on Ω:
(R
dµ
D(µkν) := Ω log dν dν, µ ν (6.3.11)
+∞, otherwise
where the notation µ ν indicates that µ is absolutely continuous w.r.t. ν, and dµ dν denotes the
Radon–Nikodym derivative of µ w.r.t. ν. It is not hard to see that this abstract definition reduces
to our earlier one if µ and ν are probability measures on Rn that have densities ρ and π. At any
rate, let us now fix a measurable function f : Ω → R and a reference probability measure ν on Ω
and and define the free energy
Fε (µ) := Eµ [f ] + εD(µkν). (6.3.12)
Exactly the same argument as before can be used to show that the minimum value of Fε is equal to
−ε log Eν [e−(1/ε)f ], and is uniquely attained by µ̄ with
dµ̄ e−(1/ε)f
= . (6.3.13)
dν Eν [e−(1/ε)f ]
(A minimal regularity condition is needed here to ensure that the expected value Eµ [e−(1/ε)f ] exists
and is finite.) We will now consider the case of a particular Ω, namely the path space C([0, T ]; Rn ).
96
6.3.1 Free energy minimization in path space
Consider an n-dimensional diffusion process (Xt )0≤t≤T given by
Z t
Xt = x + f (Xs , s) ds + Wt , 0≤t≤T (6.3.14)
0
and let P x denote the probability law of its trajectory X = (Xt )0≤t≤T viewed as a random element
of Ω = C([0, T ]; Rn ). Let us define the following energy (or cost) function on Ω:
Z T
H(X) := q0 (Xt , t) dt + r(XT ), (6.3.15)
0
where q0 : Rn × [0, T ] → R and r : Rn × R are some given functions such that

h i
Z(x) := EP x e−H(X) < ∞. (6.3.16)
(this will be the case, e.g., if both q0 and r are bounded). We would like to minimize the free energy
F(Qx ) := EQx [H] + D(Qx kP x ) (6.3.17)
over all probability measures Qx on Ω, for which X0 = x almost surely. From our discussion above,
x
it follows that the minimizing probability measure Q̄ is given by
x
dQ̄ exp − H
= , (6.3.18)
dP x Z(x)
which amounts to reweighting the paths X ∼ P x using importance weights proportional to e−H(X) —
that is, for any bounded measurable function ϕ : Ω → R,
1 h i
EQ̄x [ϕ(X)] = EP x ϕ(X)e−H(X) . (6.3.19)
Z(x)
While this could, in principle, be done, we will now show that we can instead generate samples from
x
Q̄ directly by adding a drift to (6.3.14), which is what we would expect on the basis of Gisranov’s
theorem. Moreover, we will see that this drift can be Viewed as a solution to a certain optimal
control problem.
In what follows, we will denote by Ex [·] (respectively, Ēx [·]) expectation w.r.t. P x (respectively,
x x
Q̄ ). Since Q̄ is absolutely continuous w.r.t. P x , by Girsanov’s theorem there exists an adapted
drift process ū = (ūt )0≤t≤T , such that
x Z T !
1 T
Z
dQ̄ T 2
= exp ūt dWt − |ūt | dt , (6.3.20)
dP x 0 2 0
and thus the process

Z t
X̄t = x + f (X̄s , s) + ūs ds + W̄t , 0≤t≤T (6.3.21)
0
x
where W̄t is a standard n-dimensional Brownian motion, will have the desired law Q̄ .
97
We will now look for the drift of the form ūt = −∇V (Xt , t), where V : Rn × [0, T ] → R is a C 2,1
function to be determined. To see why this ansatz makes sense, let us considder an arbitrary such V .
Then, with Xt defined according to (6.3.14), Itô’s rule giVes
Z T Z T
V (XT , T ) = V (x, 0) + V̇ (Xt , t) + AV (Xt , t) dt + ∇V (Xt , t)T dWt , (6.3.22)
0 0
where A = f T ∇ + 12 ∆ is the infinitesimal generator of (6.3.14). Thus,

Z T Z T
1
− T
∇V (Xt , t) dWt − |∇V (Xt , t)|2 dt
0 2 0
Z T 1
= V (x, 0) − V (XT , T ) + V̇ (Xt , t) + AV (Xt , t) − |∇V (Xt , t)|2 dt, (6.3.23)
0 2
which suggests that the function V must be such that
Z T
1
V (x, 0) − V (XT , T ) + V̇ (Xt , t) + AV (Xt , t) − |∇V (Xt , t)|2 dt
0 2
Z T
= − log Z(x) − q0 (Xt , t) dt − r(XT ), (6.3.24)
0
or, gathering all the integrals on one side,

Z T
1
V̇ (Xt , t) + AV (Xt , t) + q0 (Xt , t) − |∇V (Xt , t)|2 dt
0 2

= − V (x, 0) − log Z(x) + V (XT , T ) − r(XT ) . (6.3.25)
The best-case scenario is that both sides of the above equation are identically zero. This will be the
case if the function V simultaneously obeys the following:
1. It should solve the PDE
∂ 1
V (x, t) + AV (x, t) + q0 (x, t) = |∇V (x, t)|2 , (x, t) ∈ Rn × [0, T ] (6.3.26)
∂t 2
subject to the terminal condition V (x, T ) = r(x).
2. V (x, 0) = − log Z(x).

Let us now show that the function
" ! #
Z T
V (x, t) = − log E exp − r(XT ) − q0 (Xs , s) ds Xt = x (6.3.27)

t
fulfills both of these requirements. First of all, it is clear that V (x, T ) = r(x) and V (x, 0) = − log Z(x).
It remains to show that V in (6.3.27) solves (6.3.26). To that end, we will make use of the logarithmic
transformation V (x, t) = − log h(x, t). It is then a matter of straightforward calculus to show that,
if V solves (6.3.26), then h solves the linear PDE
∂
h(x, t) + Ah(x, t) − q0 (x, t)h(x, t) = 0, (x, t) ∈ Rn × [0, T ] (6.3.28)
∂t
98
with the terminal condition h(x, T ) = e−r(x) . The Feynman–Kac formula then gives the following
expression for h(x, t):
" Z T ! #

h(x, t) = E exp − r(XT ) − q0 (Xs , s) ds Xt = x . (6.3.29)

t
The expression for (6.3.27) follows immediately.
6.3.2 The optimal control interpretation

The PDE (6.3.26) is, in fact, a Hamilton–Jacobi–Bellman equation for the following optimal control
problem:
"Z #
T
1
minimize J(x; u) := E |ut |2 + q0 (Xt , t) dt + r(XT )X0 = x (6.3.30)

0 2
over all adapted controls u = (ut )0≤t≤T subject to

dXt = f (Xt ) + ut dt + dWt , 0 ≤ t ≤ T. (6.3.31)
To see this, use (5.3.47) to write
1
AV (x, t) + q0 (x, t) − |∇V (x, t)|2
2
n 1 1 o
= minn (f (x) + u)T ∇V (x, t) + ∆V (x, t) + q0 (x, t) + |u|2 (6.3.32)
u∈R 2 2
n o
= minn A V (x, t) + q(x, u, t) ,
u
(6.3.33)
u∈R
where Au = (f + u)T ∇ + 12 ∆ is the infinitesimal generator of the controlled diffusion (6.3.31) and
where we have defined the time-dependent state-action cost
1
q(x, u, t) := q0 (x, t) + |u|2 . (6.3.34)
2
Then we can rewrite (6.3.26) as
∂ n o
V (x, t) + minn Au V (x, t) + q(x, u, t) = 0, (x, t) ∈ Rn × [0, T ] (6.3.35)
∂t u∈R
with the terminal condition V (x, T ) = r(x), which is the HJB equation we were after. Consequently,
we obtain the following variational formula:
" !#
Z T
− log Ex exp − r(XT ) − q0 (Xt , t) dt
0
"Z #
T 1
= min E |ut |2 + q0 (Xt , t) dt + r(XT )X0 = x , (6.3.36)

u 0 2
where the expectation on the left-hand side is taken w.r.t. the reference process (Xt ) in (6.3.14),
while the right-hand side involves minimization, over all admissible controls, of the expected cost
J(u) w.r.t. the controlled process (6.3.31). Also note that the expression on the left-hand side has the
− log E[exp(·)] form, while on the right-hand side there is no exponentiation inside the expectation.
99
Chapter 7
Sampling and generative models
We started Chapter 2 with a discussion of the stochastic realization problem, where the goal is to
represent a given random object X as a deterministic function of some ‘latent’ or ‘internal’ random
object W . This is the idea behind the concept of a strong solution of an SDE: Subject to appropriate
regularity conditions on the drift f and on the diffusion matrix g, the process
Z t Z t
Xt = x + f (Xs ) ds + g(Xs ) dWs (7.0.1)
0 0
can be viewed as the output of a causal system with initial condition x0 and Brownian motion input
(Wt ). (A more general version of the problem is to find, for a given process (Yt )0≤t≤T , the functions
f , g, and h, such that Yt = h(Xt ).) We can also use this as a recipe for generating sample trajectories
of diffusion processes with prescribed f and g.
In our discussion of the Schrödinger bridge problem in Chapter 5, we were interested in obtaining
a sample from a given probability density p on Rn by controlling a diffusion process over a finite
time horizon, from t = 0 to t = 1. This is also an instance of a stochastic realization problem, where
we look for a system that takes a Brownian motion (Wt )0≤t≤1 and produces a sample X from p
while requiring minimum effort: Among all processes of the form
Z t
Xt = us ds + Wt , 0≤t≤1 (7.0.2)
0
with X1 ∼ p, we select the one where the expected value

"Z #
1
1 2
E |ut | dt (7.0.3)
2 0
is the smallest. A more general variant of the Schrödinger bridge entails starting from a random
initial condition X0 ∼ p0 and finding a drift process (ut )0≤t≤1 , such that
Z 1
X1 = X0 + ut dt + W1 (7.0.4)
0
has a given density p1 and (7.0.3) is, again, minimized.

This formulation is related to the controllability problem mentioned at the beginning of Chapter 5:
Given a deterministic controlled system ẋ = f (x, u) with input u(t) ∈ Rn and state x(t) ∈ Rn , find
100
a control u : [0, 1] → Rm to transfer the system from aR given initial state x(0) = x0 to a given
1
final state x(1) = x1 while minimizing the control cost 12 0 |u(t)|2 dt. The difference is that, in the
stochastic case, we are interested in optimally controlling a diffusion process to go from a given
initial density p0 to a given final density p1 .
7.1 Generative modeling

This control perspective can be brought to bear on the problem of generative modeling: Given a
target density p, construct a diffusion process (Xt )0≤t≤T , such that X0 has some reference density q
(we can think of it as a ‘prior’) and XT has the target density p. We can think of this in two ways:
1. Controlling the initial density ρ0 = q to the target density ρT = p subject to a suitable

optimality criterion. This is an optimal control problem in the space of densities.
2. Realizing (Xt )0≤t≤T as a strong solution of some SDE such that XT will have the target density
p when X0 has the prior density q, again subject to a suitable optimality criterion.
In a sense, Schrödinger’s bridge already gives us a framework for this. However, the solution for
random (as opposed to deterministic) initial conditions cannot be given in closed form. Moreover,
instead of full knowledge of the target density p, we may have access to a large number of independent
samples X01 , . . . , X0N ∼ p. Thus, we have to settle for suboptimal solutions that are learned on the
basis of these samples. In this chapter, we will take a look at one particularly effective approach to
this, referred to as diffusion models in the machine learning communiy.
Here is the basic idea. Let us first consider the setting when the target p is known; we will return
to the sample-based version later. Suppose that we have a diffusion process governed by an SDE
dXt = f (Xt , t) dt + σ(t) dWt , (7.1.1)
where the drift f and the diffusion coefficient σ are such that, if we were to initialize this process
with a sample X0 from the target p and run it for some time T , the resulting density ρT of XT would
be very close to the prior q. For example, we could simply take f ≡ 0, in which case the density of
Z T
XT = X0 + σ(t) dWt , (7.1.2)
0
with X0 independent of the Brownian motion (Wt ), is given by the convolution of ρ0 and a Gaussian
density:
Z
1 1 2
ρT (x) = ρ0 ∗ γΣ(T ) (x) = ρ0 (ξ) exp − 2 |x − ξ| dξ, (7.1.3)
(2πΣ(T ))n/2 Rn 2Σ (T )
RT
where Σ2 (T ) := 0 σ 2 (t) dt. If σ(t) increases with t quickly enough, then we can take q to be the
2
Gaussian density γΣ(T ) . Another choice is f (x, t) = − σ 2(t) x, which gives the time-inhomogeneous
Ornstein–Uhlenbeck process
σ 2 (t)
dXt = − Xt dt + σ(t) dWt . (7.1.4)
2
101
Then we have
! !
1 T 2
Z T
1 T 2
Z Z
XT = exp − σ (t) dt X0 + exp − σ (s) ds σ(t) dWs
2 0 0 2 t
1 1 Z T 1
2 2
= exp − Σ (T ) X0 + exp − Σ (T ) exp Σ2 (t) σ(t) dWt , (7.1.5)
2 2 0 2
so, if T is sufficiently large, we can take q to be the standard n-dimensional Gaussian density
2
γ(x) = (2π)1n/2 e−|x| /2 . What these two examples have in common is that both f and σ are of very
simple form completely unrelated to the target density p, and the overall effect is to add enough
noise to X0 so that the contribution of p to the resulting density at time T is, for all practical
purposes, negligible. Thus, we now have the means of taking a sample from p and transforming it
into an approximate sample from q. It is useful to think of this as an approximate multidimensional
analogue of the procedure where we take a sample X from a given univariate probability distribution
and convert it into a sample from the uniform distribution on [0, 1] by means of the transformation
X 7→ FX (X), where FX is the cdf of X. The next step is to develop a diffusion process analogue
of the reverse step U 7→ FX−1 (U ) of generating a sample X using the inverse cdf FX−1 . The natural
candidate for this is the time-reversed version of (Xt )0≤t≤T , i.e., the process X̄t := XT −t . Of course,
in order to arrive at p at time T we would have to start with ρT , not with q. Here, as before, we
rely on the fact that ρT is very close to q, so once again we are settling for an approximation.
7.2 Time reversal of diffusions

Consider an n-dimensional diffusion process governed by an Itô SDE
dXt = f (Xt , t) dt + σ(t) dWt , 0≤t≤T (7.2.1)
where the drift depends on both space and time, while the diffusion coefficient depends only on time.
If X0 has density ρ0 , then the evolution of the density ρt of Xt is described by the Fokker–Planck
(or Kolmogorov’s forward) equation
∂ σ 2 (t)
ρt (x) = −∇ · f (x, t)ρt (x) + ∆ρt (x), (x, t) ∈ Rn × [0, T ]. (7.2.2)
∂t 2
The time-reversed process X̄t is defined for t ∈ [0, T ] by X̄t := XT −t . It is not hard to show that,
since Xt is a Markov process, so is X̄t . A more interesting result is that, under some regularity
conditions on ρ0 , f , and σ, X̄t will also be a diffusion process. We give a heuristic derivation in lieu
of a rigorous treatment (see Notes at the end of the chapter for references). We will assume that f
and σ are sufficiently regular so that (7.2.1) has a unique strong solution, and that the initial density
ρ0 is such that (7.2.2) has a solution which is everywhere positive.
Since X̄t has density ρ̄t := ρT −t , we have
∂ σ 2 (T − t)
ρ̄t (x) = ∇ · f (x, T − t)ρ̄t (x) − ∆ρ̄t (x). (7.2.3)
∂t 2
This almost looks like a Fokker–Planck equation except for the minus sign in front of the Laplacian.
However, since ∆ = ∇ · ∇, we can add and subtract
σ 2 (T − t)∆ρ̄t (x) = σ 2 (T − t)∇ · ∇ρ̄t (x) = σ 2 (T − t)∇ · ρ̄t (x)∇ log ρ̄t (x)

(7.2.4)
102
on the right-hand side to get
∂ σ 2 (T − t)
− f (x, T − t) + σ 2 (T − t)∇ log ρ̄t (x) ρ̄t (x) +

ρ̄t (x) = −∇ · ∆ρ̄t (x). (7.2.5)
∂t 2
We can now read the drift and the diffusion coefficients of X̄t directly off (7.2.5):
f¯(x, t) := −f (x, T − t) + σ 2 (T − t)∇ log ρT −t (x) (7.2.6a)

σ̄(t) := σ(T − t). (7.2.6b)
This gives us the following SDE for X̄t :
dX̄t = f¯(X̄t , t) dt + σ̄(t) dW̄t , 0 ≤ t ≤ T. (7.2.7)
The form of f¯ in (7.2.6a) deserves further comments. Let us first consider the deterministic case
σ(t) ≡ 0. Then all the randomness comes from the initial condition X0 ∼ ρ0 , and the density ρt
obeys the PDE
∂
ρt (x) = −∇ · f (x, t)ρt (x) , (7.2.8)
∂t
often referred to as the Liouville equation, which does not have any second-order terms on the
right-hand side. In this instance, f¯(x, t) = −f (x, T − t), which is consistent with the easily checked
fact that the time reversal x̄(t) := x(T − t) of the deterministic dynamics
d
x(t) = f (x(t), t), 0≤t≤T (7.2.9)
dt
is described by
d
x̄(t) = −f (x̄(t, T − t)), 0 ≤ t ≤ T. (7.2.10)
dt
When σ(t) > 0, though, ordinary time reversal is not enough, and we pick up a second term
proportional to ∇ log ρT −t (x), which is called the score function (or just the score) of ρT −t (x). For
now, it will suffice to think of it as the additional force we have to apply in order to counteract the
stochastic effects of the Brownian motion. Later on, we will endow this additional term with an
operational interpretation through the lens of stochastic thermodynamics.
7.2.1 An optimal control interpretation

An alternative way to obtain the drift f¯(x, t) in (7.2.6a) is to consider a controlled diffusion
dX̄tu = f¯0 (X̄tu , t) dt + σ̄(t)ut dt + σ̄(t) dW̄t ,

0≤t≤T (7.2.11)
where f¯0 (x, t) := −f (x, T − t), and u = (ut )0≤t≤T is an n-dimensional control (with all the usual
admissibilty caveats applied). When there is no control, i.e., ut ≡ 0 for all t, we obtain what we
might call the naive time reversal, i.e., the process
Z t Z t
0 0
X̄t = X̄0 + ¯0 0
f (X̄s , s) ds + σ̄(s) dW̄s (7.2.12)
0 0
103
that results if we don’t account for the effect of the Brownian motion and just go with the drift
that we read off the time-reversed Liouville equation. The catch is that, even if we initialize it
with X̄ 0 ∼ ρT , we cannot expect the resulting densities ρ̄0t of X̄t0 to track the desired time-reversed
densities ρ̄t = ρT −t . This provides the rationale for introducing the control term into (7.2.11). As
we show next, this problem is an instance of free energy minimization of the type considered in
Section 6.3. Note, however, that Eq. (7.2.11) has a time-dependent diffusion coefficient, whereas
(6.3.31) has σ(t)] ≡ 1.
As a starting point, let us recall the PDE (7.2.3). Using the chain rule, the divergence term on
the right-hand side can be written as

∇ · f (x, T − t)ρ̄t (x) = ∇ · f (x, T − t) ρ̄t (x) + f (x, T − t)T ∇ρ̄t (x) (7.2.13)
≡ − ∇ · f¯ (x, t) ρ̄t (x) − f¯ (x, t) ∇ρ̄t (x),
0 0
T
(7.2.14)
which, upon rearranging, gives

∂ σ̄ 2 (t)
ρ̄t (x) + f¯0 (x, t)T ∇ρ̄t (x) + ∆ρ̄t (x) + ∇ · f¯0 (x, t) ρ̄t (x) = 0.

(7.2.15)
∂t 2
2
Recognizing the second-order differential operator A0t := (f¯0 )T ∇ + σ̄ 2(t) ∆ as the infinitesimal
generator of the uncontrolled process X̄t0 , we can express (7.2.15) in the following more suggestive
form:
∂
ρ̄t (x) + A0t ρ̄t (x) + ∇ · f¯0 (x, t) ρ̄t (x) = 0, (x, t) ∈ Rn × [0, T ]

(7.2.16)
∂t
with the terminal condition ρ̄T (x) = ρ0 (x). Written like this, (7.2.16) has the same form as (6.3.26),
and an application of the Feynman–Kac formula gives the following path-space representation of
ρ̄t (x):
" Z T ! #

0
∇ · f¯ (X̄s , s) ds X̄t = x
0 0 0

ρ̄t (x) = E ρ0 (X̄T ) exp (7.2.17)

t
We can now proceed in the same manner as in Section 6.3.2 to obtain an optimal control formulation
of the time-reversal problem.
Specifically, if we define the running cost
1
q(x, u, t) := |u|2 − ∇ · f¯0 (x, t) (7.2.18)
2
and the terminal cost
r(x) := − log ρ0 (x), (7.2.19)
then we readily obtain the representation of − log ρ̄t (x) as the value function
"Z #
T
u u u
V (x, t) := min E q(X̄s , us , s) ds + ρ(X̄T )X̄t = x . (7.2.20)
u t
Moreover, the value function satisfies the HJB equation

∂ n o
V (x, t) + minn Aut V (x, t) + q(x, u, t) = 0, (x, t) ∈ Rn × [0, T ] (7.2.21)
∂t u∈R
104
with the terminal conditon V (x, t) = r(x) = − log ρ0 (x), where
σ̄ 2 (t)
Aut := f¯0
T
∇ + σ̄(t)uT ∇ + ∆ (7.2.22)
2
for each u ∈ Rn . Carrying out the minimization explicitly gives
∂ σ̄ 2 (t)
V (x, t) + A0t V (x, t) − ∇ · f¯0 (x, t) = |∇V (x, t)|2 , (x, t) ∈ Rn × [0, T ] (7.2.23)
∂t 2
with V (x, T ) = − log ρ0 (x), and the optimal control is given by
ut = −σ̄(t)∇V (x, t) = σ̄(t)∇ log ρT −t (x), (7.2.24)
i.e., σ̄(t)ut matches exactly the extra score-dependent term in (7.2.6a).
7.2.2 Error analysis

The main benefit of the control-theoretic interpretation is that we can now quanfity the errors
incurred due to (a) initializing the time-reversed process with a density different from ρ̄0 = ρT and
(b) using a suboptimal feedback control k(x, t) instead of σ̄(t)∇ log ρT −t (x).
x
We start with the latter. It will be convenient to write k(x, t) as σ̄(t)k̂(x, t). Let P̄ denote the
probability law of the process
Z t Z t
X̄tx = x + f¯(X̄sx , s) ds + σ̄(s) dW̄s (7.2.25)
0 0
Z t Z t
¯0 x x

=x+ f (X̄s , s) + σ̄(s)∇ρT −s (X̄s ) ds + σ̄(s) dW̄s 0≤t≤T (7.2.26)
0 0
x
i.e., of X̄t conditioned on X̄0 = x, and let P̂ denote the probability law of the process
Z t Z t
x ¯0 x x

X̂t = x + f (X̂s , s) + σ̄(s)k̂(X̂s , s) ds + σ̄(s) dW̄s , 0 ≤ t ≤ T. (7.2.27)
0 0
Since these two processes differ only in their drift terms, Girsanov’s theorem gives
x
!
Z T Z T
dP̂ T 1 k̂ (X̄tx , t) − ∇ log ρT −t (X̄tx )2 dt .
k̂(X̄tx , t) − ∇ log ρT −t (X̄tx ) dW̄t −

x = exp
dP̄ 0 2 0
(7.2.28)
x x
Thus, we have the following for the relative entropy between P̄ and P̂ :
Z x
x x x dP̄
D(P̄ kP̂ ) = dP̄ log x (7.2.29)
dP̂
Z x
x dP̂
= − dP̄ log x (7.2.30)
dP̄
"Z #
T
1 k̂ x x 2

=E (X̄t , t) − ∇ log ρT −t (X̄t ) dt . (7.2.31)
0 2
105
Now let P̄ denote the probability law of (X̄t )0≤t≤T with X̄0 ∼ ρT and let P̂ denote the probability
law of (X̂t )0≤t≤T with X̂0 ∼ q, i.e.,
Z Z
x x
P̄ = ρT (x)P̄ dx, P̂ = q(x)P̂ dx. (7.2.32)
Rn Rn
Then we can use the chain rule for the relative entropy to write
Z
x x
D(P̄ kP̂ ) = D(ρT kq) + ρT (x)D(P̄ kP̂ )
Rn
"Z #
Z T
1 k̂ x 2
(X̄t , t) − ∇ log ρT −t (X̄tx ) dt dx

= D(ρT kq) + ρT (x)E
R n 0 2
Z T " #
1 2
= D(ρT kq) + E k̂ (X̄t , t) − ∇ log ρT −t (X̄t ) dt. (7.2.33)
2 0
The first term on the right-hand side quantifies the error due to misspecification of the initial density,
ρT vs. q, while the second term quantifies the error due to using a suboptimal feedback control.
Moreover, if we denote by ρ̂T the density of X̂T , then by the data processing inequality we have
" #
1 T
Z
2
D(ρ0 kρ̂T ) ≤ D(ρT kq) + E k̂ (X̄t , t) − ∇ log ρT −t (X̄t ) dt. (7.2.34)
2 0
As an example, let us consider the time-reversed version of the Ornstein–Uhlenbeck process in

(7.1.4) when initialized with X0 ∼ ρ0 , i.e.,
1
dX̄t = σ̄ 2 (t) X̄t + ∇ log ρT −t (X̄t ) dt + σ̄(t) dW̄t , 0≤t≤T (7.2.35)
2
where ρt is the density of
Z t 1

1 2 (t) Σ2 (t)−Σ2 (s)
Xt = e− 2 Σ X0 + e− 2 σ(s) dWs (7.2.36)
0
Rt
with Σ2 (t) := 0 σ 2 (s) ds. In this instance, we have
σ 2 (t) σ̄ 2 (t)
f (x, t) = − x, f¯0 (x, t) = x (7.2.37)
2 2
A simple calculation shows that q = γ, the standard n-dimensional Gaussian density, is the
equilibrium density of (7.1.4):
σ 2 (t) σ 2 (t) σ 2 (t)

f (x, t) − ∇ log q(x) = − x− ∇ log γ(x) = 0. (7.2.38)
2 2 2
Thus, exactly the same calculations as in Section 6.1.2 yield
d d
D(ρt kq) = D(ρt kγ) (7.2.39)
dt dt
σ 2 (t)
=− I(ρt kγ), (7.2.40)
2
106
where I(ρt kγ) is the relative Fisher information between ρt and γ. Since γ satisfies the log-Sobolev
inequality
1
D(ρkγ) ≤ I(ρkγ), (7.2.41)
2
we obtain
!
Z t
2
D(ρt kγ) ≤ exp − σ (s) ds D(ρ0 kγ) (7.2.42)
0
= exp − Σ2 (t) D(ρ0 kγ).

(7.2.43)
√
If, for example, σ(t) = 2, then the initialization error term in (7.2.33), with q = γ, will scale like
e−2T and can therefore be made negligible for sufficiently large T . On the other hand, increasing T
will also make the second term larger, so there is a trade-off between these two terms.
7.3 Sample-based construction and score matching

So far, our discussion was confined to the ideal situation when the target density p was given.
However, in most applications of interest this is not the case; instead, only a large number of
independent samples from p is available. This presents a difficulty since the implementation of
the time-reversed process X̄t , even when initialized with the target-independent prior q, requires
knowledge of the score functions ∇ log ρt (x). The solution is to learn the score from the data via the
procedure known as score matching.
The following basic idea underlies score matching: Let ρ be a probability density on Rn , and let
s(x) denote its score, i.e., s(x) = ∇ log ρ(x). For any C 1 vector-valued function ŝ : Rn → Rn , let us
define the score estimation error
L(ŝ) := Eρ [|s(X) − ŝ(X)|2 ] (7.3.1)

Z
= ρ(x)|s(x) − ŝ(x)|2 . (7.3.2)
Rn
Expanding the squared norm, we get

Z
ρ(x) |s(x)|2 − 2s(x)T ŝ(x) + |ŝ(x)|2 dx

L(ŝ) = (7.3.3)
n
ZR Z Z
2
= ρ(x)|s(x)| ρ(x) dx − 2 T
ρ(x)s(x) ŝ(x) dx + ρ(x)|ŝ(x)|2 ) dx. (7.3.4)
Rn Rn Rn
The first term on the right-hand side does not depend on ŝ, so we focus on the other two. The
second term is of particular interest, since it is not given by an expectation, with respect to ρ, of a
1
function that involves only ŝ. However, we can use the fact that s(x) = ρ(x) ∇ρ(x) and integration
by parts to get
Z Z
T
ρ(x)s(x) ŝ(x) dx = ∇ρ(x)T ŝ(x) dx (7.3.5)
Rn n
RZ

=− ∇ · ŝ(x) ρ(x) dx, (7.3.6)
Rn
107
assuming both ρ and ŝ are decaying sufficiently rapidly at infinity. We can thus express L(ŝ) as a
sum
L(ŝ) = L̃(ŝ) + C, (7.3.7)
where C is a constant that depends only on ρ and where

h i
L̃(ŝ) := Eρ |ŝ(X)|2 + 2∇ · ŝ(X) (7.3.8)
can be approximated using independent samples X1 , . . . , XN ∼ ρ as

N
1 Xh i
L̃(ŝ) ≈ L̃N (ŝ) := |ŝ(Xi )|2 + 2∇ · ŝ(Xi ) . (7.3.9)
N
i=1
We now apply these ideas to the problem of estimating the scores ∇ log ρt . Let S be a family
of candidate score models, e.g., neural nets ŝ(·, ·; θ) : Rn × [0, T ] → Rn parametrized by a vector
of weights θ. Given N independent samples X01 , . . . , X0N from the target density p, we generate N
independent diffusion trajectories
Z t Z t
i i i
Xt := X0 + f (Xs , s) ds + σ(s) dWsi , 0 ≤ t ≤ T, i = 1, . . . , N (7.3.10)
0 )
and then obtain the score network weights θ̂ by minimizing the empirical loss
Z T N
1 Xh i
LN (θ) := |ŝ(Xti , t; θ)|2 + 2∇ · ŝ(Xti , t; θ) dt (7.3.11)
0 N
i=1
over θ. To obtain an approximate sample X̂ from q, we then proceed as follows: Generate a sample
X̂0 from the prior q and then take
Z T Z T
2

X̂ = X̂0 + − f (X̂t , T − t) + σ̄ (T − t)ŝ(X̂t , T − t) dt + σ(T − t) dW̄t , (7.3.12)
0 0
where ŝ(x, t) ≡ ŝ(x, t; θ̂) is the estimated score.
7.4 Stochastic thermodynamics

The optimal control interpretation of time reversal of diffusion processes suggests that it is useful to
think of the extra σ 2 (T − t)∇ log ρT −t (x) term in (7.2.6a) as the additional force we need to apply in
order to counteract the effect of the Brownian motion. From this perspective, we are controlling the
state of diffusion process, so all the theory of Chapter 5 applies. However, we have also mentioned
a complementary perspective, according to which the subject of control is actually a probability
density that evolves according to the Fokker–Planck equation. This viewpoint can be phrased as a
generalization of classical thermodynamics to stochastic systems.
To fix ideas, let us consider a time-inhomogeneous n-dimensional Langevin diffusion process of
the form
√
dXt = −∇U (Xt , t) dt + 2 dWt , 0≤t≤T (7.4.1)
108
where U : Rn × [0, T ] → R is a smooth (e.g., C 2,1 ) time-varying potential. Let X0 be the random
initial condition with density ρ0 , and let ρt denote, as before, the density of ρt . The Fokker–Planck
equation
∂
ρt (x) = ∇ · ∇U (x, t)ρt (x) + ∆ρt (x) (7.4.2)
∂t
governs the evolution of ρt viewed as a state of a dynamical system defined over densities. We can
define two functions of state:
Z
Et (ρ) := ρ(x)U (x, t) dx, (7.4.3)
Rn
the (average) energy at time t, and

Z
S(ρ) := − ρ(x) log ρ(x) dx, (7.4.4)
Rn
the entropy. Their difference, Gt (ρ) := Et (ρ) − S(ρ), is the free energy. We are interested in the free
energy difference GT (ρT ) − G0 (ρ0 ).
Let `(x, t) := log ρt (x). Then Itô’s rule gives
Z T Z T
U (XT , T ) = U (X0 , 0) + U̇ (Xt , t) dt + At U (Xt , t) dt + MTU (7.4.5)
0 0
and
Z T Z T
`(XT , T ) = `(X0 , 0) + ˙ t , t) dt +
`(X At `(Xt , t) dt + MT` , (7.4.6)
0 0
∂
where U̇ (x, t) := ∂t ˙ t) := ∂ `(x, t), At := −∇U (·, t)T ∇ + ∆ is the (time-dependent)
U (x, t), `(x, ∂t
infinitesimal generator of (7.4.1), and MTU , MT` are zero-mean martingales (they can be given
explicitly as Itô integrals, but we will not need that). We further have
At U (x, t) = −|∇U (x, t)|2 + ∆U (x, t), (7.4.7)

At `(x, t) = −∇U (x, t) ∇`(x, t) + ∆`(x, t).
T
(7.4.8)
Here, ∇`(x, t) ≡ ∇ log ρt (x) is the score at time t. Moreover, using (7.4.2), we have
˙ t) = 1 ∂
`(x, ρt (x) (7.4.9)
ρt (x) ∂t
1 1
= ∇U (x, t)T ∇ρt (x) + (∆U (x, t)ρt (x) + ∆ρt (x) (7.4.10)
ρt (x) ρt (x)
1
= ∇U (x, t)T ∇`(x, t) + ∆U (x, t) + ∆ρt (x). (7.4.11)
ρt (x)
Since
1
∆`(x, t) = ∇ · ∇ρt (x) (7.4.12)
ρt (x)
1
= −|∇`(x, t)|2 + ∆ρt (x), (7.4.13)
ρt (x)
109
we can write
˙ t) = ∇U (x, t)T ∇`(x, t) + ∆U (x, t) + ∆`(x, t) + |∇`(x, t)|2 .
`(x, (7.4.14)
Substituting these into (7.4.5) and (7.4.6) and simplifying, we get
U (XT , T ) − U (X0 , 0)
Z T Z T
= U̇ (Xt , t) dt + − |∇U (Xt , t)|2 + ∆U (Xt , t) dt + MTU (7.4.15)
0 0
and
`(XT , T ) − `(X0 , 0)
Z T
= ∇U (Xt , t)T ∇`(Xt , t) + ∆U (Xt , t) + ∆`(Xt , t) + |∇`(Xt , t)|2 dt
0
Z T
+ − ∇U (Xt , t)T ∇`(Xt , t) + ∆`(Xt , t) dt + MT` (7.4.16)
0
Z T
= ∆U (Xt , t) + 2∆`(Xt , t) + |∇`(Xt , t)|2 dt + MT` . (7.4.17)
0
Adding these together and taking expectations gives
GT (ρT ) − G0 (ρ0 )
Z T Z T
= E[U̇ (Xt , t)] dt + E[2∆U (Xt , t) + 2∆`(Xt , t) − |∇U (Xt , t)|2 + |∇`(Xt , t)|2 ] dt. (7.4.18)
0 0
Let us now examine the expectation in the second integral. Using integration by parts, we have
Z
E[∆U (Xt , t)] = ρt (x)∆U (x, t) dx (7.4.19)
n
RZ
=− ∇ρt (x)T ∇U (x, t) dx (7.4.20)

R n
Z
=− ρt (x)∇`(x, t)T ∇U (x, t) dx (7.4.21)
Rn
= −E[∇`(Xt , t)T U `(Xt , t)], (7.4.22)
Z
E[∆`(Xt , t)] = ρt (x)∆`(x, t) dx (7.4.23)
n
RZ
=− ∇ρt (x)T ∇`(x, t) dx (7.4.24)

n
ZR
=− ρt (x)|∇`(x, t)|2 dx (7.4.25)
Rn
= −E[|∇`(Xt )|2 ]. (7.4.26)
Thus,
E[2∆U (Xt , t) + 2∆`(Xt , t) − |∇U (Xt , t)|2 + |∇`(Xt , t)|2 ]
= −E[|∇U (Xt , t)|2 + 2∇U (Xt , t)T ∇`(Xt ) + |∇`(Xt , t)|2 ]. (7.4.27)
= −E[|∇U (Xt , t) + ∇`(Xt , t)|2 ]. (7.4.28)
110
Putting everything together and rearranging, we obtain the following relation:
GT (ρT ) − G0 (ρ0 ) = W(ρ0 → ρT ) − D(ρ0 → ρT ), (7.4.29)
where we have defined the expected work
Z T Z
W(ρ0 → ρT ) := ρt (x)U̇ (x, t) dx dt (7.4.30)
0 Rn
and expected dissipation

Z T Z
D(ρ0 → ρT ) := ρt (x)|∇U (x, t) + ∇ log ρt (x)|2 dx dt, (7.4.31)
0 Rn
with the notation ρ0 → ρT indicating that, unlike the free energy difference which is a function of
initial and final states ρ0 and ρT , both W and D depend on the entire state trajectory (ρt )0≤t≤T .
The relation (7.4.29) is a key formula in stochastic thermodynamics. Its immediate consequence
is the inequality
W(ρ0 → ρT ) ≥ GT (ρT ) − G0 (ρ0 ), (7.4.32)
which can be thought of as a stochastic version of the second law of thermodynamics, which says
that the expected work in effecting the transfer from one density-valued state to another can never
be smaller than the difference in free energies between the final state and the initial state. Moreover,
equality in (7.4.32) is attained if and only if the expected dissipation is identically zero, i.e., if and
only if ∇ log ρt (x) = −∇U (x, t) for almost all x and t.
We can now apply the relation between work, free energy, and dissipation both to (7.4.1) (with
X0 ∼ ρ0 ) and to its time reversal
√
dX̄t = ∇U (X̄t , T − t) + 2∇ log ρT −t (X̄t ) dt + 2 dW̄t , 0≤t≤T (7.4.33)
which we can also express in the time-varying Langevin form as
√
dX̄t = −∇Ū (X̄t , t) dt + 2 dW̄t (7.4.34)
with Ū (x, t) := −U (x, T − t) − 2 log ρT −t (x). Then, with ρ̄t ≡ ρT −t denoting the density of X̄t , we
see that
GT (ρ̄T ) − G0 (ρ̄0 ) = G0 (ρ0 ) − GT (ρT ) (7.4.35)
and
Z T Z
D(ρ̄0 → ρ̄T ) = ρ̄t (x)|∇Ū (x, t) + ∇ log ρ̄t (x)|2 dx dt (7.4.36)
0 Rn
Z T Z
= ρt (x)|∇U (x, t) + ∇ log ρt (x)|2 dx dt (7.4.37)
0 Rn
= D(ρ0 → ρT ), (7.4.38)
which gives the following relation between the work done in transferring ρ0 to ρT along (7.4.1) and
the work done in transfering ρT to ρ0 along (7.4.33):
W(ρ0 → ρT ) + W(ρT → ρ0 ) = 2D(ρ0 → ρT ). (7.4.39)
111
Appendix A
Probability Facts
Lemma 5 (Borel–Cantelli). Let A0 , A1 , . . . be a sequence of events.

P
1. If n≥0 P[An ] < ∞, then

P lim sup An = 0. (A.0.1)
n→∞
P
2. If n≥0 P[An ] = ∞ and the events (An ) are independent, then

P lim sup An = 1. (A.0.2)
n→∞
A.0.1 Convergence concepts
112
Appendix B
Analysis Facts
Lemma 6. Let f0 , f1 , . . . be a sequence of real-valued continuous functions on the interval [0, 1] that
converges uniformly to some function f : [0, 1] → R, i.e.,
lim sup |fn (t) − f (t)| = 0. (B.0.1)

n→∞ t∈[[0,1]
Then f is itself a continuous function.
Lemma 7. Let f0 , f1 , . . . be a sequence of continuous functions on [0, 1], such that

∞
X
sup |fn (t) − fn−1 (t)| < ∞. (B.0.2)
n=1 t∈[0,1]
Then fn converge uniformly to a continuous function f : [0, 1] → R.
113
Bibliography
[Ben71] Vaclav E. Beneš. Existence of optimal stochastic control laws. SIAM Journal on Control,
9(3):446–472, August 1971.
[Buc04] James Antonio Bucklew. Introduction to Rare Event Simulation. Springer, 2004.
[Cla66] J. M. C. Clark. The Representation of Non-Linear Stochastic Systems with Applications to

Filtering. PhD thesis, Electrical Engineering Department, Imperial College, London, 1966.
[Cla73] J. M. C. Clark. An introduction to stochastic differential equations on manifolds. In D. Q.

Mayne and R. W. Brockett, editors, Geometric Methods in System Theory, pages 131–149.
Reidel, Dordrecht, Holland, 1973.
[CM44] R. H. Cameron and W. T. Martin. Transformations of Wiener integrals under translations.

Annals of Mathematics, 45:386–396, 1944.
[Doo53] J. L. Doob. Stochastic Processes. Wiley, 1953.
[FH65] Richard P. Feynman and Arthur R. Hibbs. Quantum Mechanics and Path Integrals.
McGraw-Hill, 1965.
[FKK72] M. Fujisaki, G. Kallianpur, and H. Kunita. Stochastic differential equations for the
nonlinear filtering problem. Osaka Journal of Mathematics, 9:19–40, 1972.
[Gir60] Igor V. Girsanov. On transforming a certainn class of stochastic processes by absolutely

continuous substitution of measures. Theory of Probability and Its Applications, 5:285–301,
1960.
[Kac49] Mark Kac. On distributions of certain wiener functionals. Transactions of American

Mathematicall Society, 65:1–13, 1949.
[Kac85] Mark Kac. Enigmas of Chance: An Autobiography. University of California Press, Berkeley,
CA, 1985.
[Kam81] N. G. van Kampen. Itô vs. Stratonovich. Journal of Statistical Physics, 24(1):175–187,
1981.
[KS98] Ioannis Karatzas and Steven E. Shreve. Brownian Motion and Stochastic Calculus. Springer,
2nd edition, 1998.
[Mor69] Richard E. Mortensen. Mathematical problems of modeling stochastic nonlinear dynamical

systems. Journal of Statistical Physics, 1(2):271–296, 1969.
114
[Nel67] E. Nelson. Dynamical Theories of Brownian Motion. Princeton University Press, 1967.
[Pic91] G. Picci. Stochastic realization theory. In A. C. Antoulas, editor, Mathematical System

Theory: The Influence of R. E. Kalman, pages 213–229. Springer, 1991.
[Roe94] Gert Roepstorff. Path Integral Approach to Quantum Physics. Springer, 1994.
[Sim05] Barry Simon. Functional Integration and Quantum Physics. American Mathematical
Society, 2nd edition, 2005.
[Ste01] J. Michael Steele. Stochastic Calculus and Financial Applications. Springer, 2001.
[SW74] Jan H. van Schuppen and Eugene Wong. Transformation of local martingales under a
change of measure. Annals of Probability, 2(5):879–888, 1974.
[Wil89] Jan C. Willems. Models for dynamics. In U. Kirchgraber and H. O. Walther, editors,
Dynamics Reported, volume 2, pages 171–269. Wiley, 1989.
[Won73] Eugene Wong. Recent progress in stochastic processes – A survey. IEEE Transactions on
Information Theory, IT-19(3):262–275, May 1973.
[WZ65a] Eugene Wong and Moshe Zakai. On the convergence of ordinary integrals to stochastic
integrals. Annals of Mathematical Statistics, 46:1560–1564, 1965.
[WZ65b] Eugene Wong and Moshe Zakai. On the relation between ordinary and stochastic differential
equations. International Journal of Engineering Science, 3:213–229, 1965.
115

SDE Book

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SDE Book

Uploaded by

Copyright:

Available Formats

Stochastic Differential Equations: A Systems-Theoretic Approach

2 Brownian Motion and Diffusion Processes 3

3 Stochastic Integrals and Stochastic Differential Equations 16

5 Stochastic control and estimation 61

7 Sampling and generative models 100

B Analysis Facts 113

1.1 Notes and further reading

Brownian Motion and Diffusion

2.1 Brownian motion

2. Wt − Ws is Gaussian with zero mean and variance |t − s|.

3. The sample paths t 7→ Wt are continuous almost surely.

converges uniformly a.s., and the limit is a Brownian motion.

where we have defined the partial sums

so, if we take r > 1, then

2.1.2 Basic properties

Var[Qnt ] = E[(Qnt − tn )2 ] (2.1.32)

E[(Ŵt − Wt )2 ] = inf E[(F (W[0,s] ) − Wt )2 ], (2.1.38)

Ŵt = E[Wt |W[0,s] ], (2.1.39)

E[Wt |Fs ] = E[Wt − Ws + Ws |Fs ] (2.1.40)

2.1.3 Multidimensional Brownian motion

2.2 Diffusion processes

where A is the linear second-order differential operator given by

Fs (x, t + h) = E[ϕ(Xt+h )|Xs = x]

Rearranging and using the definition of A gives

Taking the limit as h ↓ 0, we get

P[Xs+t ∈ A|Xs = x] = Pt (x, A) (2.3.1)

or, more generally,

The transition probability functions obey the Champan–Kolmogorov equation,

differentiating both sides w.r.t. t and using (2.3.11), we get

with the given initial condition ρ0 at t = 0.

+ ∇x pt−s−h (x, y)T (z − x)ph (x, z) dz

where Fs is the σ-algebra generated by (Xr )0≤r≤s , ϕ̇(x, t) := ∂t

2.5 Notes and further reading

Stochastic Integrals and Stochastic

3.1 From Kolmogorov to Itô: an internalist model of a diffusion

3.1.1 Filtrations, martingales, and all that

E = {ω ∈ Ω : a1 ≤ Xt1 (ω) < b1 , . . . , an ≤ Xtn (ω) < bn } (3.1.1)

E[Wt − Ws |Fs ] = 0 and E[(Wt − Ws )2 |Fs ] = t − s, t ≥ s. (3.1.2)

E[Yt − Ys |Fs ] = 0, t ≥ s. (3.1.3)

E[(Wt − Ws )2 |Fs ] = t − s, t ≥ s. (3.1.4)

Then (W is an Ft -Brownian motion.

See Example 4 for the proof.

dAt := E[dZt |Ft ]. (3.1.5)

Then Mt := Zt − At is an Ft -martingale: It is obviously adapted to Ft , and

E[dMt |Ft ] = E[dZt |Ft ] − E[dAt |Ft ] = 0. (3.1.6)

3.1.2 A martingale associated with a diffusion process

E[dXt |Ft ] = f (Xt ) dt, 0≤t<T (3.1.14)

dMt := dXt − f (Xt ) dt, (3.1.15)

is an Ft -martingale. Indeed, as can be easily seen from (3.1.16), Mt is completely determined by

E[dMt |Ft ] = 0, 0 ≤ t < T. (3.1.17)

Thus, our diffusion process X admits a decomposition of the form

E[(dMt )2 |Ft ] = E[(dXt − f (Xt ) dt)2 |Ft ] (3.1.19)

which can be rewritten in integral form as

Moreover, the quadratic variation process [M ]t has the formm

which is equivalent to the ODE

d[M ]t = E[(dMt )2 |Ft ] = σ 2 dt. (3.1.29)

As advertised, this expresses X as a deterministic function of X0 and an independent Brownian

3.1.3 Enter the stochastic integral

We then define the integral (3.1.31) by

Note that, since each ζi is Fti -measurable and M is a martingale, we have

E[ζi (Mti+1 − Mti )|Fti ] = ζi E[Mti+1 − Mti |Fti ] = 0, (3.1.37)

and then one can define the stochastic integral as

The integral of such a Zt w.r.t. Mt is then defined as