ECE 555: Stochastic Control

ECE 555: Stochastic Control
M.-A. Belabbas, Spring 2015, UIUC
2
Contents
1 Review of probability theory 7

1.1 Sets, measures and probability spaces . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Independence and conditioning . . . . . . . . . . . . . . . . . . . 11
1.2.2 Probability density functions . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Transformation of densities . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4 The Gaussian density . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.5 Moments and generating functions . . . . . . . . . . . . . . . . . 16
2 Basic notions from discrete-time stochastic processes 19

2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Stochastic processes and filtrations . . . . . . . . . . . . . . . . . 19
2.1.2 Simple random walk . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 First-passage time of a simple random walk . . . . . . . . . . . . . . . . . 25
3 Poisson counters and stochastic differential equations 29

3.1 Poisson counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Continuous-time Markov process . . . . . . . . . . . . . . . . . . . 30
3.1.2 Poisson counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Poisson driven differential equations . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Itō stochastic differential equations . . . . . . . . . . . . . . . . . 34
3.2.2 Solution in the sense of Itō . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Computer simulations . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 The Itō rule for changes of variable . . . . . . . . . . . . . . . . . . . . . 38
3.4 Finite-state, continuous-time jump processes . . . . . . . . . . . . . . . . . 39
3.5 Computing expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 Expectation rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 The Fokker-Planck equation for jump processes . . . . . . . . . . . . . . . 44
3.6.1 Derivation of the density equation . . . . . . . . . . . . . . . . . . 44
3.6.2 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3
Contents
3.7 The Backwards equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.8 Computing correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Dynamic programming and optimal control 59

4.1 Dynamic programming in discrete-time . . . . . . . . . . . . . . . . . . . 60
4.2 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Infinite-time horizon problems . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Dynamic programming in continuous-time . . . . . . . . . . . . . . . . . 77
4.6 Controlled jump processes . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Infinite-horizon: uniformization . . . . . . . . . . . . . . . . . . . . . . . . 83
5 Wiener processes and stochastic differential equations 87

5.1 Diffusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.1 Gaussian distribution and particles in water . . . . . . . . . . . . 87
5.1.2 The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Brownian motion and Poisson counters . . . . . . . . . . . . . . . . . . . 90
5.2.1 Time correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Stochastic differential equations and the Itō rule . . . . . . . . . . . . . . 94
5.4 The expectation rule and examples . . . . . . . . . . . . . . . . . . . . . 98
5.5 Finite difference approximations . . . . . . . . . . . . . . . . . . . . . . . 101
5.6 First passage time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6.1 Differentiability of Wiener processes . . . . . . . . . . . . . . . . . 104
5.7 The Fokker-Planck equation . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.7.1 Fokker-Planck for jump-diffusion processes . . . . . . . . . . . . . 108
5.8 Stratonovich calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.8.1 Change of variables in Stratonovic calculus . . . . . . . . . . . . . 113
6 System Concepts 115

6.1 Notions from deterministic systems . . . . . . . . . . . . . . . . . . . . . 116
6.1.1 Controllability and Observability . . . . . . . . . . . . . . . . . . 117
6.1.2 LTI systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2.1 Power spectral density . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Stochastic Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7 Linear and Nonlinear filtering 133

7.1 Conditional density for discrete Markov processes . . . . . . . . . . . . . 134
7.1.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.1.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.3 Estimating the state with the complete sequence of observations . 137
7.2 Conditional density for continuous-time Markov chains . . . . . . . . . . 140
7.2.1 White noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2.2 Unnormalized conditional density evolution . . . . . . . . . . . . 141
4
Contents
7.3 Nonlinear filtering: the Duncan-Mortensen-Zakai equation . . . . . . . . 145

7.4 Linear filtering: the Kalman-Bucy filter . . . . . . . . . . . . . . . . . . . 148
7.4.1 Conditional expectation and MMSE estimation . . . . . . . . . . 148
7.4.2 Conditional expectation and least squares . . . . . . . . . . . . . 150
7.4.3 Kalman filter as an LMSE estimator . . . . . . . . . . . . . . . . . 152
8 Ergodicity and Markov Processes 155

8.1 Birkhoff’s ergodic theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5
Lecture 1
Review of probability theory
We cover the basics of probability theory—including probability spaces, σ-fields, multi-

variate random variables, density, changes of variables.
1.1 Sets, measures and probability spaces
1.1.1 Probability spaces
1.1. We denote by Z the set of integers, by N the set of positive integers, by Q the set of
R
rational numbers and by the set of real numbers.
1.2. A set containing a finite number of elements is said to be finite. A set containing
an infinite number of elements that can be put in one to one correspondence with the
positive integers is said to be countable. According to this definition, the sets Z and Q
are countable, as one can exhibit one-to-one correspondence between their elements and
the positive integers.
1.3. If A and B are subset of a set S , we write by A ∩ B and A ∪ B their union and
intersections respectively. We denote by Ac = S − A the complement of A in S .
1.4. The set of all subsets of a set S is called the power set and is usually denoted as 2S .
1.5. If the set S is finite, so is its power set. Thus in that case, S and its power set are very
similar objects. However, if S is countably infinite, its power set is not countable. Indeed,
R
the power set of N can be shown to be (by binary expansion of real numbers) and R
can be shown to be non-countable (by what is called a "diagonalization argument", for
your own edification, you can check the paper of Cantor [1] or Rudin’s book on Real
Analysis [3]; but this is beyond the scope of this course)
7
1 Review of probability theory
1.6. A.N. Kolmogorov introduced in 1933 [2] a precise mathematical model and axioms
for the subject of probability, which are now the foundation on which the subject is built.
We outline it here. The first step is to introduce the notion of probability space, which we
do now.
1.7. We start with an arbitrary set Ω whose elements are the possible states of the world
and subsets of Ω, called events, which correspond to a collection of states of the world.
For example, when throwing a die, the set Ω can be taken to be Ω = {1, 2, . . . , 6} and the
subset A = {1, 3, 5} is the event: "the outcome of the throw is odd".
A measure µ on Ω is a positive-real valued function from a subset S of the power set of
Ω
µ : S ⊆ 2Ω 7−→ [0, 1],
which describes the ’chance’ that an event occurs. Reasoning for the case of Ω finite, it
is reasonable to ask that µ(Ω) = 1, µ(∅) = 0 (where ∅) is the empty set, and that µ be
additive on disjoint sets:
µ(A ∪ B) = µ(A) + µ(B) − µ(A ∩ B).
When the cardinality of Ω is infinite, we have mentioned that its power set is uncount-
able. Requiring that µ be additive on uncountable unions is too restrictive however. We
thus require that µ be additive only on countable unions, that is if Ai , i ∈ N are disjoint
subsets of Ω, then
∑
∞
∞
µ(∪i =0Ai ) = µ(Ai ).
i =0
We define a σ-field or σ-algebra of Ω as a collection of subsets that is closed under

complementation and countable unions of subsets. That is, S is σ-field if
1. A ∈ S ⇔ (Ω − A) ∈ S
∪
2. Ai ∈ S, for i ∈ N ⇒ i∞=0 Ai ∈ S
1.8. A probability space is given by a set Ω, a σ-field S of subsets of Ω, and a countably

additive measure µ on Ω. In case Ω is finite, we can handle all situations by taking S
to be the set of all subsets of Ω and define µ by assigning values for it on individual
elements of Ω with the convention that the value of µ on a subset is the sum of the values
of µ on the elements of the subset.
1.9. Example Take Ω to be the set of binary numbers on n digits, that is an element of
Ω is of the form (a1 . . . an ) where ai ∈ {0, 1}. The set Ω has cardinality 2n and its power
n
set cardinality 22 . Given α ∈ [0, 1], we can define a class of measures on Ω by letting
∑ ∑
µ((a 1, . . . , an )) = α ai
(1 − α) ai
.
8
1.1 Sets, measures and probability spaces
Using the binomial identity, we obtain

n ( )
∑ n
1 = (1 − α + α) =
n
αk (1 − α)n−k
k
k =0
and hence the probability measure is normalized. We can extend the definition of µ from
∪ ∩
Ω to 2Ω via the relation µ(A B) = µ(A) + µ(B) if A B = ∅.
1.10. Borel sets

In this course, we are mostly interested in phenomena whose description involves real
R
variables. As we mentioned, since the is uncountable, this is the source of technical
difficulties in probability theory.
We say that a subset Ω in R is open if for any x 0 ∈ Ω, there exists an ε > 0 such
that {x | |x − x 0 | < ε} ⊂ Ω. The finite unions and intersections of open sets are open.
The countably infinite unions of open sets are again open, but this does not hold for
countably infinite intersections. Indeed,
lim ∩in=1 {x | (−1/n < x < 1 + 1/n)} = [0, 1].

n→∞
The Borel sets are subsets of Rgenerated by countable unions and complementation
R
of open intervals of . They form, by construction, a σ-algebra.
For the purpose of this course, it is often enough to consider the probability space
R
( , B, µ) where B is the Borel σ-field, and µ is the measure obtained by extending the
usual notion of measure obtained from the length of an interval.
From the Borel σ-algebra, one then constructs the larger σ-algebra of Lebesgue mea-
surable sets. We do not go into these details here, but one should remember that the
Lebesgue σ-field is larger that the Borel σ-field but that both are strict subset of the
R
power set of .
9
1.2 Random variables

Random variables
Consider the pair (M, M) where M is a σ-algebra over M . A random variable is a

function
X : Ω → M.
We will, unless otherwise mentioned, take M to be the Borel σ-field if M = R or 2M if
M is discrete.
1.11. For the die-tossing example, a random variable is for ω ∈ Ω:
f (ω) = 1 if ω = 1, 3, 5 and 0 otherwise.
In this case, the measurement space is binary: M = {0, 1} and comes with the σ-field
2M = {{0, 1}, ∅, {0}, {1}}.
Measurability
A random variable X : Ω → M is measurable for (Ω, S) and (M, M) if and only if
X −1 (A) ∈ S for all A ∈ M.
In words, a random variable between Ω and M with σ-fields S and M is measurable if

and only if the preimage of every set in M under X is a set in S.
1.12. Even though you will often hear "a random variable X is measurable," it is clear
from the definition that measurability is dependent on the underlying σ-fields.
1.13. From an engineering point of view, one can think of a σ-field as the resolution
of the measurement instrument at our disposal and the measurable functions as the ques-
tions one can answer using this instrument. Indeed, assume that the states of a sys-
tem belong to the set Ω = {1, 2, . . . , 10}. If your measurement device can only answer
whether the state is less than 5, then observe that you can also answer if it is larger than
5 (this is the complementation requirement for sets in a σ-field.) Hence the σ-field is
{∅, Ω, {1, 2, . . . , 5}, {6, 7, . . . , 10}}. The question "is the state 2 or 4" is clearly not answer-
able by this instrument and indeed the random variable is not measurable. If we buy
a second instrument that can tell us whether the state is even or odd, the new σ-field
at our disposal is the one generated by taking unions and complements and the sets
∅, Ω, {1, 3, 5, 7, 9}, {2, 4, 6, 8, 10}, {1, 2, . . . , 5} and {5, 6, . . . , 10}. Now observe that the set
{2, 4} is in the σ-field generated by the sets above, and we can thus answer the question
of whether the state is 2 or 4.
10
1.14. The concept of measurability is helpful in the study of stochastic processes, as we

will see in the next lecture. The idea is that by controlling the σ-field Σ, we can make sure
that the random variable X does not reveal information one has no real access to (for
example, future values of the process). We will come back to this point later in greater
detail.
Adapted σ− field
An important σ-field is the so-called σ-field adapted to a random variable X . It is

obtained by taking countable unions, complements, and their iterations, of all the sets
Ai = X −1 (Bi )
where Bi is in the σ-field M.

The adapted σ-field is in some sense the smallest σ-field of Ω for which the random
variable X is measurable.
1.2.1 Independence and conditioning
1.15. Given a probability space (Ω, S, µ), we say that two events A, B ∈ S are independent
if
µ(A ∩ B) = µ(A)µ(B).
1.16. To make sense of the above definition, consider again the case of throwing a die,
that is Ω = {1, 2, . . . , 6}. The events ’the outcome is odd’ (call it A) and ’the outcome
is less than or equal to 3’ (call it B) are not independent. Indeed, A ∩ B = {1, 3} and
thus µ(A ∩ B) = 1/3 , 1/4 = µ(A)µ(B). Intuitively, we know that these events are not
independent since not knowing whether the outcome is less than 3, the chance that it is
odd is 1/2. But if we know that the outcome is less than 3, then the chance that it is odd
is now 2/3. If instead of less than 3, we take B to be ’the outcome is less than 4’, then A
and B are independent.
1.17. The conditional probability of B given A is defined as

∩
µ(A B)
µ(B |A) = .
µ(A)
1.18. It is an easy exercise to show that if S is a σ-field, then for any given A ∈ Ω,
∩
the collection of sets {B A | B ∈ S} is also a σ-field. It is the σ field on which the
conditional measure µ(·|A) is defined.
11
1.19. Bayes’ rule is obtained from the point above:
µ(A|B) = µ(B |A)µ(A)/µ(B).
1.20. The notion of independence of events can be applied to random variables: we say
that two random variables X and Y are independent if their adapted σ-fields are made of
independent sets.
1.21. Given two random variables, X,Y , their joint cumulative distribution function is
given by
ϕ(x, y) = µ(X < x,Y < y),
where X < x refers to the set {s ∈ Ω|X (s ) < x }.
1.22. If the random variables X and Y are independent, their joint-cumulative distribu-
tion factors as
ϕ(x, y) = ϕ1 (x)ϕ2 (y)
with ϕ1 (x) = µ(X < x) and ϕ2 (y) = µ(Y < y).
1.2.2 Probability density functions

We will consider in many cases real-valued random variables. These thus define, as we
have seen in the previous section, a probability measure on the real line with its Borel
σ-field. We will only consider measures on the Borel σ-algebra that are given by density
functions.
Density function
For a random variable X , a density function fX (x) on R is a positive, real-valued function

such that ∫
f (x)dx = 1
R
and ∫ x
P (X < x) = fX (x)dx .
−∞
Equivalently, ∫
PX (A) = fX (x)dx .
A
12
Independence
Two random variables X and Y are independent if their joint-density
fX,Y (x, y) = fX (x)fY (y).
1.23. Recall the marginalization procedure: given a joint density, you can obtain the
density for one variable by integrating out the other
∫
fX (x) = fX,Y (x, y)dy .
1.24. Let us return to the die tossing example with X the random variable which is 1 if
the outcome is odd and 0 otherwise; Y the random variable which is 1 if the outcome is
strictly larger than 3 and 0 otherwise.
The joint density of X and Y is thus
1 2 2 1
fX,Y (0, 0) = ; fX,Y (1, 0) = ; fX,Y (0, 1) = ; fX,Y (1, 1) = .
6 6 6 6
Observe that ∑
fX,Y (x, y) = 1.
ij
The density of X can be obtained through marginalization:
∑ 1+2 1
fX (0) = fX,Y (0, j ) = =
j
6 2
∑ 1+2 1
fX (1) = fX,Y (1, j ) = =
j
6 2
as expected.
1.2.3 Transformation of densities

We recall the transformation rule for densities in Rn .
1.25. Let X be a real valued random variable with density fX (x) and let ϕ : R 7−→ R be
a differentiable function. Then Y ≜ ϕ(X ) is also a random variable.
13
1.26. We can relate the densities fX and fY as follows. First, assume that ϕ is one-to-one.
R
Let ε > 0 be small and x 0 ∈ . The probability of the event A = {x | |x − x 0 | < ε} is
up to first order f (x 0 )ε. From basic considerations about probabilities of an event, it is
clear that we want that the image of this event under ϕ (that is, ϕ(A)) to have the same
probability. Observe that the length of the interval ϕ(A) is given, up to first order, by
d ϕ | ε. Putting the above together, we have that up to first order
dx x 0
( ) −1
dϕ
fY (y) = fX (ϕ (y))
−1

dx
1.27. If ϕ is not one-to-one, we simply take the sum over the inverse images ϕ−1 (y):
∑ ( ) −1
dϕ
fY (y) = fX (ϕ (y)) −1

x |y=ϕ(x)
dx
d ϕ −1
with the understanding that dx be evaluated at the appropriate inverse image.
1.28. If X and Y are Rn valued random variables related by Y = ϕ(X ), then the formula
becomes: ( ) −1
∑
∂ϕ
fY (y) = fX (ϕ (y)) det(
−1

∂x
x |y=ϕ(x)
∂ϕ
where ∂x is the Jacobian of ϕ.
1.2.4 The Gaussian density
∫ −x 2 dx.
1.29. Consider the integral I = Re We can evaluate it by first expressing its
square as ∫ ∫
e −x
2 −y 2
I =
2
dxdy
R R
and changing to polar coordinates to obtain
∫ ∞ ∫ 2π
r e −r dθdr = π.
2
I =2
0 0
14
1.30. A random variable is said to be Gaussian if it is distributed according to
1
e −x /2σ .
2
f (x) = √
2πσ
where σ is positive real. Using the point above, we conclude that the above density is
normalized.
2 2
R
1.31. Because the integral of e −x 1 /2σ1 · · · e −xn /2σn over n is the the product of integrals
of the type of the one above, we obtain that
∫ ∫ √
e −x 1 /2σ1 · · · e −xn /2σn dx 1 · · · dx n = (2π)n σ1 · · · σn
2 2
···
R R
1.32. We introduce the vector x = [x 1, . . . , x n ]′ and the diagonal matrix D with positive
eigenvalues d1, . . . , dn . We have the following identity:
∫ √
′ −1
e −x (2D )x dx 1 · · · dx n = (2π)n det(D).
Rn
1.33. Let Q be a symmetric positive definite matrix. It is a fundamental fact form linear
algebra that there exists an orthogonal matrix Θ such that Θ′Q Θ is diagonal. If we set
z = Θ′x, then the Jacobian of z with respect to x is Θ′, whose determinant is one.
Thus the change of variables formula, along with with fact that the determinant of Q
and the determinant of Θ′Q Θ are the same, tells us that
∫ ∫ √
−z ′ Θ ′ (2D)−1 Θz −z ′ (2Q )−1 z
e dz 1 · · · dz n = e dz 1 · · · dz n = (2π)n det(Q ).
Rn Rn
1.34. The random variable Z is a multi-dimensional or multivariate Gaussian if its

density is given by
1 ′ −1
fZ (z ) = √ e −z (2Q ) z
(2π)n det(Q )
which is well normalized as we have just shown.
1.35. Translating z by a constant value m does not change the value of the integral
above. We say that W is a multivariate Gaussian with mean m (and covariance Q ) if it
is distributed as
15
1 ′ −1 (w−m)
fW (w) = √ e −(w−m) (2Q ) .
(2π)n det(Q )
1.36. We record the identities, which can be derived easily following the steps outlined
in this section:
∫
1 ′ −1
m= w√ e −(w−m) (2Q )(w−m)dw
Rn (2π)n det(Q )
and ∫
1 ′ −1 (w−m)
Q = (w − m)(w − m)′ √ e −(w−m) (2Q ) dw .
Rn (2π)n det(Q )
1.2.5 Moments and generating functions

We here restrict ourselves to real-valued random variables.
1.37. We denote by EX the expectation x:

∫
E X = x fX (x)dx .
R
More generally, for any function g : R 7−→ R, we define the expecation of g as
∫
Eg (X ) = g (x)fX (x)dx .
R
1.38. The variance of X is defined as
var X = E(X − E(X ))2
1.39. Given a positive integer p, we define its pth moment as

∫
x =
p
E x p fX (x)dx .
R
1.40. The moment generating function of X is defined as the following expectation:
mX (t ) ≜ E(e tX ).
Notice that this is rather similar to the Laplace transform of the density.
16
1.41. Using the Taylor expansion of the exponential, we see that
mX (t ) = 1 + tm 1 + t 2 m2 /2 + t 3 m3 /3! + · · ·
where mi is the i th moment of X .
1.42. The characteristic function of X is defined as
c X (t ) ≜ E(e itX ).
This is, up to a constant factor, the Fourier transform of the density of X .
1.43. Whereas the moment generating function can fail to exist, the characteristic func-
tion always exists.
1.44. If X and Y are two independent random variable, then
mX +Y (t ) = mX (t )mY (t )
and similarly for the characteristic function.
17
Lecture 2
Basic notions from discrete-time stochastic processes
We cover aspects of the theory of stochastic processes in discrete time. stochastic process,
x(n) is a random variable
2.1 Basic concepts

We have seen in the previous lectures that given a probability space (Ω, Σ, P ), and an
observation space (M, ΣM ), if a random variable X is measurable, we can define a σ-
algebra on Ω by taking X −1 (A) where A ∈ ΣM . This σ-field is called the adapted σ-field
of X . A filtration somehow extends this idea to a family of random variables, where each
random variable is allowed access to an increasing amount of information. This encodes
the idea that as time passes, we know more about a stochastic process.
2.1.1 Stochastic processes and filtrations
2.1. Given a probability space (Ω, Σ, P ), a random process is a collection of random

variables Xt , where t = 0, 1, . . . , n, . . . or t ∈ [0, ∞). The parameter t is called the time
parameter. The random variables are assumed to be valued in the same space: either R
or the positive integers.
Sample paths
A realization of the random variables X (1), . . . , X (n) is called a sample path.
2.2. Stochastic processes are used to model a wide variety of natural phenomena that
are either deterministic, but too complex to model exactly, or genuinely random. Typical
examples are the number of customers in a queue at time t , number of packets arriving
at a router at time t , stock prices, the outcomes of a fair game, etc.
19
2 Basic notions from discrete-time stochastic processes
2.3. Consider flipping a coin n times and let X (t ) be the result of the t th flip, then X (t ) is
a random process with Ω = {H H H . . . , H T H H . . . , . . . ,T T T T . . .} and we take Σ = 2Ω .
An element of Ω is a sample path.
2.4. Consider tossing a die 3 times. In that case Ω = {111, 112, . . . , 666}. We can define
the stochastic process X (t ) = 1 if the outcome of the t th toss is odd and 0 if it is even.
The possible sample paths for X are thus 000, 001, . . . where in the first case, the three
tosses are even, etc.
2.5. The events in the σ-field Σ contain information about the whole process. In partic-
ular, they contain information about future outcomes. Indeed, in the case of the three
tosses of a die above, a random variable that is measurable with respect to 2Ω will allow
us to decide future outcomes. We would like to find a way to force a random variable
to use the information that is available only up to time t . Consistent to what was done
earlier, we encode the information a random variable has access to into a σ-field.
2.6. The notion of filtration is introduced to control the amount of information a random
variable has access to at time t.
Filtration
A filtration Ft on (Ω, Σ) is a collection of σ-fields on (Ω, Σ) such that Ft ⊆ Σ, ∀t and
Fs ⊆ Ft if s ≤ t .
2.7. Example Let us consider again the case of tossing a coin n times. Define
F1 = {Ω, ∅, {H H . . . H, H . . . H T , . . . , H T T . . . T }, {T H . . . H,T . . . H T , . . . ,T T T . . . T } .
| {z }| {z }
A1 A2
In words, F1 contains two sets: the set of all events or samples that start with H and
the set of all events of samples that start with T . It is easy to see that these two sets are
disjoint, and that their union is Ω. Hence F1 is a σ-field.
Observe that F1 is the σ-field adapted to X (1), for X (1) being the outcome of the first
toss.
Define the four disjoint events:
B1 = {H H H H . . . , H H T H . . . , . . . , H H T T . . .}
B2 = {H T H H . . . , H T T T . . . , . . .}
B3 = {T H H H . . . ,T H T T . . . , . . .}
B4 = {T T H H . . . ,T T T T . . . , . . .}
20
2.1 Basic concepts
Hence, B 1 contains all the events such that the first two tosses yield H H , B 2 all the
events such that the first toss yields T and the second is H , etc. The events Bi are clearly
disjoint.
We define F2 as
F2 = {Ω, ∅, B 1, B 2, B 3, B 4, B 1 ∪ B 2, B 1 ∪ B 2 ∪ B 3, . . .}
where we take all the possible twofold and three fold unions of the Bi . Then F2 is a σ-field
(which has a finite number of elements). Observe that A 1 = B 1 ∪ B 2 and A2 = B 3 ∪ B 4 ,
hence
F1 ⊂ F2 .
The opposite is not true, e.g. B 1 < F1 .
Hence a random variable that is measurable with respect to F2 will have access to more
information than a random variable that is measurable with respect to F1 .
In this fashion, we can construct a filtration Fi . Observe that we have again that F2 is
the σ-field adapted to X (1), X (2).
2.8. We thus see that a filtration can be used to control the amount of information a
random variable has access to at time t .
2.9. A filtration thus formalizes the intuitive notion that a random variable Z (t ) defined
on process up to time t can only be a function of the observations X (1), X (2), . . . , X (t ). Z
is also sometimes called past-measurable. Ft is simply the σ-field adapted to X (1), . . . , X (t ),
and we rewrite Z = f (X (1), X (2), . . . , X (t )) for an arbitrary function f as "Z (t ) is adapted
to Ft ". Do not fail to grasp this point, as the language of filtration is the universal
language when talking about stochastic processes in most of the probability/statistics
literature.
2.10. Talking about filtrations allows us to make the definitions independent of the choice
of particular X .
2.11. Given a probability space (Ω, Σ, P ) and a filtration Ft , we say that a process X (t )
is Ft -adapted if X (1), . . . , X (l ) is Fl -measurable.
2.1.2 Simple random walk

We have already encountered this random walk in earlier examples:
21
2.12. Consider flipping a biased coin with p(X = 1) = p and p(X = −1) = 1 − p. Let
S 0 ∈ Z (often, S 0 = 0.) Define
∑
n
S (n) = S 0 + Xi .
i =1
This process is called the simple random walk.
2.13. In general, we can define a random walk for Xk being i.i.d. but not necessarily
Bernoulli. The definition S n is the same as above.
Sample mean
A familiar stochastic process, though often not thought as such, is the sample mean
1∑
n
X¯ (n) = Xi ,
n i =1
where the Xi are iid.
Gambler’s ruin
Let us answer the following question about the simple random walk S (t ): given two
numbers b < 0 < a, what is the probability that S (t ) reaches a before it reaches b.
It is called the Gambler’s ruin problem because it can be used to model the following
scenario: You have an amount b of money and are playing a fair game against someone
with an amount a of money. Assume that at each round of the game, you loose or win
one dollar with equal probability. What is the chance that you get ruined before your
opponent?
2.14. To solve this problem we will, similarly as when we derived an equation for the
mean, seek to derive an equation for this probability. Define pk to be the probability that
S (t ) reaches a before it reaches b given that S (0) = k .
2.15. The probability of reaching a before b starting from k can be written as
1 1
pk = pk +1 + pk −1 .
2 2
Indeed, starting from k , in order to reach a or b, we have to cross either k − 1 or k + 1 at
the next step. We go to k − 1 with probability 21 . From k − 1, the probability of reaching
a or b is independent of my previous position (Because the odds of winning a bet are not
dependent of how much money you have) and equal to, by definition, pk −1 . We have a
similar reasoning for the other term.
22
2.1 Basic concepts
2.16. Formally, if we define A to be the event "reaching a before reaching b", we have
P (A|X (t − 1) = k ) = P (A, X (t ) = k − 1|X (t − 1) = k ) + P (A, X (t ) = k + 1|X (t − 1) = k )

= P (A|X (t ) = k − 1, X (t − 1) = k )P (X (t ) = k − 1|X (t − 1) = k )
+P (A|X (t ) = k − 1, X (t ) = k )P (X (t ) = k + 1|X (t − 1) = k )
1 1
= P (A|X (t ) = k − 1) + P (A|X (t ) = k − 1)
2 2
The last equation can be rewritten as
1 1
pk = pk +1 + pk −1 .
2 2
2.17. We used in the derivation several tools that appears frequently in this type of
computations:
• First, marginalize with respect to a chosen variable. The choice of this variable is
application-dependent.
• Then, condition on the past.
These two steps will often allow us to find a recursive equation for a quantity of interest.
2.18. In order to derive the equation, we relied on the fact that
P (A|X (t ), X (t − 1)) = P (A|X (t )).
In words, only the previous known state of the process matters, and not its whole history.
This fact is intuitively obvious in the case of gambling: whether you lost or won does not
affect the odds that a future coin flip will come up heads or tails. This property appears
often in the study of stochastic processes: it is called the Markov property. We will study
processes having this property, called Markov chains, in more details later in the course.
2.19. A general solution to the Equation
1 1
pk = pk −1 + pk +1
2 2
is given by
pk = α + βk .
23
2.20. We clearly have p a = 1 and pb = 0. Plugging this in the general solution, we obtain:
k −b
pk = .
a −b
We observe that if a ≫ b, i.e. you are far richer than your opponent, the probability
that you are ruined before your opponent decreases, as was expected.
24
2.2 First-passage time of a simple random walk

The first passage time of a simple random walk is a random variable defined as follows:
for X (t ) a symmetric simple random walk, define the random variable
τ(x) = min {t : X (t ) = x } .
First passage times arise naturally in many contexts in which an event is triggered when
a process reaches a certain value . Examples are in finance, when certain contracts or
options may be activated when an indicator reaches a given value, in biology, when a
reaction starts when the concentration of certain compounds is high enough, etc.
2.21. We have the equation

X (t ) = X (t − 1) + N (t ),
where {
1 with prob 1/2
N (t ) =
−1 with prob 1/2
are independent and X (0) = 0.
2.22. Recall that the probability generating function of τ(x) is given by

∑
∞
M τ(x) (z ) = z n P (τ(x) = n).
n=1
2.23. We point out the two facts:

• The time necessary to go from l to l + 1 does not depend on the past states of the
process. This again is the Markov property.
• The time necessary to go from 0 to 1 is the same as the time necessary to go from
l to l + 1 for any l .
We use these facts right away.
Computing τ = τ(1).
Define τ to be the first passage time at 1. Starting from zero, the process goes to 1 with
probability 1/2, and thus P (τ = 1) = 21 . With probability 1/2, the process goes to −1. If
the process is at −1, it will take τ′ steps to go to 0.
{
1 with prob 12
τ=
1 + τ′ + τ′′ with prob 12
where τ′ and τ′′ are independent random variables, distributed identically to τ.
25
2.24. The mgf of τ is thus:

∑
∞
1 1∑ n
∞
M τ (z ) = z P (τ = n) = z +
n
z P (1 + τ′ + τ′′ = n).
n=1
2 2 n=1
Observe that, setting ñ = n − 1

∑ ∑
z n P (τ′ + τ′′ + 1 = n) = z n P (τ′ + τ′′ = n − 1)
n≥1 n≥1
∑
= z ñ+1P (τ′ + τ′′ = ñ)
ñ≥0
∑
= z z ñ P (τ′ + τ′′ = ñ)
ñ≥0
∑
= z z ñ P (τ′ + τ′′ = ñ)
ñ≥1
since P (τ′ + τ = 0) = 0.
∑∞ n
Observe that n=1 z P (τ + τ′ = n) is the mgf of two independents random variables
τ + τ . Because τ, τ and τ′′ are all identically distributed, their mgf are the same. We
′ ′′ ′
hence have
1 1
M τ (z ) = z + z M τ (z )2 . (2.1)
2 2
√
1± 1−z 2
2.25. Solving Equation (2.1) for M τ (z ), we obtain M τ (z ) = z . Observe that for
z ∈ (0, 1), M τ (z ) must take values in (0, 1), hence
√
1 − 1 − z2
M τ (z ) = .
z
We plot this function in Figure 2-1
We next draw some conclusions from this analysis:
2.26. First, observe that
lim M τ (z ) = 1
z →1−
∑
∞
= lim− z n p(τ = n)
z →1
n=1
∑
∞
= P (τ = n)
n=1
= P (τ < ∞)
26
0.8
M τ (z )
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
z
Figure 2-1: M τ (z ).
We deduce from this that every state is visited infinitely often. This is a consequence
of the two facts highlighted below. Indeed, because P (τ < ∞) = 1, the simple random
walk, starting from 0, will visit the state 1 in finite time almost surely. Because of the
Markov property, once we are at state 1, we can consider that it is the starting point
of the random walk: hence repeating the same reasoning, we will visit 2 in finite time
almost surely, then 3, etc. Because the walk is symmetric, the same applies to states −1,
−2, etc.
Expectation
Recall that
E(τ) = dzd |z =1Mτ (z ).
Indeed,
d d ∑∞
|z =1 M τ (z ) = |z =1 z n P (τ = n)
dz dz n=1
∑
∞
= * nz n−1P (τ = n)+
,n=1 -z =1
∑∞
= nP (τ = n)
n=1
= E(τ)
27
2.27. We thus observe that E(τ) = ∞: The expected time to visit 1 starting from 0 is
infinite.
τ(x)
We can derive τ(x) using the above remarks and the generating function for τ. Indeed,
we can write
τ(x) = τ1 + τ2 + . . . + τx
where τi are independent, identically distributed variables, with the same distribution
as τ. Indeed, in order to get from 0 to x, we must get from 0 to 1, which takes τ1 steps,
then from 1 to 2, which takes τ2 steps, where τi are identicaly distributed.
Hence
M τ(x) = (M τ (z ))x .
2.28. A skeptical mind may balk at the derivation of M τ (z ), finding the equation estab-
lished to get its mgf arbitrary. For example, we can as well say that



1 with prob 21
τ=  2 + τ′ + τ′′ + τ′′′ with prob 14

 2 + τ′′′′ with prob. 14
where in the first case, the walk goes to the right at the first time step, in the second case,
the walk goes twice to the left, and in the third case, the walk goes once to the left, then
once to the right. We have that τ, τ′, τ′′, τ′′′, τ′′′′ are all iid,
A quick computation, in the spirit of the one of the previous point, yields:
z 1 2 3 1
M τ (x) = + z M τ (z ) + z 2 M τ (z ).
2 4 4
While finding the roots of this equation,
√
which is cubic in M τ may be hard, it is simpler
1− 1−z 2
to check that replacing M τ (z ) by z satisfies the equation. Hence we find the same
moment generating function, as was expected.
28
Lecture 3
Poisson counters and stochastic differential equations
A stochastic process can be thought of as a random variable whose domain is a set of

path. We have seen in the previous lectures that this point of view did not offer much
insights into questions that we are interested in, such as stopping times or computation
of expected values of a process at a given time. This will be even more true in now that
we begin our study of continuous-time stochastic processes. We start by introducing Pois-
son counters. These are positive integer-valued, monotone increasing functions whose
jumps (that is times at which the function changes value) are independent, identically
distributed according to an exponential distribution. We then introduce what is meant by
a solution to a stochastic differential equation driven by such a random path. From there,
we derive the Itō rule for jump processes and the expectation rule. Combining both, we
derive the Fokker-Planck equation for Poisson driven stochastic differential equation.
29
3 Poisson counters and stochastic differential equations
N(t)
Figure 3-1: A sample path for a Poisson counter N (t ) of rate λ. The times between jumps
are distributed according to an exponential with parameter λ. The paths are
continuous from the right, and the limit from the left exists.
3.1 Poisson counters
3.1.1 Continuous-time Markov process

A continuous time, countable state Markov process is a collection of paths x(t ) which
are right-continuous, with limits from the left, which take values in a countable set X
and which are such that for all τ > 0 and x i ∈ X the probability that x(t + τ) = x i given
the knowledge of x(σ) for all σ < t is the same as the probability that x(t + τ) = x i given
x(t ).
3.1.2 Poisson counters
3.1. Let x(t ) be a non-decreasing process which takes on values in the positive integers N.
A sample path for x(t ) is depicted in Figure 3-1.
Denote by p n (t ) the quantity
p n (t ) = probability that x(t ) = n.
Let λ > 0 be a positive real number. We say that x(t ) jumps at time t if for all for all
ε > 0, x(t + ε) − x(t − ε) = 1. Assume that the probability that x(t ) jumps during the time
interval dt is given by λdt .
3.2. We have the following informal derivation for the evolution of p n (t ). Consider the
set of all possible sample paths that fit the description above (integer valued, strictly
increasing). The variation at time t of p n (t ) is the variation in the number of paths x(·)
that are equal to n at t divided by the total number of paths. Up to first order, there is
30
an increase in such paths stemming from paths that were at n − 1 before t and jump at t .
Up to first order because we discard path that jump ’twice in a row’, that is twice within
a ’dt ’.Similarly, there is a decrease in the number of paths stemming from paths that are
equal to n and jump to n + 1. That is, we have
variation in #paths that are n at t = #paths that are n − 1 and jump

− #paths that are n and jump .
Because the probability of a jump during dt is λdt , we have that the # paths that jump
to n from n − 1 is equal to # paths that are at n − 1 multiplied by λdt , that is
variation in #paths that are n at t = ( #paths that are n-1)λdt −(#paths that are n)λdt .
Dividing both sides by the total number of paths, we obtain
dp n (t ) = p n−1 λdt − p n+1 λdt .
We can summarize the above in the following system
p˙n (t ) = −λp n (t ) + λp n−1 (t ); p 0 (0) = 1; pi (0) = 0, i > 0 (3.1)
The constant λ is called the counting rate and the process x(t ) a Poisson counter. You can
take (3.1) as the definition of the evolution of the probabilities for a Poisson counter.
3.3. We can write the equations (3.1) as an infinite-dimensional system

 p0   −λ 0 0 · · ·   p0 
     
d  p1  =  λ −λ 0 · · ·   p1  .
 λ −λ · · · 
dt  p2   0  p2 
 ..   .. .. .. . .   .. 
 .   . . . .   . 
3.4. From the above formulation, it is clear that these equations can be solved one by
one starting with p 0 (t ) = e −λt to obtain
p 0 (t ) = e −λt
p 1 (t ) = λte −λt
p 2 (t ) = λ 2t 2 /2!e −λt
..
.
pk (t ) = λ k t k /k !e −λt
..
.
31
pk (t )
Figure 3-2: Plots of pk (t ) for k = 0, 1, 2 and 5. The decaying exponential corresponds to

k = 0, and the maximal value of pk (t ) increases as k increases.
3.5. Observe that the probabilities pk (t ) are indeed normalized since
∑
∞ ∑∞
pi (t ) = e −λt * (λt )k /k !+ = 1
i =0 ,k =0 -
| {z }
e λt
3.6. We see that a Poisson counter x(t ) strictly increases with time. We can compute its
expectation as follows:
∑
∞
Ex(t ) = k pk (t )
k =0
∑
∞
= k λ k t k /k !e −λt
k =0
∑
∞
= k λ k t k /k !e −λt
k =1
∑
∞
= λt λ k −1t k −1 /(k − 1)!e −λt
k =1
∑ ∞
′ ′
= λt λ k t k /(k ′)!e −λt
k ′ =0
−λt λt
= λte e = λt
where we set k ′ = k − 1.
32
3.7. The higher-moments can be evaluated recursively as follows:

∑
∞
Ex(t ) p
= e −λt
k p λ k t k /k !
k =0
∑
∞
= e −λt k p λ k t k /k !
k =1
∑
∞
−λt
= e λt k p−1 λ k −1t k −1 /(k − 1)!
k =1
∑ ∞
′ ′
= e −λt λt (k ′ + 1)p−1 λ k t k /k ′!
k ′ =0
∑ p−1 (
∞ ∑ )
−λt p −1 ′ ′
= e λt k ′r λ k t k /k ′!
r
k ′ =0 r =0
p−1 (
∑ )
p − 1 λt
−λt
= e λt
r
e x(t )r E
r =0
∑ p − 1)
p−1 (
= λt
r
x(t )r E
r =0
3.8. What is the distribution of the times between jumps for a Poisson process? To
evaluate it, first notice that the ditribution between jumps does not depend on the current
value of the counter. Hence, we can focus on finding the distribution of the first jump of
a process that starts at 0. We have for t > 0
P (jump before t ) = 1 − P (no jump before t )

= 1 − P (x(t ) = 0)
= 1 − e −λt
∫t
Hence, if q (t ) is the probability density of the jump times, we have 0 q (s )ds = 1 − e λt .
From that equation, we conclude that
q (t ) = λe −λt , for t > 0, 0 otherwise.
This distribution is called the exponential distribution.
33
3.2 Poisson driven differential equations
3.2.1 Itō stochastic differential equations
3.9. Consider the following ordinary differential equation in Rn :

ẋ = f (t, x(t )). (3.2)
Its solution is obtained via integration:

∫ t
x(t ) = x(0) + f (σ, x(σ))dσ.
0
If f is continuous in both its argument and if there exists a constant C > 0 such that
∥ f (σ, x) − f (σ, y)∥ < C ∥x − y ∥,
this latter referred to by saying that f is Lipschitz with constant C in its first argument,
then it is known that for each initial condition x(0) there exists a unique solution to (3.2).
3.10. We are interested in making sense of differential equations as above when there is
a stochastic term, that is we want to make sense of
∫ t ∫ t
x(t ) = x(0) + f (σ, x(σ))dσ + g (σ, x(σ))dN (σ) (3.3)
0 0
where N (σ) is a Poisson counter (N is a stochastic function whereas t is a deterministic

function, hence the name stochastic integral). Keep in mind that in the above expression
x(t ) is a random variable.
3.11. We introduce the useful shorthand notation
f (t1+ ) = lim f (t ) and f (t1− ) = lim f (t ),

t →t1,t >t1 t →t1,t <t1
that is f (t1+ ) is a limit to t1 from the right and f (t1− ) the limit from the left.
3.2.2 Solution in the sense of Itō
3.12. A function x(t ) is a solution of (3.3) in the sense of Itō if

1. On an interval where N is constant, x(t ) obeys ẋ = f (t, x(t )).
34
3.2 Poisson driven differential equations
2. If N (t ) jumps at time t1 , x(t ) behaves in a neighborhood t according to
x(t1 ) = x(t1+ ) = x(t1− ) + g (x(t1− ), t1 )
Functions x(t ) obeying the conditions above are said to be càdlàg (for continu à droite,
limite à gauche) or corlol (for continuous to the right, limit on the left).
3.13. In words, x(t ) is a solution in the sense of Itō if it behaves like a usual differential
equation (with ẋ = f (t, x)) when there is no jump in N (t ), and when there is a jump in
N (t ) at t = t1 , x(t ) jumps from its value right before t1 by an amount g (t1, x(t1− )). We
additionally require x(t ) to be continuous from the right with a well-defined limit from
the left.
3.14. Let us illustrate the above definition on a simple example. Consider the equation
dx = dN (t ), (3.4)
with initial condition x(0) = 0. When there is no jump in N , x(t ) remains constant
because f (t, x) = 0. When there is a jump in N at t1 , x(t ) jumps from its value before
the jump x(t1− ) to x(t1− ) + 1 since g = 1. It is clear that x(t ) = N (t ) in this case, which is
a comforting fact in view of (3.4).
3.15. Consider the equation
dx = −xdt + dN, x(0) = 1
with N (t ) a Poisson counter of rate λ. When there is no jump, x(t ) is a decaying expo-
nential, when N (t ) jumps, x(t ) jumps by 1, since g (x) is in this case 1. We illustrate a
sample path for this equation below
x(t )
t
A sample path for the equation dx = −xdt + dN .
35
dx = xdt + xdN ; x(0) = 1. (3.5)
When there is no jump, x(t ) obeys ẋ = x and when there is a jump, x(t ) jumps by its
value right before the jump, that is x(t ) doubles in size at each jump.
If t1, t2, . . . are the jump times for N (t ), the solution is


 e t for 0 ≤ t ≤ t1


 2e t for t1 < t ≤ t2

x(t ) = 
 4e t for t2 < t ≤ t3




 ..
 .
3.17. From (3.5), one should resist the urge to integrate the equation treating N (t ) as a
usual function and conclude that x(t ) = C e N (t )+t for a constant C . The ’solution’ obtained
in this fashion is not the Itō solution.
3.2.3 Computer simulations

How to simulate a stochastic differential equation
dx = f (x)dt + g (x)dN
on a computer? We present two approaches.
3.18. If the vector field f (x) is well-behaved (e.g. not stiff), direct integration such that
the Euler method can be used for f (x) and the interpretation of the probability of a
jump during a time interval ∆t as we presented earlier can be used to handle g (x)dN .
Explicitly, choose a time-step ∆ > 0 small (no larger than 10−3 ). The probability of a
jump between t and t + ∆ is λ∆. In view of this, the following integration scheme can be
used.
1. Sample w uniformly at random on [0, 1].
2. If w ≤ λ∆ (there is a jump):
x(t ) = x(t − 1) + f (x)∆ + g (x(t − 1))

| {z }
Itō term
3. Otherwise (no jump)
x(t ) = x(t − 1) + f (x)∆
3.19. The second approach consists for first drawing the next jump time, which we know
36
3.2 Poisson driven differential equations
is sample according to a Poisson with parameter λ. The method is

1. Sample T from a Poisson with parameter λ.
2. Integrate the equation with initial condition x(t ) for T seconds.
3. Set x(T ) ← x(T ) + g (x(T )).
37
3.3 The Itō rule for changes of variable

In this section, with start with bona fide calculus of stochastic processes by introducing
the Itō rule for jump processes.
3.20. Consider the stochastic differential equation in Rk :

∑
n
dx = f (t, x)dt + g i (t, x)dNi (3.6)
i =1
where the vector fields f and g i are differentiable and the Ni are independent Poisson
counters of rates λ i .
R R ′
Let ψ : k 7−→ k be a differentiable function, we seek to find an equation for d ψ.
To fix notation, we write d ψ = h(t, x, Ni ); hence we seek to find an appropriate h. Given
a particular realization of the processes N (t ), we obtain a particular sample path x(t ).
An appropriate h is one such that for similar realizations of the Poisson processes, the
solution of (3.6) and solutions of d ψ = h(t, x, Ni ) are the same, where solutions are taken
in the Itō sense.
In between jumps, we have that x(t
∂ψ
⟨ d ψ) evolves⟩ according to dx = f (x)dt , and thus ψ(x)
evolves according to d ψ = ∂t dt + ∂x , f (t, x) dt . If one counter, say Ni , jumps at time
t , recall that x(t ) is càdlàg with x(t + ) = x(t − ) + g i (t, x − ). Hence we require ψ(t, x(t + )) =
ψ(x(t − ) + g i (t, x − )). From there, and because the probability of having two counters jump
at the same is zero, we conclude that
⟨ ⟩ ∑
∂ψ dψ
d ψ(t, x) = dt + , f (t, x) dt + (ψ(t, x + g i (x)) − ψ(t, x)) dNi
∂t ∂x i
3.21. Consider the stochastic differential equation in R:

dx = −xdt + xdN .
Let ψ(x) = x 2 , the Itō equation for ψ is

( )
d ψ = −2x 2dt + (x + x)2 − x 2 dN = −2x 2dt + 3x 2dN .
38
3.4 Finite-state, continuous-time jump processes

Poisson-driven equations are extremely versatile modelling tools. We show here how to
write down the equations of a continuous-time, finite state Markov process using Poisson
driven stochastic equations.
3.22. Let X = {x 1, x 2, . . . , x n } be a finite set and consider an X -valued stochastic process

x(t ). Define pi (t ) to be the probability that x(t ) = x i . Assume that these probabilities
obey the equations
∑
p˙i = ai j p j .
j
∑
3.23. Conservation of probability (that is, i pi (t ) = 1) requires that the columns of A
sum to zero: ∑
ai j = 0.
i
Positivity of each entries of p(t ) requires that
ai j ≥ 0, i , j .
These two requirements together imply that aii ≤ 0. Matrices A satisfying the two
conditions above are called intensity matrices or inifnitesimal generators
3.24. Consider the three states continuous-time Markov chain with pi = P (x(t ) = i )
evolving according to
p 1   −3 1 0  p 1 
d p  =  2 −1 1.5  p 
dt  2     2 
p 3   1 0 −1.5  p 3 
We can represent the Markov chain graphically as below:
1
1
2
1
2 3
1.5
A stationary distribution p ∞ for A is a distribution such that Ap ∞ = 0.
3.25. The above representation of the Markov chain focuses on the evolution of the
probability that x(t ) be at a certain state. One might want to also have a representation
39
of the paths that x(t ) follows. We can do so using Poisson driven stochastic equations.
Starting slow, consider the following equation
dx = −2xdN, x(0) = 1 (3.7)
where dN is a Poisson counter of intensity λ.

A solution in the sense of Itō for the above equation is such that x(t ) is constant when
N does not jump, and when N jumps, x(t ) goes to −x(t ). Hence a solution will have
x(t ) ∈ {−1, 1}.
3.26. Recall that, up to first order, the probability of a jump during the time interval ∆
is λ∆. Hence, using a reasoning similar to the one in 2, the probability that x(t ) = 1 is
modified by an amount −λ∆p 1 , corresponding to jumps from 1 to −1 and by an amount
λ∆p 2 , corresponding to jumps from −1 to 1. That is dp 1 = λp 1dt + λp −1dt . We can easily
derive a similar equation for p −1 and we obtain
[ ]
−λ λ
p˙ = p. (3.8)
λ −λ
3.27. We refer to (3.7) as a sample path description and to (3.8) as a probabilistic description
of a continuous-time, finite-state Markov process
3.28. Given a probabilistic description of a finite-state continuous time (FSCT) Markov

chain, we can associate to it many different sample paths representations. We single out
here two such representations.
3.29. Given a FSCT with n states s 1, . . . , sn evolving according to p˙ = Ap, where pi is the
probability of being in state si , we associate a vector-valued jump process, which jumps
R
between unit vectors of n . Precisely, we associate to si the unit vector e i . Observe that
if x(t ) = e i , then {
′ x(t ) if l , i
x(t ) + (e k − el )el x(t ) =
ek if l = i
3.30. Set G i j = (e i − e j )e ′j , form the above we conclude that the Poisson process
∑
dx = G i j xdNi j (3.9)
i,j
evolves in the set {e 1, e 2, · · · , e n }. If we set the rates of the Poisson counters Ni j appearing
in (3.9) to be
λ i j = ai j , i , j,
40
the resulting sample path representation has p(x(t )) = e i = pi .
3.31. Instead of assigning to each state in the FSCT Markov chain a unit vector in
R n , we can assign to it a real number and use Lagrange interpolation to define a Poisson
driven equation that only evolves in this finite set of numbers. Precisely, to each si , assign
pairwise distinct real numbers zi . Define
{
0 if z , z j
ϕi j (z ) =
zi − z j if z = z j
For example, the Lagrance polynomial

∏ z − zk
ϕi j (z ) = (zi − z j )
z j − zk
k , j,1≤k ≤n
fits the bill.
3.32. We now define the Poisson counter driven equation

∑
dx = ϕi j (z )dNi j
i,j
with Ni j a Poisson counter of rate λ i j = ai j (for i , j ). It is a sample path representation

of p˙ = Ap.
3.33. Let us illustrate both representations on the FSCT Markov chain with probabilities
evolving according to
 −2 0 3 
p˙ =  0 −1 0  p.
 2 1 −3 
41
3.5 Computing expectations
3.5.1 Expectation rule
3.34. We have now at our disposal a versatile modelling tool—that is, Poisson driven
stochastic differential equations—to describe in continuous time systems in which dis-
crete events occur at random times. We have also shown how to use that tool to represent
sample paths corresponding to a finite state, continuous-time Markov chain.
3.35. As was the case with discrete-time stochastic processes, recall that this sample path
description can be, from a certain point of view, understood as a convenient way to put
a measure on the usually hard to handle space of paths (that is, functions of times). In
the case of Poisson driven equations, this path space would be all the functions x(t ) for
which there exists a sequence of jumps so that x(t ) is a possible solution in the sense of
Itō.
3.36. While we do not want to work explicitly with the measure induced by a stochastic
differential equation, we nevertheless want to evaluate quantities that depend on it. We
show in this section and the next how to do so.
3.37. To this end, let us focus on the Itō equation (3.6). Recall that if Ni (t ) is a Poisson
counter of rate λ,
E
Ni (t ) − λ i t = 0.
The above property is often referred to in the literature as saying that the compensated
Poisson process N (t ) − λ(t ) is a martingale with respect to its own filtration.
Let ∆ > 0, the probability that Ni (t ) jumps between t and t + ∆ is independent of x(t ).
Hence, we have that
∫ t +∆ ∑ ∫ t +∆
E (x(t + ∆) − x(t )) =E f (σ, x(σ))dσ + E
g i (σ, x(σ))dNi (σ).
t i t
∑
E
Letting ∆ → 0, we obtain that d x = E f (t, x(t ))dt + i Egi (t, x(t ))λi dt . Hence,
∑
d
dt
E
x(t ) = E f (t, x) + Egi (t, x)λi
i
3.5.2 Examples
We illustrate the expectation rule on a couple of examples. In particular, we show how
it can be used in conjunction with the Itō rule to evaluate higher moments of x(t ).
42
3.5 Computing expectations
3.38. Consider the PSDE

dx = −xdt + dN
where N (t ) is a counter of rate λ. Let x(0) = 0. Using the expectation rule, we have
E
d
dt
E
x(t ) = − x(t ) + λ,
E
from which we conclude that x(t ) = e −t + λ. What is the variance of x(t )? Recall that
E E E
var x(t ) = (x 2 (t )) − ( x(t ))2 . In order to evaluate x 2 (t ), we use the Itō rule and the
expectation rule. First, the Itō rule yields
dx 2 = −2x 2dt + ((x + 1)2 − x 2 )dN = −2x 2dt + (2x + 1)dN .
Taking expectation on both sides, we get
d
dt
E E E
x 2 (t ) = −2 x 2 (t ) + (2 x(t ) + 1)λ
whose solution is ( )
Ex 2(t ) = Ex 2(0) − λ e −2t + λ/2.
3.39. Consider the following system,

{
dx = −xdt + zdt
dz = −2zdN
with z (0) ∈ {−1, 1}. Observe that z (t ) ∈ {−1, 1} for all t . What is the variance of x(t )?
Using the Itō rule, we get
dx 2 = −2x 2dt + 2xzdt .
Hence, taking the expectation rule with the above equation, we need Exz . Hence, we
use the Itō rule again to get
d (xz ) = (−xz + z 2 )dt − 2xzdN .
Since z 2 (t ) = 1, this equation did not introduce any new terms, and we can thus now use
the expectation rule. We obtain the system of equations
{ d
E
dt x
2 = E
−2 x 2 + 2 (xz )E
E E
dt (xz ) = −(1 + 2λ) (xz ) + 1
d
This is a linear system of ODE of the type ẋ = Ax + b where

[ ] [ ]
−2 2 0
A= ,b = .
0 −(1 + 2λ) 1
43
3.6 The Fokker-Planck equation for jump processes

We derive here an equation of the density of x(t ). In general, the situation is a bit
complicated because the support of the density can be finite, countable, or uncountable
depending not only on the stochastic equation at hand, but also on the initial condition.
For example, the equation
dx = −(x − 1)dN 1 − (x + 1)dN 2
is such that x(t ) ∈ {−1, 1} if x(0) ∈ {−1, 1}, but this does not hold if x(0) is initialized
outside of that set.
The equation
dx = xdt + dN ; x(0) = a > 0
has a density whose support is lower bounded by a.
3.6.1 Derivation of the density equation
3.40. Keeping in mind that for a given PSDE (Poisson-Driven Stochastic Equation),
determining what the support of the density of x(t ) is can itself be challenging, we show
how to derive an equation for the density assuming it exists.
We consider the general time-invariant PSDE:
∑
n
dx = f (x)dt + g i (x)dNi .
i =1
Let At be a σ-field and assume that there exists a differentiable function ρ(t, x) such that
for all A ∈ At ,
∫
P (x(t ) ∈ A) = ρ(t, x)dx
A
R R
Let ψ : k 7−→ be a smooth function; it is commonly referred to as a test function
in this context. From the Itō rule, we obtain
⟨ ⟩ ∑
∂ψ
dψ = , f (x) dt + (ψ(x + g i (x)) − ψ(x)) dNi
∂x i
If we compute the expectation of ψ using the expectation rule, we obtain

⟨ ⟩ ∑
∂ψ
d
dt
E
ψ(x) = E∂x
, f (x) + E
(ψ(x + g i (x)) − ψ(x)) λ i
i
∫
Using ρ(t, x), we know that Eψ(x) = ψ(x)ρ(t, x)dx. The time derivative of the expec-
44
tation is thus ∫ ∫
∂
d
dt
E
ψ(x) =
d
dt
ψ(x)ρ(t, x)dx = ψ(x)
∂t
ρ(t, x)dx
Putting the last two relations together, we obtain

∫ ⟨ ⟩ ∑
∂ ∂ψ
ψ(x) ρ(t, x)dx =
∂t
E ∂x
, f (x) + E
(ψ(x + g i (x)) − ψ(x)) λ i
i
∫ ⟨ ⟩ ∫ ∑
∂ψ
= , f (x) ρ(t, x)dx + (ψ(x + g i (x)) − ψ(x)) λ i ρ(t, x)dx
∂x i
∫
Because ρ(t, x)dx = 1, then for sufficiently large |x | we have that ρ(t, x) = 0 almost
∑ ∂ψ
everywhere. Using integration by parts, and recalling that ⟨ ∂ψ∂x , f (x)⟩ = ∂xi fi (x) we
obtain that
∫ ∑ ∫
∂ψ ∞ ∂
⟨ , f (x)⟩ρ(t, x)dx = ψ(x)ρ(t, x)fi (x) −∞ − ψ(x) (f (x)ρ(t, x))dx .
∂x i | {z } ∂x
0
Putting this back in the previous equation, we have

∫ ∫ [ ∑ ] ∑∫
∂ ∂
ψ(x) ρ(t, x)dx = −ψ (f ρ) − λ i ψ ρ(t, x) dx + λ i ψ(x + g i (x))ρ(t, x)dx
∂t ∂x
(3.10)
To continue with the derivation, we set g˜i (x) = x + g i (x). Let us restrict ourselves to
the case of g i (x) being such that the g˜i are one-to-one. Setting z = g˜i (x), we obtain by a
change of variables
∫ ∫ ( ) −1
∂g i
ψ(x + g i (x))ρ(t, x)dx = ψ(z )ρ(t, g˜i−1 (z )) det I + dz

∂x
3.41. Because ψ is arbitrary, we can replace the integral equation (3.10) by the differential
equation
 ( ) −1 
∂ ∂ ∑ ∂g
ρ(t, x) = − f (x)ρ(t, x) + λi ρ(t, g˜ (x)) det I +
−1 i
− ρ(t, x)
∂t ∂x  i ∂x x+gi (x) 
i
This is the density equation for PSDE, provided that ρ(t, x) exists and is smooth enough,
and provided that g˜i is one-to-one. If it is not one-to-one, one can easily modify the
argument to take into account the set of inverse images of g˜i−1 (x). We do not write it
explicitly here.
45
An intuitive explanation of the Fokker-Planck equation
While it might appear unnatural at first, the Fokker-Planck equation makes sense on an
intuitive level and can be obtained using simple, though hardly rigorous, means. We give
R
such an explanation here, first in the case of a process evolving in before extending it
to more general processes.
3.42. Consider first the equation
dx = f (x)dt + g (x)dN
R
and let x 0 ∈ be fixed. Let ε > 0 be small and consider the interval I = [x 0 −ε/2, x 0 +ε/2].
To save space, we write b = x 0 + ε/2 and a = x 0 − ε/2. We can interpret the density for x
around a given point as the number of sample paths that are around that point, that is:
number of sample paths x(t ) that are in I at time t

ρ(t, x 0 )ε " = "
total number of sample paths at t
where the quote signs around the equality sign indicate that we are not attempting to
make sense of these quantities rigorously.
The density equation can be obtained by taking the difference in the number of paths
that enter I and the number of paths that leave I . To this end, we analyze all the ways
in which paths can leave or enter I .
There are the sample paths which do not jump around time t . These evolve according
to f (x)dt . We thus need to evaluate ’how many’ paths enter I following f (x) minus how
many leave I . For example, if x 0 = 0 and f (x) = −x, it is clear that paths are entering I
from the left and the right of the interval and hence ρ(t, 0) increases under the effect of
this term. For a short time interval ∆, a sample path will move under the effect of f by an
amount f ∆. We thus have that the number of paths entering/leaving at b during a short
time interval ∆ is the number of path that are within a distance f (b)∆ from b. There are
ρ(t, b)f (b)∆ such paths. The number of paths leaving/entering at a is f (a)ρ(t, a)∆. Now
regarding the sign to give to each contribution, observe that if f (b) points to the right
(that is, is positive), it points away from x 0 and thus its effect is to decrease the number
of paths in I . Reciprocally, if f (a) points to the right, its contribution is to increase the
number of paths in I . Hence, the contributions at the boundaries have opposite signs.
Putting these together, we find that
d ρ(t, x 0 )ε = −f (x 0 + ε/2)ρ(t, x 0 + ε/2)∆ + f (x 0 − ε/2)ρ(t, x 0 − ε/2)∆.
Dividing both sides by ε∆ and taking limits, we find that
∂ ∂
ρ(t, x) = − (ρ(t, x)f (x)).
∂t ∂x
Due to jumps of path that are in I right before t , the number of paths in I decreases
46
by ρ(t, x)λ.
We now need to account for paths that are outside of I right before t , jump at t
and whose jumps are so that they land in t . To do so, consider a couple of examples:
assume that g (x) = 1. The jumps are all of magnitude 1. Hence, the paths that were
at (x 0 − 1) ± ε/2 right before t and jump are contributing to the total number of paths.
That terms is ρ(t, x − 1)λε. More generally, all the paths that are in the inverse image of
I under x + g (x) (the effect of the jump) contribute to increasing the number of paths
in I . For example, assume that g (x) = x, then the paths that are in x/2 ± ε/4 and
jump contribute to paths in I . In general, we thus have to look at the inverse image
under the function x + g (x) of the interval/region I . The volume of this interval/region
is related to the volume of I by the absolute value of the determinant of the Jacobian
of x + g (x)—a fact
( usually
) seen in multivariable calculus. This corresponds to the term
−1 ∂
ρ(t, g˜ (x)) det ∂x g˜ λ
3.43. Putting all these effects together, that is the effect of the drift f (x), the effect
of path that enter I through jumps and paths that leave I through jumps, we have the
Fokker-Planck equation.
3.44. The derivation of the effect of the drift term can be made more rigorous and
R
extended to the case of a density in n using Stokes’ theorem. It usually goes under the
name of Liouville equation or continuity equation. The idea is as follows. Consider the
differential equation
ẋ = f (x)
with f a smooth vector field and let ρ(t, x) to be a density (you might think of these as
a density of particules or a compressible fluid) that evolves following the vector field. By
this, we mean that if there is a distribution of initial conditions ρ(0, x) for the differential
equation ẋ = f (x), and if we solve the equation for each initial condition, the distribution
of solutions at time t is given by ρ(t, x).
3.45. We have the following balance equation: if S is a closed surface enclosing a volume
R
V in n and n(x) the unit normal vector of S at x that points outside, then
∫ ∫
d
ρ(t, x)dx = (ρ(t, x)f (x)) · n dS .
dt V S
The above equation says that the difference in what is inside V is what comes in minus
what comes out through the boundary S . Stokes’ theorem says that
∫ ∫
∂
(ρ(t, x)f (x)) · n dS = (ρ(t, x)f (x))dx .
S V ∂x
47
ρ(t, x)
2x 2x
x
a
2
x0 b a x0 b 2x 0
2 2
Figure 3-3: In the interval I = [a, b] centered at x 0 , there are roughly ρ(t, x 0 )(b −a) sample
paths at time t . We consider the sample path equation dx = −xdt +2xdN . We
depict in dashed-line examples of paths that enter I at t and in dotted-line
examples of path that exit I at t . There are two mechanisms through which
paths enter/exit I : through the drift term (these are the paths around a and
b) and through jumps. The density equation can be obtained by taking the
number of entering paths minus the number of exiting paths.
Putting the above two together, we have

∫ ∫
∂ ∂
ρ(t, x)dV = (ρ(t, x)f (x)dV.
V ∂t V ∂x
Because the above holds for any V , we obtain the same relation as above.
3.6.2 Some examples
3.46. Consider the PSDE

dx = −xdt + dN 1 − dN 2
where N 1 and N2 are standard Poisson counter with rate λ.
In this case, g˜1 (x) = x + 1, which is clearly invertible with inverse g˜1 (x)−1 = x − 1.
Similarly, we find that g˜2−1 (x) = x + 1. We conclude that the determinants of the Jacobians
appearing in the formula are both one. Hence, the density for ρ evolves according to
∂ ∂
ρ(t, x) = (ρx) + λ ρ(t, x − 1) − 2λ ρ(t, x) + λ ρ(t, x + 1)
∂t ∂x
This equation can be solved using the Fourier transform.
3.47. PSDE appear very frequently in the modelling of queues. In that context, knowl-
48
edge of how the density for the length of the queue evolves is very useful to understand
queue dynamics and insure quality of service. We now investigate a simple example.
Models in which the size of the queue is a real number are often referred to as fluid queue
models.
Consider the following queue, in which tasks (or customers) are being processed at a
rate µ. The arrival of new tasks in the queue is modelled as a Poisson process N (t ) with
rate λ. Denote by 1R+ (x) the indicator function for the positive reals, that is
{
1 if x > 0
1R+ (x) =
0 otherwise
A sample path description for the length of the queue is given by
1
dx = − 1R+ (x)dt + dN .
µ
We illustrate a sample path for the evolution of the queue below:
x(t)
The probability density for x satisfies
∂ 1 ∂
ρ(t, x) = (1R+ (x)ρ(t, x)) + λ (ρ(t, x − 1) − ρ(t, x))
∂t µ ∂x
First, observe that there might be a non-zero chance that the queue is empty. Taking
expectations of the sample path equation, we see that
d
dt
E
x =−
1
µ
E
1R+ (x) + λ.
E
Hence, in steady-state 1R+ (x) = λ µ. Observe that this says that the probability that
the queue is not empty is λ µ. (In general, recall that the expectation of the indicator
function of a set A is equal to the measure of that set A) Hence, in steady state, the
probability that the queue is empty is 1 − λ µ. Of course, the above ceases to make sense
if λ µ > 1, in which case the assumption that the steady-state exists is incorrect. This
49
agrees with our intuition that if the arrival rate λ is larger than the service rate µ, the
size of the queue will grow indefinitely.
Let us now focus on finding the steady-state density for the size of the queue. We know
that the steady-state has a δ measure at the origin of strength 1 − λ µ. We thus look for
a steady-state solution of the type
ρ(x) = (1 − λ µ)δ(x) + ρ1 (x) (3.11)

where ρ1 (x) is a piecewise differentiable density.
Because there is a non-zero probability at zero and because the queue increases by
units of 1, we might expect to have a discontinuity for ρ1 (x) at x = 1. This discontinuity
is also hinted at by the form of the density equation. Keeping this in mind, let us try to
solve (3.11) one unit interval at a time. For x ∈ (0, 1), because ρ(x − 1) = 0 the density
equation reduces to
1 d
ρ = λ ρ.
µ dx
The solution is of the form ρ = βe λ µx . We can evaluate β as follows: at the steady-state,
the drift-term (that is, −1/µ11R+ dt ) adds probability to x = 0 at the rate ρ(0+ )/µ = β/µ,
while the Poisson counter removes probability at a rate ρ(0)λ, that is (1 − λ µ)λ. At the
steady-state, these are equal and hence
β = (1 − λ µ)λ µ.
We thus have a solution for the density over [0, 1). Now that x = 1, the density equation
is
1 dρ
− λ ρ = λ(1 − λ µ)δ(x − 1).
µ dx
Hence, at x = 1, the solution jumps by an amount −λ µ(1 − λ µ). Now, on the interval
(1, 2), we have
1 d
ρ − λ ρ = (1 − λ µ)λ µe −λ µe λ µx , x(1+ ) = λ µ(1 − λ µ)(e λ µ − 1)
µ dx
This equation can be solved using standard means. The complete solution is obtained
by continuing to integrate the equation piece by piece as we have done above.
3.48. Consider again the system

{
dx = −xdt + zdt
dz = −2zdN
with z (0) ∈ {−1, 1}. This system combines discrete and continuous variables. The density
50
ρ(t, x, z ) is thus defined only for z = 1 and z = −1. In this case,

[ ] [ ]
−(x + z ) 0
f (x, z ) = and g (x, z ) = .
0 −2z
Furthermore, g˜ = [x, z ]′ + g (x, z) = [x, −z ]′. Hence, g˜−1 = [x, −z ]′ . Finally, observe that
the determinant det(I + ∂g /∂x) = | − 1| = 1. The density equation is thus
( [ ])
d ∂ −(x + z )
ρ(t, x, z ) = − ρ(t, x, z ) + λ (ρ(t, x, −z ) − ρ(t, x, z ))
dt ∂x 0
Because z takes on the values −1 or +1, we can write the above equation as a system
of equation with ρ+ (t, x) = ρ(t, x, 1) and ρ− = ρ(t, x, −1):
{∂ ∂
∂t ρ+ (t, x) = ∂x ((x − 1)ρ+ (t, x)) + λ(ρ− (t, x) − ρ+ (t, x))
∂ ∂
∂t ρ− (t, x) = ∂x ((x + 1)ρ− (t, x)) + λ(ρ+ (t, x) − ρ− (t, x))
It us a good exercise to derive the above system of equation using the intuitive approach
outlined above.
51
3.7 The Backwards equation

We present here the evolution equation for a process that can be interpreted as be-
ing obtained by a time-reversal. This equation is useful, among other applications, in
smoothing of time series. We will start with the discrete-time case and generalize the
idea to the continuous-time case.
3.49. We are given a discrete-time, finite state Markov process whose probabilities evolve
according to
p(t + 1) = Ap(t )
where A is a given stochastic matrix. 1 For example, think of p as containing the probability
of sunny, overcast or rainy weather. We can think of the entry i j of A as being, e.g.
the probability that it is rainy(corresponding to i ) tomorrow given that it is overcast
(corresponding to state j ) today. The backwards equation is the process telling us: what
is the probability that it was rainy yesterday given that it is overcast today.
3.50. Let us assume that there exists a backwards process, that is a matrix A˜ such that
˜ ).
p(t − 1) = Ap(t
Assuming that we are at state ωi at time k + 1, we know that we could have ended up
at this state coming from any state ω j at time k , with probability
p(x(t + 1) = ωi |x(t ) = ω j )
∑ .
j p(x(t + 1) = ωi |x(t ) = ω j )
We thus see that

A˜ = A′D
where
∑i a1i 0 ... 0 
−1
 ∑ 
 0 i a 2i . . . 0 
D =  . .. .. 
 .. .
∑
. 

 0 ... 0 i a ni 
is such that A˜ is a stochastic matrix.

We call the process with transition matrix A′ the unnormalized backwards process.
3.51. The corresponding situation in continuous-time, with forward evolution equation

p˙ = Ap is
p˙ = (A′ + D)p
1
Recall that A is stochastic if the columns of A sum to one and it has positive entries. If p is a probability
vector (that is the entries of p sum to one and are positive), then Ap is also a probability vector if A is
stochastic.
52
3.7 The Backwards equation
where D is the diagonal matrix that renders A′ infinitesimally stochastic.
53
3.8 Computing correlations

We show how to use the machinery developed in the previous sections to evaluate quan-
tities of interest. Consider a FSCT x(t ) and let τ > 0. We are concerned with evaluating
E
quantities (x(t )x(t + τ)).
Temporal correlations as above are of course very helpful in decision making; we also
recall that the power spectrum is the Fourier transform of such temporal correlations
when the process is stationary. We will come back to this in a later section. First, we
show how to evaluate these correlations. Explicitly, if we denote by q i j (t, τ) the probability
that x(t ) = x i and x(t + τ) = x j then
∑
E
(x(t )x(t + τ)) = q i j (t, τ)x i x j .
ij
3.52. In the case of a finite-state process, one might take a direct approach as we now
explain. First, assume that the probabilistic description of the process is given by
p˙ = Ap
where A is an infinitesimally stochastic matrix and p(0) = p 0 is given. Then p(x(t ) = i ) is

the i th entry of p(t ) = e At p 0 .
3.53. Recall that
p(x(t ) = x i , x(t + τ) = x j ) = p(x(t + τ) = x j |x(t ) = x i )p(x(t ) = x i ).
But p(x(t + τ) = x j |x(t ) = x i ) is nothing else than the j i th entry of e Aτ . Set ϕ(t ) = e Aτ ,
we thus have that ∑
E(x(t )x(t + τ)) = pi (t )ϕ j i (τ)x i x j .
ij
3.54. We used two ingredients to evaluate this correlation: we transformed a joint-

probability into a conditional probability and used the matrix exponential of the transi-
tion matrix to obtain the conditional probability. Using in addition the Markov property,
E
one can easily evaluate quantities such as (x(t1 )x(t2 ) · · · x(tn )).
The second approach we consider, which is also applicable to general jump process (by
which we mean processes that do not necessarily evolve in a finite state-space) is based
on the sample path description. For ease of notation, we now consider the equivalent
E
problem of evaluating (x(t )x(τ) for τ > t .
54
3.55. Consider the general Itō equation

∑
dx(t ) = f (x)dt + g i (x)dNi (t ).
i
The idea is to write this equation at time τ, multiply by x(t ) and take expectations.
Because whether N (τ) jumps or not is independent of the value of N (t ) for t < τ, the
E E
expectation (x(t )dN (τ)) is simply (x(t ))λd τ if N has rate λ. Precisely,
∑
d
dτ
E
(x(t )x(τ)) = E (xt f (x(τ)) + E (x(t )gi (x(τ))) λi .
i
3.56. Consider the Itō equation
dx = −xdt + dN
with x(0) = 0 and N is a Poisson counter of rate λ. We evaluate limt →∞ x(t )x(t + τ). E
We evaluated the mean and variance of x for this process earlier and found
Ex(t ) = e −λt + λ
λ
Ex 2(t ) =
2
− λe −2t
We now write an equation for x(t + τ):
d τ x(t + τ) = −x(t + τ)d τ + dN (t + τ).
Multiplying both sides by x(t ) and taking expectations, we obtain
d
dτ
E E
x(t )x(t + τ) = − (x(t )x(t + τ)) + x(t )λ. E
This is a linear differential equation that can easily be solved explicitly.
3.57. Consider the stochastic process z (t ) which takes values +1 and −1 and whose
probability vector p(t ) = (p(z (t ) = 1, p(z (t )) = −1)′ evolves according to
[ ]
−a b
p˙ = p.
a −b
Let x(t ) be such that x(t ) = α when z (t ) = 1 and x(t ) = β otherwise.

1. Find conditions such that
E
lim x(t ) = 0.
t →∞
55
2. Find conditions such that in addition to the above
lim
t →∞
E(x(t )x(t + τ)) = e −|τ| .
3.58. We start by exhibiting a sample path description for y(t ). From the previous
sections, we deduce that
dx = (α − x)dN α β + ( β − x)dN βα
evolves in {α, β} if initialized in that set. Indeed, if x = α and dN α β jumps, then x does
not change. If dN βα jumps, then x(t − ) changes by an amount ( β − α) and thus x(t ) = β.
The rate of change from α to β is the rate of change from 1 to −1 for z (t ). Hence we
find that the rates of N α β and N βα are respectively λ α β = b and λ βα = a.
Using the expectation rule, we have
d
dt
E
x(t ) = a β + αb − (a + b) x(t ). E
E
Hence limt →∞ x(t ) = a β+αb
a+b and thus the required condition is a β + αb = 0.
E
3.59. Let us now evaluate x 2 (t ), which is necessary for answering the second question
as we will see. To this end, we use the Itō rule to get
( ) ( )
dx 2 = (x + (α − x))2 − x 2 dN α β + (x + ( β − x))2 − x 2 dN βα
Now, the expectation rule yields
( ) ( )
d
dt
E E
(x 2 (t )) = α 2 − (x 2 ) b + β 2 − (x 2 ) a E
E
= a β 2 + b α 2 − (a + b) (x 2 )
whose solution is
E(x 2(t )) = C e −(a+b)t + a βa ++ bb α
2 2
.
3.60. To compute the correlation, we write the sample path equation with time variable
τ:
dx(τ) = (α − x(τ))dN α β (τ) + ( β − x(τ))dN βα (τ).

Assume that τ > t and let t → ∞ with the condition established above met (that is, a β +
E
αb = 0. We multiply by x(t ) and take expectation remembering that (x(t )dN α β (τ)) =
56
E
a (x(t ))d τ and similarly for other combinations:
d
dτ
E E
(x(t )x(τ)) = −(a + b) (x(t )x(τ))
and thus
E(x(t )x(τ)) = E(x 2(t ))e −(a+b)(τ−t )
3.61. If we now assume that τ < t , we can use the same reasoning (starting from (x 2 (τ)) E
and integrating to t ) to get
E(x(t )x(τ)) = E(x 2(τ))e −(a+b)(t −τ) .
3.62. We now go back to our original notation and change τ for t + τ, then E(x(t )x(τ))
E
becomes (x(t )x(t + τ)) and for
( )
a β 2 + b α 2 −(a+b)τ
E
(x(t )x(t + τ)) = C e −(a+b)t
+
a +b
e for τ > 0
( )
−(a+b)(t +τ) a β 2 + b α 2 (a+b)τ
= Ce + e for τ < 0
a +b
Now if we let t → ∞ above, we obtain
E(x(t )x(t + τ)) = a βa ++ bb α

2 2
e −(a+b)|τ| .
Hence, taking the parameters to satisfy a +b = 1 and a β 2 +b α 2 = 1 answers the problem.
57
Lecture 4
Dynamic programming and optimal control
We now address optimal control of discrete and continuous-time Markov chains as well as
Itō differential equations. We start by establishing a few known result regarding optimal
control of deterministic systems.
59
4 Dynamic programming and optimal control
4.1 Dynamic programming in discrete-time
4.1. At the basis of dynamic programming is the so-called principle of optimality. As any
good principle, it is almost a tautology:
From any point on an optimal trajectory, the remaining trajectory is optimal for the problem
initiated at this point
4.2. Example: Shortest path. We illustrate the use of this principle on the problem of
finding the shortest path in a graph. Consider the following graph:
3 4
A1 2 B1 1 C1
3
5
1 2
3 3
3 7 2 3
S A2 6 B2 5 C2 T
2 1
3
7 4
6 2
A3 B3 C3
We want to find the shortest path from S to T. The principle of optimality tells us that
if S Ai B j C k T is a shortest path, then C k T is the shortest path from C k to T . Now there
is only one path from C γ to T for γ = 1, 2, 3. Let us record the shortest distance in a
function V . Hence
V (C 1 ) = 2; V (C 2 ) = 3 and V (C 3 ) = 1.
Applying the principle of optimality again, we know that the path B 1C γT is a shortest
path from B 1 to T only if C γT is a shortest path from C γ to T. Now because there is only
one path from C γ to T, it is the shortest. There are on the other hand three path from B 1
to T , passing through either C 1 , C 2 or C 3 . The shortest path from B 1 to T will be such
that the distance from B 1 to C γ plus the distance from C γ to T is minimized. This latter
distance is V (C γ ). Hence, the shortest distance from B 1 to T is min [4 + 2, 1 + 3, 3 + 1] = 4.
Hence V (B 1 ) = 4.. Now, a similar analysis yields V (B 2 ) and V (B 3 ). We summarize the
result:
V (B 1 ) = 4,V (B 2 ) = 5 and V (B 3 ) = 3.
Applying the principle of optimality again, the shortest path from A1 to T will contain
the shortest path from B β to T. We have recorded the lengths of these path in the function
V . Hence V (A1 ) is easily evaluated at the minimum of the distance from A1 to B β plus
V (B β ). We get
V (A 1 ) = 7,V (A2 ) = 7 and V (A3 ) = 7.
60
Hence the shortest distance from A α to T is seven for all α. Applying the principle of
optimality one last time, we get that the shortest path from S to T will contain the shortest
path from A α to T. Hence we found that a shortest path from S to T is S A 1B 2C 2T , with
length 8.
4.3. The main operation we performed in the derivation above was of the type

V (A α ) = min distance between (A α − B β ) + V (B β ) .
We now formalize this approach and derive Bellman’s equation.
4.4. We derive an equation from the above principle in the case of an n stage optimization
problem. Let x t be the state of the process at time t , u t be our control at time t and
assume that the process obeys an evolution equation
x t +1 = a(x t , ut , t )
with given initial condition x 0 . Consider the cost function
∑
n−1
J = c (x t , ut , t ) + J n (x n ).
t =0
Denote by V (t, x) the optimal cost-to-go starting from x at t , that is,
∑
n−1
V (t, x) = inf c (x s , u s , s ) + J n (x n ) for x t = x
ut ,...,u n
s =t
In the example of the shortest path problem, V is nothing else than the shortest distance
from x to the target point.
4.5. Observe that J is the sum over the time variable of functions of (x t , u t , t ). We say
that the cost function is separable to emphasize the fact that it can be written as such a
sum. The function V is called the value function.
61
4.6. We now apply the principle of optimality to the value function:
∑
n−1
V (t, x) = inf c (x s , u s , s ) + J n (x n )
ut ,...,u n
s =t
 ∑
n−1 

= inf inf c (x t , ut , t ) + c (x s , u s , s ) + J n (x n )
ut ut +1,...,u n  
 s =t +1
  ∑n−1  

= inf c (x t , ut , t ) + inf   c (x s , u s , s ) + J n (x n ) 
ut  ut +1,...,u n   
 s =t +1
∑
Observe that when x t +1 = a(x t , u t , t ) is given, inf ut +1,...,un n−1 s =t +1 c (x s , u s , s ) + J n (x n ) is
V (t + 1, a(x, u, t )). Since x t +1 = a(x t , u t , t ), we obtain that the value function obeys the
recursion
V (t, x) = inf [c (x, u, t ) + V (a(x, u, t ), t + 1)]
u
for t < n and with final boundary condition V (n, x) = J n (x). This equation is called the
Bellman equation.
4.7.
Example 4.1. You have started your life with a lofty inheritance of M dollars placed in a trust
fund that gives you a yearly income of q M dollars, where q is the interest rate. You are averse to
working, but can add to the capital in order to increase your income. You cannot take money out
of the capital however. How much money you should save or use at time t in order to maximize
your spending over your lifetime, which you estimate to be n.
We first derive the evolution equation for the income. Our income in year 0 is x 0 =
q M . If we spend u 0 , we can add x 0 − u 0 to the capital, and our income in year 1 is
q (M + (x 0 − u 0 ) = x 0 + (x 0 − u 0 ). In year 2, if we have consume u 1 the previous year and
added x 1 − u 1 to the capital, our income is q (M + (x 0 − u 0 ) + (x 1 − u 1 )) = x 1 + (x 1 − u 1 ).
We can thus cast the problem as a dynamic programming problem where the evolution
equation is
x t +1 = x t + q (x t − ut )
and the objective is to maximize
J = s umtn−1
=0 u t .
We define the value function V (t, x) as being the maximal consumption we can afford
if our capital is x at time t . Bellman’s equation tells us that

V (t, x) = max u t + V (t + 1, x + q (x − ut )) .
ut
Since by time n, you can not consume anymore, V (n, x) = 0. Solving the problem
62
backwards in time, we have first
V (n − 1, x) = max(u n − 1).
u n−1
Because u n−1 is bounded above by x (you cannot spend more than we earn), we obtain
V (n − 1, x) = x.
Next, we have
V (n − 2, x) = max (u n−2 + V (n − 1, x + q (x − u n−2 )) = max (x + u n−2 + q (x − u n−2 ))

0≤u n−2 ≤x 0≤u n−2 ≤x
Because the function we seek to maximize is linear in u, the optimal u will be either 0 or
x. Hence V (n − 1, x) = max(1 + θ, 2)x . The first entry in the max corresponding to u = 0
and the second to u = x.
We postulate that V (t, x) = kt x, that is V (t, x) is linear in x with a coefficient kt . We
will try to find a recurrence equation for kt . We have that

V (t, x) = max ut + kt +1 (x + q (x − ut )) = max(kt +1 (1 + q ), kt +1 + 1)x
ut
where we again used the fact that the maximum is obtained at u = 0 or u = x. Observe
that the right-hand-side is of the form kt x, and thus our hypothesis is verified and we
have the recurrence
kt = max(kt +1 (1 + q ), kt +1 + 1) = kt +1 + (max(q kt +1, 1))

with kn−1 = 1.
Because q < 1, kn−2 = 2. In fact, we see that kn−i will be i until q (kt −i ∗ )) > 1. Let i ∗ be
the smallest integer such that q i ∗ > 1, that is i ∗ > 1/q . For all times before i ∗ , we save all
of our income (because the first argument in the max is the largest), and for years after
that, we consume all our income.
For example, if your are optimistic and take n = 120 and q = 5%, you should save for
the first 100 years, and then spend as much as you can.
63
4.2 Markov decision processes

We now consider Markov decision processes. Instead of having a deterministic evolution
for the dynamic as in the previous section, we have probability transitions
P (x t +1 |x t , u t ).
We wish to find the controls u 0, · · · , u n that minimize the cost

∑ n 
E  c (x t , u t , t ) .
 t =0 
4.8. Observe that the expectation is taken with respect to a probability transition which
is itself dependent on the sequence of controls that is applied.
4.9. We so far assume perfect state observation, that is we observe the state x t at time
t and make a control choice ut based on that observation. Observe that, by convention,
the control u 0 takes us from time 0 to time 1, hence an n stage optimization problem
starting from an initial condition x 0 and ending at time n will require n decisions to be
made (or controls to be applied) u 0, . . . , u n−1 .
4.10. It is customary in this context to call the optimal control sequence u an optimal
policy, and use the letter π to denote it. That is, π = (u 0, . . . , u n−1 ). We denote by πt the
optimal policy from t on: πt = (u t , . . . , u n−1 ).
4.11. As before, we let

∑
n
Jt = [c (x s , u s , s )] .
s =t
We denote by
E
V (t, x) = inf ( Jt |x t = x, πt )
πt
the optimal expected cost to go given that we start at x at time t . Because the process is
Markov, the expected cost-to-go only depends on the current state and control and not
on the past history of the process.
64
4.12. We now derive a recursion for the value function.

∑ n 
V (t, x) =
πt
E
inf  c (x s , u s , s ) | x t = x, πt 

 s =t 
∑ n
= inf
πt
E * c (x t , ut , t ) + c (x s , u s , s )+ | x t = x, πt +
, s =t +1 -
∑ n
= inf
ut ,πt +1
E *c (x, u t , t ) + c (x s , u s , s ) | x t = x, πt +
, s =t +1 -
 ∑ n 
= *
inf c (x, u t , t ) + inf 
ut
E
πt +1


c (x s , u s , s ) | x t = x, πt  +
 -
, s =t +1
For the last transition, observe that x t = x is fixed and u t = u affects the transition
probability from x t to x t +1 , hence the term c (x, ut , t ) is not affected by the expectation.
∑n
4.13. We now focus on the last term E s =t +1 c (x s , u s , s ) | x t = x, πt . Let αi range over
the possible values of the state of the Markov chain. We can write this expectation as
∑ ∑
αt +1 · · · αt +n P (x t +1 = αt +1, · · · , x n = αn |x t = x, πt ) J t +1, where the sums range over all
possible states for x t +i (the sums can be replaced by integrals in the continuous state-
space case.) We now use Bayes’ rule with respect to x t +1 to obtain
∑ ∑
··· P (x t +2, · · · , x n | x t +1, x t = x, πt )P (x t +1 | x t = x, πt ) Jt +1 .
αt +1 αt +n
We omitted the αt in the above equation to make it more readable. From now on, we
will explicitly write them only when they help the understanding. We now make two
observations:
• P (x t +1 |x t , πt ) only depends on u t and not on controls ut +1 etc. since they affect
transitions happening after t + 1. Hence
P (x t +1 |x t , πt ) = P (x t +1 |x t , ut )
• The Markov property and the definition of u t tells us that
P (x t +2 · · · x n | x t +1, x t = x, πt ) = P (x t +2 · · · x n | x t +1, πt +1 ),
that is we can omit x t = x and u t in the conditioning.
65
4.14. Putting all of this together, we get that

 ∑ n 
V (t, x) = inf *c (x, u, t ) + inf 
u πt +1
E
s =t +1
c (x s , u s , s ) | x t = x, πt  +
 -
,
( ∑ ∑ )
= inf c (x, u, t ) + ··· inf Jt +1P (x t +2 · · · |x t +1, πt +1 )P (x t +1 |x t , u)
u πt +1
∑
= inf *.c (x, u, t ) +
u
E
inf ( Jt +1 |x t +1 = x αt +1, πt +1 )P (x t +1 = x αt +1 |x t , u)+/
πt +1
,( αt +1 -
∑ )
= inf c (x, u, t ) + V (t + 1x t +1 )P (x t +1 |x t , u)
u
E
= inf (c (x, u, t ) + V (t + 1, x t +1 )|x t , u))
u
We thus obtain
V (t, x) = inf (c (x, u, t ) +

u
E [V (t + 1, xt +1)|xt = x, ut = u])
4.15. As an example, consider the finite state Markov process
pt +1 = A(u)pt
where p(0) = p 0 is given and the transition matrix depends on a control variable u.
We label the states x 1, x 2, . . . , x n . We have the same cost function as above. Recall that
according to the convention used in this course (that is, the columns of A sum to one),
the j th column of A represents the probability distribution of the next state in the process
given that we are currently at state x j . That is, Ai j (u) = P (x t +1 = i | x t = j, u). Hence, if
we write the function V (t + 1, x) as a row vector
V (t + 1) = [V (t + 1, x 1 ),V (t + 1, x 2 ), . . . ,V (t + 1, x n )] ,
the expectation of V (t + 1, x) entering in the recursion can be written as
E(V (t + 1, x)|xt = x j , u) = V (t + 1)A(u):, j .

where A(u):, j denotes the j th column of A(u). Hence, we can write in a slightly more
explicit manner the Bellman equation as

V (t, x = x j ) = inf c (t, x, u) + V (t + 1)A:, j .
u
Using this equation, we can evaluate V (t, x) for all x.
4.16. Consider the following stochastic system:
66
pt +1 = A(u)pt
where [ ]
.5 + u .7
A(u) =
.5 − u .3
′
where pt = p(x(t ) = a, p(x t ) = b
We seek to maximize the probability of being in state x 1 at time 3 and while exerting
the least effort in expectation. We thus introduce the function
∑
3
J = u t2 − ηp(x t = a)
t =0
where η is a positive real parameter and u ∈ [−.2, .2]. We start from state x(0) = a.
Define the value function V (t, x) as being the least achievable cost from state x at time
t . Then clearly, V (3, a) = −η and V (3, b) = 0. The value function obeys the recursion
[ ]
V (t, x) = inf u 2 +
u
E (V (t + 1, x1)|xt = x, u) .

Let us fix η = 1. Hence V (2, a) = inf u u 2 + (.5 + u)(−1) + (.5 − u)0. The minimum
is obatined at u = 0.2 and V (2, a) = −0.66. Similarly, V (2, b) = inf u u 2 + .3(−1) + .7 .
The minimal value is obtained at u = 0, for which V (2, b) = .4 Observe that if we are in
state b, our transitions are independent of u, and thus we can conclude that the optimal
control when we are in state b is zero, for all t : u(t, b) = 0.
The optimal controls for time 1 and 0 are obtained in a similar fashion.
67
Optimal stopping time

A very useful class of problems can be recast as optimal stopping problems. We set-up
the optimality equation in this case, but stop short of proving properties about stopping
times, as they require the use of Martingales and Doob’s stopping theorem, which is
beyond the scope of this course.
In this section, we also change our point of view from the rather defensive "cost min-
imization", a point of view prevalent in engineering, to the seemingly more pleasing
"reward maximization", the prevalent point of view in economics and social sciences.
They are of course related by a sign change.
4.17. The general set-up is the one of an uncontrolled Markov chain
pt +1 = Apt ,
but now we have the option to stop the process at anytime. If we stop at time t and we
are at state x, the reward is r (t, x).
4.18. A stopping time τ is a past-measurable random variable (that is a random variable

that depends only on past observations) which takes on value in the positive integers.
4.19. A typical reward would emphasize being at a certain state while penalizing long
E
waits. We seek to maximize (r (t, x)) over all stopping times 0 ≤ τ ≤ n.
4.20. Let V (t, x) be the optimal expected reward if we start from x at time t . This reward
E
is either r (t, x), if we decide to stop the process at t , or it is pt +1V (t + 1, x). The Bellman
equation is thus
E
V (t − 1, x) = max(r (t − 1, x), [V (t, x) | x t = x]
4.21. Starting with the boundary condition V (n, x) = r (n, x), we can evaluate V (t, x) for
t < n. The optimal stopping rule is thus the first time the expected reward is less or
equal than the current reward:
τ ∗ = {min : V (t, x t ) ≤ r (t, x t )}.

t
4.22. Example: The secretary problem. We now illustrate the above on the well-
known secretary problem. Consider the following game: your opponent has n tickets in
his hands with one number written on each ticket. You have no information about the
numbers. The deck is shuffled. Your opponent shows you the first ticket and you have to
decide on the spot if you select it or not. If you do not select it, he shows you the second
68
ticket and you again have to decide, etc. Once you discard a ticket, it cannot be selected
again. You win the game if you select the ticket with the highest number written on it.
The name of this example comes from the problem of interviewing candidates for a
secretary (or any other job) position. If you have n candidates, you want to pick the best
one by hiring her/him on the spot.
How to maximize your chance of winning/your probability of hiring the best candidate?
1. Observe that any stopping strategy that does not make use of the observation and
decides before hand to pick the k th ticket has a chance 1/n of winning.
2. It might not be obvious at first that one could do better, but consider the following
scheme: you discard the first n/2 tickets (assuming wlog that n is even), but record
the number on them, then pick the first ticket whose value is larger than any of the
first n/2. Notice that picking a ticket whose value is smaller than any of the first
n/2 uncovered tickets is surely a losing move. What is the probability of winning
of such a scheme?
3. If the ticket with highest number (let us call it the largest ticket) was in the first n/2
tickets, we lose. This happens with probability 1/2. If the second largest ticket is in
the first n/2 ticket, and the first one is not, we win. This happens with probability
1
4 . Hence with this simple scheme, we already have increased our probability of
winning to a constant factor of 14 ! We have the bounds
1 1
≥ P (winning) ≥ .
2 4
4. Let us find the optimal stopping time using dynamic programming. The first step
is to realize that there is an underlying Markov process in the problem, which
corresponds to how the tickets are shuffled. Define x t so that x t = 1 if the largest
ticket so far if the t th ticket, and 0 otherwise. Because we do not know anything
about the tickets a priori, this process is indeed Markov: the probability that the
current ticket is larger than the previous tickets is 1/t . We thus have a two-state
Markov process with probabilities P (x t ) being independent:
1 t −1
P (x t = 1) = , P (x t = 0) = . (4.1)
t t
5. The probability that we have the largest ticket in hand is zero if x t = 0. If x t = 1,
the probability that we have the largest ticket in hand, given our past observation
is t /n. To see this, we use Bayes’ rule with the following events: At is the event “I
have the largest ticket in hand at time t " and Bt is the event “I have the largest of
the first t tickets at time t . Hence, the probability of winning if we are at x t = 1
(which corresponds to event Bt ):
P (At ) 1/n
P (At | Bt ) = P (Bt | At ) =1 .
P (Bt ) 1/t
69
The reward is in this case the probability of winning. Hence

{ t
if x t = 1
r (t, x) = n
0 if x t = 0
6. The value function V (t, x) is the expected probability of winning given that I am
in state x at time t . Hence, if if have the largest ticket so far x = 1 at time t = n
(end -time), I have won and thus V (n, x = 1) = 1. Reciprocally, if I do not have the
largest ticket x = 0 at the end, I have lost with certainty: V (n, 0) = 0.
7. We can now set-up the Bellman recursion V (t − 1, x) = max(r (t − 1, x), V (t, x))E
where the expectation is taken with respect to the distribution given in (4.1). This
yields:



E
V (t − 1, 0) = max(0, V (t, x)) = t −1 t V (t, 0) + t V (t, 1)
1

E
 V (t − 1, 1) = max(r (t − 1, 1), V (t, x)) = max( t −1 , t −1V (t, 0) + 1V (t, 1))
 n t t
t −1
 = max( n ,V (t − 1, 0))
(4.2)
and the optimal strategy is to take another ticket if x t = 0, or if x t = 1 and t/n <
V (t, 0). We stop and keep the ticket we have if x t = 1 and nt ≥ V (t, 0).
8. Because V (t, 1) ≥ V (t, 0) (from the second equation in (4.2)), we conclude from
the first equation in (4.2) that V (t, 0) > V (t + 1, 0).
Hence, the first argument in the max defining V (t, 1) is increasing to one whereas
the second is decreasing to zero. There exists thus a t ∗ at which they will cross,
that is there exists a stopping time t ∗ such that the optimal strategy is to discard
any ticket for the first t ∗ steps and then accept the first one that is larger than the
first t ∗ .
9. We can find this t ∗ by solving the Bellman recursion. For t > t ∗ , we have that
V (t, 1) = t/n, hence using the first equation in (4.2):
1t t −1 1 t −1
V (t − 1, 0) = + V (t, 0) = + V (t, 0).
tn t n t
This yields
V (t − 1, 0) 1 V (t, 0) 1 1 1
= + = ··· = + +···+ .
t −1 n(t − 1) t n(t − 1) nt n(n − 1)
Hence
t −1 ∑ 1
n−1
V (t − 1, 0) = , for t ≥ t ∗
n s =t −1 s
Let us now focus on large n, t . Recall that t ∗ is the smallest integer such that
t∗ ∗ ∗ ∑n−1 1
n ≥ V (t , 0). We deduce that t is the smallest integer such that s =t ∗ s ≤ 1. We
70
can approximate this sum by log(n/t ∗ ) and thus
t ∗ ≃ n/e ≃ n/2.8
The probability of winning is V (0, 0) = V (t ∗, 0) ≃ t ∗ /n = 1/e ≃ 0.36
71
4.3 Infinite-time horizon problems

We now consider the case of an infinite time-horizon with a possibly discounted cost.
4.23. We use the same set-up as before, that is we have a discrete-time, controlled Markov
process with given probability transitions P (x t +1 |x t , u).
4.24. Let γ be a real number and 0 < γ ≤ 1. Recall that πt = [u t , · · · , u n ] is the policy or
control law starting from time t . We write π for π0 . Consider the cost function
∑
n
J (n, x 0, π) = γt c (x t , ut ).
t =0
Because x t is a random variable, so is J .
4.25. Define
V (n, x 0 ) = inf
π
E J (n, x0, π) | x0, π0 ,
the optimal expected cost for an horizon n and with initial condition x 0 .
4.26. Observe that we can write

∑
n
J (n, x 0, π0 ) = γt c (x t , ut )
t =0
∑
n
= c (x 0, u 0 ) + γt c (x t , ut )
t =1
∑
n−1
= c (x 0, u 0 ) + γ * c (x t +1, ut +1 )+
, t =0 -
= c (x 0, u 0 ) + γ J (n − 1, x 1, π1 )
4.27. Using the above, we have the following recursion for V :

V (n, x 0 ) = inf
π
E
J (n, x 0, π) | x 0, π0

E
= inf c (x 0, u 0 ) + γ J (n − 1, x 1, π1 ) | x 0, π0
π
[ ]
= inf c (x 0, u 0 ) + γ inf
u0
E
J (n − 1, x 1, π1 ) | x 0, π0
π1
72
4.3 Infinite-time horizon problems
Using a similar approach to the one used in a previous lecture, we can show that

inf
π1
E E
J (n − 1, x 1, π1 ) | x 0, π0 = [V (n − 1, x 1 ) | x 0, u 0 ] .
The idea is to write explicitly the expectation and condition on the state x n+1 . The
Bellman equation is thus
E
V (n, x) = inf [c (x, u) + γ (V (n − 1, x 1 ) | x 0, u)] .
u
4.28. The infinite-horizon cost or steady-state cost is defined as
V (x) = lim V (n, x).

n→∞
4.29. There are three widely studied situations under which the limit will exist:
1. 0 < γ < 1 and |c (x, u)| < M for a given constant M and all admissible x, u.
2. 0 < γ ≤ 1 and c (x, u) ≥ 0.
3. 0 < γ ≤ 1 and c (x, u) ≤ 0.
The first case is called discounted programming, and is the one we will focus on. The other
cases are not much more difficult to handle (use the monotone convergence theorem to
exchange the order of integration), and go under the name of negative and positive
programming respectively. Note that the nomenclature is based on a definition of the
problem in which one tries to maximize a reward, not minimize a cost—the problems
are clearly equivalent, it suffices to take the reward to be minus the cost and change
minimization to maximization.
4.30. Consider the following example: Let
x t +1 = x t + u t + θt
where θt are independent, uniformly distributed on [−1, 1] random variables. We seek

to find the control law that minimizes the cost function
∑
∞
J (x, π) = γt (ut2 + x t2 ).
t =0
4.31. The value function V (x) satisfies the equation (note that the costs are not neces-
sarily bounded, but are positive.)
73
[ ]
E
V (x) = inf x 2 + u 2 + γ [V (x 1 )|x 0 = x, u 0 = u]
u
[ ∫ 1 ]
= inf x + u + γ
2 2
V (x + u + θ)dθ/2
u −1
4.32. The equation above does not lend itself to easy manipulations to find V (x). We
present in the next section a method to solve equations such as the one above, called
value iteration. Another common approach is to try a parametric form for the value
function and use the equation above to fit the parameters. The form of the cost function
is often a good starting point.
4.33. Let us try a value function V (x) = ax 2 + bx + c with parameters a, b, c ∈ R. We first

establish a few relations:
∫ 1
θdθ/2 = 0
−1
∫ 1
θ 2dθ/2 = 1/3
−1
Using these, we get

[ ∫ 1 ( ) ]
ax + bx + c = inf x + u + γ
2 2 2
a(x + u + θ + 2ux + 2uθ + 2xθ) + b(x + u + θ) + c dθ/2
2 2 2
u −1
[ ( )]
1
= inf x + u + γ a(x + u + + 2ux) + b(x + u) + c
2 2 2 2
u 3
Differentiating the right-hand-side of the last equation with respect to u and setting the
result to zero, we get
2ax + b
u(x) = − .
2(1 + γa)
If we plug u(x) back in the previous equation, we see that the right-hand-side is a
quadratic polynomial in x. It thus suffices to equate the coefficients to the left and to the
right to obtain the value function.
74
4.4 Value iteration
4.4 Value iteration

Value iteration is an iterative method often used to solve the Bellman equation in infinite
horizon. It goes as follows:
4.34. Consider the Bellman equation
E
V (x) = inf (c (x, u) + γ (V (x 1 )|x 0 = x, u 0 = u) .
u
Choose V0 (x) arbitrarily and evaluate recursively:
u
E
Vn (x) = inf (c (x, u) + γ (Vn−1 (x 1 )|x 0 = x, u 0 = u) . (4.3)
This recursion is called value iteration; we can show that under some assumptions, Vn (x)
will converge to the value function V .
4.35. In case of a finite horizon, value iteration is nothing more than the usual dynamic
programming recursion if one takes V0 = 0.
4.36. We will show that it converges for the case of positive costs c with discounted
infinite horizon cost and give more general conditions for convergence, without proof,
later.
4.37. The idea is to show that the right-hand-side of (4.4) is a contraction and appeal to
fixed point theorems. To this end, define the operator
E
T (v (·))(x) : v (x) 7−→ inf (c (x, u) + γ (v (x 1 )|x 0 = x, u 0 = u) .
u
Observe that T takes a function as argument and outputs a function. We assume that
|c (x, u)| is bounded for all admissible x, u.
4.38. The value function V (x) is the solution of V = T (V ) and we can rewrite the value
iteration algorithm as
Vn = T (Vn−1 ). (4.4)
Hence, if we can show that T is a contraction, appealing to fixed point theorems, we
obtain that V = T V has a unique solution and the sequence Vn of (4.4) converges to this
solution.
R
4.39. In order to show that T is a contraction, first observe that if d ∈ , then T (V +d ) =
T (V ) + γd . Now let v 1 (x) and v 2 (x) be such that v 1 (x) ≤ v 2 (x) for all admissible x. Then
clearly T (v 1 )(x) ≤ T (v 2 )(x)—this property is called monotonicity.
75
4.40. Recall that the infinity norm of a function is
∥ f (x)∥∞ = sup |f (x)|.

x
Let v 1 (x) and v 2 (x) be continuous functions and let
d = ∥v 1 (x) − v 2 (x)∥∞ .
It is easy to see that v 1 (x) − d ≤ v 2 (x) ≤ v 1 (x) + d , hence applying T to both sides and
using the monotonicity property:
T (v 1 (x)) − γd ≤ T (v 2 ) ≤ T (v 1 (x)) + γd .
We can rewrite this equation as
∥T (v 1 ) − T (v 2 (x))∥∞ ≤ γ∥v 2 − v 1 ∥∞
Hence, T is a contraction for the infinity norm (when γ < 1) and the value iteration
converges to a unique solution.
4.41. More generally, assume one of the following holds:

1. 0 < γ < 1 and |c (x, u)| < M
2. 0 < γ ≤ 1 and c (x, u) ≥ 0 and u takes values in a finite set.
3. 0 < γ ≤ 1 and c (x, u) ≤ 0.
Then the value iterations of (4.4) converge to V (x).
76
4.5 Dynamic programming in continuous-time
4.42. The principle of dynamic programming in continuous time take a form similar to
the one in discrete time. We show here how to derive it in the case that all the functions
involved are well-behaved, and the Taylor approximations made hold. We can obtain a
more formal derivation at little more cost by using the maximum principle of Pontryagin.
We will however not do this here.
4.43. Recall the discrete-time dynamic programming equation for x t +1 = f (x t , u t , t ) and

∑
V (t, x) = ns=t c (x s , u s , s ) with initial condition x t = x. We had the following recursion:
V (t, x) = inf [c (x t , u t , t ) + V (t + 1, x t +1 )] .
u
4.44. Now consider the system

ẋ = f (x, u, t )
∫T
with cost function J = 0 c (x, u, t )dt . We define
∫ T
J (x, t ) = c (x, u, s )ds
t
under the condition that x(t ) = x. If we approximate the differential equation using Euler
integration with a step δ, that is
x t +1 ≃ x t + f (x t , u, t )δ.
We similarly approximate the cost as a (Riemann) sum

∑
J ≃ c (x t , ut , t )δ.
t
4.45. For the discrete-time system obtained via this approximation, the Bellam principle
reads
V (t, x) = inf [c (x(t ), u(t ), t )δ + V (t + δ, x(t + δ))]
u
We expand terms in the right-hand-side up to first order:

[ ]
∂V ∂V
V (t, x) = inf c (x(t ), u(t ), t )δ + V (t, x) + (t, x(t ))f (x(t ), u(t ), t )δ + δ
u ∂x ∂t
77
The terms V (t, x) cancel out. Dividing both sides by δ, we obtain

[ ]
∂V ∂V
= − inf c (x(t ), u(t ), t ) + (t, x(t ))f (x(t ), u(t ), t ) . (4.5)
∂t u ∂x
Equation (4.5) is the Bellman equation of dynamic programming in continuous time.
4.46. The case of a discounted cost of the form

∫ T
J = e −at c (x, u, t )dt
0
is treated similarly. Observe that e −at ≃ 1−at up to first order. Hence in the discrete-time
approximation, it corresponds to a discount factor γ = 1−aδ. Recalling that the optimal-
ity condition in the case of a discounted cost was V (t, x) = inf u [c (x, u, t ) + γV (t + 1, x t +1 )],
we deduce that the equivalent relation for the continuous-time case is
[ ]
∂V ∂V
= − inf c (x(t ), u(t ), t ) − aV + (t, x(t ))f (x(t ), u(t ), t ) . (4.6)
∂t u ∂x
We recover (4.5) in case a = 0.
4.47. In case you find the above derivation of the Bellman equation unsatisfying, we can
easily show that if a control u has a value function V which satisfies (4.6) for all x and t ,
then u is optimal. To this end, consider the cost function
∫ T
J = c (x, u, t ) + e −aT JT (x(T ))
0
which also allows for a final cost.

Assume u obeys the Bellman equation (4.6), and let v be a control policy different
from u. We want to show the cost incurred by using the control v is superior to the cost
obtained by using u. Recall that the cost incurred by u is V (0, x(0).
4.48. We will establish an inequality for the cost incurred by v when compared to the cost
incurred by y (that is, the value function), as follows: consider the trajectory obtained
from using v , we have
( )
d −at −at −at ∂V ∂V
e V (t, x) = −ae V + e + f (x, v, t ) .
dt ∂t ∂x
78
We add and subtract e −at c (x, v, t ) from the right-hand-side to obtain

[ ( )]
d −at −at ∂V ∂V
e V (t, x) = −e c (x, v, t ) − c (x, v, t ) − aV + + f (x, v, t ) .
dt ∂t ∂x
By definition of u, c (x, v, t ) − aV + ∂V ∂V
∂t + ∂x f (x, v, t ) > 0. To see this, call K (v ) =
c (x(t ), v (t ), t ) − aV + ∂V ∂
∂x (t, x(t ))f (x(t ), v (t ), t ). Hence, the Bellman equation reads ∂t V =
infw K (w). Since u = arg infw K (w) we have that K (u) ≤ K (v ) and thus ∂t∂ V + K (v ) ≥ 0.
d at
− e V (t, x) ≤ e −at c (x, v, t ).
dt
Integrating the above between 0 and T and rearranging terms, we obtain
∫ T
−aT
V (0, x(0)) ≤ e JT (x(T )) + e −at c (x, v, t )dt .
0
The quantity on the right is the cost incurred by v and the quantity on the left the cost
incurred by u. This hence proves our claim.
4.49. Example: LQ controller. We can apply the above relations to find the equation
of optimal least square control for linear dynamics. Consider the system
ẋ = Ax + Bu.
Let Q and R be symmetric positive definite matrices. We want to find the control u that
minimizes ∫ T
(x ′Rx + u ′Q u) dt
0
The dynamic programming equation is
[ ]
′ ′ ∂V ∂V ′
0 = inf x Rx + u Q u + + (Ax + Bu) . (4.7)
u ∂t ∂x
Differentiating the term in brackets in (4.7) with respect to u, we get that the minimizing
control obeys
∂V 1 ∂V
2Q u ∗ + B ′ ⇔ u ∗ = − Q −1B ′
∂x 2 ∂x
′
We try a value function of the form V (t, x) = x K (t )x where K is symmetric. We have
∂V ∂V
= x ′K̇ x and = 2K x .
∂t ∂x
79
Hence u ∗ = −Q −1B ′K x . Plugging this back into (4.7), we get
0 = x ′Rx + x ′K BQ −1B ′K x + x ′K̇ x + 2x ′K (Ax − BQ −1B ′K x)
Observe that 2x ′K Ax = x ′K Ax +x ′A′K x. Because the above holds for all x, we conclude
that K obeys
K̇ = −R + K BQ −1B ′K − K A − A′K
with boundary condition K (T ). The equation above is called the Riccati equation.
80
4.6 Controlled jump processes
4.6 Controlled jump processes

We first consider the case of a finite-state continuous-time Markov chain.
4.50. The type of problems we consider is the following: let S = {s1, . . . , sn } be the finite
state space and let x(t ) be a Markov process evolving in S with probabilities obeying
˙ ) = A(u)p(t ), p(0) = p 0
p(t
where u is a control term. We seek to minimize the expected value of a cost functional
∫ T
c (x, u)dt + C (x(T )).
0
4.51. Set [∫ ]
T
V (t, x) = inf
u
E c (x, u)dt + C (x(T ))|x(t ) = x, u .
t
Take δ small, we express inf u[t ..T ] as inf u[t ..t +δ] inf u[t +δ..T ] to obtain
[ ∫ T ]
V (t, x) ≃ inf
u[t,T ]
E c (x, u)dt + C (x(T )) | x(t ) = x, u
c (x(t ), u(t ))δ +
t +δ
[ (∫ T )]
≃ inf c (x(t ), u(t ))δ + inf
u[t,t +δ]
E
c (x, u)dt + C (x(T )) | x(t ) = x, u
u[t +δ,T ] t +δ
where we also used the fact that c (x, u, t ) is now determined by our current observation
x(t ) and control u(t ).
∫T
4.52. We introduce the function ϕ(t, x t )) = t c (x, u)dt +C (x(T )) where by a slight abuse
of notation, we denote by x t a random path x(.) starting at t . We will also write x(t ) = si
E
to denote random paths that start at x(t ) = si . We now evaluate (ϕ(t +δ, x t +δ |x(t ) = x, u).
Observe first that the expectation above is taken with respect to path that start at t
and end at T . We will split it up into paths from t to t + δ and then paths from t + δ to
T . We have
∑
E (ϕ(t + δ, x t +δ )|x(t ) = x, u) = ϕ(t + δ, x t +δ = si )p(x(t + δ) = si , x t +δ |x(t ) = x, u). (4.8)
si ∈S
Similarly to what we did in the discrete-time case, we use the Markov property to obtain
p(x(t + δ) = si , x t +δ |x(t ) = x, u) = p(x t +δ |x(t + δ) = si , x(t ) = x, u)p(x(t + δ) = si |x(t ) = x, u)

= p(x t +δ |x(t + δ) = si , u t +δ )p(x(t + δ) = si |x(t ) = x, ut )
81
4.53. We now focus on the second term of the last expression. Given p(t ), we have the
approximation
p(t + δ) ≃ p(t ) + A(u(t ))p(t )δ.
valid up to first order. Hence, if we know that x(t ) = si , p(t + δ) given that x(t ) = si is
A(u)e i δ + e i . We can thus rewrite (4.8) for the case x = s j as
∑
E(ϕ(t + δ, xt +δ )|x(t ) = s j , u) = ϕ(t + δ, x t +δ = si )(A(u)e j δ + e j )i p(x t +δ |x(t + δ) = si , ut +δ )
i
where we denote by (v )i the i th entry of the vector v . If we denote by ϕ′(t + δ, x) the row
vector [ϕ(t + δ, x t +δ = s1 ), . . . , ϕ(t + δ, x t +δ = sn )], we can rewrite the previous equation for
all j simultaneously as
Et ..T (ϕ(t + δ, xt +δ ) | x(t ), u) ≃ Et +δ..T [ϕ′(t + δ, xt +δ )A(u)δ + ϕ′(t + δ, xt +δ ) | xt +δ, ut +δ ]
4.54. Using the above relation, we can simplify the second term in our last expression
for the value function V (t, x) as follows:
[ ]
V (t, x) ≃ inf
u[t,t +δ]
c (x(t ), u(t ))δ + inf
u[t +δ,T ]
Et +δ..T [ϕ (t + δ, xt +δ )A(u)δ + ϕ (t + δ, xt +δ )] | xt +δ, ut +δ
′ ′
≃ inf [c (x(t ), u(t ))δ + V (t + δ, x t +δ )A(u)δ + V ′(t + δ, x t +δ )]

′
u[t,t +δ]
This yields
V (t, x) − V (t + δ, x t +δ )
≃ inf [c (x, u) + V ′(t + δ, x t +δ )A(u)]
δ u[t,t +δ]
Now recall that the sample paths x(t ) are right continuous, hence limδ→0 x(t + δ) = x(t ).
Moreover, the paths are piecewise constant. Hence, taking the limit as δ → 0, there is
no ∂V
∂x appearing on the left. We get
∂V ′(t, x)
= − inf [c (x, u) + V ′A(u)] .
∂t u
82
4.7 Infinite-horizon: uniformization

The infinite horizon case can be solved using techniques from the discrete-time case using
the procedure known as uniformization. We present it here:
4.55. The Bellman equation is in the infinite horizon discounted cost case:
0 = inf [c (x, u) − αV + A′(u)V ] ,

u
where α is the discount factor.
4.56. Recall that A is an infinitesimally stochastic matrix (or infinitesimal propagator for
the stochastic process), that is its columns sum to zero and its off-diagonal entries are all
positive. Let
m = sup |Aii (u)|.
i,u
Set
1
A˜ = (A + mI )
m
where I is the identity matrix of appropriate dimensions. Observe that the elements of A˜
are all positive and smaller than one by definition of m. Moreover, the sum of the entries
of each column is 1. Hence A˜ is a bona fide stochastic matrix.
4.57. We now add (m + α)V (x) to both sides of the Bellman equation
(m + α)V = inf [c (x, u) − αV + A′(u)V ] (m + α)V.

u
Dividing both sides by m + α, we get (we set c˜(x, u) = c (x, u)/(m + α))
V (x) = inf [c˜(x, u) + (A(u) + m)′V ]

u
[ ]
1 ′
= inf c˜(x, u) + (A(u) + m) V
u m+α
[ m ˜′ ′ ]
= inf c˜(x, u) + A (u) V
u
m + α
= inf c˜(x, u) + γA˜′(u)V
u
where γ = m+α
m
. The above equation is the Bellman equation for a discrete-time infinite
˜
horizon Markov decision process with transition matrix A.
4.58. We have thus shown that continuous-time, discounted cost infinite-horizon Markov
decision processes could be solved by considering a discrete-time infinite-horizon prob-
83
lem. Hence every tools that can be used in the discrete-time case (most notably, value
iteration) can be used in this continuous-time case.
4.59. We now explain informally what the uniformization procedure does and why it
works. Observe first that we are working exclusively in the infinite-horizon/steady-state
case. The value function V (x) does not depend on time in this case. It is thus reasonable
to expect that the exact timing of the jumps betweens states will not matter in that case,
but only the probability transitions of one state to another.
4.60. To elaborate on the previous point, consider the continuous time Markov process
 −a1 − a 2 b1 c1 
p˙ =  a1 −b 1 − b 2 c2  p.
 
a2 b2 −c 1 − c 2 
There are three states, s1, s2 and s3 . We know how to associate a sample path equation
to this process from Part 2. For example, let us associate −1 to s1 , 0 to s2 and 1 to s 3 . A
sample path equation is, e.g.
dx = (x − 1)dN 1 + (x + 1)dN 2 + · · ·
where we omitted the terms describing transitions starting at a state other than 0. We
have seen in Part 2 that if one sets the rate of N 1 at b 1 and the rate of N2 at b 2 , and
similarly for the terms not shown here, then the probability that x(t ) = −1 would be
p 1 (t ), p(x(t ) = 0) = p 2 (t ) and p(x(t ) = 1) = p 3 (t ). In that sense, the Itō equation is a
sample path realization of the Markov chain.
4.61. If for the sample path equation above, x(t ) = 0, what is the probability that x(t )
jumps t0 1? We know that x will jump to one if the Poisson counter N1 (t ) jumps before
the Poisson counter N 2 (t ). Hence this probability is P (T1 < T2 ) where T1 is the elapsed
time to the next jump of N 1 and similarly for T2 . Since N1 and N2 are independent and
exponentially distributed (with parameters b 1 and b 2 respectively), we have
∫ ∞ ∫ t2
P (T1 ≤ T2 ) = b 1b 2e −b 1t1 e −b 2t2 dt 1dt2
∫0 ∞ 0
= b 2e −b 2t2 (1 − e −b 1t2 )dt2
∫0 ∞
= b 2 (e −b 2t2 − e −(b 1 +b 2 )t2 )
0
b2
= 1−
b1 + b2
b1
=
b1 + b2
84
4.62. The above can be generalized to n possible jumps from state x = 0, where the
probability to jump from 0 to a particular state is proportional to the rate of the counter
associated to that state. Said otherwise, the probability of going from i to j is a j i /|aii |.
4.63. We can also evaluate the time that the chain will remain at state 0. If we let T be
a random variable representing the time before the next jump, we have that
P (T > t ) = P (T1 > t and T2 > t )

= P (T1 > t )P (T2 > t )
∫ ∞ ∫ ∞
−b 1 s
= b 1e ds b 2e −b 2 s ds
t t
−(b 1 +b 2 )t
= e
From the above, we conclude that T is distributed as an exponential random variable

with parameter b 1 + b 2 .
4.64. We can summarize the points above by saying that once the chain reaches the
state i , it will remain at that state for a time T ≃ −aii e aii t and then jump to state j with
probability a j i /|aii |.
4.65. In light of the above, we can intuitively interpret uniformization as saying that in the
infinite horizon case, the exact timings of the jumps do not matter and we might as well
have them happen synchronously (that is, according to a given clock). The transitions
from state to state, however, do matter. It is easy to see that the probability that we
jump from i to j is the same for the continuous time chain described by A and for the
"uniformized" discrete-time chain A. ˜ Indeed, in the continuous time case, given that the
chain is at state i , when the chain jump, it will jump to state j with probability a j i /|aii |
as we have seen above. From the description of A˜ it is easy to see that given that the
chain jumps (to a state different from i ), the probability of landing at state j is similarly
a j i /|aii |.
85
Lecture 5
Wiener processes and stochastic differential equations
We now start the study of stochastic differential equations driven by Brownian motion
(or Wiener process). We will derive an Itō rule for this process, an expectation rule and
a Fokker-Planck (density) equation.
5.1 Diffusions
5.1.1 Gaussian distribution and particles in water

In the same way that the Poisson process was related to the exponential distribution, the
Wiener process is intimately related to the Gaussian distribution. The central role of
the Gaussian distribution in stochastic processes stems, at least, from the central limit
theorem and the fact that it is the Green’s function for the heat equation. The first fact
is usually covered in introductory probability courses, we thus only explain the second.
5.1. The study of what became to be known as Brownian motion started with the obser-
vation that some particles, that are large enough to be seen through a microscope, but
light enough to not sink when put in a body of water, would undergo what appears to be
a random motion. This phenomenon was observed many centuries ago. For example,
the following excerpt from a work by Lucretius
"Observe what happens when sunbeams are admitted into a building and shed light on its
shadowy places. You will see a multitude of tiny particles mingling in a multitude of ways...
their dancing is an actual indication of underlying movements of matter that are hidden from
our sight... It originates with the atoms which move of themselves [i.e., spontaneously]. Then
those small compound bodies that are least removed from the impetus of the atoms are set in
motion by the impact of their invisible blows and in turn cannon against slightly larger bodies.
So the movement mounts up from the atoms and gradually emerges to the level of our senses, so
that those bodies are in motion that we see in sunbeams, moved by blows that remain invisible."
87
5 Wiener processes and stochastic differential equations
5.2. In 1827, the botanist R. Brown observed particles of pollen in suspension in water,
and described this phenomena in more details, but was unable to identify the source of
these "invisible blows". The first to provide a mathematical analysis of the phenomenon
was the Danish scientist Thorvard Thiele. This work was followed by the works of
Bachelier (who analyzed the fluctuation of the stock market back in 1900), Einstein (who
used this diffusion process to evaluate other quantities of interests), Smoluchowski, etc..
5.3. Going back to the grains of pollen in suspension in water, what happens at the
microscopic level is that the grains of pollen, which are quite large compared to molecules
of water, are being bombarded at a very high rate by the molecules of water. In fact, the
rate of collision is estimated to be around 1021 collisions per second.
5.4. If studying the motion of the grains of pollen using classical mechanics is in theory
possible, the sheer size of the numbers involved make the success of such an approach
rather slim. On the flip side, the sheer size of these numbers hint to the fact that a
statistical analysis might be quite accurate.
5.5. Making the switch to a statistical thinking, consider dropping a large number of
grains of pollen in water and observing how the density of the grains evolves. The pollen
will undergo a diffusion in the water. Quite remarkably, a wide range of physical situations
involving diffusion are described by the same equation for the density/concentration. We
mention that diffusion does not only related to material quantities, but energy can also
be diffused.
5.6. Let ρ(t, x) be the density of grains of pollen at time t and position x. We can verify
experimentally that the density obeys the equation
∂ 1 ∂2
ρ(t, x) = ρ(t, x).
∂t 2 ∂x 2
This equation is called the heat equation or diffusion equation.
5.7. Observe that ψ(t, x) = √ 1 e −x /2t satisfies the heat equation for all t > 0. More
2
2πt
is true: if ρ(0, x) is a twice-differentiable initial density profile, then the density for any
t > 0 can be expressed at
∫
1 −(x−y)2 /2t
ρ(t, x) = ρ(0, y) √ e dy .
R 2πt
5.8. The above generalizes to higher-dimension as follows: let Q be a positive definite
88
5.2 Brownian motion and Poisson counters
matrix with entries q i j . Consider the generalized diffusion equation
∂ ∑ ∂ ∂
ρ(t, x) = qi j ρ(t, x).
∂t ij
∂x i ∂x j
Physically, this corresponds to diffusion in a non-isotropic material. A solution is given

by
1 ′ −1
ψ(t, x) = √ e −x Q x/2t .
Q (2πt )n
Similarly to the one dimensional case, a solution for a given initial density profile ρ(0, x)
is ∫
1 ′ −1
ρ(t, x) = ρ(0, y) √ e −(x−y) Q (x−y)/2t dy
Rn Q (2πt )n
5.1.2 The Gaussian distribution

We have seen that the Gaussian distribution appears naturally in the study of diffusion
processes. This is not surprising in view of the central limit theorem and the microscopic
interpretation of diffusion. Let us now evaluate the moments of the Gaussian distribution.
5.9. Because the Gaussian distribution is symmetric about the origin, its odd moments
vanish. The even moments can be evaluated using integration by parts:
∫
Ex p
= √
1
x p e −x /2σ dx
2
2πσ ∫R
1 1 p+1 x −x 2 /2σ
= √ x e dx
2πσ R p + 1 σ
=
1
(p + 1)σ
E
x p+2 .
5.10. We can evaluate all the moments starting from Ex 2 = σ using the relation Ex p =
E
(p − 1)σ x p−2 . We have
Ex 2 = σ
Ex 4 = 3σ 2
Ex 6 = 5 · 3σ 3
..
.
p! ( σ ) p/2
Ex p =
(p/2)! 2
89

We have shown in the previous lecture that the phenomenon of diffusion was described
at the macroscopic level by the Gaussian distribution. Similarly to what we did in the
second part of this course, we want to find a stochastic equation that describes the
behavior of individual paths for particles. In the background, there is again the idea that
there is a probability space –that is the set of all possible sample paths— and we want
to put a measure on that space. While a direct approach is possible in that case—see
the derivation of the Wiener measure— we will as before never write down this measure
explicitly and instead learn how to work with a differential description of the random
paths. This approach allows us to establish all properties of the measure and evaluate
all quantities related to it.
5.11. We start from what we know: Poisson driven stochastic differential equations.
R
Consider a diffusion in . A particle gets hit from the left and from the right by molecules
of water. Because the number of hits is very high, we will ignore the motion of the particle
due to its inertia and solely consider its motion due to being hit. We assume that the
particle jumps by a small amount each time it gets hit. If we let x be the position of
the particle, an equation such as dx = dN 1 − dN 2 where N 1 and N 2 are independent
Poisson counters describes qualitatively the situation. We now need to let the rate of the
Poisson counter increase (to at least 1021 , which we will approximate by ∞.) We will
of course need to scale the size of the jumps appropriately, so that the motion becomes
independent of the rate λ.
λ
5.12. Let N1 (t ) and N 2 (t ) be independent Poisson counters of rate 2. We define the
process
1
dx λ (t ) = (dN 1 (t ) − dN 2 (t ))
s (λ)
with x λ (0) = 0. We want to find s (λ) so that the above relation yields at the macroscopic
level the statistics of a Gaussian distribution. Note that it is a priori not clear that this is
possible at all.
5.13. To fix s (λ), let us evaluate the second moment of x: using the Itō rule, we have
( ) ( )
1 2 1 2
dx = (x +
2
) − x dN 1 + (x −
2
) − x dN 2 .
2
s (λ) s (λ)
Using the expectation rule, we get
90
λ 2
d
dt
Ex2 =
2 s (λ)2
λ
=
s (λ)2
√
Hence, if we take s (λ) = λ, we see that as λ → ∞, the variance is independent of λ
E
(in fact, it is independent even at finite λ). More is true, x 2 = t . That is, the variance
of the motion increases linearly with time, as is required by the diffusion equation!
5.14. Let us evaluate the higher moments of x(t ). We have

( ) ( )
1 p 1 p
dx = (x + √ ) − x dN 1 + (x − √ ) − x dN 2 .
p p p
λ λ
Recall that the binomial formula states that
n ( )
∑ n
(x + y) =
n
x k y n−k .
k
k =0
Using a symmetry argument, one can easily conclude that all the odd moments vanish.
Hence we get ( ) ( )
d xp
=
p Ex p−2
+
1 p
Ex p−4 + · · · E
dt 2 λ 4
where the omitted terms are in powers of 1/λ. We thus find by integrating the above
relation and taking the limit as λ goes to infinity
∫
1 t
λ→∞
p
E
lim x (t ) =
2 0
p(p − 1) lim x p−2dt
λ→∞
E
We can thus obtain the moments recursively starting from p = 2:
∫
1 t
lim
λ→∞
Ex (t ) =
2
2 0
2dt = t
∫
6 t
lim
λ→∞
E x (t ) =
4
3 0
tdt = 3t 2
lim E x 6 (t ) = 5 · 3 t 3
λ→∞
..
.
p! ( t ) 2
p
lim
λ→∞
Ex p
(t ) = (p − 1)(p − 3) · · · 3 · 1 t p/2
=
(p/2)! 2
91
Hence in the limit, the moments of x(t ) match the moments of the Gaussian distribu-
tion.
5.15. In fact, if we write the density equation for the process, we obtain
[ ( ) ( )]
∂ λ 1 1
ρ(t, x) = ρ t, x + √ − 2ρ(t, x) + ρ t, x − √
∂t 2 λ λ
In the limit as λ → ∞, the right-hand side tends to the second derivative 1 of ρ(t, x)
with respect to x and thus we recover the heat-equation:
∂ 1 ∂2
ρ(t, x) = ρ(t, x).
∂t 2 ∂x 2
5.2.1 Time correlations

Even more so than in the case of Poisson counters, time correlations for Brownian process
are central to the theory. We present the fundamental facts here.
5.16. Let τ > 0. Using the same approach as in Part 2, we have that
1
d τ x(t )x(t + τ) = x(t ) √ (dN 1 − dN 2 )
λ
and thus by taking expectations on both sides and recalling that Ni (t + τ) is independent
of x(t ), we get
d
E 1
E
(x(t )x(t + τ)) = x(t ) √ (dN 1 − dN 2 ) = 0 E
dt λ
Thus E(x(t )x(t + τ)) = Ex 2(t ). In general, one has the relation
Ex(t )x(s ) = Ex 2(min(t, s )).
5.17. The random variable x(t ) − x(τ), which we can express as
1 1
x(t )−x(τ) = √ (dN 1 (t )−dN 2 (t ))−(dN 1 (τ)−dN 2 (τ)) = √ (dN 1 (t ) − dN 1 (τ)) − (dN 2 (t ) − dN 2 (τ))
λ λ
depends only on |t −τ|. Indeed, recall that the Poisson counters have the Markov property,
and hence each of the two terms in the last expression only depend on |t − τ|.
5.18. We can generalize the above by observing that if the intervals [t, τ] and [s, σ] do
1
Recall that d
dx 2
f (x) = limh→0 1
h2
(f (x − h) − 2f (x) + f (x + h)).
92
not intersect, then the random variables
x(t ) − x(τ) and x(s ) − x(σ)
are independent. This independent increment property is a fundamental characteristic of

the process, which is a consequence of its definition as a difference of Poisson counters.
5.19. We can evaluate the variance of x(t ) − x(τ) as λ → ∞ as follows:
( )
lim
λ→ ∞
E (x(t ) − x(τ))2 = lim
λ→∞
E x 2 (t ) − 2x(t )x(τ)x 2 (τ)
= t − 2 min(t, τ) + τ
= |t − τ|
E
where we used the fact that lim x 2 (t ) = t and point 16 above.
93
5.3 Stochastic differential equations and the Itō rule

We investigate in this lecture the behavior of sample path equations that contain the
stochastic process x(t ) introduced in the previous lectures. We call this process Brownian
motion and denote it by dw(t ), that is
1
dw(t ) = lim √ [dN 1 − dN 2 ]
λ→∞ λ
where N 1 and N2 are independent Poisson counters of rate λ2 .

Note that what we are doing so far is proving properties of the process without proving
its existence. The latter, which involves the use of Caratheodory extension theorem, is
beyond the scope of this notes.
In this lecture, we want to make sense of equations of the type
∑
dx = f (x)dt + g i (x)dw i (5.1)
i
where the w i are independent Brownian motions.
The Itō rule
5.20. We rewrite (5.1) as

∑ 1
dx = f (x)dt + √ (dNi − dN −i )
i λ
with a limit for λ → 0.

R
Let ψ : n 7−→ Rbe a twice differentiable function. We use the Itō rule for jump
processes to obtain
⟨ ⟩ ∑[ ]
∂ψ 1 √
dψ = , f (x) dt + ψ(x + √ g i (x)) − ψ(x) (dNi )/ λ
∂x i λ
∑[ 1
]
√
+ ψ(x − √ g i (x)) − ψ(x) dN −i / λ
i λ
5.21. Since we are to take a limit as λ → 0, it is helpful to look at a series development

of ψ. We have in general that
⟨ ⟩ ⟨ 2 ⟩
∂ψ 1 ∂ ψ
ψ(x + δ) = ψ(x) + ,δ + δ, 2 δ + h.o.t .
∂x 2 ∂x
94
5.3 Stochastic differential equations and the Itō rule
5.22. Using the above point, we have

⟨ ⟩ ⟨ ⟩
1 1 ∂ 1 ∂2
ψ(x + √ g i (x)) = ψ(x) + √ ψ, g i (x) + g i (x), 2 ψ(x)g i (x) + h.o.t .
λ λ ∂x 2λ ∂x
and a similar expression for ψ(x − √1 g i (x)).

λ
5.23. Putting the above in our expression for d ψ, we obtain

⟨ ⟩ ∑ 1 [⟨ ∂ ⟩ ]
∂ψ
dψ = , f (x) dt + √ ψ, g i (x) (dNi − dN −i )
∂x i λ ∂x
∑[ 1 ⟨ ∂2
⟩ ]
+ g i (x), 2 ψ(x)g i (x) (dNi + dN −i ) + h.o.t . (5.2)
i
2λ ∂x
5.24. Before proceeding further, we need to understand the process dz = (dNi + dN −1 )/λ
in the limit as λ → ∞. We first evaluate its expectation. Using the expectation rule, we
get that
d
dt
z =1 E
E
and thus z (t ) = t . We now derive the variance of the process. First, using the Itō rule
for jump processes, we get that
 ( )2 
 1 2 
dz =  z +
2
− z  (dNi + dN −i ) = (2z /λ + 1/λ 2 )(dNi + dN −i ).
 λ 
Hence,
d z2 E 1
E 1
= 2 z + = 2t + .
dt λ λ
Solving the above differential equation, we obtain
Ez 2 = t 2 + λt .
Hence, the variance of z (t ) is Ez 2 − (Ez )2 = t/λ.
5.25. The point above shows that, in the limit as λ → ∞, the process z (t ) evolves as t
with vanishingly small variance. Said otherwise, the process becomes deterministic. We
have thus established the very important relation
z (t ) = t or (dNi + dN −i )/λ = dt
95
in the limit as λ → ∞.
5.26. We gathering all the above, we have that in the limit

√
(dNi − dN −i )/ λ = dw i and (dNi + dN −i )/λ = dt
Using these two relations in (5.2), we obtain the Itō rule for stochastic differential equa-
tion with Wiener processes:
⟨ ⟩ ∑ [⟨ ∂ψ ⟩ ⟨ ⟩ ]
∂ψ 1 ∂2 ψ(x)
dψ = , f (x) dt + , g i (x) dw i + g i (x), g i (x) dt
∂t i
∂x 2 ∂x 2
5.27. Observe that from the above we can deduce the highly informal rule
dw 2 = dt .
The idea is to try to derive the right-hand-side of d ψ if we want that the solutions
obtained by first solving dx = f (x)dt + g (x)dw and then taking the function ψ is the
same as solving directly the equation for ψ. Starting from
dx = f (x)dt + g (x)dw,
one can derive the Itō rule by keeping the terms in dt of order no more than one for ψ.
A Taylor series of ψ yields
⟨ ⟩ ⟨ 2 ⟩
∂ψ 1 ∂ ψ
ψ(x + δ) = ψ(x) + ,δ + δ, 2 δ + h.o.t .
∂x 2 ∂x
Now for the case that interests us, δ = f (x)dt + g (x)dw and hence
⟨ ⟩ ⟨ ⟩
∂ψ 1 ∂2 ψ
ψ(x + δ) = ψ(x) + , f (x)dt + g (x)dw + f (x)dt + g (x)dw, 2 (f (x)dt + g (x)dw)
∂x 2 ∂x
⟨ ⟩ ⟨ ⟩ ⟨ ⟩
∂ψ 1 ∂ ψ
2 ∂2 ψ
= ψ(x) + , f (x)dt + g (x)dw + f (x), 2 f (x) dt + f (x), 2 g (x)dtdw
2
∂x 2 ∂x ∂x
⟨ ⟩
1 ∂ ψ
2
+ g (x), 2 g (x) dw 2 + h.o.t .
2 ∂x
Comparing the last relation to the Itō rule we have derived in the previous point, we
see that it implies that dw 2 = dt , and thus dtdw is of order 32 in dt and is ignored as
well as dt 2 .
96
5.3 Stochastic differential equations and the Itō rule
Examples
We now look at some examples
5.28. The Ornstein-Uhlenbeck process appears widely in physics to described situation

where a relaxation takes place. In contrast to the Wiener process, which does not admit
a limiting density (recall that its variance diverges as t → ∞), the Ornstein-Uhlenbeck
process will admit a limiting density that describes a balancing effect between additive
noise and relaxation. We will derive in a later lecture the density equation for this (and
more general) process. Precisely, the process is
dx = −axdt + bdw + cdt

where a, b and c are real constants.
5.29. Consider the process

dx = −xdt + xdw .
Using the Itō rule, we find that x 2 obeys
dx 2 = 2x(−xdt + xdw) + x 2dt = −x 2dt + 2x 2dw .
97
5.4 The expectation rule and examples

We present in this lecture the expectation rule for stochastic differential equations with
a Wiener process and illustrate it on several examples. We consider the stochastic differ-
ential equation
dx = f (x)dt + g (x)dw
5.30. The expectation rule is easily obtained by recalling that dw = limλ √1 (dN 1 − dN 2 ).
λ
Hence
E
d x = E(f (x)dt + g (x)dw)
= E(f (x)dt + g (x) lim √
1
(dN 1 − dN 2 ))
λ λ
[ ]
= E f (x)dt + E(g (x))E 1
lim √ (dN 1 − dN 2 ))
λ λ
= E f (x)dt
where we used the fact that N1 and N2 are independent of x(t ). We can summarize this
by saying that
E
g (x)dw = 0.
5.31. Let us revisit the Ornstein-Uhlenbeck process. It is described by the equation
dx = −xdt + αdw .
We obtain that
d
dt
E
(x) = − (x).E
E
Hence in steady-state, x = 0. Using the Itō rule, we can evaluate higher-order moments
of the process as follows:
1
dx 2 = 2x(−xdt + αdw) + α 2 2 = (α 2 − 2x 2 )dt + 2αxdw
2
Using the expectation rule, we get that
d
dt
E E
x 2 = α 2 − 2 (x 2 ).
Hence, the steady-state variance is α 2 /2. The third moment is obtained similarly:
α2
dx 3 = 3x 2 (−xdt + αdw) + 6xdt = (3α 2x − 3x 3 )dt + 3αx 2dw .
2
98
5.4 The expectation rule and examples
E
Using the expectation rule, we conclude that in steady-state x 3 = 0. Let p be a positive
integer, we have
( )
p(p − 1)α p−2 p(p − 1)α p−2
dx = px (−xdt + αdw) +
p p−1
x dt = x − px dt + αpx p−1dw .
p
2 2
The expectation rule yields
p(p − 1)α p−2

d
dt
xp = E 2
x E
− p xp. E
In steady-state (that is when the right-hand-side of the above equation is zero), we
recover the moments of a Gaussian distribution. Hence, the steady-state distribution of
the process exists and is Gaussian. Physically, this process might represent the position
of a particle under the influence of a quadratic potential V = 21 x 2 and some random
disturbances.
5.32. Linear systems. Let x ∈ Rn, A ∈ Rn×n, B ∈ Rn and consider the Itō equation
dx = Axdt + Bdw .
This describes the dynamics of a linear system with a noise input. The expectation of x
is easily obtained using the expectation rule:
d
dt
E E E
x(t ) = A x(t ) ⇒ x(t ) = e At x(0).E
E
We now focus on the covariance of x. Define Σ(t ) = x(t )x ′(t ). We will find an equation
for the i, j th entry of Σ(t ) using the Itō and expectation rules. To this end, observe that
[ ] [ ∑ ] [ ]
xi ( l Ail xl ) b
d = ∑ dt + i dw .
xj A x
l jl l b j
Now, using the Itō rule, we immediately get

 ∑  [ (∑ ) ] 1 ⟨[ b i ] [0 1] [ b i ]⟩
d (x i x j ) = x i  * + 
A j l xl dt + b j dw  +x j Ail xl dt + b i dw + , dt
, l - 

2 bj 1 0 bj
Taking expectations on both sides and dividing by dt , we get

∑ ∑
Σ̇i j = A j l Σil + Ail Σl j + b i b j
l l
In summary
Σ̇ = AΣ + ΣA′ + BB ′ .
99
The above equation is called a Lyapunov equation. Its explicit solution is

∫ t
′
Σ(t ) = e As BB ′e A s ds .
0
100
5.5 Finite difference approximations
5.5 Finite difference approximations

We now discuss the problem is simulating a stochastic differential equation with a Wiener
process in its dynamics. We recall first that a SDE can be interpreted as a convenient
way to put a measure on a set of functions. The set of realizable functions for an SDE
with Wiener process is the set of continuous functions.
5.33. Consider the approximation scheme x(t + τ) = x(t ) + dx(t ), which applied to the
stochastic equation above yields
x((k + 1)τ) = x(k τ) + τ f (x(k τ)) + τg (x(k τ))w(k τ)
where τ > 0 is the time increment in the approximation and w(k τ) are independent
Gaussians.
This approximates x(τ) as
x(τ) = x(0) + τ f (x(0)) + τg (x(0))w(0) (5.3)
where w(0) is a Gaussian random variable with zero mean and variance τ.
5.34. We compare this approximation to the refined one obtained by going first through
τ/2:
τ τ
x(τ/2) = x(0) + f (x(0)) + g (x(0))w(0)
2 2
and
τ τ
x(τ) = x(τ/2) + f (x(τ/2)) + g (x(τ/2))w(τ/2).
2 2
If we expand f and g in their Taylor series we can express x(τ) in terms of x(0) up to
first order in τ/2 as
τ τ τ
x(τ) = x(0) + τ f (x(0)) + g (x(0))w(0) + g (x(0))w( ) + . . . (5.4)
2 2 2
5.35. It is understood that the quality of the solution increases as we take more interme-
diate steps between 0 and τ, but since we keep on adding random variables w(τ/k ), we
need to make sure that the variance at x(τ) remains constant, or, in other words, that
the statistical properties of x(τ) do not depend on the time-step chosen.
5.36. To this end, let us evaluate the variances of x(τ) obtained from the two rela-
tions (5.3) and (5.4).
We have
E [x(τ) − Ex(τ))(x(τ) − Ex(τ))′] = τ2 g (x(0))g ′(x(0))E(w 2(0)) (5.5)
101
for (5.3) and
τ2 ( )
E E E
[(x(τ) − x(τ))(x(τ) − x(τ)) ] = g (x(0))g ′(x(0))
′
4
E w 2 (0) + w 2 (τ/2) . (5.6)
Hence, if we want the variances of (5.5) and (5.6) to be the same, we need the variances
of w(0) and w(τ/2) in (5.6) to be twice the variance of w(0) in (5.6). If we do not change
the variance of w(k τ) when τ decreases, the limit as τ → 0 that we will take below will
yield a deterministic process, as the noise variance goes to zero.
5.37. We summarize this by writing

√
x((k + 1)τ) = x(k τ) + τ f (x(k τ)) + τg (x(k τ))w(k τ) (5.7)
and in this way we can keep the variances of w(k τ) to be fixed as τ varies. If w(t ) is a
standard Brownian motion, that is with variance t , then we can take w(k τ) in (5.7) to be
independent Gaussians with zero mean and variance 1.
5.38. For example, the equation dx = dw implies that x(t ) ∼ N (0, t ). The approximation
scheme √
x(k τ) = x((k − 1)τ) + τw(τ)
with w(τ) ∼ N (0, 1) yields an exact solution for any τ.
102
5.6 First passage time

We investigate in this lecture first passage times for a simple Brownian motion w(t ). To
be more precise, let a > 0 and consider a Wiener process w(t ) with w(0) = 0. We define
τa = min {t | w(t ) = a} ,
that is τa is the first time at which w(t ) reaches a. We furthermore define
M (t ) = max(w(t ) for 0 ≤ s ≤ t ),
that is M (t ) is the largest value taken by w(t ) over the interval [0, t ]. We will derive
the distributions of M (t ) and τa . It should be clear that both M (t ) and τa are random
variables, and both are past-measurable. That is, we only need to know the process
w([0..t ]) in order to assign a value to these variables. Said otherwise, these random
variables are measurable with respect to the σ-field adapted to w(t ). We first emphasize
that using the material from the previous lecture, it should be clear that one can evaluate
the distributions by generating many random paths and recording, in case we care about
τa , the first time at which the process reaches a, or recording the maximum value of the
process over [0, t ] for the second case.
We claim that the following holds:
P (M (t ) ≥ a) = P (τa ≤ t ) = 2P (W (t ) > a). (5.8)
5.39. The first equality is rather easy to establish. Indeed, consider the event {M (t ) ≥ a}.
If the event happened, then because w(t ) is continuous with probability one (we will
derive this below, without using the material of this section since causality loops are
better avoided) we conclude that the event {τa ≤ t } happened. Reciprocally, if τa ≤ t ,
then we know w(t ) has reached a before t and hence M (t ) > a. This establishes the first
equality.
5.40. Let us focus on the second relation. First recall that w(t ) is symmetric, in the sense
that p(w(t ) > 0) = p(w(t ) < 0) = 1/2. The process is also Markov, or informally speaking,
future states only depend on the present state, and not on past states. This means that
if w(t ) = a, then the probability that w(t + τ) > a is the same as the probability that
w(t + τ) < a for τ > 0. We thus have, for s < t , that
1
P (w(t ) − w(τa ) > 0 | τa = s ) = P (w(t ) − w(τa ) < 0 | τa = s ) = .
2
If we integrate the above equation from s = 0..t with respect to the density for τa we
obtain
103
1
P (w(t ) − w(τa ) > 0 ∩ τa < t ) = P (w(t ) − w(τ) < 0 ∩ τa < t ) = P (τa < t ).
2
5.41. Finally, observe that the event A = {w(t ) − w(τa ) > 0} ∩ {τa < t } is the same as
the event B = {w(t ) > a}. To see this, observe that if A has taken place, then clearly
w(t ) > w(τa ) and w(τa ) = a. Reciprocally, if B as taken place, because w(t ) is continuous
w(t ) > a and τa < t . We conclude that
P (τa < t ) = 2P (w(t ) > a).
5.6.1 Differentiability of Wiener processes

One of the most well-know properties of the process w(t ), is that it is continuous ev-
erywhere with probability one and differentiable nowhere with probability one. We will
prove here a point-wise version of the second statement, that is prove that it is non-
differentiable with probability one at any point. Without loss of generality, we consider
continuity and differentiability at t = 0.
5.42. Let us assume that w(t ) is differentiable at t = 0. This means, since w(0) = 0,
that the quotients w(t )/t are bounded for all t close enough to zero. Hence if w(t ) is
differentiable at 0, there exists 0 < K < ∞ and ε > 0 such that
w(t ) < K t for all 0 ≤ t ≤ ε.
We want to show that, for K and ε fixed, the event {w(t ) < K t for all 0 ≤ t ≤ ε} has
probability zero. It is the same as showing that the event A = {∃t ∗ ∈ [0, ε] | w(t ∗ ) > K t ∗ }
has probability one.
Recall that M (t ) = max{w(s ) | 0 ≤ s ≤ t }. Observe that if B = {M (t ) > K t } happens,
this means that there exists 0 ≤ t ∗ ≤ t such that w(t ∗ ) ≥ K t ≥ K t ∗ . Said otherwise
B ⊆ A and thus P (B) ≤ P (A). We will show that as t → 0, P (B) tends to one and thus
P (A) = 1. From (5.8), we have that
P (M (t ) ≥ K t ) = 2P (w(t ) ≥ K t ) (5.9)
holds.
Recall the definition of the error function:
∫ t
2
e −x dx .
2
erf(t ) = √
π 0
It is related to the cumulative distribution function of a Gaussian with zero mean and
104
∫x
√ 1 e −s /(2σ)ds ,
2
variance σ as follows: if Φ(x) = ∞ then
2πσ
1 1 √
Φ(x) = + erf(x/ 2σ)
2 2
Because w(t ) is Gaussian with variance t and zero mean, we can rewrite (5.9) in terms
of erf as
1 1 √ √ √
P (M (t ) ≥ K t ) = 2(1 − P (w(t ) < K t )) = 2 − 2( + erf(K t/ 2t ) = 1 − erf(K t/ 2)
2 2
and hence √ √
P (M (t ) ≥ K t ) = 1 − erf(K t/ 2).
The series development of erf(x) is
( )
2 x3 x5 x7
erf(x) = √ x − + − +···
π 3 · 1! 5 · 2! 7 · 3!
and hence as t → 0, P (M (t ) ≥ K t ) → 1.
5.43. We conclude from the above that for any fixed K < ∞ and ε > 0, we can make
P (w(t ) < K t for all 0 ≤ t ≤ ε) as close to zero as desired. Hence w(t ) is not differentiable
at 0 with probability 1.
105
5.7 The Fokker-Planck equation

As we have done for stochastic differential equations driven by Poisson processes, we can
derive a partial differential equation that describes the evolution of the density ρ(t, x).
The derivation below of course assume the existence of an integrable density for x(t ).
The equation for the density we derive here also goes under the name of Kolmogorov
forward equation.
We consider a stochastic differential equation of the type
dx = f (x)dt + g (x)dw . (5.10)
The procedure is similar to the procedure used in the case of a Poisson process: we use
the Itō rule and the expectation rule to evaluate the expectation of a test function ψ.
∫
5.44. First, we have that Eψ(x) = ψ(x)ρ(t, x)dx . Hence,
Rn
∫
∂ ρ(t, x)
d
dt
Eψ(x) = ψ(x)
∂t
dx . (5.11)
Rn
5.45. Using the Itō rule, we obtain

⟨ ⟩ ⟨ ⟩
∂ψ 1 ∂2 ψ
dψ = − , f (x)dt + g (x)dw + g, 2 g dt
∂x 2 ∂x
5.46. Taking expectation on both sides of the above, we get

[ ] [⟨ ⟩]
∂ψ ∂2 ψ
d
dt
E
ψ(x) =
∂x
f +E 1
2
E
g, 2 g .
∂x
Writing down the expectations in terms of ρ(t, x), we obtain

∫ ∫ ⟨ ⟩
∂ψ ∂2 ψ
d
E
ψ(x) = f (x)ρ(t, x)dx +
1
g, 2 g ρ(t, x)dx (5.12)
dt Rn ∂x 2 Rn ∂x
| {z } | {z }
A B
5.47. Integrating A by part, we get

∫
∂
A = ψ(x)f (x)ρ(t, x) ∞ − ψ(x) (f (x)ρ(t, x)) dx
∂x
∫ Rn
∂
= − ψ(x) (f (x)ρ(t, x)) dx
Rn ∂x
106
since ρ(t, x) tends to zero when ∥x ∥ is large.
5.48. Recall that f (x) and g (x) are vector valued. We denote by g i (x) the i th entry of
g (x). Integrating B by part twice, we get
∫ ⟨ ⟩
∂2 ψ
B = g, 2 g ρ(t, x)dx
Rn ∂x
∫ ∑∑ 2
∂ ψ
= g i (x)g j (x)ρ(t, x)dx
Rn i j
∂x i ∂x j
∑ ∑ [ [ ∂ψ ] ∫
∂ψ ∂
]
= g i g j ρ(t, x) − g i g j ρ(t, x)dx
i j
∂x i ∞ Rn ∂x i ∂x j
∑ ∑ [[ ∂
] ∫
∂ ∂
]
= ψ g i g j ρ(t, x) + ψ g i g j ρ(t, x)dx
i j
∂x j ∞ Rn ∂x i ∂x j
∫ ∑∑ ∂ ∂
= ψ g i g j ρ(t, x) dx
Rn i j
∂x i ∂x j
where we first integrated with respect to x i and used the fact that ρ(t, x) decays fast.
5.49. Putting Equation (5.11) and (5.12) together with the calculated values for A and
B, and appealing to the fact that ψ was arbitrary, we conclude that the integrands are
the same and thus
⟨ ⟩
∂ ∂ 1 ∑∑ ∂ ∂
ρ(t, x) = − , ρ(t, x)f + g i (x)g j (x)ρ(t, x) . (5.13)
∂t ∂x 2 i j ∂x i ∂x j
5.50. Consider the Itō equation
dx = dx .
Comparing to (5.10), we see that it corresponds to having f (x) = 0 and g (x) = 1. Hence
in this case, the density equation for x(t ) reads as
∂ 1 ∂2 ρ
ρ(t, x) = .
∂t 2 ∂x 2
5.51. We now consider the Ornstein Uhlenbeck process
dx = −xdt + αdw .
107
A direct application of (5.13) with f (x) = −x and g = α yields (here all the variables are
scalar)
∂ ∂x ρ(t, x) 1 2 ∂2 ρ(t, x)
ρ(t, x) = + α
∂t ∂x 2 ∂x 2
5.52. Let us now look at a two-dimensional process

[ ] [ ][ ] [ ][ ]
dx 1 0 −x 2 x 1 0 1 x1
= dt + dw .
dx 2 x1 0 x2 −1 0 x 2
Observe that if dw were a usual differential (by which we mean a smooth transforma-
tion of dt , by opposition to its square root) the above equation would describe motion
R
on the circle x 12 + x 22 = constant in 2 . Because of the noise term, this is not the case.
You will explore in a problem set the correction terms that need to be added so as to
have a genuine stochastic process on the circle.
Comparing to equation (5.10), we see that it corresponds to having
[ 2] [ ]
−x 2 x2
f (x) = and g (x) = .
x 12 −x 1
The Fokker-Planck equation is thus
⟨ ⟩  
∂ ρ(t, x 1, x 2 ) ∂ 1  ∂2 (g 12 ρ) ∂2 (g 1 g 2 ρ) ∂2 (g 22 ρ) 
= − , ρf (x) +  +2 + 
∂t ∂x 2  ∂x 12 ∂x 1 ∂x 2 ∂x 22 

∂(x 22 ρ) ∂(x 12 ρ) 1  ∂2 (x 22 ρ) 
∂2 (x 1x 2 ρ(t, x)) ∂2 (x 12 ρ) 
= − +  − 2 + 
∂x 1 ∂x 2 2  ∂x 12 ∂x 1 ∂x 2 ∂x 22 

5.7.1 Fokker-Planck for jump-diffusion processes

A jump diffusion processes is a process that combines terms involving a Brownian motion
and terms involving Poisson counters. The density equation for these processes simply
add the terms related to the jump processes and the terms related to the diffusion pro-
cesses. First, observe that both density equations (for jump and for diffusion processes)
∂ ρf
include the term − ∂x . As we have seen, this Liouville term was describing the effect of
the non-stochastic drift f (x) on the density, and thus its appearance in both equations
was to be expected. In addition, we see that each density equation has terms that de-
scribe the changes in density caused by the jumps and the Brownian motion respectively.
If a process involves both, we simply include both in the density equation. The proof is
straightforward. Hence the equation
108
∑
m ∑
p
dx = f (x)dt + g i (x)dNi + hi (x)dw i
i =1 j =1
where w i (t ) are independent Brownian motions and Ni (t ) are independent Poisson coun-
ters of rates λ i , admits the following density equation:
⟨ ⟩ ∑ m  ( ) −1  ∑p ∑ n
∂ ρ(t, x)
=−
∂
, f (x)ρ + λi ρ(t, g˜−1 (x)) det ∂ g˜i − ρ(t, x) + 1 ∂2 k l
h h ρ
∂t ∂x  i ∂x  2 j =1 k,l =1 ∂x k xl j j
i =1
(5.14)
k
where h j is the k th entry of the vector h j and n is the dimension of x. We recall that
g˜i is defined as
g˜i (x) = x + g i (x)
and the above formula holds for g˜i one-to-one (and hence having a well-defined inverse)
—a straightforward modification can be applied if g˜i is not one-to-one (namely, consider
the inverse to be set-valued and integrate over this set, if the inverse image is a discrete
set, that integration is simply a sum.)
dx = xdt + xdN 1 − x/2dN 2 + bxdw

where N 1 and N2 are independent processes of rates λ 1 and λ 2 . This process could be
use to represent, e.g., fluctuations in stock prices where some discrete events might affect
the prices non-continuously. Observe that here, the discontinuous fluctuations involve
doubling or halving the stock price (through the terms xdN 1 and −x/2dN 2 respectively.
By extending the state of the system, one can easily write down an equation with a richer
set of discontinuous variations.
We have that g˜1 = 2x and g˜1−1 = x/2. Similarly, g˜2 = x/2 and g˜2−1 = 2x. The density
equation is thus
[ ]
∂ρ ∂x ρ 1 b 2 ∂2 (x 2 ρ)
=− + λ 1 ρ(t, x/2) − ρ(t, x) + λ 2 [2ρ(t, 2x) − ρ(t, x)] +
∂t ∂x 2 2 ∂x 2
5.54. Consider the system of equations
dx = (−1 + 2z )xdt + bdw

dz = −2zdN
where z (0) = 1. This is a model for a switched linear system with Brownian noise. Indeed,
109
we know that z (t ) will take on two possible values z (t ) = ±1. For the case z = 1, the
dynamics is dx = xdt + bdw and for z = −1, it is dx = −3xdt + bdw. We could apply the
general formula (5.14) to obtain the density, but in this case, it is easier to proceed by
first noticing that since z takes on finite values, it is useful to use ρ+ (t, x) = ρ(t, x, z = 1)
and ρ− (t, x) = ρ(t, x, z = −1). Now one can deduce that the density at time t and x for
x = 1 will change according to the first and diffusion terms, and the jump terms (gains
from paths being at t and x but with z = −1 and just have jumped — which happens at
rate λ, and losses to z = −1 which happens at the same rate). The determinant involved
is easily seen to be equal to 1. Putting these together, we obtain the system
 ∂ ρ+ (t,x)

 ∂t = − ∂x∂xρ + b 2 ∂2 ρ − +
2 ∂x 2 + λ [ρ (t, x) − ρ (t, x)]
 ∂ ρ− (t,x)
= ∂3x ρ
+ b 2 ∂2 ρ + −
 ∂t ∂x 2 ∂x 2 + λ [ρ (t, x) − ρ (t, x)]
110
5.8 Stratonovich calculus

Stratonovich calculus is an alternative to It0̄ calculus. Contrary to Itō calculus, it obeys
the usual chain rule from (non-stochastic) calculus (that is, there is no correction term).
We present here the basic idea and show how to go from the Itō formulation to the
Stratonovich and back. We will use Stratonovich calculus extensively when we deals
with estimation theory.
5.55. Given the Itō stochastic differential equation
dx = f (x)dt + g (x)dw,
we can write, at last formally, the solution as

∫ t ∫ t
x(t ) − x(t0 ) = f (x)dt + g (x)dw
0 0
∫t
and focus on attaching a meaning to 0 g (x)dw.
This expression is called a stochastic
integral, and the study of stochastic differential equations is often done from the point of
view of stochastic integral (which is not the one we adopted in this course, though both
are ultimately equivalent of course)
Given w(t ) as∫brownian motion sample path, and a function x(t ) we define the stochas-
t
tic Itō integral 0 x(t )dw(t ) as a Riemann-Stieltjes integral as follows: divide the interval
[0, t ] into n subintervals using the intermediate points 0 ≤ t1 ≤ t2 ≤ · · · ≤ t . We define
the Itō stochastic integral as
∫ t ∑
n
x(t )dw(t )dt = lim x(ti ) [w(ti +t ) − w(ti )]
0 n→∞
i =1
where the limit is understood as a mean-square limit. Keep in mind that w(t ) is a random
path, and thus the integral above is a random variable.
In the above definition of the Itō integral, the fact that we took the value of x(t ) at
the beginning of the discretization interval is important. While in the case of a usual
integration with respect to a ’nice’ function with bounded variations,the choice of where
x(·) is evaluated in a discretization interval ti −1, ti ] does not matter in the limit, because
w(ti ) is highly irregular, the choice does matter.
5.56. An alternate definition of the stochastic integral is due to Stratonovic and goes as
follows:
∫ ∑ x(ti ) + x(ti +1 )
¯ ) = lim
g (t )dw(t [w(ti +1 ) − w(ti )]
n→∞
i
2
Again, if w(t ) were a bounded variation function, this would of course yield the same
111
result as the discretization scheme used in the definition of the Itō integral.
5.57. We now look at how one can go from one integral to other. To this end, consider
the Itō equation
where g (·) is a differentiable function and let fs (x), g s (x) be such that
¯
dx = fs (x)dt + g s (x)dw
has the same solution x(t ) as the Itō equation. We want to relate fs and g s to f and g .
We introduce the shorthand notation w i for w(ti ), x i for x(ti ), etc.
To establish the relation between the two, we start from the Stratonovic integral
∑[ (
x(ti ) + x(ti −1 )
) ]
x(t ) ≃ fs (ti )(ti − ti −1 ) + g s [w(ti ) − w(ti −1 )]
i
2
We focus on the second term, and write x i = x i −1 + dx i .

∑ ( dx i )
gs + x i −1 [w i − w i −1 ]
i
2
We can approximate dx i by
dx i = fs (x i −1 )(ti − ti −1 ) + g s (x i −1 ) [w i − w i −1 ] .
Putting this in the previous relation, we have
∑ ( ) ∑ ( )
dx i dx i
gs + x i −1 [w i − w i −1 ] ≃ g s x i −1 + [w i − w i −1 ]
i
2 i
2
∑[ 1 ∂g s
]
≃ g s (x i −1 ) + dx i [w i − w i −1 ]
i
2 ∂x
∑[ 1 ∂g s
]
≃ g (x i −1 ) + fs (x i −1 )(ti − ti −1 ) + g s (x i −1 ) [w i − w i −1 ]
i
2 ∂x
× [w i − w i −1 ]
Recall that w i − w i −1 is distributed as a N (0, ti − ti −1 ). Letting n → ∞, we can replace

the terms w i − w i −1 )2 by dt (this is another proof of the relation dw 2 = dt ). Hence, up
to first order in dt , we have
112
∑ ( ) ∑[ ]
dx i + x i −1 1 ∂g s
gs [w i − w i −1 ] ≃ g s (x i −1 ) + g s (x i −1 ) [ti − ti −1 ]
i
2 i
2 ∂x
Putting the above together, we find that
∫ ( ) ∫
1 ∂g
fs (x) + g s (x) dt + g s (x)dw = fs (x) + g s (x)dw
Itō 2 ∂x Strat.
5.58. We can thus write the following equivalences:
¯ = (f (x) + 1 ∂g
Stratonovic : dx = f (x)dt + g (x)dw ←→ Ito : dx g )dt + g (x)dw
2 ∂x
and
¯ = (f (x) − 1 ∂g g )dt + g (x)dw

Itō : dx = f (x)dt + g (x)dw ←→ Stratonovic : dx
2 ∂x
5.8.1 Change of variables in Stratonovic calculus

We have seen that to manipulate Itō processes, one needed to add correction terms when
taking differentials. Precisely, we have for ψ a differentiable function
⟨ ⟩
∂ψ 1 ∂2 ψ
dψ = (f (x)dt + g (x)dw) + g, 2 g .
∂x 2 ∂x
The Stratonovic differential, on the other hand, behaves like the usual differential from
calculus:
d¯ψ =
∂ψ ¯
f (x)dt + g (x)dw
∂x
We show that the above relation holds below:
5.59. Consider the Stratonovic equation

¯ = f (x)dt + g (x)dw
dx ¯
and equivalent Itō equation
1 dg
dx = f (x)dt + g (x)dt + g (x)dw (5.15)
2 dx
5.60. Let ψ be a one-to-one function with inverse ϕ and set y = ψ(x) (and thus x = ϕ(y)).
113
The derivative of ψ and ϕ are thus related by
d ψ/dx = (d ϕ/dy)−1 .
Furthermore,
d 2ψ d −1 −1 d −1 ( ) −1 ( )
= (d ϕ/dy) = (d ϕ/dy) = d 2
ϕ/dy 2
d ψ/dx = d 2
ϕ/dy 2
dx 2 dx (d ϕ/dy)2 dx (d ϕ/dy)2 (d ϕ/dy)3
5.61. Let us write (5.15) in terms of y: first, from the Itō rule, we get
( )
dψ 1 dg 1 d 2ψ
d ψ(x) = f (x)dt + g (x)dt + g (x)dw + g 2 2 dt
dx 2 dx 2 dx
We let f¯(y) = f (ϕ(y)) and g¯(y) = g (ϕ(y)) and using the relations derived above, we
obtain
 ( ) −1 ( ) −1 ( ) −1 ( ) −3 2 

( ) −1
d ϕ 1 d ϕ d ¯
g d ϕ 1 d ϕ d ϕ  dt + d ϕ
dy =  f¯ + g¯ − g¯2
g¯dw
 dy 2 dy dy ∂y 2 dy dy 
2 dy
5.62. Now we convert the last equation back to a Stratonovic equation. Observe that
 ( ) −1  ( ) −2 ( ) −1
d  d ϕ 
g¯ = −
dϕ
g¯ +
dϕ d g¯
dy  dy  dy dy dy
Hence the correction term

 ( ) −1   ( ) −1 
−
1d  d ϕ g¯  d ϕ g¯
2 dy  dy   dy 
cancels out the second and third terms in the above expression for dy. We thus have the
Stratonovic equation
( ) −1
dϕ
dy = f¯dt + g¯dw .
dy
This shows that the Stratonovic differential obeys the same rules as the usual differen-
tial from calculus.
114
Lecture 6
System Concepts
We start in this chapter the study of control system with noise. After a brief overview
of controllability and observability for time-varying linear systems, we focus on linear
systems driven noise. It is well-known that studying linear dynamics in the Fourier and
Laplace transform domain is quite fruitful. We will see that a direct extension of this
theory to linear systems driven by Brownian motion is not possible, for the reason that
a typical Brownian motion sample paths does not have a finite L 2 norm. We will see
that one can nevertheless have a rather complete set of results if instead of focusing on
the frequency analysis of the energy in a signal (which is in the L 2 norm), we focus on
a frequency analysis of the power in the signal. This analysis is possible for stationary,
ergodic processes, two definitions that will be explained in this chapter. Building on
these, we will introduce the power spectrum of a signal, relate it to the Fourier transform
of the autocorrelation of the signal (that is the Wiener-Khinchin theorem) and close this
part by presenting stochastic realization theory. That is, we will characterize which power
spectra can be realized by a linear system.
115
6 System Concepts
6.1 Notions from deterministic systems

We start by recalling some notions from the theory of deterministic systems.
6.1. A general system studied in control theory is of the form
ẋ = f (t, x, u(x)); y = h(x); x∈ Rn, y ∈ Rk

where f (t, ·, ·) is called the control vector field and u(t, x) the control. If u = u(t ), it is
called open-loop and if u = u(x), it is called feedback control. A system is said to be
time invariant if neither f nor u depend on t explicitly. Of course, for a given initial
condition x 0 , one can write u(x) = u(x(t )) = ũ(t ) and write a feedback control as an open
loop control. The reciprocal does not hold however: that is one cannot necessarily write
u(t ) = ũ(x(t )).
6.2. A subclass of systems which enjoys a relatively complete theory is the class of linear
systems [though many open questions remain even for linear systems]. These are systems
of the type
ẋ = A(t )x + B(t )u; y(t ) = C (t )x (6.1)

If A, B and C are constant matrices, the system is called a linear time invariant system.
6.3. The solutions of the homogeneous system
ẋ = Ax
of course play a role in the analysis of (6.1). Because the system is linear, it is a good
idea to get a handle of the solutions for a basis of vectors of initial conditions. This is
exactly what the fundamental solution does for the canonical basis: if Φ(t, σ) satisfies
the equation
d
Φ(t, σ) = A(t )Φ(t, σ); Φ(σ, σ) = I
dt
then Φ(t, σ) is called a fundamental solution of ẋ = Ax.
The transition matrix has the following property
Φ(T , 0) = Φ(T , t )Φ(t, 0).
6.4. In case A is constant, we have
A2 A3
Φ(t, σ) = e A(t −σ) = I + A(t − σ) + (t − σ)2 + (t − σ)3 + . . .
2! 3!
116
This does not hold if A is time-varying but there exists a similar iterated expansion called
the Peano-Baker series that handles that case.
6.5. From the fundamental solution of ẋ = Ax, we can obtain the solution of (6.1) for
any initial condition x(σ):
∫ t
x(t ) = Φ(t, σ)x(σ) + Φ(t, σ)B(σ)u(σ)dσ.
σ
6.6. From the above, we immediately get that

∫ t
y(t ) = C (t )Φ(t, σ)x(σ) + C (t )Φ(t, σ)B(σ)u(σ)dσ.
σ
The function
T (t, σ) = C (t )Φ(t, σ)B(σ)
is sometimes called the weighting pattern of (6.1).
6.1.1 Controllability and Observability
6.7. A linear system is said to be controllable if for any x(σ) and t > σ, there exists a
control u(s ) defined for s ∈ [σ, t ] such that u(s ) drives the system (6.1) from x(σ) to zero.
It is important to realize that controllability is a question about the range space of an
R
operator that maps controls u to n , the state space. As it stands, the operator has an
infinite-dimensional domain and our first order of business is thus to find an equivalent
operator (in the sense that it has the same range space, since this is what we are concerned
about) with a finite-dimensional domain.
6.8. Consider the following mapping, which maps functions (controls) to vectors (think
of the state at a given time)
∫ t
L(u(t )) = B(t )u(t )dt .
σ
Consider the matrix ∫ t

Q (σ, t ) = B(s )B ′(s )ds .
σ
6.9. We claim that the range space of L and the range space of Q are the same:
1. Let y 1 be in the range space of Q . Hence there exists x 1 such that Q (σ, t )x 1 = y 1 .
117
6 System Concepts
Set u 1 (t ) = B ′(t )x 1 . Then

∫ t ∫ t
L(u 1 ) = B(s )u 1 (s )ds = B(s )B ′(s )x 1 = Q (σ, t )x 1 = y 1
σ σ
2. Reciprocally, assume that y 1 is not in the range space of Q . Then, because the
complement of the range space is of codimension zero, there exists x 1 such that
Q x 1 = 0 and x 1′ y 1 , 0. Observe that
∫ t ∫ t
0= x 1′ Q x 1 = x 1′ B(s )B ′(s )x 1ds = ∥B ′(s )x 1 ∥ 2ds
σ σ
and hence B(s )x 1 = 0. Assume by contradiction that y 1 is in the range space of L.
Hence there exists u such that
∫ t
x 1′ B(s )u(s ) = x 1′ y 1 , 0.
σ
which is a contradiction. This proves the claim.
6.10. Using the above result, we can give conditions for a system to be controllable.
Consider the system of (6.1), and define the controllability gramian of this system to be
∫ t1
W (t0, t1 ) = Φ(t0, t )B(t )B ′(t )Φ′(t0, t )dt .
t0
We claim that we can drive the system from x 0 at t0 to x 1 at t1 if and only if x 0 −Φ(t0, t1 )x 1
is in the range space of W (t0, t1 ).
Hence the system is controllable if W (t0, t1 ) is of full rank.
1. The idea is to first get rid of the drift term Ax and then use the result above relating
range spaces of two operators. To this end, set
z (t ) = Φ(t0, t )x(t ).
2. Recall that Φ(t0, t )Φ(t, t0 ) = I and hence
d
0= (Φ(t0, t )Φ(t, t0 ))
dt
= Φ̇(t0, t )Φ(t, t0 ) + Φ(t0, t )Φ̇(t, t0 )
= Φ̇(t0, t )Φ(t, t0 ) + Φ(t0, t )A(t )Φ(t, t0 )
Hence
Φ̇(t0, t ) = −Φ(t0, t )A(t )
3. We now evaluate ż :
118
ż = Φ̇(t0, t )x(t ) + Φ(t0, t )(Ax + Bu)

= −Φ(t0, t )A(t )x + Φ(t0, t )A(t )x + Φ(t0, t )Bu
= Φ(t0, t )Bu
We thus conclude that

∫ t1
z (t1 ) − z (t0 ) = Φ(t0, t )B(t )u(t )dt
t0
4. Thus, we can drive the system from x 0 to x 1 , or equivalently∫to z (0) = x 0 to z 1 =

t
Φ(t0, t1 )x 1 if and only if x(t0 )−Φ(t0, t1 )x 1 is in the range space of t01 Φ(t0, t )B(t )u(t )dt .
We have seen that this range space is the same as the range space of W (t0, t1 ).
6.11. It is not hard to see that the controllability gramian W (t0, t1 ) obeys the differential
equation
d
W (t0, t ) = A′(t )W (t0, t ) + W (t0, t )A(t ) + B(t )B ′(t )
dt
with W (t0, t0 ) = 0.
Indeed, we recall that the general solution of a matrix equation
Ṁ = A1 (t )M (t ) + M (t )A2 (t ) + B(t )
is given by
∫ t
′
M (t ) = Φ(t, t0 )M (t0 )Φ (t, t0 ) + Φ1 (t, σ)B(σ)Φ′2 (t, σ)dσ
t0
where Φ1 and Φ2 are the fundamental solutions of ẋ = A1 (t )x and ẋ = A′2 (t )x respectively.
Observability
6.12. We say that a system is observable at time t if we can determine x(t ) from the
knowledge of y(s ), s ∈ [t0, t1 ] with t0 < t1 < t .
6.13. We can treat observability as we treated controllability to obtain a necessary and

sufficient condition for which a system is observable. In fact, the condition takes a similar
rank condition with the Gramian
∫ t1
M (t1, t0 ) = Φ′(t, t0 )c ′(t )c (t )Φ(t, t0 )dt .
t0
119
6 System Concepts
The matrix M (t0, t1 ) is called the observability gramian of (6.1).
6.1.2 LTI systems
6.14. We now focus on linear time invariant systems:
ẋ = Ax + Bu; y = Cx (6.2)
where A, B and C are constant.
6.15. In the case of an LTI system, we have that Φ(t0, t ) = e A(t0 −t ) and thus
∫ t1
′
W (t0, t1 ) = e A(t0 −t )BB ′e A (t0 −t )dt .
t0
6.16. Using the series development of e At and the Cayley-Hamilton Theorem, it is not
hard to show that the range space and null space of W coincide with the range space
and null space of
WT = [B, AB, A 2B, · · · , An−1B][B, AB, A 2B, · · · , An−1B]′ .
Now WT is of full rank if and only if [B, AB, . . . , An−1B] is. Hence we conclude that a
LTI system is controllable if and only if
[B, AB, A2B, . . . , An−1B] is of full rank.
6.17. Using a similar approach, we conclude that a LTI system is observable if and only
if
[C ′, A′C ′, (A′)2C ′, . . . , (A′)n−1C ′] is of full rank.
Transfer functions
6.18. We define the transfer function of (6.2) by
G (s ) = C (I s − A)−1B .
It is a matrix of rational functions, whose poles are the eigenvalues of A.
6.19. Recall that if the real parts of the eigenvalues of A are less than σ, then all x(t )e −σt
120
where ẋ = Ax approaches zero as t → ∞. We can thus define the Laplace transform of

x(t ) for s such that Re(s ) > σ)
6.20. The Laplace transform of the matrix exponential e At is

∫ ∞
e −s t e At dt = (I s − A)−1 .
0
6.21. Assume that A is stable (also called Hurwitz), that is that its eigenvalues have
negative real part. From the knowledge of the transfer function, we can easily evaluate
the response of the system to a sinusoidal input: precisely, if
u(t ) = u 0 cos(ωt ),
then asymptotically
x(t ) = x 1 cos(ωt ) + x 2 sin(ωt )
where x 1 = u 0 Re(G (iω)) and x 2 = u 0 Im(G (iω))
Stability
6.22. Stability of LTI systems can also be investigated through Lyapunov or energy
function. The idea is that if a system is stable, it should somehow dissipate its internal
energy, that is if E is the energy of the system, with E = 0 corresponding to the system
in its lowest energy level, d /dt (E) ≤ 0. One can show that for LTI systems, stability is
equivalent to the existence of a quadratic energy function. Precisely, if A is stable, there
exists Q a symmetric positive definite matrix such that
d ′
x Q x ≤ 0.
dt
6.23. We evaluate the total derivative of the above equation to obtain a condition on Q :
d ′
(x Q x) = (Ax)′Q x + x ′Q (Ax)
dt
= x ′(A′Q + Q A)x
Hence, A is stable if and only if there exists Q such that
A′Q + Q A < 0
121
6 System Concepts
6.2 Power Spectrum

We start with recalling some definitions from harmonic analysis.
Fourier transforms
RC R
6.24. We define the space of functions L 2 ( , ) or simply L 2 ( ) to be the space of
complex-valued, square integrable functions:
{ ∫ }
R R
L ( ) = f (x) : 7−→ such that
2
C |f (x)| dx < ∞ .
2
R
6.25. We define L 2 ([−π, π]) to be the space of periodic functions, with period 2π, which
are square integrable over a period:
{ ∫ π }
2
C
L ([−π, π]) = f (x) : [−π, π] 7−→ such that f (x) = f (x + 2π) and |f (x)| dx < ∞ .
2
−π
R
6.26. The Fourier Transform of f (t ) ∈ L 2 ( ) is defined as
∫
fˆ(ω) = F (f )(ω) = f (t )e −iωt dt
R
R
6.27. The inverse Fourier Transform of fˆ(ω) ∈ L 2 ( ) is given by
∫
−1 ˆ 1
f (t ) = F ( f )(t ) = fˆ(ω)e iωt dt
2π R
6.28. Function in L 2 ([−π, π]) can be represented using a Fourier series:

∫ π
1
cn = f (t )e −int dt
2π −π
and
∑
∞
f (t ) = c n e int .
n=−∞
122
6.2 Power Spectrum
6.29. We recall a few properties of the Fourier transform

1. If f (t ) is real-valued, then fˆ(−ω) = fˆ∗ (ω).
2. Time/frequency scaling: F (f (at ))(ω) = |a| 1
F (f (t ))(ω/a).
3. Time-shift: F (f (t − t0 ))(ω) = E − j ωt 0 F (f (t ))(ω).
4. Differentiation: F (df /dt )(ω) = j ωF (f )(ω)
5. Convolution: F (f 1 ⋆ f 2 ) = F (f1 )F (f2 ) where f1 ⋆ f2 is the convolution product of
f1 and f 2 1
∫
6.30. In many situations, R f 2 (t )dt is proportional to the energy in a signal or system.
Think of f representing the amplitude of a sound wave or the voltage across the plates of
a capacitor. As a consequence of Parseval’s relation, given below, the Fourier transform
F (ω) of f (t ) can be thought of as describing how this energy is distributed amongst
frequencies present in the signal.
6.31. The Parseval/Plancherel relation for a general complex-valued signal is

∫ ∫
∗ 1
f (t )g (t )dt = fˆ(ω) gˆ ∗ (ω)d ω
R 2π R
∗
∫ ∫
If f (t ) is real-valued, then fˆ(−ω) = fˆ (ω) and R |f (t )|2dt = 2π 1
R | fˆ(ω)| d ω
2
6.2.1 Power spectral density
6.32. While fourier theory has proven very useful in the study of deterministic linear
systems, it cannot be applied to linear systems driven by noise without changes. Indeed,
consider the LTI stochastic system
dx = Axdt + Bdw; dy = C xdt . (6.3)
We have seen in a previous lesson that the autocorrelation matrix Σ(t ) = E(x(t )x ′(t )
obeyed the differential equation
Σ̇(t ) = AΣ(t ) + Σ(t )A′ + BB ′ .
We denote by Σ the steady-state value of Σ(t ), i.e. Σ = limt →∞ Σ(t ).
6.33. An explicit expression for Σ is

∫ ∞
′
Σ= e At BB ′e A t dt .
0
∫
1
Precisely, (f1 ⋆ f 2 )(x) = R f1 (y − x)f2 (y)dy
123
6 System Concepts
Recalling the definition of the controllability Gramian for the pair A, B, we conclude that
the covariance matrix of x(t ) is full rank if and only if the system is controllable.
E
6.34. Now let us focus our attention to the output y(t ). We have that (y(t )y ′(t )) = C ΣC ′
in steady-state. If y(t ) is scalar, then the variance of y(t ) is C ΣC ′ in steady-state. This
variance is nonzero if the system is controllable, and can be zero if the system is not. In
the first former case, we have signals with a constant variance in steady-state. Hence
∫
E y 2 (t )dt diverges
R
R
and the vast majority of signals y(t ) are not in L 2 ( ). In the latter case, the signal is zero
almost everywhere. In both cases, Fourier theory is of no help in analysis the signal.
6.35. Physically speaking, we see that we cannot apply Fourier theory since the signal y(t )
has infinite energy. However, observe that the following integral, where y(t ) is assumed
to be in steady-state,
∫ T
Ey 2 (t )dt = 2T C ΣC ′
−T
converges. Hence, we can make sense of the average power of y(t ) in the time interval
[−T ,T ]:
∫ T
1
2T
E y 2 (t )dt = C ΣC ′ .
−T
Generalized harmonic analysis deals with quantities as the one above. In order to
continue further, we need to introduce two definitions related to stochastic processes.
6.36. A stochastic process x(t ) is said to be stationary if for all positive integer k , and
real numbers t1, t2, . . . , tk , τ, the following holds:
P (x(t1 ), x(t2 ), . . . , x(tk )) = P (x(t1 + τ), x(t2 + τ), . . . , x(tk + τ))

where P is the joint distribution of the random variables x(t1 ), . . . , x(tk ).
6.37. A stochastic process x(t ) is said to be wide-sense stationary or weak-sense stationary

if its first and second moments are time independent. Precisely, x(t ) is weak-stationary
R
if for all τ, t1, t2 ∈
Ex(t ) = Ex(t + τ) and E(x(t1)x(t1 + τ)) = E(x(t2)x(t2 + τ))

The first requirement implies that the expectation of x(t ) is constant and the second
one implies that the autocorrelation of the process is only a function of τ, that is there
124
6.2 Power Spectrum
exists ϕ(τ) such that

E(x(t )x(t + τ)) = ϕ(τ).
6.38. We show below, by computing its autocorrelation explicitly, that the process y(t )
defined in (6.3) is weak-sense stationary in steady-state if A is a stable matrix.
6.39. A stochastic process is said to be ergodic if we can compute ensemble averages

E
as path averages for almost all paths. Precisely, recall that x(t ) is an expectation taken
over the value of all possible sample paths at time t . If x(t ) is ergodic, we can evaluate
E x(t ), which is necessarily time-independent, as the average value over a generic sample
path:
∫ T
E
x(t ) is ergodic ⇒ x(t ) = lim
1
T →∞ 2T −T
x(t )dt .
Furthermore, if x(t ) is ergodic, we can compute its autocorrelation as

∫ T
x(t ) is ergodic ⇒ E
x(t )x(t + τ) = lim
1
T →∞ 2T
x(t )x(t + τ)dt . (6.4)
−T
We thus see that the process needs to be in this case at least weak-sense stationary.
6.40. Ergodicity is a property that involves the sample paths of the random process,
whereas stationarity does not. In some sense, ergodicity says that all sample paths are
essentially the same and contain all statistical properties of the process. This is perhaps
best understood by exhibiting a process that is stationary by not ergodic. Consider the
process
x(t ) = Y with Y ∼ N (0, 1).
The sample paths of this process are all constant and with a value Y , which is sampled
according to a N (0, 1). The process is obviously stationary since the distributions of x(t )
for all t are the same. The expectation of x(t ) is simply the expectation of Y and is thus
zero.
The sample average of any given path will, however not be zero unless Y = 0, which
happens with probability zero. Indeed,
∫ T
1
lim Y dt = Y.
T →∞ 2T −T
Thus x(t ) is not ergodic.
6.41. We now return to the spectral analysis of y(t ). We define the power spectrum of y(t )
as
125
6 System Concepts
1 ∫ T 2
−iωt
Φy (ω) = lim y(t )e dt (6.5)
T →∞ 2T −T
6.42. The following result relates the power spectrum of a real stationary, ergodic pro-
cess to its autocorrelation function. The result goes under the name of Wiener-Kinchin
theorem. Precisely, it says that
E
Φy (ω) = F ϕ(τ) = F (y(t )y(t + τ)). (6.6)
E
If the process is not real, the above equality with F (y(t )y ∗ (t + τ)) on the right-hand
side holds. To prove that this equality holds, we start from the definition of the power
spectrum, which we write as
∫ T ∫ T
1 −iωt
Φy (ω) = lim y(t )e dt y(s )e iωs ds .
T →∞ 2T −T −T
Combining these integrals, we get
∫ T ∫ T
1
Φy (ω) = lim y(t )y(s )e −iω(t −s )dtds
T →∞ 2T −T −T
We now set τ = t − s and express the above double integral in terms of s and τ. We get
∫ T ∫ T
1
Φy (ω) = lim y(s + τ)y(s )e −iωτ ds d τ
T →∞ 2T −T −T
Using (6.4), we can replace the integral with respect to s by the autocorrelation of y:
∫ T
Φy (ω) = lim
T →∞ −T
E[y(s )y(s + τ)]e −iωτ d τ
which proves the result.
126
6.3 Stochastic Realization

Stochastic realization theory is about designing systems driven by noise whose power
spectra or, equivalently autocorrelation functions, satisfy some user-chosen properties.
We focus here on the problem of linear realizations and strongly characterize all power
spectral that can be realized by a LTI system driven by noise. We further give an explicit
method to construct the system.
6.43. Consider the LTI system driven by noise
dx = Axdt + Bdw; dy = C xdt (6.7)
We assume that the system is controllable and that A is a stable matrix.
6.44. Using the expectation rule, we see that the expectation of x(t ) is
Ex(t ) = e At Ex(0).
Since A is stable, Ex(t ) is zero in steady-state.
6.45. We have shown in the previous lesson that Σ(t ) = Ex(t )x(t + τ) obeys the equation
Σ̇ = AΣ + ΣA′ + BB ′ .
An explicit solution of the above equation is

∫ t
′
Σ(t ) = e As BB ′e A s ds .
0
Hence limt →∞ Σ(t ) exists if A is stable. Said differently, if A is stable, a steady-state

covariance for (6.7) exists. We denote it by Σ.
6.46. We now evaluate the auto-correlation of x(t ). First, assume that τ > 0. We have
that
ds x(t + s ) = Ax(t + s )ds + Bdw t +s .
We multiply the above equation by x ′(t ) to the right and get
ds x(t + s )x ′(t ) = Ax(t + s )x ′(t )ds + Bx ′(t )dw t +s .
Taking expectations on both side, because dw t +s is independent of x ′(t ) when s > 0,

we have
d
dt
E
x(t + s )x ′(t ) = Ax(t + s )x ′(t ).
127
6 System Concepts
We know the initial condition for the above equation in steady-state at s = 0: this is
Ex(t )x ′(t ) = Σ. Hence
E ′
x(t )x ′(t + τ) = Σe A τ for τ ≥ 0.
6.47. Let us now look at the case τ < 0. The key point in the derivation above was that
dw t +s and x(t ) were independent. We start by writing
ds x(t + τ + s ) = Ax(t + τ + s )ds + Bdw t +τ+s .
We then multiply both sides on the right by x ′(t + τ):
ds x(t + τ + s )x ′(t + τ) = Ax(t + τ + s )x ′(t + τ)ds + Bx ′(t + τ)dw t +τ+s .
Now for s > 0, dw t +τ+s and x(t + τ) are independent. The expectation rule applied to
the above equation thus yields
d
ds
E E
x(t + τ + s )x ′(t + τ) = A x(t + τ + s )x ′(t + τ).
E
Again, we know that in steady-state, x(t + τ)x ′(t + τ) = Σ. Hence, integrating the
above equation from s = 0 to s = −τ, we obtain
E(x(t )x ′(t + τ) = e −Aτ Σ for τ < 0.
6.48. We claim without proof that the processes x(t ) and y are ergodic.
6.49. We can evaluate the power spectrum of (6.7) from the autocorrelation function
using the Wiener-Khinchin theorem. To this end, first recall that
∫ ∞
e As e −s t dt = (I s − A)−1 . (6.8)
0
Now according to (6.6),

∫ 0 ∫ ∞
−iωt −Aτ ′
Φx (ω) = e e Σd τ + e −iωτ Σe A τ d τ.
−∞ 0
To evaluate the first term, we first set τ′ = −τ to obtain, using (6.8)

∫ ∞
′ ′
e iωτ e Aτ Σd τ′ = (−I iω − A)−1 Σ.
0
128
For the second term, we obtain using again (6.8)

∫ ∞
′
e −iωτ Σe A τ d τ = Σ(I iω − A′)−1 .
0
Hence we conclude that
Φx (ω) = Σ(I iω − A′)−1 + (−I iω − A)−1 Σ (6.9)
6.50. Using the equation AΣ + ΣA′ + BB ′ = 0, we can reexpress Φx (ω) in a slightly more
useful form. To this end, add and subtract Σiω to the previous equation to obtain (we
omit the identity matrices when adding scalar and matrices below)
−(A + iω)Σ − Σ(A′ − iω) = BB ′

Pre- and post-multiplying (A + iω)−1 and (A′ − iω)−1 respectively yields
−Σ(A′ − iω)−1 − (A + iω)−1 Σ = (A + iω)−1BB ′(A′ − iω)−1
Observe that the left-hand-side of the above equation is the power spectrum (6.9) and
thus we have proved that
Φx (ω) = (A + iω)−1BB ′(A′ − iω)−1
6.51. From the expression for the power spectrum of x, we can easily deduce the power
spectrum of y to be
Φy (ω) = C (A + iω)−1BB ′(A′ − iω)−1C ′ .
Because y is a scalar, then C (A + iω)−1B = B ′(A′ − iω)−1C ′, both being scalar quantities.
Set
ψ(ω) = B ′(A′ − iω)−1C ′ .
We have thus proved that the power spectrum of y has the following properties:
1. It is a rational function which can be expressed as
Φy (ω) = ψ(ω)ψ(−ω)
with ψ(ω) having roots in the left-hand side of the complex plane (recall that A is
stable).
2. It is even, that is Φy (ω) = Φy (−ω).
3. It is positive.
Knowing the above, we now seek to answer the following question:" what kind of power
spectra can we realize with a linear system?" The answer is given by spectral factorization
129
6 System Concepts
lemma, whose statement follows.
6.52. Spectral factorization Lemma: Let q (s ) be an even, real, proper rational function that
is non-negative on s = iω. Then there exists r (s ), real, proper rational, and having no
poles in the half plane ℜ(s ) > 0 and such that
q (s ) = r (s )r (−s ).
We omit the proof.
6.53. The spectral factorization lemma thus tells us that we can realize every power
spectrum q (s ) with the conditions above as a linear system.
6.54. Let r (s ) be a real, proper rational function having no poles in the left half-plane:
q 0 + q 1 s + . . . + q n−1 s n−1
r (s ) = .
p 0 + p 1 s + . . . + p n−1 s n−1 + s n
It is a well-known fact from linear systems theory that we can express this rational func-
tion as
r (s ) = C (I s − A)−1B
with
 0 1 0 ··· 0  0  q 0  ′
  0  q 
 0 0 1 · · · 0 
. .. .    .1 
A =  .. . ..  ; B =  ..  ;
  .  C =  .. 
 ..  0  
 0 0 . 0 1 
 1 q n−2 
−p 0 −p 1 · · · −p n−2 −p n−1  q n−1 
The above is called the canonical controllable realization.
6.55. Hence, given a q (s ) that satisfies the conditions of the spectral factorization Lemma,
we can find r (s ) and define a LTI system driven by noise that has q (iω) as power spectrum.
Given an autocorrelation function ϕ(τ), we can take its Fourier transform to obtain
Φ(ω) and we know that we can design a linear system with this autocorrelation only if
Φ(ω) satisfies the conditions of the spectral factorization Lemma.
6.56. Example: we are given the power spectrum
1 1
Φ(ω) = + .
1+ω 2 9 + ω2
130
We can rewrite it as
√ √
10 + 2ω2 √ ( 5 + iω) √ ( 5 − iω)
Φ(ω) = = 2 2
(1 + ω2 )(9 + ω2 ) (1 + iω)(3 + iω) (1 − iω)(3 − iω)
We deduce that the following system is such that the power spectrum of y(t ) is Φ(ω):
[ ] [ ][ ] [ ] [√ √ ] [x 1 ]
dx 1 0 1 x1 0
= dt + dw; dy = 10, 2 dt .
dx 2 −1 −3 x 2 1 x2
6.57. Because ϕ(τ) is the inverse Fourier transform of a positive function (Φ(ω) is positive
by definition, see (6.5)), there should be some restrictions as to which functions of τ can
be autocorrelations of linear stationary systems. We can characterize these functions as
follows: let u(t ) be any function in L 2 . We have
∫
Φ(ω)|û(ω)|2d ω ≥ 0 (6.10)
R
We use the following facts to find a characterization of the feasible autocorrelations:
1. From the Wiener-Khinchin, we have that Φ(ω) is the Fourier transform of ϕ(τ) =
E x(t )x(t + τ). ∫ ∫
2. The Parseval relation tells us that fˆ(ω) gˆ (ω) = 1/2π f (t )g (t )dt .
3. The (inverse) Fourier transform of a product of functions is the convolution of their
(inverse) Fourier transforms.
We now apply Parseval’s relation to (6.10) where the integrand is written as the product
(Φ(ω)û(ω))û(ω)∗ and recall that the product becomes a convolution:
∫ ∫
ϕ(t − τ)u(τ)u(t )d τdt ≥ 0.
R R
Functions ϕ(s ) that satisfy the above equality for all u are called positive definite functions
in the sense of Bôchner. We will not continue with their study here, but mention that there
is a large literature dealing with such functions.
131
Lecture 7
Linear and Nonlinear filtering
By filtering of a signal, we mean a procedure by which we produce an estimate of a signal

x(t ) for which we have an evolution model (here, a stochastic differential equation) and
a noisy measurement or observation y t .. We study in this section nonlinear filtering and
linear filtering. The qualification of linear or nonlinear refers to the type of underlying
model for the process x t and the observation process y t . We will focus on finding an
evolution equation for the conditional density of x given the observation process. To
make matters precise, consider the model
dy = h(x)dt + d ν
where w and ν are independent Brownian motions. We have derived earlier in the course
an evolution equation for the density ρ(t, x) of x(t ): the Fokker-Planck equation. We take
a slightly unusual route and first deriving the evolution equation of a nonlinear filter
and deduce from it the linear filtering (Kalman) equations. We will start by establish-
ing an evolution equation for the conditional density of a discrete-time Markov process
given an observation process. We also elaborate on smoothing in this section. We then
pass to the limit to obtain a stochastic differential equation for the conditional density
of a continuous-time Markov process. Said otherwise, we derive a discrete-state space
version of the Duncan-Mortensen-Zakai (DMZ) equation describing the unnormalized
conditional density. We derive both the Itō and Stratonovic formulations of this equation.
We then extend the results to continuous state-spaces and obtain the DMZ stochastic par-
tial differential equation. Lastly, we focus on linear systems and obtain the Kalman-Bucy
filter from the DMZ equation.
133
7 Linear and Nonlinear filtering
7.1 Conditional density for discrete Markov processes
7.1.1 Filtering
7.1. Let Ω = {ω1, . . . , ωn } ⊂ R. Consider the discrete-time Markov chain

pt +1 = Apt
where A is a stochastic matrix and pt = p(x t = ωi ). A simple recursion yields p(t ) =

At p(0) and one can use this equation to make a guess as to what state x(t ) is in, e.g. by
choosing the largest entry of p(t ). In order to improve the estimate of the state, we want
to incorporate observations of the form
y(t ) ∼ c (y |x(t ), t )
where c (y |x, t ) is the distribution of the observation conditioned on begin in state x(t ).
A very common model is the additive Gaussian noise model
y(t ) = x(t ) + n(t )
where the n(t ) are independent, zero mean Gaussians with variance σ. In that case,
c (y |x(t ), t ) = √ 1 e −(y−x(t )) /2σ .
2
2πσ
Using the observation model and the transition probability matrix, we derive an equa-
tion for the evolution of the conditional probability p(x t |y 1, y 1, . . . , y t ).
7.2. We start with an initial distribution p 0 for the state of x 0 and we assume that we make
a first observation y 1 at time t = 1. What is the probability of x 1 given this observation?
Using Bayes’ rule, we obtain
p(x 1 = ωi |y 1 ) = p(y 1 |x 1 = ωi )p(x 1 = ωi )/p(y 1 )

1
= p(y 1 |x 1 = ωi )p(x 1 = ωi )
p(y 1 )
1
= c (y 1 |ωi )(Ap(0))i
p(y 1 )
where (p)i refers to the ith entry of the vector p.
134
7.3. We now write a vector equation for the above relation. To this end, we define
 p(x 1 = ω1 |y 1 )
 
 p(x 1 = ω2 |y 1 )
p(x 1 |y 1 ) =  .. 
 . 
p(x 1 = ωn |y 1 )
and
c (y 1 |ω1 ) 0 ... 0 
 
 0 c (y 1 |ω2 ) . . . 0 
B(y 1 ) =  .. .. ..  .
 . . . 
 0 ... 0 c (y 1 |ωn )
In words, B is a diagonal matrix with diagonal entries are the probability of observation
y 1 given the possible states ωi .
We thus have
p(x 1 |y 1 ) = B(y 1 )Ap 0
7.4. Let us now evaluate p(x 2 |y 1, y 2 ). We have

∑
p(x 2 = ωi |y 1, y 2 ) = p(x 2 = ωi , x 1 = ω j |y 1, y 2 )
j
1 ∑
= p(y 2 |x 2 = ωi , x 1 = ω j , y 1 )p(x 2 = ωi , x 1 = ω j |y 1 )
p(y 2 |y 1 ) j
1 ∑ 1
= p(y 2 |x 2 = ωi , x 1 = ω j , y 1 ) p(y 1 |x 2 = ωi , x 1 = ω j )
p(y 2 |y 1 ) j p(y 1 )
p(x 2 = ωi , x 1 = ω j )
1 ∑ 1
= p(y 2 |x 2 = ωi , x 1 = ω j , y 1 ) p(y 1 |x 2 = ωi , x 1 = ω j )
p(y 2 |y 1 ) j
p(y 1 )
p(x 2 = ωi |x 1 = ω j )p(x 1 = ωi )
We started by writing the p(x 2 |y 1, y 2 ) as a marginal distribution of p(x 2, x 1 |y 1, y 2 ), then

used Bayes’ rule. Using the Markov property, we have that
p(y 2 |x 2 = ωi , x 1 = ω j , y 1 ) = p(y 2 |x 2 = ωi )
since the knowledge of x 1 does not affect y 2 if x 2 is known. Similarly,
p(y 1 |x 2 = ωi , x 1 = ω j ) = p(y 1 |x 1 = ω j ).
135
We thus have, if we recall that p(y 1, y 2 ) = p(y 2 |y 1 )p(y 1 )
1 ∑
p(x 2 = ωi |y 1, y 2 ) = c (y 2 |ωi )c (y 1 |ω j )a j i p(x 1 = ωi )
p(y 1, y 2 ) j
In matrix form, we have the update
p(x 2 |y 1, y 2 ) = B(y 2 )AB (y 1 )Ap 0 .
7.5. We can generalize the above formula to obtain the filtering or conditional density
equation:
1
p(x k |y 1, . . . y k ) = B(y k )AB(y k −1 )A · · · B(y 1 )Ap 0 (7.1)
p(y 1, . . . , y k )
7.6. Observe that the normalizing constant in Equation (7.1) is nothing less that the
probability of observing the sequence of observables y 1, . . . , y k . In practice, one evaluates
the products B(y k )AB(y k −1 ) . . . ... without normalizing the vectors until the last step. The
normalizing constant (i.e. the sum of the entries of the vector B(y k )AB(y k −1 ) . . . is then
p(y 1, . . . , y k ).
7.7. The distinction between normalized density equation (that is Equation (7.1) above)
and its unnormalized counterpart (which is nothing else than (7.1) above without the
normalizing term p(y 1, . . . , y k )) is rather inconsequential now, but when dealing with con-
tinuous state-spaces, working with unnormalized densities greatly simplifies the evolution
equations. We get to this in the following sections.
7.1.2 Smoothing
Smoothing consists of estimating the value of the unknown signal x(t ) given future ob-
servations. Smoothing cannot, rather obviously, be performed in real time. Taking an
approach similar to the one taken for deriving the filtering equation (7.1), we derive an
equation for the update of p(x k |y k +1 ...y n ).
136
7.8. As before, we start with a baby step: we evaluate p(x 0 = ωi |y 1 ). We have

∑
p(x 0 = ωi |y 1 ) = p(x 0 = ωi , x 1 = ω j |y 1 )
j
1 ∑
= p(y 1 |x 0 = ωi , x 1 = ω j )p(x 1 = ω j , x 0 = ωi )
p(y 1 ) j
1 ∑
= p(y 1 |x 0 = ωi , x 1 = ω j )p(x 0 = ωi |x 1 = ω j )p(x 1 = ω j ).
p(y 1 ) j
Using the fact that
p(y 1 |x 0 = ωi , x 1 = ω j ) = p(y 1 |x 1 = ω j ) = c (y 1 |x = ω j ),
we have the matrix equation
1 ˜
p(x 0 |y 1 ) = AB(y 1 )p(x 1 )
p(y 1 )
where A˜ = A′D −1 with D the diagonal matrix with entries the sum of the columns of A′.
That is D is such that A˜ is a stochastic matrix.
7.9. Similarly to what was done in the previous section, we can show that in general
1 ˜ ˜ (y k −1 )AB(y
˜ k )p(x k ).
p(x 0 |y 1, . . . , y k ) = AB(y 1 ) . . . AB
N
where N is a normalizing constant. This is the smoothing or noncausal condition estima-
tion equation. Note that p(y 1, . . . , y N ) may be different if evaluated used the smoothing
or filtering equation. This is because the equation requires the use of an initial condition,
p(0) or p(n) that affects the result. If n grows very large, and the chain “mixes well”, the
effect of the initial conditions will disappear.
7.1.3 Estimating the state with the complete sequence of observations

We now focus on deriving an equation for the probability of x k given the complete se-
quence of observable y 1, . . . , y n :
p(x k |y 1, y 2, . . . , y n ) (7.2)
7.10. We derive first an update equation for p(y k +1, y k +2, . . . , y n |x k ), or the probability of
observing the remainder of the observation sequence given that we know the state at k .
This quantity will prove useful below to estimate (7.2). Following a standard notation,
137
we define
β(x k = ωi ) = p(y k +1, y k +2, . . . , y n |x k = ωi ).
We let β(x k ) denote the row vector whose i th entry is β(x k = ωi ). Observe that β(x k ) is
not a probability vector and its entries thus do not necessarily sum to one. We have
∑
p(y 1, y 2, . . . y n |x 0 = ωi ) = p(y 1, , y 2, . . . y n, x 1 = ω j |x 0 = ωi )
j
∑
= p(y 1, y 2, . . . y n |x 0 = ωi , x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= p(y 1, y 2, . . . y n |x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= p(y 2, . . . y n |y 1, x 1 = ω j )p(y 1 |x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= p(y 2, . . . y n |x 1 = ω j )p(y 1 |x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= β(x 1 = ω j )c (y 1 |ω j )a j i
j
In matrix/vector notation, this becomes
β(x 0 ) = β(x 1 )B(y 1 )A′
We thus see that the update for β is similar to the update for p(x 0 |y 1, . . .) except that
the unnormalized backwards equation is used (i.e. we use A′ instead of A.
˜ Observe that an
initial value of β(n + 1) is needed.
7.11. We now define

α(x k = ωi ) = p(y 1, y 2, . . . , y k , x k = ωi )
and α(x k ) is the row vector whose entries are given by the above for varying i . Because
of the rule for conditional probability p(x, y) = p(x |y)p(y), we see that α(x k ) is nothing
but the unnormalized version p(x k |y 1, . . . , y k ). Hence
α(x k ) = B(y k )AB(y k −1 )A · · · B(y 1 )Ap 0 .
7.12. We have now the tools to address the problem of finding the most likely state given
the complete observation sequence:
138
1
p(x k = ωi |y 1, . . . , y n ) = p(x k = ωi , y 1, . . . , y n )
p(y 1, . . . y n )
1
= p(y k +1, . . . , y n |x k = ωi , y 1, . . . , y k )p(x k = ωi , y 1, . . . , y k )
p(y 1, . . . y n )
1
= p(y k +1, . . . , y n |x k = ωi )α(x k = ωi ).
p(y 1, . . . y n )
We thus have
1
p(x k = ωi |y 1, . . . , y n ) = β(x k = ωi )α(x k = ωi )
p(y 1, . . . y N )
and we have seen how to evaluate all the quantities involved. Notice that if we only need
the most likely state, the knowledge of the normalizing constant is not necessary.
139
7.2 Conditional density for continuous-time Markov chains

We now show how to extend the results of the previous lesson to continuous-time Markov
chains. First, we need to discuss how to extend the discrete-time observation model
y t = x t + nt to a continuous-time model; this leads us to introduce the often encountered
white noise process.
7.2.1 White noise
7.13. The idea of continuous observations requires some explanation. First, recall that
if w t is a standard Brownian motion, we derived that
w t ∼ N (0, t )
and moreover w t − w s (t > s ) and w τ − w σ (τ > σ) are independent random variables

with distributions N (0, t − s ) and N (0, τ − σ) if [s, t ] ∩ [σ, τ] = ∅. A white noise process
is the limit
1
ẇ = lim [w t +τ − w t ] .
τ→0 τ
Recall that we have proved that w t was not differentiable, and thus the above cannot be
interpreted as a stochastic process in the usual sense. However, it can be interpreted in
the sense of distributions as a Dirac distribution.
7.14. To elaborate on the previous point, we first recall that in the sense of distributions,
1
lim √ e −x /2τ = δ(x)
2
τ→0 πτ
Now we can informally rewrite ẇ as
1 w t +τ − w t
ẇ = lim √ √
τ→0 τ τ
1
= lim √ N (0, 1)
τ→0 τ
7.15. Observe that a white noise process is uncorrelated, in the sense that
E(ẇ(t1)ẇ(t2)) = δ(t1 − t2).

Indeed, for any t1 , t2 , we can find a τ > 0 small enough so that [t1, t1 + τ]∩[t2, t2 + τ] = ∅.
140
Hence if t1 , t2 , [ ]
lim
τ→0
E 1
τ 1
1
(w t +τ − w t1 ) (w t2 +τ − w t2 ) = 0.
τ
7.16. From the above, we conclude that the power spectrum of white noise, which by
E
the Wiener-Khinchin theorem is the Fourier transform of (ẇ(t )ẇ(t + τ)) is the Fourier
transform of δ(τ), which is a constant. Hence we have shown that white noise has a flat
power spectrum, that is
Φẇ (ω) = 1.
The name white noise actually comes from this characterization of the process.
7.17. A noisy observation for a process x(t ) is called AWGN or additive white Gaussian
noise if it is of the form
ẏ = x(t ) + ẇ .
This notation is the most frequently used in the engineering literature. We will also
use the notation
dy = xdt + dw .
We point out that this latter equation is the one that should be used to simulate a white
noise process.
7.18. The possibility of instantaneous observations with fixed variance leads to perfect
observations via a simple average. The observation equation above should be interpreted
as follows: you can make a measurement of the system in a time τ (which you can assume
to be very small), your observation is then
∫ (∫ τ ∫ τ )
1 τ 1 1 1 1
dy = y(τ) = dx + dw ⇒ ẏ = x + w τ .
τ 0 τ τ 0 0 τ τ
The variance of the noise term can be scaled by multiplying dw by a constant σ0 .
7.2.2 Unnormalized conditional density evolution
7.19. We have the continuous-time Markov chain
p˙ = Ap
where A is an infinitesimal generator describing the evolution of the probability vector

p(x(t ) = ωi ). We have in addition the observation model
ẏ = x(t ) + ẇ(t )
141
7.20. We now derive a stochastic equation for the evolution of the vector p(x t +τ |ẏ[0..t +
τ]). To this end, assume that we know the conditional density p(x t |y[0..t ]). We can
approximate p(x t +τ |y([0..t + τ]) by making one discrete time-step of length τ and use the
results of the previous section. The probability transition matrix is then e Aτ . We have
p(x t +τ |y([0..t + τ]) ≃ B(ẏ([t ..t + τ])e Aτ p(x t |y([0..t ]).
7.21. Recalling the √

definition of white noise from √point 14, the entries of the matrix
τ −(ẏ−ωi )2 τ/2 τ
B(ẏ([t ..t + τ]) are √ e . First, observe that √ is a common factor to all entries
2π 2π
of the matrix and we can thus take it as being part of the normalizing constant.
7.22. We can approximate, up to first order, the exponential as
e −(ẏ−ωi )
2 τ/2
≃ 1 − (ẏ − ωi )2 τ/2 + h.o.t .
Observe that we used here Stratonovič calculus, as there are no correction terms in the
first order approximation.
7.23. Expanding the square, we get

τ[ 2 ]
e −(ẏ−ωi )
2 τ/2
≃1− ẏ + ωi2 − 2ωi ẏ + h.o.t . (7.3)
2
7.24. We introduce the diagonal matrix H :

ω1 0 · · · 0 
 
 0 ω2 · · · 0 
H =  . .. . . (7.4)
 .. . .. 
 0 · · · 
ωn 
Using (7.3) and (7.4), we can approximate B(ẏ) as

τ τ
B(ẏ([t ..t + τ]) = (I − ẏ 2 I − H 2 + τH ẏ) (7.5)
2 2
7.25. If we approximate e Aτ as I + Aτ, we get using (7.5) (we write pt for p(x t |y([0..t ]))
τ τ
pt +τ ≃ (I − ẏ 2 I − H 2 + τH ẏ)(I + Aτ)pt
2 2
142
Rearranging terms, we get up to first order in τ:

τ τ
pt +τ − pt = − I ẏ 2 pt + Apt τ − H 2 pt + ẏH pt τ (7.6)
2 2
7.26. In (7.6), the term − 2τ ẏ 2 is independent of p and thus simply rescales pt . To see
this, consider the differential equations
ẋ = f (t )I x + A(t )x
and
ż = A(t )z (t ).
It is easy to verify that their respective solutions x(t ) and z (t ) are related by
∫ t
x(t ) = exp( f (s )ds )z (t ).
0
∫t
Hence the term − 2τ ẏ 2 rescales the solution by exp( 0 − 2τ ẏ 2ds ). Since we are not keeping
track of the normalizing constants, we ignore this term in (7.6).
7.27. If we replace ẏ τ by d¯y (recall that we are using Stratonovic calculus), we obtain
1
d¯pt = (A − H 2 )pt dt + H pt d¯y (7.7)
2
The above equation is the unnormalized conditional density equation. if the observation
model is
ẏ = h(x)dt + ẇ,
it is easy to see that the only modification to the above equation is in H , and
h(ω1 ) 0 ··· 0 
 
 0 h(ω2 ) · · · 0 
H =  . .. ..  .
 .. . . 

 0 ··· h(ωn )
7.28. In order to obtain the Itō version of this equation, we can revisit our approximation
143
to e −(ẏ−h(ωi ))
2 τ/2
. We have
e −(ẏ−h(ωi ))
2 τ/2
= e −(ẏ
2 +h 2 (ω
i )−2ẏh(ωi ))τ/2
= e −ẏ
2 τ/2
e −h
2 (ω )τ/2
i
e ẏh(ωi )τ
= e −ẏ
2 τ/2
e −h
2 (ω )τ/2
i
e dyh(ωi )
We now expand the last term up to second order to obtain
1
e dyh(ωi ) ≃ 1 + h(ωi )dy + h 2 (ωi )(dy)2
2
1
≃ 1 + h(ωi )dy + h 2 (ωi )(h(x)dt + dw)2
2
1
≃ 1 + h(ωi )dy + h 2 (ωi )(h 2 (x)dt 2 + 2h(ωi )dt + dw 2 )
2
1 2
≃ 1 + h(ωi )dy + h (ωi )dt
2
where we used the fact that dw 2 = dt and ignored the terms in order higher than one
in dt . Thus the Itō correction term is 12 H 2dt and we conclude that the Itō formulation
of (7.7) is simply
dpt = Apt + H pt dy
7.29. We can also obtain the Itō version directly from the Stratonovic by first rewrit-
ing (7.7) as
1
d¯pt = (A − H 2 )pdt + H p(h(x)dt + dw).
2
Observe that term factor multiplying dw is H p. Using results from a previous lesson,
we know that the Itō formulation can be obtained from the Stratonovic formulation by
dH p
adding 12 dp H pdt = 12 H 2 pdt .
144
7.3 Nonlinear filtering: the Duncan-Mortensen-Zakai equation
7.3 Nonlinear filtering: the Duncan-Mortensen-Zakai

equation
We now establish the conditional density equation for a continuous-time, continuous-
state space process driven by Brownian motion. This equation goes under the name of
Zakai equation or Duncan-Mortensen-Zakai equation (DMZ). We obtain it by analogy
with the discrete state space case.
7.30. Recall that if we have a continuous-time jump process with finite state space
∑
dx = g i (x)dNi ,
i
then the density equation for the process is
p˙ = Ap
where A is an infinitesimally stochastic matrix.
7.31. The equivalent for a continuous state-space of the operator A, which describes the
density given sample path equation, is the Fokker-Planck operator:
∂f ρ 1 ∑ ∂2 g i g j ρ
dx = f (x)dt + g (x)dw −
7 →− +
∂x 2 i, j ∂x i ∂x j
7.32. We formally define the operator L as
∂f ρ 1 ∑ ∂2 g i g j ρ
Lρ = − +
∂x 2 i, j ∂x i ∂x j
7.33. The conditional density equation is thus, in Stratovonic form:
∂ρ 1
= (L − h 2 (x))ρ + d¯yh(x)ρ
∂t 2
7.34. The conditional density equation in Itō form is
∂ρ
= L ρ + d¯yh(x)ρ
∂t
145
7.35. Let us look at an example: consider the following problem:

{
ẋ = dw
ẏ = xdt + bdw
Because the factors in front of the stochastic differential are constants, the equation is
the same in both Itō and Stratononic forms.
7.36. We will now derive an estimation procedure for x(t ). To this end, we write the
DMZ equation in Stratonovic form
∂ ρ 1 ∂2 ρ 1 2
= − x ρ + ẏx ρ
∂t 2 ∂x 2 2
7.37. Let us try with the parametric family of solutions ρ(t, x) = exp(a(t )x 2 +b(t )x +c (t )).
Plugging this value of ρ in the DMZ equation, we get
∂ ( )
ρ(t, x) = ȧx 2 + ḃx + c˙ e a(t )x +b(t )x+c (t )
2
∂t
for the time derivative.
Differentiating with respect to the state variable x, we obtain
∂ρ
= (2ax + b)ρ
∂x
∂2 ρ
= 2a ρ + (2ax + b)2 ρ
∂x 2
Putting the above relations together, we get
( )
1 ∂2 1 2 1( )
2 ax 2 +bx+c
− x = 2a + (2ax + b) 2
− x e
2 ∂x 2 2 2
7.38. We can now equate the coefficients of x 2, x 1 and x 0 in the previous equation. We
find
1
ȧ = 2a 2 −
2
ḃ = 2ab + ẏ
1
c˙ = a + b 2
2
146
7.3 Nonlinear filtering: the Duncan-Mortensen-Zakai equation
7.39. The equations above give us the conditional density of x(t ) given the observation
process ẏ. We can rewrite the equations in a more enlightening form by rewriting the
Gaussian as
2 +bx+c 1
e −(x−x̂) /2σ
2
e ax =√
2πσ
If we take the above as a definition of σ and x̂, we find the relations
2σ = −1/a
x̂/σ = b
1
−x̂ 2 /2σ − log(2πσ) = c
2
Using these relations, we can easily find the evolution equations for σ and x̂. For σ,
we have
1 ȧ
σ̇ =
2 a2
1 2a 2 − 1/2
=
2 a2
1
= (2 − 1/2a 2 )
2
= 1 − σ2
A similar computation yields

d x̂
= −σx̂ + σ ẏ
dt
These are the equations of the Kalman-Bucy filter.
147
7.4 Linear filtering: the Kalman-Bucy filter

We now derive the Kalman-Bucy filter in two different ways. First, simply by plugging
in an exponential in the conditional density equation and finding what relations the
parameters of the exponential have to satisfy. This is what we did in the last part of
the last lesson. The second way is somehow more traditional, and goes by looking for
a MMSE estimate of the state given the observation. The approach used there is useful
when deriving the separation principle in the study of LQG (Linear Quadratic Gaussian)
control – we cover that in the next lecture.
7.40. We consider the linear system

{
dx = Axdt + Bdw
(7.8)
dy = C xdt + d ν
The conditional density equation for the system (7.8) is

⟨ ⟩ ( )′ ( )
∂ρ ∂ 1 ∂ ′ ∂ 1

=− , Ax ρ + BB ρ − ⟨C x, C x⟩ ρ + ẏ, C x ρ (7.9)
∂t ∂x 2 ∂x ∂x 2
We assume a solution of the conditional density equation of the form
( )
1 1 −1
ρ(t, x) = √ exp − (x − x̂)Σ (x − x̂)
(2π)n det(Σ) 2
If we substitute ρ in (7.9), we obtain after some manipulations:

{
Σ̇ = = AΣ + ΣA′ + BB ′ − ΣC ′C Σ
′ ′
dt = (A − ΣC C )x̂ + ΣC ẏ
d x̂
These are the equations of the celebrated Kalman-Bucy filter.
7.4.1 Conditional expectation and MMSE estimation

We derive here a basic fact related to estimating a quantity given imperfect observations.
Let x be a random variable, with observation y ∼ p(y |x). We want to find the best
estimator for h(x) given the observation y in the MMSE sense (Minimum Mean Square
Error), that is we try to minimize
∫ ∫
E
J = ∥ĥ(y) − h(x)∥ =
2
∥ĥ(y) − h(x)∥ 2 p(x, y)dxdy .
148
7.41. We start by establishing a simple relation that often goes under thename of mean-
variance decomposition. Let z be a vector-valued random variable, we have
E(∥z ∥2) = E∥z − Ez ∥2 + ∥Ez ∥2 .

Applying the above for z = x − x̂ where x̂ is a deterministic function, we have
E(∥x − x̂ ∥2) = E∥x − Ex ∥2 + ∥x̂ − Ex ∥2 .

We call the above the mean-variance decomposition.
7.42. We can rewrite the cost J of an MMSE estimation as

∫ ∫
J = ∥ĥ(y) − h(x)∥ 2 p(x, y)dxdy
∫ ∫
= ∥ĥ(y) − h(x)∥ 2 p(x |y)p(y)dxdy
∫ [∫ ]
= ∥ĥ(y) − h(x)∥ p(x |y)dx p(y)dy
2
= Ey e (h(y))
∫
where e (h(y)) is the conditional error e (h(y)) = ∥ĥ(y) − h(x)∥ 2 p(x |y)dx . Hence, given
E
measurements y m , we have e (h(y m )) = (∥x̂(y) − x ∥ 2 |y = y m ). Hence, for each y m , we have
to assign a value ĥ(y m ).
7.43. Applying the mean-variance decomposition to the above (with z = ĥ −h) conditional
expectation we have
e (h(y)) = E(∥ĥ(y) − h(x)∥2 |y = y m )

= E(∥h(x) − E(h(x)|y)∥2 |y = y m ) + ∥ĥ(y) − E(h(x)|y)∥2
To minimize e (h(y)) we therefore pick
ĥ(y) = E(h(x)|y).
7.44. Example. Consider the distribution

{
1 if (x, y) ∈ A
p(x, y) =
0 otherwise
where A is the triangle in R2 with vertices at (0, 0), (1, 0) and (1, 1) as depicted below.
149
y
(1,1)
x
(0,0) (1,0)
The conditional distribution of x given a measurement y is uniform on the interval

[y, 1]. From the above, we know that the mmse estimator of x given the observation y is
the conditional expectation of x given y:
x̂(y) = E(x |y) = 1 +2 y .
7.4.2 Conditional expectation and least squares

We now give an alternate derivation of the above important fact: the mmse estimator is the
conditonal expectation.
Informally speaking, we see that when we condition a random variable X on a σ-field
G which is coarser than the σ-field that makes X measurable, some information is lost.
For the example of the die tossing experiment, if we take G = {Ω, ∅, {1, 3, 5}, {2, 4, 6}},
E
whereas X gives us the value of the outcome, (X |G) gives us the expected value of X
given that the outcome is odd or even. Hence it only allows us to say whether X was
odd or even.
This point is in accordance with the engineering intuition, but perhaps presented dif-
ferently. In engineering, conditioning is often used when one makes measurements, and
the reasoning is that conditioning gives us information. This is indeed the case since
we start with a complete ignorance of the realization of the random variable X . Hence,
E (X |Y ) is a first step, towards knowing X , which is the ultimate goal.
But coming back to the first paragraph, we see that any random variable Y measurable
with respect to G will allow us to decide whether the outcome is odd or even. Why choose
the random variable which is the expected value of X on the odd and even subsets?
The following provides a partial justification: the conditional expectation is the closest
random variable to X , under the constraint of being measurable on G.
7.45. Orthogonal projection Let V be an inner product space, with inner product ⟨·, ·⟩
and W be a closed subspace of V . Let x ∈ V . The orthogonal projection of x onto W is
the vector πW x of W satisfying:
x − ⟨πW x, w⟩ = 0 for all w ∈ W.
7.46. Orthogonality principle The oft-encountered orthogonality principle states that
150
the vector πW x is the closest vector to x in W . It is easily verified: assume there is a

vector w 1 ∈ W closer to x than πW x, that is
⟨x − w 1, x − w 1 ⟩ ≤ ⟨x − πW x, x − πW x⟩.
We thus have
0 ≥ ⟨x − πW x, x − πW x⟩ − ⟨x − w 1, x − w 1 ⟩
≥ ⟨w 1 − πW x, 2x − w1 − πW x⟩
≥ ⟨w 1 − πW x, x − πW x⟩ + ⟨w 1 − πW x, x − w1⟩
Because w 1 − πW x ∈ W , the first term in the last expression vanishes by definition of

πW x. Adding and subtracting πW x to the second term, we have
0 ≤ ⟨w 1 − πW x, x − w1⟩
≤ ⟨w 1 − πW x, x − πW x + πW x − w1⟩
≤ −∥w 1 − πW x ∥ 2
where we again used the fact that x − πW x is orthogonal to all vectors in W . The last
equation tells us that w 1 = πW x.
7.47. Least squares and orthogonal projection Let W be a linear subspace of an inner
product space V . Let x ∈ V . We seek the closest point to x in W :
w ∗ = arg min ∥x − w ∥ 2 .
w ∈W
7.48. In case V and W are finite dimensional, we can find an explicit formula for πW x.
Let w 1, . . . , w m be a basis for W . Then it is enough to check that ⟨x − πW x, w⟩ = 0 on a
basis of W . This yields
⟨x, w i ⟩ = ⟨πW x, w⟩.
Now, because πW x ∈ W , we can write it as a linear combination of the w i ’s: πW x =
∑
ai w i .
7.49. Theorem The conditional expectation X → E(X |G) is the orthogonal projection
from L 2 (Σ) onto L 2 (G).
Proof. Since G ⊂ Σ, we see that L 2 (G) is a subspace of L 2 (Σ).
7.50. MMSE Estimator From the above Theorem, we can immediately deduce that the
MMSE estimator, given observations y, is the conditional expectation of x given y:
151
x M M S E (y m ) = E(X |Y = y m ).
7.51. The Gaussian case

Let X ∼ N (µ, Σ) be a vector distributed according to a multivariate Gaussian. Consider
the following partition of X , and the corresponding partitions of µ and Σ:
[ ] [ ] [ ]
X1 µ1 Σ11 Σ12
X = ,µ= ,Σ = .
X2 µ2 Σ21 Σ22
Assume that you can only measure X 2 , and you are asked to estimate X 1 given your
observation of the realization of X 2 . The mmse estimator of X 1 given X 2 is the conditional
expectation of X 1 given X 2 :
X 1mmse (x 2 ) = E(X1 |X2 = x2).

By definition of the conditional expectation, we have
∫
E
(X1 |X 2 ) = x 1 p(x 1 |x 2 )dx 1
∫
with p(x 1 |x 2 ) = p(x 1, x 2 ) p(x1 2 ) and p(x 2 ) = p(x 1, x 2 )dx 1 .
A straightforward, but lengthy, computation shows that E(X1 |X2 = x2) is disitributed
as a Gaussian with mean µ̄ and covariance Σ̄ obeying
−1
µ̄ = µ1 + Σ12 Σ22 (x 2 − µ2 )
−1
Σ̄ = Σ11 − Σ12 Σ22 Σ21
We see that if X1 and X 2 are uncorrelated, i.e. Σ12 = 0, then the conditional expectation
of X 1 given X 2 is simply X 2 . This should ot come as a surprise, since if X 1 and X 2 are
uncorrelated, then observing X 2 should not affect the mmse estimation of X 1 .
We also observe that Σ̄ is smaller than Σ̄11 , and that the amount by which it is smaller
increases if Σ12 increases (i.e. if X 2 and X 1 are strongly correlated) and decreases if Σ22
is larger (i.e. if X 2 has a large variance, observing it does not help the estimation of X 1 ).
These relations are at the basis of the discrete-time Kalman filter.
7.4.3 Kalman filter as an LMSE estimator
7.52. It is customary to start with a linear equation driven by white noise:
ẋ = A(t )x(t ) + B(t )ẇ; ẏ = C (t )x(t ) + v̇ .
152
7.53. We seek an unbiased estimator of x(t ) given past observations y. We look at the
class of estimators that obey the following equation:
ż = F (t )z (t ) + H (t )ẏ .
E E
7.54. If we want x(t ) = x(t ) to hold, we need to impose some conditions on F and
E E
H . First, we need z (0) = x(0). Second, taking expectations, we get equation ẋ and ż
A(t )Ex(t ) = F (t )Ez (t ) + H (t )C (t )Ex(t ).
We thus find A(t ) = F (t ) + H (t )C (t ) or F (t ) = A(t ) − H (t )C (t ) which we rewrite as
F (t ) = A(t ) + G (t )C (t )
where G (t ) is a gain matrix that is to be determined. For the sake of readability, we do

not indicate the time dependence anymore.
7.55. Let us introduce the error vector e = x − z . It obeys the equation
e˙ = ẋ − ż = (A + GC )e + G v̇ + B ẇ .
The covariance of the error Σee = E(ee ′) hence follows (see Lesson 4)
Σ̇ee = (A + GC )Σee + Σee (A + GC )′ + GG ′ + BB ′ .
7.56. Our objective is to find a value for G (t ) above so that the error covariance is small.
We show here that we can do that in a rather strong sense: we choose G so as to minimize
Σe e (t ) according to the Löwner partial order on positive definite matrices 1 .
To this end, consider the auxiliary Riccati equation
S˙ = AS + S A′ − SC ′C S + BB ′ (7.10)
We add zero to the right-hand-side, using the creative definition of zero as 0 = GC S +

S (GC )′ − GC S − S (GC )′ to obtain
S˙ = AS + S A′ − SC ′C S + BB ′ + GC S + S (GC )′ − GC S − S (GC )′
= (A + GC )S + S (A + GC )′ − SC ′C S + BB ′ − GC S − S (GC )′
7.57. We now, evaluate Σ̇ee − S˙ :
1
This partial order is such that A ≤ B ⇔ x ′Ax ≤ x ′Bx, ∀x ∈ Rn .
153
Σ̇ee − S˙ = (A + GC )(Σee − S ) + (Σee − S )(A + GC )′ + SC ′C S + GC S + S (GC )′ + GG ′

= (A + GC )(Σee − S ) + (Σee − S )(A + GC )′ + (G + SC ′)(G + SC ′)′
In the above equation, S (t ) is a known driving term (it obeys equation (7.10)). We can
thus write the value of Σee (t ) − S (t ) explicitly:
∫ t
Σee (t ) − S (t ) = ΦA+GC (t )(G + SC ′)(G + SC ′)′Φ′A+GC (t )dt
0
where ΦA+GC (t ) is the fundamental solution of ẋ = (A + GC )x.
We conclude from the above expression that Σee (t ) − S (t ) is positive definite for all
possible choice of G (observe that the integrand is a positive definite matrix). The least
value, in the Löwner partial order (and any other sensible order!) that we can achieve
is zero however, obtained by taking G = −SC ′.
In that case, Σee (t ) = S (t ).
7.58. We thus conclude that the causal, unbiased linear estimator of x that minimizes
the variance of the error is
ż = (A − Σee C ′C )z + Σee C ′ẏ

with
Σ̇ee = AΣee + Σee A′ − Σee C ′C Σee + BB ′ .
This is the Kalman(-Bucy) filter, whose formulations we had already obtained directly
from the conditional density equation.
This derivation, however, makes clear that the Kalman gain Σee is the covariance of the
estimation error.
154
Lecture 8
Ergodicity and Markov Processes
We prove the ergodic theorem and show how it applies to Markov processes.
155
8 Ergodicity and Markov Processes
8.1 Birkhoff’s ergodic theorem
8.1. Let (X, B, µ) be a measure space. We call a map T : X 7−→ X measurable if for all
A ∈ B, T −1 (A) ⊂ B. We call a measurable map measure preserving if for all A ∈ B,
R
µ(T −1 (A)) = µ(A). We call a function f : X 7−→ integrable if
∫
|f (x)|d µ < ∞.
X
8.2. We call a set A invariant for T , or simply invariant, if T −1 (A) = A. We claim that
invariant sets form a σ-field.
Lemma 8.1. Let X be a topological space and B a σ-field of sets of X . Let T : X 7−→ X be a
measurable map. The invariant sets of T form a σ-field.
Proof. We need to show that the countable union of invariant sets is invariant and that
the complement of an invariant set is invariant. Let Ai , i ≥ 0 be a collection of invariant
sets. Then T −1 (∪i Ai ) = {x ∈ X | T (x) ∈ ∪i Ai } = ∪i {x ∈ X | T (x) ∈ Ai = ∪i T −1 (Ai ).
This proves the first part.
For the second part, denote by Ac the complement of A in X . Then T −1 (X ) = T −1 (A ∪
Ac ) = T −1 (A) ∪ T −1 (Ac ) = X , where the last equality comes from T −1 (X ) = X . Because
T −1 (A) = A by assumption, we have A ∪ T −1 (Ac ) = X and thus T −1 (Ac ) = Ac .
We consider here a discrete-time dynamical system:
x t +1 = T (x t )
8.3. We call a map T ergodic if its σ-field I of invariant sets is {∅, X }. We now state
Birkhoff’s Ergodic Theorem.
Theorem 8.1. Let (X, B, µ) be a measure space and T : X 7→ X be an ergodic measure-

preserving map. Let f be an integrable function on X . Then for almost all x ∈ X ,
∫
1∑
n
lim f (T (x)) =
i
f d µ.
n→∞ n X
j =0
If T is not ergodic, then

1∑
n
lim
n→∞ n
f (T i (x)) = E(f | I).
j =0
The proof requires two preliminary results.
156
8.1 Birkhoff’s ergodic theorem
8.4. The following Lemma is essential.
Lemma 8.2 (Maximal inequality). Let (X, B, µ) be a measure space and T : X 7−→ X be a
R
measure preserving map. Let f : X 7→ be an integrable function. Set f 0 = 0 and
∑
n−1
fn (x) = f + f ◦ T + · · · + f ◦ T n−1 = f (T i (x)). (8.1)
i =0
Furthermore, set
Fn (x) = max f j (x) (8.2)
0≤ j ≤n
and define the set An := {x ∈ X | Fn (x) > 0}. Then An ∈ B and

∫
f d µ ≥ 0.
An
Proof. Because f is integrable, so are the fn and Fn . Hence An = Fn−1 (0, ∞) is measurable.
For the second part, first note that Fn ◦ T ≥ f j ◦ T for 0 ≤ j ≤ n by definition of Fn .
Hence
Fn ◦ T + f ≥ f j ◦ T + f = f j +1
where the last equality follows from the definition of f j . Because this holds for all 0 ≤
j ≤ n, we have
Fn ◦ T + f ≥ max f j ≥ max f j .
1≤ j ≤n+1 1≤ j ≤n
Note that we cannot conclude in general that the previous inequality holds for j = 0, but
if we restrict ourselves to the set An defined above, we have
Fn ◦ T + f ≥ Fn , on the set An .
Let us integrate both sides of the above inequality:

∫ ∫ ∫
f dµ ≥ Fn d µ − Fn ◦ T d µ.
An An An
We make the following observations: Because

∫ f0 = 0,
∫ we have from the definition of Fn
that Fn (x) = 0 for x ∈ An . We thus have An Fn d µ = X Fn d µ. Similarly, since Fn ≥ 0, we
c
∫ ∫
have that An Fn ◦ T d µ ≤ X Fn ◦ T d µ. Putting these two facts together, we obtain
∫ ∫ ∫
f ≥ Fn − Fn ◦ T d µ.
An X X
∫ ∫
Finally, since T is measure preserving, X Fn ◦ T d µ = X Fn d µ. Hence the right-hand
side of the above inequality is zero, which proves the result.
157
8.5.
Lemma 8.3. Let g be an integrable, real-valued function on (X, B, µ). Let α ∈ R and set
 
 1∑ 
n−1

 
B α = x ∈ X | sup g (T j x) > α
 .
 n≥1 n j =0

 
Then for all A ∈ B with T −1 (A) = A, we have that

∫
g d µ ≥ α µ(B α ∩ A).
B α ∩A
∑n−1
Proof. We first assume that X = A We set g n = j =0 g (T j x). Observe that we can write
B α as
∞
B α = ∪n=1 {x ∈ X | g n (x) > nα}.
Define f = g − α and fn and Fn as in Eqs. (8.1) and (8.2) respectively. Then we have
∞
B α = ∪n=1 {x ∈ X | fn (x) > 0}.
We claim that we can furthermore write that

∞
B α = ∪n=1 {x ∈ X | Fn (x) > 0}.
To see this, note that if x is such that fn (x) > 0 for some n > 0, then Fn (x) > fn (x) > nα.
Reciprocally, if Fn (x) > 0, then there exists a j such that f j (x) > 0. These two statements
show that the last two definitions of B α are equivalent.
As the reader can observe, the last expression for B α begs the use of the maximal
inequality. To this end, let
Bn := {x ∈ X | Fn (x) > 0}
and denote by 1Bn its indicator function. Observe that
Bn ⊆ Bn+1 .
Furthermore, because g is integrable, so is f . In fact, we have that for all n ≥ 1,
f 1Bn ≤ |f |
and clearly limn→∞ f 1Bn = f 1B α . We thus have

∫ ∫ ∫ ∫
lim f d µ = lim f 1Bn d µ = lim f 1Bn = f dµ
n→∞ B n→∞ X n→∞ X X
n
where we used the dominated converge theorem (see Th. ?? below) for the last equality.
158
Finally, the maximal inequality of Lemma 8.2 says that for all n ≥ 1,
∫
f dµ ≥ 0
Bn
∫
and thus Bα f d ν ≥ 0 or, equivalently,
∫
g d µ ≥ a µ(B α ).
Bα
In case X , A, note that since T (A) = A, we can just apply the result just derived to the
space (X, BA, µ) where BA is the σ-field obtained by intersecting all sets in B with A.
8.6. We now prove Birkhoff’s theorem.

∑n−1
proof of Birkhoff’s ergodic Theorem. We first need to show that limn→∞ 1
n j =0 f (T i x) con-
verges. To this end, let
1∑
n−1
˜
fn (x) = f (T j x).
n j =0
Then we have the relation

n+1 ˜ 1
fn+1 = fñ ◦ T + f (x). (8.3)
n n
Now set
f + (x) := lim sup fñ (x) and f − (x) := lim inf fñ (x).
n→∞ n→∞
Using Eq. (8.3), we see that
f + (T x) = f + (x) and f − (x) = f− (T x). (8.4)
We thus ∫have to show that f + = f − almost everywhere, and moreover that they are both
equal to X f d µ.
R
Let α, β ∈ and define
D α, β := {x ∈ X | f− (x) < β and f + (x) > α}.
By definition, we have that f + (x) ≥ f− (x). We thus want to show that the set {x ∈ X |
f + (x) > f− (x)} is of measure zero. We can write this set as the (countable) union, over
β < α and α, β rational numbers, of D α, β . It thus suffices to show that µ(D α, β ) = 0 if
β < α.
From Eq. (8.4), we have that
T −1D α, β = D α, β .
159
Furthermore, if we let as before
B α = {x ∈ X | sup fñ (x) ≥ α}

n≥1
then we have that D α, β ∩ B α = D α, β . To see that this last relation holds, note that
supn≥1 fñ (x) ≥ lim sup fñ (x). We can apply Lemma 8.3 with A = D α, β to obtain
∫ ∫
f dµ = f d µ ≥ α µ(D α, β ∩ B α ) = α µ(D α, β ). (8.5)
D α, β D α, β ∩B α
Now for an integrable function g and real numbers α1, β1 define the set
E α1, β1 := {x ∈ X | g − (x) < β1, g + (x) > α1 }.
Applying the result obtained in Eq. (8.5) to g and E α1, β1 , we get

∫
g d µ ≥ α1 µ(E α1, β1 ). (8.6)
E α 1, β1
If we let g = −f , α1 = − β and β by β1 = −α. Then β1 < α1 , g + = −f − and g − = −f + .

But note that
E α1, β1 = {x ∈ X | −f + (x) < −α, −f − (x) > − β} = {x ∈ X | f + (x) > α, f − (x) < β} = D α, β .
Hence Eq. (8.6) is equivalent to

∫
f d µ ≤ β µ(D α, β ).
D α, β
The previous equation together with Eq. (8.5) yields that α ≤ β. This is only possible
if µ(D α, β ) = 0. We have thus shown that f + = f− almost everywhere, and thus that
˜
∫limn→∞ fn (x) is well-defined for almost all x ∈ X . It remains to show that it is equal to
X f d µ. ∫ ∫
We first show that f + is integrable, and then show that X f +d µ = X f d µ.
To show that f + is integrable, we use Fatou’s lemma (see Th. ?? below). To this end,
note that
∫ ∑
n−1 ∫ ∫
˜ 1
| fn |d µ ≤ |f ◦ T |d µ = |f |d µ.
j
j =1
n
Since fn converges pointwise to f + , we conclude from Fatou’s Lemma that

∫ ∫
+
f dµ ≤ |f |d µ
X X
and thus f + (and also f − ) is integrable.
160
+
∫
Finally, it remains to show that f = X f d µ. To do so, we shall use Lemma 8.3 again.
First, introduce the set
q + q +1
C n,q = {x ∈ X | ≤f ≤ }.
n n
Because we have shown that f + = f + ◦ T , then T −1C n,q = C n,q . Furthermore, for all
δ > 0,
C n,q ∩ Bq /n−δ = C n,q
by definition of B α . From Lemma 8.3 (with A = C n,q ,
∫
q
f d µ ≥ ( − δ)µ(C n,q )
C n,q n
∫ q
Since this holds for any δ > 0, we have that C n,q f d µ ≥ n µ(C n,q ). From the definition of
∫ q +1
C n,q , we also have that C n,q f +d µ ≤ n µ(C n,q ) and thus
∫ ∫
+ 1 q 1
f d µ ≤ µ(C n,q ) + µ(C n,q ) ≤ µ(C n,q ) + f d µ.
C n,q n n n C n,q
Since we can write X as a disjoint union
X = ∪q ∈ZC n,q ,
we have that ∑
µ(C n,q ) = µ(X )
q ∈Z
Hence summing of q , we obtain

∫ ∫
+ 1
f dµ ≤ f d µ.
X n X
Because the previous inequality holds for all n > 0, we obtain the inequality
∫ ∫
+
f dµ ≤ f d µ. (8.7)
X X
Now set g = −f , it is clear that g + − = −f − = −f + almost everywhere. The same

reasoning as above applies to g and Eq. (8.7) applied to g yields
∫ ∫ ∫ ∫ ∫
+ +
(−f )d µ ≤ −f d µ =⇒ f dµ = f −d µ ≥ f d µ. (8.8)
X X X X X
161
Putting Eqs. (8.7) and (8.8) together, we obtain

∫ ∫
1∑
n
+
f d µ = lim f (T x) =
j
f dµ
X n j =0 X
almost everywhere, as required. Finally, recall that f + is T -invariant (i.e. f + ◦ T = f + ).

Because f + is also measurable, we conclude that it is measurable with respect to the
σ-field of T -invariant sets, which we denoted I. Hence for any set A ∈ I, we have
∫ ∫
+
f dµ = f dµ
A A
and thus f ∗ = E(f | I). Finally, if I = {∅, X }, then f ∗ is constant almost everywhere
and ∫
1∑
n
lim f (T x) =
j
f d µ.
n→∞ n X
i =0
8.7. Given a measure preserving map T : X 7−→ X , we define by M its set of invariant
measures: hence a measure µ ∈ M if µ(T −1 (A) = µ(A). It is easy to see that invariant
measures form a convex subset of the set of all measures. Indeed, for α1, α2 ≥ 0 and such
that α1 + α2 = 1, we have
(α1 µ1 + α2 µ2 )(T −1 (A)) = (α1 µ1 + α2 µ2 )(A).
Recall that an extreme point of a convex set is a point µ that cannot be expressed as a
∑
convex combination i αi µi of µi without having for some i that αi = 1.
We say that a measure µ is ergodic for the transformation T for any A ∈ I, the σ-field
of invariant sets of T , µ(A) = 0 or µ(A) = 1. We have the following result:
Proposition 8.1. Let (X, B, µ) be a measure space and T : X 7−→ X a transformation on X .

Denote by M its set of invariant measures. Then a measure µ ∈ M is ergodic if and only if it is
an extreme point of M.
Proof. We first assume that µ ∈ M is not extremal and show that µ is not ergodic. Because
µ is assumed non-extremal, there exists α ∈ (0, 1) and µ1, µ2 ∈ M, µ1 , µ2 such that
µ = α µ1 + (1 − α)µ2 . If µ were ergodic, then for all A ∈ I, µ(A) = 0 or µ(A) = 1. Because
α is positive, we conclude that either µ1 (A) = µ2 (A) = 0 or µ1 (A) = µ2 (A) = 1. Hence µ1
and µ2 agree on I. Now let f be an integrable function. From the ergodic theorem, we
know that
∑ n
+
f (x) := lim
n→∞
1
n
E
f (T i (x)) = µi (f | I),
i =1
162
where the expectation is over any measure µi ∈ M. From the last relation, we conclude
that f + (x) is measurable with respect to I. Integrating f + over X , we have
∫ ∫
+
f (x)d µi = f d µi .
X X
Now since f + is I-measurable, and since the µi , i = 1, 2 agree on I, we conclude that

∫ ∫ ∫ ∫
+ +
f d µ1 = f d µ1 = f d µ2 = f d µ2 .
X X X X
Since f is arbitrary, equating the first and last term of the above equation yields that
µ1 = µ2 , which is a contradiction. Hence µ is not ergodic.
We now assume that µ is not ergodic and show that it can be expressed as the non-
trivial convex combination of two invariant measures. Since µ is not ergodic, there exists
a set F ∈ I with 0 < µ(F ) < 1. We define the measures µ1 and µ2 as follows: for A ∈ B,
µ(A ∩ F ) µ(A ∩ F c )
µ1 (A) = and µ2 (A) =
µ(F ) µ(F c )
where F c = Ω − F is the complement of F in Ω. We note that µi are invariant; we show
it for µ1 , a similar proof holds for µ2 :
µ(T −1 (A ∩ F ) µ(T −1 (A) ∩ µ(T −1 (F )) µ(T −1 (A) ∩ µ(F )

µ1 (T −1 (A)) = = = = µ1 (A)
µ(T −1 (F )) µ(T −1 (F )) µ(F )
where we use the fact that T −1 (F ) = F . Finally, we have
µ = µ1 (F )µ1 + (1 − µ(F ))µ2
and thus µ is not an extreme point of M.
8.8. We state the Dominated Convergence Theorem and Fatou’s Lemma.

Theorem 8.2 (Dominated Convergence Theorem). Let fn : X 7→ R
be a sequence of
measurable functions on a measure space (X, B, µ) that converges pointwise to a function f :
R
X 7→ . Then if there exists an integrable function g : X 7→ R
such that |fn | < |g | almost
everywhere on X , then f is integrable and
∫ ∫
lim fn d µ = f dµ
n→∞ X X
R
Theorem 8.3 (Fatou’s Lemma). Let fn : X 7→ be a sequence of measurable, non-negative
functions. Set f = lim inf n→∞ fn . Then f is measurable and
∫ ∫
f d µ ≤ lim inf X fn d µ
X n→∞
163
Introduction to the abstract theory of Markov processes

We now give a more abstract introduction to the theory of stochastic processes.
8.9. Recall that a stochastic process is a collection of random variables x t (ω) on a

probability space (Ω, B, P ) indexed by a time-variable t ∈ T . A discrete time process is
characterized by T = N or Z and a continuous time process by T = or + . We assume R R
R
that x t are n valued random variables:
x t : Ω 7→ Rn .
R
In fact, we can think of Ω as ( n )T and B as the σ-field on Ω generated by products of Fi
R
in the Borel σ-field of n . Hence, x t (ω) can be thought as the "value of the random path
ω at t . Note that the set-up allows for more general situations, but keeping this simple
scenario in mind can be helpful. In general, if F is the σ-field of Borel sets in n , then R
B is generated by the sets
{ω ∈ Ω | x t1 (ω) ∈ F1, · · · , x tk (ω) ∈ Fk for k ∈ N, Fi ∈ F }
8.10. Having described the state space Ω and a σ-field which made our random variables
x t measurable, we now set-up to put a measure on Ω. Describing explicitly a measure
on a set of paths – an infinite-dimensional space – is rarely done without first passing by
finite-dimensional probability distributions and using Kolmogorov’s consistency theorem,
which we state below. The idea is that while it might be difficult to describe a generic
set in Ω and its measure, we can easily look at "time slices" and characterize the paths
x t (ω) such that x t1 (ω) ∈ F1 for F1 ∈ F . Generalizing this idea, we might want to try to
define P by giving finite dimensional probability measures on nk and declare: R
µt1,...,tk (F1 × F2 × · · · × Fk ) = P (x t1 ∈ F1, · · · , x tk ∈ Fk ).
One can easily construct such finite dimensional measures: the simplest example would
∏
be to make all time-slices independents: µt1,...,tk (F1 × F2 × · · · × Fk ) = µti (Fi ). This
does not yield many interesting processes. Another way to obtain µ is by specifying a
distribution of increments of the path x t +δ − x t . This is what was done for the definition
of Brownian motion earlier in the notes.
The measures µ··· need to satisfy some consistency conditions. The first is that for any
σ ∈ S k , where S j is the permutation group on k elements, we need to have
µtσ(1),...,tσ(k ) (Fσ(1) × Fσ(2) × · · · × Fσ(k ) ) = µt1,...,tk (F1 × F2 × · · · × Fk ). (8.9)

This condition encodes the natural fact that the probability that x t1 (ω) ∈ F1 and x t2 (ω) ∈
F2 should be the same as the probability that x t2 (ω) ∈ F2 and x t1 (ω) ∈ F1 . The second
consistency condition encodes that also very natural requirement that x t (ω) ∈ n should R
164
be an almost sure event: for any l ≥ 1, we require that
µt1,...,tk ,tk +1,··· ,tk +l (F1 × F2 × · · · × Fk × · · · × Rn × Rn ) = µt ,...,t ,t

1 k k +1
(F1 × F2 × · · · × Fk ). (8.10)
The Kolmogorov consistency Theorem says that if µ satisfies the above two require-
ments, then µ can be used to define a measure on all of Ω.
Theorem 8.4 (Kolmogorov Consistency Theorem). For k ∈ N and t1, t2, · · · , tk ∈ T , let
R
µt1,··· ,tk be probability measures on n satisfying the conditions of Eqs. (8.9) and (8.10). Then
there exists a probability space (Ω, B, P ) and stochastic process x t on Ω such that
P (x t1 (ω) ∈ F1, · · · , x t1 (ω) ∈ Fk ) = µt1,··· ,tk (F1 · · · , Fk )

for all k ∈ N , t1, . . . , tk ∈ T and Fi Borel sets.
8.11. We now turn our attention to Markov processes. Informally speaking, these are
processes for which the future states only depend on the present state, and not on past
states. Hence a Markov process should satisfy the relation
P (x t (ω) ∈ F | x t1 ∈ F1, · · · , x tk ∈ Fk ) = P (x t (ω) ∈ F | x t1 ∈ F1 )

whenever t ≥ t1 ≥ t2, · · · , tk .
R
Recall for a random variable X : M 7→ , we define its adapted σ-field as the σ-field
obtained by taking countable unions and complements of the sets X −1 (F ), where F is
R
any Borel set in . It is the "smallest" σ-field on M for which X is measurable. Thinking
of a random process as a path-valued random variable x : ω ∈ Ω 7→ x(ω), we defined
the filtration Ft1t2 as the smallest σ-subfield of B that makes the collection of random
variables x s , t1 ≤ s ≤ t2 measurable. We also write Ft for Ft ∞ and F t for F−∞
t . Note that
Ft t is the smallest σ-field that makes x t measurable.

We can thus say that a process is Markov if for 0 ≤ s ≤ t and F a Borel set, we have
P (x t (ω) ∈ F1 | F0s ) = P (x t (ω) ∈ F1 | Fs s ).
8.12. Let t1 ≤ t2 ≤ t3 and let f , g be integrable functions. The Markov property for x t
implies that
E E
(g (x t3 ) | Ft1t1 × Ft2t2 ) = (g (x t3 ) | Ft2t2 ).
The so-called time-reversed Markov property states that
E(f (xt ) | Ft t
1 2
2
× Ft3t3 ) = E(f (xt ) | Ft t ).
1 2
2
We will show that the Markov property and time-reversed Markov property are equiva-
165
lent. In fact, we show that they are equivalent to the following
E(f (xt )g (xt ) | Ft t ) = E(f (xt ) | Ft t )E(g (xt ) | Ft t ).

1 3 2
2
1 2
2
3 2
2
Lemma 8.4. Let x t be a Markov process with distribution P on (Ω, B), let f and g be measur-
able functions and t1 < t2 < t3 . Then the following relations hold:
E E
1. (g (x t3 ) | Ft1t1 × Ft2t2 ) = (g (x t3 ) | Ft2t2 ).
E E
2. (f (x t1 ) | Ft2t2 × Ft3t3 ) = (f (x t1 ) | Ft2t2 ).
E E E
3. (f (x t1 )g (x t3 ) | Ft2t2 ) = (f (x t1 ) | Ft2t2 ) (g (x t3 ) | Ft2t2 ).
Proof. The first relation is equivalent to the Markov property of x t . We now show that
relation 3 holds:
( )
E
(g (x t3 )f (x t1 ) | Ft2t2 ) = EE(g (x t3 )f (x t1 )) | Ft2t2 × Ft1t1 ) | Ft2t2
( )
= E E
f (x t1 ) (g (x t3 ) | Ft2t2 × Ft1t1 ) | Ft2t2
( )
= E E
f (x t1 ) (g (x t3 ) | Ft2t2 ) | Ft2t2
( ) ( )
= E E
f (x t1 ) | Ft2t2 g (x t3 ) | Ft2t2
where we used the Markov property to go from line 2 to line 3.

We now use relation 3 to show that relation 1 holds. Let h be an integrable function.
We have

E E
(f (x t1 )h(x t2 ) (g (x t3 ) | Ft1t1 × Ft2t2 )) =E(
f (x t1 )h(x t2 )g (x t3 )
)
=E E
h(x t2 ) (f (x t1 )g (x t3 ) | Ft2t2 )
( )
=E E E
h(x t2 ) (f (x t1 ) | Ft2t2 ) (g (x t3 ) | Ft2t2 )
( )
=E E
h(x t2 )f (x t1 ) (g (x t3 ) | Ft2t2 ) ,
where we used relation 3 to go from the second to third line. Since f and h are arbitrary,
E E
we have that (g (x t3 ) | Ft2t2 ) = (g (x t3 ) | Ft1t1 × Ft2t2 )) almost everywhere. We can similarly
show that relation 2 and 3 are equivalent.
8.13. We now show how to construct a Markov process via a transition probability function.
In words, it is a function that gives us a measure for the state of the process t seconds in
the future, given that we are in a given state x now. Precisely, it is a measure Pt (x, ·) for
R
all t ∈ T , x ∈ n :
(t, x) 7−→ Pt (x, ·) is a measure on n . R
For a Borel set F of Rn , Pt (x, F ) is the probability that xs +t ∈ F given that x s = x. Said
166
otherwise, if the process x t starts at x at time zero, then
P (x t1 ∈ F ) = Pt (x, F ).
What is the joint probability P (x t1 ∈ F1, x t2 ∈ F2 ) for t1 ≤ t2 and Fi Borel sets? We

have to take into account all the transitions from a state x 1 ∈ F1 to a state x 2 ∈ F2 . Since
Pt2 −t1 (x 1, F2 ) is the probability of being in F2 at time t2 given that we are at x 1 at time t1 ,
we see that we need to sum such terms over all possible states x 1 ∈ F1 . We hence have
∫
P (x t1 ∈ F1, x t2 ∈ F2 ) = Pt1 (x, dx 1 )Pt2 −t1 (x 1, F2 ).
F1
We obtain in a similar fashion that

∫ ∫ ∫
P x (x t1 ∈ F1, x t2 ∈ F2, · · · , x tl ∈ Fl ) := ··· Pt1 (x, dx 1 )Pt2 −t1 (x 1, dx 2 ) · · · Ptl −tl −1 (xl −1, Fl )
F1−1 F2 Fl
where we use the subscript x to indicate the initial state of the process. If the initial state
of the process is not x, but is instead distributed according to a distribution ρ(dx), we
have
∫
P µ (x t1 ∈ F1, x t2 ∈ F2, · · · , x tl ∈ Fl ) := µ(dx)P x (x t1 ∈ F1, x t2 ∈ F2, · · · , x tl ∈ Fl ). (8.11)
Rn
8.14. The transition probability function obeys the so-called Chapman-Kolmogorov

equation: for any 0 ≤ s ≤ t
∫
Pt (x, F ) = Ps (x, dx s )Pt −s (x s , F ).
Rn
The idea is the same as earlier: to evaluate the probability of transition from x to F , we
can go through an arbitrary time s and consider all possible transitions:
∫
Pt (x, F ) = P (x 0 = x, x t ∈ F, x s ∈ n
)= R
Ps (x, dx s )Pt −s (x 1, F ).
Rn
The Chapman-Kolmogorov equation’s importance lies in the following fact: it shows
that the expectation operator t for a function evaluated on a Markov process form a
semi-group. Indeed, let f be an integrable function. The expectation of f (x t ) for the
process starting at x 0 = x) is
∫
E
f (x t ) = Pt (x, dx 1 )f (x 1 ).
Rn
167
We introduce the operator Et acting on integrable functions and defined as
Et f (x) = E f (xt ) (8.12)
where x is the initial state of the process at 0, We have from the Chapman-Kolmogorov
equation that
∫
Et f (x) = Pt (x, dx 1 )f (x 1 )
R
∫ ∫
n
= Ps (x, dx s )Pt −s (x s , dx 1 )f (x 1 )
∫R R .
n n
= Ps (x, dx s )Et −s f (x s )
Rn
= Es (Et −s (f ))(x)
Hence
Et +s = Et ◦ Es
and the semi-group property indeed holds.
Lemma 8.5. Let x t be a Markov process and Et the associated semi-group defined in Eq. (8.12).
Then the following properties hold:
1. If f is a positive function, so is Et f .
2. Et is a contraction for the L ∞ norm.
Proof. The proof of the first item is obvious from the definition of Et . For the second
item, we have
∫ ∫
∥Et f ∥∞ = sup | Pt (x, dx t )f (x t )| ≤ sup | Pt (x, dx t )| sup |f (x)| = ∥ f ∥∞,
x∈ Rn Rn x∈ Rn Rn x∈ Rn
where we used the fact that Pt (x, dx) integrates to one.
8.15. We can introduce an operator similar to Et but that acts on measures instead of
on functions. Assume that a Markov process is initialized randomly at time 0, according
to a distribution π. What is the distribution of the states at time t ? Denote by µt the
density of the states at time t . We have already seen that for F a Borel set, we have
∫
µt (F ) = P (x t ∈ F ) = µ(dx)Pt (x, F )
Rn
We define ∫
(Mt µ)(F ) = µ(dx)Pt (x, F ).
Rn
From the Chapman-Kolmogorov equation, we conclude that the operators Mt form a
168
semi-group:
Mt +s = Mt ◦ Ms .
8.16. If the semi-groups of operators Tt for a given Markov process is left-continuous,

i.e. such that
lim ∥Tt f − f ∥ = 0
t →0
then the process is said to be a Feller process. From the Feller property, one can conclude
(we will not go into details here) that there exists densities pt (x, x 1 ) such that
Pt (x, dx 1 ) = pt (x, x 1 )dx 1 .
The Brownian motion process introduced earlier in these notes is the Markov process
with transition density
1 (x − x 1 )2
pt (x, x 1 ) = exp(− ).
(2πt )n/2 2t
Furthermore, one can make sense of the infinitesimal generators of Tt and S t . We define
1
L µ = lim (Mt µ − µ)
t →0 t
Note that L is nothing more than the Fokker-Planck operator we have introduced in
an earlier lecture. Its dual, L ∗ is
1
L ∗ f = lim (Et f − f )
t →0 t
8.17. A Markov process is said to be stationary if its law is time-independent. This

R
means that for any h > 0, ti ∈ and Fi Borel sets, the relation
P (x t1 +h ∈ F1, . . . , x tk +h ∈ Fk ) = P (x t1 ∈ F1, . . . , x tk ∈ Fk )
holds. ∫
Recall that P (x t ∈ F ) = Mt µ(F ) = Rn µ(dx)Pt (x, F ). Thus if the process is invariant,
we conclude that
Mt µ = Mt +s µ = µ
and hence
L µ = 0.
Hence the invariant measures for the process are in the kernel of the Fokker-Planck operator.
Note that if we initialize the process with an invariant measure µ at t = 0, we obtain a
stationary process with distribution P µ whose marginal at each time t ∈ is µ, i.e. R
169
P µ (x t ∈ F ) = µ(F ). We denote by M the set of all stationary distribution of the process:
M̄ := { µ | Mt µ = µ for all t ∈ R} .
8.18. We denote by τs the time-shift operator on Ω. It is defined as follows:
τs (x t (ω)) = x t +s (ω).
Note that the time-shift operator is a measure-preserving transformation on (Ω, F , P µ ),

for µ any stationary distribution. We can equivalently say that P µ for µ ∈ M̄ are invari-
ant measures for τs . We can thus apply the ergodic theorem for τs (the translation
from discrete-time transformation T : X 7−→ X to a continuous-time transformation
τs : Ω 7−→ Ω is straightforward): for a measurable function f ,
∫
1 t
lim
t →∞ t 0
E
f (τt x t0 (ω)) = (x t0 (ω) | I)
where t0 is arbitrary because the process is stationary, and I is the σ-field of invariant
sets for the shift-operator. Observe that since the marginal of P µ for t0 is µ, the expec-
tation of the right-hand-side of the above equation is taken with respect to µ. We call
a distribution P µ for a Markov process Markov process ergodic if for all A ∈ I,
P µ (A) = 0 or P µ (A) = 1. We have seen in Prop. 8.1 that P µ is ergodic if and only
if it is an extreme point of the set of invariant measures of the shift-operator. We can
characterize ergodic measure as follows:
Theorem 8.5. A stationary distribution µ for a Markov process is an extremal point of M̄ (the
set of stationary distributions) if and only if P µ is ergodic (an extremal point of the set M of
invariant measures of the shift-operator.)
Proof. Note that the map that sends µ to P µ , described in Eq. (8.11) is linear. Hence, if
P µ is an extreme point of M, then µ is an extreme point of M̄ (it can be shown via a
simple contradiction argument).
We now show that if µ is an extremal point of M̄, then so is P µ . To see this, let H ∈ I
be a non-trivial invariant set for the shift operator and recall that we set Ft := Ft ∞ and
F t = F−∞t . Because H is invariant for the shift operator, it is both in the future F and
t
the past F −t for t >> 0. From Lem. 8.4, we thus have that
P µ (H | F00 ) = P µ (H ∩ H | F00 ) = P µ (H | F00 )P µ (H | F00 ).
Hence, P µ (H | F00 ) = 0 or P µ (H | F00 ) = 1. Hence there exists a Borel set F such that
R
H = {ω : x t (ω) ∈ F for all t ∈ }. Since H is non-trivial, 0 < µ(F ) < 1. Furthermore,
note that if the process starts in F (resp. F c ) at time 0, it never leaves F (resp. F c ), in
the sense that x t (ω) ∈ F for all t > 0. Hence Pt (x, F ) = 1 (resp. Pt (x, F c ) = 1) for all
t > 0. Hence µ is not extremal.
170
171
Bibliography
[1] G. Cantor. “Ueber eine elementare Frage der Mannigfaltigkeitslehre”. Jahresbericht

der Deutschen Mathematiker-Vereinigung 1890-1891 1 (1891), pp. 75–78.
[2] A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin En-
glish translation (1950): Foundations of the theory of probability, Chelsea, New York,
1933.
[3] W. Rudin. Principles of Mathematical Analysis. McGraw-Hill Higher Education, 1976.
173

ECE 555: Stochastic Control

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECE 555: Stochastic Control

Uploaded by

Copyright:

Available Formats

ECE 555: Stochastic Control

M.-A. Belabbas, Spring 2015, UIUC

1 Review of probability theory 7

2 Basic notions from discrete-time stochastic processes 19

3 Poisson counters and stochastic diﬀerential equations 29

3.7 The Backwards equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Dynamic programming and optimal control 59

5 Wiener processes and stochastic diﬀerential equations 87

6 System Concepts 115

7 Linear and Nonlinear filtering 133

7.3 Nonlinear filtering: the Duncan-Mortensen-Zakai equation . . . . . . . . 145

8 Ergodicity and Markov Processes 155

We cover the basics of probability theory—including probability spaces, σ-fields, multi-

1.1 Sets, measures and probability spaces

1.1.1 Probability spaces

µ(A ∪ B) = µ(A) + µ(B) − µ(A ∩ B).

We define a σ-field or σ-algebra of Ω as a collection of subsets that is closed under

1.8. A probability space is given by a set Ω, a σ-field S of subsets of Ω, and a countably

Using the binomial identity, we obtain

1.10. Borel sets

lim ∩in=1 {x | (−1/n < x < 1 + 1/n)} = [0, 1].

1.2 Random variables

Consider the pair (M, M) where M is a σ-algebra over M . A random variable is a

1.11. For the die-tossing example, a random variable is for ω ∈ Ω:

f (ω) = 1 if ω = 1, 3, 5 and 0 otherwise.

A random variable X : Ω → M is measurable for (Ω, S) and (M, M) if and only if

X −1 (A) ∈ S for all A ∈ M.

In words, a random variable between Ω and M with σ-fields S and M is measurable if

1.14. The concept of measurability is helpful in the study of stochastic processes, as we

An important σ-field is the so-called σ-field adapted to a random variable X . It is

where Bi is in the σ-field M.

1.2.1 Independence and conditioning

1.17. The conditional probability of B given A is defined as

1.19. Bayes’ rule is obtained from the point above:

µ(A|B) = µ(B |A)µ(A)/µ(B).

1.2.2 Probability density functions

For a random variable X , a density function fX (x) on R is a positive, real-valued function

Two random variables X and Y are independent if their joint-density

fX,Y (x, y) = fX (x)fY (y).

The density of X can be obtained through marginalization:

1.2.3 Transformation of densities

1.2.4 The Gaussian density

1.30. A random variable is said to be Gaussian if it is distributed according to

1.34. The random variable Z is a multi-dimensional or multivariate Gaussian if its

1.2.5 Moments and generating functions

1.37. We denote by EX the expectation x:

1.38. The variance of X is defined as

var X = E(X − E(X ))2

1.39. Given a positive integer p, we define its pth moment as

1.40. The moment generating function of X is defined as the following expectation:

1.41. Using the Taylor expansion of the exponential, we see that

where mi is the i th moment of X .

1.42. The characteristic function of X is defined as

1.44. If X and Y are two independent random variable, then

and similarly for the characteristic function.

2.1 Basic concepts

2.1.1 Stochastic processes and filtrations

2.1. Given a probability space (Ω, Σ, P ), a random process is a collection of random

A realization of the random variables X (1), . . . , X (n) is called a sample path.

A filtration Ft on (Ω, Σ) is a collection of σ-fields on (Ω, Σ) such that Ft ⊆ Σ, ∀t and

2.1.2 Simple random walk

where the Xi are iid.

2.15. The probability of reaching a before b starting from k can be written as