Dynamic Programming and

Optimal Control
Script
Prof. Raffaello D’Andrea
Lecture notes
Dieter Baldinger Thomas Mantel Daniel Rohrer
HS 2010
Contents
Contents 1
1 Introduction 3
1.1 Class Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Key Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Open Loop versus Closed Loop Control . . . . . . . . . . . . . 4
1.4 Discrete State and Finite State Problem . . . . . . . . . . . . . 5
1.5 The Basic Problem . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Dynamic Programming Algorithm 9
2.1 Principle of Optimality . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 The DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Chess Match Strategy Revisited . . . . . . . . . . . . . . . . . . 11
2.4 Converting non-standard problems . . . . . . . . . . . . . . . 13
2.4.1 Time Lags . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Correlated Disturbances . . . . . . . . . . . . . . . . . . 13
2.4.3 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Deterministic, Finite State Systems . . . . . . . . . . . . . . . . 15
2.5.1 Convert DP to Shortest Path Problem . . . . . . . . . . 15
2.5.2 DP algorithm . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.3 Forward DP algorithm . . . . . . . . . . . . . . . . . . . 16
2.6 Converting Shortest Path to DP . . . . . . . . . . . . . . . . . . 16
2.7 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Shortest Path Algorithms . . . . . . . . . . . . . . . . . . . . . 18
2.8.1 Label Correcting Methods . . . . . . . . . . . . . . . . . 19
2.8.2 A

− Algorithm . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Multi-Objective Problems . . . . . . . . . . . . . . . . . . . . . 21
2.9.1 Extended Principle of Optimality . . . . . . . . . . . . 22
2.10 Infinite Horizon Problems . . . . . . . . . . . . . . . . . . . . . 23
2.11 Stochastic, Shortest Path Problems . . . . . . . . . . . . . . . . 23
2.11.1 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11.2 Sketch of Proof . . . . . . . . . . . . . . . . . . . . . . . 25
2.11.3 Prove B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.12 Summary of previous lecture . . . . . . . . . . . . . . . . . . . 27
2.13 How do we solve Bellman’s Equation? . . . . . . . . . . . . . . 28
1
Contents
2.13.1 Method 1: Value iteration (VI) . . . . . . . . . . . . . . 28
2.13.2 Method 2: Policy Iteration (PI) . . . . . . . . . . . . . . 28
2.13.3 Third Method: Linear Programming . . . . . . . . . . . 31
2.13.4 Analogies and Connections . . . . . . . . . . . . . . . . 32
2.14 Discounted Problems . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Continuous Time Optimal Control 36
3.1 The Hamilton Jacobi Bellman (HJB) Equation . . . . . . . . . . 37
3.2 Aside on Notation . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 The Minimum Principle . . . . . . . . . . . . . . . . . . 42
3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Fixed Terminal State . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Free initial state, with cost . . . . . . . . . . . . . . . . . 46
3.4 Linear Systems and Quadratic Costs . . . . . . . . . . . . . . . 48
3.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 General Problem Formulation . . . . . . . . . . . . . . . . . . . 52
Bibliography 56
2
Chapter 1
Introduction
1.1 Class Objective
The class objective is to make multiple decisions in stages to minimize a cost
that captures undesirable outcomes.
1.2 Key Ingredients
1. Underlying discrete time system:
x
k+1
= f
k
(x
k
, u
k
, w
k
) k = 0, 1, . . . , N −1
k: discrete time index
x
k
: state
u
k
: control input, decision variable
w
k
: disturbance or noise, random parameters
N: time horizon
f
k
: function, captures system evolution
2. Additive cost function:
g
N
(x
N
)
. ¸¸ .
terminal cost
+
N−1

k=0
g
k
(x
k
, u
k
, w
k
)
. ¸¸ .
state cost
. ¸¸ .
accumulated cost
g
k
is a given nonlinear function.
· The cost is a function of the control applied
· Because the w
k
are random, typically consider the expected cost:
E
w
k
_
g
N
(x
N
) +
N−1

k=0
g
k
(x
k
, u
k
, w
k
)
_
3
1. Introduction
Example 1: Inventory Control:
Keeping an item stocked in a warehouse. Too little, you run out (bad).
Too much, cost of storage and misuse of capital (bad).
x
k
: stock in the warehouse at the beginning of k
th
time period
u
k
: stock ordered and immideately delivered at the beginning of k
th
time period
w
k
: demand during k
th
period, with some given probability distribu-
tion
Dynamics:
x
k+1
= x
k
+ u
k
−w
k
Excess demand is backlogged and corresponds to negative values.
Cost:
E
_
R(x
N
) +
N−1

k=0
(r (x
k
) + c u
k
)
_
r (x
k
): penaltise too much stock or negative stock
c u
k
: cost of items
R(x
n
): terminal cost from items at the end that can’t be sold or de-
mand that can’t be met
Objective: The objective is to minimize the cost subject to u
k
≥ 0
1.3 Open Loop versus Closed Loop Control
Open Loop: Come up with control inputs u
0
, . . . , u
N−1
before k = 0. In
Open Loop the objective is to calculate {u
0
, . . . , u
N−1
}.
Closed Loop: Wait until time k to make decision. Asumes x
k
is measurable.
Closed Loop will always give better performance, but is computationally
much more expensive. In Closed Loop the objective is to calculate the optimal
rule: u
k
= µ
k
(x
k
). Π = {µ
0
, . . . , µ
N−1
} ist a policy or control law.
4
Wednesday 22
nd
September, 2010 Discrete State and Finite State Problem
Example 2:
µ
k
(x
k
) =
_
_
_
s
k
−x
k
if x
k
< s
k
0 otherwise
s
k
is some threshold.
1.4 Discrete State and Finite State Problem
When state x
k
takes on discrete values or is finite in size, it is often conve-
nient to express the dynamics in terms of transition probabilities:
P
ij
(u, k) := Prob (x
k+1
= j | x
k
= i, u
k
= u)
i: start state
j: possible future state
u: control input
k: time
This is equivalent to: x
k+1
= w
k
, where w
k
has the following
Prob (w
k
= j | x
k
= i, u
k
= u) := P
ij
(u, k)
Example 3: Optimizing Chess Playing Strategies:
· Two game chess match with an opponent, the objective is to come
up with a strategy that maximizes the chance to win.
· Each game can have 2 outcomes:
a) Win by one player: 1 point for the winner, 0 points for the
loser.
b) Draw: 0.5 points for each player.
· If games are tied 1-1 at the end of 2 games, go into sudden death
mode until someone wins.
· Decision variable for player, two player styles:
1) Timid play: Draw with probability p
d
, lose with probability
1 − p
d
2) Bold play: Win with probability p
w
, lose with probability
1 − p
w
· Asume p
d
> p
w
as a necessary condition for problem to make
sense.
5
1. Introduction
Problem: What playing style should be chosen? Since it doesn’t make
sense to play Timid if we are tied 1-1 at the end of 2 games, it is a
2-stage-finite problem.
Transition Probability Graph The graphis below show all possible out-
comes.
0, 0
1, 0
0, 1
p
w
1 − p
w
(a) Timid Play
0, 0
1
2
,
1
2
0, 1
p
d
1 − p
d
(b) Bold Play
Figure 1.1: First Game
1, 0
1
2
,
1
2
0, 1
2, 0
3
2
,
1
2
1, 1
1
2
,
3
2
0, 2
p
d
1 − p
d
p
d
1 − p
d
p
d
1 − p
d
(a) Timid Play
1, 0
1
2
,
1
2
0, 1
2, 0
3
2
,
1
2
1, 1
1
2
,
3
2
0, 2
p
w
1 − p
w
p
w
1 − p
w
p
w
1 − p
w
(b) Bold Play
Figure 1.2: Second Game
Closed Loop Strategy Play timid iff player is ahead.
The probability of winning is:
p
d
p
w
+ p
w
((1 − p
d
) p
w
+ p
w
(1 − p
w
)) = p
2
w
(2 − p
w
) + p
w
(1 − p
w
) p
d
For {p
w
= 0.45, p
d
= 0.9} and {p
w
= 0.5, p
d
= 1} the probabilities to
win are 0.54 and 0.625.
Open Loop Strategy Possibilities:
6
Wednesday 29
th
September, 2010 The Basic Problem
0, 0
1, 0
0, 1
3
2
,
1
2
WIN
1, 1
0, 2 LOSE
2, 1 WIN
1, 2 LOSE
p
w
1 − p
w
p
d
1 − p
d
p
w
1 − p
w
p
w
1 − p
w
Figure 1.3: Closed Loop Strategy
1) Timid for first 2 games: p
2
d
p
w
2) Bold in both: p
2
w
(3 −2 p
w
)
3) Bold in first, timid in second game: p
w
p
d
+ p
w
(1 − p
d
) p
w
4) Timid in first, bold in second game: p
w
p
d
+ p
w
(1 − p
d
) p
w
Clearly 1) is not the optimal OL strategy, because p
2
d
p
w
≤ p
d
p
w
+ . . .
Best strategy yields:
p
2
w
+ p
w
(1 − p
w
) max (2 p
w
, p
d
)
if p
d
> 2 p
w
. The optimal OL strategy is 3) or 4). It can be shown that if
p
w
≤ 0.5, then the probability of winning is ≤ 0.5.
1.5 The Basic Problem
Summarize basic problem setup:
• x
k+1
= f
k
(x
k
, u
k
, w
k
), k = 0, 1, . . . , N −1
x
k
∈ S
k
state space
u
k
∈ C
k
control space
w
k
∈ D
k
disturbance space
• u
k
∈ U(x
k
) ⊂ C
k
. Constrained not only as a function of time, but also
of current state.
• w
k
∼ P
R
(·|x
k
, u
k
). Noise distribution can depend on current state and
applied control.
7
1. Introduction
Consider policies, or control laws,
Π = {µ
0
, µ
1
, . . . , µ
N−1
}
where µ
k
maps state x
k
into controls u
k
= µ
k
(x
k
), such that µ
k
(x
k
) ⊂
U(x
k
) ∀x
k
⊂ S
k
.
The set of all π called Admissible Policies, denoted Π.
• Given a policy π ∈ Π, the expected cost of starting at state x
0
:
J
π
(x
0
) := E
w
k
_
g
N
(x
N
) +
N−1

k=0
g
k
(x
k
, µ
k
(x
k
), w
k
)
_
• Optimal Policy: J
π
∗ (x
0
) ≤ J
π
(x
0
) ∀ π ∈ Π
• Optimal Cost: J

(x
0
) := J
π
∗ (x
0
)
8
Chapter 2
Dynamic Programming
Algorithm
At the heart of the DP algorithm is the following very simple and intuitive
idea.
2.1 Principle of Optimality
Let π

= {µ

0
, µ

1
, . . . , µ

n−1
} be an optimal policy. Assume that in the process
of using π

, a state x
i
occurs at time i. Consider the subproblem whereby at
time i we are at state x
i
and we want to minimize
E
w
k
_
g
n
(x
n
) +
N−1

k=i
g
k
(x
k
, µ
k
(x
k
), w
k
)
_
.
Then the truncated policy {µ

i
, µ

i+1
, . . . , µ

N−1
} is optimal for this problem.
The proof is simple: prove by contradiction. If the above were not optimal,
you could find a different policy that would give a lower cost. Applying
the same policy to the original problem from i would therefore give a lower
cost, which contradicts that π

was an optimal policy.
Example 4: Deterministic Scheduling Problem
• Have 4 machines A, B, C, D, that are used to make something.
• A must occur before B. C before D.
The solution is obtained by calculating the optimal cost for each node,
beginning at the bottom of the tree. See figure 2.1.
9
2. Dynamic Programming Algorithm
I.C.
A
C
AB
AC
CA
CD
ABC ABCD
ACB
ACD
ACBD
ACDB
CAB
CAD
CABD
CADB
CDA CDAB
5
3
2
3
4
6
3
4
6
2
4
3
6
1
3
1
3
2
6
1
3
1
3
2
9
5
3
5
8
7
10
Figure 2.1: Problem of example 4 with optimal cost for each node written above it (in circles).
2.2 The DPA
For every initial state x
0
, the optimal cost J

(x
0
) is equal to J
0
(x
0
), given by
the last step of the following recursive algorithm, which proceeds backwards
in time from N −1 to 0:
Initialization: J
N
(x
N
) = g
N
(x
N
) ∀ x
n
⊂ S
N
Recursion: For the recursion we use
J
k
(x
k
) = min
u
k
∈U
k
(x
k
)
E
w
k
_
g
k
(x
k
, u
k
, w
k
) + J
k+1
_
f
k
(x
k
, u
k
, w
k
)
_
_
.
where expectation is taken with respect to P
R
(·|x
k
, u
k
).
Furthemore, if u

k
= µ

k
(x
k
) minimizes the recursion equation for each x
k
and k, the policy π

= {µ

0
, . . . , µ

N−1
} is optimal.
Comments
• For each recursion step, we have to perform the optimization over all
possible values x
k
∈ S
k
, since we don’t know a priori which states we
will actually visit.
• This pointwise optimization is what gives us µ

k
.
Proof 1: (Read section 1.5 in [1] if you are mathematically inclined).
10
Wednesday 6
th
October, 2010 Chess Match Strategy Revisited
• Denote π
k
:= {µ
k
, µ
k+1
, . . . , µ
N−1
}.
• Denote
J

k
(x
k
) = min
π
k
E
w
k
,...,w
N−1
_
g
N
(x
N
) +
N−1

i=k
g
i
_
x
i
, µ
i
(x
i
), w
i
_
_
.
optimal cost when starting at time k, we find ourselves at state x
k
.
• J

N
(x
N
) = g
N
(x
N
), finally.
• We will show that J

k
= J
k
generated by the DPA which give us
the desired result when k = 0.
Induction: J

N
(x
N
) = J
N
(x
N
), ∴ true for k = N
Assume true for k + 1: J

k+1
(x
k+1
) = J
k+1
(x
k+1
) ∀ x
k+1
∈ S
k+1
.
Then, since π
k
= {µ
k
, π
k+1
}, have
J

k
(x
k
) = min

k

k+1
)
E
w
k
,...,w
N−1
_
g
k
(x
k
, µ
k
(x
k
), w
k
) + g
N
(x
N
) +
N−1

i=k+1
g
i
(x
i
, µ
i
(x
i
), w
i
)
_
by principle of optimality:
= min
µ
k
E
w
k
_
g
k
(x
k
, µ
k
(x
k
), w
k
) + min
π
k+1
E
w
k+1
,...,w
N−1
_
g
N
(x
N
) +
N−1

i=k+1
g
i
(x
i
, µ
i
(x
i
), w
i
)
__
by definition of J

k+1
and update equation:
= min
µ
k
E
w
k
_
g
k
(x
k
, µ
k
(x
k
), w
k
) + J

k+1
( f
k
(x
k
, µ
k
(x
k
), w
k
))
_
by induction hypothesis:
= min
µ
k
E
w
k
(g
k
(x
k
, µ
k
(x
k
), w
k
) + J
k+1
( f
k
(x
k
, µ
k
(x
k
), w
k
)))
= min
u
k
∈U
k
(x
k
)
E
w
k
(g
k
(x
k
, µ
k
(x
k
), w
k
) + J
k+1
( f
k
(x
k
, u
k
, w
k
)))
= J
k
(x
k
)
In other words: Search over a function is simply like solving what the function does
partwise.
J
k
(x
k
) is called the cost-to-go at state x
k
.
J
k
(·) is called the cost-to-go function.
2.3 Chess Match Strategy Revisited
Recall
· Timid Play, prob. tie = p
d
, prob. loss = 1 − p
d
11
2. Dynamic Programming Algorithm
· Bold Play, prob. win = p
w
, prob. loss = 1 − p
w
· 2 game match, + tie breaker if necessary
Objective Find policy which maximizes probability of winning. We will
solve with DP, replace min by max.
Asume p
d
> p
w
.
Define x
k
= difference between our score and opponent score at the end of
game k. Recall 1 point for win, 0 for loss and 0.5 for tie.
Define J
k
(x
k
) = probability of winning match at time k if state = x
k
.
Start Recursion
J
2
(x
2
) =
_
¸
¸
¸
_
¸
¸
¸
_
1 if x
2
> 0
p
w
if x
2
= 0
0 if x
2
< 0
Recursive Equation
J
k
(x
k
) = max
_
p
d
J
k+1
+ (1 − p
d
) J
k+1
(x
k
−1)
. ¸¸ .
timid
, p
w
J
k+1
+ (1 − p
w
) J
k+1
(x
k
−1)
. ¸¸ .
bold
¸
Convince yourself that this is equivalent to the formal definitions:
J
k
(x
k
) = max
u
k
E
w
k
_
g
k
(x
k
, u
k
, w
k
) + J
k+1
( f
k
_
x
k
, u
k
, w
k
)
_
_
Note: There is only a terminal cost in this problem.
J
1
(x
1
) = max [p
d
J
2
(x
1
) + (1 − p
d
) J
2
(x
1
−1), p
w
J
2
(x
1
+ 1) + (1 − p
w
) J
2
(x
1
−1)]
· If x
1
= 1: max
_
p
d
+ (1 − p
d
) p
w
. ¸¸ .
timid
, p
w
+ (1 − p
w
) p
w
. ¸¸ .
bold
¸
Which is bigger? Timid - Bold = (p
d
−p
w
)(1 −p
w
) > 0 ∴ Timid is optimal,
and J
1
(1) = p
d
+ (1 − p
d
) p
w
.
· If x
1
= 0: max
_
p
d
p
w
+ (1 − p
d
) 0
. ¸¸ .
timid
, p
w
+ (1 − p
w
) 0
. ¸¸ .
bold
¸
Optimal is p
w
, and J
1
(0) = p
w
, Bold is optimal strategy.
· If x
1
= −1: max
_
0, p
2
w
¸
J
1
(−1) = p
2
w
, optimal strategy is Bold.
12
Wednesday 6
th
October, 2010 Converting non-standard problems
J
0
(0) = max [p
d
J
1
(0) + (1 − p
d
) J
1
(−1), p
w
J
1
(1) + (1 − p
w
) J
1
(−1)]
= max
_
p
d
p
w
+ (1 − p
d
) p
2
w
, p
w
(p
d
+ (1 − p
d
) p
w
) + (1 − p
w
) p
2
w
¸
= max
_
p
d
p
w
+ (1 − p
d
) p
2
w
, p
d
p
w
+ (1 − p
d
) p
2
w
+ (1 − p
w
) p
2
w
¸
∴ J
0
(0) = p
d
p
w
+ (1 − p
d
) p
2
w
+ (1 − p
w
) p
2
w
, the optimal strategy is Bold.
Optimal Strategy If ahead, play Timid.
2.4 Converting non-standard problems to Basic Prob-
lem
2.4.1 Time Lags
Assume update equation is of the following form:
x
x+1
= f
k
(x
k
, x
k−1
, u
k
, u
k−1
, w
k
)
Define y
k
= x
k−1
, s
k
= u
k−1
_
_
_
_
_
x
k+1
y
k+1
s
k+1
_
_
_
_
_
=
_
_
_
_
_
f
k
(x
k
, y
k
, s
k
, w
k
)
x
k
u
k
_
_
_
_
_
. ¸¸ .
˜
f
k
Let ˜ x
k
= (x
k
, y
k
, s
k
), ˜ x
k+1
=
˜
f
k
( ˜ x
k
, u
k
, w
k
).
The control is u
k
= µ
k
(x
k
, u
k−1
, x
k−1
).
This can be generalized to more than one time lag.
2.4.2 Correlated Disturbances
If disturbances are not independent, but can be modeled as the output of a
system driven by independent disturbances →Colored Noise.
Example 5:
w
k
= C
k
y
k+1
y
k+1
= A
k
y
k
+ ξ
k
13
2. Dynamic Programming Algorithm
A
k
, C
k
are given, {ξ
k
} is independent.
As usual, x
k+1
= f
k
(x
k
, u
k
, w
k
).

_
_
x
k+1
y
k+1
_
_
=
_
_
f
k
_
x
k
, u
k
, C
k
_
A
k
y
k
+ ξ
k
__
A
k
y
k
+ ξ
k
_
_
and u
k
= µ
k
(x
k
, y
k
)
which is now in the standard form. In general, y
k
cannot be measured
and must be estimated.
2.4.3 Forecasts
When state information includes knowledge of probability distributions. At
the beginning of each period k, we receive information about w
k+1
probabil-
ity distribution. In particular, assume w
k+1
could have the following prob-
ability distributions {Q
1
, Q
2
, . . . , Q
m
}, with a priori probabilities p
1
, . . . , p
m
.
At time k, we receive forecast i that Q
i
is used to generate w
k+1
. Model as
follows: y
k+1
= ξ
k
, ξ
k
is a random variable, taking value i with probability
p
i
. In particular, w
k
has probability distribution Q
y
k
.
Then have
_
_
x
k+1
y
k+1
_
_
=
_
_
f
k
(x
k
, u
k
, w
k
)
ξ
k
_
_
. New state ˜ x
k
= (x
k
, y
k
).
Since y
k
is known at time k, we have a Basic Problem formulation. New
disturbance ˜ w
k
= (w
k
, ξ
k
), depends on current state, which is allowed. DPA
takes on the following form:
J
N
(x
N
, y
N
) = g
N
(x
N
)
J
k
(x
k
, y
k
) = min
u
k
E
w
k
E
ξ
k
_
g
k
(x
k
, u
k
, w
k
) + J
k+1
_
f
k
(x
k
, u
k
, w
k
), ξ
k
_
¸
¸
¸
¸
y
k
_
= min
u
k
E
w
k
_
g
k
(x
k
, u
k
, w
k
) +
m

i=1
p
i
J
k+1
_
f
k
(x
k
, u
k
, w
k
), i
_
¸
¸
¸
¸
y
k
_
where conditional expectation simply means that w
k
has probability distri-
bution Q
y
k
. y
k
∈ {1, . . . , m}, expectation over w
k
is taken with respect to the
distribution Q
y
k
.
14
Wednesday 13
th
October, 2010 Deterministic, Finite State Systems
2.5 Deterministic, Finite State Systems
Recall Basic Problem
x
k+1
= f
k
(x
k
, u
k
, w
k
) k = 0, . . . , N −1
g
k
(x
k
, u
k
, w
k
) cost at stage k.
Consider Problems where
1. x
k
∈ S
k
, S
k
is a finite set
2. No disturbances w
k
We assume, without loss of generality, that there is only one way to go from
state i ∈ S
k
to j ∈ S
k+1
(If there is more than one way, pick one with lowest
cost at stage k).
2.5.1 Convert DP to Shortest Path Problem
S
.
.
.
Stage 1
.
.
.
Stage 2
.
.
.
.
.
.
.
.
.
Stage N
. . .
. . .
. . .
T
Figure 2.2: General Shortest Path Problem
a
k
ij
= cost to go from state i ∈ S
k
to state j ∈ S
k+1
, time k. This is equal ∞ if
there is no way to go from i ∈ S
k
to j ∈ S
k+1
.
a
N
iT
= terminal cost of state i ∈ S
N
In other words,
a
k
ij
= g
k
(i, u
ij
k
), where j = f
k
(i, u
ij
k
)
a
N
iT
= g
N
(i)
15
2. Dynamic Programming Algorithm
2.5.2 DP algorithm
J
N
(i) = a
N
iT
i ∈ S
N
J
k
(i) = min
j∈S
k+1
_
a
k
ij
+ J
k+1
(j)
_
i ∈ S
k
, k = 0, . . . , N −1
This solves shortest path problem.
2.5.3 Forward DP algorithm
By inspection, the problem is symmetric. Shortest path from S to T is the
same as from T to S, motivating following algorithm,
˜
J
k
(j) optimal cost to
arrive to state j:
˜
J
N
(j) = a
0
sj
j ∈ S
1
˜
J
k
(j) = min
i∈S
N−k
_
a
N−k
ij
+
˜
J
k+1
(i)
_
j ∈ S
N−k+1
, k = 1, . . . , N
˜
J
0
(T) = min
i∈S
N
_
a
N
iT
+
˜
J
1
(i)
_
˜
J
0
(T) = J
0
(s)
2.6 Converting Shortest Path to DP
start end
.
.
.
.
.
.
Figure 2.3: Another Path Problem, in which circles are allowed
As an example for a mental picture, one could imagine cities on a map.
16
Wednesday 13
th
October, 2010 Viterbi Algorithm
• Let {1, 2, . . . , N, T} be the nodes of graph, a
ij
the cost to move from
i to j, a
ij
= ∞ if there is no edge. i and j denote nodes, as opposed to
previous section where they were states.
• Assume that all cycles have non-negative cost. This isn’t an issue if all
edges have cost ≥ 0.
• Note that with above assumption, an optimal path ≤ length N (visits
all nodes).
Setup problem where we require exactly N moves, degenerate moves are
allowed (a
ii
= 0).
J
k
(i) optimal cost of getting from i to T in N −k moves
J
N
(i) = a
iT
(can be infinite, of course)
J
k
(i) = min
j
_
a
ij
+ J
k+1
(j)
_
(optimal N − k move is a
ij
+ optimal N − k −1
move from j).
• Notice that degenerate moves are allowed. (remove in the end)
• Terminate procedure if J
k
(i) = J
k+1
(i) ∀ i.
2.7 Viterbi Algorithm
This is a powerful combination of D.P. and Bayes Rule for optimal estima-
tion.
• Given Markov Chain, with state transition probabilities p
ij
.
p
ij
= P (x
k+1
= j|x
k
= i) 1 ≤ i, j ≤ M
• p(x
0
) = initial probability for starting state.
• Can only indirectly observe state via measurement.
r(z; i, j) = P (meas = z|x
k
= i, x
k+1
= j) ∀ k
where P is the likelihood function.
17
2. Dynamic Programming Algorithm
Objective: Given Z
N
= {z
1
, . . . , z
N
} measurements, construct
ˆ
X
N
= { ˆ x
0
, . . . , ˆ x
N
}
that maximizes over all X
N
= {x
0
, . . . , x
N
} P
R
(X
N
|Z
N
). Most likely state.
• Recall P
R
(X
N
, Z
N
) = P
R
(X
N
|Z
N
) P
R
(Z
N
).
For a given Z
N
, maximizing P
R
(X
N
, Z
N
) over X
N
gives same result as max-
imizing P
R
(X
N
|Z
N
) over X
N
.
P
R
(X
N
, Z
N
) = P
R
(x
0
, . . . , x
N
, z
1
, . . . , z
N
)
= P
R
(x
1
, . . . , x
N
, z
1
, . . . , z
N
|x
0
) P
R
(x
0
)
= P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
) P
R
(x
1
, z
1
|x
0
) P
R
(x
0
)
= P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
) P
R
(z
1
|x
0
, x
1
) P
R
(x
1
|x
0
) P
R
(x
0
)
= P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
) r(z
1
; x
0
, x
1
)p
x
0
,x
1
P
R
(x
0
)
One more step:
P
R
(x
2
, . . . , x
N
, z
2
, . . . , z
N
|x
0
, x
1
, z
1
)
= P
R
(x
3
, . . . , x
N
, z
3
, . . . , z
N
|x
0
, x
1
, z
1
, z
2
, x
2
) P
R
(x
2
, z
2
|x
0
, x
1
, z
1
)
= P
R
(x
3
, . . . , x
N
, z
3
, . . . , z
N
|x
0
, x
1
, z
1
, z
2
, x
2
) P
R
(z
2
|x
0
, x
1
, z
1
, x
2
) P
R
(x
2
|x
0
, x
1
, z
1
)
= P
R
(x
3
, . . . , x
N
, z
3
, . . . , z
N
|x
0
, x
1
, z
1
, z
2
, x
2
) r(z
2
; x
1
, x
2
)p
x
1
,x
2
Keep going, and one gets:
P
R
(X
N
, Z
N
) = P
R
(x
0
)
N

k=1
p
x
k−1
,x
k
r(z
k
; x
k−1
, x
k
)
Assume, that all quantities > 0. If = 0, can modify algorithm.
Since the above is a strictly positive property (by above assumptions), and
log function is monotonically increasing as a function of its argument, we
can maximize
log (P
R
(X
N
, Z
N
)) = min
X
N
_
−log(P
R
(x
0
)) +
N

k=1
−log
_
p
x
k−1
,x
k
r(z
k
; x
k−1
, x
k
)
_
_
Forward DP At time k, we can calculate cost to arrive to any state. We
don’t have to wait until the end to solve the problem.
2.8 Shortest Path Algorithms
Look at alternatives to DP for problems that are finite and deterministic.
Path ≡ Length ≡ Cost
18
Wednesday 20
th
October, 2010 Shortest Path Algorithms
2.8.1 Label Correcting Methods
Assume a
ij
≥ 0. Arclength = cost to go from node i to node j ≥ 0.
OPEN BIN
i
j
REMOVE
Is d
i
+ a
ij
< d
j
?
Is d
i
+ a
ij
< d
T
?
YES
YES
set d
j
= d
i
+ a
ij
Figure 2.4: Diagram of the label correcting algorithm.
Let d
i
be shortest path to i so far.
Step 0 Place Node S in OPEN BIN, set d
S
= 0, d
j
= ∞ ∀j.
Step 1 Remove a node i from OPEN, and execute Step 2 for all children
j of i.
Step 2 If d
i
+ a
ij
< min(d
j
, d
T
), set d
j
= d
i
+ a
ij
, set i to be the parent of j. If
j = T, place j in OPEN if it is not already there.
Step 3 If OPEN is empty, done. If not, go back to Step 1.
Example 6: Deterministic Scheduling Problem (revisited)
19
2. Dynamic Programming Algorithm
I.C.
A
C
AB
AC
CA
CD
ABC
ACB
ACD
CAB
CAD
CDA
T
5
3
2
3
4
6
3
4
6
2
4
3
6
1
3
1
3
2
A before B
C before D
Figure 2.5: Deterministic Scheduling Problem.
Iteration # Remove OPEN d
t
OPTIMAL
0 – S(0) ∞ –
1 S A(5), C(3) ∞ –
2 C A(5), CA(7), CD(9) ∞ –
3 CD A(5), CA(7), CDA(12) ∞ –
4 CDA A(5), CA(7) 14 CDAB
5 CA A(5), CAB(9), CAD(11) 14 CDAB
6 CAD A(5), CAB(9) 14 CDAB
7 CAB A(5) 10 CABD
8 A AB(7), AC(8) 10 CABD
9 AC AB(7) 10 CABD
10 AB – 10 CABD
Done, optimal cost = 10, optimal path = CABD.
Different ways to remove items from OPEN give different, well known,
algorithms.
Depth-First Search Last in, first out. What we did in example. Finds
feasible path quickly. Also good if you have limited memory.
20
Wednesday 20
th
October, 2010 Multi-Objective Problems
Best-First Search Remove best label. Dijkstra’s method. Remove step
is more expensive, but can give good performance.
Brendth-First Search First in, first out. Bellman-Ford.
2.8.2 A

− Algorithm
Workhouse for many AI applications, path planning.
Basic idea: Replace test d
i
+ a
ij
< d
T
by d
i
+ a
ij
+ h
j
< d
T
, where h
j
is a lower
bound to the shortest distance from j to T. Indeed, if d
j
+ a
ij
+ h
j
≥ d
T
, clear
that path going through j will not be optimal.
2.9 Multi-Objective Problems
Example 7: Motivation: care about time and fuel.
time
fuel
inferior
non-inferior
Figure 2.6: Possibilities in the time-fuel graph.
A vector x = (x
1
, x
2
, . . . , x
M
) ∈ S is non-inferior if there are no other y ∈ S
so that y
l
≤ x
l
, l = 1, . . . , M, with strict inequality for one of these ls.
Given a problem with M cost functions f
1
(x), . . . , f
M
(x) x ∈ X is a non-
inferior solution if the vector ( f
1
(x), . . . , f
M
(x)) is a non-inferior vector of
set {( f
1
(y), . . . , f
M
(y))
¸
¸
y ∈ X}.
Reasonable goal: find all non-inferior solutions, then use another criterion
to pick which one you actually want to use.
21
2. Dynamic Programming Algorithm
How this applies to deterministic, finite state DP (which are equivalent to
shortest path problems):
x
k+1
= f
k
(x
k
, u
k
) Dynamics
g
l
N
+
N−1

k=0
g
l
k
(x
k
, u
k
) l = 1, . . . , M
2.9.1 Extended Principle of Optimality
If {u
k
, . . . , u
N−1
} is a non-inferior control sequence for the tail subproblem
that starts at x
k
, then {u
k+1
, . . . , u
N−1
} is also non-inferior for the tail sub-
problem that starts at f
k
(x
k
, u
k
). Simple proof: by contradiction.
Algorithm First define what we will do recursion over:
· F
k
(x
k
) : the set of M-tuples (vectors of size M) of cost to go at x
k
which are
non-inferior.
· F
N
(x
N
) = {(g
1
N
(x
N
), . . . , g
M
N
(x
N
))}. Only one element in set for each x
N
.
· Given F
k+1
(x
k+1
) ∀x
k+1
, generate for each state x
k
the set of vectors (g
l
k
(x
k
, u
k
) +
c
1
, . . . , g
M
k
(x
k
, u
k
) + c
M
), such that (c
1
, . . . , c
M
) ∈ F
k+1
( f
k
(x
k
, u
k
)).
These are all possible costs that are consistent with F
k+1
(x
k+1
). Then to
obtain F
k
(x
k
) simply extract all non-inferior elements.
X
k
F
k
(x
k
)
· · ·
F
k+1
(·)
Figure 2.7: Possible sets for F
k+1
.
When we calculate F
0
(x
0
), will have all non-inferior solutions.
22
Wednesday 27
th
October, 2010 Infinite Horizon Problems
2.10 Infinite Horizon Problems
Consider the time (or iteration) invariant case:
x
k+1
= f (x
k
, u
k
, w
k
) x
k
∈ S
u
k
∈ U
w
k
∼ P(·|x
k
, u
k
)
J
π
(x
0
) = E
_
N−1

k=0
g(x
k
, µ
k
(x
k
), w
k
)
_
, no terminal cost
Write down DP algorithm:
J
N
(x
N
) = 0
J
k
(x
k
) = min
u
k
∈U
E
w
k
(g(x
k
, u
k
, w
k
) + J
k+1
( f (x
k
, u
k
, w
k
))) ∀k
Question: What happens as N →∞? Does the problem become easier?
Yes. Reason: lose notion of time. For very large class of problems, have
Bellman Equation:
J

(x) = min
u
E
w
(g(x, u, w) + J

( f (x, u, w))) ∀x ∈ S
Bellman Equation involves solving for optimal cost to go function J

(x)∀x ∈
S.
u = µ(x) gives optimal policy (µ(·) is obtained from solution to Bellman
Equation: for every x there is a u).
• Efficient methods for solving Bellman Equation
• Technical conditions on when this can be done.
2.11 Stochastic, Shortest Path Problems
x
k+1
= w
k
x
k
∈ S, finite set
P
R
(w
k
= j|x
k
= i, u
k
= u) = p
ij
(u) u
k
∈ U(x
k
), finite set
23
2. Dynamic Programming Algorithm
We have a finite number of states. The transition from one state to the next
is dictated by p
ij
(u): probability that next state is j given current state is i. u
is the constrol input, we can control what these transition probabilites are,
finite set of options u ∈ U(i). Problem data is time (or iteration) indepen-
dent.
Cost:
Given initial state i and a policy π = {µ
0
, µ
1
, . . .}.
J
π
(i) = lim
N→∞
E
_
N−1

k=0
g(x
k
, µ
k
(x
k
))|x
o
= i
_
.
• Optimal cost from state i: J

(i)
• Stationary policy π = {µ, µ, . . .}. Denote J
µ
(i) as the resulting cost.
{µ, µ, . . .} is simply referred to as µ. µ is optimal if
J
µ
(i) = J

(i) = min
π
J
π
(i)
Assumptions:
• Existence of a cost-free termination state t
p
tt
(u) = 1 ∀u
g(t, u) = 0 ∀u.
Sufficient condition to make cost meaningfull. Think of this as a desti-
nation state.
• ∃ integer m such that for all admissible policies
ρ
π
= max
i=1,...,n
P
R
(x
m
= t|x
0
= i, π) < 1
This is a strong assumption, which is only required for proofs.
2.11.1 Main Result
A) Given any initial conditions J
0
(1), . . . , J
0
(n), the sequence
J
k+1
(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
∀i
converges to optimal cost J

(i) for each i.
24
Wednesday 27
th
October, 2010 Stochastic, Shortest Path Problems
B) Optimal cost satisfies Bellman’s Equation:
J

(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u)J

(j)
_
∀i
which has a unique solution.
2.11.2 Sketch of Proof
A0) First prove that cost is bounded.
Recall ∃ m such that ∀ policies π,
ρ
π
:= max
i
P
R
(x
m
= t|x
0
= i, π) < 1
Since all problem data is finte, ρ := max
π
ρ
π
< 1
∴ P
R
(x
2m
= t|x
0
= i, π) = P
R
(x
2m
= t|x
m
= t, x
0
= i, π) · P
R
(x
m
= t|x
0
= i, π) ≤ ρ
2
Generally, P
R
(x
k m
= t|x
0
= i, π) ≤ ρ
k
.
Furthermore, the cost incurred between the m periods km and (k +
1)m−1 is

k
max
i,u
|g(i, u)| =: ρ
k
M M := m· max
i,u
|g(i, u)|
∴ J
π
(i) ≤


k=0

k
=
M
1 −ρ
finite
A1)
J
π
(x
0
) = lim
N→∞
E
_
N−1

k=0
g (x
k
, µ
k
(x
k
))
_
= E
_
mK−1

k=0
g (x
k
, µ
k
(x
k
))
_
+ lim
N→∞
E
_
N−1

k=mK
g (x
k
, µ
k
(x
k
))
_
by previous, we know that
¸
¸
¸
¸
¸
lim
N→∞
E
_
N−1

k=mK
g (x
k
, µ
k
(x
k
))

¸
¸
¸
¸


K
1 −ρ
As expected, we can make tail as small as we want.
25
2. Dynamic Programming Algorithm
A2) Recall that we can view J
0
as terminal cost funtion, with J
0
(i) given.
Bound its expected vallue:
|E (J
0
(x
mK
))| =
¸
¸
¸
¸
¸
n

i=1
P
R
(x
mK
= i|x
0
, π)J
0
(i)
¸
¸
¸
¸
¸

_
n

i=1
P
R
(x
mK
= i|x
0
, π)
_
max
i
|J
0
(i)|
≤ ρ
K
max
i
|J
0
(i)|
A3) “Sandwich”
E(J
0
(x
mK
)) +E
_
mK−1

k=0
g(x
k
, µ
k
(x
k
))
_
=
E(J
0
(x
mK
)) + J
π
(x
0
) − lim
N→∞
E
_
N−1

k=mK
g(x
k
, µ
k
(x
k
))
_
Recall if a = b + c, then
a ≤ b +|c| ≤ b + ¯ c, ¯ c ≥ |c|
a ≥ b −|c| ≥ b − ¯ c
Follows that
−ρ
K
max
i
|J
0
(i)| −

K
1 −ρ
+ J
π
(x
0
) ≤ E
_
J
0
(x
mK
) +
mK−1

k=0
g(x
k
, µ
k
(x
k
))
_
≤ ρ
K
max
i
|J
0
(i)| +

K
1 −ρ
+ J
π
(x
0
)
A4) We take minimum over all policies, the middle term is exactly our DP
recursion of part A. Take limits, get
lim
K→∞
J
mK
(x
0
) = J

(x
0
)
Now we are almost done. But since
|J
mK+1
(x
0
) − J
mK
(x
0
)| ≤ ρ
K
M
have
lim
k→∞
J
k
(x
0
) = J

(x
0
)
26
Wednesday 3
rd
November, 2010 Summary of previous lecture
Summary
A1: bound the tail over all policies
A2: bound contribution from initial condition J
0
(i), over all policies
A3: “sandwich” type of bounds, middle term is DP recursion
A4: optimized over all policies, took limit.
2.11.3 Prove B
Prove that optimal cost satisfies Bellman’s equation
J
k+1
(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
.
In Part A, we showed that J
k
(·) → J

(·), just take limits on both sides. To
prove uniqueness, just use solution of Bellman equation as initial condition
of DP iteration.
2.12 Summary of previous lecture
Dynamics
x
k+1
= w
k
P
R
{w
k
= j|x
k
= j, u
k
= u} = p
ij
, u ∈ U(i) finite set
x
k
∈ S finite S = {1, 2, . . . , n, t}
p
tt
= 1 ∀u ∈ U(t)
Cost Given π = {µ
0
, µ
1
, . . .}
J
π
(i) = lim
N→∞
E
_
N−1

k=0
g(x
k
, µ
k
(x
k
))|x
0
= i
_
∀i ∈ S
g(t, u) = 0 ∀u ∈ U(t)
J

(i) = min
π
J
π
(i) optimal cost
Note: J
π
(t) = 0 ∀π ⇒ J

(t) = 0
27
2. Dynamic Programming Algorithm
Result
A) Given any initial conditions J
0
(1), . . . , J
0
(n), the sequence
J
k+1
(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u)J
k
(j)
_
i ∈ S \ t = {1, . . . , n}
converges to J

(i).
Note: there is a lot of a short-cut here, can include terminal state t,
provided we pick J
0
(t) = 0. Does not change equations.
B)
J

(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u) J

(j)
_
i ∈ S \ t
Bellman’s Equation: Also gives optimal policy, which is in fact stationary.
2.13 How do we solve Bellman’s Equation?
2.13.1 Method 1: Value iteration (VI)
Use the DP recursion of result A:
J
k+1
(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u) J
k
(j)
_
i ∈ S \ t
until it converges. J
0
(i) can be set to a guess, if the guess is good, it will
speed up convergence. How do we know that we are close to converging?
Exploit problem structure to get bounds, look at [1].
2.13.2 Method 2: Policy Iteration (PI)
Iterate over policy instead of values. Need following result:
C) For any stationary policy µ, the costs J
µ
(i) are the unique solutions of
J(i) = g(i, µ(i)) +
n

j=1
p
ij
(µ(i)) J(j) i ∈ S \ t
28
Wednesday 3
rd
November, 2010 How do we solve Bellman’s Equation?
Furthermore, given any initial conditions J
0
(i), the sequence
J
k+1
(i) = g(i, µ(i)) +
n

j=1
p
ij
(µ(i)) J
k
(j)
converges to J
µ
(i) for each i.
Proof: trivial. Consider problem where only allowable control at state i
is µ(i), and apply parts A and B. Special case of general theorem.
Algorithm for PI From now on, i ∈ S \ t = {1, 2, . . . , n}.
Stage 1 Given µ
k
(stationary policy at iteration k, not policy at time k), solve
for the J
µ
k (i) by solving
J(i) = g(i, µ
k
(i)) +
n

j=1
p
ij

k
(i)) J(j) ∀i
n equations, n unknowns (Result C).
Stage 2 Improve policy.
µ
k+1
(i) = arg min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u) J
µ
k (j)
_
∀i
Iterate, quit when J
µ
k+1 (i) = J
µ
k (i) ∀i
Theorem Above terminates after a finite # of steps, and converges to opti-
mal policy.
Proof two steps
1) We will first show that
J
µ
k (i) ≥ J
µ
k+1 (i) ∀i, k
2) We will show that what we converge to satisfies Bellman’s Equation.
1) For fixed k, consider the following recursion in N:
J
N+1
(i) = g(i, µ
k+1
(i)) +
n

j=0
p
ij

k+1
(i)) J
N
(j) ∀i
J
0
(i) = J
µ
k (i)
29
2. Dynamic Programming Algorithm
By result C), J
N
→ J
µ
k+1 as N →∞.
J
0
(i) = g(i, µ
k
(i)) +

j
p
ij

k
(i)) J
0
(j)
≥ g(i, µ
k+1
(i)) +

j
p
ij

k+1
(i)) J
0
(j) = J
1
(i)
J
1
(i) ≥ g(i, µ
k+1
(i)) +

j
p
ij

k+1
(i)) J
1
(j)
since J
1
(i) ≤ J
0
(i). But g(i, µ
k+1
(i)) = J
2
(i), keep going, get
J
0
(i) ≥ J
1
(i) ≥ . . . ≥ J
N
(i) ≥ . . .
take limit
J
µ
k (i) ≥ J
µ
k+1 (i) ∀i
Since the # of stationary policies is finite, we will eventually have J
µ
k (i) =
J
µ
k+1 ∀i for some finite k.
2) It follows from Stage 2 that
J
µ
k+1 (i) = J
µ
k (i) = min
u∈U(i)
_
g(i, u) +

j
p
ij
(u) J
µ
k (j)
_
when converged, but this is Bellman’s Equation! ∴ have converged to
optimal policy.
Discussion
Complexity
Stage 1 linear sytem of equations, size n, comlexity ≈ O(n
3
)
Stage 2 n minimizations, p choices (p different values of u that I can use),
complexity ≈ O(p n
2
)
Put together: O(n
2
(n + p)).
Worst case # of iterations: search over all policies, p
n
. But in practice, con-
verges very quickly.
Why does Policy Iteration converge so quickly relative to Value Iteration?
30
Wednesday 10
th
November, 2010 How do we solve Bellman’s Equation?
Rewrite Value Iteration
Stage 2 µ
k
= arg min
u∈U(i)
_
g(i, u) +∑
j
p
ij
(u) J
k
(j)
_
↑ iterate
Stage 1 J
k+1
(i) = g(i, µ
k
(i)) +∑
j
p
ij

k
(i)) J
k
(j)
2.13.3 Third Method: Linear Programming
Recall: Bellman’s Equation
J

(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
(u) J

(j)
_
i = 1, . . . , n
and Value Iteration:
J
k+1
(i) = min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
J
k
(j)
_
i = 1, . . . , n
We showed that Value Iteration (V.I.) converges to optimal cost to go J

for
all initial “guesses” J
0
.
Assume we start V.I. with any J
0
that satisfies
J
0
(i) ≤ min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
J
0
(j)
_
i = 1, . . . , n
If follows that J
1
(i) ≥ J
0
(i) for all i.
∴ J
1
(i) ≤ min
u∈U(i)
_
g(i, u) +
n

j=1
p
ij
J
1
(j)
_
i = 1, . . . , n
∴ J
2
(i) ≥ J
1
(i) ∀i
In general:
J
k+1
(i) ≥ J
k
(i) ∀i, ∀k
J
k
→ J

∴ J
0
(i) ≤ J

(i) ∀i
31
2. Dynamic Programming Algorithm
Now let J solve the following problem:
max

i
J(i) subject to
J(i) ≤ g(i, u) +

j
p
ij
(u)J(j) ∀i, ∀u ∈ U(i)
Clear that J(i) ≤ J

(i) ∀i, by previous analysis. Since J

satisfies the con-
straints, it follows that J

achieves the maximum.
This is a Linear Program!
2.13.4 Analogies and Connections
Say I want to solve
J = G + PJ, J ∈ R
n
, G ∈ R
n
, P ∈ R
n×n
.
Direct way to solve:
(I −P)J = G
J = (I −P)
−1
G.
This is exactly what we do in Stage 1 of policy iteration: solve for the cost
associated wit ha specific policy.
Why is (I −P)
−1
guaranteed to exist?
For a given policy, let
¯
P ∈ R
(n+1)×(n+1)
be the probability matrix that cap-
tures our Markov Chain:
¯
P =
_
_
P P
(·)t
0 1
_
_
p
ij
: probability next state = j given current state = i.
Facts

¯
P is a right stochastic matrix: all rows sum up to 1, all elements ≥ 0.
• Perron-Frobenius Therorem: eigenvalues of
¯
P have absolute value ≤ 1,
at least one = 1
• Assumption on terminal state: P
N
→0 as N →∞. We well eventually
reach termination state!
Therefore
32
Wednesday 10
th
November, 2010 How do we solve Bellman’s Equation?
- the eigenvalues of P have absolute value < 1.
- (I −P)
−1
exists.
Furthermore:
(I −P)
−1
= I + P + P
2
+ . . .
Proof:
(I −P)(I + P + P
2
+ . . .) = I + (P + P
2
+ . . .) −(P + P
2
+ . . .)
= I
Therefore one way to solve for J is as follows:
J
1
= G + PJ
0
J
2
= G + PJ
1
= G + PG + P
2
J
0
.
.
.
J
N
= (I + P + . . . + P
N−1
)G + P
N
J
0
∴ J
N
→(1 −P)
−1
G as N →∞!
Analogy
Value Iteration: step of the update
Policy Iteration: infinite numbers of update, or solving system of equations
exactly.
• What is truly amazing is that various combinations of policy iteration,
value iteration, all converge to Bellman’s Equation.
• Recall value iteration
J
k+1
(i) = min
u∈U(i)
_
g(i, u) +

j
p
ij
(u)J
k
(j)
_
i = 1, . . . , n
In practice, you would implement as follows:
¯
J(i) ← min
u∈U(i)
_
g(i, u) +

j
p
ij
(u)J(j)
_
i = 1, . . . , n
J(i) ←
¯
J(i) i = 1, . . . , n
33
2. Dynamic Programming Algorithm
Don’t have to do this! Can also do:
J(i) ← min
u∈U(i)
_
g(i, u) +

j
p
ij
(u)J(j)
_
i = 1, . . . , n
Gauss-Seidel update: generic technique for solving iterative equations.
• Gets even better: Asynchronous Policy Iterating.
– Any number of value update in between policy updates
– Any number of states updated at each value update
– Any number of states updated at each policy update.
Under some mild assumptions, all converge to J

2.14 Discounted Problems
J
π
(i) = lim
N→∞
E
_
N−1

k=0
α
k
g (x
k
, µ
k
(x
k
))
¸
¸
x
0
= i
_
α < 1, i ∈ {1, . . . , n}
No explicit termination state required. No assumption on transition proba-
bilities required.
Bellman’s Equation for this problem:
J

(i) = min
u∈U(i)
_
g(i, u) + α
n

j=1
p
ij
(u) J

(j)
_
∀i
How do we show this?
• Define associated problem with states {1, 2, . . . , n, t}. From state i = t,
when u is applied in new cost g(i, u) next state = j with probability
αp
ij
(u), and t with probability 1 −α (since ∑
j
p
ij
(u) = 1)
• Clear that since a < 1, we have a non-zero probability of making it to
state t, therefore our assumption on reaching the termination state is
satisfied.
Suppose we use the same policy in discounted problem as auxiliary
problem.
Note that
P
R
_
x
k+1
= j
¸
¸
x
k
= i, x
k+1
= t, u
_
=
αp
ij
(u)
αp
i,1
+ αp
i,2
+ . . . + αp
i,n
=
αp
ij
(u)
α
= p
ij
(u)
34
Wednesday 17
th
November, 2010 Discounted Problems
So as long as we reach the termination state, the state evolution is
governed by the same probabilities. The expected cost of the k
th
stage
of associated problem g(x
k
, µ
k
(x
k
)) times the probability, that t has not
been reached α
k
, therefore have α
k
g(x
k
, µ
k
(x
k
)).
Connections: for a given policy, have
¯
P =
_
_
_
_
_
_
_
_
_
αP
_
_
_
_
_
1
.
.
.
1
_
_
_
_
_
(1 −α)
0 1
_
_
_
_
_
_
_
_
_
clear that (αP)
N
= α
N
P
N
→0
35
Chapter 3
Continuous Time Optimal
Control
Consider the following system
˙ x(t) = f (x(t), u(t)) 0 ≤ t ≤ T
x(0) = x
0
no noise!
• State x(t) ∈ R
• Time t ∈ R, T is the terminal time
• Control u(t) ∈ U ⊂ R
m
, U control constraint set
Assume
• f is continuously differentiable with respect to x. (Less stringent re-
quirement: Lipschitz)
• f is continuous with respect to u
• u(t) is piecewise continuous
See Appendix A in[1] for Details.
Assume: Existence and uniqueness of solutions.
Example 8:
˙ x(t) = x(t)
1
/3
x(0) = 0
Solutions: x(t) = 0 ∀t x(t) = (
2
3
t)
3
/2
Not uinque!
36
Wednesday 17
th
November, 2010 The Hamilton Jacobi Bellman (HJB) Equation
Example 9:
˙ x(t) = x(t)
2
x(0) = 1
Solutions: x(t) =
1
1−t
, finite escape time, x(1) = ∞
The solution does not exist on an interval that includes 1, e.g. [0, 2].
Objective Minimize h(x(T)) +
_
T
0
g(x(t), u(t)) dt where g and h are con-
tinuously differentiable with respect to x, g is continuous with respect to u.
Very similar to discrete time problem (∑ →
_
, x
k+1
→ ˙ x) except for technical
assumptions.
3.1 The Hamilton Jacobi Bellman (HJB) Equation
Continuous time analog of DP algorithm. Derive it informally by discretiz-
ing problem and taking limits. Not a rigorous derivation, but it does capture
the main ideas.
• Divide time horizon into N pieces, define δ =
T
/N
• x
k
:= x(kδ), u
k
:= u(kδ) k = 0, 1, . . . , N
• Approximate differential equation by
˙ x(kδ) = f (x(kδ, u(kδ)))
x
k+1
−x
k
δ
= f (x
k
, u
k
), x
k+1
= x
k
+ f (x
k
, u
k
) δ
Approximate cost function:
h(x
N
) +
N−1

k=0
g(x
k
, u
k
) δ
• Define J

(t, x) = optimal cost to go at time t and state x for the con-
tinuous problem.
• Define
˜
J

(t, x) = discrete approximation of optimal cost to go.
Apply DP algorithm:
– terminal condition:
˜
J

(Nδ, x) = h(x)
– recursion:
˜
J

(kδ, x) = min
u∈U
_
g(x, u) δ +
˜
J

((k + 1) δ, x + f (x, u) δ)
¸
k = 0, . . . , N−1
37
3. Continuous Time Optimal Control
• Do Taylor Expansion of
˜
J

, because δ →0
˜
J

((k +1) δ, x + f (x, u) δ) =
˜
J

(kδ, x) +

˜
J

(kδ, x)
∂t
δ +
_
∂J

(kδ, x)
∂x
_
T
f (x, u) δ +o(δ)
lim
δ→0
o(δ)
δ
= 0
“little oh notation”: o(δ) are quadratic terms or higher in δ.
• Substitute back into DP recursion by δ:
0 = min
u∈U
_
g(x, u) +

˜
J

(kδ, x)
∂t
+
_

˜
J

(kδ, x)
∂x
_
T
f (x, u) +
o(δ)
δ
_
Now let t = kδ, and let δ →0. Assuming
˜
J

→ J, have
0 = min
u∈U
_
g(x, u) +
∂J

(t, x)
∂t
+
_
∂J

(t, x)
∂x
_
T
f (x, u)
_
∀x, t
(3.1)
J

(T, x) = h(x)
HJB equation (3.1):
– Partial Differential Equation, very difficult to solve
– u = µ(t, x) that minimizes R.H.S. of HJB is optimal policy
Example 10: Consider the system
˙ x(t) = u(t) |u(t)| ≤ 1
Cost is
1
2
x
2
(T), only terminal cost.
Intuitive solution: u(t) = µ(t, x) = −sgn(x) =
_
¸
¸
¸
_
¸
¸
¸
_
−1 if x > 0
0 if x = 0
1 if x < 0
What is cost to go associated with this policy?
V(t, x) =
1
2
(max{0, |x| −(T −t)})
2
Verify that this is indeed the cost to go associated with the policy out-
lined above.
For fixed t:
38
Wednesday 17
th
November, 2010 The Hamilton Jacobi Bellman (HJB) Equation
V(t, x)
x
T −t −(T −t)
Figure 3.1: Example 10 for fixed t.
∂V(t, x)
∂x
x
T −t −(T −t)
Figure 3.2: First derivative.
V(t, x)
t
1
|x| −T
2
T −|x| T
∂V(t, x)
∂t
t
1
2
T −|x| T
1: |x| ≤ T
2: |x| > T
Figure 3.3: Example 10 for fixed x.
∂V
∂x
(t, x) = sgn(x) max{0, |x| −(T −t)}
39
3. Continuous Time Optimal Control
For fixed x:
Does V(t, x) satisfy HJB?
• First check: does it satisfy boundary condition?
V(T, x) =
1
2
x
2
= h(x)
• Second check:
min
|u|≤1
_
∂V(t, x)
∂t
+
∂V(t, x)
∂x
f (x, u)
_
= min
|u|≤1
(1 + sgn(x), u) max{0, |x| −(T −t)}
= 0 by choosing u = sgn(x)
∴ V(t, x) satisfies HJB equation, ∴ V(t, x) = J

(t, x).
Furthermore u = −sgn(x) is an optimal solution. Not unique!
Note: Verifying that V(t, x) satisfies HJB is not trivial, even for
this simple example. Imagine solving for it!
Another issue:
1
2
x
2
(T) will give same optimal policy as cost |x(T)|, so
different costs give same optimal policy, but some cost are “nicer” to
work with than other.
3.2 Aside on Notation
Let F(t, x) be a continuosly differentiable function. Then
1.
∂F(t, x)
∂t
: partial derivative of F with respect to the first argument.
2.
∂F(t, ¯ x(t))
∂t
=
∂F(t, x(t))
∂t
¸
¸
¸
¸
x=¯ x(t)
: shorthand notation
3.
dF(t, ¯ x(t))
dt
=
∂F(t, ¯ x(t))
∂t
+
∂F(t, ¯ x(t))
∂x
˙
¯ x(t): total derivative
Example 11: For F(t, x) = tx
∂F(t, x)
∂t
= x
∂F(t, ¯ x(t))
∂t
= ¯ x(t)
dF(t, ¯ x(t))
dt
= ¯ x(t) + t
˙
¯ x(t)
Lemma 3.2.1: Let F(t, x, u) be a continous differentiable function and
let U be a convex set. Assume that µ

(t, x) := arg min
u∈U
F(t, x, u) is conti-
nous differentiable.
40
Wednesday 24
th
November, 2010 Aside on Notation
Then:
1)
∂ min
u∈U
F(t, x, u)
∂t
=
∂F(t, x, µ

(t, x))
∂t
∀t, x
2)
∂ min
u∈U
F(t, x, u)
∂x
=
∂F(t, x, µ

(t, x))
∂x
∀t, x
Example 12: Let F(t, x, u) = (1 + t)u
2
+ ux + 1, t ≥ 0, U is the real
line (no constraint on u), then
min
u
F(t, x, u) : 2(1 + t)u + x = 0, →u = −
x
2(1 + t)
∴ µ

(t, x) = −
x
2(1 + t)
,
min
u
F(t, x, u) =
(1 + t)x
2
4(1 + t)
2

x
2
2(1 + t)
+ 1
= −
x
2
4(1 + t)
+ 1
1)
∂ min
u∈U
F(t, x, u)
∂t
=
x
2
4(1 + t)
2
∂F(t, x, µ

(t, x))
∂t
= u
2
|
u=µ

(t,x)
=
x
2
4(1 + t)
2
2)
∂ min
u∈U
F(t, x, u)
∂x
=
x
2(1 + t)
∂F(t, x, µ

(t, x))
∂x
= u|
u=µ

(t,x)
=
x
2(1 + t)
Proof 2: Proof of Lemma 3.2.1 when u is unconstrained (U ∈ R
m
).
Let
G(t, x) = min
u∈U
F(t, x, u) = F(t, x, µ

(t, x)).
Then
∂G(t, x)
∂t
=
∂F(t, x, µ

(t, x))
∂t
+
∂F(t, x, µ

(t, x))
∂u
. ¸¸ .
=0 because µ

(t,x) minimizes
∂µ

(t, x)
∂t
.

This can be done similar for
∂G(t, x)
∂x
.
41
3. Continuous Time Optimal Control
3.2.1 The Minimum Principle
HJB gives us a lot of information: optimal cost to go for all time and for all
possible states, also gives optimal feedback law u = µ

(t, x). What if we
only cared about optimal control trajectory for a specific initial condition
x(0) = x
0
?
Can we exploit the fact that we are asking for much less to simplify the
mathematical conditions?
Starting Point: HJB
0 = min
u∈U
_
g(x, u) +
∂J

(t, x)
∂t
+
_
∂J

(t, x)
∂x
_
T
f (x, u)
_
∀t, x
J

(T, x) = h(x) ∀x
• Let µ

(t, x) be te corresponding optimal strategy (feedback law)
• Let
F(t, x, u) = g(x, u) +
∂J

(t, x)
∂t
+
_
∂J

(t, x)
∂x
_
T
f (x, u)
So HJB equation gives us
G(t, x) = min
u∈U
(F(t, x, u)) = 0
Apply Lemma
1)
∂G(t, x)
∂t
= 0 =

2
J

(t, x)
∂t
2
+
_

2
J

(t, x)
∂x∂t
_
T
f (x, µ

(t, x)) ∀t, x
2)
∂G(t, x)
∂x
= 0 =
∂g(x, µ

(t, x))
∂x
+

2
J

(t, x)
∂x∂t
+

2
J

(t, x)
∂x
2
f (x, µ

(t, x))
+
∂ f (x, µ

(t, x))
∂x
∂J

(t, x)
∂x
∀t, x
Consider a specific optimal trajectory:
u

(t) = µ

(t, x

(t)) , ˙ x

(t) = f (x

(t), u

(t)) , x

(0) = x
0
1) 0 =
d
dt
_
∂J

(t, x

(t))
∂t
_
2) 0 =
∂g (x

(t), u

(t))
∂x
+
d
dt
_
∂J

(t, x

(t))
∂x
_
+
∂ f (x

(t), u

(t))
∂x
∂J

(t, x

(t))
∂x
42
Wednesday 24
th
November, 2010 Aside on Notation
p(t) =
∂J

(t, x

(t))
∂x
p
0
(t) =
∂J

(t, x

(t))
∂t
1) ˙ p
0
(t) = 0 → p
0
(t) = constant for 0 ≤ t ≤ T
2) ˙ p(t) = −
∂ f (x

(t), u

(t))
∂x
p(t) −
∂g(x

(t), u

(t))
∂x
, 0 ≤ t ≤ T
∂J

(T, x)
∂x
=
∂h(x)
∂x
→ p(T) =
∂h(x

(T))
∂x
Put all of this together:
Define Hamiltonian H(x, u, p) = g(x, u) + p
T
f (x, u). Let u

(t) be an optimal
control trajectory, x

(t) resulting state trajectory. Then
˙ x

(t) =
∂H
∂p
(x

(t), u

(t), p(t)) x

(0) = x
0
˙ p(t) = −
∂H
∂x
(x

(t), u

(t), p(t)) p(T) =
∂h(x

(T))
∂x
u

(t) = arg min
u∈U
H(x

(t), u, p(t))
H(x

(t), u

(t), p(t)) = constant ∀t ∈ [0, T]
(H(·) = constant comes from p
0
(t) = constant).
Some remarks:
• Set of 2n ODE, with split boundary conditions. Not trivial to solve.
• Necessary condition, but not sufficient. Can have multiple solutions,
not all of them may be optimal.
• If f (x, u) is linear, U is convex, h and g are convex, then the condition
is necessary and sufficient.
Example 13 (Resource Allocation): Some robots are sent to Mars to
build habitats for a later exploration by humans.
x(t) Number of reconfigurable robots, which can build habitats or them-
selves.
x(0) is given: Number of robots that arrive to Mars.
˙ x(t) = u(t)x(t) x(0) = x
0
˙ y(t) = (1 −u(t))x(t) y(0) = 0
0 ≤ u(t) ≤ 1
43
3. Continuous Time Optimal Control
Objective: given terminal time T, find control input u

(t) that maxi-
mizes y(T), the number of habitats build.
Note: y(t) =
_
T
0
(1 −u(t))x(t) dt
Solution
g(x, u) = (1 −u)x
f (x, u) = ux
H(x, u, p) = (1 −u)x + pux
˙ p(t) = −
∂H(x

(t), u

(t), p(t))
∂x
= −1 + u

(t) − p(t)u

(t)
p(T) = 0 (h(x) ≡ 0)
u

(t) = arg max
0≤u≤1
(x

(t) + (p(t)x

(t) −x

(t)) u)
get
_
_
_
u = 0 if p(t) < 1
u = 1 if p(t) > 1
Since p(T) = 0, for t close to T, will have u

(t) = 0 and therefore
˙ p(t) = −1.
Therefore at time t = T −1, p(t) = 1 and that is where switch occurs:
˙ p(t) = −p(t) 0 ≤ t ≤ T −1 p(T −1) = 1
∴ p(t) = exp(−t) exp(T −1) 0 ≤ t ≤ T −1
Conclusion
u

(t) =
_
_
_
1 0 ≤ t ≤ T −1
0 T −1 ≤ t ≤ T
How to use in practice:
1. If you can solve HJB, you get a feedback law u = µ(x). Very con-
venient, just a controller: meausure the state and apply the control
input.
2. Solve for optimal trajectory and use a feedback law (probably linear)
to keep you on that trajectory.
3. Solve for optimal trajectory online after measuring state. Do this often.
44
Wednesday 1
st
December, 2010 Extensions
HJB
Minimum
Principle
Optimal
Solutions
Non Optimal
Solutions
not too difficult
Easy to show
(in book)
“Viscosity
solutions”
Hard to show
rigorously
(calculus of vari-
ations)
“Local Minima”
Figure 3.4: Different approaches to find a solution.
3.3 Extensions
(Drop x

(t), J

notation for simplycity)
3.3.1 Fixed Terminal State
Case where x(T) is given. Clear that there is no need for a terminal cost.
Recall co-state p(t) =
∂J(t, x(t))
∂x
.
p(T) = lim
t→T
∂J(t, x(t))
∂x
, but we can’t use h(t), the terminal cost, to constrain
p(T). Don’t need constraints on p!
˙ x(t) = f (x(t), u(t)) x(0) = x
0
, x(T) = x
T
2n ODEs
˙ p(t) = −
∂H(x(t), u(t), p(t))
∂x
2n boundary conditions
Example 14:
˙ x(t) = u(t) x(0) = 0, x(1) = 1
g(x, u) =
1
2
(x
2
+ u
2
)
cost =
1
2
_
1
0
(x
2
(t) + u
2
(t)) dt
45
3. Continuous Time Optimal Control
Hamiltonian H(x, u, p) =
1
2
(x
2
+ u
2
) + p u
We get
˙ x(t) = u(t)
˙ p(t) = −x(t)
u(t) = arg min
u
1
2
(x
2
(t) + u
2
(t)) + p(t) u(t)
therefore u(t) = −p(t)
∴ ˙ x(t) = −p(t), ˙ p(t) = −x(t), ¨ x(t) = x(t)
x(t) = Acosh(t) + Bsinh(t)
x(0) = 0 ⇒ A = 0, x(1) = 1 ⇒ B =
1
sinh(1)
x(t) =
sinh(t)
sinh(1)
=
e
t
−e
−t
e
1
−e
−1
Exercise: show that Hamiltonian is constant along this trajectory.
3.3.2 Free initial state, with cost
x(0) is not fixed, but have initial cost l(x(0)). Can show that the resuling
condition is p(0) = −
∂l(x(0))
∂x
.
Example 15:
˙ x(t) = u(t) x(1) = 1, x(0) is free
g(x, u) =
1
2
(x
2
+ u
2
)
l(x) = 0 no cost (given)
Apply Minimum Principle, as before
˙ x(t) = u(t) ¨ x = x(t)
˙ p(t) = −x(t)
u(t) = −p(t) x(t) = Acosh(t) + Bsinh(t)
˙ x(0) = u(0) = −p(0) = 0 ∴ B = 0
x(t) =
cosh(t)
cosh(1)
=
e
t
+ e
−t
e
1
+ e
−1
x(0) ≈ 0.65
46
Wednesday 1
st
December, 2010 Extensions
Free Terminal Time Result: Hamiltonian = 0 on optimal trajectory.
Gain extra degree of freedom in choosing T, we lose a degree of freedom
because H ≡ 0.
Time Varying System and Cost What happens if f = f (x, u, t), g = g(x, u, t)?
Result: Everything stays the same, except that Hamiltonian is no longer
constant along trajectory. Hint:
˙
t = 1 and ˙ x = f (x, u, t)
Singular Problems Motivate via an example:
Tracking problem: z(t) = 1 −t
2
0 ≤ t ≤ 1
Minimize
1
2
_
1
0
(x(t) −z(t))
2
dt subject to | ˙ x(t)| ≤ 1.
Apply Minimum Principle:
˙ x(t) = u(t) |u(t)| ≤ 1 x(0), x(1) are free
g(x, u, t) =
1
2
(x −z(t))
2
H(x, u, t) =
1
2
(x −z(t))
2
+ p u
Co-state equation:
˙ p(t) = −(x(t) −z(t)) p(0) = 0, p(1) = 0
Optimal u:
u(t) = arg min
|u|≤1
H(x(t), u, p(t), t)
u(t) =
_
¸
¸
¸
_
¸
¸
¸
_
= −1 if p(t) > 0
= 1 if p(t) < 0
=? if p(t) = 0
Problem is singular if Hamiltonian is not a function of u for a non-trivial
time interval.
Try the following:
p(0) = 0 ˙ p(t) = 0 for 0 ≤ t ≤ T T to be determined
Then
p(t) = 0 0 ≤ t ≤ T
∴ x(t) = z(t) 0 ≤ t ≤ T
∴ ˙ x(t) = ˙ z(t) = −2t = u(t)
47
3. Continuous Time Optimal Control
One guess: pick T =
1
2
.
This can’t be the solution: for t >
1
2
:
x(t) −z(t) > 0 ∴ ˙ p < 0 ∴ p(1) < 0
can’t satisfy constraint.
Explore this instead:
Switch before T =
1
2
.
x(t) = z(t) 0 ≤ t ≤ T <
1
2
˙ x(t) = −1 T < t ≤ 1
∴ x(t) = z(T) −(t −T) = 1 −T
2
−t + T
˙ p(t) = −(x(t) −z(t)) −(1 −T
2
−t + T −1 + t
2
) T < t ≤ 1
= T
2
−T −t
2
+ t
p(1) =
_
1
T
(T
2
−T −t
2
+ t) dt
= T
2
−T −
1
3
+
1
2
−T
3
+ T
2
+
T
3
3

T
2
2
= 0
Simplifies to (multiply by 6):
0 = −4T
3
+ 9T
2
−6T + 1
= (T −1)(T −1)(1 −4T)
∴ T = 1, T =
1
4
T =
1
4
satisfies all the constraints and we are done!
Can easily verify that p(t) > 0 for
1
4
< t < 1, giving u(t) = −1 as required.
3.4 Linear Systems and Quadratic Costs
Look at infite horizon, LTI (linear time invariant) system:
x
k+1
= Ax
k
+ Bu
k
k = 0, 1, . . .
cost =


k=0
x
T
k
Qx
k
+ u
T
k
Ru
k
, R > 0, Q ≥ 0, R = R
T
, Q = Q
T
Informally, the cost to go is time invariant: only depends on the state and
not when we get there.
J(x) = min
u
_
x
T
Qx + u
T
Ru + J(Ax + Bu)
_
48
Wednesday 8
th
December, 2010 Linear Systems and Quadratic Costs
Conjecture that optimal cost to go is quadratic in x: J(x) = x
T
Kx, where
K = K
T
, K ≥ 0. Then
x
T
Kx = x
T
Qx +x
T
A
T
KAx +min
u
_
u
T
Ru + u
T
B
T
KBu + x
T
A
T
KBu + u
T
B
T
KAx
_
Since R > 0, B
T
KB ≥ 0, R + B
T
KB > 0
2
_
R + B
T
KB
_
u + 2B
T
KAx = 0
→u = −
_
R + B
T
KB
_
−1
B
T
KA
Substitute back in:
All terms are of the form x
T
_
·
_
x.
Therefore we must have:
K = Q + A
T
KA + A
T
KB
_
R + B
T
KB
_
−1
_
R + B
T
KB
_ _
R + B
T
KB
_
−1
B
T
KA
−2A
T
KB
_
R + B
T
KB
_
−1
B
T
KA
K = A
T
_
K −KB
_
R + B
T
KB
_
−1
B
T
K
_
A + Q, K ≥ 0
Summary
• Optimal Cost to go J(x) = x
T
Kx
• Optimal feedback strategy u = Fx, F = −
_
R + B
T
KB
_
−1
B
T
KA
Questions
1. Can we always solve for K?
2. Is closed loop system x
k+1
= (A + BF)x
k
stable?
Example 16:
x
k+1
= 2x
k
+ 0 · u
k
cost =


k=0
_
x
2
k
+ u
2
k
_
A = 2, B = 0, Q = 1, R = 1
Solve for K:
K = 4 (K −0) + 1
⇔−3K = 1
→K = −
1
/3
49
3. Continuous Time Optimal Control
K does not satisfy K ≥ 0 constraint. No solution to this problem: cost
is infite.
Problem with this example is that (A, B) is not stabilizable.
Stabilizable : One can find a matrix F such that A + BF is stable.
ρ (A + BF) < 1, eigenvalues of A + BF have magnitude < 1.
Example 17:
x
k+1
= 0.5x
k
+ 0 · u
k
cost =


k=0
_
x
2
k
+ u
2
k
_
A = 0.5, B = 0, Q = 1, R = 1
Solve for K:
K = 0.25K + 1
→K = 4/3
Cost to go =
4
3
x
2
k
. F = 0.
Example 18:
x
k+1
= 2x
k
+ u
k
cost =


k=0
_
x
2
k
+ u
2
k
_
A = 2, B = 1, Q = 1, R = 1
(A, B) is stabilizable. Solve for K:
K = 4
_
K −K
2
(1 + K)
_
+ 1
0 = K
2
−4K −1
→K =
4 ±

20
2
= 2 ±

5
Pick K = 2 +

5 ≈ 4.236 and solve for F:
F = −(1 + K)
−1
2K =
−2K
1 + K
≈ −1.618
A + BF = 2 −1.618 ≈ 0.3820 is stable.
Our optimizing strategy stabilizes system as expected.
50
Wednesday 8
th
December, 2010 Linear Systems and Quadratic Costs
Example 19:
x
k+1
= 2x
k
+ u
k
cost =


k=0
_
u
2
k
_
A = 2, B = 1, Q = 0, R = 1
Solve for K:
K = 4
_
K −K
2
(1 + K)
−1
_
→K = {0, 3}
K = 0 is clearly optimal thing to do, but it leads to an unstable system.
K = 3, however, while not being optimal, leads to F = −(1 +3)
−1
3 · 2 =
−1.5, A + BF = 0.5, stable.
Modify cost to ∑

k=0
_
u
2
k
+ εx
2
k
_
, ε > 0, ε 1.
A = 2, B = 1, Q = ε, R = 1
Solve for K:
K(ε) = {3 +

3
, −
ε
3
} (to first order in ε)
The K = 3 solution is the limiting case as we put arbitrarily small cost
to the state which would otherwise →∞.
Example 20:
x
k+1
= 0.5x
k
+ u
k
cost =


k=0
_
u
2
k
_
A = 0.5, B = 1, Q = 0, R = 1
Solve for K:
K = {0, −0.75}
Here K = 0 makes perfect sense: gives optimal strategy u = 0 and stable
closed loop system.
If Q = ε
K(ε) →{

3
, −0.75 −
ε
3
}
K = 0 is a well behaved solution for ε = 0.
51
3. Continuous Time Optimal Control
Need Concept of Detectability Let Q be decomposed as Q = C
T
C (can
always do this).
Need detectability assumption. (A, C) is detectable if ∃L so that A + LC is
stable. C x
k
→0 ⇒ x
k
→0 and C x
k
→0 ⇔ x
T
k
Qx
k
→0.
3.4.1 Summary
Given
x
k+1
= Ax
k
+ Bu
k
k = 0, 1, . . .
cost =


k=0
_
x
T
Qx + u
T
k
Ru
k
_
Q ≥ 0, R ≥ 0
(A, B) is stabilizable, (A, C) is detectable where C is any matrix that satisfies
C
T
C = Q. Then:
1. ∃ unique solution to D.A.R.E
2. Optimal cost to go is J(x) = x
T
Kx
3. Optimal feedback strategy u = Fx
4. Closed loop system is stable.
3.5 General Problem Formulation
• Finite horizon, time varying, disturbances.
x
k+1
= A
k
x
k
+ B
k
u
k
+ w
k
k = 0, . . . , N −1
E(w
k
) = 0 E(w
k
w
T
k
) finite
• Cost:
E
_
x
T
N
Q
N
x
N
+
N−1

k=0
x
T
k
Q
k
x
k
+ u
T
k
R
k
u
k
_
Q
k
= Q
T
k
≥ 0 (EV ≥ 0)
R
k
= R
T
k
> 0
52
Wednesday 15
th
December, 2010 General Problem Formulation
Apply DP to solve problem:
J
N
(x
N
) = x
T
N
Q
N
x
N
J
k
(x
k
) = min
u
k
E
_
x
T
k
Q
k
x
k
+ u
T
k
R
k
u
k
+ J
k+1
(A
k
x
k
+ B
k
u
k
+ w
k
)
_
Let’s do the first step of this recursion. Or equivalently, N = 1:
J
0
(x
0
) = min
u
0
E
_
x
T
0
Q
0
x
0
+ u
T
0
R
0
u
0
+ (A
0
x
0
+ B
0
u
0
+ w
0
)
T
Q
1
(. . .)
_
consider last term:
E
_
(A
0
x
0
+ B
0
u
0
)
T
Q
1
(A
0
x
0
+ B
0
u
0
) + 2(A
0
x
0
+ B
0
u
0
)
T
Q
1
w
0
+ w
T
0
Q
1
w
0
_
= (A
0
x
0
+ B
0
u
0
)
T
Q
1
(. . .) + E(w
T
0
Q
1
w
0
)
∴ J
0
(x
0
) = min
u
0
_
x
T
0
Q
0
x
0
+ u
T
0
R
0
u
0
+ (A
0
x
0
+ B
0
u
0
)
T
Q
1
(. . .) + E(w
T
0
Q
1
w
0
)
_
Strategy is the same as if there was no noise (although it does give a different
cost). Certainty equivalence (works only for some problems, not all).
Solve for minimizing u
0
: differentiate , set to 0:
2R
0
u
0
+ 2 B
T
0
Q
1
B
0
u
0
+ 2 B
T
0
Q
1
A
0
x
0
= 0
u
0
= −(R
0
+ B
T
0
Q
1
B
0
)
−1
B
T
0
Q
1
A
0
x
0
=: F
0
x
0
Optimal feedback strategy is a linear function of the state.
Substitute and solve for J
0
(x
0
):
x
T
0
Q
0
x
0
+ u
T
0
(R
0
+ B
T
0
Q
1
B
0
) u
0
+ x
T
0
A
T
0
Q
1
A
0
x
0
+ 2 x
T
0
A
T
0
Q
1
B
0
u
0
+ E(w
T
0
Q
1
w
0
)
= x
T
0
K
0
x
0
+ E(w
T
0
Q
1
w
0
)
K
0
= Q
0
+ A
T
0
(K
1
−K
1
B
0
(R
0
+ B
T
0
K
1
B
0
)
−1
B
T
0
K
1
) A
0
K
1
= Q
1
Cost at k = 1: quadratic in x
1
, at k = 0: quadratic in x
0
+ constant.
Can extend this to any horizon, Discrete Riccati Equation (DRE):
K
k
= Q
k
+ A
T
k
(K
k+1
−K
k+1
B
k
(R
k
+ B
T
k
K
k+1
B
k
)
−1
B
k
K
k+1
) A
k
K
N
= Q
N
Feedback law:
u
k
= F
k
x
k
F
k
= −(R
k
+ B
T
k
K
k+1
B
k
)
−1
B
T
k
K
k+1
A
k
53
3. Continuous Time Optimal Control
Cost:
J
k
(x
k
) = x
T
k
K
k
x
k
+
N−1

j=k
E(w
T
j
K
j+1
w
j
)
No noise, time invariant, infinite horizon (N →∞): recover previous results
and DARE. In fact, above iterative method is one way to solve DARE. Iterate
backwards until it converges. ([1] has proof of convergence not trivial)
Time invariant, infinite horizon with noise: Cost goes to infinity. Approach:
divide cost by N and let N →∞, cost = E(w
T
K w).
Example 21: System given:
¨ z(t) = u(t)
Objective Apply a force to move a mass from any starting point to z =
0, ˙ z = 0. Implement on a computer that can only update information
once per second.
1. Discretize problem:
˙ z(t) = ˙ z(0) + u(0)t 0 ≤ t < 1
z(t) = z(0) + ˙ z(0)t +
1
/2u(0)t
2
0 ≤ t < 1
Let x
1
(k) = z(k), x
2
(k) = ˙ z(k).
x(k + 1) = Ax(k) + Bu(k) k = 0, 1, . . .
A =
_
_
1 1
0 1
_
_
B =
_
_
0.5
1
_
_
2. Cost = ∑

k=0
_
x
2
1
(k) + u
2
(k)
_
. Therefore
Q =
_
_
1 0
0 0
_
_
R = 1
3. Is system stabilizable? Can we make A + BF stable for some F?
Yes: F =
_
−1 −1.5
_
makes eigenvalues of A + BF = 0
4. Q can be decomposed as follows
Q =
_
_
1 0
0 0
_
_
=
_
_
1
0
_
_
·
_
1 0
_
54
Wednesday 15
th
December, 2010 General Problem Formulation
Therefore
C =
_
1 0
_
, Q = C
T
C
Is (A, C) detectable?
Yes.
L =
_
−2 −1
_
makes eigenvalues of A + LC = 0.
5. Solve DARE. Use MATLAB command dare.
K =
_
_
2 1
1 1.5
_
_
Optimal Feedback matrix:
F =
_
−0.5 −1.0
_
6. “Physical” interpretation: spring and a damper. spring has coeffi-
cient 0.5, damper has coefficient 1.0.
55
Bibliography
[1] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume I.
Athena Scientific, 3rd edition, 2005.
56

Contents
Contents 1 Introduction 1.1 Class Objective . . . . . . . . . . . . . . 1.2 Key Ingredients . . . . . . . . . . . . . . 1.3 Open Loop versus Closed Loop Control 1.4 Discrete State and Finite State Problem 1.5 The Basic Problem . . . . . . . . . . . . 1 3 3 3 4 5 7 9 9 10 11 13 13 13 14 15 15 16 16 16 17 18 19 21 21 22 23 23 24 25 27 27 28

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Dynamic Programming Algorithm 2.1 Principle of Optimality . . . . . . . . . . . . . 2.2 The DPA . . . . . . . . . . . . . . . . . . . . . 2.3 Chess Match Strategy Revisited . . . . . . . . 2.4 Converting non-standard problems . . . . . 2.4.1 Time Lags . . . . . . . . . . . . . . . . 2.4.2 Correlated Disturbances . . . . . . . . 2.4.3 Forecasts . . . . . . . . . . . . . . . . . 2.5 Deterministic, Finite State Systems . . . . . . 2.5.1 Convert DP to Shortest Path Problem 2.5.2 DP algorithm . . . . . . . . . . . . . . 2.5.3 Forward DP algorithm . . . . . . . . . 2.6 Converting Shortest Path to DP . . . . . . . . 2.7 Viterbi Algorithm . . . . . . . . . . . . . . . . 2.8 Shortest Path Algorithms . . . . . . . . . . . 2.8.1 Label Correcting Methods . . . . . . . 2.8.2 A∗ − Algorithm . . . . . . . . . . . . . 2.9 Multi-Objective Problems . . . . . . . . . . . 2.9.1 Extended Principle of Optimality . . 2.10 Infinite Horizon Problems . . . . . . . . . . . 2.11 Stochastic, Shortest Path Problems . . . . . . 2.11.1 Main Result . . . . . . . . . . . . . . . 2.11.2 Sketch of Proof . . . . . . . . . . . . . 2.11.3 Prove B . . . . . . . . . . . . . . . . . . 2.12 Summary of previous lecture . . . . . . . . . 2.13 How do we solve Bellman’s Equation? . . . . 1

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

Contents 2.13.1 Method 1: Value iteration (VI) . . . . 2.13.2 Method 2: Policy Iteration (PI) . . . . 2.13.3 Third Method: Linear Programming . 2.13.4 Analogies and Connections . . . . . . 2.14 Discounted Problems . . . . . . . . . . . . . . 3 Continuous Time Optimal Control 3.1 The Hamilton Jacobi Bellman (HJB) Equation 3.2 Aside on Notation . . . . . . . . . . . . . . . 3.2.1 The Minimum Principle . . . . . . . . 3.3 Extensions . . . . . . . . . . . . . . . . . . . . 3.3.1 Fixed Terminal State . . . . . . . . . . 3.3.2 Free initial state, with cost . . . . . . . 3.4 Linear Systems and Quadratic Costs . . . . . 3.4.1 Summary . . . . . . . . . . . . . . . . 3.5 General Problem Formulation . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 31 32 34 36 37 40 42 45 45 46 48 52 52 56

2

Chapter 1

Introduction
1.1 Class Objective

The class objective is to make multiple decisions in stages to minimize a cost that captures undesirable outcomes.

1.2

Key Ingredients
k = 0, 1, . . . , N − 1

1. Underlying discrete time system: x k +1 = f k ( x k , u k , w k ) k: discrete time index xk : state uk : control input, decision variable wk : disturbance or noise, random parameters N: time horizon f k : function, captures system evolution 2. Additive cost function:
N −1

gN (xN ) +
terminal cost

k =0

gk ( x k , u k , w k )
state cost

accumulated cost

gk is a given nonlinear function.

· The cost is a function of the control applied · Because the wk are random, typically consider the expected cost:
N −1 wk

E

gN (xN ) +

k =0

gk ( x k , u k , w k )

3

Cost: N −1 E R (xN ) + k =0 ∑ (r ( x k ) + c u k ) r ( xk ): penaltise too much stock or negative stock c uk : cost of items R ( xn ): terminal cost from items at the end that can’t be sold or demand that can’t be met Objective: The objective is to minimize the cost subject to uk ≥ 0 1. Closed Loop will always give better performance. Too little. . . µ N −1 } ist a policy or control law. with some given probability distribution Dynamics: x k +1 = x k + u k − w k Excess demand is backlogged and corresponds to negative values. Asumes xk is measurable. . . Closed Loop: Wait until time k to make decision. . . . u N −1 }. .1. .3 Open Loop versus Closed Loop Control Open Loop: Come up with control inputs u0 . cost of storage and misuse of capital (bad). but is computationally much more expensive. xk : stock in the warehouse at the beginning of kth time period uk : stock ordered and immideately delivered at the beginning of kth time period wk : demand during kth period. Introduction Example 1: Inventory Control: Keeping an item stocked in a warehouse. . Too much. u N −1 before k = 0. Π = {µ0 . you run out (bad). In Closed Loop the objective is to calculate the optimal rule: uk = µk ( xk ). . 4 . . In Open Loop the objective is to calculate {u0 .

uk = u) := Pij (u. · Decision variable for player. b) Draw: 0. 0 points for the loser. it is often convenient to express the dynamics in terms of transition probabilities: Pij (u. lose with probability 1 − pd 2) Bold play: Win with probability pw . the objective is to come up with a strategy that maximizes the chance to win.Wednesday 22nd September. 2010 Example 2:  s − x k k µk ( xk ) =  0 sk is some threshold.4 Discrete State and Finite State Problem When state xk takes on discrete values or is finite in size. go into sudden death mode until someone wins. k) := Prob ( xk+1 = j | xk = i. · Each game can have 2 outcomes: a) Win by one player: 1 point for the winner.5 points for each player. lose with probability 1 − pw · Asume pd > pw as a necessary condition for problem to make sense. k ) Example 3: Optimizing Chess Playing Strategies: · Two game chess match with an opponent. where wk has the following Prob (wk = j | xk = i. uk = u) i: start state j: possible future state u: control input k: time This is equivalent to: xk+1 = wk . Discrete State and Finite State Problem if xk < sk otherwise 1. · If games are tied 1-1 at the end of 2 games. two player styles: 1) Timid play: Draw with probability pd . 5 .

2 0. 0 0. Introduction Problem: What playing style should be chosen? Since it doesn’t make sense to play Timid if we are tied 1-1 at the end of 2 games.54 and 0.625. 1 1 3 2. 2 1. pd = 1} the probabilities to win are 0.1: First Game 2. 2 1 1 2.2: Second Game Closed Loop Strategy Play timid iff player is ahead. 2 0. 0 pw 2. it is a 2-stage-finite problem. 1 1 − pw 1 3 2. 0 1 − pw 0. 1 1. 2 1 − pw 0. pd = 0. 2 (a) Timid Play (b) Bold Play Figure 1.9} and { pw = 0. 1 pd 1 − pd 0. 0 pd 1 − pd 3 1 2. 2 pd 1 − pd 1. Open Loop Strategy Possibilities: 6 . pw 0. 0 The graphis below show all possible out- pd 1 − pd 1 1 2.5. 1 pw 1. Transition Probability Graph comes. 2 0. 0 pw 3 1 2.45. 2 1 − pw 1 1 2. 1 (a) Timid Play (b) Bold Play Figure 1. 0 1.1. The probability of winning is: pd pw + pw ((1 − pd ) pw + pw (1 − pw )) = p2 (2 − pw ) + pw (1 − pw ) pd w For { pw = 0.

1 1.5. d Best strategy yields: p2 + pw (1 − pw ) max (2 pw . . It can be shown that if pw ≤ 0.5. uk ). but also of current state. timid in second game: pw pd + pw (1 − pd ) pw 4) Timid in first. . w k ).3: Closed Loop Strategy 1) Timid for first 2 games: p2 pw d 2) Bold in both: p2 (3 − 2 pw ) w 3) Bold in first. 1. 0 1 − pw 0. x k ∈ Sk uk ∈ Ck w k ∈ Dk k = 0. u k . 2 LOSE Figure 1. because p2 pw ≤ pd pw + . . Constrained not only as a function of time. 7 . . 2 LOSE 0. . then the probability of winning is ≤ 0. 1 1 − pw 1. pd ) w if pd > 2 pw . • wk ∼ PR (·| xk .Wednesday 29th September. 1 WIN 1.5 The Basic Problem Summarize basic problem setup: • x k +1 = f k ( x k . 0 1 − pd pw 1 − pw WIN pw 2. 2 The Basic Problem pd pw 0. bold in second game: pw pd + pw (1 − pd ) pw Clearly 1) is not the optimal OL strategy. 2010 3 1 2. Noise distribution can depend on current state and applied control. The optimal OL strategy is 3) or 4). . 1. N − 1 state space control space disturbance space • uk ∈ U ( xk ) ⊂ Ck .

. . such that µk ( xk ) ⊂ U ( x k ) ∀ x k ⊂ Sk . . The set of all π called Admissible Policies. µ N −1 } where µk maps state xk into controls uk = µk ( xk ). Π = { µ 0 . . µ 1 . • Given a policy π ∈ Π. the expected cost of starting at state x0 : N −1 Jπ ( x0 ) := E wk gN (xN ) + k =0 ∑ gk ( x k . or control laws. µ k ( x k ).1. denoted Π. Introduction Consider policies. w k ) • Optimal Policy: Jπ ∗ ( x0 ) ≤ Jπ ( x0 ) • Optimal Cost: J ∗ ( x0 ) := Jπ ∗ ( x0 ) ∀π ∈ Π 8 .

. which contradicts that π ∗ was an optimal policy. B. . . µ∗ −1 } be an optimal policy. • A must occur before B. See figure 2. Assume that in the process n ∗ . D. 2. Applying the same policy to the original problem from i would therefore give a lower cost. . If the above were not optimal. beginning at the bottom of the tree. C before D. you could find a different policy that would give a lower cost. The solution is obtained by calculating the optimal cost for each node. . that are used to make something. a state x occurs at time i. µ1 . µ∗ −1 } is optimal for this problem.Chapter 2 Dynamic Programming Algorithm At the heart of the DP algorithm is the following very simple and intuitive idea. 9 . Then the truncated policy {µi∗ . µi∗+1 .1 Principle of Optimality ∗ ∗ Let π ∗ = {µ0 . . µ k ( x k ). w k ) . N The proof is simple: prove by contradiction. Consider the subproblem whereby at of using π i time i we are at state xi and we want to minimize N −1 wk E gn ( x n ) + k =i ∑ gk ( x k . . . C. Example 4: Deterministic Scheduling Problem • Have 4 machines A.1.

given by the last step of the following recursive algorithm.C.5 in [1] if you are mathematically inclined). uk . the policy π ∗ = {µ0 . 2. k Proof 1: 10 (Read section 1. since we don’t know a priori which states we will actually visit. Dynamic Programming Algorithm 9 8 2 3 AB 5 3 4 6 6 ABC 1 6 1 3 1 3 2 ABCD ACBD ACDB CABD CADB CDAB A 5 10 ACB 3 AC ACD 1 I. where expectation is taken with respect to PR (·| xk . . uk . . 3 2 4 CAB 3 3 7 4 6 CA 5 C CD CAD 2 3 CDA Figure 2. µ∗ −1 } is optimal. wk ) . if u∗ = µ∗ ( xk ) minimizes the recursion equation for each xk k k ∗ and k.2 The DPA For every initial state x0 .2. • This pointwise optimization is what gives us µ∗ . which proceeds backwards in time from N − 1 to 0: Initialization: JN ( x N ) = g N ( x N ) ∀ xn ⊂ S N Recursion: For the recursion we use Jk ( xk ) = uk ∈Uk ( xk ) wk min E gk ( xk .1: Problem of example 4 with optimal cost for each node written above it (in circles). the optimal cost J ∗ ( x0 ) is equal to J0 ( x0 ). uk ). wk ) + Jk+1 f k ( xk . N Comments • For each recursion step. . . . Furthemore. we have to perform the optimization over all possible values xk ∈ Sk .

.. .w N −1 E gN (xN ) + i = k +1 ∑ gi ( x i . ∗ • We will show that Jk = Jk generated by the DPA which give us the desired result when k = 0. wk ) + Jk+1 ( f k ( xk . . µk+1 . 2010 • Denote π k := {µk .. uk . • Denote ∗ Jk ( xk ) = min Chess Match Strategy Revisited N −1 π k wk . w k ) + g N ( x N ) + i = k +1 ∑ gi ( x i . µ i ( x i ) . µ i ( x i ) . wk ) + min π k+1 wk+1 .. ∗ • JN ( x N ) = g N ( x N ).. prob.. we find ourselves at state xk . prob. wk ) + Jk+1 ( f k ( xk . µk ( xk ).w N −1 E gN (xN ) + i =k ∑ gi x i .Wednesday 6th October. µk ( xk ). w i . w i ) by principle of optimality: N −1 = min E µ k wk gk ( xk . since π k = {µk . Jk ( xk ) is called the cost-to-go at state xk . Then.. loss = 1 − pd 11 . ∴ true for k = N ∗ Assume true for k + 1: Jk+1 ( xk+1 ) = Jk+1 ( xk+1 ) ∀ xk+1 ∈ Sk+1 . wk ))) µ k wk = uk ∈Uk ( xk ) wk min E ( gk ( xk . µk ( xk ). tie = pd . . optimal cost when starting at time k. µk ( xk ).w N −1 min E gk ( x k . wk ))) = Jk ( xk ) In other words: Search over a function is simply like solving what the function does partwise. wk ) + Jk+1 ( f k ( xk . µ N −1 }. wk )) µ k wk by induction hypothesis: = min E ( gk ( xk . w i ) ∗ by definition of Jk+1 and update equation: ∗ = min E gk ( xk . π k+1 }.. 2. . µk ( xk ). have ∗ Jk ( xk ) = N −1 (µk . Jk (·) is called the cost-to-go function..π k+1 ) wk . µk ( xk ).3 Recall Chess Match Strategy Revisited · Timid Play... µ k ( x k ).. finally. µ i ( x i ) . ∗ Induction: JN ( x N ) = JN ( x N ).

timid bold · If x1 = 0: max pd pw + (1 − pd ) 0. prob. wk ) uk wk Note: There is only a terminal cost in this problem.2. pw Jk+1 + (1 − pw ) Jk+1 ( xk − 1) timid bold Convince yourself that this is equivalent to the formal definitions: Jk ( xk ) = max E gk ( xk . uk . J1 ( x1 ) = max [ pd J2 ( x1 ) + (1 − pd ) J2 ( x1 − 1). Define Jk ( xk ) = probability of winning match at time k if state = xk . 0 for loss and 0. pw + (1 − pw ) pw Which is bigger? Timid . uk . · If x1 = −1: max 0.Bold = ( pd − pw )(1 − pw ) > 0 ∴ Timid is optimal. We will solve with DP. w 12 . win = pw . optimal strategy is Bold. replace min by max. Define xk = difference between our score and opponent score at the end of game k.5 for tie. + tie breaker if necessary Objective Find policy which maximizes probability of winning. Bold is optimal strategy. prob. wk ) + Jk+1 ( f k xk . Dynamic Programming Algorithm · Bold Play. Start Recursion  1     pw    0 if x2 > 0 if x2 = 0 if x2 < 0 J2 ( x2 ) = Recursive Equation Jk ( xk ) = max pd Jk+1 + (1 − pd ) Jk+1 ( xk − 1). loss = 1 − pw · 2 game match. pw + (1 − pw ) 0 timid bold Optimal is pw . Asume pd > pw . and J1 (1) = pd + (1 − pd ) pw . Recall 1 point for win. and J1 (0) = pw . pw J2 ( x1 + 1) + (1 − pw ) J2 ( x1 − 1)] · If x1 = 1: max pd + (1 − pd ) pw . p2 w J1 (−1) = p2 .

4. wk ). sk .Wednesday 6th October. The control is uk = µk ( xk . 2010 Converting non-standard problems J0 (0) = max [ pd J1 (0) + (1 − pd ) J1 (−1). w k ) Define yk = xk−1 . pw ( pd + (1 − pd ) pw ) + (1 − pw ) p2 w w = max pd pw + (1 − pd ) p2 . uk . pd pw + (1 − pd ) p2 + (1 − pw ) p2 w w w ∴ J0 (0) = pd pw + (1 − pd ) p2 + (1 − pw ) p2 . uk−1 . pw J1 (1) + (1 − pw ) J1 (−1)] = max pd pw + (1 − pd ) p2 .2 Correlated Disturbances If disturbances are not independent.4. but can be modeled as the output of a system driven by independent disturbances → Colored Noise. sk = uk−1     x k +1 f k ( x k . play Timid. 2. u k −1 .1 Assume update equation is of the following form: x x +1 = f k ( x k . y k .4 Converting non-standard problems to Basic Problem Time Lags 2. the optimal strategy is Bold. w w Optimal Strategy If ahead. This can be generalized to more than one time lag. xk+1 = f˜k ( xk . sk ). x k −1 . 2. xk−1 ). yk . u k . wk )          y k +1  =   xk     s k +1 uk f˜k ˜ ˜ ˜ Let xk = ( xk . Example 5: wk = Ck yk+1 y k +1 = A k y k + ξ k 13 .

. pm . As usual. New state xk = ( xk . uk . yk ∈ {1. we receive forecast i that Qi is used to generate wk+1 . wk ) + ∑ pi Jk+1 f k ( xk . uk . wk has probability distribution Qyk . . which is allowed. uk . yk cannot be measured and must be estimated.  x k +1    f k xk . At time k. . . y N ) = g N ( x N ) Jk ( xk . . uk . 14 . yk ). ξ k uk wk ξ k yk yk = min E gk ( xk . yk ) ∴ = Ak yk + ξ k y k +1 which is now in the standard form. with a priori probabilities p1 . we have a Basic Problem formulation. uk . Ck Ak yk + ξ k   and uk = µk ( xk . Qm }. wk )   ˜ = . Ck are given. wk ). . . Dynamic Programming Algorithm Ak . .4. In particular. Q2 . .3 Forecasts When state information includes knowledge of probability distributions. taking value i with probability pi . Model as follows: yk+1 = ξ k . New ˜ disturbance wk = (wk .2. . DPA takes on the following form: JN ( x N . assume wk+1 could have the following probability distributions { Q1 . wk ). wk ). m}.  Then have  x k +1     f k ( x k . {ξ k } is independent. . y k +1 ξk Since yk is known at time k. wk ) + Jk+1 f k ( xk . we receive information about wk+1 probability distribution. i uk wk i =1 m where conditional expectation simply means that wk has probability distribution Qyk . In particular. xk+1 = f k ( xk . 2. ξ k is a random variable. yk ) = min E E gk ( xk . depends on current state. . ξ k ). In general. uk . At the beginning of each period k. uk . expectation over wk is taken with respect to the distribution Qyk .

N aiT = terminal cost of state i ∈ S N In other words. .. . uk ) ij 15 .2: General Shortest Path Problem k aij = cost to go from state i ∈ Sk to state j ∈ Sk+1 . T Stage 1 Stage 2 Stage N Figure 2. . Consider Problems where 1.. . u k . .. Finite State Systems x k +1 = f k ( x k . . . .. xk ∈ Sk . wk ) cost at stage k. that there is only one way to go from state i ∈ Sk to j ∈ Sk+1 (If there is more than one way.. time k. . N − 1 Recall Basic Problem gk ( xk . .. k aij = gk (i.5 Deterministic. pick one with lowest cost at stage k). 2010 Deterministic. ... . w k ) k = 0. . Sk is a finite set 2. No disturbances wk We assume. without loss of generality.5. . uk .1 Convert DP to Shortest Path Problem .. 2. This is equal ∞ if there is no way to go from i ∈ Sk to j ∈ Sk+1 . N aiT = g N (i ) ij where j = f k (i. uk ). . Finite State Systems 2. .Wednesday 13th October. S . .

k = 1. . in which circles are allowed As an example for a mental picture. . motivating following algorithm.5. Jk ( j) optimal cost to arrive to state j: ˜ J N ( j ) = a0 sj ˜ Jk ( j) = min i ∈ S N −k N ˜ aij −k + Jk+1 (i ) j ∈ S1 j ∈ S N − k +1 .2. the problem is symmetric.5. N − 1 This solves shortest path problem. . Dynamic Programming Algorithm 2. . .. start . Shortest path from S to T is the ˜ same as from T to S. one could imagine cities on a map.6 Converting Shortest Path to DP . . N N ˜ ˜ J0 ( T ) = min aiT + J1 (i ) i ∈S N ˜ J0 ( T ) = J0 (s) 2. k = 0. end Figure 2.3: Another Path Problem.. 2.3 Forward DP algorithm By inspection. . .2 DP algorithm N JN (i ) = aiT i ∈ SN k aij + Jk+1 ( j) Jk (i ) = min j ∈ S k +1 i ∈ Sk . . 16 . .

aij the cost to move from i to j. of course) Jk (i ) = min aij + Jk+1 ( j) (optimal N − k move is aij + optimal N − k − 1 move from j). • Can only indirectly observe state via measurement. i and j denote nodes. degenerate moves are allowed (aii = 0). as opposed to previous section where they were states. • Notice that degenerate moves are allowed.P. .7 Viterbi Algorithm This is a powerful combination of D. pij = P ( xk+1 = j| xk = i ) 1 ≤ i. . This isn’t an issue if all edges have cost ≥ 0. 2. 2. j ≤ M • p( x0 ) = initial probability for starting state. T } be the nodes of graph.Wednesday 13th October. • Assume that all cycles have non-negative cost. 2010 Viterbi Algorithm • Let {1. Setup problem where we require exactly N moves. r (z. an optimal path ≤ length N (visits all nodes). • Note that with above assumption. Jk (i ) optimal cost of getting from i to T in N − k moves JN (i ) = aiT j (can be infinite. and Bayes Rule for optimal estimation. xk+1 = j) where P is the likelihood function. (remove in the end) • Terminate procedure if Jk (i ) = Jk+1 (i ) ∀ i. . with state transition probabilities pij . N. i. 17 ∀k . j) = P (meas = z| xk = i. • Given Markov Chain. aij = ∞ if there is no edge. .

x N . z2 . . If = 0. construct X N = { x0 . ZN )) = min − log( PR ( x0 )) + XN k =1 ∑ − log N p xk−1 . x2 ) PR ( x2 | x0 . x0 . For a given ZN . . . . x N . z2 . z1 . . .x1 PR ( x0 ) One more step: PR ( x2 . . z1 . . . Dynamic Programming Algorithm ˆ ˆ ˆ Objective: Given ZN = {z1 .x2 Keep going. . . x1 . z1 . z1 ) PR ( x1 . . . . . z3 . x1 . xk ) Forward DP At time k. . . . z N | x0 . z N | x0 ) PR ( x0 ) PR ( x2 . . . . can modify algorithm. . . . . x1 . x N } PR ( X N | ZN ). . x1 . x N . We don’t have to wait until the end to solve the problem. . . x2 ) PR (z2 | x0 . . . x N . we can maximize log ( PR ( X N . z N | x0 . . . x1 . . . z2 . and log function is monotonically increasing as a function of its argument. x1 ) p x0 . . x N . z N | x0 . . . • Recall PR ( X N . . z N } measurements. . x N . z3 . .xk r (zk . z2 . . . . . z N ) = = = = PR ( x1 . z N | x0 . we can calculate cost to arrive to any state. . . x1 . . . . . z1 ) r (z1 . x2 ) p x1 . z1 . z2 . Path ≡ Length ≡ Cost 18 . xk−1 . . . . maximizing PR ( X N . z N | x0 . . . and one gets: PR ( X N . z1 | x0 ) PR ( x0 ) PR ( x2 . x2 ) r (z2 . . . z2 . . z1 ) PR (z1 | x0 . . ZN ) = PR ( X N | ZN ) PR ( ZN ). Since the above is a strictly positive property (by above assumptions). . z1 . xk−1 . . . x1 . 2. z2 | x0 . ZN ) over X N gives same result as maximizing PR ( X N | ZN ) over X N . .2. x2 ) PR ( x2 . . . . PR ( X N . x N . . that all quantities > 0.xk r (zk .8 Shortest Path Algorithms Look at alternatives to DP for problems that are finite and deterministic. ZN ) = PR ( x0 . z1 ) = PR ( x3 . x N . z1 ) = PR ( x3 . z N | x0 . . . . x1 . ZN ) = PR ( x0 ) ∏ p xk−1 . . x N } that maximizes over all X N = { x0 . z1 . . . . x1 . z3 . . z N | x0 . . z1 ) = PR ( x3 . . . . . . z2 . . x1 . x N . . xk ) k =1 N Assume. x1 . Most likely state. x1 ) PR ( x1 | x0 ) PR ( x0 ) PR ( x2 . .

Let di be shortest path to i so far. done.1 Label Correcting Methods Assume aij ≥ 0. Step 1 Remove a node i from OPEN. go back to Step 1. d j = ∞ ∀ j. Step 2 If di + aij < min(d j . If not. set d j = di + aij . set dS = 0. and execute Step 2 for all children j of i. Step 0 Place Node S in OPEN BIN. Step 3 If OPEN is empty.Wednesday 20th October. Example 6: Deterministic Scheduling Problem (revisited) 19 .8. If j = T. 2010 Shortest Path Algorithms 2. Arclength = cost to go from node i to node j ≥ 0. d T ).4: Diagram of the label correcting algorithm. set i to be the parent of j. YES set d j = di + aij YES Is di + aij < d T ? i OPEN BIN j Is di + aij < d j ? REMOVE Figure 2. place j in OPEN if it is not already there.

CAB(9) A (5) AB(7). Also good if you have limited memory. AC (8) AB(7) – dt ∞ ∞ ∞ ∞ 14 14 14 10 10 10 10 OPTIMAL – – – – CDAB CDAB CDAB CABD CABD CABD CABD Done. 2 3 C 6 CD 4 CA 4 3 CAD CDA CAB ACD 2 A 5 AB ABC ACB 6 1 3 1 3 2 T A before B C before D Figure 2. Finds feasible path quickly. Dynamic Programming Algorithm 3 4 3 AC 6 I. Iteration # 0 1 2 3 4 5 6 7 8 9 10 Remove – S C CD CDA CA CAD CAB A AC AB OPEN S (0) A (5).C.5: Deterministic Scheduling Problem. CAD (11) A(5). What we did in example. CA(7). optimal path = CABD. 20 . CA(7). Depth-First Search Last in. CD (9) A(5). algorithms. C (3) A(5). CDA(12) A(5). well known. CA(7) A(5). optimal cost = 10. first out.2. CAB(9). Different ways to remove items from OPEN give different.

. M. Given a problem with M cost functions f 1 ( x ). then use another criterion to pick which one you actually want to use. f M (y)) y ∈ X }. .2 A∗ − Algorithm Workhouse for many AI applications. clear that path going through j will not be optimal. Remove step is more expensive. Bellman-Ford. time inferior non-inferior fuel Figure 2. . with strict inequality for one of these ls. 2.8. f M ( x )) is a non-inferior vector of set {( f 1 (y). . l = 1. x M ) ∈ S is non-inferior if there are no other y ∈ S so that yl ≤ xl . .9 Multi-Objective Problems Example 7: Motivation: care about time and fuel. . Dijkstra’s method.6: Possibilities in the time-fuel graph. Indeed. . f M ( x ) x ∈ X is a noninferior solution if the vector ( f 1 ( x ). .Wednesday 20th October. if d j + aij + h j ≥ d T . . Brendth-First Search First in. x2 . . . . . . . 2. A vector x = ( x1 . . where h j is a lower bound to the shortest distance from j to T. . . . path planning. 2010 Multi-Objective Problems Best-First Search Remove best label. . but can give good performance. Basic idea: Replace test di + aij < d T by di + aij + h j < d T . 21 . Reasonable goal: find all non-inferior solutions. first out.

u k ) glN + N −1 k =0 Dynamics ∑ l gk ( x k . c M ) ∈ F c k k k +1 ( f k ( xk . Only one element in set for each x N . . . . . finite state DP (which are equivalent to shortest path problems): x k +1 = f k ( x k . .1 Extended Principle of Optimality If {uk . u k ) l = 1. uk ) + 1 . . . . . then {uk+1 . . u N −1 } is a non-inferior control sequence for the tail subproblem that starts at xk . u N −1 } is also non-inferior for the tail subproblem that starts at f k ( xk . .7: Possible sets for Fk+1 . . . Simple proof: by contradiction. . such that ( c1 . generate for each state xk the set of vectors ( gk ( xk . uk ). . will have all non-inferior solutions. . 22 . . N l · Given Fk+1 ( xk+1 ) ∀ xk+1 . Dynamic Programming Algorithm How this applies to deterministic. M · FN ( x N ) = {( g1 ( x N ). . g N ( x N ))}. .9. Then to obtain Fk ( xk ) simply extract all non-inferior elements. When we calculate F0 ( x0 ). . u ) + c M ). k These are all possible costs that are consistent with Fk+1 ( xk+1 ). . M 2. Algorithm First define what we will do recursion over: · Fk ( xk ) : the set of M-tuples (vectors of size M) of cost to go at xk which are non-inferior. uk )).2. . . . g M ( x . Fk+1 (·) ··· Fk ( xk ) Xk Figure 2.

wk ) . have Bellman Equation: J ∗ ( x ) = min E ( g( x. uk . uk = u) = pij (u) . wk ) + Jk+1 ( f ( xk . uk . u. 2010 Infinite Horizon Problems 2. wk ))) u k ∈U w k ∀k Question: What happens as N → ∞? Does the problem become easier? Yes. • Efficient methods for solving Bellman Equation • Technical conditions on when this can be done. uk ) N −1 Jπ ( x0 ) = E k =0 ∑ g( xk . finite set 23 x k +1 = w k PR (wk = j| xk = i.10 Infinite Horizon Problems Consider the time (or iteration) invariant case: x k +1 = f ( x k .11 Stochastic. u. u = µ( x ) gives optimal policy (µ(·) is obtained from solution to Bellman Equation: for every x there is a u). For very large class of problems. w) + J ∗ ( f ( x. 2. finite set uk ∈ U ( xk ). Shortest Path Problems xk ∈ S. µk ( xk ).Wednesday 27th October. u k . no terminal cost Write down DP algorithm: JN ( x N ) = 0 Jk ( xk ) = min E ( g( xk . Reason: lose notion of time. w k ) xk ∈ S uk ∈ U wk ∼ P(·| xk . w))) u w ∀x ∈ S Bellman Equation involves solving for optimal cost to go function J ∗ ( x )∀ x ∈ S.

. . µ. the sequence Jk+1 (i ) = min g(i. u) + ∑ pij (u) Jk ( j) j =1 n u ∈U ( i ) ∀i converges to optimal cost J ∗ (i ) for each i. . µ...2. . we can control what these transition probabilites are.. . . . ptt (u) = 1 Sufficient condition to make cost meaningfull. .} is simply referred to as µ. N −1 Jπ (i ) = lim E N →∞ k =0 ∑ g( xk . finite set of options u ∈ U (i ). 24 . µk ( xk ))| xo = i . µ is optimal if Jµ (i ) = J ∗ (i ) = min Jπ (i ) π Assumptions: • Existence of a cost-free termination state t ∀u g(t. u is the constrol input. • ∃ integer m such that for all admissible policies ρπ = max PR ( xm = t| x0 = i. which is only required for proofs. . Think of this as a destination state. • Optimal cost from state i: J ∗ (i) • Stationary policy π = {µ.11. u) = 0 ∀u. µ1 . Denote Jµ (i ) as the resulting cost. J0 (n).}.1 Main Result A) Given any initial conditions J0 (1). π ) < 1 i =1. . . The transition from one state to the next is dictated by pij (u): probability that next state is j given current state is i. Problem data is time (or iteration) independent. Cost: Given initial state i and a policy π = {µ0 .. Dynamic Programming Algorithm We have a finite number of states. .}.n This is a strong assumption. . 2. {µ.

π ) ≤ ρ2 Generally.u ∴ Jπ (i ) ≤ A1) N −1 k =0 ∑ M ρk = 1 − ρ finite M Jπ ( x0 ) = lim E N →∞ mK −1 k =0 ∑ g ( xk . we know that N −1 N →∞ lim E k =mK ∑ g ( xk . ρπ := max PR ( xm = t| x0 = i. PR ( xk m = t| x0 = i. Shortest Path Problems u ∈U ( i ) g(i. ρ := max ρπ < 1 π ∴ PR ( x2m = t| x0 = i. Recall ∃ m such that ∀ policies π. the cost incurred between the m periods km and (k + 1)m − 1 is mρk max | g(i.u ∞ M := m · max | g(i. x0 = i. Furthermore. π ) = PR ( x2m = t| xm = t. 25 . u)| =: ρk M i. µk ( xk )) N −1 =E k =0 ∑ g ( xk . π ) · PR ( xm = t| x0 = i. µk ( xk )) ≤ MρK 1−ρ As expected. we can make tail as small as we want. 2. u) + ∑ pij (u) J ∗ ( j) j =1 n ∀i which has a unique solution.11. π ) ≤ ρk . µk ( xk )) by previous. 2010 B) Optimal cost satisfies Bellman’s Equation: J ∗ (i ) = min Stochastic.2 Sketch of Proof A0) First prove that cost is bounded. µk ( xk )) + lim E N →∞ k =mK ∑ g ( xk . u)| i.Wednesday 27th October. π ) < 1 i Since all problem data is finte.

the middle term is exactly our DP recursion of part A. π ) J0 (i) i =1 K n ∑ PR (xmK = i|x0 . Dynamic Programming Algorithm A2) Recall that we can view J0 as terminal cost funtion. with J0 (i ) given. get K →∞ lim JmK ( x0 ) = J ∗ ( x0 ) Now we are almost done. ¯ a ≥ b − |c| ≥ b − c Follows that ¯ c ≥ |c| −ρK max | J0 (i )| − i MρK + Jπ ( x0 ) ≤ E 1−ρ mK −1 J0 ( xmK ) + k =0 ∑ g( xk . Bound its expected vallue: | E ( J0 ( xmK ))| = ≤ i =1 ∑ PR (xmK = i|x0 . µk ( xk )) Recall if a = b + c. π ) i n max | J0 (i )| i ≤ ρ max | J0 (i )| A3) “Sandwich” mK −1 E ( J0 ( xmK )) + E k =0 N −1 ∑ g( xk . Take limits. µk ( xk )) ≤ ρK max | J0 (i )| + i MρK + Jπ ( x0 ) 1−ρ A4) We take minimum over all policies. µk ( xk )) = E ( J0 ( xmK )) + Jπ ( x0 ) − lim E N →∞ k =mK ∑ g( xk . But since | JmK+1 ( x0 ) − JmK ( x0 )| ≤ ρK M have k→∞ lim Jk ( x0 ) = J ∗ ( x0 ) 26 . then ¯ a ≤ b + |c| ≤ b + c.2.

11. . 2.3 Prove B Prove that optimal cost satisfies Bellman’s equation Jk+1 (i ) = min g(i. 2010 Summary of previous lecture Summary A1: bound the tail over all policies A2: bound contribution from initial condition J0 (i ). .} N −1 u ∈ U (i ) finite set Jπ (i ) = lim E N →∞ k =0 ∑ g( xk . . n. t} ptt = 1 ∀u ∈ U (t) Cost Given π = {µ0 .Wednesday 3rd November. over all policies A3: “sandwich” type of bounds. . 2. middle term is DP recursion A4: optimized over all policies. . just take limits on both sides. µ1 . just use solution of Bellman equation as initial condition of DP iteration. we showed that Jk (·) → J∗ (·). u) + ∑ pij (u) Jk ( j) . j =1 n u ∈U ( i ) In Part A. took limit. To prove uniqueness. . µk ( xk ))| x0 = i ∀i ∈ S g(t. xk ∈ S finite S = {1. u) = 0 ∀u ∈ U (t) J ∗ (i ) = min Jπ (i ) π optimal cost Note: Jπ (t) = 0 ∀π ⇒ J ∗ (t) = 0 27 .12 Summary of previous lecture Dynamics x k +1 = w k PR {wk = j| xk = j. . uk = u} = pij . 2.

look at [1]. . .13. u) + ∑ pij (u) Jk ( j) j =1 n u ∈U ( i ) i ∈ S\t until it converges. u) + ∑ pij (u) J ∗ ( j) j =1 n u ∈U ( i ) i ∈ S\t Bellman’s Equation: Also gives optimal policy.13. . can include terminal state t. Does not change equations. . Note: there is a lot of a short-cut here. the costs Jµ (i ) are the unique solutions of J (i ) = g(i. if the guess is good.2 Method 2: Policy Iteration (PI) Iterate over policy instead of values.1 How do we solve Bellman’s Equation? Method 1: Value iteration (VI) Use the DP recursion of result A: Jk+1 (i ) = min g(i. it will speed up convergence. u) + ∑ pij (u) Jk ( j) j =1 n u ∈U ( i ) i ∈ S \ t = {1. B) J ∗ (i ) = min g(i. Dynamic Programming Algorithm Result A) Given any initial conditions J0 (1). J0 (i ) can be set to a guess. the sequence Jk+1 (i ) = min g(i. . . J0 (n).2. . 2. Need following result: C) For any stationary policy µ.13 2. n} converges to J ∗ (i ). µ(i )) + ∑ pij (µ(i )) J ( j) j =1 n i ∈ S\t 28 . provided we pick J0 (t) = 0. How do we know that we are close to converging? Exploit problem structure to get bounds. which is in fact stationary. 2. .

not policy at time k). quit when Jµk+1 (i ) = Jµk (i ) ∀i Theorem Above terminates after a finite # of steps. n unknowns (Result C). 1) For fixed k. Consider problem where only allowable control at state i is µ(i ). i ∈ S \ t = {1. consider the following recursion in N: JN +1 (i ) = g(i. solve for the Jµk (i ) by solving J (i ) = g(i. 2. Proof: trivial. µ(i )) + ∑ pij (µ(i )) Jk ( j) j =1 n converges to Jµ (i ) for each i. . µk+1 (i ) = arg min g(i. and converges to optimal policy. Stage 2 Improve policy. k 2) We will show that what we converge to satisfies Bellman’s Equation. and apply parts A and B. µk (i )) + ∑ pij (µk (i )) J ( j) j =1 n ∀i n equations. Special case of general theorem. 2010 How do we solve Bellman’s Equation? Furthermore. Proof two steps 1) We will first show that Jµk (i ) ≥ Jµk+1 (i ) ∀i. Algorithm for PI From now on. the sequence Jk+1 (i ) = g(i. . given any initial conditions J0 (i ). . µk+1 (i )) + ∑ pij (µk+1 (i )) JN ( j) j =0 n ∀i J0 (i ) = Jµk (i ) 29 . u) + ∑ pij (u) Jµk ( j) j =1 n u ∈U ( i ) ∀i Iterate. Stage 1 Given µk (stationary policy at iteration k. n}. .Wednesday 3rd November.

J0 (i ) = g(i. . but this is Bellman’s Equation! ∴ have converged to optimal policy. 2) It follows from Stage 2 that Jµk+1 (i ) = Jµk (i ) = min g(i. u) + ∑ pij (u) Jµk ( j) j u ∈U ( i ) when converged. But g(i. . ≥ JN (i ) ≥ . µ k +1 (i )) + ∑ pij (µk+1 (i )) J0 ( j) = J1 (i ) j k +1 (i )) + ∑ pij (µk+1 (i )) J1 ( j) j since J1 (i ) ≤ J0 (i ). µk+1 (i )) = J2 (i ). take limit Jµk (i ) ≥ Jµk+1 (i ) ∀i Since the # of stationary policies is finite. size n. comlexity ≈ O(n3 ) Stage 2 n minimizations. Worst case # of iterations: search over all policies. pn . But in practice.2. Why does Policy Iteration converge so quickly relative to Value Iteration? 30 . Discussion Complexity Stage 1 linear sytem of equations. JN → Jµk+1 as N → ∞. Dynamic Programming Algorithm By result C). µk (i )) + ∑ pij (µk (i )) J0 ( j) j ≥ g(i. get J0 (i ) ≥ J1 (i ) ≥ . . p choices (p different values of u that I can use). converges very quickly. we will eventually have Jµk (i ) = Jµk+1 ∀i for some finite k. µ J1 (i ) ≥ g(i. complexity ≈ O( p n2 ) Put together: O(n2 (n + p)). . keep going.

. n u ∈U ( i ) i = 1. ∀k Jk → J ∗ ∴ J0 (i ) ≤ J ∗ (i ) ∀i 31 .I. . . .I. . u) + ∑ pij Jk ( j) j =1 n u ∈U ( i ) g(i. . . u) + ∑ pij (u) J ∗ ( j) j =1 n i = 1.13. Assume we start V.Wednesday 10th November. 2010 Rewrite Value Iteration Stage 2 µk = arg min How do we solve Bellman’s Equation? u ∈U ( i ) g(i. . . . ∴ J1 (i ) ≤ min u ∈U ( i ) g(i. with any J0 that satisfies J0 (i ) ≤ min g(i. u) + ∑ pij (u) Jk ( j) j ↑ iterate Stage 1 Jk+1 (i ) = g(i. . u) + ∑ pij J0 ( j) j =1 n u ∈U ( i ) i = 1. . . u) + ∑ pij J1 ( j) j =1 n i = 1. n We showed that Value Iteration (V.) converges to optimal cost to go J ∗ for all initial “guesses” J0 . .3 Third Method: Linear Programming Recall: Bellman’s Equation J ∗ (i ) = min and Value Iteration: Jk+1 (i ) = min g(i. µk (i )) + ∑ pij (µk (i )) Jk ( j) j 2. n If follows that J1 (i ) ≥ J0 (i ) for all i. . . n ∴ J2 (i ) ≥ J1 (i ) ∀i In general: Jk+1 (i ) ≥ Jk (i ) ∀i.

¯ • Perron-Frobenius Therorem: eigenvalues of P have absolute value ≤ 1. This is exactly what we do in Stage 1 of policy iteration: solve for the cost associated wit ha specific policy. Dynamic Programming Algorithm Now let J solve the following problem: max ∑ J (i ) subject to i J (i ) ≤ g(i. by previous analysis. at least one = 1 • Assumption on terminal state: P N → 0 as N → ∞. Direct way to solve: J ∈ Rn . let P ∈ R(n+1)×(n+1) be the probability matrix that captures our Markov Chain:   P P(·)t  ¯ P= pij : probability next state = j given current state = i. it follows that J ∗ achieves the maximum. ∀u ∈ U (i ) j Clear that J (i ) ≤ J ∗ (i ) ∀i.2. 0 1 Facts ¯ • P is a right stochastic matrix: all rows sum up to 1. This is a Linear Program! 2.13. We well eventually reach termination state! Therefore 32 .4 Analogies and Connections Say I want to solve J = G + PJ. G ∈ Rn . Since J ∗ satisfies the constraints. ( I − P) J = G J = ( I − P)−1 G. P ∈ Rn × n . all elements ≥ 0. u) + ∑ pij (u) J ( j) ∀i. Why is ( I − P)−1 guaranteed to exist? ¯ For a given policy.

the eigenvalues of P have absolute value < 1. . n 33 . . . . all converge to Bellman’s Equation. . . . . . or solving system of equations exactly. . . Proof: ( I − P)( I + P + P2 + . value iteration. . . n In practice. .) = I + ( P + P2 + . JN = ( I + P + . you would implement as follows: ¯ J (i ) ← min ¯ J (i ) ← J (i ) g(i. . . Furthermore: ( I − P ) −1 = I + P + P 2 + .( I − P)−1 exists.) =I Therefore one way to solve for J is as follows: J1 = G + PJ0 J2 = G + PJ1 = G + PG + P2 J0 . • What is truly amazing is that various combinations of policy iteration. + P N −1 ) G + P N J0 ∴ JN → (1 − P)−1 G as N → ∞! Analogy Value Iteration: step of the update Policy Iteration: infinite numbers of update. • Recall value iteration Jk+1 (i ) = min g(i. u) + ∑ pij (u) Jk ( j) j u ∈U ( i ) i = 1. . .Wednesday 10th November. .) − ( P + P2 + . u) + ∑ pij (u) J ( j) j u ∈U ( i ) i = 1. . . . . . . n i = 1. 2010 How do we solve Bellman’s Equation? .

xk+1 = t. . and t with probability 1 − α (since ∑ pij (u) = 1) j • Clear that since a < 1. . i ∈ {1. • Gets even better: Asynchronous Policy Iterating. Under some mild assumptions. u) next state = j with probability αpij (u). From state i = t. when u is applied in new cost g(i. Bellman’s Equation for this problem: J ∗ (i ) = min g(i. .14 Discounted Problems N −1 Jπ (i ) = lim E N →∞ k =0 ∑ αk g ( xk . . . µk ( xk )) x0 = i α < 1. therefore our assumption on reaching the termination state is satisfied. Note that PR xk+1 = j xk = i. . n. .n αpij (u) = = pij (u) α 34 . 2. . n} No explicit termination state required.2.1 + αpi. – Any number of value update in between policy updates – Any number of states updated at each value update – Any number of states updated at each policy update. + αpi. u = αpij (u) αpi. u) + ∑ pij (u) J ( j) j i = 1. No assumption on transition probabilities required. . all converge to J ∗ 2. . . Dynamic Programming Algorithm Don’t have to do this! Can also do: J (i ) ← min u ∈U ( i ) g(i. t}. . . . u) + α ∑ pij (u) J ∗ ( j) j =1 n u ∈U ( i ) ∀i How do we show this? • Define associated problem with states {1. we have a non-zero probability of making it to state t.2 + . n Gauss-Seidel update: generic technique for solving iterative equations. Suppose we use the same policy in discounted problem as auxiliary problem.

the state evolution is governed by the same probabilities.Wednesday 17th November. µk ( xk )) times the probability.    ¯ P=    1     0 1 clear that (αP) N = α N P N → 0  35 . µk ( xk )). Connections: for a given policy. αP  .  (1 − α) . have    1       . 2010 Discounted Problems So as long as we reach the termination state. The expected cost of the kth stage of associated problem g( xk . therefore have αk g( xk . that t has not been reached αk .

(Less stringent requirement: Lipschitz) • f is continuous with respect to u • u(t) is piecewise continuous See Appendix A in[1] for Details. u(t)) x (0) = x0 • State x (t) ∈ R • Time t ∈ R.Chapter 3 Continuous Time Optimal Control Consider the following system ˙ x (t) = f ( x (t). T is the terminal time • Control u(t) ∈ U ⊂ Rm . Example 8: ˙ x (t) = x (t) /3 1 0≤t≤T no noise! x (0) = 0 Solutions: x (t) = 0 Not uinque! ∀t 2 x (t) = ( 3 t)3/2 36 . U control constraint set Assume • f is continuously differentiable with respect to x. Assume: Existence and uniqueness of solutions.

˜ • Define J ∗ (t. . Apply DP algorithm: ˜ – terminal condition: J ∗ ( Nδ. 3.1 The Hamilton Jacobi Bellman (HJB) Equation Continuous time analog of DP algorithm.g. x ) = min g( x. Not a rigorous derivation. ˙ Very similar to discrete time problem (∑ → . 2]. x k +1 = x k + f ( x k . x ) = discrete approximation of optimal cost to go. . . u) δ) u ∈U k = 0. . . uk ) δ • Define J ∗ (t. N • Approximate differential equation by ˙ x (kδ) = f ( x (kδ. Derive it informally by discretizing problem and taking limits. g is continuous with respect to u. u k ). [0. N − 1 37 . . u k ) δ δ Approximate cost function: N −1 h( x N ) + k =0 ∑ g( xk .Wednesday 17th November. u(t)) dt where g and h are continuously differentiable with respect to x. define δ = T/N • xk := x (kδ). uk := u(kδ) k = 0. x ) = optimal cost to go at time t and state x for the continuous problem. • Divide time horizon into N pieces. . e. 2010 Example 9: The Hamilton Jacobi Bellman (HJB) Equation ˙ x ( t ) = x ( t )2 x (0) = 1 1 Solutions: x (t) = 1−t . u(kδ))) x k +1 − x k = f ( x k . T Objective Minimize h( x ( T )) + 0 g( x (t). u) δ + J ∗ ((k + 1) δ. but it does capture the main ideas. x ) = h( x ) – recursion: ˜ ˜ J ∗ (kδ. 1. . x + f ( x. xk+1 → x) except for technical assumptions. finite escape time. x (1) = ∞ The solution does not exist on an interval that includes 1.

x ) ˜ ˜ δ+ J ∗ ((k + 1) δ. x ) + g( x. have 0 = min u ∈U g( x.1) J ( T. For fixed t: 38 . because δ → 0 ˜ ∂ J ∗ (kδ. u) δ) = J ∗ (kδ. of HJB is optimal policy Example 10: Consider the system ˙ x (t) = u(t) Cost is 1 2 ∗ |u(t)| ≤ 1 x2 ( T ).1): – Partial Differential Equation. u) ∀ x. x ) + ∂t ∂J ∗ (t. x ) that minimizes R. x ) ∂x T f ( x. x ) ∂x T f ( x. u) + o (δ) δ ˜ Now let t = kδ. Continuous Time Optimal Control ˜ • Do Taylor Expansion of J ∗ . | x | − ( T − t)})2 2 Verify that this is indeed the cost to go associated with the policy outlined above. only terminal cost. • Substitute back into DP recursion by δ: 0 = min u ∈U ˜ ∂ J ∗ (kδ. x ) = − sgn( x ) = 0     1 What is cost to go associated with this policy? V (t. u) + ∂t ˜ ∂ J ∗ (kδ.3. t (3. very difficult to solve – u = µ(t. and let δ → 0. Assuming J ∗ → J. if x > 0 if x = 0 if x < 0   −1    Intuitive solution: u(t) = µ(t.H. x + f ( x.S. u) + ∂J ∗ (t. x ) = h( x ) HJB equation (3. x ) ∂x T f ( x. u) δ + o (δ) δ →0 “little oh notation”: o (δ) are quadratic terms or higher in δ. x ) + ∂t lim o (δ) =0 δ ∂J ∗ (kδ. x ) = 1 (max{0.

x ) ∂t 2 2 |x| − T T − |x| 1 1 t T T − |x| T 1: | x | ≤ T 2: | x | > T Figure 3. x ) ∂V (t. | x | − ( T − t)} ∂x 39 . x ) ∂x x −( T − t) T−t Figure 3. x ) = sgn( x ) max{0.2: First derivative. t ∂V (t.Wednesday 17th November. V (t. 2010 The Hamilton Jacobi Bellman (HJB) Equation V (t. x ) x −( T − t) T−t Figure 3.1: Example 10 for fixed t.3: Example 10 for fixed x. ∂V (t.

x ) : partial derivative of F with respect to the first argument. Furthermore u = − sgn( x ) is an optimal solution. Then 1. 3. | x | − ( T − t)} |u|≤1 = 0 by choosing u = sgn( x ) ∴ V (t. but some cost are “nicer” to work with than other. x ) =x ∂t For F (t. so different costs give same optimal policy. x (t)) ∂F (t. x ) satisfies HJB equation. x ) satisfies HJB is not trivial.3. x ) be a continuosly differentiable function. x ) = 1 x2 = h( x ) 2 • Second check: min ∂V (t. x ) satisfy HJB? • First check: does it satisfy boundary condition? V ( T. Imagine solving for it! 1 Another issue: 2 x2 ( T ) will give same optimal policy as cost | x ( T )|.2 Aside on Notation Let F (t. u) max{0. x ) ∂V (t. x (t)) = ∂t ∂t : shorthand notation ¯ x = x (t) 3. ∂t ¯ ∂F (t. x. Assume that µ∗ (t. x (t)) ˙ ¯ ¯ = x (t) + t x (t) dt Lemma 3. u) be a continous differentiable function and let U be a convex set. x (t)) ∂F (t. Not unique! Note: Verifying that V (t. 40 . x ) = J ∗ (t. ¯ ¯ ¯ dF (t. ∴ V (t. 2.1: Let F (t. even for this simple example. x ) + f ( x. u) is contiu ∈U nous differentiable. x. Continuous Time Optimal Control For fixed x: Does V (t. x ). ∂F (t. x (t)) ¯ = x (t) ∂t ¯ dF (t. x ) := arg min F (t.2. x ) = tx ¯ ∂F (t. u) ∂t ∂x |u|≤1 = min (1 + sgn( x ). x (t)) ∂F (t. x (t)) ˙ ¯ = + x (t): total derivative dt ∂t ∂x Example 11: ∂F (t.

µ∗ (t. x.Wednesday 24th November. x )) = ∀t. ∂x 41 . 2(1 + t ) (1 + t ) x 2 x2 min F (t. µ∗ (t. → u = − x 2(1 + t ) x . µ∗ (t.1 when u is unconstrained (U ∈ Rm ). x. ∂t This can be done similar for ∂G (t. x ∂x ∂x Aside on Notation Example 12: Let F (t. t ≥ 0. µ∗ (t. µ∗ (t. x ) . x.x) = ∂t 4(1 + t )2 ∂ minu∈U F (t. u) x2 = ∂t 4(1 + t )2 x2 ∂F (t. x )) x = u|u=µ∗ (t. µ∗ (t. x.2.x) = ∂x 2(1 + t ) Proof 2: Let Proof of Lemma 3. x. G (t. µ∗ (t. u) = F (t. x. 2010 Then: 1) 2) ∂ minu∈U F (t. u ∈U Then ∂G (t. x ) = min F (t. x.x ) minimizes ∂µ∗ (t. then min F (t. x ) ∂F (t. u) = ∀t. x )) = u2 |u=µ∗ (t. U is the real line (no constraint on u). x. x )) ∂ minu∈U F (t. x. x ) . x )) = + ∂t ∂t ∂F (t. x. x. x. x ∂t ∂t ∂F (t. u) : u 2(1 + t)u + x = 0. u) = (1 + t)u2 + ux + 1. x. x )). x ) = − 1) 2) ∂ minu∈U F (t. u) x = ∂x 2(1 + t ) ∂F (t. x )) ∂u =0 because µ∗ (t. u) = − +1 u 4(1 + t )2 2(1 + t ) x2 +1 =− 4(1 + t ) ∴ µ∗ (t. u) ∂F (t. x. x.

x ∗ (t)) dt ∂t ∗ ( t ). u) Apply Lemma 1) 2) ∂G (t. x ∀x J ∗ ( T. d ∂J ∗ (t. µ∗ (t. u)) = 0 u ∈U ∂J ∗ (t. u∗ ( t )) ∂g ( x d 0= + ∂x dt 0= ˙ x ∗ (t) = f ( x ∗ (t).1 The Minimum Principle HJB gives us a lot of information: optimal cost to go for all time and for all possible states. x ) =0= + f ( x. x ∗ (t)) ∂x + ∂ f ( x ∗ (t). u) = g( x. u∗ (t)) ∂J ∗ (t. x ) ∂2 J ∗ (t.3. x ) = h( x ) • Let µ∗ (t. x )) ∂J ∗ (t. x ∂t ∂t2 ∂x∂t ∂G (t. x ∗ (0) = x0 1) 2) 42 ∂J ∗ (t. u∗ (t)) . x ) ∂x T f ( x. x ) + ∂t ∂J ∗ (t. x )) ∂2 J ∗ (t. x ) ∂x T f ( x. x ) + ∂t ∂J ∗ (t.2. x )) ∀t. µ∗ (t. µ∗ (t. x ) ∂2 J ∗ (t. x ) ∂g( x. x ) ∂2 J ∗ (t. x ∗ (t)) . x ) be te corresponding optimal strategy (feedback law) • Let F (t. u) + So HJB equation gives us G (t. x ) = min ( F (t. x ∗ (t)) ∂x ∂x . x ∂x ∂x T Consider a specific optimal trajectory: u∗ (t) = µ∗ (t. x. x. Continuous Time Optimal Control 3. x )) ∂x ∂x ∂x∂t ∂x2 ∂ f ( x. x ) =0= + + f ( x. µ∗ (t. u) ∀t. u) + ∂J ∗ (t. also gives optimal feedback law u = µ∗ (t. What if we only cared about optimal control trajectory for a specific initial condition x (0) = x0 ? Can we exploit the fact that we are asking for much less to simplify the mathematical conditions? Starting Point: HJB 0 = min u ∈U g( x. x ). x ) + ∀t.

• If f ( x. x (0) is given: Number of robots that arrive to Mars. x (t) Number of reconfigurable robots. p(t)) ˙ x ∗ (t) = u ∈U x ∗ (0) = x0 p( T ) = ∂h( x ∗ ( T )) ∂x H ( x ∗ (t). Can have multiple solutions. Some remarks: ∀t ∈ [0. Not trivial to solve. Example 13 (Resource Allocation): Some robots are sent to Mars to build habitats for a later exploration by humans. u) is linear. but not sufficient. x ) ∂h( x ) ∂h( x ∗ ( T )) = → p( T ) = ∂x ∂x ∂x ˙ p(t) = − Put all of this together: Define Hamiltonian H ( x. p) = g( x. Then ∂H ∗ ( x (t). T ] • Set of 2n ODE. u∗ (t)) ∂g( x ∗ (t). x ∗ (t) resulting state trajectory. u. x ∗ (t)) ∂t 1) 2) ˙ p0 (t) = 0 → p0 (t) = constant for 0 ≤ t ≤ T ∂ f ( x ∗ (t). u. u∗ (t). h and g are convex. not all of them may be optimal. u). p(t)) ∂p ∂H ∗ ˙ p(t) = − ( x (t). u) + p T f ( x. U is convex. u∗ (t). 2010 Aside on Notation p(t) = ∂J ∗ (t. u∗ (t)) p(t) − . x ∗ (t)) ∂x p0 ( t ) = ∂J ∗ (t. p(t)) = constant (H (·) = constant comes from p0 (t) = constant). u∗ (t). Let u∗ (t) be an optimal control trajectory.Wednesday 24th November. • Necessary condition.0≤t≤T ∂x ∂x ∂J ∗ ( T. with split boundary conditions. p(t)) ∂x u∗ (t) = arg min H ( x ∗ (t). then the condition is necessary and sufficient. which can build habitats or themselves. ˙ x (t) = u(t) x (t) ˙ y(t) = (1 − u(t)) x (t) 0 ≤ u(t) ≤ 1 x (0) = x0 y (0) = 0 43 .

u∗ (t). p) = (1 − u) x + pux ∂H ( x ∗ (t). u) = (1 − u) x f ( x. Very convenient. Continuous Time Optimal Control Objective: given terminal time T. you get a feedback law u = µ( x ). If you can solve HJB. u) = ux H ( x. just a controller: meausure the state and apply the control input. 2. Therefore at time t = T − 1. p(t) = 1 and that is where switch occurs: ˙ p(t) = − p(t) 0 ≤ t ≤ T−1 0 ≤ t ≤ T−1 p ( T − 1) = 1 T (1 − u(t)) x (t) dt 0 ∴ p(t) = exp(−t) exp( T − 1) Conclusion u∗ (t) =  1  How to use in practice: 0 0 ≤ t ≤ T−1 T−1 ≤ t ≤ T 1. Do this often. Note: y(t) = Solution g( x.3. p(t)) = −1 + u ∗ ( t ) − p ( t ) u ∗ ( t ) ∂x p( T ) = 0 ( h ( x ) ≡ 0) ˙ p(t) = − u∗ (t) = arg max ( x ∗ (t) + ( p(t) x ∗ (t) − x ∗ (t)) u) 0≤ u ≤1  u = 0 if p(t) < 1 get  u = 1 if p(t) > 1 Since p( T ) = 0. the number of habitats build. u. for t close to T. find control input u∗ (t) that maximizes y( T ). Solve for optimal trajectory and use a feedback law (probably linear) to keep you on that trajectory. Solve for optimal trajectory online after measuring state. 3. 44 . will have u∗ (t) = 0 and therefore ˙ p(t) = −1.

to constrain t→ T ∂x p( T ). u(t)) ˙ p(t) = − x (0) = x0 .1 Fixed Terminal State Case where x ( T ) is given.3. x (1) = 1 1 g( x. x(T ) = xT 2n ODEs 2n boundary conditions ∂H ( x (t). but we can’t use h(t). u) = ( x2 + u2 ) 2 1 1 2 cost = ( x (t) + u2 (t)) dt 2 0 45 . u(t). 3.3 Extensions (Drop x ∗ (t). p(t)) ∂x Example 14: ˙ x (t) = u(t) x (0) = 0. x (t)) p( T ) = lim . 2010 not too difficult HJB “Viscosity solutions” Optimal Solutions Hard to show rigorously (calculus of variations) Extensions Minimum Principle “Local Minima” Non Optimal Solutions Easy to show (in book) Figure 3. ∂J (t.4: Different approaches to find a solution. ∂x ∂J (t. Don’t need constraints on p! ˙ x (t) = f ( x (t).Wednesday 1st December. the terminal cost. Clear that there is no need for a terminal cost. x (t)) Recall co-state p(t) = . J ∗ notation for simplycity) 3.

as before ˙ x (t) = u(t) ˙ p(t) = − x (t) u(t) = − p(t) x (t) = A cosh(t) + B sinh(t) ¨ x = x (t) ˙ x (0) = u (0) = − p (0) = 0 x (t) = ∴B=0 cosh(t) et + e−t = 1 cosh(1) e + e −1 x (0) ≈ 0. x (t) = x (1) = 1 ⇒ B = 1 sinh(1) et − e−t sinh(t) = 1 sinh(1) e − e −1 Exercise: show that Hamiltonian is constant along this trajectory. with cost x (0) is not fixed.3. p) = 1 ( x2 + u2 ) + p u 2 We get ˙ x (t) = u(t) ˙ p(t) = − x (t) 1 u(t) = arg min ( x2 (t) + u2 (t)) + p(t) u(t) u 2 therefore u(t) = − p(t) ˙ ∴ x ( t ) = − p ( t ). ˙ p ( t ) = − x ( t ). 3.3. Can show that the resuling ∂l ( x (0)) .65 46 . x (0) is free 1 g( x. but have initial cost l ( x (0)). Continuous Time Optimal Control Hamiltonian H ( x. u) = ( x2 + u2 ) 2 l (x) = 0 no cost (given) Apply Minimum Principle.2 Free initial state. u. ¨ x (t) = x (t) x (t) = A cosh(t) + B sinh(t) x (0) = 0 ⇒ A = 0. condition is p(0) = − ∂x Example 15: ˙ x (t) = u(t) x (1) = 1.

u. u. u. u. t). except that Hamiltonian is no longer ˙ ˙ constant along trajectory. p(t). u. p(1) = 0 u(t) = arg min H ( x (t). t)? Result: Everything stays the same. we lose a degree of freedom because H ≡ 0.Wednesday 1st December. t) |u|≤1   = −1    u(t) = = 1     =? if p(t) > 0 if p(t) < 0 if p(t) = 0 Problem is singular if Hamiltonian is not a function of u for a non-trivial time interval. t) = ( x − z(t))2 2 1 H ( x. Gain extra degree of freedom in choosing T. Hint: t = 1 and x = f ( x. g = g( x. Apply Minimum Principle: 1 ˙ x (t) = u(t) |u(t)| ≤ 1 1 g( x. Time Varying System and Cost What happens if f = f ( x. t) = ( x − z(t))2 + p u 2 Co-state equation: ˙ p(t) = −( x (t) − z(t)) Optimal u: x (0). 2010 Extensions Free Terminal Time Result: Hamiltonian = 0 on optimal trajectory. u. x (1) are free p(0) = 0. Try the following: p (0) = 0 Then p(t) = 0 0≤t≤T 0≤t≤T ˙ p(t) = 0 for 0 ≤ t ≤ T T to be determined ∴ x (t) = z(t) ˙ ˙ ∴ x (t) = z(t) = −2t = u(t) 47 . t) Singular Problems Motivate via an example: Tracking problem: z(t) = 1 − t2 0 ≤ t ≤ 1 1 ˙ Minimize 2 0 ( x (t) − z(t))2 dt subject to | x (t)| ≤ 1.

2 x (t) = z(t) ˙ x ( t ) = −1 0≤t≤T< T<t≤1 T<t≤1 1 2 1 2 : ˙ ∴p<0 ∴ p (1) < 0 ∴ x (t) = z( T ) − (t − T ) = 1 − T 2 − t + T ˙ p(t) = −( x (t) − z(t)) − (1 − T 2 − t + T − 1 + t2 ) = T 2 − T − t2 + t 1 p (1) = T ( T 2 − T − t2 + t) dt 1 1 T3 T2 + − T3 + T2 + − =0 3 2 3 2 0 = −4T 3 + 9T 2 − 6T + 1 = T2 − T − Simplifies to (multiply by 6): = ( T − 1)( T − 1)(1 − 4T ) 1 ∴ T = 1. J ( x ) = min x T Qx + u T Ru + J ( Ax + Bu) u 48 . LTI (linear time invariant) system: xk+1 = Axk + Buk cost = k =0 T T ∑ xk Qxk + uk Ruk . 2 This can’t be the solution: for t > x (t) − z(t) > 0 can’t satisfy constraint. . ∞ k = 0. R > 0. 1. R = R T . Q = Q T Informally. Continuous Time Optimal Control One guess: pick T = 1 . T = 4 T= 1 4 satisfies all the constraints and we are done! 1 4 Can easily verify that p(t) > 0 for < t < 1.4 Linear Systems and Quadratic Costs Look at infite horizon. . giving u(t) = −1 as required.3. Q ≥ 0. 3. the cost to go is time invariant: only depends on the state and not when we get there. Explore this instead: Switch before T = 1 . .

Therefore we must have: K = Q + A T KA + A T KB R + B T KB −1 −1 B T KA R + B T KB R + B T KB −1 B T KA − 2AT KB R + BT KB K = AT Summary −1 B T KA −1 K − KB R + B T KB BT K A + Q. B T KB ≥ 0. R + B T KB > 0 2 R + B T KB u + 2B T KAx = 0 → u = − R + BT KB Substitute back in: All terms are of the form x T · x. F = − R + B T KB Questions 1. Then x T Kx = x T Qx + x T A T KAx + min u T Ru + u T B T KBu + x T A T KBu + u T B T KAx u Since R > 0.Wednesday 8th December. K ≥ 0. Can we always solve for K? 2. K≥0 • Optimal Cost to go J ( x ) = x T Kx • Optimal feedback strategy u = Fx. Solve for K: K = 4 ( K − 0) + 1 B = 0. k =0 ∞ −1 B T KA ∑ 2 x k + u2 k Q = 1. 2010 Linear Systems and Quadratic Costs Conjecture that optimal cost to go is quadratic in x: J ( x ) = x T Kx. where K = K T . R=1 ⇔ −3K = 1 → K = −1/3 49 . Is closed loop system xk+1 = ( A + BF ) xk stable? Example 16: xk+1 = 2xk + 0 · uk cost = A = 2.

Q = 1. Problem with this example is that ( A. 3 Example 18: xk+1 = 2xk + uk cost = k =0 ∞ ∑ 2 x k + u2 k A = 2. Solve for K: B = 0.236 and solve for F: F = −(1 + K )−1 2K = Pick K = 2 + √ −2K ≈ −1.618 ≈ 0. B) is stabilizable.618 1+K A + BF = 2 − 1. Q = 1. Continuous Time Optimal Control K does not satisfy K ≥ 0 constraint. B = 1. F = 0. No solution to this problem: cost is infite. eigenvalues of A + BF have magnitude < 1. ρ ( A + BF ) < 1. Solve for K: K = 4 K − K 2 (1 + K ) + 1 0 = K2 − 4K − 1 √ √ 4 ± 20 = 2± 5 →K= 2 5 ≈ 4. Stabilizable : One can find a matrix F such that A + BF is stable.25K + 1 → K = 4/3 2 Cost to go = 4 xk .5.3. R=1 ( A. 50 .3820 is stable. B) is not stabilizable. Example 17: xk+1 = 0. R=1 K = 0.5xk + 0 · uk cost = k =0 ∞ ∑ 2 x k + u2 k A = 0. Our optimizing strategy stabilizes system as expected.

5. but it leads to an unstable system.75} Here K = 0 makes perfect sense: gives optimal strategy u = 0 and stable closed loop system. leads to F = −(1 + 3)−1 3 · 2 = −1. Example 20: xk+1 = 0. Solve for K: K ( ε ) = {3 + ε 4ε . Solve for K: B = 1.5. 3} K = 0 is clearly optimal thing to do. R=1 The K = 3 solution is the limiting case as we put arbitrarily small cost to the state which would otherwise → ∞.5.5xk + uk cost = k =0 ∞ ∑ u2 k A = 0. Q = ε. R=1 K = {0. K = 3.75 − } 3 3 K = 0 is a well behaved solution for ε = 0.Wednesday 8th December.− } 3 3 (to first order in ε) B = 1. ε 1. −0. ε > 0. Q = 0. 2010 Example 19: Linear Systems and Quadratic Costs xk+1 = 2xk + uk cost = k =0 ∑ ∞ u2 k A = 2. 51 . while not being optimal. Q = 0. k= k A = 2. −0. 2 Modify cost to ∑∞ 0 u2 + εxk . however. R=1 K = 4 K − K 2 (1 + K ) −1 → K = {0. If Q = ε 4ε ε K (ε) → { . stable. A + BF = 0. Solve for K: B = 1.

Optimal feedback strategy u = Fx 4. . xk+1 = Ak xk + Bk uk + wk E( w k ) = 0 • Cost: E Qk = Rk = 52 xT QN xN + N T Qk T Rk N −1 k =0 k = 0.1 Given Summary xk+1 = Axk + Buk cost = k =0 k = 0. C ) is detectable where C is any matrix that satisfies C T C = Q. ( A. ∃ unique solution to D. 3. N − 1 T E(wk wk ) finite ∑ T T xk Qk xk + uk Rk uk ≥0 >0 (EV ≥ 0) . Continuous Time Optimal Control Need Concept of Detectability Let Q be decomposed as Q = C T C (can always do this). disturbances. 3. R ≥ 0 ∑ ∞ T x T Qx + uk Ruk ( A. Closed loop system is stable.5 General Problem Formulation • Finite horizon. Optimal cost to go is J ( x ) = x T Kx 3. .3. .R. C xk → 0 ⇒ xk → 0 and C xk → 0 ⇔ xk Qxk → 0.4. 1. time varying.A. Then: 1. . Q ≥ 0.E 2. . . ( A. C ) is detectable if ∃ L so that A + LC is T stable. B) is stabilizable. . Need detectability assumption.

. not all).) + E(w0 Q1 w0 ) u0 Strategy is the same as if there was no noise (although it does give a different cost).) u0 consider last term: T E ( A0 x0 + B0 u0 )T Q1 ( A0 x0 + B0 u0 ) + 2( A0 x0 + B0 u0 )T Q1 w0 + w0 Q1 w0 T = ( A0 x0 + B0 u0 )T Q1 (. Substitute and solve for J0 ( x0 ): T T T T T T T T x0 Q0 x0 + u0 ( R0 + B0 Q1 B0 ) u0 + x0 A0 Q1 A0 x0 + 2 x0 A0 Q1 B0 u0 + E(w0 Q1 w0 ) T T = x 0 K0 x 0 + E ( w0 Q 1 w0 ) T T T K0 = Q0 + A0 (K1 − K1 B0 ( R0 + B0 K1 B0 )−1 B0 K1 ) A0 K1 = Q 1 Cost at k = 1: quadratic in x1 . . . 2010 Apply DP to solve problem: JN ( x N ) = x T Q N x N N General Problem Formulation T T Jk ( xk ) = min E xk Qk xk + uk Rk uk + Jk+1 ( Ak xk + Bk uk + wk ) uk Let’s do the first step of this recursion. Solve for minimizing u0 : differentiate . set to 0: T T 2R0 u0 + 2 B0 Q1 B0 u0 + 2 B0 Q1 A0 x0 = 0 T T u0 = −( R0 + B0 Q1 B0 )−1 B0 Q1 A0 x0 =: F0 x0 Optimal feedback strategy is a linear function of the state. . . N = 1: T T J0 ( x0 ) = min E x0 Q0 x0 + u0 R0 u0 + ( A0 x0 + B0 u0 + w0 )T Q1 (. Or equivalently. at k = 0: quadratic in x0 + constant.) + E(w0 Q1 w0 ) T T T ∴ J0 ( x0 ) = min x0 Q0 x0 + u0 R0 u0 + ( A0 x0 + B0 u0 )T Q1 (. Discrete Riccati Equation (DRE): T T Kk = Qk + Ak (Kk+1 − Kk+1 Bk ( Rk + Bk Kk+1 Bk )−1 Bk Kk+1 ) Ak KN = QN Feedback law: uk = Fk xk T T Fk = −( Rk + Bk Kk+1 Bk )−1 Bk Kk+1 Ak 53 . Certainty equivalence (works only for some problems. Can extend this to any horizon.Wednesday 15th December. .

Therefore k=   1 0 Q= 0 0 0. Example 21: System given: ¨ z(t) = u(t) Objective Apply a force to move a mass from any starting point to z = ˙ 0. above iterative method is one way to solve DARE. time invariant.5 B=  1 R=1 3. Q can be decomposed as follows     1 0 1 Q= = · 1 0 0 0 0 54 .5 makes eigenvalues of A + BF = 0 4. cost = E(w T K w). x (k + 1) = Ax (k ) + Bu(k )   k = 0. Approach: divide cost by N and let N → ∞. 1. Implement on a computer that can only update information once per second. Continuous Time Optimal Control Cost: T Jk ( xk ) = xk Kk xk + N −1 j=k ∑ E ( w T K j +1 w j ) j No noise. ([1] has proof of convergence not trivial) Time invariant. Discretize problem: ˙ ˙ z ( t ) = z (0) + u (0) t ˙ z(t) = z(0) + z(0)t + 1/2u(0)t ˙ Let x1 (k ) = z(k). . x2 (k ) = z(k ).3. z = 0. Iterate backwards until it converges.   2 0≤t<1 0≤t<1 1 1 A= 0 1 2 2. . 1. infinite horizon (N → ∞): recover previous results and DARE. infinite horizon with noise: Cost goes to infinity. In fact. . Cost = ∑∞ 0 x1 (k ) + u2 (k ) . Is system stabilizable? Can we make A + BF stable for some F? Yes: F = −1 −1.

5 −1. General Problem Formulation Q = CT C L = −2 −1 makes eigenvalues of A + LC = 0. 2010 Therefore C= 1 0 . “Physical” interpretation: spring and a damper.Wednesday 15th December.0. damper has coefficient 1. Use MATLAB command dare.   2 1 K= 1 1. Is ( A.0 6.5. 5. 55 .5 Optimal Feedback matrix: F = −0. Solve DARE. spring has coefficient 0. C ) detectable? Yes.

volume I. 3rd edition. 56 . Dynamic Programming and Optimal Control. 2005.Bibliography [1] Dimitri P. Athena Scientific. Bertsekas.

Sign up to vote on this title
UsefulNot useful