This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

Scribd Selects Books

Hand-picked favorites from

our editors

our editors

Scribd Selects Audiobooks

Hand-picked favorites from

our editors

our editors

Scribd Selects Comics

Hand-picked favorites from

our editors

our editors

Scribd Selects Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

P. 1

Optimal Control Theory|Views: 141|Likes: 1

Published by Kevin Lee

See more

See less

https://www.scribd.com/doc/66617876/Optimal-Control-Theory

12/28/2012

text

original

**(ed), MIT Press (2006)
**

Optimal Control Theory

Emanuel Todorov

University of California San Diego

Optimal control theory is a mature mathematical discipline with numerous applications

in both science and engineering. It is emerging as the computational framework of choice

for studying the neural control of movement, in much the same way that probabilistic infer-

ence is emerging as the computational framework of choice for studying sensory information

processing. Despite the growing popularity of optimal control models, however, the elab-

orate mathematical machinery behind them is rarely exposed and the big picture is hard

to grasp without reading a few technical books on the subject. While this chapter cannot

replace such books, it aims to provide a self-contained mathematical introduction to opti-

mal control theory that is su¢ciently broad and yet su¢ciently detailed when it comes to

key concepts. The text is not tailored to the …eld of motor control (apart from the last

section, and the overall emphasis on systems with continuous state) so it will hopefully

be of interest to a wider audience. Of special interest in the context of this book is the

material on the duality of optimal control and probabilistic inference; such duality suggests

that neural information processing in sensory and motor areas may be more similar than

currently thought. The chapter is organized in the following sections:

1. Dynamic programming, Bellman equations, optimal value functions, value and policy

iteration, shortest paths, Markov decision processes.

2. Hamilton-Jacobi-Bellman equations, approximation methods, …nite and in…nite hori-

zon formulations, basics of stochastic calculus.

3. Pontryagin’s maximum principle, ODE and gradient descent methods, relationship to

classical mechanics.

4. Linear-quadratic-Gaussian control, Riccati equations, iterative linear approximations

to nonlinear problems.

5. Optimal recursive estimation, Kalman …lter, Zakai equation.

6. Duality of optimal control and optimal estimation (including new results).

7. Optimality models in motor control, promising research directions.

1

1 Discrete control: Bellman equations

Let r ¸ A denote the state of an agent’s environment, and n ¸ | (r) the action (or

control) which the agent chooses while at state r. For now both A and | (r) are …nite

sets. Let :crt (r, n) ¸ A denote the state which results from applying action n in state

r, and co:t (r, n) _ 0 the cost of applying action n in state r. As an example, r may be

the city where we are now, n the ‡ight we choose to take, :crt (r, n) the city where that

‡ight lands, and co:t (r, n) the price of the ticket. We can now pose a simple yet practical

optimal control problem: …nd the cheapest way to ‡y to your destination. This problem

can be formalized as follows: …nd an action sequence (n

0

, n

1

, n

n1

) and corresponding

state sequence (r

0

, r

1

, r

n

) minimizing the total cost

J (r

, n

) =

n1

k=0

co:t (r

k

, n

k

)

where r

k+1

= :crt (r

k

, n

k

) and n

k

¸ | (r

k

). The initial state r

0

= r

init

and destination

state r

n

= r

dest

are given. We can visualize this setting with a directed graph where the

states are nodes and the actions are arrows connecting the nodes. If co:t (r, n) = 1 for all

(r, n) the problem reduces to …nding the shortest path from r

init

to r

dest

in the graph.

1.1 Dynamic programming

Optimization problems such as the one stated above are e¢ciently solved via dynamic

programming (DP). DP relies on the following obvious fact: if a given state-action sequence

is optimal, and we were to remove the …rst state and action, the remaining sequence is also

optimal (with the second state of the original sequence now acting as initial state). This

is the Bellman optimality principle. Note the close resemblance to the Markov property of

stochastic processes (a process is Markov if its future is conditionally independent of the

past given the present state). The optimality principle can be reworded in similar language:

the choice of optimal actions in the future is independent of the past actions which led to

the present state. Thus optimal state-action sequences can be constructed by starting at

the …nal state and extending backwards. Key to this procedure is the optimal value function

(or optimal cost-to-go function)

· (r) = "minimal total cost for completing the task starting from state r"

This function captures the long-term cost for starting from a given state, and makes it

possible to …nd optimal actions through the following algorithm:

Consider every action available at the current state,

add its immediate cost to the optimal value of the resulting next state,

and choose an action for which the sum is minimal.

The above algorithm is "greedy" in the sense that actions are chosen based on local infor-

mation, without explicit consideration of all future scenarios. And yet the resulting actions

are optimal. This is possible because the optimal value function contains all information

about future scenarios that is relevant to the present choice of action. Thus the optimal

value function is an extremely useful quantity, and indeed its calculation is at the heart of

many methods for optimal control.

2

The above algorithm yields an optimal action n = ¬ (r) ¸ | (r) for every state r. A

mapping from states to actions is called control law or control policy. Once we have a control

law ¬ : A ÷ | (A) we can start at any state r

0

, generate action n

0

= ¬ (r

0

), transition to

state r

1

= :crt (r

0

, n

0

), generate action n

1

= ¬ (r

1

), and keep going until we reach r

dest

.

Formally, an optimal control law ¬ satis…es

¬ (r) = arg min

u2U(x)

¦co:t (r, n) +· (:crt (r, n))¦ (1)

The minimum in (1) may be achieved for multiple actions in the set | (r), which is why ¬

may not be unique. However the optimal value function · is always uniquely de…ned, and

satis…es

· (r) = min

u2U(x)

¦co:t (r, n) +· (:crt (r, n))¦ (2)

Equations (1) and (2) are the Bellman equations.

If for some r we already know · (:crt (r, n)) for all n ¸ | (r), then we can apply the

Bellman equations directly and compute ¬ (r) and · (r). Thus dynamic programming is

particularly simple in acyclic graphs where we can start from r

dest

with ·

_

r

dest

_

= 0, and

perform a backward pass in which every state is visited after all its successor states have

been visited. It is straightforward to extend the algorithm to the case where we are given

non-zero …nal costs for a number of destination states (or absorbing states).

1.2 Value iteration and policy iteration

The situation is more complex in graphs with cycles. Here the Bellman equations are

still valid, but we cannot apply them in a single pass. This is because the presence of

cycles makes it impossible to visit each state only after all its successors have been visited.

Instead the Bellman equations are treated as consistency conditions and used to design

iterative relaxation schemes – much like partial di¤erential equations (PDEs) are treated as

consistency conditions and solved with corresponding relaxation schemes. By "relaxation

scheme" we mean guessing the solution, and iteratively improving the guess so as to make

it more compatible with the consistency condition.

The two main relaxation schemes are value iteration and policy iteration. Value iteration

uses only (2). We start with a guess ·

(0)

of the optimal value function, and construct a

sequence of improved guesses:

·

(i+1)

(r) = min

u2U(x)

_

co:t (r, n) +·

(i)

(:crt (r, n))

_

(3)

This process is guaranteed to converge to the optimal value function · in a …nite number

of iterations. The proof relies on the important idea of contraction mappings: one de…nes

the approximation error c

_

·

(i)

_

= max

x

¸

¸

·

(i)

(r) ÷· (r)

¸

¸

, and shows that the iteration (3)

causes c

_

·

(i)

_

to decrease as i increases. In other words, the mapping ·

(i)

÷ ·

(i+1)

given

by (3) contracts the "size" of ·

(i)

as measured by the error norm c

_

·

(i)

_

.

Policy iteration uses both (1) and (2). It starts with a guess ¬

(0)

of the optimal control

3

law, and constructs a sequence of improved guesses:

·

(i)

(r) = co:t

_

r, ¬

(i)

(r)

_

+·

(i)

_

:crt

_

r, ¬

(i)

(r)

__

(4)

¬

(i+1)

(r) = arg min

u2U(x)

_

co:t (r, n) +·

(i)

(:crt (r, n))

_

The …rst line of (4) requires a separate relaxation to compute the value function ·

(i)

for

the control law ¬

(i)

. This function is de…ned as the total cost for starting at state r and

acting according to ¬

(i)

thereafter. Policy iteration can also be proven to converge in a

…nite number of iterations. It is not obvious which algorithm is better, because each of

the two nested relaxations in policy iteration converges faster than the single relaxation in

value iteration. In practice both algorithms are used depending on the problem at hand.

1.3 Markov decision processes

The problems considered thus far are deterministic, in the sense that applying action n

at state r always yields the same next state :crt (r, n). Dynamic programming easily

generalizes to the stochastic case where we have a probability distribution over possible

next states:

j (j[r, n) = "probability that :crt (r, n) = j"

In order to qualify as a probability distribution the function j must satisfy

y2X

j (j[r, n) = 1

j (j[r, n) _ 0

In the stochastic case the value function equation (2) becomes

· (r) = min

u2U(x)

¦co:t (r, n) +1 [· (:crt (r, n))]¦ (5)

where 1 denotes expectation over :crt (r, n), and is computed as

1 [· (:crt (r, n))] =

y2X

j (j[r, n) · (j)

Equations (1, 3, 4) generalize to the stochastic case in the same way as equation (2) does.

An optimal control problem with discrete states and actions and probabilistic state

transitions is called a Markov decision process (MDP). MDPs are extensively studied in

reinforcement learning – which is a sub-…eld of machine learning focusing on optimal control

problems with discrete state. In contrast, optimal control theory focuses on problems with

continuous state and exploits their rich di¤erential structure.

2 Continuous control: Hamilton-Jacobi-Bellman equations

We now turn to optimal control problems where the state x ¸ R

nx

and control u ¸ | (x) _

R

nu

are real-valued vectors. To simplify notation we will use the shortcut min

u

instead of

4

min

u2U(x)

, although the latter is implied unless noted otherwise. Consider the stochastic

di¤erential equation

dx = f (x, u) dt +1 (x, u) dw (6)

where dw is :

w

-dimensional Brownian motion. This is sometimes called a controlled Ito

di¤usion, with f (x, u) being the drift and 1 (x, u) the di¤usion coe¢cient. In the absence

of noise, i.e. when 1 (x, u) = 0, we can simply write _ x = f (x, u). However in the stochastic

case this would be meaningless because the sample paths of Brownian motion are not

di¤erentiable (the term dw,dt is in…nite). What equation (6) really means is that the

integral of the left hand side is equal to the integral of the right hand side:

x(t) = x(0) +

_

t

0

f (x(:) , u(:)) d: +

_

t

0

1 (x(:) , u(:)) dw(:)

The last term is an Ito integral, de…ned for square-integrable functions q (t) as

_

t

0

q (:) dn(:) = lim

n!1

n1

k=0

q (:

k

) (n(:

k+1

) ÷n(:

k

))

where 0 = :

0

< :

2

< < :

n

= t

We will stay away from the complexities of stochastic calculus to the extent possible. Instead

we will discretize the time axis and obtain results for the continuous-time case in the limit

of in…nitely small time step.

The appropriate Euler discretization of (6) is

x

k+1

= x

k

+ f (x

k

, u

k

) +

_

1 (x

k

, u

k

) -

k

where is the time step, -

k

~ A (0, I

nw

) and x

k

= x(/). The

_

term appears because

the variance of Brownian motion grows linearly with time, and thus the standard deviation

of the discrete-time noise should scale as

_

.

To de…ne an optimal control problem we also need a cost function. In …nite-horizon

problems, i.e. when a …nal time t

f

is speci…ed, it is natural to separate the total cost

into a time-integral of a cost rate / (x, u, t) _ 0, and a …nal cost /(x) _ 0 which is only

evaluated at the …nal state x(t

f

). Thus the total cost for a given state-control trajectory

¦x(t) , u(t) : 0 _ t _ t

f

¦ is de…ned as

J (x() , u()) = /(x(t

f

)) +

_

t

f

0

/ (x(t) , u(t) , t) dt

Keep in mind that we are dealing with a stochastic system. Our objective is to …nd a control

law u = ¬ (x, t) which minimizes the expected total cost for starting at a given (x, t) and

acting according ¬ thereafter.

In discrete time the total cost becomes

J (x

, u

) = /(x

n

) +

n1

k=0

/ (x

k

, u

k

, /)

where : = t

f

, is the number of time steps (assume that t

f

, is integer).

5

2.1 Derivation of the HJB equations

We are now ready to apply dynamic programming to the time-discretized stochastic prob-

lem. The development is similar to the MDP case except that the state space is now in…nite:

it consists of :+1 copies of R

nx

. The reason we need multiple copies of R

nx

is that we have

a …nite-horizon problem, and therefore the time when a given x ¸ R

nx

is reached makes a

di¤erence.

The state transitions are now stochastic: the probability distribution of x

k+1

given

x

k

, u

k

is the multivariate Gaussian

x

k+1

~ A (x

k

+ f (x

k

, u

k

) , o (x

k

, u

k

))

where o (x, u) = 1 (x, u) 1 (x, u)

T

The Bellman equation for the optimal value function · is similar to (5), except that · is

now a function of space and time. We have

· (x, /) = min

u

¦/ (x, u, /) +1 [· (x + f (x, u) +¸, / + 1)]¦ (7)

where ¸ ~ A (0, o (x, u)) and · (x, :) = /(x)

Consider the second-order Taylor-series expansion of ·, with the time index /+1 suppressed

for clarity:

· (x +c) = · (x) +c

T

·

x

(x) +

1

2

c

T

·

xx

(x) c +o

_

c

3

_

where c = f (x, u) +¸, ·

x

=

@

@x

·, ·

xx

=

@

2

@x@x

·

Now compute the expectation of the optimal value function at the next state, using the

above Taylor-series expansion and only keeping terms up to …rst-order in . The result is:

1 [·] = · (x) + f (x, u)

T

·

x

(x) +

1

2

tr (o (x, u) ·

xx

(x)) +o

_

2

_

The trace term appears because

1

_

¸

T

·

xx

¸

_

= 1

_

tr

_

¸¸

T

·

xx

__

= tr (Cov [¸] ·

xx

) = tr (o·

xx

)

Note the second-order derivative ·

xx

in the …rst-order approximation to 1 [·]. This is a

recurrent theme in stochastic calculus. It is directly related to Ito’s lemma, which states

that if r(t) is an Ito di¤usion with coe¢cient o, then

dq (r(t)) = q

x

(r(t)) dr(t) +

1

2

o

2

q

xx

(r(t)) dt

Coming back to the derivation, we substitute the expression for 1 [·] in (7), move the

term · (x) outside the minimization operator (since it does not depend on u), and divide

by . Suppressing x, u, / on the right hand side, we have

· (x, /) ÷· (x, / + 1)

= min

u

_

/ +f

T

·

x

+

1

2

tr (o·

xx

) +o ()

_

6

Recall that t = /, and consider the optimal value function · (x, t) de…ned in continuous

time. The left hand side in the above equation is then

· (x, t) ÷· (x, t + )

**In the limit ÷ 0 the latter expression becomes ÷
**

@

@t

·, which we denote ÷·

t

. Thus for

0 _ t _ t

f

and · (x, t

f

) = /(x), the following holds:

÷·

t

(x, t) = min

u2U(x)

_

/ (x, u, t) +f (x, u)

T

·

x

(x, t) +

1

2

tr (o (x, u) ·

xx

(x, t))

_

(8)

Similarly to the discrete case, an optimal control ¬ (x, t) is a value of u which achieves the

minimum in (8):

¬ (x, t) = arg min

u2U(x)

_

/ (x, u, t) +f (x, u)

T

·

x

(x, t) +

1

2

tr (o (x, u) ·

xx

(x, t))

_

(9)

Equations (8) and (9) are the Hamilton-Jacobi-Bellman (HJB) equations.

2.2 Numerical solutions of the HJB equations

The HJB equation (8) is a non-linear (due to the min operator) second-order PDE with

respect to the unknown function ·. If a di¤erentiable function · satisfying (8) for all

(x, t) exists, then it is the unique optimal value function. However, non-linear di¤erential

equations do not always have classic solutions which satisfy them everywhere. For example,

consider the equation [ _ q (t)[ = 1 with boundary conditions q (0) = q (1) = 0. The slope of q

is either +1 or ÷1, and so q has to change slope (discontinuously) somewhere in the interval

0 _ t _ 1 in order to satisfy the boundary conditions. At the points where that occurs the

derivative _ q (t) is unde…ned. If we decide to admit such "weak" solutions, we are faced with

in…nitely many solutions to the same di¤erential equation. In particular when (8) does not

have a classic solution, the optimal value function is a weak solution but there are many

other weak solutions. How can we then solve the optimal control problem? The recent

development of non-smooth analysis and the idea of viscosity solutions provide a reassuring

answer. It can be summarized as follows: (i) "viscosity" provides a speci…c criterion for

selecting a single weak solution; (ii) the optimal value function is a viscosity solution to

the HJB equation (and thus it is the only viscosity solution); (iii) numerical approximation

schemes which take the limit of solutions to discretized problems converge to a viscosity

solution (and therefore to the optimal value function). The bottom line is that in practice

one need not worry about the absence of classic solutions.

Unfortunately there are other practical issues to worry about. The only numerical

methods guaranteed to converge to the optimal value function rely on discretizations of

the state space, and the required number of discrete states is exponential in the state-

space dimensionality :

x

. Bellman called this the curse of dimensionality. It is a problem

which most likely does not have a general solution. Nevertheless, the HJB equations have

motivated a number of methods for approximate solution. Such methods rely on parametric

models of the optimal value function, or the optimal control law, or both. Below we outline

one such method.

7

Consider an approximation ¯ · (x, t; 0) to the optimal value function, where 0 is some

vector of parameters. Particularly convenient are models of the form

¯ · (x, t; 0) =

i

c

i

(x, t) 0

i

where c

i

(x, t) are some prede…ned basis functions, and the unknown parameters 0 appear

linearly. Linearity in 0 simpli…es the calculation of derivatives:

¯ ·

x

(x, t; 0) =

i

c

i

x

(x, t) 0

i

and similarly for ¯ ·

xx

and ¯ ·

t

. Now choose a large enough set of states (x, t) and evaluate

the right hand side of (8) at those states (using the approximation to · and minimizing

over u). This procedure yields target values for the left hand side of (8). Then adjust the

parameters 0 so that ÷¯ ·

t

(x, t; 0) gets closer to these target values. The discrepancy being

minimized by the parameter adjustment procedure is the Bellman error.

2.3 In…nite-horizon formulations

Thus far we focused on …nite-horizon problems. There are two in…nite-horizon formulations

used in practice, both of which yield time-invariant forms of the HJB equations. One is

the discounted-cost formulation, where the total cost for an (in…nitely long) state-control

trajectory is de…ned as

J (x() , u()) =

_

1

0

exp(÷ct) / (x(t) , u(t)) dt

with c 0 being the discount factor. Intuitively this says that future costs are less costly

(whatever that means). Here we do not have a …nal cost /(x), and the cost rate / (x, u)

no-longer depends on time explicitly. The HJB equation for the optimal value function

becomes

c· (x) = min

u2U(x)

_

/ (x, u) +f (x, u)

T

·

x

(x) +

1

2

tr (o (x, u) ·

xx

(x))

_

(10)

Another alternative is the average-cost-per-stage formulation, with total cost

J (x() , u()) = lim

t

f

!1

1

t

f

_

t

f

0

/ (x(t) , u(t)) dt

In this case the HJB equation for the optimal value function is

` = min

u2U(x)

_

/ (x, u) +f (x, u)

T

·

x

(x) +

1

2

tr (o (x, u) ·

xx

(x))

_

(11)

where ` _ 0 is the average cost per stage, and · now has the meaning of a di¤erential value

function.

Equations (10) and (11) do not depend on time, which makes them more amenable to

numerical approximations in the sense that we do not need to store a copy of the optimal

value function at each point in time. Form another point of view, however, (8) may be easier

8

to solve numerically. This is because dynamic programming can be performed in a single

backward pass through time: initialize · (x, t

f

) = /(x) and simply integrate (8) backward

in time, computing the spatial derivatives numerically along the way. In contrast, (10) and

(11) call for relaxation methods (such as value iteration or policy iteration) which in the

continuous-state case may take an arbitrary number of iterations to converge. Relaxation

methods are of course guaranteed to converge in a …nite number of iterations for any …nite

state approximation, but that number may increase rapidly as the discretization of the

continuous state space is re…ned.

3 Deterministic control: Pontryagin’s maximum principle

Optimal control theory is based on two fundamental ideas. One is dynamic programming

and the associated optimality principle, introduced by Bellman in the United States. The

other is the maximum principle, introduced by Pontryagin in the Soviet Union. The max-

imum principle applies only to deterministic problems, and yields the same solutions as

dynamic programming. Unlike dynamic programming, however, the maximum principle

avoids the curse of dimensionality. Here we derive the maximum principle indirectly via

the HJB equation, and directly via Lagrange multipliers. We also clarify its relationship to

classical mechanics.

3.1 Derivation via the HJB equations

For deterministic dynamics _ x = f (x, u) the …nite-horizon HJB equation (8) becomes

÷·

t

(x, t) = min

u

_

/ (x, u, t) +f (x, u)

T

·

x

(x, t)

_

(12)

Suppose a solution to the minimization problem in (12) is given by an optimal control law

¬ (x, t) which is di¤erentiable in x. Setting u = ¬ (x, t) we can drop the min operator in

(12) and write

0 = ·

t

(x, t) +/ (x, ¬ (x, t) , t) +f (x, ¬ (x, t))

T

·

x

(x, t)

This equation is valid for all x, and therefore can be di¤erentiated w.r.t. x to obtain (in

shortcut notation)

0 = ·

tx

+/

x

+¬

T

x

/

u

+

_

f

T

x

+¬

T

x

f

T

u

_

·

x

+·

xx

f

Regrouping terms, and using the identity _ ·

x

= ·

xx

_ x +·

tx

= ·

xx

f +·

tx

, yields

0 = _ ·

x

+/

x

+f

T

x

·

x

+¬

T

x

_

/

u

+f

T

u

·

x

_

We now make a key observation: the term in the brackets is the gradient w.r.t. u of the

quantity being minimized w.r.t. u in (12). That gradient is zero (assuming unconstrained

minimization), which leaves us with

÷_ ·

x

(x, t) = /

x

(x, ¬ (x, t) , t) +f

T

x

(x, ¬ (x, t)) ·

x

(x, t) (13)

9

This may look like a PDE for ·, but if we think of ·

x

as a vector p instead of a gradient of

a function which depends on x, then (13) is an ordinary di¤erential equation (ODE) for p.

That equation holds along any trajectory generated by ¬ (x, t). The vector p is called the

costate vector.

We are now ready to formulate the maximum principle. If ¦x(t) , u(t) : 0 _ t _ t

f

¦ is

an optimal state-control trajectory (obtained by initializing x(0) and controlling the system

optimally until t

f

), then there exists a costate trajectory p(t) such that (13) holds with p

in place of ·

x

and u in place of ¬. The conditions on ¦x(t) , u(t) , p(t)¦ are

_ x(t) = f (x(t) , u(t)) (14)

÷_ p(t) = /

x

(x(t) , u(t) , t) +f

T

x

(x(t) , u(t)) p(t)

u(t) = arg min

u

_

/ (x(t) , u, t) +f (x(t) , u)

T

p(t)

_

The boundary condition is p(t

f

) = /

x

(x(t

f

)), and x(0) , t

f

are given. Clearly the costate

is equal to the gradient of the optimal value function evaluated along the optimal trajectory.

The maximum principle can be written in more compact and symmetric form with the

help of the Hamiltonian function

H (x, u, p, t) = / (x, u, t) +f (x, u)

T

p (15)

which is the quantity we have been minimizing w.r.t. u all along (it was about time we

gave it a name). With this de…nition, (14) becomes

_ x(t) =

@

@p

H (x(t) , u(t) , p(t) , t) (16)

÷_ p(t) =

@

@x

H (x(t) , u(t) , p(t) , t)

u(t) = arg min

u

H (x(t) , u, p(t) , t)

The remarkable property of the maximum principle is that it is an ODE, even though we

derived it starting from a PDE. An ODE is a consistency condition which singles out speci…c

trajectories without reference to neighboring trajectories (as would be the case in a PDE).

This is possible because the extremal trajectories which solve (14) make H

u

= /

u

+ f

T

u

p

vanish, which in turn removes the dependence on neighboring trajectories. The ODE (14)

is a system of 2:

x

scalar equations subject to 2:

x

scalar boundary conditions. Therefore

we can solve this system with standard boundary-value solvers (such as Matlab’s bvp4c).

The only complication is that we would have to minimize the Hamiltonian repeatedly. This

complication is avoided for a class of problems where the control appears linearly in the

dynamics and quadratically in the cost rate:

dynamics: _ x = a(x) +1(x) u

cost rate: / (x, u, t) =

1

2

u

T

1(x) u +¡ (x, t)

In such problems the Hamiltonian is quadratic and can be minimized explicitly:

H (x, u, p, t) =

1

2

u

T

1(x) u +¡ (x, t) + (a(x) +1(x) u)

T

p

arg min

u

H = ÷1(x)

1

1(x)

T

p

10

The computational complexity (or at least the storage requirement) for ODE solutions

based on the maximum principle grows linearly with the state dimensionality :

x

, and so the

curse of dimensionality is avoided. One drawback is that (14) could have multiple solutions

(one of which is the optimal solution) but in practice that does not appear to be a serious

problem. Another drawback of course is that the solution to (14) is valid for a single initial

state, and if the initial state were to change we would have to solve the problem again. If

the state change is small, however, the solution change should also be small, and so we can

speed-up the search by initializing the ODE solver with the previous solution.

The maximum principle can be generalized in a number of ways including: terminal

state constraints instead of "soft" …nal costs; state constraints at intermediate points along

the trajectory; free (i.e. optimized) …nal time; …rst exit time; control constraints. It can

also be applied in model-predictive control settings where one seeks an optimal state-control

trajectory up to a …xed time horizon (and approximates the optimal value function at the

horizon). The initial portion of this trajectory is used to control the system, and then a

new optimal trajectory is computed. This is closely related to the idea of a rollout policy –

which is essential in computer chess programs for example.

3.2 Derivation via Lagrange multipliers

The maximum principle can also be derived for discrete-time systems, as we show next.

Note that the following derivation is actually the more standard one (in continuous time it

relies on the calculus of variations). Consider the discrete-time optimal control problem

dynamics: x

k+1

= f (x

k

, u

k

)

cost rate: / (x

k

, u

k

, /)

…nal cost: /(x

n

)

with given initial state x

0

and …nal time :. We can approach this as a regular constrained

optimization problem: …nd sequences (u

0

, u

1

, u

n1

) and (x

0

, x

1

, x

n

) minimizing J

subject to constraints x

k+1

= f (x

k

, u

k

). Constrained optimization problems can be solved

with the method of Lagrange multipliers. As a reminder, in order to minimize a scalar

function q (x) subject to c (x) = 0, we form the Lagrangian 1(x, `) = q (x) + `c (x) and

look for a pair (x, `) such that

@

@x

1 = 0 and

@

@

1 = 0.

In our case there are : constraints, so we need a sequence of : Lagrange multipliers

(`

1

, `

2

, `

n

). The Lagrangian is

1(x

, u

, `

) = /(x

n

) +

n1

k=0

_

/ (x

k

, u

k

, /) + (f (x

k

, u

k

) ÷x

k+1

)

T

`

k+1

_

De…ne the discrete-time Hamiltonian

H

(k)

(x, u, `) = / (x, u, /) +f (x, u)

T

`

and rearrange the terms in the Lagrangian to obtain

1 = /(x

n

) ÷x

T

n

`

n

+x

T

0

`

0

+

n1

k=0

_

H

(k)

(x

k

, u

k

, `

k+1

) ÷x

T

k

`

k

_

11

Now consider di¤erential changes in 1 due to changes in u which in turn lead to changes

in x. We have

d1 = (/

x

(x

n

) ÷`

n

)

T

dx

n

+`

T

0

dx

0

+

+

n1

k=0

_

_

@

@x

H

(k)

÷`

k

_

T

dx

k

+

_

@

@u

H

(k)

_

T

du

k

_

In order to satisfy

@

@x

k

1 = 0 we choose the Lagrange multipliers ` to be

`

k

=

@

@x

H

(k)

= /

x

(x

k

, u

k

, /) +f

T

x

(x

k

, u

k

) `

k+1

, 0 _ / < :

`

n

= /

x

(x

n

)

For this choice of ` the di¤erential d1 becomes

d1 = `

T

0

dx

0

+

n1

k=0

_

@

@u

H

(k)

_

T

du

k

(17)

The …rst term in (17) is 0 because x

0

is …xed. The second term becomes 0 when u

k

is the

(unconstrained) minimum of H

(k)

. Summarizing the conditions for an optimal solution, we

arrive at the discrete-time maximum principle:

x

k+1

= f (x

k

, u

k

) (18)

`

k

= /

x

(x

k

, u

k

, /) +f

T

x

(x

k

, u

k

) `

k+1

u

k

= arg min

u

H

(k)

(x

k

, u, `

k+1

)

with `

n

= /

x

(x

n

), and x

0

, : given.

The similarity between the discrete-time (18) and the continuous-time (14) versions of

the maximum principle is obvious. The costate p, which before was equal to the gradient ·

x

of the optimal value function, is now a Lagrange multiplier `. Thus we have three di¤erent

names for the same quantity. It actually has yet another name: in‡uence function. This is

because `

0

is the gradient of the minimal total cost w.r.t. the initial condition x

0

(as can

be seen from 17) and so `

0

tells us how changes in the initial condition in‡uence the total

cost. The minimal total cost is of course equal to the optimal value function, thus ` is the

gradient of the optimal value function.

3.3 Numerical optimization via gradient descent

From (17) it is clear that the quantity

@

@u

H

(k)

= /

u

(x

k

, u

k

, /) +f

T

u

(x

k

, u

k

) `

k+1

(19)

is the gradient of the total cost with respect to the control signal. This also holds in the

continuous-time case. Once we have a gradient, we can optimize the (open-loop) control

sequence for given initial state via gradient descent. Here is the algorithm:

1. Given a control sequence, iterate the dynamics forward in time to …nd the correspond-

ing state sequence. Then iterate (18) backward in time to …nd the Lagrange multiplier

sequence. In the backward pass use the given control sequence instead of optimizing

the Hamiltonian.

12

2. Evaluate the gradient using (19), and improve the control sequence with any gradient

descent algorithm. Go back to step 1, or exit if converged.

As always, gradient descent in high-dimensional spaces is much more e¢cient if one uses

a second-order method (conjugate gradient descent, Levenberg-Marquardt, Gauss-Newton,

etc). Care should be taken to ensure stability. Stability of second-order optimization can be

ensured via line-search or trust-region methods. To avoid local minima – which correspond

to extremal trajectories that are not optimal – one could use multiple restarts with random

initialization of the control sequence. Note that extremal trajectories satisfy the maximum

principle, and so an ODE solver can get trapped in the same suboptimal solutions as a

gradient descent method.

3.4 Relation to classical mechanics

We now return to the continuous-time maximum principle, and note that (16) resembles

the Hamiltonian formulation of mechanics, with p being the generalized momentum (a …fth

name for the same quantity). To see where the resemblance comes from, recall the Lagrange

problem: given x(0) and x(t

f

), …nd curves ¦x(t) : 0 _ t _ t

f

¦ which optimize

J (x()) =

_

t

f

0

1(x(t) , _ x(t)) dt (20)

Applying the calculus of variations, one …nds that extremal curves (either maxima or minima

of J) satisfy the Euler-Lagrange equation

d

dt

@

@ _ x

1(x, _ x) ÷

@

@x

1(x, _ x) = 0

Its solutions are known to be equivalent to the solutions of Hamilton’s equation

_ x =

@

@p

H (x, p) (21)

÷_ p =

@

@x

H (x, p)

where the Hamiltonian is de…ned as

H (x, p) = p

T

_ x ÷1(x, _ x) (22)

The change of coordinates p =

@

@ _ x

1(x, _ x) is called a Legendre transformation. It may seem

strange that H (x, p) depends on _ x when _ x is not explicitly an argument of H. This is

because the Legendre transformation is invertible, i.e. _ x can be recovered from (x, p) as

long as the matrix

@

2

@ _ x@ _ x

1 is non-singular.

Thus the trajectories satisfying Hamilton’s equation (21) are solutions to the Lagrange

optimization problem (20). In order to explicitly transform (20) into an optimal control

problem, de…ne a control signal u and deterministic dynamics _ x = f (x, u). Then the cost

rate / (x, u) = ÷1(x, f (x, u)) = ÷1(x, _ x) yields an optimal control problem equivalent to

(20). The Hamiltonian (22) becomes H (x, p) = p

T

f (x, u) +/ (x, u), which is the optimal-

control Hamiltonian (15). Note that we can choose any dynamics f (x, u), and then de…ne

the corresponding cost rate / (x, u) so as to make the optimal control problem equivalent

to the Lagrange problem (20). The simplest choice is f (x, u) = u.

13

The function 1 is interpreted as an energy function. In mechanics it is

1(x, _ x) =

1

2

_ x

T

' (x) _ x ÷q (x)

The …rst term is kinetic energy (with ' (x) being the inertia matrix), and the second

term is potential energy due to gravity or some other force …eld. When the inertia is con-

stant, applying the Euler-Lagrange equation to the above 1 yields Newton’s second law

'• x = ÷q

x

(x), where the force ÷q

x

(x) is the gradient of the potential …eld. If the in-

ertia is not constant (joint-space inertia for a multi-joint arm for example) application of

the Euler-Lagrange equation yields extra terms which capture nonlinear interaction forces.

Geometrically, they contain Christo¤el symbols of the Levi-Civita connection for the Rie-

mannian metric given by ¸y, z¸

x

= y

T

' (x) z. We will not discuss di¤erential geometry in

any detail here, but it is worth noting that it a¤ords a coordinate-free treatment, revealing

intrinsic properties of the dynamical system that are invariant with respect to arbitrary

smooth changes of coordinates. Such invariant quantities are called tensors. For example,

the metric (inertia in our case) is a tensor.

4 Linear-quadratic-Gaussian control: Riccati equations

Optimal control laws can rarely be obtained in closed form. One notable exception is

the LQG case where the dynamics are linear, the costs are quadratic, and the noise (if

present) is additive Gaussian. This makes the optimal value function quadratic, and allows

minimization of the Hamiltonian in closed form. Here we derive the LQG optimal controller

in continuous and discrete time.

4.1 Derivation via the HJB equations

Consider the following stochastic optimal control problem:

dynamics: dx = (¹x +1u) dt +1dw

cost rate: / (x, u) =

1

2

u

T

1u +

1

2

x

T

Qx

…nal cost: /(x) =

1

2

x

T

Q

f

x

where 1 is symmetric positive-de…nite, Q and Q

f

are symmetric, and u is now uncon-

strained. We set o = 11

T

as before. The matrices ¹, 1, 1, 1, Q can be made time-varying

without complicating the derivation below.

In order to solve for the optimal value function we will guess its parametric form, show

that it satis…es the HJB equation (8), and obtain ODEs for its parameters. Our guess is

· (x, t) =

1

2

x

T

\ (t) x +a (t) (23)

where \ (t) is symmetric. The boundary condition · (x, t

f

) = /(x) implies \ (t

f

) = Q

f

and

a (t

f

) = 0. From (23) we can compute the derivatives which enter into the HJB equation:

·

t

(x, t) =

1

2

x

T

_

\ (t) x + _ a (t)

·

x

(x, t) = \ (t) x

·

xx

(x, t) = \ (t)

14

Substituting these expressions in (8) yields

÷

1

2

x

T

_

\ (t) x ÷ _ a (t) =

= min

u

_

1

2

u

T

1u +

1

2

x

T

Qx + (¹x +1u)

T

\ (t) x +

1

2

tr (o\ (t))

_

The Hamiltonian (i.e. the term inside the min operator) is quadratic in u, and its Hessian

1 is positive-de…nite, so the optimal control can be found analytically:

u = ÷1

1

1

T

\ (t) x (24)

With this u, the control-dependent part of the Hamiltonian becomes

1

2

u

T

1u + (1u)

T

\ (t) x = ÷

1

2

x

T

\ (t) 11

1

1

T

\ (t) x

After grouping terms, the HJB equation reduces to

÷

1

2

x

T

_

\ (t) x ÷ _ a (t) =

= ÷

1

2

x

T

_

Q+¹

T

\ (t) +\ (t) ¹÷\ (t) 11

1

1

T

\ (t)

_

x +

1

2

tr (o\ (t))

where we replaced the term 2¹

T

\ with ¹

T

\ +\ ¹ to make the equation symmetric. This

is justi…ed because x

T

¹

T

\ x = x

T

\

T

¹x = x

T

\ ¹x.

Our guess of the optimal value function is correct if and only if the above equation holds

for all x, which is the case when the x-dependent terms are matched:

÷

_

\ (t) = Q+¹

T

\ (t) +\ (t) ¹÷\ (t) 11

1

1

T

\ (t) (25)

÷_ a (t) =

1

2

trace (o\ (t))

Functions \, a satisfying (25) can obviously be found by initializing \ (t

f

) = Q

f

, a (t

f

) = 0

and integrating the ODEs (25) backward in time. Thus (23) is the optimal value function

with \, a given by (25), and (24) is the optimal control law (which in this case is unique).

The …rst line of (25) is called a continuous-time Riccati equation. Note that it does

not depend on the noise covariance o. Consequently the optimal control law (24) is also

independent of o. The only e¤ect of o is on the total cost. As a corollary, the optimal

control law remains the same in the deterministic case – called the linear-quadratic regulator

(LQR).

4.2 Derivation via the Bellman equations

In practice one usually works with discrete-time systems. To obtain an optimal control law

for the discrete-time case one could use an Euler approximation to (25), but the resulting

equation is missing terms quadratic in the time step , as we will see below. Instead we

apply dynamic programming directly, and obtain an exact solution to the discrete-time

LQR problem. Dropping the (irrelevant) noise and discretizing the problem, we obtain

dynamics: x

k+1

= ¹x

k

+1u

k

cost rate:

1

2

u

T

k

1u

k

+

1

2

x

T

k

Qx

k

…nal cost:

1

2

x

T

n

Q

f

x

n

15

where : = t

f

, and the correspondence to the continuous-time problem is

x

k

÷x(/) , ¹ ÷(I + ¹) , 1 ÷1, 1 ÷1, Q ÷Q (26)

The guess for the optimal value function is again quadratic

· (x, /) =

1

2

x

T

\

k

x

with boundary condition \

n

= Q

f

. The Bellman equation (2) is

1

2

x

T

\

k

x = min

u

_

1

2

u

T

1u +

1

2

x

T

Qx +

1

2

(¹x +1u)

T

\

k+1

(¹x +1u)

_

As in the continuous-time case the Hamiltonian can be minimized analytically. The resulting

optimal control law is

u = ÷

_

1 +1

T

\

k+1

1

_

1

1

T

\

k+1

¹x

Substituting this u in the Bellman equation, we obtain

\

k

= Q+¹

T

\

k+1

¹÷¹

T

\

k+1

1

_

1 +1

T

\

k+1

1

_

1

1

T

\

k+1

¹ (27)

This completes the proof that the optimal value function is in the assumed quadratic form.

To compute \

k

for all / we initialize \

n

= Q

f

and iterate (27) backward in time.

The optimal control law is linear in x, and is usually written as

u

k

= ÷1

k

x

k

(28)

where 1

k

=

_

1 +1

T

\

k+1

1

_

1

1

T

\

k+1

¹

The time-varying matrix 1

k

is called the control gain. It does not depend on the sequence

of states, and therefore can be computed o-ine. Equation (27) is called a discrete-time

Riccati equation. Clearly the discrete-time Riccati equation contains more terms that the

continuous-time Riccati equation (25), and so the two are not identical. However one can

verify that they become identical in the limit ÷ 0. To this end replace the matrices in

(27) with their continuous-time analogues (26), and after rearrangement obtain

\

k

÷\

k+1

= Q+¹

T

\

k+1

+\

k+1

¹÷\

k+1

1

_

1 + 1

T

\

k+1

1

_

1

1

T

\

k+1

+

o

_

2

_

where o

_

2

_

absorbs terms that are second-order in . Taking the limit ÷0 yields the

continuous-time Riccati equation (25).

4.3 Applications to nonlinear problems

Apart from solving LQG problems, the methodology described here can be adapted to yield

approximate solutions to non-LQG optimal control problems. This is done iteratively, as

follows:

1. Given a control sequence, apply it to the (nonlinear) dynamics and obtain a corre-

sponding state sequence.

16

2. Construct a time-varying linear approximation to the dynamics and a time-varying

quadratic approximation to the cost; both approximations are centered at the state-

control sequence obtained in step 1. This yields an LQG optimal control problem

with respect to the state and control deviations.

3. Solve the resulting LQG problem, obtain the control deviation sequence, and add it to

the given control sequence. Go to step 1, or exit if converged. Note that multiplying

the deviation sequence by a number smaller than 1 can be used to implement line-

search.

Another possibility is to use di¤erential dynamic programming (DDP), which is based

on the same idea but involves a second-order rather than a …rst-order approximation to

the dynamics. In that case the approximate problem is not LQG, however one can assume

a quadratic approximation to the optimal value function and derive Riccati-like equations

for its parameters. DDP and iterative LQG (iLQG) have second-order convergence in the

neighborhood of an optimal solution. They can be thought of as the analog of Newton’s

method in the domain of optimal control. Unlike general-purpose second order methods

which construct Hessian approximations using gradient information, DDP and iLQG obtain

the Hessian directly by exploiting the problem structure. For deterministic problems they

converge to state-control trajectories which satisfy the maximum principle, but in addition

yield local feedback control laws. In our experience they are more e¢cient that either ODE

or gradient descent methods. iLQG has been generalized to stochastic systems (including

multiplicative noise) and to systems subject to control constraints.

5 Optimal estimation: Kalman …lter

Optimal control is closely related to optimal estimation, for two reasons: (i) the only way

to achieve optimal performance in the presence of sensor noise and delays is to incorporate

an optimal estimator in the control system; (ii) the two problems are dual, as explained

below. The most widely used optimal estimator is the Kalman …lter. It is the dual of the

linear-quadratic regulator – which in turn is the most widely used optimal controller.

5.1 The Kalman …lter

Consider the partially-observable linear dynamical system

dynamics: x

k+1

= ¹x

k

+w

k

observation: y

k

= Hx

k

+v

k

(29)

where w

k

~ A (0, o) and v

k

~ A (0, 1) are independent Gaussian random variables, the

initial state has a Gaussian prior distribution x

0

~ A (´ x

0

,

0

), and ¹, H, o, 1, ´ x

0

,

0

are

known. The states are hidden and all we have access to are the observations. The objective

is to compute the posterior probability distribution ´ j

k

of x

k

given observations y

k1

y

0

:

´ j

k

= j (x

k

[y

k1

y

0

)

´ j

0

= A (´ x

0

,

0

)

17

Note that our formulation is somewhat unusual: we are estimating x

k

before y

k

has been

observed. This formulation is adopted here because it simpli…es the results and also because

most real-world sensors provide delayed measurements.

We will show by induction (moving forward in time) that ´ j

k

is Gaussian for all /, and

therefore can be represented by its mean ´ x

k

and covariance matrix

k

. This holds for / = 0

by de…nition. The Markov property of (29) implies that the posterior ´ j

k

can be treated as

prior over x

k

for the purposes of estimation after time /. Since ´ j

k

is Gaussian and (29)

is linear-Gaussian, the joint distribution of x

k+1

and y

k

is also Gaussian. Its mean and

covariance given the prior ´ j

k

are easily computed:

1

_

x

k+1

y

k

_

=

_

¹´ x

k

H´ x

k

_

, Cov

_

x

k+1

y

k

_

=

_

o +¹

k

¹

T

¹

k

H

T

H

k

¹

T

1 +H

k

H

T

_

Now we need to compute the probability of x

k+1

conditional on the new observation y

k

.

This is done using an important property of multivariate Gaussians summarized in the

following lemma:

Let p and q be jointly Gaussian, with means p and q and covariances

pp

,

qq

and

pq

=

T

qp

. Then the conditional distribution of p given q

is Gaussian, with mean and covariance

1 [p[q] = p +

pq

1

qq

(q ÷q)

Cov [p[q] =

pp

÷

pq

1

qq

qp

Applying the lemma to our problem, we see that ´ j

k+1

is Gaussian with mean

´ x

k+1

= ¹´ x

k

+¹

k

H

T

_

1 +H

k

H

T

_

1

(y

k

÷H´ x

k

) (30)

and covariance matrix

k+1

= o +¹

k

¹

T

÷¹

k

H

T

_

1 +H

k

H

T

_

1

H

k

¹

T

(31)

This completes the induction proof. Equation (31) is a Riccati equation. Equation (30) is

usually written as

´ x

k+1

= ¹´ x

k

+1

k

(y

k

÷H´ x

k

)

where 1

k

= ¹

k

H

T

_

1 +H

k

H

T

_

1

The time-varying matrix 1

k

is called the …lter gain. It does not depend on the observation

sequence and therefore can be computed o-ine. The quantity y

k

÷ H´ x

k

is called the

innovation. It is the mismatch between the observed and the expected measurement. The

covariance

k

of the posterior probability distribution j (x

k

[y

k1

y

0

) is the estimation

error covariance. The estimation error is x

k

÷ ´ x

k

.

The above derivation corresponds to the discrete-time Kalman …lter. A similar result

holds in continuous time, and is called the Kalman-Bucy …lter. It is possible to write down

the Kalman …lter in equivalent forms which have numerical advantages. One such approach

18

is to propagate the matrix square root of . This is called a square-root …lter, and involves

Riccati-like equations which are more stable because the dynamic range of the elements of

is reduced. Another approach is to propagate the inverse covariance

1

. This is called

an information …lter, and again involves Riccati-like equations. The information …lter can

represent numerically very large covariances (and even in…nite covariances – which are useful

for specifying "non-informative" priors).

Instead of …ltering one can do smoothing, i.e. obtain state estimates using observations

from the past and from the future. In that case the posterior probability of each state given

all observations is still Gaussian, and its parameters can be found by an additional backward

pass known as Rauch recursion. The resulting Kalman smoother is closely related to the

forward-backward algorithm for probabilistic inference in hidden Markov models (HMMs).

The Kalman …lter is optimal in many ways. First of all it is a Bayesian …lter, in the sense

that it computes the posterior probability distribution over the hidden state. In addition,

the mean ´ x is the optimal point estimator with respect to multiple loss functions. Recall that

optimality of point estimators is de…ned through a loss function / (x, ´ x) which quanti…es how

bad it is to estimate ´ x when the true state is x. Some possible loss functions are |x ÷ ´ x|

2

,

|x ÷ ´ x|, c (x ÷ ´ x), and the corresponding optimal estimators are the mean, median, and

mode of the posterior probability distribution. If the posterior is Gaussian then the mean,

median and mode coincide. If we choose an unusual loss function for which the optimal

point estimator is not ´ x, even though the posterior is Gaussian, the information contained

in ´ x and is still su¢cient to compute the optimal point estimator. This is because ´ x and

are su¢cient statistics which fully describe the posterior probability distribution, which in

turn captures all information about the state that is available in the observation sequence.

The set of su¢cient statistics can be thought of as an augmented state, with respect to

which the partially-observed process has a Markov property. It is called the belief state or

alternatively the information state.

5.2 Beyond the Kalman …lter

When the estimation problem involves non-linear dynamics or non-Gaussian noise the pos-

terior probability distribution rarely has a …nite set of su¢cient statistics (although there

are exceptions such as the Benes system). In that case one has to rely on numerical ap-

proximations. The most widely used approximation is the extended Kalman …lter (EKF). It

relies on local linearization centered at the current state estimate and closely resembles the

LQG approximation to non-LQG optimal control problems. The EKF is not guaranteed

to be optimal in any sense, but in practice if often yields good results – especially when

the posterior is single-peaked. There is a recent improvement, called the unscented …lter,

which propagates the covariance using deterministic sampling instead of linearization of the

system dynamics. The unscented …lter tends to be superior to the EKF and requires a com-

parable amount of computation. An even more accurate, although computationally more

expensive approach, is particle …ltering. Instead of propagating a Gaussian approximation

it propagates a cloud of points sampled from the posterior (without actually computing the

posterior). Key to its success is the idea of importance sampling.

Even when the posterior does not have a …nite set of su¢cient statistics, it is still a

well-de…ned scalar function over the state space, and as such must obey some equation. In

19

discrete time this equation is simply a recursive version of Bayes’ rule – which is not too

revealing. In continuous time, however, the posterior satis…es a PDE which resembles the

HJB equation. Before we present this result we need some notation. Consider the stochastic

di¤erential equations

dx = f (x) dt +1 (x) dw (32)

dy = h(x) dt +dv

where w(t) and v (t) are Brownian motion processes, x(t) is the hidden state, and y (t) is

the observation sequence. De…ne o (x) = 1 (x) 1 (x)

T

as before. One would normally think

of the increments of y (t) as being the observations, but in continuous time these increments

are in…nite and so we work with their time-integral.

Let j (x, t) be the probability distribution of x(t) in the absence of any observations.

At t = 0 it is initialized with a given prior j (x, 0). For t 0 it is governed by the …rst line

of (32), and can be shown to satisfy

j

t

= ÷f

T

j

x

+

1

2

tr (oj

xx

) +

_

÷

i

@

@x

i

)

i

+

1

2

ij

@

2

@x

i

@x

j

o

ij

_

j

This is called the forward Kolmogorov equation, or alternatively the Fokker-Planck equation.

We have written it in expanded form to emphasize the resemblance to the HJB equation.

The more usual form is

@

@t

j = ÷

i

@

@x

i

()

i

j) +

1

2

ij

@

2

@x

i

@x

j

(o

ij

j)

Let ¯ j (x, t) be an unnormalized posterior over x(t) given the observations ¦y (:) : 0 _ : _ t¦;

"unnormalized" means that the actual posterior ´ j (x, t) can be recovered by normalizing:

´ j (x, t) = ¯ j (x, t) ,

_

¯ j (z, t) dz. It can be shown that some unnormalized posterior ¯ j satis…es

Zakai’s equation

d¯ j =

_

÷

i

@

@x

i

()

i

¯ j) +

1

2

ij

@

2

@x

i

@x

j

(o

ij

¯ j)

_

dt +h

T

¯ jdy

The …rst term on the right re‡ects the prior and is the same as in the Kolmogorov equation

(except that we have multiplied both sides by dt). The second term incorporates the

observation and makes Zakai’s equation a stochastic PDE. After certain manipulations

(conversion to Stratonovich form and a gauge transformation) the second term can be

integrated by parts, leading to a regular PDE. One can then approximate the solution to

that PDE numerically via discretization methods. As in the HJB equation, however, such

methods are only applicable in low-dimensional spaces due to the curse of dimensionality.

6 Duality of optimal control and optimal estimation

Optimal control and optimal estimation are closely related mathematical problems. The

best-known example is the duality of the linear-quadratic regulator and the Kalman …lter.

To see that duality more clearly, we repeat the corresponding Riccati equations (27) and

(31) side by side:

control: \

k

= Q+¹

T

\

k+1

¹÷¹

T

\

k+1

1

_

1 +1

T

\

k+1

1

_

1

1

T

\

k+1

¹

…ltering:

k+1

= o +¹

k

¹

T

÷¹

k

H

T

_

1 +H

k

H

T

_

1

H

k

¹

T

20

These equations are identical up to a time reversal and some matrix transposes. The

correspondence is given in the following table:

control: \ Q ¹ 1 1 /

¸

…ltering: o ¹

T

H

T

1 : ÷/

(33)

The above duality was …rst described by Kalman in his famous 1960 paper introducing the

discrete-time Kalman …lter, and is now mentioned in most books on estimation and control.

However its origin and meaning are not apparent. This is because the Kalman …lter is

optimal from multiple points of view and can be written in multiple forms, making it hard

to tell which of its properties have a dual in the control domain.

Attempts to generalize the duality to non-LQG settings have revealed that the funda-

mental relationship is between the optimal value function and the negative log-posterior.

This is actually inconsistent with (33), although the inconsistency has not been made ex-

plicit before. Recall that \ is the Hessian of the optimal value function. The posterior is

Gaussian with covariance matrix , and thus the Hessian of the negative log-posterior is

1

. However in (33) we have \ corresponding to and not to

1

. Another problem is

that while ¹

T

in (33) makes sense for linear dynamics _ x = ¹x, the meaning of "transpose"

for general non-linear dynamics _ x = f (x) in unclear. Thus the duality described by Kalman

is speci…c to linear-quadratic-Gaussian systems and does not generalize.

6.1 General duality of optimal control and MAP smoothing

We now discuss an alternative approach which does generalize. In fact we start with the

general case and later specialize it to the LQG setting to obtain something known as a

minimum-energy estimator. As far as we know, the general treatment presented here is a

novel result. The duality we establish is between maximum a posteriori (MAP) smoothing

and deterministic optimal control for systems with continuous state. For systems with

discrete state (i.e. HMMs) an instance of such duality is the Viterbi algorithm – which

…nds the most likely state sequence using dynamic programming.

Consider the discrete-time partially-observable system

j (x

k+1

[x

k

) = exp (÷a (x

k+1

, x

k

)) (34)

j (y

k

[x

k

) = exp (÷/ (y

k

, x

k

))

j (x

0

) = exp (÷c (x

0

))

where a, /, c are the negative log-probabilities of the state transitions, observation emis-

sions, and initial state respectively. The states are hidden and we only have access to the

observations (y

1

, y

2

, y

n

). Our objective is to …nd the most probable sequence of states

(x

0

, x

1

, x

n

), that is, the sequence which maximizes the posterior probability

j (x

[y

) =

j (y

[x

) j (x

)

j (y

)

The term j (y

**) does not a¤ect the maximization and so it can be dropped. Using the
**

21

Markov property of (34) we have

j (y

[x

) j (x

) = j (x

0

)

n

k=1

j (x

k

[x

k1

) j (y

k

[x

k

)

= exp(÷c (x

0

))

n

k=1

exp(÷a (x

k

, x

k1

)) exp(÷/ (y

k

, x

k

))

= exp

_

÷c (x

0

) ÷

n

k=1

(a (x

k

, x

k1

) +/ (y

k

, x

k

))

_

Maximizing exp(÷J) is equivalent to minimizing J. Therefore the most probable state

sequence is the one which minimizes

J (x

) = c (x

0

) +

n

k=1

(a (x

k

, x

k1

) +/ (y

k

, x

k

)) (35)

This is beginning to look like a total cost for a deterministic optimal control problem.

However we are still missing a control signal. To remedy that we will de…ne the passive

dynamics as the expected state transition:

f (x

k

) = 1 [x

k+1

[x

k

] =

_

z exp(÷a (z, x

k

)) dz

and then de…ne the control signal as the deviation from the expected state transition:

x

k+1

= f (x

k

) +u

k

(36)

The control cost is now de…ned as

r (u

k

, x

k

) = a (f (x

k

) +u

k

, x

k

) , 0 _ / < :

and the state cost is de…ned as

¡ (x

0

, 0) = c (x

0

)

¡ (x

k

, /) = / (y

k

, x

k

) , 0 < / _ :

The observation sequence is …xed, and so ¡ is well-de…ned as long as it depends explicitly

on the time index /. Note that we could have chosen any f , however the present choice will

make intuitive sense later.

With these de…nitions, the control system with dynamics (36), cost rate

/ (x

k

, u

k

, /) = r (u

k

, x

k

) +¡ (x

k

, /) , 0 _ / < :

and …nal cost ¡ (x

n

, :) achieves total cost (35). Thus the MAP smoothing problem has been

transformed into a deterministic optimal control problem. We can now bring any method

for optimal control to bear on MAP smoothing. Of particular interest is the maximum

principle – which can avoid the curse of dimensionality even when the posterior of the

partially-observable system (34) does not have a …nite set of su¢cient statistics.

22

6.2 Duality of LQG control and Kalman smoothing

Let us now specialize these results to the LQG setting. Consider again the partially-

observable system (29) discussed earlier. The posterior is Gaussian, therefore the MAP

smoother and the Kalman smoother yield identical state estimates. The negative log-

probabilities from (34) now become

a (x

k

, x

k1

) =

1

2

(x

k

÷¹x

k1

)

T

o

1

(x

k

÷¹x

k1

) +a

0

/ (y

k

, x

k

) =

1

2

(y

k

÷Hx

k

)

T

1

1

(y

k

÷Hx

k

) +/

0

c (x

0

) =

1

2

(x

0

÷ ´ x

0

)

T

1

0

(x

0

÷ ´ x

0

) +c

0

where a

0

, /

0

, c

0

are normalization constants. Dropping all terms that do not depend on x,

and using the fact that x

k

÷ ¹x

k1

= u

k1

from (36), the quantity (35) being minimized

by the MAP smoother becomes

J (x

, u

) =

n1

k=0

1

2

u

T

k

1u

k

+

n

k=0

_

1

2

x

T

k

Q

k

x

k

+x

T

k

q

k

_

where

1 = o

1

Q

0

=

1

0

, q

0

= ÷´ x

0

Q

k

= H

T

1

1

H, q

k

= ÷H

T

1

1

y

k

, 0 < / _ :

Thus the linear-Gaussian MAP smoothing problem is equivalent to a linear-quadratic op-

timal control problem. The linear cost term x

T

k

q

k

was not previously included in the LQG

derivation, but it is straightforward to do so.

Let us now compare this result to the Kalman duality (33). Here the estimation system

has dynamics x

k+1

= ¹x

k

+ w

k

and the control system has dynamics x

k+1

= ¹x

k

+ u

k

.

Thus the time-reversal and the matrix transpose of ¹ are no longer needed. Furthermore

the covariance matrices now appear inverted, and so we can directly relate costs to negative

log-probabilities. Another di¤erence is that in (33) we had 1 == 1 and Q == o, while

these two correspondences are now reversed.

A minimum-energy interpretation can help better understand the duality. From (29) we

have x

k

÷¹x

k1

= w

k1

and y

k

÷Hx

k

= v

k

. Thus the cost rate for the optimal control

problem is of the form

w

T

o

1

w+v

T

1

1

v

This can be though of as the energy of the noise signals w and v. Note that an estimate

for the states implies estimates for the two noise terms, and the likelihood of the estimated

noise is a natural quantity to optimize. The …rst term above measures how far the estimated

state is from the prior; it represents a control cost because the control signal pushes the

estimate away from the prior. The second term measures how far the predicted observation

(and thus the estimated state) is from the actual observation. One can think of this as a

minimum-energy tracking problem with reference trajectory speci…ed by the observations.

23

7 Optimal control as a theory of biological movement

To say that the brain generates the best behavior in can, subject to the constraints imposed

by the body and environment, is almost trivial. After all, the brain has evolved for the sole

purpose of generating behavior advantageous to the organism. It is then reasonable to

expect that, at least in natural and well-practiced tasks, the observed behavior will be close

to optimal. This makes optimal control theory an appealing computational framework

for studying the neural control of movement. Optimal control is also a very successful

framework in terms of explaining the details of observed movement. However we have

recently reviewed this literature [12] and will not repeat the review here. Instead we will

brie‡y summarize existing optimal control models from a methodological perspective, and

then list some research directions which we consider promising.

Most optimality models of biological movement assume deterministic dynamics and

impose state constraints at di¤erent points in time. These constraints can for example

specify the initial and …nal posture of the body in one step of locomotion, or the positions

of a sequence of targets which the hand has to pass through. Since the constraints guarantee

accurate execution of the task, there is no need for accuracy-related costs which specify what

the task is. The only cost is a cost rate which speci…es the "style" of the movement. It

has been de…ned as (an approximation to) metabolic energy, or the squared derivative of

acceleration (i.e. jerk), or the squared derivative of joint torque. The solution method is

usually based on the maximum principle. Minimum-energy models are explicitly formulated

as optimal control problems, while minimum-jerk and minimum-torque-change models are

formulated in terms of trajectory optimization. However they can be easily transformed

into optimal control problems by relating the derivative being minimized to a control signal.

Here is an example. Let q(t) be the vector of generalized coordinates (e.g. joint angles)

for an articulated body such as the human arm. Let t (t) be the vector of generalized forces

(e.g. joint torques). The equations of motion are

t = ' (q) • q +n(q, _ q)

where ' (q) is the con…guration-dependent inertia matrix, and n(q, _ q) captures non-linear

interaction forces, gravity, and any external force …elds that depend on position or velocity.

Unlike mechanical devices, the musculo-skeletal system has order higher than two because

the muscle actuators have their own states. For simplicity assume that the torques t

correspond to the set of muscle activations, and have dynamics

_ t =

1

c

(u ÷t)

where u(t) is the control signal sent by the nervous system, and c is the muscle time constant

(around 40 msec). The state vector of this system is

x = [q; _ q; t]

We will use the subscript notation x

[1]

= q, x

[2]

= _ q, x

[3]

= t. The general …rst-order

24

dynamics x = f (x, u) are given by

_ x

[1]

= x

[2]

_ x

[2]

= '

_

x

[1]

_

1

_

x

[3]

÷n

_

x

[1]

, x

[2]

__

_ x

[3]

=

1

c

_

u ÷x

[3]

_

Note that these dynamics are a¢ne in the control signal and can be written as

_ x = a(x) +1u

Now we can specify a desired movement time t

f

, an initial state x

0

, and a …nal state

x

f

. We can also specify a cost rate, such as

control energy: / (x, u) =

1

2

|u|

2

torque-change: / (x, u) =

1

2

|_ t|

2

=

1

2c

2

_

_

u ÷x

[3]

_

_

2

In both cases the cost is quadratic in u and the dynamics are a¢ne in u. Therefore the

Hamiltonian can be minimized explicitly. Focusing on the minimum-energy model, we have

H (x, u, p) =

1

2

|u|

2

+ (a(x) +1u)

T

p

¬ (x, p) = arg min

u

H (x, u, p) = ÷1

T

p

We can now apply the maximum principle, and obtain the ODE

_ x = a(x) ÷11

T

p

÷_ p = a

x

(x)

T

p

with boundary conditions x(0) = x

0

and x(t

f

) = x

f

. If instead of a terminal constraint we

wish to specify a …nal cost h(x), then the boundary condition x(t

f

) = x

f

is replaced with

p(t

f

) = h

x

(x(t

f

)). Either way we have as many scalar variables as boundary conditions,

and the problem can be solved numerically using an ODE two-point boundary value solver.

When a …nal cost is used the problem can also be solved using iterative LQG approximations.

Some optimal control models have considered stochastic dynamics, and used accuracy

costs rather than state constraints to specify the task (state constraints cannot be enforced

in a stochastic system). Such models have almost exclusively been formulated within the

LQG setting. Control under sensory noise and delays has also been considered; in that

case the model involves a sensory-motor loop composed of a Kalman …lter and a linear-

quadratic regulator. Of particular interest in stochastic models is control-multiplicative

noise (also called signal-dependent noise). It is a well-established property of the motor

system, and appears to be the reason for speed-accuracy trade-o¤s such as Fitts’ law.

Control-multiplicative noise can be formalized as

dx = (a(x) +1u) dt +o11(u) dw

where 1(u) is a diagonal matrix with the components of u on its main diagonal. In this

system, each component of the control signal is polluted with Gaussian noise whose standard

deviation is proportional to that component. The noise covariance is then

o = o

2

11(u) 1(u)

T

1

T

25

With these de…nitions, one can verify that

tr (oA) = o

2

tr

_

1(u)

T

1

T

A11(u)

_

= o

2

u

T

1

T

A1u

for any matrix A. Now suppose the cost rate is

/ (x, u) =

1

2

u

T

1u +¡ (x, t)

Then the Hamiltonian for this stochastic optimal control problem is

1

2

u

T

_

1 +o

2

1

T

·

xx

(x, t) 1

_

u +¡ (x, t) + (a(x) +1u)

T

·

x

(x, t)

If we think of the matrix ·

xx

(x, t) as a given, the above expression is the Hamiltonian

for a deterministic optimal control problem with cost rate in the same form as above, and

modi…ed control-energy weighting matrix:

¯

1(x, t) = 1 +o

2

1

T

·

xx

(x, t) 1

Thus, incorporating control-multiplicative noise in an optimal control problem is equivalent

to increasing the control energy cost. The cost increase required to make the two problems

equivalent is of course impossible to compute without …rst solving the stochastic problem

(since it depends on the unknown optimal value function). Nevertheless this analysis a¤ords

some insight into the e¤ects of such noise. Note that in the LQG case · is quadratic, its

Hessian ·

xx

is constant, and so the optimal control law under control-multiplicative noise

can be found in closed form.

7.1 Promising research directions

There are plenty of examples where motor behavior is found to be optimal under a reasonable

cost function. Similarly, there are plenty of examples where perceptual judgements are found

to be optimal under a reasonable prior. There is little doubt that many additional examples

will accumulate over time, and reinforce the principle of optimal sensory-motor processing.

But can we expect future developments that are conceptually novel? Here we summarize

four under-explored research directions which may lead to such developments.

Motor learning and adaptation. Optimal control has been used to model behavior

in well-practiced tasks where performance is already stable. But the processes of motor

learning and adaptation – which are responsible for reaching stable performance – have

rarely been modeled from the viewpoint of optimality. Such modeling should be straight-

forward given the numerous iterative algorithms for optimal controller design that exist.

Neural implementation of optimal control laws. Optimal control modeling has

been restricted to the behavioral level of analysis; the control laws used to predict behavior

are mathematical functions without an obvious neural implementation. In order to bridge

the gap between behavior and single neurons, we will need realistic neural networks trained

to mimic the input-output behavior of optimal control laws. Such networks will have to

operate in closed loop with a simulated body.

Distributed and hierarchical control. Most existing models of movement control

are monolithic. In contrast, the motor system is distributed and includes a number of

26

anatomically distinct areas which presumably have distinct computational roles. To address

this discrepancy, we have recently developed a hierarchical framework for approximately

optimal control. In this framework, a low-level feedback controller transforms the musculo-

skeletal system into an augmented system for which high-level optimal control laws can be

designed more e¢ciently [13].

Inverse optimal control. Optimal control models are presently constructed by guess-

ing the cost function, obtaining the corresponding optimal control law, and comparing its

predictions to experimental data. Ideally we would be able to do the opposite: record data,

and automatically infer a cost function for which the observed behavior is optimal. There

are reasons to believe that most sensible behaviors are optimal with respect to some cost

function.

Recommended further reading

The mathematical ideas introduced in this chapter are developed in more depth in a number

of well-written books. The standard reference on dynamic programming is Bertsekas [3].

Numerical approximations to dynamic programming, with emphasis on discrete state-action

spaces, are introduced in Sutton and Barto [10] and formally described in Bertsekas and

Tsitsiklis [2]. Discretization schemes for continuous stochastic optimal control problems

are developed in Kushner and Dupuis [6]. The classic Bryson and Ho [5] remains one of

the best treatments of the maximum principle and its applications (including applications

to minimum-energy …lters). A classic treatment of optimal estimation is Anderson and

Moore [1]. A comprehensive text covering most aspects of continuous optimal control and

estimation is Stengel [11]. The advanced subjects of non-smooth analysis and viscosity

solutions are covered in Vinter [14]. The di¤erential-geometric approach to mechanics and

control (including optimal control) is developed in Bloch [4]. An intuitive yet rigorous

introduction to stochastic calculus can be found in Oksendal [7]. The applications of optimal

control theory to biological movement are reviewed in Todorov [12] and also in Pandy [8].

The links to motor neurophysiology are explored in Scott [9]. A hierarchical framework for

optimal control is presented in Todorov et al [13].

Acknowledgements

This work was supported by US National Institutes of Health grant NS-045915, and US

National Science Foundation grant ECS-0524761. Thanks to Weiwei Li for proofreading

the manuscript.

27

References

[1] Anderson, B. and Moore, J. (1979) Optimal Filtering. Prentice Hall, Englewood Cli¤s,

NJ.

[2] Bertsekas, D. and Tsitsiklis, J. (1996) Neuro-dynamic Programming. Athena Scienti…c,

Belmont, MA.

[3] Bertsekas, D. (2000) Dynamic Programming and Optimal Control (2nd ed). Athena

Scienti…c, Belmont, MA.

[4] Bloch, A. (2003) Nonholonomic mechanics and control. Springer, New York.

[5] Bryson, A. and Ho, Y. (1969) Applied Optimal Control. Blaisdell Publishing, Walth-

man, MA.

[6] Kushner, H. and Dupuis, P. (2001) Numerical Methods for Stochastic Control Problems

in Continuous Time (2nd ed). Springer.

[7] Oksendal, B. (1995) Stochastic Di¤erential Equations (4th ed). Springer, Berlin.

[8] Pandy, M. (2001) Computer modeling and simulation of human movement. Annual

Review of Biomedical Engineering 3: 245-273.

[9] Scott, S. (2004) Optimal feedback control and the neural basis of volitional motor

control. Nature Reviews Neuroscience 5: 534-546.

[10] Sutton, R. and Barto, A. (1998) Reinforcement Learning: An Introduction. MIT

Press, Cambdrige, MA.

[11] Stengel, R. (1994) Optimal Control and Estimation. Dover, New York.

[12] Todorov, E. (2004) Optimality principles in sensorimotor control. Nature Neuroscience

7: 907-915.

[13] Todorov, E., Li, W. and Pan, X. (2005) From task parameters to motor synergies: A

hierarchical framework for approximately optimal control of redundant manipulators.

Journal of Robotic Systems 22: 691-719.

[14] Vinter, R. (2000) Optimal Control. Birkhauser, Boston.

28

1

Discrete control: Bellman equations

Let x 2 X denote the state of an agent’ environment, and u 2 U (x) the action (or s control) which the agent chooses while at state x. For now both X and U (x) are …nite sets. Let next (x; u) 2 X denote the state which results from applying action u in state x, and cost (x; u) 0 the cost of applying action u in state x. As an example, x may be the city where we are now, u the ‡ ight we choose to take, next (x; u) the city where that ‡ ight lands, and cost (x; u) the price of the ticket. We can now pose a simple yet practical optimal control problem: …nd the cheapest way to ‡ to your destination. This problem y can be formalized as follows: …nd an action sequence (u0 ; u1 ; un 1 ) and corresponding state sequence (x0 ; x1 ; xn ) minimizing the total cost Xn 1 J (x ; u ) = cost (xk ; uk )

k=0

where xk+1 = next (xk ; uk ) and uk 2 U (xk ). The initial state x0 = xinit and destination state xn = xdest are given. We can visualize this setting with a directed graph where the states are nodes and the actions are arrows connecting the nodes. If cost (x; u) = 1 for all (x; u) the problem reduces to …nding the shortest path from xinit to xdest in the graph.

1.1

Dynamic programming

Optimization problems such as the one stated above are e¢ ciently solved via dynamic programming (DP). DP relies on the following obvious fact: if a given state-action sequence is optimal, and we were to remove the …rst state and action, the remaining sequence is also optimal (with the second state of the original sequence now acting as initial state). This is the Bellman optimality principle. Note the close resemblance to the Markov property of stochastic processes (a process is Markov if its future is conditionally independent of the past given the present state). The optimality principle can be reworded in similar language: the choice of optimal actions in the future is independent of the past actions which led to the present state. Thus optimal state-action sequences can be constructed by starting at the …nal state and extending backwards. Key to this procedure is the optimal value function (or optimal cost-to-go function) v (x) = "minimal total cost for completing the task starting from state x" This function captures the long-term cost for starting from a given state, and makes it possible to …nd optimal actions through the following algorithm: Consider every action available at the current state, add its immediate cost to the optimal value of the resulting next state, and choose an action for which the sum is minimal. The above algorithm is "greedy" in the sense that actions are chosen based on local information, without explicit consideration of all future scenarios. And yet the resulting actions are optimal. This is possible because the optimal value function contains all information about future scenarios that is relevant to the present choice of action. Thus the optimal value function is an extremely useful quantity, and indeed its calculation is at the heart of many methods for optimal control. 2

The proof relies on the important idea of contraction mappings: one de…nes the approximation error e v (i) = maxx v (i) (x) v (x) . Value iteration uses only (2). This is because the presence of cycles makes it impossible to visit each state only after all its successors have been visited. u0 ). It starts with a guess (0) of the optimal control 3 . 1. u) + v (i) (next (x. and keep going until we reach xdest . but we cannot apply them in a single pass. and shows that the iteration (3) causes e v (i) to decrease as i increases. Formally. transition to state x1 = next (x0 . and satis…es v (x) = min fcost (x. If for some x we already know v (next (x. By "relaxation scheme" we mean guessing the solution. Policy iteration uses both (1) and (2). Instead the Bellman equations are treated as consistency conditions and used to design iterative relaxation schemes –much like partial di¤erential equations (PDEs) are treated as consistency conditions and solved with corresponding relaxation schemes.The above algorithm yields an optimal action u = (x) 2 U (x) for every state x. We start with a guess v (0) of the optimal value function. In other words. and iteratively improving the guess so as to make it more compatible with the consistency condition. generate action u0 = (x0 ). Here the Bellman equations are still valid. u) + v (next (x.2 Value iteration and policy iteration The situation is more complex in graphs with cycles. A mapping from states to actions is called control law or control policy. The two main relaxation schemes are value iteration and policy iteration. then we can apply the Bellman equations directly and compute (x) and v (x). Thus dynamic programming is particularly simple in acyclic graphs where we can start from xdest with v xdest = 0. generate action u1 = (x1 ). an optimal control law satis…es (x) = arg min fcost (x. However the optimal value function v is always uniquely de…ned. u))g (2) u2U (x) Equations (1) and (2) are the Bellman equations. Once we have a control law : X ! U (X ) we can start at any state x0 . and perform a backward pass in which every state is visited after all its successor states have been visited. u))g u2U (x) (1) The minimum in (1) may be achieved for multiple actions in the set U (x). and construct a sequence of improved guesses: n o v (i+1) (x) = min cost (x. u) + v (next (x. u)) for all u 2 U (x). which is why may not be unique. the mapping v (i) ! v (i+1) given by (3) contracts the "size" of v (i) as measured by the error norm e v (i) . u)) (3) u2U (x) This process is guaranteed to converge to the optimal value function v in a …nite number of iterations. It is straightforward to extend the algorithm to the case where we are given non-zero …nal costs for a number of destination states (or absorbing states).

In practice both algorithms are used depending on the problem at hand. 4) generalize to the stochastic case in the same way as equation (2) does. This function is de…ned as the total cost for starting at state x and the control law acting according to (i) thereafter. In contrast.3 Markov decision processes The problems considered thus far are deterministic. where E denotes expectation over next (x. MDPs are extensively studied in reinforcement learning –which is a sub-…eld of machine learning focusing on optimal control problems with discrete state. u))] = p (yjx. because each of the two nested relaxations in policy iteration converges faster than the single relaxation in value iteration. and constructs a sequence of improved guesses: v (i) (x) = cost x. u) = y" In order to qualify as a probability distribution the function p must satisfy X p (yjx. Policy iteration can also be proven to converge in a …nite number of iterations. u) 0 In the stochastic case the value function equation (2) becomes v (x) = min fcost (x. n (i+1) (x) = arg min cost (x. and is computed as X E [v (next (x. 1. optimal control theory focuses on problems with continuous state and exploits their rich di¤erential structure. u) + v u2U (x) (x) o (i) (next (x.law. u)) (i) (i) (4) The …rst line of (4) requires a separate relaxation to compute the value function v for (i) . u). To simplify notation we will use the shortcut minu instead of 4 . Dynamic programming easily generalizes to the stochastic case where we have a probability distribution over possible next states: p (yjx. u))]g u2U (x) (5) Equations (1. u) = 1 y2X p (yjx. It is not obvious which algorithm is better. An optimal control problem with discrete states and actions and probabilistic state transitions is called a Markov decision process (MDP). u). 3. (i) (x) + v (i) next x. in the sense that applying action u at state x always yields the same next state next (x. u) = "probability that next (x. u) + E [v (next (x. u) v (y) y2X 2 Continuous control: Hamilton-Jacobi-Bellman equations We now turn to optimal control problems where the state x 2 Rnx and control u 2 U (x) Rnu are real-valued vectors.

u). k ) k=0 where n = tf = is the number of time steps (assume that tf = is integer). t) 0. u) being the drift and F (x. when F (x. u (t) . In …nite-horizon problems. u ( )) = h (x (tf )) + Z tf ` (x (t) . u (s)) dw (s) x (t) = x (0) + 0 0 The last term is an Ito integral. 5 . i.e. In the absence of noise. u. i. uk ) + F (xk . u ) = h (xn ) + ` (xk . Instead we will discretize the time axis and obtain results for the continuous-time case in the limit of in…nitely small time step. de…ned for square-integrable functions g (t) as Z t g (s) dw (s) = lim 0 n!1 where 0 = s0 < s2 < Xn 1 k=0 g (sk ) (w (sk+1 ) < sn = t w (sk )) We will stay away from the complexities of stochastic calculus to the extent possible. u) = 0. u (t) : 0 t tf g is de…ned as J (x ( ) . and a …nal cost h (x) 0 which is only evaluated at the …nal state x (tf ). we can simply write x = f (x. when a …nal time tf is speci…ed.minu2U (x) . uk ) "k p where is the time step. u) the di¤usion coe¢ cient. In discrete time the total cost becomes Xn 1 J (x . t) dt 0 Keep in mind that we are dealing with a stochastic system. Consider the stochastic di¤erential equation dx = f (x. This is sometimes called a controlled Ito di¤ usion. The appropriate Euler discretization of (6) is p xk+1 = xk + f (xk . t) which minimizes the expected total cost for starting at a given (x. u) dt + F (x. Thus the total cost for a given state-control trajectory fx (t) . with f (x.e. To de…ne an optimal control problem we also need a cost function. although the latter is implied unless noted otherwise. t) and acting according thereafter. However in the stochastic _ case this would be meaningless because the sample paths of Brownian motion are not di¤erentiable (the term dw=dt is in…nite). u) dw (6) where dw is nw -dimensional Brownian motion. Our objective is to …nd a control law u = (x. uk . Inw ) and xk = x (k ). it is natural to separate the total cost into a time-integral of a cost rate ` (x. The term appears because the variance of Brownian motion grows linearly with time. "k N (0. What equation (6) really means is that the integral of the left hand side is equal to the integral of the right hand side: Z t Z t f (x (s) . and thus the standard deviation p of the discrete-time noise should scale as . u (s)) ds + F (x (s) .

The development is similar to the MDP case except that the state space is now in…nite: it consists of n + 1 copies of Rnx . k + 1)]g (7) where N (0. Suppressing x. u) + . k on the right hand side. uk is the multivariate Gaussian xk+1 N (xk + f (xk . k + 1) n o 1 = min ` + f T vx + 2 tr (Svxx ) + o ( ) u 6 . n) = h (x) Consider the second-order Taylor-series expansion of v. except that v is now a function of space and time. and divide by . u)T vx (x) + 1 tr ( S (x. u)) and v (x. vxx = @x@x v Now compute the expectation of the optimal value function at the next state. then dg (x (t)) = gx (x (t)) dx (t) + 1 2 gxx (x (t)) dt 2 The trace term appears because h i h E T vxx = E tr T vxx = tr (Cov [ ] vxx ) = tr ( Svxx ) Coming back to the derivation. The result is: E [v] = v (x) + f (x. u. we substitute the expression for E [v] in (7). move the term v (x) outside the minimization operator (since it does not depend on u). k ) + E [v (x + u f (x. The state transitions are now stochastic: the probability distribution of xk+1 given xk . We have v (x. k) v (x. with the time index k +1 suppressed for clarity: v (x + ) = v (x) + where = T vx (x) + f (x. u) + . u) The Bellman equation for the optimal value function v is similar to (5). and therefore the time when a given x 2 Rnx is reached makes a di¤erence. which states s that if x (t) is an Ito di¤usion with coe¢ cient . uk ) . uk )) where S (x. T S (xk . k) = min f ` (x. S (x.1 Derivation of the HJB equations We are now ready to apply dynamic programming to the time-discretized stochastic problem. u. The reason we need multiple copies of Rnx is that we have a …nite-horizon problem. u) = F (x. u) F (x. 1 T vxx (x) + o 3 2 @ @2 vx = @x v. u) vxx (x)) + o 2 i 2 Note the second-order derivative vxx in the …rst-order approximation to E [v]. we have v (x.2. using the above Taylor-series expansion and only keeping terms up to …rst-order in . This is a recurrent theme in stochastic calculus. It is directly related to Ito’ lemma.

t) is a value of u which achieves the minimum in (8): n o (x. consider the equation jg (t)j = 1 with boundary conditions g (0) = g (1) = 0. (iii) numerical approximation schemes which take the limit of solutions to discretized problems converge to a viscosity solution (and therefore to the optimal value function). an optimal control (x. If we decide to admit such "weak" solutions. u)T vx (x. (ii) the optimal value function is a viscosity solution to the HJB equation (and thus it is the only viscosity solution). u. which we denote vt . Nevertheless. t) de…ned in continuous time. u. t) + f (x. t + ) @ In the limit ! 0 the latter expression becomes @t v. If a di¤erentiable function v satisfying (8) for all (x. the HJB equations have motivated a number of methods for approximate solution. or the optimal control law. then it is the unique optimal value function. and consider the optimal value function v (x. u) vxx (x. The slope of g _ is either +1 or 1. and the required number of discrete states is exponential in the statespace dimensionality nx . Thus for 0 t tf and v (x. the optimal value function is a weak solution but there are many other weak solutions. The only numerical methods guaranteed to converge to the optimal value function rely on discretizations of the state space. However. 7 . t)) (9) 2 u2U (x) Equations (8) and (9) are the Hamilton-Jacobi-Bellman (HJB) equations. Such methods rely on parametric models of the optimal value function. Below we outline one such method. t) v (x. The left hand side in the above equation is then v (x. At the points where that occurs the derivative g (t) is unde…ned. the following holds: n o 1 vt (x. u) vxx (x.2 Numerical solutions of the HJB equations The HJB equation (8) is a non-linear (due to the min operator) second-order PDE with respect to the unknown function v.Recall that t = k . u)T vx (x. How can we then solve the optimal control problem? The recent development of non-smooth analysis and the idea of viscosity solutions provide a reassuring answer. tf ) = h (x). we are faced with _ in…nitely many solutions to the same di¤erential equation. t) + 1 tr (S (x. or both. In particular when (8) does not have a classic solution. t)) (8) u2U (x) Similarly to the discrete case. It is a problem which most likely does not have a general solution. 2. and so g has to change slope (discontinuously) somewhere in the interval 0 t 1 in order to satisfy the boundary conditions. non-linear di¤erential equations do not always have classic solutions which satisfy them everywhere. t) + f (x. Unfortunately there are other practical issues to worry about. For example. t) = arg min ` (x. t) + 2 tr (S (x. It can be summarized as follows: (i) "viscosity" provides a speci…c criterion for selecting a single weak solution. t) exists. t) = min ` (x. The bottom line is that in practice one need not worry about the absence of classic solutions. Bellman called this the curse of dimensionality.

Here we do not have a …nal cost h (x). ) to the optimal value function. u ( )) = lim ` (x (t) . This procedure yields target values for the left hand side of (8). Linearity in simpli…es the calculation of derivatives: X i vx (x. Particularly convenient are models of the form X i v (x. u) vxx (x)) (10) 2 u2U (x) Another alternative is the average-cost-per-stage formulation. One is the discounted-cost formulation. ) gets closer to these target values. (8) may be easier 8 . where the total cost for an (in…nitely long) state-control trajectory is de…ned as Z 1 J (x ( ) . which makes them more amenable to numerical approximations in the sense that we do not need to store a copy of the optimal value function at each point in time. where e vector of parameters. Then adjust the parameters so that vt (x. Form another point of view. u) + f (x.3 In…nite-horizon formulations Thus far we focused on …nite-horizon problems. and the unknown parameters linearly.Consider an approximation v (x. u)T vx (x) + 1 tr (S (x. The HJB equation for the optimal value function becomes o n v (x) = min ` (x. u) + f (x. u ( )) = exp ( t) ` (x (t) . t. ) = e x (x. Equations (10) and (11) do not depend on time. however. t) i i appear and similarly for vxx and vt . Now choose a large enough set of states (x. and v now has the meaning of a di¤erential value function. u (t)) dt 0 with > 0 being the discount factor. both of which yield time-invariant forms of the HJB equations. u) vxx (x)) u2U (x) (11) where 0 is the average cost per stage. 2. u) no-longer depends on time explicitly. ) = e (x. t. with total cost Z 1 tf J (x ( ) . t. t) i i is some where i (x. The discrepancy being e minimized by the parameter adjustment procedure is the Bellman error. Intuitively this says that future costs are less costly (whatever that means). u)T vx (x) + 2 tr (S (x. t) are some prede…ned basis functions. u (t)) dt tf !1 tf 0 In this case the HJB equation for the optimal value function is o n 1 = min ` (x. There are two in…nite-horizon formulations used in practice. t. t) and evaluate e e the right hand side of (8) at those states (using the approximation to v and minimizing over u). and the cost rate ` (x.

but that number may increase rapidly as the discretization of the continuous state space is re…ned. (x. t) + f (x. t) + ` (x. t) This equation is valid for all x. t)) vx (x. 3 Deterministic control: Pontryagin’ maximum principle s Optimal control theory is based on two fundamental ideas. tf ) = h (x) and simply integrate (8) backward in time. The other is the maximum principle. t) _ (13) 9 . yields _ _ T 0 = vx + `x + fx vx + _ T x T `u + fu vx We now make a key observation: the term in the brackets is the gradient w. 3. Here we derive the maximum principle indirectly via the HJB equation. and therefore can be di¤erentiated w.t.1 Derivation via the HJB equations For deterministic dynamics x = f (x.r. We also clarify its relationship to classical mechanics. (x.r. and yields the same solutions as dynamic programming. That gradient is zero (assuming unconstrained minimization). t))T vx (x. t) we can drop the min operator in (12) and write 0 = vt (x. t) + f (x. This is because dynamic programming can be performed in a single backward pass through time: initialize v (x. the maximum principle avoids the curse of dimensionality.t.to solve numerically. t) (12) u Suppose a solution to the minimization problem in (12) is given by an optimal control law (x. u of the quantity being minimized w. which leaves us with T vx (x. and directly via Lagrange multipliers. Relaxation methods are of course guaranteed to converge in a …nite number of iterations for any …nite state approximation. In contrast. (x. x to obtain (in shortcut notation) 0 = vtx + `x + T x `u T + fx + T T x fu vx + vxx f Regrouping terms. introduced by Bellman in the United States. t) = min ` (x. The maximum principle applies only to deterministic problems. computing the spatial derivatives numerically along the way. t) . t) which is di¤erentiable in x. u.r. t) . and using the identity vx = vxx x + vtx = vxx f + vtx . Unlike dynamic programming. u)T vx (x. t) = `x (x. u) the …nite-horizon HJB equation (8) becomes _ n o vt (x. t) + fx (x. (x. (10) and (11) call for relaxation methods (such as value iteration or policy iteration) which in the continuous-state case may take an arbitrary number of iterations to converge.t. One is dynamic programming and the associated optimality principle. however. introduced by Pontryagin in the Soviet Union. u in (12). Setting u = (x.

The maximum principle can be written in more compact and symmetric form with the help of the Hamiltonian function H (x. u (t) : 0 t tf g is an optimal state-control trajectory (obtained by initializing x (0) and controlling the system optimally until tf ). u all along (it was about time we gave it a name). u. u (t) . Therefore we can solve this system with standard boundary-value solvers (such as Matlab’ bvp4c). u)T p (t) u T fx (x (t) . t) + f (x. t) + _ o n u (t) = arg min ` (x (t) . s The only complication is that we would have to minimize the Hamiltonian repeatedly. but if we think of vx as a vector p instead of a gradient of a function which depends on x. and x (0) . t) + (a (x) + B (x) u)T p 2 arg min H = u R (x) 1 B (x)T p 10 . With this de…nition. then (13) is an ordinary di¤erential equation (ODE) for p. t) = 2 uT R (x) u + q (x.t. t) In such problems the Hamiltonian is quadratic and can be minimized explicitly: H (x. The ODE (14) is a system of 2nx scalar equations subject to 2nx scalar boundary conditions. u (t)) p (t) (14) The boundary condition is p (tf ) = hx (x (tf )). t) = 1 uT R (x) u + q (x. p (t) .r. even though we derived it starting from a PDE. tf are given. u (t) . p. p. t) = ` (x. If fx (t) . The vector p is called the costate vector. t). That equation holds along any trajectory generated by (x. (14) becomes x (t) = _ p (t) = _ @ @p H @ @x H (x (t) . u. u (t) . u. p (t) . p (t)g are x (t) = f (x (t) . An ODE is a consistency condition which singles out speci…c trajectories without reference to neighboring trajectories (as would be the case in a PDE). t) The remarkable property of the maximum principle is that it is an ODE. This complication is avoided for a class of problems where the control appears linearly in the dynamics and quadratically in the cost rate: dynamics: cost rate: x = a (x) + B (x) u _ 1 ` (x.This may look like a PDE for v. u (t)) _ p (t) = `x (x (t) . t) (x (t) . Clearly the costate is equal to the gradient of the optimal value function evaluated along the optimal trajectory. then there exists a costate trajectory p (t) such that (13) holds with p in place of vx and u in place of . We are now ready to formulate the maximum principle. u)T p (15) which is the quantity we have been minimizing w. p (t) . The conditions on fx (t) . u (t) . t) u (16) u (t) = arg min H (x (t) . u. u. u. T This is possible because the extremal trajectories which solve (14) make Hu = `u + fu p vanish. t) + f (x (t) . which in turn removes the dependence on neighboring trajectories.

u1 . 3. optimized) …nal time. ) such that @x L = 0 and @@ L = 0. un 1 ) and (x0 . This is closely related to the idea of a rollout policy – which is essential in computer chess programs for example. and so we can speed-up the search by initializing the ODE solver with the previous solution. It can also be applied in model-predictive control settings where one seeks an optimal state-control trajectory up to a …xed time horizon (and approximates the optimal value function at the horizon). One drawback is that (14) could have multiple solutions (one of which is the optimal solution) but in practice that does not appear to be a serious problem.e. state constraints at intermediate points along the trajectory. Constrained optimization problems can be solved with the method of Lagrange multipliers. The maximum principle can be generalized in a number of ways including: terminal state constraints instead of "soft" …nal costs. If the state change is small. and if the initial state were to change we would have to solve the problem again. Note that the following derivation is actually the more standard one (in continuous time it relies on the calculus of variations). uk ) xk+1 )T k+1 De…ne the discrete-time Hamiltonian H (k) (x. …rst exit time. k) h (xn ) with given initial state x0 and …nal time n. The initial portion of this trajectory is used to control the system.2 Derivation via Lagrange multipliers The maximum principle can also be derived for discrete-time systems. As a reminder. k+1 ) xT k k 11 . the solution change should also be small. and so the curse of dimensionality is avoided. uk . k) + f (x.The computational complexity (or at least the storage requirement) for ODE solutions based on the maximum principle grows linearly with the state dimensionality nx . 2. k) + (f (xk . so we need a sequence of n Lagrange multipliers ( 1. Another drawback of course is that the solution to (14) is valid for a single initial state. in order to minimize a scalar function g (x) subject to c (x) = 0. u)T and rearrange the terms in the Lagrangian to obtain L = h (xn ) xT n n + xT 0 0 + Xn 1 k=0 H (k) (xk . x1 . The Lagrangian is L (x . as we show next. uk . ) = g (x) + c (x) and @ look for a pair (x. we form the Lagrangian L (x. however. and then a new optimal trajectory is computed. uk ). In our case there are n constraints. n ). uk . ) = h (xn ) + Xn 1 k=0 ` (xk . u. We can approach this as a regular constrained optimization problem: …nd sequences (u0 . xn ) minimizing J subject to constraints xk+1 = f (xk . free (i. u. control constraints. u . ) = ` (x. Consider the discrete-time optimal control problem dynamics: cost rate: …nal cost: xk+1 = f (xk . uk ) ` (xk .

= (k) @ @x H 0 k<n = hx (xn ) the di¤erential dL becomes Xn 1 dL = T dx0 + 0 k=0 T For this choice of (k) @ @u H duk (17) The …rst term in (17) is 0 because x0 is …xed. The minimal total cost is of course equal to the optimal value function. This also holds in the continuous-time case. thus is the gradient of the optimal value function. 12 . We have dL = (hx (xn ) Xn 1 + k=0 @ @xk L T n ) dxn (k) @ @x H + T 0 dx0 + T dxk + k (k) @ @u H T duk to be In order to satisfy k n = 0 we choose the Lagrange multipliers T = `x (xk . 3. uk . This is because 0 is the gradient of the minimal total cost w. and x0 . Once we have a gradient. It actually has yet another name: in‡uence function. uk ) k T = `x (xk .t.3 Numerical optimization via gradient descent (k) @ @u H T = `u (xk . k) + fx (xk . n given. uk ) (k) u k+1 (18) (xk . Then iterate (18) backward in time to …nd the Lagrange multiplier sequence. Here is the algorithm: 1. k) + fx (xk . Thus we have three di¤erent names for the same quantity.r. The second term becomes 0 when uk is the (unconstrained) minimum of H (k) . is now a Lagrange multiplier . uk ) k+1 . u. The similarity between the discrete-time (18) and the continuous-time (14) versions of the maximum principle is obvious. uk ) From (17) it is clear that the quantity k+1 (19) is the gradient of the total cost with respect to the control signal. In the backward pass use the given control sequence instead of optimizing the Hamiltonian. k) + fu (xk . iterate the dynamics forward in time to …nd the corresponding state sequence. uk . k+1 ) uk = arg min H with n = hx (xn ). Summarizing the conditions for an optimal solution. we can optimize the (open-loop) control sequence for given initial state via gradient descent.Now consider di¤erential changes in L due to changes in u which in turn lead to changes in x. we arrive at the discrete-time maximum principle: xk+1 = f (xk . The costate p. the initial condition x0 (as can be seen from 17) and so 0 tells us how changes in the initial condition in‡ uence the total cost. which before was equal to the gradient vx of the optimal value function. uk . Given a control sequence.

Then the cost _ rate ` (x. with p being the generalized momentum (a …fth name for the same quantity). _ _ Thus the trajectories satisfying Hamilton’ equation (21) are solutions to the Lagrange s optimization problem (20). It may seem _ _ strange that H (x. x) _ @ _ @x L (x. u) + ` (x. recall the Lagrange problem: given x (0) and x (tf ). gradient descent in high-dimensional spaces is much more e¢ cient if one uses a second-order method (conjugate gradient descent. and then de…ne the corresponding cost rate ` (x. …nd curves fx (t) : 0 t tf g which optimize Z tf J (x ( )) = L (x (t) . The Hamiltonian (22) becomes H (x. p) = pT x _ L (x. x) _ (22) @ @p H @ @x H (x. x) yields an optimal control problem equivalent to _ (20). u) so as to make the optimal control problem equivalent to the Lagrange problem (20). x) is called a Legendre transformation. and improve the control sequence with any gradient descent algorithm. The simplest choice is f (x.2. x (t)) dt _ (20) 0 Applying the calculus of variations. 3. p) (x. Gauss-Newton. u). u). Note that we can choose any dynamics f (x. p) depends on x when x is not explicitly an argument of H. f (x. Stability of second-order optimization can be ensured via line-search or trust-region methods. u)) = L (x.e. x can be recovered from (x. x) =0 Its solutions are known to be equivalent to the solutions of Hamilton’ equation s x= _ p= _ where the Hamiltonian is de…ned as H (x. p) as _ @2 long as the matrix @ x@ x L is non-singular. To avoid local minima –which correspond to extremal trajectories that are not optimal –one could use multiple restarts with random initialization of the control sequence. etc). Evaluate the gradient using (19). u) = u. p) = pT f (x.4 Relation to classical mechanics We now return to the continuous-time maximum principle. or exit if converged. de…ne a control signal u and deterministic dynamics x = f (x. Care should be taken to ensure stability. As always. and note that (16) resembles the Hamiltonian formulation of mechanics. which is the optimalcontrol Hamiltonian (15). i. u). one …nds that extremal curves (either maxima or minima of J) satisfy the Euler-Lagrange equation d @ _ dt @ x L (x. Go back to step 1. u) = L (x. To see where the resemblance comes from. Levenberg-Marquardt. In order to explicitly transform (20) into an optimal control problem. and so an ODE solver can get trapped in the same suboptimal solutions as a gradient descent method. This is _ _ because the Legendre transformation is invertible. 13 . Note that extremal trajectories satisfy the maximum principle. p) (21) The change of coordinates p = @@x L (x.

When the inertia is constant. The matrices A. tf ) = h (x) implies V (tf ) = Qf and a (tf ) = 0. In order to solve for the optimal value function we will guess its parametric form. applying the Euler-Lagrange equation to the above L yields Newton’ second law s M x = gx (x). F. R. t) = V (t) x vxx (x. If the in• ertia is not constant (joint-space inertia for a multi-joint arm for example) application of the Euler-Lagrange equation yields extra terms which capture nonlinear interaction forces. revealing intrinsic properties of the dynamical system that are invariant with respect to arbitrary smooth changes of coordinates. 4 Linear-quadratic-Gaussian control: Riccati equations Optimal control laws can rarely be obtained in closed form. but it is worth noting that it a¤ords a coordinate-free treatment. t) = 1 xT V (t) x + a (t) 2 (23) where V (t) is symmetric. Here we derive the LQG optimal controller in continuous and discrete time. u) = 2 uT Ru + 1 xT Qx 2 h (x) = 1 xT Qf x 2 where R is symmetric positive-de…nite. One notable exception is the LQG case where the dynamics are linear. We set S = F F T as before. Such invariant quantities are called tensors. Q and Qf are symmetric. From (23) we can compute the derivatives which enter into the HJB equation: _ vt (x. x) = 1 xT M (x) x _ _ 2_ g (x) The …rst term is kinetic energy (with M (x) being the inertia matrix). and allows minimization of the Hamiltonian in closed form. and the noise (if present) is additive Gaussian. t) = V (t) 14 . This makes the optimal value function quadratic. Q can be made time-varying without complicating the derivation below.The function L is interpreted as an energy function. B. The boundary condition v (x. show that it satis…es the HJB equation (8). and u is now unconstrained. t) = 1 xT V (t) x + a (t) _ 2 vx (x. Our guess is v (x. and obtain ODEs for its parameters. Geometrically. 4.1 Derivation via the HJB equations Consider the following stochastic optimal control problem: dynamics: cost rate: …nal cost: dx = (Ax + Bu) dt + F dw 1 ` (x. the costs are quadratic. the metric (inertia in our case) is a tensor. zix = yT M (x) z. where the force gx (x) is the gradient of the potential …eld. and the second term is potential energy due to gravity or some other force …eld. We will not discuss di¤erential geometry in any detail here. they contain Christo¤el symbols of the Levi-Civita connection for the Riemannian metric given by hy. In mechanics it is L (x. For example.

As a corollary. and (24) is the optimal control law (which in this case is unique). a given by (25). the optimal control law remains the same in the deterministic case –called the linear-quadratic regulator (LQR). the term inside the min operator) is quadratic in u. Instead we apply dynamic programming directly. 4. The only e¤ect of S is on the total cost. Dropping the (irrelevant) noise and discretizing the problem. Our guess of the optimal value function is correct if and only if the above equation holds for all x. we obtain dynamics: cost rate: …nal cost: xk+1 = Axk + Buk 1 T 1 T 2 uk Ruk + 2 xk Qxk 1 T f 2 xn Q xn 15 . a satisfying (25) can obviously be found by initializing V (tf ) = Qf . as we will see below.Substituting these expressions in (8) yields (t) _ n x a (t) = o 1 T = minu 2 u Ru + 1 xT Qx + (Ax + Bu)T V (t) x + 1 tr (SV (t)) 2 2 1 T _ 2x V The Hamiltonian (i. a (tf ) = 0 and integrating the ODEs (25) backward in time. To obtain an optimal control law for the discrete-time case one could use an Euler approximation to (25). Note that it does not depend on the noise covariance S. Thus (23) is the optimal value function with V. This is justi…ed because xT AT V x = xT V T Ax = xT V Ax. Consequently the optimal control law (24) is also independent of S. so the optimal control can be found analytically: u= R 1 B T V (t) x (24) With this u.2 Derivation via the Bellman equations In practice one usually works with discrete-time systems. the HJB equation reduces to 1 T _ 2 x V (t) x = 1 xT Q + 2 a (t) = _ AT V (t) + V (t) A V (t) BR 1BTV (t) x + 1 tr (SV (t)) 2 where we replaced the term 2AT V with AT V + V A to make the equation symmetric. but the resulting equation is missing terms quadratic in the time step .e. the control-dependent part of the Hamiltonian becomes 1 T 2 u Ru + (Bu)T V (t) x = 1 T 2x V (t) BR 1 B T V (t) x After grouping terms. The …rst line of (25) is called a continuous-time Riccati equation. which is the case when the x-dependent terms are matched: _ V (t) = Q + AT V (t) + V (t) A a (t) = _ 1 2 V (t) BR 1 B T V (t) (25) trace (SV (t)) Functions V. and its Hessian R is positive-de…nite. and obtain an exact solution to the discrete-time LQR problem.

Q Q (26) The guess for the optimal value function is again quadratic 1 v (x.where n = tf = xk and the correspondence to the continuous-time problem is x (k ) . The resulting optimal control law is u= R + B T Vk+1 B 1 with boundary condition Vn = Qf . and after rearrangement obtain Vk Vk+1 = Q + AT Vk+1 + Vk+1 A Vk+1 B R + B T Vk+1 B 1 B T Vk+1 + o 2 where o 2 absorbs terms that are second-order in continuous-time Riccati equation (25).3 Applications to nonlinear problems Apart from solving LQG problems. Taking the limit ! 0 yields the 4. To compute Vk for all k we initialize Vn = Qf and iterate (27) backward in time.ine. and therefore can be computed o. apply it to the (nonlinear) dynamics and obtain a corresponding state sequence. Equation (27) is called a discrete-time Riccati equation. as follows: 1. we obtain Vk = Q + AT Vk+1 A AT Vk+1 B R + B T Vk+1 B 1 B T Vk+1 A (27) This completes the proof that the optimal value function is in the assumed quadratic form. It does not depend on the sequence of states. and is usually written as uk = Lk xk 1 (28) B T Vk+1 A where Lk = R + B T Vk+1 B The time-varying matrix Lk is called the control gain. 16 . and so the two are not identical. The Bellman equation (2) is n o 1 T x Vk x = min 1 uT Ru + 1 xT Qx + 1 (Ax + Bu)T Vk+1 (Ax + Bu) 2 2 2 2 u B T Vk+1 Ax Substituting this u in the Bellman equation. The optimal control law is linear in x. However one can verify that they become identical in the limit ! 0. To this end replace the matrices in (27) with their continuous-time analogues (26). . B B. A (I + A) . Given a control sequence. k) = 2 xT Vk x As in the continuous-time case the Hamiltonian can be minimized analytically. This is done iteratively. Clearly the discrete-time Riccati equation contains more terms that the continuous-time Riccati equation (25). R R. the methodology described here can be adapted to yield approximate solutions to non-LQG optimal control problems.

iLQG has been generalized to stochastic systems (including multiplicative noise) and to systems subject to control constraints. for two reasons: (i) the only way to achieve optimal performance in the presence of sensor noise and delays is to incorporate an optimal estimator in the control system. 0 ). which is based on the same idea but involves a second-order rather than a …rst-order approximation to the dynamics. 5 Optimal estimation: Kalman …lter Optimal control is closely related to optimal estimation. The objective is to compute the posterior probability distribution pk of xk given observations yk 1 b y0 : pk = p (xk jyk b p0 = N (b0 . P. Solve the resulting LQG problem. 5. b x 17 1 y0 ) 0) . however one can assume a quadratic approximation to the optimal value function and derive Riccati-like equations for its parameters. as explained below. but in addition yield local feedback control laws.1 The Kalman …lter Consider the partially-observable linear dynamical system dynamics: observation: xk+1 = Axk + wk yk = Hxk + vk (29) where wk N (0. both approximations are centered at the statecontrol sequence obtained in step 1. They can be thought of as the analog of Newton’ s method in the domain of optimal control. the b initial state has a Gaussian prior distribution x0 N (b0 . DDP and iLQG obtain the Hessian directly by exploiting the problem structure. (ii) the two problems are dual. Another possibility is to use di¤ erential dynamic programming (DDP). 0 are x known. x0 . P ) are independent Gaussian random variables. obtain the control deviation sequence. DDP and iterative LQG (iLQG) have second-order convergence in the neighborhood of an optimal solution. Note that multiplying the deviation sequence by a number smaller than 1 can be used to implement linesearch. This yields an LQG optimal control problem with respect to the state and control deviations. Go to step 1. The states are hidden and all we have access to are the observations.2. Unlike general-purpose second order methods which construct Hessian approximations using gradient information. Construct a time-varying linear approximation to the dynamics and a time-varying quadratic approximation to the cost. or exit if converged. S. H. and A. 3. S) and vk N (0. In that case the approximate problem is not LQG. It is the dual of the linear-quadratic regulator –which in turn is the most widely used optimal controller. For deterministic problems they converge to state-control trajectories which satisfy the maximum principle. and add it to the given control sequence. In our experience they are more e¢ cient that either ODE or gradient descent methods. The most widely used optimal estimator is the Kalman …lter.

It is possible to write down the Kalman …lter in equivalent forms which have numerical advantages. Since pk is Gaussian and (29) b is linear-Gaussian. Cov = E = b yk H xk yk H k AT P + H kHT Now we need to compute the probability of xk+1 conditional on the new observation yk . The quantity yk H xk is called the innovation. qq and pq = qp . The covariance k of the posterior probability distribution p (xk jyk 1 y0 ) is the estimation b error covariance. The Markov property of (29) implies that the posterior pk can be treated as b prior over xk for the purposes of estimation after time k. This is done using an important property of multivariate Gaussians summarized in the following lemma: Let p and q be jointly Gaussian.Note that our formulation is somewhat unusual: we are estimating xk before yk has been observed. with mean and covariance E [pjq] = p + Cov [pjq] = pp pq 1 q) qq (q 1 pq qq qp Applying the lemma to our problem. The estimation error is xk xk . One such approach 18 . Equation (31) is a Riccati equation. It does not depend on the observation b sequence and therefore can be computed o. Then the conditional distribution of p given q is Gaussian. Equation (30) is usually written as b xk+1 = Abk + Kk (yk x kH T where Kk = A P +H b H xk ) kH T 1 The time-varying matrix Kk is called the …lter gain. It is the mismatch between the observed and the expected measurement. We will show by induction (moving forward in time) that pk is Gaussian for all k. and b b therefore can be represented by its mean xk and covariance matrix k . The above derivation corresponds to the discrete-time Kalman …lter. and is called the Kalman-Bucy …lter. with means p and q and covariances T pp .ine. we see that pk+1 is Gaussian with mean b and covariance matrix k+1 b xk+1 = Abk + A x =S+A kA kH T P +H kH T 1 (yk b H xk ) (30) T A kH T P +H kH T 1 H kA T (31) This completes the induction proof. the joint distribution of xk+1 and yk is also Gaussian. This holds for k = 0 by de…nition. Its mean and covariance given the prior pk are easily computed: b # " # # " # " " S + A k AT A kHT xk+1 Abk x xk+1 . This formulation is adopted here because it simpli…es the results and also because most real-world sensors provide delayed measurements. A similar result holds in continuous time.

This is called is reduced. although computationally more expensive approach. In that case one has to rely on numerical approximations. median. and again involves Riccati-like equations. The information …lter can represent numerically very large covariances (and even in…nite covariances –which are useful for specifying "non-informative" priors). which propagates the covariance using deterministic sampling instead of linearization of the system dynamics. b b kx xk. The EKF is not guaranteed to be optimal in any sense. (x x). it is still a well-de…ned scalar function over the state space. The resulting Kalman smoother is closely related to the forward-backward algorithm for probabilistic inference in hidden Markov models (HMMs). Instead of …ltering one can do smoothing. and the corresponding optimal estimators are the mean. In addition. The unscented …lter tends to be superior to the EKF and requires a comparable amount of computation. This is because x and are su¢ cient statistics which fully describe the posterior probability distribution. but in practice if often yields good results – especially when the posterior is single-peaked. median and mode coincide. Some possible loss functions are kx xk2 . even though the posterior is Gaussian. An even more accurate. 5.2 Beyond the Kalman …lter When the estimation problem involves non-linear dynamics or non-Gaussian noise the posterior probability distribution rarely has a …nite set of su¢ cient statistics (although there are exceptions such as the Benes system). called the unscented …lter. First of all it is a Bayesian …lter. and its parameters can be found by an additional backward pass known as Rauch recursion. Recall that b optimality of point estimators is de…ned through a loss function ` (x. i. The Kalman …lter is optimal in many ways. the information contained b b in x and is still su¢ cient to compute the optimal point estimator. If the posterior is Gaussian then the mean. If we choose an unusual loss function for which the optimal b point estimator is not x. It relies on local linearization centered at the current state estimate and closely resembles the LQG approximation to non-LQG optimal control problems. with respect to which the partially-observed process has a Markov property. b the mean x is the optimal point estimator with respect to multiple loss functions. and mode of the posterior probability distribution. x) which quanti…es how b b bad it is to estimate x when the true state is x. Another approach is to propagate the inverse covariance an information …lter. is particle …ltering. In 19 . Instead of propagating a Gaussian approximation it propagates a cloud of points sampled from the posterior (without actually computing the posterior). There is a recent improvement.e. and involves Riccati-like equations which are more stable because the dynamic range of the elements of 1 . in the sense that it computes the posterior probability distribution over the hidden state. It is called the belief state or alternatively the information state. In that case the posterior probability of each state given all observations is still Gaussian. The most widely used approximation is the extended Kalman …lter (EKF). Key to its success is the idea of importance sampling. Even when the posterior does not have a …nite set of su¢ cient statistics. which in turn captures all information about the state that is available in the observation sequence. obtain state estimates using observations from the past and from the future. This is called a square-root …lter. The set of su¢ cient statistics can be thought of as an augmented state.is to propagate the matrix square root of . and as such must obey some equation.

The second term incorporates the observation and makes Zakai’ equation a stochastic PDE. Before we present this result we need some notation. t) be the probability distribution of x (t) in the absence of any observations. The best-known example is the duality of the linear-quadratic regulator and the Kalman …lter. however. To see that duality more clearly. The more usual form is X X @ @ @2 p= (fi p) + 1 @t @xi 2 @xi @xj (Sij p) i ij The …rst term on the right re‡ ects the prior and is the same as in the Kolmogorov equation (except that we have multiplied both sides by dt). Let p (x. After certain manipulations s (conversion to Stratonovich form and a gauge transformation) the second term can be integrated by parts. such methods are only applicable in low-dimensional spaces due to the curse of dimensionality. t) = p (x. we repeat the corresponding Riccati equations (27) and (31) side by side: control: …ltering: Vk = Q + AT Vk+1 A k+1 AT Vk+1 B R + B T Vk+1 B A kH T 1 B T Vk+1 A kA T =S+A kA T P +H kH T 1 H 20 . t) be an unnormalized posterior over x (t) given the observations fy (s) : 0 s tg. 0). For t > 0 it is governed by the …rst line of (32). the posterior satis…es a PDE which resembles the HJB equation. In continuous time. At t = 0 it is initialized with a given prior p (x. One would normally think of the increments of y (t) as being the observations.discrete time this equation is simply a recursive version of Bayes’ rule – which is not too revealing. leading to a regular PDE. Let p (x. and can be shown to satisfy X X @ 1 @2 1 pt = f T px + 2 tr (Spxx ) + @xi fi + 2 @xi @xj Sij p i ij (32) This is called the forward Kolmogorov equation. however. t) can be recovered by normalizing: b R p (x. Consider the stochastic di¤erential equations dx = f (x) dt + F (x) dw dy = h (x) dt + dv where w (t) and v (t) are Brownian motion processes. One can then approximate the solution to that PDE numerically via discretization methods. e "unnormalized" means that the actual posterior p (x. but in continuous time these increments are in…nite and so we work with their time-integral. We have written it in expanded form to emphasize the resemblance to the HJB equation. De…ne S (x) = F (x) F (x)T as before. or alternatively the Fokker-Planck equation. It can be shown that some unnormalized posterior p satis…es b e e e Zakai’ equation s X X T @ 1 @2 de = p e e e @xi (fi p) + 2 @xi @xj (Sij p) dt + h pdy i ij 6 Duality of optimal control and optimal estimation Optimal control and optimal estimation are closely related mathematical problems. t) = p (z. As in the HJB equation. and y (t) is the observation sequence. t) dz. x (t) is the hidden state.

that is. xn ). b. x1 . Recall that V is the Hessian of the optimal value function. and initial state respectively. However its origin and meaning are not apparent. This is actually inconsistent with (33). 6. HMMs) an instance of such duality is the Viterbi algorithm – which …nds the most likely state sequence using dynamic programming. making it hard to tell which of its properties have a dual in the control domain. As far as we know. Our objective is to …nd the most probable sequence of states (x0 . c are the negative log-probabilities of the state transitions. This is because the Kalman …lter is optimal from multiple points of view and can be written in multiple forms. However in (33) we have V corresponding to 1 . xk )) p (yk jxk ) = exp ( b (yk .1 General duality of optimal control and MAP smoothing We now discuss an alternative approach which does generalize. yn ). y2 . The posterior is Gaussian with covariance matrix . The duality we establish is between maximum a posteriori (MAP) smoothing and deterministic optimal control for systems with continuous state. xk )) p (x0 ) = exp ( c (x0 )) (34) where a. The states are hidden and we only have access to the observations (y1 . Attempts to generalize the duality to non-LQG settings have revealed that the fundamental relationship is between the optimal value function and the negative log-posterior. and thus the Hessian of the negative log-posterior is 1 . the meaning of "transpose" _ for general non-linear dynamics x = f (x) in unclear. In fact we start with the general case and later specialize it to the LQG setting to obtain something known as a minimum-energy estimator. the general treatment presented here is a novel result. and is now mentioned in most books on estimation and control. although the inconsistency has not been made explicit before. Using the 21 . Consider the discrete-time partially-observable system p (xk+1 jxk ) = exp ( a (xk+1 . The correspondence is given in the following table: control: m …ltering: V Q S A B R n k (33) AT H T P k The above duality was …rst described by Kalman in his famous 1960 paper introducing the discrete-time Kalman …lter. the sequence which maximizes the posterior probability p (x jy ) = p (y jx ) p (x ) p (y ) The term p (y ) does not a¤ect the maximization and so it can be dropped. For systems with discrete state (i. Another problem is and not to that while AT in (33) makes sense for linear dynamics x = Ax.e.These equations are identical up to a time reversal and some matrix transposes. Thus the duality described by Kalman _ is speci…c to linear-quadratic-Gaussian systems and does not generalize. observation emissions.

0 k<n and …nal cost q (xn .Markov property of (34) we have Yn p (y jx ) p (x ) = p (x0 ) This is beginning to look like a total cost for a deterministic optimal control problem. xk ) = a (f (xk ) + uk . 22 . Therefore the most probable state sequence is the one which minimizes Xn J (x ) = c (x0 ) + (a (xk . To remedy that we will de…ne the passive dynamics as the expected state transition: Z f (xk ) = E [xk+1 jxk ] = z exp ( a (z. We can now bring any method for optimal control to bear on MAP smoothing. xk )) k=1 Xn = exp c (x0 ) (a (xk . xk )) (35) k=1 p (xk jxk 1 ) p (yk jxk ) Yn = exp ( c (x0 )) exp ( a (xk . the control system with dynamics (36). k) . and so q is well-de…ned as long as it depends explicitly on the time index k. With these de…nitions. 0) = c (x0 ) q (xk . xk ) + q (xk . xk 1 ) + b (yk . Thus the MAP smoothing problem has been transformed into a deterministic optimal control problem. uk . n) achieves total cost (35). xk ) . k) = b (yk . xk )) k=1 k=1 The observation sequence is …xed. xk )) dz and then de…ne the control signal as the deviation from the expected state transition: xk+1 = f (xk ) + uk The control cost is now de…ned as r (uk . Note that we could have chosen any f . xk ) . Of particular interest is the maximum principle – which can avoid the curse of dimensionality even when the posterior of the partially-observable system (34) does not have a …nite set of su¢ cient statistics. 0<k n 0 k<n (36) Maximizing exp ( J) is equivalent to minimizing J. however the present choice will make intuitive sense later. xk 1 )) exp ( b (yk . However we are still missing a control signal. k) = r (uk . and the state cost is de…ned as q (x0 . xk 1 ) + b (yk . cost rate ` (xk .

xk 1) = b (yk . Note that an estimate for the states implies estimates for the two noise terms. Thus the cost rate for the optimal control problem is of the form wT S 1 w + vT P 1 v This can be though of as the energy of the noise signals w and v.2 Duality of LQG control and Kalman smoothing Let us now specialize these results to the LQG setting. xk ) = c (x0 ) = 1 2 1 2 1 2 (xk (yk (x0 where a0 . T 1 T u Ruk k=0 2 k 1 T 1 (xk Axk 1 ) + 1) S T 1 Hxk ) P (yk Hxk ) + b0 b b x0 )T 0 1 (x0 x0 ) + c0 Axk a0 + Xn k=0 1 T 2 xk Qk xk + xT qk k q0 = 1 Qk = H P H. and so we can directly relate costs to negative log-probabilities. u ) = where R=S Q0 = Xn 1 1 0 . Another di¤erence is that in (33) we had R () P and Q () S. A minimum-energy interpretation can help better understand the duality. The posterior is Gaussian. The second term measures how far the predicted observation (and thus the estimated state) is from the actual observation. therefore the MAP smoother and the Kalman smoother yield identical state estimates. c0 are normalization constants. The …rst term above measures how far the estimated state is from the prior. b0 . and the likelihood of the estimated noise is a natural quantity to optimize. 0<k n Thus the linear-Gaussian MAP smoothing problem is equivalent to a linear-quadratic optimal control problem. the quantity (35) being minimized by the MAP smoother becomes J (x . it represents a control cost because the control signal pushes the estimate away from the prior. Thus the time-reversal and the matrix transpose of A are no longer needed. 23 . but it is straightforward to do so. qk = b x0 H TP 1 yk . The negative logprobabilities from (34) now become a (xk . Here the estimation system has dynamics xk+1 = Axk + wk and the control system has dynamics xk+1 = Axk + uk . Consider again the partiallyobservable system (29) discussed earlier. From (29) we have xk Axk 1 = wk 1 and yk Hxk = vk . Let us now compare this result to the Kalman duality (33). and using the fact that xk Axk 1 = uk 1 from (36). while these two correspondences are now reversed. Furthermore the covariance matrices now appear inverted.6. Dropping all terms that do not depend on x. The linear cost term xT qk was not previously included in the LQG k derivation. One can think of this as a minimum-energy tracking problem with reference trajectory speci…ed by the observations.

Here is an example. Minimum-energy models are explicitly formulated as optimal control problems. joint torques). at least in natural and well-practiced tasks. is almost trivial. and y then list some research directions which we consider promising. Most optimality models of biological movement assume deterministic dynamics and impose state constraints at di¤erent points in time. subject to the constraints imposed by the body and environment. and have dynamics _= 1 c (u ) where u (t) is the control signal sent by the nervous system. the musculo-skeletal system has order higher than two because the muscle actuators have their own states. However they can be easily transformed into optimal control problems by relating the derivative being minimized to a control signal. For simplicity assume that the torques correspond to the set of muscle activations. Optimal control is also a very successful framework in terms of explaining the details of observed movement. Instead we will brie‡ summarize existing optimal control models from a methodological perspective. jerk). or the squared derivative of joint torque. After all. and n (q. or the squared derivative of acceleration (i. It is then reasonable to expect that. the brain has evolved for the sole purpose of generating behavior advantageous to the organism. However we have recently reviewed this literature [12] and will not repeat the review here. and any external force …elds that depend on position or velocity. the observed behavior will be close to optimal.e. q) • _ where M (q) is the con…guration-dependent inertia matrix. Since the constraints guarantee accurate execution of the task. or the positions of a sequence of targets which the hand has to pass through. It has been de…ned as (an approximation to) metabolic energy.g. there is no need for accuracy-related costs which specify what the task is. Let (t) be the vector of generalized forces (e. The state vector of this system is x = [q. ] _ We will use the subscript notation x[1] = q. joint angles) for an articulated body such as the human arm. The general …rst-order _ 24 .g.7 Optimal control as a theory of biological movement To say that the brain generates the best behavior in can. Unlike mechanical devices. This makes optimal control theory an appealing computational framework for studying the neural control of movement. The equations of motion are = M (q) q + n (q. The only cost is a cost rate which speci…es the "style" of the movement. x[3] = . and c is the muscle time constant (around 40 msec). q. These constraints can for example specify the initial and …nal posture of the body in one step of locomotion. gravity. while minimum-jerk and minimum-torque-change models are formulated in terms of trajectory optimization. The solution method is usually based on the maximum principle. x[2] = q. q) captures non-linear _ interaction forces. Let q (t) be the vector of generalized coordinates (e.

u) are given by x[1] = x[2] _ x[2] = M x[1] _ x[3] = _ 1 c 1 x[3] n x[1] .dynamics x = f (x. Some optimal control models have considered stochastic dynamics. u. then the boundary condition x (tf ) = xf is replaced with p (tf ) = hx (x (tf )). Either way we have as many scalar variables as boundary conditions. Therefore the Hamiltonian can be minimized explicitly. We can also specify a cost rate. such as control energy: torque-change: ` (x. If instead of a terminal constraint we wish to specify a …nal cost h (x). Such models have almost exclusively been formulated within the LQG setting. Control under sensory noise and delays has also been considered. Control-multiplicative noise can be formalized as dx = (a (x) + Bu) dt + BD (u) dw where D (u) is a diagonal matrix with the components of u on its main diagonal. we have H (x. It is a well-established property of the motor system. u) = ` (x. u) = 1 2 1 2 kuk2 k _ k2 = 1 2c2 u x[3] 2 In both cases the cost is quadratic in u and the dynamics are a¢ ne in u. and the problem can be solved numerically using an ODE two-point boundary value solver. When a …nal cost is used the problem can also be solved using iterative LQG approximations. u. p) = u kuk2 + (a (x) + Bu)T p BTp We can now apply the maximum principle. p) = arg min H (x. The noise covariance is then S= 2 BD (u) D (u)T B T 25 . Focusing on the minimum-energy model. and a …nal state xf . in that case the model involves a sensory-motor loop composed of a Kalman …lter and a linearquadratic regulator. x[2] u x[3] Note that these dynamics are a¢ ne in the control signal and can be written as x = a (x) + Bu _ Now we can specify a desired movement time tf . and appears to be the reason for speed-accuracy trade-o¤s such as Fitts’ law. and obtain the ODE x = a (x) _ BB T p p = ax (x)T p _ with boundary conditions x (0) = x0 and x (tf ) = xf . each component of the control signal is polluted with Gaussian noise whose standard deviation is proportional to that component. and used accuracy costs rather than state constraints to specify the task (state constraints cannot be enforced in a stochastic system). p) = 1 2 (x. Of particular interest in stochastic models is control-multiplicative noise (also called signal-dependent noise). In this system. an initial state x0 .

Such modeling should be straightforward given the numerous iterative algorithms for optimal controller design that exist. Now suppose the cost rate is ` (x. the above expression is the Hamiltonian for a deterministic optimal control problem with cost rate in the same form as above. and reinforce the principle of optimal sensory-motor processing. In order to bridge the gap between behavior and single neurons. There is little doubt that many additional examples will accumulate over time. Motor learning and adaptation. one can verify that tr (SX) = 2 tr D (u)T B T XBD (u) = 2 T u B T XBu for any matrix X. there are plenty of examples where perceptual judgements are found to be optimal under a reasonable prior. t) as a given. the motor system is distributed and includes a number of 26 . t) B Thus. Optimal control modeling has been restricted to the behavioral level of analysis. and modi…ed control-energy weighting matrix: e R (x. Neural implementation of optimal control laws.With these de…nitions. t) If we think of the matrix vxx (x. But the processes of motor learning and adaptation – which are responsible for reaching stable performance – have rarely been modeled from the viewpoint of optimality.1 Promising research directions There are plenty of examples where motor behavior is found to be optimal under a reasonable cost function. The cost increase required to make the two problems equivalent is of course impossible to compute without …rst solving the stochastic problem (since it depends on the unknown optimal value function). incorporating control-multiplicative noise in an optimal control problem is equivalent to increasing the control energy cost. Optimal control has been used to model behavior in well-practiced tasks where performance is already stable. Distributed and hierarchical control. its Hessian vxx is constant. Nevertheless this analysis a¤ords some insight into the e¤ects of such noise. But can we expect future developments that are conceptually novel? Here we summarize four under-explored research directions which may lead to such developments. Note that in the LQG case v is quadratic. Most existing models of movement control are monolithic. u) = 1 uT Ru + q (x. t) B u + q (x. the control laws used to predict behavior are mathematical functions without an obvious neural implementation. In contrast. t) 2 Then the Hamiltonian for this stochastic optimal control problem is 1 T 2u R+ 2 B T vxx (x. t) + (a (x) + Bu)T vx (x. 7. and so the optimal control law under control-multiplicative noise can be found in closed form. t) = R + 2 B T vxx (x. Similarly. Such networks will have to operate in closed loop with a simulated body. we will need realistic neural networks trained to mimic the input-output behavior of optimal control laws.

A classic treatment of optimal estimation is Anderson and Moore [1]. we have recently developed a hierarchical framework for approximately optimal control. In this framework. Discretization schemes for continuous stochastic optimal control problems are developed in Kushner and Dupuis [6]. and US National Science Foundation grant ECS-0524761. Thanks to Weiwei Li for proofreading the manuscript. To address this discrepancy. The applications of optimal control theory to biological movement are reviewed in Todorov [12] and also in Pandy [8]. The di¤erential-geometric approach to mechanics and control (including optimal control) is developed in Bloch [4]. There are reasons to believe that most sensible behaviors are optimal with respect to some cost function. Inverse optimal control. and automatically infer a cost function for which the observed behavior is optimal. a low-level feedback controller transforms the musculoskeletal system into an augmented system for which high-level optimal control laws can be designed more e¢ ciently [13]. The classic Bryson and Ho [5] remains one of the best treatments of the maximum principle and its applications (including applications to minimum-energy …lters). 27 . Optimal control models are presently constructed by guessing the cost function. Ideally we would be able to do the opposite: record data. An intuitive yet rigorous introduction to stochastic calculus can be found in Oksendal [7]. Numerical approximations to dynamic programming. obtaining the corresponding optimal control law. Acknowledgements This work was supported by US National Institutes of Health grant NS-045915. The advanced subjects of non-smooth analysis and viscosity solutions are covered in Vinter [14].anatomically distinct areas which presumably have distinct computational roles. Recommended further reading The mathematical ideas introduced in this chapter are developed in more depth in a number of well-written books. The links to motor neurophysiology are explored in Scott [9]. A comprehensive text covering most aspects of continuous optimal control and estimation is Stengel [11]. with emphasis on discrete state-action spaces. A hierarchical framework for optimal control is presented in Todorov et al [13]. are introduced in Sutton and Barto [10] and formally described in Bertsekas and Tsitsiklis [2]. and comparing its predictions to experimental data. The standard reference on dynamic programming is Bertsekas [3].

Li. H. B. X. Nature Reviews Neuroscience 5: 534-546. R. J. (1994) Optimal Control and Estimation. M. (1998) Reinforcement Learning: An Introduction. P. [8] Pandy. R. [3] Bertsekas. 28 . [4] Bloch. Prentice Hall. Boston. MA.. D. and Pan. New York. J. and Barto. and Dupuis. (2004) Optimal feedback control and the neural basis of volitional motor control.References [1] Anderson. Belmont. A. [12] Todorov. Springer. Annual Review of Biomedical Engineering 3: 245-273. Nature Neuroscience 7: 907-915. Walthman. Springer. R. [2] Bertsekas. MA. (1995) Stochastic Di¤ erential Equations (4th ed). [9] Scott. Journal of Robotic Systems 22: 691-719. MIT Press. [11] Stengel. (2001) Computer modeling and simulation of human movement. (2003) Nonholonomic mechanics and control. D. A. Athena Scienti…c. [6] Kushner. Dover. and Moore. MA. [5] Bryson. (2004) Optimality principles in sensorimotor control. New York. (1979) Optimal Filtering. A. Y. S. W. [10] Sutton. MA. Belmont. and Ho. E. NJ. Springer. Athena Scienti…c. [7] Oksendal. (2005) From task parameters to motor synergies: A hierarchical framework for approximately optimal control of redundant manipulators. and Tsitsiklis. Englewood Cli¤s. [14] Vinter. B. (2000) Dynamic Programming and Optimal Control (2nd ed). Birkhauser. Blaisdell Publishing. Berlin. (2000) Optimal Control. (2001) Numerical Methods for Stochastic Control Problems in Continuous Time (2nd ed). (1996) Neuro-dynamic Programming. (1969) Applied Optimal Control. [13] Todorov. Cambdrige. E.

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

Optimal Control

Dynamic Optimization

Optimal Control Systems

Notes-LinearSystems

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd