You are on page 1of 27

Approximate Dynamic Programming

- Report

Nishanth A Rao, Nishanth Shetty, Sai Krishna B V

November 29, 2019


Contents

1 Calculus of Variations 3
1.1 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Functional . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Increment of a functional . . . . . . . . . . . . . . . . . 3
1.1.3 Fundamental Theorem of Calculus of Variations . . . . 3
1.1.4 Euler Equation . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Necessary conditions for optimal control . . . . . . . . . . . . 4
1.3 Linear Regulator Problems . . . . . . . . . . . . . . . . . . . . 5
1.4 Linear Tracking Problems . . . . . . . . . . . . . . . . . . . . 7
1.5 Pontryagin’s Minimum Principle . . . . . . . . . . . . . . . . . 9

2 Dynamic Programming 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Principle of Optimality . . . . . . . . . . . . . . . . . . 13
2.1.2 Curse Of Dimensionality . . . . . . . . . . . . . . . . . 14
2.1.3 Recurrence relation of Dynamic Programming . . . . . 14
2.1.4 Characteristics of Dynamic Programming solution . . . 16
2.2 Discrete Linear Regulator Problem . . . . . . . . . . . . . . . 16
2.3 Hamilton-Jacobi-Bellman equation . . . . . . . . . . . . . . . 18
2.4 Continuous Linear Regulator problem . . . . . . . . . . . . . . 20

3 Approximate Dynamic Programming 22


3.1 Stochastic Dynamic Programming . . . . . . . . . . . . . . . . 22
3.2 DP Algorithm for Stochastic Finite Horizon Problems . . . . . 23
3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Approximation in Value Space . . . . . . . . . . . . . . . . . . 24
3.3.1 Issues with Approximation in Value Space . . . . . . . 24
3.4 Parametric Approximation . . . . . . . . . . . . . . . . . . . . 25

1
3.4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . 25

2
Chapter 1

Calculus of Variations

1.1 Calculus of Variations


1.1.1 Functional
A functional J is a rule of correspondence that assigns to each function x
in a certain class Ω a unique real number. Intuitively, it is a“function of a
function”.

1.1.2 Increment of a functional


∆J(x, δx) = δJ(x, δx) + g(x, δx) · kδxk (1.1)

where δJ is linear in δx.(δx is the variation of the function x). If

lim g(x, δx) = 0


kδxk→0

then, J is said to be differentiable on x and δJ is the variation of J evaluated


for the function x.

1.1.3 Fundamental Theorem of Calculus of Variations


If x∗ is an extremal, the variation of must vanish on x∗ i.e.,

δJ(x∗ , δx) = 0 for all admissible δx

3
1.1.4 Euler Equation
A necessary condition for x∗ to be an extremal for the functional J of the
form Z tf
J(x) = g(x(t), ẋ(t), t)dt (1.2)
t0

is given by the Euler eqation


 
∂g ∗ ∗ d ∂g ∗ ∗
(x (t), ẋ (t), t) − (x (t), ẋ (t), t) = 0 (1.3)
∂x dt ∂ ẋ

with the boundary conditions


 ∂g T
(x∗ (tf ) , ẋ∗ (tf ) , tf ) δxf + [g (x∗i(tf ) , x∗ (tf ) , tf )
∂ ẋ
T (1.4)
− ∂∂gẋ (x∗ (tf ) , ẋ∗ (tf ) , tf ) ẋ∗ (tf ) δtf = 0


1.2 Necessary conditions for optimal control


Find an admissible control u∗ that causes the system

ẋ(t) = a(x(t), u(t), t) (1.5)

to follow an admissible trajectory x∗ that minimizes the performance measure


Z tf
J(u) = h (x (tf ) , tf ) + g(x(t), u(t), t)dt (1.6)
t0

This performance measure can also be written as


Z tf  
d
J(u) = g(x(t), u(t), t) + [h(x(t), t)] dt + h (x (t0 ) , t0 ) (1.7)
t0 dt
As the states are constrained by the set of differential equations in 1.5, form
the augmented functional. Also, by application of chain rule of differentia-
tion, we get:
Z tt (  T
∂h ∂h
Ja (u) = g(x(t), u(t), t) + (x(t), t) ẋ(t) + (x(t), t)
t0 ∂x ∂t (1.8)
+pT (t)[a(x(t), u(t), t) − ẋ(t)] dt

4
For convenience, define the Hamiltonian function as follows:

H (x(t), u(t), p(t), t) , g(x(t), u(t), t) + pT (t)[a(x(t), u(t), t)] (1.9)

Thus the necessary conditions that must be met are as follows:

ẋ∗ (t) = ∂H
(x∗ (t), u∗ (t), p∗ (t), t)

∂p 



for all

ṗ∗ (t) = − ∂H (x∗ (t), u∗ (t), p∗ (t), t) (1.10)
∂x 
 t ∈ [t0 , tf ]


0 = ∂H
(x∗ (t), u∗ (t), p∗ (t), t)

∂u

and,
 T
∂h ∗
(x (tf ) , tf ) − p∗ (tf ) δxf + [H (x∗ (tf ) , u∗ (tf ) , p∗ (tf ) , tf )
∂x
 (1.11)
∂h ∗
+ (x (tf ) , tf ) δtf = 0
∂t

1.3 Linear Regulator Problems


The linear regulator systems forms and important class of optimal control
problems. Consider the plant described by the linear time-varying state
equations:
ẋ(t) = A(t)x(t) + B(t)u(t) (1.12)
The performance measure to be minimized is

1 tf  T
Z
1 T
x (t)Q(t)x(t) + uT (t)R(t)u(t) dt

J = x (tf ) Hx (tf ) + (1.13)
2 2 t0

Note that H and Q are real symmetric positive semi-definite matrices, and R
is a real symmetric positive definite matrix. This way, the optimal solution
is guaranteed to minimize the performance measure globally.
Physically, the performance measure is interpreted as follows: The state
vector is maintained close to the origin without an excessive expenditure of
control effort.

5
Thus for linear quadratic regulator problem, the hamiltonian is defined as
follows:
1 1
H (x(t), u(t), p(t), t) = xT (t)Q(t)x(t) + uT (t)R(t)u(t)
2 2 (1.14)
T T
+ p (t)A(t)x(t) + p (t)B(t)u(t)
and the necessary conditions for optimality are:
ẋ∗ (t) = A(t)x∗ (t) + B(t)u∗ (t) (1.15)
∂H
ṗ∗ (t) = − = −Q(t)x∗ (t) − AT (t)p∗ (t) (1.16)
∂x
∂H
0= = R(t)u∗ (t) + BT (t)p∗ (t) (1.17)
∂u
Solving for u∗ (t) from 1.17, we get

u∗ (t) = −R−1 (t)BT (t)p∗ (t) (1.18)

Thus from 1.15, we have


ẋ∗ (t) = A(t)x∗ (t) − B(t)R−1 (t)BT (t)p∗ (t) (1.19)
Rewriting the above 2n linear homogeneous differential equations in a matrix
form, we have:
 ∗
A(t) −B(t)R−1 (t)BT (t)
   ∗ 
ẋ (t) x (t)
=
ṗ∗ (t) −Q(t) −AT (t) p∗ (t)
The solution to these equations are of the form:
 ∗    ∗ 
x (tf ) ϕ11 (tf , t) ϕ12 (tf , t) x (t)
= (1.20)
p∗ (tf ) ϕ21 (tf , t) ϕ22 (tf , t) p∗ (t)

where ϕij are n × n matrices.


Solving for p∗ (t) gives:
p∗ (t) = [ϕ22 (tf , t) − Hϕ12 (tf , t)]−1 [Hϕ11 (tf , t) − ϕ21 (tf , t)] x∗ (t) (1.21)
where the boundary condition p∗ (tf ) = Hx∗ (tf ) have been used. Kalman
has shown that the required inverse exists for all t ∈ [t0 , tf ]. Rewriting the
above equation as:
p∗ (t) , K(t)x∗ (t) (1.22)

6
we see that p∗ (t) is a linear function of the states of the system, and the
matrix K(t) is symmetric. Substituting back in (18), we get:
u∗ (t) = −R−1 (t)BT (t)K(t)x(t)
(1.23)
, F(t)x(t)
This indicates that the optimal control law is a linear, time-varying com-
bination of the system states. However, it is important to note that the
measurements of all of the state variables must be available to implement
the optimal control law.
In order to calculate the matrix K(t), we need to compute the state tran-
sition matrix. Computing the time-varying state transition matrix becomes
very time-consuming and tedious task. However, there is an alternative ap-
proach. It can be shown that the matrix K(t) satisfies the matrix differential
equation:
K̇(t) = −K(t)A(t) − AT (t)K(t) − Q(t) + K(t)B(t)R−1 (t)BT (t)K(t)
(1.24)
along with the boundary condition K(tf ) = H. Equation (24) is called
Riccati equation.

1.4 Linear Tracking Problems


When the desired value is not the origin, then the linear quadratic regulator
problem becomes that of a linear tracking one. The performance measure in
this case to be minimized is:

1 tf 
Z
1 T
J = [x (tf ) − r (tf )] H [x (tf ) − r (tf )] + [x(t) − r(t)]T Q(t)[x(t) − r(t)]
2 2 t0
+uT (t)R(t)u(t) dt

1 2 R n
1 tf 2 2
o
, 2 kx (tf ) − r (tf )kH + 2 t0 kx(t) − r(t)kQ(t) + ku(t)kR(t) dt
where r(t) is the desired / reference value of the state vector; H and Q are
real symmetric positive semi-definite matrices, and R is real symmetric and
positive definite. The hamiltonian is
1 1
H (x(t), u(t), p(t), t) = kx(t) − r(t)k2Q(t) + ku(t)k2R(t)
2 2 (1.25)
T T
+ p (t)A(t)x(t) + p (t)B(t)u(t)

7
and the necessary conditions to be satisfied are:
ẋ∗ (t) = A(t)x∗ (t) + B(t)u∗ (t) (1.26)
∂H
ṗ∗ (t) = − = −Q(t)x∗ (t) − AT (t)p∗ (t) + Q(t)r(t) (1.27)
∂x
∂H
0= = R(t)u∗ (t) + BT (t)p∗ (t) (1.28)
∂u
Thus, we get
u∗ (t) = −R−1 (t)BT (t)p∗ (t) (1.29)
and,
 ∗
−B(t)R−1 (t)BT (t) x∗ (t)
     
ẋ (t) A(t) 0
∗ = +
ṗ (t) −Q(t) −AT (t) p∗ (t) Q(t)r(t)
where Q(t)r(t) is a forcing function; These differential equations are linear
and time-varying but not homogeneous. Thus the solution to the above
equation is of the form:
 ∗   ∗  Z tf  
x (tf ) x (t) 0
= ϕ(tf , t) + ϕ(tf , τ ) dτ (1.30)
p∗ (tf ) p∗ (t) t Q(τ )r(τ )

If ϕ is partitioned and the integral can be replaced by the 2n×1 vector


Z tf    
0 f1 (t)
ϕ(tf , τ ) dτ =
t Q(τ )r(τ ) f2 (t)
then, solving for p∗ (t) yeilds
p∗ (t) = [ϕ22 (tf , t) − Hϕ12 (tf , t)]−1 [Hϕ11 (tf , t) − ϕ21 (tf , t)] x∗ (t)
+ [ϕ22 (tf , t) − Hϕ12 (tf , t)]−1 [Hf1 (t) − Hr (tf ) − f2 (t)]
, K(t)x∗ (t) + s(t)
(1.31)
along with the boundary conditions:
p∗ (tf ) = Hx∗ (tf ) − Hr (tf ) (1.32)
Thus, the optimal control law is given by:

u∗ (t) = −R−1 (t)BT (t)K(t)x(t) − R−1 (t)BT (t)s(t)


(1.33)
, F(t)x(t) + v(t)

8
where F(t) is the feedback gain matrix and v(t) is the command signal.
v(t) depends on the system parameters and thus on the future values of
the reference signal. Thus, it may be said that the optimal control has an
anticipatory quality. Thus, we must determine our present strategy on the
basis of where we are now and where we intend to go. Again, we face the
same issue of determining the state transition matrix in order to find the
matrix K(t). However, there is an alternative route to this, and it can be
shown that K(t) satisfies the same riccati equation, and thus s(t) can be
obtained as follows:

K̇(t) = −K(t)A(t) − AT (t)K(t) − Q(t) + K(t)B(t)R−1 (t)BT (t)K(t) (1.34)

and

ṡ(t) = − AT (t) − K(t)B(t)R−1 (t)BT (t) s(t) + Q(t)r(t)


 
(1.35)

along with the boundary conditions

K (tf ) = H
(1.36)
s (tf ) = −Hr (tf )

1.5 Pontryagin’s Minimum Principle


So far, it has been assumed that there are no constraints on the admissible
controls and the states of a system. Physically however, controls generally
have magnitude limitations. For eg., the input voltage to a motor cannot ex-
ceed certain limit, beyond which the motor can be damaged. Thus we need to
modify the necessary conditions that an optimal control must satisy, and gen-
eralize the fundamental theorem leading to pontryagin’s minimum principle.

By definition, the control u∗ causes


the functional J to have a relative
minimum if

J(u) − J(u∗ ) = ∆J ≥ 0

for all admissible controls sufficiently


∗ ∗
close to u . If u = u + δu, then the increment in J can be expressed as

9
∆J(u∗ , δu) = δJ(u∗ , δu) + higher-order terms;
The higher order terms approach zero as the norm of δu approaches zero. δu
is arbitrary only if the extremal control is strictly within the boundary. In
this case, the boundary has no effect on the problem solution. However, if the
extremal control lies on a boundary during atleast one subinterval [t1 , t2 ] of
the interval [t0 , tf ] as shown in the figure, then only those admissible control
variations δ û exists whose negatives (−δ û) are not admissible. If only these
variations are considered, a necessary condition for u∗ to minimize J is that
δJ(u∗ , δ û) ≥ 0. For variations δ ũ which lie strictly within the boundary,
it is necessary that δJ(u∗ , δ ũ) = 0. Thus, considering all the admissible
variations with ||δu|| small enough so that the sign of ∆J is determined by
δJ, the necessary condition for u∗ to minimze J is
δJ(u∗ , δu) ≥ 0
Rewriting the increment of J by using the Hamiltonion ,the fact that p∗ (t)
is selected such that the coefficient of δx(t) in the integral is identically zero,
and the boundary conditions being satisfied, we get:
Z tf  T
∗ ∂H ∗ ∗ ∗
∆J (u , δu) = (x (t), u (t), p (t), t) δu(t)dt
t0 ∂u (1.37)
+ higher-order terms.
The integrand is approximated to first-order w.r.t changes in H caused by
changes in u alone.
 T
∂H .
(x (t), u (t), p (t), t) δu(t) = H (x∗ (t), u∗ (t) + δu(t), p∗ (t), t)−H (x∗ (t), u∗ (t), p∗ (t), t)
∗ ∗ ∗
∂u
(1.38)
Thus, for a sufficiently small neighborhood of u∗ (||δu|| ≤ β ,where β is a small
positive number), a necessary condition for u∗ to be minimizing control is:
Z tf
[H (x∗ (t), u∗ (t) + δu(t), p∗ (t), t) − H (x∗ (t), u∗ (t), p∗ (t), t)] dt ≥ 0
t0
(1.39)
⇒ H (x∗ (t), u∗ (t) + δu(t), p∗ (t), t) ≥ H (x∗ (t), u∗ (t), p∗ (t), t) (1.40)

for all t ∈ [t0 , tf ] and for all admissible controls. Thus, an optimal control
must minimize the Hamiltonian - Pontryagin’s minimum principle. Note

10
that an optimal control must satisy Pontryagin’s minimum principle; how-
ever, there may be controls that satisfy the minimum principle that are not
optimal.

In summary, a control u∗ ∈ U , which causes the system

ẋ(t) = a(x(t), u(t), t) (1.41)

to follow an admissible trajectory that minimizes the performance measure


Z tf
J(u) = h (x (tf ) , tf ) + g(x(t), u(t), t)dt, (1.42)
t0

is sought. In terms of the Hamiltonian

H (x(t), u(t), p(t), t) , g(x(t), u(t), t) + pT (t)[a(x(t), u(t), t)], (1.43)

necessary conditions for u∗ to be an optimal control are

ẋ∗ (t) = ∂H (x∗ (t), u∗ (t), p∗ (t), t)



∂p 

ṗ∗ (t) = − ∂x (x∗ (t), u∗ (t), p∗ (t), t)
∂H 
for all t ∈ [t0 , tf ]
H (x (t), u (t), p (t), t) ≤ H (x∗ (t), u(t), p∗ (t), t)
∗ ∗ ∗


for all admissible u(t)

(1.44)
and the boundary conditions
 ∂h ∗ ∗
T
(x (t f ) , tf ) − p (tf ) δxf
∂x (1.45)
+ H (x∗ (tf ) , u∗ (tf ) , p∗ (tf ) , tf ) + ∂h
(x∗ (tf ) , tf ) δtf = 0
 
∂t

The minimum principle can also be applied to problems in which the admis-
sible controls are not bounded. This can be done by viewing the unbounded
control region as having arbitrarily large bounds, thus ensuring that the op-
timal control will not be constrained by the boundaries. Thus in this case,
for u∗ (t) to minimize the Hamiltonian, it is necessary that

∂H
(x∗ (t), u∗ (t), p∗ (t), t) = 0 (1.46)
∂u

11
Chapter 2

Dynamic Programming

2.1 Introduction
Before the work of Pontryagin and Bellman, obtaining the optimal control
for a given system required the minimization of a functional using variational
techniques.
With Pontryagin’s Minimum Principle, the task of finding an optimal
control boils down to minimizing a Hamiltonian.
Given an initial state of the system x0 at time t = t0 ,
using Pontryagin’s minimum principle we can obtain an optimal control that
generates a state trajectory which minimizes the cost functional.
Richard Bellman had other ideas,

“ In place of determining the optimal sequence of decisions from


the fixed state of the system, we wish to determine the optimal
decision to be made at any state of the system. ”

This idea of Bellman required solution to an entire family of minimization


problems i.e. minimize the cost functional for all possible values of t and x.
This would require a lot of computation.
Example 1:
Consider a discrete time system,

xk+1 = f (xk , uk ) k = 0, 1, 2, ...., T − 1

Where xk lives in a finite set X consisting of N elements, uk lives in a finite


set U consisting of M elements, and T, N, M are fixed positive integers.

12
Suppose cost is assigned to each transition from some xk to xk+1 , and also
there is a terminal cost function. For each trajectory, the total cost accumu-
lated is the sum of transition cost at time steps plus the terminal cost at xT .

For a given initial state x0 to minimize the cost, a naive approach would
be to start from x0 , enumerate all possible trajectories going forward upto
time T.
The computational effort required to implement the solution is: O(MT T)
M possible trajectories at each stage, and there T such stages, to compute the
total cost adding the transitional cost at every stage requires T additions.
There are N possible intial states, hence the total computations required
would be O(NMT T)
An alternate approach to solve the problem is to go backward in time, i.e
starting at k = T, terminal costs for each xk is known. At k = T-1, for each
xk the corresponding xk+1 has to be determined which has the least cost
(one-step transition cost + Terminal cost). If there is more than one path
choose one at random. Repeating theabove for k = T-2,...,0 we obtain the
control that minimizes the cost.
At each stage k, for each state xk , and each control uk the transition cost
has to be added to the already computed minimum cost starting from xk+1 .
Hence the computation effort with this backward approach is O(NMT)

This backward approach to solve the problem, finds the optimal policy
for every initial state and also for any intermediate state, hence satisfies Bell-
man’s objective of finding an optimal policy for every state. This recursive
scheme serves as an example for the method of dynamic programming.

2.1.1 Principle of Optimality


Given a sequential decision making problem, a good strategy to solve it would
be to solve the fundamental parts of the problem first and then combine
these solutions to solve a bigger subproblem, doing this in a recursive way
the solution to the actual problem can be obtained. This is the Bottom-Up
Approach to solve a sequential decision making problem. From Example 1,
we can observe that the Bottom-Up Approach required fewer computations
when compared to brute force approach. This feature of the Bottom-Up Ap-
proach was called Principle of Optimality by Richard Bellman. Bellman
further defined the Optimal policy as,

13
An optimal policy has the property that whatever the initial state
and initial decision are, the remaining decisions must constitute
an optimal policy with regard to the state resulting from the first
decision.

2.1.2 Curse Of Dimensionality


Using Dynamic programming does reduce the number of computations re-
quired with respect to direct enumeration, but when dealing with systems of
higher dimensions, the number of computations required will be very high.
Dynamic programming does not provide an analytical solution, but it pro-
vides a closed form tabular solution. For high dimensional systems, the
look-up in this table will also become computationally expensive due to the
limited availability of high speed memory locations. Bellman phrased this as
the Curse of Dimensionality.

2.1.3 Recurrence relation of Dynamic Programming


Consider an nth order time invariant system, described by

ẋ(t) = a(x(t), u(t)) (2.1)

It is desired to determine the control law which minimizes the performance


measure Z tf
J = h(x(tf )) + g(x(t), u(t))dt (2.2)
0
where tf is assumed to be fixed. Discretizing the state equation and perfor-
mance measure we obtain

x(k + 1) = aD (x(k), u(k)) (2.3)

where aD (x(k), u(k)) = ∆ta(x(k), u(k))


N
X −1
J = h(x(N )) + gD (x(k), u(k)) (2.4)
k=0

−1
NP −1
NP
where gD (x(k), u(k)) = ∆t g(x(k), u(k)).
k=0 k=0
Now the discrete problem requires an optimal control law

14
u∗ (x(0), 0), u∗ (x(1), 1), ...u∗ (x(N − 1), N − 1),
to be determined for system given by (3) which has the performance measure
given by (4).
The cost of reaching the final state value x(N) is,
JN N (x(N )) , h(x(N )) (2.5)
The cost of operation during the final stage is,
JN −1,N (x(N − 1), u(N − 1)) , gD (x(N − 1), u(N − 1)) + JN N (x(N )) (2.6)
The optimal cost is,

JN∗ −1,N (x(N − 1)) , min {gD (x(N − 1), u(N − 1))+
u(N −1)

JN N (aD (x(N − 1), u(N − 1)))} (2.7)


The cost of operation over the last two stages is given by,

JN −2,N (x(N − 2), u(N − 2), u(N − 1)) = gD (x(N − 2), u(N − 2))+
JN −1,N (x(N − 1), u(N − 1)) (2.8)

JN∗ −2,N (x(N − 2)) , min {gD (x(N − 2), u(N − 2))+
u(N −2),u(N −1)

JN −1,N (x(N − 1), u(N − 1))} (2.9)


Principle of Optimality states that for a 2 stage process irrespective of the
initial state x(N-2) and initial decision u(N-2), the remaining decision u(N-1)
must be optimal with respect to the value of x(N-1), hence the minimum cost
for the last two stages can be written as,

JN∗ −2,N (x(N − 2)) , min {gD (x(N − 2), u(N − 2))+
u(N −2)

JN −1,N (aD (x(N − 2), u(N − 2)))} (2.10)


Continuing backward in this manner, we obtain the optimal cost for a K-stage
process,
n
JN∗ −K,N (x(N − K)) = min gD (x(N − K), u(N − K))+
u(N −K)
o

JN −K,N (aD (x(N − K), u(N − K))) (2.11)

15
The above equation is the recurrence relation in Dynamic programming, the
optimal policy for a K-stage policy is obtained from the optimal policy for a
(K-1)-stage policy.
Another observation from the above recurrence relation is that, the minimum
cost for the final K stages of a N-stage process with initial state value x(N-K)
at the beginning of the (N-K)th stage, is also the minimum possible cost for a
K-stage process with initial state numerically equal to x(N-K).This is called
the Imbedding principle.

2.1.4 Characteristics of Dynamic Programming solu-


tion
• Absolute Minimum As a direct search is used to solve the recurrence
relation, the solution obtained is the global minimum

• Presence of Constraints If there are constraints on the state and


control variables, it reduces the number of allowable state-action pairs
to be considered, hence the number of computations required are re-
duced.

• Form of Optimal Control Dynamic Programming yields the optimal


control in feedback form.

u∗ (t) = f (x(t), t) (2.12)

but it does not yield an analytical expression for f.In order to obtain the
optimal control for a given state, a table look-up needs to be performed.

2.2 Discrete Linear Regulator Problem


Consider a Discrete system described by

x(k + 1) = Ax(k) + Bu(k) (2.13)

with the performance measure


N −1
1 T 1X T
J = x (N )Hx(N ) + [x (k)Q(k)x(k) + uT R(k)u(k)] (2.14)
2 2 k=0

16
H and Q(k) are real symmetric and positive definite matrices
R(k) is a real symmetric positive definite matrix
The recurrence relation of dynamic programming is used to obtain the opti-
mal control that minimizes the cost.
1 1
JN N (x(N )) = xT (N )Hx(N ) = J∗N N (x(N )) , xT (N )P(0)x(N ) (2.15)
2 2

where P(0) , H
The cost over the final interval is given by,

1
JN −1,N (x(N − 1), u(N − 1)) = xT (N − 1)Qx(N − 1)+
2
1 T 1
u (N − 1)Ru(N − 1) + xT (N )P(0)x(N ) (2.16)
2 2
and the minimum cost is
1 1
JN∗ −1,N (x(N −1)) = min { xT (N −1)Qx(N −1)+ uT (N −1)Ru(N −1)+
u(N −1) 2 2
1
[Ax(N − 1) + Bu(N − 1)]T P(0)[Ax(N − 1) + Bu(N − 1)]} (2.17)
2
To minimize JN −1,N with respect to u(N-1) control values which satisfy,
∂JN −1,N
=0 (2.18)
∂u(N − 1)
are considered.
Evaluating the partial derivative indicated by (18) gives,

Ru(N − 1) + BT P(0)[Ax(N − 1) + Bu(N − 1)] = 0 (2.19)


To ensure that the controls satisfying the above equation actually minimize
JN −1,N , compute the second partial derivative,

∂ 2 JN −1,N
= R + BT P(0)B (2.20)
∂u2 (N − 1)
From the definition of R and H we can conclude that the second partial
derivative is greater than 0. Hence the controls satisying (19), do minimize
JN −1,N .

17
From (19), we get the equation for optimal control as,

u∗ (N − 1) = −[R + BT P(0)B]−1 BT P(0)Ax(N − 1) (2.21)


Defining F(N-1) as,
F(N −1) , −[R+BT P(0)B]−1 BT P(0)A The optimal control can be written
as
u∗ (N − 1) , F(N − 1)x(N − 1) (2.22)
Using the above equation for u∗ (N − 1) in (17) we obtain the minimum cost
for the final stage as,
1
JN∗ −1,N (x(N − 1)) , xT (N − 1)P(1)x(N − 1) (2.23)
2
where,
P(1) = [A + BF(N − 1)]T P(0)[A + BF(N − 1)] + FT (N − 1)RF(N − 1) + Q
Comparing (23) and (15), it can be concluded that JN∗ −1,N is of exactly the
same form as JN∗ N . And this can be further generalized to the previous stages
to get,

u∗ (N − K) , F(N − K)x(N − K)
1 (2.24)
JN∗ −K,N (x(N − K)) = xT (N − K)P(K)x(N − K)
2
From the above equation, the minimum cost for an N-stage process with ini-
tial state x0 is given by,

∗ 1
J0,N (x0 ) = xT0 P(N )x0 (2.25)
2
From (24), it can be observed that the optimal control at each stage for a
linear regulator problem is a linear combination of states. Hence the optimal
policy is linear state feedback policy. The feedback is time-varying.
In the case of an Infinite Horizon problem, when the system is completely
controllable and time-invariant, H = 0, and R and Q are constant matrices,
then the feedback will be time-invariant.

2.3 Hamilton-Jacobi-Bellman equation


To solve an optimal control problem, for a continuous time system, one
approach would be to discretize the system equation and the performance

18
measure, and apply the Principle of Optimality to obtain the recurrence
relation.An alternative approach to solve the same problem leads to HJB
equation, which is a partial differential equation.
Consider a process described by the state equation,

ẋ(t) = a(x(t), u(t), t) (2.26)


which is to be controlled to minimize the performance measure
Z tf
J = h(x(tf )) + g(x(τ ), u(τ ), τ )dτ (2.27)
t0

where t0 and tf are fixed.


The minimum cost function is,
Z tf

J (x(t), t) = min {h(x(tf )) + g(x(τ ), u(τ ), τ )dτ } (2.28)
u(τ ) t
t≤τ ≤tf

Subdividing the interval and using Principle of Optimality the above equation
reduces to,
Z t+∆t 
∗ ∗
J (x(t), t) = min gdτ + J (x(t + ∆t), t + ∆t) (2.29)
u(τ ) t
t≤τ ≤t+∆t

J ∗ (x(t+∆t), t+∆t) is the minimum cost for the time interval t+∆t ≤ τ ≤ tf .
Assuming partial derivatives for J ∗ exist and are bounded, using the Taylor
series expansion for J ∗ (x(t+∆t), t+∆t) about the point (x(t), t), and taking
the limit ∆t → 0 simplifies the above expression to,

0 = Jt∗ (x(t), t)+min(t){g(x(t), u(t), t)+Jx∗T (x(t), t)[a(x(t), u(t), t)]} (2.30)
u
with the boundary condition,

J ∗ (x(tf ), tf ) = h(x(tf ), tf ) (2.31)


Defining the Hamiltonian as,

H (x(t), u(t), Jx∗ , t) , g(x(t), u(t), t) + Jx∗T (x(t), t)[a(x(t), u(t), t)] (2.32)
and,
H (x(t), u∗ (x(t), Jx∗ , t), Jx∗ , t) = min H (x(t), u(t), Jx∗ , t) (2.33)
u(t)

19
Using these definitions, (30) can be rewritten as,

0 = Jt∗ (x(t), t) + H (x(t), u∗ (x(t), Jx∗ , t), Jx∗ , t) (2.34)

(34) is the Hamilton-Jacobi-Bellman equation and the continuous-time


analog of Bellman’s recurrence relation.

2.4 Continuous Linear Regulator problem


The process to be controlled is described by the state equation,

ẋ(t) = A(t)x(t) + B(t)u(t) (2.35)

and the performance measure to be minimized is


Z tf
1 T 1 T
J = x (tf )Hx(tf ) + [x (t)Q(t)x(t) + uT (t)R(t)u(t)]dt (2.36)
2 t0 2
H and Q are real symmetric and positive semi-definite matrices, R is a real,
symmetric positive definite matrix.
To use Hamilton-Jacobi-Bellman equation to solve the above problem, the
Hamiltonian is formed,

1 1
H (x(t), u(t), Jx∗ , t) = xT (t)Q(t)x(t) uT (t)R(t)u(t)+
2 2
∗T
Jx (x(t), t)[A(t)x(t) + B(t)u(t)] (2.37)

A necessary condition to minimise H is ∂H /∂u = 0.

∂H
(x(t), u(t), Jx∗ , t) = R(t)u(t) + BT (t)Jx∗ (x(t), t) = 0 (2.38)
∂u
Since
∂ 2H
= R(t) (2.39)
∂u2
and from definition R is a positive definite matrix, hence the solution to (38)
will globally minimize the Hamiltonian. Solving (38) gives,

u∗ (t) = −R−1 (t)BT (t)Jx∗ (x(t), t) (2.40)

20
Substituting for u(t) in (37), gives the following HJB equation,
1 1
0 = Jt∗ + xT Qx − Jx∗T BR−1 BT Jx∗ + Jx∗T Ax (2.41)
2 2
Assuming the solution to the above HJB equation to be of the form,
1
J ∗ (x(t), t) = xT (t)K(t)x(t) (2.42)
2
Where K(t) is a real symmetric positive definite matrix, which is to be
determined by solving the HJB equation.
Substituting the assumed solution in HJB equation gives,
1 1 1 1
0 = xT K̇x + xT Qx − xT KBR−1 BT Kx + xT KAx (2.43)
2 2 2 2
Splitting KA into symmetric and unsymmetric part, and simplifying reduces
the above equation to,

0 = K̇(t)+Q(t)−K(t)B(t)R−1 (t)BT (t)K(t)+K(t)A(t)+AT (t)K(t) (2.44)

and the boundary condition is,

K(tf ) = H (2.45)
As a result of the assumed solution, HJB equation reduces to a set of ordinary
non-linear differential equations. Solving these, gives K(t), using which the
optimal control law can be obtained,

u∗ (t) = −R−1 (t)BT (t)K(t)x(t) (2.46)

21
Chapter 3

Approximate Dynamic
Programming

3.1 Stochastic Dynamic Programming


This system differs from the deterministic one as it now includes a factor
of random “disturbance”(wk )(e.g., physical noise, market uncertainties, de-
mand for inventory, unpredictable breakdowns, etc.)

xk+1 = fk (xk , uk , wk ), k = 0, 1, ..., N − 1 (3.1)

An important difference is that we optimize not over control sequences u0 , ..., uN −1 ,


but rather over policies (also called closed-loop control laws, or feedback poli-
cies) that consist of a sequence of functions

π = {µ0 , ..., µN −1 } (3.2)


where µk maps states xk into controls uk = µk (xk ), and satisfies the control
constraints, i.e., is such that µk (xk ) ∈ Uk (xk ) for all xk ∈ Sk . In the deter-
ministic case, µk (xk ) = uk is not particularly helpful but for the stochastic
case it is. This is because in the deterministic case we can choose the entire
control sequence at the start as we know all the transitions.
For a given initial state x0 , the objective is to minimize over all π = {µ0 , µ1 , µ2 , ..., µN −1 }
the cost
N
X −1
Jπ (x0 ) = E{gN (xN ) + gk (xk , µk (xk ), wk )} (3.3)
k=0

22
and the optimal cost for state x0 is

J ∗ (x0 ) = min Jπ (x0 ) (3.4)


π

3.2 DP Algorithm for Stochastic Finite Hori-


zon Problems
The DP algorithm for the stochastic finite horizon optimal control problem
has a similar form to its deterministic version, and shares several of its major
characteristics:

1. Using tail subproblems to break down the minimization over multiple


stages to single stage minimizations.

2. Generating backwards for all k and xk the values Jk∗ (xk ), which give
the optimal cost-to-go starting at stage k at state xk .

3. Obtaining an optimal policy by minimization in the DP equations.

4. A structure that is suitable for approximation in value space, whereby


we replace Jk∗ by approximations J˜k , and obtain a suboptimal policy
by the corresponding minimization.

3.2.1 Algorithm
Start with
JN∗ (xN ) = gN (xN ) , (3.5)
and for k = 0, ..., N − 1, let
n o
Jk∗ (xk ) = min E gk (xk , uk , wk ) + Jk+1

(fk (xk , uk , wk )) (3.6)
uk ∈Uk (xk )

If u∗k = µ∗k (xk ) minimizes the right side of this equation for each xk and k,
the policy π ∗ = {µ∗0 , ..., µ∗N −1 } is optimal.

23
3.3 Approximation in Value Space
Here we approximate the optimal cost-to-go functions Jk with some other
functions J˜k . We then replace Jk∗ in the DP equation with J˜k . In particular,
at state xk , we use the control obtained from the minimization
n o
µ̃k (xk ) ∈ arg min ˜
E gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) (3.7)
uk ∈Uk (xk )

This defines a suboptimal policy {µ˜0 , ..., µ˜N 1 }. Approximation in value space
can be implemented as a one-step lookahead(the future costs are approxi-
˜ + 1) after a single step) or a multistep lookahead(the future
mated by J(k
˜ + l) after l ≤ 1 steps).
costs are approximated by J(k

3.3.1 Issues with Approximation in Value Space


˜ for k = 1, 2, 3, ..., N ?
How to construct Jk+1
˜ to be the exact cost of a differ-
1. Problem Approximation - Take the Jk+1
ent problem or a related problem. The problem is thus approximated
and simplified to make it suitable to be solved with exact Dynamic
Programming.

2. Parametric Approximation - Introduce a parametric class of functions


involving some parameter that is chosen to ensure a good approxima-
tion.

24
3. Neural Networks - Trained to compute the approximation Jk+1 ˜ . The
universal approximation property of neural networks is useful here.

How to simplify the E{.} operation?


1. Certainty Equivalence - Replace a stochastic quantity with a determin-
istic one. This affects the optimality of the control law in general but
greatly simplifies computation.

2. Adaptive simulation - Reduction of the variables and some heuristically


determined “bad” states in simulation can greatly simplify computa-
tion.

3. Monte Carlo Tree Search - Decides which controls to simulate more


closely than others based on some criteria.

How to simplify the minimization operation?


Discretisation over the control set is done to avoid the possibility of minimiz-
ing over an infinite set of actions.

3.4 Parametric Approximation


An approximation architecture is a class of functions J(x, ˜ r) that depend on
x and a vector r = (r1 , ..., rm ) of m “tunable” scalar parameters (or weights).
˜ r) close to some target cost function J(x).
Choose r to make J(x,
The training algorithm chooses r. It typically uses least squares optimiza-
˜ r) a data set of state-cost pairs.
tion(regression) to fit J(x,

3.4.1 Neural Networks


Neural networks can be used to approximate the optimal cost-to-go functions
˜ v, r) of the form
Jk∗ . We consider parametric architectures J(x,
˜ v, r) = r0 φ(x, v)
J(x, (3.8)
˜ v, r)
where v and r are parameter vectors. v and r are selected so that J(x,
approximates some cost function. The process is to collect a training set

25
to find a function of the form (3.9) that matches the training set in a least
squares sense, i.e., (v, r) minimizes
q
X
(J(xs , v, r) − β s )2 (3.9)
s=1

˜ v, r)
A neural network architecture provides a parametric class of functions J(x,
of the form (3.9) that can be used in the optimization framework just de-
scribed.

26

You might also like