Professional Documents
Culture Documents
- Report
1 Calculus of Variations 3
1.1 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Functional . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Increment of a functional . . . . . . . . . . . . . . . . . 3
1.1.3 Fundamental Theorem of Calculus of Variations . . . . 3
1.1.4 Euler Equation . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Necessary conditions for optimal control . . . . . . . . . . . . 4
1.3 Linear Regulator Problems . . . . . . . . . . . . . . . . . . . . 5
1.4 Linear Tracking Problems . . . . . . . . . . . . . . . . . . . . 7
1.5 Pontryagin’s Minimum Principle . . . . . . . . . . . . . . . . . 9
2 Dynamic Programming 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Principle of Optimality . . . . . . . . . . . . . . . . . . 13
2.1.2 Curse Of Dimensionality . . . . . . . . . . . . . . . . . 14
2.1.3 Recurrence relation of Dynamic Programming . . . . . 14
2.1.4 Characteristics of Dynamic Programming solution . . . 16
2.2 Discrete Linear Regulator Problem . . . . . . . . . . . . . . . 16
2.3 Hamilton-Jacobi-Bellman equation . . . . . . . . . . . . . . . 18
2.4 Continuous Linear Regulator problem . . . . . . . . . . . . . . 20
1
3.4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . 25
2
Chapter 1
Calculus of Variations
3
1.1.4 Euler Equation
A necessary condition for x∗ to be an extremal for the functional J of the
form Z tf
J(x) = g(x(t), ẋ(t), t)dt (1.2)
t0
4
For convenience, define the Hamiltonian function as follows:
ẋ∗ (t) = ∂H
(x∗ (t), u∗ (t), p∗ (t), t)
∂p
for all
ṗ∗ (t) = − ∂H (x∗ (t), u∗ (t), p∗ (t), t) (1.10)
∂x
t ∈ [t0 , tf ]
0 = ∂H
(x∗ (t), u∗ (t), p∗ (t), t)
∂u
and,
T
∂h ∗
(x (tf ) , tf ) − p∗ (tf ) δxf + [H (x∗ (tf ) , u∗ (tf ) , p∗ (tf ) , tf )
∂x
(1.11)
∂h ∗
+ (x (tf ) , tf ) δtf = 0
∂t
1 tf T
Z
1 T
x (t)Q(t)x(t) + uT (t)R(t)u(t) dt
J = x (tf ) Hx (tf ) + (1.13)
2 2 t0
Note that H and Q are real symmetric positive semi-definite matrices, and R
is a real symmetric positive definite matrix. This way, the optimal solution
is guaranteed to minimize the performance measure globally.
Physically, the performance measure is interpreted as follows: The state
vector is maintained close to the origin without an excessive expenditure of
control effort.
5
Thus for linear quadratic regulator problem, the hamiltonian is defined as
follows:
1 1
H (x(t), u(t), p(t), t) = xT (t)Q(t)x(t) + uT (t)R(t)u(t)
2 2 (1.14)
T T
+ p (t)A(t)x(t) + p (t)B(t)u(t)
and the necessary conditions for optimality are:
ẋ∗ (t) = A(t)x∗ (t) + B(t)u∗ (t) (1.15)
∂H
ṗ∗ (t) = − = −Q(t)x∗ (t) − AT (t)p∗ (t) (1.16)
∂x
∂H
0= = R(t)u∗ (t) + BT (t)p∗ (t) (1.17)
∂u
Solving for u∗ (t) from 1.17, we get
6
we see that p∗ (t) is a linear function of the states of the system, and the
matrix K(t) is symmetric. Substituting back in (18), we get:
u∗ (t) = −R−1 (t)BT (t)K(t)x(t)
(1.23)
, F(t)x(t)
This indicates that the optimal control law is a linear, time-varying com-
bination of the system states. However, it is important to note that the
measurements of all of the state variables must be available to implement
the optimal control law.
In order to calculate the matrix K(t), we need to compute the state tran-
sition matrix. Computing the time-varying state transition matrix becomes
very time-consuming and tedious task. However, there is an alternative ap-
proach. It can be shown that the matrix K(t) satisfies the matrix differential
equation:
K̇(t) = −K(t)A(t) − AT (t)K(t) − Q(t) + K(t)B(t)R−1 (t)BT (t)K(t)
(1.24)
along with the boundary condition K(tf ) = H. Equation (24) is called
Riccati equation.
1 tf
Z
1 T
J = [x (tf ) − r (tf )] H [x (tf ) − r (tf )] + [x(t) − r(t)]T Q(t)[x(t) − r(t)]
2 2 t0
+uT (t)R(t)u(t) dt
1 2 R n
1 tf 2 2
o
, 2 kx (tf ) − r (tf )kH + 2 t0 kx(t) − r(t)kQ(t) + ku(t)kR(t) dt
where r(t) is the desired / reference value of the state vector; H and Q are
real symmetric positive semi-definite matrices, and R is real symmetric and
positive definite. The hamiltonian is
1 1
H (x(t), u(t), p(t), t) = kx(t) − r(t)k2Q(t) + ku(t)k2R(t)
2 2 (1.25)
T T
+ p (t)A(t)x(t) + p (t)B(t)u(t)
7
and the necessary conditions to be satisfied are:
ẋ∗ (t) = A(t)x∗ (t) + B(t)u∗ (t) (1.26)
∂H
ṗ∗ (t) = − = −Q(t)x∗ (t) − AT (t)p∗ (t) + Q(t)r(t) (1.27)
∂x
∂H
0= = R(t)u∗ (t) + BT (t)p∗ (t) (1.28)
∂u
Thus, we get
u∗ (t) = −R−1 (t)BT (t)p∗ (t) (1.29)
and,
∗
−B(t)R−1 (t)BT (t) x∗ (t)
ẋ (t) A(t) 0
∗ = +
ṗ (t) −Q(t) −AT (t) p∗ (t) Q(t)r(t)
where Q(t)r(t) is a forcing function; These differential equations are linear
and time-varying but not homogeneous. Thus the solution to the above
equation is of the form:
∗ ∗ Z tf
x (tf ) x (t) 0
= ϕ(tf , t) + ϕ(tf , τ ) dτ (1.30)
p∗ (tf ) p∗ (t) t Q(τ )r(τ )
8
where F(t) is the feedback gain matrix and v(t) is the command signal.
v(t) depends on the system parameters and thus on the future values of
the reference signal. Thus, it may be said that the optimal control has an
anticipatory quality. Thus, we must determine our present strategy on the
basis of where we are now and where we intend to go. Again, we face the
same issue of determining the state transition matrix in order to find the
matrix K(t). However, there is an alternative route to this, and it can be
shown that K(t) satisfies the same riccati equation, and thus s(t) can be
obtained as follows:
and
K (tf ) = H
(1.36)
s (tf ) = −Hr (tf )
J(u) − J(u∗ ) = ∆J ≥ 0
9
∆J(u∗ , δu) = δJ(u∗ , δu) + higher-order terms;
The higher order terms approach zero as the norm of δu approaches zero. δu
is arbitrary only if the extremal control is strictly within the boundary. In
this case, the boundary has no effect on the problem solution. However, if the
extremal control lies on a boundary during atleast one subinterval [t1 , t2 ] of
the interval [t0 , tf ] as shown in the figure, then only those admissible control
variations δ û exists whose negatives (−δ û) are not admissible. If only these
variations are considered, a necessary condition for u∗ to minimize J is that
δJ(u∗ , δ û) ≥ 0. For variations δ ũ which lie strictly within the boundary,
it is necessary that δJ(u∗ , δ ũ) = 0. Thus, considering all the admissible
variations with ||δu|| small enough so that the sign of ∆J is determined by
δJ, the necessary condition for u∗ to minimze J is
δJ(u∗ , δu) ≥ 0
Rewriting the increment of J by using the Hamiltonion ,the fact that p∗ (t)
is selected such that the coefficient of δx(t) in the integral is identically zero,
and the boundary conditions being satisfied, we get:
Z tf T
∗ ∂H ∗ ∗ ∗
∆J (u , δu) = (x (t), u (t), p (t), t) δu(t)dt
t0 ∂u (1.37)
+ higher-order terms.
The integrand is approximated to first-order w.r.t changes in H caused by
changes in u alone.
T
∂H .
(x (t), u (t), p (t), t) δu(t) = H (x∗ (t), u∗ (t) + δu(t), p∗ (t), t)−H (x∗ (t), u∗ (t), p∗ (t), t)
∗ ∗ ∗
∂u
(1.38)
Thus, for a sufficiently small neighborhood of u∗ (||δu|| ≤ β ,where β is a small
positive number), a necessary condition for u∗ to be minimizing control is:
Z tf
[H (x∗ (t), u∗ (t) + δu(t), p∗ (t), t) − H (x∗ (t), u∗ (t), p∗ (t), t)] dt ≥ 0
t0
(1.39)
⇒ H (x∗ (t), u∗ (t) + δu(t), p∗ (t), t) ≥ H (x∗ (t), u∗ (t), p∗ (t), t) (1.40)
for all t ∈ [t0 , tf ] and for all admissible controls. Thus, an optimal control
must minimize the Hamiltonian - Pontryagin’s minimum principle. Note
10
that an optimal control must satisy Pontryagin’s minimum principle; how-
ever, there may be controls that satisfy the minimum principle that are not
optimal.
(1.44)
and the boundary conditions
∂h ∗ ∗
T
(x (t f ) , tf ) − p (tf ) δxf
∂x (1.45)
+ H (x∗ (tf ) , u∗ (tf ) , p∗ (tf ) , tf ) + ∂h
(x∗ (tf ) , tf ) δtf = 0
∂t
The minimum principle can also be applied to problems in which the admis-
sible controls are not bounded. This can be done by viewing the unbounded
control region as having arbitrarily large bounds, thus ensuring that the op-
timal control will not be constrained by the boundaries. Thus in this case,
for u∗ (t) to minimize the Hamiltonian, it is necessary that
∂H
(x∗ (t), u∗ (t), p∗ (t), t) = 0 (1.46)
∂u
11
Chapter 2
Dynamic Programming
2.1 Introduction
Before the work of Pontryagin and Bellman, obtaining the optimal control
for a given system required the minimization of a functional using variational
techniques.
With Pontryagin’s Minimum Principle, the task of finding an optimal
control boils down to minimizing a Hamiltonian.
Given an initial state of the system x0 at time t = t0 ,
using Pontryagin’s minimum principle we can obtain an optimal control that
generates a state trajectory which minimizes the cost functional.
Richard Bellman had other ideas,
12
Suppose cost is assigned to each transition from some xk to xk+1 , and also
there is a terminal cost function. For each trajectory, the total cost accumu-
lated is the sum of transition cost at time steps plus the terminal cost at xT .
For a given initial state x0 to minimize the cost, a naive approach would
be to start from x0 , enumerate all possible trajectories going forward upto
time T.
The computational effort required to implement the solution is: O(MT T)
M possible trajectories at each stage, and there T such stages, to compute the
total cost adding the transitional cost at every stage requires T additions.
There are N possible intial states, hence the total computations required
would be O(NMT T)
An alternate approach to solve the problem is to go backward in time, i.e
starting at k = T, terminal costs for each xk is known. At k = T-1, for each
xk the corresponding xk+1 has to be determined which has the least cost
(one-step transition cost + Terminal cost). If there is more than one path
choose one at random. Repeating theabove for k = T-2,...,0 we obtain the
control that minimizes the cost.
At each stage k, for each state xk , and each control uk the transition cost
has to be added to the already computed minimum cost starting from xk+1 .
Hence the computation effort with this backward approach is O(NMT)
This backward approach to solve the problem, finds the optimal policy
for every initial state and also for any intermediate state, hence satisfies Bell-
man’s objective of finding an optimal policy for every state. This recursive
scheme serves as an example for the method of dynamic programming.
13
An optimal policy has the property that whatever the initial state
and initial decision are, the remaining decisions must constitute
an optimal policy with regard to the state resulting from the first
decision.
−1
NP −1
NP
where gD (x(k), u(k)) = ∆t g(x(k), u(k)).
k=0 k=0
Now the discrete problem requires an optimal control law
14
u∗ (x(0), 0), u∗ (x(1), 1), ...u∗ (x(N − 1), N − 1),
to be determined for system given by (3) which has the performance measure
given by (4).
The cost of reaching the final state value x(N) is,
JN N (x(N )) , h(x(N )) (2.5)
The cost of operation during the final stage is,
JN −1,N (x(N − 1), u(N − 1)) , gD (x(N − 1), u(N − 1)) + JN N (x(N )) (2.6)
The optimal cost is,
JN∗ −1,N (x(N − 1)) , min {gD (x(N − 1), u(N − 1))+
u(N −1)
JN −2,N (x(N − 2), u(N − 2), u(N − 1)) = gD (x(N − 2), u(N − 2))+
JN −1,N (x(N − 1), u(N − 1)) (2.8)
JN∗ −2,N (x(N − 2)) , min {gD (x(N − 2), u(N − 2))+
u(N −2),u(N −1)
JN∗ −2,N (x(N − 2)) , min {gD (x(N − 2), u(N − 2))+
u(N −2)
15
The above equation is the recurrence relation in Dynamic programming, the
optimal policy for a K-stage policy is obtained from the optimal policy for a
(K-1)-stage policy.
Another observation from the above recurrence relation is that, the minimum
cost for the final K stages of a N-stage process with initial state value x(N-K)
at the beginning of the (N-K)th stage, is also the minimum possible cost for a
K-stage process with initial state numerically equal to x(N-K).This is called
the Imbedding principle.
but it does not yield an analytical expression for f.In order to obtain the
optimal control for a given state, a table look-up needs to be performed.
16
H and Q(k) are real symmetric and positive definite matrices
R(k) is a real symmetric positive definite matrix
The recurrence relation of dynamic programming is used to obtain the opti-
mal control that minimizes the cost.
1 1
JN N (x(N )) = xT (N )Hx(N ) = J∗N N (x(N )) , xT (N )P(0)x(N ) (2.15)
2 2
where P(0) , H
The cost over the final interval is given by,
1
JN −1,N (x(N − 1), u(N − 1)) = xT (N − 1)Qx(N − 1)+
2
1 T 1
u (N − 1)Ru(N − 1) + xT (N )P(0)x(N ) (2.16)
2 2
and the minimum cost is
1 1
JN∗ −1,N (x(N −1)) = min { xT (N −1)Qx(N −1)+ uT (N −1)Ru(N −1)+
u(N −1) 2 2
1
[Ax(N − 1) + Bu(N − 1)]T P(0)[Ax(N − 1) + Bu(N − 1)]} (2.17)
2
To minimize JN −1,N with respect to u(N-1) control values which satisfy,
∂JN −1,N
=0 (2.18)
∂u(N − 1)
are considered.
Evaluating the partial derivative indicated by (18) gives,
∂ 2 JN −1,N
= R + BT P(0)B (2.20)
∂u2 (N − 1)
From the definition of R and H we can conclude that the second partial
derivative is greater than 0. Hence the controls satisying (19), do minimize
JN −1,N .
17
From (19), we get the equation for optimal control as,
u∗ (N − K) , F(N − K)x(N − K)
1 (2.24)
JN∗ −K,N (x(N − K)) = xT (N − K)P(K)x(N − K)
2
From the above equation, the minimum cost for an N-stage process with ini-
tial state x0 is given by,
∗ 1
J0,N (x0 ) = xT0 P(N )x0 (2.25)
2
From (24), it can be observed that the optimal control at each stage for a
linear regulator problem is a linear combination of states. Hence the optimal
policy is linear state feedback policy. The feedback is time-varying.
In the case of an Infinite Horizon problem, when the system is completely
controllable and time-invariant, H = 0, and R and Q are constant matrices,
then the feedback will be time-invariant.
18
measure, and apply the Principle of Optimality to obtain the recurrence
relation.An alternative approach to solve the same problem leads to HJB
equation, which is a partial differential equation.
Consider a process described by the state equation,
Subdividing the interval and using Principle of Optimality the above equation
reduces to,
Z t+∆t
∗ ∗
J (x(t), t) = min gdτ + J (x(t + ∆t), t + ∆t) (2.29)
u(τ ) t
t≤τ ≤t+∆t
J ∗ (x(t+∆t), t+∆t) is the minimum cost for the time interval t+∆t ≤ τ ≤ tf .
Assuming partial derivatives for J ∗ exist and are bounded, using the Taylor
series expansion for J ∗ (x(t+∆t), t+∆t) about the point (x(t), t), and taking
the limit ∆t → 0 simplifies the above expression to,
0 = Jt∗ (x(t), t)+min(t){g(x(t), u(t), t)+Jx∗T (x(t), t)[a(x(t), u(t), t)]} (2.30)
u
with the boundary condition,
H (x(t), u(t), Jx∗ , t) , g(x(t), u(t), t) + Jx∗T (x(t), t)[a(x(t), u(t), t)] (2.32)
and,
H (x(t), u∗ (x(t), Jx∗ , t), Jx∗ , t) = min H (x(t), u(t), Jx∗ , t) (2.33)
u(t)
19
Using these definitions, (30) can be rewritten as,
1 1
H (x(t), u(t), Jx∗ , t) = xT (t)Q(t)x(t) uT (t)R(t)u(t)+
2 2
∗T
Jx (x(t), t)[A(t)x(t) + B(t)u(t)] (2.37)
∂H
(x(t), u(t), Jx∗ , t) = R(t)u(t) + BT (t)Jx∗ (x(t), t) = 0 (2.38)
∂u
Since
∂ 2H
= R(t) (2.39)
∂u2
and from definition R is a positive definite matrix, hence the solution to (38)
will globally minimize the Hamiltonian. Solving (38) gives,
20
Substituting for u(t) in (37), gives the following HJB equation,
1 1
0 = Jt∗ + xT Qx − Jx∗T BR−1 BT Jx∗ + Jx∗T Ax (2.41)
2 2
Assuming the solution to the above HJB equation to be of the form,
1
J ∗ (x(t), t) = xT (t)K(t)x(t) (2.42)
2
Where K(t) is a real symmetric positive definite matrix, which is to be
determined by solving the HJB equation.
Substituting the assumed solution in HJB equation gives,
1 1 1 1
0 = xT K̇x + xT Qx − xT KBR−1 BT Kx + xT KAx (2.43)
2 2 2 2
Splitting KA into symmetric and unsymmetric part, and simplifying reduces
the above equation to,
K(tf ) = H (2.45)
As a result of the assumed solution, HJB equation reduces to a set of ordinary
non-linear differential equations. Solving these, gives K(t), using which the
optimal control law can be obtained,
21
Chapter 3
Approximate Dynamic
Programming
22
and the optimal cost for state x0 is
2. Generating backwards for all k and xk the values Jk∗ (xk ), which give
the optimal cost-to-go starting at stage k at state xk .
3.2.1 Algorithm
Start with
JN∗ (xN ) = gN (xN ) , (3.5)
and for k = 0, ..., N − 1, let
n o
Jk∗ (xk ) = min E gk (xk , uk , wk ) + Jk+1
∗
(fk (xk , uk , wk )) (3.6)
uk ∈Uk (xk )
If u∗k = µ∗k (xk ) minimizes the right side of this equation for each xk and k,
the policy π ∗ = {µ∗0 , ..., µ∗N −1 } is optimal.
23
3.3 Approximation in Value Space
Here we approximate the optimal cost-to-go functions Jk with some other
functions J˜k . We then replace Jk∗ in the DP equation with J˜k . In particular,
at state xk , we use the control obtained from the minimization
n o
µ̃k (xk ) ∈ arg min ˜
E gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) (3.7)
uk ∈Uk (xk )
This defines a suboptimal policy {µ˜0 , ..., µ˜N 1 }. Approximation in value space
can be implemented as a one-step lookahead(the future costs are approxi-
˜ + 1) after a single step) or a multistep lookahead(the future
mated by J(k
˜ + l) after l ≤ 1 steps).
costs are approximated by J(k
24
3. Neural Networks - Trained to compute the approximation Jk+1 ˜ . The
universal approximation property of neural networks is useful here.
25
to find a function of the form (3.9) that matches the training set in a least
squares sense, i.e., (v, r) minimizes
q
X
(J(xs , v, r) − β s )2 (3.9)
s=1
˜ v, r)
A neural network architecture provides a parametric class of functions J(x,
of the form (3.9) that can be used in the optimization framework just de-
scribed.
26