You are on page 1of 8

Proceedings of the 7th Asian Control Conference, SaPP.

1
Hong Kong, China, August 27-29, 2009

Adaptive Dynamic Programming for Feedback Control


Frank L. Lewis, Fellow IEEE, and Draguna Vrabie

Abstract—Living organisms learn by acting on their new policy yields a value that is improved over the previous
environment, observing the resulting reward stimulus, and value. See Figure 1.
adjusting their actions accordingly to improve the reward. This
action-based or Reinforcement Learning can capture notions of CRITIC –
optimal behavior occurring in natural systems. We describe Evaluates the current
mathematical formulations for Reinforcement Learning and a control policy
practical implementation method known as Adaptive Dynamic Reward/Response
Policy from
Programming. These give us insight into the design of update/ environment
controllers for man-made engineered systems that both learn improvement
and exhibit optimal behavior. Relations are show between ADP
and adaptive control. ACTOR
ACTOR -
System/
Implements the Environment
I. REINFORCEMENT LEARNING AND OPTIMALITY IN NATURE policy Control action
control policy
control System output

E VERY living organism interacts with its environment


and uses those interactions to improve its own actions in
order to survive and increase. Charles Darwin showed that
Figure 1. Reinforcement Learning with an Actor/Critic Structure.

species modify their actions based on interactions with the The actor-critic structure implies two steps: policy
environment over long time scales, leading to natural evaluation by the critic followed by policy improvement.
selection and survival of the fittest. Adam Smith showed that The policy evaluation step is performed by observing from
modification of the actions of corporate entities based on the environment the results of current actions. Of particular
interactions on the scale of a global economy is responsible interest are those actor-critic structures where the critic
for the relative balance and wealth of nations. Ivan Pavlov assesses the value of current policies based on some sort of
used simple reinforcement and punishment stimuli to modify optimality criteria [26], [27], [28], [29], [7], [23], [8], [22].
the behavior patterns of dogs by inducing conditional In such a scheme, reinforcement learning is a means of
reflexes. learning optimal behaviors by observing the response from
We call modification of actions based on interactions with the environment to nonoptimal control policies.
the environment reinforcement learning (RL) [19]. There are Approximate or Adaptive Dynamic Programming (ADP)
many types of learning including supervised learning, [28], [29], also known as Neurodynamic Programming
unsupervised learning, etc. Reinforcement learning refers to (NDP) [7] is a method for practical implementation of Actor-
an actor or agent that interacts with its environment and Critic RL Structures that is based on temporal differences
modifies its actions, or control policies, based on stimuli and value function approximation. In this paper we show that
received in response to its actions. The RL algorithms are ADP is of direct importance for feedback control systems. In
constructed on the idea that successful control decisions fact ADP is an extension of adaptive control that yields
should be remembered, by means of a reinforcement signal, optimal online controller design techniques.
such that they become more likely to be used a second time. The bulk of research in ADP has been conducted for
RL is strongly connected from a theoretical point of view systems that operate in discrete-time (DT). We cover RL for
with direct and indirect adaptive optimal control methods. DT systems in Section 2, then for continuous-time systems in
One class of reinforcement learning methods is based on Section 3. In Section 4 we show how to implement RL
the Actor-Critic structure [4], where an actor component algorithms online using ADP techniques. There, comparisons
applies an action or control policy to the environment, and a are drawn with adaptive control.
critic component assesses the value of that action. Based on
this assessment of the value, various schemes may then be II. REINFORCEMENT LEARNING FOR DISCRETE-TIME
used to modify or improve the action in the sense that the SYSTEMS
We consider here nonlinear discrete-time (DT) systems.
Manuscript received June 12, 2009. This work was supported by the First we recall optimal control, then, based on that, present
National Science Foundation ECCS-0801330 and the Army Research
Office W91NF-05-1-0314. two RL algorithms, namely Policy Iteration (PI) and Value
F. Lewis and D. Vrabie are with the Automation and Robotics Research Iteration (VI). The LQR case is detailed to tie these ideas
Institute, University of Texas at Arlington, 7300 Jack Newell Blvd. S. into standard control theory notions. In Section 4, we show
Fort Worth, TX 76118 USA (phone/fax: +817-272-5938; e-mail: lewis@
uta.edu).
how to implement PI and VI online in real-time using data

978-89-956056-9-1/09/©2009 ACA 1402


7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1

measured along the system trajectories.


(
V * ( xk ) = min r ( xk , h( xk )) + g V * ( xk +1 )
h (.)
) (9)
A. Optimal Control for Discrete-Time Systems
This is known as the Bellman optimality equation, or the
Consider the class of discrete-time systems discrete-time Hamilton-Jacobi-Bellman (HJB) equation. One
xk +1 = f ( xk ) + g ( xk )uk (1) then has the optimal policy as
with state xk Î R n
and control input uk Î R . A control m
(
h* ( xk ) = arg min r ( xk , h( xk )) + g V * ( xk +1 ) .
h (.)
) (10)
n m
policy is defined as a function h(.) : R ® R , i.e.
Since one must know the optimal policy at time k+1 to use
uk = h( xk ) . (2) (9) to determine the optimal policy at time k, Bellman’s
Thus, a control policy is simply a feedback controller. Principle yields a backwards-in-time procedure for solving
Define a performance measure or cost function the optimal control problem known as Dynamic
¥ Programming. DP methods are off-line methods for controls
Vh ( xk ) = åg
i =k
i -k
r ( xi , ui ) (3) design that generally require the full knowledge of the
system dynamical equations. That is f ( x), g ( x) must be
with 0 < g £ 1 a discount factor and uk = h( xk ) a prescribed
known.
feedback control policy. The discount factor reflects the fact
that we are less concerned about costs acquired further into B. Policy Iteration, Value Iteration, and Fixed Point
the future. Function r ( xk , uk ) is known as the utility and is a Equations
measure of the one-step cost of control. A standard form is In contrast to dynamic programming off-line designs, we
seek reinforcement learning schemes for on-line learning in
r ( xk , uk ) = Q ( xk ) + ukT Ruk (4)
real time, ultimately without knowing the system dynamics
with Q( x), R positive definite, which we use at times for f ( x), g ( x) . Therefore, we next show how to exploit the
illustration. notion that the Bellman equation (6) and the Bellman
We assume the system is stabilizable on some set W Î R n , optimality equation (9) are fixed point equations to develop
that is there exists a control policy uk = h( xk ) such that the forward-in-time methods for solving the optimal control
closed-loop system xk +1 = f ( xk ) + g ( xk )h( xk ) is problem.
A key concept in implementing learning optimal
asymptotically stable on W . A control policy uk = h( xk ) is
controllers online forward in time is the temporal difference
said to be admissible if it is stabilizing and yields a finite error. Define the temporal difference error in terms of the
cost Vh ( xk ) . For any admissible policy uk = h( xk ) , we call Bellman equation as
Vh ( xk ) its cost or value. ek = H ( xk , h( xK ), DVk )
(11)
By writing (3) as = r ( xk , h( xk )) + g Vh ( xk +1 ) - Vh ( xk )
¥
Consider any given admissible policy uk = h( xk ) with
Vh ( xk ) = r ( xk , uk ) + g åg
i = k +1
i - ( k +1)
r ( xi , ui ) (5)
value Vh ( xk ) . Motivated, though not justified, by (10)
one sees that a difference equation equivalent to (3) is given determine a new policy from this value using the operation
by h '( xk ) = arg min ( r ( xk , h( xk )) + g Vh ( xk +1 ) ) . (12)
Vh ( xk ) = r ( xk , h( xk )) + g Vh ( xk +1 ), Vh (0) = 0 (6) h (.)

That is, instead of evaluating the infinite sum (3), one can This procedure is justified in [7], where it is shown that
solve the difference equation (6) to obtain the value of using the new policy h '( xk ) is improved in that it has value
a current policy uk = h( xk ) . Vh ' ( xk ) less than or equal to the old value Vh ( xk ) . This is
This is a nonlinear Lyapunov equation known as the known as the one step improvement property of rollout
Bellman equation. It is defined in terms of the DT algorithms. That is, the step (12) has given an improved
Hamiltonian defined as policy.
H ( xk , h( xK ), DVk ) = r ( xk , h( xk )) + g Vh ( xk +1 ) - Vh ( xk ) (7) This suggests the following iterative method for
determining the optimal control, which is known as Policy
where DVk = g Vh ( xk +1 ) - Vh ( xk ) is the forward difference
Iteration [16], [23], [7].
operator. 1) DT Policy Iteration (PI) Algorithm
The optimal value can be written using the Bellman Initialize. Select any admissible (i.e. stabilizing) control
equation as policy h0 ( xk )
V * ( xk ) = min ( r ( xk , h( xk )) + g Vh ( xk +1 ) ) (8) Policy Evaluation Step. Determine the value of the
h (.)
According to Bellman’s Optimality Principle [17] one has current policy using the Bellman Equation
V j +1 ( xk ) = r ( xk , h j ( xk )) + g V j +1 ( xk +1 ) (13)

1403
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1

Policy Improvement Step. Determine an improved policy engineer.


using In the Linear Quadratic Regulator (LQR) case one has
(
h j +1 ( xk ) = arg min r ( xk , h( xk )) + g V j +1 ( xk +1 ) ) (14) xk +1 = Axk + Buk and the control policies are state variable
h (.) feedbacks uk = h( xk ) = - Kxk . Given a prescribed policy K ,
W the cost function or value is
If the utility has the special form (4) and the dynamics are ¥
(1), then the policy improvement step looks like
g
Vh ( xk ) = å ( x Qx + u
i =k
T
i i
T
i Rui )

h j +1 ( xk ) = - R -1 g T ( xk )ÑV j +1 ( xk +1 ) (15) ¥
(18)
2
where ÑV ( x) =
¶V ( x)
is the gradient of the value function,
= å
i=k
xiT T
(Q + K RK ) xi º VK ( xk )
¶x
interpreted here as a column vector. which has utility r ( xk , uk ) = xkT Qxk + ukT Ruk with weighting
Note that the initial policy in PI must be admissible, which matrices Q = QT ³ 0 , R = RT > 0 . The closed-loop system
requires that it be stabilizing. It has been shown by [16], [13] is xk +1 = ( A - BK ) xk º Ac xk . It is assumed that (A,B) is
and others that this algorithm converges under certain
conditions to the optimal value and control policy, that is, to stabilizable, and ( A, Q ) is detectable.
the solution given by (9), (10). In the LQR case the value is quadratic in the current state
Policy Iteration is based on the fact that the Bellman so that VK ( xk ) = xkT Pxk for some matrix P. Therefore, the
equation (6) is a fixed point equation. It can be shown that
Bellman equation (6) for the LQR is
the Bellman Optimality Equation (9) is also a fixed point
equation. As such, it has an associated contraction map xkT Pxk = xkT Qxk + ukT Ruk + xkT+1 Pxk +1 . (19)
which can be iterated using ANY initial policy, as formalized In terms of the feedback gain this can be written as
in the following Value Iteration Algorithm.
2) DT Value Iteration (VI) Algorithm
( )
xkT Pxk = xkT Q + K T RK + ( A - BK )T P( A - BK ) xk . (20)

Initialize. Select any control policy h0 ( xk ) , not Since this must hold for all current states xk one has
necessarily admissible or stabilizing. ( A - BK )T P( A - BK ) - P + Q + K T RK = 0 . (21)
Value Update Step. Update the value using which is a Lyapunov equation when K is fixed. The DT HJB
V j +1 ( xk ) = r ( xk , h j ( xk )) + g V j ( xk +1 ) (16) equation or Bellman optimality equation is
Policy Improvement Step. Determine an improved policy AT PA - P + Q - AT PB( R + BT PB) -1 BT PA = 0 . (22)
using i.e. the Riccati equation. The optimal feedback is

h (.)
(
h j +1 ( xk ) = arg min r ( xk , h( xk )) + g V j +1 ( xk +1 ) ) (17) uk = -( R + BT PB )-1 BT PAxk = - Kxk (23)
For the LQR, Bellman’s equation (6) is written as (19) and
W
hence is equivalent to the Lyapunov equation (21).
It is important to note that now, the old value is used on
Therefore, the Policy Iteration Algorithm for DT LQR is
the right-hand side of (16), in contrast to the PI step (13). It
has been shown that VI converges under certain situations. ( A - BK j )T Pj +1 ( A - BK j ) - Pj +1 + Q + K Tj RK j = 0 . (24)
Note that VI does not require an initial stabilizing policy. with the policy update (14) as
The evaluation of the value of the current policy in PI K j +1 = ( R + BT Pj +1 B) -1 BT Pj +1 A . (25)
using the Bellman Equation (13) amounts to determining the
value of using the policy h j ( xk ) starting in all current states This is exactly Hewer’s algorithm [12] to solve the Riccati
equation (22). Hewer proved that it converges under the
xk . This is called a full backup in [23] and can involve stabilizability and detectability assumptions.
significant computation. On the other hand, the value update In the Value Iteration Algorithm, the policy evaluation
step (16) in VI involves far less computation and is called a step (16) for LQR is
partial backup in [23]. Pj +1 = ( A - BK j )T Pj ( A - BK j ) + Q + K Tj RK j . (26)
C. Policy Iteration & Value Iteration for the DT LQR and the policy update (17) is (25). However, iteration of
In Section 4 we show how to implement PI and VI online these two equations has been studied by Lancaster and
using data measured along the system trajectories. This Rodman [15], who showed that it converges to the Riccati
yields online learning controllers that converge to the equation solution under the stated assumptions.
optimal control solution. To tie these notions to standard Note that Policy Iteration involves full solution of a
concepts in control systems, in this section we show that, in Lyapunov equation (24) at each step and requires a
the LQR case, PI and VI are equivalent to underlying stabilizing gain K j at each step. On the other hand, Value
equations that are very familiar to the control systems

1404
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1

Iteration involves only a Lyapunov recursion (26) at each 0 = r ( x, m ( x)) + (ÑV m )T ( f ( x) + g ( x) m ( x)), V m (0) = 0 (29)
step, which is very easy to compute, and does not require a
stabilizing gain. The recursion (26) can be performed even if where ÑV m (a column vector) denotes the gradient of the
K j is not stabilizing. cost function V m with respect to x.
On-line learning vs. off-line planning solution of the This is the CT Bellman equation. It is defined based on
LQR. It is important to note the following key point. In the CT Hamiltonian function
going from the formulation (19) of the Bellman equation to H ( x, m ( x), ÑV m ) = r ( x, m ( x)) + (ÑV m )T ( f ( x) + g ( x) m ( x)) .
the formulation (21), which is the Lyapunov equation, one
(30)
has performed two steps. First, the system dynamics are
A temporal difference error e(t) and a TD equation for CT
substituted for xk +1 to yield (20), next the current state xk is systems could now be defined as
cancelled to obtain (21). These two steps make it impossible
e(t ) º H ( x, m ( x), ÑV m )
to apply real-time online reinforcement learning methods to (31)
find the optimal control, which we shall show how to do = r ( x, m ( x)) + (ÑV m )T ( f ( x) + g ( x) m ( x)) = 0
Section 4. Because of these two steps, optimal controls This is a fixed point equation and allows the formulation
design in the Control Systems Community is almost of CT reinforcement learning schemes.
universally an off-line procedure involving solutions of We now see the problem with CT systems immediately.
Riccati equations, where full knowledge of the system Compare the CT Bellman Hamiltonian (30) to the DT
dynamics (A,B) is required. Hamiltonian (7). The former contains the full system
dynamics f ( x) + g ( x)u , while the DT Hamiltonian does not.
III. INTERVAL REINFORCEMENT LEARNING (IRL) FOR
CONTINUOUS-TIME SYSTEMS This means that there is no hope of using the CT Bellman
equation (29) as a basis for reinforcement learning unless the
Several studies have been made about reinforcement
full dynamics are known.
learning and ADP for CT systems, including [1], [2], [3], [5],
The next algorithm shows how to implement Policy
[9], [10], [21], [18]. Reinforcement Learning is considerably
Iteration for CT Systems based on the Hamiltonian function
more difficult for continuous-time systems than for discrete-
defined in (30) and the fixed point equation
time systems, and its development has lagged. We shall now
see why. H ( x, m ( x), ÑV m ) = 0 .
Consider the continuous-time nonlinear dynamical system 1) CT Standard Policy Iteration (PI) Algorithm
&x = f ( x) + g ( x )u (27) Initialize. Select any admissible (i.e. stabilizing) control
with state x(t ) Î R n , control input u (t ) Î R m , and the usual policy m (0) ( x)
(i )
mild assumptions required for existence of unique solutions Policy Evaluation Step. solve for V m ( x(t )) using
and an equilibrium point to x = 0 , e.g. f (0) = 0 and (i )

n
H ( x, m (i ) ( x), ÑV m )
f ( x) + g ( x)u Lipschitz on a set W Í R that contains the (i )
(32)
origin. We assume the system is stabilizable on W , that is = r ( x, m (i ) ( x)) + (ÑV m )T ( f ( x) + g ( x) m (i ) ( x)) = 0
there exists a continuous control function u (t ) such that the Policy Improvement Step. Determine an improved policy
closed-loop system is asymptotically stable on W . using
(i )
Define the performance measure associated with the m (i+1) = arg min[ H ( x, u, ÑV m )] (33)
feedback control policy u = m ( x) as u

¥ which explicitly is
m
V ( x(t )) = ò r ( x(t ), u (t ))dt
(i )
(28) m (i+1) ( x) = - 12 R -1 g T ( x)ÑV m (34)
t
W
with utility r ( x, u ) = Q ( x) + u T Ru , with Q( x) positive Note that the full dynamics f ( x) + g ( x)u are needed to
definite, i.e. "x ¹ 0, Q( x) > 0 and x = 0 Þ Q( x) = 0 , and solve (32).
RÎR m´m
a positive definite matrix. A policy is called Another problem is noted when comparing the CT
admissible if it is continuous, stabilizes the system, and has a Bellman Hamiltonian (30) to the DT Hamiltonian (7). There
finite associated cost. are two occurrences of value function in the latter, which
allows one to perform Value Iteration as shown in (16), (17).
A. Standard Formulation of CT Optimal Control Problem However the CT Hamiltonian only has one occurrence of the
If the policy is admissible and the cost is smooth, then an value function, so it is not at all clear how to perform value
infinitesimal equivalent to (28) is the nonlinear Lyapunov iteration for CT systems.
equation

1405
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1

B. Interval Reinforcement Form of CT Optimal Control 2) CT IRL Value Iteration (VI) Algorithm
The method for successfully confronting these problems Initialize. Select any control policy m (0) ( x) , not
was given by Vrabie [25], who defined a different temporal necessarily stabilizing.
difference error for CT systems. It was shown there that (i )
Value Update Step. solve for V m ( x(t )) using
some notions of learning mechanisms in the human brain
based on multiple timescales motivate the following (i )
t +T ( i -1)

approach. V m ( x(t )) = ò r ( x(t ), m (i ) ( x(t )))dt +V m ( x(t +T )) with


t
Write the cost (28) in the interval reinforcement form (i )
t +T V m (0) = 0 (40)
m m
V ( x(t )) = ò r ( x(t ), u (t ))dt + V ( x(t +T )) (35) Policy Improvement Step. Determine an improved policy
t
using
for any T>0. This is exactly in the form of the DT Bellman (i )

equation (6). It is a fixed point equation. Therefore, one can m (i+1) = arg min[ H ( x, u, ÑV m )] (41)
u
define (35) as the Bellman equation for CT systems and the
which explicitly is
associated CT temporal difference error e(t) as (i )
t +T m (i+1) ( x) = - 12 R -1 g T ( x)ÑV m (42)
m m
e(t :t +T ) =-V ( x(t )) + ò r ( x(t ), u (t ))dt + V ( x(t +T )) = 0
W
t
Note that neither algorithm requires knowledge about the
(36)
internal system dynamics function f(.). That is, they work for
This does not involve the system dynamics.
partially unknown systems.
According to Bellman’s principle, the optimal value is
given in terms of this construction as, [17], C. Interval Reinforcement Learning (IRL) for CT LQR
æ t +T ö Case
V * ( x(t )) = min ç ò r ( x(t ), u (t ))dt + V * ( x(t +T )) ÷ In Section 4 we show how to implement IRL PI and IRL
u (t:t +T ) ç ÷
è t ø VI online using data measured along the system trajectories.
where u (t : t + T ) = {u (t ) : t £ t < t + T } . The optimal control This yields online learning controllers that converge to the
is optimal control solution. To tie these notions to standard
concepts in control systems, in this section we derive the
æ t +T ö
m * ( x(t )) = arg min ç ò r ( x(t ), u (t ))dt + V * ( x(t +T )) ÷ . underlying equations for the CT LQR.
u (t:t +T ) ç
è t
÷
ø In the LQR case [17], one has the dynamics x& = Ax + Bu
It is shown in [25] that the nonlinear Lyapunov equation and the cost
¥
( )
(29) is exactly equivalent to the interval reinforcement form
(35). That is, the positive definite solution of both is the V u ( x (t )) = ò xT (t )Qx(t ) + uT (t ) Ru (t ) dt = xT (t ) Px(t )
t
value (28) of the policy u = m ( x) .
for some P. Then the Bellman Equation (29) becomes
Now, it is direct to properly formulate policy iteration and
value iteration for CT systems. We call these the interval xT Qx + uT Ru + 2 xT P( Ax + Bu ) = 0 (43)
reinforcement learning (IRL) formulations. For a fixed admissible policy u = - Kx this becomes
1) CT IRL Policy Iteration (PI) Algorithm
xT (Q + K T RK + PAc + Ac P ) x = 0 (44)
Initialize. Select any admissible (i.e. stabilizing) control
with Ac = A - BK .
policy m (0) ( x)
(i )
Minimizing this with respect to u one obtains
Policy Evaluation Step. solve for V m ( x(t )) using -1 T
u = - R B Px , whence substitution into (43) and
(i )
t +T (i ) simplification yields the Bellman optimality equation or HJB
V m ( x(t )) = ò r ( x(t ), m (i ) ( x(t )))dt +V m ( x(t +T )) with equation
t
xT ( AT P + PA + Q - PBR -1 BT P) x = 0
m (i )
V (0) = 0 (37)
Standard usage now states that these equations hold for all
Policy Improvement Step. Determine an improved policy initial conditions so that one has the Lyapunov equation
using
PAc + Ac P = -(Q + K T RK ) (45)
(i+1) m (i )
m = arg min[ H ( x, u, ÑV )] (38) and the Riccati equation
u
which explicitly is AT P + PA + Q - PBR -1 BT P = 0 (46)
m (i+1) -1 T
( x) = - 12 R g ( x)ÑV m (i )
(39) For the CT LQR, the Standard Policy Iteration Algorithm
(32), (33) becomes
W

1406
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1

t +T
Pi Ac + Ac Pi = -(Q + ( K i ) T RK i ) (47)
òe
AiT t T
Pi = (Q + ( K i )T RK i )e Ai t dt + e Ai T Pi -1e AiT
K i+1 = R -1 BT Pi (48) t
This is exactly Kleinman’s algorithm [14]. It requires a with
stabilizing initial gain and knowledge of the full system
Ai = A - BR -1 BT Pi -1
dynamics A,B. It is an offline planning technique that is
based on solving the Riccati equation. A stabilizing initial gain is not needed. This is a method of
To obtain online optimal learning controllers using RL solving the CT Riccati equation by using iteration on a
techniques, Vrabie used the interval form of the CT LQR discrete-time Lyapunov recursion!
Bellman error
t +T IV. ONLINE REINFORCEMENT LEARNING, ADP, AND

ò (x )
T
x (t ) Px(t ) = T T
(t )Qx(t ) + u (t ) Ru (t ) dt + ADAPTIVE CONTROL
t In this section we shall see how to formulate RL
+ xT (t + T ) Px(t + T ) algorithms as on-line real-time learning methods for solving
the optimal control problem using data measured along
Therefore, the CT Policy Iteration and Value Iteration system trajectories [23]. These methods are broadly called
Algorithms may alternatively be formulated as approximate dynamic programming (ADP) [27], [28], [29]
1) IRL Policy Iteration (PI) Algorithm for CT LQR or neurodynamic programming (NDP) [7]. By contrast,
Initialize. Select any admissible (i.e. stabilizing) control
standard optimal control solutions using dynamic
policy m (0) ( x) = - K 0 x programming are backwards-in-time procedures. Therefore,
Policy Evaluation Step. solve for Pi using u i = - K i x they can be used for off-line planning but not online
t +T learning. Moreover, knowledge of the full system dynamical
xT (t ) Pi x(t ) = ò (x ) description is needed. An example is the Riccati equation
T
(t )(Q + (u i )T (t ) Ru i (t ) dt +
(49) solution approach for optimal controls design.
t
There are two key ingredients to formulating Policy
+ xT (t + T ) P i x(t + T ) Iteration and Value Iteration as online learning algorithms:
Policy Improvement Step. Determine an improved policy temporal difference (TD) error and value function
using approximation (VFA). We have discussed TD above in
K i+1 = R -1 BT Pi (50) formulating the VI and PI algorithms
W A. ADP- On-Line Reinforcement Learning Optimal
2) IRL Value Iteration (VI) Algorithm for CT LQR Control
Initialize. Select any control policy m (0) ( x) = - K 0 x , not For online learning, let us require that the temporal
necessarily stabilizing. difference errors be zero at each time step. For DT systems
(i )
Policy Evaluation Step. solve for V m ( x(t )) using one has the TD error (11). For CT systems, one must use the
t +T
IRL TD error (36) as the formulation (31) does not work for

ò (x ) online learning.
xT (t ) Pi x(t ) = T
(t )(Q + (u i )T (t ) Ru i (t ) dt
(51) 1) Value Function Approximation (VFA)
t
To provide a practical means for solving the TD equation,
+ xT (t + T ) Pi -1 x(t + T ) one may approximate the value function Vh (.) using a
Policy Improvement Step. Determine an improved policy parametric approximator. This has been called Approximate
using Dynamic Programming (ADP) by Werbos [26], [27], [28],
K i+1 = R -1 BT Pi (52) [29] and neurodynamic programming (NDP) by Bertsekas
W [7], both of whom used neural networks as the
We shall see in the next section how to implement these approximators.
algorithms online without knowing the system dynamics A Assume that the value (DT or CT) V m ( x) is sufficiently
by using reinforcement learning techniques. That is, IRL smooth. Then, according to the Weierstrass higher order
techniques allow the solution of the CT Riccati equation approximation Theorem, there exists a dense basis set
online without knowing the system internal dynamics by
{ji ( x)} such that
using data measured along the system trajectories.
¥
It is shown in [24] that the IRL Policy Iteration
Algorithm for CT LQR is exactly equivalent to Kleinman’s
V m ( x) = å w j ( x)
i =1
i i

algorithm (47), (48), yet it can be implemented online using (53)


L ¥
the ADP methods in the next section. On the other hand, the
IRL Value Iteration Algorithm for CT LQR is equivalent to
= å w j ( x) + å w j ( x) º W
i =1
i i
i = L +1
i i
T
f ( x) + e L ( x )

1407
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1

where f ( x) = [j1 ( x) j 2 ( x) L j L ( x )] : R n ® R L is a To implement this online, the time is incremented at each


iteration by the period T. Each RLS iteration is performed
basis vector and e L ( x) converges uniformly to zero as the
using the measured data at each time increment given by
number of terms retained L ® ¥ . In the Weierstrass
( x(t ), x(t + T ), r (t : t + T ) ) , where
Theorem, standard usage takes a polynomial basis set. In the
t +T
neural network community, approximation results have been
shown for various other basis sets including sigmoid,
r (t :t +T ) = ò r ( x(t ), u (t ))dt
t
hyperbolic tangent, Gaussian radial basis functions, etc.
is the reinforcement measured on each time interval. The
There, standard results show that the NN approximation
reinforcement learning time interval T need not be the same
error e L ( x) is bounded by a constant on a compact set. L is
at every iteration. That is, T can be changed depending on
referred to as the number of hidden-layer neurons. how long it takes to get meaningful information from the
Many papers have been written about ADP for discrete- observations.
time systems. Therefore, consider ADP for CT systems
based on the new IRL formulation introduced by Vrabie B. ADP and Adaptive Control
[24]. Assuming the approximation Adaptive control can be performed either in a direct
V m ( x) = W T f ( x) (54) fashion, wherein the controller parameters are directly
estimated, or in an indirect fashion, wherein the system
the Interval Reinforcement Learning PI Algorithm (37), (38)
model parameters are first estimated and then the controller
can be written in terms of the policy evaluation step
t +T
is computed. One sees that reinforcement learning is an

ò ( Q ( x) + u ) indirect adaptive controller wherein the parameters of the


WiT f ( x (t )) = i
T
Rui dt + WiT f ( x(t + T )) (55)
Value (54) are estimated using (57). Then the control is
t
computed using (56). However, the optimal control is
and the policy improvement step
directly computed in terms of the learned parameters using
¶V (56), so this is actually a direct adaptive control scheme!
ui +1 = hi +1 ( x) = - 1 2 R -1 g T ( x) i
¶x The importance of reinforcement learning is that it
T (56)
é ¶f ( x(t )) ù provides an adaptive controller that converges to the optimal
-1 T
= - 2 R g ( x) ê
1
ú Wi control. This is new in the Control System Community,
ë ¶x(t ) û where adaptive controllers do not typically converge to
Write (55) as optimal control solutions. Indirect adaptive controllers have
t +T
been designed that first estimate system parameters and then
WiT [f ( x(t )) - f ( x(t + T )) ] = ò ( Q ( x) + u )
T
i Rui dt (57) solve Riccati equations, but these are clumsy. Reinforcement
t Learning provides Optimal Adaptive Controllers learned
This is a scalar equation, whereas the unknown parameter online.
vector Wi has L elements. Note that this is a two-time scale system wherein the
Note that this equation is exactly in the form of standard control action in an inner loop occurs at the sampling time,
parameter identification equations in adaptive control with but the performance is evaluated in an outer loop over a
longer horizon, corresponding to the convergence time
[f ( x(t )) - f ( x(t + T ))] a known regression vector and Wi
needed for RLS.
the unknown parameters to be identified. Therefore to It is important to note that, in the LQR case, the Riccati
implement IRL PI online, simply select an initial stabilizing equation (46) provides the optimal control solution. The
policy. Then at each iteration solve the policy evaluation dynamics (A,B) must be known for the solution of the
(57) using, for instance RLS. After convergence of the RLS, Lyapunov equation (45) and the Riccati equation (46). As
update the policy using (56). such these equations provide offline planning solutions for
Note that to implement this algorithm, one only requires controls design. On the other hand, the Lyapunov equation is
the input-coupling function g(x). The internal dynamics f(x) equivalent to fixed point equation (35), which can be
need not be known. One requires the persistence of evaluated online along the system trajectories using ADP
excitation of the regression matrix [f ( x(t )) - f ( x(t + T )) ] . techniques by measuring at each time the data set
The IRL Value Iteration algorithm is implemented online ( x(t ), x(t + T ), r (t : t + T ) ) , which consists of the current
in exactly similar fashion, but (57) is replaced by state, the next state, and the resulting integral utility incurred.
t +T
This corresponds to learning the optimal control online by
WiT f ( x(t )) = ò ( Q ( x) + u )
T
i Rui dt + WiT-1f ( x(t + T )) evaluating the performance of nonoptimal controllers.
t ADP actually solves the Riccati equation online without
with the old NN weights Wi held constant. Now, the knowing the dynamics f(x) by observing the data
regression vector is f ( x(t )) , which must be PE. ( x(t ), x(t + T ), r (t : t + T ) ) at each time along the system

1408
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1

trajectory. systems”, Neural Networks – special issue: Goal-Directed Neural


Systems, vol. 22, no. 3, 237-246.
[26] P.J. Werbos, Beyond Regression: New Tools for Prediction and
REFERENCES Analysis in the Behavior Sciences, Ph.D. Thesis, Harvard Univ.,
[1] Abu-Khalaf M. and Lewis F. L., “Nearly Optimal Control Laws for 1974.
Nonlinear Systems with Saturating Actuators Using a Neural Network [27] P.J. Werbos, “Neural networks for control and system identification”,
HJB Approach,” Automatica, vol. 41, no. 5, pp. 779-791, 2005. Proc. IEEE Conf. Decision and Control, 1989.
[2] M. Abu-Khalaf, F.L. Lewis, and J. Huang, “Policy iterations on the [28] P.J. Werbos., “A menu of designs for reinforcement learning over
Hamilton-Jacobi-Isaacs equation for H-infinity state feedback control time”, Neural Networks for Control, pp. 67-95, ed. W.T. Miller, R.S.
with input saturation,” IEEE Trans. Automatic Control, vol. 51, no. Sutton, P.J. Werbos, Cambridge: MIT Press, 1991.
12, pp. 1989-1995, 2006. [29] P.J. Werbos, “Approximate dynamic programming for real-time
[3] L. Baird, “Reinforcement learning in continuous time: Advantage control and neural modeling”, Handbook of Intelligent Control, ed.
updating,” Proceedings of the International Conference on Neural D.A. White and D.A. Sofge, New York: Van Nostrand Reinhold,
Networks, Orlando, FL, June 1994. 1992.
[4] A.G. Barto, R.S. Sutton, and C. Anderson. “Neuron-like adaptive [30] D.A. White and D.A. Sofge, ed., Handbook of Intelligent Control,
elements that can solve difficult learning control problems”, IEEE New York: Van Nostrand Reinhold, 1992.
Transactions on Systems, Man, and Cybernetics, SMC-13. pp. 834-
846, 1983.
[5] R. Beard, G. Saridis, and J. Wen, “Approximate solutions to the time-
invariant Hamilton-Jacobi-Bellman equation”, Automatica, vol. 33,
no. 12, pp. 2159-2177, Dec. 1997.
[6] R. E. Bellman, Dynamic Programming, Princeton, NJ: Princeton
University Press, 1957.
[7] D.P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming,
Athena Scientific, MA, 1996.
[8] X. Cao, Stochastic Learning and Optimization, Springer-Verlag,
Berlin, 2009.
[9] K. Doya, “Reinforcement learning in continuous time and space,”
Neural Computation, vol. 12, pp. 219-245, MIT Press, 2000.
[10] T. Hanselmann, L. Noakes, and A. Zaknich, “Continuous-Time
Adaptive Critics”, IEEE Trans on Neural Networks, vol. 18, no. 3,
pp. 631-647, 2007.
[11] P. He and S. Jagannathan, “Reinforcement learning neural-network-
based controller for nonlinear discrete-time systems with input
constraints,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 37, no.
2, pp. 425–436, Apr. 2007.
[12] G.A. Hewer, “An iterative technique for the computation of steady
state gains for the discrete optimal regulator”, IEEE Trans. Automatic
Control, pp. 382-384, 1971.
[13] R. Howard, Dynamic Programming and Markov Processes, MIT
Press, Cambridge, MA, 1960.
[14] D. L. Kleinman, “On an iterative technique for Riccati equation
computations,” IEEE Trans. Autom. Control, vol. AC-13, no. 1, pp.
114–115, Feb. 1968.
[15] P. Lancaster and L. Rodman, Algebraic Riccati Equations, Oxford
University Press, UK, 1995.
[16] R.J. Leake and R.W. Liu, “Construction of Suboptimal Control
Sequences”, J. SIAM Control, 5(1), 54-63, 1967.
[17] F.L. Lewis and V. Syrmos, Optimal Control, 2nd ed., John Wiley,
New York, 1995.
[18] P. Mehta and S. Meyn, “Q-learning and Pontryagin’s minimum
principle,” preprint, 2009.
[19] J.M. Mendel and R.W. MacLaren, “Reinforcement learning control
and pattern recognition systems”, Adaptive, Learning, and Pattern
Recognition Systems: Theory and Applications, ed. J.M. Mendel and
K.S. Fu, pp. 287-318, Academic Press, New York, 1970.
[20] W.T. Miller, R.S. Sutton, P.J. Werbos, ed., Neural Networks for
Control, Cambridge: MIT Press, 1991.
[21] J. Murray, C. Cox, R. Saeks, and G. Lendaris, “Globally convergent
approximate dynamic programming applied to an autolander,” Proc.
ACC, pp. 2901-2906, Arlington, VA, 2001.
[22] W.B. Powell, Approximate Dynamic Programming: Solving the
Curses of Dimensionality, Wiley, New York, 2009.
[23] R.S. Sutton and A.G. Barto, Reinforcement Learning – An
Introduction, MIT Press, Cambridge, Massachusetts, 1998.
[24] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive
optimal control for continuous-time linear systems based on policy
iteration”, Automatica, vol. 45, pp. 477-484, 2009.
[25] D. Vrabie and F.L. Lewis, “Neural network approach to continuous-
time direct adaptive optimal control for partially-unknown nonlinear

1409

You might also like