Professional Documents
Culture Documents
1
Hong Kong, China, August 27-29, 2009
Abstract—Living organisms learn by acting on their new policy yields a value that is improved over the previous
environment, observing the resulting reward stimulus, and value. See Figure 1.
adjusting their actions accordingly to improve the reward. This
action-based or Reinforcement Learning can capture notions of CRITIC –
optimal behavior occurring in natural systems. We describe Evaluates the current
mathematical formulations for Reinforcement Learning and a control policy
practical implementation method known as Adaptive Dynamic Reward/Response
Policy from
Programming. These give us insight into the design of update/ environment
controllers for man-made engineered systems that both learn improvement
and exhibit optimal behavior. Relations are show between ADP
and adaptive control. ACTOR
ACTOR -
System/
Implements the Environment
I. REINFORCEMENT LEARNING AND OPTIMALITY IN NATURE policy Control action
control policy
control System output
species modify their actions based on interactions with the The actor-critic structure implies two steps: policy
environment over long time scales, leading to natural evaluation by the critic followed by policy improvement.
selection and survival of the fittest. Adam Smith showed that The policy evaluation step is performed by observing from
modification of the actions of corporate entities based on the environment the results of current actions. Of particular
interactions on the scale of a global economy is responsible interest are those actor-critic structures where the critic
for the relative balance and wealth of nations. Ivan Pavlov assesses the value of current policies based on some sort of
used simple reinforcement and punishment stimuli to modify optimality criteria [26], [27], [28], [29], [7], [23], [8], [22].
the behavior patterns of dogs by inducing conditional In such a scheme, reinforcement learning is a means of
reflexes. learning optimal behaviors by observing the response from
We call modification of actions based on interactions with the environment to nonoptimal control policies.
the environment reinforcement learning (RL) [19]. There are Approximate or Adaptive Dynamic Programming (ADP)
many types of learning including supervised learning, [28], [29], also known as Neurodynamic Programming
unsupervised learning, etc. Reinforcement learning refers to (NDP) [7] is a method for practical implementation of Actor-
an actor or agent that interacts with its environment and Critic RL Structures that is based on temporal differences
modifies its actions, or control policies, based on stimuli and value function approximation. In this paper we show that
received in response to its actions. The RL algorithms are ADP is of direct importance for feedback control systems. In
constructed on the idea that successful control decisions fact ADP is an extension of adaptive control that yields
should be remembered, by means of a reinforcement signal, optimal online controller design techniques.
such that they become more likely to be used a second time. The bulk of research in ADP has been conducted for
RL is strongly connected from a theoretical point of view systems that operate in discrete-time (DT). We cover RL for
with direct and indirect adaptive optimal control methods. DT systems in Section 2, then for continuous-time systems in
One class of reinforcement learning methods is based on Section 3. In Section 4 we show how to implement RL
the Actor-Critic structure [4], where an actor component algorithms online using ADP techniques. There, comparisons
applies an action or control policy to the environment, and a are drawn with adaptive control.
critic component assesses the value of that action. Based on
this assessment of the value, various schemes may then be II. REINFORCEMENT LEARNING FOR DISCRETE-TIME
used to modify or improve the action in the sense that the SYSTEMS
We consider here nonlinear discrete-time (DT) systems.
Manuscript received June 12, 2009. This work was supported by the First we recall optimal control, then, based on that, present
National Science Foundation ECCS-0801330 and the Army Research
Office W91NF-05-1-0314. two RL algorithms, namely Policy Iteration (PI) and Value
F. Lewis and D. Vrabie are with the Automation and Robotics Research Iteration (VI). The LQR case is detailed to tie these ideas
Institute, University of Texas at Arlington, 7300 Jack Newell Blvd. S. into standard control theory notions. In Section 4, we show
Fort Worth, TX 76118 USA (phone/fax: +817-272-5938; e-mail: lewis@
uta.edu).
how to implement PI and VI online in real-time using data
That is, instead of evaluating the infinite sum (3), one can This procedure is justified in [7], where it is shown that
solve the difference equation (6) to obtain the value of using the new policy h '( xk ) is improved in that it has value
a current policy uk = h( xk ) . Vh ' ( xk ) less than or equal to the old value Vh ( xk ) . This is
This is a nonlinear Lyapunov equation known as the known as the one step improvement property of rollout
Bellman equation. It is defined in terms of the DT algorithms. That is, the step (12) has given an improved
Hamiltonian defined as policy.
H ( xk , h( xK ), DVk ) = r ( xk , h( xk )) + g Vh ( xk +1 ) - Vh ( xk ) (7) This suggests the following iterative method for
determining the optimal control, which is known as Policy
where DVk = g Vh ( xk +1 ) - Vh ( xk ) is the forward difference
Iteration [16], [23], [7].
operator. 1) DT Policy Iteration (PI) Algorithm
The optimal value can be written using the Bellman Initialize. Select any admissible (i.e. stabilizing) control
equation as policy h0 ( xk )
V * ( xk ) = min ( r ( xk , h( xk )) + g Vh ( xk +1 ) ) (8) Policy Evaluation Step. Determine the value of the
h (.)
According to Bellman’s Optimality Principle [17] one has current policy using the Bellman Equation
V j +1 ( xk ) = r ( xk , h j ( xk )) + g V j +1 ( xk +1 ) (13)
1403
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1
h j +1 ( xk ) = - R -1 g T ( xk )ÑV j +1 ( xk +1 ) (15) ¥
(18)
2
where ÑV ( x) =
¶V ( x)
is the gradient of the value function,
= å
i=k
xiT T
(Q + K RK ) xi º VK ( xk )
¶x
interpreted here as a column vector. which has utility r ( xk , uk ) = xkT Qxk + ukT Ruk with weighting
Note that the initial policy in PI must be admissible, which matrices Q = QT ³ 0 , R = RT > 0 . The closed-loop system
requires that it be stabilizing. It has been shown by [16], [13] is xk +1 = ( A - BK ) xk º Ac xk . It is assumed that (A,B) is
and others that this algorithm converges under certain
conditions to the optimal value and control policy, that is, to stabilizable, and ( A, Q ) is detectable.
the solution given by (9), (10). In the LQR case the value is quadratic in the current state
Policy Iteration is based on the fact that the Bellman so that VK ( xk ) = xkT Pxk for some matrix P. Therefore, the
equation (6) is a fixed point equation. It can be shown that
Bellman equation (6) for the LQR is
the Bellman Optimality Equation (9) is also a fixed point
equation. As such, it has an associated contraction map xkT Pxk = xkT Qxk + ukT Ruk + xkT+1 Pxk +1 . (19)
which can be iterated using ANY initial policy, as formalized In terms of the feedback gain this can be written as
in the following Value Iteration Algorithm.
2) DT Value Iteration (VI) Algorithm
( )
xkT Pxk = xkT Q + K T RK + ( A - BK )T P( A - BK ) xk . (20)
Initialize. Select any control policy h0 ( xk ) , not Since this must hold for all current states xk one has
necessarily admissible or stabilizing. ( A - BK )T P( A - BK ) - P + Q + K T RK = 0 . (21)
Value Update Step. Update the value using which is a Lyapunov equation when K is fixed. The DT HJB
V j +1 ( xk ) = r ( xk , h j ( xk )) + g V j ( xk +1 ) (16) equation or Bellman optimality equation is
Policy Improvement Step. Determine an improved policy AT PA - P + Q - AT PB( R + BT PB) -1 BT PA = 0 . (22)
using i.e. the Riccati equation. The optimal feedback is
h (.)
(
h j +1 ( xk ) = arg min r ( xk , h( xk )) + g V j +1 ( xk +1 ) ) (17) uk = -( R + BT PB )-1 BT PAxk = - Kxk (23)
For the LQR, Bellman’s equation (6) is written as (19) and
W
hence is equivalent to the Lyapunov equation (21).
It is important to note that now, the old value is used on
Therefore, the Policy Iteration Algorithm for DT LQR is
the right-hand side of (16), in contrast to the PI step (13). It
has been shown that VI converges under certain situations. ( A - BK j )T Pj +1 ( A - BK j ) - Pj +1 + Q + K Tj RK j = 0 . (24)
Note that VI does not require an initial stabilizing policy. with the policy update (14) as
The evaluation of the value of the current policy in PI K j +1 = ( R + BT Pj +1 B) -1 BT Pj +1 A . (25)
using the Bellman Equation (13) amounts to determining the
value of using the policy h j ( xk ) starting in all current states This is exactly Hewer’s algorithm [12] to solve the Riccati
equation (22). Hewer proved that it converges under the
xk . This is called a full backup in [23] and can involve stabilizability and detectability assumptions.
significant computation. On the other hand, the value update In the Value Iteration Algorithm, the policy evaluation
step (16) in VI involves far less computation and is called a step (16) for LQR is
partial backup in [23]. Pj +1 = ( A - BK j )T Pj ( A - BK j ) + Q + K Tj RK j . (26)
C. Policy Iteration & Value Iteration for the DT LQR and the policy update (17) is (25). However, iteration of
In Section 4 we show how to implement PI and VI online these two equations has been studied by Lancaster and
using data measured along the system trajectories. This Rodman [15], who showed that it converges to the Riccati
yields online learning controllers that converge to the equation solution under the stated assumptions.
optimal control solution. To tie these notions to standard Note that Policy Iteration involves full solution of a
concepts in control systems, in this section we show that, in Lyapunov equation (24) at each step and requires a
the LQR case, PI and VI are equivalent to underlying stabilizing gain K j at each step. On the other hand, Value
equations that are very familiar to the control systems
1404
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1
Iteration involves only a Lyapunov recursion (26) at each 0 = r ( x, m ( x)) + (ÑV m )T ( f ( x) + g ( x) m ( x)), V m (0) = 0 (29)
step, which is very easy to compute, and does not require a
stabilizing gain. The recursion (26) can be performed even if where ÑV m (a column vector) denotes the gradient of the
K j is not stabilizing. cost function V m with respect to x.
On-line learning vs. off-line planning solution of the This is the CT Bellman equation. It is defined based on
LQR. It is important to note the following key point. In the CT Hamiltonian function
going from the formulation (19) of the Bellman equation to H ( x, m ( x), ÑV m ) = r ( x, m ( x)) + (ÑV m )T ( f ( x) + g ( x) m ( x)) .
the formulation (21), which is the Lyapunov equation, one
(30)
has performed two steps. First, the system dynamics are
A temporal difference error e(t) and a TD equation for CT
substituted for xk +1 to yield (20), next the current state xk is systems could now be defined as
cancelled to obtain (21). These two steps make it impossible
e(t ) º H ( x, m ( x), ÑV m )
to apply real-time online reinforcement learning methods to (31)
find the optimal control, which we shall show how to do = r ( x, m ( x)) + (ÑV m )T ( f ( x) + g ( x) m ( x)) = 0
Section 4. Because of these two steps, optimal controls This is a fixed point equation and allows the formulation
design in the Control Systems Community is almost of CT reinforcement learning schemes.
universally an off-line procedure involving solutions of We now see the problem with CT systems immediately.
Riccati equations, where full knowledge of the system Compare the CT Bellman Hamiltonian (30) to the DT
dynamics (A,B) is required. Hamiltonian (7). The former contains the full system
dynamics f ( x) + g ( x)u , while the DT Hamiltonian does not.
III. INTERVAL REINFORCEMENT LEARNING (IRL) FOR
CONTINUOUS-TIME SYSTEMS This means that there is no hope of using the CT Bellman
equation (29) as a basis for reinforcement learning unless the
Several studies have been made about reinforcement
full dynamics are known.
learning and ADP for CT systems, including [1], [2], [3], [5],
The next algorithm shows how to implement Policy
[9], [10], [21], [18]. Reinforcement Learning is considerably
Iteration for CT Systems based on the Hamiltonian function
more difficult for continuous-time systems than for discrete-
defined in (30) and the fixed point equation
time systems, and its development has lagged. We shall now
see why. H ( x, m ( x), ÑV m ) = 0 .
Consider the continuous-time nonlinear dynamical system 1) CT Standard Policy Iteration (PI) Algorithm
&x = f ( x) + g ( x )u (27) Initialize. Select any admissible (i.e. stabilizing) control
with state x(t ) Î R n , control input u (t ) Î R m , and the usual policy m (0) ( x)
(i )
mild assumptions required for existence of unique solutions Policy Evaluation Step. solve for V m ( x(t )) using
and an equilibrium point to x = 0 , e.g. f (0) = 0 and (i )
n
H ( x, m (i ) ( x), ÑV m )
f ( x) + g ( x)u Lipschitz on a set W Í R that contains the (i )
(32)
origin. We assume the system is stabilizable on W , that is = r ( x, m (i ) ( x)) + (ÑV m )T ( f ( x) + g ( x) m (i ) ( x)) = 0
there exists a continuous control function u (t ) such that the Policy Improvement Step. Determine an improved policy
closed-loop system is asymptotically stable on W . using
(i )
Define the performance measure associated with the m (i+1) = arg min[ H ( x, u, ÑV m )] (33)
feedback control policy u = m ( x) as u
¥ which explicitly is
m
V ( x(t )) = ò r ( x(t ), u (t ))dt
(i )
(28) m (i+1) ( x) = - 12 R -1 g T ( x)ÑV m (34)
t
W
with utility r ( x, u ) = Q ( x) + u T Ru , with Q( x) positive Note that the full dynamics f ( x) + g ( x)u are needed to
definite, i.e. "x ¹ 0, Q( x) > 0 and x = 0 Þ Q( x) = 0 , and solve (32).
RÎR m´m
a positive definite matrix. A policy is called Another problem is noted when comparing the CT
admissible if it is continuous, stabilizes the system, and has a Bellman Hamiltonian (30) to the DT Hamiltonian (7). There
finite associated cost. are two occurrences of value function in the latter, which
allows one to perform Value Iteration as shown in (16), (17).
A. Standard Formulation of CT Optimal Control Problem However the CT Hamiltonian only has one occurrence of the
If the policy is admissible and the cost is smooth, then an value function, so it is not at all clear how to perform value
infinitesimal equivalent to (28) is the nonlinear Lyapunov iteration for CT systems.
equation
1405
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1
B. Interval Reinforcement Form of CT Optimal Control 2) CT IRL Value Iteration (VI) Algorithm
The method for successfully confronting these problems Initialize. Select any control policy m (0) ( x) , not
was given by Vrabie [25], who defined a different temporal necessarily stabilizing.
difference error for CT systems. It was shown there that (i )
Value Update Step. solve for V m ( x(t )) using
some notions of learning mechanisms in the human brain
based on multiple timescales motivate the following (i )
t +T ( i -1)
equation (6). It is a fixed point equation. Therefore, one can m (i+1) = arg min[ H ( x, u, ÑV m )] (41)
u
define (35) as the Bellman equation for CT systems and the
which explicitly is
associated CT temporal difference error e(t) as (i )
t +T m (i+1) ( x) = - 12 R -1 g T ( x)ÑV m (42)
m m
e(t :t +T ) =-V ( x(t )) + ò r ( x(t ), u (t ))dt + V ( x(t +T )) = 0
W
t
Note that neither algorithm requires knowledge about the
(36)
internal system dynamics function f(.). That is, they work for
This does not involve the system dynamics.
partially unknown systems.
According to Bellman’s principle, the optimal value is
given in terms of this construction as, [17], C. Interval Reinforcement Learning (IRL) for CT LQR
æ t +T ö Case
V * ( x(t )) = min ç ò r ( x(t ), u (t ))dt + V * ( x(t +T )) ÷ In Section 4 we show how to implement IRL PI and IRL
u (t:t +T ) ç ÷
è t ø VI online using data measured along the system trajectories.
where u (t : t + T ) = {u (t ) : t £ t < t + T } . The optimal control This yields online learning controllers that converge to the
is optimal control solution. To tie these notions to standard
concepts in control systems, in this section we derive the
æ t +T ö
m * ( x(t )) = arg min ç ò r ( x(t ), u (t ))dt + V * ( x(t +T )) ÷ . underlying equations for the CT LQR.
u (t:t +T ) ç
è t
÷
ø In the LQR case [17], one has the dynamics x& = Ax + Bu
It is shown in [25] that the nonlinear Lyapunov equation and the cost
¥
( )
(29) is exactly equivalent to the interval reinforcement form
(35). That is, the positive definite solution of both is the V u ( x (t )) = ò xT (t )Qx(t ) + uT (t ) Ru (t ) dt = xT (t ) Px(t )
t
value (28) of the policy u = m ( x) .
for some P. Then the Bellman Equation (29) becomes
Now, it is direct to properly formulate policy iteration and
value iteration for CT systems. We call these the interval xT Qx + uT Ru + 2 xT P( Ax + Bu ) = 0 (43)
reinforcement learning (IRL) formulations. For a fixed admissible policy u = - Kx this becomes
1) CT IRL Policy Iteration (PI) Algorithm
xT (Q + K T RK + PAc + Ac P ) x = 0 (44)
Initialize. Select any admissible (i.e. stabilizing) control
with Ac = A - BK .
policy m (0) ( x)
(i )
Minimizing this with respect to u one obtains
Policy Evaluation Step. solve for V m ( x(t )) using -1 T
u = - R B Px , whence substitution into (43) and
(i )
t +T (i ) simplification yields the Bellman optimality equation or HJB
V m ( x(t )) = ò r ( x(t ), m (i ) ( x(t )))dt +V m ( x(t +T )) with equation
t
xT ( AT P + PA + Q - PBR -1 BT P) x = 0
m (i )
V (0) = 0 (37)
Standard usage now states that these equations hold for all
Policy Improvement Step. Determine an improved policy initial conditions so that one has the Lyapunov equation
using
PAc + Ac P = -(Q + K T RK ) (45)
(i+1) m (i )
m = arg min[ H ( x, u, ÑV )] (38) and the Riccati equation
u
which explicitly is AT P + PA + Q - PBR -1 BT P = 0 (46)
m (i+1) -1 T
( x) = - 12 R g ( x)ÑV m (i )
(39) For the CT LQR, the Standard Policy Iteration Algorithm
(32), (33) becomes
W
1406
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1
t +T
Pi Ac + Ac Pi = -(Q + ( K i ) T RK i ) (47)
òe
AiT t T
Pi = (Q + ( K i )T RK i )e Ai t dt + e Ai T Pi -1e AiT
K i+1 = R -1 BT Pi (48) t
This is exactly Kleinman’s algorithm [14]. It requires a with
stabilizing initial gain and knowledge of the full system
Ai = A - BR -1 BT Pi -1
dynamics A,B. It is an offline planning technique that is
based on solving the Riccati equation. A stabilizing initial gain is not needed. This is a method of
To obtain online optimal learning controllers using RL solving the CT Riccati equation by using iteration on a
techniques, Vrabie used the interval form of the CT LQR discrete-time Lyapunov recursion!
Bellman error
t +T IV. ONLINE REINFORCEMENT LEARNING, ADP, AND
ò (x )
T
x (t ) Px(t ) = T T
(t )Qx(t ) + u (t ) Ru (t ) dt + ADAPTIVE CONTROL
t In this section we shall see how to formulate RL
+ xT (t + T ) Px(t + T ) algorithms as on-line real-time learning methods for solving
the optimal control problem using data measured along
Therefore, the CT Policy Iteration and Value Iteration system trajectories [23]. These methods are broadly called
Algorithms may alternatively be formulated as approximate dynamic programming (ADP) [27], [28], [29]
1) IRL Policy Iteration (PI) Algorithm for CT LQR or neurodynamic programming (NDP) [7]. By contrast,
Initialize. Select any admissible (i.e. stabilizing) control
standard optimal control solutions using dynamic
policy m (0) ( x) = - K 0 x programming are backwards-in-time procedures. Therefore,
Policy Evaluation Step. solve for Pi using u i = - K i x they can be used for off-line planning but not online
t +T learning. Moreover, knowledge of the full system dynamical
xT (t ) Pi x(t ) = ò (x ) description is needed. An example is the Riccati equation
T
(t )(Q + (u i )T (t ) Ru i (t ) dt +
(49) solution approach for optimal controls design.
t
There are two key ingredients to formulating Policy
+ xT (t + T ) P i x(t + T ) Iteration and Value Iteration as online learning algorithms:
Policy Improvement Step. Determine an improved policy temporal difference (TD) error and value function
using approximation (VFA). We have discussed TD above in
K i+1 = R -1 BT Pi (50) formulating the VI and PI algorithms
W A. ADP- On-Line Reinforcement Learning Optimal
2) IRL Value Iteration (VI) Algorithm for CT LQR Control
Initialize. Select any control policy m (0) ( x) = - K 0 x , not For online learning, let us require that the temporal
necessarily stabilizing. difference errors be zero at each time step. For DT systems
(i )
Policy Evaluation Step. solve for V m ( x(t )) using one has the TD error (11). For CT systems, one must use the
t +T
IRL TD error (36) as the formulation (31) does not work for
ò (x ) online learning.
xT (t ) Pi x(t ) = T
(t )(Q + (u i )T (t ) Ru i (t ) dt
(51) 1) Value Function Approximation (VFA)
t
To provide a practical means for solving the TD equation,
+ xT (t + T ) Pi -1 x(t + T ) one may approximate the value function Vh (.) using a
Policy Improvement Step. Determine an improved policy parametric approximator. This has been called Approximate
using Dynamic Programming (ADP) by Werbos [26], [27], [28],
K i+1 = R -1 BT Pi (52) [29] and neurodynamic programming (NDP) by Bertsekas
W [7], both of whom used neural networks as the
We shall see in the next section how to implement these approximators.
algorithms online without knowing the system dynamics A Assume that the value (DT or CT) V m ( x) is sufficiently
by using reinforcement learning techniques. That is, IRL smooth. Then, according to the Weierstrass higher order
techniques allow the solution of the CT Riccati equation approximation Theorem, there exists a dense basis set
online without knowing the system internal dynamics by
{ji ( x)} such that
using data measured along the system trajectories.
¥
It is shown in [24] that the IRL Policy Iteration
Algorithm for CT LQR is exactly equivalent to Kleinman’s
V m ( x) = å w j ( x)
i =1
i i
1407
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1
1408
7th ASCC, Hong Kong, China, Aug. 27-29, 2009 SaPP.1
1409