Online Policy Iteration Based Algorithms To Solve The ContinuousTime Infinite Horizon Optimal Control Problem

Online Policy Iteration Based Algorithms to Solve the Continuous-
Time Infinite Horizon Optimal Control Problem

Kyriakos Vamvoudakis, Draguna Vrabie and Frank Lewis, Fellow, IEEE
Abstract—In this paper we discuss two online algorithms research field) overview is given in [4].
based on policy iterations for learning the continuous-time (CT) Some of the methods involve a computational intelligence
optimal control solution when nonlinear systems with infinite technique known as Policy Iteration (PI) [7]. PI names a
horizon quadratic cost are considered. class of algorithms built as a two-step iteration: policy
For the first time we present an online adaptive algorithm
evaluation and policy improvement. Instead of trying a direct
implemented on an actor/critic structure which involves
synchronous continuous-time adaptation of both actor and approach to solving the HJB equation, the PI algorithm starts
critic neural networks. This is a version of Generalized Policy by evaluating the cost of a given initial admissible control
Iteration for CT systems. The convergence to the optimal policy and then tries to use this information to obtain a new
controller based on the novel algorithm is proven while stability improved (i.e. which will have a lower associated cost)
of the system is guaranteed. control policy. The two steps are repeated until the policy
The characteristics and requirements of the new online improvement step no longer changes the actual policy, thus
learning algorithm are discussed in relation with the regular
online policy iteration algorithm for CT systems which we have
convergence to the optimal controller is achieved. One must
previously developed. The latter solves the optimal control note that the cost can be evaluated only in the case of
problem by performing sequential updates on the actor and admissible control policies, admissibility being a condition
critic networks, i.e. while one is learning the other one is held for the control policy which initializes the algorithm.
constant. In contrast, the new algorithm relies on simultaneous In the linear CT system case, when quadratic indices are
adaptation of both actor and critic networks. To support the considered for the optimal stabilization problem, the HJB
new theoretical result a simulation example is then considered.
equation becomes the well known Riccati equation and the
I. INTRODUCTION policy iteration method is in fact Newton’s method [9] which
requires iterative solutions of Lyapunov equations. In the
C OST-EFFECTIVE solutions are regularly required in
the control loops which are placed at the high levels of
the hierarchy of a complex integrated control application, the
nonlinear systems case, successful application of the PI
method was limited until [3], where Galerkin spectral
approximation methods were used to solve the nonlinear
so called “money-making” loops. Such a controller is given Lyapunov equations describing the policy evaluation step in
by the solution of an optimal control problem. Optimal the PI algorithm. Such methods are known to be
control policies satisfy the specified system performances computationally intensive and can not handle control
while minimizing a structured cost index which describes the constraints.
balance between desired performances and available control The key to solving practically the CT nonlinear Lyapunov
resources. equations was in the use of neural networks [1] which can be
From a mathematical point of view the solution of the trained to become approximate solutions of these equations.
optimal control problem is connected with the solution of the In fact the PI algorithm for CT systems can be built on an
underlying Hamilton-Jacobi-Bellman (HJB) equation. Until actor/critic structure which involves two neural networks:
recently, due to the nature of this nonlinear differential one, the critic, is trained to become an approximation of the
equation, for continuous-time systems, which form the object Lyapunov equation solution at the policy evaluation step,
of interest in this paper, only particular solutions were while the second one is trained to approximate an improving
available. For this reason considerable effort has been made policy at the policy improving step. Actor/critic structures
for developing algorithms which approximately solve this have been introduced and further developed by Werbos [15],
equation (e.g. [1], [3], [10]). Far more results are available [16], [17] with the purpose of solving the optimal control
for the solution of the discrete-time HJB equation. A good problem at the same time with the concept of Approximate
but no longer exhaustive (due to the fast dynamics of the Dynamic Programming (ADP) algorithms. Adaptive critics
have been described in [11] for discrete-time systems and [6]
This work was supported by the National Science Foundation ECS- for continuous-time systems.
0501451, ECCS-0801330 and the Army Research Office W91NF-05-1- This paper is concerned with developing approximate
0314. online solutions, based on PI, for the infinite horizon optimal
K. Vamvoudakis, D. Vrabie and F. Lewis are with the Automation and
Robotics Research Institute, University of Texas at Arlington, 7300 Jack control problem in the case of continuous-time nonlinear
Newell Blvd. S., Fort Worth, TX 76118 USA (phone/fax: +817-272-5938; systems. In our previous work [13], [14] we have developed
e-mail: {kyriakos, dvrabie, lewis}@arri.uta.edu).
978-1-4244-2761-1/09/$25.00 ©2009 IEEE

an online PI algorithm for continuous-time systems which m (0) = 0 , m ( x ) stabilizes (1) on W and V ( x0 ) is finite
converges to the optimal control solution without making "x0 ÎW .
explicit use of any knowledge on the internal dynamics of the
For any admissible control policy m ÎY (W) , if the
system. The algorithm was based on sequential updates of
the critic (policy evaluation) and actor (policy improvement) associated cost function
neural networks (i.e. while one is tuned the other one ¥
remains constant). We will now present for the first time an V m ( x0 ) = ò r ( x(t ), m ( x(t )))dt (3)
0
online adaptive algorithm which involves concurrent, or
synchronous, tuning of both actor and critic neural networks is C1 , then an infinitesimal version of (3) is
(i.e. both neural networks are tuned at the same time). This 0 = r ( x, m ( x)) +Vxm T ( f ( x) + g ( x) m ( x)), V m (0) = 0 (4)
approach is a version of the Generalized Policy Iteration
(GPI), as introduced in [12]. An “almost concurrent” version where Vx m denotes the partial derivative of the value
of PI has been described in [6], without providing explicit function V m with respect to x , as the value function does
training laws for neither actor or critic networks, nor not depend explicitly on time. Equation (4) is a Lyapunov
convergence analysis. equation for nonlinear systems which, given a controller
The paper is organized as follows. Section II provides the
formulation of the optimal control problem followed by the m ( x)ÎY (W) , can be solved for the value function V m ( x)
general description of the Policy Iteration algorithm. Section associated with it. Given that m ( x ) is an admissible control
III discusses two online adaptive algorithms based on PI. It
policy, if V m ( x) satisfies (4), with r ( x, m ( x)) ³ 0 , then
briefly outlines the online partially model free algorithm
which has been presented in [13] and then introduces a new V m ( x) is a Lyapunov function for the system (1) with
adaptive GPI algorithm suitable for online implementation. control policy m ( x ) .
A comparative discussion is then given relating the structure The optimal control problem can now be formulated:
and implementation issues of the new online algorithm with
Given the continuous-time system (1), the set uÎY (W) of
our previous online adaptive PI algorithm. Section IV
presents a simulation result when the new GPI algorithm is admissible control policies and the infinite horizon cost
considered. Section V concludes the paper pointing towards functional (2), find an admissible control policy such that the
next research directions. cost index (2) associated with the system (1) is minimized.
Defining the Hamiltonian of the problem
II. THE OPTIMAL CONTROL PROBLEM AND THE POLICY H ( x, u , Vx* ) = r ( x (t ), u (t )) +Vx*T ( f ( x(t )) + g ( x(t ))u (t )) , (5)
ITERATION ALGORITHM
the optimal cost function V * ( x) satisfies the HJB equation
A. Optimal control and the continuous-time HJB equation 0 = min [ H ( x, u ,Vx* )] . (6)
uÎY ( W )
Consider the time-invariant affine in the input dynamical
system given by Assuming that the minimum on the right hand side of the
x& (t ) = f ( x(t )) + g ( x(t ))u ( x(t )) ; x(0) = x0 (1) equation (6) exists and is unique then the optimal control
function for the given problem is
with x(t ) Î Rn , f ( x(t )) Î Rn , g ( x(t )) Î R n´m and the input
u* ( x) = - R -1g T ( x)Vx* ( x ) . (7)
u(t ) Î U Ì Rm . We assume that, f (0) = 0 , f ( x) + g ( x)u is
Inserting this optimal control policy in the Hamiltonian we
Lipschitz continuous on a set WÍ R that contains the origin
n
obtain the formulation of the HJB equation in terms of Vx*
and that the dynamical system is stabilizable on W , i.e. there
1
exists a continuous control function u(t ) ÎU such that the 0 = Q( x) +Vx*T ( x) f ( x) - Vx*T ( x ) g ( x) R -1 g T ( x)Vx* ( x)
4 . (8)
system is asymptotically stable on W .
Define the infinite horizon integral cost V * (0) = 0
¥ This is a necessary and sufficient condition for the optimal
V ( x0 ) = ò r ( x(t ), u (t ))dt (2) value function [8]. For the linear system case, considering a
0 quadratic cost functional, the equivalent of this HJB equation
where r ( x, u ) = Q ( x) + u T Ru with Q( x) positive definite, i.e. is the well known Riccati equation.
In order to find the optimal control solution for the
"x ¹ 0, Q( x) > 0 and x = 0 Þ Q( x) = 0 , RÎR m´m a positive problem one only needs to solve the HJB equation (8) for the
definite matrix. value function and then substitute the solution in (7) to
Definition 1 [1] (Admissible policy) A control policy obtain the optimal control. However, due to the nonlinear
m ( x ) is defined as admissible with respect to (2) on W , nature of the HJB equation finding the solution of interest is
denoted by m ÎY (W) , if m ( x ) is continuous on W , generally difficult.
B. Policy iterations and adaptive critics approximated by a neural network as
Policy Iteration describes a class of algorithms which V ( x) = W1T f1 ( x) + e ( x) (12)
perform a directed search in the space of admissible control n N
where f1 ( x) : R ® R is the activation functions vector,
policies, i.e. at each step of the iterative procedure the
control policy is adapted in the sense of performance with N the number of neurons in the hidden layer, and e ( x)
improvement, to determine the optimal control policy is the NN approximation error. Select the activation
associated with optimal control performance. For the optimal functions to provide a complete basis set such that V ( x ) and
control problem formulated in the previous section, the PI its derivative
algorithm is consists in solving iteratively the following two ¶V ¶e
equations = Ñf1T W1 + (13)
(i )
¶x ¶x
1. given m (i ) ( x) , solve for V m ( x(t )) using are uniformly approximated. According to the Weierstrass
(i )
higher-order approximation theorem, such a basis exists if
0 = r ( x, m (i ) ( x)) + (Vxm ( x))T ( f ( x) + g ( x) m (i ) ( x)) (9) V ( x ) is sufficiently smooth. This means that, as N ® ¥ ,
m (i )
V (0) = 0 e ® 0 uniformly.
2. update the control policy using Using the NN function approximation, considering a
(i )
m (i+1) = arg min[ H ( x, u, Vxm )] , (10) control policy u (t ) , (9) becomes
uÎY ( W )
H ( x, W1 , u ) = W1T Ñf1 ( f + gu ) + Q( x) + uT Ru = e H (14)
which explicitly is
(i ) where the residual error due to the function approximation
m (i+1) ( x) = - 12 R -1 g T ( x)Vxm ( x) . (11) error is
e H = ( Ñe x ) ( f + gu ) .
T
To ensure convergence of the PI algorithm an initial (15)
(0)
admissible policy m ( x(t ))ÎY (W) is required. The In [1] it has been shown that, under the usual assumptions, as
algorithm then converges to the optimal control policy N ® ¥ , eH ® 0 .
m * ÎY (W) with corresponding cost Notice that due to the fact that the control policy
¥
improvement is given in closed form by (11), a training
V * ( x0 ) = min ( ò r ( x(t ), m ( x(t )))dt ) . Proofs of phase for the actor neural network is not required but merely
mÎY ( W) consists in a parameter update.
0
convergence of the PI algorithm were given for linear system
case in [9], and for nonlinear systems in [3] and [1] where III. CONTINUOUS-TIME ONLINE ADAPTIVE CRITIC
the cases of unconstrained and respectively constrained SOLUTIONS FOR THE INFINITE HORIZON OPTIMAL CONTROL
control policies were considered. PROBLEM
Policy iteration was first used for nonlinear systems in [3] Standard PI algorithms are offline methods which require
together with Galerkin approximation methods for finding complete knowledge on the system dynamics to obtain the
the solution of (9). In [1] the suitability of NN approximators solution (i.e. the functions f ( x), g ( x) in (1) need to be
for practical implementation on actor/critic structures was known). In order to change the offline character of the PI and
proven. thus making it consistent with learning mechanisms in the
mammal brain, in [13] we have proposed a version of the PI
Critic
algorithm suitable for online implementation; which can be
Cost function V ( x)
Actor viewed as reinforcement learning for CT systems [2], [5].
Controller u System x A. Online partially model free policy iteration for CT
m ( x) x& = f ( x) + g ( x)u; x0 nonlinear systems
This PI algorithm is described as iteration between the
Fig. 1. Actor/Critic Structure. following two equations which can be solved online based
on data measured from the closed loop system. Thus, let
In the actor/critic structure the Critic and the Actor m (0) ( x(t ))ÎY (W) be an admissible policy, and T > 0 , then
functions are approximated by neural networks, and the PI the iteration between
algorithm consists in tuning alternatively each of the two t +T
(i ) (i )
neural networks so that (9) and (10) are solved. The critic V m ( x(t )) = ò r ( x( s ), m (i ) ( x( s )))ds +V m ( x(t +T ))
neural network is tuned to evaluate the performance of the t (16)
current control policy (which during this time is held (i )
V m (0) = 0
constant) such that it approximately solves equation (9).
Thus, assume there exist weights W1 such that V ( x ) is and
& e
(i )
m (i+1) = arg min[ H ( x, u, Vxm )] (17) W%1 = -a1s 1s 1TW%1 + a1s 1 . (22)
uÎY ( W ) ms
converges to the optimal control policy m * ÎY (W) . To guarantee convergence of Ŵ1 to W1 , the next
Convergence to the optimal controller is obtained as was Persistence of Excitation (PE) assumption is required.
proven in [13], by showing that solving (16) is equivalent PE Assumption. Let the signal s 1 be persistently
with solving (9). As in this algorithm the tuning of the critic exciting over the interval [t , t + T ] , i.e. there exist b1 > 0 ,
and actor neural networks is based on data measured from
the system, one can observe that knowledge on the system b 2 > 0 , T > 0 such that, for all t,
t +T
internal dynamics, f ( x) , is no longer required for obtaining
ò s (t )s
T
b1I £ S0 º 1 1 (t )dt £ b2I . (23)
the optimal control solution. Thus a partially model free PI
t
algorithm suitable for online implementation has been
obtained. The next result shows that the tuning algorithm is
effective, in the sense that the weights Ŵ1 converge to the
B. Online Generalized PI algorithm with synchronous
tuning of actor and critic neural networks desired unknown weights W1 which solve the Hamiltonian
In the following we present for the first time an adaptive equation (14).
solution based on PI for the CT nonlinear optimal control Theorem 1. Let u (t ) be any admissible bounded control
problem which uses simultaneous continuous-time tuning for input. Let tuning for the critic NN be provided by (20) and
both actor and critic neural networks. First we present a assume that s 1 is persistently exciting. Let the residual error
result which proves that online CT tuning of the Critic NN in (14) be bounded e H < e max . Then the critic parameter
can be successfully used to approximately solve equation (9)
in the PI algorithm for a given fixed admissible control error is bounded by
æ 1 ö
W%1 (t ) £ b 2T ç + 2a1 ÷ e max .
policy u (t ) .
(24)
Let the weights of the critic NN, W1 , which solve (14) be è b1 ø
unknown. Then the output of the critic neural network is Proof
Vˆ ( x) = Wˆ T f ( x) (18) Consider the following Lyapunov function candidate
1 1
1
where Ŵ1 are the current known values of the critic NN L(t ) = tr{W%1T a1-1W%1} . (25)
2
weights. Recall that f1 ( x) : R n ® R N is the activation The derivative of L is given by
s
functions vector, with N the number of neurons in the hidden L& = -tr{W%1T 12 [s1T W%1 - e H ]}
layer. The approximate Hamiltonian is then ms
H ( x, Wˆ , u ) = Wˆ T Ñf ( f + gu ) + Q ( x) + u T Ru = e (19) ssT s e
Û L& = -tr{W%1T 1 12 W%1} + tr{W%1T 1 H }
1 1 1 1
It is desired to select Ŵ1 to minimize the squared error ms ms ms

. (26)
E1 = 12 e1T e1 s1T % 2 sT e
£ - || W1 || + || 1 W%1 || H
We select the tuning law for the critic weights as normalized ms ms ms
gradient descent
s 1T % é s1T % e ù
& ¶E s = - || W1 || ê|| W1 || - H ú
Wˆ1 = -a1 1 = -a1 T 1 2 [s1T Wˆ1 + Q( x) + uT Ru ] ms êë ms ms úû
¶Wˆ (s s + 1)
1 1 1
(20) Then L& £ 0 if
where s 1 = Ñf1 ( f + gu ) . This is a modified Levenberg- s 1T % e
|| W1 ||> e max > , (27)
Marquardt algorithm where (s1T s 1 + 1) 2 was used instead of ms ms
(s1T s 1 + 1) . since ms ³ 1 . This provides an effective practical bound
Define the critic weight estimation error W%1 = W1 - Wˆ1 and for s 1T W%1 , since L(t ) decreases if (27) holds.
note that, from (14), Define the dynamical system
Q( x) + uT Ru = -W1T Ñj1 ( f + gu ) + e H . (21) ì &% T % e
ïW1 = - a1s 1s 1 W1 + a1s 1 m
Substitute (21) in (20) and, with the notations í s (28)
s 1 = s 1 / (s 1T s1 + 1) and ms = 1 + s 1T s 1 , we obtain the ï y = s T W%
î 1 1
dynamics of the critic weight estimation error as with the output bounded by || y ||> e max , as just shown. The
PE condition on s 1 (t ) is equivalent to the uniform complete With the chosen tuning laws one can then show that the
observability of this system, i.e. bounded-input bounded- errors W%1 and W%2 are UUB and convergence is obtained.⁭
output implies that the state W%1 (t ) is bounded. Specifically, For space reasons we will present the details of this proof in
it can be shown that a future paper.
Note that once the critic parameters converged to the
æ 1 ö
W%1 (t ) £ b 2T ç + 2a1 ÷ e max . (29) parameters of the approximate optimal critic then the
è b1 ø approximate optimal controller can simply be obtained based
This completes the proof. ⁭ on (7).
Remark1. Note that, as N ® ¥ , e H ® 0 uniformly [1].
C. Algorithm Discussion
This means that e max decreases as the number of hidden Next we briefly highlight the characteristics of the new
layer neurons in (18) increases. proposed algorithm in contrast with the algorithm outlined in
Remark 2. This theorem requires the assumption that the section III A.
control input u (t ) is bounded. In the upcoming theorem this The main difference between the two algorithms is related
restriction is removed. to the manner in which the two neural network based
Define the control policy in the form of a action neural approximators, i.e. the actor and the critic, are learning. In
network which computes the control input for the system to the case of standard PI algorithms each of the two neural
be controlled. We choose this control in the form suggested networks is tuned separately in the sense that while one is
by (11), thus we have learning the parameters of the other NN are held constant. In
this case of the newly proposed algorithm the two neural
u ( x) = - 1 R -1 g T ( x)Ñf T Wˆ ,
2 2 1 2 (30)
networks are tuned concurrently, in a synchronous manner,
where Ŵ2 denotes the current known values of the actor NN based on the continuous-time data from the environment.
weights. From this point of view the tuning laws of the actor and critic
Based on (8), define the approximate HJB equation can be interpreted as dynamics of extra states of the closed
1 loop actor/critic structure. The equilibrium point for the
Q( x) +W1*T Ñf1 ( x) f ( x) - W1*T D1 ( x)W1* = e HJB ( x) states described by the dynamics of the critic parameters is
4 (31)
*T * given by the parameter values which characterize the optimal
W1 f1 (0) + e (0) = 0 critic. Thus, the stability of the extended system is equivalent
with the notation D1 ( x) =Ñf1 ( x) g ( x) R -1 g T ( x)Ñf1T ( x) , to the convergence of the critic parameters to the parameters
of the optimal cost function.
where W1* denotes the ideal unknown weights of the critic One should also note in the case of the new algorithm that
and actor neural networks. no sampling of any kind is used during the learning process;
We now present the main theorem which provides the the tuning of the actor and critic networks is done in
tuning laws for the actor and critic neural networks such that continuous-time while the controller (actor) also operates in
convergence to the optimal controller is guaranteed. continuous-time. In contrast, for the online version of the
Theorem 2. Let tuning for the critic NN be provided by standard PI method a discrete-time component, characterized
& s by the time of integration in (16), was included. This means
Wˆ1 = -a1 T 2 2 [s 2T Wˆ1 + Q( x) + u2T Ru2 ] (32)
(s 2 s 2 + 1) that the critic learns based on discrete sampled information
while the actor operates in continuous-time; and that, after
where s 2 = Ñf1 ( f + gu2 ) , and assume that
convergence of the critic parameters is obtained, the actor is
s 2 = s 2 / (s 2T s 2 + 1) is persistently exciting. Let the actor then tuned to match the closed form solution of (17) which is
NN be tuned as given by (11). In this case, based on the way in which the
s T improved control law was defined, one can see that the critic
& 1 1
Wˆ2 = -a2 ( D1 ( x)(Wˆ2 -Wˆ1 ) - D1 ( x)Wˆ2 2 Wˆ1 ) (33) parameters can simply be transferred to become the new
2 4 ms 2
actor parameters, which result in a controller with improved
where ms 2 = 1 + s 2T s 2 . Then the critic parameter error performances.
One also notes that in contrast with our online version of
W%1 =W1* -Wˆ1 and the actor parameter error W%2 =W1* -Wˆ2 are
the standard PI algorithm where only knowledge on the input
uniformly ultimately bounded (UUB) and the convergence to to state dynamics was required for implementation, in the
the approximate optimal critic is obtained. case of the new adaptive algorithm full knowledge on the
Proof system dynamics is required. This points out to future
The convergence proof is based on Lyapunov analysis. development of the algorithm such as to include a third NN
We consider the Lyapunov function structure in the adaptive critic (see Figure 1) with the
1 1
L(t ) = V ( x) + tr (W%1T a1-1W%1 ) + tr (W%2T a 2-1W%2 ) . (34) purpose of system model identification.
2 2
IV. SIMULATION RESULTS V. CONCLUSION
To support the new theoretical result we now offer a In this paper we have proposed a new adaptive algorithm
simulation example for the new online GPI algorithm. We which solves the continuous-time optimal control problem
consider the linear system with the quadratic cost function when affine in the inputs nonlinear systems are considered.
used in [6] The algorithm is built on an actor/critic structure and
é -1 -2 ù é1ù involves synchronous training of both actor and critic
x& = ê ú x + ê úu networks. This means that initiation of the tuning process for
ë 1 -4 û ë -3û
the parameters of either actor or critic does not require
with Q and R in the cost function being identity matrices of
complete convergence of the adaptation process of critic or
appropriate dimensions. In this linear case the solution of the respectively actor as it is regularly required in policy
HJB equation is given by the solution of the ARE equation iteration algorithms.
and the parameters of the optimal critic are Compared with our recent online policy iteration
W1* =[0.3199 -0.1162 0.1292]T . algorithm for CT systems, the new Generalized CT Policy
Figure 1 shows the critic parameters, denoted with Iteration requires complete knowledge of the system model.
ˆ For this reason our research efforts will now be directed
W1 =[W11 W12 W13 ]T , converging to the optimal values. In
towards integrating a third neural network with the
fact after 200s the critic parameters converged to actor/critic structure with the purpose of approximating in an
Wˆ1 (t f ) =[0.3192 -0.1174 0.1287]T . online fashion the system dynamics.
Parameters of the critic NN
0.35 REFERENCES
0.3 [1] M. Abu-Khalaf, F. L. Lewis, “Nearly Optimal Control Laws for
W11
Nonlinear Systems with Saturating Actuators Using a Neural Network
0.25 W12 HJB Approach” , Automatica, vol. 41, no. 5, pp. 779-791, 2005.
0.2 W13 [2] L. C. Baird III, “Reinforcement Learning in Continuous Time:
Advantage Updating”, Proc. Of ICNN, Orlando FL, June 1994.
0.15
[3] R. Beard, G. Saridis, J. Wen, “Galerkin approximations of the
0.1 generalized Hamilton–Jacobi–Bellman equation”, Automatica, vol.
0.05 33, no. 12, pp. 2159–2177, 1997.
[4] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming,
0
Athena Scientific, MA, 1996.
-0.05 [5] K. Doya, “Reinforcement Learning In Continuous Time and Space”,
-0.1 Neural Computation, 12(1), 219-245, 2000.
[6] T. Hanselmann, L. Noakes, and A. Zaknich, “Continuous-Time
-0.15
0 50 100 150 200 Adaptive Critics”, IEEE Transactions on Neural Networks, 18(3),
Time (s) 631-647, 2007.
Fig. 2. Convergence of the critic parameters to the parameters of the [7] R. A. Howard, Dynamic Programming and Markov Processes, MIT
optimal critic Press, Cambridge, Massachusetts, 1960.
[8] F. L. Lewis, V. L. Syrmos, Optimal Control, John Wiley, 1995.
The evolution of the system states is presented in Figure 3.
[9] D. Kleinman, “On an Iterative Technique for Riccati Equation
System states Computations”, IEEE Trans. on Automatic Control, vol. 13, pp. 114-
3 115, February, 1968.
2.5 [10] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive
Dynamic Programming”, IEEE Trans. on Systems, Man and
2
Cybernetics, vol. 32, no. 2, pp 140-153, 2002.
1.5 [11] D. Prokhorov, D. Wunsch, “Adaptive critic designs,” IEEE Trans. on
1 Neural Networks, vol. 8, no 5, pp. 997-1007, 1997.
[12] R. S. Sutton, A. G. Barto, Reinforcement Learning – An Introduction,
0.5
MIT Press, Cambridge, Massachusetts, 1998.
0 [13] D. Vrabie, F. Lewis, “Adaptive Optimal Control Algorithm for
Continuous-Time Nonlinear Systems Based on Policy Iteration”,
-0.5
IEEE Proc. CDC08, IEEE, 2008.
-1 [14] D. Vrabie, O. Pastravanu, F. Lewis, M. Abu-Khalaf, “Adaptive
-1.5 Optimal Control for Continuous-Time Linear Systems Based on
0 50 100 150 200 Policy Iteration”, Automatica (to appear).
Time (s) [15] P.J. Werbos, Beyond Regression: New Tools for Prediction and
Fig. 3. Evolution of the system states for the duration of the Analysis in the Behavior Sciences, Ph.D. Thesis, 1974.
experiment [16] P. J. Werbos, “Approximate dynamic programming for real-time
control and neural modeling,” Handbook of Intelligent Control, ed.
One can see that after 200s the PE condition of the control D.A. White and D.A. Sofge, New York: Van Nostrand Reinhold,
signal was then removed being no longer required. 1992.
[17] P. Werbos, “Neural networks for control and system identification”,
IEEE Proc. CDC89, IEEE, 1989.

Online Policy Iteration Based Algorithms To Solve The ContinuousTime Infinite Horizon Optimal Control Problem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Online Policy Iteration Based Algorithms To Solve The ContinuousTime Infinite Horizon Optimal Control Problem

Uploaded by

Copyright:

Available Formats

Online Policy Iteration Based Algorithms to Solve the Continuous-

Time Infinite Horizon Optimal Control Problem

978-1-4244-2761-1/09/$25.00 ©2009 IEEE

It is desired to select Ŵ1 to minimize the squared error ms ms ms

(s1T s 1 + 1) . since ms ³ 1 . This provides an effective practical bound

You might also like