Dynamic Multiobjective Control For Continuous-Time Systems Using Reinforcement Learning

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2018.2869462, IEEE
Transactions on Automatic Control
1
Dynamic Multiobjective Control for

Continuous-time Systems using Reinforcement
Learning
Victor G. Lopez, Student Member, IEEE, and Frank L. Lewis, Fellow, IEEE
ways that are difficult to express otherwise. Examples of these

Abstract—This paper presents an extension of the applications can be found in [6]. The study of multiobjective
reinforcement learning algorithms to design suboptimal control optimization control is therefore a natural extension of the usual
sequences for multiple performance functions in continuous-time analysis in the current literature [7].
systems. The first part of the paper provides the theoretical
development and studies the required conditions to obtain a
Multiobjective optimal control has been studied in [7]-[9]
state-feedback control policy that achieves Pareto optimal results where the concept of Pareto domination is employed to
for the multiobjective performance vector. Then, a policy compare the desirability of two vector functions. Using this
iteration algorithm is proposed that take into account practical notion, the control input is designed such that improving an
considerations to allow its implementation in real-time objective function unilaterally implies making another worse
applications for systems with partially unknown models. Finally, [10], [11]. Most of the papers in the literature deal with static
the multiobjective linear quadratic regulator problem is solved
using the proposed control scheme and employing a
multiobjective optimization. To the best of our knowledge, no
multiobjective optimization software to solve the static other paper presents a practical method of policy iteration to
optimization problem at each iteration. solve the multiobjective optimization problem for nonlinear
dynamical continuous-time systems.
Index Terms—Multiobjective optimization, nonlinear systems, Several methods exist to compute a Pareto optimum. These
Pareto optimality, reinforcement learning. include numerical methods embedded in software packages, as
well as analytic procedures such as the weighted sum, or
scalarization, technique [11]. The weighted sum method
I. INTRODUCTION consists of combining the different cost indices into a single
R EINFORCEMENT learning is a set of artificial

intelligence methods that has had an increasing success in
the last decade for providing a system with the ability to
scalar function by computing their convex sum. This is a
practical method in many applications, but it presents many
technical drawbacks. These include the presence of
improve its performance as it gains experience while unreachable Pareto optimal results, a strong, non-intuitive
attempting to achieve its goals [1]-[4]. In the last few years, dependence on the selected sum weights, and a large
reinforcement learning approaches have been adopted in computational burden as the number of desired solutions
control theory where the performance of a dynamical system is increases [11]. For these reasons, our paper presents a general
measured by means of a scalar function that represents the cost approach for multiobjective control that can be applied with
spent by the system along time. Reinforcement learning any of the existing multiobjective optimization methods.
techniques, properly defined for control of dynamical systems, The main motivation of our work is to design a control
are described in [5]. strategy that allows to solve optimization problems that cannot
Many engineering problems require describing the goals of a be expressed by a single cost function. When the objectives of
system by means of two or more performance indices, rather the system are conflicting with each other, a tradeoff must be
than the single cost function employed in classical optimal achieved. The controller is based on reinforcement learning
control. Using several performance indices provides more methods to avoid the difficult task of solving the
flexibility to represent the expected behavior of the system in Hamilton-Jacobi-Bellman equation [2] and to relax the need of
full knowledge of the dynamic model of the system.
Paper resubmitted on June 22, 2018. Work supported by U.S. NSF grant As main contribution of the paper, a reinforcement learning
ECCS-1405173, ONR grant N00014-17-1-2239, and China NSFC grant algorithm, based on policy iteration, is proposed to achieve an
#61633007. Author Victor G. Lopez is supported by the Mexican Council of
Science and Technology (Conacyt). online solution of the multiobjective optimization problem. It is
V. G. Lopez is with the UTA Research Institute, University of Texas at rigorously proven that, under the provided conditions, this
Arlington, Fort Worth, TX 76118, USA (e-mail: algorithm yields a single Pareto optimal solution, in a nontrivial
victor.lopezmejia@mavs.uta.edu).
F. L. Lewis is with the UTA Research Institute, University of Texas at
extension from the single-objective optimization problem.
Arlington , Fort Worth, TX 76118, USA and Qian Ren Consulting Professor, Furthermore, only partial knowledge of the system dynamics is
State Key Laboratory of Synthetical Automation for Process Industries, required to achieve optimal control. This algorithm is finally
Northeastern University, Shenyang China (e-mail: lewis@uta.edu).
0018-9286 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2018.2869462, IEEE
2
formulated and analyzed for the specific case of linear systems. where x  n and u  m are the state vector and the control
The paper is organized as follows. Basic definitions for input of the system, respectively, and f is a continuously
multiobjective optimization and for the multiobjective optimal
differentiable function.
control problem are described in Section 2. Section 3 shows the
In the multiobjective optimization problem, the performance
basic transformations employed to obtain an iterative
of system (2) is evaluated with respect to M different
suboptimal control sequence. In Section 4, a policy iteration
performance indices
algorithm to solve the multiobjective optimization problem is 
designed, with considerations to allow its implementation in J j ( x(0), u ) = 0 L j ( x(0), u )dt , (3)
practical applications. Section 5 studies the linear systems case. j = 1, , M , where each L j is a continuously differentiable
Finally, Section 6 concludes with a numerical example.
function. The feedback control function u ( x) is said to be
II. BASIC DEFINITIONS admissible if it is continuous, stabilizes the dynamics (2) and
In this section, various definitions to develop multiobjective makes J j ( x, u ( x)) finite for all j = 1, , M . The class of
optimization algorithms for dynamical systems are reviewed. functions satisfying these properties is denoted as U 0 . Define
the vector J as J =  J1 , , J M  . It is our interest to find a
T
A. Pareto optimality
Multiobjective optimization deals with the problem of function u ( x)  U 0 such that vector J is minimized in the
minimizing two or more objective functions simultaneously Pareto sense.
[10]. In mathematical terms, this problem is expressed as
For a fixed control policy u ( x) , define the value functions
min V ( x) (1)
x X 
V j ( x(t )) = t L j ( x( ), u )d , j = 1, , M. (4)
where x  n
is selected inside a feasible set X and
Let V = V1 , ,VM  . A differential equivalent to the value
T
V : n → M is a vector function with M elements,
V ( x) = V1 ( x), , VM ( x) , with Vi ( x) , i = 1, function (4) is given by the Bellman equations
T
, M , the
functions to be minimized. In the general case, there does not 0 = L j ( x, u ) + V jT f ( x, u ) H j ( x, V j , u ) (5)
exist a solution x that achieves the minimization of all where V j is the gradient of V j , and H j ( x, V j , u) is the
functions Vi ( x) simultaneously, and the concepts of Pareto j th Hamiltonian function of the system. Note that the orbital
domination and Pareto optimality must be introduced. derivative of V j ( x) is given by
Definition 1. A vector W  M is said to Pareto dominate V j ( x) = V j x = − L j ( x, u ). (6)
vector V  M if W j  V j for all j = 1, , M , and W j  V j for Define also
at least one j , where V j and W j are the j -th entries of V * ( x(0)) inf J ( x(0), u ) (7)
uU 0
vectors V and W , respectively.  
The following definition states a specific notation that we use where, in general, V * is not unique, and V *  J with J the
throughout this paper. Pareto front of vector J .
Definition 2. Notation W  V for vectors W  M and Define the Pareto optimal vector H * as
V  M , indicate that W is not Pareto dominated by V , i.e., H * ( x, V ) = min0 H ( x, V , u) (8)
uU
either V = W or there is at least one entry j such that
were H =  H1 , , H M  and V = V1 , , VM  . H * is
T T
V j  W j . Notation W V means that W j  V j for all

Pareto optimal in the sense that, for each state vector x and
j = 1, , M .
vector V , H * ( x, V )  H ( x, V , u ) for every control policy
Employing these definitions, the concept of Pareto optimality
can be stated as follows. u U 0 .
Definition 3. A solution x* of problem (1) is said to be In general, it is possible to select different control inputs, u *1
Pareto optimal if V ( x* )  V ( x) for all x  X . and u *2 , such that H ( x, V , u*1 ) and H ( x, V , u*2 ) are both
The outcome V ( x* ) of a Pareto optimal solution x* is also Pareto optimal. For this reason, we make the following
assumption, useful for the analysis of the next section.
said to be Pareto optimal. In general, a multiobjective
optimization problem has multiple Pareto optimal outcomes, Assumption 1. If there exists u * such that H * ( x, V ) = 0
and the set of all Pareto optimal outcomes for a given problem for all x and a given function V ( x) , then select u = u * .
is regarded as the Pareto front. Here, we represent the Pareto Assumption 1 is a restriction on the procedure employed to
front as  , such that V ( x* )   . find a Pareto optimal vector in (8), and states that if the vector
of zeros is one of the possible Pareto optimal results for H * ,
B. Multiobjective performance of a dynamical system
then control policy u must be selected accordingly. The
Consider a general nonlinear system with dynamics consequences of Assumption 1 are studied in Lemmas 4 and 5
x = f ( x, u ) (2) in Section 3.
3
III. MULTIOBJECTIVE SUBOPTIMAL CONTROL SEQUENCES For this entry j , let V j* solve the Bellman equation for u * and
This section defines and analyzes transformations to design Vj solve the Bellman equation for u . Then,
suboptimal control policies in an iterative manner. This is an
extension for multiobjective optimization of the results in [3]. H j ( x, V , u )  H j ( x, V , u ) = H j ( x, V j , u ) = 0 . Now, by
*
j
*
j
*
Considering the multiobjective optimal control problem Lemma 1, H j ( x, V j* , u )  H j ( x, V j , u ) implies V j  V j* .

described in Section 2, define as the set of all continuously Theorem 1. Let the control policy u * be such that (8) holds,
differentiable functions V : n
→ M such that V (0) = 0 . and V * such that V j* solves the j th Bellman equation (5) for
Define also 0 as the subset of such that u ( x, V )  U 0 , u * , for every entry j = 1, , M . Then, V *  J with J the
i.e., the feedback control policies based on the vector function Pareto front of J as defined in (7).
V are admissible. Define the transformations T1 , T2 and T as
Proof. As u * makes H * Pareto optimal, then, by Lemma 2,
follows.
V * is also Pareto optimal. Now, for all entries of vector J , we
Definition 4.
have
1. Define the function T1 : 0 → U 0 as 
J j = 0 L j ( x, u )dt
,VM ) = u where V = V1 , ,VM 
T
T1 (V ) = T1 (V1 ,

and u = u ( x, V ) . = 0 H j ( x, V j* , u )dt − V j* ( x()) + V j* ( x(0))
2. Define the function T2 : U 0 → as T2 (u ) = V , As u *  U 0 , then V ( x()) = 0 . Therefore, J j = V j* ( x(0)) for
where V  M
and V j ( x) = J j ( x, u ) , j = 1, ,M . all entries j , and Pareto optimality of V * implies Pareto

3. Define the composite mapping T : 0 → as optimality of J . This is V *  J as in (7).
T (V1 , ,VM ) = T2 (T1 (V1 , ,VM )) = J ( x, u) , where The proof of Theorem 1 shows that V * = J when V * solves
u = u ( x, V ) . the Bellman equation for u * . Lemma 3 and Theorem 2 show
Our objective now is to use these transformations to design a that solving the Bellman equation, regardless of the control
control sequence that converges to an optimal policy function u , is a sufficient and necessary condition for a vector
u* ( x)  U 0 . We begin our analysis by studying some of the V to satisfy the equality V = T2 (u ) = J .
properties of the vector functions defined in Section 2, as well Lemma 3. V = T2 (u ) if and only if V j satisfies
as those of transformations T1 , T2 and T . H j ( x, V , u ) = 0 , for j = 1, ,M .
Lemma 1 allows to compare two vector functions, V and Proof. If H j ( x, V , u ) = 0 , then V j = V T f ( x, u )
W , with entries V j and W j as in (4), when the respective
= H j ( x, V , u ) − L j ( x, u ) = − L j ( x, u ) = J j , and integrating
Hamiltonian functions are known.
both sides of the equality along the same motions for all entries
Lemma 1. If H ( x, V , u)  H ( x, W , u) for given vector
of the vector, yields V = J = T2 (u ) . Conversely, if V = J , then
functions V , W , and control u , then W  V .
Proof. By Definition 2, we have that either V j = J j = − L j ( x, u ) , which implies H j = 0 .
H ( x, V , u) = H ( x, W , u) or there exists an entry j such Theorem 2. Let V  0
and W  . Now, W = T (V ) if
that H j ( x, V , u )  H j ( x, W , u ) . Assume the latter case and and only if H ( x, W , u ( x, V )) = 0 .
consider this same entry j . By definition of H j in (5), we Proof. Follows directly from Lemma 3 and Definition 4.
have L j ( x, u ) + V jT f ( x, u )  L j ( x, u ) + W jT f ( x, u ) , which Clearly, if V is such that H ( x, V , u) = 0 , then
H * ( x, V )  0 , which means that the vector of zeros does not
implies V jT f ( x, u )  W jT f ( x, u ) ; that is, V j  W j .
Integrating the inequality along the same motions yields Pareto dominates H * . However, this does not necessarily
W j  V j and, therefore, W  V . imply that all the elements of H * are nonpositive. Lemma 4
solves this inconvenience.
Lemma 2 and Theorem 1 relate the Pareto optimality of
vector V with the Pareto optimality of vector H in (8). Lemma 4. Let Assumption 1 hold. Then, H *j ( x, V )  0 for
Lemma 2. Consider a control policy u * such that (8) holds. all entries of H * .
If V j* ( x) , j = 1, , M , solves the Bellman equation (5) for u * , Proof. By Assumption 1, if the vector of zeros is Pareto
optimal, then H *j = 0 for all entries j . If H = 0 is not Pareto
then V * ( x) is Pareto optimal.
optimal, then by definition of Pareto optimality we have
Proof. Consider a control policy u such that
H *j ( x, V )  0 for all j with at least one strict inequality.
H ( x, V , u )  H ( x, V , u* ) . By Pareto optimality of
As studied below, Lemma 4 allows guaranteeing that all the
H j ( x, V j , u* ) , there exists an entry j such that
entries of a vector are at least as small as the entries of another
H j ( x, V j , u )  H j ( x, V j , u* ). (9) ( V W ) when an iterative algorithm is employed.
The following theorem shows the recursion required later in
4
this section to design a suboptimal control sequence. developed. This section presents the integral reinforcement
Theorem 3. Let V  0 and V = T (V ) , and let Assumption learning in multiobjective optimization form.
Notice that the jth value function (4) can be expressed as
1 hold. Then, H * ( x, V ) 0 implies V *  V V , with V * t +T
Pareto optimal. V j ( x(t )) = t L j ( x( ), u )d +V j ( x(t + T )) (10)
Proof. Take u = u* ( x, V ) . By Assumption 1 and Lemma 4, for any time interval T >0. Given the functions Vj(x) and Lj(x,u),
H ( x, V )  0 for every j = 1,
*
, M . Then, we can express equation (10) does not require knowledge about the system
j
dynamics (2). Lemma 6 shows that the solution Vj (x) of (10) is
V j = H ( x, V ) − L j ( x, u )  − L j ( x, u ) = J j .
*
j the value function (4) that solves equation (5) .
As V j = T (V ) = J j implies V j = J j , then V j  V j . Lemma 6. Assume the control policy u(x) stabilizes the
system dynamics (2). Then, the solution Vj (x) of equation (10)
Integrating the inequality we get V j  V j for all entries j . is equivalent to the solution of the Bellman equation (5).
In the single objective optimization problem, it is clear that Proof. If equation (5) holds for Vj, then
an iterative repetition of the operation in Theorem 3 leads the V j = V jT f ( x, u ) = − L j ( x, u ) . Integrating both sides of the
function vector V to the unique optimal value function V * . In
equation, we get
the multiobjective optimization case, Assumption 1 is required t +T t +T
to prevent leaping among different Pareto optima at each t L j ( x, u )d = − t V j ( x( ))d = −V j ( x(t + T )) + V j ( x(t ))
iteration, as proven in Lemma 5 and Theorem 4. which is the same equation as (10).
Lemma 5. Let V * be Pareto optimal and let Assumption 1 The following algorithm presents the multiobjective optimal
hold. If W * is any other Pareto optimal value function such that controller by reinforcement learning. The policy evaluation
V *  W * , then W *  T (V * ) . step consists of solving Equation (10). This corresponds to the
Proof. Assume W=T(V*). If Assumption 1 holds, by Lemma transformation T2 in Definition 4. The policy improvement step
is based on Equation (8), and corresponds to the transformation
4 we have H*j(x,∇V) ≤ 0 for all entries j. By Theorem 3, we
T1. Convergence of Algorithm 1 is proven in Theorem 6.
have Wj ≤ V*j for all j. As W*j >V*j for some j, for any other Pareto
Algorithm 1. Integral multiobjective policy iteration.
optimal vector W* , then W* cannot be reached.
1. Select an admissible control policy u0.
Theorem 4. If a Pareto optimal solution V*∈ 0 exists, then 2. Solve for Vk from the set of equations
V*=T(V*). Conversely, V=T(V) implies V=V*. t +T
V jk ( x(t )) = t L j ( x( ), u )d + V jk ( x(t + T )) . (11)
Proof. Consider two Pareto optimal vectors V* and W*. By
Theorem 3, if V = T (V * ) , then W *  V V * ; by Lemma 5 and 3. Update the control policy as
u k +1 = arg min H ( x, V k , u ), (12)
definition of Pareto optimality, V V implies V = V * . u
Conversely, if V=T(V), by Theorem 2 we have H*(x,∇V) = 0 Go to step 2. On convergence, stop.

and V solves the Bellman equations (5); by Lemma 2, V=V*. Theorem 6. Assume there exists an admissible control input
We finally formalize the idea of using the result in Theorem u for system (2). Perform Algorithm 1 such that Assumption 1
3 to build a sequence of successive approximations that holds in step 3. Then, Algorithm 1 converges to a Pareto
converge to a Pareto optimal solution V*. optimal solution V*. Moreover, the control policy u(x, ∇ V*)
Theorem 5. Take V 0 ∈  0 and V k+1=T(V k). Then optimizes the performance index vector J.
Proof. From equation (12) and Assumption 1, we have that
V *  V k +1 V 0 for a Pareto optimal solution V*.
Hj (x,∇Vjk, uk+1) ≤ 0 for j = 1, , M . As function Vjk+1 solves
Proof. The proof follows inductively from Theorem 3,
noting that the current estimate of the optimal value function at equation (11), then by Lemma 6 Hj (x,∇Vj k+1, uk+1) = 0. From
step k is V k=T2(uk) and taking the control policy at step k + 1 both results we get Hj (x,∇Vjk, uk+1) ≤ Hj (x,∇Vj k+1, uk+1). By
based on V k, i.e., uk+1=u(x,∇V k)=T1(V k). Convergence to a Lemma 1, this implies Vjk+1 ≤ Vjk and vector Vk+1 is not Pareto
single Pareto optimal result is provided by Theorem 4. dominated by V k. By Theorems 5, these properties hold for
every iteration until a Pareto optimal vector V* is obtained.
IV. INTEGRAL REINFORCEMENT LEARNING ALGORITHM FOR By Theorem 1, if (11) holds for V*, then V* = J*. Thus, V*
IMPLEMENTATION OF MULTIOBJECTIVE SUBOPTIMAL CONTROL guarantees a Pareto optimal performance of the system.
Remark 1. Step 3 in Algorithm 1 can be solved by any
In this section, we use the analysis of Section 3 to design an
multiobjective optimization method. In this paper, we avoid the
integral reinforcement learning (IRL) algorithm using the
use of the weighted-sum method because it reduces the problem
structure of policy iteration [1], [5], [12], that is shown to
to a single-objective formulation that is often restrictive and is
converge to a Pareto optimal solution of vector V, then used to
not suitable for general applications [11]. The technical
generate the optimal policy u(x,∇V*). Here, it is assumed that
drawbacks of the weighted sum method include its inability to
the state values of system (2) are known, even if part of its reach results in nonconvex sections of the Pareto optimal set, a
mathematical model is uncertain. strong, non-intuitive dependence on the sum weights, and a
In [12], an IRL algorithm that converges to the solution V* of large computational burden as the optimization problem grows
the Bellman equation for a single performance index was
5
in complexity. It is well known that the minimization of each individual Hj

Remark 2. Equation (11) avoids the use of the system with respect to K is achieved using the optimal gain matrix
dynamics (2) in the policy evaluation step of the algorithm, and K*=R−1 T
j B Pj. However, this optimization problem can be
equation (12) requires only partial knowledge of the characterized differently to be programed in a multi-objective
mathematical model of the system [12]. optimization solver. Theorem 7 shows that minimizing (20) by
In the following section it is shown how to use partial means of matrix K is equivalent to minimize the sum of the
knowledge of a linear system with a particular Pareto eigenvalues of the matrix KTRjK−PjBK−KTBTPj. To simplify
optimization solver in Algorithm 1. the notation, define the variables
S j = Q j + K T R j K + Pj ( A − BK ) + ( A − BK ) Pj
T
(21)
V. MULTIOBJECTIVE LINEAR QUADRATIC REGULATOR
and
Consider a system with linear dynamics
S ' j = K T R j K − Pj BK − K T BT Pj . (22)
x = Ax + Bu. (13)
The performance of the system is measured using M different The i th eigenvalue of a matrix S is denoted as λi(S).
performance indices with quadratic terms, given by Theorem 7. Let Hj = xTSj x, where Sj is the symmetric matrix
(21). Then, solving the minimization problem
J j = 0 ( xT Q j x + uT R j u )dt ,

(14) K * = arg min H j
K
j=1,…,M, where Qj >0 and Rj >0 are symmetric matrices.
Express each of the M value functions in quadratic form as is equivalent to solving the eigenvalue minimization problem
n
V j = xT Pj x, (15) K * = arg min  i ( S ' j ),
K i =1
j=1,…,M, with Pj = PjT  0 .
with Sj’ as in (22).
In order to apply the multiobjective IRL algorithm, express Proof. Take the optimal matrix K* such that H*j = xTS*j x, with
the functions (15) in the form (10); that is, Sj =Qj+K*TRjK*+Pj(A−BK*)+(A−BK*)TPj, is minimal; this
*
( x T Q j x + u T R j u ) d
t +T
xT (t ) Pj x(t ) = t means xTS*j x ≤ xTSj x for Sj in (21) using any matrix K. Now we
(16) can write xT(Sj−S*j )x ≥ 0 and, therefore, Sj−S*j is a positive
+ xT (t + T ) Pj x(t + T )
semidefinite matrix. Note that for the matrices (21) and (22),
Solving this equation becomes an easier task if we employ the we have Sj−S*j = Sj’− Sj’*. As all the eigenvalues of Sj’− Sj’* are
Kronecker product to express the term xTPjx as xTPjx = nonnegative, and the trace of a matrix is equal to the sum of its
vec(Pj)T(x  x), where vec(Pj) is the column vector obtained by eigenvalues, then tr(Sj’− Sj’*) ≥ 0, which implies tr(Sj’) ≥ tr(Sj’*).
stacking the columns of Pj. Moreover, as matrix Pj is We conclude that matrix K* generates the matrix Sj’* with
symmetric and the expression x  x includes all possible minimal sum of its eigenvalues.
products of the entries of x, each of the vectors vec(Pj) and x  x By Lemma 7, minimization of the Hamiltonian vector H can
include repeated terms. Represent these vectors after removing be achieved by finding the gain matrix K* such that, for given
all the redundant terms as p j and x , respectively, which matrices Pj, j=1,…,M, we have
consist of n(n+1)/2 components. Now, we can write  n 
  i ( K T R1 K − P1 BK − K T BT P1 ) 
xT Pj x = pTj x (17) i =1
K * = arg min  n  (23)
Using the expression (17), we rewrite equation (16) as  
  i ( K RM K − PM BK − K B PM ) 
K T T T
( xT Q j x + uT R j u ) d
t +T
pTj ( x (t ) − x (t + T )) = t (18)  i =1 
Remark 3. Problem (23) is expressed without knowledge of
and the goal is to find the values of p j that satisfy (18) given
matrix A of the system dynamics (13).
the measurements x(t) and x(t+T), and the employed control Algorithm 2 expresses the policy iteration procedure
input u. This objective can be achieved using recursive least presented in Algorithm 1, modified for the linear systems case.
squares after collecting several samples of equation (18) [5]. Algorithm 2.
The Hamiltonian functions for this system are 1. Select an admissible control policy u0 = K0x.
H j = xT Q j x + uT R j u + 2 xT Pj ( Ax + Bu ) . (19) 2. Solve the set of equations (11) for Vk.
The optimal control policy u* for system (13) is the input 3. Solve the multiobjective optimization problem (23)
u=−Kx that makes the vector H=[H1,…,HM ]T Pareto optimal. and update the control policies as uk+1 = Kk+1x.
Several methods can be used to determine u*. Here, we Go to step 2. On convergence, stop.
propose a general procedure that allows this problem to be
solved by any multiobjective optimization software package. VI. SIMULATION RESULTS
Substitute the policy u=−Kx in each of the Hamiltonian Algorithm 2 is now employed to achieve stabilization of the
functions (19), to obtain linearized double inverted pendulum in a cart [19], [20],
H j = xT Q j x + xT K T R j Kx represented by the dynamic equations (13), where
(20)
+ xT Pj ( A − BK ) x + xT ( A − BK ) Pj x
T
6
0 0 0 1 0 0  0 
0 0 0 0 1 0  0  0.4
Position
0 0 0 0 0 1  0  0.3 Angle 1
A= , B= , Angle 2
0 0 0 0 0 0  1  0.2
0 86.69 −21.61 0 0 0  6.64 
0 −40.31 39.45 0 0 0  0.08 
0.1
State trajectories
state x1 is the position of the cart, x2 and x3 are the angles of both 0
pendulums, and the remaining states are the velocities. As -0.1
performance objectives, i) regulation of all states is required

-0.2
and ii) the values of x2 and x3 must be as close to each other as
possible. The performance indices (14) can now be defined as -0.3
 200 0 0 0 0 0   1 0 0 0 0 0 -0.4
 0 200 0 0 0 0  0 1 −1 0 0 0  0 1 2 3
Time
4 5 6
 0 0 200 0 0 0  0 −1 1 0 0 0  Fig. 1. State trajectories of a linear system with multiobjective optimization.

Q1 =  , Q2 =  ,
 0 0 0 1 0 0 0 0 0 1 0 0  integer controls,” Journal of Process Control, vol. 20, pp.
 0 0 0 0 1 0 0 0 0 0 1 0  810-822, 2010.
 0 0 0 0 0 1  0 0 0 0 0 1 [9] A. Kumar and A. Vladimirsky, “An efficient method for
multiobjective optimal control and optimal control subject to
and R1 = R2 = 1 . The sample time per iteration is T = 0.05 . integral constraints,” Journal of Computational
Mathematics, vol. 28, No. 4, pp. 517-551, 2010.
The Matlab function for multiobjective optimization [10] S. Boyd and L. Vandenberghe, Convex Optimization,
fgoalattain is employed in to determine the feedback control Cambridge University press, New York, 2004.
matrix K at each iteration. fgoalattain allows to generate [11] M. Caramia and P. Dell’Olmo, “Multi-objective
optimization” in Multi-objective Management in Freight
different points in the Pareto front of the problem. The state Logistics. Increasing capacity, service level and safety with
trajectories for x1, x2 and x3 after implementation of Algorithm optimization algorithms, Springer-Verlag London, 2008.
2 are shown in Figure 1. All states are shown to be stabilized by [12] D. Vrabie, O. Pastravanu, M. Abu-Khalaf and F. L. Lewis,
“Adaptive optimal control for continuous-time linear
the controller. The final gain matrix K is systems based on policy iteration,” Automatica, vol. 45, pp.
K = 11.90 110.67 −165.9 13.30 4.20 −26.32 . 477-484, 2009.
[13] D. Vrabie, K. G. Vamvoudakis and F. L. Lewis, Optimal
Adaptive Control and Differential Games by Reinforcement
VII. CONCLUSIONS Learning Principles, The Institution of Engineering and
Technology, London, UK, 2013.
A sequence for suboptimal control with multiple objective [14] D. Liu, Q. Wei, D. Wang, X. Yang and H. Li, Adaptive
functions for general nonlinear systems was designed, Dynamic Programming with Applications in Optimal
guaranteeing its convergence to an optimal vector in the Pareto Control, Springer International Publishing, 2017.
[15] F.-Y. Wang, H. Zhang and D. Liu, “Adaptive dynamic
sense. The proposed policy iteration algorithm allows solving programming: An introduction,” IEEE Computational
the M Bellman equations independently, using only the Intelligence Magazine, pp. 39-47, 2009.
measurements of the system trajectories during a time interval. [16] R. Kamalapurkar, L. Andrews, P. Walters and W. E. Dixon,
“Model-based reinforcement learning for infinite-horizon
This control scheme can be applied in real-time without having approximate optimal tracking,” IEEE Transactions on
full knowledge of the mathematical model of the system. As a Neural Networks and Learning Systems, vol. 28, No. 3, pp.
case of study, the multiobjective LQR was solved. 753-758, 2017.
[17] T. Bian, Y. Jiang and Z.-P. Jiang, “Adaptive dynamic
programming and optimal control of nonlinear nonaffine
REFERENCES systems,” Automatica, vol. 50, pp. 2624-2632, 2014.
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An [18] Q. Yang and S. Jagannathan, “Reinforcement learning
introduction, The MIT Press, Cambridge, MA, 1998. controller design for affine nonlinear discrete-time systems
[2] F. L. Lewis, D. Vrabie and V. L. Syrmos, Optimal Control, using online approximators,” IEEE Trans. on Systems, Man
2nd ed., John Wiley & Sons, inc., New Jersey, 2012. and Cybernetics-Part B: Cybernetics, vol. 42, no. 2, pp.
[3] R. J. Leake and R. W. Liu, “Construction of suboptimal 377-390, 2012.
control sequences,” J. SIAM Control, vol. 5, No. 1, pp. [19] Q.-R. Li, W.-H. Tao, N. Sun, C.-Y. Zhang and L.-H. Yao,
54-63, 1967. “Stabilization control of double inverted pendulum system,”
[4] D. P. Bertsekas, “Dynamic programming and suboptimal presented at the 3rd Int. Conf. on Innovative Computing
control: a survey from ADP to MPC,” European Journal of Information and Control, Jun. 18-20, 2008.
Control, vol. 11, pp. 310-334, 2005. [20] J.-L. Zhang and W. Zhang, “LQR self-adjusting based
[5] K. G. Vamvoudakis, H. Modares, B. Kiumarsi and F. L. control for the planar double inverted pendulum,” Physics
Lewis, “Game theory-based control system algorithms with Procedia, vol. 24, Part C, pp. 1669-1676, 2012.
real-time reinforcement learning,” IEEE Control Systems [21] Y. Song, Y. Wang and C. Wen, “Adaptive fault-tolerant PI
Magazine, pp. 33-52, 2017. tracking control with guaranteed transient and steady-state
[6] G. P. Liu, J. B. Yang and J. F. Whidborne, Multiobjective performance,” IEEE Trans. on Automatic Control, vol. 62,
Optimisation and Control. Research Studies Press, 2003. no. 1, pp. 481-487, 2017.
[7] A. Gambier and E. Badreddin, “Multi-objective optimal [22] Y. Song, X. Huang and C. Wen, “Tracking control for a class
control: An overview,” presented at the IEEE Int. Conf. on of unknown nonsquare MIMO nonaffine systems: A
Control Applications, Oct. 1-3, 2007 . deep-rooted information based robust adaptive approach,”
[8] F. Logist, S. Sager, C. Kirches and J. F. Van Impe, “Efficient IEEE Trans. on Automatic Control, vol. 61, no. 10, pp.
multiple objective optimal control of dynamic systems using 3227-3233, 2016.

Dynamic Multiobjective Control For Continuous-Time Systems Using Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dynamic Multiobjective Control For Continuous-Time Systems Using Reinforcement Learning

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Dynamic Multiobjective Control for

ways that are difficult to express otherwise. Examples of these

R EINFORCEMENT learning is a set of artificial

V j  W j . Notation W V means that W j  V j for all

Considering the multiobjective optimal control problem Lemma 1, H j ( x, V j* , u )  H j ( x, V j , u ) implies V j  V j* .

Conversely, if V=T(V), by Theorem 2 we have H*(x,∇V) = 0 Go to step 2. On convergence, stop.

in complexity. It is well known that the minimization of each individual Hj

pendulums, and the remaining states are the velocities. As -0.1

performance objectives, i) regulation of all states is required

 0 0 200 0 0 0  0 −1 1 0 0 0  Fig. 1. State trajectories of a linear system with multiobjective optimization.

You might also like