You are on page 1of 12

2434 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO.

10, OCTOBER 2017

Off-Policy Reinforcement Learning for


Synchronization in Multiagent
Graphical Games
Jinna Li, Member, IEEE, Hamidreza Modares, Tianyou Chai, Fellow, IEEE,
Frank L. Lewis, Fellow, IEEE, and Lihua Xie, Fellow, IEEE

Abstract— This paper develops an off-policy reinforcement I. I NTRODUCTION


learning (RL) algorithm to solve optimal synchronization of
multiagent systems. This is accomplished by using the framework
of graphical games. In contrast to traditional control protocols,
which require complete knowledge of agent dynamics, the pro-
M ULTIAGENT systems (MAS) have attracted com-
pelling attention in the last two decades because of
their potential applications in variety of disciplines, including
posed off-policy RL algorithm is a model-free approach, in that engineering, social science, and natural science [1]–[3]. The
it solves the optimal synchronization problem without knowing distributed synchronization problem of MAS has been con-
any knowledge of the agent dynamics. A prescribed control
policy, called behavior policy, is applied to each agent to generate
sidered with the goal of developing control protocols based
and collect data for learning. An off-policy Bellman equation is on the local information of each agent and its neighbors to
derived for each agent to learn the value function for the policy make all agents reach an agreement on certain quantities of
under evaluation, called target policy, and find an improved interest or track a reference trajectory [3]–[6]. Most existing
policy, simultaneously. Actor and critic neural networks along results on synchronization of MAS did not impose optimality
with least-square approach are employed to approximate target
and, therefore, are generally far from optimal.
control policies and value functions using the data generated
by applying prescribed behavior policies. Finally, an off-policy Optimal distributed control considered in the litera-
RL algorithm is presented that is implemented in real time and ture [4], [7]–[12] is desirable design, because it minimizes
gives the approximate optimal control policy for each agent using a predefined performance function that captures energy con-
only measured data. It is shown that the optimal distributed sumption, efficiency, and accuracy of the solution. However,
policies found by the proposed algorithm satisfy the global Nash
these existing methods require either complete knowledge of
equilibrium and synchronize all agents to the leader. Simulation
results illustrate the effectiveness of the proposed method. the system dynamics [4], [7]–[10] or at least partial knowledge
of the system dynamics [11]. Adaptive leader–follower syn-
Index Terms— Graphical game, multiagent systems (MAS),
neural network (NN), reinforcement learning (RL),
chronization is presented in [12] for MAS using the model ref-
synchronization. erence adaptive control approach to avoid the requirement of
knowing the system dynamics. However, adaptive approaches
Manuscript received March 8, 2016; revised July 3, 2016; accepted are generally far from optimal and this method still requires
September 12, 2016. Date of publication March 8, 2016; date of current
version September 15, 2017. This work was supported in part by the partial knowledge of the system dynamics.
NSFC Project under Grant 61673280, Grant 61104093, Grant 61525302, With the rapid development and extensive applications
Grant 61333012, Grant 61304028, Grant 61590922, and Grant 61503257, in of digital sensor technology, extensive data carrying system
part by the Open Project of State Key Laboratory of Synthetical Automation
for Process Industries under Grant PAL-N201603, and in part by the Project information can be collected. It is desired to use these
of Liaoning Province under Grant LJQ2015088, Grant 2015020164, and data to develop model-free data-based optimal control pro-
Grant 2014020138. tocols [13]–[15]. The robust adaptive dynamic programming
J. Li is with the College of Information Engineering, Shenyang University of
Chemical Technology, Shenyang 110142, China, and also with the State Key method is adopted in [15] to design decentralized opti-
Laboratory of Synthetical Automation for Process Industries, Northeastern mal controllers for large-scale systems without requiring the
University, Shenyang 110819, China (e-mail: lijinna_721@126.com). knowledge of system dynamics. Note that synchronization
H. Modares is with the Department of Electrical and Computer Engineering,
Missouri University of Science and Technology, Rolla, MO 65401, USA and distributed optimal controller design are out of the con-
(e-mail: modaresh@mst.edu). cerns of [15]. The reinforcement learning (RL) technique is
T. Chai is with the State Key Laboratory of Synthetical Automation for a promising technique which can be used to learn optimal
Process Industries, Northeastern University, Shenyang 110819, China (e-mail:
tychai@mail.neu.edu.cn). solutions to control problems in real time using only measured
F. L. Lewis is with the UTA Research Institute, The University of Texas at data along the system trajectories [16]–[18]. Distributed Q-
Arlington, Arlington, TX 76118 USA, and also with Northeastern University, learning (QD-learning) [19] and Q-learning [20] as data-based
Shenyang 110819, China (e-mail: lewis@uta.edu).
L. Xie is with the School of Electrical and Electronic Engineering, College optimal control approaches have been developed for learning
of Engineering, Nanyang Technological University, Singapore 639798 (e-mail: optimal solution to the control of MAS. However, these results
elhxie@ntu.edu.sg). are limited to Markov process and discrete-time systems.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. To the best of our knowledge, optimal synchronization of
Digital Object Identifier 10.1109/TNNLS.2016.2609500 multiagent continuous-time (CT) systems with completely
2162-237X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2435

unknown dynamics has not considered yet. The data-driven that there exists an edge starting from the vertex v j to vi in
optimal control problem of MAS is challenging because of the a directed graph. A graph is called simple if there are no
coupling of the agent dynamics as a result of data exchange repeated edges or self-loops (vi , vi ) ∈ E for ∀i . Denote the
between them. set of neighbors of node vi as Ni = {v j : (v j , vi ) ∈ E}. D =
In this paper, an off-policy RL algorithm is presented to diag(d1 . . . d N) is called in-degree matrix with the weighted
learn optimal control protocols for CT MAS using only mea- degree di = j ei j of node vi (i.e., the i
th row sum of E).

sured data. The off-policy RL and multiagent graphical games Define the graph Laplacian matrix as L = D − E, in which
are brought together, where the dynamics and performance summing elements of every row are equal to zero.
objective of each agent are affected by it and its neighbors A directed path is a sequence of edges
in a graphical topology. In off-policy RL [13], [21], [22], (vi1 , vi2 ), (vi2 , vi3 ), . . . , (vi j −1 , vi j ) with (vil−1 , vil ) ∈ E
two different policies are used: the behavior policy which for l ∈ {2, . . . , j }, which is a path starting from the node vi1
is used to generate data for learning and the target policy and ending at vi j . A directed graph is strongly connected if
which is evaluated and updated. This is in contrast to on- there is a directed path for every two distinct vertices. The
policy RL [8], [23] that requires the learning data be generated graph is said to be strongly connected if there is a directed
by the same control policy as the one under evaluation. This path from vi to v j for any distinct nodes vi , v j ∈ V.
greatly increases the information exploration ability during
the learning process and results in data efficiency. Moreover, B. Synchronization of Multiagent Systems
in off-policy RL, no dynamics model of the agents is required.
Consider the N systems or agents with identical node
The contributions of this paper are as follows. Off-policy RL
dynamics
is used to solve the optimal synchronization problem in the
framework of graphical games. No knowledge of the agent ẋi = Axi + Bui (1)
dynamics is required. To this end, a performance function
where xi = xi (t) ∈ denotes the state vector, ui = ui (t) ∈
Rn
is defined for each agent in terms of its local neighborhood
R p (i = 1, 2 . . . , N) denotes the control input. A and B are
tracking error and its control effort. It is shown that one must
the matrices of appropriate dimensions. The dynamics of the
solve coupled Hamilton–Jacobi–Bellman (HJB) equations to
command generator or leader with state x0 is given by
minimize these performance functions. The solution to the
HJB equations results in synchronization of all agents to the ẋ0 = Ax0 . (2)
leader while reaching a global Nash equilibrium. An off-policy
Assumption 1: The pair (A, B) is controllable.
RL algorithm is developed to approximate solutions to the HJB
The local neighborhood tracking error δ i of agent i is
equations and learn the optimal control policy.
defined as
The organization structure of this paper is as follows.
Section II introduces the graph theory concepts and some δ i =  j ∈Ni ei j (xi − x j ) + gi (xi − x0 ) (3)
definitions that are used throughout this paper. Section III
where gi ≥ 0 is the pinning gain for agent i with gi = 0 if
defines the optimal synchronization problem and investigates
agent i has direct access to the leader and gi = 0 otherwise.
global Nash equilibrium and stability of the optimal solution.
Assumption 2: The graph is strongly connected and the
Section IV develops an off-policy RL algorithm to learn opti-
leader is pinned to at least one node.
mal controllers using data generated form agents. Section V
Let ξ = x − x0 be the global synchronization error [8],
presents the simulation results. Finally, the conclusions are
where x0 = 1 ⊗ x0 , 1 = [1, 1, . . . , 1]T ∈ R N . The global
stated in Section VI.
error vector of the MAS with the command generator is given
Notations: R n denotes the n dimensional Euclidean
from (3) by
space. ⊗ stands for the Kronecker product. Let
X , Xi , and U be compact sets, denote D  δ = ((L + G) ⊗ In )ξ (4)
   
{(δ, u, δ )|δ, δ ∈ X , u ∈ U} and Di  {(δ i , u, δ i )|δ i , δ i ∈
  where G is a diagonal matrix with diagonal entries equal to
Xi , uT ∈

U}. <

S1 (δ, u, δ ), S2 (δ, u, δ )

>D 
the pinning gains gi . Using (1), (2) and (3) yields the local
D 1S (δ, u, δ )S 2 (δ, u, δ )d(δ, u, δ ) denotes the inner
neighborhood tracking error dynamics as
product of column vector functions S1 and S2 . 
δ̇ i = Aδ i + (di + gi )Bui − ei j Bu j (5)
II. P RELIMINARIES j ∈Ni
In this section, we first introduce some notations and which is further expressed by the compact form
theories on graph theory [4], [8]. The synchronization problem
of MAS is then defined. δ̇ = (I N ⊗ A)δ + ((L + G) ⊗ B)u. (6)
This is the dynamics of the overall neighborhood errors, where
A. Communication Graph [4], [8] δ = [δ 1T δ 2T · · · δ TN ]T and u = [u1T u2T · · · uTN ]T .
Consider a graph denoted by G = (V, E) with a set of Synchronization Problem: Design local control protocols
vertices V = {v1 , v2 , . . . , v N } and a set of edges or arcs E ⊆ ui in (1) to synchronize the states of all agents in G to
V ×V. E = [ei j ] is called the connectivity matrix with ei j > 0 the trajectory of the leader, i.e., lim xi (t) = x0 (t) for ∀i ,
t →∞
if (v j , vi ) ∈ E, and otherwise ei j = 0. (v j , vi ) ∈ E indicates i = 1, 2, . . . , N or lim ξ (t) = lim (x(t) − x0 ) = 0.
t →∞ t →∞

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2436 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017

Remark 1: Our objective is to make lim ξ (t) = 0. By taking derivative of Vi (δ i (t)) with respect to time t
t →∞
In the subsequent development, we show how to along with the trajectory of the local neighborhood tracking
make lim δ (t) = 0. According to (4), one has error δ i (t), the Bellman equation is given in terms of the
t →∞
ξ (t) ≤ (1/σmin (L + G)) δ (t) , where σmin (L + G) Hamiltonian function as
denotes the smallest singular value of the matrix L + G. Hi (δ i , ∇Vi , ui , u−i )
By Assumption 2, σmin (L + G) > 0, so that
= δ iT Qi δ i + uiT Ri ui + ∇ViT (Aδ i + (di + gi )Bui
lim δ (t) = 0 ⇒ lim ξ (t) = 0. 
t →∞ t →∞
− ei j Bu j ) = 0. (10)
Therefore, to solve the synchronization problem, one can
j ∈Ni
design a control protocol for each agent i to guarantee
asymptotic stability of the local neighborhood tracking error The optimal response of agent i to fixed policies u−i can be
dynamics (5). In the subsequent sections III and IV, it is shown derived by minimizing Hamiltonian function with respect to
how to design local control protocols to stabilize the error ui as follows:
dynamics (5) in an optimal manner by minimizing a predefined u∗i (t) = arg min [Hi (δ i , ∇Vi∗ , ui , u−i )] (11)
performance function for each agent. ui

which yields
III. M ULTIAGENT G RAPHICAL G AMES 1
u∗i (t) = − (di + gi )R−1 i Bi ∇Vi
T ∗
(12)
In this section, optimal synchronization of MASs on graphs 2
is discussed in the framework of multiagent graphical games. where ∇Vi = ∂ Vi /∂δ i stands for a gradient operator.
It is shown how to find optimal protocols for every agent. Let all neighbors of agent i select control polices given
It is also shown that the optimal response makes all agents by (12) and substitute (12) into (10), then one gets the
synchronize to the leader and reach a global Nash equilibrium. following coupled co-operative game HJB equations:

Hi δ i , ∇Vi∗ , u∗i , u∗−i


A. Graphical Game for Dynamical Multiagent Systems
1
T
Define a local quadratic performance index for each agent as = δ iT Qi δ i + (di + gi )2 ∇Vi∗ BR−1 i B ∇Vi
T ∗

 ∞ 4
1
Ji (δ i (t0 ), ui , u−i ) = [δ iT Qi δ i + uiT Ri ui ]dt (7) + (∇Vi∗ )T Aδ i − (di + gi )2 BR−1 T
i B ∇Vi

2
t0
1  −1 T ∗
where Qi and Ri are positive  semidefinite and positive definite + ei j (di + gi )BR j B ∇V j = 0.
matrices, respectively. (A, Qi ) is observable. 2
j ∈Ni
Minimizing (7) subject to (5) is a graphical game, since (13)
both the dynamics and the performance function for each agent
depend on the agent and its neighbors [8]. In graphical games, We now show that under a certain assumption, these coupled
the focus is on the global Nash equilibrium. The definition of HJB equations can be simplified and resemble the coupled
global Nash equilibrium is given as follows. algebraic Riccati equations (AREs) that appear in standard
Definition 1 [24]: A global Nash equilibrium solution linear quadratic multiplayer non zero-sum games [24], [25].
for an N player game is given by an N-tuple of policies Assumption 3: The cost function is quadratic and is given
{u∗1 , u∗2 , . . . , u∗N } if it satisfies by Vi = δ iT Pi δ i , where Pi is a positive definite matrix.
Using Assumption 3, (13) can be written as the form shown
Ji∗  Ji (δ i (t0 ), u∗i , u∗G −i ) ≤ Ji (δ i (t0 ), ui , u∗G −i ) (8) in Lemma 1.
Lemma 1: Under Assumption 3, coupled co-operative game
for all i ∈ N and ∀ui , uG −i (uG −i = {u j : j ∈ V, j = i }).
HJB equations (13) are equivalent to the coupled AREs
The N-tuple of game values {J1∗ , J2∗ , . . . , JN∗ } is said to be a ⎛ ⎞
Nash equilibrium outcome of the N-player game. 
From (5), one can find that performance index (7) depends 2δ iT PiT ⎝Aδ i + −1
ei j (d j + g j )BR j BT P j δ j ⎠
on agent i and its neighbors. Thus, global Nash equilibrium j ∈Ni
(8) can be written as Ji∗  Ji (δ i (t0 ), u∗i , u∗−i ) ≤ Ji (δ i (t0 ), + δ iT Qi δ i − (di + gi )2 δ iT PiT BRiT BT Pi δ i = 0 (14)
ui , u∗−i ), since Ji (δ i (t0 ), u∗i , u∗G −i ) = Ji (δ i (t0 ), u∗i , u∗−i ) and
Ji (δ i (t0 ), ui , u∗G −i ) = Ji (δ i (t0 ), ui , u∗−i ), where u−i = {u j : and optimal response (12) becomes
j ∈ Ni }.
u∗i (t) = −(di + gi )R−1
i B Pi δ i .
T
(15)

B. Coupled HJB Equations for Solving Graphical Games Proof: For the quadratic cost function Vi = one δ iT Pi δ i ,
has ∇Vi = 2Pi δ i . Substituting this into (12) leads to (15).
Interpreting the control input ui as a policy dependent on On the other hand, substituting the optimal response (15) into
the local neighborhood tracking error δ i (t), the value function the coupled co-operative game HJB equations (13) gives (14)
corresponding to the performance index (7) is introduced as and the proof is completed. 
 ∞
 T  Equation (14) is similar to the coupled AREs
Vi (δ i (t)) = δ i Qi δ i + uiT Ri ui dτ. (9)
t
in [24] and [25] for standard nonzero sum game problems.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2437

Remark 2: As (15) shows, if the cost function Vi satis- Therefore


fies Assumption 3, then the derived optimal controller only d Vi T

depends on the local neighborhood error δ i of agent i . Based + δ i Qi δ i + uiT Ri ui


dt
on (3), it is concluded that the optimal response (15) is in fact
T

= ui − u∗i Ri ui − u∗i
distributed under Assumption 3. 
T
2

Remark 3: It is noted that none of the upcoming analy- + ei j u∗i Ri u j − u∗j . (19)
sis or proofs require Assumption 3. However, if solutions can (di + gi )
j ∈Ni
be found for (14), then these are also the solutions of (13).
In standard multiplayer non zero-sum games, there is only Selecting ui = u∗i and u j = u∗j gives
one state dynamic equation and it is known that the values are d Vi T

quadratic in the state [24], [25]. However, in graphical games, + δ i Qi δ i + uiT Ri ui = 0. (20)
dt
each agent has its own dynamics. It has not been shown that
Since the matrices Qi ≥ 0 and Ri > 0, then (d Vi/dt) < 0
the values are quadratic in the local states. That is, in general,
hold for all agents. Therefore, system (5) is asymptoti-
Assumption 3 may not hold.
cally stable and so all agents synchronize to the leader.
2) Since 1) holds for the selected control polices, then
C. Stability and Global Nash Equilibrium
δ i (t) → 0 when t → ∞. For Lyapunov functions
for the Proposed Solution
Vi (δ i )(i = 1, 2, . . . , N), satisfying Vi (0) = 0, we have
To achieve global Nash equilibrium, one needs to calculate Vi (δ ∞ ) = 0. Thus, performance index (7) can be
the optimal response for every agent i by solving N coupled written as
partial differential HJB equations (13) for the N player game  ∞
T

problem. Theorem 1 shows that if all agents select their own Ji (δ i (0), ui , u−i ) = δ i Qi δ i + uiT Ri ui dt
optimal response and the communication topology graph is 0  ∞
strongly connected, then system (5) is asymptotically stable + Vi (δ i (0)) + V̇ dt (21)
for all i (i = 1, 2, . . . , N). Therefore, all agents synchronize. 0
Meanwhile, all of N agents are in global Nash equilibrium. or
Theorem 1: Make Assumption 2. Let Vi be smooth solutions
Ji (δ i (0), ui , u−i )
to HJB equations (13) and design the control policies u∗i  ∞
T

as (12). Then, the following holds. = δ i Qi δ i + uiT Ri ui dt


1) System (5) is asymptotically stable, and therefore, 0  ∞

by Assumption 2, all agents are synchronized to the + Vi (δ i (0)) + ∇ViT (Aδ i + (di + gi )Bui
leader. 0


2) [u∗1 , u∗2 , . . . , u∗N ] are global Nash equilibrium policies, − ei j Bu j ) dt. (22)
and the corresponding Nash equilibrium outcomes are j ∈Ni
Ji∗ (δ i (0)) = Vi (i = 1, 2, . . . , N).
If Vi satisfy (13) and u∗i ,
u∗−i , given by (12), are
Proof: optimal control polices, then by completing the square,
1) Let Vi be Lyapunov function candidates. Then, taking one has
derivative of Vi with respect to time t along with the

trajectory of the local neighborhood tracking error δ i (t), Ji δ i (0), ui , u−i


 ∞
one has
= Vi (δ i (0)) + uiT Ri ui + ∇ViT (di + gi )Bui
d Vi
= ∇ViT δ̇ = ∇ViT 0

T
dt − ∇ViT ei j Bu j + u∗i Ri u∗i dt

× Aδ i + (di + gi )Bui − ei j Bu j . j ∈Ni

j ∈Ni

+ ∇ViT ei j Bu∗j dt
(16)
j ∈Ni
Using (10), this becomes
 ∞
T

= Vi (δ i (0)) + ui − u∗i Ri ui − u∗i


d Vi T

+ δ i Qi δ i + uiT Ri ui = Hi (δ i , ∇Vi , ui , u−i ). 0



dt ∗

(17) − ∇ViT ei j B u j − u j dt. (23)


j ∈Ni
On the other hand, based on (10) and (13) and using the
same development as of [26], one can get If ui = u∗i and u j = u∗j , then Ji∗ (δ i (0), u∗i , u∗−i ) =
Vi (δ i (0)).
Hi (δ i , ∇Vi , ui , u−i ) Define

T

= ui − u∗i Ri ui − u∗i

Ji δ i (0), ui , u∗−i = Vi (δ i (0))


2 
T
 ∞
+ ei j u∗i Ri u j − u∗j . (18)
T

(di + gi ) + ui − u∗i Ri ui − u∗i dt


j ∈Ni 0

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2438 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017

then, it is clear that Ji∗ (δ i (0), u∗i , u∗−i ) < Algorithm 1 Model-Based On-Policy Reinforcement Learning
Ji (δ i (0), ui , u∗−i ) holds for all i (i = 1, 2, . . . , N). (0)
1. Initialize the agents with admissible control policies ui
Therefore, global Nash equilibrium is reached and the
for ∀i and set s = 0;
proof is completed. 
2. Evaluate policies by solving Vi :
From coupled co-operative game HJB equations (13), one
can see that designing optimal control polices for agents (1)
requires solving two issues. One issue is that coupled Hi (δ i , ∇Vi(s+1) , u(s) (s)
i , u−i )
co-operative game HJB equations are N nonlinear partial dif- = δ iT Qi δ i + [u(s) (s)
i ] Ri ui
T
ferential equations, which makes them hard or even impossible (s+1) (s)
to solve analytically. The other is that the system matrices A + [∇Vi ]T (Aδ i + (di + gi )Bui
 (s)
and B need to be completely known to find the solutions. − ei j Bu j ) = 0 (24)
An off-policy RL algorithm is designed in Section IV to j ∈Ni
overcome these difficulties.
where s denotes iteration index;
3. Find improved control policies using ui :
IV. O FF -P OLICY R EINFORCEMENT
1
L EARNING A LGORITHM u(s) −1 T
i = − (di + gi )Ri B ∇Vi .
(s)
(25)
In [8], the graphical game was solved but full knowledge 2
 
 (s) (s+1) 
of all agent dynamics is needed. The off-policy RL allows 4. Stop when Vi − Vi  ≤ ε with a small constant ε.
the solution of optimality problems without knowing any
knowledge of the agent dynamics. This section presents an
off-policy learning algorithm for the synchronization of MAS
that does not require any knowledge of the dynamics. An off-policy RL algorithm is then provided to learn the
To this end, off-policy Bellman equations are first derived, solutions of coupled co-operative game HJB equations (13)
and then, an actor-critic neural network (NN) structure is to obtain distributed approximate optimal control policies.
(s) (s)
used to evaluate the value function and find an improved Introducing auxiliary variables ui and u−i for each agent
control policy for each agent. Then, an iterative off-policy dynamics (5), one has
RL algorithm is given to learn approximate optimal control (s)
 (s)
δ̇ i = Aδ i + (di + gi )Bui − ei j Bu j
policies that make the MAS reach global Nash equilibrium
j ∈Ni
and meanwhile guarantee the synchronization of all agents to


the command generator. In off-policy RL, a behavior policy is + (di + gi )B ui − u(s)


i + ei j B u j − u(s)
j . (26)
applied to the system to generate the data for learning and a j ∈Ni
different policy, called the target policy, and is evaluated and In (26), ui are interpreted as behavior policies actually
updated using measured data. (s)
applied to the system. By contrast, ui are the target policies
learned in Algorithm 1.
A. Derivation of Off-Policy Bellman Equations Differentiating Vi(s+1) with respect to agent (26) yields
In this section, a model-based RL algorithm [8] is first given
which is used to compare with the results given by the off- d Vi(s+1)(δ i )
policy RL algorithm. Then, off-policy Bellman equations are dt
(s+1)
derived and it is shown that they have the same solution as = ∇Vi δ̇ i
the coupled co-operative game HJB equations (13). 
(s+1)
T (s) (s)
Definition 2 [8]: For a given system (5), a control input = ∇Vi Aδ i + (di + gi )Bui − ei j Bu j
ui is called to be admissible with respect to cost (7) if ui is j ∈Ni
continuous, ui (0) = 0, ui stabilizes system (5) and Vi is finite.
(s+1)
T

Algorithm 1 is presented to learn the optimal control poli- + ∇Vi (di + gi )B ui − u(s)
i
cies by using the knowledge of system models.

Remark 4: Vamvoudakis et al. [8] showed that under a weak (s)

− ei j B u j − u j . (27)
coupling assumption, Algorithm 1 converges to the solution j ∈Ni
of coupled co-operative game HJB equations (13) if all agents
update their control policies in terms of (25) at each iteration. Using (24) and (25) in (27) gives
This conclusion holds under the condition that the initial (s+1)
d Vi (δ i )
control policies are admissible.
Algorithm 1 provides a method to learn control poli- dt

T
cies that achieve the global Nash equilibrium and synchro- = −δ iT Qi δ i − u(s)
i Ri u(s)
i
nization. However, Algorithm 1 requires the knowledge of
− 2(u(s+1)
i )T Ri (ui − u(s)
i )
agent dynamics during implementing the iterative process.  (s+1)
T

To obviate this requirement and present a model-free approach, + 2(di + gi )−1 · ei j u i Ri u j − u(s)
j . (28)
off-policy Bellman equations are presented in the following. j ∈Ni

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2439

The integral RL idea [23] is now used to develop Bellman Algorithm 2 Off-Policy RL for Multiagent Games
equations for evaluating the value of target policies. Integrating (0)
 1. Begin with admissible initial control policies ui for ∀i ,
both sides of (28) on the interval [t, t ] yields the following
and set s = 0;
off-policy Bellman equations: (s+1) (s+1)

2. Solve Vi and ui from off-policy Bellman equa-
(s+1) (s+1)
Vi (δ i ) − Vi (δ i )  s denotes the iteration index;
tion (29), where
   (s) (s+1) 
t  (s)
T  3. Stop when Vi − Vi  ≤ ε.
(s)
= (δ i (τ ))T Qi δ i (τ ) + ui (δ i (τ )) Ri ui (δ i (τ )) dτ
t
 
t  (s)

−2
(s+1)
(ui (δ i (τ )))T Ri ui (δ i (τ )) − ui (δ i (τ )) dτ · Ri (u(s)
j (δ j (τ )) − u j (δ j (τ )))dτ
t  t + t
 

1
t 
T − lim (δ Ti (τ )Qi δ i (τ )
+ 2(di + gi ) −1
ei j u(s+1) (δ i (τ )) Ri t →0 t
i t

t j ∈Ni  (s) T −1 (s)
(s)
 + ui (δ i (τ ))Ri ui (δ i (τ )) dτ (31)
· u j (δ j (τ )) − u j (δ j (τ )) dτ (29)
  which is equivalent to (29). Since (Wi (δ i ), vi (δ i )) is also
where δ i = δ i (t), δ i = δ i (t ).
assumed a solution to (29), we have
Theorem 2 shows that the value function and improved
d Wi (δ i )  T −1 (s)
policy solutions found by solving the off-policy Bellman equa- = −δ iT Qi δ i − u(s)
i Ri ui
tions (29) are identical to the value function solution found dt
 (s+1)T (s)

by (24) and the improved policy found by (25), simultaneously. − 2 vi Ri ui − ui


Let ϑi (δ i ) be the set of all admissible control policies for   (s+1)T

+ 2(di + gi )−1 ei j vi Ri u j − u(s)


j
agent i .
(s+1) (s+1) j ∈Ni
Theorem 2: Let Vi (δ i ) satisfy Vi (δ i ) ≥ 0,
(s+1) (s+1) (32)
Vi (0) = 0, and ui (δ i ) ∈ ϑi (δ i ). Then, the solution
(s+1) (s+1)
(Vi (δ i ), ui (δ i )) to (29) is identical to the solution subtracting (32) from (31) yields
(s+1)

Vi(s+1)(δ i ) to (24) and u(s+1) i (δ i ) to (25), at the same time. d Vi (δ i ) − Wi (δ i )


Proof:
(s+1) dt
1) Sufficiency: One can conclude that if Vi (0) = 0  (s+1) (s+1) T (s)

(s+1) = 2 ui − vi Ri ui − ui
and ui (δ i ) satisfies (24) and (25), then they are the   (s+1) T
solution to (29) by taking the derivative of (29). − 2(di + gi )−1 ei j ui − v(s+1)
i
2) Necessity: The necessity proof can be completed if the j ∈Ni
(s)

uniqueness of solution of (29) is shown. It is now shown · Ri u j − u j (33)


by contradiction that the solution to (29) is unique.
Suppose that (29) has another solution (Wi (δ i ), vi (δ i )), which means that the above equation holds for any ui and
(s) (s)
where Wi (δ i ) ≥ 0, Wi (0) = 0, and vi (δ i ) ∈ ϑi (δ i ). u−i . Letting ui = ui and u j = u j ( j ∈ Ni ) yields
For any function p(t), one has [13] (s+1)
d(Vi (δ i ) − Wi (δ i ))
 t + t = 0. (34)
1 dt
lim ( p(τ )dτ
t →0 t t
 t + t  t Thus, we have Vi(s+1)(δ i ) = Wi (δ i ) + c, where c is a real
1
= lim p(τ )dτ − p(τ )dτ constant. Due to Vi(s+1)(0) = 0 and Wi (0) = 0, then c = 0.
t →0 t (s+1)
 t 0 0 Thus, we have Vi (δ i ) = Wi (δ i ) for ∀δ i . From (34),
d the following form holds for ∀ui and u j , j ∈ Ni :
= p(τ )dτ
dt 0  (s+1) (s+1) T
= p(t). (30) ui − vi Ri
  (s) 
(s)

Tacking the derivative of Vi (δ i ) and using (30) gives × ui − ui − 2(di + gi )−1 ei j u j − u j = 0.


d Vi(s+1) (δ i ) j ∈Ni
(35)
dt
1 (s+1) (s+1) For ∀ui ∈ ϑi (δ i ) and u j ∈ ϑ j (δ j ), we have u(s+1) =
= lim Vi (δ i (t + t)) − Vi (δ i (t)) i
t →0 t
 t + t v(s+1)
i . Therefore, we have a conclusion that contradicts to
1  (s+1) T (s) the assumption. The proof is completed. 
= 2 lim ui (δ i (τ )) Ri ui (δ i (τ ))
t →0 t t Based on Theorem 2, the following model-free off-policy

− ui (δ i (τ )) dτ − 2(di + gi )−1 RL Algorithm 2 is presented.


 t + t  Remark 5: How to find the admissible initial control policies
1  T
· lim ei j u(s+1) (δ i (τ )) for systems with completely unknown system dynamics has
t →0 t t i
j ∈Ni been clarified in [27, Remark 11]. If system (5) can be a priori

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2440 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017

known to be itself stable, then the initial policies can be chosen To estimate the solution (Vi(s+1)(δ i ), u(s+1) i (δ i )),
as u(0)
i = 0, which can guarantee the admissibility of the initial the weighted residual method is used here [13], [23]. To this
(s+1) (s+1)
policy without requiring any knowledge of the dynamics of end, Vi (δ i ) and ui (δ i ) are, respectively, substituted
(s+1) (s+1)
system (5). Otherwise, suppose that system (5) has nominal by V̂i (δ i ) and 
ui (δ i ) to yield the following residual
models A N and B N satisfying A = A N + A and B = B N + error:
B, where A and B are the unknown part of system (5). 
In this case, robust control methods, such as H∞ control with σi(s)(δ i (t), u(t), δ i (t))
 (s+1)
the nominal models A N and B N , can be used to yield an = (φi (δ i (t)) − φi (δ i (t)))T Wvi
admissible initial policy.  p  p  t
 
Remark 6: Since Theorem 2 shows that the solution +2 rl1 ,l2 ψil1 (δ i (τ ))T Wu(s)
il
− uil1 (δ i (τ ))
1
of Algorithm 1 with (24) and (25) is equivalent to the l1 =1 l2 =1 t
solution of off-policy Bellman equations (29), the conver-  t


gence of Algorithm 2 can be guaranteed, since [8] and [28] · (ψil2 (δ i (τ )))T Wu(s+1)
il
dτ − δ iT (τ )Qi δ i (τ )dτ
2
have proved that the RL Algorithm 1 converges. Compared t

p 
p  
with (24) and (25), (29) does not need the knowledge of agent t

(s) T
dynamics (A, B). − rl1 ,l2 Wu il ψil1 (δ i (τ ))
1
l1 =1 l2 =1 t
Remark 7: Off-policy Algorithm 2 can learn the solution to
the HJB equations (13) without requiring any knowledge of · (ψil2 (δ i (τ )))T Wu(s)
il

2
the agent dynamics, because the Bellman equations (29) do

p 
p  t


not contain the system matrices A and B of the agents. This is −1
− 2(di + gi ) rl1 ,l2 ei j ψil1 (δ i (τ ))T
because the information about the agent dynamics is embedded l1 =1 l2 =1 t j ∈Ni
in the local neighborhood error δ i and control input ui , as well  
· Wu(s) ψ j l2 (δ j (τ ))T Wu(s) − u j l2 (δ j (τ )) dτ. (39)
as u−i . The most prominent advantage of learning the optimal il1 jl2

solution by using (29) is that the resulting optimal control pro- The above expression can be rewritten as
tocol does not suffer from model inaccuracy or simplifications (s) 
made in identifying system models. σi (δ i (t), u(t), δ i (t))

= ρ̄i(s) (δ i (t), u(t), δ i (t))Wi(s+1) − πi(s) (δ i (t)) (40)
B. Using Actor-Critic Structure for Off-Policy Learning where
in Graphical Games 
T
T
T T
Wi(s+1)  (s+1)
Wvi Wu(s+1) . . . Wu(s+1)
Vi(s+1) u(s+1)
i1 ip
To solve and in (29), multiple actor-critic
i 
p 
p

T l1 ,l2
NNs-based approach is developed. According to Weierstrass
(s+1) πi(s) (δ i )  ρ Q (δ i ) + rl1 ,l2 Wu(s)
il
ρψ (δ i )Wu(s)
il
high-order approximation theorem [29], [30], Vi and l1 =1 l2 =1
1 2
(s+1)
ui can be approximately expressed as linear combination (s) 
ρ̄i (δ i (t), u(t), δ i (t))
of independent basis functions. Thus, the following N critic  T 
NN and actor NN are constructed to approximate the optimal  ρ ϕ (δ i (t), δ i (t)) 2θ (s)1(δ i (t), δ j (t), u(t))
cost function Vi∗ and the optimal control law u∗i : 
. . . 2θ (s) p (δ i (t), δ j (t), u(t))
V̂i(s)(δ i ) = φi (δ i )T Wvi
(s)
(36) θ (s)l2 (δ i (t), δ j (t), u(t))
(s)
ûil (δ i ) = ψil (δ i )T Wu(s) (37)  p

T l1 ,l2 l1 ,l2

il  rl1 ,l2 Wu(s)il


ρψ (δ i (t))−ρuψ (δ i (t), ui (t))
1
where φi (δ i ) ∈ R h v is the activation function vector with h v l1
neurons in the critic NN hidden layer of agent i . ψil (δ i ) ∈ 
p 

R hlu i (l = 1, 2, . . . , p) is the activation function vectors with − 2(di + gi )−1 ei j rl,l2


h lu i neurons in the lth subactor NN hidden layer for agent i . l2 j ∈Ni
(s) (s)
For ∀s = 0, 1, . . . , Wvi and Wu il are the weight vectors ×(Wu(s)
jl
ρψl1 ,l2 (δ i (t), δ̄ j (t))
2
of critic and actor NNs, respectively. Expression (37) can be l,l2
− ρu,ψ (δ i (t), u j (t)) (l = 1, 2, . . . , p)
rewritten as the compact form
 (s) T with
û(s) (s) (s)
i (δ i ) = û i1 (δ i ) û i2 (δ i ) . . . û ip (δ i )  
 T ρ φ (δ i (t), δ i (t))  [φ (δ i (t)) − φ (δ i (t))]
= ψi1 (δ i )T Wu(s) i1
ψi2 (δ i )T Wu(s) i2
. . . ψip (δ i )T Wu(s)
ip
.  t
(38) ρ Q (δ i )  δ iT (τ )Qi δ i (τ )dτ
t
If we make Assumption 3 that the value function is quadratic  t


in the form of Vi = δ iT Pi δ i for linear system (5) with multiple ρψl1 ,l2 (δ i (t))  ψi,l1 (δ i (τ ))(ψi,l2 (δ i (τ )))T dτ
control inputs, then the activation function vectors of critic t
 
NNs and actor NNs are in fact φi (δ i )T = δ iT ⊗ δ iT and l1 l2
t
ρu,ψ (δ i (t), ui (t))  u i,l1 (τ )(ψi,l2 (δ i (τ ))T dτ
ψil (δ i )T = δ iT , respectively. t

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2441

 t

t 
ρψl1 l2 (δ i,k , δ j,k )  ψ j,l2 (δ j,k )(ψi,l1 (δ i,k ))T
ρψl1 l2 (δ i (t), δ j (t))  T
ψ j,l2 (δ j (τ ))(ψi,l1 (δ i (τ ))) dτ 2
t 


T 
  + ψ j,l2 δ j,k ψi,l1 δ i,k
t
l1 l2 t 
ρu,ψ (δ i (t), u j (t))  u j,l2 (τ )(ψi,l1 (δ i (τ )))T dτ ρul1kl,ψ
2
(δ ik , u j k )  u j,l2 ,k (ψi,l1 (δ i,k ))T
t 2  

T 

p + u j,l2 ,k ψi,l1 δ i,k .
where l1 , l2 = 1, 2, . . . p, m = h v + h lu i is the length of
(s)
i=1 Remark 8: Solving (45) for NN weights Wi(s+1) requires
the vector ρ̄i . (s)
Zi ∈ R Mi ×m i to be full column rank. Therefore, the size
(s+1)
To compute the weight vector Wi , the residual error Mi ofthe sample set δ Mi should be larger than m i =
(s)  (s)  (s+1) p
σi (δ i , u, δ i ) is projected onto dσi (δ i , u, δ i )/d Wi and hv + l=1 h lu i , i.e., Mi ≥ m i . This requires satisfaction of a
the projective value is set to be zero on domain Di , that is proper persistence of excitation condition to assure that that
 (s)    
dσi (δ i , u, δ i ) d Wi(s+1) , σi(s) (δ i , u, δ i ) D = 0. (41) Rank(Z(s)
i ) = m i is satisfied [8], [13], [21]–[23].
i
Using the approximate control policy (37), closed-loop
Using (40) and the definition of derivative, (41) becomes system (5) is presented by
(s+1)  (k)  (k)  −1 
Wi = ρ̄i (δ i (t), u(t), δ i (t)), ρ̄i (δ i (t), u(t), δ i (t)) D (s)
δ̇ i = Aδ i + (di + gi )Bi ûi −
(s)
ei j B j û j . (46)
 (k) 
i
 (k)
· ρ̄i (δ i (t), u(t), δ i (t)), πi (δ i (t)) D . (42) j ∈Ni
i
(k)  (k)  Selecting a Lyapunov function candidate as the approximate
Since ρ̄i (δ i (t), u(t), δ i (t)), ρ̄i (δ i (t), u(t), δ i (t))Di and
(k)  (k) value function V̂i(s), the next sufficient condition is presented
ρ̄i (δ i (t), u(t), δ i (t)), πi (δ i (t))Di can be, respectively, to guarantee the stability of (46).
approximated as Theorem 3: The closed-loop system (46) is asymptotically
 (k)  (k)  
ρ̄i (δ i (t), u(t), δ i (t)), ρ̄i (δ i (t), u(t), δ i (t)) D
i
stable with approximate control policies û(s) i and û(s)
−i , if the
IDi (s)
T (s) following condition holds:
= (s)
T  (s+1) T
Z Zi (43)
Mi i δ iT Qi δ i +
ui Ri u(s)
i > ∇Vi
and 
(s) (s)
× (di + gi )Bu i − ei j Bu j (47)
 (k) 
 ID
T (s)
ρ̄i δ i (t), u(t), δ i , πi(k) (δ i (t)) D = i Z(s) i ηi . (44) j ∈Ni
i Mi
Thus, we have where u(s) (s) (s)
i = ûi (δ i ) − ui (δ i ) (i = 1, 2, . . . , N) denotes

T (s) −1 (s)
T (s) the error between the approximate value of u(s) i (δ i ) and its
Wi(s+1) = Z(s) i Zi Zi ηi (45) real value.
where Proof: Differentiating Vi(s+1) along with the dynamics of
 (s) 

T 

T T agent i , (46) yields


Z(s)
i = ρ̄i δ i,1 , u1 , δ i,1 . . . ρ̄i(s) δ i,M , u Mi , δ i,M
 T (s+1)
(s) d Vi (δ i )  (s+1) T
ηi = π (s) (δ i,1 ), . . . π (s) (δ i,M ) = ∇Vi Aδ i + (di + gi )Bûi
(s)
 dt


I Di  d δi (t), u(t), δi (t)  (s)
D − ei j Bû j
 i 
 

δ Mi = δ i,k , uk , δ i,k  δ i,k , uk , δ i,k ∈ Di , k = 1, 2, . . . , Mi j ∈Ni
 (s+1) T (s)

and Mi is the size of sample set δ Mi . = ∇Vi Aδ i + (di + gi ) ui + u(s) i


The data in sample set δ Mi are collected from the real 
(s) (s)

application system on domain Di . Exploratory prescribed − ei j B u j + u j .


behavior policies can be used to assure that the data set is j ∈Ni
(s+1) (48)
rich. To compute weight vector Wi , the trapezoidal rule
is used to approximate the definite integrals appeared in (45). Using (24), this becomes
Thus, we have

 d Vi(s+1)  (s+1) T (s)

(s)
ρ ϕ δ i,k , δ i,k  ϕi (δ i,k ) − ϕi (δ i,k ) = ∇Vi (di + gi )Bu i − ei j Bu j
dt
t  T  
(s)
T
j ∈Ni
ρ Q (δ i,k )  δ i,k Qi + (Qi,k )T Qi δi,k i
(s)
2 − δ iT Qi δ i − ui Ri ui . (49)
t  (s+1)
ρψ l,l2 (δ i (t))  ψi,l1 (δ i,k )(ψi,l2 (δ i,k ))T If (47) holds, then (d Vi /dt) < 0, which indicates
2

T  asymptotic stability of closed-loop system (46). The proof is


+ ψi,l1 δ i,k ψi,l2 δ i,k
completed.
l1 l2
t  Remark 9: Referring to the definition of the perfor-
ρu,ψ δ ik , uik  ui,l1 ,k (ψi,l2 (δ ik ))T
2 mance index and the derived optimal control policy, one can
 

T 
+ ui,l1 ,k ψi,l2 δ i,k appropriately choose activation function sets and their size

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2442 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017

Algorithm 3 Data-Based Off-Policy RL for Multiagent Game This means that for any ε1 > 0, it follows:
 (s) 

1. Collect real system data (δ i,k , uik , u j k , δ i,k ) from agent i û − u(s)  ≤ ε1 . (50)
i i
for the sample set D (D = ∪ni=1 Di ) using different control Theorem 2 has shown that the equivalence between the
input u, where i = 1, 2, · · · , N and j ∈ Ni ; solutions (V(s) (s)
(0) i , ui ) derived by off-policy Bellman equa-
2. Set the initial critic NN weight vector Wiv , and choose (s) (s)
(0) tions (29) and (Vi , ui ) learned by Algorithm 1 given by (24)
the initial actor NN weight vector Wu il (l = 1, 2, · · · , p), (s)
(0) and (25). Actually, [8] and [28] showed that Vi in (24) and
such that ûi is admissible. Let s = 0; (s)
(s+1) ui with the form of (25) can, respectively, converge to the
3. Compute Z(i) and η(i)  to update Wi  in terms of (45); solutions of coupled co-operative game HJB equations (13)
 (s) (s−1) 
4. Let s = s + 1. If Wi − Wi  ≤  ( is a small and global Nash equilibrium solution u∗i , which means that for
positive number), then stop the iteration and employ Wi
(s) any ε2 > 0
 (s) 
to (37) in order to derive the final control policy. Otherwise, u − u∗  ≤ ε2 (if s → ∞). (51)
i i
go back to Step 3 and continue.
Combining (50) with (51), therefore, yields for any ε > 0
 (s)     
û − u∗  ≤ û(s) − u(s)  + u(s) − u∗  ≤ ε1 + ε2 = ε
i i i i i i
h lu i (l = 1, 2, . . . , p) for actor NNs, such that u(s)
i can be (52)
made arbitrarily small [13], [21], [23] based on the uniform (s)
if s → ∞. Therefore, =
lim û u∗i .
approximation property of NNs [31]. Under Assumption 3, s→∞ i
one clearly knows that the activation function vector of actor The proof is completed. 
NNs is ψil (δ i )T = δ iT (i = 1, 2, . . . , N; l = 1, 2, . . . , p) Remark 12: The most significant advantage of
due to the linear optimal control policy ui dependent on δ i . Algorithm 3 is that the knowledge of the agent dynamics
Thus, u(s)i can be made arbitrarily small based on the uniform
is not required for learning approximate optimal control
approximation property of NNs [31]. In this sense, the condi- protocols. No system identification is needed, and therefore,
tion in Theorem 3 can be satisfied, so the asymptotic stability the negative impacts brought by identifying inaccurate
of (46) is guaranteed. models are eliminated without compromising the accuracy of
optimal control protocols. This is in contrast to the existing
model-based optimal control for MAS [4], [7]–[12].
C. Reinforcement Learning Algorithm Using Remark 13: Off-policy RL has been developed for single-
Off-Policy Approach agent systems [13], [21], [22], [30]. However, to the best of
Summarizing the results given in Sections IV-A and IV-B our knowledge, this is the first time a model-free off-policy
yields the following off-policy RL algorithm for obtaining the RL is developed to learn optimal control protocols for MASs.
optimal control policy. Learning optimal control protocols for MASs is challenging
Remark 10: Algorithm 3 consists of two stages. In the and more complicated than for single-agent agent systems
first stage, the algorithm collects data obtained by applying because of the interplay between agents and the information
prescribed behavior control polices to the agents. This data flow between them.
include the local neighborhood tracking error and the control Remark 14: For the usage of model-free off-policy RL
inputs of the agent and its neighbors. In the second stage, presented in Algorithm 3, it may be challenging for higher
the weights of critic and actor NNs are updated in real time dimensional systems because of the requirement of heavy
to learn the value corresponding to the target policy under computational resources and slow convergence. However,
evaluation and find an improved target policy. The behavior recursive least squares or least mean squares can be used to
policy for each agent can be an exploratory policy to assure replace the batch least squares (45) to save data storage space
that the collected data are rich enough and the target policy for speeding up the convergence.
for each agent is learned using these collected data without
actually being applied to the agents. V. S IMULATION E XAMPLES
Remark 11: Note that the final control policy learned by This section verifies the effectiveness of the pro-
Algorithm 3 is static, while the optimal control policy is posed off-policy RL algorithm for the optimal control
dynamical in terms of (12) for the case of dynamical neighbor of MAS.
graphs. That is why the results of this paper cannot be Consider the five-node communication graph shown
extended to the case of dynamical neighbor graphs. in Fig. 1. The leader is pinned to node 3; thus, g3 = 1
Theorem 4: The approximate control policies û(s) and gi = 0 (i = 3). The weight of each edge is set to
i derived by
Algorithm 3 converge to the global Nash equilibrium solution be 1. Therefore, the graph Laplacian matrix corresponding to
(s) Fig. 1 becomes
u∗i as s → ∞, i.e., lim ûi = u∗i (i = 1, 2, . . . , N). ⎡ ⎤
s→∞
(s) 3 −1 −1 −1 0
Proof: It is clear that ûi is used to approximate the target ⎢ −1
⎢ 1 0 0 0 ⎥⎥
policy u(s) in off-policy Bellman equations (29). As stated ⎢ −1 −1 2
i
in Remark 9, if the activation function sets and their size L=⎢ 0 0 ⎥⎥. (53)
⎢ ⎥
h lu i (l = 1, 2, . . . , p) for actor NNs are chosen appropriately, ⎣ −1 0 0 2 −1 ⎦
(s) (s) (s) −1 0 −1
u i = ûi −ui can be made arbitrarily small [13], [21], [23]. 0 2

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2443

Fig. 1. Communication graph. Fig. 4. Synchronization of the first two states of all agents with the leader.

Fig. 5. Synchronization of the last two states of all agents with the leader.

Each agent is assumed a fourth-order system, and therefore,


the local neighborhood error of each agent i can be expressed
Fig. 2. Three representative critic NN weights. as δ i = [δi1 δi2 δi3 δi4 ]T . Under Assumption 3, the activation
function vectors for the critic NNs and the actor NNs are
chosen as
 2
φi (δ i ) = δi1 δi1 δi2 δi1 δi3 δi1 δi4 δi2
2

2 T
δi2 δi3 δi2 δi4 δi32
δi3 δi4 δi4
ψi1 (δ i ) = [δi1 δi2 δi3 δi4 ]T
(i = 1, 2, 3, 4, 5).
Consider the dynamics of agent i (i = 1, 2, 3, 4, 5) as
⎡ ⎤ ⎡ ⎤
−1 1 0 0 0
⎢ −0.5 0.5 0 0 ⎥ ⎢1⎥
ẋi = ⎢
⎣ 0
⎥x + ⎢ ⎥u (54)
0 −2 0 ⎦ i ⎣ 1 ⎦ i
0 0 0 −3 1
and the leader
⎡ ⎤
Fig. 3. Three representative actor NN weights.
−1 1 0 0
⎢ −0.5 0.5 0 0 ⎥
ẋ0 = ⎢
⎣ 0
⎥x . (55)
0 −2 0 ⎦ 0
The weighting matrices in the performance function (7) are 0 0 0 −3
specified as
⎡ ⎤ Note that the leader is marginally stable with poles at
1 0 0 0 s = −0.5, s = 0, s = −2, and s = −3. The initial critic NN
⎢0 1 0 0⎥ weights are set to be zero, and the initial admissible controllers
⎢ ⎥
Q1 = Q2 = Q3 = Q4 = Q5 = ⎢ ⎥ are, respectively, chosen as
⎣0 0 1 0⎦
0 0 0 1 u(0)
1 = [−0.1203 − 0.4374 0.3639 0.1591]δ1
R1 = 1, R2 = 0.1, R3 = 0.2, R4 = 0.01, R5 = 0.02. u(0)
2 = [−0.1512 0.0918 0.1976 0.2376]δ2

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2444 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017

u(0)
3 = [−0.3608 − 0.3585 0.3603 0.5886]δ3 [14] Z.-S. Hou and Z. Wang, “From model-based control to data-driven
control: Survey, classification and perspective,” Inf. Sci., vol. 235,
u(0)
4 = [−0.2141 − 0.2674 0.2739 0.1529]δ4 pp. 3–35, Jun. 2013.
(0) [15] T. Bian, Y. Jiang, and Z.-P. Jiang, “Decentralized adaptive optimal
u5 = [−0.0802 0.0577 − 0.7025 0.1208]δ5 . control of large-scale systems with application to power systems,” IEEE
Trans. Ind. Electron., vol. 62, no. 4, pp. 2439–2447, Apr. 2015.
Setting the value of the convergence criterion  = 10−4 , [16] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time
Algorithm 3 is implemented. Figs. 2 and 3, respectively, show nonlinear HJB solution using approximate dynamic programming:
the convergence of the critic NN weights and the actor NN Convergence proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,
vol. 38, no. 4, pp. 943–949, Aug. 2008.
weights. Applying the optimal control policy (37) to agents, [17] D. Liu, D. Wang, and H. Li, “Decentralized stabilization for a class of
the first two state trajectories and the last two state trajectories continuous-time nonlinear interconnected systems using online learning
of each agent are, respectively, given in Figs. 4 and 5, which optimal control approach,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 25, no. 2, pp. 418–428, Feb. 2014.
show that all five agents are synchronized with the leader. [18] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive
Control and Differential Games by Reinforcement Learning Principles.
Stevenage, U.K.: IET Press, 2012.
VI. C ONCLUSION [19] S. Kar, J. M. F. Moura, and H. V. Poor, “QD-learning: A collabora-
In this paper, an off-policy RL algorithm is presented to tive distributed strategy for multi-agent reinforcement learning through
consensus + innovations,” IEEE Trans. Signal Process., vol. 61, no. 7,
solve the synchronization problem of MAS in an optimal pp. 1848–1862, Apr. 2013.
manner and using only measured data. The integral RL idea is [20] M. Kaya and R. Alhajj, “A novel approach to multiagent reinforce-
employed to derive off-policy Bellman equations to evaluate ment learning: Utilizing OLAP mining in the learning process,” IEEE
Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 35, no. 4, pp. 582–590,
target policies and find improved target polices. Each agent Nov. 2005.
applies a behavior policy to collect data which is used to [21] B. Luo, T. Huang, H.-N. Wu, and X. Yang, “Data-driven H∞ control
learn the solution to the off-policy Bellman equation. This for nonlinear distributed parameter systems,” IEEE Trans. Neural Netw.
Learn. Syst., vol. 26, no. 11, pp. 2949–2961, Nov. 2015.
policy is different than the target policy which is not actually [22] H. Modares, F. L. Lewis, and Z.-P. Jiang, “H∞ tracking control of
applied to the systems and is evaluated and learned to find completely unknown continuous-time systems via off-policy reinforce-
the optimal policy. The proposed approach does not require ment learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10,
pp. 2550–2562, Oct. 2015.
any knowledge of the agent dynamics. A simulation example [23] D. Vrabie and F. Lewis, “Neural network approach to continuous-time
shows the effectiveness of the proposed method. direct adaptive optimal control for partially unknown nonlinear systems,”
Neural Netw., vol. 22, no. 3, pp. 237–246, Apr. 2009.
[24] T. Başar and G. J. Olsder, Dynamic Noncooperative Game Theory,
R EFERENCES 2nd ed. Philadelphia, PA, USA: SIAM, 1999.
[1] Y. Xu and W. Liu, “Novel multiagent based load restoration algorithm [25] A. W. Starr and Y. C. Ho, “Nonzero-sum differential games,” J. Optim.
for microgrids,” IEEE Trans. Smart Grid, vol. 2, no. 1, pp. 152–161, Theory Appl., vol. 3, no. 3, pp. 184–206, Mar. 1969.
Mar. 2011. [26] F. L. Lewis and V. L. Syrmos, Optimal Control. Hoboken, NJ, USA:
[2] C. Yu, M. Zhang, and F. Ren, “Collective learning for the emergence of Wiley, 1995.
social norms in networked multiagent systems,” IEEE Trans. Cybern., [27] H. Modares and F. L. Lewis, “Linear quadratic tracking control of
vol. 44, no. 12, pp. 2342–2355, Dec. 2014. partially-unknown continuous-time systems using reinforcement learn-
[3] R. Olfati-Saber, “Flocking for multi-agent dynamic systems: Algorithms ing,” IEEE Trans. Autom. Control, vol. 59, no. 11, pp. 3051–3056,
and theory,” IEEE Trans. Autom. Control, vol. 51, no. 3, pp. 401–420, Nov. 2014.
Mar. 2006. [28] G. N. Saridis and C.-S. G. Lee, “An approximation theory of optimal
[4] K. H. Movric and F. L. Lewis, “Cooperative optimal control for control for trainable manipulators,” IEEE Trans. Syst., Man, Cybern.,
multi-agent systems on directed graph topologies,” IEEE Trans. Autom. vol. 9, no. 3, pp. 152–159, Mar. 1979.
Control, vol. 59, no. 3, pp. 769–774, Mar. 2014. [29] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for
[5] T. M. Cheng and A. V. Savkin, “Decentralized control of multi-agent nonlinear systems with saturating actuators using a neural network HJB
systems for swarming with a given geometric pattern,” Comput. Math. approach,” Automatica, vol. 41, no. 5, pp. 779–791, May 2005.
Appl., vol. 61, no. 4, pp. 731–744, Feb. 2011. [30] Z.-P. Jiang and Y. Jiang, “Robust adaptive dynamic programming for
[6] F. L. Lewis, H. Zhang, K. Hengster-Movric, and A. Das, Cooper- linear and nonlinear systems: An overview,” Eur. J. Control, vol. 19,
ative Control of Multi-Agent Systems: Optimal and Adaptive Design no. 5, pp. 417–425, Sep. 2013.
Approaches. London, U.K.: Springer-Verlag, 2014. [31] R. Courant and D. Hilbert, Methods of Mathematical Physics. New York,
[7] W. B. Dunbar and R. M. Murray, “Distributed receding horizon control NY, USA: Wiley, 1953.
for multi-vehicle formation stabilization,” Automatica, vol. 42, no. 4,
pp. 549–558, Apr. 2006.
[8] K. G. Vamvoudakis, F. L. Lewis, and G. R. Hudas, “Multi-agent Jinna Li (M’12) received the M.S. and Ph.D.
differential graphical games: Online adaptive learning solution degrees from Northeastern University, Shenyang,
for synchronization with optimality,” Automatica, vol. 48, no. 8, China, in 2006 and 2009, respectively.
pp. 1598–1611, Aug. 2012. She is currently an Associate Professor with
[9] E. Semsar-Kazerooni and K. Khorasani, “Multi-agent team cooperation: the Shenyang University of Chemical Technol-
A game theory approach,” Automatica, vol. 45, no. 10, pp. 2205–2213, ogy, Shenyang. From 2009 to 2011, she held a
2009. post-doctoral position with the Lab of Industrial
[10] W. Dong, “Distributed optimal control of multiple systems,” Int. J. Control Networks and Systems, Shenyang Insti-
Control, vol. 83, no. 10, pp. 2067–2079, Aug. 2010. tute of Automation, Chinese Academy of Sciences,
[11] C. Wang, X. Wang, and H. Ji, “A continuous leader-following consensus Shenyang. From 2014 to 2015, she was a Visiting
control strategy for a class of uncertain multi-agent systems,” IEEE/CAA Scholar granted by China Scholarship Council with
J. Autom. Sinica, vol. 1, no. 2, pp. 187–192, Apr. 2014. the Energy Research Institute, Nanyang Technological University, Singapore.
[12] Y. Liu and Y. Jia, “Adaptive leader-following consensus control of multi- From 2015 to 2016, she was a Domestic Young Core Visiting Scholar
agent systems using model reference adaptive control approach,” IET granted by Ministry of Education of China with the State Key Laboratory
Control Theory Appl., vol. 6, no. 13, pp. 2002–2008, Sep. 2012. of Synthetical Automation for Process Industries, Northeastern University.
[13] B. Luo, H.-N. Wu, T. Huang, and D. Liu, “Data-based approximate Her current research interests include distributed optimization control, neural
policy iteration for affine nonlinear continuous-time optimal control modeling, reinforcement learning, approximate dynamic programming, and
design,” Automatica, vol. 50, no. 12, pp. 3281–3290, Dec. 2014. data-based control.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2445

Hamidreza Modares (M’15) received the B.S. Frank L. Lewis (S’70–M’81–SM’86–F’94)


degree from the University of Tehran, Tehran Iran, received the bachelor’s degree in physics/electrical
in 2004, the M.S. degree from the Shahrood Uni- engineering and the M.S. degree in electrical
versity of Technology, Shahrood, Iran, in 2006, and engineering from Rice University, Houston, TX,
the Ph.D. degree from The University of Texas at USA, the M.S. degree in aeronautical engineering
Arlington, Arlington, TX, USA, in 2015. He was from the University of West Florida, Pensacola,
a Senior Lecturer with the Shahrood University FL, USA, and the Ph.D. degree from the Georgia
of Technology from 2006 to 2009 and a Faculty Institute of Technology, Atlanta, GA, USA.
Research Associate with The University of Texas He is currently a U.K. Chartered Engineer,
at Arlington, from 2015 to 2016. He is currently an the IEEE Control Systems Society Distinguished
Assistant Professor with the Department of Electri- Lecturer, the University of Texas at Arlington
cal and Computer Engineering, Missouri University of Science and Technol- Distinguished Scholar Professor, a UTA Distinguished Teaching Professor,
ogy, Rolla, MO, USA. His current research interests include cyber-physical and a Moncrief-O’Donnell Chair with The University of Texas at Arlington
systems, reinforcement learning, distributed control, robotics, and pattern Research Institute, Fort Worth, TX, USA. He is a Qian Ren Thousand
recognition. Talents Consulting Professor with Northeastern University, Shenyang,
China. He holds six U.S. patents, and has authored 301 journal papers,
396 conference papers, 20 books, 44 chapters, and 11 journal special
issues. His current research interests include feedback control, reinforcement
learning, intelligent systems, and distributed control systems.
Dr. Lewis is a fellow of the International Federation of Automatic Control,
the U.K. Institute of Measurement and Control, and Professional Engineer at
Texas. He was a recipient of the IEEE Computational Intelligence Society
Neural Networks Pioneer Award in 2012, the Distinguished Foreign Scholar
from the Nanjing University of Science and Technology, the 111 Project
Professor at Northeastern University, China, the Outstanding Service Award
from the Dallas IEEE Section, and an Engineer of the Year from the Fort
Worth IEEE Section. He was listed in Fort Worth Business Press Top 200
Leaders in Manufacturing. He was also a recipient of the 2010 IEEE Region
Five Outstanding Engineering Educator Award. He also received the IEEE
Control Systems Society Best Chapter Award (as a Founding Chairman of
DFW Chapter) in 1996.

Lihua Xie (F’07) received the B.E. and M.E.


degrees in control engineering from the Nanjing
University of Science and Technology, Nanjing,
China, in 1983 and 1986, respectively, and the Ph.D.
degree in electrical engineering from The University
of Newcastle, Callaghan, NSW, Australia, in 1992.
From 1986 to 1989, he was a Teaching Assistant
and a Lecturer with the Department of Automatic
Control, Nanjing University of Science and Technol-
ogy. He joined Nanyang Technological University,
Singapore, in 1992, where he is currently a Professor
Tianyou Chai (M’90–SM’97–F’08) received the with The School of Electrical and Electronic Engineering, the Director
Ph.D. degree in control theory and engineering with the Centre for E-City, and the Head of the Division of Control and
from Northeastern University, Shenyang, China, in Instrumentation. He was a Changjiang Visiting Professor with the South China
1985. University of Technology, Guangzhou, China, from 2006 to 2010. He also held
He was a Professor with Northeastern University visiting appointments with the California Institute of Technology, Pasadena,
in 1988. He is currently the Founder and Direc- CA, USA, The University of Melbourne, Melbourne, VIC, Australia, and
tor with the State Key Laboratory of Syntheti- Hong Kong Polytechnic University, Hong Kong.
cal Automation for Process Industries, Northeastern Dr. Xie was an Editor-at-Large of the Journal of Control Theory and
University. He is the Director with the Depart- Applications. He served as an Associate Editor of the IEEE T RANSACTIONS
ment of Information Science, National Natural Sci- ON AUTOMATIC C ONTROL from 2005 to 2007, the Automatica from 2007 to
ence Foundation of China, Beijing, China. He has 2009, the IEEE T RANSACTIONS ON C IRCUIT AND S YSTEMS -II from 2006
authored 144 peer-reviewed international journal papers. His current research to 2007, The Transactions of the Institute of Measurement and Control from
interests include modeling, control, optimization, and integrated automation 2008 to 2009, the International Journal of Control, Automation, and Systems
of complex industrial processes. He has developed control technologies with from 2004 to 2006, and the Conference Editorial Board, the IEEE Control
applications to various industrial processes. System Society from 2000 to 2004. He was a member of the Editorial Board of
Dr. Chai is a member of Chinese Academy of Engineering, and a fellow of the IEE Proceedings on Control Theory and Applications from 2005 to 2006
IFAC, For his contributions, he has won four prestigious awards of National and the IET Proceedings on Control Theory and Applications from 2007 to
Science and Technology Progress and National Technological Innovation, and 2009. He is a fellow of the IFAC and the IEEE Distinguished Lecturer. He
the 2007 Industry Award for Excellence in Transitional Control Research from was a member of the Board of Governors, the IEEE Control Systems Society,
the IEEE Multiple-Conference on Systems and Control. in 2011.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.

You might also like