Professional Documents
Culture Documents
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2435
unknown dynamics has not considered yet. The data-driven that there exists an edge starting from the vertex v j to vi in
optimal control problem of MAS is challenging because of the a directed graph. A graph is called simple if there are no
coupling of the agent dynamics as a result of data exchange repeated edges or self-loops (vi , vi ) ∈ E for ∀i . Denote the
between them. set of neighbors of node vi as Ni = {v j : (v j , vi ) ∈ E}. D =
In this paper, an off-policy RL algorithm is presented to diag(d1 . . . d N) is called in-degree matrix with the weighted
learn optimal control protocols for CT MAS using only mea- degree di = j ei j of node vi (i.e., the i
th row sum of E).
sured data. The off-policy RL and multiagent graphical games Define the graph Laplacian matrix as L = D − E, in which
are brought together, where the dynamics and performance summing elements of every row are equal to zero.
objective of each agent are affected by it and its neighbors A directed path is a sequence of edges
in a graphical topology. In off-policy RL [13], [21], [22], (vi1 , vi2 ), (vi2 , vi3 ), . . . , (vi j −1 , vi j ) with (vil−1 , vil ) ∈ E
two different policies are used: the behavior policy which for l ∈ {2, . . . , j }, which is a path starting from the node vi1
is used to generate data for learning and the target policy and ending at vi j . A directed graph is strongly connected if
which is evaluated and updated. This is in contrast to on- there is a directed path for every two distinct vertices. The
policy RL [8], [23] that requires the learning data be generated graph is said to be strongly connected if there is a directed
by the same control policy as the one under evaluation. This path from vi to v j for any distinct nodes vi , v j ∈ V.
greatly increases the information exploration ability during
the learning process and results in data efficiency. Moreover, B. Synchronization of Multiagent Systems
in off-policy RL, no dynamics model of the agents is required.
Consider the N systems or agents with identical node
The contributions of this paper are as follows. Off-policy RL
dynamics
is used to solve the optimal synchronization problem in the
framework of graphical games. No knowledge of the agent ẋi = Axi + Bui (1)
dynamics is required. To this end, a performance function
where xi = xi (t) ∈ denotes the state vector, ui = ui (t) ∈
Rn
is defined for each agent in terms of its local neighborhood
R p (i = 1, 2 . . . , N) denotes the control input. A and B are
tracking error and its control effort. It is shown that one must
the matrices of appropriate dimensions. The dynamics of the
solve coupled Hamilton–Jacobi–Bellman (HJB) equations to
command generator or leader with state x0 is given by
minimize these performance functions. The solution to the
HJB equations results in synchronization of all agents to the ẋ0 = Ax0 . (2)
leader while reaching a global Nash equilibrium. An off-policy
Assumption 1: The pair (A, B) is controllable.
RL algorithm is developed to approximate solutions to the HJB
The local neighborhood tracking error δ i of agent i is
equations and learn the optimal control policy.
defined as
The organization structure of this paper is as follows.
Section II introduces the graph theory concepts and some δ i = j ∈Ni ei j (xi − x j ) + gi (xi − x0 ) (3)
definitions that are used throughout this paper. Section III
where gi ≥ 0 is the pinning gain for agent i with gi = 0 if
defines the optimal synchronization problem and investigates
agent i has direct access to the leader and gi = 0 otherwise.
global Nash equilibrium and stability of the optimal solution.
Assumption 2: The graph is strongly connected and the
Section IV develops an off-policy RL algorithm to learn opti-
leader is pinned to at least one node.
mal controllers using data generated form agents. Section V
Let ξ = x − x0 be the global synchronization error [8],
presents the simulation results. Finally, the conclusions are
where x0 = 1 ⊗ x0 , 1 = [1, 1, . . . , 1]T ∈ R N . The global
stated in Section VI.
error vector of the MAS with the command generator is given
Notations: R n denotes the n dimensional Euclidean
from (3) by
space. ⊗ stands for the Kronecker product. Let
X , Xi , and U be compact sets, denote D δ = ((L + G) ⊗ In )ξ (4)
{(δ, u, δ )|δ, δ ∈ X , u ∈ U} and Di {(δ i , u, δ i )|δ i , δ i ∈
where G is a diagonal matrix with diagonal entries equal to
Xi , uT ∈
U}. <
S1 (δ, u, δ ), S2 (δ, u, δ )
>D
the pinning gains gi . Using (1), (2) and (3) yields the local
D 1S (δ, u, δ )S 2 (δ, u, δ )d(δ, u, δ ) denotes the inner
neighborhood tracking error dynamics as
product of column vector functions S1 and S2 .
δ̇ i = Aδ i + (di + gi )Bui − ei j Bu j (5)
II. P RELIMINARIES j ∈Ni
In this section, we first introduce some notations and which is further expressed by the compact form
theories on graph theory [4], [8]. The synchronization problem
of MAS is then defined. δ̇ = (I N ⊗ A)δ + ((L + G) ⊗ B)u. (6)
This is the dynamics of the overall neighborhood errors, where
A. Communication Graph [4], [8] δ = [δ 1T δ 2T · · · δ TN ]T and u = [u1T u2T · · · uTN ]T .
Consider a graph denoted by G = (V, E) with a set of Synchronization Problem: Design local control protocols
vertices V = {v1 , v2 , . . . , v N } and a set of edges or arcs E ⊆ ui in (1) to synchronize the states of all agents in G to
V ×V. E = [ei j ] is called the connectivity matrix with ei j > 0 the trajectory of the leader, i.e., lim xi (t) = x0 (t) for ∀i ,
t →∞
if (v j , vi ) ∈ E, and otherwise ei j = 0. (v j , vi ) ∈ E indicates i = 1, 2, . . . , N or lim ξ (t) = lim (x(t) − x0 ) = 0.
t →∞ t →∞
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2436 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017
Remark 1: Our objective is to make lim ξ (t) = 0. By taking derivative of Vi (δ i (t)) with respect to time t
t →∞
In the subsequent development, we show how to along with the trajectory of the local neighborhood tracking
make lim δ (t) = 0. According to (4), one has error δ i (t), the Bellman equation is given in terms of the
t →∞
ξ (t) ≤ (1/σmin (L + G))δ (t), where σmin (L + G) Hamiltonian function as
denotes the smallest singular value of the matrix L + G. Hi (δ i , ∇Vi , ui , u−i )
By Assumption 2, σmin (L + G) > 0, so that
= δ iT Qi δ i + uiT Ri ui + ∇ViT (Aδ i + (di + gi )Bui
lim δ (t) = 0 ⇒ lim ξ (t) = 0.
t →∞ t →∞
− ei j Bu j ) = 0. (10)
Therefore, to solve the synchronization problem, one can
j ∈Ni
design a control protocol for each agent i to guarantee
asymptotic stability of the local neighborhood tracking error The optimal response of agent i to fixed policies u−i can be
dynamics (5). In the subsequent sections III and IV, it is shown derived by minimizing Hamiltonian function with respect to
how to design local control protocols to stabilize the error ui as follows:
dynamics (5) in an optimal manner by minimizing a predefined u∗i (t) = arg min [Hi (δ i , ∇Vi∗ , ui , u−i )] (11)
performance function for each agent. ui
which yields
III. M ULTIAGENT G RAPHICAL G AMES 1
u∗i (t) = − (di + gi )R−1 i Bi ∇Vi
T ∗
(12)
In this section, optimal synchronization of MASs on graphs 2
is discussed in the framework of multiagent graphical games. where ∇Vi = ∂ Vi /∂δ i stands for a gradient operator.
It is shown how to find optimal protocols for every agent. Let all neighbors of agent i select control polices given
It is also shown that the optimal response makes all agents by (12) and substitute (12) into (10), then one gets the
synchronize to the leader and reach a global Nash equilibrium. following coupled co-operative game HJB equations:
∞ 4
1
Ji (δ i (t0 ), ui , u−i ) = [δ iT Qi δ i + uiT Ri ui ]dt (7) + (∇Vi∗ )T Aδ i − (di + gi )2 BR−1 T
i B ∇Vi
∗
2
t0
1 −1 T ∗
where Qi and Ri are positive semidefinite and positive definite + ei j (di + gi )BR j B ∇V j = 0.
matrices, respectively. (A, Qi ) is observable. 2
j ∈Ni
Minimizing (7) subject to (5) is a graphical game, since (13)
both the dynamics and the performance function for each agent
depend on the agent and its neighbors [8]. In graphical games, We now show that under a certain assumption, these coupled
the focus is on the global Nash equilibrium. The definition of HJB equations can be simplified and resemble the coupled
global Nash equilibrium is given as follows. algebraic Riccati equations (AREs) that appear in standard
Definition 1 [24]: A global Nash equilibrium solution linear quadratic multiplayer non zero-sum games [24], [25].
for an N player game is given by an N-tuple of policies Assumption 3: The cost function is quadratic and is given
{u∗1 , u∗2 , . . . , u∗N } if it satisfies by Vi = δ iT Pi δ i , where Pi is a positive definite matrix.
Using Assumption 3, (13) can be written as the form shown
Ji∗ Ji (δ i (t0 ), u∗i , u∗G −i ) ≤ Ji (δ i (t0 ), ui , u∗G −i ) (8) in Lemma 1.
Lemma 1: Under Assumption 3, coupled co-operative game
for all i ∈ N and ∀ui , uG −i (uG −i = {u j : j ∈ V, j = i }).
HJB equations (13) are equivalent to the coupled AREs
The N-tuple of game values {J1∗ , J2∗ , . . . , JN∗ } is said to be a ⎛ ⎞
Nash equilibrium outcome of the N-player game.
From (5), one can find that performance index (7) depends 2δ iT PiT ⎝Aδ i + −1
ei j (d j + g j )BR j BT P j δ j ⎠
on agent i and its neighbors. Thus, global Nash equilibrium j ∈Ni
(8) can be written as Ji∗ Ji (δ i (t0 ), u∗i , u∗−i ) ≤ Ji (δ i (t0 ), + δ iT Qi δ i − (di + gi )2 δ iT PiT BRiT BT Pi δ i = 0 (14)
ui , u∗−i ), since Ji (δ i (t0 ), u∗i , u∗G −i ) = Ji (δ i (t0 ), u∗i , u∗−i ) and
Ji (δ i (t0 ), ui , u∗G −i ) = Ji (δ i (t0 ), ui , u∗−i ), where u−i = {u j : and optimal response (12) becomes
j ∈ Ni }.
u∗i (t) = −(di + gi )R−1
i B Pi δ i .
T
(15)
B. Coupled HJB Equations for Solving Graphical Games Proof: For the quadratic cost function Vi = one δ iT Pi δ i ,
has ∇Vi = 2Pi δ i . Substituting this into (12) leads to (15).
Interpreting the control input ui as a policy dependent on On the other hand, substituting the optimal response (15) into
the local neighborhood tracking error δ i (t), the value function the coupled co-operative game HJB equations (13) gives (14)
corresponding to the performance index (7) is introduced as and the proof is completed.
∞
T Equation (14) is similar to the coupled AREs
Vi (δ i (t)) = δ i Qi δ i + uiT Ri ui dτ. (9)
t
in [24] and [25] for standard nonzero sum game problems.
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2437
= ui − u∗i Ri ui − u∗i
distributed under Assumption 3.
T
2
Remark 3: It is noted that none of the upcoming analy- + ei j u∗i Ri u j − u∗j . (19)
sis or proofs require Assumption 3. However, if solutions can (di + gi )
j ∈Ni
be found for (14), then these are also the solutions of (13).
In standard multiplayer non zero-sum games, there is only Selecting ui = u∗i and u j = u∗j gives
one state dynamic equation and it is known that the values are d Vi T
quadratic in the state [24], [25]. However, in graphical games, + δ i Qi δ i + uiT Ri ui = 0. (20)
dt
each agent has its own dynamics. It has not been shown that
Since the matrices Qi ≥ 0 and Ri > 0, then (d Vi/dt) < 0
the values are quadratic in the local states. That is, in general,
hold for all agents. Therefore, system (5) is asymptoti-
Assumption 3 may not hold.
cally stable and so all agents synchronize to the leader.
2) Since 1) holds for the selected control polices, then
C. Stability and Global Nash Equilibrium
δ i (t) → 0 when t → ∞. For Lyapunov functions
for the Proposed Solution
Vi (δ i )(i = 1, 2, . . . , N), satisfying Vi (0) = 0, we have
To achieve global Nash equilibrium, one needs to calculate Vi (δ ∞ ) = 0. Thus, performance index (7) can be
the optimal response for every agent i by solving N coupled written as
partial differential HJB equations (13) for the N player game ∞
T
problem. Theorem 1 shows that if all agents select their own Ji (δ i (0), ui , u−i ) = δ i Qi δ i + uiT Ri ui dt
optimal response and the communication topology graph is 0 ∞
strongly connected, then system (5) is asymptotically stable + Vi (δ i (0)) + V̇ dt (21)
for all i (i = 1, 2, . . . , N). Therefore, all agents synchronize. 0
Meanwhile, all of N agents are in global Nash equilibrium. or
Theorem 1: Make Assumption 2. Let Vi be smooth solutions
Ji (δ i (0), ui , u−i )
to HJB equations (13) and design the control policies u∗i ∞
T
2) [u∗1 , u∗2 , . . . , u∗N ] are global Nash equilibrium policies, − ei j Bu j ) dt. (22)
and the corresponding Nash equilibrium outcomes are j ∈Ni
Ji∗ (δ i (0)) = Vi (i = 1, 2, . . . , N).
If Vi satisfy (13) and u∗i ,
u∗−i , given by (12), are
Proof: optimal control polices, then by completing the square,
1) Let Vi be Lyapunov function candidates. Then, taking one has
derivative of Vi with respect to time t along with the
= ui − u∗i Ri ui − u∗i
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2438 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017
then, it is clear that Ji∗ (δ i (0), u∗i , u∗−i ) < Algorithm 1 Model-Based On-Policy Reinforcement Learning
Ji (δ i (0), ui , u∗−i ) holds for all i (i = 1, 2, . . . , N). (0)
1. Initialize the agents with admissible control policies ui
Therefore, global Nash equilibrium is reached and the
for ∀i and set s = 0;
proof is completed.
2. Evaluate policies by solving Vi :
From coupled co-operative game HJB equations (13), one
can see that designing optimal control polices for agents (1)
requires solving two issues. One issue is that coupled Hi (δ i , ∇Vi(s+1) , u(s) (s)
i , u−i )
co-operative game HJB equations are N nonlinear partial dif- = δ iT Qi δ i + [u(s) (s)
i ] Ri ui
T
ferential equations, which makes them hard or even impossible (s+1) (s)
to solve analytically. The other is that the system matrices A + [∇Vi ]T (Aδ i + (di + gi )Bui
(s)
and B need to be completely known to find the solutions. − ei j Bu j ) = 0 (24)
An off-policy RL algorithm is designed in Section IV to j ∈Ni
overcome these difficulties.
where s denotes iteration index;
3. Find improved control policies using ui :
IV. O FF -P OLICY R EINFORCEMENT
1
L EARNING A LGORITHM u(s) −1 T
i = − (di + gi )Ri B ∇Vi .
(s)
(25)
In [8], the graphical game was solved but full knowledge 2
(s) (s+1)
of all agent dynamics is needed. The off-policy RL allows 4. Stop when Vi − Vi ≤ ε with a small constant ε.
the solution of optimality problems without knowing any
knowledge of the agent dynamics. This section presents an
off-policy learning algorithm for the synchronization of MAS
that does not require any knowledge of the dynamics. An off-policy RL algorithm is then provided to learn the
To this end, off-policy Bellman equations are first derived, solutions of coupled co-operative game HJB equations (13)
and then, an actor-critic neural network (NN) structure is to obtain distributed approximate optimal control policies.
(s) (s)
used to evaluate the value function and find an improved Introducing auxiliary variables ui and u−i for each agent
control policy for each agent. Then, an iterative off-policy dynamics (5), one has
RL algorithm is given to learn approximate optimal control (s)
(s)
δ̇ i = Aδ i + (di + gi )Bui − ei j Bu j
policies that make the MAS reach global Nash equilibrium
j ∈Ni
and meanwhile guarantee the synchronization of all agents to
Algorithm 1 is presented to learn the optimal control poli- + ∇Vi (di + gi )B ui − u(s)
i
cies by using the knowledge of system models.
Remark 4: Vamvoudakis et al. [8] showed that under a weak (s)
− ei j B u j − u j . (27)
coupling assumption, Algorithm 1 converges to the solution j ∈Ni
of coupled co-operative game HJB equations (13) if all agents
update their control policies in terms of (25) at each iteration. Using (24) and (25) in (27) gives
This conclusion holds under the condition that the initial (s+1)
d Vi (δ i )
control policies are admissible.
Algorithm 1 provides a method to learn control poli- dt
T
cies that achieve the global Nash equilibrium and synchro- = −δ iT Qi δ i − u(s)
i Ri u(s)
i
nization. However, Algorithm 1 requires the knowledge of
− 2(u(s+1)
i )T Ri (ui − u(s)
i )
agent dynamics during implementing the iterative process. (s+1)
T
To obviate this requirement and present a model-free approach, + 2(di + gi )−1 · ei j u i Ri u j − u(s)
j . (28)
off-policy Bellman equations are presented in the following. j ∈Ni
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2439
The integral RL idea [23] is now used to develop Bellman Algorithm 2 Off-Policy RL for Multiagent Games
equations for evaluating the value of target policies. Integrating (0)
1. Begin with admissible initial control policies ui for ∀i ,
both sides of (28) on the interval [t, t ] yields the following
and set s = 0;
off-policy Bellman equations: (s+1) (s+1)
2. Solve Vi and ui from off-policy Bellman equa-
(s+1) (s+1)
Vi (δ i ) − Vi (δ i ) s denotes the iteration index;
tion (29), where
(s) (s+1)
t (s)
T 3. Stop when Vi − Vi ≤ ε.
(s)
= (δ i (τ ))T Qi δ i (τ ) + ui (δ i (τ )) Ri ui (δ i (τ )) dτ
t
t (s)
−2
(s+1)
(ui (δ i (τ )))T Ri ui (δ i (τ )) − ui (δ i (τ )) dτ · Ri (u(s)
j (δ j (τ )) − u j (δ j (τ )))dτ
t t + t
1
t
T − lim (δ Ti (τ )Qi δ i (τ )
+ 2(di + gi ) −1
ei j u(s+1) (δ i (τ )) Ri t →0 t
i t
t j ∈Ni (s) T −1 (s)
(s)
+ ui (δ i (τ ))Ri ui (δ i (τ )) dτ (31)
· u j (δ j (τ )) − u j (δ j (τ )) dτ (29)
which is equivalent to (29). Since (Wi (δ i ), vi (δ i )) is also
where δ i = δ i (t), δ i = δ i (t ).
assumed a solution to (29), we have
Theorem 2 shows that the value function and improved
d Wi (δ i ) T −1 (s)
policy solutions found by solving the off-policy Bellman equa- = −δ iT Qi δ i − u(s)
i Ri ui
tions (29) are identical to the value function solution found dt
(s+1)T (s)
(s+1) = 2 ui − vi Ri ui − ui
and ui (δ i ) satisfies (24) and (25), then they are the (s+1) T
solution to (29) by taking the derivative of (29). − 2(di + gi )−1 ei j ui − v(s+1)
i
2) Necessity: The necessity proof can be completed if the j ∈Ni
(s)
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2440 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017
known to be itself stable, then the initial policies can be chosen To estimate the solution (Vi(s+1)(δ i ), u(s+1) i (δ i )),
as u(0)
i = 0, which can guarantee the admissibility of the initial the weighted residual method is used here [13], [23]. To this
(s+1) (s+1)
policy without requiring any knowledge of the dynamics of end, Vi (δ i ) and ui (δ i ) are, respectively, substituted
(s+1) (s+1)
system (5). Otherwise, suppose that system (5) has nominal by V̂i (δ i ) and
ui (δ i ) to yield the following residual
models A N and B N satisfying A = A N + A and B = B N + error:
B, where A and B are the unknown part of system (5).
In this case, robust control methods, such as H∞ control with σi(s)(δ i (t), u(t), δ i (t))
(s+1)
the nominal models A N and B N , can be used to yield an = (φi (δ i (t)) − φi (δ i (t)))T Wvi
admissible initial policy. p p t
Remark 6: Since Theorem 2 shows that the solution +2 rl1 ,l2 ψil1 (δ i (τ ))T Wu(s)
il
− uil1 (δ i (τ ))
1
of Algorithm 1 with (24) and (25) is equivalent to the l1 =1 l2 =1 t
solution of off-policy Bellman equations (29), the conver- t
gence of Algorithm 2 can be guaranteed, since [8] and [28] · (ψil2 (δ i (τ )))T Wu(s+1)
il
dτ − δ iT (τ )Qi δ i (τ )dτ
2
have proved that the RL Algorithm 1 converges. Compared t
p
p
with (24) and (25), (29) does not need the knowledge of agent t
(s) T
dynamics (A, B). − rl1 ,l2 Wu il ψil1 (δ i (τ ))
1
l1 =1 l2 =1 t
Remark 7: Off-policy Algorithm 2 can learn the solution to
the HJB equations (13) without requiring any knowledge of · (ψil2 (δ i (τ )))T Wu(s)
il
dτ
2
the agent dynamics, because the Bellman equations (29) do
p
p t
not contain the system matrices A and B of the agents. This is −1
− 2(di + gi ) rl1 ,l2 ei j ψil1 (δ i (τ ))T
because the information about the agent dynamics is embedded l1 =1 l2 =1 t j ∈Ni
in the local neighborhood error δ i and control input ui , as well
· Wu(s) ψ j l2 (δ j (τ ))T Wu(s) − u j l2 (δ j (τ )) dτ. (39)
as u−i . The most prominent advantage of learning the optimal il1 jl2
solution by using (29) is that the resulting optimal control pro- The above expression can be rewritten as
tocol does not suffer from model inaccuracy or simplifications (s)
made in identifying system models. σi (δ i (t), u(t), δ i (t))
= ρ̄i(s) (δ i (t), u(t), δ i (t))Wi(s+1) − πi(s) (δ i (t)) (40)
B. Using Actor-Critic Structure for Off-Policy Learning where
in Graphical Games
T
T
T T
Wi(s+1) (s+1)
Wvi Wu(s+1) . . . Wu(s+1)
Vi(s+1) u(s+1)
i1 ip
To solve and in (29), multiple actor-critic
i
p
p
T l1 ,l2
NNs-based approach is developed. According to Weierstrass
(s+1) πi(s) (δ i ) ρ Q (δ i ) + rl1 ,l2 Wu(s)
il
ρψ (δ i )Wu(s)
il
high-order approximation theorem [29], [30], Vi and l1 =1 l2 =1
1 2
(s+1)
ui can be approximately expressed as linear combination (s)
ρ̄i (δ i (t), u(t), δ i (t))
of independent basis functions. Thus, the following N critic T
NN and actor NN are constructed to approximate the optimal ρ ϕ (δ i (t), δ i (t)) 2θ (s)1(δ i (t), δ j (t), u(t))
cost function Vi∗ and the optimal control law u∗i :
. . . 2θ (s) p (δ i (t), δ j (t), u(t))
V̂i(s)(δ i ) = φi (δ i )T Wvi
(s)
(36) θ (s)l2 (δ i (t), δ j (t), u(t))
(s)
ûil (δ i ) = ψil (δ i )T Wu(s) (37) p
T l1 ,l2 l1 ,l2
in the form of Vi = δ iT Pi δ i for linear system (5) with multiple ρψl1 ,l2 (δ i (t)) ψi,l1 (δ i (τ ))(ψi,l2 (δ i (τ )))T dτ
control inputs, then the activation function vectors of critic t
NNs and actor NNs are in fact φi (δ i )T = δ iT ⊗ δ iT and l1 l2
t
ρu,ψ (δ i (t), ui (t)) u i,l1 (τ )(ψi,l2 (δ i (τ ))T dτ
ψil (δ i )T = δ iT , respectively. t
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2441
t
t
ρψl1 l2 (δ i,k , δ j,k ) ψ j,l2 (δ j,k )(ψi,l1 (δ i,k ))T
ρψl1 l2 (δ i (t), δ j (t)) T
ψ j,l2 (δ j (τ ))(ψi,l1 (δ i (τ ))) dτ 2
t
T
+ ψ j,l2 δ j,k ψi,l1 δ i,k
t
l1 l2 t
ρu,ψ (δ i (t), u j (t)) u j,l2 (τ )(ψi,l1 (δ i (τ )))T dτ ρul1kl,ψ
2
(δ ik , u j k ) u j,l2 ,k (ψi,l1 (δ i,k ))T
t 2
T
p + u j,l2 ,k ψi,l1 δ i,k .
where l1 , l2 = 1, 2, . . . p, m = h v + h lu i is the length of
(s)
i=1 Remark 8: Solving (45) for NN weights Wi(s+1) requires
the vector ρ̄i . (s)
Zi ∈ R Mi ×m i to be full column rank. Therefore, the size
(s+1)
To compute the weight vector Wi , the residual error Mi ofthe sample set δ Mi should be larger than m i =
(s) (s) (s+1) p
σi (δ i , u, δ i ) is projected onto dσi (δ i , u, δ i )/d Wi and hv + l=1 h lu i , i.e., Mi ≥ m i . This requires satisfaction of a
the projective value is set to be zero on domain Di , that is proper persistence of excitation condition to assure that that
(s)
dσi (δ i , u, δ i ) d Wi(s+1) , σi(s) (δ i , u, δ i ) D = 0. (41) Rank(Z(s)
i ) = m i is satisfied [8], [13], [21]–[23].
i
Using the approximate control policy (37), closed-loop
Using (40) and the definition of derivative, (41) becomes system (5) is presented by
(s+1) (k) (k) −1
Wi = ρ̄i (δ i (t), u(t), δ i (t)), ρ̄i (δ i (t), u(t), δ i (t)) D (s)
δ̇ i = Aδ i + (di + gi )Bi ûi −
(s)
ei j B j û j . (46)
(k)
i
(k)
· ρ̄i (δ i (t), u(t), δ i (t)), πi (δ i (t)) D . (42) j ∈Ni
i
(k) (k) Selecting a Lyapunov function candidate as the approximate
Since ρ̄i (δ i (t), u(t), δ i (t)), ρ̄i (δ i (t), u(t), δ i (t))Di and
(k) (k) value function V̂i(s), the next sufficient condition is presented
ρ̄i (δ i (t), u(t), δ i (t)), πi (δ i (t))Di can be, respectively, to guarantee the stability of (46).
approximated as Theorem 3: The closed-loop system (46) is asymptotically
(k) (k)
ρ̄i (δ i (t), u(t), δ i (t)), ρ̄i (δ i (t), u(t), δ i (t)) D
i
stable with approximate control policies û(s) i and û(s)
−i , if the
IDi (s)
T (s) following condition holds:
= (s)
T (s+1) T
Z Zi (43)
Mi i δ iT Qi δ i +
ui Ri u(s)
i > ∇Vi
and
(s) (s)
× (di + gi )Bu i − ei j Bu j (47)
(k)
ID
T (s)
ρ̄i δ i (t), u(t), δ i , πi(k) (δ i (t)) D = i Z(s) i ηi . (44) j ∈Ni
i Mi
Thus, we have where u(s) (s) (s)
i = ûi (δ i ) − ui (δ i ) (i = 1, 2, . . . , N) denotes
T (s) −1 (s)
T (s) the error between the approximate value of u(s) i (δ i ) and its
Wi(s+1) = Z(s) i Zi Zi ηi (45) real value.
where Proof: Differentiating Vi(s+1) along with the dynamics of
(s)
T
T
+ ui,l1 ,k ψi,l2 δ i,k appropriately choose activation function sets and their size
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2442 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017
Algorithm 3 Data-Based Off-Policy RL for Multiagent Game This means that for any ε1 > 0, it follows:
(s)
1. Collect real system data (δ i,k , uik , u j k , δ i,k ) from agent i û − u(s) ≤ ε1 . (50)
i i
for the sample set D (D = ∪ni=1 Di ) using different control Theorem 2 has shown that the equivalence between the
input u, where i = 1, 2, · · · , N and j ∈ Ni ; solutions (V(s) (s)
(0) i , ui ) derived by off-policy Bellman equa-
2. Set the initial critic NN weight vector Wiv , and choose (s) (s)
(0) tions (29) and (Vi , ui ) learned by Algorithm 1 given by (24)
the initial actor NN weight vector Wu il (l = 1, 2, · · · , p), (s)
(0) and (25). Actually, [8] and [28] showed that Vi in (24) and
such that ûi is admissible. Let s = 0; (s)
(s+1) ui with the form of (25) can, respectively, converge to the
3. Compute Z(i) and η(i) to update Wi in terms of (45); solutions of coupled co-operative game HJB equations (13)
(s) (s−1)
4. Let s = s + 1. If Wi − Wi ≤ ( is a small and global Nash equilibrium solution u∗i , which means that for
positive number), then stop the iteration and employ Wi
(s) any ε2 > 0
(s)
to (37) in order to derive the final control policy. Otherwise, u − u∗ ≤ ε2 (if s → ∞). (51)
i i
go back to Step 3 and continue.
Combining (50) with (51), therefore, yields for any ε > 0
(s)
û − u∗ ≤ û(s) − u(s) + u(s) − u∗ ≤ ε1 + ε2 = ε
i i i i i i
h lu i (l = 1, 2, . . . , p) for actor NNs, such that u(s)
i can be (52)
made arbitrarily small [13], [21], [23] based on the uniform (s)
if s → ∞. Therefore, =
lim û u∗i .
approximation property of NNs [31]. Under Assumption 3, s→∞ i
one clearly knows that the activation function vector of actor The proof is completed.
NNs is ψil (δ i )T = δ iT (i = 1, 2, . . . , N; l = 1, 2, . . . , p) Remark 12: The most significant advantage of
due to the linear optimal control policy ui dependent on δ i . Algorithm 3 is that the knowledge of the agent dynamics
Thus, u(s)i can be made arbitrarily small based on the uniform
is not required for learning approximate optimal control
approximation property of NNs [31]. In this sense, the condi- protocols. No system identification is needed, and therefore,
tion in Theorem 3 can be satisfied, so the asymptotic stability the negative impacts brought by identifying inaccurate
of (46) is guaranteed. models are eliminated without compromising the accuracy of
optimal control protocols. This is in contrast to the existing
model-based optimal control for MAS [4], [7]–[12].
C. Reinforcement Learning Algorithm Using Remark 13: Off-policy RL has been developed for single-
Off-Policy Approach agent systems [13], [21], [22], [30]. However, to the best of
Summarizing the results given in Sections IV-A and IV-B our knowledge, this is the first time a model-free off-policy
yields the following off-policy RL algorithm for obtaining the RL is developed to learn optimal control protocols for MASs.
optimal control policy. Learning optimal control protocols for MASs is challenging
Remark 10: Algorithm 3 consists of two stages. In the and more complicated than for single-agent agent systems
first stage, the algorithm collects data obtained by applying because of the interplay between agents and the information
prescribed behavior control polices to the agents. This data flow between them.
include the local neighborhood tracking error and the control Remark 14: For the usage of model-free off-policy RL
inputs of the agent and its neighbors. In the second stage, presented in Algorithm 3, it may be challenging for higher
the weights of critic and actor NNs are updated in real time dimensional systems because of the requirement of heavy
to learn the value corresponding to the target policy under computational resources and slow convergence. However,
evaluation and find an improved target policy. The behavior recursive least squares or least mean squares can be used to
policy for each agent can be an exploratory policy to assure replace the batch least squares (45) to save data storage space
that the collected data are rich enough and the target policy for speeding up the convergence.
for each agent is learned using these collected data without
actually being applied to the agents. V. S IMULATION E XAMPLES
Remark 11: Note that the final control policy learned by This section verifies the effectiveness of the pro-
Algorithm 3 is static, while the optimal control policy is posed off-policy RL algorithm for the optimal control
dynamical in terms of (12) for the case of dynamical neighbor of MAS.
graphs. That is why the results of this paper cannot be Consider the five-node communication graph shown
extended to the case of dynamical neighbor graphs. in Fig. 1. The leader is pinned to node 3; thus, g3 = 1
Theorem 4: The approximate control policies û(s) and gi = 0 (i = 3). The weight of each edge is set to
i derived by
Algorithm 3 converge to the global Nash equilibrium solution be 1. Therefore, the graph Laplacian matrix corresponding to
(s) Fig. 1 becomes
u∗i as s → ∞, i.e., lim ûi = u∗i (i = 1, 2, . . . , N). ⎡ ⎤
s→∞
(s) 3 −1 −1 −1 0
Proof: It is clear that ûi is used to approximate the target ⎢ −1
⎢ 1 0 0 0 ⎥⎥
policy u(s) in off-policy Bellman equations (29). As stated ⎢ −1 −1 2
i
in Remark 9, if the activation function sets and their size L=⎢ 0 0 ⎥⎥. (53)
⎢ ⎥
h lu i (l = 1, 2, . . . , p) for actor NNs are chosen appropriately, ⎣ −1 0 0 2 −1 ⎦
(s) (s) (s) −1 0 −1
u i = ûi −ui can be made arbitrarily small [13], [21], [23]. 0 2
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2443
Fig. 1. Communication graph. Fig. 4. Synchronization of the first two states of all agents with the leader.
Fig. 5. Synchronization of the last two states of all agents with the leader.
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
2444 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 10, OCTOBER 2017
u(0)
3 = [−0.3608 − 0.3585 0.3603 0.5886]δ3 [14] Z.-S. Hou and Z. Wang, “From model-based control to data-driven
control: Survey, classification and perspective,” Inf. Sci., vol. 235,
u(0)
4 = [−0.2141 − 0.2674 0.2739 0.1529]δ4 pp. 3–35, Jun. 2013.
(0) [15] T. Bian, Y. Jiang, and Z.-P. Jiang, “Decentralized adaptive optimal
u5 = [−0.0802 0.0577 − 0.7025 0.1208]δ5 . control of large-scale systems with application to power systems,” IEEE
Trans. Ind. Electron., vol. 62, no. 4, pp. 2439–2447, Apr. 2015.
Setting the value of the convergence criterion = 10−4 , [16] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time
Algorithm 3 is implemented. Figs. 2 and 3, respectively, show nonlinear HJB solution using approximate dynamic programming:
the convergence of the critic NN weights and the actor NN Convergence proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,
vol. 38, no. 4, pp. 943–949, Aug. 2008.
weights. Applying the optimal control policy (37) to agents, [17] D. Liu, D. Wang, and H. Li, “Decentralized stabilization for a class of
the first two state trajectories and the last two state trajectories continuous-time nonlinear interconnected systems using online learning
of each agent are, respectively, given in Figs. 4 and 5, which optimal control approach,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 25, no. 2, pp. 418–428, Feb. 2014.
show that all five agents are synchronized with the leader. [18] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive
Control and Differential Games by Reinforcement Learning Principles.
Stevenage, U.K.: IET Press, 2012.
VI. C ONCLUSION [19] S. Kar, J. M. F. Moura, and H. V. Poor, “QD-learning: A collabora-
In this paper, an off-policy RL algorithm is presented to tive distributed strategy for multi-agent reinforcement learning through
consensus + innovations,” IEEE Trans. Signal Process., vol. 61, no. 7,
solve the synchronization problem of MAS in an optimal pp. 1848–1862, Apr. 2013.
manner and using only measured data. The integral RL idea is [20] M. Kaya and R. Alhajj, “A novel approach to multiagent reinforce-
employed to derive off-policy Bellman equations to evaluate ment learning: Utilizing OLAP mining in the learning process,” IEEE
Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 35, no. 4, pp. 582–590,
target policies and find improved target polices. Each agent Nov. 2005.
applies a behavior policy to collect data which is used to [21] B. Luo, T. Huang, H.-N. Wu, and X. Yang, “Data-driven H∞ control
learn the solution to the off-policy Bellman equation. This for nonlinear distributed parameter systems,” IEEE Trans. Neural Netw.
Learn. Syst., vol. 26, no. 11, pp. 2949–2961, Nov. 2015.
policy is different than the target policy which is not actually [22] H. Modares, F. L. Lewis, and Z.-P. Jiang, “H∞ tracking control of
applied to the systems and is evaluated and learned to find completely unknown continuous-time systems via off-policy reinforce-
the optimal policy. The proposed approach does not require ment learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10,
pp. 2550–2562, Oct. 2015.
any knowledge of the agent dynamics. A simulation example [23] D. Vrabie and F. Lewis, “Neural network approach to continuous-time
shows the effectiveness of the proposed method. direct adaptive optimal control for partially unknown nonlinear systems,”
Neural Netw., vol. 22, no. 3, pp. 237–246, Apr. 2009.
[24] T. Başar and G. J. Olsder, Dynamic Noncooperative Game Theory,
R EFERENCES 2nd ed. Philadelphia, PA, USA: SIAM, 1999.
[1] Y. Xu and W. Liu, “Novel multiagent based load restoration algorithm [25] A. W. Starr and Y. C. Ho, “Nonzero-sum differential games,” J. Optim.
for microgrids,” IEEE Trans. Smart Grid, vol. 2, no. 1, pp. 152–161, Theory Appl., vol. 3, no. 3, pp. 184–206, Mar. 1969.
Mar. 2011. [26] F. L. Lewis and V. L. Syrmos, Optimal Control. Hoboken, NJ, USA:
[2] C. Yu, M. Zhang, and F. Ren, “Collective learning for the emergence of Wiley, 1995.
social norms in networked multiagent systems,” IEEE Trans. Cybern., [27] H. Modares and F. L. Lewis, “Linear quadratic tracking control of
vol. 44, no. 12, pp. 2342–2355, Dec. 2014. partially-unknown continuous-time systems using reinforcement learn-
[3] R. Olfati-Saber, “Flocking for multi-agent dynamic systems: Algorithms ing,” IEEE Trans. Autom. Control, vol. 59, no. 11, pp. 3051–3056,
and theory,” IEEE Trans. Autom. Control, vol. 51, no. 3, pp. 401–420, Nov. 2014.
Mar. 2006. [28] G. N. Saridis and C.-S. G. Lee, “An approximation theory of optimal
[4] K. H. Movric and F. L. Lewis, “Cooperative optimal control for control for trainable manipulators,” IEEE Trans. Syst., Man, Cybern.,
multi-agent systems on directed graph topologies,” IEEE Trans. Autom. vol. 9, no. 3, pp. 152–159, Mar. 1979.
Control, vol. 59, no. 3, pp. 769–774, Mar. 2014. [29] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for
[5] T. M. Cheng and A. V. Savkin, “Decentralized control of multi-agent nonlinear systems with saturating actuators using a neural network HJB
systems for swarming with a given geometric pattern,” Comput. Math. approach,” Automatica, vol. 41, no. 5, pp. 779–791, May 2005.
Appl., vol. 61, no. 4, pp. 731–744, Feb. 2011. [30] Z.-P. Jiang and Y. Jiang, “Robust adaptive dynamic programming for
[6] F. L. Lewis, H. Zhang, K. Hengster-Movric, and A. Das, Cooper- linear and nonlinear systems: An overview,” Eur. J. Control, vol. 19,
ative Control of Multi-Agent Systems: Optimal and Adaptive Design no. 5, pp. 417–425, Sep. 2013.
Approaches. London, U.K.: Springer-Verlag, 2014. [31] R. Courant and D. Hilbert, Methods of Mathematical Physics. New York,
[7] W. B. Dunbar and R. M. Murray, “Distributed receding horizon control NY, USA: Wiley, 1953.
for multi-vehicle formation stabilization,” Automatica, vol. 42, no. 4,
pp. 549–558, Apr. 2006.
[8] K. G. Vamvoudakis, F. L. Lewis, and G. R. Hudas, “Multi-agent Jinna Li (M’12) received the M.S. and Ph.D.
differential graphical games: Online adaptive learning solution degrees from Northeastern University, Shenyang,
for synchronization with optimality,” Automatica, vol. 48, no. 8, China, in 2006 and 2009, respectively.
pp. 1598–1611, Aug. 2012. She is currently an Associate Professor with
[9] E. Semsar-Kazerooni and K. Khorasani, “Multi-agent team cooperation: the Shenyang University of Chemical Technol-
A game theory approach,” Automatica, vol. 45, no. 10, pp. 2205–2213, ogy, Shenyang. From 2009 to 2011, she held a
2009. post-doctoral position with the Lab of Industrial
[10] W. Dong, “Distributed optimal control of multiple systems,” Int. J. Control Networks and Systems, Shenyang Insti-
Control, vol. 83, no. 10, pp. 2067–2079, Aug. 2010. tute of Automation, Chinese Academy of Sciences,
[11] C. Wang, X. Wang, and H. Ji, “A continuous leader-following consensus Shenyang. From 2014 to 2015, she was a Visiting
control strategy for a class of uncertain multi-agent systems,” IEEE/CAA Scholar granted by China Scholarship Council with
J. Autom. Sinica, vol. 1, no. 2, pp. 187–192, Apr. 2014. the Energy Research Institute, Nanyang Technological University, Singapore.
[12] Y. Liu and Y. Jia, “Adaptive leader-following consensus control of multi- From 2015 to 2016, she was a Domestic Young Core Visiting Scholar
agent systems using model reference adaptive control approach,” IET granted by Ministry of Education of China with the State Key Laboratory
Control Theory Appl., vol. 6, no. 13, pp. 2002–2008, Sep. 2012. of Synthetical Automation for Process Industries, Northeastern University.
[13] B. Luo, H.-N. Wu, T. Huang, and D. Liu, “Data-based approximate Her current research interests include distributed optimization control, neural
policy iteration for affine nonlinear continuous-time optimal control modeling, reinforcement learning, approximate dynamic programming, and
design,” Automatica, vol. 50, no. 12, pp. 3281–3290, Dec. 2014. data-based control.
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.
LI et al.: OFF-POLICY RL FOR SYNCHRONIZATION 2445
Authorized licensed use limited to: University of Texas at Arlington. Downloaded on January 18,2021 at 18:26:56 UTC from IEEE Xplore. Restrictions apply.