Liu 2019

The Formation Control of Mobile Autonomous Multi-Agent Systems
Using Deep Reinforcement Learning

Qishuai Liu and Qing Hui
Department of Electrical and Computer Engineering
University of Nebraska-Lincoln
Lincoln, NE 68588-0511, USA
Email: qishuai.liu@huskers.unl.edu; qing.hui@unl.edu
Abstract—The formation control of mobile autonomous multi-

agent systems has been an important task in the fields of
automatic control and robotics. Many applications, from swarm
robots to autonomous cars, need the corresponding agents (i.e.,
robots or cars) to follow the designed control law for their group
behaviors. Finding a feasible and collision free trajectory for
each agent in the formation control is critical and challenging for
these multi-agent systems. This work presents a formation control
algorithm for mobile autonomous multi-agent systems based on
the application of deep reinforcement learning. Specifically, the
proposed method can lead the agent to develop a policy network
to learn not only its own policy but also the policy of its neighbors.
Thus, the value network can evaluate the policy that it has
learned and find the correct actions after the training process.
Also, the proposed method removes the assumption that other
agents perform the same policy, which is widely used in some
existing collision avoidance algorithms. Finally, the experimental Fig. 1. The formation control of four mobile autonomous robots. They form
results show the effectiveness of our proposed method. a rectangle.
I. I NTRODUCTION
Cooperative control of mobile autonomous multi-agent sys- designed as the reference so that other agents can be controlled
tems (MASs) has attracted considerable research interest in to follow the corresponding trajectory. In the meantime, the
control and robotics communities in recent years [1]. This trajectory of the follower agent is designed in a way to have
task has a wide range of applications in reconnaissance, the desired separation distance and relative bearing with the
surveillance, and security [2], [3]. The ability of maintaining leader. Another popular method is the consensus algorithm,
the network topology and connectivity of robots is crucial which is focused on finding a common state for all of the
for some tasks such as target localization, oceanic search and agents, then driving each agent to the particular state relative to
rescue, and undersea oil pipeline maintenance [4]–[6]. the common state that is found. Based on this idea, the research
Among all cooperative control tasks for MASs, the for- on the consensus problems for multi-agent systems can be
mation control is one of the most interesting research topics extended to the case of directed communication topologies
since it has much broader impacts. Many MASs including [14].
unmanned aerial vehicles (UAVs), autonomous underwater ve- Many research platforms use the mobile autonomous robot
hicles (AUVs), and nonholonomic mobile robots are studied to systems to perform the formation control algorithm. For ex-
address their corresponding formation control problem. These ample, the vision-based control method was used to drive
research efforts focus on leader-follower methods [7]–[9], the mobile autonomous robots to form the desired formation
virtual leader approaches [10], [11], and leaderless consensus in [15]. The obstacle avoidance problem was tackled in the
method [12]. Some other results of the formation control can formation control for the mobile autonomous robots in [16].
be found in the survey [13]. The primary aim of the formation A real-time observer developed to estimate the relative state
control for the MASs is to design the appropriate algorithm of the mobile autonomous robots to form the formation was
such that it can ensure the group of the agents to achieve proposed in [17].
and maintain the desired geometric connection of their states. One of the main difficult problems in the formation con-
Formation control generally makes the mobile autonomous trol is the collision avoidance in the moving of the mobile
agents work together to cooperatively finish the formation autonomous robots. The major strategies for solving this
task. This work is often accomplished by communicating the problem are the rule-based approaches and optimization-based
state information of each agent with its neighbors. The leader- approaches. For the rule-based approaches, a consensus based
follower approach is one of the most popular methods due algorithm was proposed in [18], where the artificial potential
to its simplicity. The basic idea is that the leader can be approach is used to generate the collision avoidance strategy.
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

By exploiting the properties of weighted graphs, the formation invoking the control law re-design.
and the collision avoidance for the robots can be achieved [19].
Alternatively, one of the optimization approaches is to use the II. BACKGROUND
model predictive control based method [20], [21]. Another In this section, we will introduce some basic knowledge and
problem is that in practice, many mobile autonomous robot methods related to our work.
system models are nonlinear, probably with nonholonomic
constraints, which imply that the dynamic equations of the A. Formation Control
robots are fairly complicated and hard to model. Some research First, we formulate the formation control of multiple mobile
has been focused on studying the formation control task for autonomous robots under the general problem setup. Specifi-
nonholonomic mobile robots [22], [23]. The tracking control cally, consider N agents whose dynamics are governed by the
of the mobile autonomous robots with limited information of following differential equations:
the desired trajectory was studied in [24].
In order to overcome these two problems, we attempt to find p˙i = qi
(1)
an efficient way to handle them by introducing reinforcement q˙i = ai
learning (RL). In the control engineering domain, RL bridges
the gap between the traditional control theory and the adaptive where ṗi = dpi /dt, pi ∈ Rn and qi ∈ Rn denote the position
control algorithm [25]. The reinforcement learning is based and velocity of each robot i, respectively, i ∈ {1 . . . , N }, R
on the idea that an agent can solve the different actions denotes the set of real numbers, and Rn denotes the set of
from learning the outcome which is optimal for some specific n-dimensional column vectors. They are the state of the agent
situation. Since RL is an end-to-end learning method, there si = [pi qi ]T . ai is the control input force for each robot i,
is no need to know the model of the robot, therefore, the which can be seen as the external or feedback action putting
designer of the control algorithm can save a lot of effort since on each robot i. Let the mapping F : R2N → RM be given.
only the feedback in the form of the reward function needs The desired formation for the agent can be specified by the
to be provided. This reward function generally provides the following constraint:
information of the state and action about the performance of F (s) = F (s∗ ) (2)
the last step that the agent takes [26]. Thus, the agent can
learn the appropriate policy to optimize the long-term reward where s∗ denotes the desired state. Then the formation control
by continuously interacting with the environment. For the RL, problem can be defined as follows.
an agent can evaluate the feedback of its action in each of Definition 2.1: ([13]) (Formation Control Problem): The
its steps, thus the whole performance of the agent can be formation control problem can be defined as to design a
improved for the subsequent actions [27], [28]. control law such that the constraint (2) can be satisfied for
In this work, the deep reinforcement learning (DRL) is the multirobot system (1).
used to model the formation control problem. DRL has been Based on Definition 2.1, some commonly used formation
proven to learn control policy directly from the input [29]. control problems are listed below:
Also, they showed that DRL can receive the high-dimensional • Position-based problem: Each robot i can sense the abso-
sensory inputs to learn the policy to excel for some challenging lute state of the others with respect to a global coordinate
tasks [30]. A major progress for DRL is that it can be system. The constraint (2) can be written as:
extended to the continuous action domain [31]. From then,
using DRL to solve some classical control problems becomes F (s) := s = F (s∗ ). (3)
fairly easy. Some DRL algorithms in the continuous action
domain are proposed such as asynchronous advantage actor- Each agent can control si actively.
critic [32] and in [33], they combine the asymmetric actor- • Displacement-based problem: Each agent can sense the
critic with domain randomization. In this paper, we will use an relative state of the other agents with respect to a global
multi-agent actor-critic algorithm to train our multiple mobile coordinate system. Meanwhile, they cannot sense the ab-
autonomous robots in the simulation, and use the trained solute state of the other agents with the global coordinate
network on the real mobile autonomous robots to demonstrate system. The constraint (2) can be written as:
the effectiveness of our algorithm. The main advantages of our F (s) := [. . . , (sj − si )T , . . .]T = F (s∗ ) (4)
algorithm over traditional formation control are
1) it ditches the concept of using the complex model for each i, j = 1, . . . , N , where (·)T denotes the
of the nonholonomic mobile robots to design control transpose operation. The constraint (4) is invariant to
laws, and hence, avoiding daunting tasks of mathematical translation applied to the state s. The agent controls
modeling and designing the specific control input, leading [. . . , (sj −si )T , . . .]T in the displacement-based problem.
to a more straightforward, generic, and simpler method • Distance-based problem: Distance-based problem re-
for solving many complex control problems; quires that each agent can only sense the relative state
2) a collision avoidance method can be naturally in- of the other agents with respect to the local coordinate
tegrated into this formation control algorithm without system. They do not need to sense the absolute state
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

information of the other agents with respect to the global This function generally is referred as the loss function, where
coordinate system. In this case, the constraint is given by: y = r + γ maxa Q̄∗ (s , a ). Q̄ is the target Q function and
its parameters are updated with the most recent parameters in
F (s) := [. . . , sj − si , . . . ]T = F (s∗ ) (5) DQN periodically. DQN also uses experience replay buffer
for each robot i, j = 1, . . . , N , where · denotes a norm. D to stabilize the network, which is the tuples containing
The constraint (5) is invariant to combination of transla- (s, a, r, s ).
tion and rotation applied to state s. Each agent actively The multi-agent system can perform the DQN directly
controls [. . . , sj − si , . . .]T . The difference between by letting each agent i learn its own optimal function Qi
the distance-based problem and the displacement-based independently [36]. The disadvantage of this algorithm is that
problem is that the distance-based problem only cares the environment may not maintain stationary for each agent i
the relative distance between the two agents, however, the when each agent does so. In this case, the Markov assumption
displacement-based problem cares the relative coordinate will be violated.
for each pair of two agents.
D. Policy Gradient Algorithms
In this work, we perform the position-based formation control,
which means that the multirobot system (1) will achieve Avoiding to learn the action-value function Q, another
F (s) → F (s∗ ). The constraint of the system will depend on algorithm was proposed to learn the policy directly, which
Equation (3), which states the desired position for each robot is a popular choice for the DQN. The policy can be learned
with respect to the global coordinate system, thus each robot by adjusting the parameters θ of the policy and maximizing
will move to a designed state with respect to the common state. the objective function J(θ) = Es∼pπ ,a∼πθ [R]. This can be
The specific form of the constraints in (3) will be discussed achieved by taking the steps in the gradient direction ∇θ J(θ),
in the experimental part. where ∇ denotes the nabla operator. The gradient of the policy
can be given as follows:
B. Markov Decision Processes
∇θ J(θ) = Es∼pπ ,a∼πθ [∇θ log πθ (a|s)Qπ (s, a)] (7)
In this part, we will introduce the Markov decision process
(MDP) for multi-agent systems. The multi-agent MDP can be where pπ is the state distribution. The multi-agent scenario can
defined as a set of state S = {S1 , . . . , SN } which describes use a policy gradient algorithm to exhibit the high variance
the possible configurations of all the agents, a set of action gradient estimates. In such a case, the reward of each agent
A = {A1 , . . . , AN } which describes the actions of each agent obtained depends on the actions of other agents. The agent’s
i, and a set of observations O = {O1 , . . . , ON } for each own reward will exhibit more variability when compared
agent. In order to choose proper action for each agent, it uses with the single agent action. Based on the policy gradient
a stochastic policy πθi : Oi × Ai → [0, 1] to generate the next algorithm, many other methods were developed by learning
action that it should take. By executing this action, the agent the approximation of the true action-value function Qπ (s, a)
can produce the next state according to the state transition such as temporal-difference learning and actor-critic algorithm
function T : S × A1 × A2 × . . . × AN → S. After each agent [37].
executing its own action, the agent can acquire the rewards The policy gradient algorithm can be extended to the
as a function of the state and the action ri : S × Ai → R. deterministic policies μθ : S → A, which can be used to
Meanwhile, the agent can obtain a private observation related solve the problem where the action of each agent taken will
to the state oi : S → Oi . The solution of the multi-agent MDP be in the continuous domain. Thus, we can rewrite the gradient
is a control policy πθi which
T can maximize the expected sum of the objective function J(θ) = Es∼pμ [R(s, a)] as follows:
of future rewards Ri = t=0 γ t rit if the agent executing the
policy, where γ is a discount factor and T is the time horizon. ∇θ J(θ) = Es∼D [∇θ μθ (a|s)∇a Qμ (s, a)|a=μθ (s) ]. (8)
C. Deep Q-Networks The term ∇a Qμ (s, a) requires that the action domain A should
The deep Q-network (DQN) is a popular method in RL be continuous, and hence, the policy that each agent is acting
and has already been proven to be a successful algorithm should also be in the continuous domain.
for the multi-agent scenario [34], [35]. Q-learning uses the The deep deterministic policy gradient (DDPG) is a variant
action-value function to evaluate the policy that it learns. of the deterministic policy method, where the policy μ and
The corresponding action-value function can be described as the critic Qμ can be approximated by using the deep neural
Qπ (s, a) = E[R|st = s, at = a], where E denotes the expec- networks. DDPG still uses the replay buffer of experiences to
tation operator. This function can be computed recursively as sample data and train the corresponding network. Also, it uses
Qπ (s, a) = Es [r(s, a) + γEa ∼π [Qπ (s , a )]]. DQN can learn the target network to avoid the divergence, which is the same
the optimal action-value function Q∗ which corresponds to the as in the DQN. The DDPG can learn complex policies for
optimal policy of the agent by minimizing the function given some tasks using low-dimensional observations. It adopts the
as follows: straightforward actor-critic algorithm and makes the learning
process easy to be implemented for some difficult problems
L(θ) = Es,a,r,s [(Q∗ (s, a|θ) − y)2 ]. (6) and large networks.
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

III. A LGORITHM by maximizing the logrithm probability of agent j’s reward,
This part of the paper is to derive the algorithm that which is given as follows:
can perform the multi-agent cooperative work for the DRL. L(μ̂ji ) = −Esj ,aj [log μ̂ji (aj |sj ) + λH(μ̂ji )] (12)
Consider now a system of N robots operating in a common
environment. There is no controller that can collect the rewards where H is the (information) entropy of the policy distribution.
or make the decisions for the whole robots, which implies that Then the corresponding estimate ŷ can be calculated by using
each agent in the communication network can only decide the following formula:
its own action and thus receive the corresponding rewards
for itself. The communication network can be denoted by ŷ = ri = γQμi (s , μ̂1 N
i (s1 ) . . . , μ̂i (sN )) (13)
G = (N, E), where E represents the set of communication where μ̂j i denotes the target network for the approximate pol-
links, which implies that E is an undirected graph with the
icy μ̂ji . When training the network, the action-value function
vertex set N and edge set E ⊆ {(i, j) : i, j ∈ N, i = j}. We
Qμi can be updated and the latest samples of each agent j in
assume that the each agent can fully observe its own state,
the replay buffer can be used to perform the single gradient
i.e., si = oi . The edge (i, j) ∈ E represents the meaning that
step to update the parameter of the critic network. Thus, we
agents i and j can communicate with each other. Meanwhile,
can remove the assumption that each agent needs to know the
each agent can only use the local state information to learn
other’s policies. The whole algorithm is shown in Algorithm
its policies. By augmenting the policies information of other
1.
agents, one can use the extended actor-critic method to solve
the formation control problem for multirobots. Algorithm 1 Formation Control of Multirobots with Deep
For N robots, the policies of the system can be parameter- Reinforcement Learning
ized by θ = {θ1 , . . . , θN }, we denote π = {π1 , . . . , πN } as for episode = 1 to M do
the polices of all the robots. Thus, we can rewrite the gradient Receive the initial state s for each robot.
of the expected return for each agent i, which is given as for t = 1 to maximum iteration number do
follows: for each robot i, it selects action ai = μθi (oi ) + Nt
∇θi J(θi ) with respect to the current policy.
(9) each robot executes the action ai .
= Es∼μ,ai ∼πi [∇θi log πi (ai |si )Qπi (s, a1 , . . . , aN )].
each robot receives the reward ri and moves to the next
From the action-value function Qπi (s, a1 , . . . , aN ) used in state si by the system dynamics (1).
(9), we can observe that it requires the actions of all of the store (s, a, r, s ) to the replay buffer D.
robots as the input. The output is the Q value for agent i. Each for robot i = 1 to N do
agent i learns the Q value separately. Since the action-value si ← si for each robot i.
function for each agent has its own structure, we can define sample a random minibatch of S samples
the different constraints in the form of (2) for each agent i. (sj , aj , rj , sj ) from the replay buffer D.
Now consider each robot that will determine its own policies set ŷj by (13).
μθi in the continuous domain with respect to the parameters update critic by minimizing the loss function (12).
θi , the gradient can be rewritten as follows: update actor by the sampled policy gradient (10).
∇θi J(μi ) = Ex,a∼D end for
(10) end for
[∇θi μi (ai |si )∇ai Qμi (s, a1 , . . . , aN )|ai =μi (si ) ].
end for
The replay buffer D constitutes the tuples (s, s , a1 , . . . ,
aN , r1 , . . . , rN ). This multi-tuple can contain the actions of
all agents, and hence, all the rewards after they execute the IV. E XPERIMENT
action. Thus, the loss function can be given as follows: In this part, we will demonstrate our experiments in details.
L(θi ) = Es,a,r,s [(Qμi (s, a1 , . . . , aN ) − y)2 ] (11) We first use the gazebo simulation environment to train the

DRL and obtain the network model, then we use the trained
where y = ri + γQμi (s , a1 , . . . , aN )|aj =μj (sj ) , μ = algorithm on the real robots to show the effectiveness of
{μθ1 ,...,θN
} is the set of policies that all of the agents learned the algorithm. The robot systems we used here are four
in the target network with the parameter θi . Pioneer 3DX robots. It is a differential-driven robot with
We can see that this algorithm still requires the assumption two active wheels and two velocity commands–linear and
that agent i needs to know the other agent’s policies. This angular velocities. Each of them is with an NVIDIA TX1
assumption is widely adopted in the velocity obstacle (VO) so that they can communicate with each other by using the
method, which is used to solve the collision avoidance prob- Robot Operating System (ROS). The parameters used in the
lem. In order to remove this assumption, each agent i will algorithm are shown is Table I. After training the network, we
estimate the policies of the other agent and will take μ̂ji as can use the proposed algorithm to generate the path to achieve
the real policy of agent j. This estimation can be achieved the formation control for the multirobot system.
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

TABLE I another robot to cause the collision. The reward function for
PARAMETERS U SED IN THE A LGORITHM each robot i can be given as follows:
Parameter Name Parameter Value
−(xi − xdi + yi − yid ) if L > di
Learning rate of actor 0.0001 ri = (16)
Learning rate of critic 0.00001 −10 if L < di
Batch size 130
γ 0.001 where L is the distance between the robot i and its nearest
Max steps in one episode 60 neighbor robot, and di is the safe distance for each robot i to
ensure that it cannot cause the collision with others. When the
collision happens, the episode will end.
Let (rxi , ryi ) and θi denote the Cartesian position and In the experiment, the initial positions for the robots are
orientation of the ith robot, respectively. Furthermore, let given by x1 = −1.83m, y1 = −1.82m, x2 = 1.78m,
(vi , wi ) denote the linear and angular speed pair of the ith y2 = 1.80m, x3 = −2.01m, y3 = 2.135m, x4 = 2.135m,
robot. Then the kinematic equation of the ith robot can be and y4 = 2.135m. Fig. 2 shows the learning curve of the
written as: proposed algorithm.
ṙxi = vi cos(θi ), ṙyi = vi sin(θi ), θ̇i = wi . (14) 0
By linearizing (14) at a fixed point off the center of the

wheel axis (xi , yi ) of the robot, where xi = rxi + di cos(θi ) −50
and yi = ryi + di sin(θi ) and d = 0.15m, we can have

mean episode rewards
−100
vi cos(θi ) sin(θi ) axi
=
wi −(1/d) sin(θi ) (1/d) cos(θi ) ayi
−150
where [axi , ayi ]T = ai in (1). This is a simple kinematic
equation. With this equation, we can transfer the action of
each agent to the velocity command used in the nonholonomic −200
mobile robot directly.
Since our formation control is based on the consensus −250
algorithm, we first seek the consensus state of all the robots,
0
0
00
00
00
00
00
10
20
30
40
50
which is denoted by sr = [xrc , ycr , θcr ]T . The notations [xrc , ycr ] episodes
and θcr denote the reference position and orientation of the
formation center of the multirobot team, respectively. Since Fig. 2. The learning curve for the formation control problem. The curve
shows the mean of the average reward of 1000 episodes.
sr is dynamically changing, each robot will maintain a local
variable sri = [xci , yci , θci ]T , which is the sensed value of the
state sr by each robot i. The objective of consensus is to make
sure that the value of sri tracks the value of sr , i = 1, . . . , N .
2.5
In order to perform the formation control, there is a con- robot1
robot2
straint in the form of (2) that needs to be satisfied for each 2 robot3
robot i, which means that each robot needs to determine its robot4
own desired position (xdi , yid ) relative to the sensed consensus 1.5
state sri . It can be written as follows: 1
d
xi xci cos(θci ) − sin(θci ) x̃if
x direction
0.5
= + (15)
yid yci sin(θci ) cos(θci ) ỹif 0
where [x̃if , ỹif ] represents the desired deviation vector of

T
-0.5
the ith robot relative to the geometric center of the formation.

-1
Thus, this is the constraint in the form of (2) that each
robot i should be satisfied in order to form the pattern in -1.5
our experiment. The real trajectory of each robot i should
track [xdi , yid ]T and thus the desired formation shape can be -2
0 5 10 15 20 25 30 35 40
maintained. time
Next, we can define the reward function for each robot i.

The reward function is specified to give the robot award when Fig. 3. The x direction trajectory of each robot.
it achieves its goal, and penalize it for getting too close with
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

2.5 0.6
robot1 robot1
robot2
2 robot3
robot2
robot4 0.4 robot3
robot4
1.5
0.2
1
velocity in y direction
0.5
0
y direction
0
-0.2
-0.5
-1 -0.4
-1.5
-0.6
-2
-0.8
-2.5 0 5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40
time
time
Fig. 6. The velocity of y direction trajectory of each robot.

Fig. 4. The y direction trajectory of each robot.
2.5
Fig. 3 and Fig. 4 show the trajectory of each robot’s moves 2

after 500000 training episodes. Fig. 5 and Fig. 6 show the 1.5
corresponding velocity in each direction for each robot. 1
robot1
robot2
robot3
robot4
0.5
y direction
0
0.6
robot1 -0.5
robot2
0.4 robot3 -1
robot4
-1.5
0.2 -2
velocity in x direction
-2.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
0 x direction
-0.2 Fig. 7. The trajectory of each robot with kinematic constraints.
-0.4
that the hardware can provide, and (18) specifies a maximum
-0.6 turning rate that corresponds to the maximum velocity for the
robot. The parameters used in our experiment are shown in
-0.8 Table II.
0 5 10 15 20 25 30 35 40
time
TABLE II
PARAMETERS U SED IN THE E XPERIMENT
Fig. 5. The velocity of x direction trajectory of each robot.
Parameter Name Parameter Value
Kinematic constraints of each nonholonomic mobile robot x̃1f 1.5m
need to be considered when the experiments are conducted. ỹ1f 1.5m
In many works, the kinematic constraints of the mobile robot x̃2f -1.5m
ỹ2f 1.5m
are hard to encode, thus it might result in increasing the x̃3f 1.5m
computational complexity [38]. However, it is quite easy to ỹ3f -1.5m
incorporate these kinematic constraints into the RL framework. x̃4f -1.5m
ỹ4f -1.5m
Some typical kinematic constraints are: di 0.6m
vmax 1m/s
a < vmax (17)
and
With these kinematics constraints incorporated in the exper-
|θt+1 − θt | < Δt · vmax (18)
iment, the trajectories of all four robots are shown in Fig. 7.
where (17) represents the physical limit that the velocity of The arrow direction denotes the direction that the robot moves
the robot cannot exceed the maximum value of the velocity toward.
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

Next, we will perform this trained DRL to the real mul- V. C ONCLUSION
tirobot systems. Fig. 8–Fig. 10 show the actual experiment, In this work, we have developed a multi-agent formation
where Fig. 8 shows that the robots were in the initial position, control algorithm based on the application of deep reinforce-
Fig. 9 shows a snapshot when the robots were moving, where ment learning. In particular, this method not only can let
they were trying to avoid each other in order to prevent the each robot learn its own policies but also can estimate the
collision, and Fig.10 shows that finally, the robots achieved policies of other agents’ executing. Meanwhile, it can perform
the desired formation. the collision avoidance while the robots are moving. The
simulation and real-time experimental results showed that this
algorithm is effective and can incorporate some kinematic
constraints seamlessly for each robot.
This work considers the case where the full state informa-
tion is available for each robot. In some scenarios only partial
state information becomes available for the sake of control.
Another issue is the intermittent communication links between
the robots, leading to a switching communication topology.
Such a switching could be triggered by either time-dependent
events or state-dependent events. Addressing these issues will
be the future research effort under the foundation of this work.
R EFERENCES
[1] X. Ge, Q.-L. Han, D. Ding, X.-M. Zhang, and B. Ning, “A survey
on recent advances in distributed sampled-data cooperative control of
multi-agent systems,” Neurocomputing, vol. 275, pp. 1684–1701, 2018.
[2] M. Fields, E. Haas, S. Hill, C. Stachowiak, and L. Barnes, “Effective
robot team control methodologies for battlefield applications,” in Intel-
ligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International
Fig. 8. The initial position for all of the robots. Conference on, 2009, pp. 5862–5867.
[3] J. Ding, J. H. Gillula, H. Huang, M. P. Vitus, W. Zhang, and C. J. Tomlin,
“Hybrid systems in robotics,” IEEE Robotics & Automation Magazine,
vol. 18, no. 3, pp. 33–43, 2011.
[4] H. Rezaee and F. Abdollahi, “Pursuit formation of double-integrator
dynamics using consensus control approach,” IEEE Transactions on
Industrial Electronics, vol. 62, no. 7, pp. 4249–4256, 2015.
[5] Q. Lu, Q.-L. Han, X. Xie, and S. Liu, “A finite-time motion control
strategy for odor source localization,” IEEE Transactions on Industrial
Electronics, vol. 61, no. 10, pp. 5419–5430, 2014.
[6] H. Li, P. Xie, and W. Yan, “Receding horizon formation tracking control
of constrained underactuated autonomous underwater vehicles,” IEEE
Transactions on Industrial Electronics, vol. 64, no. 6, pp. 5004–5013,
2017.
[7] P. Millán, L. Orihuela, I. Jurado, and F. R. Rubio, “Formation control
of autonomous underwater vehicles subject to communication delays,”
IEEE Transactions on Control Systems Technology, vol. 22, no. 2, pp.
770–777, 2014.
[8] M. Ou, H. Du, and S. Li, “Finite-time formation control of multiple
nonholonomic mobile robots,” International Journal of Robust and
Fig. 9. The snapshot of the robots avoiding each other to prevent the collision. Nonlinear Control, vol. 24, no. 1, pp. 140–165, 2014.
[9] T. Liu and Z.-P. Jiang, “Distributed formation control of nonholonomic
mobile robots without global position measurements,” Automatica,
vol. 49, no. 2, pp. 592–600, 2013.
[10] C. Ma and J. Zhang, “On formability of linear continuous-time multi-
agent systems,” Journal of Systems Science and Complexity, vol. 25,
no. 1, pp. 13–29, 2012.
[11] X. Dong, B. Yu, Z. Shi, and Y. Zhong, “Time-varying formation
control for unmanned aerial vehicles: Theories and applications,” IEEE
Transactions on Control Systems Technology, vol. 23, no. 1, pp. 340–
348, 2015.
[12] Z. Lin, L. Wang, Z. Han, and M. Fu, “Distributed formation control
of multi-agent systems using complex laplacian,” IEEE Transactions on
Automatic Control, vol. 59, no. 7, pp. 1765–1777, 2014.
[13] K.-K. Oh, M.-C. Park, and H.-S. Ahn, “A survey of multi-agent
formation control,” Automatica, vol. 53, pp. 424–440, 2015.
[14] W. Ren and R. W. Beard, “Consensus seeking in multiagent systems
under dynamically changing interaction topologies,” IEEE Transactions
on Automatic Control, vol. 50, no. 5, pp. 655–661, 2005.
[15] M. Aranda, G. Lpez-Nicols, C. Sags, and Y. Mezouar, “Formation con-
Fig. 10. The final position of all of the robots. trol of mobile robots using multiple aerial cameras,” IEEE Transactions
on Robotics, vol. 31, no. 4, pp. 1064–1071, Aug 2015.
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

[16] S.-C. Liu, D.-L. Tan, and G.-J. Liu, “Formation control of [28] C. Li, J. Zhang, and Y. Li, “Application of artificial neural network
mobile robots with active obstacle avoidance,” Acta Automatica based on q-learning for mobile robot path planning,” in 2006 IEEE
Sinica, vol. 33, no. 5, pp. 529 – 535, 2007. [Online]. Available: International Conference on Information Acquisition, Aug 2006, pp.
http://www.sciencedirect.com/science/article/pii/S1874102907600238 978–982.
[17] H. Wang, D. Guo, X. Liang, W. Chen, G. Hu, and K. K. Leang, “Adap- [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
tive vision-based leaderfollower formation control of mobile robots,” stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
IEEE Transactions on Industrial Electronics, vol. 64, no. 4, pp. 2893– ing,” arXiv preprint arXiv:1312.5602, 2013.
2902, April 2017. [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
[18] Y. Kuriki and T. Namerikawa, “Consensus-based cooperative formation Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
control with collision avoidance for a multi-uav system,” in 2014 et al., “Human-level control through deep reinforcement learning,”
American Control Conference, June 2014, pp. 2077–2082. Nature, vol. 518, no. 7540, p. 529, 2015.
[19] R. Falconi, L. Sabattini, C. Secchi, C. Fantuzzi, and C. Melchiorri, [31] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
“Edge-weighted consensus-based formation control strategy with col- D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
lision avoidance,” Robotica, vol. 33, no. 2, p. 332347, 2015. learning,” arXiv preprint arXiv:1509.02971, 2015.
[20] A. Richards and J. How, “Decentralized model predictive control of [32] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
cooperating UAVs,” in 43rd IEEE Conference on Decision and Control, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
vol. 4. Citeseer, 2004, pp. 4286–4291. forcement learning,” in International conference on machine learning,
[21] A. T. Hafez, A. J. Marasco, S. N. Givigi, A. Beaulieu, and C. A. Rabbath, 2016, pp. 1928–1937.
“Encirclement of multiple targets using model predictive control,” in [33] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
2013 American Control Conference, June 2013, pp. 3147–3152. “Domain randomization for transferring deep neural networks from
[22] R. Vidal, O. Shakernia, and S. Sastry, “Formation control of non- simulation to the real world,” in Intelligent Robots and Systems (IROS),
holonomic mobile robots with omnidirectional visual servoing and 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 23–30.
motion segmentation,” in Robotics and Automation, 2003. Proceedings. [34] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning
ICRA’03. IEEE International Conference on, vol. 1. IEEE, 2003, pp. to communicate with deep multi-agent reinforcement learning,” in
584–589. Advances in Neural Information Processing Systems, 2016, pp. 2137–
[23] W. Dong, “Distributed observer-based cooperative control of multiple 2145.
nonholonomic mobile agents,” International Journal of Systems Science, [35] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch,
vol. 43, no. 5, pp. 797–808, 2012. “Multi-agent actor-critic for mixed cooperative-competitive environ-
[24] ——, “Tracking control of multiple-wheeled mobile robots with limited ments,” in Advances in Neural Information Processing Systems, 2017,
information of a desired trajectory,” IEEE Transactions on Robotics, pp. 6379–6390.
vol. 28, no. 1, pp. 262–268, Feb 2012. [36] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper-
[25] N. Hovakimyan and C. Cao, L1 adaptive control theory: guaranteed ative agents,” in Proceedings of the tenth international conference on
robustness with fast adaptation. SIAM-Society for Industrial and machine learning, 1993, pp. 330–337.
Applied Mathematics, 2010, vol. 21. [37] R. S. Sutton, A. G. Barto, et al., Reinforcement Learning: An Introduc-
[26] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in tion. MIT press, 1998.
robotics: A survey,” The International Journal of Robotics Research, [38] J. Van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles
vol. 32, no. 11, pp. 1238–1274, 2013. for real-time multi-agent navigation,” in Robotics and Automation, 2008.
[27] B.-Q. Huang, G.-Y. Cao, and M. Guo, “Reinforcement learning neural ICRA 2008. IEEE International Conference on. IEEE, 2008, pp. 1928–
network to the problem of autonomous mobile robot obstacle avoidance,” 1935.
in 2005 International Conference on Machine Learning and Cybernetics,
vol. 1, Aug 2005, pp. 85–89.
978-1-5386-8396-5/19/$31.00 ©2019 IEEE

Liu 2019

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Liu 2019

Uploaded by

Copyright:

Available Formats

The Formation Control of Mobile Autonomous Multi-Agent Systems

Using Deep Reinforcement Learning

Abstract—The formation control of mobile autonomous multi-

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

ṙxi = vi cos(θi ), ṙyi = vi sin(θi ), θ̇i = wi . (14) 0

By linearizing (14) at a ﬁxed point off the center of the

state sri . It can be written as follows: 1

where [x̃if , ỹif ] represents the desired deviation vector of

the ith robot relative to the geometric center of the formation.

Next, we can deﬁne the reward function for each robot i.

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

Fig. 6. The velocity of y direction trajectory of each robot.

Fig. 3 and Fig. 4 show the trajectory of each robot’s moves 2

-0.2 Fig. 7. The trajectory of each robot with kinematic constraints.

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

978-1-5386-8396-5/19/$31.00 ©2019 IEEE

You might also like