Professional Documents
Culture Documents
I. I NTRODUCTION
Cooperative control of mobile autonomous multi-agent sys- designed as the reference so that other agents can be controlled
tems (MASs) has attracted considerable research interest in to follow the corresponding trajectory. In the meantime, the
control and robotics communities in recent years [1]. This trajectory of the follower agent is designed in a way to have
task has a wide range of applications in reconnaissance, the desired separation distance and relative bearing with the
surveillance, and security [2], [3]. The ability of maintaining leader. Another popular method is the consensus algorithm,
the network topology and connectivity of robots is crucial which is focused on finding a common state for all of the
for some tasks such as target localization, oceanic search and agents, then driving each agent to the particular state relative to
rescue, and undersea oil pipeline maintenance [4]–[6]. the common state that is found. Based on this idea, the research
Among all cooperative control tasks for MASs, the for- on the consensus problems for multi-agent systems can be
mation control is one of the most interesting research topics extended to the case of directed communication topologies
since it has much broader impacts. Many MASs including [14].
unmanned aerial vehicles (UAVs), autonomous underwater ve- Many research platforms use the mobile autonomous robot
hicles (AUVs), and nonholonomic mobile robots are studied to systems to perform the formation control algorithm. For ex-
address their corresponding formation control problem. These ample, the vision-based control method was used to drive
research efforts focus on leader-follower methods [7]–[9], the mobile autonomous robots to form the desired formation
virtual leader approaches [10], [11], and leaderless consensus in [15]. The obstacle avoidance problem was tackled in the
method [12]. Some other results of the formation control can formation control for the mobile autonomous robots in [16].
be found in the survey [13]. The primary aim of the formation A real-time observer developed to estimate the relative state
control for the MASs is to design the appropriate algorithm of the mobile autonomous robots to form the formation was
such that it can ensure the group of the agents to achieve proposed in [17].
and maintain the desired geometric connection of their states. One of the main difficult problems in the formation con-
Formation control generally makes the mobile autonomous trol is the collision avoidance in the moving of the mobile
agents work together to cooperatively finish the formation autonomous robots. The major strategies for solving this
task. This work is often accomplished by communicating the problem are the rule-based approaches and optimization-based
state information of each agent with its neighbors. The leader- approaches. For the rule-based approaches, a consensus based
follower approach is one of the most popular methods due algorithm was proposed in [18], where the artificial potential
to its simplicity. The basic idea is that the leader can be approach is used to generate the collision avoidance strategy.
C. Deep Q-Networks The term ∇a Qμ (s, a) requires that the action domain A should
The deep Q-network (DQN) is a popular method in RL be continuous, and hence, the policy that each agent is acting
and has already been proven to be a successful algorithm should also be in the continuous domain.
for the multi-agent scenario [34], [35]. Q-learning uses the The deep deterministic policy gradient (DDPG) is a variant
action-value function to evaluate the policy that it learns. of the deterministic policy method, where the policy μ and
The corresponding action-value function can be described as the critic Qμ can be approximated by using the deep neural
Qπ (s, a) = E[R|st = s, at = a], where E denotes the expec- networks. DDPG still uses the replay buffer of experiences to
tation operator. This function can be computed recursively as sample data and train the corresponding network. Also, it uses
Qπ (s, a) = Es [r(s, a) + γEa ∼π [Qπ (s , a )]]. DQN can learn the target network to avoid the divergence, which is the same
the optimal action-value function Q∗ which corresponds to the as in the DQN. The DDPG can learn complex policies for
optimal policy of the agent by minimizing the function given some tasks using low-dimensional observations. It adopts the
as follows: straightforward actor-critic algorithm and makes the learning
process easy to be implemented for some difficult problems
L(θ) = Es,a,r,s [(Q∗ (s, a|θ) − y)2 ]. (6) and large networks.
for itself. The communication network can be denoted by ŷ = ri = γQμi (s , μ̂1 N
i (s1 ) . . . , μ̂i (sN )) (13)
G = (N, E), where E represents the set of communication where μ̂j i denotes the target network for the approximate pol-
links, which implies that E is an undirected graph with the
icy μ̂ji . When training the network, the action-value function
vertex set N and edge set E ⊆ {(i, j) : i, j ∈ N, i = j}. We
Qμi can be updated and the latest samples of each agent j in
assume that the each agent can fully observe its own state,
the replay buffer can be used to perform the single gradient
i.e., si = oi . The edge (i, j) ∈ E represents the meaning that
step to update the parameter of the critic network. Thus, we
agents i and j can communicate with each other. Meanwhile,
can remove the assumption that each agent needs to know the
each agent can only use the local state information to learn
other’s policies. The whole algorithm is shown in Algorithm
its policies. By augmenting the policies information of other
1.
agents, one can use the extended actor-critic method to solve
the formation control problem for multirobots. Algorithm 1 Formation Control of Multirobots with Deep
For N robots, the policies of the system can be parameter- Reinforcement Learning
ized by θ = {θ1 , . . . , θN }, we denote π = {π1 , . . . , πN } as for episode = 1 to M do
the polices of all the robots. Thus, we can rewrite the gradient Receive the initial state s for each robot.
of the expected return for each agent i, which is given as for t = 1 to maximum iteration number do
follows: for each robot i, it selects action ai = μθi (oi ) + Nt
∇θi J(θi ) with respect to the current policy.
(9) each robot executes the action ai .
= Es∼μ,ai ∼πi [∇θi log πi (ai |si )Qπi (s, a1 , . . . , aN )].
each robot receives the reward ri and moves to the next
From the action-value function Qπi (s, a1 , . . . , aN ) used in state si by the system dynamics (1).
(9), we can observe that it requires the actions of all of the store (s, a, r, s ) to the replay buffer D.
robots as the input. The output is the Q value for agent i. Each for robot i = 1 to N do
agent i learns the Q value separately. Since the action-value si ← si for each robot i.
function for each agent has its own structure, we can define sample a random minibatch of S samples
the different constraints in the form of (2) for each agent i. (sj , aj , rj , sj ) from the replay buffer D.
Now consider each robot that will determine its own policies set ŷj by (13).
μθi in the continuous domain with respect to the parameters update critic by minimizing the loss function (12).
θi , the gradient can be rewritten as follows: update actor by the sampled policy gradient (10).
∇θi J(μi ) = Ex,a∼D end for
(10) end for
[∇θi μi (ai |si )∇ai Qμi (s, a1 , . . . , aN )|ai =μi (si ) ].
end for
The replay buffer D constitutes the tuples (s, s , a1 , . . . ,
aN , r1 , . . . , rN ). This multi-tuple can contain the actions of
all agents, and hence, all the rewards after they execute the IV. E XPERIMENT
action. Thus, the loss function can be given as follows: In this part, we will demonstrate our experiments in details.
L(θi ) = Es,a,r,s [(Qμi (s, a1 , . . . , aN ) − y)2 ] (11) We first use the gazebo simulation environment to train the
DRL and obtain the network model, then we use the trained
where y = ri + γQμi (s , a1 , . . . , aN )|aj =μj (sj ) , μ = algorithm on the real robots to show the effectiveness of
{μθ1 ,...,θN
} is the set of policies that all of the agents learned the algorithm. The robot systems we used here are four
in the target network with the parameter θi . Pioneer 3DX robots. It is a differential-driven robot with
We can see that this algorithm still requires the assumption two active wheels and two velocity commands–linear and
that agent i needs to know the other agent’s policies. This angular velocities. Each of them is with an NVIDIA TX1
assumption is widely adopted in the velocity obstacle (VO) so that they can communicate with each other by using the
method, which is used to solve the collision avoidance prob- Robot Operating System (ROS). The parameters used in the
lem. In order to remove this assumption, each agent i will algorithm are shown is Table I. After training the network, we
estimate the policies of the other agent and will take μ̂ji as can use the proposed algorithm to generate the path to achieve
the real policy of agent j. This estimation can be achieved the formation control for the multirobot system.
−100
vi cos(θi ) sin(θi ) axi
=
wi −(1/d) sin(θi ) (1/d) cos(θi ) ayi
−150
where [axi , ayi ]T = ai in (1). This is a simple kinematic
equation. With this equation, we can transfer the action of
each agent to the velocity command used in the nonholonomic −200
mobile robot directly.
Since our formation control is based on the consensus −250
algorithm, we first seek the consensus state of all the robots,
0
0
00
00
00
00
00
10
20
30
40
50
which is denoted by sr = [xrc , ycr , θcr ]T . The notations [xrc , ycr ] episodes
and θcr denote the reference position and orientation of the
formation center of the multirobot team, respectively. Since Fig. 2. The learning curve for the formation control problem. The curve
shows the mean of the average reward of 1000 episodes.
sr is dynamically changing, each robot will maintain a local
variable sri = [xci , yci , θci ]T , which is the sensed value of the
state sr by each robot i. The objective of consensus is to make
sure that the value of sri tracks the value of sr , i = 1, . . . , N .
2.5
In order to perform the formation control, there is a con- robot1
robot2
straint in the form of (2) that needs to be satisfied for each 2 robot3
robot i, which means that each robot needs to determine its robot4
own desired position (xdi , yid ) relative to the sensed consensus 1.5
d
xi xci cos(θci ) − sin(θci ) x̃if
x direction
0.5
= + (15)
yid yci sin(θci ) cos(θci ) ỹif 0
0.2
1
velocity in y direction
0.5
0
y direction
0
-0.2
-0.5
-1 -0.4
-1.5
-0.6
-2
-0.8
-2.5 0 5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40
time
time
2.5
0
0.6
robot1 -0.5
robot2
0.4 robot3 -1
robot4
-1.5
0.2 -2
velocity in x direction
-2.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
0 x direction
-0.4
that the hardware can provide, and (18) specifies a maximum
-0.6 turning rate that corresponds to the maximum velocity for the
robot. The parameters used in our experiment are shown in
-0.8 Table II.
0 5 10 15 20 25 30 35 40
time
TABLE II
PARAMETERS U SED IN THE E XPERIMENT
Fig. 5. The velocity of x direction trajectory of each robot.
Parameter Name Parameter Value
Kinematic constraints of each nonholonomic mobile robot x̃1f 1.5m
need to be considered when the experiments are conducted. ỹ1f 1.5m
In many works, the kinematic constraints of the mobile robot x̃2f -1.5m
ỹ2f 1.5m
are hard to encode, thus it might result in increasing the x̃3f 1.5m
computational complexity [38]. However, it is quite easy to ỹ3f -1.5m
incorporate these kinematic constraints into the RL framework. x̃4f -1.5m
ỹ4f -1.5m
Some typical kinematic constraints are: di 0.6m
vmax 1m/s
a < vmax (17)
and
With these kinematics constraints incorporated in the exper-
|θt+1 − θt | < Δt · vmax (18)
iment, the trajectories of all four robots are shown in Fig. 7.
where (17) represents the physical limit that the velocity of The arrow direction denotes the direction that the robot moves
the robot cannot exceed the maximum value of the velocity toward.