Professional Documents
Culture Documents
Learning Vision-Based Cohesive Flight in Drone Swarms
Learning Vision-Based Cohesive Flight in Drone Swarms
Abstract
This paper presents a data-driven approach to learning vision-
based collective behavior from a simple flocking algorithm.
We simulate a swarm of quadrotor drones and formulate the
controller as a regression problem in which we generate 3D
velocity commands directly from raw camera images. The
dataset is created by simultaneously acquiring omnidirec-
tional images and computing the corresponding control com-
mand from the flocking algorithm. We show that a convolu-
tional neural network trained on the visual inputs of the drone
can learn not only robust collision avoidance but also coher-
ence of the flock in a sample-efficient manner. The neural
controller effectively learns to localize other agents in the vi-
sual input, which we show by visualizing the regions with the
most influence on the motion of an agent. This weakly super-
vised saliency map can be computed efficiently and may be
used as a prior for subsequent detection and relative localiza-
tion of other agents. We remove the dependence on sharing
positions among flock members by taking only local visual
information into account for control. Our work can therefore
be seen as the first step towards a fully decentralized, vision-
based flock without the need for communication or visual
Figure 1: Vision-based flock of nine drones during migra-
markers to aid detection of other agents. tion. Our visual swarm controller operates fully decentral-
ized and provides collision-free, coherent collective motion
without the need to share positions among agents. The be-
1 Introduction havior of an agent depends only on the omnidirectional vi-
Collective motion of animal groups such as flocks of birds sual inputs (colored rectangle, see Fig. 3 for details) and the
is an awe-inspiring natural phenomenon that has profound migration point (blue circle and arrows). Collision avoid-
implications for the field of aerial swarm robotics (Flore- ance (red arrows) and coherence (green arrows) between
ano and Wood 2015; Zufferey et al. 2011). Animal groups in flock members are learned entirely from visual inputs.
nature operate in a completely self-organized manner since
the interactions between them are purely local and decisions
are made by the animals themselves. By taking inspiration (GNSS). The main drawback of these approaches is the in-
from decentralization in biological systems, we can develop troduction of a single point of failure, as well as the use
powerful robotic swarms that are 1) robust to failure, and 2) of unreliable data links, respectively. Relying on centralized
highly scalable since the number of agents can be increased control bears a significant risk since the agents lack the au-
or decreased depending on the workload. tonomy to make their own decisions in failure cases such as
One of the most appealing characteristics of collective an- a communication outage. The possibility of failure is even
imal behavior for robotics is that decisions are made based higher in dense urban environments, where GNSS measure-
on local information such as visual perception. As of today, ments are often unreliable and imprecise.
however, most multi-agent robotic systems rely on entirely Vision is arguably the most promising sensory modality
centralized control (Mellinger and Kumar 2011; Kushleyev to achieve a maximum level of autonomy for robotic sys-
et al. 2013; Preiss et al. 2017; Weinstein et al. 2018) or wire- tems, particularly considering the recent advances in com-
less communication of positions (Vásárhelyi et al. 2014; puter vision and deep learning (Krizhevsky, Sutskever, and
Virágh et al. 2014; Vásárhelyi et al. 2018), either from a Hinton 2012; LeCun, Bengio, and Hinton 2015; He et al.
motion capture system or global navigation satellite system 2016). Apart from being light-weight and having relatively
low power consumption, even cheap commodity cameras particular, the controllers are based on three types of learn-
provide an unparalleled information density with respect to ing methods: imitation learning, supervised learning, and re-
sensors of similar cost. Their characteristics are specifically inforcement learning.
desirable for the deployment of an aerial multi-robot sys- Imitation learning is used in (Ross et al. 2013) to control a
tem. The difficulty when using cameras for robot control is drone in a forest environment based on human pilot demon-
the interpretation of the visual information which is a hard strations. The authors motivate the importance of following
problem that this paper addresses directly. suboptimal control policies in order to cover more of the
In this work, we propose a reactive control strategy based state space. The reactive controller can avoid trees by adapt-
only on local visual information. We formulate the swarm in- ing the heading of the drone; the limiting factor is ultimately
teractions as a regression problem in which we predict con- the field of view of a single front-facing camera.
trol commands as a nonlinear function of the visual input A supervised learning approach (Loquercio et al. 2018)
of a single agent. To the best of our knowledge, this is the features a convolutional network that is used to predict a
first successful attempt to learn vision-based swarm behav- steering angle and a collision probability for drone naviga-
iors such as collision-free navigation in an end-to-end man- tion in urban environments. Steering angle prediction is for-
ner directly from raw images. mulated as a regression problem by minimizing the mean-
squared-error between predicted and ground truth annotated
2 Related Work examples from a dataset geared for autonomous driving re-
We classify the related work into three main categories. search. The probability of collision is learned by minimizing
Sec. 2.1 considers literature in which a flock of drones is the binary cross-entropy of labeled images collected while
controlled in a fully decentralized manner. Sec. 2.2 is com- riding a bicycle through urban environments. The drone is
prised of recent data-driven advances in vision-based drone controlled directly by the steering angle, whereas its forward
control. Finally, Sec. 2.3 combines ideas from the previous velocity is modulated by the collision probability.
sections into approaches that are both vision-based and de- An approach based on reinforcement learning (Sadeghi
centralized. and Levine 2017) shows that a neural network trained en-
tirely in a simulated environment can generalize to flights in
2.1 Decentralized flocking with drones the real world. In contrast with the previous methods based
Flocks of autonomous drones such as quadrotors and fixed- only on supervised learning, the authors additionally employ
wings are the focus of recent research in swarm robotics. a reinforcement learning approach to derive a robust control
Early work presents ten fixed-wing drones deployed in an policy from simulated data. A 3D modeling suite is used to
outdoor environment (Hauert et al. 2011). Their collective render various hallway configurations with randomized vi-
motion is based on Reynolds flocking (Reynolds 1987) with sual conditions. Surprisingly, the control policy trained en-
a migration term that allows the flock to navigate towards the tirely in the simulated environment is able to navigate real-
desired goal. Arising from the use of a nonholonomic plat- world corridors and rarely leads to collisions.
form, the authors study the interplay of the communication The work described above and other similar methods, for
range and the maximum turning rate of the agents. instance, (Giusti et al. 2016; Gandhi, Pinto, and Gupta 2017;
Thus far, the largest decentralized quadrotor flock con- Smolyanskiy et al. 2017), use a data-driven approach to con-
sisted of 30 autonomous agents flying in an outdoor environ- trol a flying robot in real-world environments. A shortcom-
ment (Vásárhelyi et al. 2018). The underlying algorithm has ing of these methods is that the learned controllers operate
many free parameters which require the use of optimization only in two-dimensional space which bears similar charac-
methods. To this end, the authors employ an evolutionary al- teristics to navigation with ground robots. Moreover, the ap-
gorithm to find the best flocking parameters according to a proaches do not show the ability of the controllers to coor-
fitness function that relies on several order parameters. The dinate a multi-agent system.
swarm can operate in pre-defined confined space by incor-
porating repulsive virtual agents. 2.3 Vision-based multi-drone control
The commonality of all mentioned approaches and others,
for example, (Vásárhelyi et al. 2014; Dousse et al. 2017), The control of multiple agents based on visual inputs is
is the ability to share GNSS positions wirelessly among achieved with relative localization techniques (Saska et
flock members. However, there are many situations in which al. 2017) for a group of three quadrotors. Each agent is
wireless communication is unreliable or GNSS positions are equipped with a camera and a circular marker that enables
too imprecise. We may not be able to tolerate position im- the detection of other agents and the estimation of relative
precisions in situations where the environment requires a distance. The system relies only on local information ob-
small inter-agent distance, for example when traversing nar- tained from the onboard cameras in near real-time.
row passages in urban environments. In these situations, tall Thus far, decentralized vision-based drone control has
buildings may deflect the signal and communication outages been realized by mounting visual markers on the drones
occur due to the wireless bands being over-utilized. (Saska et al. 2014; Faigl et al. 2013; Krajnı́k et al. 2014). Al-
though this simplifies the relative localization problem sig-
2.2 Vision-based single drone control nificantly, the marker-based approach would not be desirable
Vision-based control of a single flying robot is facilitated by for real-world deployment of flying robots. The used visual
several recent advances in the field of machine learning. In markers are relatively large and bulky which unnecessarily
(a) Dataset generation (b) Training phase (c) Vision-based control
Figure 2: Overview of our method. (2a) The dataset is generated using a simple flocking algorithm (see Sec. 3.1) deployed
on simulated quadrotor drones (see Sec. 3.2). The agents are initialized in random non-overlapping positions and assigned
random linear trajectories while computing velocity targets and acquiring images from all mounted cameras (see Sec. 3.3).
(2b) During the training phase (see Sec. 3.4), we employ a convolutional neural network to learn the mapping between visual
inputs and control outputs by minimizing the error between predicted and target velocity commands. (2c) During the last phase
(see Sec. 3.5), we use the trained network as a reactive controller for emulating vision-based swarm behavior. The behavior of
each agent is entirely vision-based and only depends on its visual input at a specific time step. Initial flight experiments have
confirmed that by using PX4 and ROS, we can use the exact same software infrastructure for both simulation and reality.
(a) (b)
y [m]
y [m]
0 0
−2 −2
2 Vision 2 Vision
y [m]
y [m]
0 0
(c) (d)
−2 −2
−5 0 5 10 15 20
0 5 10 15 20
x [m]
x [m]
inter-agent distance [m]
(e) 4 4 (f)
2 2
0 0
order parameter
Vision
0.5 0.5
(g) (h)
Position
0.0 Vision 0.0
0 25 50 75 100 125 150 175 200 0 500 1000 1500 2000 2500
time step time step
Figure 4: Flocking with common migration goal (left column) and opposing migration goals (right column). First two rows:
Top view of a swarm migrating using the position-based (4a and 4b) and vision-based (4c and 4d) controller. The trajectory
of each agent is shown in a different color. The colored squares, triangles, and circles show the agent configuration during the
first, middle, and last time step, respectively. The gray square and gray circle denote the spawn area and the migration point,
respectively. For the flock with opposing migration goal (4b and 4d), the waypoint on the right is given to a subset of five agents
(solid lines), whereas the waypoint on the left is given to a subset of four agents (dotted lines). Third row: Inter-agent minimum
and maximum distances from (7) over time (4e and 4f) while using the position-based and vision-based controller. The mean
minimum distance between any pair of agents is denoted by a solid line, whereas mean maximum distances are shown as a
dashed line. The colored shaded regions show the minimum and maximum distance between any pair of agents. Fourth row:
Order parameter from (8) over time (4g and 4h) during flocking. Note that the order parameter for the position-based flock does
not continue until the last time step since the vision-based flock takes longer to reach the migration point.