Disaster Response and Recovery

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1
Deep Reinforcement Learning Enabled Multi-UAV

Scheduling for Disaster Data Collection
With Time-Varying Value
Pengfu Wan , Gangyan Xu , Member, IEEE, Jiawei Chen , and Yaoming Zhou
Abstract— The congestion and disruption of information collection processes and bring many blind spots in disas-
infrastructures frequently happen during disasters, which would ter areas [4]. To address such problems, many alternative
hinder the understanding of disaster scenarios, and thus impede approaches have been proposed, e.g., data crowd-sourcing [5],
rapid response activities. With the advantages of high flexibility
and efficiency, this paper proposes to use UAVs as temporary and social media mining [6], and deploying mobile communication
mobile relays for disaster data collection. However, different from units [7]. Recently, with the technological advancement of
many existing data collection scenarios in industrial sectors, the Unmanned Aerial Vehicle (UAV), it has been proposed to use
disaster data value varies with UAV arrival time and service time UAVs as temporary and mobile relays for disaster data collec-
in terms of their importance for disaster response, which makes
tion, with many pilot applications in 2015 Nepal earthquake,
the scheduling of UAVs challenging. To address such a problem,
this paper proposes an attention-based Deep Reinforcement 2021 Henan Flash Flood, 2022 Luding earthquake, etc.
Learning (DRL) method for multi-UAV scheduling considering UAV-based data collection is emerging in various scenarios,
time-varying data value. Specifically, the problem is modeled as e.g., large-scale wireless sensor networks [8], communication
a specific team orienteering problem with time-varying value. networks [9], [10], [11], infrastructure inspection [12], and
Then the relationships between UAV route selection and service
time at each node are analyzed, based on which the computing
construction process monitoring [13], among which UAV
efficiency for solution algorithms can be improved. After that, demonstrates many advantages of high flexibility and speed,
an attention-based DRL method is developed, with a calibrated economic efficiency, and easy-to-deploy. Meanwhile, many
attention model and decoding method. Finally, systematic compu- works have been conducted to improve its collection quality
tational experiments are conducted to evaluate the performance and efficiency [14], [15], [16], in which UAV route planning
of the proposed method, which demonstrates its superiority over
popular methods in UAV scheduling, especially for large-scale
and scheduling has attracted much attention, with objectives
and complex scenarios. on minimizing UAV energy consumption [17], [18], flight
time [19], task completion time [20], data packet losses [21],
Index Terms— Disaster response, unmanned aerial vehicle,
multi-UAV scheduling, data collection, deep reinforcement learn- etc. These works provide extensive knowledge and valuable
ing, time-varying value. insights about UAV-based data collection. However, they can-
not be directly adopted in disaster data collection, which has
four distinct features that make the problem more complex
I. I NTRODUCTION and challenging.
Firstly, the value of disaster data is highly dependent on
T IMELY disaster data is essential for efficient responses,
thus vital for saving lives and preventing economic
losses [1], [2], [3]. Nevertheless, information infrastructures
collection time and varies with time. On the one hand, since
the timeliness of data is vital in disasters, the data value will
are always vulnerable that would be congested or even decrease dramatically with time. On the other hand, disaster
destroyed by disasters, which may impede the disaster data data value peaks at the time collection starts and decreases dra-
matically afterward. This is because the most critical data (e.g.,
Manuscript received 17 April 2023; revised 10 October rescue and relief demand data) usually be sent by victims at the
2023 and 9 December 2023; accepted 16 December 2023. This work very beginning, then they tend to use the temporal relay (UAV)
was supported in part by the National Natural Science Foundation of China
under Grant 72174042 and Grant 72101223, in part by the Natural Science for non-critical or even disaster-irrelevant communications
Foundation of Guangdong Province under Grant 2023A1515011402, in part thereafter. Incorporating such time-varying characteristics of
by the Natural Science Foundation of Shenzhen Municipality under Grant data value greatly increases the complexity of the problem
JCYJ20230807140406013, and in part by the Startup Fund of The Hong
Kong Polytechnic University. The Associate Editor for this article was J. Li. and makes efficient scheduling of UAVs more challenging.
(Corresponding author: Gangyan Xu.) Secondly, the disaster data volumes or values usually dif-
Pengfu Wan, Gangyan Xu, and Jiawei Chen are with the Depart- fer from regions regarding their populations and degrees of
ment of Aeronautical and Aviation Engineering, The Hong Kong Poly-
technic University, Hong Kong (e-mail: pengfu.wan@connect.polyu.hk; damage. It makes the disaster area contain heterogeneous data
gangyan.xu@polyu.edu.hk; superlaser-jw.chen@connect.polyu.hk). points (regions), which are different from previous works in
Yaoming Zhou is with the Department of Industrial Engineering and industrial scenarios. Such heterogeneity further affects the
Management, Shanghai Jiao Tong University, Shanghai 200240, China
(e-mail: iezhou@sjtu.edu.cn). decisions on UAVs’ service time at different points and makes
Digital Object Identifier 10.1109/TITS.2023.3345280 the scheduling problem more complex.
1558-0016 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: BLDEA's College of Eng & Tech - Vijayaura. Downloaded on February 09,2024 at 07:08:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Thirdly, due to the scarcity of UAVs in disasters, it cannot as an important issue in this field, UAV scheduling has
cover all affected regions but only parts of them, as in the case attracted much attention with different performance measures,
of 2021 Henan Flood. It differs our work from previous ones in such as energy consumption, coverage ranges, collection cost,
industrial data collection where the problems can be modeled and efficiency [26]. For example, considering the high energy
and solved based on the variants of Vehicle Routing Problems consumption of sensors in data transmission, Baek et al. [27]
(VRPs). In this work, besides efficient routing decisions, the proposed an energy-efficient UAV routing by maximizing
sub-sets of regions to be covered should also be decided. the minimum residual energy of sensors. Li et al. [28]
Fourthly, decisions in emergency situations should be made considered the cooperation among vehicles and UAVs in
efficiently. Nevertheless, considering the complexities of the 6G-based IoT networks, and designed data collection routes
problem discussed above, and its relatively large scale in terms to improve the coverage ratio and reduce collection costs.
of the number of demand regions and UAVs, it is a challenge And Wang et al. [29] took multiple objectives of UAV data
to design efficient decision-making algorithms. collection into account, and proposed two schemes for flight
Taking the above features and challenges into consideration, cycle minimization and energy efficiency maximization.
this work develops a Deep Reinforcement Learning (DRL) There are also different techniques developed for solv-
based multi-UAV scheduling method to maximize the value ing the UAV scheduling problems in data collection, e.g.,
of disaster data collected, and finally support efficient and graph-theory-based, optimization-based, and learning-based
effective disaster responses. The contributions of this work methods [14]. Specifically, graph-theory-based methods can
lie in the following four aspects: convert the geographical space into graphs and generate data
• An integrated mathematical model is developed that could collection routes based on graph analysis [30], which are
well capture the features of time-varying data value and widely applied in space division problems [31]. Optimization-
partial coverage of demand regions for multiple UAVs based methods can get the optimal or near-optimal route
scheduling in disaster data collection. by optimization related techniques, such as branch and
• Through analyzing the interaction between UAV arrival bound [32], dynamic programming [33], and successive con-
time and service time on potential disaster data value vex optimization [34]. However, they cannot cope with
collected, an approximate analytical solution is developed complex and large-scale problems, thus inappropriate for
to accelerate the UAV scheduling process. disaster scenarios that require efficient decision-making [14].
• Through embedding the interdependent decision pro- Recently, learning-based methods that adopt supervised learn-
cesses of UAV routes and service time at each region, ing [35] and reinforcement learning [36] techniques are
a new attention-based DRL framework is developed that emerging, which could deal with dynamic and uncertain
can support real-time decision-making for multi-UAV environments, as well as large-scale problems. There are
scheduling in disaster data collection. currently two mainstream reinforcement learning frameworks
• Systematic experimental case studies are conducted that for UAV-based data collection. One adopts grid world models
verify the advantages of the proposed method over for depicting data collection scenarios and using value-based
existing learning-based and heuristics-based methods in methods [37], [38], [39]. The other applies policy-based meth-
different scenarios. Results can also be adopted as bench- ods, taking into account more flexible scenario settings and
marks for future research. UAV action spaces [40], [41].
The rest of the paper is structured as follows. Related works However, previous works focus on trajectory design, and
are reviewed in Section II. The mathematical model of the UAV route scheduling issues are still not well covered.
multi-UAV based disaster data collection problem is presented Besides, existing works mainly consider the amount of data
and analyzed in Section III. Section IV discusses the DRL- collected, while in a disaster scenario, both the value and
based solution method. Then experimental case studies and amount of data collected should be considered, which is more
results analysis are given in Section V. Finally Section VI complex and challenging.
concludes the whole paper and points out the future works.
B. Team Orienteering Problem With Time-Varying Value
II. R ELATED W ORK
Team Orienteering Problem (TOP) [42] refers to the prob-
In this section, the relevant works are reviewed from three
lem that given a set of nodes with different values and a fixed
streams: UAV-assisted data collection, team orienteering prob-
amount of time, each team member decides its path to visit
lems with time-varying value, and reinforcement learning for
these nodes such that the total value of all paths is maximized.
combinatorial optimization.
Different from VRPs [43], members in TOP only need to
visit a subset of the nodes, which makes it more appropriate
A. UAV-Assisted Data Collection for scenarios with insufficient resources, such as healthcare
UAV-assisted data collection has attracted a lot of attention logistics [44], disaster response [45], etc.
in recent years, and extensive work has been done across dif- Driving by various practical cases, especially in emergency
ferent areas [14], [22]. However, in practical applications, there response systems, TOP with time-varying value has also been
are some challenges faced by UAV-assisted data collection, studied [46], [47]. Generally, the time-varying properties can
including the selection of data collection mode [23], sensors be classified into three groups. The first is the arrival-time-
deployment [24], UAV speed control [25], etc. In particular, dependent value, where the value varies with the arrival
WAN et al.: DRL ENABLED MULTI-UAV SCHEDULING FOR DISASTER DATA COLLECTION 3
time of members at each node. For example, Erkut and

Zhang [48] created a competitive scenario for salespeople in
which sales potentials decreased linearly with time, and Ekici
and Retharekar [49] extended similar models in multiple-agent
problems. The second is the service-time-dependent value,
which means the value attained is a function of service time.
For example, Erdoğan and Laporte [50] provided a model
that vehicles can continue collecting the remaining value by
making several passes, in which the total value is affected
by different service time. The third group is compound-time- Fig. 1. Multi-UAV scheduling for disaster data collection.
dependent value, which is a combination of arrival-time and
service-time dependent values. Such a scenario was considered decided in multi-UAV scheduling for disaster data collection
in [47] within a disaster rescue scenario, where the rescue problems. In addition, as service time is a continuous variable,
success rate is related to ambulance arrival time and rescue it will be difficult to determine directly in RL methods.
time. Yu et al. [51] further extended it to robust models where
service time is uncertain. In these works, it is also assumed that
III. P ROBLEM M ODELLING
accumulated data value decreases linearly with arrival time.
It is worth noting that the compound-time-dependent value In this section, the model of multi-UAV scheduling problem
is the most complicated one, and cases related to emergency for disaster data collection is developed first, then the inter-
responses and disaster data collection always fall into this actions between UAV route design and service time decision
type [52]. are analyzed.
Existing works mainly solved TOP with time-varying
value using exact [47] and various heuristic algorithms [51]. A. Mathematical Model
Although it performs well in small and medium sized problems After disasters strike, many regions within the affected
and uncertain environments, it cannot well cope with large- area may lose their communication networks, which hinders
scale problems efficiently. In addition, compared to existing victims inside from sending their rescue/relief demand data
works in humanitarian logistics, the time-varying models of out. The amount and value of these disaster-related data
disaster data collection are built based on instant data value differed from regions, regarding different sizes, populations,
rather than accumulated data value, and have not been well and damages in these regions, as illustrated in Fig. 1. A fleet of
investigated. UAVs will then be dispatched from a central depot to visit each
region, hovering over as temporary relays, to collect disaster
C. Reinforcement Learning for Combinatorial Optimization data sent from victims.
Given the set of disaster regions as C and let n = |C|, the set
RL methods have been widely adopted in many CO prob- of all nodes with the affected region can be represented as N =
lems, including the maximum cut problem [53], [54], bin {0}∪C ∪{n+1}, where nodes {0} and {n+1} represent the same
packing problem [55], [56], minimum vertex cover prob- depot where UAVs depart from and return to. A mixed-integer
lem [53], [57], and etc. In addition, route scheduling problems, nonlinear programming model for this problem is formulated
such as Traveling Salesman Problem (TSP), VRP, and TOP, (1) to (9) and the notations adopted in the model are shown
that have been widely studied and well solved using exact in Table I.
or heuristics methods [58], [59], [60], [61], have received
increasing attention from RL research community to cope XXX
with dynamic scenarios and large-scale problems [62], [63], max xi jk f (a jk , s j ) (1)
[64]. For instance, Lin et al. [65] considered time windows i∈N j∈C k∈V
constraints and battery charging demand for VRP and built an n
X n+1
X
end-to-end DRL framework. Li et al. [66] developed a DRL s.t. xi,m,k − xm, j,k = 0 ∀m ∈ C, k ∈ V
method based on a vehicle selection decoder to deal with VRP i=0,i̸ =m j=1, j̸ =m
with heterogeneous capacities. (2)
RL methods for CO consist mainly of value-based and X n
X
policy-based approaches. Value-based approaches learn a value xi,m,k ≤ 1 ∀m ∈ C (3)
function to evaluate the value of possible states and actions, k∈V i=0,i̸ =m
then construct solutions based on evaluations, such as Deep n+1
X
Q Learning (DQN) [53], [67]. In contrast, policy-based x0,m,k ≤ 1 ∀k ∈ V (4)
approaches can directly determine the next action given the m=1
current state, including REINFORCE [63], [68] and Proximal ai,k + ti, j + si − a j,k ≤ M(1 − xi, j,k ) ∀i, j ∈ N , k ∈ V
Policy Optimization (PPO) [54], [69].
(5)
Although RL methods for CO problems are efficient and
relatively mature, most applications only consider one vari- a0,k = 0 ∀k ∈ V (6)
able, while both traveling routes and service time have to be an+1,k ≤ Tmax ∀k ∈ V (7)
TABLE I
N OTATION TABLE
si ≥ 0 ∀i ∈ N (8)
xi, j,k ∈ {0, 1} ∀i, j ∈ N , k ∈ V (9)
The objective function (1) maximizes the data value col-
lected by all UAVs, which depends on the routes of UAVs
and data collection time (service time) at each region. Con-
straint (2) means every UAV should leave the disaster region
m ∈ C it visited, which guarantees flow conservation of the
problem. Constraint (3) ensures every disaster region is visited
at most once. Inequality (4) limits the maximum number of
Fig. 2. Data value with UAV arrival time and service time.
UAVs depart from the depot. The visiting sequence within each
route is specified in constraint (5), where M is a large positive
constant to linearize the inequality. If xi, j,k = 1, it means be quickly collected. Since this stage is usually very short,
that UAV k would visit the region j after visiting the region the data value is assumed to be constant with bik . (ii) After a
i, and the arrival time at region j must be greater than or certain time, the data value begins to decrease at an increasing
equal to the sum of the arrival time at region i, traveling rate as new disaster data is generated at a decreasing rate while
time from region i to region j, and the service time at region non-critical data begin to show up [71]. (iii) When most critical
i. if xi, j,k = 0, inequality (4) is always satisfied since M disaster data are collected, the instant data value will stay at
is large enough. Constraint (6) initializes the departure time, a very low level. With the above analysis, the logistic curve
while constraint (7) is the endurance limitation for UAVs. f ik is appropriate for approximating such a process. In this
Constraints (8) and (9) are the ranges of decision variables work, the midpoint of the logistic curve is proportional to βi
si and xi jk . with the ratio g. The reason is that if one node contains more
data value, the UAV should take more time to collect its data.
Integrating with (10), the instant data value f ik after the data
B. Time-Varying Data Value
collection starts can be depicted as (11).
According to the analysis in Section I and Section II, the
bik
disaster data value vik collected by UAV k at node i is a f ik = ( g > 0, x ∈ [0, si ] ) (11)
function of aik and si , denoted as vik = f (aik , si ). 1 + e x−gβi
1) Arrival-Time-Dependent Value: Due to the timeliness Based on Equation (10) and (11), the collected data value
requirement in disasters, the earlier the disaster data are vik can be obtained by integrating the logistic function over
collected, the more value they may contain. According to the service time si as (12) and illustrated in Figure 2.
βi − α · aik
Z si Z si
investigations on the survival rate with rescue time [70], the
instant data value of node i (with origin instant data value of vik = f ik dx = dx (12)
0 0 1 + e x−gβi
βi ) is considered to decrease linearly with aik , denoted as bik :
bik = βi − α · aik (α > 0) (10) C. Model Analysis
where the decreasing rate of data value is α. The TOP with time-varying data value is more complex
2) Service-Time-Dependent Value: Different from service- than the basic TOP (which is already an NP-Hard problem),
time-depended value models adopted in emergency rescue that not only traveling routes with arrival time at each node
scenarios, the data value decreasing process over si contains should be determined, but also the service time of UAV
three stages: (i) During the initial stage right after UAV arrives, at each node. It can hardly be solved by heuristic based
a large amount of informative and critical disaster data will methods within a short time, let alone exact optimal solutions.
Furthermore, since the service time at node 2 is unchanged,

the data value collected from node 2 only depends on the
arrival time at node 2. Based on the formula (12), earlier arrival
time would improve collection efficiency, thus the formula (16)
expresses the additional data value collected per unit of arrival
Fig. 3. Decision-making process for service time.
time advancement.
Then the optimal service time s1∗ can be obtained by solving
equation (17), where s2 is the largest service time at node
2 limited by Tmax :
Z s2
b1k 1
s −gβ
=α x−gβ2
dx (17)
1+e 1 1
0 1+e
If service time s1 is smaller than s1∗ , it means that UAV
can get more data value at the current node, so it should not
leave. If s1 is larger than s1∗ , staying at the current node will
not be a rational idea since more data value can be collected
Fig. 4. Data value comparison between two adjacent nodes.
at the next node. Thus s1∗ is the optimal solution when only
Besides, according to our preliminary works, it makes DRL- considering balancing the collected data value between two
based methods difficult to converge, especially in large-scale adjacent nodes.
problems. Therefore, this part will further analyze the model The above analysis can be extended to a multi-node situ-
to make it easy to solve. ation that covers the entire route. Assume there are m nodes
Consider one UAV that just arrives at a node in its route, let 1, 2, . . . , i, . . . , m − 1, m in one UAV route, and the current
the UAV choose the next node to be visited first, then decide node is i. Different from the previous scenario that only
the service time of the current node. After finishing the data considers the next node i + 1, the UAV needs to consider all
collection process at the current node, this UAV will fly to the subsequent nodes i +2, . . . , m. According to equation (17), the
next chosen node and update the arrival time of that node. This reason that UAV chooses to leave the current node is to obtain
process will be repeated until the UAV returns to the depot, more data value in the next node. In a multi-node situation,
as illustrated in Figure 3. the UAV will always collect no fewer data values than in a
Specifically, the service time of the UAV at the current node double-node situation. Considering that more data value may
is decided upon the selection of the next node. According to be obtained at future nodes, the optimal service time of the
the feature of time-varying data value, the time for the UAV to current node i in the multi-node sequence should be less than
leave the current node depends on the expectation of collected or equal to si∗ . Hence, it can be derived that the former result
value in the future. To make it simple for analysis, we first si∗ is a mathematical upper bound for the optimal solution
ignore the impact of other nodes in the route, and then the of service time given the route, and the UAV should leave
decision on service time at the current node can be made by the current node on or before time aik + si∗ . Since the time
comparing the potential value gained at the current node with allocation of multiple UAVs is independent of each other and
the next node. only depends on their own routes, the above analysis can be
Figure 4 shows the detailed comparison process. Take UAV easily adopted in multi-UAV scenarios.
k as an example, given the current node and next node as
IV. M ETHODOLOGY
node 1 and node 2, the service time s1 has to be decided.
Based on constraint (5), a1k + s1 + t12 is the arrival time of With the above analysis, this section develops a DRL-
node 2. Based on formula (10), the decayed data value in terms based solution method. In real-world scenarios, models that
of the arrival time of these two nodes can be represented as match the disaster scenarios are pre-trained, enabling efficient
follows: decisions to be made quickly when a disaster occurs. The
overall model and framework of our DRL-based method will
b1k = β1 − α · a1k (13)
be introduced first, then the attention model and decoding
b2k = β2 − α · (a1k + s1 + t12 ) (14) process will be discussed.
The instant data value at the time UAV leaves node 1 is
derived based on formula (11) and represented as: A. Overview of the DRL-Based Method
b1k To solve the problem of multi-UAV scheduling problem
f 1k (x = s1 ) = (15) using the DRL-based method, the Markov Decision Process
1 + es1 −gβ1
(MDP) model for the problem is built first. Here, the problem
Assuming s2 remains unchanged and using formulas (12)
is modeled as a finite MDP M =< S, A, P, R >. S is a
and (14), the extra instant data value obtained by the UAV k
finite set of states, including the information of the set of
at node 2 is formulated as (16):
nodes N and UAVs V . The node information contains the
∂ f 2k ∂b2k s2
Z Z s2
1 1 locations and the original data value β of nodes, and UAV
− =− dx = α dx
∂s1 ∂s1 0 1 + e x−gβ2
0 1+e
x−gβ2
information includes the current positions and the used time
(16) of UAVs. A is a finite set of actions consisting of the node
set N . Unavailable nodes that cannot fulfill constraints (3)

and (7) are masked when choosing actions. P is a transition
function that denotes the possibility of the next state given the
current state and action. R is the reward function vik , which is
determined by arrival time aik and service time si according
to the formula (12). The optimization objective of the MDP
model is consistent with formula (1).
To reduce variance and improve training efficiency, the boot-
strap method is widely applied in developing DRL algorithms,
such as Actor-Critic. However, the problem of delayed rewards
arises since the service time of the current node cannot be
determined until the next node is selected, which poses a chal-
lenge for the adoption of the bootstrap method. Therefore, the
REINFORCE algorithm is chosen, which generates complete
sequences during sampling and returns the total rewards in the
end. Denote the policy of one UAV as π with parameter θ,
and in a problem instance s, all actions chosen by the UAV
Fig. 5. Agent-environment interaction mode.
under π(θ ) are a = (a1 , a2 , . . . , an ). Then, the whole policy
π is decomposed into:
Yn node information is mapped into a high dimensional vector
π(a|s, θ) = π(at |s, a1:t−1 , θ) (18) space h with parameters W and l. The embedding of node i
t=1 is represented as:
The purpose of the algorithm is to optimize parameter θ to h i = Wi [X i ; βi ] + li (21)
maximize final rewards. Denote G(a) as the total data value
obtained by the solution a. Based on Monte Carlo policy Single-node embedding cannot reflect the relationships
gradient, J (θ ) can be obtained as the performance measure among nodes, and an attention mechanism is applied to encode
of policy π(θ ): node embedding further. Denote the weights of key, query, and
value as W k , W q , and W v , respectively. Then for node i, its
∇ J (θ ) = Eπ [G(a)∇logπ(a|s, θ)] (19)
key ki , query qi , and value vi could be computed as:
In addition, one efficient way to reduce gradient variance
and increase convergence speed is to use a baseline b(s). Since ki = W k h i , qi = W q h i , vi = W v h i (22)
the baseline does not vary with action a, adding a baseline will For any other node j in the graph, the attention weight
not affect the validity of the equation (19). Then the policy of node j on node i is calculated by taking a dot product
gradient with baseline can be formulated as: between qi and k j . Denote dk as the dimension of the key,
∇ J (θ ) = Eπ [(G(a) − b(s))∇logπ(a|s, θ)] (20) then the attention-based node embedding h i′ is formulated as
a weighted value based on softmax attention scores:
Specifically, the exponential moving average baseline with
a low computational cost is developed. In each iteration, the
X qiT k j
h i′ = softmax( √ )v j (23)
baseline b(s) will be updated by a decay factor γ . In the j
dk
training process, it is impossible to get all possible actions
and expected rewards, thus batch gradient descent and Adam In addition, to depict different feature spaces of graph
are adopted. The interaction between the agent (UAV) and the information, the input embedding is divided into m parts and
environment is proposed and illustrated in Figure 5. In the multi-head attention (MHA) is adopted. To each part, the sub-
following, the attention model and decoding method will be input embedding forms a head based on the formula (23).
further discussed to introduce our end-to-end framework. Denote W o as the parameter matrix for weights of subspaces,
the multi-head embedding of node i is formulated as:
B. The Attention Model
MHAi = W o Concat(h i1
′
, h i2
′
, . . . , h im
′
) (24)
Although the global environment information is embedded
as inputs when making decisions, agents should automatically Each attention layer consists of two sublayers: a MHA
extract the most crucial information and ignore noises from layer and a feed-forward layer, and each sublayer includes
useless ones, thus the attention model is adopted. It consists a skip connection. The final node embedding is the output
of three main parts: graph embedding, context embedding, and of encoding layers, which are composed of multiple stacked
decoding mechanism. attention layers. The graph embedding is the mean of the final
1) Graph Embedding: Graph information here includes node embeddings.
the location information of nodes X and their maximum 2) Context Embedding: Context embedding consists of
instant data value β when disaster happens. As all the graph three parts: graph embedding, current state embedding, and
information is static, the graph embedding only needs to be depot embedding. In the current state embedding, the remain-
calculated once in one instance. For each node in the graph, ing flight time of the UAV is added to the current state.
As UAVs should return to the depot finally equation (2), the Algorithm 1 Reinforce With Exponential Moving
depot embedding is included in context embedding. Similar to Average Baseline
graph embedding, both the current state embedding and depot Input: number of epochs N , steps per epoch T , batch
embedding are encoded by MHA. With context embedding size B, decay factor γ
h c and query weight W q , query for context embedding qc is Initialize network weights of encoder and decoder;
formulated as: for epoch = 1, 2, . . . , N do
qc = W q h c (25) for step = 1, 2, . . . , T do
generate B instances X 1 , X 2 , . . . , X B ;
3) Decoding Mechanism: The possibilities of the next for i = 1, 2, . . . , B do
action are obtained by the decoding mechanism. Since the while termination condition is not satisfied
context embedding includes information about the current do
state, qc is used to query ki of other nodes and calculate select actions through current policy;
attention weight between them. Note that not all nodes are if previous node is not the depot then
available since visited nodes cannot be revisited (equation (3)), calculate service time of previous
and some nodes that make UAV exceed its maximum flight node;
time are prohibited (equation (7)). The set of unavailable nodes end
forms a mask, and attention weights of unavailable nodes are update the environment and observe
masked. Then the attention weight u i of node i is: current states;
 T end
 q√c ki if node i is available record actions a1 , a2 , . . . , a B and states
ui = dk (26) s1 , s2 , . . . , s B ;
compute total rewards G 1 , G 2 , . . . , G B ;

−∞ otherwise
end
To reduce the varying range of u i , the tanh function is used to
if no baseline is generated before then
clip the result. Possibilities of all actions are obtained through
b ← avg(G 1 , G 2 , . . . , G B );
the softmax function of attention weights.
else
b ← γ avg(G 1 , G 2 , . . . , G B ) + (1 − γ )b;
C. Decoding Method end
PB
Different from single-agent problems, multiple UAVs with ∇θ ← i=1 (G i − b)∇θ logπ(ai |si , θ );
different states are considered in this work. One feasible end
method is to compare the actions of different UAVs and choose end
the most appropriate decision. Unlike generating joint actions,
UAVs in the competitive mechanism have to compete with
each other, and only one action of one UAV is chosen each
the total value is obtained according to the formula (1), and the
time. The main advantage of this method is avoiding conflicts
policy is updated according to the REINFORCE framework.
when UAVs tend to select the same node. As the current
j The pseudo-code of REINFORCE with an exponential moving
states of UAVs are different, context embedding h c have to
j average baseline is presented in the Algorithm 1.
be calculated for each UAV j, then query qc and the attention
j
weight u i between UAV j and node i are calculated as:
j j
V. E XPERIMENTAL C ASE S TUDY
qc = W q h c (27)

jT A. Experiment Setting
 qc k i

j √ if node i is available for UAV j In this work, a square disaster area with size [0, 1] × [0, 1]
ui = dk (28)
 is built for experimental case study. A set of disaster regions
 −∞ otherwise
with data to be collected are randomly generated within the
To all attention weights calculated above, the possibility for area. Their instant data values β are also randomly generated
action ai j (choosing UAV j and node i) is in the range of [20, 30]. In the experiments, the data value
j
decreasing rate α is set to 0.005, implying that data value
eu i will decrease to 0 after 200 minutes, which is consistent
p(ai j ) = P P (29)
j with the understanding of golden window (around 3 hours)
j i eu i
in emergencies. Meanwhile, g is set as 0.5 in formula (11).
Greedy decoding is developed and the action with the largest Besides, a fixed airspeed of 0.05 unit distance per minute
possibility is chosen. After getting the new action, the service is set for UAVs. To verify the performance of the proposed
time of the previous node is calculated if the previous node is method in different scales of problems, the number of disaster
not the depot. Then the environment information, including regions Nregion is set from 5 to 500 while the number of UAVs
arrival time and UAV positions, will be updated based on NUAV ranges from 1 to 20. In the following, each scenario is
constraint (5), and new states of UAVs will be observed. numbered, as ’U 10− N 50’ denotes the scenario with 10 UAVs
Repeating such interactions until all UAVs return to the depot, and 50 disaster regions.
TABLE II TABLE III

H YPERPARAMETER TABLE C OMPARISON IN S MALL -S CALE P ROBLEMS
The hyperparameters adopted in the training process are

presented in Table II. It is worth mentioning that the proposed
method requires relatively a small number of parameters and
has low computational complexity. Specifically, for all the sce-
narios, the number of parameters remains constant at 99,328.
And for scenarios such as ’U 20− N 150’, the FLOPs (Floating
Point Operations) of the proposed method is 39.882G, which
is relatively small. It should be noted that, when dealing with
large-scale problems, the batch size can be substituted with
either 256 or 128. All experiments are conducted on a Dell
Precision Tower 5820 server with a TITAN RTX GPU. The
implementation of all algorithms is performed using Python.
B. Benchmark Methods Fig. 6. Training returns of episodes.
To verify the performance of the proposed DRL method,

five algorithms are taken for comparison, as listed below. All is represented as warm start in Figure 6. Here, compared to
algorithms share the same service time calculation process as directly training ’U 5 − N 20’ instances, using ’U 2 − N 10’ as
our proposed DRL method, only different in routing decisions. the initial network could provide a faster convergence curve.
In addition, to demonstrate the necessity of the attention mech-
• SCIP: Solving Constraint Integer Programs (SCIP) is
anism in our algorithm, the result of the ablation experiment is
widely adopted for mixed-integer nonlinear programming
also shown in Figure 6. Compared to our proposed algorithm,
model [47], [72], which can get exact optimal results.
the algorithm without a multi-head attention mechanism can
It will be adopted to verify whether the proposed model
hardly learn anything, resulting in a curve with little upward
could generate near-optimal solutions.
trend.
• Greedy Algorithm: Greedy Algorithm follows a greedy
2) Comparison in Small-Scale Problems: For small-scale
strategy that chooses UAVs and nodes based on the
problems, SCIP is adopted for comparison to verify whether
“optimal matching” principle [73]. It is widely adopted
the proposed method could generate near-optimal solutions.
in literature as a benchmark method.
Besides, considering the timeliness requirement in disasters,
• Tabu Search: Tabu Search uses the tabu table to avoid the
the maximum computing time for SCIP is set as 7200s.
local optimization problem, thus works well in TOP [51],
Table III shows the results. For ’U 1 − N 5’ and ’U 2 − N 5’,
VRP [74], and data collection problems [75].
SCIP performs better and the gap between two methods is
• Genetic Algorithm: Genetic Algorithm is a metaheuristic
pretty small. For ’U 1 − N 8’ and ’U 2 − N 8’, our DRL
method that proved to be effective in TOP [76] and multi-
method performs better than SCIP. Results show that proposed
UAV data collection framework [77].
method can get approximately optimal solutions. In addition,
• DRL: A state-of-the-art DRL method [64], which com-
proposed DRL method can immediately generate approximate
bines transformer and pointer network model is adopted
optimal solutions for the problem, unlike SCIP which requires
for comparison, which is proved effective in orienteering
a relatively long time for the results.
problems.
3) Comparison in Medium-Scale and Large-Scale Prob-
lems: The experiment results for medium-scale and large-scale
C. Computational Results problems are shown in Table IV. In general, the proposed
In the experiments, each scenario is running 20 times with method performs better than Greedy Algorithm and DRL.
randomly generated seeds, and their average performance will For some medium-scale instances (e.g., the first 4 rows
be taken for comparison. in Table IV), the results of the Tabu Search and Genetic
1) Ablation Experiments for Warm Start and Attention Algorithm are close to or even better than those of the
Mechanism: Considering the difficulty of directly training proposed method. However, the performance of these heuristic
medium and large-scale instances, the trained network of algorithms worsens as the scale of scenarios increases. As the
small-scale instances is used as the initial network, which number of disaster regions enlarged to no less than 200 while
TABLE IV
C OMPARISON IN M EDIUM -S CALE AND L ARGE -S CALE P ROBLEMS
TABLE V
I MPACT OF D IFFERENT ROUTES S EQUENCE L ENGTH ON E XPERIMENTAL P ERFORMANCE
TABLE VI
I MPACT OF D IFFERENT P ROBLEM S CALES ON E XPERIMENTAL P ERFORMANCE
fixing the number of UAVs at 20 (see the lower part of average number of disaster regions allocated to each UAV in
Table IV), we can find that the superiority of the proposed different scenarios. Conventionally, the higher the ratio, the
method becomes more dominant. In addition, Figure 7 is more disaster regions visited by each UAV. Table V shows
employed to visualize the performance of algorithms in large- the performance of different algorithms by using the results
scale scenarios. of Greedy Algorithm as the benchmark. According to Table V
4) Advantages of Proposed Method in Long-Sequence and and Figure 7, it can be observed that the higher the ratio, the
Large-Scale Problems: Table V examines the impact of route better the performance of both learning-based methods, which
sequence length, which is determined by the ratio of Nregion shows that they have advantages over heuristic methods when
to NUAV , on algorithmic efficiency. Ratio N /U means the dealing with long route sequence problems. In comparison,
the performance of the trained model is superior to heuristics

methods and the state-of-the-art DRL method when facing
different cases with the same scenario settings.
VI. C ONCLUSION
To address the problem of UAV-assisted disaster data col-
lection, this paper proposed an efficient multi-UAV scheduling
method considering the feature of time-varying data value.
First of all, a TOP-based mathematical model is developed to
maximize the data value collected by all UAVs. Besides, the
features of the time-varying data value in terms of both UAV
arrival time and service time are analyzed and modeled. Mean-
while, to accelerate the solution algorithms, the relationships
Fig. 7. Comparative trend curves across different problem scales between UAV route selection and service time are analyzed.
under 20 UAVs. An attention-based DRL method is then proposed, which can
obtain high-quality solutions in near real-time. Several typical
there is no evident growing trend in the performance of these scenarios are simulated and different algorithms are tested,
heuristic methods. which verify the advantages of the proposed method in both
In Table VI, the product of the number of disaster computing efficiency and solution quality.
regions and number of UAVs, N·U, is regarded as an indi- This work can be extended from the following directions.
cator to measure the scale of each problem, which varies Firstly, the proposed mathematical model with time-varying
from 500 to 10000. The percentage results of the four methods value can be further developed to incorporate more com-
listed in Table VI are still benchmarked against the Greedy plex scenarios, such as multiple disaster types, dynamic and
Algorithm. As the scale of the instances increases, the per- uncertain environments, and non-uniform spatial distribution
formance gap between our proposed method and compared of disasters. Secondly, the dynamics of the scenarios can be
algorithms becomes increasingly larger (as shown in the last considered with regard to the development of disasters and
column of Table VI). This phenomenon illustrates that in the changes of UAV numbers. Thirdly, the method proposed
some small-scale and medium-scale problems, other meth- in this paper can be further improved with higher robustness
ods, especially heuristic methods, can quickly traverse the to different scenarios. Fourthly, the multi-agent reinforcement
neighborhood and continuously update their optimal solutions. learning frameworks can be studied to deal with decentralized
However, in large-scale problems, the neighborhood of the cur- scenarios.
rent solution grows exponentially, which dramatically reduces
their iterative efficiency. By contrast, the efficiency of the R EFERENCES
proposed DRL method is minimally affected by the scale of [1] I. Nourbakhsh, R. Sargent, A. Wright, K. Cramer, B. McClendon,
the problem, and its performance on large-scale problems is and M. Jones, “Mapping disaster zones,” Nature, vol. 439, no. 7078,
also very good. pp. 787–788, Feb. 2006.
[2] M. Zook, M. Graham, T. Shelton, and S. Gorman, “Volunteered geo-
graphic information and crowdsourcing disaster relief: A case study
of the Haitian earthquake,” World Med. Health Policy, vol. 2, no. 2,
D. Discussions pp. 7–33, Jul. 2010.
According to the above experimental results, several advan- [3] M. Morton and J. L. Levy, “Challenges in disaster data collection during
recent disasters,” Prehospital Disaster Med., vol. 26, no. 3, pp. 196–201,
tages and managerial implications can be concluded. Jun. 2011.
Firstly, the proposed method is effective in realizing efficient [4] I. Junglas and B. Ives, “Recovering it in a disaster: Lessons from
and high-quality scheduling of UAVs for disaster data collec- Hurricane Katrina,” MIS Quart. Executive, vol. 6, no. 1, pp. 39–51,
2007.
tion, no matter small-scale problems or large-scale problems. [5] H. To, S. H. Kim, and C. Shahabi, “Effectively crowdsourcing the
In practice, the proposed method can be adopted for real-time acquisition and analysis of visual data for disaster response,” in Proc.
decision-making in disasters. IEEE Int. Conf. Big Data, Oct. 2015, pp. 697–706.
[6] P. R. Spence, K. A. Lachlan, and A. M. Rainear, “Social media and
Secondly, the proposed method performs better in long- crisis research: Data collection and directions,” Comput. Hum. Behav.,
sequence problems where the UAV will visit many regions in vol. 54, pp. 667–672, Jan. 2016.
one trip. In future, with the improved capacity of UAVs and [7] T. Sakano et al., “Disaster-resilient networking: A new vision based on
movable and deployable resource units,” IEEE Netw., vol. 27, no. 4,
their prolonged endurance, the long-sequence feature will be pp. 40–46, Jul. 2013.
more prominent as UAVs are capable of visiting more nodes in [8] S. Wang, Y. Long, Y. Zhou, and G. Xu, “Multi-UAV route planning
one trip. Therefore, the proposed method will be more popular for data collection from heterogeneous IoT devices,” in Proc. IEEE Int.
in many scenarios with the development of UAV technologies. Conf. Ind. Eng. Eng. Manag. (IEEM), Dec. 2022, pp. 1556–1560.
[9] X. Pang, M. Sheng, N. Zhao, J. Tang, D. Niyato, and K.-K. Wong,
Thirdly, in real-life applications, given the number of “When UAV meets IRS: Expanding air-ground networks via passive
regions in an area (e.g., the area of responsibility) and the reflection,” IEEE Wireless Commun., vol. 28, no. 5, pp. 164–170,
number of UAVs, the multi-UAV scheduling model can be pre- Oct. 2021.
[10] T. Ma et al., “UAV-LEO integrated backbone: A ubiquitous data collec-
trained using a simulation-based environment and then directly tion approach for B5G Internet of Remote Things networks,” IEEE J.
applied in practice. Our experimental results have shown that Sel. Areas Commun., vol. 39, no. 11, pp. 3491–3505, Nov. 2021.
[11] X. Pang, W. Mei, N. Zhao, and R. Zhang, “Intelligent reflecting [33] H. Hu, K. Xiong, G. Qu, Q. Ni, P. Fan, and K. B. Letaief, “AoI-minimal
surface assisted interference mitigation for cellular-connected UAV,” trajectory planning and data collection in UAV-assisted wireless powered
IEEE Wireless Commun. Lett., vol. 11, no. 8, pp. 1708–1712, Aug. 2022. IoT networks,” IEEE Internet Things J., vol. 8, no. 2, pp. 1211–1223,
[12] Y. Tan, S. Li, H. Liu, P. Chen, and Z. Zhou, “Automatic inspection Jan. 2021.
data collection of building surface based on BIM and UAV,” Autom. [34] W. Chen, S. Zhao, Q. Shi, and R. Zhang, “Resonant beam charging-
Construct., vol. 131, Nov. 2021, Art. no. 103881. powered UAV-assisted sensing data collection,” IEEE Trans. Veh.
[13] K. Asadi et al., “An integrated UGV-UAV system for construc- Technol., vol. 69, no. 1, pp. 1086–1090, Jan. 2020.
tion site data collection,” Autom. Construct., vol. 112, Apr. 2020, [35] J. Chen et al., “Efficient data collection in large-scale UAV-aided
Art. no. 103068. wireless sensor networks,” in Proc. 11th Int. Conf. Wireless Commun.
[14] Z. Wei et al., “UAV-assisted data collection for Internet of Things: Signal Process. (WCSP), Oct. 2019, pp. 1–5.
A survey,” IEEE Internet Things J., vol. 9, no. 17, pp. 15460–15483, [36] L. Liu, K. Xiong, J. Cao, Y. Lu, P. Fan, and K. B. Letaief, “Average AoI
Sep. 2022. minimization in UAV-assisted data collection with RF wireless power
[15] I. Jawhar, N. Mohamed, J. Al-Jaroodi, D. P. Agrawal, and S. Zhang, transfer: A deep reinforcement learning scheme,” IEEE Internet Things
“Communication and networking of UAV-based systems: Classifica- J., vol. 9, no. 7, pp. 5216–5228, Apr. 2022.
tion and associated architectures,” J. Netw. Comput. Appl., vol. 84, [37] S. Fu et al., “Energy-efficient UAV-enabled data collection via wireless
pp. 93–108, Apr. 2017. charging: A reinforcement learning approach,” IEEE Internet Things J.,
[16] D. Liu et al., “Opportunistic UAV utilization in wireless networks: Moti- vol. 8, no. 12, pp. 10209–10219, Jun. 2021.
vations, applications, and challenges,” IEEE Commun. Mag., vol. 58, [38] P. Tong, J. Liu, X. Wang, B. Bai, and H. Dai, “Deep reinforcement
no. 5, pp. 62–68, May 2020. learning for efficient data collection in UAV-aided Internet of Things,” in
[17] W. Ejaz, A. Ahmed, A. Mushtaq, and M. Ibnkahla, “Energy-efficient task Proc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), Jun. 2020,
scheduling and physiological assessment in disaster management using pp. 1–6.
UAV-assisted networks,” Comput. Commun., vol. 155, pp. 150–157, [39] K. K. Nguyen, T. Q. Duong, T. Do-Duy, H. Claussen, and L. Hanzo,
Apr. 2020. “3D UAV trajectory and data collection optimisation via deep reinforce-
[18] X. Pang, J. Tang, N. Zhao, X. Zhang, and Y. Qian, “Energy-efficient ment learning,” IEEE Trans. Commun., vol. 70, no. 4, pp. 2358–2371,
design for mmWave-enabled NOMA-UAV networks,” Sci. China Inf. Apr. 2022.
Sci., vol. 64, no. 4, Apr. 2021, Art. no. 140303. [40] M. Sun, X. Xu, X. Qin, and P. Zhang, “AoI-energy-aware UAV-
[19] J. Gong, T.-H. Chang, C. Shen, and X. Chen, “Flight time minimization assisted data collection for IoT networks: A deep reinforcement learning
of UAV for data collection over wireless sensor networks,” IEEE J. Sel. method,” IEEE Internet Things J., vol. 8, no. 24, pp. 17275–17289,
Areas Commun., vol. 36, no. 9, pp. 1942–1954, Sep. 2018. Dec. 2021.
[20] Z. Wang, G. Zhang, Q. Wang, K. Wang, and K. Yang, “Completion time [41] Y. Wang et al., “Trajectory design for UAV-based Internet of Things
minimization in wireless-powered UAV-assisted data collection system,” data collection: A deep reinforcement learning approach,” IEEE Internet
IEEE Commun. Lett., vol. 25, no. 6, pp. 1954–1958, Jun. 2021. Things J., vol. 9, no. 5, pp. 3899–3912, Mar. 2022.
[21] Y. Emami, B. Wei, K. Li, W. Ni, and E. Tovar, “Deep Q-networks for [42] I.-M. Chao, B. L. Golden, and E. A. Wasil, “The team orienteering
aerial data collection in multi-UAV-assisted wireless sensor networks,” problem,” Eur. J. Oper. Res., vol. 88, no. 3, pp. 464–474, Feb. 1996.
in Proc. Int. Wireless Commun. Mobile Comput. (IWCMC), Jun. 2021, [43] H. Qin, X. Su, T. Ren, and Z. Luo, “A review on the electric vehicle
pp. 669–674. routing problems: Variants and algorithms,” Frontiers Eng. Manag.,
[22] M. T. Nguyen et al., “UAV-assisted data collection in wireless sensor net- vol. 8, no. 3, pp. 370–389, Sep. 2021.
works: A comprehensive survey,” Electronics, vol. 10, no. 21, p. 2603, [44] R. Aringhieri, S. Bigharaz, D. Duma, and A. Guastalla, “Novel appli-
Oct. 2021. cations of the team orienteering problem in health care logistics,” in
[23] S. R. Yeduri, N. S. Chilamkurthy, O. J. Pandey, and L. R. Cenkeramaddi, Optimization in Artificial Intelligence and Data Sciences. Rome, Italy:
“Energy and throughput management in delay-constrained small- Springer, 2022, pp. 235–245.
world UAV-IoT network,” IEEE Internet Things J., vol. 10, no. 9, [45] S. Saeedvand, H. S. Aghdasi, and J. Baltes, “Novel hybrid algorithm for
pp. 7922–7935, May 2023. team orienteering problem with time windows for rescue applications,”
[24] Q. Wu, P. Sun, and A. Boukerche, “Unmanned aerial vehicle-assisted Appl. Soft Comput., vol. 96, Nov. 2020, Art. no. 106700.
energy-efficient data collection scheme for sustainable wireless sensor [46] V. F. Yu, P. Jewpanya, S.-W. Lin, and A. N. P. Redi, “Team orienteering
networks,” Comput. Netw., vol. 165, Dec. 2019, Art. no. 106927. problem with time windows and time-dependent scores,” Comput. Ind.
[25] X. Li, J. Tan, A. Liu, P. Vijayakumar, N. Kumar, and M. Alazab, Eng., vol. 127, pp. 213–224, Jan. 2019.
“A novel UAV-enabled data collection scheme for intelligent transporta-
[47] Q. Yu, Y. Adulyasak, L.-M. Rousseau, N. Zhu, and S. Ma, “Team
tion system through UAV speed control,” IEEE Trans. Intell. Transp.
orienteering with time-varying profit,” Informs J. Comput., vol. 34, no. 1,
Syst., vol. 22, no. 4, pp. 2100–2110, Apr. 2021.
pp. 262–280, Jan. 2022.
[26] B. Alzahrani, O. S. Oubbati, A. Barnawi, M. Atiquzzaman, and
[48] E. Erkut and J. Zhang, “The maximum collection problem with time-
D. Alghazzawi, “UAV assistance paradigm: State-of-the-art in appli-
dependent rewards,” Nav. Res. Logistics, vol. 43, no. 5, pp. 749–763,
cations and challenges,” J. Netw. Comput. Appl., vol. 166, Sep. 2020,
Aug. 1996.
Art. no. 102706.
[27] J. Baek, S. I. Han, and Y. Han, “Energy-efficient UAV routing for [49] A. Ekici and A. Retharekar, “Multiple agents maximum collection
wireless sensor networks,” IEEE Trans. Veh. Technol., vol. 69, no. 2, problem with time dependent rewards,” Comput. Ind. Eng., vol. 64, no. 4,
pp. 1741–1750, Feb. 2020. pp. 1009–1018, Apr. 2013.
[28] T. Li, W. Liu, Z. Zeng, and N. N. Xiong, “DRLR: A deep-reinforcement- [50] G. Erdosan and G. Laporte, “The orienteering problem with variable
learning-based recruitment scheme for massive data collections in profits,” Networks, vol. 61, no. 2, pp. 104–116, Mar. 2013.
6G-based IoT networks,” IEEE Internet Things J., vol. 9, no. 16, [51] Q. Yu, C. Cheng, and N. Zhu, “Robust team orienteering prob-
pp. 14595–14609, Aug. 2022. lem with decreasing profits,” INFORMS J. Comput., vol. 34, no. 6,
[29] T. Wang, X. Pang, J. Tang, N. Zhao, X. Zhang, and X. Wang, “Time and pp. 3215–3233, Nov. 2022.
energy efficient data collection via UAV,” Sci. China Inf. Sci., vol. 65, [52] S. A. Shah, D. Z. Seker, S. Hameed, and D. Draheim, “The rising role
no. 8, Aug. 2022, Art. no. 182302. of big data analytics and IoT in disaster management: Recent advances,
[30] R. Penicka, J. Faigl, and M. Saska, “Physical orienteering problem for taxonomy and prospects,” IEEE Access, vol. 7, pp. 54595–54614, 2019.
unmanned aerial vehicle data collection planning in environments with [53] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song, “Learning
obstacles,” IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 3005–3012, combinatorial optimization algorithms over graphs,” in Proc. Adv. Neural
Jul. 2019. Inf. Process. Syst., vol. 30, 2017, pp. 1–11.
[31] S. Aggarwal and N. Kumar, “Path planning techniques for unmanned [54] Q. Cappart, T. Moisan, L.-M. Rousseau, I. Prémont-Schwarz, and
aerial vehicles: A review, solutions, and challenges,” Comput. Commun., A. A. Cire, “Combining reinforcement learning and constraint program-
vol. 149, pp. 270–299, Jan. 2020. ming for combinatorial optimization,” in Proc. AAAI Conf. Artif. Intell.,
[32] M. Samir, S. Sharafeddine, C. M. Assi, T. M. Nguyen, and A. Ghrayeb, 2021, vol. 35, no. 5, pp. 3677–3687.
“UAV trajectory planning for data collection from time-constrained IoT [55] H. Hu, X. Zhang, X. Yan, L. Wang, and Y. Xu, “Solving a new 3D
devices,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 34–46, bin packing problem with deep reinforcement learning method,” 2017,
Jan. 2020. arXiv:1708.05930.
[56] Q. Cai, W. Hang, A. Mirhoseini, G. Tucker, J. Wang, and Pengfu Wan received the B.S. degree in indus-
W. Wei, “Reinforcement learning driven heuristic optimization,” 2019, try engineering from Nanjing University, Nanjing,
arXiv:1906.06639. China, in 2021, and the M.Sc. degree in engineer-
[57] S. Manchanda, A. Mittal, A. Dhawan, S. Medya, S. Ranu, and A. Singh, ing enterprise management from The Hong Kong
“GCOMB: Learning budget-constrained combinatorial algorithms over University of Science and Technology, Hong Kong,
billion-sized graphs,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, in 2022. He is currently pursuing the Ph.D. degree
2020, pp. 20000–20011. with the Department of Aeronautical and Aviation
[58] J. Li, M. Zhou, Q. Sun, X. Dai, and X. Yu, “Colored traveling sales- Engineering, The Hong Kong Polytechnic Univer-
man problem,” IEEE Trans. Cybern., vol. 45, no. 11, pp. 2390–2401, sity, Hong Kong.
Nov. 2015. His research interests include data-driven opti-
[59] X. Meng, J. Li, X. Dai, and J. Dou, “Variable neighborhood search for a mization and control, reinforcement learning, and
colored traveling salesman problem,” IEEE Trans. Intell. Transp. Syst., emergency management.
vol. 19, no. 4, pp. 1018–1026, Apr. 2018.
[60] X. Xu, J. Li, and M. Zhou, “Delaunay-triangulation-based variable
neighborhood search to solve large-scale general colored traveling
salesman problems,” IEEE Trans. Intell. Transp. Syst., vol. 22, no. 3,
pp. 1583–1593, Mar. 2021.
[61] X. Xu, J. Li, M. Zhou, and X. Yu, “Precedence-constrained col-
ored traveling salesman problem: An augmented variable neighborhood Gangyan Xu (Member, IEEE) received the B.S.
search approach,” IEEE Trans. Cybern., vol. 52, no. 9, pp. 9797–9808, degree in automation and the M.E. degree in sys-
Sep. 2022. tems engineering from the Huazhong University of
[62] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neu- Science and Technology, Wuhan, China, in 2009 and
ral combinatorial optimization with reinforcement learning,” 2016, 2012, respectively, and the Ph.D. degree in systems
arXiv:1611.09940. engineering from The University of Hong Kong,
[63] W. Kool, H. Van Hoof, and M. Welling, “Attention, learn to solve routing Hong Kong, in 2016.
problems!” 2018, arXiv:1803.08475. He is currently an Assistant Professor with The
[64] R. Gama and H. L. Fernandes, “A reinforcement learning approach to the Hong Kong Polytechnic University, Hong Kong.
orienteering problem with time windows,” Comput. Oper. Res., vol. 133, Prior to that, he was an Assistant Professor with the
Sep. 2021, Art. no. 105357. Harbin Institute of Technology, Shenzhen, China; a
Research Fellow with Nanyang Technological University, Singapore; and a
[65] B. Lin, B. Ghaddar, and J. Nathwani, “Deep reinforcement learning for
Research Assistant with the City University of Hong Kong, Hong Kong.
the electric vehicle routing problem with time windows,” IEEE Trans.
His research interests include data-driven optimization and control, intelligent
Intell. Transp. Syst., vol. 23, no. 8, pp. 11528–11538, Aug. 2022.
transportation systems, resilient engineering, and emergency management.
[66] J. Li et al., “Deep reinforcement learning for solving the heterogeneous
Dr. Xu is an Editorial Board Member of Advanced Engineering Informatics
capacitated vehicle routing problem,” IEEE Trans. Cybern., vol. 52,
and a Special Corresponding Expert of Frontiers of Engineering Management.
no. 12, pp. 13572–13585, Dec. 2022.
[67] T. Barrett, W. Clements, J. Foerster, and A. Lvovsky, “Exploratory
combinatorial optimization with reinforcement learning,” in Proc. AAAI
Conf. Artif. Intell., 2020, vol. 34, no. 4, pp. 3243–3250.
[68] M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác, “Reinforcement
learning for solving the vehicle routing problem,” in Proc. Adv. Neural
Inf. Process. Syst., vol. 31, 2018, pp. 1–11. Jiawei Chen received the B.S. degree in automation
[69] L. Duan et al., “A multi-task selected learning approach for solving 3D from the Harbin Institute of Technology, Shenzhen,
flexible bin packing problem,” 2018, arXiv:1804.06896. China, in 2022. She is currently pursuing the Ph.D.
[70] Z.-C. Li and Q. Liu, “Optimal deployment of emergency rescue stations degree with the Department of Aeronautical and
in an urban transportation corridor,” Transportation, vol. 47, no. 1, Aviation Engineering, The Hong Kong Polytechnic
pp. 445–473, Feb. 2020. University, Hong Kong.
[71] L. Zhuang, J. He, Z. Yong, X. Deng, and D. Xu, “Disaster information Her research interests include data-driven opti-
acquisition by residents of China’s earthquake-stricken areas,” Int. J. mization and control, intelligent transportation sys-
Disaster Risk Reduction, vol. 51, Dec. 2020, Art. no. 101908. tems, and reinforcement learning.
[72] S. Vigerske and A. Gleixner, “SCIP: Global optimization of mixed-
integer nonlinear programs in a branch-and-cut framework,” Optim.
Methods Softw., vol. 33, no. 3, pp. 563–593, May 2018.
[73] Y. Pang, Y. Zhang, Y. Gu, M. Pan, Z. Han, and P. Li, “Efficient data col-
lection for wireless rechargeable sensor clusters in harsh terrains using
UAVs,” in Proc. IEEE Global Commun. Conf., Dec. 2014, pp. 234–239.
[74] Y. Long, G. Xu, J. Zhao, B. Xie, and M. Fang, “Dynamic Truck–UAV
collaboration and integrated route planning for resilient urban emergency Yaoming Zhou received the B.Eng. degree in
response,” IEEE Trans. Eng. Manag., pp. 1–13, 2023. mechatronics from Zhejiang University, Hangzhou,
[75] O. Ghdiri, W. Jaafar, S. Alfattani, J. B. Abderrazak, and China, in 2014, and the Ph.D. degree in operations
H. Yanikomeroglu, “Energy-efficient multi-UAV data collection research from The University of Hong Kong, Hong
for IoT networks with time deadlines,” in Proc. IEEE Global Commun. Kong, in 2018.
Conf., Dec. 2020, pp. 1–6. From 2018 to 2019, he was a Senior Algorithm
[76] F. S. Moosavi Heris, S. F. Ghannadpour, M. Bagheri, and F. Zandieh, Engineer with Alibaba. Since 2019, he has been
“A new accessibility based team orienteering approach for urban tourism an Associate Professor with the Department of
routes optimization (a real life case),” Comput. Oper. Res., vol. 138, Industrial Engineering and Management, Shanghai
Feb. 2022, Art. no. 105620. Jiao Tong University, Shanghai, China. His research
[77] S. Alfattani, W. Jaafar, H. Yanikomeroglu, and A. Yongacoglu, “Multi- interests include the modeling and analysis of trans-
UAV data collection framework for wireless sensor networks,” in Proc. portation systems, and the integration of operations research, data analytics,
IEEE Global Commun. Conf. (GLOBECOM), Dec. 2019, pp. 1–6. and artificial intelligence and their application to transportation.

Disaster Response and Recovery

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Disaster Response and Recovery

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

Deep Reinforcement Learning Enabled Multi-UAV

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

time of members at each node. For example, Erkut and

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Furthermore, since the service time at node 2 is unchanged,

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

set N . Unavailable nodes that cannot fulfill constraints (3)

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE II TABLE III

The hyperparameters adopted in the training process are

B. Benchmark Methods Fig. 6. Training returns of episodes.

To verify the performance of the proposed DRL method,

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

the performance of the trained model is superior to heuristics

12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

You might also like