You are on page 1of 14

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO.

7, JULY 2022 6807

Reinforcement Learning and Particle Swarm


Optimization Supporting Real-Time Rescue
Assignments for Multiple Autonomous
Underwater Vehicles
Jiehong Wu , Member, IEEE, Chengxin Song , Jian Ma, Jinsong Wu , Senior Member, IEEE,
and Guangjie Han , Senior Member, IEEE
Abstract— Rescue assignments strategy are crucial for multiple
Autonomous Underwater Vehicle (multi-AUV) systems in three
dimensional (3-D) complex underwater environments. Consider-
ing the requirements of rescue missions, multi-AUV systems need
to be cost-effective, fast-rescuing, and less concerned about the
relationship between rescue missions. The real-time rescue plays
a vital role in the multi-AUV system with the characteristics
mentioned above. In this paper, we propose an efficient Reward
acting on Reinforcement Learning and Particle Swarm Opti-
mization (R-RLPSO), to provide a strategy of real-time rescue
assignment for the multi-AUV system in the 3-D underwater
environment. This strategy consists of the following three parts.
Firstly, we present a reward-based real-time rescue assignment
algorithm. Secondly, we propose an Attraction Rescue Area
containing a Rescue Area. For the waypoints in each Attraction
Rescue Area, the reward is calculated by a linear reward
function. Thirdly, to speed up the convergence of the R-RLPSO
and mark the rescue states of Attraction Rescue Area and
rescue area, we develop a Reward Coefficient based on the
reward of all Attraction Rescue Areas and Rescue Areas. Finally,
simulation results show that the system based on R-RLPSO Fig. 1. The description of the rescue missions for multi-AUV system.
is more cost-effective and time-saving than that of based on
comparison algorithms ISOM and IACO. missions to multiple mission sets so that each AUV can pass
Index Terms— Multi-AUV system, real-time, reward-based res- through one of these mission sets along the optimized rescue
cue assignment, RL, PSO. route [2]. In marine environment monitoring, the multi-AUV
systems have been widely used in rescue missions, which is
I. I NTRODUCTION due to their low operating costs, secure group communications,
and the abilities to detect regions where general detection
M ULTIPLE autonomous underwater vehicle (Multi-
AUV) systems have been investigated in recent
years [1]. The fundamental problem of rescue assignment
tools are not accessible [3]. Meanwhile, to improve the rescue
efficiency, much attention should be paid not only to a single
for multi-AUV system is how to delegate the entire rescue AUV but also to multiple ones. Usually shipwreck accidents
result from unfavorable factors such as obstruction threats
Manuscript received 22 August 2019; revised 1 September 2020; accepted and ocean storms. When numerous shipwreck accidents occur,
17 February 2021. Date of publication 11 March 2021; date of current
version 8 July 2022. This work was supported in part by the Aeronautical it is necessary to perform rescue missions. Fig. 1 gives a
Science Foundation of China under Grant 2018ZC54013; in part by the description of rescue missions for the multi-AUV system
National Natural Science Foundation of China under Grant 62072072 and in the 3-dimensional (3-D) underwater environment. If there
Grant 61971206; in part by the Chile CONICYT FONDECYT Regular under
Grant 1181809; and in part by the Chile CONICYT FONDEF under Grant are two shipwreck accidents T1 and T2 occurring in the
16I10466. The Associate Editor for this article was C. F. Mecklenbräuker. 3-D marine environment, the locations of T1 and T2 can
(Corresponding author: Jiehong Wu.) be obtained through radar detection. When the multi-AUV
Jiehong Wu, Chengxin Song, and Jian Ma are with the School of Computer
Science, Shenyang Aerospace University, Shenyang 110136, China (e-mail: system receives rescue commands from the rescue center,
wujiehong@sau.edu.cn; 2928239360@qq.com; majian_panpan@163.com). two AUVs move from the base station S and rush to the
Jinsong Wu is with the School of Artificial Intelligence, Guilin University of locationsof T1 and T2 to perform rescue missions. After
Electronic Technology, Guilin 541004, China, and also with the Department
of Electrical Engineering, Universidad de Chile, Santiago 9170124, Chile two AUVs complete all rescue missions according to the
(e-mail: wujs@ieee.org). optimal rescue strategy, they return to the substations G 1
Guangjie Han is with the Department of Information and Commu- and G 2 , respectively. To ensure cost-effective and time-saving,
nication System, Hohai University, Changzhou 213022, China (e-mail:
hanguangjie@163.com). the multi-AUV system should follow the steps of the optimal
Digital Object Identifier 10.1109/TITS.2021.3062500 rescue strategy. Besides, unpredictable factors in 3-D underwa-
1558-0016 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
6808 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 7, JULY 2022

ter environments such as submarine reefs and different types of includes 2 parameters, and one is the length of the rescue route,
obstacles make rescue missions more complicated. Multi-AUV and the other is the reward within the Attraction Rescue Area
systems should consider the obstruction threats when rescue and Rescue Area. R-RLPSO is a real-time algorithm, which
missions are performed for security reasons. Fig. 1 illustrates can make a rapid response to rescue missions. Besides, it takes
that AUV1 encounters a cuboid obstacle before approaching into account the overall rescue efficiency of the multi-AUV
rescue mission T1 when AUV1 needs to change the scheduled system and ignores the relations between rescue missions.
rescue route. To ensure the rescue efficiency, the multi-AUV The main contribution of this paper is to provide a real-time
system still needs to make sure that the new rescue route is underwater rescue algorithm for multi-AUV systems. The
cost-effective and time-saving. algorithm mainly includes the following innovations:
Selecting an optimal rescue strategy is important for the 1) We propose a reward-based real-time rescue assignment
multi-AUV system to complete the rescue missions; this issue algorithm R-RLPSO based on RL and PSO to solve the rescue
is a nondeterministic polynomial (NP) complete problem [4]. missions for the multi-AUV system in the 3-D underwater
As the number of rescue missions increases, it is difficult to environment.
find an optimal rescue strategy [5]. Related algorithms mainly 2) We propose a concept with the name of Attraction Rescue
include negotiation algorithms [6], auction algorithms [7]–[9], Area. Meanwhile, we propose a linear reward function based
genetic algorithms [10]–[12], ant colony algorithms [13]–[15], on the proposed Attraction Rescue Area. For the waypoints
neural networks [16], [17]. Although researchers have in attraction Rescue Area, the reward value is calculated by a
proposed many approaches for task assignments, these linear reward function.
algorithms may not be suitable for the actual underwater 3) We propose Reward Coefficient based on reward of all
rescue scenarios. The most critical issue for rescue missions Attraction Rescue Areas and Rescue Areas, aiming at speed
to be considered is how to speed up rescue and decrease up the convergence of R-RLPSO and mark the current reward
the resource consumption. The negotiation algorithms and states of rescue missions.
auction algorithms for rescue assignments are required to 4) To make the simulation match the actual rescue envi-
consider handling complicated relations between AUVs and ronment, we construct rescue missions in the 3-D underwater
rescue missions. Besides, negotiation algorithms and auction environment, including the submarine reefs and different types
algorithms generally include two processes, including rescue of obstacles.
assignment and path planning, which are not suitable for The rest of the paper is organized as followes: Section II
real-time rescue missions. Meanwhile, these algorithms mainly analyzes the related work on task assignment. Section III
concentrate on rescue assignments and pay less attention to describes the problem state. Section IV gives the proposed
path planning. For underwater rescue missions, path planning algorithm. Section V shows the simulation results and analysis.
is also an important aspect. The biological heuristic intelligent Section VI concludes the paper.
optimization algorithm has been widely used in path planning.
When genetic algorithms and ant colony algorithms are II. R ELATED W ORKS
applied to perform rescue missions, the algorithms have higher The core processing in searching the optimal rescue assign-
computational complexity. Besides, these algorithms require ment strategy belongs to the task assignment, which has been
prior experiences. Especially for the ant colony algorithm, extensively studied. There have been mainly three types of
it needs to determine the reach ability between the two rescue methods, including linear programming, market mechanisms,
missions, which may easily lead to rescue failure [18]. and intelligence algorithms [19].
To overcome the weaknesses of the algorithms as mentioned In the early study of task assignments, linear program-
above, it is necessary to provide a real-time strategy to solve ming has been a classic method of solving task assignments.
real-time rescue assignment for the multi-AUV system. There- Darrah et al. [20] studied the multi-AUV dynamic task assign-
fore, we propose the approach of Reward acting on Reinforce- ment using mixed integer linear programming. Although this
ment Learning and Particle Swarm Optimization (R-RLPSO) method can accurately provide a strategy for task assignment,
to resolve the rescue assignment for the multi-AUV system the computational complexity is high and does not meet the
in the 3-D complex underwater environment. Compared with requirements of real-time rescue missions. Zu et al. [21]
negotiation algorithm and auction algorithm, R-RLPSO not applied the Hungarian algorithm to solve the task assignment.
only pays attention to path planning, but also merge rescue task It resolved the robot on how to get missions and realize them
assignment and path planning into one process, which reduces at a minimal cost, but the computational complexity is quite
the time and suits real-time rescue tasks. Compared with the high, which cannot meet real-time requirements in complex
biological heuristic algorithm, this algorithm does not need to scenarios.
consider the relationship between tasks, but is determined by The advantages of the market-based mechanism based
the location of rescue points, which reduces the computational approaches are that the calculation is simple. However, before
complexity and improves the efficiency. In R-RLPSO, the res- a multi-AUV system performing rescue missions, it needs
cue state for each rescue mission is represented by the reward, to consider an optimal rescue assignment strategy, after that,
which is obtained based on reinforcement learning (RL). considering the implementation strategy of rescue missions
Particle swarm optimization (PSO) is utilized to produce a efficiently. Meanwhile, these approaches mainly focus on res-
rescue route for obstacle avoidance. The quality of the rescue cue assignments rather than how to complete rescue missions.
route is evaluated by the cost function. The cost function Charles et al. [22] introduced a Bayesian formulation for

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
WU et al.: REINFORCEMENT LEARNING AND PARTICLE SWARM OPTIMIZATION SUPPORTING REAL-TIME RESCUE ASSIGNMENTS 6809

auction-based task assignment in heterogeneous multi-agent TABLE I


team, which mainly considers the process of task assignment T HE C ONCEPTS AND D ESCRIPTIONS
and ignores the part of path planning. Gyeongtaek et al. [23]
proposed a new market-based decentralized algorithm for the
task assignment with a limited communication range. The
authors used market-based methods to consider task assign-
ment in a dynamic environment, which considers interactions
between tasks, and there are many constraints when the num-
ber of rescue missions increases. Sangwoo et al. [6] proposed
an intersection-based algorithm for path generation and a
negotiation-based algorithm for task assignment since these
algorithms can generate admissible path at a smaller comput-
ing cost. It is undeniable that negotiation-based algorithm can
find the optimal rescue assignment strategy when the number
of AUV is equal to that of rescue missions. However, when
the number of AUVs and rescue missions breaks the balance,
the proposed approach in [5] can only find a suboptimal
strategy. In addition, this proposed approach in [6] does not for cooperative timing missions that require the participation
consider the rescue assignment in more complex scenarios. of multiple agents based on PSO. Wang et al. [27] proposed
In recent years, researchers have paid more attention to an improved PSO to solve task assignment and online path
the study of intelligence algorithms in task assignment. The planning for AUV swarm cooperation. However, when two
related intelligence algorithms mainly include neural net- pre-allocated tasks are closer, this method may cause the
works and bio-heuristic intelligence optimization algorithms. failure of task assignment due to falling into local minimums.
Neural networks have been widely studied in recent years. For rescue missions in the 3-D underwater environment,
As a representative of neural networks in task assignment, the core issue is that of multi-AUV system completes real-time
Self-organizing map (SOM) has been successfully applied rescue missions at a minimal cost. Although the literatures as
to underwater task assignment. Zhu et al. [17] proposed an mentioned earlier can solve rescue assignment, these methods
integrated biologically inspired self-organizing map (BISOM) rarely comprehensively consider how to quicken rescuing
algorithm, which can solve task assignment and path planning speed, decrease cost, concern less about the relationship
for multi-AUV system in the 3-D underwater environment between rescue missions, and achieve real-time performance.
with obstacle avoidance. Zhu et al. [2] also addressed the task Inspired by the excellent performance of bio-heuristic intel-
assignment issue for multi-AUV system. Moreover, it inte- ligence optimization algorithms in underwater task assign-
grates velocity synthesis and SOM method in the 3-D under- ment, this paper presents an R-RLPSO algorithm to solve
water environment. Although the SOM-based method in [17] real-time rescue missions. The R-RLPSO provides a strategy
and [2] provides a strategy for underwater task assignment, of real-time rescue missions for the multi-AUV system, which
it can cause a sudden change of AUV velocity. Especially can be performed more efficiently and flexibly in the 3-D
in the initial rescue stage of the multi-AUV system, it is underwater environment. The R-RLPSO algorithm adopts the
practically impossible for the multi-AUV system to gain such reinforcement learning strategy based on model-based in the
large velocity in the actual rescue environment. rescue area, and the PSO algorithm is used to find the proba-
Path optimization has a significant impact on rescue mis- bilistic optimal solution globally.
sions. Multi-AUV system not only needs to follow an optimal
rescue strategy but also considers how to perform rescue III. P ROBLEM S TATEMENT
missions safely and efficiently. The above literatures mainly A. The Innovative Concepts and Descriptions
focus on task assignment, but path optimization weighs less.
As shown in Table I, this section formulates several innov-
The bio-heuristic intelligence optimization algorithms mainly
ative concepts in this paper, which is useful for understanding
include particle swarm optimization (PSO), genetic algo-
our scheme in the latter sections.
rithm (GA), ant colony optimization (ACO). Wang et al. [24]
The degree of mission completion is defined as follows:
proposed a novel Collection Path Ant Colony Optimiza-
tion (CPACO) to solve the task assignment, the CPACO algo- D
G= × 100%, (1)
rithm can achieve global optimization and reduce processing DS
time. However, it is a pity that this method does not consider where D and D S represents the number of AUV crossing
the 3-D environment where the 3-D is the real environments rescue area and the total number of rescue area.
of the underwater. Li et al. [11] proposed an improved genetic
algorithm (IGA) to solve the task assignment of multi-robot
system in which n robots are used to search and spy on B. Degree of Mission Completion
a given area quickly and safely. Lin et al. [25] proposed We assume that there are some shipwreck accidents in the
PSO based on DCAA algorithm to solve the task assignment. 3-D underwater environment, and the N rescue missions can
Gyeongtaek et al. [26] introduced an optimal task assignment be expressed as T = {T1 , . . . , Ti , . . . , TN }. For each rescue

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
6810 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 7, JULY 2022

real-time rescue, each AUV needs to search the optimal rescue


strategy independently during the rescue process. However,
conventional rescue algorithms, such as negotiation algo-
rithms, auction algorithms, perform rescue assignment offline.
The offline procedure is to divide the rescue process into two
steps. The first step is to select the optimal rescue strategy, and
the second step is to ensure the rescue missions are performed
safely and efficiently. The shortcomings of these methods are
that they rely on environment experience when selecting a
rescue strategy, and such rescue assignment algorithms are
not suitable for real-time rescue missions. However, as a
Fig. 2. The environment model of rescue missions. representative of biological heuristic intelligent algorithms,
the PSO algorithm has high performance in continuous space
and can support real-time rescue missions for the multi-AUV
mission Ti in T , we assume that the rescue mission Ti is a system.
sphere object with the center point located in Ti (x i , yi , z i ) and
the coverage radius of rescue mission Ti is Ri . Then, for each A. Particle Swarm Optimization
Rescue Area in the rescue missions, it can be expressed by (2).
The PSO is an evolutionary algorithm based on the pop-
(x − x i )2 + (y − yi )2 + (z − z i )2 < Ri2 , (2) ulation evolution [28], originally derived from the study of
foraging behavior of birds. In PSO, the experience stored by
where p(x, y, z) represents the waypoint of the rescue route.
the global optimum particle represents the current potential
The rescue route consists of several waypoints in the 3-D
rescue route, and each particle gradually reaches the global
underwater environment. These waypoints play a substantial
optimum regions according to the experience of its own and
role in determining the optimal rescue route. For each rescue
other particles. During the k-th evolution, the particle updates
mission Ti , the description of the Degree of Mission Comple-
its velocity vector v i(k) and position vector x i(k) through (3).
tion is as follows: some of the waypoints on the rescue route
(k+1) (k) (k) (k) (k) (k)
fall within the Rescue Area to satisfy (2). vi = wv i + c1r1 (Pp_best − x i ) + c2r2 (Pg_best − x i )
x i(k+1) = x i(k) + v i(k+1) , (3)
C. Model of Mission Assignment
The rescue assignment of the multi-AUV system is a where v i(k)
represents the speed of the i -th control point in
complex NP complete problem. The fundamental problem of the k-th evolution. x i(k) represents the position of the i -th
rescue assignment is how to divide the entire rescue missions control point in the k-th evolution. w is the inertia weighting
to multiple mission sets so that each AUV can perform factor, c1 and c2 are learning factors. r1 and r2 are random
(k)
predetermined mission set, respectively. Besides, multi-AUV numbers of uniform distribution between 0 and 1. Pp_best is
system needs to be the cost-effective, high rescue speed, and the best experience that each particle is memorized in the k-th
(k)
less concerned about relationship between rescue missions. evolution, Pg_best is the global optimum experience for the
Fig. 2 shows the environment model of rescue missions. The particle swarm in the k-th evolution.
multi-AUV system consists of three AUVs, which can be
represented as V = {V1 , V2 , V3 }. Each AUV from multi-AUV B. R-RLPSO Algorithm
system moves from the starting point S, after completing all The basic PSO algorithm selects the global optimum particle
rescue missions, they reach three target points, which are through cost function,and the cost function can transform into
described as G = {G 1 , G 2 , G 3 }. There are seven sphere penalty function [29]. The role of the penalty function is to
objects to represent the rescue missions. The rescue missions turn the multi-constraint problem into an unconstrained prob-
can be represented as T = {T1 , T2 , . . . , T7 }. For each rescue lem. To ensure that multi-AUV system can complete real-time
mission Ti in T , the Ti can only be completed with single Vi rescue missions successfully. In the R-RLPSO algorithm,
in V . As shown in Fig. 2, the rescue costs of the multi-AUV we present a new cost function C O ST _F, and it consists of
system to perform various rescue strategies are different. If the two parts, one is the c_ path, and the other is the c_r ewar d.
rescue cost of each AUV from multi-AUV system is set as The c_r ewar d represents the penalty for the cost function
C = {Ck |k = 1, 2, 3}, the optimal rescue  strategy is to C O ST _F. Therefore, the C O ST _F can be represented as (4).
ensure that the total rescue cost Csum = 3k=1 Ck is minimal.
In addition, the AUVs should avoid sphere and cuboid objects C O ST _F = αc_ path − βc_r ewar d, (4)
in the process of performing rescue missions.
where the C O ST _F is the cost function of the R-RLPSO
algorithm. α and β are weight coefficients. For each AUV
IV. P ROPOSED A LGORITHM in the multi-AUV system, c_ path is the length of the rescue
Real-time performance plays a vital role in underwater res- route, and the c_r ewar d is the reward value of the waypoints
cue missions. The multi-AUV system needs to find the optimal falling within all the Attraction Rescue Areas and Rescue
rescue strategy in the 3-D marine environment. To reach the Areas.

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
WU et al.: REINFORCEMENT LEARNING AND PARTICLE SWARM OPTIMIZATION SUPPORTING REAL-TIME RESCUE ASSIGNMENTS 6811

where α j is the discount factor at j , γi, j +1 is the reward


of Vi at j + 1. Q πi (s, a) represents the policy value of Vi ,
which is the value of the attraction level of the Rescue Area
to the waypoints. When the R-RLPSO algorithm converges,
Q πi (s, a) must reach the upper limit of attraction. In this case,
the optimal policy π ∗ can be represented as (7).

π ∗ = argmaxQ∗ (s, a), (7)


a

In (3), the values of c_ path and c_r ewar d satisfy


c_ path  c_r ewar d, the multi-AUV system follows the
rescue strategy along the optimal policy π ∗ , and some of
the waypoints can be attracted into Rescue Areas. With the
Fig. 3. The process of obtaining c_reward by reinforcement learning.
continuous evolution of the R-RLPSO algorithm, each par-
ticle represents a potential rescue route. The rescue route is
C. Path Generation of R-RLPSO Algorithm generated by the optimal particle, and the optimal particle is
determined by the least cost of the cost function C O ST _F in
Before introducing the specific calculation about c_r ewar d, the R-RLPSO. If the c_r ewar d is more significant, C O ST _F
it is necessary to analyze why the proposed cost func- will be smaller, and the rescue route of this particle has
tion C O ST _F can complete real-time rescue missions for a larger probability for the selection of the optimal rescue
multi-AUV system. route. When the policy π adopted by the multi-AUV system
Fig. 3 shows the process of obtaining c_r ewar d from tends to the optimal policy π ∗ , c_r ewar d will gradually
trial-and-error through interaction with the environment. become more substantial. Meanwhile, the optimal policy π ∗
In underwater rescue missions, the environment represents the ensures more waypoints to be attracted to Rescue Areas.
Attraction Rescue Areas and Rescue Areas, and the agent There are different types of obstacles in the rescue space,
represents the multi-AUV system. st is the current state, and and the rescue route of optimal particle ensures rescue safety.
at describes the joint actions of the multi-AUV system. In addition, multi-AUV systems will automatically select the
The process of obtaining c_r ewar d for the multi-AUV optimal rescue mission combination based on C O ST _F, and
system is a Markov decision process [30]. Fig. 2 shows that the less concerned with the relationship between rescue missions.
multi-AUV system consists of three AUVs and seven rescue Therefore, the R-RLPSO algorithm can quickly generate a
missions. Therefore, the rescue process can be defined as the rescue path and complete real-time rescue mission in the 3-D
tuple S, 1 , 2 , 3 , p, γ1 , γ2 , γ3 . complex underwater environment.
γi : S ×  × S → R,
P : S ×  × S → [0, 1], D. Mission Area Processing of R-RLPSO Algorithm
 = 1 × 2 × 3 , (5)
The calculation of c_r ewar d is crucial for the rescue
where S is the set of the states for the multi-AUV system. missions in the multi-AUV system. In this paper, we propose a
1 , 2 , and 3 are the actions of the multi-AUV system, γi linear reward function to calculate the reward of all Attraction
is the reward function in the multi-AUV system. P is the state Rescue Areas. Meanwhile, to speed up the convergence of
transition probability function. After that, the state transitions the R-RLPSO and mark the current reward states, we give
are the result of joint actions at within the AUVs in the a Reward Coefficient based on the reward of all Attraction
multi-AUV system. The a1,t ∈ 1 , a2,t ∈ 2 , and a3,t ∈ 3 , Rescue Areas and Rescue Areas.
and the joint actions can be represented as at = [a1,t , a2,t , a3,t ] Fig. 4 shows that each virtual Attraction Rescue Area
which as well denotes the action set for the multi-AUV system. contains a Rescue Area. R0 and R1 represent the radius of the
In this paper, both the Attraction Rescue Areas and Rescue Rescue Area and the radius of the virtual Attraction Rescue
Areas are sphere objects. The process to perform rescue mis- Area, respectively. However, each Attraction Rescue Area is
sions is to find the best waypoints by the R-RLPSO algorithm not a solid sphere coverage, and its coverage is shown in
in the 3-D complex underwater environment. Meanwhile, if we the shaded blue. The waypoint p(x, y, z) can be attracted by
imagine each Rescue Area is a spherical magnet, completing within the virtual Attraction Rescue Area and Rescue Area.
each rescue mission is to make some of waypoints attracted The rescue route of the AUV is divided into many waypoints.
to the spherical magnet as much as possible. However, this For each Rescue Area, according to the Degree of Mission
attraction cannot be infinitely large; otherwise, it can cause Completion, it should ensure that there is at least one waypoint
excessive distortion of the rescue route. The process for p(x, y, z) to meet (1). For each waypoint p(x, y, z) of AUV in
Rescue Area is achieved through the policy π of reinforcement the multi-AUV system, the distance of di st between the center
learning. As shown in (6). point Ti (x i , yi , z i ) of the Rescue Area and current waypoint

 p(x, y, z) can be calculated as (8).
Q πi = E{ α j γi, j +1 |s0 = s, a0 = a, π}, (6) 
j =0 di st = (x − x i )2 + (y − yi )2 + (z − z i )2 , (8)

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
6812 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 7, JULY 2022

is that when di st tends to R0 but does not fall within the


Rescue Area, the reward value of γ is still equal to zero.
However, the waypoint p(x, y, z) is already near the Rescue
Area. The proposed linear reward function can capture the
state of such a waypoint and calculate reward of γ according
to the di st. By the cost function C O ST _F, the number of
waypoints in the Attraction Rescue Area will increase so that
the more substantial reward will be accumulated. It represents
that the AUV tends to the Rescue Area, and the rescue route
marked by the waypoints is selected as the final rescue route.
However, the rescue route generated by the R-RLPSO
algorithm still has a large random moving amplitude before
convergence. As shown in (9), when the γ > 0, it represents
that the current waypoint p(x, y, z) is at least within the
Attraction Rescue Area. The optimal particle selection is based
on the cost function C O ST _F in the R-RLPSO algorithm.
Fig. 4. The virtual attraction rescue area.
Since the waypoint is already within the Attraction Rescue
Area, the reward γ is appropriately increased, and C O ST _F
Then, for each rescue mission, the reward value γ can be
will go down more dramatically. Therefore, the rescue route
expressed as (9).
⎧ marked by such waypoints has a higher probability to be

⎨0 di st > R1 , selected as the optimal rescue route. Based on this idea,
γ = ε di st < R0 , we give a Reward Coefficient based on the total reward of

⎩ all Attraction Rescue Areas and Rescue Areas to increase the
((1 − (di st − R0 )/(R1 − R0 )) ∗ ε) R0 ≤ di st ≤ R1 ,
degree of the reward, which speeds up the convergence speed
(9)
of the R-RLPSO and marking the current reward state of the
The expression(9) shows that the reward γ of the waypoint rescue missions. If there are N rescue missions in the 3-D
p(x, y, z) is divided into three parts according to the distance underwater environment. For the Vi in multi-AUV system,
( j) ( j) ( j)
di st. When di st > R1 , it represents the waypoint of p(x, y, z) the reward can be expressed as [γ1 , . . . , γk , . . . , γ N ] at
is outside the range of the Attraction Rescue Area, and thus the the j -th iteration. For the k-th rescue mission. If some of
waypoint is not attracted by the Attraction Rescue Area, and waypoints on the rescue route are within the Attraction Rescue
( j)
the reward value γ is equal to zero. When di st < R0 , it means Area and the Rescue Area, it satisfies γk > 0. Considering
that the waypoint of p(x, y, z) within the Rescue Area. the actual role of the Reward Coefficient, we believe that
According to the Degree of Mission Completion, the AUV it is meaningless to calculate the Reward Coefficient when
has completed this rescue mission. The reward of γ is equal ( j)
γk ≤ 0. Therefore, when calculating the Reward Coefficient,
to ε, where the value of ε is a constant. When R0 ≤ di st ≤ R1 , ( j) ( j)
we set γk = 0 when the reward value of γk . The total
it represents the waypoint p(x, y, z) is within the Attraction ( j)
reward Wsum of N rescue missions at the j -th iteration can
Rescue Area but has not completed the rescue mission. The
be shown as (10).
proposed linear reward function can obtain the reward accord-
ing to the value of di st. When di st tends to R1 , the waypoint ( j)

N
( j)
p(x, y, z) will be outside the Attraction Rescue Area, and Wsum = γi , (10)
the reward of γ will tend to zero. When di st tends to R0 , i=1

then the waypoint will fall within the Rescue Area, which Then, the Reward Coefficient vector of W ( j ) at the j -th
means that the current rescue mission will be completed and iteration can be shown in (11).
the reward of γ tends to ε. For each rescue mission, there is ( j) ( j) ( j)
γ1 γk γN
not only a single waypoint p(x, y, z) within Attraction Rescue W ( j) = [ ( j)
,...,( j)
,...,
( j)
], (11)
Area and Rescue Area. Therefore, the reward of each rescue Wsum Wsum Wsum
mission should be the summation of the rewards of a certain For the waypoint p(x, y, z) at the k-th rescue mission,
number of waypoints on the rescue route. These waypoints ( j +1)
the reward γk at ( j +1)-th iteration can be shown by (12).
are determined via measuring the overall rescue efficiency 
through multi-AUV system. For each AUV in multi-AUV di st = (x − x i )2 + (y − yi )2 + (z − z i )2 , (12)
system, it produces an optimal rescue strategy by C O ST _F
that only measures the summation of the reward of all rescue Then, the reward value γ can be expressed as (13).

missions, which meets less concerned the relationship between ⎪0 di st > R1 ,


rescue missions. ⎨ ( )
( j +1) (1 + W (k))ε
j di st < R0 ,
When R0 ≤ di st ≤ R1 , it is novel and efficient to γk =

⎪ ((1 − (di st − R 0 )/(R 1 − R 0 ))ε)
propose such a linear reward function. If this situation is not ⎪
⎩ R0 ≤ di st ≤ R1 ,
considered, then the reward of γ for the waypoint p(x, y, z) (1 + W ( j ) (k))
becomes a Boolean problem. The disadvantage of this way (13)

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
WU et al.: REINFORCEMENT LEARNING AND PARTICLE SWARM OPTIMIZATION SUPPORTING REAL-TIME RESCUE ASSIGNMENTS 6813

( j +1)
where W ( j ) (k) is the Reward Coefficient of the k-th rescue γk needs to be punished, as shown in (15).
mission at the j -th iteration. The proposed Reward Coefficient ( j +1)
vector is necessary for performing rescue missions in the ( j +1) γk − ε1 W ( j ) (k) > 0 and W ( j +1)(k) = 0,
γk =
multi-AUV system, including one is that the waypoints of 0 other wi se,
rescue route within the Attraction Rescue Area and Rescue (15)
Area makes the cost function C O ST _F fall more signifi-
cantly. Therefore, it is more easily selected as the optimal where W ( j ) (k) is the Reward Coefficient of the k-th rescue
rescue route by the cost function C O ST _F. The other, more mission at j -th iteration. ε1 is the penalty value of the reward.
important, is that the Reward Coefficient vector of W ( j ) at the
j -th iteration is passed to the reward of ( j + 1)-th iteration. V. P ERFORMANCE
Reward transmission means that the reward is not only related To demonstrate the effectiveness of R-RLPSO in under-
to the current iteration but also related to the next reward. Due water rescue missions, several simulations are constructed in
to the limitation of the cost function C O ST _F, for the Vi MATLAB R2016b, where the personal computer is configured
in multi-AUV system, it is impossible to perform all rescue with Intel Core i3-7100U @3.9GHz, 8GigaBytes (GB) RAM.
missions. It will have a particular “preference” for some rescue In our simulations, the number of rescue missions is the value
missions, and this “preference” will make part of the Reward of 7, there are four cuboid objects and six sphere objects in
Coefficient vector W ( j ) which is equal to zero. Meanwhile, the 3-D underwater environment. We use the sailing distance
the “preference” part has larger reward weight, and the more of AUV passes through the task area to value the quality of
significant weight in the Reward Coefficient vector of W ( j ) is given algorithms.
just the reward weight of the Rescue Area and the Attraction
Rescue Area. As shown in (11), this reward weight will be
passed to the next iteration, which we call the “strength to A. Simulation Environment Settings of R-RLPSO Algorithm
strength” of the reward. The simulation were constructed in a three-dimensional
( j +1)
However, the reward of γk does not always increase; underwater environment. The data for the underwater environ-
it should be punished in two aspects. On the one hand, ment was real data which was downloaded from the National
the waypoints are excessively accumulated in each Attrac- Marine Science Data Center. It can be seen to include obstacles
tion Rescue Area and Rescue Area. On the other hand, and uneven seabed. The parameters for R-RLPSO are designed
no waypoints falling within each Attraction Rescue Area and as followed: inertia weight factor w linearly decreases from
Rescue Area. The reason for the first phenomenon is that 0.9 to 0.4; learning factor c1 linearly decreases from 2.5 to
Attraction Rescue Area and Rescue Area use greedy method to 0.5, learning factor c2 linearly increases from 0.5 to 2.5. The
attract waypoints. To avoid the rescue route distortion caused parameter α and β in C O ST _F is 2 and 10, respectively.
by excessive accumulation of waypoints in the Attraction The maximum number of iterations is 50, the number of
Rescue Area and Rescue Area, the reward of γk has to be particles is 300. Besides, the reward of ε is 0.1, the penalty
punished. For each rescue mission, if the upper limit of the reward value ε1 is 0.5, the upper limit κ is 10, the radius
waypoints is set to κ and the penalty value is set to ε1 of Attraction Rescue Area is 10 meters (m). The background
( j +1)
where ε1 is a constant. After that, the reward of γk is of the simulation is as follows: seven shipwreck accidents are
penalized if the number of waypoints in the Attraction Rescue occurring, the multi-AUV system needs to move from the base
Area and Rescue Area exceeds the upper limit κ. As shown station S, where the coordinate of S is (5, 5, 5) and performing
in (14). seven rescue missions T = {T1 , T2 , . . . , T7 }. After the rescue
missions are completed, AUV swarm needs to reach their
( j +1)
( j +1) γk − ε1 η > κ, substations G 1 , G 2 , and G 3 , where the coordinates of them
γk = ( j +1) (14)
γk η ≤ κ, are (180, 180, 0), (130, 180, 0), and (180, 130, 0), respectively.
Meanwhile, we set different types of obstacles along the rescue
where the η represents the number of waypoints within each route, including cuboid obstacles C = {Ci |i = 1, 2, 3, 4},
Attraction Rescue Area and Rescue Area. The reason for sphere obstacles S = {Si |i = 1, 2, . . . , 6}, and submarine
the second phenomenon is that the AUV abandons performing reefs. Each Rescue Area is represented by a sphere object.
the related rescue missions. However, it is impossible to Each Attraction Rescue Area is also a virtual sphere object,
complete all rescue missions for a single AUV in the 3-D which contains the Rescue Area. According to the Degree
underwater environment. Therefore, it is normal for AUV to of Mission Completion, the AUV passes the predetermined
abandon some rescue missions. As shown in (11), each AUV Rescue Area, indicating that the rescue mission is completed.
has a particular “preference” for some rescue missions. For The description of the rescue mission is shown in Table I.
a single AUV, “Non-preference” rescue missions should not As shown in Table II and Table III, P represents the center
have waypoints. However, for “preference” rescue missions, point of the rescue mission, and R represents the coverage
( j +1)
the reward of γk needs to be punished if there are no area. For obstacles in the environment model, the description
waypoints. For the AUV at the k-th Rescue Area, if the of sphere obstacles and cuboid obstacles are shown in Table II
situation that W ( j ) (k) > 0 at the j -th iteration, but W ( j +1) = and Table III, respectively. In Table II, P and R represents the
0 at the ( j + 1)-th iteration occurs. Then, the reward value center of the sphere objects and the radius of sphere objects,

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
6814 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 7, JULY 2022

Fig. 5. The process of rescue missions by R-RLPSO algorithm.

TABLE II TABLE III


T HE D ESCRIPTION OF R ESCUE M ISSIONS ( M ) T HE D ESCRIPTION OF S PHERE O BSTACLES ( M )

B. Rescue Mission Process Analysis of R-RLPSO Algorithm


respectively. In Table III, P represents the body-centered Fig. 5 shows the process of rescue missions by R-RLPSO
cuboid obstacles. algorithm. Fig. 5(a) shows the initial state before the

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
WU et al.: REINFORCEMENT LEARNING AND PARTICLE SWARM OPTIMIZATION SUPPORTING REAL-TIME RESCUE ASSIGNMENTS 6815

TABLE IV TABLE V
T HE D ESCRIPTION OF C UBOID O BSTACLES ( M ) T HE D ISTANCE B ETWEEN AUV S WARM AND R ESCUE M ISSIONS
A FTER THE F IRST S TAGE ( M )

TABLE VI
T HE D ISTANCE B ETWEEN AUV AND R ESCUE M ISSIONS
rescue missions, where C1 -C4 represent the cuboid obstacles, A FTER THE S ECOND S TAGE ( M )
and S1 -S6 represent the sphere obstacles. T1 -T7 represent
the rescue missions. Three AUV are moving from the base
station to perform rescue missions. We assume that each AUV
completes rescue in 100 steps. In the simulation, the rescue
process of each AUV is divided into four stages. Fig. 5(b)
shows the first stage of the rescue process. At this stage,
AUV swarm are far away from the rescue missions, and no TABLE VII
rescue missions are completed. The three AUVs will face three T HE C LOSEST D ISTANCE B ETWEEN AUV AND R ESCUE M ISSION
missions: T1 , T4 , and T6 . After the first stage of the rescue T3 D URING THE T HIRD S TAGE ( M )
process, the distances between the AUV swarm and rescue
missions are shown in Table V. Besides, AUV1 will face the
cuboid obstacles C2 , and AUV2 avoids the sphere obstacles S1
and the cuboid obstacles C2 . Fig. 5(c) shows the second stage
of the rescue process. At this stage, AUV swarm performs one
of these missions, respectively. Table IV shows that AUV1 is closest distance of AUV2 and AUV3 to T3 are 44.94 m and
the closest to T6 , AUV2 is the closest to T1 , and AUV3 is 31.34 m, respectively. Fig. 5(e) shows that AUV swarm has
the closest to T6 after the first stage of the rescue process. performed all rescue missions, and AUV swarm return to their
However, the simulation result shows that the AUV swarm substations, respectively.
do not rush to its nearest mission under the action of algo-
rithm R-RLPSO, AUV1 tends to T1 , AUV2 tends to T4 , and
AUV3 tends to T6 . Here, the most crucial issue is that AUV C. Algorithm Performance Analysis of R-RLPSO Algorithm
swarm are not pre-defined in selecting rescue missions and The reward plays a vital role in the rescue assignment.
the algorithm is lightweight computational complexity, which Each AUV automatically selects rescue missions based on
meet the real-time characteristics. Especially for AUV2, Table the current reward. Fig. 6(a) shows the reward of rescue
IV shows that the closest distance from the rescue mission T1 missions for AUV1. The simulation result shows that AUV1 is
is 27.62 m, and the farthest distance from T4 is 41.78 m. scheduled to perform T1 , T2 , and T3 in the initial rescue state.
However, considering the overall efficiency of performing The reward of the other rescue missions is zero. The reason for
rescue missions, AUV2 still selects to perform T4 rather than this phenomenon is that because of the role of C O ST _F; the
T1 . After the second stage, the distance between the AUV multi-AUV system needs to be cost-effective and time-saving
swarm and rescue missions is shown in Table VI. In addition, during the process of rescue. T1 , T2 , and T3 have temporary
AUV1 avoids the cuboid obstacles C2 and AUV3 avoids reward stability during the process of iterations. The reason
the sphere obstacles S3 . Fig. 5(d) shows the third stage of for this phenomenon is that the AUV1 does not find better
the rescue process. At this stage, all AUVs perform one of waypoints for performing rescue missions. Then, the reward
these missions, respectively. Table VI shows that AUV1 is the value of T3 gradually increases during the process of iterations.
closest to T2 , AUV2 is also the closest to T2 , and AUV3 is the The reward value of T2 increases rapidly and stabilizes after a
closest to T7 after the second stage. However, the simulation brief decrease. The reward of T1 decreases at the beginning of
result shows that AUV1 tends to T2 , AUV2 tends to T5 , iterations, but it rises and stabilizes in the later of iterations.
and AUV3 tends to T7 . In addition, AUV1 avoids the sphere When the AUV1 performs rescue missions, it should not
obstacles S4 , and AUV2 avoids the cuboid obstacles C3 . pay much attention to the single rescue mission, because
Fig. 5(d) shows that AUV1 has completed T3 at the end of the it can distort the rescue route. Considering the rescue cost,
third stage. Table VII shows the closest distance between AUV AUV1 should have a comprehensive measurement of all rescue
swarm and rescue mission T3 during the process of the third missions. When the reward of T1 decreases, the reward value
stage. The result indicates that AUV1 is only 2.34 m away of T2 and T3 must increase. This shows that AUV can
from the center point of T3 . This waypoint occurs at step 79, comprehensively measure the situation of each rescue mission
and the coordinate of the waypoint is p(148.97, 141.10, 2.20). to find the best location of waypoints.
Table II shows that the radius of T3 is 3 m. Therefore, In Fig. 7, the blue dotted line shows the total reward
AUV1 has completed the T3 during the third stage. The of rescue missions for AUV1. The total reward consists of

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
6816 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 7, JULY 2022

Fig. 7. The total reward of rescue missions for multi-AUV system.

Fig. 8. The best cost of R-RLSPO algorithm.

the reward of T4 gradually increases, the reward of T5 has been


only little decreased, and those of other iterations are increas-
ing. However, at the beginning of the iterations, AUV2 also
senses the rescue mission T1 , and the reward is 0.3249. The
reason for this phenomenon is that some of the waypoints
for AUV2 falling within the Attraction Rescue Area of T1 .
However, by measuring the overall rescue cost, AUV2 quickly
abandons the rescue mission T1 , and the reward of T1 drops
to 0.0366 in the next iteration. This shows that AUV2 can
adaptively adjust the rescue strategy among multiple rescue
missions. In Fig. 7, the black dotted line shows the total reward
Fig. 6. The reward of rescue missions for each AUV. of rescue missions for AUV2. It also shows that the total
reward gradually increases during the process of iterations.
Meanwhile, it has similar changes to AUV1. Fig. 6(c) shows
the reward of T1 , T2 , and T3 . The simulation result shows the process of reward about AUV3, and the total reward of
that the total reward gradually increases during the process AUV3 is shown in Fig. 7. The red dotted line shows that
of iterations. According to (3), when the reward increases, the total reward also gradually increases during the process of
it represents that the waypoints of AUV1 are continually iterations.
evolving to find the optimal rescue locations. Fig. 6(b) shows Fig. 9 shows the best cost of R-RLPSO algorithm, and
the process of reward for AUV2. The simulation result shows the best cost is calculated by C O ST _F. The experience of
that AUV2 is scheduled to perform T4 and T5 . Meanwhile, individual particles and global optimum particle considers two

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
WU et al.: REINFORCEMENT LEARNING AND PARTICLE SWARM OPTIMIZATION SUPPORTING REAL-TIME RESCUE ASSIGNMENTS 6817

TABLE VIII
T HE WAYPOINTS OF M ULTI -AUV S YSTEM FALLING ATTRACTION R ESCUE A REAS AND R ESCUE A REAS ( M )

falling in Attraction Rescue Areas and Rescue Areas. For


each rescue mission Ti in T = {T1 , T2 , . . . , T7 }, the first row
represents the number of motion steps, and the second row
represents the distance of waypoints from the center of the
Rescue Area. When AUV1 approaches the rescue mission T1 ;
the result shows that AUV1 falls into the Attraction Rescue
Area at step 38 and detaches after step 43. At step 40 and 41,
AUV1 performs rescue mission T1 successfully. Meanwhile,
the distance from the rescue center T1 is 2.30 m and 2.98 m,
respectively. When AUV1 approaches the rescue mission T2 ;
the result shows that six waypoints are falling into Attraction
Rescue Area. The first step is 56, and AUV1 detaches after
step 61. There are two waypoints for performing rescue mis-
sions at step 58 and 59, respectively. Meanwhile, the distance
from the rescue center T2 is 2.35 m and 2.13 m, respectively.
Table VIII shows that AUV1 falls within the Attraction Rescue
Area of T3 at step 75, and there are two waypoints for
performing T3 at step 78 and 79. When AUV2 approaches
the rescue mission T4 , the result shows that ten points are
falling into Attraction Rescue Area and Rescue Area, and the
first step is 41. AUV2 detaches after step 50. There are three
waypoints for performing rescue mission T4 , and the related
distance is 2.28 m, 0.95 m, and 1.74 m, respectively. Table VIII
shows AUV2 rescue route falling in Attraction Rescue Area
Fig. 9. Simulation of rescue assignment algorithm IACO and ISOM.
of the rescue mission T5 , there is only one waypoint in the
Rescue Area; the distance from the rescue center is 2.67 m.
Table VIII also shows the waypoints of AUV3 falling in T6 and
aspects in which one is the value of c_ path, and the other is T7 , respectively. The simulation result shows that AUV3 falls
that the current reward value is better than the existing reward into the Attraction Rescue Area in step 32 and 63, respectively.
empirical value. If both of the two aspects are met, then the There are two waypoints in each Attraction Rescue Area to
C O ST _F is updated. The simulation result shows that the best complete the rescue mission successfully. Thus, the degree of
cost of the multi-AUV system decreases during the process of mission completion is 100% executed in this scenario.
iterations. Besides, the decrease of the cost curve represents
that the total reward is increasing, and the increase in the total
reward value means that the multi-AUV system is attracted to D. Rescue Assignment by ISOM and IACO Algorithm
perform rescue missions. With the rapid development of artificial intelligence, intel-
To describe the role of the proposed Attraction Rescue ligent algorithms have been widely used in underwater rescue
Area, motion step of each AUV is numbered from 1 to 100. assignment. The related algorithms mainly include neural
Table VIII shows the waypoints of the multi-AUV system networks and bio-heuristic intelligence optimization algo-

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
6818 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 7, JULY 2022

rithms. Therefore, we choose the classic algorithm which is an TABLE IX


improved ant colony optimization (IACO), and neural network T HE ROUTE L ENGTH OF R-RLPSO, ISOM, AND IACO A LGORITHM ( M )
algorithm which is an improved self-organizing map (ISOM)
as our comparison algorithms.
The ACO is a classic task assignment algorithm, initially
proposed by Dorigo [31] and used to solve the traveling sales-
man problem (TSP). However, the classic ACO pays much
attention to the task assignment, ignoring the application in the
real environment. In this paper, we use an improved ant colony
optimization (IACO) based on random perturbation to solve
the rescue assignment in the 3-D underwater environment.
Compared with ACO, we added D. Zhu latest algorithm on
obstacle avoidance [32] to become IACO, which not only
performs rescue missions but also avoids different types of
obstacles. Kohonen first proposed the SOM algorithm [33],
and it has been widely used in task assignment. However,
the classic SOM does not consider avoiding obstacles when
performing missions. To use the classic SOM neural network
for underwater rescue assignment, we add obstacle avoidance
algorithm [32] into the SOM algorithm called ISOM algorithm
to perform rescue missions.
The multi-AUV system utilizes the same underwater envi-
ronment for algorithm R-RLPSO, IACO, and ISOM. The
parameters of algorithm IACO are set as follows: the number
of ants is 20, the volatilization rate is 0.15, and information Fig. 10. Route length of R-RLSPO, IACO and ISOM algorithm through the
heuristic factor is 0.12. We set a constant value Q = 10, Rescue Areas.
the heuristic factor H = Q/D I ST , where the D I ST repre-
sents the distance between the two rescue missions. To avoid
the waypoints tortuous and not in line with the AUV actual a small rescue cost, which can reflect the superiority of the
rescue route, we use the gradient descent method to smooth the algorithm. The result shows that the route length of algorithm
rescue route. Fig. 9(a) shows the rescue assignment by IACO R-RLPSO is the smallest of the three algorithms. Compared
algorithm. The parameters of ISOM are set as follows: the to algorithm ISOM, the route length of algorithm R-RLPSO is
learning rate is 0.5, the discount factor is 0.9, the neighborhood reduced by 3.7331 m, 0.4039 m, and 0.2475 m, respectively.
radius is 1, and the maximum number of iterations is 100. Compared to algorithm IACO, the route length of algorithm
Fig. 9(b) shows the simulation result of algorithm ISOM for R-RLPSO is reduced by 4.8523 m, 2.0876 m, and 1.5176 m,
rescue missions. respectively. Besides, the route length of algorithm ISOM is
The simulation results show that all AUVs use IACO shorter than that of the algorithm IACO.
and ISOM algorithm to perform rescue missions consistent Fig. 10 shows the route length of R-RLPSO, IACO and
with algorithm R-RLPSO. AUV1 rescues T1 , T2 , and T3 , ISOM through the Rescue Areas. Where the x-axis represents
AUV2 rescues T4 and T5 . AUV3 rescues T6 and T7 . Both the rescue mission from T1 to T7 . The simulation result shows
ISOM and IACO can avoid sphere objects, cuboid objects, that algorithm R-RLPSO has a shorter route length in the
and submarine reefs to ensure safety during the process of Rescue Area and accelerates the subsequent rescue mission.
rescue missions. In algorithm IACO, the cost value of AUV1 is In addition, the reason for this phenomenon, the path length
249.7438, the cost value of AUV2 is 215.9746, and the of IACO algorithm has more than 6 m, is that the AUV does
AUV3 is 216.0894. The simulation results show that the time not pass through the Rescue Area along a straight line.
consumption of R-RLPSO is only 26.8524 s, the IACO time Although ISOM and IACO can perform rescue assignment,
consumption is 31.8696 s, and the ISOM time consumption is they have flaws when used in underwater rescue missions.
32.1239 s. Table IX shows the route length of algorithm R- For ISOM, if the initial location of the AUV is far from the
RLPSO, ISOM, and IACO in the process of rescue missions. rescue missions, ISOM can cause a sudden change in the speed
It can be seen that in path-planning, the distance of our of AUV, which is impractical in actual rescue missions [16].
algorithm is similar to IACO algorithm and ISOM neural The algorithm IACO seems to be good. However, we found
network algorithm. This also shows that path-planning is a that the prior distance experience D I ST between different
mature technology, not much improved, but overall we have rescue missions has a significant impact on this algorithm in
improved. And this advantage is even greater if the task is the simulation. The rescue result of Fig. 9(a) is generated at
performed from a distance. D I STmin = 40 m and D I STmax = 105 m, where D I STmin
Comparing with IACO and ISOM, the route length of represents the minimum distance that can be passed between
the multi-AUV system in the rescue missions is shortest, two rescue missions, and D I STmax represents the maximum
indicating that the AUV can complete the rescue missions with distance that can be passed between two rescue missions.

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
WU et al.: REINFORCEMENT LEARNING AND PARTICLE SWARM OPTIMIZATION SUPPORTING REAL-TIME RESCUE ASSIGNMENTS 6819

information includes two aspects, one is the prediction of the


local environment ahead, and the other is the location of the
rescue areas within the detectable range.

R EFERENCES
[1] S. M. Zadeh, D. M. W. Powers, K. Sammut, and A. M. Yazdani,
“A novel versatile architecture for autonomous underwater vehicle’s
motion planning and task assignment,” Soft Comput., vol. 22, no. 5,
pp. 1687–1710, Mar. 2018.
[2] D. Zhu, H. Huang, and S. X. Yang, “Dynamic task assignment and path
planning of multi-AUV system based on an improved self-organizing
map and velocity synthesis method in three-dimensional underwa-
ter workspace,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 504–514,
Apr. 2013.
Fig. 11. The rescue mission assignment by IACO algorithm. [3] W. K. Zhang, G. X. Wang, G. H. Xu, C. Liu, and X. Shen, “Development
of control system in abdominal operating ROV,” Chin. J. Ship Res.,
vol. 12, no. 2, pp. 124–132, 2017.
[4] J. Faigl, P. Vana, and J. Deckerova, “Fast heuristics for the 3-D
If we change the D I STmin = 30 m and D I STmax = 100 m, multi-goal path planning based on the generalized traveling salesman
the rescue result of the multi-AUV system is shown in Fig. 11. problem with neighborhoods,” IEEE Robot. Autom. Lett., vol. 4, no. 3,
Fig. 11 shows that the AUV1 and AUV3 completed the pp. 2439–2446, Jul. 2019.
[5] S. MahoudZadeh, D. M. W. Powers, and A. M. Yazdani, “A novel
rescue missions successfully. For AUV2, because there is efficient task-assign route planning method for AUV guidance in a
no relation between the starting point S and T4 , this limits dynamic cluttered environment,” in Proc. IEEE Congr. Evol. Comput.
AUV2 performing rescue mission T4 . Meanwhile, to ensure the (CEC), Jul. 2016, pp. 678–684.
[6] S. Moon, E. Oh, and D. H. Shim, “An integral framework of task
rescue cost of the multi-AUV system, AUV2 has to go through assignment and path planning for multiple unmanned aerial vehicles
the rescue mission T1 that has been rescued by AUV1, fail- in dynamic environments,” J. Intell. Robot. Syst., vol. 70, nos. 1–4,
ing T4 . The location of the Rescue Area can be easily detected pp. 303–313, Apr. 2013.
[7] W. Yao, N. Qing, N. Wan, and Y. Liu, “An iterative strategy for task
by radar detection, but the prior experience is required for the assignment and path planning of distributed multiple unmanned aerial
values of D I STmin and D I STmax . Improper experience can vehicles,” Aerosp. Sci. Technol., vol. 86, pp. 455–464, Mar. 2019.
easily lead to failure of rescue missions. The R-RLPSO is a [8] J. Zhang, G. Wang, X. Yao, Y. Song, and F. Zhao, “Research on task
assignment optimization algorithm based on multi-agent,” in Proc. Chin.
real-time rescue algorithm that does not consider the complex Automat. Congr. (CAC), Xi’an, China, Nov. 2018, pp. 2179–2183.
relationships between rescue missions, as long as the location [9] G. Ferri, A. Munafo, A. Tesei, and K. LePage, “A market-based task allo-
information of rescue missions is obtained, the multi-AUV cation framework for autonomous underwater surveillance networks,” in
Proc. OCEANS, Aberdeen, U.K., Jun. 2017.
system can find the most cost-effective rescue strategy and [10] A. Alvarez, A. Caiti, and R. Onken, “Evolutionary path planning
generate rescue route in a short time and its low computing for mobile robot navigation,” IEEE J. Ocean Eng., vol. 29, no. 2,
time can guarantee the real-time performance of rescue task. pp. 418–429, Apr. 2004.
[11] X. Bai, W. Yan, S. S. Ge, and M. Cao, “An integrated multi-population
genetic algorithm for multi-vehicle task assignment in a drift field,” Inf.
VI. C ONCLUSION Sci., vol. 453, pp. 227–238, Jul. 2018.
[12] S. Li, X. Xu, and L. Zuo, “Task assignment of multi-robot systems based
In this paper, we have provided an R-RLSPO algorithm to on improved genetic algorithms,” in Proc. IEEE Int. Conf. Mechatronics
achieve the real-time rescue assignments for the multi-AUV Autom. (ICMA), Beijing, China, Aug. 2015, pp. 1430–1435.
system in the 3-D complex underwater environment. Com- [13] Z. Xu, Y. Li, and X. Feng, “Constrained multi-objective task assign-
ment for UUVs using multiple ant colonies system,” in Proc. ISECS
pared with the existing algorithms, the obvious advantage of Int. Colloq. Comput., Commun., Control, Manage., Guangzhou, China,
the R-RLPSO algorithm is to ensure that the rescue missions Aug. 2008, pp. 462–466.
are completed under the premise of cost-effective, rapid rescu- [14] X. Qin et al., “Task allocation of multi-robot based on improved
ant colony algorithm,” Space Control Technol. Appl., vol. 44, no. 5,
ing, and less concerned about the relationship between rescue pp. 55–59, Oct. 2018.
missions. By the R-RLPSO algorithm, the multi-AUV system [15] G. Li, L. Boukhatem, and J. Wu, “Adaptive quality-of-service-based
can adaptively select rescue missions to find the optimal rescue routing for vehicular ad hoc networks with ant colony optimization,”
IEEE Trans. Veh. Technol., vol. 66, no. 4, pp. 3249–3264, Apr. 2017.
strategy, which meets the needs of the real-time rescue in the [16] X. Cao, D. Zhu, and S. X. Yang, “Multi-AUV target search based
actual scenario. on bioinspired neurodynamics model in 3-D underwater environments,”
Our future works may focus on the following. IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 11, pp. 2364–2374,
Nov. 2016.
1) Considering the establishment of communication mecha- [17] D. Zhu, X. Cao, B. Sun, and C. Luo, “Biologically inspired self-
nism in the multi-AUV system. When one of the AUVs fails to organizing map applied to task assignment and path planning of an AUV
perform the rescue missions, it will automatically transmit the system,” IEEE Trans. Cognit. Develop. Syst., vol. 10, no. 2, pp. 304–313,
Jun. 2018.
subsequent rescue mission queue to the nearest AUV. When [18] I. Younas, F. Kamrani, M. Bashir, and J. Schubert, “Efficient genetic
the nearest neighbor AUV receives the rescue mission queue, algorithms for optimal assignment of tasks to teams of agents,” Neuro-
it will add the rescue mission queue to its own rescue missions, computing, vol. 314, pp. 409–428, Nov. 2018.
[19] S. MahmoudZadeh, D. M. W. Powers, K. Sammut, and A. Yazdani,
and then weigh the overall rescue efficiency to execute the “Toward efficient task assignment and motion planning for large-scale
rescue missions. underwater missions,” Int. J. Adv. Robot. Syst., vol. 13, no. 5, pp. 1–13,
2) Building more intelligent real-time multi-AUV rescue 2016.
[20] M. A. Darrah, W. Niland, and B. M. Stolarik, “Multiple UAV dynamic
system. The heuristic information is produced by the neural task allocation using mixed integer linear programming in a SEAD
network in the process of rescue missions. The heuristic mission,” in Proc. Infotech, 2006, p. 7164.

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.
6820 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 7, JULY 2022

[21] L.-N. Zu, Y.-T. Tian, J.-C. Fu, and J.-F. Liu, “Algorithm of task- Chengxin Song received the B.S. degree from
allocation based on realizing at the lowest cost in multimobile robot Shenyang Aerospace University in 2017, where he is
system,” in Proc. Int. Conf. Mach. Learn. Cybern., Shanghai, China, currently pursuing the master’s degree. His current
Aug. 2004, pp. 152–156. research interests include intelligent path planning
[22] E. P. Charles and C. Henrik, “A Bayesian formulation for auction- and energy optimization on AUV.
based task allocation in heterogeneous multi-agent teams,” Proc SPIE,
vol. 8047, no. 23, May 2011, Art. no. 804710.
[23] G. Oh, Y. Kim, J. Ahn, and H. L. Choi, “Market-based distributed task
assignment of multiple unmanned aerial vehicles for cooperative timing
mission,” J. Aircr., vol. 54, no. 6, pp. 2298–2310, 2017.
[24] L. Wang and Z. Wang, “Collection path ant colony optimization for
multi-agent static task allocation,” J. Inf. Comput. Sci., vol. 9, no. 18,
pp. 5689–5696, 2012.
[25] L. Lin, S. Qibo, W. Shangguang, and Y. Fangchun, “Research on
PSO based multiple UAVs real-time task assignment,” in Proc. 25th
Jian Ma received the B.S. degree from Shenyang
Chin. Control Decis. Conf. (CCDC), Guiyang, China, May 2013, Aerospace University in 2018, where he is currently
pp. 1530–1536. pursuing the master’s degree. His main research
[26] G. Oh, Y. Kim, J. Ahn, and H.-L. Choi, “PSO-based optimal task allo-
interests include multi-UAV group algorithm and
cation for cooperative timing missions,” IFAC-PapersOnLine, vol. 49,
UAV networks communication security.
no. 17, pp. 314–319, Aug. 2016.
[27] H. Wang, J. Yuan, H. Lv, and Q. Li, “Task allocation and online path
planning for AUV swarm cooperation,” in Proc. OCEANS, Aberdeen,
U.K., Jun. 2017, pp. 1–6.
[28] Y. Shi and R. C. Eberhart, “Parameter selection in particle swarm
optimization,” in Proc. Int. Conf. Evol. Program., Berlin, Germany,
1998, pp. 592–600.
[29] S. MahmoudZadeh, A. M. Yazdani, K. Sammut, and D. M. W. Powers,
“Online path planning for AUV rendezvous in dynamic cluttered under-
sea environment using evolutionary algorithms,” Appl. Soft Comput.,
vol. 70, pp. 929–945, Sep. 2018. Jinsong Wu (Senior Member, IEEE) received the
[30] W. Song, Y. Zhou, X. Hu, S. Duan, and H. Lai, “Memristive neural Ph.D. degree from the Department of Electrical
network based reinforcement learning with reward shaping for path and Computer Engineering, Queen’s University,
finding,” in Proc. 5th Int. Conf. Inf., Cybern., Comput. Social Syst. Kingston, ON, Canada. He was the Founder and the
(ICCSS), Hangzhou, China, Aug. 2018, pp. 200–205. Founding Chair of the IEEE Technical Committee on
[31] M. Dorigo, M. Birattari, and T. Stutzle, “Ant colony optimization,” IEEE Green Communications and Computing (TCGCC).
Comput. Intell. Mag., vol. 1, no. 4, pp. 28–39, Nov. 2006. He is elected as a Vice-Chair, Technical Activities,
[32] B. Sun, D. Zhu, and S. X. Yang, “An optimized fuzzy control algorithm IEEE Environmental Engineering Initiative, and a
for three-dimensional AUV path planning,” Int. J. Fuzzy Syst., vol. 20, pan-IEEE effort under the IEEE Technical Activities
no. 2, pp. 597–610, Feb. 2018. Board (TAB). He is also the Co-Founder and the
[33] T. Kohonen, “The self-organizing map,” IEEE Proc., vol. 78, no. 9, Founding Vice-Chair of IEEE Technical Committee
pp. 1464–1480, Sep. 1990. on Big Data (TCBD).

Jiehong Wu (Member, IEEE) received the Ph.D. Guangjie Han (Senior Member, IEEE) received
degree in computer architecture from Northeastern the Ph.D. degree from Northeastern University,
University in 2008. She was sponsored by Chi- Shenyang, China, in 2004. He is currently a Profes-
nese Government as a Visiting Scholar with Wright sor with the Department of Information and Com-
State University, Dayton, OH, USA, in 2011. She munication System, Hohai University, Changzhou,
is currently a Professor and a Prominent Teacher China, and a Distinguished Professor with the Dalian
with Shenyang Aerospace University, China. Her University of Technology, Dalian, China.
main research interests include UAV/AUV/UUV sys-
tem’s correspondence security, autonomous obstacle
avoidance and defense, and power consumption opti-
mization.

Authorized licensed use limited to: Kasetsart University provided by UniNet. Downloaded on April 24,2024 at 08:54:08 UTC from IEEE Xplore. Restrictions apply.

You might also like