3D Robotic Navigation Using A Vision-Based Deep Reinforcement

Applied Soft Computing 110 (2021) 107602
Contents lists available at ScienceDirect
Applied Soft Computing

journal homepage: www.elsevier.com/locate/asoc
3D robotic navigation using a vision-based deep reinforcement

learning model
∗
P. Zieliński , U. Markowska-Kaczmar
Department of Computational Intelligence, Wroclaw University of Science and Technology, 27 Wybrzeże Wyspiańskiego st., 50-370 Wrocław, Poland
graphical abstract
article info a b s t r a c t
Article history: In this paper, we address a problem of vision-based 3D robotic navigation using deep reinforcement
Received 29 July 2020 learning for an Autonomous Underwater Vehicle (AUV). Our research offers conclusions from the
Received in revised form 13 April 2021 experimental study based on one of the RoboSub 2018 competition tasks. However, it can be
Accepted 5 June 2021
generalized to any navigation task consisting of movement from a starting point to the front of the
Available online 15 June 2021
next station. The presented reinforcement learning-based model predicts the robot’s steering settings
MSC: using the data acquired from the robot’s sensors. Its Vision Module may be based on a built-in
65D19 convolutional network or a pre-trained TinyYOLO network so that a comparison of various levels of
97R40 features’ complexity is possible. To enable evaluation of the proposed solution, we prepared a test
environment imitating the real conditions. It provides the ability to steer the agent simulating the
Keywords:
AUV and calculate values of rewards, used for training the model by evaluating its decisions. We study
Deep reinforcement learning
A2C the solution in terms of the reward function form, the model’s hyperparameters and the exploited
PPO camera images processing method, and provide an analysis of the correctness and speed of the model’s
Vision-based navigation functioning. As a result, we obtain a valid model able to steer the robot from the starting point to the
YOLO destination based on visual cues and inputs from other sensors.
CNN © 2021 Elsevier B.V. All rights reserved.
1. Introduction in an underwater environment independently, constitute with

their remotely operated counterparts (ROVs) a group, universally
Autonomous Underwater Vehicles (AUVs), i.e. unmanned ve- referred to as Unmanned Underwater Vehicles (UUVs) and are
hicles and robots, able to navigate and perform various tasks widely used both in industry and research applications, espe-
cially in particularly difficult or dangerous tasks. To operate fully
∗ Corresponding author. autonomously, the robot has to move between each key point,
E-mail address: p.zielinski@pwr.edu.pl (P. Zieliński). performing various tasks [1]. Its controller should plan the robot’s
https://doi.org/10.1016/j.asoc.2021.107602
1568-4946/© 2021 Elsevier B.V. All rights reserved.
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
movement efficiently, avoiding existing obstacles and optimizing control, [29] for obstacle avoidance, [30] for navigating a robot to
the path to minimize its length and required time. Therefore, the a target given by an image and [15] for 2D target tracking. These
problem of underwater robot navigation is a crucial part of an papers focus only on two-dimensional problems; the problem
autonomous controller, on which the ability to perform further of obstacle avoidance in a 3D environment was recently tackled
tasks depends most. by [31].
Underwater robots are often designed and applied for school To sum up, there are several brand new survey papers. [32]
and academic competitions, where they have to perform tasks discusses in detail state of the art deep reinforcement learning,
similar to those of professional vehicles. However, their environ- imitation learning, and transfer learning in robot control. [33] de-
ment of operation is often limited to smaller areas, such as lakes, scribes the fundamental background behind sim-to-real transfer
basins, or special facilities. One of these competitions, RoboSub, in deep reinforcement learning and reviews the main methods
is organized yearly in San Diego, on a TRANSDEC testing facility; being utilized. [34] overviews commonly perceived challenges in
AUVs prepared for this competition perform a series of tasks, deep RL and how they can be addressed.
imitating current research conducted in the local US Navy unit. In this research, we concentrate on an analysis of a vision-
The main aim of this research is to explore the feasibility of based approach to the problem of 3D robotic navigation, where
applying deep reinforcement learning for training a vision-based we compare the proposed deep reinforcement learning-based
navigation controller of an autonomous underwater vehicle. The model’s effectiveness depending on the visual feature embed-
research was inspired by the RoboSub competition, attended by dings used by the controller. Furthermore, we present a study
one of the authors, where the ability to navigate between each of several reward functions and their influence on the process
task is essential. Navigation can be defined as the ability to of training and achieved results. The solution presented in the
determine the robot’s position and to plan a suitable and safe paper has been investigated in terms of the efficiency of the
path to the goal location [2]. A crucial aspect of robotic navigation training process and the correctness of functioning, with varying
is the ability to sense objects around the robot and to know its environmental conditions and for various hyperparameters’ set-
relative pose. The ability to navigate an agent in a 3D environ- tings. The research is based on a specifically designed simulation
ment from a starting point towards the target object without environment, in which training and evaluation of models were
specifying the expected path, using only visual observations and conducted.
sensors readings, seems attractive not only for AUVs, but also for We summarize our contributions as follows:
similar tasks performed by unmanned aerial vehicles (UAVs), or
intelligent agents in 3D games. • We present a novel, vision-based approach for 3D naviga-
Depending on the nature of the environment, various strate- tion, which utilizes data from the robot’s sensors and an
gies might be used for path planning. With an accurate model object detection model for target object location. The pro-
of the world and tractable action space, classical optimization posed solution is trained as a deep reinforcement learning
methods perform robustly, for example, Dijkstra algorithm [3], A* model and can steer the robot to navigate from the starting
algorithm [4], Artificial potential field method [5]. In case of more point to the target object.
complex environments or large action spaces, the performance of • We analyze the reward function used for training the model.
these methods degrades rapidly. Therefore, some studies present We propose four relevant components and prove that each
evolutionary-based approaches to solve this problem, such as one is necessary to achieve the correct behavior of the
multiobjective evolutionary algorithm [6], a combination of an controller.
evolutionary algorithm and game theory [7] or hybrid Particle • We test various image processing methods and show how
Swarm (PSO) and Modified Bat algorithms [8]. the level of visual features’ complexity processed by the
However, evolutionary algorithms are often difficult to tune, model influences its performance.
which is why currently reinforcement learning (RL) is used in • We evaluate the solution regarding its success rate and
many fields of robotics. Thanks to deep neural networks’ capacity achieved reward value, as well as the model’s prediction
for processing high-dimensional data, deep reinforcement learn- speed.
ing (DRL) models can use raw sensor data without manual feature • We provide a simulation environment for similar experi-
engineering. Among many of the methods, the most frequently ments.
tested one in the context of AUV navigation is Deep Deterministic
Policy Gradients (DDPG [9]): [10–14]. You et al. [15] leverage The rest of the paper is organized as follows. Section 2 states
DDPG for 2D target tracking in an unmanned combat air vehicle. the problem of AUV navigation. The proposed solution is shown
Authors in [16,17] and [18] applied Deep Double-Q Network in Section 3, while Section 4 presents the experiments con-
(D3QN [19]) for path planning, and in [17] hierarchical Deep Q- ducted during the research and discusses their outcomes. Finally,
Network (h-DQN [20]) was used for the same task. In [21], target Section 5 shows our conclusions.
search tasks were solved using Asynchronous Advantage Actor–
Critic (A3C [22]) model. Sporyshev et al. [23] solve a problem 2. Problem description and formulation
of search and surveillance mission by multiple cooperative AUVs
using Proximal Policy Optimization (PPO [24]). This DRL method In the research we consider one of the tasks from the RoboSub
is also used in [25] for mapping from the images produced by the 2018 competition, which consists of navigating from the start-
cameras to the velocity of AUVs and in [26] for a motion planning ing area (dock) towards the gate, indicating the entrance to the
system, which avoids obstacles and can reach multiple target competition task sequence. Fig. 1 illustrates point configuration
points. In [27], authors present how to train an agent to tackle for this task; point P0 refers to the starting point, whereas P∗ is
a hybrid objective of 3D path-following and collision avoidance the target position. It is assumed that initially, the target object
for an AUV. is not in the robot’s field of vision and is located several meters
The aforementioned DRL methods utilize inputs from ad- away from the starting point. The navigation problem is defined
vanced sensors, such as sonars, and did not rely on visual cues. as the movement from the dock towards the gate, stopping at a
Lately, a combination of DRL and deep learning vision methods point, where the gate fits the sight of the camera and the robot is
become more and more popular in the robot navigation domain. situated centered, in front of it. This scenario provides necessary
For example, [28] used such an approach for robotic arm motion robustness, since a situation where the robot ends its navigation
2
Fig. 2. Settings vector components: longitudinal velocity vz , lateral velocity vx ,

vertical velocity vy and yaw velocity ωy .
Fig. 1. Plan view of an example configuration of a starting point P0 and a [ ]

destination point P∗ for the examined task. rad
Xω = ωx , ωy , ωz : ωx , ωy , ωz ∈ R
{ }
2
(5)
s
in front of the next task’s station is common for other competition Xϕ = ϕx , ϕy , ϕz : ϕx , ϕy , ϕz ∈ R [◦ ]

{ }
(6)
tasks from RoboSub.
The goal of the robot’s controller is to choose the velocity set- Finally, using a depth sensor of the robot, we utilize the
tings so that the robot, starting from point P0 , reaches destination current depth value Xd , measured in meters (Eq. (7)).
point P∗ in the shortest possible time. Point P0 is defined as the
Xd = d : d ∈ R, d ≥ 0 [m] (7)
place where the robot ended the previous task, or from where it
starts the challenge, whereas point P∗ indicates the target, which The settings vector Y estimated by the model consists of four
is the place of executing the next task; in the case of the entrance real values in the range [−1, 1] (according to Eq. (8)). All these
gate, it is any point which allows the robot to go under the gate by values are illustrated in Fig. 2:
moving forward. Both points are given as a four-element vector
(Eq. (1)), relative to the center of the competition facility. • longitudinal velocity setting vz (vz = 1 stands for maximal
forward velocity),
P = x, y, z , ϕy : x, y, z , ϕy ∈ R • lateral velocity setting vx (vx = 1 stands for maximal velocity
{ }
(1)
to the right),
where:
• vertical velocity setting vy (vy = 1 stands for maximal
x, y, z – position on axes X , Y , Z , emergence speed),
ϕy – heading angle (angle about vertical axis – we ignore the • yaw velocity setting ωy (ωy = 1 stands for maximum clock-
remaining angles since the AUV was designed not to be able wise angular velocity).
to rotate around them).
Y = vz , vx , vy , ωy : vz , vx , vy , ωy ∈ R
{ }
(8)
To solve this navigation problem, the robot’s controller has to
In the original AUV, these settings are treated as target values
find the best move to the target using the information delivered
of velocities in the respective directions; they were used by the
from sensors. Formally, we can say that the main task of the
robot’s internal controller to calculate the control vector of the
controller model is to estimate the settings vector Y based on the
thrusters. In this research we assume, that this internal controller
input data X received from sensors, as it is expressed by Eq. (2).
is able to set the actuation of the thrusters using the given
model velocity setpoint; this design demands from the model to learn
X−
−−→ Y (2)
to handle the inaccuracies of the internal controller, as well as
The AUV was equipped with three high-definition color cam- the delay needed to reach the set velocity.
eras. Each one generates an image tensor Ximg of shape 1280 ×
720 × 3, containing integers in the range [0, 255] (Eq. (3)). 3. The method
Ximg = x(i,j,k) ∈ [0, 255] = Z 1280×720×3
{ }
(3)
The high-level structure of the controller is presented in Fig. 3.
Our model utilizes one of the cameras, mounted on the front of It is based on a deep reinforcement learning model. It receives
the robot and facing forward. an input vector X that contains information from all sensors and
Apart from the cameras, one of the crucial sensors of the returns a settings vector Y.
AUV is an AHRS (Attitude and Heading Reference System), an in- The image from the front camera is processed by the Vision
tegrated sensor composed of a gyroscope, an accelerometer and Module first. This module is responsible for feature extraction
a compass, which allows accurate orientation and both linear from an image and dimensionality reduction. Its output is con-
and angular acceleration measurement. The controller analyses catenated with data from other sensors (i.e. linear acceleration,
following three-element vectors (with real values per each axis): angular velocity, current rotation and depth) in the next module
linear acceleration vector Xa (Eq. (4)), angular velocity vector Xω that also performs further processing. A recurrent network is
(Eq. (5)) and rotation vector Xϕ (Eq. (6)) with values in the range applied as the final module of the controller (Time-Series Analysis
(−180, 180]. Module) to discover the temporal relationship between streams
[m] of data (images and other sensors inputs). The output of this
Xa = ax , ay , az : ax , ay , az ∈ R
{ }
(4) module is the settings vector Y.
s2
3
Fig. 3. High-level controller structure, showing its main modules; X and Y are the controller’s input and output (Eq. (2)), Ximg is the image tensor (Eq. (3)), Xa
refers to linear acceleration vector (Eq. (4)), Xω is the angular velocity vector (Eq. (5)), Xϕ is the orientation vector (Eq. (6)) and Xd refers to the current depth value
(Eq. (7)).
One of the crucial characteristics of the problem is its high

complexity. The controller has to process multidimensional in-
put data: a stream of images and numerical values. The most
straightforward approach is to train the entire model at once
using raw data from the environment, expecting it to discover the
existing relationships between particular input values and correct
settings, leading to the target point. On the other hand, parts of
the model could be extracted and trained separately, which could
simplify the training task. In our solution we investigate both of
these approaches.
The controller’s model was trained in the actor–critic strat- Fig. 4. The Vision Module elements; Method block refers to a particular vision
egy [35], based on the A2C (Advantage Actor–Critic) framework processing method chosen in the configuration (this defines which of the tensors
are passed over); Fully connected layers are optional, trainable layers, which
[22]. This approach utilizes two models; the actor (policy gradient
process the output of preceding blocks; the module receives a vector Xv is , which
method) is used to control the agent behavior, while the critic is the image tensor Ximg scaled to shape 416 × 416 × 3, and outputs a vector
(action-value function method) evaluates the actor’s decision. Yv is .
Both models are trained in parallel: the actor policy πθ (s, a) is
trained to maximize the value of rewards received for actions
taken by the agent and the critic function Qω∗ (s, a) is optimized applied. It is an object detection model architecture, which fea-
to minimize reward function approximation error. Here, s refers tures high processing speed and satisfactory accuracy. It performs
to the state vector (in our case, equivalent to the input vector X image analysis in a single-shot manner. As an output, it estimates
containing the information from the environment: visual obser- the coordinates of the target object’s bounding box on the image.
vations Ximg and sensors readings Xa , Xω , Xϕ , and Xd ), and a is We used TinyYOLO — a small version of the architecture (Fig. 5).
the action vector (which refers to the output settings vector Y). In our vision module, YOLO delivers three different levels of
With the A2C, the model is trained using multiple agents, which output information: basic features, raw features, or a bounding
interact with individual environments, using their copies of the box; it is trained in advance, separately from the controller’s
global network. Agents train their individual network’s weights model, and reused as a feature extractor. Further data processing
independently, and at the end of each set of episode, the global is available using additional, trainable fully-connected layers. The
network is updated synchronously. Vision Module returns a vector Yv is . Below, both methods are
The original implementation of the A2C uses a vanilla policy described in detail.
gradient method for controlling the agent. In our research we de-
CNN. As mentioned before, the CNN used in our solution has an
cided to replace it with PPO (Proximal Policy Optimization) [24].
architecture based on [36]. It consists of the following layers, each
A classic gradient ascent method is here enhanced using trust
followed by a ReLU activation function:
regions, which limit the maximal update step size. The size of
the trust region is set using the clipped surrogate objective func- • convolutional layer (32 filters 8 × 8 with stride 4 × 4),
tion. This approach efficiently restricts policy weights changes, • convolutional layer (64 filters 4 × 4 with stride 2 × 2),
improving the stability of the training process. • convolutional layer (64 filters 3 × 3 with stride 1 × 1),
Next, the solution’s modules are described in detail. • flattening layer,
• fully-connected layer (512 neurons).
3.1. Vision module
The output of the CNN is a tensor of shape 48 × 48 × 64. It is then
The Vision Module is presented in detail in Fig. 4. As an input, flattened and can be processed using fully connected layers. In a
it receives the data from the camera mounted on the front of the configuration which uses the CNN, the whole controller model is
trained end-to-end.
robot. First the original input Ximg of shape 1280 × 720 × 3 is
rescaled to shape 416 × 416 × 3 of integers in the range [0, 255]. YOLO. Its architecture is presented in Fig. 5. The model consists
Because the AUV’s camera resolution is constant, we decided not of several convolutional layers; each of the first six layers is fol-
to crop the image additionally, as the deformation resulting from lowed by a max-pooling layer. In each convolutional layer Batch
resizing is invariable. Normalization is applied. The network uses the ReLU activation
As it is shown in Fig. 4, the Vision Module offers two methods function. The final output of the network is the bounding box of
of image processing. They can be applied in various configura- the detected object. To provide a deeper analysis of the model’s
tions (using both of them or separately); the element assigned predictions, we added the ability to process its hidden feature
as Method is responsible for the appropriate choice of the op- maps (in Fig. 5: basic features and raw features, referring to the
tion. The first module is a convolutional network (CNN), based level of features’ complexity). Due to their shape, these feature
on an architecture from [36], which is trained end-to-end with maps need to be flattened and can be further analyzed using
the whole model. In the second one, YOLO [37] architecture is trainable layers of the controller’s model. In theory, this can yield
4
Fig. 5. Object detection model architecture: Conv refers to a convolutional layer; Max-pool is a max-pooling layer; batch-norm indicates batch normalization; ReLU
and Sigmoid are activation function types; input vector Xv is is the image tensor Ximg scaled to shape 416 × 416 × 3.
a better understanding of the image content, as compared to plain λcoord – position component relevance coefficient,
bounding box parameters. S – number of rows and columns in YOLO grid,
obj obj
The YOLO network is an object detection model, i.e. it esti- 1i – step function (1i = 1 if cell i contains the
mates the position and the size of each object on the image. It is obj
detected object; 1i = 0 if it does not),
based on a grid of size 13 × 13: the input image is divided into
(x∗i , y∗i ) – target bounding box position,
169 cells and for each of them the network predicts the location
(xi , yi ) – predicted bounding box coordinates,
of one object (its bounding box’s position and size). Because of
the grid shape, the input image size should be set to 416 × 416 • Lsize – bounding box size error (comparing predicted and true
(for each cell to be of shape 32 × 32 pixels). width and height of the bounding box) – Eq. (14):
Each of the convolutional layers perform feature extraction, fi-
nally returning an intermediate tensor Ibbox of shape 13 × 13 × 5, S 2 [( ]
i.e. a prediction of a bounding box for each cell in the grid, in a
∑ obj
√ ∗ √ )2 (√ ∗ √ )2
Lsize = λcoord 1i wi − wi + hi − hi
form of five-element vector [C , x, y, w, h], where C is the detec-
i=0
tion confidence (Eq. (10)), x and y are coordinates of the bounding
box center and w and h are its width and height (all scaled to (14)
range [0, 1]). Finally, the model’s output Ybbox is estimated by
(wi∗ , h∗i ) – real width and height of the bounding box,
choosing the cell with the highest confidence value, according to
Eq. (9). (wi , hi ) – predicted size of the detection,
[ ]
Ybbox = arg max(Ibbox ) = Ĉ , x̂, ŷ, ŵ, ĥ (9) • Lobj , Lnoobj – positive and negative detection error, based on
C comparing the confidence of the detection Ci with the
ground truth Ci∗ (Ci∗ = 0 for negative detection; Ci∗ = 1
C = p (obj) × IoU true
pred (10) for positive) – Eqs. (15) and (16):
S2
∑ obj
)2
Ci∗ − Ci
(
|pred ∩ true| Lobj = 1i (15)
IoU true
pred = (11) i=0
|pred ∪ true|
where:
S 2
p (obj) – probability, that the bounding box contains the

∑ noobj
)2
Lnoobj = λnoobj Ci∗ − Ci
(
1i (16)
object,
i=0
IoU true
pred – Intersection over Union between ground truth and
estimated bounding box. λnoobj – negative detection loss relevance factor,
noobj noobj
Our YOLO model applies a modified loss function. It was 1i – step function (1i = 1 if cell i does not
noobj
designed to detect a single type of objects, therefore classification contain the detected object; 1i = 0 if it does).
error component has been removed. The assumed loss function is
Considering the task, which should be fulfilled by the con-
described in Eq. (12):
troller, the YOLO model has to recognize a single object in the
L = Lpos + Lsize + Lobj + Lnoobj (12) field of view (the gate), therefore we always use only the detec-
where: tion, which has the highest confidence C . It is worth noticing that
in case of no object in the field of view (which will happen often
• Lpos – detected object position error (based on the difference during the AUV operation, especially in the initial part of the run),
between the target and predicted bounding box position) –
the YOLO model will return a bounding box prediction with a low
Eq. (13):
confidence value (close to 0). We assume that in such a case, the
S2
∑ [( )2 )2 ] controller model will be able to learn to take appropriate actions
obj
Lpos = λcoord x∗i − xi + y∗i − yi
(
1i (13) to discover the target object location (for example, rotate to look
i=0 for the target object in other parts of the environment).
5
3.2. Data processing module • rotation reward Rr , assessing agent’s angular position around
yaw axis in relation to the normal of target object front
The data collected from the robot’s sensors (AHRS and depth plane (illustrated in Fig. 6b and formally described by Eq.
sensor) are concatenated with the Vision Module’s output, cre- (18)); rotation component’s aim is to align the agent with
ating the input for the Data Processing Module. To predict the the target object, so that it can approach it by going forward;
correct values of velocity settings, the model needs to learn the
relation between the sensors’ readings, the Vision Module output, Rr = cos γtarget − γagent
( )
(18)
and the correct settings of the AUV’s velocity. Therefore, we add
a sequence of trainable, fully-connected layers, which process the where:
concatenated input of the Module, trained jointly with the entire γ – normal angular position around yaw axis;
DRL model. If we increase the number of layers and neurons in
this sequence, the model should be able to learn more complex • velocity reward Rv , evaluating agent’s movement direction
relations between the input parameters, but on the other hand, and speed, described by Eq. (19); it should make the model
it can make the training process more difficult. learn to move towards the gate as fast as possible;
⏐ ⏐
⏐vagent ⏐
Rv = cos ∢ vagent , Ttarget
[ ( )]
3.3. Time-series analysis module ∗ (19)
vmax
Due to a strong and long-term time dependency occurring in where:
the considered problem, actions taken by the controller should
∢ vagent , Ttarget – angle between agent velocity vector
( )
be based not only on the current but also previous observations
and a vector connecting agent’s center of mass and
of the environment. We decided to use a recurrent network —
target ⏐point (see Fig. 6c),
an LSTM layer, which is trained together with the controller’s ⏐
⏐vagent ⏐ – length of the velocity vector, normalized to
model. It performs analysis of the output of the previous module, maximal agent’s speed vmax ;
conditioning output on current observations and hidden state (of
length 256), passed to consecutive LSTM cells in subsequent time • angular velocity reward Rav , determining the correctness of
steps. The size of the hidden state in the Time-Series Analysis current agent’s velocity in relation to its rotation, given in
Module was based on our preliminary studies, which are not Eq. (20); it rewards angular movement towards the target
described here. angular position around yaw axis (i.e. parallel to target
The output of the module is the settings vector Y, used to object’s normal), discouraging the model from turning away
control the robot’s movement. from the target: the difference of tanh allows comparing
the current angle and angular velocity so that the rotation
3.4. Reward function towards the target angle returns values close to 0, and dif-
ferent in other cases; we then apply cos on the difference to
One of the key aspects of reinforcement learning is the reward reward zeroing (i.e. adjusting the angular velocity to the tar-
function used for training models [35]. It should accurately eval- get angle); the normalization of the difference with tanh(1)
is added to prevent the reward function from returning −1
uate agent actions, promoting those leading to the success and
for all misaligned angles and angular velocities;
discouraging the model from making incorrect decisions. Ideally,
tanh (∆γ ) − tanh ωY
[ ( )]
the optimal reward function should enable the model to learn the
Rav = cos π ∗ (20)
correct behavior in the environment. In our research this would tanh 1
mean reaching the target point using an optimal path, without
where:
unnecessary deviations, preserving natural orientation and ap-
propriate speed. Therefore, we propose four rewards and examine ∆γ = γtarget − γagent – difference between agent and
how they influence the model’s training and performance: target normals’ angular positions around the yaw axis
(same as in Eq. (18)), normalized to the range [−1, 1],
• position reward Rp , based on the distance of the agent from
ωY – normalized yaw angular velocity.
the target point (e.g. the center of the gate), calculated
individually for each dimension according to Eq. (17) (see To prevent such behavior when the agent concentrates on
Fig. 6a); this component should help the model learn to collecting small rewards for a long time, we introduced two re-
minimize the distance to the target point (position reward strictions. The first one limits the length of training episodes and
increases as the agent approaches the target): as the agent resets the environment after exceeding the maximum number of
approaches the target position, the function will rise to 1, steps Lepisode . The second one relies on discounting positive re-
reaching kmin,d at a distance equal to lmin,d - this allows wards using function Dexp , described in Eq. (21), which decreases
explicit shaping of this reward function component in each exponentially from the first step, reaching value 0.1 at the half of
direction; the episode length.
( )
ln 0.1
( )
⏐ ⏐ ln kmin,d
Rp,d = exp ⏐dtarget − dagent ⏐ ∗ (17) Dexp = exp step ∗ (21)
lmin,d 1
L
2 episode
where: The agent receives several distinct rewards and penalties.
d – the axis along which the distance is measured (X , When reaching the target, it is awarded a high reward, which
Y is formally described by Eq. (22). The fact of successful com-
⏐ or Z ),
pletion of the task is determined using a box surrounding the
⏐
⏐dtarget − dagent ⏐ – absolute distance between agent and
target position, target object with some margin; agent hitting the box indicates
kmin,d ∈ (0, 1) – function value achieved at distance successful finish and, therefore, causes resetting the environment
lmin,d ; during research, kmin,d and lmin,d were set empir- and starting a new training episode.
ically; Rsuccess = 50 ∗ Dlin ∗ Rr ∗ wr + Rp,av g ∗ wp
( )
(22)
6
Fig. 6. Parameters used to calculate each reward component; (a) position reward (evaluates the distance from the target position along each axis: ∆x, ∆y and ∆z)
(Eq. (17)); (b) rotation reward (calculated using the difference between the yaw angular orientations of the agent’s normal (Nagent ) and the target’s normal (Ntarget ):
γagent and γtarget ) (Eq. (18)); (c) velocity reward (assesses agent’s linear velocity vagent (its direction and magnitude), comparing it to a vector connecting the agent
and the target Ttarget ) (Eq. (19)); (d) angular velocity reward (evaluates the agent’s angular velocity around the yaw axis (ωY ) to encourage aligning with the target’s
normal) (Eq. (20)).
its hyperparameters on the process of training, the efficiency of

1( ) operating and the achieved results. In this section, we describe
Rp,av g = Rp,x + Rp,y + Rp,z (23) the platform used for the research, experimental procedure and
3
experiments conducted to evaluate the proposed solution. The
code for training and evaluating the object detection and con-
kmin,lin − 1
Dlin = ∗ step + 1 (24) troller models is available on demand from the corresponding
lmin,lin author.
where:
4.1. Simulation environment
wr , wp – weights for rotation and position components
(inferred experimentally in preliminary studies using grid
To review the applicability of the proposed solution it was
search method),
necessary to prepare a simulation environment, resembling real
Rp,av g – average position reward for each axis,
conditions on the test facility accurately. The application needs
Dlin – linear discount factor, taking the value kmin,lin at step =
to provide tools for controlling agents inserted into the environ-
lmin,lin (promotes reaching the target in a low number of
ment, as well as simulating the physics of their movement and
steps).
the readouts of sensors and cameras.
The agent is punished when hitting an obstacle (Rcol_penalty = It is necessary to mention some details on the competition
−1) or when making a fatal mistake, such as untimely emergence course, which should be simulated in the environment. The fa-
or missing the target (Rfatal = −10). A fatal mistake causes im- cility is divided into four quarters, each containing the same set
mediate resetting of the environment and starting a new training of objects. The region in which a team performs their run is
episode. We did not consider collisions to be a fatal mistake, since drawn randomly, and the robot’s starting orientation is drawn
during our research we noticed multiple runs, in which the agent with a coin flip (see Fig. 7). To address the possibility of varying
would collide with the target object. Treating such an event as a conditions on the facility, the simulation has to provide necessary
fatal error increased the instability of training and did not allow randomization of environment parameters (e.g. water hue and
the model to learn the correct behavior. opacity, sun position and brightness etc.). It would ensure our
model preserves robustness to applying in different conditions.
4. Results and discussion Since we assume that the robot is equipped with an internal
controller, which controls the speed of the robot’s thrusters ac-
The controller model has been tested to verify its capability of cording to the velocity setpoint provided by the DRL model, we
solving the aforementioned problem and analyze the influence of used a simplified dynamic model of the AUV. In the simulation
7
Algorithm 1 Dataset generation

Input: env ironment, number of collected images, target object, agent
object
Output: dataset of annotated images
1: for k = 1, 2 to n do
2: Reset(env ironment);
3: xr , yr , zr ← PutAgent(agent , target);
4: AddNoiseObjects();
5: αr , βr , γr ← AdjustRotation(agent, target);
6: bbox ← CalculateBoundingBox(agent, target);
7: image ← CollectImage();
8: dataset ← Add(image, bbox);
9: end for
10: return dataset
Fig. 7. Test facility layout; each region is marked with letters A-D; robot’s
starting orientations are indicated with green and blue arrows.
Table 1
Object detection model hyperparameters.
Hyperparameter Value
Mini-batch size 30
Learning rate 1 × 10−4
Model saving frequency 10 000 steps
Position component relevance coefficient (λcoord ) 5
Negative detection loss relevance factor (λnoobj ) 0.5
Number of rows and columns in YOLO grid (S) 13
Fig. 8. Example images of the conditions in the environment: the real image
(on the left) and an example from the simulation (on the right).
we consider its inertia and drag; their parameters were chosen

empirically to roughly resemble the real robot’s behavior.
During the research we have used Unity software, basing
the environment on RoboSub facility model made available by
Coleman University team. One of its crucial features is Unity ML-
Agents toolkit [38], providing tools for inserting agents, that can
be controlled using custom DRL models. The environment might
be used as a versatile base for other AUV-related projects. It is one
of the significant contributions of this research; we provide access
to the simulation environment on GitHub.1 Example images of Fig. 9. Plots of loss function and IoU for object detection model training.
the conditions in the real and simulated environment are shown
in Fig. 8.
(x, y, w, h) (center coordinates x and y, width w and height h),
4.2. Object detection model
enclosing the gate on the image.
The image generation method is presented in Algorithm 1.
Our detection model, designed with TensorFlow [39], was
Using this approach, we collected a dataset of 2000 positive and
trained to detect a gate, which is used as the target object
2000 negative training examples. Half of the images have been
in the first competition task. Our research does not focus on
noised with other objects. Similarly, we generated a separate set
YOLO hyperparameters’ analysis; we use default configuration
of 400 images as our test dataset.
and present observations of the training process. To evaluate the
model we used Intersection over Union (Eq. (11)) between the Training and evaluation. To train the object detection model we
object’s ground truth bounding box and the model’s prediction, used an online image augmentation method, applying various
and loss function value [37] (Eq. (12)). randomized operations on an image (horizontal flipping, random
crop, gaussian blur and noise, contrast normalization and random
Dataset to train YOLO. We used the simulation environment to
affine transformation). For drawing mini-batches (each of size
generate a set of images presenting the gate. The image gener-
30), we used a subset of 1000 images (a buffer, drawn randomly
ation function resets the environment and then, after drawing
every time the model is saved). The use of a buffer was required
the facility’s quarter and objects’ positions, it adjusts the agent’s
to enable online image augmentation in the limited memory of
orientation, so that the gate is visible on the camera image,
the test platform. Then, for each of the following 10 000 training
adding small, random position deviations. To make the dataset
steps, a 30-image mini-batch was drawn from the buffer for
noisy, the function can fill the object’s surroundings with various
training the model. After this amount of steps, the buffer was re-
objects used in other competition tasks. Furthermore, the condi-
drawn and the cycle was repeated. The model has been trained
tions (such as water hue and opacity, sun position and brightness,
using default YOLO hyperparameters (Table 1).
etc.) were modified on-line during the dataset generation.
Our training strategy was based on learning steps, followed by
Our simulation enables automatic annotation calculation. By
evaluation, model weights storing and redrawing of the training
comparing relative position of the agent and the target ob-
subset. Fig. 9 presents the learning curve. According to the re-
ject, it generates the coordinates of a bounding box as a vector
sults, the model achieves the best level of performance between
6th and 7th model save (that is where the IoU metric reaches
1 https://github.com/piotlinski/TransdecEnvironment. its maximum value on both train and test dataset). Empirical
8
Table 2
Test platform configuration.
CPU AMD Ryzen 7 2700
RAM 32 GB
GPU Nvidia RTX 2070
OS Ubuntu Linux 18.04
Table 3
Default parameters used in experiments.
Hyperpa- Value Description
rameter
Ttraining 24 [h] Model training period (hours)
Fig. 10. Histogram of IoU for the test dataset using the selected object detection
nenv s 12 Number of agents
model. It is visible, that the selected model returns accurate predictions for the
ntrain_steps 50 000 Number of steps between each
training dataset, with only a few examples having a low (<0.3) IoU metric value.
evaluation and model saving
nev al_steps 200 Number of evaluation steps
idec 10 Interval between each model’s
comparison of saved models’ performance confirms, that the 70k- decision
Lepisode 4000 Maximum number of steps per
step model yields best results on a test video generated with our episode
simulation, which is why this model has been used for further |Ct | 256 LSTM cell state vector length
research. Fig. 10 shows a histogram of intersection over union val- Lminibatch 30 Number of steps in one minibatch
ues for the test dataset using the selected object detection model. nminibatches 4 Number of minibatches used for
model training
Additionally, we present some examples of model predictions lr 0.00025 Constant learning rate value
compared to the ground truth in Fig. 11. εPPO 0.2 PPO clipping threshold
kmin,x 0.3
4.3. Navigation controller
kmin,y 0.3
Expected position reward value
kmin,z 0.3
kmin achieved at distance lmin
To study the proposed solution we used an implementation lmin,x 0.8
for each dimension (Eq. (17))
of the deep reinforcement learning model from Stable Baselines lmin,y 0.5
lmin,z 2.0
library [40], which is a fork of OpenAI’s Baselines [41]. Training
and evaluation were conducted using the simulation prepared for wr 0.2 Rotation component weight in
the project. We measure the performance using 4 metrics: success reward (Eq. (22))
wp 0.1 Position component weight in
• mean reward gathered by the agent R̄, averaged for each success reward (Eq. (22))
agent and environment instance, as in Eq. (25): kmin,lin 0.5 Linear discount factor
lmin,lin Lepisode parameters (Eq. (24))
Ne T
1 ∑∑
R̄ = ri (t) (25)
Ne
i t =0
and missing the target qM ), together with the fraction of all
where: of runs, during which a collision occurred (qC ):
Ne – total number of agents, |S | |E | |M | |C |
T – learning episode length, qS = ; qE = ; qM = ; qC = (27)
|T | |T | |T | |T |
ri (t) – value of a reward received by the agent in ith
environment, in tth time step; where:
• average episode length L̄, i.e. number of steps taken by the |S | – number of episodes which ended successfully,
agent before reaching the target or resetting the environ- |E | – number of episodes failed by emerging,
ment; good results are indicated by large reward achieved |M | – number of episodes where the agent missed the
in small number of steps; small reward received in short target,
episodes can be caused by incorrect agent behavior, such as |T | – total number of episodes conducted in the eval-
untimely emergence or missing the target, both of which are uation; |T | = |S | + |E | + |M |,
considered fatal mistakes in the competition; on the other |C | – number of episodes during which a collision
hand, a large number of steps taken in one episode may sug- occurred.
gest, that the agent is wandering around the environment; We applied a unified training procedure to compare various
• speed of predictions it¯ s , calculated as an average number of model designs directly. Using actor–critic architecture and A2C
steps per second, as shown in Eq. (26): method, we conducted a parallel training of 12 agents in a con-
Nsteps stant period of 24 h on our test platform described by Table 2. Our
it¯ s = (26) training strategy assumes a cyclic model evaluation every 50 000
Tev al
training steps. Testing was conducted for each agent by perform-
where: ing 200 steps according to model prediction in each of the 12
Nsteps – number of steps taken by the agent during environments and registering average reward and episode length.
evaluation, The evaluation sequence, used for assessing the best model from
each training procedure (i.e. the one achieving the highest av-
Tev al – total evaluation time in seconds;
erage reward per test episode during training) was repeated 5
• episode statistics (Eq. (27)), i.e. total fraction of successful times, with forced reset of each simulation instance. All metrics
approaches (ended by reaching the target) qS and failed were averaged over each evaluation episode (or summed, in case
episodes (distinguishing between untimely emergence qE of episode statistics).
9
Fig. 11. Example images from the dataset used to train the YOLO model, with drawn ground-truth boxes (red) and predictions (green). The two images on the left
contain added ‘noise’ objects, whereas the images on the right were not noised.
Table 4 Table 5
Hyperparameters examined in the research. The controller model configuration used for reward function evaluation.
Hyperparame- Values Description Hyperparameter Values
ter Visual feature embeddings bounding box
Visual feature {bounding box, It defines the visual features Vision FC layers None
embeddings raw features, embedding method in the Vision Data Processing FC layers [64, 64, 64, 64]
basic features, Module (Fig. 4)
built-in conv}
Vision FC [l1 , l2 , l3 , . . .] They assign the number of

layers neurons in fully connected layers
(32)
in the Vision Module; None The complex form of reward function is a weighted sum of
refers to no additional hidden
layers in the Vision Module
elementary reward functions. The weights have been chosen em-
Data [l1 , l2 , l3 , . . .] They define the number of pirically, based on the agent’s behavior preliminary tests, which
Processing FC neurons in subsequent FC layers are not described here. Their values are as follows:
layers in the Data Processing Module
(Fig. 4) • wv = 0.8 (velocity reward weight),
• wav = 0.2 (angular velocity reward weight),
• wp = 0.1 (position reward weight),
• wr = 0.2 (rotation reward weight),
The experiments are focused on studying the influence of
• b = 0.2 (velocity reward bias).
visual feature embeddings choice and reward function form on
the model’s performance, speed and the process of training. Each The reward function form from Eq. (32) was proposed based
experiment utilizes a model trained with default parameters from on the following insights. Conditioning the reward on the velocity
Table 3, whereas model hyperparameters examined in our re- component Rv should guarantee that the agent moves towards
search and the possible settings are shown in Table 4. the target, whereas additional bias b should enable agent’s move-
During research, we examined the configuration of the Data ment evaluation even in case of zero velocity (which occurs often
Processing Module’s hidden fully-connected layers to choose the in the starting phase of the training; lack of bias led to random
optimal settings. The process of tuning these hyperparameters is wandering and collecting small rewards). Since the lengthwise
described in Appendix. position component Rp,Z is intuitively the most important one,
it was used to condition rewards calculated along the other two
4.3.1. Experiment 1: The influence of reward function components axes (Eq. (31)). Additionally, we scale the function to the range
on model training [−1, 1] to stabilize the training process.
The main aim of this experiment is to examine the importance The experiment consists of two phases. First, to examine the
of the four aforementioned reward function components. We process of training, we trained a model using each form of the
researched four elementary forms of reward function: reward function. We assumed a constant configuration of the
model (see Table 5). Then, the best models for each reward
• analyzing only agent’s position Rpos (Eq. (28)), function form were evaluated using an averaged reward function,
given in Eq. (33).
1 ( )
Rpos = ∗ Dexp ∗ Rp,x + Rp,y + Rp,z + Rr (28) 1(
4
)
Rev = Rp,X + Rp,Y + Rp,Z + Rr + Rv + Rav (33)
6
• assessing only agent’s velocity Rvel (Eq. (29)),
The results of the training are shown in Fig. 12. We use the
1 following denominations: position-only refers to reward func-
Rv el = ∗ Dexp ∗ (Rv + Rav ) (29)
2 tion from Eq. (28), velocity-only is the form defined in Eq. (29),
noangvel refers to complex form without angular velocity com-
• complex form without angular velocity Rno_ang vel (Eq. (30)), ponent from Eq. (30) and complex is the full form from Eq. (32).
Dexp ∗ (Rv + b) wv + wp ∗ Rp,agg + wr ∗ Rr Here, the rewards were calculated according to the form used for
Rno_ang v el = ∗ training the model (e.g. position-only form for the position-only
1+b wv + wp + wr model).
(30) Using the model with the highest value of average reward
achieved in test sequences, we conducted the evaluation proce-
dure. All the relevant metrics are shown in Fig. 13: (a) shows
(
Rp,agg = Rp,Z ∗ Rp,X + Rp,Y
)
(31) average reward (Eq. (33)) achieved by the model, (b) presents av-
erage episode length during evaluation and (c) presents episode
• complex form, using all components Rcomplex (Eq. (32)). statistics, i.e. stacked fractions of episodes, which ended with
emerging, missing the target or success, with a separate column
Dexp ∗ (Rv + b) wv + wav ∗ Rav + wp ∗ Rp,agg + wr ∗ Rr showing a total fraction of runs with collision. Since we assumed
Rcomplex = ∗
1+b wv + wav + wp + wr a constant model architecture, model’s speed was not analyzed.
10
Fig. 12. Training results for each of the reward function forms — average reward achieved in testing (performed every ntrain_steps of training, according to the training
procedure); smoothing was not applied.
reward function is easily visible; position- and velocity-based

reward functions do not lead to successful training of the model,
achieving low values of average reward both during training
and evaluation phase. On the other hand, the complex reward
function forms allow the model to reach high rewards, which
indicates learning how to act in the environment.
This conclusion is confirmed by the values of the other metrics
and observations of the agent’s movement. Models trained using
simple forms of reward function do not reach the target. Instead,
they miss it or emerge above the surface. The model trained
using the complex reward function without the angular velocity
component performs significantly better, but it still misses the
target in most runs. Observations show, that, due to the lack
of the angular velocity reward compound (Eq. (20)), this agent
has problems with aligning to the target normal. The complex
form, which includes all of the proposed reward components,
allows training a model, which achieves the highest fraction of
successful runs and collects the biggest reward during evaluation,
with almost no untimely emergence. However, here one should
notice that almost 10% of runs included colliding with another
object (most probably the gate).
Analysis of average episode lengths is consistent with the
other metrics. Best models achieve high reward value in relatively
short episodes, which means they manage to operate fairly well
in the environment. Position- and velocity-based reward forms
yield models reaching low reward value in short episodes, which
indicates their poor performance.
4.3.2. Experiment 2: The influence of visual features embedding

choice on the model performance
During our research we compared the process of training
and the performance of the controller, depending on the used
image processing method in the Vision Module. Here, we used
the complex reward function form (Eq. (32)), with default model
parameters from Table 3 and researched configurations listed in
Table 6.
We denominate them as follows. First term refers to the visual
feature embedding: default is the best bounding box-based model
Fig. 13. Performance evaluation results for each of the reward function forms: (as concluded in Appendix), conv refers to applying the built-in
(a) mean reward achieved during evaluation; (b) average episode length; (c) CNN, while raw and basic signify models utilizing feature maps
episode statistics: here, the main column presents stacked fractions of episodes extracted from the YOLO network.
when agent emerged, missed the target, or achieved success (all of them summing
up to 100%), while a separate column shows a total fraction of runs when a
For each method based on complex feature maps (conv, raw
collision occurred. and basic) we analyzed first a model without Vision FC layers;
in this case we used more neurons in Data Processing FC layers:
[256, 64, 64, 64] for CNN-based models and [1024, 256, 64, 64]
Analysis. The difference between models trained using complex for those processing YOLO feature maps, where the bigger size is
(Eqs. (30) and (32)) and simple (Eqs. (28) and (29)) forms of required by a significantly longer value vector. Such plain models
11
Table 6
All configurations of the model investigated in the experiment; each row shows a set of hyperparameters of a separate examined model, according to Table 4;
column name is the denomination used further in the experiment (for example, model named conv-bb-fc uses built-in CNN output and YOLO network bounding
box prediction as visual feature embeddings, it has one FC layer with 256 neurons in the Vision Module and four FC layers with 64 neurons in the Data Processing
Module)
Name Visual feature embeddings Vision FC layers Data Processing FC layers
default bounding box None [64, 64, 64, 64]
conv built-in conv None [256, 64, 64, 64]
conv-fc built-in conv [256] [64, 64, 64, 64]
conv-bb (built-in conv + bounding box) None [256, 64, 64, 64]
conv-bb-fc (built-in conv + bounding box) [256] [64, 64, 64, 64]
raw raw features None [1024, 256, 64, 64]
raw-fc raw features [1024, 256] [64, 64, 64, 64]
raw-bb (raw features + bounding box) None [1024, 256, 64, 64]
raw-bb-fc (raw features + bounding box) [1024, 256] [64, 64, 64, 64]
basic basic features None [1024, 256, 64, 64]
basic-fc basic features [1024, 256] [64, 64, 64, 64]
basic-bb (basic features + bounding box) None [1024, 256, 64, 64]
basic-bb-fc (basic features + bounding box) [1024, 256] [64, 64, 64, 64]
are denominated with no additional suffix (i.e. conv, raw and feature maps extracted from the YOLO network are over 35%
basic). slower than the default approach and the basic ‘conv’ model.
Apart from that, we used models with additional Vision FC Apart from the bigger size of the Data Processing Module FC
layers and default number of neurons in Data Processing FC layers or the addition of Vision Module FC layers, the delay is
layers ([64, 64, 64, 64]); then, for built-in convolutional layers we likely to be caused by the extraction of a high-dimensional tensor
applied an additional hidden layer with 256 neurons, and for of feature maps from the YOLO network in each step.
YOLO-based feature processing, two layers of [1024, 256] neurons During our experiments, we observed a small decrease in the
were used. These models are marked with a suffix -fc. number of iterations per second when using bounding box pre-
Moreover, for each of the complex feature maps-based models, dictions from the YOLO network. The difference ranges between
we examined a version using a combination of the feature maps’ 5 and 10%. However, it is worth mentioning, that the effect of
data and the bounding box prediction from the YOLO model. We adding Vision Module FC layers on the prediction speed is rather
denominate these models with a suffix -bb (abbr. bounding box). small, with the slowdown below 5%.
The training process’ results for each visual feature embedding To discuss the behavior of the controller model, which
method (conv, raw and basic) are shown in Fig. 14. After that, we achieved the best results, we present three example attempts
evaluated the models achieving the highest testing performance to the discussed navigation task. We plotted the robot’s position
and gathered the metrics in Fig. 15. during each run together with linear velocity settings output by
the controller (Eq. (8)), averaged throughout the fragment be-
Analysis. When looking at models without Vision FC layers it is tween each plotted vector. We distinguished three general types
obvious that the only successful alternative to the default bound- of acquired paths: successful runs with satisfactory (Fig. 16a) or
ing box-based configuration uses the basic features from the distorted path (Fig. 16b) and failed attempts (Fig. 16c).
YOLO network. The ‘basic’ model achieved over 40% of successful Subjectively satisfactory paths, which constitute roughly half
runs. Raw features- and built-in convolution-based models are of all successful attempts, could be observed when the robot was
not able to reach correct behavior in the environment and os- able to notice the target object during the initial part of the run.
cillate around 0.0 reward throughout the entire process of train- This allowed it to keep the target in the field of vision throughout
ing. Including additional, built-in convolutional layers greatly in- the entire run and navigate the robot on a smooth path with
creases the complexity of the model, which makes its training steady velocity settings.
significantly more difficult. On the other hand, the ‘raw’ features The inability to observe the target object at the beginning
extracted from the YOLO network are probably too specific for the often resulted in a chaotic movement of the robot in the initial
detection task, which is why the model is unable to use them for phase, which can be interpreted as searching for the target. After
navigation. This shows that data stored in the penultimate layer finding the target object, the controller usually started to navigate
cannot be used in the proposed solution. It is worth noticing, that the robot towards it, however, the resulting path was not as
concatenating bounding box prediction with no Vision FC layers smooth as in the previous group of runs.
did not improve the model’s performance at all. This is probably Among unsuccessful attempts, we can identify two main
caused by the huge size of features tensor, leading the model to causes. First, the model could continue to search for the target
ignore the appended 5-element bounding box vector. object for too long without finding it, ultimately exceeding the
Although adding Vision FC layers did not improve the per- fixed episode length without reaching the target. The other cause
formance of raw features- and built-in convolution-based mod- results from the lack of knowledge about the orientation of the
els significantly, additional dimensionality reduction allowed the target object (the detection model only provides a bounding box
models with bounding box vector concatenation to achieve much of the detection); the controller could attempt to navigate to the
better results. Especially, the ‘conv-bb-fc’ model reached 25% of target object from a wrong position (i.e. from the side) and hence
successful runs, with average reward close to 10.00. However, pass by the target object.
additional Vision FC layers cause a small worsening of basic
features-based models’ performance. 5. Conclusions
What draws attention is the visible difference in prediction
speed between vision processing methods. Compared to the end- In this paper we presented the research on the proposed
to-end approach in models with built-in convolutional layers, autonomous underwater vehicle navigation controller, featuring
the utilization of the external YOLO-based network suffers from deep reinforcement learning. We analyze several methods of
a noticeable decline in prediction speed: all models based on visual feature embeddings extraction and show their influence
12
Fig. 14. Training results for each of the visual features embedding method — average reward achieved in testing (performed every ntrain_steps of training, according
to the training procedure); smoothing was not applied; (a) methods based on features extracted using built-in CNN; (b) methods based on raw features extracted
from the YOLO network; (c) methods based on basic features extracted from the YOLO network.
on the model’s training and functioning. The final model can Our paper shows a comparison of various model architectures
predict the robot’s steering settings based on the data from sen- (especially in the context of the configuration of hidden, fully-
sors, successfully navigating between two predefined points in a connected layers) and their influence on the performance and the
3D environment. This shows, that deep reinforcement learning process of training. On that basis, we chose a combination that
with vision-based approach, which was previously used in 2D provides the best results, with respect to the assumed metrics.
navigation problems [29,30], might be applied in an autonomous Furthermore, we analyzed several reward function forms and
underwater vehicle controller. four aspects assessed during the agent’s operation, leading us to
Among all of the achievements of the research, we would the final form, which allows the model to learn fairly correct
behavior.
like to emphasize the analysis of the visual features extraction
The research can be extended in many directions. First of all,
approaches applied in our model. As opposed to previous ap-
the reward function could be enhanced. Currently, the robot does
proaches [28–30], we show how various levels of feature com-
not always move stably and sometimes rotates throughout navi-
plexity influence the feasibility of training the DRL model suc-
gation. Further analysis of the reward function form might as well
cessfully; the utilization of the bounding box prediction from help to reduce the number of collisions, which occur even with
the YOLO network, trained in a supervised manner on data from the most successful solutions. Additionally, the final condition
the simulation, allows the model to achieve significantly better could be specified better, to guarantee a correct orientation of
results than an end-to-end solution, which uses solely built- the robot in the target position and enable it to stop at the target
in, trainable layers. Furthermore, we compare various levels of point. It is important to mention the instability of training DRL
feature maps extracted from the YOLO network and prove the models, which is easily visible on learning curves in Figs. 12 and
advantage of ‘basic’ features over the ‘raw’ maps extracted from 14. This could be further analyzed by repeating the runs multi-
the penultimate layer of the network. ple times and analyzing the statistics for each configuration. An
13
Fig. 16. Robot paths and velocities acquired by using a trained ‘default’ model;
‘velocity’ refers to controller settings aligned as in Fig. 2; (a) successful run with
satisfactory path; (b) successful run with distorted path; (c) failed run; position
axes are measured in meters.
this should improve the model’s performance, especially the abil-

Fig. 15. Performance evaluation results: various methods of visual feature ity to avoid collisions and find the target object quickly. An
embeddings extraction; default refers to models without Vision FC layers, bb interesting direction would be applying the solution to the prob-
indicates combining embeddings with bounding box prediction, whereas fc lem of navigation of other types of robots, such as small un-
signify including Vision FC layers; (a) mean reward achieved during evaluation; manned ground vehicles or unmanned aerial vehicles, which
(b) average episode length; (c) number of model iterations per second; (d)
would however require a different reward function form.
episode statistics, as in Fig. 13.
Abbreviations
interesting aspect to consider is transfer learning, i.e. fine-tuning The following abbreviations are used in this manuscript:
the YOLO network during the controller training, which could im-
A2C Advanced Actor–Critic
prove the performance of the best solutions. Additionally, positive
AUV Autonomous Underwater Vehicle
results of ‘basic’ features-based models suggest, that utilizing a
CNN Convolutional Neural Network
pre-trained feature extractor, such as VGG [42] or ResNet [43]
DL Deep Learning
in the vision module could lead to training a successful model.
DRL Deep Reinforcement Learning
Another direction of the research could be applying the solution
FC Fully-Connected
to other competition tasks; an important test for our solution
IoU Intersection over Union
would be to use the model in a real AUV’s navigation problem,
which would verify its ability to steer a true robot. LSTM Long Short-Term Memory
Our approach has big development potential. Especially, our PPO Proximal Policy Optimization
model could be extended to process data from additional sensors RL Reinforcement Learning
(such as other robot’s cameras, proximity sensors or sonars); YOLO You Only Look Once
14
Table 7
Data processing module hyperparameters’ tuning — the model configuration.
Hyperparameter Values
Visual feature embeddings bounding box
Vision FC layers None
{[16, 16], [64, 64], [256, 256]
Data Processing FC layers
[64, 64], [64, 64, 64], [64, 64, 64, 64]}
Fig. 17. Training results: Data Processing Module with 2 hidden FC layers with
varying number of neurons; plot shows average reward achieved in testing
(performed every ntrain_steps of training, according to the training procedure);
smoothing was not applied.
CRediT authorship contribution statement
P. Zieliński: Conceptualization, Methodology, Software, Vali-

dation, Formal analysis, Investigation, Resources, Data curation,
Writing - original draft, Writing - review & editing, Visualiza-
tion. U. Markowska-Kaczmar: Conceptualization, Methodology,
Validation, Formal analysis, Resources, Writing - original draft,
Writing - review & editing, Supervision, Project administration,
Funding acquisition.
Declaration of competing interest
The authors declare that they have no known competing finan-

cial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Acknowledgments
We thank members of KN Robocik for the design and con-

struction of the robot and help with designing the environment.
Acknowledgments are directed also to Nick Cantrell, a former
RoboSub competitor, who shared a model of the Transdec facility. Fig. 18. Performance evaluation results: Data Processing Module with two
We thank the RoboNation – AUVSI Foundation for designing and hidden FC layers; (a) mean reward achieved during evaluation; (b) average
conducting the RoboSub competition. episode length; (c) number of model iterations per second; (d) episode statistics,
as in Fig. 13.
Appendix. Data processing module hyperparameters’ tuning
To choose the optimal Data Processing Module structure, espe- The best of the models saved during the training were then
cially the number of FC layers and the number of neurons in each evaluated. The results of the evaluation are shown in Fig. 18.
layer, we investigated how they influence the process of training One can notice, that all models achieve similar results in the
and the model’s performance. Here, we utilized the complex evaluation phase as well. However, the ‘2x64’ model reaches
reward function form from Eq. (32). Model’s hyperparameters the best performance: achieves highest value of average reward
were set as in Table 7, with default ones as in Table 3. both during training and evaluation, with the highest fraction of
The examination was conducted in two stages. First, we com-
successes and mean episode length.
pared models with 2 hidden layers in the Data Processing Module,
The second stage relied on examining various numbers of
with varying number of neurons. We researched models with
2 × 16, 2 × 64 and 2 × 256 neurons. The results of the training hidden layers in the Data Processing Module, each containing the
are shown in Fig. 17. It is visible, that each model is able to learn same number of neurons. The previous part shows that model
how to act in the environment and achieve high values of mean with two 64-neuron layers performs best, that is why we decided
reward in the training phase. to test configurations of 2, 3 and 4 layers with 64 neurons in each
15
of them. The results of training were gathered in Fig. 19. As be-

fore, the difference between each model is not easily noticeable;
each one achieves high value of average reward during training.
After training, the models which achieved the highest average
reward value were evaluated. The evaluation metrics are shown
in Fig. 20. The results do not show significant differences between
the models again. They achieve comparable values of rewards and
fractions of successes, however the ‘4x64’ model shows slightly
better performance than the other ones.
Modifications of Data Processing FC layers’ configuration does
not have a radical effect on the model’s training process. Each
Fig. 19. Training results: Data Processing Module with 2, 3 and 4 hidden FC
of the analyzed settings provides the feasibility for the model to
layers with 64 neurons each; plot shows average reward achieved in testing learn how to act in the environment. However, it is worth notic-
(performed every ntrain_steps of training, according to the training procedure); ing, that configuration using 4 layers with 64 neurons achieves
smoothing was not applied. the best results, both during training and evaluation phase.
Similarly, changes in these hyperparameters do not alter the
model’s performance during the evaluation process significantly,
although the ‘4x64’ neuron model performs best again, achieving
over 50% of successful runs.
It is visible, that the number of hidden layers in the Data
Processing Module does not influence the speed of the model.
Possibly, the differences are much smaller than the delays oc-
curring in one of the other modules or during data transmission
between the model and the environment.
Due to the best performance of a model with 4 hidden FC
layers in the Data Processing Module, each with 64 neurons, we
based our experiments on this Data Processing FC layers settings.
References
[1] C.W. Warren, A technique for autonomous underwater vehicle route

planning, in: Proceedings of the Symposium on Autonomous Underwater
Vehicle Technology, Vol. 15, 1990, pp. 201–205, http://dx.doi.org/10.1109/
48.107148, (3).
[2] H. Choset, K.M. Lynch, S. Hutchinson, G. Kantor, W. Burgard, L.
Kavraki, S. Thrun, Principles of Robot Motion Theory, Algorithms, and
Implementations, MIT, 2005.
[3] H.I. Kang, B. Lee, K. Kim, Path planning algorithm using the particle swarm
optimization and the improved dijkstra algorithm, in: 2008 IEEE Pacific-
Asia Workshop on Computational Intelligence and Industrial Application,
Vol. 2, 2008, pp. 1002–1004, http://dx.doi.org/10.1109/PACIIA.2008.376.
[4] A.K. Guruji, H. Agarwal, D. Parsediya, Time-efficient A* algorithm for
robot path planning, Proc. Technol. 23 (2016) 144–149, http://dx.doi.org/
10.1016/j.protcy.2016.03.010, 3rd International Conference on Innovations
in Automation and Mechatronics Engineering 2016, ICIAME 2016 05-
06 February, 2016. URL http://www.sciencedirect.com/science/article/pii/
S2212017316300111.
[5] F.A. Cosío, M.P. Castañeda, Autonomous robot navigation using adaptive
potential fields, Math. Comput. Modelling 40 (9) (2004) 1141–1156, http://
dx.doi.org/10.1016/j.mcm.2004.05.001, URL http://www.sciencedirect.com/
science/article/pii/S0895717704003097.
[6] H. Li, M. Liu, K. Liu, Bio-inspired geomagnetic navigation method for
autonomous underwater vehicle, J. Syst. Eng. Electron. 28 (6) (2017)
1203–1209, http://dx.doi.org/10.21629/JSEE.2017.06.18.
[7] B.J. Dzienkowski, C. Strode, U. Markowska-Kaczmar, Employing game
theory and computational intelligence to find the optimal strategy of an
autonomous underwater vehicle against a submarine, in: FedCSIS, 2016,
pp. 31–40.
[8] I.K. Ibraheem, F.H. Ajeil, Multi-objective path planning of an autonomous
mobile robot in static and dynamic environments using a hybrid PSO-
MFB optimisation algorithm, 2018, CoRR . arXiv:1805.00224. URL http:
//arxiv.org/abs/1805.00224.
[9] T.P. Lillicrap, J.J. Hunt, A.e. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,
D. Wierstra, Continuous control with deep reinforcement learning, 2015,
arXiv e-prints arXiv:1509.02971.
[10] K. Wu, M.A. Esfahani, S. Yuan, H. Wang, Depth-based obstacle avoidance
through deep reinforcement learning, in: Proceedings of the 5th Interna-
tional Conference on Mechatronics and Robotics Engineering, in: ICMRE’19,
ACM, New York, NY, USA, 2019, pp. 102–106, http://dx.doi.org/10.1145/
3314493.3314495, URL http://doi.acm.org/10.1145/3314493.3314495.
Fig. 20. Performance evaluation results Data Processing Module with 2, 3 and [11] I. Carlucho, M. De Paula, S. Wang, B.V. Menna, Y. Petillot, G.G. Acosta,
4 hidden FC layers with 64 neurons each; (a) mean reward achieved during AUV position tracking control using end-to-end deep reinforcement learn-
evaluation; (b) average episode length; (c) number of model iterations per ing, OCEANS (2019) http://dx.doi.org/10.1109/OCEANS.2018.8604791, 2018
second; (d) episode statistics, as in Fig. 13. MTS/IEEE Charleston [8604791].
16
[12] R. Yu, Z. Shi, C. Huang, T. Li, Q. Ma, Deep reinforcement learning based [29] L. Xie, S. Wang, A. Markham, N. Trigoni, Towards monocular vision based
optimal trajectory tracking control of autonomous underwater vehicle, obstacle avoidance through deep reinforcement learning, in: Robotics:
in: 2017 36th Chinese Control Conference (CCC), 2017, pp. 4958–4965, Science and Systems Workshop 2017: New Frontiers for Deep Learning
http://dx.doi.org/10.23919/ChiCC.2017.8028138. in Robotics ; Conference Date: 15-07-2017 through 15-07-2017, 2017.
[13] Y. Huo, Y. Li, X. Feng, Model-free recurrent reinforcement learning for [30] J. Kulhanek, E. Derner, T. de Bruin, R. Babuska, Vision-based navigation
AUV horizontal control, IOP Conf. Ser.: Mater. Sci. Eng. 428 (2018) 012063, using deep reinforcement learning, in: 2019 European Conference on
http://dx.doi.org/10.1088/1757-899x/428/1/012063. Mobile Robots (ECMR), IEEE, 2019, http://dx.doi.org/10.1109/ecmr.2019.
[14] C. Wang, L. Wei, Z. Wang, M. Song, N. Mahmoudian, Reinforcement 8870964.
learning-based multi-AUV adaptive trajectory planning for under-ice field [31] J. Roghair, K. Ko, A.E.N. Asli, A. Jannesari, A vision based deep reinforcement
estimation, Sensors 18 (11:3859) (2019). learning algorithm for UAV obstacle avoidance, 2021, arXiv:2103.06403.
[15] S. You, M. Diao, L. Gao, F. Zhang, H. Wang, Target tracking strat- [32] J. Hua, L. Zeng, G. Li, Z. Ju, Learning for a robot: Deep reinforcement
egy using deep deterministic policy gradient, Appl. Soft Comput. 95 learning, imitation learning, transfer learning, Sensors 21 (4) (2021) http://
(2020) 106490, http://dx.doi.org/10.1016/j.asoc.2020.106490, URL http:// dx.doi.org/10.3390/s21041278, URL https://www.mdpi.com/1424-8220/21/
www.sciencedirect.com/science/article/pii/S1568494620304294. 4/1278.
[16] C. Yan, X. Xiang, C. Wang, Towards real-time path planning through deep [33] W. Zhao, J.P. Queralta, T. Westerlund, Sim-to-real transfer in deep rein-
reinforcement learning for a UAV in dynamic environments, J. Intell. Robot. forcement learning for robotics: a survey, in: 2020 IEEE Symposium Series
Syst. (2019) http://dx.doi.org/10.1007/s10846-019-01073-3. on Computational Intelligence (SSCI), 2020.
[17] Y. Sun, X. Ran, G. Zhang, H. Xu, X. Wang, AUV 3D path planning based on [34] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, S. Levine, How to train
the improved hierarchical deep Q network, J. Mar. Sci. Eng. 8 (2) (2020) your robot with deep reinforcement learning: lessons we have learned,
http://dx.doi.org/10.3390/jmse8020145, URL https://www.mdpi.com/2077- Int. J. Robot. Res. (2021) http://dx.doi.org/10.1177/0278364920987859.
1312/8/2/145. [35] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, second
[18] X. Zhou, Y. Gao, L. Guan, Towards goal-directed navigation through ed., The MIT Press, 2018, URL http://incompleteideas.net/book/the-book-
combining learning based global and local planners, Sensors (Basel) 2nd.html.
(2019) http://dx.doi.org/10.3390/s19010176, URL https://www.ncbi.nlm. [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A.
nih.gov/pmc/articles/PMC6339171/. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A.
[19] Z. Wang, N. de Freitas, M. Lanctot, Dueling network architectures for deep Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis,
reinforcement learning, 2015, CoRR abs/1511.06581. arXiv:1511.06581. Human-level control through deep reinforcement learning, Nature 518
URL http://arxiv.org/abs/1511.06581. (2015) 529–533, http://dx.doi.org/10.1038/nature14236.
[20] T.D. Kulkarni, K.R. Narasimhan, A. Saeedi, J.B. Tenenbaum, Hierarchical [37] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified,
deep reinforcement learning: Integrating temporal abstraction and intrinsic real-time object detection, in: 2016 IEEE Conference on Computer Vision
motivation, in: Proceedings of the 30th International Conference on Neural and Pattern Recognition (CVPR), 2016, pp. 779–788, http://dx.doi.org/10.
Information Processing Systems, in: NIPS’16, Curran Associates Inc., Red 1109/CVPR.2016.91.
Hook, NY, USA, 2016, pp. 3682–3690. [38] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, D. Lange,
[21] X. Cao, C. Sun, M. Yan, Target search control of AUV in underwater Unity: A general platform for intelligent agents, 2018, ArXiv abs/1809.
environment with deep reinforcement learning, IEEE Access 7 (2019) 02627.
96549–96559, http://dx.doi.org/10.1109/ACCESS.2019.2929120. [39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado,
[22] V. Mnih, A. Puigdomènech Badia, M. Mirza, A. Graves, T.P. Lillicrap, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,
reinforcement learning, 2016, arXiv e-prints arXiv:1602.01783. R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I.
[23] M. Sporyshev, A. Scherbatyuk, Reinforcement learning approach for co- Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O.
operative AUVs in underwater surveillance operations, in: 2019 IEEE Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow:
Underwater Technology (UT), 2019, pp. 1–4, http://dx.doi.org/10.1109/UT. Large-scale machine learning on heterogeneous systems, 2015, Software
2019.8734293. available from tensorflow.org. URL http://tensorflow.org/.
[24] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy [40] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P.
optimization algorithms, 2017, arXiv e-prints arXiv:1707.06347. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J.
[25] Y. Liu, F. Wang, Z. Lv, K. Cao, Y. Lin, Pixel-to-action policy for underwater Schulman, S. Sidor, Y. Wu, Stable baselines, 2018, https://github.com/hill-
pipeline following via deep reinforcement learning, in: 2018 IEEE Inter- a/stable-baselines.
national Conference of Intelligent Robotic and Control Engineering (IRCE), [41] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J.
2018, pp. 135–139, http://dx.doi.org/10.1109/IRCE.2018.8492943. Schulman, S. Sidor, Y. Wu, P. Zhokhov, OpenAI baselines, 2017, https:
[26] Y. Sun, J. Cheng, G. Zhang, H. Xu, Mapless motion planning system //github.com/openai/baselines.
for an autonomous underwater vehicle using policy gradient-based deep [42] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
reinforcement learning, J. Intell. Robot. Syst. (2019) http://dx.doi.org/10. scale image recognition, in: Y. Bengio, Y. LeCun (Eds.), 3rd International
1007/s10846-019-01004-2. Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
[27] S.T. Havenstrøm, A. Rasheed, O. San, Deep reinforcement learning con- May 7-9, 2015, Conference Track Proceedings, 2015, URL http://arxiv.org/
troller for 3D path following and collision avoidance by autonomous abs/1409.1556.
underwater vehicles, Front. Robot. AI 7 (2021) 211, http://dx.doi.org/10. [43] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
3389/frobt.2020.566037, URL https://www.frontiersin.org/article/10.3389/ in: 2016 IEEE Conference on Computer Vision and Pattern Recognition
frobt.2020.566037. (CVPR), 2016, pp. 770–778, http://dx.doi.org/10.1109/CVPR.2016.90.
[28] F. Zhang, J. Leitner, M. Milford, B. Upcroft, P.I. Corke, Towards vision-
based deep reinforcement learning for robotic motion control, 2015, ArXiv
abs/1511.03791.
17

3D Robotic Navigation Using A Vision-Based Deep Reinforcement - 1-S2.0-S1568494621005238-Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3D Robotic Navigation Using A Vision-Based Deep Reinforcement - 1-S2.0-S1568494621005238-Main

Uploaded by

Copyright:

Available Formats

Applied Soft Computing 110 (2021) 107602

Contents lists available at ScienceDirect

Applied Soft Computing

1. Introduction in an underwater environment independently, constitute with

Fig. 2. Settings vector components: longitudinal velocity vz , lateral velocity vx ,

Fig. 1. Plan view of an example configuration of a starting point P0 and a [ ]

in front of the next task’s station is common for other competition Xϕ = ϕx , ϕy , ϕz : ϕx , ϕy , ϕz ∈ R [◦ ]

One of the crucial characteristics of the problem is its high

p (obj) – probability, that the bounding box contains the

its hyperparameters on the process of training, the efficiency of

Algorithm 1 Dataset generation

we consider its inertia and drag; their parameters were chosen

Vision FC [l1 , l2 , l3 , . . .] They assign the number of

reward function is easily visible; position- and velocity-based

4.3.2. Experiment 2: The influence of visual features embedding

this should improve the model’s performance, especially the abil-

CRediT authorship contribution statement

P. Zieliński: Conceptualization, Methodology, Software, Vali-

Declaration of competing interest

The authors declare that they have no known competing finan-

We thank members of KN Robocik for the design and con-

Appendix. Data processing module hyperparameters’ tuning

of them. The results of training were gathered in Fig. 19. As be-

[1] C.W. Warren, A technique for autonomous underwater vehicle route

You might also like