Professional Documents
Culture Documents
graphical abstract
article info a b s t r a c t
Article history: In this paper, we address a problem of vision-based 3D robotic navigation using deep reinforcement
Received 29 July 2020 learning for an Autonomous Underwater Vehicle (AUV). Our research offers conclusions from the
Received in revised form 13 April 2021 experimental study based on one of the RoboSub 2018 competition tasks. However, it can be
Accepted 5 June 2021
generalized to any navigation task consisting of movement from a starting point to the front of the
Available online 15 June 2021
next station. The presented reinforcement learning-based model predicts the robot’s steering settings
MSC: using the data acquired from the robot’s sensors. Its Vision Module may be based on a built-in
65D19 convolutional network or a pre-trained TinyYOLO network so that a comparison of various levels of
97R40 features’ complexity is possible. To enable evaluation of the proposed solution, we prepared a test
environment imitating the real conditions. It provides the ability to steer the agent simulating the
Keywords:
AUV and calculate values of rewards, used for training the model by evaluating its decisions. We study
Deep reinforcement learning
A2C the solution in terms of the reward function form, the model’s hyperparameters and the exploited
PPO camera images processing method, and provide an analysis of the correctness and speed of the model’s
Vision-based navigation functioning. As a result, we obtain a valid model able to steer the robot from the starting point to the
YOLO destination based on visual cues and inputs from other sensors.
CNN © 2021 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.asoc.2021.107602
1568-4946/© 2021 Elsevier B.V. All rights reserved.
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
movement efficiently, avoiding existing obstacles and optimizing control, [29] for obstacle avoidance, [30] for navigating a robot to
the path to minimize its length and required time. Therefore, the a target given by an image and [15] for 2D target tracking. These
problem of underwater robot navigation is a crucial part of an papers focus only on two-dimensional problems; the problem
autonomous controller, on which the ability to perform further of obstacle avoidance in a 3D environment was recently tackled
tasks depends most. by [31].
Underwater robots are often designed and applied for school To sum up, there are several brand new survey papers. [32]
and academic competitions, where they have to perform tasks discusses in detail state of the art deep reinforcement learning,
similar to those of professional vehicles. However, their environ- imitation learning, and transfer learning in robot control. [33] de-
ment of operation is often limited to smaller areas, such as lakes, scribes the fundamental background behind sim-to-real transfer
basins, or special facilities. One of these competitions, RoboSub, in deep reinforcement learning and reviews the main methods
is organized yearly in San Diego, on a TRANSDEC testing facility; being utilized. [34] overviews commonly perceived challenges in
AUVs prepared for this competition perform a series of tasks, deep RL and how they can be addressed.
imitating current research conducted in the local US Navy unit. In this research, we concentrate on an analysis of a vision-
The main aim of this research is to explore the feasibility of based approach to the problem of 3D robotic navigation, where
applying deep reinforcement learning for training a vision-based we compare the proposed deep reinforcement learning-based
navigation controller of an autonomous underwater vehicle. The model’s effectiveness depending on the visual feature embed-
research was inspired by the RoboSub competition, attended by dings used by the controller. Furthermore, we present a study
one of the authors, where the ability to navigate between each of several reward functions and their influence on the process
task is essential. Navigation can be defined as the ability to of training and achieved results. The solution presented in the
determine the robot’s position and to plan a suitable and safe paper has been investigated in terms of the efficiency of the
path to the goal location [2]. A crucial aspect of robotic navigation training process and the correctness of functioning, with varying
is the ability to sense objects around the robot and to know its environmental conditions and for various hyperparameters’ set-
relative pose. The ability to navigate an agent in a 3D environ- tings. The research is based on a specifically designed simulation
ment from a starting point towards the target object without environment, in which training and evaluation of models were
specifying the expected path, using only visual observations and conducted.
sensors readings, seems attractive not only for AUVs, but also for We summarize our contributions as follows:
similar tasks performed by unmanned aerial vehicles (UAVs), or
intelligent agents in 3D games. • We present a novel, vision-based approach for 3D naviga-
Depending on the nature of the environment, various strate- tion, which utilizes data from the robot’s sensors and an
gies might be used for path planning. With an accurate model object detection model for target object location. The pro-
of the world and tractable action space, classical optimization posed solution is trained as a deep reinforcement learning
methods perform robustly, for example, Dijkstra algorithm [3], A* model and can steer the robot to navigate from the starting
algorithm [4], Artificial potential field method [5]. In case of more point to the target object.
complex environments or large action spaces, the performance of • We analyze the reward function used for training the model.
these methods degrades rapidly. Therefore, some studies present We propose four relevant components and prove that each
evolutionary-based approaches to solve this problem, such as one is necessary to achieve the correct behavior of the
multiobjective evolutionary algorithm [6], a combination of an controller.
evolutionary algorithm and game theory [7] or hybrid Particle • We test various image processing methods and show how
Swarm (PSO) and Modified Bat algorithms [8]. the level of visual features’ complexity processed by the
However, evolutionary algorithms are often difficult to tune, model influences its performance.
which is why currently reinforcement learning (RL) is used in • We evaluate the solution regarding its success rate and
many fields of robotics. Thanks to deep neural networks’ capacity achieved reward value, as well as the model’s prediction
for processing high-dimensional data, deep reinforcement learn- speed.
ing (DRL) models can use raw sensor data without manual feature • We provide a simulation environment for similar experi-
engineering. Among many of the methods, the most frequently ments.
tested one in the context of AUV navigation is Deep Deterministic
Policy Gradients (DDPG [9]): [10–14]. You et al. [15] leverage The rest of the paper is organized as follows. Section 2 states
DDPG for 2D target tracking in an unmanned combat air vehicle. the problem of AUV navigation. The proposed solution is shown
Authors in [16,17] and [18] applied Deep Double-Q Network in Section 3, while Section 4 presents the experiments con-
(D3QN [19]) for path planning, and in [17] hierarchical Deep Q- ducted during the research and discusses their outcomes. Finally,
Network (h-DQN [20]) was used for the same task. In [21], target Section 5 shows our conclusions.
search tasks were solved using Asynchronous Advantage Actor–
Critic (A3C [22]) model. Sporyshev et al. [23] solve a problem 2. Problem description and formulation
of search and surveillance mission by multiple cooperative AUVs
using Proximal Policy Optimization (PPO [24]). This DRL method In the research we consider one of the tasks from the RoboSub
is also used in [25] for mapping from the images produced by the 2018 competition, which consists of navigating from the start-
cameras to the velocity of AUVs and in [26] for a motion planning ing area (dock) towards the gate, indicating the entrance to the
system, which avoids obstacles and can reach multiple target competition task sequence. Fig. 1 illustrates point configuration
points. In [27], authors present how to train an agent to tackle for this task; point P0 refers to the starting point, whereas P∗ is
a hybrid objective of 3D path-following and collision avoidance the target position. It is assumed that initially, the target object
for an AUV. is not in the robot’s field of vision and is located several meters
The aforementioned DRL methods utilize inputs from ad- away from the starting point. The navigation problem is defined
vanced sensors, such as sonars, and did not rely on visual cues. as the movement from the dock towards the gate, stopping at a
Lately, a combination of DRL and deep learning vision methods point, where the gate fits the sight of the camera and the robot is
become more and more popular in the robot navigation domain. situated centered, in front of it. This scenario provides necessary
For example, [28] used such an approach for robotic arm motion robustness, since a situation where the robot ends its navigation
2
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
Fig. 3. High-level controller structure, showing its main modules; X and Y are the controller’s input and output (Eq. (2)), Ximg is the image tensor (Eq. (3)), Xa
refers to linear acceleration vector (Eq. (4)), Xω is the angular velocity vector (Eq. (5)), Xϕ is the orientation vector (Eq. (6)) and Xd refers to the current depth value
(Eq. (7)).
Fig. 5. Object detection model architecture: Conv refers to a convolutional layer; Max-pool is a max-pooling layer; batch-norm indicates batch normalization; ReLU
and Sigmoid are activation function types; input vector Xv is is the image tensor Ximg scaled to shape 416 × 416 × 3.
a better understanding of the image content, as compared to plain λcoord – position component relevance coefficient,
bounding box parameters. S – number of rows and columns in YOLO grid,
obj obj
The YOLO network is an object detection model, i.e. it esti- 1i – step function (1i = 1 if cell i contains the
mates the position and the size of each object on the image. It is obj
detected object; 1i = 0 if it does not),
based on a grid of size 13 × 13: the input image is divided into
(x∗i , y∗i ) – target bounding box position,
169 cells and for each of them the network predicts the location
(xi , yi ) – predicted bounding box coordinates,
of one object (its bounding box’s position and size). Because of
the grid shape, the input image size should be set to 416 × 416 • Lsize – bounding box size error (comparing predicted and true
(for each cell to be of shape 32 × 32 pixels). width and height of the bounding box) – Eq. (14):
Each of the convolutional layers perform feature extraction, fi-
nally returning an intermediate tensor Ibbox of shape 13 × 13 × 5, S 2 [( ]
i.e. a prediction of a bounding box for each cell in the grid, in a
∑ obj
√ ∗ √ )2 (√ ∗ √ )2
Lsize = λcoord 1i wi − wi + hi − hi
form of five-element vector [C , x, y, w, h], where C is the detec-
i=0
tion confidence (Eq. (10)), x and y are coordinates of the bounding
box center and w and h are its width and height (all scaled to (14)
range [0, 1]). Finally, the model’s output Ybbox is estimated by
(wi∗ , h∗i ) – real width and height of the bounding box,
choosing the cell with the highest confidence value, according to
Eq. (9). (wi , hi ) – predicted size of the detection,
[ ]
Ybbox = arg max(Ibbox ) = Ĉ , x̂, ŷ, ŵ, ĥ (9) • Lobj , Lnoobj – positive and negative detection error, based on
C comparing the confidence of the detection Ci with the
ground truth Ci∗ (Ci∗ = 0 for negative detection; Ci∗ = 1
C = p (obj) × IoU true
pred (10) for positive) – Eqs. (15) and (16):
S2
∑ obj
)2
Ci∗ − Ci
(
|pred ∩ true| Lobj = 1i (15)
IoU true
pred = (11) i=0
|pred ∪ true|
where:
S 2
3.2. Data processing module • rotation reward Rr , assessing agent’s angular position around
yaw axis in relation to the normal of target object front
The data collected from the robot’s sensors (AHRS and depth plane (illustrated in Fig. 6b and formally described by Eq.
sensor) are concatenated with the Vision Module’s output, cre- (18)); rotation component’s aim is to align the agent with
ating the input for the Data Processing Module. To predict the the target object, so that it can approach it by going forward;
correct values of velocity settings, the model needs to learn the
relation between the sensors’ readings, the Vision Module output, Rr = cos γtarget − γagent
( )
(18)
and the correct settings of the AUV’s velocity. Therefore, we add
a sequence of trainable, fully-connected layers, which process the where:
concatenated input of the Module, trained jointly with the entire γ – normal angular position around yaw axis;
DRL model. If we increase the number of layers and neurons in
this sequence, the model should be able to learn more complex • velocity reward Rv , evaluating agent’s movement direction
relations between the input parameters, but on the other hand, and speed, described by Eq. (19); it should make the model
it can make the training process more difficult. learn to move towards the gate as fast as possible;
⏐ ⏐
⏐vagent ⏐
Rv = cos ∢ vagent , Ttarget
[ ( )]
3.3. Time-series analysis module ∗ (19)
vmax
Due to a strong and long-term time dependency occurring in where:
the considered problem, actions taken by the controller should
∢ vagent , Ttarget – angle between agent velocity vector
( )
be based not only on the current but also previous observations
and a vector connecting agent’s center of mass and
of the environment. We decided to use a recurrent network —
target ⏐point (see Fig. 6c),
an LSTM layer, which is trained together with the controller’s ⏐
⏐vagent ⏐ – length of the velocity vector, normalized to
model. It performs analysis of the output of the previous module, maximal agent’s speed vmax ;
conditioning output on current observations and hidden state (of
length 256), passed to consecutive LSTM cells in subsequent time • angular velocity reward Rav , determining the correctness of
steps. The size of the hidden state in the Time-Series Analysis current agent’s velocity in relation to its rotation, given in
Module was based on our preliminary studies, which are not Eq. (20); it rewards angular movement towards the target
described here. angular position around yaw axis (i.e. parallel to target
The output of the module is the settings vector Y, used to object’s normal), discouraging the model from turning away
control the robot’s movement. from the target: the difference of tanh allows comparing
the current angle and angular velocity so that the rotation
3.4. Reward function towards the target angle returns values close to 0, and dif-
ferent in other cases; we then apply cos on the difference to
One of the key aspects of reinforcement learning is the reward reward zeroing (i.e. adjusting the angular velocity to the tar-
function used for training models [35]. It should accurately eval- get angle); the normalization of the difference with tanh(1)
is added to prevent the reward function from returning −1
uate agent actions, promoting those leading to the success and
for all misaligned angles and angular velocities;
discouraging the model from making incorrect decisions. Ideally,
tanh (∆γ ) − tanh ωY
[ ( )]
the optimal reward function should enable the model to learn the
Rav = cos π ∗ (20)
correct behavior in the environment. In our research this would tanh 1
mean reaching the target point using an optimal path, without
where:
unnecessary deviations, preserving natural orientation and ap-
propriate speed. Therefore, we propose four rewards and examine ∆γ = γtarget − γagent – difference between agent and
how they influence the model’s training and performance: target normals’ angular positions around the yaw axis
(same as in Eq. (18)), normalized to the range [−1, 1],
• position reward Rp , based on the distance of the agent from
ωY – normalized yaw angular velocity.
the target point (e.g. the center of the gate), calculated
individually for each dimension according to Eq. (17) (see To prevent such behavior when the agent concentrates on
Fig. 6a); this component should help the model learn to collecting small rewards for a long time, we introduced two re-
minimize the distance to the target point (position reward strictions. The first one limits the length of training episodes and
increases as the agent approaches the target): as the agent resets the environment after exceeding the maximum number of
approaches the target position, the function will rise to 1, steps Lepisode . The second one relies on discounting positive re-
reaching kmin,d at a distance equal to lmin,d - this allows wards using function Dexp , described in Eq. (21), which decreases
explicit shaping of this reward function component in each exponentially from the first step, reaching value 0.1 at the half of
direction; the episode length.
( )
ln 0.1
( )
⏐ ⏐ ln kmin,d
Rp,d = exp ⏐dtarget − dagent ⏐ ∗ (17) Dexp = exp step ∗ (21)
lmin,d 1
L
2 episode
where: The agent receives several distinct rewards and penalties.
d – the axis along which the distance is measured (X , When reaching the target, it is awarded a high reward, which
Y is formally described by Eq. (22). The fact of successful com-
⏐ or Z ),
pletion of the task is determined using a box surrounding the
⏐
⏐dtarget − dagent ⏐ – absolute distance between agent and
target position, target object with some margin; agent hitting the box indicates
kmin,d ∈ (0, 1) – function value achieved at distance successful finish and, therefore, causes resetting the environment
lmin,d ; during research, kmin,d and lmin,d were set empir- and starting a new training episode.
ically; Rsuccess = 50 ∗ Dlin ∗ Rr ∗ wr + Rp,av g ∗ wp
( )
(22)
6
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
Fig. 6. Parameters used to calculate each reward component; (a) position reward (evaluates the distance from the target position along each axis: ∆x, ∆y and ∆z)
(Eq. (17)); (b) rotation reward (calculated using the difference between the yaw angular orientations of the agent’s normal (Nagent ) and the target’s normal (Ntarget ):
γagent and γtarget ) (Eq. (18)); (c) velocity reward (assesses agent’s linear velocity vagent (its direction and magnitude), comparing it to a vector connecting the agent
and the target Ttarget ) (Eq. (19)); (d) angular velocity reward (evaluates the agent’s angular velocity around the yaw axis (ωY ) to encourage aligning with the target’s
normal) (Eq. (20)).
Fig. 8. Example images of the conditions in the environment: the real image
(on the left) and an example from the simulation (on the right).
Table 2
Test platform configuration.
CPU AMD Ryzen 7 2700
RAM 32 GB
GPU Nvidia RTX 2070
OS Ubuntu Linux 18.04
Table 3
Default parameters used in experiments.
Hyperpa- Value Description
rameter
Ttraining 24 [h] Model training period (hours)
Fig. 10. Histogram of IoU for the test dataset using the selected object detection
nenv s 12 Number of agents
model. It is visible, that the selected model returns accurate predictions for the
ntrain_steps 50 000 Number of steps between each
training dataset, with only a few examples having a low (<0.3) IoU metric value.
evaluation and model saving
nev al_steps 200 Number of evaluation steps
idec 10 Interval between each model’s
comparison of saved models’ performance confirms, that the 70k- decision
Lepisode 4000 Maximum number of steps per
step model yields best results on a test video generated with our episode
simulation, which is why this model has been used for further |Ct | 256 LSTM cell state vector length
research. Fig. 10 shows a histogram of intersection over union val- Lminibatch 30 Number of steps in one minibatch
ues for the test dataset using the selected object detection model. nminibatches 4 Number of minibatches used for
model training
Additionally, we present some examples of model predictions lr 0.00025 Constant learning rate value
compared to the ground truth in Fig. 11. εPPO 0.2 PPO clipping threshold
kmin,x 0.3
4.3. Navigation controller
kmin,y 0.3
Expected position reward value
kmin,z 0.3
kmin achieved at distance lmin
To study the proposed solution we used an implementation lmin,x 0.8
for each dimension (Eq. (17))
of the deep reinforcement learning model from Stable Baselines lmin,y 0.5
lmin,z 2.0
library [40], which is a fork of OpenAI’s Baselines [41]. Training
and evaluation were conducted using the simulation prepared for wr 0.2 Rotation component weight in
the project. We measure the performance using 4 metrics: success reward (Eq. (22))
wp 0.1 Position component weight in
• mean reward gathered by the agent R̄, averaged for each success reward (Eq. (22))
agent and environment instance, as in Eq. (25): kmin,lin 0.5 Linear discount factor
lmin,lin Lepisode parameters (Eq. (24))
Ne T
1 ∑∑
R̄ = ri (t) (25)
Ne
i t =0
and missing the target qM ), together with the fraction of all
where: of runs, during which a collision occurred (qC ):
Ne – total number of agents, |S | |E | |M | |C |
T – learning episode length, qS = ; qE = ; qM = ; qC = (27)
|T | |T | |T | |T |
ri (t) – value of a reward received by the agent in ith
environment, in tth time step; where:
• average episode length L̄, i.e. number of steps taken by the |S | – number of episodes which ended successfully,
agent before reaching the target or resetting the environ- |E | – number of episodes failed by emerging,
ment; good results are indicated by large reward achieved |M | – number of episodes where the agent missed the
in small number of steps; small reward received in short target,
episodes can be caused by incorrect agent behavior, such as |T | – total number of episodes conducted in the eval-
untimely emergence or missing the target, both of which are uation; |T | = |S | + |E | + |M |,
considered fatal mistakes in the competition; on the other |C | – number of episodes during which a collision
hand, a large number of steps taken in one episode may sug- occurred.
gest, that the agent is wandering around the environment; We applied a unified training procedure to compare various
• speed of predictions it¯ s , calculated as an average number of model designs directly. Using actor–critic architecture and A2C
steps per second, as shown in Eq. (26): method, we conducted a parallel training of 12 agents in a con-
Nsteps stant period of 24 h on our test platform described by Table 2. Our
it¯ s = (26) training strategy assumes a cyclic model evaluation every 50 000
Tev al
training steps. Testing was conducted for each agent by perform-
where: ing 200 steps according to model prediction in each of the 12
Nsteps – number of steps taken by the agent during environments and registering average reward and episode length.
evaluation, The evaluation sequence, used for assessing the best model from
each training procedure (i.e. the one achieving the highest av-
Tev al – total evaluation time in seconds;
erage reward per test episode during training) was repeated 5
• episode statistics (Eq. (27)), i.e. total fraction of successful times, with forced reset of each simulation instance. All metrics
approaches (ended by reaching the target) qS and failed were averaged over each evaluation episode (or summed, in case
episodes (distinguishing between untimely emergence qE of episode statistics).
9
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
Fig. 11. Example images from the dataset used to train the YOLO model, with drawn ground-truth boxes (red) and predictions (green). The two images on the left
contain added ‘noise’ objects, whereas the images on the right were not noised.
Table 4 Table 5
Hyperparameters examined in the research. The controller model configuration used for reward function evaluation.
Hyperparame- Values Description Hyperparameter Values
ter Visual feature embeddings bounding box
Visual feature {bounding box, It defines the visual features Vision FC layers None
embeddings raw features, embedding method in the Vision Data Processing FC layers [64, 64, 64, 64]
basic features, Module (Fig. 4)
built-in conv}
Fig. 12. Training results for each of the reward function forms — average reward achieved in testing (performed every ntrain_steps of training, according to the training
procedure); smoothing was not applied.
Table 6
All configurations of the model investigated in the experiment; each row shows a set of hyperparameters of a separate examined model, according to Table 4;
column name is the denomination used further in the experiment (for example, model named conv-bb-fc uses built-in CNN output and YOLO network bounding
box prediction as visual feature embeddings, it has one FC layer with 256 neurons in the Vision Module and four FC layers with 64 neurons in the Data Processing
Module)
Name Visual feature embeddings Vision FC layers Data Processing FC layers
default bounding box None [64, 64, 64, 64]
conv built-in conv None [256, 64, 64, 64]
conv-fc built-in conv [256] [64, 64, 64, 64]
conv-bb (built-in conv + bounding box) None [256, 64, 64, 64]
conv-bb-fc (built-in conv + bounding box) [256] [64, 64, 64, 64]
raw raw features None [1024, 256, 64, 64]
raw-fc raw features [1024, 256] [64, 64, 64, 64]
raw-bb (raw features + bounding box) None [1024, 256, 64, 64]
raw-bb-fc (raw features + bounding box) [1024, 256] [64, 64, 64, 64]
basic basic features None [1024, 256, 64, 64]
basic-fc basic features [1024, 256] [64, 64, 64, 64]
basic-bb (basic features + bounding box) None [1024, 256, 64, 64]
basic-bb-fc (basic features + bounding box) [1024, 256] [64, 64, 64, 64]
are denominated with no additional suffix (i.e. conv, raw and feature maps extracted from the YOLO network are over 35%
basic). slower than the default approach and the basic ‘conv’ model.
Apart from that, we used models with additional Vision FC Apart from the bigger size of the Data Processing Module FC
layers and default number of neurons in Data Processing FC layers or the addition of Vision Module FC layers, the delay is
layers ([64, 64, 64, 64]); then, for built-in convolutional layers we likely to be caused by the extraction of a high-dimensional tensor
applied an additional hidden layer with 256 neurons, and for of feature maps from the YOLO network in each step.
YOLO-based feature processing, two layers of [1024, 256] neurons During our experiments, we observed a small decrease in the
were used. These models are marked with a suffix -fc. number of iterations per second when using bounding box pre-
Moreover, for each of the complex feature maps-based models, dictions from the YOLO network. The difference ranges between
we examined a version using a combination of the feature maps’ 5 and 10%. However, it is worth mentioning, that the effect of
data and the bounding box prediction from the YOLO model. We adding Vision Module FC layers on the prediction speed is rather
denominate these models with a suffix -bb (abbr. bounding box). small, with the slowdown below 5%.
The training process’ results for each visual feature embedding To discuss the behavior of the controller model, which
method (conv, raw and basic) are shown in Fig. 14. After that, we achieved the best results, we present three example attempts
evaluated the models achieving the highest testing performance to the discussed navigation task. We plotted the robot’s position
and gathered the metrics in Fig. 15. during each run together with linear velocity settings output by
the controller (Eq. (8)), averaged throughout the fragment be-
Analysis. When looking at models without Vision FC layers it is tween each plotted vector. We distinguished three general types
obvious that the only successful alternative to the default bound- of acquired paths: successful runs with satisfactory (Fig. 16a) or
ing box-based configuration uses the basic features from the distorted path (Fig. 16b) and failed attempts (Fig. 16c).
YOLO network. The ‘basic’ model achieved over 40% of successful Subjectively satisfactory paths, which constitute roughly half
runs. Raw features- and built-in convolution-based models are of all successful attempts, could be observed when the robot was
not able to reach correct behavior in the environment and os- able to notice the target object during the initial part of the run.
cillate around 0.0 reward throughout the entire process of train- This allowed it to keep the target in the field of vision throughout
ing. Including additional, built-in convolutional layers greatly in- the entire run and navigate the robot on a smooth path with
creases the complexity of the model, which makes its training steady velocity settings.
significantly more difficult. On the other hand, the ‘raw’ features The inability to observe the target object at the beginning
extracted from the YOLO network are probably too specific for the often resulted in a chaotic movement of the robot in the initial
detection task, which is why the model is unable to use them for phase, which can be interpreted as searching for the target. After
navigation. This shows that data stored in the penultimate layer finding the target object, the controller usually started to navigate
cannot be used in the proposed solution. It is worth noticing, that the robot towards it, however, the resulting path was not as
concatenating bounding box prediction with no Vision FC layers smooth as in the previous group of runs.
did not improve the model’s performance at all. This is probably Among unsuccessful attempts, we can identify two main
caused by the huge size of features tensor, leading the model to causes. First, the model could continue to search for the target
ignore the appended 5-element bounding box vector. object for too long without finding it, ultimately exceeding the
Although adding Vision FC layers did not improve the per- fixed episode length without reaching the target. The other cause
formance of raw features- and built-in convolution-based mod- results from the lack of knowledge about the orientation of the
els significantly, additional dimensionality reduction allowed the target object (the detection model only provides a bounding box
models with bounding box vector concatenation to achieve much of the detection); the controller could attempt to navigate to the
better results. Especially, the ‘conv-bb-fc’ model reached 25% of target object from a wrong position (i.e. from the side) and hence
successful runs, with average reward close to 10.00. However, pass by the target object.
additional Vision FC layers cause a small worsening of basic
features-based models’ performance. 5. Conclusions
What draws attention is the visible difference in prediction
speed between vision processing methods. Compared to the end- In this paper we presented the research on the proposed
to-end approach in models with built-in convolutional layers, autonomous underwater vehicle navigation controller, featuring
the utilization of the external YOLO-based network suffers from deep reinforcement learning. We analyze several methods of
a noticeable decline in prediction speed: all models based on visual feature embeddings extraction and show their influence
12
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
Fig. 14. Training results for each of the visual features embedding method — average reward achieved in testing (performed every ntrain_steps of training, according
to the training procedure); smoothing was not applied; (a) methods based on features extracted using built-in CNN; (b) methods based on raw features extracted
from the YOLO network; (c) methods based on basic features extracted from the YOLO network.
on the model’s training and functioning. The final model can Our paper shows a comparison of various model architectures
predict the robot’s steering settings based on the data from sen- (especially in the context of the configuration of hidden, fully-
sors, successfully navigating between two predefined points in a connected layers) and their influence on the performance and the
3D environment. This shows, that deep reinforcement learning process of training. On that basis, we chose a combination that
with vision-based approach, which was previously used in 2D provides the best results, with respect to the assumed metrics.
navigation problems [29,30], might be applied in an autonomous Furthermore, we analyzed several reward function forms and
underwater vehicle controller. four aspects assessed during the agent’s operation, leading us to
Among all of the achievements of the research, we would the final form, which allows the model to learn fairly correct
behavior.
like to emphasize the analysis of the visual features extraction
The research can be extended in many directions. First of all,
approaches applied in our model. As opposed to previous ap-
the reward function could be enhanced. Currently, the robot does
proaches [28–30], we show how various levels of feature com-
not always move stably and sometimes rotates throughout navi-
plexity influence the feasibility of training the DRL model suc-
gation. Further analysis of the reward function form might as well
cessfully; the utilization of the bounding box prediction from help to reduce the number of collisions, which occur even with
the YOLO network, trained in a supervised manner on data from the most successful solutions. Additionally, the final condition
the simulation, allows the model to achieve significantly better could be specified better, to guarantee a correct orientation of
results than an end-to-end solution, which uses solely built- the robot in the target position and enable it to stop at the target
in, trainable layers. Furthermore, we compare various levels of point. It is important to mention the instability of training DRL
feature maps extracted from the YOLO network and prove the models, which is easily visible on learning curves in Figs. 12 and
advantage of ‘basic’ features over the ‘raw’ maps extracted from 14. This could be further analyzed by repeating the runs multi-
the penultimate layer of the network. ple times and analyzing the statistics for each configuration. An
13
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
Fig. 16. Robot paths and velocities acquired by using a trained ‘default’ model;
‘velocity’ refers to controller settings aligned as in Fig. 2; (a) successful run with
satisfactory path; (b) successful run with distorted path; (c) failed run; position
axes are measured in meters.
Abbreviations
interesting aspect to consider is transfer learning, i.e. fine-tuning The following abbreviations are used in this manuscript:
the YOLO network during the controller training, which could im-
A2C Advanced Actor–Critic
prove the performance of the best solutions. Additionally, positive
AUV Autonomous Underwater Vehicle
results of ‘basic’ features-based models suggest, that utilizing a
CNN Convolutional Neural Network
pre-trained feature extractor, such as VGG [42] or ResNet [43]
DL Deep Learning
in the vision module could lead to training a successful model.
DRL Deep Reinforcement Learning
Another direction of the research could be applying the solution
FC Fully-Connected
to other competition tasks; an important test for our solution
IoU Intersection over Union
would be to use the model in a real AUV’s navigation problem,
which would verify its ability to steer a true robot. LSTM Long Short-Term Memory
Our approach has big development potential. Especially, our PPO Proximal Policy Optimization
model could be extended to process data from additional sensors RL Reinforcement Learning
(such as other robot’s cameras, proximity sensors or sonars); YOLO You Only Look Once
14
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
Table 7
Data processing module hyperparameters’ tuning — the model configuration.
Hyperparameter Values
Visual feature embeddings bounding box
Vision FC layers None
{[16, 16], [64, 64], [256, 256]
Data Processing FC layers
[64, 64], [64, 64, 64], [64, 64, 64, 64]}
Fig. 17. Training results: Data Processing Module with 2 hidden FC layers with
varying number of neurons; plot shows average reward achieved in testing
(performed every ntrain_steps of training, according to the training procedure);
smoothing was not applied.
Acknowledgments
To choose the optimal Data Processing Module structure, espe- The best of the models saved during the training were then
cially the number of FC layers and the number of neurons in each evaluated. The results of the evaluation are shown in Fig. 18.
layer, we investigated how they influence the process of training One can notice, that all models achieve similar results in the
and the model’s performance. Here, we utilized the complex evaluation phase as well. However, the ‘2x64’ model reaches
reward function form from Eq. (32). Model’s hyperparameters the best performance: achieves highest value of average reward
were set as in Table 7, with default ones as in Table 3. both during training and evaluation, with the highest fraction of
The examination was conducted in two stages. First, we com-
successes and mean episode length.
pared models with 2 hidden layers in the Data Processing Module,
The second stage relied on examining various numbers of
with varying number of neurons. We researched models with
2 × 16, 2 × 64 and 2 × 256 neurons. The results of the training hidden layers in the Data Processing Module, each containing the
are shown in Fig. 17. It is visible, that each model is able to learn same number of neurons. The previous part shows that model
how to act in the environment and achieve high values of mean with two 64-neuron layers performs best, that is why we decided
reward in the training phase. to test configurations of 2, 3 and 4 layers with 64 neurons in each
15
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
References
16
P. Zieliński and U. Markowska-Kaczmar Applied Soft Computing 110 (2021) 107602
[12] R. Yu, Z. Shi, C. Huang, T. Li, Q. Ma, Deep reinforcement learning based [29] L. Xie, S. Wang, A. Markham, N. Trigoni, Towards monocular vision based
optimal trajectory tracking control of autonomous underwater vehicle, obstacle avoidance through deep reinforcement learning, in: Robotics:
in: 2017 36th Chinese Control Conference (CCC), 2017, pp. 4958–4965, Science and Systems Workshop 2017: New Frontiers for Deep Learning
http://dx.doi.org/10.23919/ChiCC.2017.8028138. in Robotics ; Conference Date: 15-07-2017 through 15-07-2017, 2017.
[13] Y. Huo, Y. Li, X. Feng, Model-free recurrent reinforcement learning for [30] J. Kulhanek, E. Derner, T. de Bruin, R. Babuska, Vision-based navigation
AUV horizontal control, IOP Conf. Ser.: Mater. Sci. Eng. 428 (2018) 012063, using deep reinforcement learning, in: 2019 European Conference on
http://dx.doi.org/10.1088/1757-899x/428/1/012063. Mobile Robots (ECMR), IEEE, 2019, http://dx.doi.org/10.1109/ecmr.2019.
[14] C. Wang, L. Wei, Z. Wang, M. Song, N. Mahmoudian, Reinforcement 8870964.
learning-based multi-AUV adaptive trajectory planning for under-ice field [31] J. Roghair, K. Ko, A.E.N. Asli, A. Jannesari, A vision based deep reinforcement
estimation, Sensors 18 (11:3859) (2019). learning algorithm for UAV obstacle avoidance, 2021, arXiv:2103.06403.
[15] S. You, M. Diao, L. Gao, F. Zhang, H. Wang, Target tracking strat- [32] J. Hua, L. Zeng, G. Li, Z. Ju, Learning for a robot: Deep reinforcement
egy using deep deterministic policy gradient, Appl. Soft Comput. 95 learning, imitation learning, transfer learning, Sensors 21 (4) (2021) http://
(2020) 106490, http://dx.doi.org/10.1016/j.asoc.2020.106490, URL http:// dx.doi.org/10.3390/s21041278, URL https://www.mdpi.com/1424-8220/21/
www.sciencedirect.com/science/article/pii/S1568494620304294. 4/1278.
[16] C. Yan, X. Xiang, C. Wang, Towards real-time path planning through deep [33] W. Zhao, J.P. Queralta, T. Westerlund, Sim-to-real transfer in deep rein-
reinforcement learning for a UAV in dynamic environments, J. Intell. Robot. forcement learning for robotics: a survey, in: 2020 IEEE Symposium Series
Syst. (2019) http://dx.doi.org/10.1007/s10846-019-01073-3. on Computational Intelligence (SSCI), 2020.
[17] Y. Sun, X. Ran, G. Zhang, H. Xu, X. Wang, AUV 3D path planning based on [34] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, S. Levine, How to train
the improved hierarchical deep Q network, J. Mar. Sci. Eng. 8 (2) (2020) your robot with deep reinforcement learning: lessons we have learned,
http://dx.doi.org/10.3390/jmse8020145, URL https://www.mdpi.com/2077- Int. J. Robot. Res. (2021) http://dx.doi.org/10.1177/0278364920987859.
1312/8/2/145. [35] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, second
[18] X. Zhou, Y. Gao, L. Guan, Towards goal-directed navigation through ed., The MIT Press, 2018, URL http://incompleteideas.net/book/the-book-
combining learning based global and local planners, Sensors (Basel) 2nd.html.
(2019) http://dx.doi.org/10.3390/s19010176, URL https://www.ncbi.nlm. [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A.
nih.gov/pmc/articles/PMC6339171/. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A.
[19] Z. Wang, N. de Freitas, M. Lanctot, Dueling network architectures for deep Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis,
reinforcement learning, 2015, CoRR abs/1511.06581. arXiv:1511.06581. Human-level control through deep reinforcement learning, Nature 518
URL http://arxiv.org/abs/1511.06581. (2015) 529–533, http://dx.doi.org/10.1038/nature14236.
[20] T.D. Kulkarni, K.R. Narasimhan, A. Saeedi, J.B. Tenenbaum, Hierarchical [37] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified,
deep reinforcement learning: Integrating temporal abstraction and intrinsic real-time object detection, in: 2016 IEEE Conference on Computer Vision
motivation, in: Proceedings of the 30th International Conference on Neural and Pattern Recognition (CVPR), 2016, pp. 779–788, http://dx.doi.org/10.
Information Processing Systems, in: NIPS’16, Curran Associates Inc., Red 1109/CVPR.2016.91.
Hook, NY, USA, 2016, pp. 3682–3690. [38] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, D. Lange,
[21] X. Cao, C. Sun, M. Yan, Target search control of AUV in underwater Unity: A general platform for intelligent agents, 2018, ArXiv abs/1809.
environment with deep reinforcement learning, IEEE Access 7 (2019) 02627.
96549–96559, http://dx.doi.org/10.1109/ACCESS.2019.2929120. [39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado,
[22] V. Mnih, A. Puigdomènech Badia, M. Mirza, A. Graves, T.P. Lillicrap, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,
reinforcement learning, 2016, arXiv e-prints arXiv:1602.01783. R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I.
[23] M. Sporyshev, A. Scherbatyuk, Reinforcement learning approach for co- Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O.
operative AUVs in underwater surveillance operations, in: 2019 IEEE Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow:
Underwater Technology (UT), 2019, pp. 1–4, http://dx.doi.org/10.1109/UT. Large-scale machine learning on heterogeneous systems, 2015, Software
2019.8734293. available from tensorflow.org. URL http://tensorflow.org/.
[24] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy [40] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P.
optimization algorithms, 2017, arXiv e-prints arXiv:1707.06347. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J.
[25] Y. Liu, F. Wang, Z. Lv, K. Cao, Y. Lin, Pixel-to-action policy for underwater Schulman, S. Sidor, Y. Wu, Stable baselines, 2018, https://github.com/hill-
pipeline following via deep reinforcement learning, in: 2018 IEEE Inter- a/stable-baselines.
national Conference of Intelligent Robotic and Control Engineering (IRCE), [41] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J.
2018, pp. 135–139, http://dx.doi.org/10.1109/IRCE.2018.8492943. Schulman, S. Sidor, Y. Wu, P. Zhokhov, OpenAI baselines, 2017, https:
[26] Y. Sun, J. Cheng, G. Zhang, H. Xu, Mapless motion planning system //github.com/openai/baselines.
for an autonomous underwater vehicle using policy gradient-based deep [42] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
reinforcement learning, J. Intell. Robot. Syst. (2019) http://dx.doi.org/10. scale image recognition, in: Y. Bengio, Y. LeCun (Eds.), 3rd International
1007/s10846-019-01004-2. Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
[27] S.T. Havenstrøm, A. Rasheed, O. San, Deep reinforcement learning con- May 7-9, 2015, Conference Track Proceedings, 2015, URL http://arxiv.org/
troller for 3D path following and collision avoidance by autonomous abs/1409.1556.
underwater vehicles, Front. Robot. AI 7 (2021) 211, http://dx.doi.org/10. [43] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
3389/frobt.2020.566037, URL https://www.frontiersin.org/article/10.3389/ in: 2016 IEEE Conference on Computer Vision and Pattern Recognition
frobt.2020.566037. (CVPR), 2016, pp. 770–778, http://dx.doi.org/10.1109/CVPR.2016.90.
[28] F. Zhang, J. Leitner, M. Milford, B. Upcroft, P.I. Corke, Towards vision-
based deep reinforcement learning for robotic motion control, 2015, ArXiv
abs/1511.03791.
17