You are on page 1of 12

International Journal of Control, Automation, and Systems 21(2) (2023) 563-574 ISSN:1598-6446 eISSN:2005-4092

http://dx.doi.org/10.1007/s12555-021-0642-7 http://www.springer.com/12555

Navigation of Mobile Robots Based on Deep Reinforcement Learning: Re-


ward Function Optimization and Knowledge Transfer
Weijie Li, Ming Yue*  , Jinyong Shangguan, and Ye Jin

Abstract: This paper presents an end-to-end online learning navigation method based on deep reinforcement learn-
ing (DRL) for mobile robots, whose objective is that mobile robots can avoid obstacles to reach the target point in
an unknown environment. Specifically, double deep Q-networks (Double DQN), dueling deep Q-networks (Dueling
DQN) and prioritized experience replay (PER) are combined to form prioritized experience replay-double dueling
deep Q-networks (PER-D3QN) algorithm to realize high-efficiency navigation of mobile robots. Moreover, consid-
ering the problem of sparse reward in the traditional reward function, an artificial potential field is introduced into
the reward function to guide robots to fulfill the navigation task through the change of potential energy. Further-
more, in order to accelerate the training of mobile robots in complex environment, a knowledge transfer training
method is proposed, which migrates the knowledge from simple to complex environment, and quickly learns on the
basis of the priori knowledge. Finally, the performance is validated based on a three-dimensional simulator, which
shows that the mobile robot can obtain higher rewards and achieve higher success rates and less time for navigation,
indicating that the proposed approaches are feasible and efficient.

Keywords: Deep reinforcement learning (DRL), knowledge transfer, mobile robot, navigation, reward function.

1. INTRODUCTION ments, the training will be more time-consuming because


it is difficult to obtain positive rewards. The above prob-
Autonomous navigation of mobile robots is an impor- lems cause difficulties in the navigation of mobile robots,
tant ability which can ensure that the platform reaches the thus it is of great significance to explore these problems.
target point from the starting point without collision in the Due to its demonstrated power in complex estimation
environment with a number of obstacles. In common, the problems, deep learning has been applied in robotics, such
steps of traditional navigation methods include simulta- as indoor navigation [6-8], robot manipulators [9,10] and
neous localization and mapping (SLAM) [1], trajectory optimized control [11]. However, the above deep learning-
planning [2] and tracking control [3,4]. However, SLAM based methods all heavily rely on labelled data, which is
is time-consuming and requires high accuracy and density time-consuming to collect, so it makes sense to find meth-
of LIDAR. Under the condition of sparse range informa- ods that do not require tagged data. In this case, rein-
tion and no obstacle map, it is still challenging for mobile forcement learning [12-14], which can reduce the hassle
robots to complete autonomous navigation. As a result, of collecting labelled data by interacting with the environ-
the novel navigation method of end-to-end online learning ment to obtain training samples, has been widely used in
based on deep reinforcement learning (DRL) has attracted the field of robot. Nonetheless, traditional reinforcement
extensive attention of scholars. Nevertheless, the perfor- learning algorithms such as Q-learning cannot handle the
mance of traditional DRL algorithms, such as deep Q- tasks with continuous states. For this reason, DQN, a DRL
networks (DQN), is often affected by the over-estimation algorithm which solve the continuous state problem by
of Q-values and other problems [5]. At the same time, if replacing the Q-table in Q-learning with a deep neural
the reward function is not designed properly, the perfor- network, is used for mobile robots to perform path plan-
mance of the algorithm will be greatly reduced. In addi- ning [15] and autonomous navigation [16], among others.
tion, when mobile robots are trained in complex environ- Nevertheless, there is a problem of over-estimation of Q-

Manuscript received July 30, 2021; revised December 14, 2021 and February 27, 2022; accepted March 12, 2022. Recommended by
Associate Editor Jongho Lee under the direction of Editor Hyo-Sung Ahn. This work was supported by the National Natural Science
Foundation of China (Nos. U2013601 and 61873047), the Fundamental Research Funds for the Central Universities (DUT22GF205) and the
National Key Research and Development Program (2022YFB4701302).

Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin are with the School of Automotive Engineering, Dalian University of Technology,
Dalian 116024, China (e-mails: weijie45@mail.dlut.edu.cn, yueming@dlut.edu.cn, {shangguanjinyong, jinye98}@mail.dlut.edu.cn). Ming
Yue is also with the Ningbo Institute of Dalian University of Technology, Ningbo 315016, China.
* Corresponding author.

©ICROS, KIEE and Springer 2023


564 Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin

values in the original DQN. To solve this difficulty, Van the priori knowledge and still require more time for train-
Hasselt et al. [5] raised the double DQN algorithm, which ing. Hence, transfer learning has received a lot of atten-
settles the matter by separating selection and estimation tion because of its ability to save training resources. For
to calculate the normal Q-values. In addition, Schaul et example, Gao et al. [28] proposed an incremental train-
al. [17] adopted the method of priority experience re- ing method which transferred the knowledge from 2D to
play (PER) in the sample extraction process and improved 3D environment, thus improving the efficiency of training
the efficiency of training by setting priority for samples. and the convergence of the model. Similarly, Liu and Jin
What’s more, Wang et al. [18] came up with the dueling [29] designed a transfer reinforcement learning method
DQN, which improved the neural network by dividing it for multi-stage knowledge acquisition, which combined
into two parts: value function and advantage function for transfer and reinforcement learning methods to obtain col-
better policy evaluation. To take full advantage of the re- lision avoidance knowledge more effectively. Inspired by
spective strengths of the above algorithms, double DQN, the above works, a kind of knowledge transfer training
dueling DQN and PER are integrated in this work to cre- method is put forward. After the training of the mobile
ate the prioritized experience replay-double dueling DQN robot in a relatively simple environment, its model and pa-
(PER-D3QN) and apply it to the navigation of the mobile rameters are transferred to a similar complex environment,
robot. and the training starts from this point, so as to accelerate
Furthermore, the design of reward function is also cru- the training in the complex environment.
cial in the whole training process of DRL, thus the reward To sum up, the main contributions of this paper are as
function is also the focus of researches. To address the is- follows:
sue of sparse reward, Pathak et al. [19] used the curiosity 1) Double DQN, dueling DQN and PER are combined
as an intrinsic reward signal for the agent to actively ex- to form the PER-D3QN algorithm and applied to the
plore the environment and learn better policy. Whereas, navigation of the mobile robot, by which the perfor-
this method is hard to use in a large-scale environment. mance of navigation can be effectively enhanced.
In such case, Wu et al. [20] took random network distilla-
2) Artificial potential field variation is employed into
tion rewards as intrinsic reward which is effective in large-
the reward function and taken as a reward to guide
scale environments. However, manually designing the re-
the mobile robot to move towards the target point,
ward function still poses puzzles. To this end, Dong et al.
through which the training time could be reduced.
[21] brought up a method based on Lyapunov function to
design the reward function, which guide the reward shap- 3) A knowledge transfer training method is introduced
ing without handcraft engineering. What’s more, a num- to take advantage of priori knowledge, in which the
ber of researchers attempt to study the non-reward explo- convergence speed of the algorithm in the relatively
ration method of reinforcement learning [22,23]. In the ex- complex environment can be improved.
ploration stage, the agent can collect samples without us- The rest of this paper is organized as follows: Section 2
ing the pre-specified reward function; and then, the reward improves the reward function and designs the navigation
function will be given and the agent will use the samples model of the mobile robot. Section 3 introduces the algo-
to calculate the nearly optimal strategy. Considering the rithm and the training and learning of the model in detail.
flexibility and ease of use for the artificial potential field The effectiveness of the proposed methods is verified by
and the inspiration from the above works, the artificial po- simulation in Section 4. Finally, Section 5 concludes this
tential field is drawn into the reward function here. The paper.
potential energy change of each step for the mobile robot
is used as a reward to guide the mobile robot towards the 2. REWARD FUNCTION OPTIMIZATION AND
target point. NAVIGATION MODEL DESIGN
In addition, the training of the model is not only af-
fected by the algorithm, but also related to the environ- 2.1. Problem statement
ment. When the agent is trained in a relatively complex The objective of this study is to make the mobile robot
environment, there may be the problem of slow or even capable of autonomous navigation, in order that the mo-
difficult convergence. To figure out this problem, curricu- bile robot can avoid unknown obstacles and reach the tar-
lum learning [24-26] is applied to the training process of get point. The navigation problem of the mobile robot
agents, which assigns different weights to training sam- can be regarded as a series of Markov decision processes
ples of different difficulties according to the difficulty of (MDP), which is shown in Fig. 1.
samples. Unfortunately, the setting of weights is a difficult The MDP can be represented by a quintuple (S, A, R, P,
issue. In this case, Florensa et al. [27] decomposed the γ), where S and A are the state space and action space of
complex problem into multiple subproblems and searched the system respectively; R denotes the reward function; P
the maze path according to the dynamic programming. In represents the state transition probability of the system af-
fact, these methods still do not take into account the use of ter action a is picked; γ is a discount factor, reflecting that
Navigation of Mobile Robots Based on Deep Reinforcement Learning: Reward Function Optimization and ... 565

Environment As for the navigation problem of mobile robots, the re-


s1 s2 s3 ... sn ward function can guide the behavior of the agent, whose
design is critical to both the effectiveness and efficiency
r1 r2 r3 rn of model training. For the traditional reward function, the
agent usually gets a fixed positive reward value when the
Agent
a1 a2 a3 an
task is completed, while a fixed negative reward value
when failed. In this regard, rewards are given only at the
end of the task, which leads to the problem of reward spar-
Fig. 1. Schematic diagram of MDP. sity. As a result, it is tricky to quantitatively understand the
influence of actions selected during the task execution on
the agent’s future, and thus the training of the model is
the influence of the current decision on the future is grad-
slow and the optimal strategy is difficult to find. The tra-
ually reduced. In a MDP, the mapping from a state to an
ditional reward function can be expressed as follows:
action is based on a strategy of π(s | a). The best strategy
is to get the highest reward from any initial state. Given
(
r+ , dg < la ,
a strategy π, the Q-value is defined as the expectation of r= (3)
r− , do < lc ,
the cumulative discount reward for taking an action a in a
state s, and the formula is expressed as follows: where r is the reward value obtained; r+ is a fixed positive
"

# reward value, r− is a fixed negative reward value; dg is the
Qπ (s, a; θ ) = E ∑ γ t r (st , at ) |s0 = s, a0 = a , (1) distance of the mobile robot to the target point, and do is
t=0 the minimum distance of the mobile robot to the obsta-
where Qπ (s, a; θ ) represents the Q-value corresponding to cles; la is the target point threshold, and lc is the collision
action a under state s based on strategy π, and θ is the threshold. Notice that when dg is less than la , it believes
parameters of the network; t represents the iteration time; that the target point has been reached and the navigation
r (st , at ) denotes the reward obtained by the agent for pick- is successful; when do is less than lc , it suggests that the
ing action at in state st ; s0 and a0 are the initial state and collision has occurred and the navigation fails. In addition,
initial action, respectively. the size of the robot itself needs to be taken into account
The aim of the MDP is to find a strategy that maximizes when determining la and lc . For la , an additional allowable
the cumulative discount rewards. As a consequence, Q- error needs to be attached since it is difficult for the robot
learning algorithm can find the near-optimal strategy by to reach a point precisely.
iterating to the optimal Q-value through the Bellman for- To address the problem of slow model training caused
mular as follows: by the traditional reward function, this paper optimizes
the reward function and puts forward a new form of re-
Q∗ (st , at ; θ ) = r (st , at ) + γ max Qat+1 (st+1 , at+1 ; θ ) , ward function. The proposed reward function is divided
(2) into two parts: the first part is to give a fixed reward when
the mobile robot reaches the final state. The mobile robot
where Q∗ (st , at ; θ ) represents the optimal Q-value corre-
gets a large positive reward after successfully navigating
sponding to action at in state st , and max Qat+1 (st+1 , at+1 ; θ )
to the target point. When the mobile robot collides, the
denotes the maximum Q-value at state st+1 .
navigation fails and the reward function gives a large neg-
In this work, the state information is composed of the
ative reward. The second part is to give a small reward
sensing information of LIDAR and the position of the
value according to the state change after the mobile robot
robot, and the action space is the displacement of five dif-
selects an action and reaches the next state. Here, the ar-
ferent directions. Through the design of the above state
tificial potential field is introduced, and the change in po-
space and action space, the goal is to find a nearly optimal
tential energy of the mobile robot at each step is used as a
strategy through iterative training of the network under the
reward to guide the mobile robot towards the target point.
guidance of the reward function, so as to obtain higher re-
The designed artificial potential field is as follows:
wards and improve the success rate of navigation.
U = Ua (dg ) +Ur (do ) , (4)
2.2. Reward function optimization
After the agent selects and performs an action, the en- where Ua and Ur are the attractive and repulsive potential
vironment will give corresponding feedback information field, respectively, which can be determined by
which is given according to the reward function to judge 1
whether the action is good. Reward function is often de- Ua (dg ) = µdg2 , (5)
2
signed on the grounds of the environment and task. A re- ( 
1 do

ward is a scalar quantity, with a positive value representing 2 σ 1 − lr , do 6 lr ,
Ur (do ) = (6)
a reward and a negative value denoting a penalty. 0, do > lr ,
566 Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin

Choose an 2.3. State and action space


action The state space is used to describe the information of
the mobile robot in the environment. In DRL, the state is
Get next state usually used as the input of the model. During the pro-
cess of navigation, the position of the mobile robot rela-
tive to the target and obstacles is always changing, which
Y N
goal? is uncertain. The state input into the model in this study
Y N comes from the real-time sensing information of LIDAR
collide? and odometer, including the position between the mo-
bile robot and the object and azimuth information. Fig. 3
Calculate the
demonstrates the schematic diagram of the state space of
potential
energy Ut
the mobile robot. In the figure, it is assumed that the coor-
dinate of the mobile robot is (xr , yr ), the pentagram repre-
sents the target point whose coordinate is (xg , yg ), and the
r = r+ r = r- r = Ut-1 - Ut circle denotes the nearest obstacle to the mobile robot de-
tected by LIDAR. Specifically, the defined state space in-
Fig. 2. Flow chart of reward calculation. cludes: 24 pieces of distance information dL = [d1 , d2 , ...,
d24 ]T obtained uniformly by LIDAR in all directions; the
angle ϕ0 which between the robot orientation and the line
connecting the robot to the target point; the distance dg be-
where µ and σ are attractive and repulsive gains, respec-
tween the robot and the target point; the minimum distance
tively, and lr is the range threshold of the repulsive po-
do detected by LIDAR and the index n which corresponds
tential field, whose appropriate values are chosen by trial-
to the minimum distance detected by LIDAR. Regarding
and-error method. Note that if do exceeds this threshold,
the dimensionality of the detection information of the LI-
the mobile robot will not be affected by the repulsive po-
DAR, a moderate value is chosen in order to reduce the
tential field. What’s more, in the case of known obstacles,
computational effort while ensuring outstanding detection
we need to consider all obstacles in the vicinity of the
results. Then the state space can be expressed as
agent when calculating the repulsive potential field. How-
ever, the navigation task in this study is based on an un- T
S = dLT , ϕ0 , dg , do , n .

(8)
known environment. Therefore, the adopted approach is
to calculate the repulsive potential field from the distance On the other hand, the action space is used to depict
of the agent to the nearest obstacle (i.e., the minimum dis- the actions that the agent need to implement in the envi-
tance detected by the LIDAR). ronment. The output of the network are the corresponding
Finally, the presented reward function is designed as Q-values of the action space. Theoretically, the number of
follows: actions in the action space can be chosen arbitrarily, but

r+ ,
 dg < la ,
r = r− , do < lc , (7) y

Ut−1 −Ut , otherwise,

Target Enclosure
( xg , y g )
where Ut−1 and Ut are the total potential energy of the dg
previous and current steps, respectively. For the determi-
nation of the values of r+ and r− , two aspects could be Heading
considered: 1) r+ should be slightly larger than r− , which 0

ensures that the robot is actively learning to reach the tar- do


get point. 2) The size of the potential field change should Obstacle
be taken into account. If r+ and r− are too large, the effect ( xr , y r )
of the potential field will be greatly receded, and if they
Mobile robot
are too small, the ability to guide the robot to avoid ob-
stacles and converge to the target point will be weakened.
To express clearly, the flow chart of reward calculation is o x
shown in Fig. 2, where goal and collide are the markers
for the arrival of the target point and the collision, respec- Fig. 3. Schematic diagram of the state space of the mobile
tively. robot.
Navigation of Mobile Robots Based on Deep Reinforcement Learning: Reward Function Optimization and ... 567

in practice, an excessive number of actions might lead to 3. DRL-BASED NAVIGATION


a surge in computation and thus excessive training time. IMPLEMENTATION
If the number of actions is set to be too small, the naviga-
3.1. PER-D3QN algorithm
tion route of the mobile robot will become insufficiently
smooth and will not match the actual navigation route sig- 3.1.1 Preliminary
nificantly. In order to avoid the above problems, the num- To realize the navigation of the mobile robot, Double
ber of actions in the action space needs to be reasonably DQN, Dueling DQN and PER are combined to form PER-
designed. Through continuous trial, the action space de- D3QN algorithm. First, for the DQN algorithm, since a
signed in this case includes five actions. The linear ve- maximum operation is included in estimating the action
locity of the mobile robot is fixed as υ0 , and the angular value, this will make the agent tend to overestimate the Q-
velocity corresponding to different actions is (2ω0 , ω0 , 0, values. To solve this problem, Double DQN is proposed
−ω0 , −2ω0 ). The directions corresponds to the five ac- and achieves the elimination of the overestimation prob-
tions are as follows: large left turn, small left turn, for- lem by decoupling the two steps of selection of the action
ward, small right turn and large right turn. Therefore, the corresponding to the target Q-value and the calculation of
action space can be expressed as the target Q-value. In the case of Dueling DQN, the du-
eling architecture is introduced into the network, which
contains a value function part and an advantage function
A = [(υ0 , 2ω0 ), (υ0 , ω0 ), (υ0 , 0), (υ0 , −ω0 ),
part, where the value is related to the state only and the
(υ0 , −2ω0 )]T . (9) advantage is associated with the state and the action. The
final output of the network is a combination of value and
advantage, which can effectively improve the convergence
2.4. Exploration strategy efficiency of the network. As far as PER is concerned,
For the exploration-exploitation dilemma, ε-greedy is the temporal-difference error is used as the priority of the
introduced as an exploration strategy, which is improved sample when storing the experience. In this way, unlike
on the basis of greedy strategy, where ε ∈ [0, 1] is a ex- random sampling in DQN, PER draws samples from the
ploration factor. The greedy policy selects only the action experience replay buffer based on the priority, which can
that maximizes the value function each time. This explo- accelerate the training of the model.
ration strategy only considers exploitation rather than ex-
ploration, and the value of untried actions will be hard 3.1.2 Algorithm implementation
to update, thus they will not be picked when the action Fig. 4 illustrates the schematic diagram of the nav-
is chosen. Obviously, the greedy strategy will be trapped igation learning process of the mobile robot based on
in the local optimum with high probability, such that the PER-D3QN algorithm. The mobile robot inputs the
optimal strategy cannot be found. On the contrary, the state information into the current value network and
ε-greedy strategy gives consideration to both exploration chooses the action corresponding to the maximum Q-
and exploitation. Action is randomly selected by the agent value argmaxa Q (s, a; θ ) according to the output of the
according to the probability of ε, and the action corre- current value network. The robot performs the action and
sponding to the maximum Q-value argmaxa Q (s, a; θ ) is gets a reward. Then, the current state s, the selected action
selected according to the probability of 1-ε. As the train- a, the state of the next moment s0 and the reward r are
ing goes on, the ε decreases, which means more explo- put into the experience replay buffer as a sample of expe-
ration in the early part of the training and more exploita- rience. During network training, sample data is selected
tion in the later part. Here, the mobile robot tries more according to the sample priority, and the parameters of
different actions to update the Q-values in the early stage the current value network are copied to the target value
of training, so as to find a better policy. In the later stage of network every n time steps. Finally, the loss function is
training, the whole navigation process needs to be consid- calculated using the Q-values of the current and the tar-
ered to maximize the expected value. Finally, the ε-greedy get value networks. The parameters of the current value
strategy can be expressed as follows: network will be close to optimal after sufficient training.
The network architecture used by the PER-D3QN algo-
rithm is shown in Fig. 5, which includes one input layer,
(
random, c < ε,
a= (10) three hidden layers and one output layer. The input layer
argmaxa Q (s, a; θ ) , otherwise, which corresponds to the state space has 28 units. The first
two hidden layers are fully connected layers of 64 units.
where c ∈ [0, 1] is a random number. When c falls between The third hidden layer which is divided into a layer with
[0, ε), action is selected randomly, and when it falls be- one unit, and a layer with five units is a dueling archi-
tween [ε, 1], it is selected according to agent’s own knowl- tecture. The two layers in the dueling architecture can es-
edge. timate the state-value and the advantages of each action,
568 Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin

( r , s ')
(s, a) Experience
replay buffer
(s)
( s ') ( s ')

(a)
(r )
s
Current value Copy Parameters Target value
Environment
network each n time steps network
a
argmax a Q( s, a; ) argmax a 'Q ( s ', a '; )

Gradient Q ( s , a; ) Q( s ', argmax a 'Q( s ', a '; ); )


Loss function
2
L (r Q ( s ', argmax a 'Q ( s ', a '; ); ) Q ( s, a; ))

Fig. 4. Schematic diagram of the navigation learning process of the mobile robot based on PER-D3QN algorithm.

Simple environment Complex environment


Network with knowledge
Network with random of simple environment
parameters

parameters
Copy the

Dueling Training Set the exploration factor


Input layer, R28 Hidden layer, R64 Hidden layer, R64 Output layer, R5
architecture to 0.5
Network with knowledge Training
Fig. 5. Network architecture of the PER-D3QN algorithm. of simple environment Network with knowledge
of complex environment

respectively. The final output layer which contains 5 units


corresponds to the action space and outputs the matched
Q-values for each action.
The execution process of the navigation model can be Fig. 6. Process of knowledge transfer.
divided into the following steps: 1) The mobile robot ob-
tains state information through sensors; 2) Call the model
and take the state information as the input data of the
DRL is extremely time-consuming and it may be difficult
model. The action that the mobile robot should perform
for the model to converge in a relatively complex environ-
in the current state can be got through model calculation;
ment. To solve this problem, a knowledge transfer training
3) Execute the action to obtain the state at the next mo-
method is put forward. The process of knowledge trans-
ment and judge whether the target point is reached or a
fer is shown in Fig. 6. First, the mobile robot is trained
collision occurs; 4) Reset the position of the mobile robot
in a relatively simple environment. The model converges
and the target point if a collision occurs, and reset the po-
after a certain period of training, which means that the
sition of the target point if the target point is reached, then
mobile robot possess the knowledge of navigation in a
return to (1).
simple environment. Then, the parameters of the trained
The pseudo-code of PER-D3QN algorithm for naviga- model are migrated to the more complex environment and
tion of the mobile robot is shown in Algorithm 1. take it as the initial weight of the network. In this way,
the mobile robot will be initially equipped with certain
3.2. Knowledge transfer knowledge and can conduct exploration and training more
When faced with different environments, models often efficiently, thus improving the training efficiency. Mean-
need to be trained from scratch. However, the training of while, although the environments are similar, there are still
Navigation of Mobile Robots Based on Deep Reinforcement Learning: Reward Function Optimization and ... 569

Algorithm 1: PER-D3QN algorithm for navigation of the Enclosure Static obstacles Enclosure Dynamic obstacles
mobile robot.
1: Initialize experience replay buffer D
Target
2: Initialize current value network and target value net-
work with random weights
3: for episode = 1 to M do
4: initialize states s Target
5: for t = 1 to T do
6: Input state s into current value network, use Scanning range of LIDAR Scanning range of LIDAR
the ε-greedy method to select an action a from the (a) Static obstacle (b) Dynamic obstacle
current output Q-values environment (Scenario ) environment (Scenario )
7: Execute action a, get the reward r, the next
time state s and the termination flag done Fig. 7. Simulation environments.
8: Store experience (s, a, r, s0 , done) with priority
in experience replay buffer
9: s = s0 4. SIMULATION RESULTS AND DISCUSSIONS
10: for j = 1 to K do 4.1. Simulation environments and parameter settings
11: ( Calculate target value: yj =
To verify the autonomous navigation ability of the sys-
y j, if done
, tem, two 4 m×4 m obstacle environments are established

r j + γQ (s j , argmaxa Q (s j , a; θ ) ; θ ) , otherwise in the 3D simulator Gazebo, as shown in Fig. 7: Fig. 7(a) is
where done is the marker of the end of the episode a static obstacle environment (Scenario I) and Fig. 7(b) is
12: Calculate temporal-difference error: δ j = a dynamic obstacle environment (Scenario II). The cylin-
y j − Q (s j−1 , a j−1 ; θ ) ders whose radius is 0.15 m in the environments represent
13: Update the priority of experience: p j = the obstacles and those in the dynamic obstacle environ-
|δ j | ment move counterclockwise around the central point of
14: Accumulate weight changes: ∆ = ∆ + the environment at a certain speed; the square frame rep-
δ j ∇θ Q (s j−1 , a j−1 ; θ ) resents a enclosure, and hitting the enclosure is equivalent
15: end for to colliding with an obstacle; the square in the enclosure
16: Update the parameters of the current value is the target point; the lines in the enclosure represent the
network: θ = θ + ηδ j ∇Q (s j−1 , a j−1 ; θ ) laser beams emitted by the LIDAR.
17: if t % n = 1 then During the training simulation process, the mobile robot
18: Update the parameters of the target value starts from different initial points and navigates to the tar-
network: θ − = θ get points randomly generated in the open area of the en-
19: end if vironment. Through the interaction between the mobile
20: end for robot and the environment and the restriction of the rules
21: if ε > εmin then in the navigation process, the quality of the mobile robot
22: ε ∗ = edecay decision is improved continuously. The parameter settings
23: end if for the model training during the simulation are shown in
24: end for Table 1.

4.2. Results and analysis


some differences, and the prior knowledge cannot be fully
4.2.1 Results of reward function optimization
adapted to the new environment, such that it is necessary
to reassign some exploration ability to the agent. After Firstly, to validate the effectiveness of the optimized re-
several attempts, ε is set to a temperate value of 0.5. This ward function, this paper completed the training of the
study first train the mobile robot in a static obstacle envi- mobile robot in Scenario I based on the DQN algorithm.
ronment and then transfer the trained model to a dynamic The simulation results are shown in Fig. 8, where the opti-
obstacle environment after the convergence of the model. mized reward function introduces artificial potential field
The mobile robot can learn the ability to avoid obstacles and guides the mobile robot to move to the target point
in the dynamic obstacle environment after a short period while avoiding obstacles through the change of potential
of training. field. The traditional reward function rewards the robot at
the end of the task. It is difficult to reach the target point by
using the traditional reward function in the environment,
which leads to too few positive rewards and slow learning
of the mobile robot to navigate to the target point. Over the
570 Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin

Table 1. Parameter settings for the model training.


Parameters Values Remarks
Action space size 5 Optional action of the mobile robot
State space size 28 Dimension of the input state
γ 0.99 Discount factor of cumulative reward
ε 1 Exploration rate of action
εmin 0.05 Minimum exploration rate
edecay 0.99 Exploration decay rate of action
Experience replay memory 1000000 Stores historical experience data
Mini-batch size 64 Size of extracted empirical data
Episode length 6000 Maximum steps per episode
Reset target 2000 Update the target network every n steps
α 0.00025 Learning rate of neural network
Hidden layers size 3 Size of hidden layers in networks
Activation function ReLu Activation function
Loss MSE Loss function
Optimizer RMSProp Neural network optimization algorithm
r+ 250 Reward given for reaching the target point
r− −200 Reward given in case of collision
µ 100 Attractive gain
σ 100 Repulsive gain
υ0 0.15 m/s Linear velocity of the mobile robot
ω0 0.75 rad/s Angular velocity of the mobile robot

1600 wards received, but judge the ability of the reward func-
Optimized
1400 Traditional tion to guide the agent by analyzing the trend of the curve.
1200 In the subsequent simulations, the proposed algorithm will
adopt the optimized reward function.
1000
800 4.2.2 Results of algorithm improvement
Reward

600 On the other hand, to verify the better performance of


400 the PER-D3QN algorithm, the DQN algorithm and the
200 PER-D3QN algorithm are used to train the navigation of
0
the mobile robot in Scenario I, respectively. As shown
in Fig. 9, the convergence rate of the two algorithms is
-200 approximately the same, with the models converging in
-400 about 400 episodes. The reward of the PER-D3QN algo-
0 100 200 300 400 500 600 700 800
Episode rithm after convergence is about 2500, while the DQN
algorithm is about 1000 to 1500. This indicates that the
Fig. 8. Rewards for the optimized and the traditional re- mobile robot with the PER-D3QN algorithm successfully
ward functions. navigates to the target point more often during an episode,
and it navigates more efficiently with less likelihood of
collision.
course of 800 episodes of training, the mobile robot learns After training, the models are tested in Scenario I to
about collision avoidance, but still do not learn well about compare the navigation success rate and navigation time
converging to a target point due to the lack of positive re- of the two algorithms for 1000 episodes of training. First,
ward for reaching the target point. Through the guidance to validate the navigation success rate, five rounds of tests
of the optimized reward function, the mobile robot learns are conducted separately, each with 100 times navigation,
better navigation knowledge in about 300 episodes and is and the success rates of the five rounds are calculated and
able to avoid obstacles and approach the target point. It averaged. The criterion for successful navigation is that
should be noted that since the reward functions used are the intelligent body reaches the target point without col-
different, we cannot simply compare the number of re- lision. Considering the size of the ontology, the robot is
Navigation of Mobile Robots Based on Deep Reinforcement Learning: Reward Function Optimization and ... 571

3000
Table 3. Navigation time of the DQN algorithm.
PER-D3QN
2500 DQN
Examples Test 1 Test 2 Test 3 Test 4 Test 5
(time/s) (time/s) (time/s) (time/s) (time/s)
2000
Round 1 16.867 16.458 16.454 16.453 17.057
1500 Round 2 16.270 16.662 16.459 16.457 16.542
Reward

Round 3 16.666 16.261 16.456 16.257 16.468


1000 Round 4 16.671 16.057 16.454 16.453 15.858
Round 5 16.157 16.455 16.453 16.469 16.456
500
Mean(time/s) 16.451
0
Table 4. Navigation time of the PER-D3QN algorithm.
-500
0 300 600 900 1200 1500 Test 1 Test 2 Test 3 Test 4 Test 5
Episode Examples
(time/s) (time/s) (time/s) (time/s) (time/s)
Fig. 9. Rewards for the PER-D3QN and the DQN algo- Round 1 13.461 13.052 13.044 12.867 13.644
rithms. Round 2 13.261 13.451 13.449 13.653 13.452
Round 3 13.039 13.051 13.046 13.048 13.051
Round 4 13.049 12.851 12.846 13.045 12.849
Table 2. Navigation success rates of the DQN and D3QN Round 5 13.053 13.051 13.047 13.052 13.451
algorithms. Mean(time/s) 13.155
Examples Test 1 Test 2 Test 3 Test 4 Test 5 Mean
DQN 83% 86% 92% 85% 87% 86.6% in Scenario I. After the convergence of the model, the
PER-D3QN 98% 94% 97% 95% 96% 96.0% model trained for 1000 episodes is transferred to Sce-
nario II. The mobile robot is trained based on the migrated
considered to have reached the target point when its dis- model in Scenario II and set the exploration coefficient
tance from the target point is less than 0.2 m. For the test ε to 0.5 to explore the dynamic environment. By con-
of navigation time, the starting point is fixed at (0, 0) and trast, the PER-D3QN algorithm is also adopted to train
the target point is fixed at (1.2, 1.2). The tests are con- the mobile robot from scratch in Scenario II. The sim-
ducted in five rounds, with five times in each round, and ulation results are shown in Fig. 10. During the process
the time taken for each navigation is counted and the av- of knowledge transfer, the mobile robot achieves conver-
erage value calculated. The final test results are shown in gence after 400 episodes of training in Scenario I, and
Tables 2-4. the reward remains stable around 3000. At this time, the
The navigation success rates of the DQN and PER- mobile robot possesses the navigation ability in Scenario
D3QN algorithms are shown in Table 2, where the average I. At the 1000th episode, due to the transfer to Scenario
navigation success rate of the DQN algorithm is 86.6%
and that of the PER-D3QN algorithm is 96.0%. Compared
with the DQN algorithm, the average navigation success 3500
rate of the PER-D3QN algorithm is improved by 9.4%. 3000
The navigation time of the DQN and PER-D3QN algo-
rithms are shown in Tables 3 and 4, respectively. The av- 2500
erage navigation time of DQN algorithm is 16.451 s and 2000
the average navigation time of PER-D3QN algorithm is
Reward

13.155 s, and the average navigation time of PER-D3QN 1500


is reduced by 20.0% compared to DQN. It can be seen that 1000
the success rate of navigation can be effectively improved
and the time required for navigation can be reduced by 500
drawing samples in priority, solving the problem of over- Scenario I transfer to Scenario II
0
estimation, and introducing the dueling architecture in the Scenario II without transfer
network. -500
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Episode
4.2.3 Results of knowledge transfer
Finally, the effect of the knowledge transfer training Fig. 10. Rewards for the knowledge transfer training
method is verified. Firstly, PER-D3QN is used to train method and the training method without transfer.
572 Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin

Table 5. Navigation success rates of the training methods Table 6. Navigation time of the training method without
without transfer and with transfer. transfer.

Examples Test 1 Test 2 Test 3 Test 4 Test 5 Mean Examples Test 1 Test 2 Test 3 Test 4 Test 5
Without (time/s) (time/s) (time/s) (time/s) (time/s)
87% 90% 87% 86% 89% 87.8%
transfer Round 1 17,667 13.453 14.661 12.850 16.670
With Round 2 19.270 16.660 18.672 13.452 16.659
94% 91% 95% 97% 95% 94.4%
transfer Round 3 19.271 16.057 17.256 19.261 15.654
Round 4 14.852 16.661 18.663 17.459 19.265
II, the reward suddenly changes to about 1200. At this Round 5 15.859 18.467 18.462 16.659 19.268
point, the mobile robot is equipped with the knowledge Mean(time/s) 16.925
in Scenario I, which includes the ability to approach the
target point and the ability to avoid obstacles to a certain Table 7. Navigation time of the knowledge transfer train-
extent. However, in the beginning of the transfer to Sce- ing method.
nario II, the actions chosen by the mobile robot based on
the priori knowledge sometimes cannot completely avoid Examples Test 1 Test 2 Test 3 Test 4 Test 5
the dynamic obstacles. After 500 episodes of exploration (time/s) (time/s) (time/s) (time/s) (time/s)
and training, the parameters of the model are adjusted, Round 1 14.057 14.652 14.263 15.058 14.050
and the mobile robot is able to navigate in Scenario II. Round 2 14.854 14.655 13.659 14.656 14.054
When the mobile robot is trained in Scenario II through- Round 3 14.853 14.658 14.053 14.260 14.841
out, the mobile robot still don’t learn to navigate well after Round 4 14.944 14.460 14.853 15.057 14.459
2000 episodes of training. Because it is more prone to col- Round 5 14.655 14.057 11.658 15.270 14.052
lide and harder to reach the target point, resulting in less Mean(time/s) 14.404
positive rewards and a lack of positive incentives for the
mobile robot. The comparison testifies that the knowledge
transfer training method can effectively shorten the train- sible reason for this phenomenon is the overfitting of the
ing time and accelerate the convergence of the models in neural network.
complex environments.
What’s more, the trained models are tested in Scenario
II. The success rates of navigation and the time of navi- 5. CONCLUSION
gation of the model trained for 800 episodes after transfer
and the model trained for 1800 episodes without trans- In this paper, a mobile robot navigation model based
fer are contrasted in the same way as in the tests of the on an improved DQN algorithm is proposed to achieve
algorithm performance. The data obtained from the tests autonomous navigation in unfamiliar environments. Sim-
are shown in Tables 5-7. The navigation success rates of ulation results demonstrate that the proposed algorithm is
the training methods without transfer and with transfer are able to achieve higher reward and navigation success rate
shown in Table 5, where the average navigation success and shorter navigation time relative to the traditional DQN
rate is 87.8% for the training method without transfer and algorithm. To solve the problems of slow convergence of
94.4% for that with transfer. The knowledge transfer train- the model and the existence of local iterations caused by
ing method have a 6.6% higher average navigation success the traditional reward function, this study introduces an
rate than the training method without transfer. The naviga- artificial potential field in the reward function. The reward
tion time of the training method without transfer and the function with the artificial potential field is proved to be
knowledge transfer training method are shown in Tables effective in improving the training speed and reducing the
6 and 7, respectively. The average navigation time of the problem of local iterations by guiding the mobile robot.
training method without transfer is 16.925 s and the av- Finally, a knowledge transfer training method is raised,
erage navigation time of the knowledge transfer training and it is shown from the simulation results that the knowl-
method is 14.404 s. The average navigation time of the edge transfer training method can effectively improve the
knowledge transfer training method is reduced by 14.9% training efficiency when training in a relatively complex
compared to the training method without transfer. In sum- environment compared to training directly. Also, during
mary, the knowledge transfer training method can improve the training process, the rewards converge and then drop
the training efficiency and effectiveness, thus the mobile slightly before remaining stable, which may be due to net-
robot can learn superior navigation strategies efficiently. work overfitting. In subsequent research work, methods to
On the other hand, it is observed from the simulation re- mitigate network overfitting will be explored and control
sults that the rewards have a slight decrease after a rough algorithms to achieve linear and angular velocity tracking
convergence and then reach a relative stability. One pos- control of the mobile robot will be considered.
Navigation of Mobile Robots Based on Deep Reinforcement Learning: Reward Function Optimization and ... 573

REFERENCES [13] S. Li, L. Ding, H. Gao, Y.-J. Liu, N. Li, and Z. Deng, “Rein-
forcement learning neural network-based adaptive control
[1] M. Tang, Z. Chen, and F. Yin, “An improved adaptive un- for state and input time-delayed wheeled mobile robots,”
scented FastSLAM with genetic resampling,” International IEEE Transactions on Systems, Man, and Cybernetics:
Journal of Control, Automation, and Systems, vol. 19, no. Systems, vol. 50, no. 11, pp. 4171-4182, 2018.
4, pp. 1677-1690, 2021. [14] Y. Li, Y. Fan, K. Li, W. Liu, and S. Tong, “Adaptive
[2] B. Li, T. Acarman, Y. Zhang, L. Zhang, C. Yaman, and Q. optimized backstepping control-based RL algorithm for
Kong, “Tractor-trailer vehicle trajectory planning in nar- stochastic nonlinear systems with state constraints and its
row environments with a progressively constrained optimal application,” IEEE Transactions on Cybernetics, vol. 52,
control approach,” IEEE Transactions on Intelligent Vehi- no. 10, pp. 10542-10555, 2021.
cles, vol. 5, no. 3, pp. 414-425, 2019. [15] L. Jiang, H. Huang, and Z. Ding, “Path planning for in-
[3] N. Sun, Y. Fang, and H. Chen, “A continuous robust anti- telligent robots based on deep Q-learning with experience
swing tracking control scheme for underactuated crane sys- replay and heuristic knowledge,” IEEE/CAA Journal of Au-
tems with experimental verification,” Journal of Dynamic tomatica Sinica, vol. 7, no. 4, pp. 1179-1189, 2019.
Systems, Measurement, and Control, vol. 138, no. 4, p. [16] M. M. Ejaz, T. B. Tang, and C.-K. Lu, “Vision-based au-
041002, 2016. tonomous navigation approach for a tracked robot using
deep reinforcement learning,” IEEE Sensors Journal, vol.
[4] Y. Li, S. Tong, and T. Li, “Observer-based adaptive
21, no. 2, pp. 2230-2240, 2020.
fuzzy tracking control of MIMO stochastic nonlinear sys-
tems with unknown control directions and unknown dead [17] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prior-
zones,” IEEE Transactions on Fuzzy Systems, vol. 23, no. itized experience replay,” Proc. of 4th International Con-
4, pp. 1228-1241, 2014. ference on Learning Representations, ICLR 2016, pp. 1-21,
2016.
[5] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforce-
ment learning with double Q-learning,” Proc. of the AAAI [18] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and
Conference on Artificial Intelligence, vol. 30, pp. 2094- N. Freitas, “Dueling network architectures for deep rein-
2100, 2016. forcement learning,” Proc. of International Conference on
Machine Learning, PMLR, pp. 1995-2003, 2016.
[6] S. J. Lee, H. Choi, and S. S. Hwang, “Real-time depth es-
[19] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell,
timation using recurrent CNN with sparse depth cues for
“Curiosity-driven exploration by self-supervised predic-
SLAM system,” International Journal of Control, Automa-
tion,” Proc. of International Conference on Machine
tion, and Systems, vol. 18, no. 1, pp. 206-216, 2020.
Learning, PMLR, pp. 2778-2787, 2017.
[7] C. Chen, P. Zhao, C. X. Lu, W. Wang, A. Markham, and
[20] K. Wu, H. Wang, M. A. Esfahani, and S. Yuan, “BND*-
N. Trigoni, “Deep-learning-based pedestrian inertial navi-
DDQN: Learn to steer autonomously through deep rein-
gation: Methods, data set, and on-device inference,” IEEE
forcement learning,” IEEE Transactions on Cognitive and
Internet of Things Journal, vol. 7, no. 5, pp. 4431-4441,
Developmental Systems, vol. 13, no. 2, pp. 249-261, 2021.
2020.
[21] Y. Dong, X. Tang, and Y. Yuan, “Principled reward shaping
[8] T. Ran, L. Yuan, and J. Zhang, “Scene perception based for reinforcement learning via Lyapunov stability theory,”
visual navigation of mobile robot in indoor environment,” Neurocomputing, vol. 393, pp. 83-90, 2020.
ISA Transactions, vol. 109, pp. 389-400, 2021.
[22] N. Sajid, P. J. Ball, T. Parr, and K. J. Friston, “Active in-
[9] L. Van Truong, S. D. Huang, V. T. Yen, and P. Van Cuong, ference: Demystified and compared,” Neural Computation,
“Adaptive trajectory neural network tracking control for in- vol. 33, no. 3, pp. 674-712, 2021.
dustrial robot manipulators with deadzone robust compen-
[23] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu,
sator,” International Journal of Control, Automation, and
“Reward-free exploration for reinforcement learning,”
Systems, vol. 18, no. 9, pp. 2423-2434, 2020.
Proc. of International Conference on Machine Learning,
[10] S. Jung, “A neural network technique of compensating for PMLR, pp. 4870-4879, 2020.
an inertia model error in a time-delayed controller for robot [24] N. Jiang, S. Jin, and C. Zhang, “Hierarchical automatic
manipulators,” International Journal of Control, Automa- curriculum learning: Converting a sparse reward naviga-
tion, and Systems, vol. 18, no. 7, pp. 1863-1871, 2020. tion task into dense reward,” Neurocomputing, vol. 360, pp.
[11] Y. Li, Y. Liu, and S. Tong, “Observer-based neuro-adaptive 265-278, 2019.
optimized control of strict-feedback nonlinear systems [25] X. Yao, X. Feng, J. Han, G. Cheng, and L. Guo, “Auto-
with state constraints,” IEEE Transactions on Neural Net- matic weakly supervised object detection from high spatial
works and Learning Systems, vol. 33, no. 7, pp. 3131-3145, resolution remote sensing images via dynamic curriculum
2021. learning,” IEEE Transactions on Geoscience and Remote
[12] H.-S. Ahn, O. Jung, S. Choi, J.-H. Son, D. Chung, and G. Sensing, vol. 59, no. 1, pp. 675-685, 2020.
Kim, “An optimal satellite antenna profile using reinforce- [26] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang,
ment learning,” IEEE Transactions on Systems, Man, and and J. Yang, “Multi-modal curriculum learning for semi-
Cybernetics, Part C (Applications and Reviews), vol. 41, supervised image classification,” IEEE Transactions on Im-
no. 3, pp. 393-406, 2010. age Processing, vol. 25, no. 7, pp. 3249-3260, 2016.
574 Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin

[27] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Ye Jin received his B.S. degree from
Abbeel, “Reverse curriculum generation for reinforcement Dalian Jiaotong University, in 2020, and
learning,” Proc. of Conference on robot learning, PMLR, currently, he is working as a postgraduate
pp. 482-495, 2017. student with the Mechanics of Dalian Uni-
[28] J. Gao, W. Ye, J. Guo, and Z. Li, “Deep reinforcement versity of Technology, Dalian, China. His
learning for indoor mobile robot path planning,” Sensors, main research interests include path plan-
vol. 20, no. 19, p. 5493, 2020. ning, such as A-star algorithm, dynamic
window algorithm, and so on, with appli-
[29] X. Liu and Y. Jin, “Reinforcement learning-based collision cation to wheeled mobile robots.
avoidance: Impact of reward function and knowledge trans-
fer,” Artificial Intelligence for Engineering Design, Analy- Publisher’s Note Springer Nature remains neutral with regard
sis and Manufacturing, vol. 34, no. 2, pp. 207-222, 2020. to jurisdictional claims in published maps and institutional affil-
iations.

Weijie Li received his B.S. degree in ve-


hicle engineering from Nanchang Univer-
sity, Nanchang, China, in 2020. He is cur-
rently pursing a master’s degree with the
Department of Automotive Engineering,
Dalian University of Technology, Dalian,
China. His main research interests in-
clude autonomous navigation and con-
trol approaches with application to mobile
robots.

Ming Yue received his B.S. degree from


Zhengzhou University, in 1998, and his
M.S. and Ph.D. degrees from Harbin
Institute of Technology, China, in 2004
and 2008, respectively. He was a visiting
scholar with Aerospace Engineering of
University of Michigan, USA. from 2013
to 2014. Currently, he is a Full Professor
with the Department of Automotive Engi-
neering of Dalian University of Technology, China. His main
research interests include control approaches, such as adaptive
control, sliding mode control, model predictive control, and
so on, with application to wheeled mobile robots, intelligent
vehicles, and complicated mechatronic systems.

Jinyong Shangguan received his B.S. and


M.S. degrees from Liaocheng University,
Liaocheng, China, in 2017 and 2019, re-
spectively. He is currently working toward
a Ph.D. degree with the Department of Au-
tomotive Engineering of Dalian University
of Technology, Dalian, China. His current
research interests include energy manage-
ment, observer design, system-level opti-
mization, vehicle dynamics, and intelligent control for battery
electric buses.

You might also like