Professional Documents
Culture Documents
http://dx.doi.org/10.1007/s12555-021-0642-7 http://www.springer.com/12555
Abstract: This paper presents an end-to-end online learning navigation method based on deep reinforcement learn-
ing (DRL) for mobile robots, whose objective is that mobile robots can avoid obstacles to reach the target point in
an unknown environment. Specifically, double deep Q-networks (Double DQN), dueling deep Q-networks (Dueling
DQN) and prioritized experience replay (PER) are combined to form prioritized experience replay-double dueling
deep Q-networks (PER-D3QN) algorithm to realize high-efficiency navigation of mobile robots. Moreover, consid-
ering the problem of sparse reward in the traditional reward function, an artificial potential field is introduced into
the reward function to guide robots to fulfill the navigation task through the change of potential energy. Further-
more, in order to accelerate the training of mobile robots in complex environment, a knowledge transfer training
method is proposed, which migrates the knowledge from simple to complex environment, and quickly learns on the
basis of the priori knowledge. Finally, the performance is validated based on a three-dimensional simulator, which
shows that the mobile robot can obtain higher rewards and achieve higher success rates and less time for navigation,
indicating that the proposed approaches are feasible and efficient.
Keywords: Deep reinforcement learning (DRL), knowledge transfer, mobile robot, navigation, reward function.
Manuscript received July 30, 2021; revised December 14, 2021 and February 27, 2022; accepted March 12, 2022. Recommended by
Associate Editor Jongho Lee under the direction of Editor Hyo-Sung Ahn. This work was supported by the National Natural Science
Foundation of China (Nos. U2013601 and 61873047), the Fundamental Research Funds for the Central Universities (DUT22GF205) and the
National Key Research and Development Program (2022YFB4701302).
Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin are with the School of Automotive Engineering, Dalian University of Technology,
Dalian 116024, China (e-mails: weijie45@mail.dlut.edu.cn, yueming@dlut.edu.cn, {shangguanjinyong, jinye98}@mail.dlut.edu.cn). Ming
Yue is also with the Ningbo Institute of Dalian University of Technology, Ningbo 315016, China.
* Corresponding author.
values in the original DQN. To solve this difficulty, Van the priori knowledge and still require more time for train-
Hasselt et al. [5] raised the double DQN algorithm, which ing. Hence, transfer learning has received a lot of atten-
settles the matter by separating selection and estimation tion because of its ability to save training resources. For
to calculate the normal Q-values. In addition, Schaul et example, Gao et al. [28] proposed an incremental train-
al. [17] adopted the method of priority experience re- ing method which transferred the knowledge from 2D to
play (PER) in the sample extraction process and improved 3D environment, thus improving the efficiency of training
the efficiency of training by setting priority for samples. and the convergence of the model. Similarly, Liu and Jin
What’s more, Wang et al. [18] came up with the dueling [29] designed a transfer reinforcement learning method
DQN, which improved the neural network by dividing it for multi-stage knowledge acquisition, which combined
into two parts: value function and advantage function for transfer and reinforcement learning methods to obtain col-
better policy evaluation. To take full advantage of the re- lision avoidance knowledge more effectively. Inspired by
spective strengths of the above algorithms, double DQN, the above works, a kind of knowledge transfer training
dueling DQN and PER are integrated in this work to cre- method is put forward. After the training of the mobile
ate the prioritized experience replay-double dueling DQN robot in a relatively simple environment, its model and pa-
(PER-D3QN) and apply it to the navigation of the mobile rameters are transferred to a similar complex environment,
robot. and the training starts from this point, so as to accelerate
Furthermore, the design of reward function is also cru- the training in the complex environment.
cial in the whole training process of DRL, thus the reward To sum up, the main contributions of this paper are as
function is also the focus of researches. To address the is- follows:
sue of sparse reward, Pathak et al. [19] used the curiosity 1) Double DQN, dueling DQN and PER are combined
as an intrinsic reward signal for the agent to actively ex- to form the PER-D3QN algorithm and applied to the
plore the environment and learn better policy. Whereas, navigation of the mobile robot, by which the perfor-
this method is hard to use in a large-scale environment. mance of navigation can be effectively enhanced.
In such case, Wu et al. [20] took random network distilla-
2) Artificial potential field variation is employed into
tion rewards as intrinsic reward which is effective in large-
the reward function and taken as a reward to guide
scale environments. However, manually designing the re-
the mobile robot to move towards the target point,
ward function still poses puzzles. To this end, Dong et al.
through which the training time could be reduced.
[21] brought up a method based on Lyapunov function to
design the reward function, which guide the reward shap- 3) A knowledge transfer training method is introduced
ing without handcraft engineering. What’s more, a num- to take advantage of priori knowledge, in which the
ber of researchers attempt to study the non-reward explo- convergence speed of the algorithm in the relatively
ration method of reinforcement learning [22,23]. In the ex- complex environment can be improved.
ploration stage, the agent can collect samples without us- The rest of this paper is organized as follows: Section 2
ing the pre-specified reward function; and then, the reward improves the reward function and designs the navigation
function will be given and the agent will use the samples model of the mobile robot. Section 3 introduces the algo-
to calculate the nearly optimal strategy. Considering the rithm and the training and learning of the model in detail.
flexibility and ease of use for the artificial potential field The effectiveness of the proposed methods is verified by
and the inspiration from the above works, the artificial po- simulation in Section 4. Finally, Section 5 concludes this
tential field is drawn into the reward function here. The paper.
potential energy change of each step for the mobile robot
is used as a reward to guide the mobile robot towards the 2. REWARD FUNCTION OPTIMIZATION AND
target point. NAVIGATION MODEL DESIGN
In addition, the training of the model is not only af-
fected by the algorithm, but also related to the environ- 2.1. Problem statement
ment. When the agent is trained in a relatively complex The objective of this study is to make the mobile robot
environment, there may be the problem of slow or even capable of autonomous navigation, in order that the mo-
difficult convergence. To figure out this problem, curricu- bile robot can avoid unknown obstacles and reach the tar-
lum learning [24-26] is applied to the training process of get point. The navigation problem of the mobile robot
agents, which assigns different weights to training sam- can be regarded as a series of Markov decision processes
ples of different difficulties according to the difficulty of (MDP), which is shown in Fig. 1.
samples. Unfortunately, the setting of weights is a difficult The MDP can be represented by a quintuple (S, A, R, P,
issue. In this case, Florensa et al. [27] decomposed the γ), where S and A are the state space and action space of
complex problem into multiple subproblems and searched the system respectively; R denotes the reward function; P
the maze path according to the dynamic programming. In represents the state transition probability of the system af-
fact, these methods still do not take into account the use of ter action a is picked; γ is a discount factor, reflecting that
Navigation of Mobile Robots Based on Deep Reinforcement Learning: Reward Function Optimization and ... 565
( r , s ')
(s, a) Experience
replay buffer
(s)
( s ') ( s ')
(a)
(r )
s
Current value Copy Parameters Target value
Environment
network each n time steps network
a
argmax a Q( s, a; ) argmax a 'Q ( s ', a '; )
Fig. 4. Schematic diagram of the navigation learning process of the mobile robot based on PER-D3QN algorithm.
parameters
Copy the
Algorithm 1: PER-D3QN algorithm for navigation of the Enclosure Static obstacles Enclosure Dynamic obstacles
mobile robot.
1: Initialize experience replay buffer D
Target
2: Initialize current value network and target value net-
work with random weights
3: for episode = 1 to M do
4: initialize states s Target
5: for t = 1 to T do
6: Input state s into current value network, use Scanning range of LIDAR Scanning range of LIDAR
the ε-greedy method to select an action a from the (a) Static obstacle (b) Dynamic obstacle
current output Q-values environment (Scenario ) environment (Scenario )
7: Execute action a, get the reward r, the next
time state s and the termination flag done Fig. 7. Simulation environments.
8: Store experience (s, a, r, s0 , done) with priority
in experience replay buffer
9: s = s0 4. SIMULATION RESULTS AND DISCUSSIONS
10: for j = 1 to K do 4.1. Simulation environments and parameter settings
11: ( Calculate target value: yj =
To verify the autonomous navigation ability of the sys-
y j, if done
, tem, two 4 m×4 m obstacle environments are established
−
r j + γQ (s j , argmaxa Q (s j , a; θ ) ; θ ) , otherwise in the 3D simulator Gazebo, as shown in Fig. 7: Fig. 7(a) is
where done is the marker of the end of the episode a static obstacle environment (Scenario I) and Fig. 7(b) is
12: Calculate temporal-difference error: δ j = a dynamic obstacle environment (Scenario II). The cylin-
y j − Q (s j−1 , a j−1 ; θ ) ders whose radius is 0.15 m in the environments represent
13: Update the priority of experience: p j = the obstacles and those in the dynamic obstacle environ-
|δ j | ment move counterclockwise around the central point of
14: Accumulate weight changes: ∆ = ∆ + the environment at a certain speed; the square frame rep-
δ j ∇θ Q (s j−1 , a j−1 ; θ ) resents a enclosure, and hitting the enclosure is equivalent
15: end for to colliding with an obstacle; the square in the enclosure
16: Update the parameters of the current value is the target point; the lines in the enclosure represent the
network: θ = θ + ηδ j ∇Q (s j−1 , a j−1 ; θ ) laser beams emitted by the LIDAR.
17: if t % n = 1 then During the training simulation process, the mobile robot
18: Update the parameters of the target value starts from different initial points and navigates to the tar-
network: θ − = θ get points randomly generated in the open area of the en-
19: end if vironment. Through the interaction between the mobile
20: end for robot and the environment and the restriction of the rules
21: if ε > εmin then in the navigation process, the quality of the mobile robot
22: ε ∗ = edecay decision is improved continuously. The parameter settings
23: end if for the model training during the simulation are shown in
24: end for Table 1.
1600 wards received, but judge the ability of the reward func-
Optimized
1400 Traditional tion to guide the agent by analyzing the trend of the curve.
1200 In the subsequent simulations, the proposed algorithm will
adopt the optimized reward function.
1000
800 4.2.2 Results of algorithm improvement
Reward
3000
Table 3. Navigation time of the DQN algorithm.
PER-D3QN
2500 DQN
Examples Test 1 Test 2 Test 3 Test 4 Test 5
(time/s) (time/s) (time/s) (time/s) (time/s)
2000
Round 1 16.867 16.458 16.454 16.453 17.057
1500 Round 2 16.270 16.662 16.459 16.457 16.542
Reward
Table 5. Navigation success rates of the training methods Table 6. Navigation time of the training method without
without transfer and with transfer. transfer.
Examples Test 1 Test 2 Test 3 Test 4 Test 5 Mean Examples Test 1 Test 2 Test 3 Test 4 Test 5
Without (time/s) (time/s) (time/s) (time/s) (time/s)
87% 90% 87% 86% 89% 87.8%
transfer Round 1 17,667 13.453 14.661 12.850 16.670
With Round 2 19.270 16.660 18.672 13.452 16.659
94% 91% 95% 97% 95% 94.4%
transfer Round 3 19.271 16.057 17.256 19.261 15.654
Round 4 14.852 16.661 18.663 17.459 19.265
II, the reward suddenly changes to about 1200. At this Round 5 15.859 18.467 18.462 16.659 19.268
point, the mobile robot is equipped with the knowledge Mean(time/s) 16.925
in Scenario I, which includes the ability to approach the
target point and the ability to avoid obstacles to a certain Table 7. Navigation time of the knowledge transfer train-
extent. However, in the beginning of the transfer to Sce- ing method.
nario II, the actions chosen by the mobile robot based on
the priori knowledge sometimes cannot completely avoid Examples Test 1 Test 2 Test 3 Test 4 Test 5
the dynamic obstacles. After 500 episodes of exploration (time/s) (time/s) (time/s) (time/s) (time/s)
and training, the parameters of the model are adjusted, Round 1 14.057 14.652 14.263 15.058 14.050
and the mobile robot is able to navigate in Scenario II. Round 2 14.854 14.655 13.659 14.656 14.054
When the mobile robot is trained in Scenario II through- Round 3 14.853 14.658 14.053 14.260 14.841
out, the mobile robot still don’t learn to navigate well after Round 4 14.944 14.460 14.853 15.057 14.459
2000 episodes of training. Because it is more prone to col- Round 5 14.655 14.057 11.658 15.270 14.052
lide and harder to reach the target point, resulting in less Mean(time/s) 14.404
positive rewards and a lack of positive incentives for the
mobile robot. The comparison testifies that the knowledge
transfer training method can effectively shorten the train- sible reason for this phenomenon is the overfitting of the
ing time and accelerate the convergence of the models in neural network.
complex environments.
What’s more, the trained models are tested in Scenario
II. The success rates of navigation and the time of navi- 5. CONCLUSION
gation of the model trained for 800 episodes after transfer
and the model trained for 1800 episodes without trans- In this paper, a mobile robot navigation model based
fer are contrasted in the same way as in the tests of the on an improved DQN algorithm is proposed to achieve
algorithm performance. The data obtained from the tests autonomous navigation in unfamiliar environments. Sim-
are shown in Tables 5-7. The navigation success rates of ulation results demonstrate that the proposed algorithm is
the training methods without transfer and with transfer are able to achieve higher reward and navigation success rate
shown in Table 5, where the average navigation success and shorter navigation time relative to the traditional DQN
rate is 87.8% for the training method without transfer and algorithm. To solve the problems of slow convergence of
94.4% for that with transfer. The knowledge transfer train- the model and the existence of local iterations caused by
ing method have a 6.6% higher average navigation success the traditional reward function, this study introduces an
rate than the training method without transfer. The naviga- artificial potential field in the reward function. The reward
tion time of the training method without transfer and the function with the artificial potential field is proved to be
knowledge transfer training method are shown in Tables effective in improving the training speed and reducing the
6 and 7, respectively. The average navigation time of the problem of local iterations by guiding the mobile robot.
training method without transfer is 16.925 s and the av- Finally, a knowledge transfer training method is raised,
erage navigation time of the knowledge transfer training and it is shown from the simulation results that the knowl-
method is 14.404 s. The average navigation time of the edge transfer training method can effectively improve the
knowledge transfer training method is reduced by 14.9% training efficiency when training in a relatively complex
compared to the training method without transfer. In sum- environment compared to training directly. Also, during
mary, the knowledge transfer training method can improve the training process, the rewards converge and then drop
the training efficiency and effectiveness, thus the mobile slightly before remaining stable, which may be due to net-
robot can learn superior navigation strategies efficiently. work overfitting. In subsequent research work, methods to
On the other hand, it is observed from the simulation re- mitigate network overfitting will be explored and control
sults that the rewards have a slight decrease after a rough algorithms to achieve linear and angular velocity tracking
convergence and then reach a relative stability. One pos- control of the mobile robot will be considered.
Navigation of Mobile Robots Based on Deep Reinforcement Learning: Reward Function Optimization and ... 573
REFERENCES [13] S. Li, L. Ding, H. Gao, Y.-J. Liu, N. Li, and Z. Deng, “Rein-
forcement learning neural network-based adaptive control
[1] M. Tang, Z. Chen, and F. Yin, “An improved adaptive un- for state and input time-delayed wheeled mobile robots,”
scented FastSLAM with genetic resampling,” International IEEE Transactions on Systems, Man, and Cybernetics:
Journal of Control, Automation, and Systems, vol. 19, no. Systems, vol. 50, no. 11, pp. 4171-4182, 2018.
4, pp. 1677-1690, 2021. [14] Y. Li, Y. Fan, K. Li, W. Liu, and S. Tong, “Adaptive
[2] B. Li, T. Acarman, Y. Zhang, L. Zhang, C. Yaman, and Q. optimized backstepping control-based RL algorithm for
Kong, “Tractor-trailer vehicle trajectory planning in nar- stochastic nonlinear systems with state constraints and its
row environments with a progressively constrained optimal application,” IEEE Transactions on Cybernetics, vol. 52,
control approach,” IEEE Transactions on Intelligent Vehi- no. 10, pp. 10542-10555, 2021.
cles, vol. 5, no. 3, pp. 414-425, 2019. [15] L. Jiang, H. Huang, and Z. Ding, “Path planning for in-
[3] N. Sun, Y. Fang, and H. Chen, “A continuous robust anti- telligent robots based on deep Q-learning with experience
swing tracking control scheme for underactuated crane sys- replay and heuristic knowledge,” IEEE/CAA Journal of Au-
tems with experimental verification,” Journal of Dynamic tomatica Sinica, vol. 7, no. 4, pp. 1179-1189, 2019.
Systems, Measurement, and Control, vol. 138, no. 4, p. [16] M. M. Ejaz, T. B. Tang, and C.-K. Lu, “Vision-based au-
041002, 2016. tonomous navigation approach for a tracked robot using
deep reinforcement learning,” IEEE Sensors Journal, vol.
[4] Y. Li, S. Tong, and T. Li, “Observer-based adaptive
21, no. 2, pp. 2230-2240, 2020.
fuzzy tracking control of MIMO stochastic nonlinear sys-
tems with unknown control directions and unknown dead [17] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prior-
zones,” IEEE Transactions on Fuzzy Systems, vol. 23, no. itized experience replay,” Proc. of 4th International Con-
4, pp. 1228-1241, 2014. ference on Learning Representations, ICLR 2016, pp. 1-21,
2016.
[5] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforce-
ment learning with double Q-learning,” Proc. of the AAAI [18] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and
Conference on Artificial Intelligence, vol. 30, pp. 2094- N. Freitas, “Dueling network architectures for deep rein-
2100, 2016. forcement learning,” Proc. of International Conference on
Machine Learning, PMLR, pp. 1995-2003, 2016.
[6] S. J. Lee, H. Choi, and S. S. Hwang, “Real-time depth es-
[19] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell,
timation using recurrent CNN with sparse depth cues for
“Curiosity-driven exploration by self-supervised predic-
SLAM system,” International Journal of Control, Automa-
tion,” Proc. of International Conference on Machine
tion, and Systems, vol. 18, no. 1, pp. 206-216, 2020.
Learning, PMLR, pp. 2778-2787, 2017.
[7] C. Chen, P. Zhao, C. X. Lu, W. Wang, A. Markham, and
[20] K. Wu, H. Wang, M. A. Esfahani, and S. Yuan, “BND*-
N. Trigoni, “Deep-learning-based pedestrian inertial navi-
DDQN: Learn to steer autonomously through deep rein-
gation: Methods, data set, and on-device inference,” IEEE
forcement learning,” IEEE Transactions on Cognitive and
Internet of Things Journal, vol. 7, no. 5, pp. 4431-4441,
Developmental Systems, vol. 13, no. 2, pp. 249-261, 2021.
2020.
[21] Y. Dong, X. Tang, and Y. Yuan, “Principled reward shaping
[8] T. Ran, L. Yuan, and J. Zhang, “Scene perception based for reinforcement learning via Lyapunov stability theory,”
visual navigation of mobile robot in indoor environment,” Neurocomputing, vol. 393, pp. 83-90, 2020.
ISA Transactions, vol. 109, pp. 389-400, 2021.
[22] N. Sajid, P. J. Ball, T. Parr, and K. J. Friston, “Active in-
[9] L. Van Truong, S. D. Huang, V. T. Yen, and P. Van Cuong, ference: Demystified and compared,” Neural Computation,
“Adaptive trajectory neural network tracking control for in- vol. 33, no. 3, pp. 674-712, 2021.
dustrial robot manipulators with deadzone robust compen-
[23] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu,
sator,” International Journal of Control, Automation, and
“Reward-free exploration for reinforcement learning,”
Systems, vol. 18, no. 9, pp. 2423-2434, 2020.
Proc. of International Conference on Machine Learning,
[10] S. Jung, “A neural network technique of compensating for PMLR, pp. 4870-4879, 2020.
an inertia model error in a time-delayed controller for robot [24] N. Jiang, S. Jin, and C. Zhang, “Hierarchical automatic
manipulators,” International Journal of Control, Automa- curriculum learning: Converting a sparse reward naviga-
tion, and Systems, vol. 18, no. 7, pp. 1863-1871, 2020. tion task into dense reward,” Neurocomputing, vol. 360, pp.
[11] Y. Li, Y. Liu, and S. Tong, “Observer-based neuro-adaptive 265-278, 2019.
optimized control of strict-feedback nonlinear systems [25] X. Yao, X. Feng, J. Han, G. Cheng, and L. Guo, “Auto-
with state constraints,” IEEE Transactions on Neural Net- matic weakly supervised object detection from high spatial
works and Learning Systems, vol. 33, no. 7, pp. 3131-3145, resolution remote sensing images via dynamic curriculum
2021. learning,” IEEE Transactions on Geoscience and Remote
[12] H.-S. Ahn, O. Jung, S. Choi, J.-H. Son, D. Chung, and G. Sensing, vol. 59, no. 1, pp. 675-685, 2020.
Kim, “An optimal satellite antenna profile using reinforce- [26] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang,
ment learning,” IEEE Transactions on Systems, Man, and and J. Yang, “Multi-modal curriculum learning for semi-
Cybernetics, Part C (Applications and Reviews), vol. 41, supervised image classification,” IEEE Transactions on Im-
no. 3, pp. 393-406, 2010. age Processing, vol. 25, no. 7, pp. 3249-3260, 2016.
574 Weijie Li, Ming Yue, Jinyong Shangguan, and Ye Jin
[27] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Ye Jin received his B.S. degree from
Abbeel, “Reverse curriculum generation for reinforcement Dalian Jiaotong University, in 2020, and
learning,” Proc. of Conference on robot learning, PMLR, currently, he is working as a postgraduate
pp. 482-495, 2017. student with the Mechanics of Dalian Uni-
[28] J. Gao, W. Ye, J. Guo, and Z. Li, “Deep reinforcement versity of Technology, Dalian, China. His
learning for indoor mobile robot path planning,” Sensors, main research interests include path plan-
vol. 20, no. 19, p. 5493, 2020. ning, such as A-star algorithm, dynamic
window algorithm, and so on, with appli-
[29] X. Liu and Y. Jin, “Reinforcement learning-based collision cation to wheeled mobile robots.
avoidance: Impact of reward function and knowledge trans-
fer,” Artificial Intelligence for Engineering Design, Analy- Publisher’s Note Springer Nature remains neutral with regard
sis and Manufacturing, vol. 34, no. 2, pp. 207-222, 2020. to jurisdictional claims in published maps and institutional affil-
iations.