Professional Documents
Culture Documents
ABSTRACT In this paper, path planning and sorting of packages for Omnidirectional-Wheel conveyor
are presented using Reinforcement Learning (RL). Q-learning, Double Q-learning, Deep Q-learning, and
the Double Deep Q-learning algorithms are investigated. The RL algorithms enable the conveyor to self-
learn the packages path and sort them without using conventional control or path planning theories. The RL
algorithms are used for two different case studies on conveyors structures with different numbers of cells to
compare and evaluate their performances in large- and small-scale sized structures. To explore the proposed
methods response to external environment effects, two types of collisions between multiple packages were
considered, the proposed RL algorithms showed their ability to resolve both types successfully. Comparative
study between multiple RL algorithms for path planning showed that the Q-learning and Double Q-learning
algorithms had outperformed their Deep learning versions for path planning in the two case studies.
Furthermore, the proposed RL methods are compared experimentally to classic control and path planning
theories using a hardware prototype for one of the presented case studies. The hardware experimental results
showed that the proposed RL methods were as successful as the conventional methods in path planning and
sorting in much less processing time. Two types of sorting scenarios (Type I and II) were tested for same
package type and for multiple ones. For Type I sorting the Q-learning algorithm performed better than the
Q-learning with weights approach, achieving better mean and minimum rewards while maximum rewards
remain the same for both techniques. As for Type II sorting, only the Q-learning with weights approach was
able to achieve it and converge in a reasonable time.
INDEX TERMS Deep Q-learning (DQN), double deep Q-learning (DDQN), double Q-learning,
omnidirectional-wheel conveyor, path planning, Q-learning, reinforcement learning, sorting.
FIGURE 3. Cognitive conveyor based on belt driven swivel rollers [16]. signals are then exchanged between available units in a
specific logical time.
the conveyor structure consists of 13-cells which is the same Q(S, A) = Q(S, A) × (1 − α) + α[R + γ argmax Q(S0 , a)]
case implemented in hardware. In the 13-cell layout shown in (1)
Fig. 11, and due to the limited space and number of cells; only
one package is used at every episode. The package location where S is the current state, A is the current action, α is the
is generated at the start of every episode at the green cell, and learning rate, γ is the discount factor, and the ‘‘argmax Q (S0 ,
the goal is for the package to reach the red cell. a)’’ term represents the best state-action pair in the next state.
If the state S is the final state in the episode, then the equation
A. Q-LEARNING ALGORITHM becomes:
The first algorithm used in this work is the Q-learning Q(S, A) = Q(S, A) × (1 − α) + αR (2)
algorithm. The basic Q-learning algorithm was introduced
in [26]. Q-learning algorithms allows the system to learn what Every type of packages should have different Q (S, A)
would be the best action to perform in the next time slot function. This is attributed to the fact that every type of
based on its current state. This is performed as shown in the package will have a different destination.
B. DOUBLE Q-LEARNING where every C time-steps the Q0 is updated to have the exact
The second algorithm used for path planning is the Double values of the original network Q
Q-learning algorithm. The basic version of this algorithm was
first introduced in [28] to eliminate the maximization bias in D. DOUBLE DEEP Q-LEARNING (DDQN)
Q-learning [27]. It follows the same approach as Q-learning The basic DDQN algorithm was first introduced in [32] and
algorithm, but uses two Q-tables instead of one to reduce it used the target network as a second value function without
overestimation. Each time step only one table will be updated the need to use a completely new network. Therefore, the
randomly. Actions are then concluded in the same way as target network is used to estimate the action of the greedy
Q-learning, but from a table that results from the summation policy. The target network remains as a periodic copy of the
of the new q-tables. Tables will be updated using equation online network. In this algorithm, a basic multi-layer neural
3 for q-table Q1. network was used instead of a convolution neural network.
The equation for updating the network is given by:
Q1(S, A) = Q1(S, A) × (1 − α)
+ α[R + γ Q2(S, argmax Q1(S0 , a)] (3)
Q(S, A; θ) = Q(S, A) × (1 − α)
For the final state in the episode equation 3 reduces to + α[R + γ Q0 (S0 , argmaxQ(S0 , a; θ); θ−)] (9)
equation 4 as shown below:
Q1(S, A) = Q1(S, A) × (1 − α) + αR (4) V. OMNIDIRECTIONAL-WHEEL PACKAGE SORTING
USING RL
For q-table Q2 the calculations are performed using equations The sorting task is implemented on the 38-cell case study
5 and 6: to allow more variation and space for the movement of
different types of packages to increase the complexity of
Q2(S, A) = Q2(S, A) × (1 − α) + α
the sorting problem. Two types of sorting are considered in
× [R + γ Q1(S, argmax Q2(S0 , a)] (5) this work: the first type of sorting (Type I) is to sort five
Q2(S, A) = Q2(S, A) × (1 − α) + αR (6) different types of packages to five different target cells. The
second sorting type (Type II) was to sort five packages of the
For enhancing the accuracy and performance of the Double same type to different five target cells, but any package can
Q-learning algorithm; every package has its own q-table that occupy any target cell. For every type there were two different
gets updated as per the previous equations. sorting configurations. For the sorting task of Type I standard
Q-learning algorithm [26] was used. Also, a variation of it is
C. DEEP Q-LEARNING (DQN) used that was inspired by the episodic semi-gradient SARSA
The third algorithm used for the task of path planning is algorithm [27] (Q-learning with weights approach). This
Deep Q-learning. Its basic version was introduced in [4] Q-learning with weights approach adjusts the q-table slightly.
as an attempt to combine RL with neural networks [30]. The q-table consists of the element wise multiplication of
It was used in several control applications such as controlling other two tables. The first table is for the value of the state-
robots’ motion in the real-world using camera inputs [29], action pairs. Each state-action pair can have a certain value,
and has a high potential to be used in text generation, finance, for the simulations all the state-action pairs had a value equal
industry 4.0 and intelligent transportation systems [31]. The to 1. The second table is for the weight associated with
DQN algorithm is applied in this work for conveyor path every stat-action pair. Only the weights are updated at each
planning without the use of a convolution layer due to time-step.
the low complexity of the problem in addition to its low- There were two types of collisions similar to the cases
latency requirement. This algorithm operates exactly as the that took place in path planning approach, but this time due
Q-learning, with an addition of an experience replay memory to the large number of packages, if any type of collision
D that stores transitions consisting of current state, current happens all the packages involved in the collision will not
action, current reward, and next state. Another network will move for one time-step. A flow chart explaining the steps
be added called the target network Q’, where each time-step of the sorting algorithm is shown in Fig. 13. The RL agent
a next state step from the experience replay memory will be starts by examining the current state of the system, knowing
chosen randomly. This state will be then overestimated from the current positions of the packages and their desired
the target network and used in updating the original network sorting locations. Based on this info, the agent chooses the
according to the following equation: actions that would lead to highest rewards; and check the
possibility of collision between packages in such case. If no
Q(S, A; θ) = Q(S, A) × (1 − α) + α
collision will happen then the actions will be taken, but if
× [R + γ argmax Q0 (S0 , a; θ−)] (7) collisions would happen all packages that are expected to
For the final state equation 7 reduces to: collide will retain their position for one time-step, then the
algorithm will re-evaluate the system again during the next
Q(S, A; θ) = Q(S, A) × (1 − α) + αR (8) time-step.
The tabular algorithms were simulated for five times for the
path planning two case studies of the conveyor with 38 and The value function approximation algorithms were simulated
13 cells. Every training run consisted of 100,000 episodes. only for one time as the neural network takes long time
The first 70,000 episodes were considered as a training to train. Therefore, the DQN and DDQN algorithms were
phase for the algorithm. In the training phase exploration simulated both for 150,000 episodes. Every run is organized
was enabled. The last 30,000 episodes were considered the as follows: 120,000 episodes for training and the rest 30,000
testing phase. In the testing phase exploration was disabled. episodes were considered as a test for the algorithms with
TABLE 3. Results for path planning in 38-cells case study. TABLE 4. Algorithms’ performance for path planning in 38-cells case
study.
FIGURE 18. Total average rewards over all episodes for path planning in 38-cells case study.
FIGURE 19. Policy maps for Q-Learning Algorithm path planning (38-cells) case study.
algorithm in terms of handling collision type one is the did not converge within the allowed training time to be able
Q-learning algorithm with 0.277126 collisions per episode to perform path planning within short response time. Thus,
overall and the worst is the DQN with 0.197 collisions per despite the higher complexity of the two later algorithms,
episode. As for collision type two, the best algorithm was their longer running times was not an advantage to the case
the Q-learning with 0.00113 collisions per episode for each study. This is evident also when considering rewards, where
algorithm overall and the worst algorithm is the DDQN Q-Learning and Double Q-Learning were able to achieve
algorithm with 0.0323 collisions per episode. almost the same total maximum average rewards as the DQN
The graphs of total average rewards over all episodes of and DDQN within the shorter convergence time, and this
the 38-cell case study are shown in Fig. 18 and summarized maximum reward was not much affected even when reducing
in Table 4. Results shown in the Table 4 indicate that both the number of episodes by half or third.
Q-Learning and Double Q-Learning are better options for The policy maps for the green, red, and yellow packages
the path planning problem addressed in this case study using Q-Learning algorithm are shown in Fig. 19. It shows
despite being simpler than the other two RL algorithms that the algorithm could not converge successfully to a valid
under investigation. Mainly this is due to the fact that they path between source and destination for all cells in case of
reached a state of convergence within a considerably short green package type. The map for the red package type shows
amount of time; compared with DQN and DDQN which that the Q-Learning algorithm had converged successfully
FIGURE 20. Policy map of Double Q-Learning Algorithm path planning (38-cells) case stud.
TABLE 5. Parameters for algorithms used in sorting test cases. TABLE 6. Results for 38-cells sorting test cases.
FIGURE 21. Total average rewards over all episodes for path planning in 13-cells case study.
the scale of the target problem for path planning is, the more
favorable is to use simpler algorithms as Q-Learning and
Double Q-Learning vs. complex ones as DQN and DDQN;
as their complexity doesn’t add much enhancement to the
problem solution compared to the cost of much longer run
times.
Policy maps for the Q-learning and Double Q-Learning
algorithms in path planning 13-cellscase study are shown in
Fig. 22. Both algorithms were able to converge and find the
shortest path successfully.
slowest in convergence is the DQN algorithm. Results also VIII. HARDWARE EXPERIMENTAL RESULTS
indicate that as the case study of 13-cells is much smaller A hardware prototype for the 13-cell conveyor manufactured
than the 38 cells, episodes required till convergence and by members of our research group was used to verify
also time are reduced by almost one third. In addition, the RL methods for path planning. The manufactured
all algorithms under test converged to valid paths between Omnidirectional-Wheel conveyor prototype is shown in
packages’ starting points and destinations. Also; the simpler Fig. 23; and it has the following mechanical specs: it consists
REFERENCES
[1] T. Sun, Y. Zhang, H. Zhang, P. Wang, Y. Zhao, and G. Liu, ‘‘Three-
wheel driven omnidirectional reconfigurable conveyor belt design,’’ in
Proc. Chin. Autom. Congr. (CAC), Hangzhou, China, 2019, pp. 101–105,
doi: 10.1109/CAC48633.2019.8997050.
[2] C. Uriartea, A. Asphandiarb, H. Thamera, A. Benggoloc, and M. Freitagac,
‘‘Control strategies for small-scaled conveyor modules enabling highly
flexible material flow systems,’’ Proc. CIRP, vol. 79, pp. 433–438,
Dec. 2019.
[3] F. Farahnakian, M. Ebrahimi, M. Daneshtalab, P. Liljeberg, and J. Plosila,
FIGURE 25. Hardware test results of Double Q-learning Algorithm.
‘‘Q-learning based congestion-aware routing algorithm for on-chip
network,’’ in Proc. IEEE 2nd Int. Conf. Netw. Embedded Syst. Enterprise
Appl., 2011, pp. 1–7, doi: 10.1109/NESEA.2011.6144949.
their ability to operate in real-time without considerable [4] H. Hu, X. Jia, Q. He, S. Fu, and K. Liu, ‘‘Deep reinforcement learning
delays compared to other control methods. based AGVs real-time scheduling with mixed rule for flexible shop floor
in industry 4.0,’’ Comput. Ind. Eng., vol. 149, Nov. 2020, Art. no. 106749.
[5] G. Faraci, A. Raciti, S. A. Rizzo, and G. Schembra, ‘‘Green wireless power
IX. CONCLUSION transfer system for a drone fleet managed by reinforcement learning in
In this paper four different RL were implemented to smart industry,’’ Appl. Energy, vol. 259, Feb. 2020, Art. no. 114204.
[6] B. Kim and J. Pineau, ‘‘Socially adaptive path planning in human
test their performance as alternatives for control theory environments using inverse reinforcement learning,’’ Int. J. Social Robot.,
algorithms used in path planning and sorting of packages vol. 8, no. 1, pp. 54–66, 2016.
on an Omnidirectional-Wheel conveyor. Simulations and [7] B. Recht, ‘‘A tour of reinforcement learning: The view from continuous
control,’’ Annu. Rev. Control, Robot., Auto. Syst., vol. 2, pp. 253–279,
experimental results have shown that the idea of using RL Oct. 2019.
algorithms to control the conveyor instead of traditional [8] F. Fuchs, Y. Song, E. Kaufmann, D. Scaramuzza, and P. Dárr, ‘‘Super-
control methods has much potential, and proved to be as human performance in gran turismo sport using deep reinforcement
learning,’’ IEEE Robot. Autom. Lett., vol. 6, no. 3, pp. 4257–4264,
efficient as control theory methods with faster run time and Jul. 2021, doi: 10.1109/LRA.2021.3064284.
less complexity. [9] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville,
To prove that the RL algorithms can solve small and large and Y. Bengio, ‘‘An actor-critic algorithm for sequence prediction,’’ in
Proc. ICLR, 2017, pp. 1–17.
scale path planning problems, two case studies have been
[10] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
considered in this work, one for a relatively large sized M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan,
conveyor of 38-cells and another one with 13-cells. Both and D. Hassabis, ‘‘Mastering chess and shogi by self-play with a general
case studies were successful in simulation and the small reinforcement learning algorithm,’’ 2017, arXiv:1712.01815.
[11] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
sized conveyor path planning was verified successfully suing M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan,
experimental work. For the 38-cell conveyor path planning and D. Hassabis, ‘‘A general reinforcement learning algorithm that masters
the Q-learning was the best algorithm in terms of collisions chess, shogi, and Go through self-play,’’ Science, vol. 362, no. 6419,
pp. 1140–1144, 2018.
and convergence time, followed by Double Q-learning. DQN [12] K. Balakrishnan, P. Chakravarty, and S. Shrivastava, ‘‘An A* curriculum
algorithm were not able to converge during testing time, approach to reinforcement learning for RGBD indoor robot navigation,’’
and are better avoided since the Q-learning and Double 2021, arXiv:2101.01774.
[13] Y. Zhu, R. Mottaghi, E. Kolve, and J. Lim, ‘‘Target-driven visual
Q-learning algorithms were able to perform well in this navigation in indoor scenes using deep reinforcement learning,’’ in
application. The same results were concluded for the path Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2017, pp. 3357–3364, doi:
planning in the 13-cell conveyor; where the Q-learning and 10.1109/ICRA.2017.7989381.
[14] S. H. Mayer, Development of a completely decentralized control system
double Q-learning algorithms outperformed the DQN and for modular continuous conveyors, vol. 73. Karlsruhe Deutschand: KIT
the DDQN, mainly because of their faster convergence time. Scientific, 2009.
[15] T. Krühn and L. Overmeyer, Dezentrale, Verteilte Steuerung Flächiger LAMIA A. SHIHATA received the B.Sc. degree
Foerdersysteme für den Innerbetrieblichen Materialfluss (Berichte aus from the Mechanical Engineering, Design and
dem ITA). Garbsen, Germany: TEWISS-Technik und Wissen, 2015. Production Engineering Department, Faculty of
[16] L. Overmeyer, K. Ventz, S. Falkenberg, and T. Kruhn, ‘‘Interfaced Engineering, Ain Shams University, Cairo, Egypt,
multidirectional small-scaled modules for intralogistics operations,’’ and the master’s and Ph.D. degrees in industrial
Logistics Res., vol. 2, no. 3, pp. 123–133, 2010. engineering, in 1996 and 2003, respectively. She
[17] Z. Seibold, Logical Time for Decentralized Control of Material Handling holds the position of Associate Professor at the
Systems. Karlsruhe Deutschand: KIT Scientific, 2016.
German University in Cairo teaching various top-
[18] L. Lamport, ‘‘Time, clocks, and the ordering of events in a distributed
ics in quality control, production and operations
system,’’ in Proc. Concurrency Works Leslie Lamport, 2019, pp. 179–196.
[19] R. E. Korf, ‘‘Depth-first iterative-deepening: An optimal admissible tree management, design of experiments, facilities
search,’’ Artif. Intell. J., vol. 27, pp. 97–109, 1985. [Online]. Available: planning, lean manufacturing, and project management. She trains and
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.288 consults industrial entities in the field of quality control and six sigma. She is
[20] Y. Li, ‘‘Reinforcement learning applications,’’ 2019, arXiv:1908.06973. a reviewer in a several esteemed journals and the Co-Founder of the Quality
[21] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, ‘‘Q-learning algorithms: Assurance Unit in the faculty.
A comprehensive classification and applications,’’ IEEE Access, vol. 7,
pp. 133653–133667, 2019, doi: 10.1109/ACCESS.2019.2941229.
[22] S. Y. Luis, D. G. Reina, and S. L. T. Marén, ‘‘A multiagent deep
reinforcement learning approach for path planning in autonomous surface
vehicles: The Ypacaraí lake patrolling case,’’ IEEE Access, vol. 9,
pp. 17084–17099, 2021, doi: 10.1109/ACCESS.2021.3053348.
[23] J. Xin, H. Zhao, D. Liu, and M. Li, ‘‘Application of deep reinforcement
learning in mobile robot path planning,’’ in Proc. Chin. Autom. Congr.
(CAC), 2017, pp. 7112–7116, doi: 10.1109/CAC.2017.8244061.
[24] C. Yan and X. Xiang, ‘‘A path planning algorithm for UAV based on
improved Q-learning,’’ in Proc. 2nd Int. Conf. Robot. Autom. Sci. (ICRAS),
2018, pp. 1–5, doi: 10.1109/ICRAS.2018.8443226.
EMAN AZAB (Senior Member, IEEE) received
[25] R. Hafner and M. Riedmiller, ‘‘Reinforcement learning on an
omnidirectional mobile robot,’’ in Proc. IEEE/RSJ Int. Conf. Intell. the B.Sc. degree (Hons.) in electronics and
Robot. Syst., Las Vegas, NV, USA, Oct. 2003, pp. 418–423, doi: communication engineering from the Faculty of
10.1109/IROS.2003.1250665. Engineering, Cairo University, in 2006, the M.Sc.
[26] C. J. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8, nos. 3–4, and Ph.D. degrees in electronics engineering
p. 279, 1992. from the German University in Cairo, Egypt, in
[27] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. 2008 and 2012, respectively, and the postdoctoral
Cambridge, MA, USA: The MIT Press, 2018. degree from TU Darmstadt, in 2015, with a focus
[28] H. V. Hasselt, ‘‘Double Q-learning,’’ in Proc. Adv. Neural Inf. Process. on ion-beam diagnostics and research work at
Syst., J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and GSI Helmholtzzentrum, Germany. She holds the
A. Culotta, Eds. New York, NY, USA: Curran, 2010, pp. 2613–2621. position of Assistant Professor at German University in Cairo teaching
[29] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, various topics in analog, mixed signal electronics, and electrical engineering,
‘‘A brief survey of deep reinforcement learning,’’ in Proc. Deep Learn. since 2016. She leads and is part of multiple research teams in the fields
Image Understand., 2017, pp. 1–16. of sensor technologies, IC design, and Industry 4.0. She has numerous
[30] V. Francois-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau,
publications at well recognized international conferences and journals in the
‘‘An introduction to deep reinforcement learning,’’ Found. Trends Mach.
field. She is serving as the Secretary for the IEEE Women in Engineering
Learn., vol. 11, nos. 3–4, pp. 219–354, Dec. 2018.
[31] Y. Li, ‘‘Deep reinforcement learning: An overview,’’ 2017, Egypt Section.
arXiv:1701.07274.
[32] H. V. Hasselt, A. Guez, and D. Silver, ‘‘Deep reinforcement learning
with double Q-learning,’’ in Proc. 13th AAAI Conf. Artif. Intell., 2016,
pp. 2094–2100.
WALID ZAHER received the B.Sc. degree in electronics from the Faculty of
Information Engineering Technology, German University in Cairo (GUC),
in 2020, and the M.Sc. degree from GUC, in 2022, with a focus on the topic
of controlling the path planning tasks of omni-directional conveyors using
machine learning algorithms, specifically reinforcement learning algorithms. MAGGIE MASHALY (Senior Member, IEEE)
received the B.Sc. degree (Hons.) in information
engineering and technology and the master’s
degree in networking from German University in
ARSANY W. YOUSSEF received the B.Sc. and Cairo, Egypt, in 2010 and 2011, respectively, and
M.Sc. degrees (Hons.) in mechatronics engineer- the Ph.D. degree in the area of cloud computing
ing from the Faculty of Engineering and Materials and computer networks from the University of
Science, German University in Cairo (GUC), Stuttgart, Germany, in 2017. She holds the position
in 2018 and 2021, respectively. His bachelor’s of Assistant Professor at German University in
degree was in the aerodynamics control of flapping Cairo teaching various topics in machine learning,
wings UAVs with a publication in this field, data engineering, computer engineering, and networks. She leads multiple
in 2020. His M.Sc. degree topic was in the research teams in the fields of cloud computing, edge computing, machine
industrial automation of smart factories (Industry learning, and Industry 4.0, and has numerous publications at well recognized
4.0). He worked as a part-time Teaching Assistant international conferences and journals in the field and also acts as a reviewer
at the German International University (GIU), Berlin, Germany, in 2020. and a Program Committee Member at many of them. She is a Board Member
He holds the position of Assistant Lecturer at GUC teaching various topics in of the IEEE Women in Engineering Egypt Section.
mechatronics engineering and industrial automation, since September 2021.