Omnidirectional-Wheel Conveyor Path Planning and Sorting Using Reinforcement Learning Algorithms

Received January 25, 2022, accepted February 25, 2022, date of publication March 4, 2022, date of current version
March 17, 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3156924
Omnidirectional-Wheel Conveyor Path

Planning and Sorting Using Reinforcement
Learning Algorithms
WALID ZAHER1 , ARSANY W. YOUSSEF 2 , LAMIA A. SHIHATA3 , EMAN AZAB 1, (Senior Member, IEEE),
AND MAGGIE MASHALY 4 , (Senior Member, IEEE)
1 Electronics
Department, Faculty of Information Engineering and Technology, German University in Cairo, New Cairo City 11835, Egypt
2 MechatronicsDepartment, Faculty of Engineering and Materials Science, German University in Cairo, New Cairo City 11835, Egypt
3 Designand Production Department, Faculty of Engineering and Materials Science, German University in Cairo, New Cairo City 11835, Egypt
4 Networks Department, Faculty of Information Engineering and Technology, German University in Cairo, New Cairo City 11835, Egypt
Corresponding author: Maggie Mashaly (maggie.ezzat@guc.edu.eg)
ABSTRACT In this paper, path planning and sorting of packages for Omnidirectional-Wheel conveyor
are presented using Reinforcement Learning (RL). Q-learning, Double Q-learning, Deep Q-learning, and
the Double Deep Q-learning algorithms are investigated. The RL algorithms enable the conveyor to self-
learn the packages path and sort them without using conventional control or path planning theories. The RL
algorithms are used for two different case studies on conveyors structures with different numbers of cells to
compare and evaluate their performances in large- and small-scale sized structures. To explore the proposed
methods response to external environment effects, two types of collisions between multiple packages were
considered, the proposed RL algorithms showed their ability to resolve both types successfully. Comparative
study between multiple RL algorithms for path planning showed that the Q-learning and Double Q-learning
algorithms had outperformed their Deep learning versions for path planning in the two case studies.
Furthermore, the proposed RL methods are compared experimentally to classic control and path planning
theories using a hardware prototype for one of the presented case studies. The hardware experimental results
showed that the proposed RL methods were as successful as the conventional methods in path planning and
sorting in much less processing time. Two types of sorting scenarios (Type I and II) were tested for same
package type and for multiple ones. For Type I sorting the Q-learning algorithm performed better than the
Q-learning with weights approach, achieving better mean and minimum rewards while maximum rewards
remain the same for both techniques. As for Type II sorting, only the Q-learning with weights approach was
able to achieve it and converge in a reasonable time.
INDEX TERMS Deep Q-learning (DQN), double deep Q-learning (DDQN), double Q-learning,
omnidirectional-wheel conveyor, path planning, Q-learning, reinforcement learning, sorting.
I. INTRODUCTION the system to be fast in path planning, avoid collisions and be

Flexible material handling systems are used extensively in able to detect packages’ correct destinations automatically.
industrial facilities to accelerate packages delivery to their One example of flexible material handling systems are the
destination. This process involves efficient path planning Omnidirectional-Wheel conveyors [1], [2]. In literature, path
and sorting. Most of these logistics are done using exact planning of packages on these flexible conveyors is realized
mathematical calculations and conventional simulation tools using classical control theory using two approaches. The
to accurately determine the shortest path of a specific package first approach uses computer vision to detect the package’s
from source-to-destination [1]. Nowadays, a single material position and the second approach requires equipping the
handling system is used for multiple products, which requires conveyor with multiple sensors, which results in an increase
of the cost and complexity of the system [2]. In both
The associate editor coordinating the review of this manuscript and approaches, the type and destination of the package must be
approving it for publication was Chun-Wei Tsai . known before the path planning process begins.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

VOLUME 10, 2022 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 27945
W. Zaher et al.: Omnidirectional-Wheel Conveyor Path Planning and Sorting Using Reinforcement Learning Algorithms
In the last decade AI and Machine Learning (ML)

have become an important part of the industrial applied
research. Many applications nowadays include AI algo-
rithms to perform simple tasks on their own without the
intervention of humans. One example of these algorithms
is Reinforcement Learning (RL) which has been used
successfully in many applications in recent years such
as network-on-chip to alleviate congestion [3], Indus-
try 4.0 applications, and controlling drones in smart
industry [4], [5].
RL had been used in many path planning applications
as well. It had been used in the path planning of robots
in human environments [6] to replace conventional path
planning technique. This is attributed to the fact that RL is
about studying past experiences to be able to take future
FIGURE 1. Function of individual module of the flexible conveyor [14].
actions in dynamic systems [7]. RL can adapt and work
for different applications such as video games [8], text
generation [9] and mining biological data. It also can be used
to play chess, shogi and Go [10]–[11] and to control and guide
robots in indoor navigation [12], [13].
In this work, a new approach is proposed for path planning
and sorting of packages on flexible conveyors. Instead of
using traditional control methods. RL is used to allow the
conveyor to self-learn how to move the packages from
start-to-destination on its own, using simple RL algorithms.
Through knowing the position of the package whether
using a computer vision system or sensors equipped in
each module, the algorithm will be able to guide different
types of packages to different destinations automatically
without using classic control approaches. As for sorting,
detecting the package color is sufficient to enable the
RL algorithms to complete the task of package sorting
successfully. FIGURE 2. Exterior view of flexible conveyor module [14].
This paper is organized as follows: section II and III are
dedicated to literature review on path planning methods used
for flexible conveyors and RL applications in path planning, A. FLEXIBLE CONVEYOR
respectively. Section IV explains the use of RL algorithms
An example of the flexible conveyor is shown in Figs. 1 and 2;
used in the learning and prediction for path planning of
this conveyor consists of identical square modules with
packages on the Omnidirectional-wheel conveyor. Section V
a decentralized control and communication system [14].
discusses sorting using RL algorithms in details. Section VI
The modules cooperate using their rollers and powered
lists and explains RL parameters of the experiments, whose
belts (which can be mechanically lifted at will) to transfer
results are shown in section VII. Hardware experimental
flat packages along two orthogonal axes. It is classified
results to verify the used RL algorithms are shown in
as a dynamic system as the modules can be easily
section VIII. Finally, section IX discusses the results and
reconfigured forming unlimited possible combinations. After
concludes the paper.
the recognition of the transported package by RFID-
technology, path planning is generated using computer
networks distance-vector routing algorithm while taking into
II. LITERATURE REVIEW
consideration the available units [14].
In this section, a detailed literature review on path planning of
conveyors is presented. At the beginning, various conveyor’s
types are shown with their used path planning techniques B. COGNITIVE CONVEYOR
in literature. These path planning techniques are executed This conveyor consists of identical small-scale grid-like
using well-known typical control methods. Afterwards, square units, with a decentralized controller, as shown in
RL applications in path planning are mentioned and explained Figs. 3 and 4 [15]. These modules cooperate using their driven
briefly to show how RL can be a powerful tool when used for 360◦ swivel rollers permitting both omnidirectional motion of
path planning in conveyors. flat packages: translation and rotation [16].
27946 VOLUME 10, 2022

FIGURE 5. The physical system of Grid-Sorter: a conveying module and its

connections [17].
FIGURE 3. Cognitive conveyor based on belt driven swivel rollers [16]. signals are then exchanged between available units in a
specific logical time.
III. APPLICATIONS OF RL IN PATH PLANNING

RL has been used extensively in industrial applications. It was
used to control dexterous robots in [20], and it can help
in online control application [21]. In this section, several
examples of using RL in path planning are discussed.
A. AUTONOMOUS SURFACES VEHICLES (ASVs)

In [22], Deep RL was used for path planning of autonomous
surfaces vehicles. Centralized convolution neural network
with a final independent dense layer was used for the path
FIGURE 4. Cognitive conveyor module based on swivel rollers [16].
planning task. Every agent had one for itself and for every
other agent the properties and capabilities were assumed
to be the same. For the rewards and penalties, a reward
Similar to the flexible conveyor, this dynamic conveyor function was defined such that whenever the agent goes to
uses the distance-vector routing algorithm which limits an idle cell it receives a reward. The agent also receives
its motion to orthogonal translations only, although it’s a penalty for collisions and other invalid actions. When
capable of executing rotations from mechanical conception. compared to lawn mower trajectories, the multiagent policy
The small size of their modules relative to the transported had 15% improvement. Also, for the considered case studies,
packages creates another challenge. A methodology was there was a 6% improvement over the Independent Deep
created to let the system behave as a single unit by Q-Learning (IDQL) by the approach presented in [22],
dynamically grouping several modules in a virtual cluster as in addition to the enhancement of running three times faster.
large as the transported package.
B. MOBILE ROBOT PATH PLANNING
C. GRID-SORTER Authors in [23] proposed one RL-based solution for allowing
As an evolution of the flexible conveyor, this conveyor was mobile robots to self-decide on optimal actions depending
designed with the same mechanical modules, as shown in on their visual perceptions. For the mobile robot, a deep
Fig.5. However, it is used with considerable higher density Q-network was used as a value function approximation with
of packages by applying a new approach called space-saving a trained q-network used to choose the q-value. RBG images
sorting logistic tasks to avoid deadlocks [17]. This path taken from the environment were used as inputs for the
planning approach adopted in decentralized material flow network, and robots were able to avoid the obstacles while
systems was a time-window-based route reservation-based on navigating towards their destination. The results proved that
the principle of logical time [18] to prevent deadlocks instead the deep RL approach was effective in the path planning task
of creating a dynamic avoiding solution. without the need for any handcrafted features [23].
For the path planning, an own adaptation of the Iterative
Deepening Algorithm (IDA) was applied in [19]. A tree C. UNMANNED AERIAL VEHICLE
of routes is created through a procedure of requests after For the path planning task of the unmanned aerial vehicle
recognition of the network topology with the well-known an improved Q-learning algorithm was used in [24]. A new
link-state routing algorithm, rejection and confirmation reward function design, an initialization and action selection
VOLUME 10, 2022 27947

FIGURE 6. Omni-directional conveyor [1].
FIGURE 8. Set of possible package transportation actions.
The states are defined by the coordinates of the x-y axis of

each cell and the type of each package.
There are six possible package transportation actions that
could be taken at any state as shown in Fig. 8. Each action
is assigned a different number from 0 to 5. The actions are
up-left, up-right, left, right, down-left, and down-right. The
rewards and penalties are given according to the position of
the package. If the package arrived at the required destination,
it is given a reward of 1000. If the package falls off the
conveyor or goes to wrong destination then it receives a
FIGURE 7. The structure of the omnidirectional module [1].
penalty of 2000.
Collision cases are also considered in the proposed work
strategy were used to improve the results. Compared to the by detecting if the next state S 0 of each package are the
original q-learning the new improved algorithm performed same. Collision cases represents real-life working scenarios.
better according to the results [24]. The improved algorithm A collision may be between two packages or between a
was proven effective for the path planning task. package and a mal-functioned cell. This situation is labeled as
a collision of type 1, and each package receives a penalty of
IV. OMNIDIRECTIONAL-WHEEL PATH PLANNING 125. Resolving of this collision situation is done by making
USING RL one package moves to S 0 , while the other package remains
The structure of the Omnidirectional-Wheel conveyor is stationary for this time step. This work also considers another
composed of several hexagonal modules. These modules type of collisions -named collision type two- where the
move the packages to the required destinations. This is next state S 0 of one package type is the current state S of
possible because of the wheels of the modules. Each module another type of packages and vice versa. This is handled by
has three omnidirectional wheels separated by 120◦ [1]. The making one package type stationary during this next time
structure of the conveyor belt is shown in Fig.6, and the step and moving the other package by a different action at
hexagonal cell module of the conveyor is shown in Fig. 7, random direction other than the original direction. However,
where the left and right parts of the Fig.7 shows the module’s this action must ensure that the package will not fall off
side and bottom view, respectively. the conveyor; then each of them is given a 250 penalty.
Applying RL on the Omnidirectional-Wheel Conveyor From the actions taken, the next state could be defined, and
for path planning is inspired by the work done in [25], the packages immediately go to their required states. The
where a RL algorithm was used to make omnidirectional proposed algorithm is designed to handle three packages
robots move from one position to another. In this work; at the same time each of which has certain destination.
the Omnidirectional-Wheel conveyor package transportation An overview of how collisions are handled within the
from one cell to another is abstracted into a small-scale proposed work is shown in the flow chart presented in Fig. 9.
discrete action problem with a globally observable state, The package types are defined by colors. The packages
which is very suitable to be solved by RL. To set up a RL positions are generated randomly by the code on any of
environment, a Markov Decision Problem (MDP) must be the cells at the bottom of the conveyor. Two case studies
defined [4]. The standard tuple MDP is considered consisting have been investigated in this work. In the first case study,
of (S, A, R, S 0 ), where S represents states, A represents a conveyor of 38 hexagonal cell modules will be considered
actions, R represents rewards, and S 0 represents next states. with the layout shown in Fig.10. In the second case study,
27948 VOLUME 10, 2022

FIGURE 11. Conveyor layout with 13 cells.
FIGURE 9. Procedure of handling collisions.
FIGURE 12. Q-learning algorithm flowchart.
flowchart in Fig. 12; by evaluating all possible actions taken

by the system into possible next states, calculating rewards
for each action (whether short-term or long-term rewards),
and choosing the action/state pair that maximize the rewards,
which are also called Q-values stored in the Q-table. The
FIGURE 10. Conveyor layout with 38 cells.
Q-table is updated according to equation 1:
the conveyor structure consists of 13-cells which is the same Q(S, A) = Q(S, A) × (1 − α) + α[R + γ argmax Q(S0 , a)]
case implemented in hardware. In the 13-cell layout shown in (1)
Fig. 11, and due to the limited space and number of cells; only
one package is used at every episode. The package location where S is the current state, A is the current action, α is the
is generated at the start of every episode at the green cell, and learning rate, γ is the discount factor, and the ‘‘argmax Q (S0 ,
the goal is for the package to reach the red cell. a)’’ term represents the best state-action pair in the next state.
If the state S is the final state in the episode, then the equation
A. Q-LEARNING ALGORITHM becomes:
The first algorithm used in this work is the Q-learning Q(S, A) = Q(S, A) × (1 − α) + αR (2)
algorithm. The basic Q-learning algorithm was introduced
in [26]. Q-learning algorithms allows the system to learn what Every type of packages should have different Q (S, A)
would be the best action to perform in the next time slot function. This is attributed to the fact that every type of
based on its current state. This is performed as shown in the package will have a different destination.
VOLUME 10, 2022 27949

B. DOUBLE Q-LEARNING where every C time-steps the Q0 is updated to have the exact
The second algorithm used for path planning is the Double values of the original network Q
Q-learning algorithm. The basic version of this algorithm was
first introduced in [28] to eliminate the maximization bias in D. DOUBLE DEEP Q-LEARNING (DDQN)
Q-learning [27]. It follows the same approach as Q-learning The basic DDQN algorithm was first introduced in [32] and
algorithm, but uses two Q-tables instead of one to reduce it used the target network as a second value function without
overestimation. Each time step only one table will be updated the need to use a completely new network. Therefore, the
randomly. Actions are then concluded in the same way as target network is used to estimate the action of the greedy
Q-learning, but from a table that results from the summation policy. The target network remains as a periodic copy of the
of the new q-tables. Tables will be updated using equation online network. In this algorithm, a basic multi-layer neural
3 for q-table Q1. network was used instead of a convolution neural network.
The equation for updating the network is given by:
Q1(S, A) = Q1(S, A) × (1 − α)
+ α[R + γ Q2(S, argmax Q1(S0 , a)] (3)
Q(S, A; θ) = Q(S, A) × (1 − α)
For the final state in the episode equation 3 reduces to + α[R + γ Q0 (S0 , argmaxQ(S0 , a; θ); θ−)] (9)
equation 4 as shown below:
Q1(S, A) = Q1(S, A) × (1 − α) + αR (4) V. OMNIDIRECTIONAL-WHEEL PACKAGE SORTING
USING RL
For q-table Q2 the calculations are performed using equations The sorting task is implemented on the 38-cell case study
5 and 6: to allow more variation and space for the movement of
different types of packages to increase the complexity of
Q2(S, A) = Q2(S, A) × (1 − α) + α
the sorting problem. Two types of sorting are considered in
× [R + γ Q1(S, argmax Q2(S0 , a)] (5) this work: the first type of sorting (Type I) is to sort five
Q2(S, A) = Q2(S, A) × (1 − α) + αR (6) different types of packages to five different target cells. The
second sorting type (Type II) was to sort five packages of the
For enhancing the accuracy and performance of the Double same type to different five target cells, but any package can
Q-learning algorithm; every package has its own q-table that occupy any target cell. For every type there were two different
gets updated as per the previous equations. sorting configurations. For the sorting task of Type I standard
Q-learning algorithm [26] was used. Also, a variation of it is
C. DEEP Q-LEARNING (DQN) used that was inspired by the episodic semi-gradient SARSA
The third algorithm used for the task of path planning is algorithm [27] (Q-learning with weights approach). This
Deep Q-learning. Its basic version was introduced in [4] Q-learning with weights approach adjusts the q-table slightly.
as an attempt to combine RL with neural networks [30]. The q-table consists of the element wise multiplication of
It was used in several control applications such as controlling other two tables. The first table is for the value of the state-
robots’ motion in the real-world using camera inputs [29], action pairs. Each state-action pair can have a certain value,
and has a high potential to be used in text generation, finance, for the simulations all the state-action pairs had a value equal
industry 4.0 and intelligent transportation systems [31]. The to 1. The second table is for the weight associated with
DQN algorithm is applied in this work for conveyor path every stat-action pair. Only the weights are updated at each
planning without the use of a convolution layer due to time-step.
the low complexity of the problem in addition to its low- There were two types of collisions similar to the cases
latency requirement. This algorithm operates exactly as the that took place in path planning approach, but this time due
Q-learning, with an addition of an experience replay memory to the large number of packages, if any type of collision
D that stores transitions consisting of current state, current happens all the packages involved in the collision will not
action, current reward, and next state. Another network will move for one time-step. A flow chart explaining the steps
be added called the target network Q’, where each time-step of the sorting algorithm is shown in Fig. 13. The RL agent
a next state step from the experience replay memory will be starts by examining the current state of the system, knowing
chosen randomly. This state will be then overestimated from the current positions of the packages and their desired
the target network and used in updating the original network sorting locations. Based on this info, the agent chooses the
according to the following equation: actions that would lead to highest rewards; and check the
possibility of collision between packages in such case. If no
Q(S, A; θ) = Q(S, A) × (1 − α) + α
collision will happen then the actions will be taken, but if
× [R + γ argmax Q0 (S0 , a; θ−)] (7) collisions would happen all packages that are expected to
For the final state equation 7 reduces to: collide will retain their position for one time-step, then the
algorithm will re-evaluate the system again during the next
Q(S, A; θ) = Q(S, A) × (1 − α) + αR (8) time-step.
27950 VOLUME 10, 2022

FIGURE 13. Sorting algorithm flowchart.
Four testing scenarios were developed as toy examples

and investigated in this work to validate the sorting approach
using RL. The conveyor setup used in sorting Type I and II are FIGURE 14. Sorting type I: Map 1.
shown in Figs. 14 to 17. For sorting Type I, Map 1 started the
training episodes at random positions on the y = 0 row in the
simulations of Q-learning with weights approach and started
at fixed positions in the simulations of Q-learning, where the
red box started at cell (1,0), the yellow started at (3,0), the
green started at (5,0), the gray started at (7,0) and the white
started at (9,0).
For sorting Type I-Map 2, the boxes started at random
positions on the y = 0 row in all simulations. For sorting
Type II both maps for the boxes started at fixed positions on
the y = 0 row since all the packages are of the same type and
all are colored green. The reward system for the sorting task
is the same as the path planning task, except that there is an
extra penalty that exists in Type II sorting which has a value
of (-300) points. The packages receive this penalty if it went
to a target destination that is occupied by another package.
For sorting Type II, only Q-learning with weights approach
was used because it was the only algorithm that could finish
the simulation and converge to an accurate result within a
reasonable time. Whereas in sorting Type I scenario, both
approaches of Q-learning and Q-learning with weights were
used.
VI. REINFORCEMENT LEARNING ALGORITHMS’

PARAMETERS FIGURE 15. Sorting type I: Map 2.
The tabular algorithms were simulated for five times for the
path planning two case studies of the conveyor with 38 and The value function approximation algorithms were simulated
13 cells. Every training run consisted of 100,000 episodes. only for one time as the neural network takes long time
The first 70,000 episodes were considered as a training to train. Therefore, the DQN and DDQN algorithms were
phase for the algorithm. In the training phase exploration simulated both for 150,000 episodes. Every run is organized
was enabled. The last 30,000 episodes were considered the as follows: 120,000 episodes for training and the rest 30,000
testing phase. In the testing phase exploration was disabled. episodes were considered as a test for the algorithms with
VOLUME 10, 2022 27951

TABLE 1. Algorithms’ parameters for path planning case studies.
FIGURE 16. Sorting type II: map 1.
TABLE 2. Algorithms parameters for path planning case studies neural

network and replay memory.
FIGURE 17. Sorting type II: map 2.
exploration disabled. Exploration was disabled in the testing

section because in reality random exploration could cause the VII. RESULTS
packages to fall off the conveyor. The training parameters for In this section the results of the different algorithms are
the 38-cell and 13-cell conveyor are shown in Table 1. The presented. All the results that are presented in this section
parameters for the neural networks are represented in Table 2 is averaged over the five runs that were performed for every
for both case studies. tabular algorithm for each conveyor case study.
27952 VOLUME 10, 2022

TABLE 3. Results for path planning in 38-cells case study. TABLE 4. Algorithms’ performance for path planning in 38-cells case
study.
A. CASE STUDY OF 38-CELLS CONVEYOR: PATH

PLANNING
The number of successful attempts for the packages to reach
their destinations using all RL algorithms are given in Table 3.
It shows that the tabular methods algorithms were able to
converge successfully before the testing phase while the deep
neural networks were not able to converge. DQN and DDQN
might be able to converge with more episodes, but this is not
guaranteed and would require longer times that makes the
algorithms not suitable in real-time. The number of time-steps
needed to complete the training runs are also given.
As shown in Table 3 the fastest algorithm to finish training
is the Q-learning algorithm, while the slowest was the DQN.
However, the fastest algorithm to finish the testing phase was The number of collisions of type one and type two for
the Double Q-learning. every algorithm are summarized as well in Table 3. The best
VOLUME 10, 2022 27953

FIGURE 18. Total average rewards over all episodes for path planning in 38-cells case study.
FIGURE 19. Policy maps for Q-Learning Algorithm path planning (38-cells) case study.
algorithm in terms of handling collision type one is the did not converge within the allowed training time to be able
Q-learning algorithm with 0.277126 collisions per episode to perform path planning within short response time. Thus,
overall and the worst is the DQN with 0.197 collisions per despite the higher complexity of the two later algorithms,
episode. As for collision type two, the best algorithm was their longer running times was not an advantage to the case
the Q-learning with 0.00113 collisions per episode for each study. This is evident also when considering rewards, where
algorithm overall and the worst algorithm is the DDQN Q-Learning and Double Q-Learning were able to achieve
algorithm with 0.0323 collisions per episode. almost the same total maximum average rewards as the DQN
The graphs of total average rewards over all episodes of and DDQN within the shorter convergence time, and this
the 38-cell case study are shown in Fig. 18 and summarized maximum reward was not much affected even when reducing
in Table 4. Results shown in the Table 4 indicate that both the number of episodes by half or third.
Q-Learning and Double Q-Learning are better options for The policy maps for the green, red, and yellow packages
the path planning problem addressed in this case study using Q-Learning algorithm are shown in Fig. 19. It shows
despite being simpler than the other two RL algorithms that the algorithm could not converge successfully to a valid
under investigation. Mainly this is due to the fact that they path between source and destination for all cells in case of
reached a state of convergence within a considerably short green package type. The map for the red package type shows
amount of time; compared with DQN and DDQN which that the Q-Learning algorithm had converged successfully
27954 VOLUME 10, 2022

FIGURE 20. Policy map of Double Q-Learning Algorithm path planning (38-cells) case stud.
TABLE 5. Parameters for algorithms used in sorting test cases. TABLE 6. Results for 38-cells sorting test cases.
regardless of where the package started; it would always

converge to a valid path to destination. Same successful
convergence happened with the yellow package type.
For Double Q-Learning algorithm, the policy maps are
shown in Fig. 20 indicating successful convergence of the
algorithm for all tested cases of red, yellow and green
packages. Valid paths could always be found regardless of
the starting position of the package until reaching the correct
destination. One irregularity was noticed in the policy map of
the red package; if the red package went to (8,5) or (6,5) it with weights approach, achieving better mean and minimum
will enter a loop. But from the nature of the algorithm, this rewards while maximum rewards remained the same for both
problem will be solved if the training was conducted for more techniques. But when it comes to sorting Type II, only the
episodes. Q-learning with weights approach was able to achieve sorting
and converge in a reasonable time.
B. CASE STUDY OF 38-CELLS CONVEYOR: SORTING
For achieving the sorting target using the proposed RL tech- C. CASE STUDY OF 13-CELLS CONVEYOR: PATH
niques, Q-learning and Q-learning with weights algorithms PLANNING
were used with the parameters shown in Table 5. Parameters Path planning results of the 13-cells conveyor case study
were tuned and selected to achieve the highest maximum are shown in Fig. 21 and summarized in Tables 7 and 8.
reward in shortest convergence time. Results of the 38-cells The fastest algorithm in finishing is shown to be the
conveyor sorting cases mentioned earlier in section V are Q-learning algorithm. While the slowest is the DDQN
summarized and presented in Table 6. For sorting Type I, the algorithm. It can be concluded from these results that the
Q-learning algorithm performed better than the Q-learning fastest algorithm to converge is the Q-learning, while the
VOLUME 10, 2022 27955

FIGURE 21. Total average rewards over all episodes for path planning in 13-cells case study.
TABLE 7. Results for path planning in 13-cells case study.
FIGURE 22. Policy map of Q-Learning and Double Q-Learning

Algorithm (respectively) for path planning in 13-cells case study.
the scale of the target problem for path planning is, the more
favorable is to use simpler algorithms as Q-Learning and
Double Q-Learning vs. complex ones as DQN and DDQN;
as their complexity doesn’t add much enhancement to the
problem solution compared to the cost of much longer run
times.
Policy maps for the Q-learning and Double Q-Learning
algorithms in path planning 13-cellscase study are shown in
Fig. 22. Both algorithms were able to converge and find the
shortest path successfully.
slowest in convergence is the DQN algorithm. Results also VIII. HARDWARE EXPERIMENTAL RESULTS
indicate that as the case study of 13-cells is much smaller A hardware prototype for the 13-cell conveyor manufactured
than the 38 cells, episodes required till convergence and by members of our research group was used to verify
also time are reduced by almost one third. In addition, the RL methods for path planning. The manufactured
all algorithms under test converged to valid paths between Omnidirectional-Wheel conveyor prototype is shown in
packages’ starting points and destinations. Also; the simpler Fig. 23; and it has the following mechanical specs: it consists
27956 VOLUME 10, 2022

TABLE 8. Results for path planning in 13-cells case study.
FIGURE 23. Hardware prototype of 13-cells conveyor.
of 13 identical hexagonal cells. Its regular cell has a distance

across its corners of 0.345m, distance across its flats of 0.3m
and side length of 0.175m. The size of the overall conveyor
is 0.9m and 1.385m in its width and height, respectively. The
three 70mm diameter wheels are arranged apart from the cell
center by 112.5mm.
FIGURE 24. Hardware test results of Q-learning Algorithm.
The hardware results for the Q-learning and Double
Q-learning path planning for the 13-cell case study are shown
in Figs. 24 and 25, respectively. These figures represent the dotted line labelled as ‘Reference’ represents a straight-line
conveyor dimensions on the x and y axes. Paths planned path between start and destination cells of the package. The
by the RL algorithms are compared to the classical path difference between the path taken by the RL and the path
planning algorithms implemented on MATLAB’s Simulink. taken on actual hardware is because RL only considers the
The experimental results show that the paths taken by the RL center of the cell but in hardware testing the package is not
algorithms are the same as the path taken by the classical always located at the center of the cell. The RL algorithms
control methods applied using MATLAB. The straight line in both Figs. 24 and 25 proved to have almost the exact
labelled as ‘Controller’ represents the path planned by path as the classic control methods in hardware tests. The
classic control methods using MATLAB. The dashed line Q-learning and Double Q-learning algorithms finished the
represents the path given by the RL algorithms, and the test at 6.742 and 6.7139 seconds, respectively; which shows
VOLUME 10, 2022 27957

Thus, it can be concluded that simple RL techniques can

efficiently replace traditional control methods for the purpose
of path planning, with equal efficiency and shorter run times.
As for package sorting on the 38-cell conveyor case study,
several sorting scenarios were defined and tested, where the
Q-learning and Q-learning with weights algorithms were
used to achieve sorting targets with maximum rewards and
in minimum convergence times.
An outlook for future research work includes investigating
test cases of higher complexity such as conveyors of hundreds
or thousands of numbers of cells. In such cases the proposed
application of simple RL algorithms will not be sufficient
as it will require long training durations that may not lead
to convergence. Thus, a thorough study is required for these
large-scale case studies to investigate what are the most
suitable algorithms for their operation.
REFERENCES
[1] T. Sun, Y. Zhang, H. Zhang, P. Wang, Y. Zhao, and G. Liu, ‘‘Three-
wheel driven omnidirectional reconfigurable conveyor belt design,’’ in
Proc. Chin. Autom. Congr. (CAC), Hangzhou, China, 2019, pp. 101–105,
doi: 10.1109/CAC48633.2019.8997050.
[2] C. Uriartea, A. Asphandiarb, H. Thamera, A. Benggoloc, and M. Freitagac,
‘‘Control strategies for small-scaled conveyor modules enabling highly
flexible material flow systems,’’ Proc. CIRP, vol. 79, pp. 433–438,
Dec. 2019.
[3] F. Farahnakian, M. Ebrahimi, M. Daneshtalab, P. Liljeberg, and J. Plosila,
FIGURE 25. Hardware test results of Double Q-learning Algorithm.
‘‘Q-learning based congestion-aware routing algorithm for on-chip
network,’’ in Proc. IEEE 2nd Int. Conf. Netw. Embedded Syst. Enterprise
Appl., 2011, pp. 1–7, doi: 10.1109/NESEA.2011.6144949.
their ability to operate in real-time without considerable [4] H. Hu, X. Jia, Q. He, S. Fu, and K. Liu, ‘‘Deep reinforcement learning
delays compared to other control methods. based AGVs real-time scheduling with mixed rule for flexible shop floor
in industry 4.0,’’ Comput. Ind. Eng., vol. 149, Nov. 2020, Art. no. 106749.
[5] G. Faraci, A. Raciti, S. A. Rizzo, and G. Schembra, ‘‘Green wireless power
IX. CONCLUSION transfer system for a drone fleet managed by reinforcement learning in
In this paper four different RL were implemented to smart industry,’’ Appl. Energy, vol. 259, Feb. 2020, Art. no. 114204.
[6] B. Kim and J. Pineau, ‘‘Socially adaptive path planning in human
test their performance as alternatives for control theory environments using inverse reinforcement learning,’’ Int. J. Social Robot.,
algorithms used in path planning and sorting of packages vol. 8, no. 1, pp. 54–66, 2016.
on an Omnidirectional-Wheel conveyor. Simulations and [7] B. Recht, ‘‘A tour of reinforcement learning: The view from continuous
control,’’ Annu. Rev. Control, Robot., Auto. Syst., vol. 2, pp. 253–279,
experimental results have shown that the idea of using RL Oct. 2019.
algorithms to control the conveyor instead of traditional [8] F. Fuchs, Y. Song, E. Kaufmann, D. Scaramuzza, and P. Dárr, ‘‘Super-
control methods has much potential, and proved to be as human performance in gran turismo sport using deep reinforcement
learning,’’ IEEE Robot. Autom. Lett., vol. 6, no. 3, pp. 4257–4264,
efficient as control theory methods with faster run time and Jul. 2021, doi: 10.1109/LRA.2021.3064284.
less complexity. [9] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville,
To prove that the RL algorithms can solve small and large and Y. Bengio, ‘‘An actor-critic algorithm for sequence prediction,’’ in
Proc. ICLR, 2017, pp. 1–17.
scale path planning problems, two case studies have been
[10] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
considered in this work, one for a relatively large sized M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan,
conveyor of 38-cells and another one with 13-cells. Both and D. Hassabis, ‘‘Mastering chess and shogi by self-play with a general
case studies were successful in simulation and the small reinforcement learning algorithm,’’ 2017, arXiv:1712.01815.
[11] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
sized conveyor path planning was verified successfully suing M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan,
experimental work. For the 38-cell conveyor path planning and D. Hassabis, ‘‘A general reinforcement learning algorithm that masters
the Q-learning was the best algorithm in terms of collisions chess, shogi, and Go through self-play,’’ Science, vol. 362, no. 6419,
pp. 1140–1144, 2018.
and convergence time, followed by Double Q-learning. DQN [12] K. Balakrishnan, P. Chakravarty, and S. Shrivastava, ‘‘An A* curriculum
algorithm were not able to converge during testing time, approach to reinforcement learning for RGBD indoor robot navigation,’’
and are better avoided since the Q-learning and Double 2021, arXiv:2101.01774.
[13] Y. Zhu, R. Mottaghi, E. Kolve, and J. Lim, ‘‘Target-driven visual
Q-learning algorithms were able to perform well in this navigation in indoor scenes using deep reinforcement learning,’’ in
application. The same results were concluded for the path Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2017, pp. 3357–3364, doi:
planning in the 13-cell conveyor; where the Q-learning and 10.1109/ICRA.2017.7989381.
[14] S. H. Mayer, Development of a completely decentralized control system
double Q-learning algorithms outperformed the DQN and for modular continuous conveyors, vol. 73. Karlsruhe Deutschand: KIT
the DDQN, mainly because of their faster convergence time. Scientific, 2009.
27958 VOLUME 10, 2022

[15] T. Krühn and L. Overmeyer, Dezentrale, Verteilte Steuerung Flächiger LAMIA A. SHIHATA received the B.Sc. degree
Foerdersysteme für den Innerbetrieblichen Materialfluss (Berichte aus from the Mechanical Engineering, Design and
dem ITA). Garbsen, Germany: TEWISS-Technik und Wissen, 2015. Production Engineering Department, Faculty of
[16] L. Overmeyer, K. Ventz, S. Falkenberg, and T. Kruhn, ‘‘Interfaced Engineering, Ain Shams University, Cairo, Egypt,
multidirectional small-scaled modules for intralogistics operations,’’ and the master’s and Ph.D. degrees in industrial
Logistics Res., vol. 2, no. 3, pp. 123–133, 2010. engineering, in 1996 and 2003, respectively. She
[17] Z. Seibold, Logical Time for Decentralized Control of Material Handling holds the position of Associate Professor at the
Systems. Karlsruhe Deutschand: KIT Scientific, 2016.
German University in Cairo teaching various top-
[18] L. Lamport, ‘‘Time, clocks, and the ordering of events in a distributed
ics in quality control, production and operations
system,’’ in Proc. Concurrency Works Leslie Lamport, 2019, pp. 179–196.
[19] R. E. Korf, ‘‘Depth-first iterative-deepening: An optimal admissible tree management, design of experiments, facilities
search,’’ Artif. Intell. J., vol. 27, pp. 97–109, 1985. [Online]. Available: planning, lean manufacturing, and project management. She trains and
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.288 consults industrial entities in the field of quality control and six sigma. She is
[20] Y. Li, ‘‘Reinforcement learning applications,’’ 2019, arXiv:1908.06973. a reviewer in a several esteemed journals and the Co-Founder of the Quality
[21] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, ‘‘Q-learning algorithms: Assurance Unit in the faculty.
A comprehensive classification and applications,’’ IEEE Access, vol. 7,
pp. 133653–133667, 2019, doi: 10.1109/ACCESS.2019.2941229.
[22] S. Y. Luis, D. G. Reina, and S. L. T. Marén, ‘‘A multiagent deep
reinforcement learning approach for path planning in autonomous surface
vehicles: The Ypacaraí lake patrolling case,’’ IEEE Access, vol. 9,
pp. 17084–17099, 2021, doi: 10.1109/ACCESS.2021.3053348.
[23] J. Xin, H. Zhao, D. Liu, and M. Li, ‘‘Application of deep reinforcement
learning in mobile robot path planning,’’ in Proc. Chin. Autom. Congr.
(CAC), 2017, pp. 7112–7116, doi: 10.1109/CAC.2017.8244061.
[24] C. Yan and X. Xiang, ‘‘A path planning algorithm for UAV based on
improved Q-learning,’’ in Proc. 2nd Int. Conf. Robot. Autom. Sci. (ICRAS),
2018, pp. 1–5, doi: 10.1109/ICRAS.2018.8443226.
EMAN AZAB (Senior Member, IEEE) received
[25] R. Hafner and M. Riedmiller, ‘‘Reinforcement learning on an
omnidirectional mobile robot,’’ in Proc. IEEE/RSJ Int. Conf. Intell. the B.Sc. degree (Hons.) in electronics and
Robot. Syst., Las Vegas, NV, USA, Oct. 2003, pp. 418–423, doi: communication engineering from the Faculty of
10.1109/IROS.2003.1250665. Engineering, Cairo University, in 2006, the M.Sc.
[26] C. J. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8, nos. 3–4, and Ph.D. degrees in electronics engineering
p. 279, 1992. from the German University in Cairo, Egypt, in
[27] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. 2008 and 2012, respectively, and the postdoctoral
Cambridge, MA, USA: The MIT Press, 2018. degree from TU Darmstadt, in 2015, with a focus
[28] H. V. Hasselt, ‘‘Double Q-learning,’’ in Proc. Adv. Neural Inf. Process. on ion-beam diagnostics and research work at
Syst., J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and GSI Helmholtzzentrum, Germany. She holds the
A. Culotta, Eds. New York, NY, USA: Curran, 2010, pp. 2613–2621. position of Assistant Professor at German University in Cairo teaching
[29] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, various topics in analog, mixed signal electronics, and electrical engineering,
‘‘A brief survey of deep reinforcement learning,’’ in Proc. Deep Learn. since 2016. She leads and is part of multiple research teams in the fields
Image Understand., 2017, pp. 1–16. of sensor technologies, IC design, and Industry 4.0. She has numerous
[30] V. Francois-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau,
publications at well recognized international conferences and journals in the
‘‘An introduction to deep reinforcement learning,’’ Found. Trends Mach.
field. She is serving as the Secretary for the IEEE Women in Engineering
Learn., vol. 11, nos. 3–4, pp. 219–354, Dec. 2018.
[31] Y. Li, ‘‘Deep reinforcement learning: An overview,’’ 2017, Egypt Section.
arXiv:1701.07274.
[32] H. V. Hasselt, A. Guez, and D. Silver, ‘‘Deep reinforcement learning
with double Q-learning,’’ in Proc. 13th AAAI Conf. Artif. Intell., 2016,
pp. 2094–2100.
WALID ZAHER received the B.Sc. degree in electronics from the Faculty of
Information Engineering Technology, German University in Cairo (GUC),
in 2020, and the M.Sc. degree from GUC, in 2022, with a focus on the topic
of controlling the path planning tasks of omni-directional conveyors using
machine learning algorithms, specifically reinforcement learning algorithms. MAGGIE MASHALY (Senior Member, IEEE)
received the B.Sc. degree (Hons.) in information
engineering and technology and the master’s
degree in networking from German University in
ARSANY W. YOUSSEF received the B.Sc. and Cairo, Egypt, in 2010 and 2011, respectively, and
M.Sc. degrees (Hons.) in mechatronics engineer- the Ph.D. degree in the area of cloud computing
ing from the Faculty of Engineering and Materials and computer networks from the University of
Science, German University in Cairo (GUC), Stuttgart, Germany, in 2017. She holds the position
in 2018 and 2021, respectively. His bachelor’s of Assistant Professor at German University in
degree was in the aerodynamics control of flapping Cairo teaching various topics in machine learning,
wings UAVs with a publication in this field, data engineering, computer engineering, and networks. She leads multiple
in 2020. His M.Sc. degree topic was in the research teams in the fields of cloud computing, edge computing, machine
industrial automation of smart factories (Industry learning, and Industry 4.0, and has numerous publications at well recognized
4.0). He worked as a part-time Teaching Assistant international conferences and journals in the field and also acts as a reviewer
at the German International University (GIU), Berlin, Germany, in 2020. and a Program Committee Member at many of them. She is a Board Member
He holds the position of Assistant Lecturer at GUC teaching various topics in of the IEEE Women in Engineering Egypt Section.
mechatronics engineering and industrial automation, since September 2021.
VOLUME 10, 2022 27959

Omnidirectional-Wheel Conveyor Path Planning and Sorting Using Reinforcement Learning Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Omnidirectional-Wheel Conveyor Path Planning and Sorting Using Reinforcement Learning Algorithms

Uploaded by

Copyright:

Available Formats

Received January 25, 2022, accepted February 25, 2022, date of publication March 4, 2022, date of current version

March 17, 2022.

Omnidirectional-Wheel Conveyor Path

Corresponding author: Maggie Mashaly (maggie.ezzat@guc.edu.eg)

I. INTRODUCTION the system to be fast in path planning, avoid collisions and be

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

In the last decade AI and Machine Learning (ML)

27946 VOLUME 10, 2022

FIGURE 5. The physical system of Grid-Sorter: a conveying module and its

III. APPLICATIONS OF RL IN PATH PLANNING

A. AUTONOMOUS SURFACES VEHICLES (ASVs)

VOLUME 10, 2022 27947

FIGURE 6. Omni-directional conveyor [1].

FIGURE 8. Set of possible package transportation actions.

The states are defined by the coordinates of the x-y axis of

27948 VOLUME 10, 2022

FIGURE 11. Conveyor layout with 13 cells.

FIGURE 9. Procedure of handling collisions.

FIGURE 12. Q-learning algorithm flowchart.

flowchart in Fig. 12; by evaluating all possible actions taken

VOLUME 10, 2022 27949

27950 VOLUME 10, 2022

FIGURE 13. Sorting algorithm flowchart.

Four testing scenarios were developed as toy examples

VI. REINFORCEMENT LEARNING ALGORITHMS’

VOLUME 10, 2022 27951

TABLE 1. Algorithms’ parameters for path planning case studies.

FIGURE 16. Sorting type II: map 1.

TABLE 2. Algorithms parameters for path planning case studies neural

FIGURE 17. Sorting type II: map 2.

exploration disabled. Exploration was disabled in the testing

27952 VOLUME 10, 2022

A. CASE STUDY OF 38-CELLS CONVEYOR: PATH

VOLUME 10, 2022 27953

27954 VOLUME 10, 2022

regardless of where the package started; it would always

VOLUME 10, 2022 27955

TABLE 7. Results for path planning in 13-cells case study.

FIGURE 22. Policy map of Q-Learning and Double Q-Learning

27956 VOLUME 10, 2022

TABLE 8. Results for path planning in 13-cells case study.

FIGURE 23. Hardware prototype of 13-cells conveyor.

of 13 identical hexagonal cells. Its regular cell has a distance

VOLUME 10, 2022 27957

Thus, it can be concluded that simple RL techniques can

27958 VOLUME 10, 2022

VOLUME 10, 2022 27959

You might also like