DM Soft Voting Classifier 2020

International Journal of Cognitive Computing in Engineering 2 (2021) 47–56
Contents lists available at ScienceDirect
International Journal of Cognitive Computing in Engineering

journal homepage: https://www.keaipublishing.com/en/journals/international-journal-of-cognitive-
computing-in-engineering/
Performance characterization of reinforcement learning-enabled

evolutionary algorithms for integrated school bus routing and scheduling
problem
Eda Koksal∗, Abhishek R. Hegde, Haresh P. Pandiarajan, Bharadwaj Veeravalli
Electrical and Computer Engineering Department, National University of Singapore, 117583, Singapore
a r t i c l e i n f o a b s t r a c t
Keywords: Bi-objective school bus scheduling optimization problem that is a subset of vehicle fleet scheduling problem is
Reinforcement learning focused in this paper. In the literature, school bus routing and scheduling problem is proven to be an NP-Hard
Ant colony optimization problem. The processed data supplied by our framework is utilized to search a near-optimum schedule with the
Genetic algorithm
aid of reinforcement learning by evolutionary algorithms. They are named as reinforcement learning-enabled
Particle swarm optimization
genetic algorithm (RL-enabled GA), reinforcement learning-enabled particle swarm optimization algorithm (RL-
School bus routing and scheduling
Combinatorial optimization enabled PSO), and reinforcement learning-enabled ant colony optimization algorithm (RL-enabled ACO). In this
paper, the performance characterization of reinforcement learning-enabled evolutionary algorithms for integrated
school bus routing and scheduling problem is investigated. The efficiency of the conventional algorithms is im-
proved, and the near-optimal schedule is achieved significantly in a shorter duration with the active guidance
of the reinforcement learning algorithm. We attempt to carry out extensive performance evaluation and con-
ducted experiments on a geospatial dataset comprising road networks, trip trajectories of buses, and the address
of students. The conventional and reinforcement learning integrated algorithms are improving the travel time
of buses and the students. More than 50% saving by the conventional and the reinforcement learning-enabled
ant colony optimization algorithm compared to the constructive heuristic algorithm is achieved from 92nd and
54th iterations, respectively. Similarly, the saving by the conventional and the reinforcement learning-enabled
genetic algorithm is 41.34% at 500th iterations and more than 50% improvement from 281st iterations, respec-
tively. Lastly, more than 10% saving by the conventional and the reinforcement learning-enabled particle swarm
algorithm is achieved from 432nd and 28th iterations, respectively.
Introduction the utilization of resources, and to reduce the inconvenience of students

and parents.
Smart City Infrastructure (SCI) includes Cloud-based Internet of SBRS has been classified as an NP-Hard combinatorial optimization
Things (IoT) systems that are an imperative component. These cloud- problem. The state-of-art algorithms rely on meta-heuristic algorithms
based IoT systems comprise a variety of sensors, devices, and compute to solve the combinatorial optimization problems in polynomial time
and storage elements utilized by a variety of applications. Through ser- complexity (Kang et al., 2015, Pacheco, Caballero, Laguna and Molina,
vices that are deployed in a cloud environment, the collected, processed 2013).
data and shared information are valuable and crucial to smart cities. The dominant algorithms to tackle combinatorial optimization prob-
As an essential infrastructure for smart cities, this study focuses on in- lems are Evolutionary Algorithms (EA) such as Genetic Algorithm
telligent transportation systems, particularly on the transportation of (GA), Particle Swarm Optimization (PSO), and Ant Colony Optimization
students. (ACO). According to the formal definition, meta-heuristic is an iterative
The transportation of students distributed across a designated area process and guides a subordinate heuristic for exploring and exploit-
to the relevant schools is modeled as School Bus Routing and Schedul- ing the search space to find efficient near-optimal solutions (Osman and
ing (SBRS) problem, and it is a subset of the Vehicle Routing Problem Laporte, 1996, Voß, Martello, Osman and Roucairol, 2012). Yet, these
(VRP). SBRS problem aims to utilize school buses efficiently, to reduce traditional algorithms suffer from two major drawbacks. On the one
∗
Corresponding author.
E-mail address: eda_koksal@u.nus.edu (E. Koksal).
https://doi.org/10.1016/j.ijcce.2021.02.001
Received 26 November 2020; Received in revised form 19 January 2021; Accepted 4 February 2021
Available online 8 February 2021
2666-3074/© 2021 The Authors. Publishing Services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
E. Koksal, A.R. Hegde, H.P. Pandiarajan et al. International Journal of Cognitive Computing in Engineering 2 (2021) 47–56
hand, the state-of-art algorithms experience generalization drawback Related work

since they require expert knowledge and algorithmic decisions. On the
other hand, the computational time increases while handling large-scale SBRS problem is a subset of the VRP and it is well studied
problem instances, even it is assumed that there is expert knowledge in the literature. There are various studies for the subset prob-
(Bengio, Lodi and Prouvost, 2020, Mazyavkina, Sviridov, Ivanov and lems of the VRP. Briefly, to tackle the supply chain problem, a
Burnaev, 2020). biography-based optimization (BBO) algorithm, a cuckoo optimiza-
From the perspective of computation time, achieving an adequate tion algorithm (COA), Simulated Annealing (SA), gray wolf, and inva-
performance while solving a combinatorial optimization problem with sive weed optimization algorithms have been applied (Babaee Tirko-
a complex dataset represents a challenging task. Thus, in this paper, we laee, Goli, Pahlevan and Malekalipour Kordestanizadeh, 2019, Goli and
augment an agent-based Reinforcement Learning (RL) approach to guide Davoodi, 2018, Goli, Zare, Tavakkoli-Moghaddam and Sadegheih, 2020,
the combinatorial optimization algorithms. Our hypothesis is that the Sangaiah, Tirkolaee, Goli and Dehnavi-Arani, 2020). In another study, a
guidance of RL aids the algorithm in achieving a near-optimum sched- Multi-Objective Mixed-Integer Linear Programming (MOMILP) has been
ule within a smaller number of iterations. The augmentation process applied to optimize a multi-objective production planning based on a
dynamically controls the hyper-parameters by introducing additional fuzzy set due to uncertainty of seasonal demand (Tirkolaee, Goli and We-
exploitation/exploration factors along with the conventional operators ber, 2019). In addition to the fuzzy set, to tackle demand uncertainty, a
and acting according to the convergence rate. This facilitates escaping neural network with a runner root algorithm (RRA) has been applied
from getting trapped in any local-optima, which in return results in a to predict the future dairy demand. Later, an exact solution method
reduced convergence rate. has been applied to solve the multi-objective product portfolio prob-
The agent learns and takes decisions to improve the performance of lem with predicted demand data (Goli, Zare, Tavakkoli-Moghaddam and
evolutionary algorithms. In order to validate the hypothesis, in this pa- Sadeghieh, 2019). Moreover, a hybrid ANN has been also proposed and
per we wish to conduct a rigorous performance characterization study it is improved with Bat and Firefly EA to predict the air travel demand
involving algorithms (GA, PSO, and ACO) and observe certain key char- (Mostafaeipour, Goli and Qolipour, 2018). Lastly, a mathematical model
acteristics such as suitability, convergence rate, functionality, and effi- is developed for locating temporary relief centers and routing relief ve-
ciency. hicles under critical conditions. variable neighborhood search (VNS) al-
In our context, certain factors influence and magnify the complexity gorithm is applied to tackle this optimization problem (Davoodi and
of our problem. To represent our problem, we assume that nodes rep- Goli, 2019).
resent road intersection coordinates, and edges represent street coor- This section focuses on the review of the SBRS problem and de-
dinates. Given that, our framework supplies the following data: 58440 scribes various methods from the literature. The problem consists of sub-
edges, 27179 nodes, the distances, and the travel time prediction be- problems; bus stop selection, bus route generation, and bus scheduling
tween these various edges/nodes. Firstly, the possible locations to pick (Kim, Kim and Park, 2012, Park and Kim, 2010).
up students are assigned. However, the intermediate nodes/edges be-
tween these possible pick-up locations are not known beforehand. Fur- Bus stop selection
thermore, the intermediate nodes vary depends on the time of the
day (peak hour vs. non-peak hour). Particularly, high variation is ob- This sub-problem is to assign students to a bus stop and to deter-
served in the scheduling of the fleet during peak hours. Along with the mine the subset of bus stops that needs to be visited by buses. Gen-
complexity introduced by the data, the proposed algorithm is a ser- erally, this sub-problem can be transformed into an NP-Hard problem
vice of our framework. SBRS is our test case with more constraints (Fisher, Jaikumar and Van Wassenhove, 1986).
by the transport operators and the schools. However, the algorithm The state-of-art algorithms to tackle the bus stop selection sub-
is developed to schedule generic bus fleets with significantly vary- problem rely on heuristics algorithms. This sub-problem of bus stop
ing demand on a daily basis. Therefore, the convergence speed to a selection has been studied as three strategies. Those heuristic strate-
near-optimum is also crucial for this problem as the efficiency of the gies are location-allocation-routing (LAR), allocation-routing-location
result. (ARL), and integration of the sub-processes (Riera-Ledesma and Salazar-
In our study, a geospatial real-world based dataset is employed to González, 2012, Sarubbi et al., 2016, Schittekat et al., 2013). LAR de-
evaluate the approach in terms of functionality and efficiency. The termines the bus stops first and ARL determines the route first.
results of RL-enabled GA, RL-enabled PSO, and RL-enabled ACO al-
gorithms are compared with the existing schedule provided by trans-
port operators, the schedule generated by the constructive heuristic, Bus route generation and bus scheduling
and the schedule generated by the conventional GA, PSO, and ACO
algorithms. Bus route generation and bus scheduling sub-problems are also com-
In this paper, the performance characterization of reinforcement binatorial optimization problems. Due to complexity, these two sub-
learning-enabled evolutionary algorithms for integrated school bus rout- problems are considered separately and consecutively in the literature.
ing and scheduling problem is investigated. The results have demon- Yet, optimizing these sub-problems individually might not lead to an
strated that RL integrated algorithms perform significantly better than overall near-optimum solution (Shafahi, Wang and Haghani, 2018).
the conventional algorithms and the schedule produced by the construc- Hence, in this study, these sub-problems are integrated. The review of
tive heuristic algorithm. The objective function is improved by more the studies that tackle the integrated routing and scheduling problems
than 50% by RL-enabled ACO and RL-enabled GA algorithms. Further- are consolidated in Table 1.
more, RL-enabled ACO and RL-enabled GA accomplish these improve-
ments significantly with a fewer number of generations. At last, these Improvement Approaches
methods are almost industry ready, and their effectiveness is proved
based on a real-world case study. To tackle combinatorial optimization problems in polynomial time
The rest of the paper is organized as follows: Section 2 overviews complexity, meta-heuristic algorithms are the dominant approach such
the SBRS problem and the methodologies in combinatorial optimization as EAs. EA is an iterative process and guides a subordinate heuristic for
problems. Section 3 explains the problem formulation of SBRS and the exploring and exploiting the search space to find efficient near-optimal
proposed methodology. In Section 4, the detailed results of our experi- solutions (Osman and Laporte, 1996, Voß, Martello, Osman and Rou-
ments are reported. Finally, the conclusion and future work are drawn cairol, 2012). However, EAs are known for their sensitivity regarding
in Section 5. the choice of the values of the parameters (Karafotias, Hoogendoorn and
48
Table 1
Literature classification on the integrated school bus routing and scheduling.
Reference Methodology Description
(Stodola, Mazal, Podhorec and Litvaj, 2014) Integrated ACO, Homogenous vehicle fleet
(Arias-Rojas, Jiménez and Montoya-Torres, 2012) Individual ACO for allocation then problem is converted to Travelling Salesman Problem
(Kang et al., 2015) Integrated GA for the integrated problem
(Kim and Son, 2012) Integrated PSO to minimize the number of vehicles
(Alinezhad et al., 2018) Integrated PSO, Homogenous vehicle fleet
(Kiriş and Özcan, 2020) Individual GA-based approach with k-means
(Mahmoudzadeh and Wang, 2020) Individual Cluster-Based with dynamic demand characteristics.
Table 2
Literature classification on the improvement approaches.
Reference Methodology Description
(Karafotias, Smit and Eiben, 2012) Adaptive The parameters are predicted by Artificial Neural Network (ANN) based on diversity
and fitness values with online calibration.
(Gong, Tang, Li and Zhang, 2019) Deterministic Separate subpopulations for GA are created with fixed parameters.
(Böttcher, Doerr and Neumann, 2010) Adaptive The mutation probability of GA is altered based on an equation.
(Ratnaweera, 2002, Zheng, Ma, Zhang and Qian, 2003) Adaptive Linear time varying inertia weight w of PSO is adapted.
(Naka, Genji, Yura and Fukuyama, 2001) Adaptive Nonlinear time varying inertia weight w is applied.
(Ratnaweera, Halgamuge and Watson, 2004) Adaptive Linear varying c1 and c2 co-efficient is used.
(Lessing, Dumitrescu and Stützle, 2004) Adaptive Dynamic heuristic matrix is applied.
(Chusanapiputt, Nualhong, Jantarang and Phoomvuthisarn, 2006) Adaptive Based on pheromone dispersion, the 𝛼, 𝛽, and 𝜌 are adapted.
(Li and Li, 2007) Adaptive Time varying 𝛼 and 𝛽 is applied.
(Martens et al., 2007) Self-Adaptive The ant determines the value of parameters by decision rules.
Table 3
The table shows the list of notations and the terminology of problem statement with their summary.
Notation Description
𝐵𝑆 A set of all bus stops.

𝑆 A set of all the students, where si indicates the ith student.
𝑃 A set of selected roadsides where 0 indicates the school, and at each roadside j for j= 1,2, …, np . one or more students are waiting for the bus.
𝐵 A set of all available buses where m is the total number of buses and bi indicates the ith bus.
𝑓𝑏 Travel time of all buses (TTB).
𝑓𝑠 Travel time of all students (TTS).
𝜑 Unity function to normalize and map the data into scale bounded from 0 to 1.
𝜔𝑏 , 𝜔𝑠 The weights of the objective functions for travel time of buses and students.
𝐻𝑖 The number of students to be picked up from roadside I, for i=1, 2…, np .
𝜂 The maximum utilization of buses.
𝐶𝑖 The capacity of the ith bus.
Eiben, 2015). Since the EA are iterative algorithms, in which if its pa- other scale bounded from 0 to 1 based on the work of Bowerman et al.
rameters’ values are not chosen correctly, the overall result might be (Bowerman, Hall and Calamai, 1995). Based on our assumption men-
inefficient. The most popular EAs GA, PSO, and ACO algorithms. The tioned above the problem is formulated as follows:
studies to improve the efficiency of these algorithms during the compu- ( ) ( )
min 𝑓 = 𝜔𝑏 ∗ 𝜑 𝑓𝑏 + 𝜔𝑠 ∗ 𝜑 𝑓𝑠 (1)
tation to increase the quality of the near-optimum values are consoli-
dated in Table 2.
𝑛𝑖
∑
𝑠𝑢𝑏𝑗 𝐻 𝑟 𝑖 (𝑗 ) ≤ 𝜂 ∗ 𝐶 𝑖 (2)
School bus route generation and scheduling problem statement and 𝑗=1
constructive heuristic
𝑚
∑
𝑛𝑖 = 𝑛𝑝 (3)
SBRS problem has been modeled by Newton and Thomas
𝑖=1
(Newton and Thomas, 1974). There are various configurations and this
𝑛𝑖
𝑚 ∑
NP-Hard optimization problem can be approached in various ways, refer ∑
𝐻𝑟𝑖 (𝑗 ) = 𝑛𝑠 (4)
to Section 2.
𝑖=1 𝑗=1
In this study, we considered two stakeholders, transport operators,
and students. The transport operators would attempt to minimize their 𝑟𝑖 (𝑗 ) ≠ 𝑟𝑘 (𝑗 ), ∀𝑖, 𝑘 ∈ 𝐵, ∀𝑗 ∈ 𝑃 (5)
cost; on the other hand, students would expect minimization of their
travel time. Even though, the nature of these objective functions is sup- 𝑇𝑏𝑖 ≤ 𝑇𝑚𝑎𝑥 , 𝑓 𝑜𝑟 𝑖 = 1, 2, ⋯ , 𝑚 (6)
portive at the beginning of the algorithm, their nature conflict with each
𝜔𝑏 , 𝜔𝑠 ∈ [0, 1], 𝑎𝑛𝑑 𝜔𝑏 + 𝜔𝑠 = 1 (7)
other when the algorithms are converging. The notations and the termi-
nology used are summarized in Table 3. Objective function Eq. (1) combines these two functions with the
weighted sum approach. The effect of different weighting can be ex-
Problem statement - formulation of the SBRS problem amined by the decision-makers to find the desired solution. Our bus
fleet capacity is not homogenous, and some spare seats are reserved by
The aim is to minimize the travel time of buses and students. The Eq. (2). Eqs. (3) and (4) enforces picking up all students and visiting all
data range of these two functions is normalized and mapped into an- selected roadsides. Eq. (5) prevents each pick-up point to be assigned to
49
Table 4
The table shows the list of notations and the terminology of reinforcement learning algorithm with their summary.
𝑄(𝑆, 𝐴) Q-Table stores the expected long-term impact of taking the specific action from that specific state.
𝐸(𝑆, 𝐴) E-Table, eligibility trace mechanism to signifies what the influence of the action from the specific state on the gained reward is.
𝜎(𝑡) The reward functions.
𝜛 The learning rate controls the influence of the target on the current Q-values.
𝛿 The target, temporal difference (TD) error.
𝛾 The discount rate determines the current value of future rewards.
𝜆 The trace decay determines the fallback rate of eligibility trace.
different buses. Eq. (6) enforces the upper bound of the duration for a Reinforcement learning
route. Lastly, Eq. (7) imposes the weights of the objective functions to
be within the range. Reinforcement learning is an area of machine learning and it is an
Two objective functions 𝑓𝑏 and 𝑓𝑠 is converted to a linear combina- agent-based approach. In a given state of environment, the goal-directed
tion, as shown in Eq. (1). Still, the problem falls with the variants of agent learns and takes an action based on its policy and receives a re-
Travelling Salesman Problem (TSP), which is also NP-Hard, due to the ward from the environment. Based on the received feedback, the agent
requirements of the problem. Initially, owing to the requirement of the changes its state in the environment. In the RL context, the aim is to
identification of intermediate nodes/edges and the travel time predic- maximize the expected sum of future rewards (Sutton and Barto, 2018),
tions within 15 minutes time interval is challenging. Furthermore, the refer to Table 4 for notations.
starting node/edge is unknown and the two-way tour distance and time Combinatorial optimization problems with a complex dataset can be
between any two nodes/edges are not strictly equal. Furthermore, the a challenging task particularly with an adequate performance from the
proposed service is a part of our framework and SBRS is a test case. In perspective of computation time. This paper introduces reinforcement
generic cases, the service has to respond and achieve a near-optimum learning-enabled combinatorial optimization algorithms to solve the in-
schedule with dynamic demand. tegrated bus route generation and scheduling problem. RL plays the role
of guiding the optimization algorithms on-the-fly to aid in achieving a
Bus stop selection near-optimal result within a smaller number of iterations. RL aims to
explore the search space of the problem and learn from this experience
In this study, the LAR strategy is followed. Note that, the number the best performing policy.
of bus stops is greater than the number of students. Furthermore, stu- In this study, RL is integrated with GA, PSO, and ACO algorithms to
dents cannot cross the street and bus drivers have to drive on the left improve their performance. These EAs are sensitive regarding the choice
side of the road. Thus, it is crucial to determine the relative position of the values of their parameters. These algorithms have two parameters
between the pick-up point and its nearest road segment before deciding that influence their performance. The following method is applied to
the intermediate nodes/edges. The student is assigned to a pick-up point these algorithms to integrate RL.
based on the angle and the equation between two lines by considering The continuous range of both parameters is discretized into d inter-
geographical coordinates. vals. There are d2 different combinations of these two parameters that
can exist, and the combinations of these parameters represent different
Initial state generation process – constructive heuristic states. Each state defines the range for the value of probability param-
eters. Later, a random value (using uniform distribution) is assigned to
This constructive heuristic method allocates buses to roadsides for these parameters from the range that is set by the RL. There are five
the first time to generate initial solutions. A school-centered system actions (increasing, decreasing, and maintaining) to find the new state
is built to scan roadsides and allocate buses in a radar mode. We based on the current state and the reward.
developed this constructive heuristic from the idea of sectoring of The learning process is traced by an estimated state-action table Q-
Thangiah and Nygard (Thangiah and Nygard, 1992) and Corberán et al. Table Q (S, A), for each state and the possible five actions from this state.
(Corberán, Fernández, Laguna and Marti, 2002). Mostly, the pick-up The action is selected from Q-Table based on the 𝜖-greedy policy which
points close to each other to be visited by the same bus, thus the pick- chooses a random action either from the Q-table with a probability or
up points are sectored. Initial solutions are obtained by changing the it will choose the action with the highest Q-value.
phasing. Phasing is the starting state of the scanning in a radar mode 1 1 𝜔 𝜔
measured clockwise from the north line. 𝜎(𝑡) = ∗ + 𝑏 ∗Ω𝐵 + 𝑠 ∗Ω𝑆 (8)
2 𝑓𝑡+1 2 2
School bus route generation and scheduling

1 1
Ω𝐵 = 𝑇bt
, Ω𝑆 = 𝑇st
(9)
EAs are iterative and generic population-based meta-heuristic algo- MaxTTB MaxTTS
rithms to achieve a near-optimum solution for combinatorial optimiza-
tion problems within polynomial time complexity (Kang et al., 2015, Furthermore, an eligibility trace mechanism E-Table E (S, A) is in-
Pacheco, Caballero, Laguna and Molina, 2013). troduced as a temporary memory to record the occurrence of events and
However, the sensitivity regarding the choice of the values of the the influence of actions on the next states. The E-Table signifies what
parameters is a known drawback of EAs (Karafotias, Hoogendoorn and the influence of the action from the specific state on the gained reward
Eiben, 2015). Since EAs are iterative algorithms, in which if their param- is.
eters’ values are not selected correctly, the algorithm might not achieve Despite the conventional algorithms, where the parameters are set
a near-optimal solution. Especially, if the dataset of the problem is com- as constant, the current state assigns the values of both parameters at
plex, attaining an adequate performance becomes a challenging task. each generation. The algorithm applies the selected values for parame-
In this study, an agent-based RL approach is integrated to guide EAs. ters only for one iteration and produces the new solution. The reward
Three popular EAs are applied for the SBRS problem which are GA, PSO, (objective) is calculated by RL. The expected long-term impact and the
and ACO algorithms. influence of the action from the current state are updated based on the
50
Table 5 Each chromosome is evaluated by the fitness function f calculated by

The table shows the list of notations and the termi- the school bus routing process, refer to Eq. 1. Furthermore, Elite Pre-
nology of genetic algorithm with their summary. serve Strategy is often used. At each generation, the chromosomes with
Notation Description high fitness function value f are replaced by the chromosomes with low
fitness function values.
𝑝𝑐 The probability parameter of crossover
𝑝𝑚 The probability parameter of mutation
The 3-point crossover and 3-point mutation operator are applied,
𝑑 The discretization parameter and the number of new chromosomes produced is 2k+1 − 2 = 14 and
k! − 1 = 5, respectively.
The performance of GA depends on its operators. The GA operators
reward. The reward function introduces the aim implicitly and explic- are controlled by the probability parameters, which are a measure of the
itly, refer to Eq. (8) and (9). likelihood that the operator will be applied to the chromosomes. These
Lastly, Q-Table and E-Table are updated based on the temporal dif- parameters are the probability parameter of mutation pm and crossover
ference between two consecutive generations, refer to Eq. (10), (11), and pc . Therefore, RL guidance for GA can be integrated by controlling its
(12). The influence of the temporal difference on the current Q-values is parameters, named RL-enabled GA (Koksal Ahmed, Li, Veeravalli and
controlled by the learning rate ϖ. The temporal difference is influence by Ren, 2020). An adaptive learning-based controller is chosen to improve
the improvement scale problem of EAs. The improvement of the fitness the GA’s efficiency by finding an optimum value for the probability pa-
value will decay in nature when EA is evolving. Consequently, we only rameters of GA on-the-fly.
considered the fitness function value of the current generation and con-
trolled the learning rate of RL to decrease the influence of the temporal Reinforcement learning-enabled particle swarm optimization algorithm
difference on the learned Q-values to prevent fitness decay.
( ) ( ) PSO algorithm is inspired by the organism behavior such as bird
𝛿 = 𝜎(𝑡) + 𝛾 ∗ 𝑄 𝑆𝑡+1 , 𝐴𝑡+1 − 𝑄 𝑆𝑡 , 𝐴𝑡 (10) flocking (Kennedy and Eberhart, 1995a, Kennedy and Eberhart, 1995b).
With the physical movements of the individuals in the swarm, the aim
𝑄(𝑆, 𝐴) = 𝑄(𝑆, 𝐴) + (𝜛 ∗ 𝛿 ∗ 𝐸 (𝑆, 𝐴)) (11) is to find a near-optimum solution. In the natural world, the birds are
inspired by their personal knowledge, and the swarm knowledge to find
the food source. While the observable area of a bird is limited, all birds
𝐸 (𝑆, 𝐴) = 𝛾 ∗ 𝜆 ∗ 𝐸 (𝑆, 𝐴) (12) in a swarm can be aware of the larger area. The terminology of PSO is
In this paper, we augment an agent-based RL approach to guide as follows and refer to Table 6 for notations.
the combinatorial optimization algorithms. Firstly, EAs are generic Each particle is represented as a position vector XV that stores all
population-based metaheuristic algorithms. Although these algorithms selected roadsides Pj where students are mapped in the discrete do-
carry a good population to produce the next population, there is no guar- main. Besides the selected roadside, the corresponding information of
antee for improvement as the process is based on a random approach. the roadside is also stored like source node, destination node, the num-
Secondly, EAs are known for their sensitivity regarding the choice of the ber of students, the roadside ID, and the assigned bus ID.
values of their parameters (Karafotias, Hoogendoorn and Eiben, 2015). The particle position XV vector is converted to an intermediate dis-
Therefore, if the parameters’ values are not chosen correctly or kept crete vector Y by applying the rule represented in Eq. (13). The particle
constant for each iteration, the overall result might be inefficient. Our position XV vector is compared with the personal best position of the
hypothesis is that the guidance of RL aids the algorithm in achieving a particle 𝜁 and the global best particle position vector G, where p to be
near-optimum schedule within a smaller number of iterations. The aug- the index of the particle and d to be the index of the student.
mentation process dynamically controls the hyper-parameters by intro- ( )
⎧ 1, if 𝑥𝑣iter
pd
−1 = 𝐺 iter−1
𝑑
ducing additional exploitation/exploration factors along with the con- ⎪
⎪ ( )
ventional operators and acting according to the convergence rate. This
iter ⎪ −1, if 𝑥𝑣iter
pd
−1 = 𝜁 iter−1
pd
facilitates escaping from getting trapped in any local-optima, which in 𝑦pd = ⎨ (13)
( )
return results in a reduced convergence rate. ⎪
⎪−1 or 1 randomly, if 𝑥𝑣pd = 𝜁pd = 𝐺𝑑
iter −1 iter−1 iter−1
⎪
Reinforcement learning-enabled genetic algorithm ⎩ 0, otherwise
Every particle also consists of a velocity vector the same length as
GA is a class of optimization algorithm inspired by the process of
the position vector with floating-point values representing the velocity
natural selection and was proposed by Holland (Holland, 1975). By the
with the particle moves in the search space. After intermediate vector Y
biologically inspired operators such as mutation, crossover, and selec-
is generated, the PSO algorithm proceeds to the update of the velocity
tion, the possible solutions are represented as a gene, refer to Table 5
V vector, and the position 𝜆 vector of the particle in the continuous
for notations. The terminology of GA is as follows.
domain starts. Firstly, the distance d1 between position vector XV and
One gene represents a selected roadside Pj where students are
the personal best solution of particle 𝜁 is measured by Eq. (14). The
mapped. Single gene stores the number of students at this roadside (Hj ),
distance d2 between position vector XV and the global best solution of
the source srcj , and the destination destj nodes for the direction of the
particle G is measured by Eq. (15).
road segment and the ID of the assigned bus bi . One solution is repre-
sented by the form of a chromosome. 𝑑1 = −1 − 𝑦𝑖𝑡𝑒𝑟
𝑝𝑑 (14)
Constructive heuristic generates a group of different chromosomes
as the first population. During each generation, GA selects individual
𝑑2 = 1 − 𝑦𝑖𝑡𝑒𝑟
𝑝𝑑 (15)
chromosomes at random from the previous population to produce the
new chromosomes for the next generation. GA applies genetic opera-
tions to generate different chromosomes. A complete GA includes three 𝑣𝑖𝑡𝑒𝑟 𝑖𝑡𝑒𝑟−1
𝑝𝑑 = 𝑤.𝑣𝑝𝑑 + 𝑐 1 .𝑟1 .𝑑1 + 𝑐 2 .𝑟2 .𝑑2 (16)
basic genetic operations: selection, crossover, and mutation. Selection se-
lects some individual chromosomes, which contribute to the popula-
𝜆𝑖𝑡𝑒𝑟 𝑖𝑡𝑒𝑟 𝑖𝑡𝑒𝑟
𝑝𝑑 = 𝑦𝑝𝑑 + 𝑣𝑝𝑑 (17)
tion at the next generation. The crossover combines two chromosomes
to form new chromosomes in the next generation. Lastly, mutation ap- Secondly, the velocity V vector for each particle is updated based
plies random changes to a chromosome to generate new chromosomes. on the distance d1 , d2, and the velocity vector of the previous iteration,
51
Table 6
The table shows the list of notations and the terminology of particle swarm optimization algorithm with their summary.
𝑋 𝑉𝑝 The position vector of particle p, 𝑥𝑣𝑝𝑛 is the index of nth bus stop with students to be scheduled.
𝑉𝑝 The velocity vector of particle p.
𝜁𝑝 The position vector of particle p’s personal best solution position, 𝑝𝑝𝑛 is the index of nth bus stop with students to be scheduled.
𝐺 The position vector of global best solution for entire swarm.
𝑌𝑝𝑖𝑡𝑒𝑟 The Intermediate discrete vector of particle p for iteration iter.
𝜆𝑖𝑡𝑒𝑟
𝑝
The position vector of the particle p in the continues domain for iteration iter.
𝑤 The inertia weight, the coefficient of previous velocity.
𝑐1 , 𝑐2 Acceleration coefficients
refer to Eq. (16). At last, the position 𝜆 vector of the particle in the direction. The aim is to guide the particle to move along the same di-
continuous domain is calculated with the intermediate vector Y and the rection. The inertia weight w has an impact on the convergence of the
velocity vector V, refer to Eq. (17). algorithm. When 𝑤 ≥ 1, the velocity increases over time thereby accel-
After the position 𝜆 vector of the particle in the continuous domain erating the particles with maximum velocity. It leads the swarm to di-
is updated, the intermediate vector Y and the position vector XV in the verge. When 𝑤 < 1, the particle decelerates. Thus, the inertia weight w
discrete domain are updated, refer to Eqs. (18) and (19). The vector enforces the particle to explore the search space or enforces local ex-
XV represents the bus assignment of the students and these processes ploitation.
repeat for each iteration until PSO finds a near-optimum value bus as- Second component (𝑐1 .𝑟1 .𝑑1 ), named as cognitive component, is the
signment. At last, the new bus assignment is directed to the school bus distance of the particle from its current position to its own personal best
routing process to find the intermediate route between the bus stops. position. This term introduces the effect of returning to the personal best
For each particle, the final travel time f is calculated by the school bus position of the particle and hence it is also named the “nostalgia” of the
routing process, refer to Eq. (1). The personal best position of the par- particle.
ticle 𝜁 and the global best position G are updated based on Eqs. (20) Third component (𝑐2 .𝑟2 .𝑑2 ), named as social component, is the dis-
and (21). tance of the particle from its current position to the global best particle
( ) position. This component acts more like a group norm that individual
⎧ 1, if 𝜆iter > 𝛼
⎪ ( pd ) particles try to attain.
𝑦iter
pd
= ⎨−1, if 𝜆iter < 𝛼 (18) There are two random parameters, r1 and r2 , in the range of [0,1].
pd
⎪ They are to help in randomizing the influence of cognitive and social
⎩ 0, otherwise
( ) components. These parameters are random for each index of the velocity
⎧𝐺𝑑iter−1 , if 𝑦iter =1 vector of each particle for each iteration. Acceleration co-efficient (𝑐1 .𝑟1 )
⎪ ( pd )
𝑥𝑣iter
pd
= ⎨𝜁 iter−1 , if 𝑦iter = −1 (19) and (𝑐2 .𝑟2 ), determine the given importance to the personal and swarm
pd pd
⎪ experience.
⎩any vehicle, otherwise
Consequently, the performance of PSO depends on the inertia weight
⎧ ( ) ( ( )iter ) w and the acceleration co-efficient (c1 and c2 ). Similar to the RL-enabled
( )iter ⎪𝑥𝑣iter , if 𝑓 𝑥𝑣iter < 𝑓 𝜁 best 𝑝
𝜁 best 𝑝 = ⎨ ( 𝑝 )iter−1 𝑝
(20) GA, RL is integrated into PSO to guide the swarm to a near-optimum
⎪𝜁 best 𝑝 , otherwise solution, named RL-enabled PSO.
⎩
⎧ ( )iter−1 ( ( )iter ) ( ( )iter−1 )

( )iter ⎪𝐺 best , if ∀𝑓 𝜁 best 𝑝 > 𝑓 𝐺 best Reinforcement learning-enabled ant colony optimization algorithm
𝐺 best =⎨ ( ( )iter )
⎪min 𝜁 best 𝑝
, otherwise
⎩ ACO algorithm is an optimization algorithm modeled on the ac-
(21) tions of an artificial ant colony and was proposed by Dorigo et al.
(Dorigo, Maniezzo and Colorni, 1996). By the cooperation of an ant
𝑖𝑡𝑒𝑟+1 𝑖𝑡𝑒𝑟
𝜉𝑖𝑗0 = (1 − 𝜌)𝜉𝑖𝑗0 + 𝜌Δ𝜉𝑖𝑗0 (22) colony, the aim is to find a near-optimum solution to discrete optimiza-
tion problems. In the natural world, ants communicate with each other
{ using pheromone, an aromatic essence. When ants find foods while forg-
(𝐹𝑤 −𝐹𝑏 )+(𝐹𝑤 −𝐹𝑠 )
𝐹𝑤
, if 𝑗0 ∈ first stop of 𝑇𝑏 or 𝑇𝑔 , 𝑖 ∈ 𝑇𝑏 or 𝑇𝑔 ing, they return to their colony by laying down pheromone trails. The
Δ𝜉ij0 =
0, otherwise concentration of the pheromone depends on the length of the tour. If
(23) the pheromone concertation is more, other ants would make a proba-
bilistic decision to follow the path. As more ants pass through the same
path, the pheromone will be accumulated. However, the pheromone
will be evaporated in all the trails hence only the shortest path re-
mains. The terminology of ACO is as follows and refer to Table 7 for the
(24) notations.
An ant colony consists of Pheromone matrix 𝜉 jk to trace the
𝛼 ∗ 𝜓𝛽
𝜉𝑗𝑘 𝑗𝑘 pheromone level of the selected roadsides that students are mapped.
𝑃 𝑟𝑜𝑏ℎ𝑗𝑘 = ∑ (25) It represents whether a cluster can be formed between the jth stop and
𝛼 ∗ 𝜓𝛽
𝜉𝑗𝑘
𝑘∈𝑇 𝑗𝑘
kth stop. Here, let k to be a first bus stop that was visited by ith bus.
Eq. (16) updates the velocity vector V and it has 3 components each There is also heuristic matrix 𝜓 jk which is the reciprocal of Dik that
of which aid the particles of the swarm to move toward the global op- is the minimum distance between the location of student j and k. The
timal solution. The performance of PSO depends on these individual heuristic term is to find the nearest neighborhood stop among the tour
components. constructed by a bus. Furthermore, the ant colony also consists of an
The first component (𝑤.𝑣𝑖𝑡𝑒𝑟
𝑝𝑑
−1 ), named as inertia component, is the initial stop selection vector ϴ that traces the probability of a stop being
previous velocity term and acts as a memory of the previous movement selected as the first stop in the tour.
52
Table 7 only those pair of bus stops and first stops belonging to the respective
The table shows the list of notations and the terminology of ant colony bus tour.
optimization algorithm with their summary. The probability of assigning a student to any feasible bus tour is cal-
Notation Description culated by Eqs. (24) and (25). These equations consist of two parameters
that influence the performance of ACO. These two parameters deter-
𝜉𝑗𝑘 The pheromone level between jth stop and kth stop.
𝐷𝑗𝑘 The minimum distance between the location j and k.
mine the relative influence of the pheromone concentration (𝛼) and the
𝜓𝑗𝑘 The heuristic matrix, the reciprocal of Dik heuristic matrix (𝛽). In this study, RL determines the value of these two
𝐶𝑖 The capacity of the ith bus. parameters based on the reward calculated by Eq. (8).
Θ The first stop selection vector
𝛼 The relative influence of the pheromone
concentration. School bus routing
𝛽 The relative influence of the heuristic value.
𝜌 The evaporation factors.
𝑇ℎ The solution constructed by the ant h.
In this study, the two-way tour distance between two nodes/edges is
𝐹𝑏 The fitness of global-best solution. not strictly equal. Additionally, the starting roadside is unknown. Fur-
𝐹𝑠 The fitness of the iteration-best solution. thermore, the intermediate nodes/edges of the routes need to be calcu-
𝐹𝑤 The fitness of the worst solution. lated based on the travel time prediction within 15 mins time interval.
𝑇𝑏 The global best tour.
These characteristics convert our problem to be a variant of TSP and
𝑇𝑠 The iteration best tour.
𝑛𝑗 The number of stops in the bus tour. prevent us to use deterministic path query algorithms such as Dijkstra
𝑃 𝑟𝑜𝑏ℎ𝑗𝑘 The probability of assigning student location Pj to bus k by the Algorithm (Dijkstra, 1959). In the literature, SA is the commonly used
ant h. meta-heuristic algorithm for TSP (Adzhar and Salleh, 2014, Fan and
Machemehl, 2006, Spada, Bierlaire and Liebling, 2005).
SA approximates the global schedule for traveling tour in a large
Each artificial ant in the colony has a vector that stores the tour Ri search space. It takes a certain probability to accept a solution. Further-
for all the available m buses. The tour vector of each bus Ri represents a more, since simple order change cannot approach the global minimum
sequence of roadside P(j) visited by the feasible ith bus. P(j) is a selected in one step, it prevents getting trapped in the local minimum and settle
roadside where students are mapped. The length of the tour vector Ri is in a near-optimal solution.
bounded by the capacity constraint Ci of the ith bus. Thus, the SA algorithm is integrated with GA, PSO, and ACO algo-
Constructive heuristic generates a group of different solutions. rithms and their RL enabled versions. During each iteration, SA will
Pheromone matrix 𝜉 jk is initialized based on the best and the worst so- calculate the intermediate nodes/edges and the route of the students
lutions generated by the constructive heuristic, refer to Eqs. (22) and that are allocated to the same buses.
(23) for global update rule. Lastly, the first stop selection vector ϴ is
initialized based on the bus assignment of constructive heuristic.
Until the termination iteration number is achieved, the following Experimental design
processes are applied. Firstly, the first stop of all tour vector for each ant
is determined. The first stop selection vector ϴ is utilized as a discrete We attempt to carry out extensive performance evaluation of our
probability distribution to choose the first stop for each bus. After the proposed strategies RL-enabled GA, RL-enabled PSO, and RL-enabled
first stop selection, the tour is constructed for all ants. From unvisited ACO. Each algorithm is compared with the baseline mechanism with
student locations, one student is sampled at a time. Then the probability the static value for the parameters.
of assigning student location Pj to any feasible bus tour is calculated, We tested our algorithms using the time datasets on a real-world
refer to Eqs. (24) and (25) where k0 refers to the first stop of the bus case of a school. Our framework retrieves the Global Positioning Sys-
tour. tem (GPS) records of 1000 private buses via 3G/4G. These data records
The idea of assigning a student to a bus is based on two factors. are processed by Travel Time Prediction (TTP) service, which employs
Firstly, the pheromone value suggests whether the bus stop j can be machine learning algorithms to predict (Ren, Han, Li and Veeravalli,
cluster to the first bus stop k0 or not. Secondly, the heuristic value rep- 2017). Our framework feeds our algorithms with the predicted travel
resents the nearest neighborhood between bus stop j and the bus stop time and retrieves the schedule for the requested vehicle fleets. Even
which is already visited by the bus. This process continues until all stu- if the demand variations might not be significant for some schools, in
dents are assigned to a bus. generic cases the demand varies significantly. Thus, the convergence
At last, firstly the pheromone matrix 𝜉 and first stop selection vec- speed to a near-optimum is crucial as the efficiency of the results.
tor ϴ for all ants are updated using the local update rule, refer to The dataset used in our experiments comprises 58440 edges, 27179
Eqs. (26) and (27) for local update rule. Secondly, the pheromone ma- nodes, 1000 bus trip trajectories, and 14 school buses with 330 students.
trix 𝜉 and first stop selection vector ϴ are updated using global update The algorithm aims to assign these 330 students while considering the
rules, refer to Eqs. (22) and (28). intermediate nodes among these 58440 edges and 27179 nodes for every
15 mins interval.
𝜉ijiter
0
+1
= (1 − 𝜌)𝜉ijiter
0
+ 𝜌Δ𝜉0 , if edge 𝑖 ∈ bus tour , 𝑗0 ∈ first stop of bus tour We conducted a brute force experiment to determine the static value
(26) of parameters for the conventional version of GA, PSO, and ACO as
suggested in the literature (Eiben and Smith, 2003). After some initial
Θiter
𝑗
+1
= (1 − 𝜌)Θiter
𝑗 + 𝜌 ∗ Θ0 , if for all f irst stop 𝑗 ∈ 𝑇𝑙 (27) experimentation, the best results for the conventional GA were mostly
between (0.65-0.75) for both pc and pm parameters. The value of the
Θ𝑖𝑡𝑒𝑟
𝑗
+1
= (1 − 𝜌)Θ𝑖𝑡𝑒𝑟 𝑖𝑡𝑒𝑟
𝑗 + 𝜌 ∗ ΔΘ𝑗 (28) probability of these operators for RL-enabled GA varies between (0.15-
0.95) for both parameters. The best results for the conventional PSO
⎧ (𝐹𝑤 −𝐹𝑏 )+(𝐹𝑤 −𝐹𝑠 ) were mostly between (0.3-0.4) and (1.2-1.4) for c1 and c2 , respectively.
⎪ 𝐹𝑤 , if 𝑗 ∈ first stop of 𝑇𝑏 or 𝑇𝑔 , 𝑖 ∈ 𝑇𝑏 or 𝑇𝑔
ΔΘiter
𝑗 =⎨ 𝑛𝑗 The value of c1 and c2 for RL-enabled PSO varies between (1-5). Lastly,
⎪ 0, otherwise the best results for the conventional ACO were mostly between (2-3) and
⎩
(5-6) for 𝛼 and 𝛽, respectively. The value of the 𝛼 and 𝛽 for RL-enabled
(29)
ACO varies between (1-10).
The local update will be applied to only those bus stops and first stops At last, the computation time might be affected by the platform, thus
present in an ant tour. Hence, the effect of evaporation will influence the number of iterations is more accurate to quantify the performance
53
Fig. 1. The performance comparison of conventional GA, PSO, and ACO algorithms with RL-enabled GA, RL-enabled PSO, and RL-enabled ACO to improve TTB and
TTS.
of the algorithm. Therefore, to capture the effect of computational time, Table 8

an equivalent way is to measure the number of iterations. Generation number to achieve the improvement percentage of GA, PSO, ACO,
RL-enabled GA, RL-enabled PSO, and RL-enabled ACO by initial schedule as a
base. (∗ : within 500 generations algorithm could not achieve the correspond-
Performance evaluation and discussions ing improvement percentage.)
Improvement % 10% 20% 30% 40% 50% Max Fitness Value

The near-optimal schedules obtained by the studied algorithms are
ACO 1 1 2 4 92 64.29
compared with the current existing school bus schedules which were cre- RL-enabled ACO 1 1 1 10 54 68.67
ated by the transport operators to evaluate the performance. The travel GA 6 44 135 429 ∗
41.34
time of the existing schedule is calculated based on the time dataset. RL-enabled GA 4 9 182 230 281 53.93
∗ ∗ ∗ ∗
The constructive heuristic is a school-centered system that minimizes PSO 432 10.4
∗ ∗ ∗ ∗
RL-enabled PSO 28 13.05
the travel time of buses and students compared to the existing schedule.
The constructive heuristic is superior to the existing schedule found in
practice and it could save TTB by 5.78% and TTS by 12.70%, refer to
our previous study Koksal Ahmed et al. (Koksal Ahmed, Li, Veeravalli and the convergence rate is low in the iterative process as expected
and Ren, 2020). (Li, Du and Nian, 2014).
Fig 1 presents the fitness function value and the trend comparison When we compare the conventional algorithms with RL guided ver-
between the conventional GA, PSO, and ACO algorithms and the RL sions, we observed improvement in all the cases. The highest improve-
guided versions. The best schedule of each generation is considered, and ment is observed between GA and RL-enabled GA. With the guidance of
it could be observed that the objective function is minimized quickly at RL, the randomness impact of GA was reduced, and the algorithm was
first and then slows down. able to control the exploitation and exploration. The performance of
When the conventional GA, PSO, and ACO algorithms are compared RL-enabled GA is higher than GA, for instance, a 20% improvement was
with each other, it is observed that the performance of the ACO algo- seen from the 44th iteration generation onwards by GA, while the same
rithm is the best among others. improvement rate was achieved from the 9th generation by RL-enabled
The performance of ACO is better than GA and PSO algorithms. GA. Furthermore, RL-enabled GA achieved a 50% improvement from
Firstly, ACO has the robustness compare to GA. Secondly, even though 281st generation onward while GA could not achieve this improvement
GA carries a good population to produce a new population, there level, refer to Table 8.
is no guarantee for improvement as the process is based on a ran- The impact of RL guidance on the ACO is lower compare to GA due to
dom approach. However, ACO applies the global and local update global and local update approaches. However, the performance and the
approaches that improve the solution in each generation (Sariff and final result of RL-enabled ACO is better compared to conventional ACO.
Buniyamin, 2009). For instance, a 50% improvement was achieved from 92nd generation
PSO algorithm runs similar to GA in that both are considered a onward by ACO, while the same improvement rate was achieved from
population-based search method. However, the major difference is GA is 54th generation by RL-enabled ACO, refer to Table 8.
a discrete technique while PSO is a continuous technique. Even though The impact of RL guidance on the PSO algorithm could not improve
PSO is converted to the discrete space while running, it does not im- the final result significantly as compared to GA and ACO algorithms.
prove the performance of PSO as compare to GA. Furthermore, in high- However, the performance impact can be seen, for instance, a 10% im-
dimensional space, the PSO algorithm is easy to fall into local-optimum provement was achieved from 432nd iteration onward by PSO algorithm
54
while RL-enabled PSO algorithm achieved same improvement level from knowledge provided by SOLO Pte Ltd. Also, we are grateful for
28th iteration onward, refer to Table 8. the help of Dr. Zengxiang Li, IHPC A∗ STAR Singapore, for his
The average travel time of one bus by the existing schedule provided support.
by the transport operators is 26.4” minutes. The achieved average travel
time of one bus by the conventional GA, PSO, and ACO are (22.2” –
References
23.9” – 20.7”) mins and by the RL-enabled GA, RL-enabled PSO, and
RL-enabled ACO are (21.1”-23.8”-19.3”) mins, respectively. Similarly,
Adzhar, N., & Salleh, S. (2014). Simulated annealing technique for routing in a rectangular
the average travel time of one student by the existing schedule provided mesh network. Modelling and Simulation in Engineering, 2014.
by the transport operators is 16.8” minutes. The achieved average travel Alinezhad, H., Yaghoubi, Ù. S., Hoseini Motlagh, S. M., Allahyari, S., & Saghafi
Nia, M. (2018). An improved particle swarm optimization for a class of capacitated
time of one student by the conventional GA, PSO, and ACO are (11.5”
vehicle routing problems. International Journal of Transportation Engineering, 5(4),
– 14.1” – 10.4”) mins and by the RL-enabled GA, RL-enabled PSO and 331–347.
RL-enabled ACO are (10.7”-13.9”-11”) mins, respectively. Arias-Rojas, J. S., Jiménez, J. F., & Montoya-Torres, J. R. (2012). Solving of school bus
routing problem by ant colony optimization. Revista EIA, (17), 193–208.
Babaee Tirkolaee, E., Goli, A., Pahlevan, M., & Malekalipour Kordestanizadeh, R. (2019).
Conclusions A robust bi-objective multi-trip periodic capacitated arc routing problem for urban
waste collection using a multi-objective invasive weed optimization. Waste Manage-
This study aimed to design an efficient methodology to achieve a ment & Research, 37(11), 1089–1101.
Bengio, Y., Lodi, A., & Prouvost, A. (2020). Machine learning for combinatorial optimiza-
near-optimum schedule for the SBRS problem. In our context, certain tion: a methodological tour d’horizon. European Journal of Operational Research.
factors influence and magnify the complexity of our problem. To this Böttcher, S., Doerr, B., & Neumann, F. (2010). Optimal fixed and adaptive mutation rates for
end, we augment an agent-based RL approach to guide the combina- the LeadingOnes problem Paper presented at the International Conference on Parallel
Problem Solving from Nature.
torial optimization algorithms. Our hypothesis is that with the guid- Bowerman, R., Hall, B., & Calamai, P. (1995). A multi-objective optimization approach to
ance of RL, the algorithm may achieve a near-optimum schedule within urban school bus routing: Formulation and solution method. Transportation Research
a smaller number of iterations. The augmentation process dynami- Part A: Policy and Practice, 29(2), 107–123.
Chusanapiputt, S., Nualhong, D., Jantarang, S., & Phoomvuthisarn, S. (2006). Selective self-
cally controls the hyper-parameters by introducing additional exploita- -adaptive approach to ant system for solving unit commitment problem Paper presented at
tion/exploration factors along with the conventional operators and act- the Proceedings of the 8th annual conference on Genetic and evolutionary computa-
ing according to the convergence rate. tion.
Corberán, A., Fernández, E., Laguna, M., & Marti, R. (2002). Heuristic solutions to the
In this paper, we have validated and demonstrated the usefulness
problem of routing school buses with multiple objectives. Journal of the operational
of fusing RL into the conventional EAs which showed a significant im- research society, 53(4), 427–435.
provement. From the literature survey, this work is the first-of-its-kind Davoodi, S. M. R., & Goli, A (2019). An integrated disaster relief model based on cov-
to validate and to demonstrate the above-mentioned hypothesis with a ering tour using hybrid Benders decomposition and variable neighborhood search:
Application in the Iranian context. Computers & Industrial Engineering, 130, 370–380.
complex real-world dataset. In this study, we considered three popular Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische
algorithms; namely GA, PSO, and ACO algorithms. To improve the effi- mathematik, 1(1), 269–271.
ciency of these conventional algorithms, the RL algorithm is integrated Dorigo, M., Maniezzo, V., & Colorni, A. (1996). Ant system: optimization by a colony of
cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cyber-
to guide these algorithms. We attempt to carry out extensive perfor- netics), 26(1), 29–41.
mance evaluations. Eiben, A. E., & Smith, J. E. (2003). Introduction to evolutionary computing. Springer.
The results indicate that the conventional and the reinforcement Fan, W., & Machemehl, R. B. (2006). Using a simulated annealing algorithm to solve the
transit route network design problem. Journal of Transportation Engineering, 132(2),
learning integrated algorithms are improving the travel time of buses 122–132.
and the students. More than 50% saving by the conventional and the Fisher, M. L., Jaikumar, R., & Van Wassenhove, L. N. (1986). A multiplier adjust-
reinforcement learning-enabled ant colony optimization algorithm com- ment method for the generalized assignment problem. Management Science, 32(9),
1095–1103.
pared to the constructive heuristic algorithm is achieved from 92nd and Goli, A., & Davoodi, S. M. R. (2018). Coordination policy for production and delivery
54th iterations, respectively. Similarly, the saving by the conventional scheduling in the closed loop supply chain. Production Engineering, 12(5), 621–631.
and the reinforcement learning-enabled genetic algorithm is 41.34% at Goli, A., Zare, H. K., Tavakkoli-Moghaddam, R., & Sadeghieh, A. (2019). Hybrid artificial
intelligence and robust optimization for a multi-objective product portfolio problem
500th iterations and more than 50% improvement from 281st iterations,
Case study: The dairy products industry. Computers & Industrial Engineering, 137, Ar-
respectively. Lastly, more than 10% saving by the conventional and the ticle 106090.
reinforcement learning-enabled particle swarm algorithm is achieved Goli, A., Zare, H. K., Tavakkoli-Moghaddam, R., & Sadegheih, A. (2020). Multiobjective
from 432nd and 28th iterations, respectively. fuzzy mathematical model for a financially constrained closed-loop supply chain with
labor employment. Computational Intelligence, 36(1), 4–34.
The work reported in this paper is almost industry-ready and we have Gong, M., Tang, Z., Li, H., & Zhang, J. (2019). Evolutionary multitasking with dynamic
conclusively demonstrated its effectiveness is based on a real-world case resource allocating strategy. IEEE Transactions on Evolutionary Computation, 23(5),
study using real-life data. As such, our Intelligent Transportation System 858–869.
Holland, J. (1975). Adaptation in artificial and natural systems (p. 232). Ann Arbor: The
(ITS) framework can be readily deployed over the cloud. An extension of University of Michigan Press.
this work could be considering the city’s congestion condition. Further- Kang, M., Kim, S.-K., Felan, J. T., Choi, H. R., & Cho, M. (2015). Development of a ge-
more, our generic cases could be extended as multiple schools, factories, netic algorithm for the school bus routing problem. International Journal of Software
Engineering and Its Applications, 9(5), 107–126.
Central Business District (CBD) zones, or even city scale. We could ex- Karafotias, G., Hoogendoorn, M., & Eiben, A. (2015). Parameter control in evolutionary al-
pect a significant potential improvement by our algorithms because of gorithms: Trends and challenges. IEEE Transactions on Evolutionary Computation, 19(2),
the increase of application scale in both temporal and spatial dimen- 167–187.
Karafotias, G., Smit, S. K., & Eiben, A. (2012). A generic approach to parameter control Paper
sions. presented at the European Conference on the Applications of Evolutionary Computa-
tion.
Declaration of Competing Interests Kennedy, J., & Eberhart, R. (1995a). A new optimizer using particle swarm theory. In
Proceedings of the sixth international symposium on micro machine and human science.
Kennedy, J., & Eberhart, R. (1995b). Particle swarm optimization. In Proceedings of
The authors declare that they have no known competing financial ICNN’95-international conference on neural networks.
interests or personal relationships that could have appeared to influence Kim, B.-I., Kim, S., & Park, J. (2012). A school bus scheduling problem. European Journal
of Operational Research, 218(2), 577–585.
the work reported in this paper.
Kim, B.-I., & Son, S.-J. (2012). A probability matrix based particle swarm optimization for
the capacitated vehicle routing problem. Journal of Intelligent Manufacturing, 23(4),
Acknowledgments 1119–1126.
Kiriş, S. B., & Özcan, T. (2020). Metaheuristics approaches to solve the employee bus
routing problem with clustering-based bus stop selection. In Artificial Intelligence
The authors are grateful for the support of the NUS SINGA and Machine Learning Applications in Civil, Mechanical, and Industrial Engineering
scholarship, for the datasets (i.e., vehicle GPS record) and domain (pp. 216–239). IGI Global.
55
Koksal Ahmed, E., Li, Z., Veeravalli, B., & Ren, S (2020). Reinforcement learning enabled Ren, S., Han, L., Li, Z., & Veeravalli, B. (2017). Spatial-temporal traffic speed bands data
genetic algorithm for school bus scheduling. Journal of Intelligent Transportation Sys- analysis and prediction. In Proceedings of the 2017 IEEE international conference on
tems. industrial engineering and engineering management (IEEM).
Lessing, L., Dumitrescu, I., & Stützle, T. (2004). A comparison between ACO algorithms Riera-Ledesma, J., & Salazar-González, J.-J. (2012). Solving school bus routing using the
for the set covering problem. In Proceedings of the International Workshop on Ant Colony multiple vehicle traveling purchaser problem: A branch-and-cut approach. Computers
Optimization and Swarm Intelligence. & Operations Research, 39(2), 391–404.
Li, M., Du, W., & Nian, F. (2014). An adaptive particle swarm optimization algorithm Sangaiah, A. K., Tirkolaee, E. B., Goli, A., & Dehnavi-Arani, S. (2020). Robust optimization
based on directed weighted complex network. Mathematical problems in engineering, and mixed-integer linear programming model for LNG supply chain planning problem.
2014. Soft Computing, 24(11), 7885–7905.
Li, Y., & Li, W. (2007). Adaptive ant colony optimization algorithm based on infor- Sariff, N. B., & Buniyamin, N. (2009). Comparative study of genetic algorithm and ant
mation entropy: Foundation and application. Fundamenta Informaticae, 77(3), 229– colony optimization algorithm performances for robot path planning in global static
242. environments of different complexities. In Proceedings of the 2009 IEEE international
Mahmoudzadeh, A., & Wang, X. B. (2020). Cluster based methodology for scheduling a symposium on computational intelligence in robotics and automation-(CIRA).
university shuttle system. Transportation Research Record, 2674(1), 236–248. Sarubbi, J. F., Mesquita, C. M., Wanner, E. F., Santos, V. F., & Silva, C. M. (2016). A strategy
Martens, D., De Backer, M., Haesen, R., Vanthienen, J., Snoeck, M., & Baesens, B. (2007). for clustering students minimizing the number of bus stops for solving the school bus routing
Classification with ant colony optimization. IEEE Transactions on Evolutionary Compu- problem.Paper presented at the NOMS 2016-2016 IEEE/IFIP Network Operations and
tation, 11(5), 651–665. Management Symposium.
Mazyavkina, N., Sviridov, S., Ivanov, S., & Burnaev, E. (2020). Reinforcement learning for Schittekat, P., Kinable, J., Sörensen, K., Sevaux, M., Spieksma, F., & Springael, J. (2013).
combinatorial optimization: A survey. arXiv preprint arXiv:2003.03600. A metaheuristic for the school bus routing problem with bus stop selection. European
Mostafaeipour, A., Goli, A., & Qolipour, M. (2018). Prediction of air travel demand using a Journal of Operational Research, 229(2), 518–528.
hybrid artificial neural network (ANN) with Bat and Firefly algorithms: A case study. Shafahi, A., Wang, Z., & Haghani, A. (2018). A matching-based heuristic algorithm for
The Journal of Supercomputing, 74(10), 5461–5484. school bus routing problems. arXiv preprint arXiv:1807.05311.
Naka, S., Genji, T., Yura, T., & Fukuyama, Y. (2001). Practical distribution state estima- Spada, M., Bierlaire, M., & Liebling, T. M. (2005). Decision-aiding methodology for the
tion using hybrid particle swarm optimization. In Proceedings of the 2001 IEEE power school bus routing and scheduling problem. Transportation Science, 39(4), 477–490.
engineering society winter meeting. Conference Proceedings (Cat. No. 01CH37194). Stodola, P., Mazal, J., Podhorec, M., & Litvaj, O. (2014). Using the ant colony optimiza-
Newton, R. M., & Thomas, W. H. (1974). Bus routing in a multi-school system. Computers tion algorithm for the capacitated vehicle routing problem. In Proceedings of the 16th
& Operations Research, 1(2), 213–222. international conference on mechatronics-mechatronika 2014.
Osman, I. H., & Laporte, G. (1996). Metaheuristics: A bibliography. Springer. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Pacheco, J., Caballero, R., Laguna, M., & Molina, J. (2013). Bi-objective bus routing: an Thangiah, S. R., & Nygard, K. E. (1992). School bus routing using genetic algorithms. In
application to school buses in rural areas. Transportation Science, 47(3), 397–411. Proceedings of the applications of artificial intelligence X: Knowledge-Based Systems.
Park, J., & Kim, B.-I. (2010). The school bus routing problem: A review. European Journal Tirkolaee, E. B., Goli, A., & Weber, G.-W. (2019). Multi-objective aggregate production
of Operational Research, 202(2), 311–319. planning model considering overtime and outsourcing options under fuzzy seasonal
Ratnaweera, A. (2002). Particle swarm optimization with self-adaptive acceleration co- demand. In Advances in manufacturing II (pp. 81–96). Springer.
efficients. In Proceedings of the international conference on fuzzy system & knowledge Voß, S., Martello, S., Osman, I. H., & Roucairol, C. (2012). Meta-heuristics: Advances and
discovery (FSKD 2002), Singapore, Nov. trends in local search paradigms for optimization. Springer Science & Business Media.
Ratnaweera, A., Halgamuge, S. K., & Watson, H. C. (2004). Self-organizing hierarchical Zheng, Y.-L., Ma, L.-H., Zhang, L.-Y., & Qian, J.-X. (2003). On the convergence analysis
particle swarm optimizer with time-varying acceleration coefficients. IEEE Transac- and parameter selection in particle swarm optimization. In Proceedings of the 2003
tions on Evolutionary Computation, 8(3), 240–255. international conference on machine learning and cybernetics (IEEE Cat. No. 03EX693).
56

DM Soft Voting Classifier 2020

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Soft Voting Classifier 2020

Uploaded by

Copyright:

Available Formats

International Journal of Cognitive Computing in Engineering 2 (2021) 47–56

Contents lists available at ScienceDirect

International Journal of Cognitive Computing in Engineering

Performance characterization of reinforcement learning-enabled

Introduction the utilization of resources, and to reduce the inconvenience of students

hand, the state-of-art algorithms experience generalization drawback Related work

Reference Methodology Description

Reference Methodology Description

𝐵𝑆 A set of all bus stops.

School bus route generation and scheduling

Table 5 Each chromosome is evaluated by the ﬁtness function f calculated by

⎧ ( )iter−1 ( ( )iter ) ( ( )iter−1 )

of the algorithm. Therefore, to capture the eﬀect of computational time, Table 8

Improvement % 10% 20% 30% 40% 50% Max Fitness Value

You might also like