Technical Report

Q-Learning-Based Teaching-Learning Optimization for Distributed Two-Stage Hybrid Flow Shop Scheduling with Fuzzy Processing Time
CHAPTER 1
INTRODUCTION
A two-stage hybrid flow shop scheduling problem (THFSP) consists of two stages, at
least one of which consists of parallel machines. This NP-hard problem is a special case of the
hybrid flow shop scheduling problem (HFSP). A number of results have been obtained in
single-factory and multi-factory settings. Various methods, such as the exact, heuristic, and
metaheuristic methods, have been applied to solve THFSP in a single-factory setting. By
using an exact method and several heuristics to minimize makespan.
Production has transferred from single factories to multiple factories with the further
development of globalization. As a result, distributed scheduling problems in multiple
factories have become the main topic of production scheduling in recent years. The
distributed two-stage hybrid flow shop scheduling problem (DTHFSP) is also considered in
this work.
Teaching-Learning based optimization (TLBO) is a population-based algorithm

inspired by the act of passing on knowledge from a teacher to a student within a classroom.
The TLBO possesses a simple structure and fewer parameters and is easily understood and
implemented.
The integration of RL and metaheuristic can lead to the dynamic selection of search
operators or adaptively adjusted parameter settings, among others. As a result, integrating RL
with metaheuristic can improve the performance of the later, making it an effective approach
to obtaining high quality solutions.
The DTHFSP with fuzzy processing time is studied, and a novel algorithm called the
QTLBO is constructed through the integration of the Q-learning algorithm and the TLBO to
minimize makespan.
Dept. of CS&E, JIT, Davanagere 1

CHAPTER 2
LITERATURE SURVEY
1) Title: “A Review of Reinforcement Learning Based Intelligent Optimization for
Manufacturing Scheduling”
Authors and Year: Ling Wang, Zixiao Pan, and Jingjing Wang , 2021.
Description: This paper provides a comprehensive review of how Reinforcement Learning

(RL) can be utilized for intelligent optimization in manufacturing scheduling. It delves into
the design of state and action in RL for scheduling, summarizes RL-based algorithms,
reviews applications for different scheduling problems, discusses integration with meta-
heuristics, and outlines future research directions. The authors highlight the increasing
importance of RL in shop scheduling optimization and provide insights into the advancements
and challenges in this field.
2) Title: "Machine learning at the service of meta-heuristics for solving combinatorial

optimization problems”
Authors and Year: M. Karimi-Mamaghan, M. Mohammadi, P. Meyer et al. , 2022.
Description: The paper explores the integration of machine learning techniques into
metaheuristics for solving combinatorial optimization problems. It delves into three levels of
integration: problem level integration, high-level integration between meta-heuristics, and
lowlevel integration within a meta-heuristic. The authors provide a comprehensive review of
how machine learning techniques can be used in various elements of meta-heuristics, such as
algorithm selection, fitness evaluation, initialization, evolution, parameter setting, and
cooperation. They discuss the advantages, limitations, requirements, and challenges of
implementing machine learning at each level of integration. The paper also identifies research
gaps and proposes future research directions in this domain.
3) Title: “Effective heuristics and metaheuristics to minimize total flowtime for the
distributed permutation flow shop problem”
Authors and Year: Quan-Ke Pan, Liang Gao, Ling Wang, Jing Liang, Xin-Yu Li , 2019
Description: This paper, authored by Quan-Ke Pan et al. in 2019, focuses on addressing the

distributed permutation flow shop scheduling problem (DPFSP) through the application of
heuristics and metaheuristics. The research explores the use of various algorithms, including
artificial bee colony, scatter search, iterated local search, and iterated greedy, to optimize
scheduling in flow shop environments. By proposing new heuristics and metaheuristics, the
study aims to minimize the total flowtime in manufacturing processes, ultimately enhancing
efficiency and productivity. The paper presents computational results, comparisons, and
experimental findings that demonstrate the effectiveness of the proposed approaches in
improving scheduling outcomes in dynamic manufacturing settings.

CHAPTER 3
3. 1 PROBLEM STATEMENT AND OBJECTIVES
3.1.1 Problem Statement:
To design and implement “Q-Learning-Based Teaching-Learning Optimization for

Distributed Two-Stage Hybrid Flow Shop Scheduling with Fuzzy Processing Time ”.
• Input: Factories, jobs.
• Process: Q-learning algorithm, teaching-learning optimization.
• Output: Minimized makespan, computational results.
3.1.2 Objectives:
Integration of Reinforcement Learning (RL) and Metaheuristic:
• Objective: Combine the strengths of both approaches:

o RL's ability to learn from interactions with the environment (DTHFSP in this
case)
o Metaheuristics' (Teaching-Learning-Based Optimization - TLBO)
populationbased search for efficient exploration and exploitation.
• Explanation: QTLBO leverages Q-learning to guide the learning process within
TLBO's framework. This allows the algorithm to adapt its search strategy dynamically
based on the encountered problem landscape.
Q-learning-based Teaching-Learning Optimization (QTLBO):
• Objective: Develop a novel optimization algorithm that utilizes Q-learning to

enhance TLBO's performance in minimizing DTHFSP make span.
• Explanation: QTLBO introduces a Q-table that stores the expected future rewards
(reduced makespan) associated with different learning modes in the learner phase. The
algorithm learns to select the most effective learning mode based on the current state,
accelerating convergence towards optimal schedules.
Teacher Phase, Learner Phase, and Self-Learning Phases:

• Objective: Design distinct phases within QTLBO to mimic a classroom setting:

o Teacher Phase: A "teacher" solution guides learners (other solutions) towards
better makespan values.
o Learner Phase: Learners interact and share knowledge, potentially improving
their individual makespan.
o Self-Learning Phase (Optional): Learners can explore the solution space
independently, potentially discovering new optima.
• Explanation: These phases create a structured learning environment within QTLBO.
The teacher phase provides initial direction, the learner phase fosters collaboration,
and the self-learning phase (if included) encourages exploration.
Effectiveness of QTLBO Strategies:
• Objective: Demonstrate the benefits of QTLBO compared to traditional TLBO or

other existing DTHFSP optimization algorithms. This may involve:
o Comparative Analysis: Evaluate the performance metrics like average make
span, convergence speed, and solution quality.
o Statistical Significance: Employ statistical tests to confirm that QTLBO's
improvements are statistically significant.
• Explanation: Evaluating QTLBO's effectiveness helps gauge its suitability for
DTHFSP optimization.
Problem Specificity: Tailor the Q-table structure and reward function to effectively capture
the DTHFSP problem characteristics.
Parameter Tuning: Optimize QTLBO's parameters (learning rate, discount factor, etc.) for
optimal performance on your specific DTHFSP instances.
Computational Efficiency: Consider trade-offs between solution quality and computational

cost, especially for large-scale DTHFSP problems.

3. 2 METHODOLOGY
QTLBO merges Q-learning with TLBO, leveraging Q-learning's adaptive

decisionmaking and TLBO's collaborative optimization. This integration enables efficient
exploration of complex search spaces while promoting collective knowledge sharing,
making it a robust and adaptable solution for real-world optimization tasks.
3.2.1 Technology Used :
 Q-Learning :
• Q-learning is used to solve the distributed two-stage hybrid flow shop scheduling
problem (DTHFSP) with fuzzy processing time.
• The Q-learning algorithm is implemented using 9 states and 4 actions.
• The algorithm structure is dynamically adjusted through adaptive action selection.
 Teaching-Learning based Optimization (TLBO) :

• TLBO is a search process that acts on a population of learners.
• It has a teacher phase and a learner phase.
• In the teacher phase, the best solution passes its knowledge to learners.
• TLBO modifies solutions based on the knowledge passed by the teacher.
 Reinforcement-Learning(RL) :
• RL is integrated with the TLBO algorithm to optimize scheduling.

• Q-learning algorithm is used to dynamically adjust the algorithm structure. - 9
states, 4 actions, a reward, and adaptive action selection are implemented in RL.
• QTLBO provides effective strategies for the distributed two-stage hybrid flow
shop scheduling problem.
• RL algorithms, particularly Q-learning, are commonly used in production
scheduling.
 Metaheuristic :
• Metaheuristic is used in the QTLBO algorithm to solve the DTHFSP problem.

• Previous TLBOs did not consider the integration of Q-learning and metaheuristics

3.2.2 Q-Learning Algorithm:
Q-learning is a fundamental reinforcement learning algorithm that enables an agent to learn

an optimal policy for making decisions in an unknown environment. It's a model-free
approach, meaning the agent doesn't need a pre-built model of the environment's dynamics.
Here's a breakdown of the steps you provided:
• For each s , a initialize the table entry Q(s, a) to zero.
• Observe the current state s
• Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new sate 𝑠1
Update the table entry for Q(s, a) as follows:
Q(s,a) = r
s <- 𝑠1
Initialization:
1. Q-Table Creation:
o Create a data structure called a Q-table. This table stores the Q-values, which
represent the expected long-term reward of taking a specific action (a) in a
particular state (s).
o Initialize all entries in the Q-table to zero. This signifies that the agent has no
initial knowledge about the value of any state-action pair.
Learning Loop:
2. State Observation: At each time step, the agent observes the current state (s) of the
environment. This state could represent various factors depending on the problem,
such as the position of a robot in a maze or the current resources in a game.

3. Action Selection (Exploration vs. Exploitation):

o The agent selects an action (a) to take in the current state. This selection
involves balancing exploration and exploitation:
 Exploration: Trying new actions to learn about the environment and
potentially discover better options. A common strategy is the
epsilongreedy approach, where with some probability (epsilon), the
agent randomly chooses an action, and with probability (1-epsilon), it
selects the action with the highest Q-value in the current state
(exploitation).
 Exploitation: Choosing the action that is currently believed to lead to
the highest reward based on the learned Q-values.
4. Action Execution and Reward Observation:
o The agent takes the chosen action (a) and observes the immediate reward (r)
received from the environment. This reward reflects the immediate
consequence of the action.
5. State Transition:
o The environment transitions to a new state (s') as a result of the action taken.
This new state represents the updated environment after the action's effect.
6. Q-Value Update:
o The Q-value for the previous state-action pair (s, a) is updated using the
Bellman equation. This equation combines the immediate reward (r), the
estimated optimal future reward from the new state (max_a' Q(s', a')), and a
learning rate (alpha) that controls how much the agent learns from new
experiences: o Q(s, a) = (1 - alpha) * Q(s, a) + alpha * (r + gamma * max_a' Q(s',
a'))
 alpha (learning rate): Determines the weight given to the new

experience (r + gamma * max_a' Q(s', a')) compared to the previous Q-
value. A higher alpha leads to faster but potentially less stable learning,
while a lower alpha leads to slower but more stable learning.
 gamma (discount factor): Controls the importance of future rewards. A
higher gamma means the agent values future rewards more and plans
for longer-term goals.

7. Repeat:
o The agent continues by returning to step 2 and observing the new state,
repeating the learning process until it converges or reaches a stopping
criterion.
3.2.3 Flow Chart:
1. Initialization:
The algorithm begins by setting up the initial population of solutions for the
scheduling problem. These solutions represent different ways to schedule the jobs
across the multiple factories.
2. Sorting Step:

The solutions in the population are ranked based on a specific criteria, most
likely their makespan (total completion time). The solution with the shortest
makespan is considered the best.
3. Q-Learning Step:
This step uses a Q-learning algorithm to dynamically determine which of the

following four phases to enter next:
 Teacher Phase
 Learner Phase
 Teacher's Self-Learning Phase
 Learner's Self-Learning Phase
4. Execution of the Chosen Action:
Based on the decision from the Q-learning step, one of the four phases is
implemented:
 Teacher Phase: A pair of solutions (teachers) is chosen from the

population, and a new solution (learner) is created by combining
aspects of these teachers.
 Learner Phase: A solution (learner) is chosen, and another solution
(peer) is randomly selected from the population. The learner is then
improved by incorporating aspects of the peer.
 Teacher's Self-Learning Phase: A solution (teacher) is chosen, and
it's improved based on its own information and past performance.
 Learner's Self-Learning Phase: A solution (learner) is chosen, and
it's improved based on its own information and past performance.
5. Stopping Condition?
The algorithm checks if a stopping criteria has been met. This criteria might be
a certain number of iterations or achieving a desired makespan.
6. End:

If the stopping condition is met, the algorithm terminates and returns the best
solution found so far, which represents the optimal schedule for the DTHFSP.
7. Loop:
If the stopping condition is not met, the algorithm returns to step 3 and repeats
the process.
3.3 RESULTS AND DISCUSSION
3.3.1 Results:
• Q-learning algorithm integrated with TLBO for dynamic algorithm selection 

Completion time calculated with addition operator.
• Ranking operator compares makespan to decide elite solution.
• Initial population randomly produced.
• Solutions sorted in ascending order of makespan.
3.3.2 Discussion:
QTLBO presents a promising new approach for scheduling in manufacturing,

particularly for distributed production environments. Its ability to handle fuzzy processing
times and adapt to dynamic changes offers significant advantages over traditional methods.
However, the computational cost, data dependence, and limited explainability require careful
consideration.
Further research is needed to explore how QTLBO scales to extremely large and
complex distributed networks. Additionally, advancements in explainable AI could improve
transparency in the system's decision-making process.
3.4 ADVANTAGES & DISADVANTAGES

3.4.1 Advantages:
1) Improved Efficiency: Q-learning-based teaching-learning optimization can outperform

traditional methods in complex scheduling problems, leading to potentially shorter
production times and increased output.

2) Fuzzy Processing Time Handling: This method can account for uncertain job processing
times, a common issue with new or complex tasks. This flexibility allows for more
realistic scheduling in dynamic environments.
3) Distributed Scheduling Advantage: Designed specifically for distributed two-stage
hybrid flow shops, it can optimize scheduling across multiple factories or production
lines, improving overall production network coordination and resource allocation.
4) Adaptability through Learning: The reinforcement learning aspect of Q-learning allows
the system to learn and adapt its scheduling decisions over time. This is beneficial in
environments where job characteristics, processing times, or machine availability change
frequently.
5) Potential for Continuous Improvement: As the system gathers more data and interacts
with the scheduling environment, it can continuously refine its decision-making,
potentially leading to long-term efficiency gains.
6) Exploration and Exploitation Balance: Q-learning can balance exploration (trying new
scheduling strategies) with exploitation (focusing on proven effective ones). This balance
can help the system discover even better solutions over time.
3.4.2 Disadvantages:
1. Computational Cost: Q-learning-based teaching-learning optimization is

computationally expensive compared to traditional methods. Running the algorithm,
especially for largescale scheduling problems, can require significant computing power
and resources.
2. Data Dependence: The effectiveness of Q-learning heavily relies on the quality and
quantity of data used for training. Insufficient data or data with inaccuracies can lead the
system to make suboptimal scheduling decisions.
3. Limited Explainability: While the system can learn effective schedules, it can be
difficult to understand the exact reasoning behind its decisions. This lack of transparency
can be a drawback for debugging or justifying specific scheduling choices.
4. Potential for Convergence to Local Optima: The learning process might get stuck in
finding a good but not necessarily the best solution (local optimum). This can limit the
overall efficiency gains achievable.

5. Integration Challenges: Implementing this method might require modifying or

integrating with existing factory scheduling software. The complexity and cost of such
integration can be a significant hurdle to adoption.
6. Scalability Concerns: While designed for distributed scheduling, the effectiveness of this
method for very large or intricate distributed networks is still under investigation. Its
scalability to extremely complex scenarios needs further exploration.
CHAPTER 4
CONCLUSION AND FUTURE WORK

This study provides a new path to integrate RL with the TLBO. Unlike existing works,
this study applies an RL algorithm named Q-learning to dynamically adjust the algorithm
structure of the QTLBO. In the QTLBO, a group of teachers was used. The teacher phase,
learner phase, teacher’s self-learning phase, and learner’s self-learning phase were designed.
Then, the Q-learning algorithm was implemented by 9 states, 4 actions defined as
combinations of the above phases, a reward, and an adaptive action selection, after which
they are applied to dynamically adjust algorithm structure. A number of experiments were
conducted. The computational results demonstrate that the new strategies of the QTLBO are
effective and that the QTLBO provides promising results on the considered DTHFSP.
Distributed scheduling has also been extensively considered; however, distributed

scheduling with uncertainty is not studied fully.
In future works we will attempt to solve the distributed scheduling problem with
uncertainty by using various metaheuristics. Previous works have also mainly used many
kinds of RL algorithms, particularly Q-learning. Related to this, the integration of other RL
algorithms with metaheuristics for production scheduling

REFERENCES
[1] M. Karimi-Mamaghan, M. Mohammadi, P. Meyer, A. M. Karimi-Mamaghan, and E. -
G. Talbi, Machine learning at the service of meta-heuristics for solving combinatorial
optimization problems: A state-of-the-art, Eur. J. Oper. Res., vol. 296, no. 2, pp. 393–422,
2022
[2] J. Wang, D. M. Lei, and J. C. Cai, An adaptive artificial bee colony with
reinforcement learning for distributed three-stage assembly scheduling with maintenance,
Appl. Soft. Comput., vol. 117, p. 108371, 2021.
[3] Wang, Z. X. Pan, and J. J. Wang, A review of reinforcement learning based intelligent
optimization for manufacturing scheduling, Complex. Syst. Mod. Sim., vol. 1, no. 4, pp. 257–
270, 2021.
[4] J. Q. Li, J. K. Li, L. J. Zhang, H. Y. Sang, Y. Y. Han, and Q. D. Chen, Solving type-2
fuzzy distributed hybrid flowshop scheduling using an improved brain storm optimization
algorithm, Int. J. Fuzzy Syst., vol. 23, pp. 1194–1212, 2021.
[5] Z. S. Shao, W. S. Shao, and D. C. Pi, Effective heuristics and metaheuristics for the
distributed fuzzy blocking flowshop scheduling problem, Swarm and EComput., vol. 59, p.
100747, 2020.
[6] J. Wang, X. D. Wang, F. Chu, and J. B. Yu, An energyefficient two-stage hybrid flow
shop scheduling problem in a glass production, Int. J. Prod. Res., vol. 58, no. 8, pp. 2283–
2314, 2020.
[7] D. M. Lei, B. Su, and M. Li, Cooperated teachinglearning-based optimisation for

distributed two-stage assembly flow shop scheduling, Int. J. of Prod. Res., vol. 59, no. 23, pp.
7232–7245, 2020.
[8] B. Fan, W. Yang, and Z. Zhang, Solving the two-stage hybrid flow shop scheduling
problem based on mutant firefly algorithm, J. Amb. Intel. Hum. Comp., vol. 10, no. 3, pp.
979– 990, 2019.


Technical Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Technical Report

Uploaded by

Copyright:

Available Formats

Q-Learning-Based Teaching-Learning Optimization for Distributed Two-Stage Hybrid Flow Shop Scheduling with Fuzzy Processing Time

Teaching-Learning based optimization (TLBO) is a population-based algorithm

Dept. of CS&E, JIT, Davanagere 1

Description: This paper provides a comprehensive review of how Reinforcement Learning

2) Title: "Machine learning at the service of meta-heuristics for solving combinatorial

Authors and Year: M. Karimi-Mamaghan, M. Mohammadi, P. Meyer et al. , 2022.

Dept. of CS&E, JIT, Davanagere 2

Dept. of CS&E, JIT, Davanagere 3

To design and implement “Q-Learning-Based Teaching-Learning Optimization for

• Input: Factories, jobs.

• Process: Q-learning algorithm, teaching-learning optimization.

• Output: Minimized makespan, computational results.

Integration of Reinforcement Learning (RL) and Metaheuristic:

• Objective: Combine the strengths of both approaches:

Q-learning-based Teaching-Learning Optimization (QTLBO):

• Objective: Develop a novel optimization algorithm that utilizes Q-learning to

Teacher Phase, Learner Phase, and Self-Learning Phases:

Dept. of CS&E, JIT, Davanagere 4

• Objective: Design distinct phases within QTLBO to mimic a classroom setting:

Effectiveness of QTLBO Strategies:

• Objective: Demonstrate the benefits of QTLBO compared to traditional TLBO or

Computational Efficiency: Consider trade-offs between solution quality and computational

Dept. of CS&E, JIT, Davanagere 5

QTLBO merges Q-learning with TLBO, leveraging Q-learning's adaptive

3.2.1 Technology Used :

 Teaching-Learning based Optimization (TLBO) :

• RL is integrated with the TLBO algorithm to optimize scheduling.

• Metaheuristic is used in the QTLBO algorithm to solve the DTHFSP problem.

Dept. of CS&E, JIT, Davanagere 6

3.2.2 Q-Learning Algorithm:

Q-learning is a fundamental reinforcement learning algorithm that enables an agent to learn

• For each s , a initialize the table entry Q(s, a) to zero.

• Observe the current state s

Select an action a and execute it

Receive immediate reward r

Observe the new sate 𝑠1

Update the table entry for Q(s, a) as follows:

Dept. of CS&E, JIT, Davanagere 7

3. Action Selection (Exploration vs. Exploitation):

 alpha (learning rate): Determines the weight given to the new

Dept. of CS&E, JIT, Davanagere 8

3.2.3 Flow Chart:

Dept. of CS&E, JIT, Davanagere 9

This step uses a Q-learning algorithm to dynamically determine which of the

4. Execution of the Chosen Action:

 Teacher Phase: A pair of solutions (teachers) is chosen from the

Dept. of CS&E, JIT, Davanagere 10

3.3 RESULTS AND DISCUSSION

• Q-learning algorithm integrated with TLBO for dynamic algorithm selection 

QTLBO presents a promising new approach for scheduling in manufacturing,

3.4 ADVANTAGES & DISADVANTAGES

1) Improved Efficiency: Q-learning-based teaching-learning optimization can outperform

Dept. of CS&E, JIT, Davanagere 11

1. Computational Cost: Q-learning-based teaching-learning optimization is

Dept. of CS&E, JIT, Davanagere 12

5. Integration Challenges: Implementing this method might require modifying or

CONCLUSION AND FUTURE WORK

Distributed scheduling has also been extensively considered; however, distributed

Dept. of CS&E, JIT, Davanagere 13

[7] D. M. Lei, B. Su, and M. Li, Cooperated teachinglearning-based optimisation for

Dept. of CS&E, JIT, Davanagere 14

Dept. of CS&E, JIT, Davanagere 15

You might also like