Technical Report

Q-Learning-Based Teaching-Learning Optimization for Distributed Two-Stage Hybrid Flow Shop Scheduling with Fuzzy Processing Time
CHAPTER 1
INTRODUCTION
A two-stage hybrid flow shop scheduling problem (THFSP) consists of two stages, at
least one of which consists of parallel machines. This NP-hard problem is a special case of the
hybrid flow shop scheduling problem (HFSP). A number of results have been obtained in
single-factory and multi-factory settings. Various methods, such as the exact, heuristic, and
metaheuristic methods, have been applied to solve THFSP in a single-factory setting. By using
an exact method and several heuristics to minimize makespan.
Production has transferred from single factories to multiple factories with the further
development of globalization. As a result, distributed scheduling problems in multiple factories
have become the main topic of production scheduling in recent years. The distributed two-stage
hybrid flow shop scheduling problem (DTHFSP) is also considered in this work.
Teaching-Learning based optimization (TLBO) is a population-based algorithm

inspired by the act of passing on knowledge from a teacher to a student within a classroom. The
TLBO possesses a simple structure and fewer parameters and is easily understood and
implemented.
The integration of RL and metaheuristic can lead to the dynamic selection of search
operators or adaptively adjusted parameter settings, among others. As a result, integrating RL
with metaheuristic can improve the performance of the later, making it an effective approach
to obtaining high quality solutions.
The DTHFSP with fuzzy processing time is studied, and a novel algorithm called the
QTLBO is constructed through the integration of the Q-learning algorithm and the TLBO to
minimize makespan.
Dept. of CS&E, JIT, Davanagere 1

CHAPTER 2
LITERATURE SURVEY
1) Title: “A Review of Reinforcement Learning Based Intelligent Optimization for
Manufacturing Scheduling”
Authors and Year: Ling Wang, Zixiao Pan, and Jingjing Wang , 2021.
Description: This paper provides a comprehensive review of how Reinforcement Learning

(RL) can be utilized for intelligent optimization in manufacturing scheduling. It delves into the
design of state and action in RL for scheduling, summarizes RL-based algorithms, reviews
applications for different scheduling problems, discusses integration with meta-heuristics, and
outlines future research directions. The authors highlight the increasing importance of RL in
shop scheduling optimization and provide insights into the advancements and challenges in this
field.
2) Title: "Machine learning at the service of meta-heuristics for solving combinatorial

optimization problems”
Authors and Year: M. Karimi-Mamaghan, M. Mohammadi, P. Meyer et al. , 2022.
Description: The paper explores the integration of machine learning techniques into
metaheuristics for solving combinatorial optimization problems. It delves into three levels of
integration: problem level integration, high-level integration between meta-heuristics, and
lowlevel integration within a meta-heuristic. The authors provide a comprehensive review of
how machine learning techniques can be used in various elements of meta-heuristics, such as
algorithm selection, fitness evaluation, initialization, evolution, parameter setting, and
cooperation. They discuss the advantages, limitations, requirements, and challenges of
implementing machine learning at each level of integration. The paper also identifies research
gaps and proposes future research directions in this domain.
3) Title: “Effective heuristics and metaheuristics to minimize total flowtime for the
distributed permutation flow shop problem”
Authors and Year: Quan-Ke Pan, Liang Gao, Ling Wang, Jing Liang, Xin-Yu Li , 2019
Description: This paper, authored by Quan-Ke Pan et al. in 2019, focuses on addressing the
distributed permutation flow shop scheduling problem (DPFSP) through the application of

heuristics and metaheuristics. The research explores the use of various algorithms, including
artificial bee colony, scatter search, iterated local search, and iterated greedy, to optimize
scheduling in flow shop environments. By proposing new heuristics and metaheuristics, the
study aims to minimize the total flowtime in manufacturing processes, ultimately enhancing
efficiency and productivity. The paper presents computational results, comparisons, and
experimental findings that demonstrate the effectiveness of the proposed approaches in
improving scheduling outcomes in dynamic manufacturing settings.

CHAPTER 3
3. 1 PROBLEM STATEMENT AND OBJECTIVES
3.1.1 Problem Statement:
To design and implement “Q-Learning-Based Teaching-Learning Optimization for

Distributed Two-Stage Hybrid Flow Shop Scheduling with Fuzzy Processing Time ”.
• Input: Factories, jobs.
• Process: Q-learning algorithm, teaching-learning optimization.
• Output: Minimized makespan, computational results.
3.1.2 Objectives:
Integration of Reinforcement Learning (RL) and Metaheuristic:
• Objective: Combine the strengths of both approaches:

o RL's ability to learn from interactions with the environment (DTHFSP in this
case)
o Metaheuristics' (Teaching-Learning-Based Optimization - TLBO)
populationbased search for efficient exploration and exploitation.
• Explanation: QTLBO leverages Q-learning to guide the learning process within
TLBO's framework. This allows the algorithm to adapt its search strategy dynamically
based on the encountered problem landscape.
Q-learning-based Teaching-Learning Optimization (QTLBO):
• Objective: Develop a novel optimization algorithm that utilizes Q-learning to enhance

TLBO's performance in minimizing DTHFSP make span.
• Explanation: QTLBO introduces a Q-table that stores the expected future rewards
(reduced makespan) associated with different learning modes in the learner phase. The
algorithm learns to select the most effective learning mode based on the current state,
accelerating convergence towards optimal schedules.
Teacher Phase, Learner Phase, and Self-Learning Phases:

• Objective: Design distinct phases within QTLBO to mimic a classroom setting:

o Teacher Phase: A "teacher" solution guides learners (other solutions) towards
better makespan values.
o Learner Phase: Learners interact and share knowledge, potentially improving
their individual makespan.
o Self-Learning Phase (Optional): Learners can explore the solution space
independently, potentially discovering new optima.
• Explanation: These phases create a structured learning environment within QTLBO.
The teacher phase provides initial direction, the learner phase fosters collaboration, and
the self-learning phase (if included) encourages exploration.
Effectiveness of QTLBO Strategies:
• Objective: Demonstrate the benefits of QTLBO compared to traditional TLBO or other

existing DTHFSP optimization algorithms. This may involve:
o Comparative Analysis: Evaluate the performance metrics like average make
span, convergence speed, and solution quality.
o Statistical Significance: Employ statistical tests to confirm that QTLBO's
improvements are statistically significant.
• Explanation: Evaluating QTLBO's effectiveness helps gauge its suitability for
DTHFSP optimization.
Problem Specificity: Tailor the Q-table structure and reward function to effectively capture
the DTHFSP problem characteristics.
Parameter Tuning: Optimize QTLBO's parameters (learning rate, discount factor, etc.) for
optimal performance on your specific DTHFSP instances.
Computational Efficiency: Consider trade-offs between solution quality and computational

cost, especially for large-scale DTHFSP problems.

3. 2 METHODOLOGY
QTLBO merges Q-learning with TLBO, leveraging Q-learning's adaptive

decisionmaking and TLBO's collaborative optimization. This integration enables efficient
exploration of complex search spaces while promoting collective knowledge sharing, making
it a robust and adaptable solution for real-world optimization tasks.
3.2.1 Technology Used :
➢ Q-Learning :
• Q-learning is used to solve the distributed two-stage hybrid flow shop scheduling
problem (DTHFSP) with fuzzy processing time.
• The Q-learning algorithm is implemented using 9 states and 4 actions.
• The algorithm structure is dynamically adjusted through adaptive action selection.
➢ Teaching-Learning based Optimization (TLBO) :

• TLBO is a search process that acts on a population of learners.
• It has a teacher phase and a learner phase.
• In the teacher phase, the best solution passes its knowledge to learners.
• TLBO modifies solutions based on the knowledge passed by the teacher.
➢ Reinforcement-Learning(RL) :
• RL is integrated with the TLBO algorithm to optimize scheduling.

• Q-learning algorithm is used to dynamically adjust the algorithm structure. - 9
states, 4 actions, a reward, and adaptive action selection are implemented in RL.
• QTLBO provides effective strategies for the distributed two-stage hybrid flow shop
scheduling problem.
• RL algorithms, particularly Q-learning, are commonly used in production
scheduling.
➢ Metaheuristic :
• Metaheuristic is used in the QTLBO algorithm to solve the DTHFSP problem.

• Previous TLBOs did not consider the integration of Q-learning and metaheuristics

3.2.2 Q-Learning Algorithm:
Q-learning is a fundamental reinforcement learning algorithm that enables an agent to learn an

optimal policy for making decisions in an unknown environment. It's a model-free approach,
meaning the agent doesn't need a pre-built model of the environment's dynamics. Here's a
breakdown of the steps you provided:
• For each s , a initialize the table entry Q(s, a) to zero.
• Observe the current state s
• Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new sate 𝑠1
Update the table entry for Q(s, a) as follows:
Q(s,a) = r
s <- 𝑠1
Initialization:
1. Q-Table Creation:
o Create a data structure called a Q-table. This table stores the Q-values, which
represent the expected long-term reward of taking a specific action (a) in a
particular state (s).
o Initialize all entries in the Q-table to zero. This signifies that the agent has no
initial knowledge about the value of any state-action pair.
Learning Loop:
2. State Observation: At each time step, the agent observes the current state (s) of the
environment. This state could represent various factors depending on the problem,
such as the position of a robot in a maze or the current resources in a game.

3. Action Selection (Exploration vs. Exploitation):

o The agent selects an action (a) to take in the current state. This selection involves
balancing exploration and exploitation:
▪ Exploration: Trying new actions to learn about the environment and
potentially discover better options. A common strategy is the
epsilongreedy approach, where with some probability (epsilon), the
agent randomly chooses an action, and with probability (1-epsilon), it
selects the action with the highest Q-value in the current state
(exploitation).
▪ Exploitation: Choosing the action that is currently believed to lead to
the highest reward based on the learned Q-values.
4. Action Execution and Reward Observation:
o The agent takes the chosen action (a) and observes the immediate reward (r)
received from the environment. This reward reflects the immediate consequence
of the action.
5. State Transition:
o The environment transitions to a new state (s') as a result of the action taken.
This new state represents the updated environment after the action's effect.
6. Q-Value Update:
o The Q-value for the previous state-action pair (s, a) is updated using the Bellman
equation. This equation combines the immediate reward (r), the estimated optimal
future reward from the new state (max_a' Q(s', a')), and a learning rate (alpha) that
controls how much the agent learns from new experiences: o Q(s, a) = (1 - alpha) *
Q(s, a) + alpha * (r + gamma * max_a' Q(s', a'))
▪ alpha (learning rate): Determines the weight given to the new experience
(r + gamma * max_a' Q(s', a')) compared to the previous Q-value. A
higher alpha leads to faster but potentially less stable learning, while a
lower alpha leads to slower but more stable learning.
▪ gamma (discount factor): Controls the importance of future rewards. A
higher gamma means the agent values future rewards more and plans for
longer-term goals.

7. Repeat:
o The agent continues by returning to step 2 and observing the new state, repeating
the learning process until it converges or reaches a stopping criterion.
3.2.3 Flow Chart:
1. Initialization:
The algorithm begins by setting up the initial population of solutions for the
scheduling problem. These solutions represent different ways to schedule the jobs
across the multiple factories.
2. Sorting Step:
The solutions in the population are ranked based on a specific criteria, most
likely their makespan (total completion time). The solution with the shortest makespan
is considered the best.

3. Q-Learning Step:
This step uses a Q-learning algorithm to dynamically determine which of the

following four phases to enter next:
▪ Teacher Phase
▪ Learner Phase
▪ Teacher's Self-Learning Phase
▪ Learner's Self-Learning Phase
4. Execution of the Chosen Action:
Based on the decision from the Q-learning step, one of the four phases is
implemented:
▪ Teacher Phase: A pair of solutions (teachers) is chosen from the

population, and a new solution (learner) is created by combining aspects
of these teachers.
▪ Learner Phase: A solution (learner) is chosen, and another solution
(peer) is randomly selected from the population. The learner is then
improved by incorporating aspects of the peer.
▪ Teacher's Self-Learning Phase: A solution (teacher) is chosen, and it's
improved based on its own information and past performance.
▪ Learner's Self-Learning Phase: A solution (learner) is chosen, and it's
improved based on its own information and past performance.
5. Stopping Condition?
The algorithm checks if a stopping criteria has been met. This criteria might be
a certain number of iterations or achieving a desired makespan.
6. End:
If the stopping condition is met, the algorithm terminates and returns the best
solution found so far, which represents the optimal schedule for the DTHFSP.

7. Loop:
If the stopping condition is not met, the algorithm returns to step 3 and repeats
the process.
3.3 RESULTS AND DISCUSSION
3.3.1 Results:
• Q-learning algorithm integrated with TLBO for dynamic algorithm selection

Completion time calculated with addition operator.
• Ranking operator compares makespan to decide elite solution.
• Initial population randomly produced.
• Solutions sorted in ascending order of makespan.
3.3.2 Discussion:
QTLBO presents a promising new approach for scheduling in manufacturing,

particularly for distributed production environments. Its ability to handle fuzzy processing
times and adapt to dynamic changes offers significant advantages over traditional methods.
However, the computational cost, data dependence, and limited explainability require careful
consideration.
Further research is needed to explore how QTLBO scales to extremely large and
complex distributed networks. Additionally, advancements in explainable AI could improve
transparency in the system's decision-making process.
3.4 ADVANTAGES & DISADVANTAGES

3.4.1 Advantages:
1) Improved Efficiency: Q-learning-based teaching-learning optimization can outperform

traditional methods in complex scheduling problems, leading to potentially shorter
production times and increased output.
2) Fuzzy Processing Time Handling: This method can account for uncertain job processing
times, a common issue with new or complex tasks. This flexibility allows for more realistic
scheduling in dynamic environments.

3) Distributed Scheduling Advantage: Designed specifically for distributed two-stage

hybrid flow shops, it can optimize scheduling across multiple factories or production lines,
improving overall production network coordination and resource allocation.
4) Adaptability through Learning: The reinforcement learning aspect of Q-learning allows
the system to learn and adapt its scheduling decisions over time. This is beneficial in
environments where job characteristics, processing times, or machine availability change
frequently.
5) Potential for Continuous Improvement: As the system gathers more data and interacts
with the scheduling environment, it can continuously refine its decision-making, potentially
leading to long-term efficiency gains.
6) Exploration and Exploitation Balance: Q-learning can balance exploration (trying new
scheduling strategies) with exploitation (focusing on proven effective ones). This balance
can help the system discover even better solutions over time.
3.4.2 Disadvantages:
1. Computational Cost: Q-learning-based teaching-learning optimization is computationally

expensive compared to traditional methods. Running the algorithm, especially for
largescale scheduling problems, can require significant computing power and resources.
2. Data Dependence: The effectiveness of Q-learning heavily relies on the quality and
quantity of data used for training. Insufficient data or data with inaccuracies can lead the
system to make suboptimal scheduling decisions.
3. Limited Explainability: While the system can learn effective schedules, it can be difficult
to understand the exact reasoning behind its decisions. This lack of transparency can be a
drawback for debugging or justifying specific scheduling choices.
4. Potential for Convergence to Local Optima: The learning process might get stuck in
finding a good but not necessarily the best solution (local optimum). This can limit the
overall efficiency gains achievable.
5. Integration Challenges: Implementing this method might require modifying or integrating
with existing factory scheduling software. The complexity and cost of such integration can
be a significant hurdle to adoption.
6. Scalability Concerns: While designed for distributed scheduling, the effectiveness of this
method for very large or intricate distributed networks is still under investigation. Its
scalability to extremely complex scenarios needs further exploration.

CHAPTER 4
CONCLUSION AND FUTURE WORK

This study provides a new path to integrate RL with the TLBO. Unlike existing works,
this study applies an RL algorithm named Q-learning to dynamically adjust the algorithm
structure of the QTLBO. In the QTLBO, a group of teachers was used. The teacher phase,
learner phase, teacher’s self-learning phase, and learner’s self-learning phase were designed.
Then, the Q-learning algorithm was implemented by 9 states, 4 actions defined as combinations
of the above phases, a reward, and an adaptive action selection, after which they are applied to
dynamically adjust algorithm structure. A number of experiments were conducted. The
computational results demonstrate that the new strategies of the QTLBO are effective and that
the QTLBO provides promising results on the considered DTHFSP.
Distributed scheduling has also been extensively considered; however, distributed

scheduling with uncertainty is not studied fully.
In future works we will attempt to solve the distributed scheduling problem with
uncertainty by using various metaheuristics. Previous works have also mainly used many kinds
of RL algorithms, particularly Q-learning. Related to this, the integration of other RL algorithms
with metaheuristics for production scheduling

REFERENCES
[1] M. Karimi-Mamaghan, M. Mohammadi, P. Meyer, A. M. Karimi-Mamaghan, and E. -
G. Talbi, Machine learning at the service of meta-heuristics for solving combinatorial
optimization problems: A state-of-the-art, Eur. J. Oper. Res., vol. 296, no. 2, pp. 393–422, 2022
[2] J. Wang, D. M. Lei, and J. C. Cai, An adaptive artificial bee colony with reinforcement
learning for distributed three-stage assembly scheduling with maintenance, Appl. Soft.
Comput., vol. 117, p. 108371, 2021.
[3] Wang, Z. X. Pan, and J. J. Wang, A review of reinforcement learning based intelligent
optimization for manufacturing scheduling, Complex. Syst. Mod. Sim., vol. 1, no. 4, pp. 257–
270, 2021.
[4] J. Q. Li, J. K. Li, L. J. Zhang, H. Y. Sang, Y. Y. Han, and Q. D. Chen, Solving type-2
fuzzy distributed hybrid flowshop scheduling using an improved brain storm optimization
algorithm, Int. J. Fuzzy Syst., vol. 23, pp. 1194–1212, 2021.
[5] Z. S. Shao, W. S. Shao, and D. C. Pi, Effective heuristics and metaheuristics for the
distributed fuzzy blocking flowshop scheduling problem, Swarm and EComput., vol. 59, p.
100747, 2020.
[6] J. Wang, X. D. Wang, F. Chu, and J. B. Yu, An energyefficient two-stage hybrid flow
shop scheduling problem in a glass production, Int. J. Prod. Res., vol. 58, no. 8, pp. 2283–2314,
2020.
[7] D. M. Lei, B. Su, and M. Li, Cooperated teachinglearning-based optimisation for

distributed two-stage assembly flow shop scheduling, Int. J. of Prod. Res., vol. 59, no. 23, pp.
7232–7245, 2020.
[8] B. Fan, W. Yang, and Z. Zhang, Solving the two-stage hybrid flow shop scheduling
problem based on mutant firefly algorithm, J. Amb. Intel. Hum. Comp., vol. 10, no. 3, pp. 979–
990, 2019.

Technical Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Technical Report

Uploaded by

Copyright:

Available Formats

Q-Learning-Based Teaching-Learning Optimization for Distributed Two-Stage Hybrid Flow Shop Scheduling with Fuzzy Processing Time

Teaching-Learning based optimization (TLBO) is a population-based algorithm

Dept. of CS&E, JIT, Davanagere 1

Description: This paper provides a comprehensive review of how Reinforcement Learning

2) Title: "Machine learning at the service of meta-heuristics for solving combinatorial

Authors and Year: M. Karimi-Mamaghan, M. Mohammadi, P. Meyer et al. , 2022.

Dept. of CS&E, JIT, Davanagere 2

Dept. of CS&E, JIT, Davanagere 3

To design and implement “Q-Learning-Based Teaching-Learning Optimization for

• Input: Factories, jobs.

• Process: Q-learning algorithm, teaching-learning optimization.

• Output: Minimized makespan, computational results.

Integration of Reinforcement Learning (RL) and Metaheuristic:

• Objective: Combine the strengths of both approaches:

Q-learning-based Teaching-Learning Optimization (QTLBO):

• Objective: Develop a novel optimization algorithm that utilizes Q-learning to enhance

Teacher Phase, Learner Phase, and Self-Learning Phases:

Dept. of CS&E, JIT, Davanagere 4

• Objective: Design distinct phases within QTLBO to mimic a classroom setting:

Effectiveness of QTLBO Strategies:

• Objective: Demonstrate the benefits of QTLBO compared to traditional TLBO or other

Computational Efficiency: Consider trade-offs between solution quality and computational

Dept. of CS&E, JIT, Davanagere 5

QTLBO merges Q-learning with TLBO, leveraging Q-learning's adaptive

3.2.1 Technology Used :

➢ Teaching-Learning based Optimization (TLBO) :

• RL is integrated with the TLBO algorithm to optimize scheduling.

• Metaheuristic is used in the QTLBO algorithm to solve the DTHFSP problem.

Dept. of CS&E, JIT, Davanagere 6

3.2.2 Q-Learning Algorithm:

Q-learning is a fundamental reinforcement learning algorithm that enables an agent to learn an

• For each s , a initialize the table entry Q(s, a) to zero.

• Observe the current state s

Select an action a and execute it

Receive immediate reward r

Observe the new sate 𝑠1

Update the table entry for Q(s, a) as follows:

Dept. of CS&E, JIT, Davanagere 7

3. Action Selection (Exploration vs. Exploitation):

Dept. of CS&E, JIT, Davanagere 8

3.2.3 Flow Chart:

Dept. of CS&E, JIT, Davanagere 9

This step uses a Q-learning algorithm to dynamically determine which of the

4. Execution of the Chosen Action:

▪ Teacher Phase: A pair of solutions (teachers) is chosen from the

Dept. of CS&E, JIT, Davanagere 10

3.3 RESULTS AND DISCUSSION

• Q-learning algorithm integrated with TLBO for dynamic algorithm selection

QTLBO presents a promising new approach for scheduling in manufacturing,

3.4 ADVANTAGES & DISADVANTAGES

1) Improved Efficiency: Q-learning-based teaching-learning optimization can outperform

Dept. of CS&E, JIT, Davanagere 11

3) Distributed Scheduling Advantage: Designed specifically for distributed two-stage

1. Computational Cost: Q-learning-based teaching-learning optimization is computationally

Dept. of CS&E, JIT, Davanagere 12

CONCLUSION AND FUTURE WORK

Distributed scheduling has also been extensively considered; however, distributed

Dept. of CS&E, JIT, Davanagere 13

[7] D. M. Lei, B. Su, and M. Li, Cooperated teachinglearning-based optimisation for

Dept. of CS&E, JIT, Davanagere 14

You might also like