Material Flow Control in Remanufacturing Systems W

Journal of Remanufacturing (2023) 13:161–185
https://doi.org/10.1007/s13243-023-00126-z
RESEARCH
Material flow control in Remanufacturing Systems

with random failures and variable processing times
Felix Paschko1,2 · Steffi Knorn2 · Abderrahim Krini1 · Markus Kemke3
Received: 7 February 2023 / Accepted: 18 April 2023 / Published online: 16 May 2023
© The Author(s) 2023
Abstract
Material flow control in remanufacturing is an important issue in the field of disassembly.
This paper deals with the potential of autonomous material release decisions for remanu-
facturing systems to balance the uncertainties related to changing bottlenecks, to maximise
throughput (TH ) and to minimise work-in-process (WIP ). The goal is to achieve the highest
possible throughput rate using real-time data while keeping costs to a minimum. Unlike
traditional production systems, remanufacturing must consider and handle high uncertain-
ties in the process. Up to now, classical methods such as CONWIP, Material Requirement
Planning (MRP) and Kanban have been used for material flow control. However, these
methods do not perform well in a system with high variation and uncertainties such as
remanufacturing as they aim to find solutions for static environments. Crucial for optimal
production in stochastic environments is finding the optimum pull or release rate which can
vary over time in terms of maximising TH and minimising WIP . We propose a deep rein-
forcement learning approach that acts on the environment and can adapt to changing condi-
tions. This ensures that changing bottlenecks are taken care of and that there is a minimum
WIP in the system.
Keywords Remanufacturing · Disassembly · Reinforcement Learning · Value Stream

Optimisation · Material flow control
* Felix Paschko
FelixPaschko@web.de
Steffi Knorn
knorn@tu-berlin.de
Abderrahim Krini
Abderrahim.Krini@bosch.com
Markus Kemke
Markus.Kemke@bosch.com
1
Robert Bosch Automotive Steering GmbH, Engineering Remanufacturing,
Richard‑Bullinger‑Straße 77, 73527 Schwäbisch Gmünd, Germany
2
Technische Universität Berlin, Chair of Measurement and Control, Straße des 17. Juni 135,
10623 Berlin, Germany
3
Robert Bosch Automotive Steering GmbH, IT in Manufacturing, Richard‑Bullinger‑Straße 77,
73527 Schwäbisch Gmünd, Germany
13
Vol.:(0123456789)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
162 Journal of Remanufacturing (2023) 13:161–185
Introduction
In the age of Industry 4.0, development cycles are becoming shorter and shorter, and
products are becoming more and more individualised. This poses great challenges to
today’s intelligent production systems. These challenges are now also arriving in reman-
ufacturing and are increasing the uncertainties due to the increase in variability. Despite
increasing challenges, the systems must be flexible and continuously adapt to changing
conditions. However, this must not compromise throughput (TH ) and robustness against
external disturbances. Overall, this results in constantly changing production environ-
ments that production planning and control must deal with [4].
In remanufacturing, end-of-life (EOL) products are the raw material (Cores) and as
a result, the remanufacturing process is subject to large uncertainties. This makes the
planning of remanufacturing increasingly difficult [7]. EOL products have experienced
different loads in the field, leading to significant differences in quality. Before and in the
practical remanufacturing process, it is difficult to evaluate the quality of EOL products
exactly. To make it even more difficult, the processing time, remanufacturing probabil-
ity, random failures of cores or components and costs are uncertain under the influence
of human behaviour and the environment. The uncertain quality further increases the
uncertainties in processing time, reprocessing probability and cost, making it a two-fold
uncertainty [38].
In general, there exist two sets of methods to control such problems: pull and push
methods. All pull methods were developed with a similar goal to control the WIP in the
system at a low level while having no loss of TH . Pull methods set limits on WIP and
observe the resulting TH . In contrast, push methods try to control the TH and observe
the WIP level. The big advantage of pull methods is achieving the same average TH with
a lower average WIP [10]. Maximising TH and minimising WIP are conflicting goals
that must be combined. Especially in systems with additional variability this can lead to
major challenges. In this paper, we not only want to adapt the WIP of the remanufactur-
ing system, but also to further optimise the pull approach.
With the help of Industry 4.0 technologies, we have the possibility to regularly read-
just the WIP by changing conditions of the productions system or influence it via the
pull or material release without reducing TH . By observing real-time conditions, the
information can be used to adjust the policy. The focus is on the development of policy
architectures that make the pull approach suitable for production systems with lower
volume, high product mix and high variability [5, 15, 29, 30].
An exploratory study based on simulation experiments is used to support our research.
An important aspect is to maintain an optimal WIP range, which is determined based on
the dynamic behaviour during the simulation. Using an innovative approach based on Deep
Reinforcement Learning, we show that we can significantly reduce WIP values without
reducing TH . We do not need any prior information about processing time variations, fail-
ure probabilities, etc. and the system automatically adapts to the changing states.
Previous papers in this area have focused on sequencing of disassembly, production
schedules for disassembly, balancing of disassembly lines or disassembly Petri nets.
However, the literature does not deal with material flow control in disassembly, with
predominantly manual work processes combined with the uncertainties of core quality,
random failures, and highly variable processing times. Our proposed approach can be
seen as an operational decision model that leads to productive use of workers and cost
minimisation in disassembly. We focus on the following uncertainties in the paper:
13
Journal of Remanufacturing (2023) 13:161–185 163
(1) The disassembly station has its own processing time per quality class. This leads to a
changing bottleneck within the remanufacturing system due to the differences between
the classes.
(2) During disassembly, cores and components fail randomly. Different failure rates are
tested, and the adaptability of the RL agent is demonstrated.
The reinforcement learning approach is compared with a pure CONWIP approach and
therefor different WIP limits. The objective of this paper is the development and implemen-
tation of an autonomous and self-learning algorithm addressing material flow control with
stochastic uncertainties in remanufacturing systems.
The paper is structured as follows: Sect. 2 contains a literature review. Objectives of
production planning and control are identified and, in this context, literature focusing
on remanufacturing and literature on reinforcement learning for material flow control
are examined. The next part explains the state-of-the-art in reinforcement learning and
describes the approach in general. Section 4 discusses the remanufacturing system under
consideration and the uncertainties regarding the changing bottleneck. This is followed
by the optimisation objective and assumptions, the state space, the action space, and the
developed reward functions. The algorithm used is briefly explained in the next section and
leads into the results and comparison section. The paper concludes with a summary of rein-
forcement learning approaches in the field of remanufacturing and ends in the conclusion.
Literature review
Production planning and control
The goal of Production planning and control (PPC) is to make the best use of available
production factors, such as material, machine or work hours, as conditions change [14, 24].
Production control is responsible for the on-time delivery of the production plan. The effi-
cient utilisation of existing capacity is crucial to competitive manufacturing that delivers
on time [20].
A valuable alternative seems to be decentralised control, in which smaller units, such as
workstations, workers, etc., independently take decisions based on their local knowledge.
With this, the production efficiency can be improved [23], and the system is more robust as
it can handle dynamics and complexity better [11]. We focus on the task of material flow
control as part of production control, which can be used in combination with other autono-
mous units.
Material flow control in remanufacturing
The literature review [28] classifies the latest approaches in the field of disassembly plan-
ning and differentiate the approaches according to certain characteristics that have an
influence on its complexity. The proposed solutions are disassembly-specific, such as dis-
assembly priority graphs, disassembly trees, mathematical approaches to disassembly opti-
misation or disassembly Petri nets. Production control, as understood in this publication,
describes the allocation of disassembly operations to available resources. Thus, real-world
problems have a more dynamic character, such as machines can fail, completion dates or
priorities can change [16], and failure rates of cores and components can change. Previous
13
approaches in the literature aimed to find a generally optimal policy, to look in more detail
at production release, e.g., of disassembly, or to adapt procedures such as Kanban to
remanufacturing.
The authors of [6] investigate different disassembly release mechanisms for components
by examining the effects of different delay buffers or lead times in disassembly and assem-
bly. According to [6], the timing of the material flow through the individual stations is
determined by the disassembly, the release of disassembled components and by the control
of the individual workstations. In [36] they developed a technique to schedule the disas-
sembly, so that the resources are fully utilised. The developed planning technique creates a
disassembly schedule that minimises the total processing times and hence the cost of dis-
assembly leading to an optimal process schedule. The developed algorithm was extended
by a selective disassembly process plan with the aim of optimising the selection of com-
ponents of a product [34, 35]. The algorithm based in the disassembly process plan was
further developed to solve the problem of disassembling high-mix / low-volume batches of
electronic devices. In this process, the sequence of multiple and single product batches is
determined by disassembly and removal operations to minimise the idle time of machines
and the production margin [33].
[17] proposes a multi-Kanban mechanism for a disassembly environment. The focus
is on a disassembly line with single-variety products, different component requirements,
products that have multiple precedence relationships, and random workstation failure.
The performance is measured by the inventory level, the level of satisfied demand, and
the customer waiting time. A simulation proved that the performance of the disassembly
line using the proposed multi-Kanban mechanism outperforms the traditional push system.
According to [17], the increase in efficiency must be facilitated with the help of a different
control method.
[13] compared the performance of the Dynamic Kanban System for Disassembly
(DKSDL) with the classical Kanban method. A number of uncertainties is listed and an
approach to deal with these uncertainties dynamically is proposed [13]. found that their
approach is superior to the conventional method as well as the modified Kanban system for
disassembly line (MKSDL) previously developed by the authors.
[21] developed a multiple quality class approach for the end-of-life products to better
deal with varying working hours to better balance the disassembly line. [2] proposed a
decision support framework for disassembly systems. Specific quality criteria for electronic
braking systems are defined, which classify the cores and place them in one of six qual-
ity classes. For each of these quality classes, there are specific processing times as well
as information on the economic feasibility of disassembly. The publication shows that a
quality-related approach leads to better compliance with the target cycle time for each qual-
ity class. In [2], the authors prioritise the decision whether further disassembly offers a
benefit based on quality and economic benefit to prevent unnecessary work on the disas-
sembly. Products that have experienced greater stress during their life cycle show a signifi-
cant influence on the type and duration of disassembly work required.
Publications in the field of control methods for disassembly systems almost all exclude
defective parts as well as failed processes due to the otherwise significantly increasing
complexity [12]. It is generally known that in remanufacturing the variance in the quality
of the cores plays a significant role with regard to the uncertainties in the disassembly pro-
cess and thus an increase in failure probability [1].
In [8], the authors stated that a failure of a single disassembly process can significantly
disrupt the flow within a disassembly system. A failure of components or a core would lead
to difficulties, such as idling of a bottleneck workstation. By considering or observing the
13
core conditions and the resulting parameters in the disassembly process, uncertainties can
be reduced in the control phase, failures or idle times can be avoided and a higher on-time
delivery can be achieved. In a production environment where cores with uncertain states
are processed, a control system must be able to deal with defective operations or failures
during the process steps. However, research has only addressed this problem in a limited
state. The goal of fully utilising the workstations leads to unnecessary work and no produc-
tive use of resources. The knowledge of where the uncertainties are in remanufacturing
should be considered in the control system.
With the increasing automation of disassembly, there will be a need for adaptive control
solutions in the future [37]. A regulated workflow leads to hybrid lines being used opti-
mally. Thus, this leads to a better productive use of labour hours. We will take these points
into account in our proposal. The approach incorporates the uncertainties into its decision-
making and takes appropriate action. We thus enable an increase in productivity in the
form of maximised TH , minimised WIP and optimum machine and worker utilisation.
Background – reinforcement learning in production flow control
The field of material flow control with reinforcement learning is still quite young. The sys-
tematic literature review [19] summarises the previous papers in the field of Deep Rein-
forcement Learning in Production Planning and Control. The authors divide the areas into
production planning, production control, production logistics and implementation chal-
lenges. [19] only mentions one publication in connection with WIP control, see [32]. This
shows that there is a need for action in this area, especially since reinforcement learning
will play a major role in the future regarding the dynamic control of production systems
that considers the state.
[38] used a Q-learning agent that takes care of the release in a CONWIP environment.
But the Q-learning agent limit the state to discrete values and thus cannot handle continu-
ous values. The action space is also discrete: do nothing, release a production authorisation
and capture a production authorisation. Thus, [38] did not control the WIP limit but the
release rate as it cannot handle continuous value from the production environment.
[32] is based on the previously published paper [27]. In their first paper, the authors
chose a Deep Q-Network (DQN) agent to dynamically adjust the WIP . Here, [27] considers
two parallel simulations with two different agents. One agent has the goal of maximising
TH regardless of the WIP , like a usual push. The other agent with the goal of setting the
WIP appropriately close to the critical WIP (w0) is rewarded if it has a low WIP with the
same or close to the same TH . This leads to a better result than the classic push method. By
using this method, a WIP close to w0 is achieved.
[32] extended their previous approach and replaced the DQN agent with a Proximal
Policy Optimisation agent (PPO agent) and applied this approach to the same flow shop
system as in [27]. The reward function is as follows:
Reward = number of parts produced𝜋 − number of parts producedmaxTH + (WIPmaxTH − WIP𝜋 )

(1)
The suffix π represents the trained policy. The maxTh is the policy where the orders are
released according to their due date as in a push system. The approach was compared by
simulations with other methods, i.e., statistical throughput control (STC), and achieved bet-
ter results than the other agent in terms of maximising TH and minimising WIP and good
results compared to the STC method.
13
In summary, many classic methods were used and improved by additional dynamic
methods or adapted to the systems. It can be observed that deterministic processing times
are often used, and random failures are not considered in remanufacturing. Mainly fixed
failure rates are used. Here, we propose a different approach. Instead of choosing an agent
with only the goal of maximising TH as a reference, we propose to use the observations
available in actual operations and to develop the reward function based on the real measur-
able key performance indicators. In contrast to [32] we choose the pull or release of further
material into the production system as the action space and thus control the pull signal or
the material release.
We present a simulation-based control for a disassembly system including manual and
automated workstations with fluctuating processing times due to different core quality and
random failures of cores or components. Further, we have used a discrete-event simulation
model as a digital twin to simulate the production processes that are controlled by the RL
agent.
Changing bottleneck and WIP effect
Variability of the processing time leads to a changing bottleneck between different work-
stations. If the changing bottleneck is not considered or not included in the control of the
material flow, one of the workstations will run empty because there is no more material
available. On the other hand, too high release rate may result in too much material inside
the remanufacturing system waiting to be processed. We focus on three KPIs in this paper:
Throughput, WIP and Cycle Time, defined as follows:
• Throughput (TH ): How much material leaves the remanufacturing system per time
(units/time).
• Work-in-Process (WIP ): The amount of material (units) that has been released and not
yet completed.
• Cycle time (CT ): Time that passes between the release of the material and the comple-
tion of the material.
The cycle time is significantly influenced by the two parameters TH and WIP and hence
results from the TH achieved and the WIP in the system. Little’s Law relates the three KPIs
[29]:
WIP = CT ∗ TH (2)
Little’s Law can be changed to the critical WIP formula, which uses the bottleneck rate
rεconstraintε and raw process time T0 to calculate the w0 of the system under consideration
[10].
w0 = rεconstraintε ∗ T0 (3)
The formulas (2) and (3) are difficult to apply to use cases that are stochastic [29]. The
idealised production curve which shows the relation between TH and WIP can be repre-
sented as in Fig. 1.
Figure 1 shows the idealised existing relationship between TH and WIP . The produc-
tion curve shows the performance curve over the different WIP limits. From a certain point
of the WIP , wo, a limit is reached from which the performance of a production system no
13
Fig. 1 Idealised work-in-process and throughput relation (based on [18])
longer changes significantly. This is synonymous with the realisation that above a certain
WIP , no interruption of work is possible [18].
In remanufacturing, however, the uncertainties that complicate the calculation of the w0
or do not lead to the optimum must be considered. In the next section we propose our opti-
misation objective and what assumptions are made.
Disassembly and experimental setting
The considered remanufacturing system is used to test the performance of the RL agent.
Here, the corresponding processing times as well as the effects of the uncertainties on key
performance indicators are considered. The cores at the start of the remanufacturing sys-
tem do not count towards the WIP here, as our focus is on the released cores. The simula-
tion time is 225,000 s. The remanufacturing system is structured as shown in Fig. 2.
The mean value µ and the standard deviation σ of the processing time ( PTi ) at each
workstation (WSi ) can be seen in the Table 1. The disassembly system consists of four
workstations (WSi ) and a buffer in front of each workstation ( Bufferi ). The processing times
( PTi ) are normally distributed and depend on the core quality during disassembly. With
decreasing core quality (increasing index), the standard deviation and thus the fluctuations
increase.
Fig. 2 Remanufacturingsystem
13
Table 1 Processing time for the Workstation Time (mean, standard deviation)
individual workstations
Workstation 1 (WS1): PT 1 = (100s, 10s)perCore
Pre-cleaning and pre-sorting
Analysis
Workstation 3 (WS3): PT 3q = (250s, 5s)perCore
Disassembly
1
PT 3q = (320s, 10s)perCore
2
PT 3q = (450s, 30s)perCore
3
Workstation 4 (WS4): PT 4q = (40s, 5, s)perComponent
Rework and Cleaning
1
PT 4q = (80s, 10s)perComponent
2
PT 4q = (100s, 30s)perComponent
3
Since the processing time of WS3 cannot be predicted with certainty before the analysis
at WS2, the processing time of WS3 fluctuates around the processing time of WS2. Thus,
a changing bottleneck is created here between WS2 and WS3, which affects the material
release. There is a risk of releasing too many cores into the disassembly creating a conges-
tion in front of WS3 due to poor quality of the arriving cores. On the other hand, too few
cores can be released, resulting in a loss of TH . Not only the processing times cause dif-
ficulties, but also the random failures of cores after analysis or components during disas-
sembly. The analysis at WS2 identifies the load on the cores during the field life and sorts
out cores that are above the limit of reuse. Component damage can usually only be detected
during disassembly and thus leads to failures late in the process. Failures depend on quality
and differ per class. In our model, the quality of the cores can be modelled as a p-dimen-
sional state vector:
⎛ q1 ⎞
⎜q ⎟
q=⎜ 2⎟
q (4)
⎜ 3⎟
⎝ ⋮ ⎠
Without loss of generality, we consider three quality classes in our simulation model.
Here, 1 corresponds to very good quality and as the index increases, the quality decreases.
The representation allows to classify the cores into different quality classes. The probabil-
ity that a core or individual components will have to be scrapped depends on the condition
q. Accordingly, the probability of remanufacturing cores and components differs in q. The
probabilities may change in the process of disassembly. The general assumptions for a bet-
ter understanding of the remanufacturing system are described in the assumptions and opti-
misation goals section below.
Material flow control in remanufacturing system
Assumptions and optimisation goals
Following the systems description above, we make the following assumptions:
• The system is constantly supplied with material, so a shortage of raw material (in this
case cores) is impossible.
13
• We focus a flow shop system. Hence, the workstations are arranged in sequence and
the cores must all pass through the same workstations.
• The workstations immediately pick up the waiting material (cores or components) as
soon as the current processing is completed and start the next processing (assuming
the material is available in the buffer).
• We consider a product (but can also be several that do not require a tool change)
with five components that are remanufactured.
• Cores or components are handled individually (one-piece flow), no batch produc-
tion.
• The cores are handled from the buffer according to the FIFO (first in first out) principle.
The goal is to release the material so that the production system is close to w0. The
release of further material here offers a direct influence on the production system and the
possibility to react timely to changes in the state space. The following formulas (5) and (6)
show the overall optimisation goal of the proposed material flow control:
( T )
∑
min WIP (5)
t=0
( )
∑
T
max TH (6)
t=0
With the help of the RL agent, it should be ensured that in a stochastic environment,
important KPIs such as maximising TH , minimising the average WIP , a low CT , a high
service level and fast order processing find high importance.
In the following, the state space, action space and the reward function are discussed.
State space
In our case, the state vector has a total of five dimensions. In our simulation the sample
time ts, every fixed time step the agent takes an action, is 50 s. We propose to observe
the following states of the environment with respect to the optimisation objective and the
reward function:
• WIP in the remanufacturing system (WIPreman (t))

• Average TH of the remanufacturing system per sample time ( TH(t))
• Average Failure rate per sample time ncore (t) at WS2 (core)
• Average Failure rate per sample time ncomp (t) at WS3 (component)
• Average material release rate per sample time ( MR(t))
The average values are calculated as a cumulative value from simulation start t0 to
observation time t and passed on to the RL agent per sample time ts. The rates are passed
to the RL agent as an average value per sample time ts. The WIPreman (t) is provided as a
current value and may for example have the value 4.8. As soon as the cores are disassem-
/ into their components, the component counts towards the WIP with the value
bled
1
!Mcomp , where Mcompreman is the number of components that are remanufactured.
reman
13
Overall, all important key figures that influence the current WIP are contained in the
state space. Classically, the WIP( is) determined at time t by the material input ( INmat (t)),
TH(t) and the starting WIP (WIP to ), as the following formula (7) shows:
( )
WIP(t) = IN mat (t) − TH(t) + WIP t0 (7)
In remanufacturing, formula (7) must be extended by two more values:

Ncomp (t)
WIPreman (t) = IN core (t) − THreman (t) − Ncores (t) − + WIPreman (t0 ) (8)
Mcompreman
where INcore (t) represents the input of cores, Ncores (t) the number of failures
( ) of cores and
Ncomp (t) the number of failures of components at time t. The initial WIP t0 is set to 0, as
we start with an empty disassembly line. The calculations of the individual rates are:
� �
∑t Ncore total (t)
[units]
t0 =0 IN coretotal (t) (9)
ncore (t) =
ts[time]
� �
∑t Ncomp total (t)
[units]
t0 =0 IN comptotal (t) (10)
ncomp (t) =
ts[time]
∑t
IN mat (t)[units]
(11)
t0 =0
MR(t) =
ts[time]
∑t
TH Dis (t)[units]
(12)
t0 =0
TH(t) =
ts[time]
Using the state space, the RL agent gets the current WIPreman and the average rates per
sample time ts.
Action space
The RL agent has a discrete action space. At each sample time ts it makes the decision
whether to release one core or not. The discrete action space is A1 = {0, 1}. If it decides for
0, the agent does not release another core and if 1, it releases one core. This enables a core
to arrive at WS2 at the necessary time and to be available there before idle time. The mate-
rial flow should be coordinated with the processing rates. The action space can be seen as
a pull signal.
The random failures after analysis and disassembly make this difficult, as these can-
not be predicted exactly and must be compensated for accordingly. At the same time, the
machining time variations further complicate material release control, as there is the risk
of causing a queue before disassembly due to frequent release and poor core quality, or the
risk of idling the disassembly station if no core is available. Also, there could be a puffer in
front of the other workstations because of releasing to many cores.
13
Reward function
The reward function is the key element that is used by the agent to calculate its policy. In our
case, we propose to shape the reward function into different components depending on the
state of the remanufacturing system. The RL agent activates the following reward functions by
fulfilling the conditions that are required at time t and the current state of the system. Figure 3
illustrates the sequence of activation. It is also possible to fall back into one of the previous
reward functions as soon as the requirement for an activation point is no longer fulfilled. Also
e.g., R3 cannot be activated before R2.
The first target is to get the average utilisation of the bottleneck or one of the changing bot-
tlenecks above a certain transition value. Care must be taken to use a transition value aligned
with the existing system. An appropriate value can be determined with the help of simulations
or by means of process knowledge. The following formula (13) is used to reward increasing
utilisation with a weight factor 𝜔1:
� �0.4
⎛ Umax − UWSbottleneck (t) ⎞
R1 (t) = 𝜔1 ∗ ⎜1 − ⎟ (13)
⎜ Umax ⎟
⎝ ⎠
{
0, forUWSbottleneck ≤ Utilactivate
aR2 = (14)
1, forUWSbottleneck > Utilactivate
The reward function R1 thus rewards the increase of the average utilisation of the bottleneck
towards 1 and thus 100%. Here, Utilactivate is 0.3. If Utilactivate is reached, the system activates
aR2 (aR2 = 1) to the next step of the reward function R2. If Utilactivate is not reached, aR2 has the
value zero and R2 is not activated. This first reward function is mainly important for the start-
up phase. The following are the key reward functions for finding w0.
After Utilactivate has been achieved, the next goal is to bring MR(t) and TH(t) of the reman-
ufacturing system into a defined range. This is important so that not too little and not too
much material is released. Depending on the parameter values of the system, the limit must be
adjusted. Formula (15) shows the reward for the different ranges.
⎧ ⎛ ⎛ � MR(t) � ⎞⎞
⎪ ⎜ ⎜ TH(t) ⎟⎟
⎪ −⎜1 − ⎜ LL MR(t) ⎟⎟, for MR(t) < LL
TH(t)
⎪ ⎝ ⎝ TH(t) ⎠⎠
⎪ ⎛ � MR(t) � ⎞
R2 (t) = aR2 ∗ 𝜔2 ∗ ⎨ (15)
⎪ 1 − ⎜⎜ UL ⎟, for MR(t) > UL
TH(t)
MR(t) ⎟ TH(t)
⎪ ⎝ TH(t) ⎠
⎪
⎪ 2, forLL < MR(t) < UL
⎩ TH(t)
Fig. 3 Sequence reward function

and activation points
13
If the ratio is above or below the target range at time t, it is penalised according to the
distance limit. If the ratio is between Lower Limit LL (here, 0.8) and Upper Limit UL (in
our case 1), the reward jumps to 2.
The next target is to get near the maximum possible or planned TH(t) of the remanu-
facturing system. The maximum possible TH(t) can be extracted from experience and
simulations. In the simulation, this is measurable via the average TH of the remanufac-
turing system or the bottleneck workstation. However, the demand rate, if lower than the
maximum TH rate, can also be used as a value. The effectivity of the remanufacturing
system is measured as follows [9].
actual TH[units]
effectivity production = (16)
planned TH[units]
In the case of a changing bottleneck, the TH value of the entire remanufacturing sys-
tem can be used. If a bottleneck has been identified or can be extracted from the long-
term data, this value can be used for the planned TH value. The planned TH for the total
system or in the case of an identified bottleneck can be calculated as follows:
WIP[units]
THreman system = − (17)
cylce time[time]
1[unit]
PRbottleneck = (18)
PTbottleneck [time]
If, for example, the agent does not release any material or releases too little, then pro-
duction is not effective, and the strategy applied does not enable the maximum TH . The
RL agent should enable the fulfilment of the production plan by maximising TH and
implement it effectively [3]. To activate the next reward function R3 the RL agent must
exceed a certain transition value via the release of cores. We define a transition value of
0.95. From the simulation results, the selected transition value is very suitable for the
transition to R3. Value aR3 is defined as:
⎧ TH(t)
⎪ 0, for plannedTH(t) ≤ 0.95
aR3 =⎨ TH(t) (19)
⎪ 1, for plannedTH(t) > 0.95
⎩
The difficulty is that the TH(t) value only changes after some time, depending on the
length of the remanufacturing system and how many cores or components fail, the TH(t)
value in the state space will only change after the agent has performed several actions
and the first core or component passes the system or the bottleneck workstation. As
soon as the transition value has been reached, aR3 is set to 1 and activates R3.
R3 is the last function that enables finding w0 of the remanufacturing system and
includes the efficiency of the remanufacturing system. The function makes it possible to
differentiate between various strategies, as it explicitly considers MR(t) and TH(t) . With
this part of the reward function, we measure if the agent is working at a minimum WIP
level. The efficiency is calculated as follow, [9]:
13
actualTH[units]
efficiency = (20)
actualIN mat [units]
With (20), the third reward function can be formulated:

( )
TH(t)
R3 (t) = aR2 ∗ aR3 ∗ 𝜔3 ∗ (21)
MR(t)
With increasing TH(t) and no increase of MR(t), the agents receive a higher positive
reward. This last formula allows to differentiate between strategies and finds the one that
achieves the global optimum, maximise TH and minimise WIP. The overall reward function
is as follows:
Rtotal (t) = R1 (t) + R2 (t) + R3 (t) (22)
As explained at the beginning of this section, the reward functions are activated in the
defined order as shown in Fig. 3. The next section provides the information about the Rein-
forcement Learning Algorithm used in this paper.
RL algorithm—Proximal Policy Optimization (PPO)
Reinforcement Learning can obtain the optimal policy by learning from the interaction with
the environment. Normally the RL problem is modelled as a Markov decision process (MDP)
with a tuple < S, A, P, R, E >, where:
• S: set of all states

• A: set of executable actions of the agent
• P: transition distribution, Pass� = P(St+1 = s� |St = s, At = a)
• R: reward function, and rt represents the reward obtained after taking an action at time t
• E : set of states that have been already reached
The RL agent selects an action based on its policy,( which) is a probability distribution of
actions under a given state. This is represented by 𝜋𝜃 at |st . The goal of the RL agent is to
maximise the cumulative reward defined as:
∑
∞
Rt = rt + 𝛾rt+1 + 𝛾 2 rt+2 + … = 𝛾 k rt+k (23)
k=0
The parameter 𝛾 is the discount factor to calculate the cumulative reward, where 𝛾 ∈ [0, 1].
To find an optimal policy, RL uses a policy gradient method. It samples interactions between
the agent and the environment. With that it calculates the current policy gradient directly. The
current policy then can be optimised with the gradient information [31].
The optimal policy and the value of policy πθ are designed in the form of:
𝜋 ∗ = argmax𝜋 𝔼𝛾∼𝜋(𝜏) R(𝜏) (24)
∑
L(𝜃) = 𝔼𝜏∼𝜋𝜃 (𝜏) R(𝜏) = P𝜃 (𝜏)R(𝜏)
(25)
𝜏∼𝜋𝜃 (𝜏)
13
∏ � � � �
P𝜃 (𝜏) = Tt=1 𝜋𝜃 at �st P st+1 �st , at describes the occurrence probability of the current
trajectory. 𝜃 is the parameter of the current model. The gradient of the objective function
L(𝜃) is approximated as in (26) since the environment transition distribution is independent
of the model parameter.
∑ 1 ∑∑
N T
( )
∇𝜃 L(𝜃) ≈ R(𝜏)∇𝜃 logπθ (𝜏) ≈ R(𝜏 n )∇𝜃 log𝜋𝜃 ant |snt (26)
𝜏∼𝜋𝜃 (𝜏)
N n=1 t=1
The policy gradient algorithm is divided into two steps of continuous iterative update:
• Using 𝜋𝜃 to interact with the environment. Obtain the observed data for calculating
∇𝜃 L(𝜃).
• Update 𝜃 with the gradient. Update the learning rate 𝛼 , where 𝜃 = 𝜃 + 𝛼∇𝜃 L(𝜃).
We have chosen Proximal Policy Optimisation (PPO) to control the material flow. The
algorithm was published by John Schulman et al. in 2017 [25]. PPO is implemented by
policy gradient estimation and gradient ascent optimization [31]. The policy estimation is:
( )
LPG (𝜃) = 𝔼t logπθ at |st Ât (27)
(To )improve
( )the training efficiency( )and reuse the sampled data, PPO uses
πθ at |st ∕πθ at |st to substitute logπθ at |st to support off-policy training. The clipping
mechanism
( ) is( added) to the objective function to punish the excessive policy change when
πθ at |st ∕πθ at |st is far from 1. The final objective function is:
( )
LCLIP (𝜃) = 𝔼 ̂t , clip rt (𝜃), 1 − 𝜖, 1 + 𝜖 A
̂ t [min(rt (𝜃)A ̂t )] (28)
where rt (𝜃) represents the action selection probability ratio of new and old policies.
( )
𝜋𝜃 at |st
rt (𝜃) = ( (29)
𝜋𝜃old at ||st )
The clip function limits the ratio between old and new to the interval [1 − 𝜖, 1 + 𝜖],
where 𝜖 is a hyperparameter. Only large changes in the direction of policy improvement are
removed.
Results and comparison
Simulation scenarios
We compare our approach directly with different WIP limits to show that we can preserve
TH and decrease WIP at the same time. In doing so, the RL agent should automatically
adapt to the changing environment. In the CONWIP procedure, a new WIP limit would
have to be determined according to the changes, which would correspond to the other con-
ditions. We will show that the RL approach is able to control the material flow by control-
ling the release rate, achieving a higher TH than the classical methods. We simulated dif-
ferent scenarios:
13
1. Constant failure rate (10%) at analysis (WS2 ) and disassembly (WS3) (1%) and quality-
related variable processing times at disassembly (WS3).
2. Failure depending on quality class (Q1 = 1%, Q2 = 5%, Q3 = 15%) and quality-related
variable processing times at disassembly (WS3).
3. Failure after half processing time of the analysis workstation (WS2).
a Constant failure rate (1%) and quality-related variable processing times at disas-
sembly (WS3).
b Quality-related failure rate (Q1 = 1%, Q2 = 5%, Q3 = 15%) and quality-related vari-
able processing times at disassembly (WS3).
4. No failures and with quality-related variables in processing times at disassembly (WS3).

5. A changing environment generating different bottlenecks between workstation WS2 and
WS3.
We can observe from the results that the developed RL agent outperforms the defined
CONWIP limits. The selected formulas mentioned in the reward function measure how
effectively and efficiently the production system works. The TH and the material released
play a key role. We simulated the RL agent and the individual WIP limits and took the
cumulative reward to compare the different approaches. The RL agent performs best, as
it requires less input for the maximum TH or achieves a higher TH compared to the lower
WIP limits and is thus closest to the w0 of the system.
The next section contains the results of the different simulations.
Comparison RL agents against fixed CONWIP
In the first scenario, we consider a constant failure rate (10%) at analysis (WS2) and dis-
assembly (WS3) (1%) and quality-related variable processing times at disassembly (WS3).
Figure 4 shows the results of the simulation.
Figure 4 shows the cumulative reward of the individual limits and the RL agent. The
selected limit of 5 is quite close to the critical WIP, but the RL agent generates a higher
reward over the simulation time. The other limits are far from the w0 and perform poorly
Fig. 4 Performance methods fixed failure rate and quality-related processing time at disassembly WS3
13
overall in the selected scenario. It is easy to see that the higher the threshold is chosen, the
lower the reward. This depends mainly on the fact that the higher input no longer has any
influence on the TH , but only reduces the reward and thus also reduces efficiency. In this
case w0 is between 5 and 10, with the RL agent achieving an average WIP of 5.8.
Next, we change the failure rates at disassembly (WS3) in the scenario and define them
for each quality class and keep the processing time variability at WS3. The chosen failure
rates at disassembly are for Q1 = 1%, Q2 = 5% and Q3 = 15%. The failure rate at the WS2
remains the same. Figure 5 below shows the simulation results.
Figure 5 illustrates the strength of the RL agents in finding w0 using the learned policy.
The RL agent again achieves a higher cumulative reward than the rest of the WIP limits. As
in Scenario 1, a downward trend in the higher limit values is evident. The RL agent in this
case achieves an average WIP of 5.7.
In the next scenario, we change the failure point at WS2. The core fails within the analy-
sis after half of the processing time (after 150 s) or can be used in disassembly. This can
happen in remanufacturing if the cores are checked for load parameters at the beginning of
the analysis and then, e.g., software modifications are made. Figure 6 shows the simulation
results for this scenario.
This scenario changes the challenges for the RL agent. Since the RL agent must decide
whether to release another core so that WS2 is supplied even in the event of a failure. Higher
limits create a buffer before the analysis at WS2 and thus ensure that cores are always avail-
able. However, it can be seen from the reward function that the productivity of the limits
is much worse than that of the RL agent. Also, rewarding the utilisation of the analysis
does not allow for a higher reward for the higher limits. Here, the RL agent also achieves a
clearer distance to CONWIP 5 and sets itself slightly apart.
The fourth scenario is dealing with the moved failure point at WS2 and with the quality-
related failure rates at WS3 (Q1 = 1%, Q2 = 5% and Q3 = 15%). Figure 7 shows the results.
The RL agent performs best in all simulated scenarios so far and releases material
according to w0 and the corresponding challenges regarding failures. In contrast to the
other limits, the RL agent can react flexibly to the respective situation. In remanufactur-
ing, the error rates at analysis WS2 and disassembly WS3 cause the most difficulties here,
as these lead to a loss of material, especially at later workstations. With a fixed limit, the
Fig. 5 Performance methods quality-related failure rates and processing time at disassembly WS3
13
Fig. 6 Performance methods fixed failure rate at WS2 and WS3, quality-related processing time at WS3 and
moved failure point at WS2
newly released material takes too long to reach the required workstation in the event of
a material failure.
Another simulation compares the performance of the RL agent in an environment
without failures of cores and components, but with quality-related processing time fluc-
tuations. Although this is rather unusual in remanufacturing systems, it allows to com-
pare the agent under condition like in a flow shop production with a changing bottle-
neck. For example, it can be shown that the RL agent is variably adaptable to changing
environments. Figure 8 shows the comparison in this scenario.
The RL agent also achieves a higher reward here than the rest of the limits. Even if
the RL agent achieves only a slightly higher reward than CONWIP 5, it is closer to w0
and influences the efficiency of the remanufacturing system through its material release.
To demonstrate the advantages of the RL agent in terms of adaptivity, we have simu-
lated an additional scenario in which bottlenecks alternate between WS2 and WS3 with
a constant failure rate (as in the other scenarios) at each of these workstations. Table 2
shows the changing processing time at WS3.
Fig. 7 Performance methods quality-related failure rate, quality-related processing time at WS3 and moved
failure point at WS2
13
Fig. 8 Performance methods without failures and with quality-based processing time at WS3
This requires the RL agent to adapt the release so that the corresponding buffers before
the bottleneck do not overflow with stock. The closer the bottleneck is to the end, the more
difficult it is to utilise and supply the bottleneck with sufficient material. Figure 9 shows the
results of the simulation.
The extended scenario shows the potential of controlling the material flow in changing
conditions with an RL agent as opposed to setting a defined limit. The agent can adjust
its behaviour regarding material release as soon as it detects changes in the state space.
This usually happens faster than with later adjustments by workers or planners or control
software.
In the following, we will show the different plots of the simulation that reflect the adap-
tivity. The agent has achieved a significantly higher reward in this scenario than the other
limits. The distance to CONWIP 5 is also greater and accordingly shows that CONWIP 5
releases too little material for the system in this scenario.
Table 2 Processing time for WS2 Workstation Time (mean, standard deviation)
and WS3 for adaptivity scenario
Analysis (failure rate 10%)
Workstation 3 (WS3): For 0s < t < 75, 000s:
Disassembly (failure rate 1%) • PT 3q = (250s, 5s)
• PT 3q = (320s, 10s)
1
• PT 3q = (450s, 30s)
2
3
For 75, 000s < t < 125, 000s:
• PT 3q = (150s, 5s)
• PT 3q = (250s, 10s)
1
• PT 3q = (300s, 30s)
2
3
For 125, 000s < t < 225, 000s:
• PT 3q = (320s, 5s)
• PT 3q = (400s, 10s)
1
• PT 3q = (450s, 30s)
2
13
Fig. 9 Performance of the different methods in a changing environment
The key characteristic is the WIP . The most important thing here is that no excessive
fluctuations occur and thus a constant material flow is guaranteed by the RL agent. Fig-
ure 10 shows the WIPreman over the simulation time t .
The WIP has a stable course over the simulation time with only a few outliers, which
can, however, be caused by the failures at the individual workstations. The stability and
flexibility are clearly shown by the average WIP over the simulation time as shown in
Fig. 11.
The average WIPreman is from the last simulation scenario. At the beginning of the
simulation, the WIP increases towards 5.8. From time 75,000 s, the processing times at
WS3 change and the average WIP falls slightly towards 5.7. Thus, the average WIP falls,
and the agent adapts the material release. Subsequently, the average WIP rises again
Fig. 10 WIP over simulation time
13
Fig. 11 Average WIP over simulation time
from 150,000 s because the bottleneck shifts again due to the renewed adaptation of the
processing times at WS3.
The agent’s cumulative reward shows that the agent adapts to the changing states.
The adaptation is positively rewarded. The agent achieves a high TH at the bottleneck
and changes the input via the material release. Figure 12 shows the cumulative reward
of the RL agent.
From the individual transitions at 75,000 s and 150,000 s, we see that the agent does
not get much reward at the beginning because it allows too little TH at the bottleneck and
generates too much input via the release. After it adjusts the policy, it eventually gets a
constant positive reward.
Fig. 12 Cumulative Reward RL Agent over simulation time
13
Fig. 13 Input/Output Ratio over simulation time and changing states
Figure 13 illustrates the ratio of input to output. The figure shows how the change in
processing times changes the ratio of input to output. If the bottleneck is further back in the
production system, the agent must release more material so that the bottleneck is constantly
supplied with material. If the bottleneck is further forward in the system, the agent releases
the material in longer periods of time so that the buffer in front of the bottleneck does not
overflow.
The transitions occur because of the changing condition at 75,000 s and 150,000 s. In
addition to the input/output ratio, the average material release rate of the RL agent shows
the adaptivity in Fig. 14.
At the beginning of the simulation the RL agent releases more material, between
75,000 s and 150,000 s it decreases the material release and after 150,000 s it increases
it again and releases more material per ts. At the beginning, the material release rate
Fig. 14 Average material release rate RL Agent over simulation time
13
Fig. 15 Queue analysis WS2 over simulation time t
moves towards 0.15 units/ts. After the change of states, the RL agent decreases its material
release, and the average value moves towards 0.165 units/ts. After readjusting the environ-
mental conditions, the RL agent adapts again and accelerates its material release rate back
towards 0.15 units/ts. The RL agent protects the supply with material of the changing bot-
tlenecks, this can be seen from the buffer stocks at WS2 (Fig. 15) and WS3 (Fig. 16).
The buffers are usually supplied with material so that the bottleneck can work at
any time. It should be noted at figures Figs. 15 and 16 that the bottlenecks change. At
the beginning and end of the simulation, WS3 is the bottleneck, while in the middle of
the simulation WS2 is the bottleneck. This explains that at certain simulation times the
buffer stock of the workstations drops to 0. Figure 16 shows that due to failures at WS2 ,
the buffer stock briefly falls to 0 before WS3. At the same time, the RL agent tries to
cover these failures by releasing more cores, as can be seen in Fig. 15.
Fig. 16 Queue disassembly WS3 over simulation time t
13
With this behaviour, the RL agent receives the highest reward compared to the lim-
its. Thus, a lower WIP risks the utilisation and material supply of the bottleneck and
a higher WIP overfills the buffers with material and thus reduces the flexibility of the
production system.
The RL agent has between 40–50% lower WIP compared to CONWIP 10 and CON-
WIP 15 in all scenarios, unlike the other limits, except WIP limit of 5. CONWIP 5
always carries the risk, which can also be seen in the reward, that the bottleneck runs
empty due to the lower WIP or that there is no material at the bottleneck to process due
to random failures.
The RL agent has a different material release rate in all different scenarios. This
shows that the PPO agent adapts to the other states or characteristics. The PPO agent
illustrates the possibilities that reinforcement learning offers in production optimisation.
The adaptivity of reinforcement learning agents as well as the combination with classi-
cal methods expands the optimisation potential of multi-objective optimisation.
The next section explains areas of use of reinforcement learning in the field of
remanufacturing, gives a summary and points out future areas of research.
Conclusion and further research
[22] points out that capacity control, in this case the utilisation of the production system,
is an important area of research in production control and should become more impor-
tant in the future. The reinforcement learning approach is a way for remanufacturing to
deal with increasing complexity. The pure CONWIP approach with its simple imple-
mentation allows to compare and evaluate the RL approach. For remanufacturing, the
proposed RL approach can also be used for further operational decisions, such as disas-
sembly decisions based on various observations. The approach can also be applied to
new productions or series productions that, for example, assemble the products in a flow
shop with one-piece flow. Instead of assuming failure rates for cores and components
as we do, the rework rate at certain workstations is considered in the series process
or a defined bottleneck workstation. For production lines that can only accommodate a
certain number of workpiece carriers, otherwise machine downtimes would occur, rein-
forcement learning can be used to determine the optimal number of workpiece carriers
and control the release of the workpiece carriers according to the line conditions.
In summary, our developed approach serves as a possible solution to control the
uncertainties in remanufacturing and to increase the productivity of disassembly lines.
The results show an optimisation of the system and a minimisation of costs. Since set-
ting WIP limits takes time and resources, and eventually no guarantee can be given that
this is the critical WIP of the production system, our proposed approach allows for self-
adaptation to changing environmental conditions. The RL agent can also be combined
with different agents or control algorithms. By observing and incorporating other deci-
sions, the RL agent can adapt its decision-making and, in contrast to supervised learn-
ing methods, continues to improve itself during production. The proposed approach can
also be modified so that in case of demand uncertainties, the bottleneck rate is equal to
the demand rate and the material is released accordingly.
13
With the help of digital twins and real-time data as well as predictions, reinforcement
learning can be the next step in dealing with individualisation and increasing complex-
ity, allowing production to increase productivity.
Author contributions Felix Paschko and Steffi Knorn wrote the main manuscript text and prepared all figures
and tables. Felix Paschko, Steffi Knorn, Abderrahim Krini and Markus Kemke reviewed the manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL. We acknowledge support by the
German Research Foundation and the Open Access Publication Fund of TU Berlin.
Data availability All data generated or analysed during this study are included in this published article.
Declarations
Competing interest I declare that the authors have no competing interests as defined by Springer, or other
interests that might be perceived to influence the results and/or discussion reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
1. Altekin FT, Akkan C (2012) Task-failure-driven rebalancing of disassembly lines. Int J Prod Res
50:4955–4976. https://doi.org/10.1080/00207543.2011.616915
2. Colledani M, Battaïa O (2016) A decision support system to manage the quality of End-of-Life
products in disassembly systems. CIRP Ann 65:41–44. https://doi.org/10.1016/j.cirp.2016.04.121
3. Drucker PF (1963) Managing for business effectiveness. Harv Bus Rev 41:53–60
4. ElMaraghy H, AlGeddawy T, Azab A et al (2012) Change in Manufacturing – Research and Industrial
Challenges. Enabling Manufacturing Competitiveness and Economic Sustainability. Springer, Berlin,
Heidelberg, pp 2–9
5. Fernandes NO, do Carmo-Silva S (2006) Generic POLCA—A production and materials flow control
mechanism for quick response manufacturing. Int J Prod Econ 104:74–84. https://doi.org/10.1016/j.
ijpe.2005.07.003
6. Guide VDR, Jayaraman V, Srivastava R (1999) The effect of lead time variation on the performance of disas-
sembly release mechanisms. Comput Ind Eng 36:759–779. https://doi.org/10.1016/S0360-8352(99)00164-3
7. Guide VR, Kraus ME, Srivastava R (1997) Scheduling policies for remanufacturing. Int J Prod Econ 48:187–
204. https://doi.org/10.1016/S0925-5273(96)00091-6
8. Gungor A, Gupta SM (2001) A solution approach to the disassembly line balancing problem in the presence
of task failures. Int J Prod Res 39:1427–1467. https://doi.org/10.1080/00207540110052157
9. Heinen E (1991) Industriebetriebslehre als entscheidungsorientierte Unternehmensführung. In: Heinen E,
Picot A (eds) Industriebetriebslehre. Gabler Verlag, Wiesbaden, pp 1–71
10. Hopp WJ, Spearman ML (2008) Factory Physics, 3rd edn. Waveland Press, United States of America
11. Hülsmann M (ed) (2007) Understanding autonomous cooperation and control in logistics. The impact
of autonomy on management, information, communication and material flow. Springer, Berlin, Heidel-
berg, New York
12. Kim H-J, Harms R, Seliger G (2007) Automatic Control Sequence Generation for a Hybrid Disassem-
bly System. IEEE Trans Automat Sci Eng 4:194–205. https://doi.org/10.1109/TASE.2006.880538
13
13. Kizilkaya EA, Gupta SM (2004) Modeling operational behavior of a disassembly line. In: Gupta SM
(ed) Environmentally Conscious Manufacturing IV. SPIE, pp 79–93. https://doi.org/10.1117/12.580419
14. Lödding H (2016) Verfahren der Fertigungssteuerung. Grundlagen, Beschreibung, Konfiguration, 3rd
edn. VDI-Buch. Springer Vieweg, Berlin, Heidelberg
15. Lödding H, Yu K-W, Wiendahl H-P (2003) Decentralized WIP-oriented manufacturing control
(DEWIP). Prod Plann Control 14:42–54. https://doi.org/10.1080/0953728021000078701
16. Madureira A, Pereira I, Falcao D (2013) Dynamic adaptation for scheduling under rush manufacturing
orders with case-based reasoning. In: International Conference on Algebraic and Symbolic Computa-
tion, pp 330–344
17. McGovern SM, Gupta SM (2006) Computational complexity of a reverse manufacturing line. In: Pro-
ceedings of the SPIE International Conference on Environmentally Conscious Manufacturing VI, pp
1–12. https://doi.org/10.1117/12.686371
18. Nyhuis P, Wiendahl H-P (2012) Logistische Kennlinien. Grundlagen, Werkzeuge und Anwendungen,
3. Aufl. 2012. VDI-Buch. Springer, Berlin, Heidelberg
19. Panzer M, Bender B, Gronau N (2021) Deep Reinforcement Learning In Production Planning And Control:
A Systematic Literature Review. Institutionelles Repositorium der Leibniz Universität Hannover, Hannover
20. Pinedo ML (2018) Scheduling. Theory, algorithms, and systems, softcover reprint of the hardcover 5th
edition 2016. Springer, Cham, Heidelberg, New York, Dordrecht, London
21. Riggs RJ, Battaïa O, Hu SJ (2015) Disassembly line balancing under high variety of end of life states using
a joint precedence graph approach. J Manuf Syst 37:638–648. https://doi.org/10.1016/j.jmsy.2014.11.002
22. Samsonov V, Ben Hicham K, Meisen T (2022) Reinforcement Learning in Manufacturing Control:
Baselines, challenges and ways forward. Eng Applic Artif Intell 112:104868 https://doi.org/10.1016/j.
engappai.2022.104868
23. Scholz-Reiter B, Beer Cd, Freitag M et al (2008) Dynamik logistischer Systeme. In: Nyhuis P (ed)
Beiträge zu einer Theorie der Logistik. Springer, Berlin, Heidelberg, pp 109–138
24. Schuh G (2006) Produktionsplanung und -steuerung. Grundlagen, Gestaltung und Konzepte, 3., völlig
neu bearb. Aufl. VDI-Buch. Springer, Berlin
25. Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. https://doi.
org/10.48550/arXiv.1707.06347
26. Silva T, Azevedo A (2019) Production flow control through the use of reinforcement learning. Procedia
Manuf 38:194–202. https://doi.org/10.1016/j.promfg.2020.01.026
27. Slama I, Ben-Ammar O, Masmoudi F et al (2019) Disassembly scheduling problem: literature review and
future research directions. IFAC-PapersOnLine 52:601–606. https://doi.org/10.1016/j.ifacol.2019.11.225
28. Spearman ML, Hopp WJ, Woodruff DL (1990) CONWIP: a pull alternative to kanban. Int J Prod Res
28:879–894. https://doi.org/10.1080/00207549008942761
29. Suri R (1998) Quick response manufacturing. A companywide approach to reducing lead times, 1st.
Productivity Press/CRC Press, New York
30. Sutton RS, Barto AG (2018) Reinforcement Learning. An Introduction, 2 ed. Adaptive Computation
and Machine Learning series. MIT Press Ltd, Massachusetts
31. Tomé De Andrade e Silva M, Azevedo A (2022) Self-adapting WIP parameter setting using deep rein-
forcement learning. Comput Oper Res 144:105854 https://doi.org/10.1016/j.cor.2022.105854
32. Veerakamolmal P, Gupta SM (1998) High-mix/low-volume batch of electronic equipment disassem-
bly. Comput Ind Eng 35:65–68. https://doi.org/10.1016/S0360-8352(98)00021-7
33. Veerakamolmal P, Gupta SM (1998) Optimal analysis of lot-size balancing for multiproducts selective
disassembly. International Journal of Flexible Automation and Integrated Manufacturing 6(3):245–269
34. Veerakamolmal P, Gupta SM (1999) Analysis of design efficiency for the disassembly of modular elec-
tronic products. J Electron Manuf 09:79–95. https://doi.org/10.1142/S0960313199000301
35. Veerakamolmal P, Gupta SM, McLean CR (1997) Disassembly process planning. In: Proceedings of
the International Conference on Engineering Design and Automation, pp 18–21
36. Wurster M, Michel M, May MC et al (2022) Modelling and condition-based control of a flexible and
hybrid disassembly system with manual and autonomous workstations using reinforcement learning. J
Intell Manuf 33:575–591. https://doi.org/10.1007/s10845-021-01863-3
37. Xanthopoulos AS, Chnitidis G, Koulouriotis DE (2019) Reinforcement learning-based adaptive pro-
duction control of pull manufacturing systems. J Ind Prod Eng 36:313–323. https://doi.org/10.1080/
21681015.2019.1647301
38. Zhao J, Peng S, Li T et al (2019) Energy-aware fuzzy job-shop scheduling for engine remanufacturing
at the multi-machine level. Front Mech Eng 14:474–488. https://doi.org/10.1007/s11465-019-0560-z
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com

Material Flow Control in Remanufacturing Systems W

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Material Flow Control in Remanufacturing Systems W

Uploaded by

Copyright:

Available Formats

Journal of Remanufacturing (2023) 13:161–185

Material flow control in Remanufacturing Systems

Felix Paschko1,2 · Steffi Knorn2 · Abderrahim Krini1 · Markus Kemke3

Keywords Remanufacturing · Disassembly · Reinforcement Learning · Value Stream

Production planning and control

Material flow control in remanufacturing

Background – reinforcement learning in production flow control

Reward = number of parts produced𝜋 − number of parts producedmaxTH + (WIPmaxTH − WIP𝜋 )

Changing bottleneck and WIP effect

Fig. 1 Idealised work-in-process and throughput relation (based on [18])

Disassembly and experimental setting

Material flow control in remanufacturing system

Assumptions and optimisation goals

Following the systems description above, we make the following assumptions:

• WIP in the remanufacturing system (WIPreman (t))

In remanufacturing, formula (7) must be extended by two more values:

Fig. 3 Sequence reward function

With (20), the third reward function can be formulated:

RL algorithm—Proximal Policy Optimization (PPO)

• S: set of all states

Results and comparison

4. No failures and with quality-related variables in processing times at disassembly (WS3).

Comparison RL agents against fixed CONWIP

Fig. 9 Performance of the different methods in a changing environment

Fig. 10 WIP over simulation time

Fig. 11 Average WIP over simulation time

Fig. 12 Cumulative Reward RL Agent over simulation time

Fig. 13 Input/Output Ratio over simulation time and changing states

Fig. 14 Average material release rate RL Agent over simulation time

Fig. 15 Queue analysis WS2 over simulation time t

Fig. 16 Queue disassembly WS3 over simulation time t

Conclusion and further research

You might also like