Production Flow Control Through The Use of Reinforcement Learning Production Flow Control Through The Use of Reinforcement Learning

Available online at www.sciencedirect.
com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia Manufacturing 00 (2019) 000–000
Procedia Manufacturing 00 (2019) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Manufacturing 38 (2019) 194–202
29th International Conference on Flexible Automation and Intelligent Manufacturing

29th International Conference on
(FAIM2019), Flexible
June 24-28, Automation and Intelligent
2019, Limerick, Ireland. Manufacturing
(FAIM2019), June 24-28, 2019, Limerick, Ireland.
Production
Production flow
flow control
control through
through the
the use
use of
of reinforcement
reinforcement learning
learning
Tomé
Tomé Silva,
Silva, Américo
Américo Azevedo
Azevedo
FEUP, Faculty of Engineering, University of Porto, Porto, Portugal
FEUP, of
FEUP, Faculty Faculty of Engineering,
Engineering, University
University of Porto of
andPorto,
INESC Porto,
TEC,Portugal
Porto, Portugal
FEUP, Faculty of Engineering, University of Porto and INESC TEC, Porto, Portugal
This paper introduces a new research focus for the problem of flow control. Most of the research until this point in this topic comes
This paper introduces a new research focus for the problem of flow control. Most of the research until this point in this topic comes
in the form of heuristics and flow control protocols, from which we can highlight Kanban and CONWIP. These protocols have as
in the form of heuristics and flow control protocols, from which we can highlight Kanban and CONWIP. These protocols have as
common ground the fact that both impact flow by limiting the amount of WIP (work in process) that circulates through a production
common ground the fact that both impact flow by limiting the amount of WIP (work in process) that circulates through a production
route. These limits are not static in a sense that one limit defined for a given period will not suffice for all possible conditions the
route. These limits are not static in a sense that one limit defined for a given period will not suffice for all possible conditions the
future may entail. Therefore, we need strategies to find which values for the WIP caps are best (according to an optimization target),
future may entail. Therefore, we need strategies to find which values for the WIP caps are best (according to an optimization target),
given a production system state and a customer demand level. We propose the use of a Reinforcement learning (RL) agent and
given a production system state and a customer demand level. We propose the use of a Reinforcement learning (RL) agent and
introduce the problem within the framework of a reinforcement learning problem, showing that for a simulated system it is possible
introduce the problem within the framework of a reinforcement learning problem, showing that for a simulated system it is possible
to reduce WIP levels up to 43% without losses in throughput (TH). As an introduction to the flow control problem comparisons
to reduce WIP levels up to 43% without losses in throughput (TH). As an introduction to the flow control problem comparisons
between push and pull systems are made resorting to the use of discrete event simulations. We simulated a CONWIP and a push
between push and pull systems are made resorting to the use of discrete event simulations. We simulated a CONWIP and a push
protocol and comparisons are made in terms of cycle-time, throughput and customer lead-time. The work points-out that within
protocol and comparisons are made in terms of cycle-time, throughput and customer lead-time. The work points-out that within
the field of industrial management research terms such as cycle-time, customer lead-time, and lead-time are sometimes used
the field of industrial management research terms such as cycle-time, customer lead-time, and lead-time are sometimes used
interchangeably, which may lead to unnecessary confusion and hindered understanding of the subject matter. Specifically, we show
interchangeably, which may lead to unnecessary confusion and hindered understanding of the subject matter. Specifically, we show
that cycle-time reduction does not lead directly to customer lead-time reduction in a make to order environment.
that cycle-time reduction does not lead directly to customer lead-time reduction in a make to order environment.
.
.
© 2019 The Authors. Published by Elsevier B.V.
© 2019
This The
is an Authors,
open accessPublished by Elsevier
article under B.V.
the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
© 2019 The Authors, Published by Elsevier B.V.
Peer review under
Peer-review underresponsibility
the responsibility
of theof the scientific
scientific committee
committee of the Flexible
of the Flexible Automation
Automation and Intelligent
and Intelligent Manufacturing
Manufacturing 2019 (FAIM2019
2019)
Peer review under the responsibility of the scientific committee of the Flexible Automation and Intelligent Manufacturing 2019
Keywords: Artificial Intelligence; Reinforcement Learning; Simulation; Flow control
Keywords: Artificial Intelligence; Reinforcement Learning; Simulation; Flow control
1. Introduction
1. Introduction
In the literature flow control is often associated with flow shop scheduling problems and its variants where we have
In the literature flow control is often associated with flow shop scheduling problems and its variants where we have
n machines and m jobs and each job goes through each machine sequentially. Specific operations will be performed
n machines and m jobs and each job goes through each machine sequentially. Specific operations will be performed
in each machine, and those operations have to happen in that order, from machine i to n. Examples of this problems
in each machine, and those operations have to happen in that order, from machine i to n. Examples of this problems
have been extensively studied for decades and research still perseveres in this field [2] [3] [1].
have been extensively studied for decades and research still perseveres in this field [2] [3] [1].
The flow control that we will analyze in this paper is related to the flow of parts that go through a product route
The flow control that we will analyze in this paper is related to the flow of parts that go through a product route
within a production system, we are not interested on the scheduling of parts to workstations, but in ways of
within a production system, we are not interested on the scheduling of parts to workstations, but in ways of
2351-9789 © 2019 The Authors, Published by Elsevier B.V.

2351-9789
Peer review©under
2019the
Theresponsibility
Authors, Published by Elsevier
of the scientific B.V. of the Flexible Automation and Intelligent Manufacturing 2019
committee
Peer review under the responsibility of the scientific committee of the Flexible Automation and Intelligent Manufacturing 2019
2351-9789 © 2019 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Flexible Automation and Intelligent Manufacturing 2019 (FAIM 2019)
10.1016/j.promfg.2020.01.026
Tomé Silva et al. / Procedia Manufacturing 38 (2019) 194–202 195
2 Author name / Procedia Manufacturing 00 (2019) 000–000
understanding if the flow of a part should be stopped. It is similar to the idea of Input/Output control suggested by
Wight [4] and later picked up by Fry and Smith [5]. The big difference is that Wight introduced the idea of monitoring
WIP (work-in-process) and controlling throughput (TH), and the goal in this paper is the opposite, to monitor TH and
control WIP. At first sight it might seem like the two approaches will have the same result as they both put a cap on
WIP, but there is a nuance on using Input/output control, because controlling TH is not easy.TH is a function of
capacity and utilization, meaning that to adjust throughput we would have to decrease the production rate of various
workstations. Due to production systems complexity, making these adjustments can potentially lead to unpredictable
changes in the overall throughput of a plant. Nevertheless, both ideas must deal with variability. If variability is high,
then by limiting WIP we may decrease throughput further than what we initially anticipated. The goal would be to
keep throughput as high as possible with minimum WIP. The paper is organized as follows: in section 2, we will show
how WIP affects lead-times and TH. In section 3, we present the advantages of having WIP caps that automatically
adjust to production conditions and finally in section 4 we will explain the approach of using RL to reduce WIP levels
effectively without hurting TH.
2. WIP effect on lead-times and TH
Throughout this paper we will use some definitions such as customer lead-time, Cycle-Time, Throughput and WIP.
Some of these definitions are well established and universal within research communities, however some them
depending on who you ask may be different, therefore we will define those terms as we want the reader to understand
them for the rest of the paper:
• Customer lead-time (CLT): Time between accepting and order from a customer to time it is delivered to the
client;
• Throughput (TH): Rate in which parts exit the system (units/time);
• Cycle-Time (CT): Time it takes from an order release to the shop floor to the time, that same order, is released to
finish goods inventory;
• Input-Rate: Rate at which orders are released into the system (units/time).
In 1961 D.C Little [6] provided a proof for the queuing formula, which is now more commonly known as the
Little’s law. Little’s law under certain conditions of stationarity of the processes, states that at the limit, when time
goes to infinity this relation holds 𝐶𝐶𝐶𝐶 = WIP/ TH. From direct observation of this formula it becomes apparent that
to decrease cycle-time we can either decrease WIP or increase the throughput of the system. Increasing throughput
may be done by increasing capacity on bottlenecks and decreasing WIP may be achieved by opting for order release
strategies that consider the state of system such as most pull strategies [7], [8].
When using Little’s law one should be careful, this formula gives us the relation between these 3 quantities, but it
does not tell us, for example, what is going to happen if we decrease WIP. If we decreased WIP knowing that TH will
remain the same, then we could with confidence say that CT would decrease, but if WIP is decreased will the TH be
the same? We have no guarantees of that being true, the amount of WIP that we allow will affect the TH of the system.
Some authors when advocating for the reduction of WIP as a measure to control customer lead-time seem to equate
cycle-time reduction to be equal to customer lead-time reduction or that reducing the first will reduce the latter. The
problem is not that their conclusions are wrong, the problem is that both terms are used interchangeably, when authors
are referring to different quantities. The effects of WIP on cycle-time were already extensively studied both
analytically or by the use of simulations (see e.g. [9] [8]).
Let us consider a plant that works in a make-to-order environment. Demand comes in the form of orders and
costumer lead-time is quoted at the time of the order. Demand is then turned into a production plan by a scheduling
mechanism. In this step we just converted order arrival into a potential release rate. From the previous step an order
release date is determined and orders that are due in the same time bucket are sequenced by yet another procedure, as
to determine a sequence in which orders being released at the same time bucket will enter the shop-floor. Now, as to
synchronize TH with input rate, let us assume that the company uses a CONWIP strategy where another order can
only be released in the system if the WIP on its route is less than a fixed amount [8]. In the production system just
hypothesized there is a schedule and a sequence but what dictates the release rate is a CONWIP strategy. This means
that if TH for some reason is less than expected, the rate of release will also be less than expected, hence some orders
scheduled for a given time bucket will have to be postponed to another time bucket.
196 Tomé Silva et al. / Procedia Manufacturing 38 (2019) 194–202
Author name / Procedia Manufacturing 00 (2019) 000–000 3
If we imagine another plant that works exactly under the same circumstances, but without any WIP cap strategy,
where the release rate is exactly as defined on the schedule and sequence, i.e. there is no synchronization between TH
and release rate, this is a traditional push system. In this scenario orders will always be released on schedule and wait
as work in progress. The question that arises is, in what scenario is costumer lead-time expected to be less? The
conceptual answer to this question is illustrated in the diagram in Fig. 1.
Fig. 1 Waiting time of an order within a make-to-order production system for The CONWIP Strategy and Push System
In Fig. 1 we divide the customer lead time in 3 components: time in pre-shop floor, cycle-time and transport to
client. Starting with the CONWIP system: practically for any schedule, regardless of if it can be accomplished or not,
in a CONWIP system with a limited amount of WIP allowed, if TH is the same the cycle time will be lower by Little’s
law. Now, on the Push system orders will be released into the system as established in the schedule, independently if
WIP levels are very high or low, therefore Cycle time will be higher, again by Little’s Law. In practice in the Push
system the time spent in pre-shop floor is zero, as all the orders are going to be released at the beginning of the time
bucket and then wait on the buffers between workstations to be processed as WIP. On the other hand, in the CONWIP
environment orders will be released until the WIP cap is met, after that, a following order will only be released when
another one comes out. Therefore, for the CONWIP system orders instead of waiting as materials on workstation
buffers, wait as information before the production system on what we call the pre-shop floor. The only significant
difference between the systems will be the form in which orders await. In the CONWIP strategy we can postpone the
commitment of materials to a given order and, in that way, you gain more flexibility to adjust the production schedule
or to accommodate unpredicted changes. Also, maintaining high levels of WIP can create all sort of hassles, such as
space management in the production system area and material deterioration. It also means that if you can operate with
less orders as WIP, you have less money invested in inventory, which makes a company more efficient. This is an
example where having less cycle-time does not mean that we will be able to provide products to our costumers faster.
To test the hypothesis just presented, a simple simulation can be performed to attest the conjectured behavior of
the system. For this simulation our production system will have two workstations, and only one type of product will
produced out of the production line on Fig. 2a.
Fig. 2. (a) simulated flow line; (b) Effects of different WIP Cap Strategies on customer Lead-Time; (c) Effect of different WIP cap strategies on
TH
When looking at Fig. 2b, we can observe both the effect of variability and WIP on customer lead-time for the
simulated system. Variability is represented by the different uniform distributions of the processing times for the two
workstations.
The first apparent behavior is that when WIP is too low, average customer lead-time will be very high. This is not
new, other authors made the same observations. We need a certain amount of WIP in the system as to guarantee that
we have utilization near capacity. Wallace and Hopp [10] coined this idea as critical WIP. Working with less than
critical WIP would not make any sense if the resulting TH of that practice does not match demand. That would mean
that we would be under utilizing our system when there were enough orders to make it work at full capacity. We can
also see that if a certain level of WIP is achieved then adding more WIP will not decrease customer lead-time further.
For the case with less variability, (blue line), that level of WIP happens at 3 units. The interpretation is that for this
scenario a WIP equal to 3 is enough to keep the system with the highest utilization possible. Hence, adding more WIP
is of no gain to TH. But the most important thing, which was the purpose of the simulation is that it shows that a
higher WIP cap does not increase customer lead-time for a given demand (arrival rate) and production system capacity.
Inspecting the effects of variation on customer lead-time, the first thing to notice is that all the lines are pushed upwards
as variability increases, which means that the average customer lead-time will be higher when variability is high.
The other interesting behavior is that as variability is higher the amount of WIP that establishes the point where no
improvement to TH is made, by adding more units of work in progress, is also higher. Or equating it differently, the
amount of WIP necessary so customer lead-time can be as low as possible increases with higher variability. More
WIP is needed for a smaller customer lead-time to be achieved, even though cycle time increases with WIP. In Fig.
2c we confirm the already presented idea that we need to allow a certain level of WIP, so it is possible to output orders
at a higher rate. The higher the variability of a production system, the lower the overall throughput and the higher the
levels of WIP necessary to obtain a given throughput. As observed before, TH does not increase forever as we increase
the WIP cap, there is a point where no marginal gain in TH is obtained by adding one more unit of WIP into the
system. In Fig. 2c in the red color scheme we witness that to achieve the higher throughput possible for lowest
variability we need a WIP cap of 3, for the highest level of variability that value grows to 8.
3. Determining the right levels of WIP
This simulation justifies the importance of having the right levels of WIP. We know that higher WIP levels lead to
higher cycle-times, but not necessarily higher customer lead-times. The best chance to really reduce customer lead-
time is to detract variability in any way we deem possible and increase the capacity of the system.
In the real world there is a limit to which variability can be controlled, if we have reached a point where we have
reduced variability, or the efforts to reducing it further don’t come to any improvements, what is left for managers to
do?
A possible answer could be: if there is demand for it we should keep our equipment under high levels utilization.
To do this we have to make sure our bottleneck do not stop more than what they are really required to. That means
that we need to have healthy levels of WIP. Empirically, as we have shown through simulations, that level of WIP
will depend on variability, arrival rate, and effective capacity of our system. Pull and mixed strategies as Kanban and
CONWIP and other manufacturing flow control mechanisms such as CWIPL [11] and G-MaxWIP [12] , they all try
to control flow through inducing an effect on the amount of WIP that is accepted into the system.
Nevertheless, regardless of the pull strategy we decide is best to our production environment, having a fixed WIP
cap for long periods of time, is not a realistic scenario. Those limits must change as our system effective capacity,
variability and arrival rate also change. Therefore, it is of outmost importance to have a procedure where the right
level of WIP is calculated to face the current state of a system. We need a procedure not only to enforce WIP limits,
but to change them according to state conditions, automatically.
4. Manufacturing Flow control – Reinforcement learning approach
“Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical
reward signal” [13]. This is a very simple definition that encompasses in a very generic form the idea of reinforcement
learning.
Reinforcement learning is different from supervised learning used in machine learning and pattern recognition.
There are a lot of different supervised learning algorithms, some of them currently amid rapid development such as
artificial neural networks (ANNs). Supervised learning is fundamentally different because as the name indicates it
needs an external supervisor to tell what action is best at any given situation, then the algorithm learns how to
generalize by looking at examples that were classified by this external agent. Performance of this algorithms is
constrained as we might expect by the performance of this external agent. Nevertheless, to use supervised learning we
need a large database for our problem and as the reader can imagine building databases for complex problems may
present itself unfeasible or even impossible. In fields where humans also lack extended understanding of how to
perform a task, there is no external agent qualified to supervise the learning of these algorithms. It is for this and other
reasons pointed out , for example by Silver et al. [14] that there is interest on the use of reinforcement learning
mechanisms that do not require human supervision.
4.1. The Agent-Environment Interface
In reinforcement learning an agent will interact with the environment at a sequence of discrete time steps, 𝑡𝑡 =
1, 2, 3, …. At each one of these time steps the agent will receive a representation of the environment state, 𝑠𝑠𝑡𝑡 ∈ 𝕊𝕊,
where 𝕊𝕊 is the set of all possible states, knowing 𝑠𝑠𝑡𝑡 the agent will take an action 𝑎𝑎𝑡𝑡 ∈ 𝔸𝔸(𝑠𝑠𝑡𝑡 ), where 𝔸𝔸(𝑠𝑠𝑡𝑡 ) is the set
of possible actions available at the state 𝑠𝑠𝑡𝑡 . As a result of the previously taken action, in the next time step, 𝑡𝑡 + 1 the
agent receives a numerical reward, 𝑟𝑟𝑡𝑡+1 ∈ ℝ, and the environment is now at a new state, 𝑠𝑠𝑡𝑡+1 . The just described
decision step process then repeats itself in a cycle. During this cycle the agent is continually mapping states to actions
according to what action has the highest probability of bringing the better reward. This mapping is called the agents’
policy and is denoted as 𝜋𝜋𝑡𝑡 . Hence, 𝜋𝜋𝑡𝑡 (𝑠𝑠𝑡𝑡 ) will give us the action 𝑎𝑎𝑡𝑡 that will lead to the highest expected reward
when at state 𝑠𝑠𝑡𝑡 . This a very general framework: in here we are considering that this policy is dependent on t, i.e.
depending on t, being on given state may lead to a different action than being on that same state at t+1.
4.2. Modeling Reinforcement learning problems
Reinforcement learning problems can me modelled as Markov Decision Processes (MDPs ) if action and state
spaces follow the Markov property. This is the case when to know the state of the process at t+1, 𝑠𝑠𝑡𝑡+1 , we only need
to know the information about the state and action taken at time t, 𝑠𝑠𝑡𝑡 𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎𝑡𝑡 . In other words, 𝑠𝑠𝑡𝑡+1 depends only on the
previous state and action, we don’t need to know all the information of all the previous states until we get to 𝑠𝑠𝑡𝑡 . This
basically assumes that the state representation compactly retains all the information necessary to predict a future state.
A good example taken from Sutton & Barto [13], is the example of the cannonball, the current position and velocity
vector of a cannon ball is all that we need to know the future of its trajectory.
A Markov decision process is a tuple made out of 4 components (𝕊𝕊, 𝔸𝔸, {𝑃𝑃𝑠𝑠𝑠𝑠 }, 𝛾𝛾, ℝ) where:𝕊𝕊 is a set of states; 𝔸𝔸
is a set of actions; 𝑃𝑃𝑠𝑠𝑠𝑠 are the state transition probabilities. (This gives us the probability of transitioning to state 𝑠𝑠’
being at state 𝑠𝑠 and performing action 𝑎𝑎); 𝛾𝛾 ∈ [0,1] is called a discount factor. (Represents the value of future
rewards); ℝ is the reward function.
4.3. Solving the Flow control problem
We start by defining the state space, the environment that we are going to use and learn a policy for is the same
showed in Fig. 2a. This is a simple system, proposedly so, we first want to attest the hypothesis of using Reinforcement
Learning in this problem, therefore starting with a simple problem will allow us to quickly debug any issues and open
the door for more complex flow control problems and possibly real case scenarios.
4.3.1. State Space

Our state vector has 7 dimensions, comprised of the following continuous variables: average WIP Level; average
interarrival time; standard deviation of interarrival time; average utilization for workstation 1; average utilization for
workstation 2; average processing time for workstation 1 and average processing time for workstation 2. These
quantities are all computed by using exponential moving averages.
4.3.2. Action Space

For the purpose of this paper we bounded the maximum WIP cap to be 100 parts because we know that statistically is
very unlikely for the WIP to grow past this amount for the simulated system exemplified in Fig. 2a. Therefore, we
allowed for the following vector of possible actions: [1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 19, 30, 40, 50, 60, 70, 80,
90, 100]. From this vector we can see that the action space is more sparsely defined after 20 units, again as the
designers of the system we want to learn a policy to, we know that the best policies will be found for WIP cap levels
mostly below 20 units, therefore we don’t need as much refinement above 20 units. Actions will be taken at every
decision epoch, which will be triggered by intervals of 30 units of time. This simulates the decision of changing WIP
cap according to a predefined schedule. This is usually a management decision taken according to the capabilities of
a company when it comes to changing procedures. This can be considered as an hyperparameter in our framework.
4.3.3. Reward Function

Defining a reward function is often the most important part of a learning problem, and we must make sure that the
reward signal we decide to implement translates itself into the desired behavior we want to observe from our policy.
In other words, we may be maximizing the expected total reward at every episode, but our system may still be
performing badly, this is often a sign that the reward signal is not representing effectively the desired behavior.
As we have discussed previously, we have a multi-objective goal, we want to maximize throughput while we want
to minimize WIP. To aggravate the problem, we have two goals that conflict with each other, therefore we know that
there exists a pareto front of non- dominated solutions. The shape of this pareto front can be observed in Fig. 2c as
the red hued curves. For a given level of variability there is a curve where we have all the possible trade-offs between
TH and WIP and we observe the best possible TH for a given WIP level. We can only operate below this curve, and
we know that if variability changes this curve is pushed down, i.e. for a given WIP level the maximum possible TH
will be lower. By default, we know the two policies that maximize the two objectives individually. The policy that
maximizes TH is always to choose the highest possible WIP cap (infinity), and the policy that minimizes the average
WIP is to always set the WIP cap to 1. This means that we can devise a reward structure where we provide rewards
when our agent performs better than an agent that maximizes TH. In other words, we will reward our agent when it
makes decisions that result in the same TH as the agent that tries to maximize TH, but with lower levels of WIP.
Because the policy that maximizes TH (maxTH) is given for free the only added complexity is that every time we
want to evaluate the performance of our agent we have to run two simulations, one for each of the policies.
4.4. The optimization algorithm
For the optimization of this learning problem we used the DQN algorithm introduced Mnih et all [15]. This
algorithm uses Q value functions, which represent the future expected cumulative reward for a given state an action,
assuming that we follow a given policy π , as in :
𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) = 𝐸𝐸[𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑟𝑟𝑡𝑡+1 + 𝛾𝛾 2 𝑟𝑟𝑡𝑡+2 + ⋯ |𝑠𝑠𝑡𝑡 = 𝑠𝑠, 𝑎𝑎𝑡𝑡 = 𝑎𝑎, 𝜋𝜋] (1)
The goal is now to find a policy π that maximizes the expected sum of rewards. When we have a state space that
is continuous, we will have an infinite number of value functions, therefore we need to use a value function
approximator as we cannot compute all the value functions analytically. If we have this estimator at decision time we
can just solve for our policy:
𝜋𝜋 ∗ (𝑠𝑠) = 𝑎𝑎𝑎𝑎𝑎𝑎𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄𝜙𝜙 (𝑠𝑠, 𝑎𝑎) (2)

𝑎𝑎
Where 𝑄𝑄𝜙𝜙 (s, a) is a value function estimator, parameterized by ϕ, which in our case will be the vector of weights
of a neural network. The job of the DQN algorithm is to find the parameter vector 𝜙𝜙. The basic idea of most
reinforcement learning algorithms is to estimate value functions by an iterative update that uses the Bellman equation
as a target for that update. In DQN, the bellman operator depends itself of on the estimator, as in:
𝑄𝑄(𝑠𝑠, 𝑎𝑎) = 𝑟𝑟(𝑠𝑠, 𝑎𝑎) + 𝛾𝛾 𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄𝜙𝜙 (𝑠𝑠’, 𝑎𝑎’) (3)

𝑎𝑎’
Where 𝑠𝑠′ represents the state at time 𝑡𝑡 + 1 and 𝑎𝑎′ an action available at 𝑠𝑠′. In contrast s represents the state at time
𝑡𝑡 and 𝑎𝑎 the action taken when at that state 𝑠𝑠.
Even though the bellman operator depends itself of on the estimator, in DQN the bellman operator is still used as
the target for the update, therefore the cost function for the NN estimator is:
2
𝐿𝐿(𝜙𝜙) = 𝐸𝐸(𝑠𝑠,𝑎𝑎,𝑟𝑟,𝑠𝑠′ ) [𝑟𝑟(𝑠𝑠, 𝑎𝑎) + 𝛾𝛾 𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄𝜙𝜙 (𝑠𝑠’, 𝑎𝑎’) − 𝑄𝑄𝜙𝜙 (𝑠𝑠, 𝑎𝑎)] (4)
𝑎𝑎’
Notice, that in eq.4 we take the expectation of the error over state transitions. As in any supervised learning problem
we will need sample data with the ground truths, these ground truths come in the form of the bellman operator. But
because the bellman operator is dependent on the function approximator, we have a situation where we are learning
not just the approximator, but also the ground truths. This fact makes the distribution of our sample data to be
constantly changing, which can make the learning process unstable. Intuitively it can be perceived as trying to hit a
target that is always moving.
4.5. Results and performance evaluation
After training, the policy was evaluated by controlling 150 simulations, we used a random and the maxTH policy
of section 4.3.3 as baseline for performance comparison. The random policy takes a random action at every decision
step from the set of possible actions [1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 19, 30, 40, 50, 60, 70, 80, 90, 100] and the
maxTH sets a the WIP cap to 10,000 units at every step. The 10,000 WIP cap simulates not having a WIP cap at all.
Fig. 3. Performance comparison between the learn policy THmax and a Random Policy
Looking at Fig. 3 we can see how each policy managed the trade-off between the WIP level and TH. In the graph,
the policy “Model” stands for the policy learnt through the use of the Reinforcement learning algorithm, notice as
well that TH in this graph is being measured as parts produced. Our policy was able to achieve practically the exact
same TH (a difference of -0.2 %) with -43.4% less WIP when compared to the system with no WIP cap. This attests
the success of our approach. Although we cannot guarantee that our policy is in fact capable of determining exactly
the critical WIP, our data shows that at least it pushes the WIP levels towards it. Another interesting behavior is that
the random policy was, as well, able to decrease the average WIP level without any loss in TH. This behavior is part
of a bias that we unintentionally introduced when discretizing the action space. The highest possible WIP cap in our
action space is set to be 100. WIP levels above 100 have very little probability of occurring but given enough
simulations it will eventually happen. Which means, that both our model and the random policy have a small free
advantage compared to the THmax policy.
5. Conclusion
This paper presents the problem of improving the performance of flow control mechanisms such as CONWIP,
Kanban, and other pull-based strategies by establishing stochastic limits to the amount of WIP allowed within the
production system. Those Systems need their WIP cap adjusted periodically and in this paper we propose a new
approach using reinforcement learning. A Reinforcement learning algorithm could be used to learn what the best WIP
cap is, for a given period by interacting with a discrete event simulation of the system we want to manage. The authors
believe that the use of RL has greater potential to industrial management problems that involve very complex systems,
greater potential than methods that require supervision during the learning process. Complex problems like those who
require understanding of complex manufacturing systems, such as flow control, may be an interesting test bed for the
use of this learning technique.
We show that an RL agent can learn how to maintain the TH achieved by a push system, with lower amounts of
WIP. Our approach automatically makes decisions on how WIP caps should be changed to ensure that TH is high and
WIP levels are low. The proposed approach calls for real case applications, which will allow the authors to understand
its scalability and practicability.
References
[1] S. M. Johnson, “Optimal two- and three-stage production schedules with setup times included,” Nav. Res. Logist. Q., vol. 1, no. 1, pp.
61–68, Mar. 1954.
[2] W. Li, X. Luo, D. Xue, and Y. Tu, “A heuristic for adaptive production scheduling and control in flow shop production,” Int. J. Prod.
Res., vol. 49, no. 11, pp. 3151–3170, 2011.
[3] Y. Peng and D. McFarlane, “Adaptive agent-based manufacturing control and its application to flow shop routing control,” Prod. Plan.
Control, vol. 15, no. 2, pp. 145–155, 2004.
[4] O. . Wight, “Input/output control: a real handle on lead times,” Prod. Invent. Manag., vol. 11, pp. 9–10, 1970.
[5] T. Fry, A. S.-P. and I. Management, and U. 1987, “A procedure for implementing input/output control: a case study,” Prod. Invent.
Manag., vol. 28, no. 3, pp. 50–52, 1987.
[6] J. D. C. Little and J. D. C., “A Proof for the Queuing Formula: L = λ W,” Oper. Res., vol. 9, no. 3, pp. 383–387, Jun. 1961.
[7] Y. Sugimori, K. Kusonoki, F. Cho, and S. Uchikawa, “Toyota production system and Kanban system Materialization of just-in-time
and respect-for-human system,” Int. J. Prod. Res., vol. 15, no. 6, pp. 553–564, Jan. 1977.
[8] M. L. Spearman, D. L. Woodruff, and W. J. Hopp, “CONWIP: a pull alternative to kanban,” Int. J. Prod. Res., vol. 28, no. 5, pp. 879–
894, May 1990.
[9] H. Jodlbauer, “A time-continuous analytic production model for service level, work in process, lead time and utilization,” Int. J. Prod.
Res., vol. 46, no. 7, pp. 1723–1744, Apr. 2008.
[10] W. J. Hopp and M. L. Spearman, Factory physics. 1995.
[11] M. M. Sepehri and N. Nahavandi, “Critical WIP loops: a mechanism for material flow control in flow lines,” Int. J. Prod. Res., vol. 45,
no. 12, pp. 2759–2773, Jun. 2007.
[12] A. Grosfeld-Nir and M. Magazine, “Gated MaxWIP: A strategy for controlling multistage production systems,” Int. J. Prod. Res., vol.
40, no. 11, pp. 2557–2567, Jan. 2002.
[13] R. S. Sutton and A. G. Barto, Reinforcement learning  : an introduction. MIT PRESS, 2018.
[14] D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, p. 354, Oct. 2017.
[15] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.

Production Flow Control Through The Use of Reinforcement Learning Production Flow Control Through The Use of Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Production Flow Control Through The Use of Reinforcement Learning Production Flow Control Through The Use of Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

29th International Conference on Flexible Automation and Intelligent Manufacturing

2351-9789 © 2019 The Authors, Published by Elsevier B.V.

2351-9789 © 2019 The Authors. Published by Elsevier B.V.

2. WIP effect on lead-times and TH

3. Determining the right levels of WIP

4. Manufacturing Flow control – Reinforcement learning approach

4.1. The Agent-Environment Interface

4.2. Modeling Reinforcement learning problems

4.3. Solving the Flow control problem

4.3.1. State Space

4.3.2. Action Space

4.3.3. Reward Function

4.4. The optimization algorithm

𝜋𝜋 ∗ (𝑠𝑠) = 𝑎𝑎𝑎𝑎𝑎𝑎𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄𝜙𝜙 (𝑠𝑠, 𝑎𝑎) (2)

𝑄𝑄(𝑠𝑠, 𝑎𝑎) = 𝑟𝑟(𝑠𝑠, 𝑎𝑎) + 𝛾𝛾 𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄𝜙𝜙 (𝑠𝑠’, 𝑎𝑎’) (3)

4.5. Results and performance evaluation

You might also like