Professional Documents
Culture Documents
Inventory Management in Supply Chains PDF
Inventory Management in Supply Chains PDF
Abstract
A major issue in supply chain inventory management is the coordination of inventory policies adopted by di!erent
supply chain actors, such as suppliers, manufacturers, distributors, so as to smooth material #ow and minimize costs
while responsively meeting customer demand. This paper presents an approach to manage inventory decisions at all
stages of the supply chain in an integrated manner. It allows an inventory order policy to be determined, which is aimed
at optimizing the performance of the whole supply chain. The approach consists of three techniques: (i) Markov decision
processes (MDP) and (ii) an arti"cial intelligent algorithm to solve MDPs, which is based on (iii) simulation modeling. In
particular, the inventory problem is modeled as an MDP and a reinforcement learning (RL) algorithm is used to
determine a near optimal inventory policy under an average reward criterion. RL is a simulation-based stochastic
technique that proves very e$cient particularly when the MDP size is large. 2002 Elsevier Science B.V. All rights
reserved.
Keywords: Supply chain; Inventory management; Markov decision processes; Reinforcement learning
0925-5273/02/$ - see front matter 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 9 2 5 - 5 2 7 3 ( 0 0 ) 0 0 1 5 6 - 0
154 I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161
improving customer service, increasing product change of local decision rules, and a better use of
variety, and lowering costs [2]. the information #ow through the supply chain.
An e!ective management and control of the ma- Johnes and Riley [9] and Hoekstra and Romme
terial #ow across the boundaries between com- [10] address the optimal positioning of stocks in
panies and their customers is vital to the success of the chain and suggest the use of strategic stocks to
companies, but is a di$cult task due to the demand de-couple push from pull operations. Stalk and
ampli"cation e!ect, known as &Forrester e!ect' [3]. Hout [11] and Blackburn [12] focus on time com-
The latter depends on factors such as the supply pression and the integration of operations with
chain structure, the time lags involved in accom- both customers and suppliers.
plishing actions (e.g. from the order release to ful- Studies on supply chain inventory management
"llment), and the poor decision making concerning generally identify three stages, namely supply, pro-
information and material #ows. Recent empirical duction, and distribution [13], yet the focus is usu-
studies [4] demonstrate that inventory manage- ally put on the coordination between only two of
ment policies can have a destabilizing e!ect due to them [14,13]. Coherently, Thomas and Gri$n [15]
the increase in the volatility of demand as it passes classify the models for coordinated supply chain
up through the chain. For example, Towill [5] management into buyer}vendor coordination,
claims that the demand ampli"cation experienced production}distribution coordination, and inven-
across each business interface is about 2 : 1. tory}distribution coordination.
Lee et al. [6] describe the Bullwhip e!ect occur- To our knowledge, there are only a few ap-
ring in supply chains as the considerable increase of proaches that simultaneously analyze inventory
the order variability relative to the variability of decisions at more than two stages under an opera-
buyers' demand. They identify the main mecha- tional perspective. In this paper, we propose an
nisms that destabilize supply chains, i.e. order approach to coordinate inventory management in
batching, price #uctuation, capacity shortfalls that a supply chain made up of three stages, i.e. supply,
lead to over-ordering and cancellation, and the production, and distribution. A model based on
updating of demand forecast. Markov decision processes (MDPs) and reinforce-
A tight coordination among inventory policies of ment learning (RL) is proposed to simultaneously
the di!erent actors in the supply chain can reduce design the inventory reorder policies of all the SC
the ripple e!ect on demand. To this end an appro- stages. After a brief description of MDPs and RL
priate information infrastructure is necessary that algorithm (Sections 2 and 3), in Section 4 we de"ne
allows all the actors within a SC make decisions the considered supply chain and the attendant
synchronized and coherent among each other. Such MDP model. Results obtained by the inventory
an infrastructure is referred to as networked inven- policy determined through the proposed approach
tory management information systems (NIMISs) are discussed in Section 5.
[2]. However, the exploitation of the NIMISs
requires the adoption of suitable inventory man-
agement policies. For instance, Kelle and Milne [7] 2. Markov decision processes
provide quantitative tools to study the e!ect of an
(s, S) policy on the supply chain and show that A Markov decision process is a sequential deci-
small frequent orders and the cooperation among sion-making stochastic process characterized by
the SC partners can reduce demand variability. "ve elements [16]: decision epochs, states, actions,
Towill [5] investigates the impact of di!erent strat- transition probabilities, and rewards. An agent (de-
egies, such as JIT, vendor integration, and time- cision maker) controls the path of the stochastic
based management, on the reduction of demand process. In fact, at certain points in time in the path,
ampli"cation. Wikner [8] stresses that the Forres- this agent intervenes and takes decisions which
ter e!ect is lowered through the "ne tuning of a!ect the course of the future path. These points
existing ordering policies, the reduction of delays, are called decision epochs and the decisions are
the removal of the distribution stage in the SC, the called actions. At each decision epoch, the system
I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161 155
occupies a decision-making state. This state may be decisions are allowed only at predetermined dis-
described by a vector. As a result of taking an crete points in time, in SMDPs the decision maker
action in a state, the decision-maker receives a re- can choose an action any time the system state
ward (which may be positive or negative) and the changes. Moreover, SMDPs model the system
system goes to the next state with a certain prob- evolution in continuous time, and the time spent by
ability which is called the transition probability. the system in a particular state follows a probabil-
A decision rule is a function for selecting an action ity distribution. In SMDPs, the action choice not
in each state, while a policy is a collection of such only determines the joint probability distribution
decision rules over the state-space. Implementing of a subsequent state, but also the time between
a policy generates a sequence of rewards. The MDP decision epochs. In general, the system state may
problem is to choose a policy to maximize a func- change several times between decision epochs, but
tion of this reward sequence (optimality criterion). only the state at the decision epochs is relevant to
Possible choices for these functions include the the decision maker. What happened between two
expected total discounted reward or the long-run subsequent decision epochs provides no relevant
average reward. information to the decision maker. Therefore, two
In this article we use the average reward cri- processes can be distinguished: (1) the semi-Markov
terion. The average reward or gain of a stationary decision process represents the evolution of the sys-
policy , starting at state i and continuing with tem state at the decision epochs, and (2) the natural
policy , is de"ned as follows: process describes the evolution of states continually
throughout time. The two distinct processes
1 L , coincide at decision epochs. The reward function
gL(i)" lim r (X , > ) ,
N R R associated with SMDPs is more complex. When
, G R the decision maker chooses action a in state s, "rst
where r (X , > ) represents the reward received he receives a lump sum reward, further he accrues
R R
when using action > in state X , > being the a reward at a rate c ( j, s, a) as long as the natural
R R R
action prescribed by policy in state X. process occupies state j.
MDPs have been widely applied to inventory For i3S, when action a3A is chosen (for any
G
control problems [17]. For example, they can be state i, A denotes the set of possible actions that
G
used for determining optimal reorder points and can be taken in i), and if the next state is j, let
quantities. In such a case decision epochs occur r(i, j, a) represent the reward obtained and t(i, j, a)
periodically, according to an inventory review pol- represent the time spent, during the state transition.
icy, and the system state is a function of inventory Also let i represent the state visited in the kth
I
position at the review time. In a given state, actions epoch and represent the action taken in that
I
correspond to the amount of stock to be ordered epoch. Then the average reward (gain) of an SMDP
(with `not orderinga being a possible action). The starting at state i and continuing with policy can
transition probabilities substantially depend on the be given as
ordered quantity, the supply rate, and the demand lim E[ , (r (i , i , i "i ))]/N
process until the next decision epoch. A decision gL(i)" , I I I> I .
lim E[ , (t (i , i , i "i ))]/N
rule speci"es the quantity to be ordered at the , I I I> I
review time, while a policy consists in a mapping of Modeling inventory control problem through
the replenishment orders onto the possible inven- SMDPs rather than MDPs presents several ad-
tory positions. Inventory managers (the decision vantages. It allows inventory policies to be con-
makers) seek the optimal policy, namely a policy sidered in which review time intervals are not
that maximizes a pro"t index (e.g. revenues minus required to be constant as well as makes it possible
ordering costs and inventory holding costs) over to have the system accrue rewards (or incur costs)
the decision-making horizon. between decision epochs depending on the natural
Semi-Markov decision processes (SMDPs) ex- process (inventory holding and pipeline costs are
tend MDPs. In fact, di!erently from MDPs where examples of such costs).
156 I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161
3. Reinforcement learning such that it is clear which is the best action in each
state. The vector that maps every state into the
Traditional approaches to solve MDPs and associated optimal action, represents the learned
SMDPs, such as value iteration, policy iteration, optimal policy.
modi"ed policy iteration, and linear programming,
become very di$cult to be applied as system space 3.1. SMART algorithm
and action space grow, due to the huge computa-
tional e!ort required. Semi-Markov average reward technique
Reinforcement learning [18,19] is an arti"cial (SMART) can be implemented with a simulator of
intelligent technique that has been successfully the system [20]. The environmental response
utilized for solving complex MDPs that model real- for each action is captured from simulating the
istic systems. This technique is a way of teaching system with di!erent actions in all states.
agents the optimal control policy [20], which is Information about the response is obtained
based on simulation and value iteration, the latter from the immediate rewards received and the
being a traditional method to solve MDPs and time spent in each transition from one decision-
SMDPs. making state to another. The updating of the
The RL model is based on the interaction of two knowledge base, which has to happen when
elements, i.e. the learning agent and the environment, the system moves from one decision-making
and two mechanisms, namely the exploitation and state to a new decision-making state, basically
the exploration. means changing the action value of the action
The learning agent selects the actions by trial and taken in the old state (this process is called
error (exploration) and based on its knowledge of learning). To implement this change apart from
the environment (exploitation). The environment the response one also needs to use a variable
responds to these actions by an immediate reward, called learning rate which is gradually decayed to
which is called the reinforcement signal, and evolves 0 as the learning progresses. The probability of
in a di!erent state. A good action either results in exploration is also similarly decayed to 0. The
a high immediate rewards directly or leads the decaying scheme may be as follows: a "M/m
K
system to states where high rewards are obtainable. where a is the value of the variable (learning rate
K
Using this information (i.e. the reward received), the or exploration probability) at the mth iteration and
agent updates its knowledge of the environment M is some predetermined constant. Typically M
and selects the next action. The agent knowledge is about 0.1 for exploration probability and 0.01
consists of an R-value for each state-action pair: for learning rates. Fig. 1 depicts the steps of the
each R-value is a measure of the goodness of an adopted algorithm.
action in a state. The updating algorithm (which is
based on value iteration) ensures that a good envir-
onmental response, obtained as a consequence of 4. Supply chain inventory problems and
taking an action in a state, results in increasing the reinforcement learning
attendant action-value while a poor response re-
sults in lowering it. Thus, as the good actions are In this section it is shown how reinforcement
rewarded and the bad actions are punished over learning can be used to address supply chain
time, some action-values tend to grow and others inventory problems. First we describe the con-
tend to diminish. sidered model of a supply chain, which includes
When a system visits a state, the learning agent the main stages identi"ed in the literature
chooses the action with the highest action value. (supply, production, and distribution), as well as
Sometimes the learning agent chooses a random the logic of the attendant material and order
action. This is called exploration, which ensures #ows. Then we code the described model into an
that all actions are taken in all states. The learning SMDP that can be solved through the SMART
phase ends when a trend appears in all R-values algorithm.
I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161 157
Table 2
Actual and coded inventory positions
Actual IP (!8 [!8;!6[ [!6;!4[ [!4;!2[ [!2; 0[ [0; 2[ [2; 4[ [4; 6[ [6; 8[ *8
G
Coded IP 1 2 3 4 5 6 7 8 9 10
G
own stage, i.e. each must take an action that ranges opts an integrated perspective, as it is based on the
from ordering nothing up to a maximum equal to minimization of the total SC costs. Such a policy is
the stock point capacity plus the current backorder de"ned by two vectors that specify the stock review
plus the estimated consumption during the trans- time intervals (T , T , T ) and the target levels (S ,
portation lead time minus the stock on hand. We S , S ) at each stage. At the stock review time
have assumed a value of 30 for such a maximum, interval T , an order is placed to raise the inventory
G
which simulation has shown to be as much high to position up to the target level S .
G
be never reached. The needed capacity can be de- In particular, the vector (T , T , T ) has been
termined by measuring the maximum stock level determined by solving the non-linear programming
that is reached at each stage, by simulating the problem that minimizes the average SC cost, sub-
system under the learnt inventory policy. ject to the following relaxed constraints [22]:
The action space size and the state space
¹ *¹ *0 for i"1, 2, 3.
size, respectively equal to 29,791 and 1000, yield G G\
29,791,000 action values. Their estimated values
The echelon inventory concept is utilized for com-
de"ne the near optimal inventory policy.
puting holding costs, so that the average SC cost is
Even though decision epochs occur at predeter-
given by
mined points in time, the considered decision
process is an SMDP. In fact, the system state Co 1
may change as well as cost (rewards) are incurred C" G # dH ¹
¹ 2 G G
(accrued) between two subsequent decision epochs. G G
The SMDP has been solved by the SMART the echelon unit holding cost H at the ith stage
G
algorithm. The learning phase has been simulated being the incremental holding cost of the ith stage
by a commercial simulating package ARENA [21]. with respect to the upstream stage (i!1)th
The learning process, which has required a length (H "h !h ).
G G G>
of 1,500,000 time units, has taken about 2 h on a PC The vector (6, 8, 8) has been obtained as solution
Pentium II 450. for (T , T , T ).
The target stock S has been de"ned equal to the
G
demand during the reorder interval time ¹ (¹ d)
G G
5. Results plus the stock necessary to cover the demand dur-
ing the transportation lead time (LT ):
G
A near-optimal supply chain inventory policy,
which will be referred to as SMART policy, has S "(¹ #LT )d.
G G G
been determined through the proposed approach.
The safety stock (SS) has been added at the last
This policy, which can be thought of as an (s, S)
stage to cope with customer demand uncertainty.
policy where both s and S vary with the system
Therefore, the order quantity OQ at every stage
state, is relatively simple to be implemented, as it G
is given by
requires the knowledge of just the optimal action to
be taken in each of the system states. OQ "(S #SS)!IP ; O Q "S !IP ;
The e!ectiveness of the SMART policy has been
evaluated against a periodic order policy that ad- O Q "S !IP .
160 I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161
periodic order policy, which has been used as [7] P. Kelle, A. Milne, The e!ect of (s, S) ordering policy on the
a benchmark. Also, the SMART policy proves supply chain, International Journal of Production Econ-
quite robust with respect to slight changes in de- omics 59 (1999) 113}122.
[8] J. Wikner, D.R. Towill, M. Naim, Smoothing supply chain
mand. dynamics, International Journal of Production Economics
It is expected that the superiority of the SMART 22 (1991) 231}248.
policy would be greater for more complex cases. [9] T.C. Jones, D.W. Riley, Using inventory for competitive
In fact, centralized but simpler policies (such as advantage through supply chain management, Interna-
the POQ based utilized as a benchmark) cannot tional Journal of Physical Distribution and Materials
Management 17 (2) (1987) 94}104.
adapt to complex environments as the SMART [10] S. Hoekstra, J. Romme, Integral Logistics Structures: De-
policy does. This depends on (i) the ability of veloping Customer-Oriented Goods Flows, McGraw-Hill,
simulation modeling of capturing detailed features London, 1992.
of the system as well as (ii) the capability of [11] G.H. Stalk, T.M. Hout, Competing against Time, How
MDPs of describing time dependencies between Time-Based Competition Is Reshaping Global Competi-
tion, Free Press, New York, 1990.
decisions. [12] J.D. Blackburn, Time-based Competition: The Next
Further research should address the issue of hav- Battleground in American Manufacturing, Irwin, Home-
ing the supply chain actors actually implementing wood, IL, 1991.
the optimal policy determined through the pro- [13] S. ErenguK c, A.J. Vakharia, Integrated production/distribu-
posed approach. This is quite a di$cult task, given tion planning in supply chains, European Journal of Op-
erational Research 115 (1999) 219}236.
that the supply chain actors are likely to belong to [14] C. Forza, Achieving superior operating performance from
diverse "rms. Therefore, having them actually share integrated pipeline management: An empirical study, In-
a unique reward function needs a way (e.g. appro- ternational Journal of Physical Distribution and Logistics
priate incentive mechanisms) to fairly split the Management 26 (9) (1996) 36}63.
higher rewards that the optimal policy would [15] D.J. Thomas, P.M. Gri$n, Coordinated supply chain
management, European Journal of Operational Research
guarantee. 94 (1) (1996) 1}15.
[16] M. Puterman, Markov Decision Processes: Discrete
Stochastic Programming, Wiley Interscience, New York,
References 1994.
[17] E. Porteus, Stochastic inventory theory, in: D.P. Heyman,
[1] M. Christopher, Logistic and Supply Chain Management, M.J. Sobel (Eds.), Handbooks of Operations Research,
Pitman Publishing, London, 1992. North-Holland, Amsterdam, 1990.
[2] M. Verwijmeren, P. Van der Vlist, K. van Donselaar, [18] R.L. Sutton, A.G. Barto, Reinforcement Leaning } An
Networked inventory management information systems: Introduction, MIT Press, Cambridge, MA, 1998.
Materializing supply chain management, International [19] D. Bertsekas, J. Tsitsiklis, Neuro-Dynamic Programming,
Journal of Physical Distribution and Logistics Manage- Athena Scienti"c, Belmont, MA, 1996.
ment 26 (6) (1996) 16}31. [20] T.A. Das, A. Gosavi, S. Mahadevan, N. Marchalleck, Solv-
[3] J.W. Forrester, Industrial Dynamics, MIT Press, Cam- ing semi-Markov decision problems using average reward
bridge, MA, 1961. reinforcement learning, Management Science 45 (4) (1999)
[4] M.P. Baganha, M. Cohen, The stabilizing e!ect of inven- 560}574.
tory in supply chains, Operations Research 46 (3) (1998) [21] W.D. Kelton, R.P. Sadowski, D.A. Sadowsky, Simulation
S72}S73. with Arena, McGraw-Hill, New York, 1998.
[5] D. Towill, Industrial dynamics modeling of supply chains, [22] J.A. Muckstadt, R.O. Roundy, 1993, Analysis of multistage
Logistics Information Management 9 (1996) 43}56. production systems, in: S.C. Graves, A.H.G. Rinnooy Kan,
[6] H.L. Lee, V. Padmanabhan, S. Whang, The bullwhip e!ect P.H. Zipkin (Eds.), Handbooks in Operations Research
in the supply chains, Sloan Management Review 38 (3) and Management Science, Vol. 4, North-Holland, Amster-
(1997) 93}102. dam, 1993.