You are on page 1of 9

Int. J.

Production Economics 78 (2002) 153}161

Inventory management in supply chains:


a reinforcement learning approach
Ilaria Giannoccaro, Pierpaolo Pontrandolfo*
Dipartimento di Ingegneria Meccanica e Gestionale , Politecnico di Bari, Viale Japigia 182, 70123 Bari, Italy
Received 8 August 2000; received in revised form 5 September 2000

Abstract

A major issue in supply chain inventory management is the coordination of inventory policies adopted by di!erent
supply chain actors, such as suppliers, manufacturers, distributors, so as to smooth material #ow and minimize costs
while responsively meeting customer demand. This paper presents an approach to manage inventory decisions at all
stages of the supply chain in an integrated manner. It allows an inventory order policy to be determined, which is aimed
at optimizing the performance of the whole supply chain. The approach consists of three techniques: (i) Markov decision
processes (MDP) and (ii) an arti"cial intelligent algorithm to solve MDPs, which is based on (iii) simulation modeling. In
particular, the inventory problem is modeled as an MDP and a reinforcement learning (RL) algorithm is used to
determine a near optimal inventory policy under an average reward criterion. RL is a simulation-based stochastic
technique that proves very e$cient particularly when the MDP size is large.  2002 Elsevier Science B.V. All rights
reserved.

Keywords: Supply chain; Inventory management; Markov decision processes; Reinforcement learning

1. Introduction cerned with the integrated management of the #ows


of goods and information throughout the supply
A supply chain (SC) is a network of organiza- chain, so as to insure that the right goods be de-
tions that are involved in the di!erent processes livered in the right place and quantity at the right
and activities that produce value in the form of time.
products and services in the hands of the ultimate The SCM literature covers di!erent areas, such
consumer [1]. Such activities are mainly the pro- as forecasting, procurement, production, distribu-
curement of materials, the transformation of these tion, inventory, transportation, and customer ser-
materials into intermediate and "nished product, vice, under several perspectives, i.e. strategic,
and the distribution of "nished products to the end tactical, and operational. supply chain inventory
customer. Supply chain management (SCM) is con- management (SCIM), which is the main concern of
this paper, is an integrated approach to the plann-
ing and control of inventory throughout the entire
* Corresponding author. Tel.: #39-80-5962-763; fax: #39-
network of co-operating organizations, from the
80-5962-788. source of supply to the end user. SCIM is focused
E-mail address: pontrandolfo@poliba.it (P. Pontrandolfo). on the ultimate customer demand and aims at

0925-5273/02/$ - see front matter  2002 Elsevier Science B.V. All rights reserved.
PII: S 0 9 2 5 - 5 2 7 3 ( 0 0 ) 0 0 1 5 6 - 0
154 I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161

improving customer service, increasing product change of local decision rules, and a better use of
variety, and lowering costs [2]. the information #ow through the supply chain.
An e!ective management and control of the ma- Johnes and Riley [9] and Hoekstra and Romme
terial #ow across the boundaries between com- [10] address the optimal positioning of stocks in
panies and their customers is vital to the success of the chain and suggest the use of strategic stocks to
companies, but is a di$cult task due to the demand de-couple push from pull operations. Stalk and
ampli"cation e!ect, known as &Forrester e!ect' [3]. Hout [11] and Blackburn [12] focus on time com-
The latter depends on factors such as the supply pression and the integration of operations with
chain structure, the time lags involved in accom- both customers and suppliers.
plishing actions (e.g. from the order release to ful- Studies on supply chain inventory management
"llment), and the poor decision making concerning generally identify three stages, namely supply, pro-
information and material #ows. Recent empirical duction, and distribution [13], yet the focus is usu-
studies [4] demonstrate that inventory manage- ally put on the coordination between only two of
ment policies can have a destabilizing e!ect due to them [14,13]. Coherently, Thomas and Gri$n [15]
the increase in the volatility of demand as it passes classify the models for coordinated supply chain
up through the chain. For example, Towill [5] management into buyer}vendor coordination,
claims that the demand ampli"cation experienced production}distribution coordination, and inven-
across each business interface is about 2 : 1. tory}distribution coordination.
Lee et al. [6] describe the Bullwhip e!ect occur- To our knowledge, there are only a few ap-
ring in supply chains as the considerable increase of proaches that simultaneously analyze inventory
the order variability relative to the variability of decisions at more than two stages under an opera-
buyers' demand. They identify the main mecha- tional perspective. In this paper, we propose an
nisms that destabilize supply chains, i.e. order approach to coordinate inventory management in
batching, price #uctuation, capacity shortfalls that a supply chain made up of three stages, i.e. supply,
lead to over-ordering and cancellation, and the production, and distribution. A model based on
updating of demand forecast. Markov decision processes (MDPs) and reinforce-
A tight coordination among inventory policies of ment learning (RL) is proposed to simultaneously
the di!erent actors in the supply chain can reduce design the inventory reorder policies of all the SC
the ripple e!ect on demand. To this end an appro- stages. After a brief description of MDPs and RL
priate information infrastructure is necessary that algorithm (Sections 2 and 3), in Section 4 we de"ne
allows all the actors within a SC make decisions the considered supply chain and the attendant
synchronized and coherent among each other. Such MDP model. Results obtained by the inventory
an infrastructure is referred to as networked inven- policy determined through the proposed approach
tory management information systems (NIMISs) are discussed in Section 5.
[2]. However, the exploitation of the NIMISs
requires the adoption of suitable inventory man-
agement policies. For instance, Kelle and Milne [7] 2. Markov decision processes
provide quantitative tools to study the e!ect of an
(s, S) policy on the supply chain and show that A Markov decision process is a sequential deci-
small frequent orders and the cooperation among sion-making stochastic process characterized by
the SC partners can reduce demand variability. "ve elements [16]: decision epochs, states, actions,
Towill [5] investigates the impact of di!erent strat- transition probabilities, and rewards. An agent (de-
egies, such as JIT, vendor integration, and time- cision maker) controls the path of the stochastic
based management, on the reduction of demand process. In fact, at certain points in time in the path,
ampli"cation. Wikner [8] stresses that the Forres- this agent intervenes and takes decisions which
ter e!ect is lowered through the "ne tuning of a!ect the course of the future path. These points
existing ordering policies, the reduction of delays, are called decision epochs and the decisions are
the removal of the distribution stage in the SC, the called actions. At each decision epoch, the system
I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161 155

occupies a decision-making state. This state may be decisions are allowed only at predetermined dis-
described by a vector. As a result of taking an crete points in time, in SMDPs the decision maker
action in a state, the decision-maker receives a re- can choose an action any time the system state
ward (which may be positive or negative) and the changes. Moreover, SMDPs model the system
system goes to the next state with a certain prob- evolution in continuous time, and the time spent by
ability which is called the transition probability. the system in a particular state follows a probabil-
A decision rule is a function for selecting an action ity distribution. In SMDPs, the action choice not
in each state, while a policy is a collection of such only determines the joint probability distribution
decision rules over the state-space. Implementing of a subsequent state, but also the time between
a policy generates a sequence of rewards. The MDP decision epochs. In general, the system state may
problem is to choose a policy to maximize a func- change several times between decision epochs, but
tion of this reward sequence (optimality criterion). only the state at the decision epochs is relevant to
Possible choices for these functions include the the decision maker. What happened between two
expected total discounted reward or the long-run subsequent decision epochs provides no relevant
average reward. information to the decision maker. Therefore, two
In this article we use the average reward cri- processes can be distinguished: (1) the semi-Markov
terion. The average reward or gain of a stationary decision process represents the evolution of the sys-
policy , starting at state i and continuing with tem state at the decision epochs, and (2) the natural
policy , is de"ned as follows: process describes the evolution of states continually
throughout time. The two distinct processes

 
1 L , coincide at decision epochs. The reward function
gL(i)" lim r (X , > ) ,
N R R associated with SMDPs is more complex. When
, G R the decision maker chooses action a in state s, "rst
where r (X , > ) represents the reward received he receives a lump sum reward, further he accrues
R R
when using action > in state X , > being the a reward at a rate c ( j, s, a) as long as the natural
R R R
action prescribed by policy  in state X. process occupies state j.
MDPs have been widely applied to inventory For i3S, when action a3A is chosen (for any
G
control problems [17]. For example, they can be state i, A denotes the set of possible actions that
G
used for determining optimal reorder points and can be taken in i), and if the next state is j, let
quantities. In such a case decision epochs occur r(i, j, a) represent the reward obtained and t(i, j, a)
periodically, according to an inventory review pol- represent the time spent, during the state transition.
icy, and the system state is a function of inventory Also let i represent the state visited in the kth
I
position at the review time. In a given state, actions epoch and  represent the action taken in that
I
correspond to the amount of stock to be ordered epoch. Then the average reward (gain) of an SMDP
(with `not orderinga being a possible action). The starting at state i and continuing with policy  can
transition probabilities substantially depend on the be given as
ordered quantity, the supply rate, and the demand lim E[ , (r (i , i ,  i "i ))]/N
process until the next decision epoch. A decision gL(i)" , I I I> I   .
lim E[ , (t (i , i ,  i "i ))]/N
rule speci"es the quantity to be ordered at the , I I I> I  
review time, while a policy consists in a mapping of Modeling inventory control problem through
the replenishment orders onto the possible inven- SMDPs rather than MDPs presents several ad-
tory positions. Inventory managers (the decision vantages. It allows inventory policies to be con-
makers) seek the optimal policy, namely a policy sidered in which review time intervals are not
that maximizes a pro"t index (e.g. revenues minus required to be constant as well as makes it possible
ordering costs and inventory holding costs) over to have the system accrue rewards (or incur costs)
the decision-making horizon. between decision epochs depending on the natural
Semi-Markov decision processes (SMDPs) ex- process (inventory holding and pipeline costs are
tend MDPs. In fact, di!erently from MDPs where examples of such costs).
156 I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161

3. Reinforcement learning such that it is clear which is the best action in each
state. The vector that maps every state into the
Traditional approaches to solve MDPs and associated optimal action, represents the learned
SMDPs, such as value iteration, policy iteration, optimal policy.
modi"ed policy iteration, and linear programming,
become very di$cult to be applied as system space 3.1. SMART algorithm
and action space grow, due to the huge computa-
tional e!ort required. Semi-Markov average reward technique
Reinforcement learning [18,19] is an arti"cial (SMART) can be implemented with a simulator of
intelligent technique that has been successfully the system [20]. The environmental response
utilized for solving complex MDPs that model real- for each action is captured from simulating the
istic systems. This technique is a way of teaching system with di!erent actions in all states.
agents the optimal control policy [20], which is Information about the response is obtained
based on simulation and value iteration, the latter from the immediate rewards received and the
being a traditional method to solve MDPs and time spent in each transition from one decision-
SMDPs. making state to another. The updating of the
The RL model is based on the interaction of two knowledge base, which has to happen when
elements, i.e. the learning agent and the environment, the system moves from one decision-making
and two mechanisms, namely the exploitation and state to a new decision-making state, basically
the exploration. means changing the action value of the action
The learning agent selects the actions by trial and taken in the old state (this process is called
error (exploration) and based on its knowledge of learning). To implement this change apart from
the environment (exploitation). The environment the response one also needs to use a variable
responds to these actions by an immediate reward, called learning rate which is gradually decayed to
which is called the reinforcement signal, and evolves 0 as the learning progresses. The probability of
in a di!erent state. A good action either results in exploration is also similarly decayed to 0. The
a high immediate rewards directly or leads the decaying scheme may be as follows: a "M/m
K
system to states where high rewards are obtainable. where a is the value of the variable (learning rate
K
Using this information (i.e. the reward received), the or exploration probability) at the mth iteration and
agent updates its knowledge of the environment M is some predetermined constant. Typically M
and selects the next action. The agent knowledge is about 0.1 for exploration probability and 0.01
consists of an R-value for each state-action pair: for learning rates. Fig. 1 depicts the steps of the
each R-value is a measure of the goodness of an adopted algorithm.
action in a state. The updating algorithm (which is
based on value iteration) ensures that a good envir-
onmental response, obtained as a consequence of 4. Supply chain inventory problems and
taking an action in a state, results in increasing the reinforcement learning
attendant action-value while a poor response re-
sults in lowering it. Thus, as the good actions are In this section it is shown how reinforcement
rewarded and the bad actions are punished over learning can be used to address supply chain
time, some action-values tend to grow and others inventory problems. First we describe the con-
tend to diminish. sidered model of a supply chain, which includes
When a system visits a state, the learning agent the main stages identi"ed in the literature
chooses the action with the highest action value. (supply, production, and distribution), as well as
Sometimes the learning agent chooses a random the logic of the attendant material and order
action. This is called exploration, which ensures #ows. Then we code the described model into an
that all actions are taken in all states. The learning SMDP that can be solved through the SMART
phase ends when a trend appears in all R-values algorithm.
I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161 157

Fig. 1. The SMART algorithm.

su$cient to cover the request. Otherwise the order


is backordered and waits until the upstream stock
reaches the ordered quantity. In particular, backor-
dering customer demand at the distribution stage
involves penalty costs, which grows with the wait-
ing time. Even though the time interval at which
periodic inventory reviews occur does not change
from stage to stage, actors may adopt diverse deci-
Fig. 2. The supply chain model. sions in terms of both how much and even when
to order: in fact, any actor may decide not to
order at a certain point in time even if the others
4.1. A supply chain inventory model do order.
The inventory management process is character-
A simple supply chain model is considered to ized by time and cost variables as reported in Table 1.
show the way in which the proposed approach can Production costs have not been considered as in
be utilized to determine a near-optimal inventory the model they do not depend on the speci"c inven-
order policy. tory policy. Similarly, production times are almost
The supply chain model consists of three stages, not in#uenced by the inventory policy and in any
namely supply, production and distribution (Fig. 2). case they do not substantially di!erentiate the per-
It is assumed that a decision maker (actor) exists at formance of one policy from another.
every stage, who has the responsibility of managing Each time actor i issues an order, she incurs the
inventory at that stage. ordering cost Co , which includes the transporta-
G
At "xed time intervals, each actor reviews the tion cost. Ordering costs are assumed to be inde-
stock at his stage and, according to a certain inven- pendent on the order size. The Inventory holding
tory policy, decides whether to issue an order to the cost h is the cost of keeping a stock unit per time
G
upstream stage. Once the order is placed, the de- unit at stock point i. The pipeline cost Cp is the
G
livering process begins as long as upstream stock is cost per time unit of a stock unit in transit to = .
G
158 I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161

Table 1 mance, given that an integrated inventory manage-


Cost and time variables ment policy has to be determined. Therefore, the
Cost variables average reward or gain over the long run has been
utilized, which is de"ned as follows:
Unit price P 1000
Ordering cost Co (i"1, 2, 3)
G
80, 80, 80 "(Total Reward)/(Total Time),
Unit inventory cost (per t.u.) h (i"1, 2, 3) 10, 5, 3
G
Unit pipeline cost (per t.u.) Cp (i"1, 2, 3)
G
10, 5, 3 where
Unit penalty cost for late delivery (per t.u.) Cb 50
Total reward"price Tot sell}Tot Cost,
Time variables
Tot sell"products sold at Total Time,
Stock review time interval (constant) It 10
Transportation time (uniform distribution) T 1}3 Tot Cost"total costs incurred at Total Time.
G
Demand
Mean interval time (exponential distribution) d 1 To complete the mapping of the inventory man-
agement problem into the SMDP, the decision
epochs, the system state variable, and the possible
When "nal demand cannot be immediately satis- actions in every system state have to be identi"ed.
"ed, the system incurs a penalty cost Cb times the A decision epoch occurs at each stock review
waiting time until demand is ful"lled. Although time interval, when the actors at all the three stages
estimating the penalty cost is often di$cult, make decision on inventory. Decision agents are
it proves crucial when responsiveness to market three, as many as the actors in the supply chain.
demand is a key performance, which is especially As a decision agent must make decision solely
true in time-based competition. based on the system state, the state variable must
provide him with any information that is relevant
4.2. The SMDP model to an integrated inventory management. In particu-
lar, she needs to know the inventory position of the
The discussed SC inventory management pro- whole supply chain, being inadequate that of her
cess, resulting from a given inventory policy, is own stage only. Also, an integrated inventory man-
a stochastic process. This section explains the way agement requires that the three decision makers
in which the inventory process has been mapped share a unique reward function, given that the
into an SMDP and solved through the proposed performance index must refer to the supply chain as
approach. a whole. Thus, the system state variable is given by
The SMDP de"nition involves the choice of the the following vector:
reward function to be maximized. This choice is
(IP , IP , IP ),
linked to the hypotheses on the cost structure of the   
inventory model. With this regard, two basic which describes the global SC inventory position as
options are available, namely averaging vs. dis- the inventory position IP at every stage i.
G
counting the cost. When the time value of money is The inventory position IP at a given stage de-
G
considered, cost must be discounted rather than pends on schedule receipts (SR ), on-hand inven-
G
averaged. In the considered case, the average tory (OH ), and backorders (BO ) as follows:
G G
reward criterion has been chosen as this is more
IP "OH #SR !BO .
frequently used in common inventory models (both G G G G
single stage and multi-echelon). Also, such a cri- From the above equation it follows that IP is
G
terion simpli"es the hypotheses because there is no not bounded, which would imply an in"nite size of
need to assume a discount factor. Furthermore, the associated MDP. Therefore, every IP has been
G
averaging is more appropriate when the perfor- coded so as to let it assume a limited number of
mance is analyzed over a time horizon that is theor- values (Table 2).
etically in"nite. Finally, the reward function has to At every decision epoch, all decision agents must
be associated with the whole supply chain perfor- select the replenishment order quantity for their
I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161 159

Table 2
Actual and coded inventory positions

Actual IP (!8 [!8;!6[ [!6;!4[ [!4;!2[ [!2; 0[ [0; 2[ [2; 4[ [4; 6[ [6; 8[ *8
G
Coded IP 1 2 3 4 5 6 7 8 9 10
G

own stage, i.e. each must take an action that ranges opts an integrated perspective, as it is based on the
from ordering nothing up to a maximum equal to minimization of the total SC costs. Such a policy is
the stock point capacity plus the current backorder de"ned by two vectors that specify the stock review
plus the estimated consumption during the trans- time intervals (T , T , T ) and the target levels (S ,
   
portation lead time minus the stock on hand. We S , S ) at each stage. At the stock review time
 
have assumed a value of 30 for such a maximum, interval T , an order is placed to raise the inventory
G
which simulation has shown to be as much high to position up to the target level S .
G
be never reached. The needed capacity can be de- In particular, the vector (T , T , T ) has been
  
termined by measuring the maximum stock level determined by solving the non-linear programming
that is reached at each stage, by simulating the problem that minimizes the average SC cost, sub-
system under the learnt inventory policy. ject to the following relaxed constraints [22]:
The action space size and the state space
¹ *¹ *0 for i"1, 2, 3.
size, respectively equal to 29,791 and 1000, yield G G\
29,791,000 action values. Their estimated values
The echelon inventory concept is utilized for com-
de"ne the near optimal inventory policy.
puting holding costs, so that the average SC cost is
Even though decision epochs occur at predeter-
given by
mined points in time, the considered decision

 
process is an SMDP. In fact, the system state  Co 1
may change as well as cost (rewards) are incurred C" G # dH ¹
 ¹ 2 G G
(accrued) between two subsequent decision epochs. G G
The SMDP has been solved by the SMART the echelon unit holding cost H at the ith stage
G
algorithm. The learning phase has been simulated being the incremental holding cost of the ith stage
by a commercial simulating package ARENA [21]. with respect to the upstream stage (i!1)th
The learning process, which has required a length (H "h !h ).
G G G>
of 1,500,000 time units, has taken about 2 h on a PC The vector (6, 8, 8) has been obtained as solution
Pentium II 450. for (T , T , T ).
  
The target stock S has been de"ned equal to the
G
demand during the reorder interval time ¹ (¹ d)
G G
5. Results plus the stock necessary to cover the demand dur-
ing the transportation lead time (LT ):
G
A near-optimal supply chain inventory policy,
which will be referred to as SMART policy, has S "(¹ #LT )d.
G G G
been determined through the proposed approach.
The safety stock (SS) has been added at the last
This policy, which can be thought of as an (s, S)
stage to cope with customer demand uncertainty.
policy where both s and S vary with the system
Therefore, the order quantity OQ at every stage
state, is relatively simple to be implemented, as it G
is given by
requires the knowledge of just the optimal action to
be taken in each of the system states. OQ "(S #SS)!IP ; O Q "S !IP ;
The e!ectiveness of the SMART policy has been      
evaluated against a periodic order policy that ad- O Q "S !IP .
  
160 I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161

Table 3 quantity. Rather, replenishment orders are placed


SMART policy vs. the benchmark policy as complex functions of inventory position: di!er-
ently from the benchmark policy, reorder points as
Gain SMART Benchmark  (gain)
well as ordered quantities vary with the global
k"1 (CV"100%) 856 826 3.63% inventory position.
k"5 (CV"45%) 881 852 3.40% Furthermore, the SMART policy considers the
k"10 (CV"32%) 884 855 3.39% stochastic nature of the environment, namely
demand and lead time variability, whereas the
benchmark policy is determined based on a deter-
The performance achieved by the two policies ministic demand equal to the average and copes
have been measured through simulation runs with with uncertainty through a safety stock.
a time length of 100,000 time units. In particular,
three demand patterns have been considered, all
characterized by Erlang distributions with same 6. Conclusions
demand rates but di!erent variance. The three pat-
terns are indeed characterized by three diverse In this paper the SCM problem has been ad-
values of the k parameter of the Erlang distribution. dressed with particular emphasis on inventory
As known, the relationships between the k para- management. Supply chain management is widely
meter, the mean , the variance , and the coe$c- recognized as a vital source of competitive advant-
ient of variation CV of demand are as follows: age, yet SCM techniques, especially in the inven-
tory area, are very di$cult to be put into practice,
"k, "k, CV"1/k.
given the high need of information communication
The results are depicted in Table 3. and processing involved. To this end many e!orts
The benchmark policy has been adapted to the have been lately devoted to the design of appropri-
demand variance by adjusting the safety stock SS. ate networked inventory management information
On the contrary, the inventory policy learned for systems (NIMISs).
the k"1 case has been used for the other demand Despite the e!orts focused on the implementa-
patterns (k"5 and 10). This has allowed the ro- tion of NIMISs, relatively less attention has been
bustness of the proposed approach to be veri"ed. given to de"ne an appropriate logic for managing
It can be observed that the performance in- inventory, so missing the opportunity of exploiting
creases with k for both the SMART and the bench- the potential of such information systems. In par-
mark policies, which was expected, given that when ticular, integrated approaches to manage inventory
k increases demand uncertainty diminishes. Less decisions at all stages of the supply chain need to be
obvious is that the SMART policy performs better developed.
than the benchmark does even for k"5 and 10, In this paper an approach has been proposed,
namely when demand is di!erent from that experi- which addresses this problem. It is based on three
enced during the learning (k"1). In fact, while the techniques, namely Markov decision processes, re-
learned policy can surely deal with the same inforcement learning, and simulation. MDPs make
demand pattern used during the learning phase, it it possible to model sequential decision-making
could show a performance decline for new patterns. problems under uncertainty. RL and simulation
Based on the results, we can then conclude that, not allow MDPs to be solved in a wider range of cases
only is the SMART policy more e$cient, but is also than conventional methods (e.g. dynamic and
robust as long as demand undergoes slight changes. linear programming) do.
The higher e$ciency of the SMART policy is The approach has been tested on a supply chain
mainly due to the fact that the decision rule, on model consisting of the supply, manufacturing,
which basis actions (i.e. replenishment orders) and distribution stages. The integrated inventory
are taken, is more sophisticated. In fact, there is policy determined through the proposed ap-
neither a unique reorder point nor a unique reorder proach (SMART policy) outperforms a centralized
I. Giannoccaro, P. Pontrandolfo / Int. J. Production Economics 78 (2002) 153}161 161

periodic order policy, which has been used as [7] P. Kelle, A. Milne, The e!ect of (s, S) ordering policy on the
a benchmark. Also, the SMART policy proves supply chain, International Journal of Production Econ-
quite robust with respect to slight changes in de- omics 59 (1999) 113}122.
[8] J. Wikner, D.R. Towill, M. Naim, Smoothing supply chain
mand. dynamics, International Journal of Production Economics
It is expected that the superiority of the SMART 22 (1991) 231}248.
policy would be greater for more complex cases. [9] T.C. Jones, D.W. Riley, Using inventory for competitive
In fact, centralized but simpler policies (such as advantage through supply chain management, Interna-
the POQ based utilized as a benchmark) cannot tional Journal of Physical Distribution and Materials
Management 17 (2) (1987) 94}104.
adapt to complex environments as the SMART [10] S. Hoekstra, J. Romme, Integral Logistics Structures: De-
policy does. This depends on (i) the ability of veloping Customer-Oriented Goods Flows, McGraw-Hill,
simulation modeling of capturing detailed features London, 1992.
of the system as well as (ii) the capability of [11] G.H. Stalk, T.M. Hout, Competing against Time, How
MDPs of describing time dependencies between Time-Based Competition Is Reshaping Global Competi-
tion, Free Press, New York, 1990.
decisions. [12] J.D. Blackburn, Time-based Competition: The Next
Further research should address the issue of hav- Battleground in American Manufacturing, Irwin, Home-
ing the supply chain actors actually implementing wood, IL, 1991.
the optimal policy determined through the pro- [13] S. ErenguK c, A.J. Vakharia, Integrated production/distribu-
posed approach. This is quite a di$cult task, given tion planning in supply chains, European Journal of Op-
erational Research 115 (1999) 219}236.
that the supply chain actors are likely to belong to [14] C. Forza, Achieving superior operating performance from
diverse "rms. Therefore, having them actually share integrated pipeline management: An empirical study, In-
a unique reward function needs a way (e.g. appro- ternational Journal of Physical Distribution and Logistics
priate incentive mechanisms) to fairly split the Management 26 (9) (1996) 36}63.
higher rewards that the optimal policy would [15] D.J. Thomas, P.M. Gri$n, Coordinated supply chain
management, European Journal of Operational Research
guarantee. 94 (1) (1996) 1}15.
[16] M. Puterman, Markov Decision Processes: Discrete
Stochastic Programming, Wiley Interscience, New York,
References 1994.
[17] E. Porteus, Stochastic inventory theory, in: D.P. Heyman,
[1] M. Christopher, Logistic and Supply Chain Management, M.J. Sobel (Eds.), Handbooks of Operations Research,
Pitman Publishing, London, 1992. North-Holland, Amsterdam, 1990.
[2] M. Verwijmeren, P. Van der Vlist, K. van Donselaar, [18] R.L. Sutton, A.G. Barto, Reinforcement Leaning } An
Networked inventory management information systems: Introduction, MIT Press, Cambridge, MA, 1998.
Materializing supply chain management, International [19] D. Bertsekas, J. Tsitsiklis, Neuro-Dynamic Programming,
Journal of Physical Distribution and Logistics Manage- Athena Scienti"c, Belmont, MA, 1996.
ment 26 (6) (1996) 16}31. [20] T.A. Das, A. Gosavi, S. Mahadevan, N. Marchalleck, Solv-
[3] J.W. Forrester, Industrial Dynamics, MIT Press, Cam- ing semi-Markov decision problems using average reward
bridge, MA, 1961. reinforcement learning, Management Science 45 (4) (1999)
[4] M.P. Baganha, M. Cohen, The stabilizing e!ect of inven- 560}574.
tory in supply chains, Operations Research 46 (3) (1998) [21] W.D. Kelton, R.P. Sadowski, D.A. Sadowsky, Simulation
S72}S73. with Arena, McGraw-Hill, New York, 1998.
[5] D. Towill, Industrial dynamics modeling of supply chains, [22] J.A. Muckstadt, R.O. Roundy, 1993, Analysis of multistage
Logistics Information Management 9 (1996) 43}56. production systems, in: S.C. Graves, A.H.G. Rinnooy Kan,
[6] H.L. Lee, V. Padmanabhan, S. Whang, The bullwhip e!ect P.H. Zipkin (Eds.), Handbooks in Operations Research
in the supply chains, Sloan Management Review 38 (3) and Management Science, Vol. 4, North-Holland, Amster-
(1997) 93}102. dam, 1993.

You might also like