Professional Documents
Culture Documents
SSRN Id4272441
SSRN Id4272441
ed
Learning by Exploiting an Optimal Policy Structure
iew
1 Department of Industrial and Management Engineering, Pohang University of
Science and Technology, 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk 37673,
Rep. of Korea, Email: hjhjpark94@postech.ac.kr
v
2 Department of Industrial and Management Engineering, Pohang University of
re
Science and Technology, 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk 37673,
Rep. of Korea, Email: dgchoi@postech.ac.kr
er
3 School of Business, Ewha Womans University, 52, Ewhayeodae-Gil, Seodaemun-Gu,
Seoul, 03760, Rep. of Korea, Email: dmin@ewha.ac.kr , Office: +82-2-3277-3923
* Corresponding author, Email: dmin@ewha.ac.kr , Office: +82-2-3277-3923
pe
ot
tn
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Adaptive Inventory Replenishment using Structured
Reinforcement Learning by Exploiting an Optimal
iew
Policy Structure
ev
Abstract
r
demand. We design a structured reinforcement learning algorithm that effi-
ciently adapts the replenishment policy to changing demand without any prior
er
knowledge. Our proposed method integrates the known structural properties
of an optimal inventory replenishment policy with reinforcement learning. By
exploiting the optimal policy structure, we tune reinforcement learning to char-
pe
acterize the inventory replenishment policy and approximate the value function.
In particular, we propose two methods for stochastic approximation on the gra-
dient of the objective function. These novel reinforcement learning algorithms
ensure an efficient convergence rate and lower algorithmic complexity for solving
ot
practical problems. The numerical results demonstrate that the proposed algo-
rithms adaptively update the policy to changing demand and raise operational
efficiency compared to a static replenishment policy. We also conduct a case
tn
study for a retail shop in South Korea to validate the practical feasibility of
the proposed method. Understanding the optimal policy structure is beneficial
for designing reinforcement learning algorithms that can address the inventory
rin
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
tural Properties, Stochastic Approximation
iew
1. Introduction
ev
related fields. The inventory replenishment problem aims to control order-
ing decisions so that operational costs are minimized over time. In addition,
because the ordering decision in the current period affects inventory levels in
subsequent periods, the problem is regarded as a sequential decision-making
r
problem. These problems are formulated as a Markov decision process and the
solution is traditionally derived in the form of a policy (Puterman, 2014). The
er
optimal policies and related structures for some classes of inventory replenish-
ment problems have been analytically characterized in previous studies. For
pe
instance, the (s, S) replenishment policy, whose parameters represent the re-
order and order-up-to levels, is an optimal policy under inventory management
incurring setup costs (Scarf, 1960). This replenishment policy orders up to the
order-up-to level when the inventory level drops to the reorder level.
In this study, we examine a multi-item inventory replenishment problem.
ot
systems. Further, the inventory systems periodically review the inventory level
and have setup costs (joint setup costs for multi-item cases). With these problem
definitions, the optimal policy of the single-item case is known as the (s, S)
replenishment policy. Although no general structure of this optimal policy for
rin
orders all items whose inventory levels are less than their can-order level up to
their order-up-to level whenever at least one item drops to their reorder level.
Inventory replenishment problems analytically develop the optimal policy
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
under the assumption that the demand distribution is known a priori and is
iew
stationary, which means that a decision-maker should update the optimal pol-
icy whenever observing a change in demand. However, the demand observed in
practical inventory systems is commonly ill-specified and has a non-stationary
nature such as regime-switching movements (Chen, 2021). These properties
incur delays in identifying demand changes and updating the inventory poli-
ev
cies. Furthermore, conventional methods such as dynamic programming are
less responsive to changes in demand because of their immense computational
requirements at every regime-switching point (Keskin et al., 2022). To overcome
r
these limitations, data-driven approaches have been employed to deal with the
unknown and non-stationary nature of several uncertainties (Chen, 2021; Huh
er
and Rusmevichientong, 2009; Keskin et al., 2022; Shi et al., 2016).
Reinforcement learning (RL), a sub-branch of machine learning suitable for
optimizing sequential decision-making, has been leveraged to address practical
pe
inventory control problems (Giannoccaro and Pontrandolfo, 2002; Gijsbrechts
et al., 2022; Jiang and Sheng, 2009; Oroojlooyjadid et al., 2021). Recent RL
methods tend to approximate the policy or value functions using parametric
functions such as neural networks. As they can automatically characterize a
ot
instead, RL methods learn and optimize the policy by observing demand on-the-
fly. Despite these good properties, however, the high computational cost and
slow convergence rate are drawbacks of RL methods (Kunnumkal and Topaloglu,
rin
2008). Therefore, they fail to adapt a policy to the changing environment and
become inefficient in a large inventory system.
In this paper, we propose a novel RL method that provides good learning
behavior by exploiting the structural properties of an optimal policy. For ex-
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
replenishment policy such as the (s, S) policy for a single-item problem rather
iew
than learning action-value functions. Thanks to these benefits, it rapidly char-
acterizes observed demand and adapts the replenishment policy to switching
demand. We show that the proposed method provides near-optimal policies
for single- and multi-item problems by exploiting the policy structures. Ad-
ditionally, a case study analysis shows that the proposed method has better
ev
operational efficiency in the retail industry.
The main contributions of this study can be summarized as follows:
We develop a structured RL algorithm that optimizes inventory replen-
r
ishment policies without prior knowledge of the demand distribution.
The well-designed RL algorithm adaptively updates the policy in response
efficiency. er
to the switching demand distribution, allowing us to achieve operational
tion 2 reviews related works and Section 3 describes our inventory management
problem. Section 4 discusses the methodology of the proposed structured RL
algorithm for inventory management. Section 5 presents an experimental study
tn
2. Relevant Literature
Much of the literature has studied the existence and structural properties of
optimal policies to address inventory replenishment problems. As an example
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
optimal policy in a system with a setup cost is the (s, S) replenishment policy
iew
and that the corresponding value function has K-convexity structure. For multi-
item inventory systems that require various items to be ordered jointly, Balintfy
(1964) introduced an (s, c, S) joint replenishment policy that is a reasonable
policy structure. Ignall (1969) showed that the (s, c, S) joint replenishment
policy is optimal for a simple two-item inventory system. Girlich and Barche
ev
(1991) generalized the optimal policy for a multi-item case and analyzed the
optimality of the (σ, S) replenishment policy under a certain condition of the
Wiener demand process. Under this replenishment policy, if the vector of the
r
inventory level x whose elements represent the inventory level of each item
belongs to the reorder set σ, the inventory is replenished up to the corresponding
er
element of the order-up-to level vector S. However, the joint order replenishment
problem is well known as NP-hard (Cohen-Hillel and Yedidsion, 2018), and thus,
optimizing its policy is quite difficult.
pe
Characterizing the demand distribution is hard in practice, and various fac-
tors such as technological advances (e.g., expansion of electric vehicles) and
economic turmoil (e.g., the COVID-19 pandemic) change the demand distribu-
tion irregularly; thus, inventory management studies have attempted to extend
ot
the problem using unknown and switching demand distributions. Earlier stud-
ies assumed that complete information on when and how the demand distri-
bution changes is known. For example, Song and Zipkin (1993) showed that
tn
recently been adopted to solve inventory management problems. Huh and Rus-
mevichientong (2009) proposed an adaptive gradient-based algorithm to opti-
mize the order-up-to level of the base-stock policy for an inventory system with
lost sales. Their approach traced a virtual constraint-free order-up-to level and
ep
attained the minimum expected cost without making an assumption about the
demand distribution. Shi et al. (2016) proposed an algorithm to optimize the
base-stock policy for a multi-item inventory system with capacity constraints.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Their algorithm defined an additional virtual order-up-to level that represents
iew
the target inventory level when the capacity constraint is relaxed. Chen and
Chao (2020) proposed an online learning algorithm to optimize the base-stock
policy for a multi-item inventory system with stockout substitution. The algo-
rithm estimated the demand distributions and substitution probability by con-
ducting a novel exploration phase. Further, data-driven methods must consider
ev
an unknown demand distribution as well as regime-switching demand patterns
when knowledge on switching points is lacking. Chen (2021) proposed a non-
parametric learning algorithm for inventory management with lost sales under
r
regime-switching demand. The algorithm placed excessive orders to figure out
true realized demand on the censored observation and estimate the demand
er
distribution in the recent batch time period to distinguish unknown changing
points. Keskin et al. (2022) developed an online learning algorithm for joint
inventory and pricing problems in a regime-switching environment.
pe
As with data-driven inventory management, RL is a promising method for
addressing inventory management problems in which the demand distribution
is unknown and switching over time. Over the past decade, various RL meth-
ods have been applied to tackle inventory management problems. Giannoccaro
ot
the supply chain. To coordinate decisions, they defined the joint state/action
space of the agents and employed a Q-learning algorithm to optimize a coordi-
nated ordering policy. Likewise, Oroojlooyjadid et al. (2021) proposed a deep
rin
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
agement as bandit-based optimization and proposed two efficient algorithms to
iew
solve the problem.
To the best of our knowledge, few studies investigate how to use the RL
method to optimize an inventory replenishment policy, considering the struc-
tural properties of an optimal policy and incomplete information on changing
demand. Jiang and Sheng (2009) employed a case-based RL to optimize the
ev
order-up-to level of the (s, S) policy for a multi-echelon inventory system with
switching demand. However, the previous study fails to fully optimize the pa-
rameters of replenishment policy as we do in this study. As they only consider
r
a part of the replenishment policy such as either the order-up-to level or the
reorder level, their findings on optimal policies are limited. We also contribute
er
to the literature by investigating a multi-item inventory system.
Lastly, we leverage the known structural properties of an optimal replenish-
ment policy (e.g., K-convexity and (s, S) replenishment policy), which enables
pe
the proposed method to take advantage of the fast convergence rate and efficient
algorithmic complexity. In the literature, several studies integrate the policy
structure within the RL framework for signal processing (Sharma et al., 2020),
power transmission scheduling (Fu and van der Schaar, 2012), and Markov de-
ot
these existing studies. We first propose a model that fully integrates the struc-
tural properties into the RL framework to solve an inventory replenishment
problem. As a result, the proposed method derives a near-optimal replenish-
rin
ment policy without any prior knowledge of the demand distribution and rapidly
adapts the policy in response to the switching environment.
3. Problem Description
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
many N items are to be replenished and when. That is, the problem has the
iew
form of a single-item inventory system if N = 1. We use the following notations
in the analysis:
cnO : unit ordering cost for an item n = 1, ..., N
cnH : unit holding cost for an item n = 1, ..., N
cnB : unit backlogging cost for an item n = 1, ..., N
ev
K: constant setup cost
xnt : inventory level of an item n = 1, ..., N in period t
ant : order amount for an item n = 1, ..., N at the beginning of period t
r
dnt : demand for an item n = 1, ..., N in period t
δ: joint order discountable ratio (0≤δ < 1)
er
The time period is distinguished by index t. Referring to the literature (Chen
and Chao, 2020; Shi et al., 2016), this problem considers a zero lead time, zero
purchasing revenue for all items, and a periodic review inventory system over
pe
an infinite time horizon; i.e., inventory of each item is reviewed at fixed and
constant time intervals. In addition, we assume a full backlogged system in
which excessive demand is carried over to the following period and expressed
as a negative inventory level. A decision-maker observes the current inventory
ot
level xnt at the beginning of period t and decides whether to make an order
for replenishment. Thereafter, stochastic demand dnt arises during the period.
For simplicity, we use x′ to denote the following inventory level. Vectors x =
tn
system dynamics follows xnt+1 = xnt + ant − dnt ∀n=1,...,N . The problem has a
well-defined cost structure that is convex in terms of the holding and backlogging
amounts for each period, which is represented as Ln (x) = cnB [−x]+ + cnH [x]+
where [·]+ = max{0, x}. Additionally, the joint order discountable setup cost
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
with the joint order discountable setup cost are denoted as in equation (1):
iew
{cnO an + Ln (xn ′ )}
X
rt (x, a, x′ ) = hKv1{h > 0} + (1)
n∈[N ]
The discountable part of the setup cost is proportionally saved as the number
of joint orders increases. All the unit costs are non-negative constant and the
condition cnB > cnO ∀n=1...N should hold so that do-nothing-policy is not optimal.
ev
Although the cost is regarded as negative feedback, we transform the per-period
costs into a “reward” without loss of generality.
The solution to this problem is expressed with a policy and the optimal
r
policy is attained under the long-term average reward criterion by minimizing
PT
the objective function ρπ = limT →∞ T1 Eπ t rt . The optimal policy of a
er
single-item problem (N = 1) is the well-known (s, S) replenishment policy.
Moreover, if there are two items (N = 2), the optimal policy becomes the
pe
(s, c, S) joint replenishment policy (Ignall, 1969). Although no optimal policy
exists for a problem that has more than two items (i.e., N > 2), the (s, c, S) joint
replenishment policy is practically implementable and performs reasonably well.
Moreover, we exploit the (s, c, S) joint replenishment policy structure because
the structure outperforms other joint replenishment policies under the periodic
ot
review inventory system with irregular demand (Johansen and Melchiors, 2003).
To reflect reality, we consider that the demand distribution changes over
tn
time, which means that the parameters of the demand distribution sequen-
tially change. For example, assuming a two-parameterized demand distribution,
switching demand is represented by the sequence of parameters αn (ltn ), β n (ltn ) ,
where ltn = max{k ∈ Z : Tkn ≤ t} and Tkn is the period in which k − th demand
rin
switches to αn (ltn ), β n (ltn ) for item n. Under these circumstances, Song and
Zipkin (1993) proposed the world-dependent policy, whose parameters change
for each demand phase, and proved its optimality for a single-item inventory
ep
system, as in Theorem 1.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Although the world-dependent policy does not guarantee optimality for a multi-
iew
item inventory system, it is practically reasonable to consider a (S(ω), c(ω), s(ω))
joint replenishment policy for further analysis. As we have no prior information
on when and how these parameters change, however, the analytical methods
are not applicable for attaining an optimal policy. Therefore, we consider a
data-driven learning algorithm to tackle the problems.
ev
4. Structured Reinforcement Learning Algorithm
r
inventory replenishment problem. We introduce a mathematical basis of the pro-
posed method and then present its two types: structured RL with full stochastic
er
approximation (SRL-FSA) and partial stochastic approximation (SRL-PSA).
a certain fixed order-up-to level S, transitioning to states that have higher in-
ventory levels than the order-up-to level is impossible; furthermore, the fixed
reorder level s limits the lower bound of the transition state. Therefore, replen-
ishment policies with distinct parameters take different communicative classes.
rin
levels; thus, the transitioned state (i.e., inventory level) under the policy can be
comprehensive for all states, which means that any stationary policy within the
stochastic policy structure takes a single communicative class. Specifically, the
Pr
10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
decision rules of the proposed stochastic (s, S) replenishment and (s, c, S) joint
iew
replenishment policies are represented by equations (2) and (3), respectively:
S + ϵ − x if u > f (x, s)
a= (2)
0 o.w
where u∼U nif orm[0, 1] and noise variable ϵ∼N (0, σ 2 ). Further, f (x, y) =
1
is the y-shifted sigmoid function for mixing probability of the stochas-
ev
1+e−(x−y)/τ
r
0 o.w
where un1 , un2 ∼U nif orm[0, 1] ∀n = 1, ..., N and noise variable ϵ∼N (0, σ 2 IN ).
er
Hereafter, the perturbed order-up-to level S n + ϵn is denoted as S̃ n .
This stochastic replenishment policy acts as an exploration process while
pe
following the forms of original replenishment policies. From Propositions 1 and
2, the stochastic replenishment policies converge to the original deterministic re-
plenishment policies by adjusting the precision hyperparameters, and the proofs
are straightforward (proof is provided in Appendix A).
ot
11
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
In addition, we use the relative value function approximation technique to
iew
approximate the average reward value. The relative value of a certain repre-
sentative state should be zero, and state ‘0’ (i.e., x = 0) is regarded as the
distinguished state. This characteristic allows us to ignore the intersection coef-
ficient of the polynomial function, and the approximated value function passes
through the origin. The relative value function approximation for a single-item
ev
problem is given as V (x; s, S, w) = wT Φ(x), where w ∈ R4 and polynomial basis
Φ(x) = [ϕi (x)]Ti=1,...,4 ; i.e., ϕi (x) = xi . Equations (4) and (5) show the online
updates of the relative value function for a single-item problem:
r
wt+1 = wt + γ1 (t){r(x, a, x′ ) − ρt + Vt (x′ ; s, S, wt ) − Vt (x; s, S, wt )}Φ(x) (4)
er
The approximated relative value function evaluates a policy under an average
reward criterion (Singh, 1994). To achieve the stable convergence of the rela-
tive value function update (equation (4)), the step size satisfies the conditions
pe
P∞ P∞
of t γ1 (t) = ∞ and t γ1 (t)2 < ∞. The relative value function approx-
imation can be extended to a multi-item problem. We use the same update
procedure as in equations (4) and (5). However, the approximator has a dif-
ferent structure to accommodate a multidimensional polynomial function. The
ot
Q
n
12
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
with continuous demand, the exponential family is practically convenient and
iew
a generic way to fit the demand distribution. Within the exponential family,
we adopt the Gamma distribution, which captures similar shapes of a general
demand distribution (e.g., right-skewed) and has desirable properties such as
differentiable and integrable. Although we develop the proposed algorithm un-
der this distributional assumption for practical convenience, the algorithm does
ev
not necessarily require the true distribution and learns the optimal policy for
any demand distribution.
For a given optimal (s, S) replenishment policy, the transition probability
r
is separately defined in terms of whether the current inventory level exceeds
the reorder level s. P0 (x′ |x) denotes the state transition probability at which
Γ(α) d e
er
the current inventory level exceeds the reorder level s and P1 (x′ |S) denotes
the other case. Given the Gamma distributional assumption i.e., h(d; α, β) =
β α α−1 −βd βα
, P0 (x′ |x, α, β) = h(x − x′ ; α, β) = Γ(α)
′
(x − x′ )α−1 eβ(x −x) and
pe
βα ′ α−1 β(x′ −S̃)
P1 (x′ |S̃, α, β) = h(S̃ − x′ ; α, β) = Γ(α) (S̃ − x ) e show each of the state
transition probabilities using change-of-variable of the probability distribution.
We denote the transition probability from the current state x to the next
state x′ corresponding to the policy parameter (s, S) by Px,x′ (s, S). The tran-
ot
sition probability for the stochastic (s, S) replenishment policy is Px,x′ (s, S) =
f (x, s)P0 (x′ |x, α, β) + (1 − f (x, s))P1 (x′ |S̃, α, β). Here, the sigmoid function
f (x, s) is regarded as a mixing probability, and the noise ϵ of S̃ is realized be-
tn
fore the policy parameters are updated. Therefore, the partial derivatives of
the transition probability with respect to the policy parameters (s, S) are well
derived as follows:
rin
∂
Px,x′ (s, S) = β(1 − f (x, s)) P1 (x′ |S̃, α − 1, β) − P1 (x′ |S̃, α, β)
(6)
∂S
∂ ∂
f (x, s) P0 (x′ |x, α, β) − P1 (x′ |S̃, α, β)
Px,x′ (s, S) = (7)
∂s ∂s
13
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
For multi-item problem, the transition probability and its gradient are presented
iew
in the Appendix E. Likewise, its transition probability has a bounded first and
a twice differentiable function of the policy parameters.
We develop P0 (x′ |x) and P1 (x′ |S) under the assumption of continuous de-
mand distribution. However, practical inventory systems may observe discrete
values of demand. Therefore, we extend the transition probabilities to ac-
ev
commodate discrete demand. Suppose we have a count variable dt that rep-
resents demand at period t. By observing the demand for the past υ peri-
ods, the empirical probability that demand is j at period t is obtained by
r
Pυ
p̂t,j = υ1 i=1 1{dt−i = j}. For simplicity, we drop the index for time t of
variables and regard they lie at the same time period. We then transform
er
the probability p̂ into the probability density function by piecewise linear ap-
proximation. The piecewise linear function defined on intervals d¯j ∈ [j, j + 1],
where j ∈ [0, 1, ..., dmax − 1], becomes p̄j (d¯j ) := ∆p̂j (d¯j − j) + p̂j , where
pe
∆p̂j = p̂j+1 − p̂j . To ensure that the integral of the interpolated probabil-
ity density function equals one, the piecewise
linear function is normalized
by
dmax −1
applying a normalizing constant, K = 12 p̂0 + p̂dmax + 2 i=1
P
p̂i .
When replacing the demand distribution by h(d; p̄) in P0 (x′ |x) and P1 (x′ |S),
ot
∂ ′
∂S P1 (x |S̃, p̄) is unfortunately not always available in closed form. To tackle this
issue, we numerically obtain the derivative by perturbing the order-up-to level
∂ ′ P1 (x′ |S̃+ε,p̄)−P1 (x′ |S̃,p̄)
and computing the partial derivative, ∂S P1 (x |S̃, p̄) ≈ .
tn
′
derivative becomes ∂
∂S P1 (x |S̃, p̄) = 1
K ∆p̂j for each interval x′ ∈ [S̃ −j −1, S̃ −j].
rin
The partial derivatives of the transition probability with respect to the policy
parameters (s, S) are then given as follows:
∂ ∂
Px,x′ (s, S) = (1 − f (x, s)) P1 (x′ |S̃, p̄) (8)
∂S ∂S
∂ ∂
ep
14
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
4.1.4. Gradient and policy updates
iew
The average reward objective corresponding to the policy parameters is given
P
as g(s, S) = x∈X π(x; s, S)r(x). If Proposition 3 is established, the gradient
of the average reward objective function is given in a closed form as in equation
(10); the proof is shown in Marbach and Tsitsiklis (2001):
X X
∇g(s, S) = π(x; s, S) ∇Px,x′ (s, S)V (x′ ; s, S) (10)
ev
x∈X x′ ∈X
The gradient of the objective function is the expectation
with respect to the cor-
′
P
responding policy distribution as ∇g(s, S) = Eπ x′ ∈X ∇Px,x (s, S)V (x ; s, S) .
′
r
sampling approximation: ∇g(s, S) ≈ x′ ∈X ∇Px,x′ (s, S)V (x′ ; s, S). We finally
P
S n
S n ∂
P ′ (s, c, S)V (x′ ; s, c, S)
x′ ∈X N ∂S n x,x
c ← cn − x′ ∈X N ∂c∂n Px,x′ (s, c, S)V (x′ ; s, c, S) ∀n=1,...,N (12)
n P
P
tn
∂ ′
sn sn ′ N
x ∈X ∂s n P x,x ′ (s, c, S)V (x ; s, c, S)
15
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
of the underlying demand distribution are adaptively updated; thereafter, the
iew
immediate reward is used to update the relative value function. The policy
parameters are updated in the descent direction of the expected gradient of the
objective. Appendix F provides the details of how to derive the gradient.
In the overall learning procedure, the described SRL-FSA algorithm fully ap-
proximates all the parts of the gradient of the objective function with stochastic
ev
samples. This algorithm estimates the expected gradient using only online sam-
ples (i.e., it does not compute integration of the transition probability using
the polynomial structure of the value function). Lastly, the decaying updates
r
of the hyperparameters are applied to ensure the convergence from the stochas-
tic policy to the corresponding deterministic policy. Algorithm 1 summarizes
er
the proposed SRL-FSA algorithm. Here, a projection operator Ωt [·] maintains
the upper bound of s by S to exploit the (s, S) replenishment policy structure,
which has the relationship s ≤ S.
pe
Algorithm 1 SRL-FSA for single-item inventory management
1: while satisfying the stopping criteria do
2: given xt , then take action at with noise ϵt following the stochastic (s, S)
replenishment policy; then, observe the transitioned state xt+1 and corre-
sponding reward rt
ot
16
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
adjusted to exploit the structure of the (s, c, S) joint replenishment policy and
iew
the pairs of parameters have to be considered to optimize each policy of items.
Appendix I provides the details and Algorithm 3 in Appendix H presents the
proposed SRL-FSA for a multi-item problem.
ev
take the integral of the value function and compute the gradient of the objective
function. The SRL-FSA algorithm conducts a sample approximation for double
integration: integrations with respect to policy and the gradient of the transition
r
probability. It has the advantage of being well applied to various problems but
its convergence rate is less efficient. To overcome this drawback, we develop the
er
SRL-PSA algorithm, which computes the gradient integration of the transition
probability using the polynomial structure of the value function. The SRL-PSA
algorithm provides better convergence rate than the SRL-FSA algorithm by
pe
partially approximating some parts of the gradient.
Like the aforementioned algorithms, the SRL-PSA algorithm observes de-
mand and the next period inventory level by placing replenishment decisions.
The underlying demand distribution is adaptively updated using demand ob-
ot
servations and the reward is used to update the relative value function. The
expected gradient of the objective function is estimated not only using online
samples but also by directly taking the integration thanks to the polynomial
tn
structure of the value function. Appendix G presents the details for deriving
the gradient. Thereafter, the replenishment policy parameters are updated in
the descent direction of the estimated gradient. The decaying update of the
rin
17
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Algorithm 2 SRL-PSA for single-item inventory management
while satisfying the stopping criteria do
iew
1:
2: given xt , then take action at with noise ϵt following the stochastic (s, S)
replenishment policy; then, observe the transitioned state xt+1 and corre-
sponding reward rt
3: attain the realized demand dt ; then, adaptively estimate the distribu-
tional parameters α̂t+1 and β̂t+1
4: update the relative value function:
wt+1 = wt + γ1 (t) rt − ρt + Vt (xt+1 ; st , St ) − Vt (xt ; st , St ) Φ(xt )
ev
ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , St ) − Vt (xt ; st , St ) − ρt
5: update the policy parameters: P4
St+1 = St −b1 (t)β̂t+1 (1−f (xt , st )) i=1 wt+1,i Eŷ∼h(d;α̂t+1 −1,β̂t+1 ) (S̃t −
Ŷ )i − Ey∼h(d;α̂t+1 ,β̂t+1 ) (S̃t − Y )i
r
∂
P4
st+1 = Ωt st −b2 (t) ∂y f (xt , y)|y=st i=1 wt+1,i Ey∼h(d;α̂t+1 ,β̂t+1 ) (xt −
6:
Y )i − (S̃t − Y )i
5. Numerical Experiments
18
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
distributions, which remain the same over time. In the regime-switching system,
iew
we consider a non-stationary Gamma demand distribution, which starts from
its distributional parameters (2, 0.5) and changes over time in the following
√ √ √ √
sequence: (4, 2/2) → (1.5, 3/4) → (8, 1) → (1.25, 10/8) → (16, 2). The
demand distribution changes to the next at every equal interval H (i.e., the
switching period Tk = kH). This regime-switching system is designed based on
ev
a previous study (Bayraktar and Ludkovski, 2010), and the proposed scenario
considers demand seasonality.
For a multi-item problem, we consider the same demand distributions and
r
all the items follow i.i.d predefined distributions. To the best of our knowledge,
no method derives an optimal policy for continuous state/decision inventory
er
systems. Therefore, we use a full enumeration method that searches all the grid
values of the admissible policy parameters to confirm near-optimality. Moreover,
we benchmark using the online actor–critic (AC) method, which has been widely
pe
used for managing inventory systems in the literature (Gijsbrechts et al., 2022).
The benchmark AC method uses a simple neural network for approximating a
replenishment policy (i.e., actor) without considering structural properties.
By following the setting of the previous study (De Moor et al., 2022), we con-
ot
sider the same unit costs for both the single-item and the multi-item problems:
the unit ordering cost cnO = 0.3, unit holding cost cnH = 0.1, unit backlogging
cost cnB = 0.5 ∀n=1,...,N , and setup cost K = 0.1. The joint order discountable
tn
ratio is u = 0.9 for the multi-item problem. The description for the setting of
algorithmic hyperparameters is presented in Appendix N.
FSA and SRL-PSA are (4.06, 6.85) and (4.35, 6.72), respectively. These results
are close to those under the near-optimal policy (s∗ , S ∗ ) = (4.60, 6.90) using the
full enumeration method with 0.1-unit grid-searching. SRL-FSA and SRL-PSA
Pr
19
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
both result in a long-run average cost, which is the average of multiple replica-
iew
tions, similar to that of the full enumeration. Specifically, the average cost of the
proposed structured RL algorithm is different from that of the full enumeration
by less than 1%. Table 1 summarizes the learned policy and long-run average
cost for the single-item inventory problem. Figure 2 shows that the performance
of both SRL-FSA and SRL-PSA rapidly converges to a near-optimal solution
ev
compared with the benchmarking method (i.e., AC). In addition, SRL-PSA
has a superior and more stable convergence rate than SRL-FSA, meaning that
it provides a good policy within fewer iterations. Here, the full enumeration
r
requires a considerably long computation time even for a small-sized problem
despite providing a near-optimal policy. Therefore, the full enumeration is not
er
applicable in a practical situation, and the good convergence behavior of the
proposed methods is meaningful for a practical large-sized problem.
pe
ot
Figure 2: Convergence graphs for the average cost of the proposed algorithm and
benchmarking heuristic under a single-item inventory system with the Gamma demand
distribution.
Pr
20
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Table 1: Performance of the proposed structured RL algorithm compared with the full
enumeration heuristic under a single-item inventory system with the Gamma demand
iew
distribution
Parameter Metric
Algorithm s S Average cost Difference (%)
Full enum. 4.60 6.90 1.754 -
SRL-FSA 4.06 6.85 1.758 0.23
SRL-PSA 4.35 6.72 1.756 0.11
ev
We observe similar results for a single-item inventory system with the truncated-
normal demand distribution. Appendix O and Appendix P summarize the
learned policy and long-run average costs with additional descriptions (see Ta-
r
ble O.4 and Figure P.8). The results verify that well-designed RL algorithms
are promising and applicable for inventory replenishment problems with various
and even unknown demand distributions.
er
Figure 3 demonstrates the extent to which the proposed structured RL al-
pe
gorithm behaves when the demand distributions are non-stationary and change
over time (i.e., regime-switching systems). It rapidly adapts the replenishment
policy to switching demand distributions. This good learning behavior enables
the proposed SRL-FSA and SRL-PSA algorithms to lower the long-run aver-
age cost by 1.7% and 3.2% respectively compared with the static replenishment
ot
policy. Here, the static policy is a near-optimal policy obtained using the full
enumeration method for the initial static inventory system. We consider this
tn
static inventory system as the situation in which there is a significant time lag
in observing the change in the demand distributions; hence, a decision-maker
believes that the demand distribution is “static” over time. This observation
supports the argument that the proposed structured RL algorithm can improve
rin
In this section, we present the results from the numerical analysis for a two-
item inventory system that requires finding the (s, c, S) joint replenishment pol-
icy. In this two-item inventory system, each item follows the same i.i.d Gamma
Pr
21
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
iew
ev
(a) SRL-FSA (b) SRL-PSA
Figure 3: Trajectory of the updated policy parameters using the proposed algorithms
under a single-item inventory system with the regime-switching demand distribution.
demand distribution. Figure 4 demonstrates that the proposed structured RL
r
algorithm behaves well and learns the joint replenishment policy for the two-
item inventory replenishment problem. Table 2 compares the joint replenish-
er
ment policies and shows that the learned policies are close to the near-optimal
policy under the full enumeration method. Furthermore, the cost difference
between the proposed structured RL algorithm and full enumeration heuristic
pe
is less than 2%. Similar to the aforementioned analysis, SRL-PSA again has
better convergence rate. Appendix P provides the results (see Figure P.9).
ot
tn
switching demand distributions. This good learning behavior enables the pro-
posed SRL-FSA and SRL-PSA algorithms to successfully reduce the average
costs by 7.8% and 13.4%, respectively, compared to the static replenishment
Pr
22
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Table 2: Performance of the proposed structured RL algorithm compared with the full
enumeration heuristic under a two-item inventory system with the Gamma demand
iew
distribution
Parameter Metric
Algorithms s1
c1
S 1 2
s c 2
S 2
Average cost Difference (%)
Full 4.40 6.10 6.50 4.40 6.10 6.50 3.463 -
enum.
SRL-FSA 2.71 4.51 6.54 2.79 4.00 6.29 3.518 1.56
ev
SRL-PSA 3.44 3.87 6.01 3.52 4.04 5.81 3.499 1.03
r
er
pe
(a) SRL-FSA (b) SRL-PSA
Figure 5: Trajectory of the updated policy parameters using the proposed algorithms
under a two-item inventory system with the regime-switching demand distribution.
up to four items. We find that the proposed structured RL algorithm has lower
complexity than the existing AC method, which does not consider structural
properties, and thus we expect better scalability (see the details in Appendix
tn
R). Figure 6 presents the extent to which the proposed structured RL algo-
rithm lowers the long-run average cost compared with the AC method. Greater
decreases in the long-run average cost are achieved as the number of items in-
rin
6. Case study
ep
We conduct a case study for a retail shop in South Korea to examine the
applicability of the proposed structured RL method in the health & beauty re-
tail industry. The retail shop operates 150 stores nationwide, each store selling
Pr
23
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
iew
ev
Figure 6: Performance improvement of the SRL-FSA and SRL-PSA algorithms com-
pared with the benchmark AC when extending the number of items.
approximately 10,000 items on average. The case study considers two seasonal
items offered by the retail shop: sunblock and deodorant. Figure Q.10 in Ap-
r
pendix Q illustrates the daily sales of sunblock and deodorant for 865 and 1,000
days, showing strong seasonality with peaks in summer.
er
The retail shop’s current practice is to review items daily and decide whether
to replenish them. When the inventory level drops below the average demand
pe
in lead time, the retail shop orders items based on the average demand and its
safety stock. Using this case study, we aim to confirm whether the proposed
method reduces the inventory cost more than the retail shop’s current replenish-
ment policy. For comparison, we also consider other baseline benchmarks, such
as the static replenishment policy, Q-learning, and post-decision state learning.
ot
Cachon, 2006). The retail prices of sunblock and deodorant are 19,000 KRW
and 11,500 KRW, respectively. Table Q.7 summarizes the unit costs.
Figure 7 demonstrates that the proposed structured RL methods reasonably
learn the replenishment policy in response to the switching demand. This de-
rin
sirable behavior enables the proposed methods to lower the inventory costs by
more than 10% and 30% compared to the current replenishment practices for
sunblock and deodorant, respectively. Other benchmark methods may possi-
ep
bly lower the inventory costs more than the current practice, but the proposed
method outperforms them as well. Table 3 summarizes the results. Compared
to the current practice, SRL-PSA saves 2.884 million KRW in sunblock inven-
Pr
24
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
tory costs over 865 days, which is approximately 2,500 USD. When considering
iew
150 stores, this indicates that the retail shop can expect to save 150,000 USD
in sunblock inventory costs annually. Similarly, the proposed method can save
approximately 70,000 USD in deodorant inventory costs annually. In terms of
the business scale of chain stores, applying these savings to the other 10,000
items justifies the feasibility of our method.
r ev
(a) SRL-FSA for sunblock
er (b) SRL-PSA for sunblock
pe
ot
mance, as follows. First, although there is a difference in the peak periods for
sunblock and deodorant, the retail shop ignores this demand characteristic. For
example, sunblock has an earlier and longer peak period, and the demand is sig-
Pr
25
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
nificantly large during these periods. However, the existing practice considers
iew
the same peak periods for both items. On the other hand, the proposed method
automatically detects the changes in demand without having to define the peak
period and derives a customized policy for individual items. Second, the retail
shop determines the reorder level based on the average demand over specific
months or seasons. This approach is highly like to lag the actual demand, with
ev
the resulting reorder level being unresponsive to the actual demand. Meanwhile,
the proposed method learns and updates both the reorder and order-up-to levels
in immediate response to the recent demand observations. This adaptive learn-
r
ing behavior is likely to fare better in a recent retail business where demand is
highly irregular and unpredictable. Moreover, we expect that the achievement
7. Concluding Remarks er
of the proposed method is more significant for multi-item inventory systems.
26
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
system. Lastly, most existing methods fail to find a “good” solution for a
iew
multi-item inventory replenishment problem within a reasonable computation,
whereas this study provides efficient methods for handling multi-item inventory
systems. The scalability of the proposed method also proves the applicability
of data-driven methods to practical inventory systems.
We acknowledge some of the limitations of our study and suggest potential
ev
future research extensions. First, we consider a single-echelon, single-supplier,
full backlogging, and no lead time inventory system. Therefore, extending our
model to incorporate more realistic assumptions is encouraged. Although con-
r
sidering variants does not significantly diminish our main findings, replicating
our findings in various models is an important extension of this study. Second,
er
although an optimal replenishment policy structure is known for basic prob-
lems, a rational policy structure can be developed for more complex systems.
Another possible avenue for future research is to use structured RL in other
pe
applications (e.g., a Markov decision process model in which an optimal policy
is characterized by a few parameters). For example, the travel industry consid-
ers a booking limit problem, which controls the amount of capacity sold to any
particular class in a given period. The optimal policy is parameterized and its
ot
structural properties are then known (van Ryzin and Talluri, 2005).
References
tn
27
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Chen, B. and Chao, X. (2020). Dynamic inventory control with stockout sub-
iew
stitution and demand learning. Management Science, 66(11):5108–5127.
ev
De Moor, B. J., Gijsbrechts, J., and Boute, R. N. (2022). Reward shaping to
improve the performance of deep reinforcement learning in perishable inventory
management. European Journal of Operational Research, 301(2):535–545.
r
Djonin, D. V. and Krishnamurthy, V. (2007). Q-learning algorithms for con-
strained markov decision processes with randomized monotone policies: Appli-
er
cation to MIMO transmission control. IEEE Transactions on Signal Processing,
55(5):2170–2181.
pe
Fu, F. and van der Schaar, M. (2012). Structure-aware stochastic control
for transmission scheduling. IEEE Transactions on Vehicular Technology,
61(9):3931–3945.
Gijsbrechts, J., Boute, R. N., Van Mieghem, J. A., and Zhang, D. J. (2022). Can
tn
28
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Ignall, E. (1969). Optimal continuous review policies for two product inventory
iew
systems with joint setup costs. Management Science, 15:278–283.
ev
Johansen, S. G. and Melchiors, P. (2003). Can-order policy for the periodic-
review joint replenishment problem. Journal of the Operational Research Soci-
ety, 54(3):283–290.
r
Keskin, N. B., Li, Y., and Song, J.-S. (2022). Data-driven dynamic pricing and
ordering with perishable inventory in a changing environment. Management
Science, 68(3):1938–1958.
er
Kunnumkal, S. and Topaloglu, H. (2008). Exploiting the structural properties of
the underlying Markov decision problem in the Q-learning algorithm. INFORMS
pe
Journal on Computing, 20(2):288–301.
Oroojlooyjadid, A., Nazari, M., Snyder, L. V., and Takáč, M. (2021). A deep
Q-network for the beer game: Deep reinforcement learning for inventory opti-
tn
Roy, A., Borkar, V., Karandikar, A., and Chaporkar, P. (2021). Online rein-
forcement learning of optimal threshold policies for markov decision processes.
IEEE Transactions on Automatic Control, 67(7):3722–3729.
Pr
29
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Scarf, H. (1960). The optimality of (s, S) policies in dynamic inventory problem,
iew
mathematical methods in the social sciences.
ev
Shi, C., Chen, W., and Duenyas, I. (2016). Nonparametric data-driven algo-
rithms for multiproduct inventory systems with censored demand. Operations
Research, 64(2):362–370.
r
Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff marko-
vian decision processes. In AAAI, volume 94, pages 700–705.
er
Song, J.-S. and Zipkin, P. (1993). Inventory control in a fluctuating demand
environment. Operations Research, 41(2):351–370.
pe
Terwiesch, C. and Cachon, G. (2006). Matching supply with demand: An intro-
duction to operations management. McGraw-Hill.
30
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Appendix A. Proofs of Propositions 1 and 2
iew
As σ 2 → 0, the perturbed order-up-to level S̃ converges to deterministic
order-up-to level S with probability 1. This statement can be shown by Cheby-
shev’s Inequality as follow:
σ2
P |S̃ − S| ≥ a ≤ 2
a
ev
for any a > 0.
1
In addition, the sigmoid function f (x, y) = 1+e−(x−y)/τ
converges to the step
function as τ → 0:
1 if x > y
r
f (x, y)→step(x; y) :=
0 o.w
er
Then, the stochastic (s, S) and (s, c, S) replenishment policies converge to the
deterministic (s, S) and (s, c, S) replenishment policies, respectively. □
pe
Appendix B. Convergence property of the proposed algorithms
Lemma 1. The objective function g(s, S) satisfies Proposition 4 and the se-
quence is updated with the opposite direction of gradient for the objective func-
P∞
tion by following the update rate bp (k), which satisfies k bp (k) = ∞ and
tn
P∞ 2
k bp (k) < ∞ ∀p=1,2 ; then, the sequence satisfies limt→∞ ∇g(st , St ) = 0.
Proof The value function is K-convex, which implies that the corresponding
ep
31
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
points is at least not zero, which implies that the converged point cannot be
iew
the extreme point. Therefore, the sequence of policy (st , St ) updated by the
algorithm converges to a stationary point with probability 1.
We can construct the algorithm that has a desirable learning rate that ensures
the conditions presented in Lemma 1 hold. To prove Theorem 2, it suffices to
ev
show Proposition 4.
r
Proof The proof is presented in Appendix D.
er
Showing Theorem 2, the gradient-based algorithm has convergence property in
the single-item case. Moreover, this proof can be easily extended to the multi-
item case by showing the twice differentiable objective function with respect to
pe
the can-order level cn for all items n = 1, ..., N ; thus, we can show that the
proposed structured RL algorithm has convergence property under the problem
examined in this study. □
ot
Any density function for the demand distribution has bounded first and sec-
tn
ond derivatives with respect to the corresponding random variable; then, the
transition rule for replenishment P1 (x′ |S) also has bounded first and second
derivatives with respect to the parameter S. In addition, the sigmoid function
rin
has bounded first and second derivatives; thus, the sigmoid mixing probability
f (x, s) also has bounded first and second derivatives with respect to the param-
eter s. Hence, the transition probability Px,x′ (s, S), which is the composition
of those functions, has bounded first and second derivatives with respect to the
ep
32
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Appendix D. Proof of Proposition 4
iew
Before we describe the proof of Proposition 4, we make the following two
assumptions.
ev
riodic. Furthermore, the representative state x∗ is the positive recurrent state
for the Markov chain.
r
(s, S), the transition probability Px,x′ (s, S) has bounded first and second deriva-
tives. Furthermore, the reward function has bounded first and second derivatives
with respect to (s, S).
er
These assumptions hold under our problem definition. Assumption 1 holds
pe
by leveraging a stochastic replenishment policy. Assumption 2 also holds from
Proposition 3 and the fact that the reward function does not depend on the
policy parameters when the current and next states are given.
Now, we show that the average reward objective function g(s, S) is twice
differentiable and has bounded first and second derivatives.
ot
The average reward under the continuous state can be approximated by the
average cost for the infinite state Markov chain:
tn
X
g(s, S) ≈ π(x; s, S)r̂(x)
x∈X
The approximated Markov chain satisfies the following balance equations:
X
π(x; s, S)Px,x′ (s, S) = π(x′ ; s, S)
rin
x∈X
X
π(x; s, S) = 1
x∈X
The balance equation implies that any initial state vector converges to a certain
constant state vector by transitioning infinitely many times with the transition
ep
33
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
of matrix A(s, S) are composed with the entries of the transition probability
iew
Px,x′ (s, S). From Proposition 3, the matrix A(s, S) is also twice differentiable
and has bounded first and second derivatives. Since the corresponding station-
ary distribution is unique, matrix A(s, S) is invertible and the stationary policy
can be represented using Cramer’s rule as follows:
C(s, S)
π(x; s, S) =
det A(s, S)
ev
where C(s, S) is a vector whose entries are a polynomial function of the entries
of A(s, S). The entries of C(s, S) are twice differentiable and have bounded
first and second derivatives; further, det A(s, S) is twice differentiable and
r
has bounded first and second derivatives. Since matrix A(s, S) is invertible,
|det A(s, S) | is greater than some positive value. These statements imply that
er
the stationary policy is twice differentiable and has bounded first and second
derivatives. Given the characteristics of the stationary policy, the objective
function composed of the stationary policy is also twice differentiable and has
pe
bounded first and second derivatives. □
k̸=n
k̸=n
34
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
For each item n, the partial derivatives of the transition probability with respect
iew
to the policy parameters (s, c, S) are well defined as follows:
∂
P ′ (s, c, S) =
n x,x
∂S
Y Y
n n n k k n n k k
β (1 − f (x , c ))(1 − f (x , s )) + (1 − f (x , s )) f (x , s )
k̸=n k̸=n
n′ n′
n n n
Y k n n n
× (P1 (x |S̃ , α − 1, β ) − P1 (x |S̃ , α , β ) Px,xk ′ (s, ck , S k )
ev
k̸=n
∂
Px,x′ (s, c, S) =
∂cn
Y
k k n n′ n n n n n′ n n n ∂
f (xn , cn )
1− f (x , s ) P0 (x |x , α , β ) − P1 (x |S̃ , α , β )
∂cn
r
k̸=n
Y
k k k
× Px,x k ′ (s, c , S )
∂
∂sn
∂
k̸=n
Px,x′ (s, c, S) =
n n
n n′ n n n n n′ n n
f (x , s ) P0 (x |x , α , β ) − P1 (x |S̃ , α , β )
ern
Y
f (xk , sk )Px,x
k k k
k ′ (s, c , S )
pe
∂sn
k̸=n
n n n
X
k k k k k k ′ k k k k k′ k k k
+ Px,xn ′ (s, c , S ) (f (x , s ) − f (x , c )) P0 (x |x , α , β ) − P1 (x |S , α , β )
k̸=n
Y
j
× f (xj , sj )Px,x j ′ (s, cj
, S j
)
ot
j∈[N ]\{n,k}
The partial derivative for objective with respect to the policy parameters
(s, S) is computed and the proposed algorithm updates the policy parameters
rin
in the descent direction of the gradient. First, the partial derivative of the
objective function with respect to the order-up-to level S is computed as follows:
Z
∂
P1 (x′ |S̃, α − 1, β) − P1 (x′ |S̃, α, β) V (x′ ; s, S)dx′
g(s, S) ≈ β(1 − f (x, s))
∂S x′ ∈X
Using the stochastic approximation framework, the partial derivative equation
ep
35
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
where η is a random sample from a Bernoulli trial that has equal probabil-
iew
ities (i.e., η ∼ Bern(·; 0.5)). In addition, a sample for the next state x′ is
extracted from the joint distribution, which is composed of P1 (x′ |S̃, α − 1, β)
and P1 (x′ |S̃, α, β) by mixing the Bernoulli sample (i.e., x′ ∼ (1 − η)P1 (x′ |S̃, α −
1, β) + ηP1 (x′ |S̃, α, β)).
For another parameter s, the partial derivative is computed as follows:
ev
Z
∂ ∂
P0 (x′ |x, α, β) − P1 (x′ |S̃, α, β) V (x′ ; s, S)dx′
g(s, S) ≈ f (x, s)
∂s ∂s x′ ∈X
Using the stochastic approximation framework, the partial derivative equation
can be approximated by the following formulation with an online sample as:
r
∂ ∂
g(s, S) ≈ f (x, s) × (−1)η V (x′ ; s, S)
∂s ∂s
where η is a random sample from a Bernoulli trial with equal probabilities. In
er
addition, a sample for the next state x′ is extracted from the joint distribution,
which is composed of P0 (x′ |x, α, β) and P1 (x′ |S̃, α, β) by mixing the Bernoulli
sample (i.e., x′ ∼ (1 − η)P0 (x′ |x, α, β) + ηP1 (x′ |S̃, α, β)).
pe
The policy parameters (s, S) update the formula using the SRL-FSA algo-
rithm as follows:
St+1 = St − b1 (t)β̂t+1 (1 − f (xt , st ))(−1)ηS Vt+1 (zS ; st , St ) (F.1)
∂
f (x, y)|y=st (−1)ηs Vt+1 (zs ; st , St )
st+1 = st − b2 (t)
ot
(F.2)
∂y
Here, i.i.d Bernoulli samples ηS , ηs ∼ Bern(·; 0.5); further, random samples
zS ∼ (1 − ηS )P1 (x′ |S̃t , α̂t+1 − 1, β̂t+1 ) + ηS P1 (x′ |S̃t , α̂t+1 , β̂t+1 ) and zs ∼ (1 −
tn
system
The part of the gradient of the objective function can be directly derived by
computing the integration. Then, the partial derivative of the objective function
ep
36
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
The expectation term is regarded as the polynomial function of the n − th
iew
moment for the Gamma distribution; thus, it can be easily presented in closed
form (we skip this representation for brevity).
Then, the partial derivative of the objective function with respect to the
reorder level s is expressed as follows:
4
∂ ∂ X
wi Ey∼h(d;α,β) (x − Y )i − (S̃ − Y )i
g(s, S) ≈ f (x, s)
ev
∂s ∂s i=1
Similar to the update equation of S, it can be easily presented in closed form.
Then, the policy parameters (s, S) update the equation using the SRL-PSA
algorithm as follows:
r
St+1 = St − b1 (t)β̂t+1 (1 − f (xt , st ))
4
×
X
i=1
4
er
wt+1,i Eŷ∼h(d;α̂t+1 −1,β̂t+1 ) (S̃t − Ŷ )i − Ey∼h(d;α̂t+1 ,β̂t+1 ) (S̃t − Y )i
(G.1)
pe
∂ X
wt+1,i Ey∼h(d;α̂t+1 ,β̂t+1 ) (xt − Y )i − (S̃t − Y )i .
st+1 = st − b2 (t) f (xt , y)|y=st
∂y i=1
(G.2)
This section provides the SRL-FSA and SRL-PSA algorithms for a multi-
item inventory system. Algorithm 3 and Algorithm 4 summarize the cor-
tn
n
of cn by S n and Ωst [·] maintains the upper bound of sn by cn according to
the (s, c, S) joint replenishment policy structure that requires sn ≤ cn ≤ S n
∀n=1,...,N . To stabilize policy learning in the multi-item problem, we consider a
ep
batch update technique that updates the parameters using several offline sam-
ples.
In Algorithm 4, the main part for the gradient with respect to the can-
Pr
37
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
iew
Algorithm 3 SRL-FSA for multi-item inventory management
1: while satisfying the stopping criteria do
2: given xt , then take action at with noise ϵk following the stochastic
(s, c, S) replenishment policy; then, observe the transitioned state xt+1 and
ev
corresponding reward rt
3: attain the realized demand dt for all the items; then, adaptively estimate
n n
the corresponding distributional parameters α̂t+1 and β̂t+1 ∀n=1...N
4: update the relative value function:
r
wt+1 = wt + γ1 (t) rt − ρt + V − t(xt+1 ; st , ct , St ) −
Vt (xt ; st , ct , St ) Φ(xt )
ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , ct , St ) − Vt (xt ; st , ct , St ) − ρt
5:
6: for item n = 1, ..., N do
n n ′ n n n
1, β̂t+1 ) + ηS P1 (x |S̃t , α̂t+1 , β̂t+1
)
er
update the policy parameters for all the items:
sampling ηSn ∼ Bern(·; 0.5) then zSn ∼ (1 − ηSn )P1 (x′ |S̃tn , α̂t+1 n
−
pe
St+1 = St − b1 (t)β̂t+1 (1 − f (xnt , cnt ))(1 − k̸=n f (xkt , skt )) + (1 −
n n n
Q
n ′
f (xt , st )) k̸=n f (xt , st ) (−1)ηS Vt+1 (xn ′ = zSn , x−n = x−n
n n k k
Q
t+1 ; st , ct , St )
cn n n
ct+1 = Ωt ct −b2 (t) 1− k̸=n f (xt , st ) ∂y f (xt , y)|y=cnt (−1)ηc Vt+1 (xn ′ =
n
Q k k
′
zcn , x−n = x−n
t+1 ; st ,Qct , St )
θtn = en × k̸=n f (xkt , skt )
tn
sn n n
=k}ηsk
P
∂
Vt+1 (x′ =
n
n 1{τ
st+1 = Ωt st − b3 (t) ∂y f (xt , y)|y=snt (−1) k
n
zs ; st , ct , St )
7: update the hyperparameters:
decaying σt and τt
ep
Pr
38
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
iew
Algorithm 4 SRL-PSA for multi-item inventory management
1: while satisfying the stopping criteria do
2: given xt , then take action at with noise ϵt following the stochastic
ev
(s, c, S) replenishment policy; then, observe the transitioned state xt+1 and
corresponding reward rt
3: attain the realized demand dt for all the items; then, adaptively estimate
n n
the corresponding distributional parameters α̂t+1 and β̂t+1 ∀n=1...N
4: update the relative value function:
r
wt+1 = wt +γ1 (t)rt −ρt +Vt (xt+1 ; st , ct , St )−Vt (xt ; st , ct , St ) Φ(x t)
ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , ct , St ) − Vt (xt ; st , ct , St ) − ρt
5: update the policy parameters for all the items:
6: for item n = 1, ..., N do
k k
Ank
A2
n n
er
k k n n k k
1 = (1 − f (xt , st ))f (xt , ct ) + f (xt , st )f (xt , st ) ∀k∈[N ]\n
nk
=
f (xt , st )) ∀k∈[N ]\n
(1 − f (xt , st ))(1 − f (xt , ct )) + f (xnt , snt )(1 −
n n k k
pe
St+1 = St − b1 (t)β̂t+1 (1 − f (xnt , cnt ))(1 − k̸=n f (xkt , skt )) + (1 −
n n
Q
n n k k
[(S̃tn −
Q P
f (xt , st )) k̸=n f (xt , st ) 1≤i1 +...+iN ≤4 wi1 ...iN (Eŷ n ∼h(d;α̂n n
t+1 −1,β̂t+1 )
Ŷ n )in ] − Eyn ∼h(d;α̂n ,β̂ n ) [(S̃tn − Y n )in ]) k̸=n Eyk ∼h(d;α̂k ,β̂ k ) [Ank k
Q
t+1 t+1 t+1 t+1
1 (x −
ot
y k )ik − Ank k k ik
2 (S̃t − y ) ]
∆cnt = Eyn ∼h(d;α̂n ,β̂ n ) [(xnt − Y n )in −
P
w
1≤i1 +...+iN ≤4 i1 ...iN t+1 t+1
tn
n n in
Q nk k k ik nk k k ik
(S̃t − Y ) ] k̸=n Ey k ∼h(d;α̂k k
t+1 ,βt+1 )
[A1 (x − y ) − A2 (S̃t − y ) ]
n ∂
cnt+1 = Ωct cnt − b2 (t) 1 − k̸=n f (xkt , skt ) ∂y f (xnt , y)|y=cnt ∆cnt
Q
P k k k k k
Q j j
k̸=n ∆ct f (xt , st ) − f (xt , ct ) j∈[N ]\{n,k} f (xt , st )}
8: update the hyperparameters:
decaying σt and τt
ep
Pr
39
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
order level ∆cn is reused to compute the gradient with respect to the reorder
iew
level sn , which improves the computational efficiency.
ev
extended to the multi-item case:
Z
∇g(s, c, S) ≈ ∇Px,x′ (s, c, S)V (x′ ; s, c, S)dx′
x′ ∈X N
The partial derivative for each of the policy parameter (s, c, S) pairs for all
r
the items is computed, and this updates the policy parameters in the descent
∂
∂S n
g(s, c, S) ≈
er
direction of the gradient. First, the partial derivative of the objective function
with respect to S n , which is the order-up-to level of item n, is
pe
Y Y
β n (1 − f (xn , cn )) 1 − f (xk , sk ) + (1 − f (xn , sn )) f (xk , sk )
k̸=n k̸=n
n ′
× (−1)η V (xn ′ = z n , x−n ; s, c, S),
where η n is a random sample from a Bernoulli trial with equal probabili-
′
ties. In addition, a sample for the next state of item n (xn ) is extracted
ot
from the joint distribution, which is composed of P1n (xn ′ |S̃ n , αn − 1, β n ) and
P1n (xn ′ |S̃ n , αn , β n ) by mixing the Bernoulli sample (i.e., z n ∼ (1−η n )P1n (xn ′ |S̃ n , αn −
tn
f (x , s )
∂cn ∂cn
k̸=n
where η n is a random sample from a Bernoulli trial with equal probabili-
ties. In addition, a sample for the next state of item n (i.e., xn ′ ) is ex-
tracted from the joint distribution, which is composed of P0n (xn ′ |xn , αn , β n ) and
ep
P1n (xn ′ |S̃ n , αn , β n ) by mixing the Bernoulli sample (i.e., z n ∼ P̄ n (xn ′ |xn , s, c, S, η n ) =
(1 − η n )P0n (xn ′ |xn , αn , β n ) + η n P1n (xn ′ |S̃ n , αn , β n )).
Pr
40
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
The stochastic approximation of the partial derivative of the objective func-
iew
tion with respect to sn is computed as follows:
∂ ∂ n n
P n k
k 1{τ =k}η V (z; s, c, S)
g(s, c, S) ≈ f (x , s ) × (−1)
∂sn ∂sn
where η k is a random sample from a Bernoulli trial with equal probabilities and
τ n is randomly sampled from a Multinoulli trial that has the categorical weight
vector θn as follows:
ev
Y X Y
θ n = en × f (xk , sk ) + ek × (f (xk , sk ) − f (xk , ck )) f (xj , sj )
k̸=n k̸=n j∈[N ]\{n,k}
Here, en denotes the N size standard vector whose only n − th element is one,
while the others are zero. The categorical weight is normalized to (1T θn = 1)
r
by dividing the sum of the elements of the weight vector. In addition, a sample
for the next state vector x′ is extracted from the nested joint distribution, which
′ Q j
is composed of P̄ k (xk |xk , s, c, S, η k ) j̸=k Px,x
erj j
j ′ (s, c , S ) ∀k by activating the
The policy parameters (s, c, S) update the formula using the SRL-FSA al-
gorithm as follows:
n
St+1 =
ot
Y Y
Stn − b1 (t)β̂t+1
n
(1 − f (xnt , cnt ))(1 − f (xkt , skt )) + (1 − f (xnt , snt )) f (xkt , skt )
k̸=n k̸=n
n ′
× (−1) Vt+1 (xn ′ = zSn , x−n = x−n
ηS
t+1 ; st , ct , St )
tn
(I.1)
Y ∂ n
cnt+1 n
= ct − b2 (t) 1 − k k
f (xt , st ) f (xnt , y)|y=cnt (−1)ηc
∂y
k̸=n (I.2)
n′ ′
× Vt+1 (x zcn , x−n x−n
rin
= = t+1 ; st , ct , St )
∂ P
1{τ n =k}ηsk
snt+1 = snt − b3 (t) f (xnt , y)|y=snt (−1) k Vt+1 (x′ = zns ; st , ct , St )
∂y
(I.3)
Here, i.i.d Bernoulli samples ηSn , ηcn , ηsn ∼ Bern(·; 0.5) ∀n=1,...,N . Random sam-
ep
41
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Further, zns ∼ P̂ (x′ |x, st , ct , St , τ n ) =
k k′ k j
(st , cjt , Stj ) where τ n ∼ M ulti(θtn ).
iew
n k
P Q
k 1{τ = k}P̄ (x |x , st , ct , St , η ) j̸=k P xt ,xjt
′
ev
be extended to the multi-item case. The partial derivative with respect to the
policy parameter (s, c, S) pairs for all the items is computed and the proposed
algorithm updates in the descent direction of the gradient. First, the partial
r
derivative of the objective function with respect to S n is as follows:
∂
n
g(s, c, S) ≈
∂S
β n (1 − f (xn , cn ))(1 −
X
Y
k̸=n
er
f (xk , sk )) + (1 − f (xn , sn ))
Y
f (xk , sk )
k̸=n
∂ Y ∂
n
g(s, c, S) ≈ 1 − f (xk , sk ) f (xn , cn ) × ∆cn
∂c ∂cn
k̸=n
X
∆cn = wi1 ...iN Eyn ∼h(d;αn ,β n ) [(xn − Y n )in − (S̃ n − Y n )in ]
1≤i1 +...+iN ≤4
rin
Y
× Eyk ∼h(d;αk ,β k ) [Ank
1 (x
k k ik
−y ) − Ank
2 (S̃
k
−y ) ]k ik
k̸=n
Here, ∆cn is an incremental factor for item n and this is reused to compute the
partial derivative of the objective function with respect to sn as follows:
ep
∂ ∂
g(s, c, S) ≈ n f (xn , sn )
∂sn ∂s
Y X Y
n k k k k k k k j j
× ∆c f (x , s ) + ∆c f (x , s ) − f (x , c ) f (x , s )
k̸=n k̸=n j∈[N ]\{n,k}
Pr
42
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
To sum up, the policy parameters (s, c, S) update the equation using the
iew
SRL-PSA algorithm
as follows:
Y Y
St+1 = βt+1 (1 − f (xnt , cnt ))(1 −
n n
f (xkt , skt )) + (1 − f (xnt , snt )) f (xkt , skt )
k̸=n k̸=n
X
× wi1 ...iN Eŷn ∼h(d;α̂n n [(S̃tn − Ŷ n )in ]
t+1 −1,β̂t+1 )
1≤i1 +...+iN ≤4
[(S̃tn − Y n )in ]
ev
−Eyn ∼h(d;α̂n n
t+1 ,β̂t+1 )
Y
nk k k ik nk k k ik
× Eyk ∼h(d;α̂k k [A 1 (x − y ) − A 2 (S̃ − y ) ]
t+1 ,β̂t+1 )
k̸=n
(J.1)
r
Y ∂
cnt+1 = cnt − b2 (t) 1 − f (xkt , skt ) f (xnt , y)|y=cnt × ∆cnt (J.2)
∂y
k̸=n
× ∆cnt
∂
∂y
Y
f (xnt , y)|y=snt
f (xkt , skt ) +
X
er
∆ckt f (xkt , skt ) − f (xkt , ckt )
Y
f (xjt , sjt )}
pe
k̸=n k̸=n j∈[N ]\{n,k}
(J.3)
This section provides the SRL-FSA and SRL-PSA algorithms for a single-
item inventory system with discrete demand. Algorithm 5 and Algorithm 6
summarize the discrete variant of the SRL-FSA and SRL-PSA algorithms, re-
tn
43
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
iew
Algorithm 5 discrete variant SRL-FSA for single-item inventory management
1: while satisfying the stopping criteria do
2: given xt , then take action at with noise ϵt following the stochastic (s, S)
replenishment policy; then, observe the transitioned state xt+1 and corre-
sponding reward rt
3: attain the realized demand dt ; then, adaptively update the empirical
probability p̂t+1 and corresponding interpolated density p̄t+1
ev
4: update the relative value function:
wt+1 = wt + γ1 (t) rt − ρt + Vt (xt+1 ; st , St ) − Vt (xt ; st , St ) Φ(xt )
ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , St ) − Vt (xt ; st , St ) − ρt
5: update the policy parameters:
sampling µ ∼ M ulti(·; λ1 |∆p̂t+1 |) then zS ∼ U (S̃t − µ − 1, S̃t − µ)
r
St+1 = St − b1 (t)(1 − f (xt , st ))(−1)1{p̂t+1,µ+1 <p̂t+1,µ } Vt+1 (zS ; st , St )
sampling η ∼ Bern(·; 0.5) then zs ∼ ηP1 (x′ |S̃t , p̄t+1 ) + (1 −
′
6:
η)P0 (x |xt , p̄t+1 )
∂
st+1 = Ωt st − b2 (t) ∂y
update the hyperparameters:
decaying σt and τt
er
f (xt , y)|y=st (−1)η Vt+1 (zs ; st , St )
pe
Algorithm 6 discrete variabt SRL-PSA for single-item inventory management
1: while satisfying the stopping criteria do
given xt , then take action at with noise ϵt following the stochastic (s, S)
ot
2:
replenishment policy; then, observe the transitioned state xt+1 and corre-
sponding reward rt
3: attain the realized demand dt ; then, adaptively update the empirical
tn
St+1 = St −b1 (t) K (1−f (xt , st )) j=0 ∆p̂t+1,j i=1 i+1 k=0 (S̃t −
j)k (S̃t − j − 1)i−k
∂
P4
st+1 = Ωt st − b2 (t) ∂y f (xt , y)|y=st i=1 wt+1,i Ey∼h(d;p̄t+1 ) (xt −
i i
Y ) − (S̃t − Y )
ep
44
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Using the stochastic approximation framework, the partial derivative is approx-
iew
imated by the following formulation with an online sample:
∂
g(s, S) ≈ (1 − f (x, s))(−1)1{p̂µ+1 <p̂µ } V (x′ ; s, S)
∂S
1
, where µ is a random sample from a Multinoulli trial that has a λ |∆p̂| prob-
ability in the integer range 0 to dmax − 1 (i.e., µ ∼ M ulti(·; λ1 |∆p̂|)). Here, λ
Pdmax −1
(:= j=0 |∆p̂j |) denotes a normalizing constant that makes the piecewise
ev
∂
constant function ∂S P1 (x′ |S̃, p̄) = K
1
|∆p̂| a Multinoulli distribution by satis-
fying the property of probability. Additionally, a sample for the next state x′ is
extracted from the continuous uniform distribution in range [S̃ − µ − 1, S̃ − µ];
r
here, µ is fixed after it is sampled from the Multinoulli distribution.
The partial derivative of another parameter s is computed as follows:
∂
∂s
g(s, S) ≈
∂
∂s
f (x, s)
Z
′
x ∈X
er
P0 (x′ |x, p̄) − P1 (x′ |S̃, p̄) V (x′ ; s, S)dx′
the Bernoulli sample (i.e., x′ ∼ (1 − η)P0 (x′ |x, p̄) + ηP1 (x′ |S̃, p̄)).
Finally, the discrete variant of the SRL-FSA algorithm updates the policy
tn
(s, S) as follows:
St+1 = St − b1 (t)(1 − f (xt , st ))(−1)1{p̂t+1,µ+1 <p̂t+1,µ } Vt+1 (zS ; st , St ) (L.1)
∂
st+1 = st − b2 (t) f (xt , y)|y=st (−1)η Vt+1 (zs ; st , St ) (L.2)
∂y
rin
Here, the Multinoulli sample is µ ∼ M ulti(·; λ1 |∆p̂t+1 |) and the Bernoulli sample
η ∼ Bern(·; 0.5); moreover, the random samples are zS ∼ U nif orm(S̃t − µ −
1, S̃t − µ) and zs ∼ ηP1 (x′ |S̃t , p̄t+1 ) + (1 − η)P0 (x′ |xt , p̄t+1 ).
ep
Pr
45
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Appendix M. Policy update in a discrete variant SRL-PSA for a
iew
single-item inventory system
In the case of the SRL-PSA algorithm, the partial gradient of the objective
function can be obtained by computing the integral. The partial derivative of
the objective function with respect to the order-up-to level S is expressed as
follows:
ev
∂ 1 X−1
dmax Z S̃−j
g(s, S) ≈ (1 − f (x, s)) ∆p̂j V (z; s, S)dz
∂S K j=0 S̃−j−1
1 X−1
dmax X4 S̃−j
wi i+1
= (1 − f (x, s)) ∆p̂j z
r
K j=0 i=1
i+1 S̃−j−1
Additionally, the partial derivative with respect to reorder level s is given as
follows:
∂
∂s
g(s, S) ≈
∂
∂s
f (x, s)
X4
i=1
er
wi Ey∼h(d;p̄) (x − Y )i − (S̃ − Y )i
The expectation is regarded as the polynomial function of the n−th moment for
pe
the interpolated demand distribution and, thus, it is easily presented in closed
form.
The SRL-PSA algorithm updates policy (s, S) as follows:
1 X−1
dmax X4 i
wt+1,i X
St+1 = St − b1 (t) (1 − f (xt , st )) ∆p̂t+1,j (S̃t − j)k (S̃t − j − 1)i−k
ot
K j=0 i=1
i + 1
k=0
(M.1)
4
∂
tn
X
wt+1,i Ey∼h(d;p̄t+1 ) (xt − Y )i − (S̃t − Y )i .
st+1 = st − b2 (t) f (xt , y)|y=st
∂y i=1
(M.2)
bp (t)
limt→∞ γ1 (t) = 0 ∀p=1,2 . For comparison purposes, we use the same step size
for both the SRL-FSA and the SRL-PSA algorithms. The step sizes for the rel-
0.001
ative value function are γ1 (t) = (⌊t/10⌋+1)0.6 and γ2 (t) = 0.01; further, the step
Pr
46
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
0.01 0.01
sizes for the policy parameters are b1 (t) = ⌊t/5⌋+1 and b2 (t) = (⌊t/20⌋+1)0.7 . The
iew
precision hyperparameters for the stochastic replenishment policy is defined in
τ0 σ0
a decaying form as τt = (⌊t/10⌋+1)0.8 and σt = (⌊t/10⌋+1)0.8 .
ev
Table O.4 compares the optimized policy parameters of the proposed algo-
rithm with those of the full enumeration heuristic under the single-item inven-
tory system with a truncated-normal demand distribution.
r
Table O.4: Comparison of the proposed structured RL algorithm with the full enu-
meration heuristic under the single-item inventory system with a truncated-normal
demand distribution
Algorithm
Parameter
s S
er Average cost
Metric
Difference
(%)
pe
Full enum. 3.00 4.00 1.151 -
SRL-FSA 2.69 3.75 1.155 0.35
SRL-PSA 2.79 4.00 1.152 0.09
Table O.5 and Table O.6 compare the proposed algorithm with the static
ot
Table O.5: Comparison of the proposed structured RL algorithm with the static re-
tn
Figure P.8 presents the convergence graph of the proposed algorithm and
benchmarking heuristic under the truncated-normal demand distribution and
Pr
47
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Table O.6: Comparison of the proposed structured RL algorithm with the static replen-
ishment policy under the two-item inventory system with regime-switching demand
iew
Metric
Algorithm Average cost Improve rate (%)
Static 5.870 -
SRL-FSA 5.446 7.79
SRL-PSA 5.176 13.41
ev
Figure P.9 illustrates the convergence graph of the proposed algorithm and the
benchmarking heuristic under the two-item inventory system.
r
er
pe
Figure P.8: Convergence graphs for the average cost of the proposed algorithm and
benchmarking heuristic under the single-item inventory system with a truncated-
normal demand distribution.
ot
tn
rin
Figure P.9: Convergence graphs for the average cost of the proposed algorithm and the
ep
benchmarking heuristic under the two-item inventory system with a Gamma demand
distribution.
Pr
48
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Appendix Q. Sales trend of the items and their unit costs setting
iew
r ev
(a) Sunblock (b) Deodorant
Figure Q.10: Daily sales for a Korean retail shop
Item cO
er
Table Q.7: Unit costs for the case study
Unit cost (KRW)
cH cB K
pe
Sunblock 14,250 9.5 19,000 38,000
Deodorant 8,625 5.75 11,500 23,000
ot
tn
rin
ep
Pr
49
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Appendix R. Algorithmic complexity analysis of the proposed meth-
iew
ods
ev
variate fourth-degree polynomial approximation of the value function dominates
all the other operations. Although the benchmarking heuristic (i.e., AC) also
has quartic complexity when approximating the value function, it has distinct
r
complexity as an approximator of policy, denoted as W (N ) (e.g., a neural net-
work or other complex approximator whose model size is much more complex
er
than the polynomial value function). Therefore, the algorithmic complexity of
the proposed algorithm is more efficient than that of the benchmarking heuris-
pe
tic (i.e., the AC). The analysis shows that the proposed method has superior
algorithmic complexity than typical RL algorithms that do not consider any
structural property.
method, the quartic complexity damages the scalability for the general size of
the multi-item model. To address the scalability issue, the model complexity of
the polynomial approximator is relaxed by removing all the interaction terms
ep
among different items. We can conduct the relaxation in a belief that the true
value function has a tractable structure (e.g., multi-dimensional K-convexity)
for basic multi-item inventory management problems. According to the modi-
Pr
50
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
fication, the space complexity reduces from quartic to linear; further, the com-
iew
putational burden can be reduced to quadratic which is arisen from the policy
updates. The improved algorithmic complexity is summarized in Table R.9.
The algorithmic complexities of the modified algorithms still outperform the
comparative baselines. The improved algorithmic complexity verifies that the
proposed structured RL method can have scalability.
ev
Table R.9: Algorithmic complexity of structured RL with relaxed value function ap-
proximation
Complexity
Algorithms Time Space
r
N
Full enum. O(2 ) O(N )
AC O(W (N ) + N ) O(W (N ) + N )
SRL-FSA
SRL-PSA
O(N 2 )
O(N 2 ) er O(N )
O(N )
pe
ot
tn
rin
ep
Pr
51
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441