SSRN Id4272441

Adaptive Inventory Replenishment using Structured Reinforcement
ed
Learning by Exploiting an Optimal Policy Structure
Hyungjun Park1, Dong Gu Choi2, Daiki Min3∗
iew
1 Department of Industrial and Management Engineering, Pohang University of
Science and Technology, 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk 37673,
Rep. of Korea, Email: hjhjpark94@postech.ac.kr
v
2 Department of Industrial and Management Engineering, Pohang University of
re
Science and Technology, 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk 37673,
Rep. of Korea, Email: dgchoi@postech.ac.kr
er
3 School of Business, Ewha Womans University, 52, Ewhayeodae-Gil, Seodaemun-Gu,
Seoul, 03760, Rep. of Korea, Email: dmin@ewha.ac.kr , Office: +82-2-3277-3923
* Corresponding author, Email: dmin@ewha.ac.kr , Office: +82-2-3277-3923
pe
ot
tn
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4272441
ed
Adaptive Inventory Replenishment using Structured
Reinforcement Learning by Exploiting an Optimal
iew
Policy Structure
ev
Abstract
We consider an inventory replenishment problem with unknown and switching
r
demand. We design a structured reinforcement learning algorithm that effi-
ciently adapts the replenishment policy to changing demand without any prior
er
knowledge. Our proposed method integrates the known structural properties
of an optimal inventory replenishment policy with reinforcement learning. By
exploiting the optimal policy structure, we tune reinforcement learning to char-
pe
acterize the inventory replenishment policy and approximate the value function.
In particular, we propose two methods for stochastic approximation on the gra-
dient of the objective function. These novel reinforcement learning algorithms
ensure an efficient convergence rate and lower algorithmic complexity for solving
ot
practical problems. The numerical results demonstrate that the proposed algo-
rithms adaptively update the policy to changing demand and raise operational
efficiency compared to a static replenishment policy. We also conduct a case
tn
study for a retail shop in South Korea to validate the practical feasibility of
the proposed method. Understanding the optimal policy structure is beneficial
for designing reinforcement learning algorithms that can address the inventory
rin
replenishment problem. These well-designed reinforcement learning algorithms

are particularly promising when we require policy updates based on observa-
tions without precise knowledge of switching demand. These research findings
could be extended to address the various inventory decisions in which optimal
ep
policy structures are available.

Keywords Inventory Replenishment Policy, Reinforcement Learning, Struc-
Pr
Preprint submitted to Elsevier October 31, 2022
ed
tural Properties, Stochastic Approximation
iew
1. Introduction
Operations management scholars and practitioners have for decades exam-

ined the inventory replenishment problem to achieve operational efficiency in
ev
related fields. The inventory replenishment problem aims to control order-
ing decisions so that operational costs are minimized over time. In addition,
because the ordering decision in the current period affects inventory levels in
subsequent periods, the problem is regarded as a sequential decision-making
r
problem. These problems are formulated as a Markov decision process and the
solution is traditionally derived in the form of a policy (Puterman, 2014). The
er
optimal policies and related structures for some classes of inventory replenish-
ment problems have been analytically characterized in previous studies. For
pe
instance, the (s, S) replenishment policy, whose parameters represent the re-
order and order-up-to levels, is an optimal policy under inventory management
incurring setup costs (Scarf, 1960). This replenishment policy orders up to the
order-up-to level when the inventory level drops to the reorder level.
In this study, we examine a multi-item inventory replenishment problem.
ot
To exploit the known structure of an optimal policy, we define the problem

by assuming instantaneous replenishment (i.e., no lead time) in the inventory
tn
systems. Further, the inventory systems periodically review the inventory level
and have setup costs (joint setup costs for multi-item cases). With these problem
definitions, the optimal policy of the single-item case is known as the (s, S)
replenishment policy. Although no general structure of this optimal policy for
rin
the multi-item case exists, a (s, c, S) joint replenishment policy is regarded as

an implementable policy structure (Ignall, 1969) whose parameters represent
the reorder, can-order, and order-up-to levels. The joint replenishment policy
ep
orders all items whose inventory levels are less than their can-order level up to
their order-up-to level whenever at least one item drops to their reorder level.
Inventory replenishment problems analytically develop the optimal policy
Pr
ed
under the assumption that the demand distribution is known a priori and is
iew
stationary, which means that a decision-maker should update the optimal pol-
icy whenever observing a change in demand. However, the demand observed in
practical inventory systems is commonly ill-specified and has a non-stationary
nature such as regime-switching movements (Chen, 2021). These properties
incur delays in identifying demand changes and updating the inventory poli-
ev
cies. Furthermore, conventional methods such as dynamic programming are
less responsive to changes in demand because of their immense computational
requirements at every regime-switching point (Keskin et al., 2022). To overcome
r
these limitations, data-driven approaches have been employed to deal with the
unknown and non-stationary nature of several uncertainties (Chen, 2021; Huh
er
and Rusmevichientong, 2009; Keskin et al., 2022; Shi et al., 2016).
Reinforcement learning (RL), a sub-branch of machine learning suitable for
optimizing sequential decision-making, has been leveraged to address practical
pe
inventory control problems (Giannoccaro and Pontrandolfo, 2002; Gijsbrechts
et al., 2022; Jiang and Sheng, 2009; Oroojlooyjadid et al., 2021). Recent RL
methods tend to approximate the policy or value functions using parametric
functions such as neural networks. As they can automatically characterize a
ot
reasonable demand distribution and adapt a policy to switching the regime, a

near-optimal policy is attained even when full information is unavailable. In
addition, the online learning nature does not require demand data in advance;
tn
instead, RL methods learn and optimize the policy by observing demand on-the-
fly. Despite these good properties, however, the high computational cost and
slow convergence rate are drawbacks of RL methods (Kunnumkal and Topaloglu,
rin
2008). Therefore, they fail to adapt a policy to the changing environment and
become inefficient in a large inventory system.
In this paper, we propose a novel RL method that provides good learning
behavior by exploiting the structural properties of an optimal policy. For ex-
ep
ample, Scarf (1960) introduced K-convexity to prove the optimality of an (s, S)

policy. We use this K-convexity to design the RL method for an inventory
replenishment problem. In addition, the proposed method directly updates a
Pr
ed
replenishment policy such as the (s, S) policy for a single-item problem rather
iew
than learning action-value functions. Thanks to these benefits, it rapidly char-
acterizes observed demand and adapts the replenishment policy to switching
demand. We show that the proposed method provides near-optimal policies
for single- and multi-item problems by exploiting the policy structures. Ad-
ditionally, a case study analysis shows that the proposed method has better
ev
operational efficiency in the retail industry.
The main contributions of this study can be summarized as follows:
We develop a structured RL algorithm that optimizes inventory replen-
r
ishment policies without prior knowledge of the demand distribution.
The well-designed RL algorithm adaptively updates the policy in response
efficiency. er
to the switching demand distribution, allowing us to achieve operational
We obtain near-optimal policies for inventory systems with extended item

pe
sizes and various demand distributions.
The proposed algorithm has efficient algorithmic complexity and superior
convergence rate by exploiting the structural property of an optimal policy.
The structure of the remainder of this paper is organized as follows. Sec-
ot
tion 2 reviews related works and Section 3 describes our inventory management
problem. Section 4 discusses the methodology of the proposed structured RL
algorithm for inventory management. Section 5 presents an experimental study
tn
to verify the advantages of our proposed method. Section 6 demonstrates the

practical feasibility of the proposed method. Lastly, Section 7 concludes by
discussing the research implications and presenting future research directions.
rin
2. Relevant Literature
Much of the literature has studied the existence and structural properties of
optimal policies to address inventory replenishment problems. As an example
ep
of an earlier study, Scarf (1960) considered an inventory system without a setup

cost and showed that the order-up-to (base-stock) policy is optimal and that
the corresponding value function is convex. In addition, they showed that the
Pr
ed
optimal policy in a system with a setup cost is the (s, S) replenishment policy
iew
and that the corresponding value function has K-convexity structure. For multi-
item inventory systems that require various items to be ordered jointly, Balintfy
(1964) introduced an (s, c, S) joint replenishment policy that is a reasonable
policy structure. Ignall (1969) showed that the (s, c, S) joint replenishment
policy is optimal for a simple two-item inventory system. Girlich and Barche
ev
(1991) generalized the optimal policy for a multi-item case and analyzed the
optimality of the (σ, S) replenishment policy under a certain condition of the
Wiener demand process. Under this replenishment policy, if the vector of the
r
inventory level x whose elements represent the inventory level of each item
belongs to the reorder set σ, the inventory is replenished up to the corresponding
er
element of the order-up-to level vector S. However, the joint order replenishment
problem is well known as NP-hard (Cohen-Hillel and Yedidsion, 2018), and thus,
optimizing its policy is quite difficult.
pe
Characterizing the demand distribution is hard in practice, and various fac-
tors such as technological advances (e.g., expansion of electric vehicles) and
economic turmoil (e.g., the COVID-19 pandemic) change the demand distribu-
tion irregularly; thus, inventory management studies have attempted to extend
ot
the problem using unknown and switching demand distributions. Earlier stud-
ies assumed that complete information on when and how the demand distri-
bution changes is known. For example, Song and Zipkin (1993) showed that
tn
the situation-dependent policy is optimal for inventory systems with switching

demand; however, their analysis assumed complete information.
To address the use of this impractical assumption, data-driven methods have
rin
recently been adopted to solve inventory management problems. Huh and Rus-
mevichientong (2009) proposed an adaptive gradient-based algorithm to opti-
mize the order-up-to level of the base-stock policy for an inventory system with
lost sales. Their approach traced a virtual constraint-free order-up-to level and
ep
attained the minimum expected cost without making an assumption about the
demand distribution. Shi et al. (2016) proposed an algorithm to optimize the
base-stock policy for a multi-item inventory system with capacity constraints.
Pr
ed
Their algorithm defined an additional virtual order-up-to level that represents
iew
the target inventory level when the capacity constraint is relaxed. Chen and
Chao (2020) proposed an online learning algorithm to optimize the base-stock
policy for a multi-item inventory system with stockout substitution. The algo-
rithm estimated the demand distributions and substitution probability by con-
ducting a novel exploration phase. Further, data-driven methods must consider
ev
an unknown demand distribution as well as regime-switching demand patterns
when knowledge on switching points is lacking. Chen (2021) proposed a non-
parametric learning algorithm for inventory management with lost sales under
r
regime-switching demand. The algorithm placed excessive orders to figure out
true realized demand on the censored observation and estimate the demand
er
distribution in the recent batch time period to distinguish unknown changing
points. Keskin et al. (2022) developed an online learning algorithm for joint
inventory and pricing problems in a regime-switching environment.
pe
As with data-driven inventory management, RL is a promising method for
addressing inventory management problems in which the demand distribution
is unknown and switching over time. Over the past decade, various RL meth-
ods have been applied to tackle inventory management problems. Giannoccaro
ot
and Pontrandolfo (2002) proposed a RL algorithm to solve a multi-stage supply

chain inventory model. They integrated multiple RL agents into all stages and
decided the cooperative decisions to minimize the overall operational costs of
tn
the supply chain. To coordinate decisions, they defined the joint state/action
space of the agents and employed a Q-learning algorithm to optimize a coordi-
nated ordering policy. Likewise, Oroojlooyjadid et al. (2021) proposed a deep
rin
Q-learning-based algorithm to overcome incomplete information in supply chain

management when each agent only observes local information. Gijsbrechts et al.
(2022) employed actor-critic-based deep RL method to construct a near-optimal
replenishment policy for perishable inventory management. De Moor et al.
ep
(2022) demonstrated that a deep RL framework can improve the performance

of lost sales, dual sourcing, and multi-echelon inventory management problems,
among others. Recently, Preil and Krapp (2022) formulated supply chain man-
Pr
ed
agement as bandit-based optimization and proposed two efficient algorithms to
iew
solve the problem.
To the best of our knowledge, few studies investigate how to use the RL
method to optimize an inventory replenishment policy, considering the struc-
tural properties of an optimal policy and incomplete information on changing
demand. Jiang and Sheng (2009) employed a case-based RL to optimize the
ev
order-up-to level of the (s, S) policy for a multi-echelon inventory system with
switching demand. However, the previous study fails to fully optimize the pa-
rameters of replenishment policy as we do in this study. As they only consider
r
a part of the replenishment policy such as either the order-up-to level or the
reorder level, their findings on optimal policies are limited. We also contribute
er
to the literature by investigating a multi-item inventory system.
Lastly, we leverage the known structural properties of an optimal replenish-
ment policy (e.g., K-convexity and (s, S) replenishment policy), which enables
pe
the proposed method to take advantage of the fast convergence rate and efficient
algorithmic complexity. In the literature, several studies integrate the policy
structure within the RL framework for signal processing (Sharma et al., 2020),
power transmission scheduling (Fu and van der Schaar, 2012), and Markov de-
ot
cision process problems under several structural assumptions (Kunnumkal and

Topaloglu, 2008; Djonin and Krishnamurthy, 2007; Roy et al., 2021). The inven-
tory replenishment policy has different structural characteristics from those in
tn
these existing studies. We first propose a model that fully integrates the struc-
tural properties into the RL framework to solve an inventory replenishment
problem. As a result, the proposed method derives a near-optimal replenish-
rin
ment policy without any prior knowledge of the demand distribution and rapidly
adapts the policy in response to the switching environment.
3. Problem Description
ep
The objective of inventory management is to minimize the per-period cost

over the planning horizon by determining the sequence of ordering decisions.
Under our illustrative inventory replenishment problem, we must decide how
Pr
ed
many N items are to be replenished and when. That is, the problem has the
iew
form of a single-item inventory system if N = 1. We use the following notations
in the analysis:
cnO : unit ordering cost for an item n = 1, ..., N
cnH : unit holding cost for an item n = 1, ..., N
cnB : unit backlogging cost for an item n = 1, ..., N
ev
K: constant setup cost
xnt : inventory level of an item n = 1, ..., N in period t
ant : order amount for an item n = 1, ..., N at the beginning of period t
r
dnt : demand for an item n = 1, ..., N in period t
δ: joint order discountable ratio (0≤δ < 1)
er
The time period is distinguished by index t. Referring to the literature (Chen
and Chao, 2020; Shi et al., 2016), this problem considers a zero lead time, zero
purchasing revenue for all items, and a periodic review inventory system over
pe
an infinite time horizon; i.e., inventory of each item is reviewed at fixed and
constant time intervals. In addition, we assume a full backlogged system in
which excessive demand is carried over to the following period and expressed
as a negative inventory level. A decision-maker observes the current inventory
ot
level xnt at the beginning of period t and decides whether to make an order
for replenishment. Thereafter, stochastic demand dnt arises during the period.
For simplicity, we use x′ to denote the following inventory level. Vectors x =
tn
(x1 , x2 , ..., xN )T ∈ RN and a = (a1 , a2 , ..., aN )T ∈ RN represent the inventory

level and replenishment amount of corresponding N items. We consider that
demand is independent between time periods and items. The corresponding
rin
system dynamics follows xnt+1 = xnt + ant − dnt ∀n=1,...,N . The problem has a
well-defined cost structure that is convex in terms of the holding and backlogging
amounts for each period, which is represented as Ln (x) = cnB [−x]+ + cnH [x]+
where [·]+ = max{0, x}. Additionally, the joint order discountable setup cost
ep
is imposed, which can be discounted when replenishment occurs by ordering

different types of items jointly. The immediate convex cost function and function
Pr
ed
with the joint order discountable setup cost are denoted as in equation (1):
iew
{cnO an + Ln (xn ′ )}
X
rt (x, a, x′ ) = hKv1{h > 0} + (1)
n∈[N ]
1{a > 0}, and the joint-order discount factor v = 1 − δ + hδ .

n
P
where h = n∈[N ]
The discountable part of the setup cost is proportionally saved as the number
of joint orders increases. All the unit costs are non-negative constant and the
condition cnB > cnO ∀n=1...N should hold so that do-nothing-policy is not optimal.
ev
Although the cost is regarded as negative feedback, we transform the per-period
costs into a “reward” without loss of generality.
The solution to this problem is expressed with a policy and the optimal
r
policy is attained under the long-term average reward criterion by minimizing
PT
the objective function ρπ = limT →∞ T1 Eπ t rt . The optimal policy of a
er
single-item problem (N = 1) is the well-known (s, S) replenishment policy.
Moreover, if there are two items (N = 2), the optimal policy becomes the
pe
(s, c, S) joint replenishment policy (Ignall, 1969). Although no optimal policy
exists for a problem that has more than two items (i.e., N > 2), the (s, c, S) joint
replenishment policy is practically implementable and performs reasonably well.
Moreover, we exploit the (s, c, S) joint replenishment policy structure because
the structure outperforms other joint replenishment policies under the periodic
ot
review inventory system with irregular demand (Johansen and Melchiors, 2003).
To reflect reality, we consider that the demand distribution changes over
tn
time, which means that the parameters of the demand distribution sequen-
tially change. For example, assuming a two-parameterized demand distribution,

switching demand is represented by the sequence of parameters αn (ltn ), β n (ltn ) ,
where ltn = max{k ∈ Z : Tkn ≤ t} and Tkn is the period in which k − th demand
rin

switches to αn (ltn ), β n (ltn ) for item n. Under these circumstances, Song and
Zipkin (1993) proposed the world-dependent policy, whose parameters change
for each demand phase, and proved its optimality for a single-item inventory
ep
system, as in Theorem 1.
Theorem 1. A world-dependent (s(ω), S(ω)) policy with a world state ω is op-

timal for the single-item inventory model with demand-switching and fixed cost.
Pr
ed
Although the world-dependent policy does not guarantee optimality for a multi-
iew
item inventory system, it is practically reasonable to consider a (S(ω), c(ω), s(ω))
joint replenishment policy for further analysis. As we have no prior information
on when and how these parameters change, however, the analytical methods
are not applicable for attaining an optimal policy. Therefore, we consider a
data-driven learning algorithm to tackle the problems.
ev
4. Structured Reinforcement Learning Algorithm
In this section, we explain the structured RL algorithm proposed to solve the
r
inventory replenishment problem. We introduce a mathematical basis of the pro-
posed method and then present its two types: structured RL with full stochastic
er
approximation (SRL-FSA) and partial stochastic approximation (SRL-PSA).
4.1. Preliminaries for optimizing an inventory replenishment policy
4.1.1. Stochastic replenishment policy

pe
As aforementioned, the optimal (or reasonable) policy is known as the (s, S)
replenishment policy and (s, c, S) joint replenishment policy for single- and
multi-item inventory systems, respectively. However, these inventory policies
cannot satisfy the property that all stationary policies take the same single
ot
communicative class in a Markov chain, which is regarded as a desirable char-

acteristic to hold convergence using RL. Since the replenishment policies have
tn
a certain fixed order-up-to level S, transitioning to states that have higher in-
ventory levels than the order-up-to level is impossible; furthermore, the fixed
reorder level s limits the lower bound of the transition state. Therefore, replen-
ishment policies with distinct parameters take different communicative classes.
rin
To satisfy the desirable property of the same single communicative class,

we define a novel policy family, namely, the stochastic replenishment policy.
The stochastic policy randomly derives both the reorder and the order-up-to
ep
levels; thus, the transitioned state (i.e., inventory level) under the policy can be
comprehensive for all states, which means that any stationary policy within the
stochastic policy structure takes a single communicative class. Specifically, the
Pr
10
ed
decision rules of the proposed stochastic (s, S) replenishment and (s, c, S) joint
iew
replenishment policies are represented by equations (2) and (3), respectively:

S + ϵ − x if u > f (x, s)

a= (2)
0 o.w

where u∼U nif orm[0, 1] and noise variable ϵ∼N (0, σ 2 ). Further, f (x, y) =
1
is the y-shifted sigmoid function for mixing probability of the stochas-
ev
1+e−(x−y)/τ
tic replenishment policy.


S n + ϵn − xn if un1 > −n f (x−n , s−n ) and un2 > f (xn , cn )
 Q




an = S n + ϵn − xn if un1 ≤ −n f (x−n , s−n ) and un2 > f (xn , sn ) (3)
Q
r




0 o.w
where un1 , un2 ∼U nif orm[0, 1] ∀n = 1, ..., N and noise variable ϵ∼N (0, σ 2 IN ).
er
Hereafter, the perturbed order-up-to level S n + ϵn is denoted as S̃ n .
This stochastic replenishment policy acts as an exploration process while
pe
following the forms of original replenishment policies. From Propositions 1 and
2, the stochastic replenishment policies converge to the original deterministic re-
plenishment policies by adjusting the precision hyperparameters, and the proofs
are straightforward (proof is provided in Appendix A).
ot
Proposition 1. The stochastic (s, S) replenishment policy converges to the cor-

responding deterministic (s, S) replenishment policy as τ → 0 and σ 2 → 0.
tn
Proposition 2. The stochastic (s, c, S) joint replenishment policy converges to

the deterministic (s, c, S) joint replenishment policy as τ → 0 and σ 2 → 0.
4.1.2. Relative value function approximation

rin
The optimal value function of the single-item inventory replenishment prob-

lem is K-convex with one inflection point; we use this structural property to
approximate the value function. However, no existing differentiable function
ep
satisfactorily represents the K-convex value function. Therefore, we approxi-

mate the value function as a fourth-degree polynomial function, which fits the
value function well while also attaining reasonable model complexity.
Pr
11
ed
In addition, we use the relative value function approximation technique to
iew
approximate the average reward value. The relative value of a certain repre-
sentative state should be zero, and state ‘0’ (i.e., x = 0) is regarded as the
distinguished state. This characteristic allows us to ignore the intersection coef-
ficient of the polynomial function, and the approximated value function passes
through the origin. The relative value function approximation for a single-item
ev
problem is given as V (x; s, S, w) = wT Φ(x), where w ∈ R4 and polynomial basis
Φ(x) = [ϕi (x)]Ti=1,...,4 ; i.e., ϕi (x) = xi . Equations (4) and (5) show the online
updates of the relative value function for a single-item problem:
r
wt+1 = wt + γ1 (t){r(x, a, x′ ) − ρt + Vt (x′ ; s, S, wt ) − Vt (x; s, S, wt )}Φ(x) (4)
ρt+1 = ρt + γ2 (t){r(x, a, x′ ) + Vt (x′ ; s, S, wt ) − Vt (x; s, S, wt ) − ρt } (5)
er
The approximated relative value function evaluates a policy under an average
reward criterion (Singh, 1994). To achieve the stable convergence of the rela-
tive value function update (equation (4)), the step size satisfies the conditions
pe
P∞ P∞
of t γ1 (t) = ∞ and t γ1 (t)2 < ∞. The relative value function approx-
imation can be extended to a multi-item problem. We use the same update
procedure as in equations (4) and (5). However, the approximator has a dif-
ferent structure to accommodate a multidimensional polynomial function. The
ot
basis and approximation functions are expressed as V (x; s, c, S, w) = wT Φ(x),

where w ∈ RM , M =N +4 C4 − 1, and multidimensional polynomial basis
Φ(x) = [ n=1...N ϕin (xn )]T{(i1 ,...,iN ):1≤P in ≤4} ∈ RM .
tn
Q
n
4.1.3. Transition probabilities

This section describes why we consider a policy gradient and explains how
rin
to update the policy by exploiting the structural property of an optimal policy.

We first explain how to find the policy gradient of the objective function under
a replenishment policy. Then, we derive the formula that updates the policy
parameters in the descent direction of the gradient. The proposed algorithm
ep
has convergence property. Appendix B presents the details of the proof.

The proposed policy optimization requires a distributional assumption to
exploit the structural property of an optimal policy. For an inventory system
Pr
12
ed
with continuous demand, the exponential family is practically convenient and
iew
a generic way to fit the demand distribution. Within the exponential family,
we adopt the Gamma distribution, which captures similar shapes of a general
demand distribution (e.g., right-skewed) and has desirable properties such as
differentiable and integrable. Although we develop the proposed algorithm un-
der this distributional assumption for practical convenience, the algorithm does
ev
not necessarily require the true distribution and learns the optimal policy for
any demand distribution.
For a given optimal (s, S) replenishment policy, the transition probability
r
is separately defined in terms of whether the current inventory level exceeds
the reorder level s. P0 (x′ |x) denotes the state transition probability at which
Γ(α) d e

er
the current inventory level exceeds the reorder level s and P1 (x′ |S) denotes
the other case. Given the Gamma distributional assumption i.e., h(d; α, β) =
β α α−1 −βd βα
, P0 (x′ |x, α, β) = h(x − x′ ; α, β) = Γ(α)
′
(x − x′ )α−1 eβ(x −x) and
pe
βα ′ α−1 β(x′ −S̃)
P1 (x′ |S̃, α, β) = h(S̃ − x′ ; α, β) = Γ(α) (S̃ − x ) e show each of the state
transition probabilities using change-of-variable of the probability distribution.
We denote the transition probability from the current state x to the next
state x′ corresponding to the policy parameter (s, S) by Px,x′ (s, S). The tran-
ot
sition probability for the stochastic (s, S) replenishment policy is Px,x′ (s, S) =
f (x, s)P0 (x′ |x, α, β) + (1 − f (x, s))P1 (x′ |S̃, α, β). Here, the sigmoid function
f (x, s) is regarded as a mixing probability, and the noise ϵ of S̃ is realized be-
tn
fore the policy parameters are updated. Therefore, the partial derivatives of
the transition probability with respect to the policy parameters (s, S) are well
derived as follows:
rin
∂
Px,x′ (s, S) = β(1 − f (x, s)) P1 (x′ |S̃, α − 1, β) − P1 (x′ |S̃, α, β)

(6)
∂S
∂ ∂
f (x, s) P0 (x′ |x, α, β) − P1 (x′ |S̃, α, β)

Px,x′ (s, S) = (7)
∂s ∂s
Proposition 3. The transition probability Px,x′ (s, S) is bounded first and a

ep
twice differentiable function of the parameter (s, S).
Proof The proof is provided in Appendix C.

Pr
13
ed
For multi-item problem, the transition probability and its gradient are presented
iew
in the Appendix E. Likewise, its transition probability has a bounded first and
a twice differentiable function of the policy parameters.
We develop P0 (x′ |x) and P1 (x′ |S) under the assumption of continuous de-
mand distribution. However, practical inventory systems may observe discrete
values of demand. Therefore, we extend the transition probabilities to ac-
ev
commodate discrete demand. Suppose we have a count variable dt that rep-
resents demand at period t. By observing the demand for the past υ peri-
ods, the empirical probability that demand is j at period t is obtained by
r
Pυ
p̂t,j = υ1 i=1 1{dt−i = j}. For simplicity, we drop the index for time t of
variables and regard they lie at the same time period. We then transform
er
the probability p̂ into the probability density function by piecewise linear ap-
proximation. The piecewise linear function defined on intervals d¯j ∈ [j, j + 1],
where j ∈ [0, 1, ..., dmax − 1], becomes p̄j (d¯j ) := ∆p̂j (d¯j − j) + p̂j , where
pe
∆p̂j = p̂j+1 − p̂j . To ensure that the integral of the interpolated probabil-
ity density function equals one, the piecewise
linear function is normalized
by
dmax −1
applying a normalizing constant, K = 12 p̂0 + p̂dmax + 2 i=1
P
p̂i .
When replacing the demand distribution by h(d; p̄) in P0 (x′ |x) and P1 (x′ |S),
ot
∂ ′
∂S P1 (x |S̃, p̄) is unfortunately not always available in closed form. To tackle this
issue, we numerically obtain the derivative by perturbing the order-up-to level
∂ ′ P1 (x′ |S̃+ε,p̄)−P1 (x′ |S̃,p̄)
and computing the partial derivative, ∂S P1 (x |S̃, p̄) ≈ .
tn
Then, the infinitesimally perturbed increment of P1 (x′ |S̃, p̄) (i.e., ∂ ′

∂S P1 (x |S̃, p̄))
∂ 1 d
is equivalent to − ∂x P1 (x|S̃, p̄)|x=x′ = − K dx p¯j (S̃ − x)|x=x ; thus, the partial
′
′
derivative becomes ∂
∂S P1 (x |S̃, p̄) = 1
K ∆p̂j for each interval x′ ∈ [S̃ −j −1, S̃ −j].
rin
The partial derivatives of the transition probability with respect to the policy
parameters (s, S) are then given as follows:
∂ ∂
Px,x′ (s, S) = (1 − f (x, s)) P1 (x′ |S̃, p̄) (8)
∂S ∂S
∂ ∂
ep
f (x, s) P0 (x′ |x, p̄) − P1 (x′ |S̃, p̄)

Px,x′ (s, S) = (9)
∂s ∂s
Pr
14
ed
4.1.4. Gradient and policy updates
iew
The average reward objective corresponding to the policy parameters is given
P
as g(s, S) = x∈X π(x; s, S)r(x). If Proposition 3 is established, the gradient
of the average reward objective function is given in a closed form as in equation
(10); the proof is shown in Marbach and Tsitsiklis (2001):
X X
∇g(s, S) = π(x; s, S) ∇Px,x′ (s, S)V (x′ ; s, S) (10)
ev
x∈X x′ ∈X
The gradient of the objective function is the expectation
with respect to the cor-
′
P
responding policy distribution as ∇g(s, S) = Eπ x′ ∈X ∇Px,x (s, S)V (x ; s, S) .
′
Then, the expectation with a stationary policy is replaced by the following
r
sampling approximation: ∇g(s, S) ≈ x′ ∈X ∇Px,x′ (s, S)V (x′ ; s, S). We finally
P
reach the policy

 updates
 using the gradient descent method: 
S
s

S
s
  P
er
∂
  ←   −  Px′ ∈X ∂S x,x
∂
P ′ (s, S)V (x′ ; s, S)
′
x′ ∈X ∂s Px,x (s, S)V (x ; s, S)
′
 (11)
As with the single-item problem, we use the sampling approximation to ob-

pe
tain the expectation of the gradient of the objective function for the multi-
item problem: ∇g(s, c, S) ≈ x′ ∈X N ∇Px,x′ (s, c, S)V (x′ ; s, c, S). Here, X N =
P
X × ... × X denotes the N times Cartesian product of the single-item state

| {z }
N
space
 X.  Then,
 the  policy
P update is represented as follows: 
ot
S n
S n ∂
P ′ (s, c, S)V (x′ ; s, c, S)
     x′ ∈X N ∂S n x,x 
 c  ←  cn  −  x′ ∈X N ∂c∂n Px,x′ (s, c, S)V (x′ ; s, c, S)  ∀n=1,...,N (12)
 n    P 
    P 
tn
∂ ′
sn sn ′ N
x ∈X ∂s n P x,x ′ (s, c, S)V (x ; s, c, S)
Although we demonstrate the convergence of the iterative updates (see Theo-

rem 2 in Appendix B), the updates may or may not converge to optimal policy
because of the limiting nature of RL. Therefore, the proposed method’s optimal-
rin
ity verification is complemented by an extensive numerical study in Section 5.
4.2. Structured reinforcement learning with full stochastic approximation
Based on the mathematical background in Section 4.1, this section sum-

ep
marizes the proposed structured RL method. We first observe demand and

transition to the next inventory level after placing replenishment decisions at
the current inventory level. Using the demand observations, the parameters
Pr
15
ed
of the underlying demand distribution are adaptively updated; thereafter, the
iew
immediate reward is used to update the relative value function. The policy
parameters are updated in the descent direction of the expected gradient of the
objective. Appendix F provides the details of how to derive the gradient.
In the overall learning procedure, the described SRL-FSA algorithm fully ap-
proximates all the parts of the gradient of the objective function with stochastic
ev
samples. This algorithm estimates the expected gradient using only online sam-
ples (i.e., it does not compute integration of the transition probability using
the polynomial structure of the value function). Lastly, the decaying updates
r
of the hyperparameters are applied to ensure the convergence from the stochas-
tic policy to the corresponding deterministic policy. Algorithm 1 summarizes
er
the proposed SRL-FSA algorithm. Here, a projection operator Ωt [·] maintains
the upper bound of s by S to exploit the (s, S) replenishment policy structure,
which has the relationship s ≤ S.
pe
Algorithm 1 SRL-FSA for single-item inventory management
1: while satisfying the stopping criteria do
2: given xt , then take action at with noise ϵt following the stochastic (s, S)
replenishment policy; then, observe the transitioned state xt+1 and corre-
sponding reward rt
ot
3: attain the realized demand dt ; then, adaptively estimate the distribu-

tional parameters α̂t+1 and β̂t+1
4: update the relative value function:
wt+1 = wt + γ1 (t) rt − ρt + Vt (xt+1 ; st , St ) − Vt (xt ; st , St ) Φ(xt )
tn
ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , St ) − Vt (xt ; st , St ) − ρt

5: update the policy parameters:
sampling ηS ∼ Bern(·; 0.5) then zS ∼ (1 − ηS )P1 (x′ |S̃t , α̂t+1 −
1, β̂t+1 ) + ηS P1 (x′ |S̃t , α̂t+1 , β̂t+1 )
St+1 = St − b1 (t)β̂t+1 (1 − f (xt , st ))(−1)ηS Vt+1 (zS ; st , St )
rin
sampling ηs ∼ Bern(·; 0.5) then zs ∼ (1 − ηs )P0 (x′ |xt , α̂t+1 −

1, β̂t+1 ) + ηs P1 (x′ |S̃t , α̂t+1 , β̂t+1 )
∂

st+1 = Ωt st − b2 (t) ∂y f (xt , y)|y=st (−1)ηs Vt+1 (zs ; st , St )
6: update the hyperparameters:
decaying σt and τt
ep
The structured RL algorithm for a multi-item inventory system is similar to

Algorithm 1. An additional policy parameter of the can-order level should be
Pr
16
ed
adjusted to exploit the structure of the (s, c, S) joint replenishment policy and
iew
the pairs of parameters have to be considered to optimize each policy of items.
Appendix I provides the details and Algorithm 3 in Appendix H presents the
proposed SRL-FSA for a multi-item problem.
4.3. Structured reinforcement learning with partial stochastic approximation

The polynomial approximation of the relative value function allows us to
ev
take the integral of the value function and compute the gradient of the objective
function. The SRL-FSA algorithm conducts a sample approximation for double
integration: integrations with respect to policy and the gradient of the transition
r
probability. It has the advantage of being well applied to various problems but
its convergence rate is less efficient. To overcome this drawback, we develop the
er
SRL-PSA algorithm, which computes the gradient integration of the transition
probability using the polynomial structure of the value function. The SRL-PSA
algorithm provides better convergence rate than the SRL-FSA algorithm by
pe
partially approximating some parts of the gradient.
Like the aforementioned algorithms, the SRL-PSA algorithm observes de-
mand and the next period inventory level by placing replenishment decisions.
The underlying demand distribution is adaptively updated using demand ob-
ot
servations and the reward is used to update the relative value function. The
expected gradient of the objective function is estimated not only using online
samples but also by directly taking the integration thanks to the polynomial
tn
structure of the value function. Appendix G presents the details for deriving
the gradient. Thereafter, the replenishment policy parameters are updated in
the descent direction of the estimated gradient. The decaying update of the
rin
hyperparameters ensures the convergence to the corresponding deterministic

policy. Algorithm 2 presents the SRL-PSA algorithm.
We extend Algorithm 2 to Algorithm 4 in Appendix H for a multi-item
problem. As part of Algorithm 4, Appendix J provides the details of how to
ep
obtain the gradient.

The two algorithms for discrete demand are presented in Appendix K.
Notably, these variants require no assumption about the underlying demand
Pr
17
ed
Algorithm 2 SRL-PSA for single-item inventory management
while satisfying the stopping criteria do
iew
1:
sponding reward rt
3: attain the realized demand dt ; then, adaptively estimate the distribu-
tional parameters α̂t+1 and β̂t+1
ev
5: update the policy parameters: P4
St+1 = St −b1 (t)β̂t+1 (1−f (xt , st )) i=1 wt+1,i Eŷ∼h(d;α̂t+1 −1,β̂t+1 ) (S̃t −

Ŷ )i − Ey∼h(d;α̂t+1 ,β̂t+1 ) (S̃t − Y )i
r

∂
P4
st+1 = Ωt st −b2 (t) ∂y f (xt , y)|y=st i=1 wt+1,i Ey∼h(d;α̂t+1 ,β̂t+1 ) (xt −

6:
Y )i − (S̃t − Y )i

update the hyperparameters:

er
pe
distribution, which provides better generalizability and practical applicability.
5. Numerical Experiments
To demonstrate the efficacy of the proposed structured RL algorithm, we

ot
conduct a series of numerical experiments for various scenarios. Specifically, the

numerical study analyzes the following three research insights: (i) an inventory
replenishment policy may be optimized without any prior knowledge of the
tn
demand distribution, (ii) the policy can reasonably adapt to non-stationary

(i.e., regime-switching) demand, and (iii) the proposed algorithm has scalability
and can be extended to any size of multi-item problem.
rin
5.1. Experimental design

We consider an inventory system arising from continuous-type demand and
replenishment being placed with a real-valued order unit. We set two types of
ep
inventory systems depending on whether the demand distribution changes over

time: a static system and a regime-switching system. The demand of the static
inventory system follows the Gamma(2, 0.5) and/or truncatedN ormal(3, 12 )
Pr
18
ed
distributions, which remain the same over time. In the regime-switching system,
iew
we consider a non-stationary Gamma demand distribution, which starts from
its distributional parameters (2, 0.5) and changes over time in the following
√ √ √ √
sequence: (4, 2/2) → (1.5, 3/4) → (8, 1) → (1.25, 10/8) → (16, 2). The
demand distribution changes to the next at every equal interval H (i.e., the
switching period Tk = kH). This regime-switching system is designed based on
ev
a previous study (Bayraktar and Ludkovski, 2010), and the proposed scenario
considers demand seasonality.
For a multi-item problem, we consider the same demand distributions and
r
all the items follow i.i.d predefined distributions. To the best of our knowledge,
no method derives an optimal policy for continuous state/decision inventory
er
systems. Therefore, we use a full enumeration method that searches all the grid
values of the admissible policy parameters to confirm near-optimality. Moreover,
we benchmark using the online actor–critic (AC) method, which has been widely
pe
used for managing inventory systems in the literature (Gijsbrechts et al., 2022).
The benchmark AC method uses a simple neural network for approximating a
replenishment policy (i.e., actor) without considering structural properties.
By following the setting of the previous study (De Moor et al., 2022), we con-
ot
sider the same unit costs for both the single-item and the multi-item problems:
the unit ordering cost cnO = 0.3, unit holding cost cnH = 0.1, unit backlogging
cost cnB = 0.5 ∀n=1,...,N , and setup cost K = 0.1. The joint order discountable
tn
ratio is u = 0.9 for the multi-item problem. The description for the setting of
algorithmic hyperparameters is presented in Appendix N.
5.2. Findings for the single-item inventory replenishment problem

rin
Figure 1 illustrates the trajectory of a single-item replenishment policy when

demand follows a Gamma distribution. It shows that the proposed structured
RL algorithm successfully learns the (s, S) policy. The policies obtained by SRL-
ep
FSA and SRL-PSA are (4.06, 6.85) and (4.35, 6.72), respectively. These results
are close to those under the near-optimal policy (s∗ , S ∗ ) = (4.60, 6.90) using the
full enumeration method with 0.1-unit grid-searching. SRL-FSA and SRL-PSA
Pr
19
ed
both result in a long-run average cost, which is the average of multiple replica-
iew
tions, similar to that of the full enumeration. Specifically, the average cost of the
proposed structured RL algorithm is different from that of the full enumeration
by less than 1%. Table 1 summarizes the learned policy and long-run average
cost for the single-item inventory problem. Figure 2 shows that the performance
of both SRL-FSA and SRL-PSA rapidly converges to a near-optimal solution
ev
compared with the benchmarking method (i.e., AC). In addition, SRL-PSA
has a superior and more stable convergence rate than SRL-FSA, meaning that
it provides a good policy within fewer iterations. Here, the full enumeration
r
requires a considerably long computation time even for a small-sized problem
despite providing a near-optimal policy. Therefore, the full enumeration is not
er
applicable in a practical situation, and the good convergence behavior of the
proposed methods is meaningful for a practical large-sized problem.
pe
ot
(a) SRL-FSA (b) SRL-PSA

Figure 1: Trajectory of the (s, S) policy learned by the proposed algorithms for a
tn
single-item inventory system with the Gamma demand distribution.

rin
ep
Figure 2: Convergence graphs for the average cost of the proposed algorithm and
benchmarking heuristic under a single-item inventory system with the Gamma demand
distribution.
Pr
20
ed
Table 1: Performance of the proposed structured RL algorithm compared with the full
enumeration heuristic under a single-item inventory system with the Gamma demand
iew
distribution
Parameter Metric
Algorithm s S Average cost Difference (%)
Full enum. 4.60 6.90 1.754 -
SRL-FSA 4.06 6.85 1.758 0.23
SRL-PSA 4.35 6.72 1.756 0.11
ev
We observe similar results for a single-item inventory system with the truncated-
normal demand distribution. Appendix O and Appendix P summarize the
learned policy and long-run average costs with additional descriptions (see Ta-
r
ble O.4 and Figure P.8). The results verify that well-designed RL algorithms
are promising and applicable for inventory replenishment problems with various
and even unknown demand distributions.
er
Figure 3 demonstrates the extent to which the proposed structured RL al-
pe
gorithm behaves when the demand distributions are non-stationary and change
over time (i.e., regime-switching systems). It rapidly adapts the replenishment
policy to switching demand distributions. This good learning behavior enables
the proposed SRL-FSA and SRL-PSA algorithms to lower the long-run aver-
age cost by 1.7% and 3.2% respectively compared with the static replenishment
ot
policy. Here, the static policy is a near-optimal policy obtained using the full
enumeration method for the initial static inventory system. We consider this
tn
static inventory system as the situation in which there is a significant time lag
in observing the change in the demand distributions; hence, a decision-maker
believes that the demand distribution is “static” over time. This observation
supports the argument that the proposed structured RL algorithm can improve
rin
operational efficiency by adaptively and automatically updating the policy to

switching demands. Appendix O summarizes the details of these results.
5.3. Findings for the multi-item inventory replenishment problem

ep
In this section, we present the results from the numerical analysis for a two-
item inventory system that requires finding the (s, c, S) joint replenishment pol-
icy. In this two-item inventory system, each item follows the same i.i.d Gamma
Pr
21
ed
iew
ev
Figure 3: Trajectory of the updated policy parameters using the proposed algorithms
under a single-item inventory system with the regime-switching demand distribution.
demand distribution. Figure 4 demonstrates that the proposed structured RL
r
algorithm behaves well and learns the joint replenishment policy for the two-
item inventory replenishment problem. Table 2 compares the joint replenish-
er
ment policies and shows that the learned policies are close to the near-optimal
policy under the full enumeration method. Furthermore, the cost difference
between the proposed structured RL algorithm and full enumeration heuristic
pe
is less than 2%. Similar to the aforementioned analysis, SRL-PSA again has
better convergence rate. Appendix P provides the results (see Figure P.9).
ot
tn

under a two-item inventory system with the Gamma demand distribution.
rin
Figure 5 illustrates the good behavior of the proposed structured RL algo-

rithms in a multi-item inventory system with non-stationary demand. As in
the single-item case, the algorithms rapidly adapt the replenishment policy to
ep
switching demand distributions. This good learning behavior enables the pro-
posed SRL-FSA and SRL-PSA algorithms to successfully reduce the average
costs by 7.8% and 13.4%, respectively, compared to the static replenishment
Pr
22
ed
Table 2: Performance of the proposed structured RL algorithm compared with the full
enumeration heuristic under a two-item inventory system with the Gamma demand
iew
distribution
Parameter Metric
Algorithms s1
c1
S 1 2
s c 2
S 2
Average cost Difference (%)
Full 4.40 6.10 6.50 4.40 6.10 6.50 3.463 -
enum.
SRL-FSA 2.71 4.51 6.54 2.79 4.00 6.29 3.518 1.56
ev
SRL-PSA 3.44 3.87 6.01 3.52 4.04 5.81 3.499 1.03
policy. We summarize the resulting average cost in Table O.6 in Appendix O.
r
er
pe
under a two-item inventory system with the regime-switching demand distribution.
To examine scalability, we extend the analysis to a problem that includes

ot
up to four items. We find that the proposed structured RL algorithm has lower
complexity than the existing AC method, which does not consider structural
properties, and thus we expect better scalability (see the details in Appendix
tn
R). Figure 6 presents the extent to which the proposed structured RL algo-
rithm lowers the long-run average cost compared with the AC method. Greater
decreases in the long-run average cost are achieved as the number of items in-
rin
creases, suggesting that the proposed structured RL algorithm is more effective

than previous methods for larger problems.
6. Case study
ep
We conduct a case study for a retail shop in South Korea to examine the
applicability of the proposed structured RL method in the health & beauty re-
tail industry. The retail shop operates 150 stores nationwide, each store selling
Pr
23
ed
iew
ev
Figure 6: Performance improvement of the SRL-FSA and SRL-PSA algorithms com-
pared with the benchmark AC when extending the number of items.
approximately 10,000 items on average. The case study considers two seasonal
items offered by the retail shop: sunblock and deodorant. Figure Q.10 in Ap-
r
pendix Q illustrates the daily sales of sunblock and deodorant for 865 and 1,000
days, showing strong seasonality with peaks in summer.
er
The retail shop’s current practice is to review items daily and decide whether
to replenish them. When the inventory level drops below the average demand
pe
in lead time, the retail shop orders items based on the average demand and its
safety stock. Using this case study, we aim to confirm whether the proposed
method reduces the inventory cost more than the retail shop’s current replenish-
ment policy. For comparison, we also consider other baseline benchmarks, such
as the static replenishment policy, Q-learning, and post-decision state learning.
ot
For the numerical evaluations, we determine unit costs by applying a certain

percentage of the retail price that has been used in the book (Terwiesch and
tn
Cachon, 2006). The retail prices of sunblock and deodorant are 19,000 KRW
and 11,500 KRW, respectively. Table Q.7 summarizes the unit costs.
Figure 7 demonstrates that the proposed structured RL methods reasonably
learn the replenishment policy in response to the switching demand. This de-
rin
sirable behavior enables the proposed methods to lower the inventory costs by
more than 10% and 30% compared to the current replenishment practices for
sunblock and deodorant, respectively. Other benchmark methods may possi-
ep
bly lower the inventory costs more than the current practice, but the proposed
method outperforms them as well. Table 3 summarizes the results. Compared
to the current practice, SRL-PSA saves 2.884 million KRW in sunblock inven-
Pr
24
ed
tory costs over 865 days, which is approximately 2,500 USD. When considering
iew
150 stores, this indicates that the retail shop can expect to save 150,000 USD
in sunblock inventory costs annually. Similarly, the proposed method can save
approximately 70,000 USD in deodorant inventory costs annually. In terms of
the business scale of chain stores, applying these savings to the other 10,000
items justifies the feasibility of our method.
r ev
(a) SRL-FSA for sunblock
er (b) SRL-PSA for sunblock
pe
ot
(c) SRL-FSA for deodorant (d) SRL-PSA for deodorant

tn
Figure 7: Trajectory of the updated policy
Table 3: Comparison of inventory costs

Inventory cost (M KRW)
Item Q- PDS SRL- SRL-
Practice Static
rin
learning learning FSA PSA

Sunblock 21.587 19.942 23.620 19.795 19.353 18.703
Deodorant 5.632 4.498 4.932 4.343 4.204 4.179
We further investigate why the proposed methods achieve better perfor-

ep
mance, as follows. First, although there is a difference in the peak periods for
sunblock and deodorant, the retail shop ignores this demand characteristic. For
example, sunblock has an earlier and longer peak period, and the demand is sig-
Pr
25
ed
nificantly large during these periods. However, the existing practice considers
iew
the same peak periods for both items. On the other hand, the proposed method
automatically detects the changes in demand without having to define the peak
period and derives a customized policy for individual items. Second, the retail
shop determines the reorder level based on the average demand over specific
months or seasons. This approach is highly like to lag the actual demand, with
ev
the resulting reorder level being unresponsive to the actual demand. Meanwhile,
the proposed method learns and updates both the reorder and order-up-to levels
in immediate response to the recent demand observations. This adaptive learn-
r
ing behavior is likely to fare better in a recent retail business where demand is
highly irregular and unpredictable. Moreover, we expect that the achievement
7. Concluding Remarks er
of the proposed method is more significant for multi-item inventory systems.
In this study, we examine a multi-item inventory replenishment problem with

pe
unknown and switching demand. We propose a novel structured RL algorithm
that adaptively and automatically learns an optimal replenishment policy with-
out any prior knowledge of demand. We integrate the known structural prop-
erties of an optimal replenishment policy, which enables the proposed method
ot
to adapt the replenishment policy in response to changing demand. We also

analytically show that the proposed method has efficient algorithmic complex-
tn
ity and provides scalability by exploiting the structure of an optimal policy.

The numerical analysis shows good performance, and the case study confirms
its operational efficiency under a practical inventory system. This well-designed
RL algorithm is particularly promising when we require policy updates based
rin
on observations lacking precise knowledge of switching demand.

This study contributes to related research fields and practice. First, the pro-
posed method does not require any prior knowledge of the demand distribution,
ep
thereby allowing online learning to support managing newly released products

whose history is not available. Second, we show that well-designed and adap-
tive learning methods are operationally efficient in an inventory replenishment
Pr
26
ed
system. Lastly, most existing methods fail to find a “good” solution for a
iew
multi-item inventory replenishment problem within a reasonable computation,
whereas this study provides efficient methods for handling multi-item inventory
systems. The scalability of the proposed method also proves the applicability
of data-driven methods to practical inventory systems.
We acknowledge some of the limitations of our study and suggest potential
ev
future research extensions. First, we consider a single-echelon, single-supplier,
full backlogging, and no lead time inventory system. Therefore, extending our
model to incorporate more realistic assumptions is encouraged. Although con-
r
sidering variants does not significantly diminish our main findings, replicating
our findings in various models is an important extension of this study. Second,
er
although an optimal replenishment policy structure is known for basic prob-
lems, a rational policy structure can be developed for more complex systems.
Another possible avenue for future research is to use structured RL in other
pe
applications (e.g., a Markov decision process model in which an optimal policy
is characterized by a few parameters). For example, the travel industry consid-
ers a booking limit problem, which controls the amount of capacity sold to any
particular class in a given period. The optimal policy is parameterized and its
ot
structural properties are then known (van Ryzin and Talluri, 2005).
References
tn
Balintfy, J. L. (1964). On a basic class of multi-item inventory problems. Man-

agement Science, 10(2):287–297.
Bayraktar, E. and Ludkovski, M. (2010). Inventory management with partially

rin
observed nonstationary demand. Annals of Operations Research, 176(1):7–39.
Bertsekas, D. P., Hager, W., and Mangasarian, O. (1999). Nonlinear Program-

ming. Athena Scientific Belmont. Massachusets, USA.
ep
Chen, B. (2021). Data-driven inventory control with shifting demand. Produc-

tion and Operations Management, 30(5):1365–1385.
Pr
27
ed
Chen, B. and Chao, X. (2020). Dynamic inventory control with stockout sub-
iew
stitution and demand learning. Management Science, 66(11):5108–5127.
Cohen-Hillel, T. and Yedidsion, L. (2018). The periodic joint replenishment

problem is strongly NP-hard. Mathematics of Operations Research, 43(4):1269–
1289.
ev
De Moor, B. J., Gijsbrechts, J., and Boute, R. N. (2022). Reward shaping to
improve the performance of deep reinforcement learning in perishable inventory
management. European Journal of Operational Research, 301(2):535–545.
r
Djonin, D. V. and Krishnamurthy, V. (2007). Q-learning algorithms for con-
strained markov decision processes with randomized monotone policies: Appli-
er
cation to MIMO transmission control. IEEE Transactions on Signal Processing,
55(5):2170–2181.
pe
Fu, F. and van der Schaar, M. (2012). Structure-aware stochastic control
for transmission scheduling. IEEE Transactions on Vehicular Technology,
61(9):3931–3945.
Giannoccaro, I. and Pontrandolfo, P. (2002). Inventory management in supply

ot
chains: a reinforcement learning approach. International Journal of Production

Economics, 78(2):153–161.
Gijsbrechts, J., Boute, R. N., Van Mieghem, J. A., and Zhang, D. J. (2022). Can
tn
deep reinforcement learning improve inventory management? performance on

lost sales, dual-sourcing, and multi-echelon problems. Manufacturing & Service
Operations Management, Published Online.
rin
Girlich, H.-J. and Barche, V. (1991). On optimal strategies in inventory systems

with wiener demand process. International Journal of Production Economics,
23(1-3):105–110.
ep
Huh, W. T. and Rusmevichientong, P. (2009). A nonparametric asymptotic

analysis of inventory planning with censored demand. Mathematics of Opera-
tions Research, 34(1):103–123.
Pr
28
ed
Ignall, E. (1969). Optimal continuous review policies for two product inventory
iew
systems with joint setup costs. Management Science, 15:278–283.
Jiang, C. and Sheng, Z. (2009). Case-based reinforcement learning for dynamic

inventory control in a multi-agent supply-chain system. Expert Systems with
Applications, 36(3):6520–6526.
ev
Johansen, S. G. and Melchiors, P. (2003). Can-order policy for the periodic-
review joint replenishment problem. Journal of the Operational Research Soci-
ety, 54(3):283–290.
r
Keskin, N. B., Li, Y., and Song, J.-S. (2022). Data-driven dynamic pricing and
ordering with perishable inventory in a changing environment. Management
Science, 68(3):1938–1958.
er
Kunnumkal, S. and Topaloglu, H. (2008). Exploiting the structural properties of
the underlying Markov decision problem in the Q-learning algorithm. INFORMS
pe
Journal on Computing, 20(2):288–301.
Marbach, P. and Tsitsiklis, J. N. (2001). Simulation-based optimization of

markov reward processes. IEEE Transactions on Automatic Control, 46:191–
209.
ot
Oroojlooyjadid, A., Nazari, M., Snyder, L. V., and Takáč, M. (2021). A deep
Q-network for the beer game: Deep reinforcement learning for inventory opti-
tn
mization. Manufacturing & Service Operations Management, 24(1):285–304.
Preil, D. and Krapp, M. (2022). Bandit-based inventory optimisation: Re-

inforcement learning in multi-echelon supply chains. International Journal of
rin
Production Economics, 252:108578.
Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic

programming. John Wiley & Sons.
ep
Roy, A., Borkar, V., Karandikar, A., and Chaporkar, P. (2021). Online rein-
forcement learning of optimal threshold policies for markov decision processes.
IEEE Transactions on Automatic Control, 67(7):3722–3729.
Pr
29
ed
Scarf, H. (1960). The optimality of (s, S) policies in dynamic inventory problem,
iew
mathematical methods in the social sciences.
Sharma, N., Mastronarde, N., and Chakareski, J. (2020). Accelerated structure-

aware reinforcement learning for delay-sensitive energy harvesting wireless sen-
sors. IEEE Transactions on Signal Processing, 68:1409–1424.
ev
Shi, C., Chen, W., and Duenyas, I. (2016). Nonparametric data-driven algo-
rithms for multiproduct inventory systems with censored demand. Operations
Research, 64(2):362–370.
r
Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff marko-
vian decision processes. In AAAI, volume 94, pages 700–705.
er
Song, J.-S. and Zipkin, P. (1993). Inventory control in a fluctuating demand
environment. Operations Research, 41(2):351–370.
pe
Terwiesch, C. and Cachon, G. (2006). Matching supply with demand: An intro-
duction to operations management. McGraw-Hill.
van Ryzin, G. J. and Talluri, K. T. (2005). An introduction to revenue manage-

ment. In Emerging Theory, Methods, and Applications, pages 142–194. Informs.
ot
tn
rin
ep
Pr
30
ed
Appendix A. Proofs of Propositions 1 and 2
iew
As σ 2 → 0, the perturbed order-up-to level S̃ converges to deterministic
order-up-to level S with probability 1. This statement can be shown by Cheby-
shev’s Inequality as follow:
σ2
P |S̃ − S| ≥ a ≤ 2
a
ev
for any a > 0.
1
In addition, the sigmoid function f (x, y) = 1+e−(x−y)/τ
converges to the step
function as τ → 0: 
1 if x > y

r
f (x, y)→step(x; y) :=
0 o.w

er
Then, the stochastic (s, S) and (s, c, S) replenishment policies converge to the
deterministic (s, S) and (s, c, S) replenishment policies, respectively. □
pe
Appendix B. Convergence property of the proposed algorithms
From Theorem 2, the proposed structured RL method has convergence prop-

erty.
ot
Lemma 1. The objective function g(s, S) satisfies Proposition 4 and the se-
quence is updated with the opposite direction of gradient for the objective func-
P∞
tion by following the update rate bp (k), which satisfies k bp (k) = ∞ and
tn
P∞ 2
k bp (k) < ∞ ∀p=1,2 ; then, the sequence satisfies limt→∞ ∇g(st , St ) = 0.
Proof The proof is presented by (Bertsekas et al., 1999).

rin
Theorem 2. The sequence of (st , St ) generated by the algorithm converges to

a limit point.
Proof The value function is K-convex, which implies that the corresponding
ep
objective function diverges infinitely as it approaches either positive or negative

infinity. From Lemma 1, the sequence of policy (st , St ) converges to the point
where the gradient of the objective function is zero. The gradient of both infinity
Pr
31
ed
points is at least not zero, which implies that the converged point cannot be
iew
the extreme point. Therefore, the sequence of policy (st , St ) updated by the
algorithm converges to a stationary point with probability 1.
We can construct the algorithm that has a desirable learning rate that ensures
the conditions presented in Lemma 1 hold. To prove Theorem 2, it suffices to
ev
show Proposition 4.
Proposition 4. The objective function g(s, S) is twice differentiable and has

bounded first and second derivatives.
r
Proof The proof is presented in Appendix D.
er
Showing Theorem 2, the gradient-based algorithm has convergence property in
the single-item case. Moreover, this proof can be easily extended to the multi-
item case by showing the twice differentiable objective function with respect to
pe
the can-order level cn for all items n = 1, ..., N ; thus, we can show that the
proposed structured RL algorithm has convergence property under the problem
examined in this study. □
ot
Appendix C. Proof of Proposition 3
Any density function for the demand distribution has bounded first and sec-
tn
ond derivatives with respect to the corresponding random variable; then, the
transition rule for replenishment P1 (x′ |S) also has bounded first and second
derivatives with respect to the parameter S. In addition, the sigmoid function
rin
has bounded first and second derivatives; thus, the sigmoid mixing probability
f (x, s) also has bounded first and second derivatives with respect to the param-
eter s. Hence, the transition probability Px,x′ (s, S), which is the composition
of those functions, has bounded first and second derivatives with respect to the
ep
parameters (s, S). □

Pr
32
ed
Appendix D. Proof of Proposition 4
iew
Before we describe the proof of Proposition 4, we make the following two
assumptions.
Assumption 1. The Markov chain corresponding to all the stationary policies

satisfies that all the states belong to a single communicating class and are ape-
ev
riodic. Furthermore, the representative state x∗ is the positive recurrent state
for the Markov chain.
Assumption 2. For any states x, x′ ∈ X and the feasible policy parameters
r
(s, S), the transition probability Px,x′ (s, S) has bounded first and second deriva-
tives. Furthermore, the reward function has bounded first and second derivatives
with respect to (s, S).
er
These assumptions hold under our problem definition. Assumption 1 holds
pe
by leveraging a stochastic replenishment policy. Assumption 2 also holds from
Proposition 3 and the fact that the reward function does not depend on the
policy parameters when the current and next states are given.
Now, we show that the average reward objective function g(s, S) is twice
differentiable and has bounded first and second derivatives.
ot
The average reward under the continuous state can be approximated by the
average cost for the infinite state Markov chain:
tn
X
g(s, S) ≈ π(x; s, S)r̂(x)
x∈X
The approximated Markov chain satisfies the following balance equations:
X
π(x; s, S)Px,x′ (s, S) = π(x′ ; s, S)
rin
x∈X
X
π(x; s, S) = 1
x∈X
The balance equation implies that any initial state vector converges to a certain
constant state vector by transitioning infinitely many times with the transition
ep
probability. Thus, the balance equation can be expressed in the matrix-vector

multiplication form A(s, S)π(x; s, S) = ⃗v . Here, A(s, S) is a certain matrix de-
pending on the parameters (s, S) and ⃗v is a certain constant vector. The entries
Pr
33
ed
of matrix A(s, S) are composed with the entries of the transition probability
iew
Px,x′ (s, S). From Proposition 3, the matrix A(s, S) is also twice differentiable
and has bounded first and second derivatives. Since the corresponding station-
ary distribution is unique, matrix A(s, S) is invertible and the stationary policy
can be represented using Cramer’s rule as follows:
C(s, S)
π(x; s, S) =
det A(s, S)
ev
where C(s, S) is a vector whose entries are a polynomial function of the entries
of A(s, S). The entries of C(s, S) are twice differentiable and have bounded

first and second derivatives; further, det A(s, S) is twice differentiable and
r
has bounded first and second derivatives. Since matrix A(s, S) is invertible,

|det A(s, S) | is greater than some positive value. These statements imply that
er
the stationary policy is twice differentiable and has bounded first and second
derivatives. Given the characteristics of the stationary policy, the objective
function composed of the stationary policy is also twice differentiable and has
pe
bounded first and second derivatives. □
Appendix E. Transition probability and its gradient for a multi-item

inventory system
ot
For a multi-item problem, the transition probability corresponding to the

policy parameter (s, c, S) is denoted as Px,x’ (s, c, S). Equation (F.1) presents
tn
the transition probability for the stochastic (s, c, S) replenishment policy:

Y
n n n k k k
Px,x’ (s, c, S) = Px,x n ′ (s, c , S ) Px,x k ′ (s, c , S )
k̸=n
f (xn , cn )P0n (xn ′ |xn , αn , β n )

Y
n n n
f (xk , sk )

Px,x n′ (s, c , S ) = 1 −
rin
k̸=n
))P1n (xn ′ |S̃ n , αn , β n ) f (xk , sk ) f (xn , sn )P0n (xn ′ |xn , αn , β n )

Y
n n

+ (1 − f (x , c +
k̸=n
+ (1 − f (xn , sn ))P1n (xn ′ |S̃ n , αn , β n )

ep
Pr
34
ed
For each item n, the partial derivatives of the transition probability with respect
iew
to the policy parameters (s, c, S) are well defined as follows:
∂
P ′ (s, c, S) =
n x,x
∂S
Y Y
n n n k k n n k k
β (1 − f (x , c ))(1 − f (x , s )) + (1 − f (x , s )) f (x , s )
k̸=n k̸=n
n′ n′
n n n
Y k n n n
× (P1 (x |S̃ , α − 1, β ) − P1 (x |S̃ , α , β ) Px,xk ′ (s, ck , S k )
ev
k̸=n
∂
Px,x′ (s, c, S) =
∂cn
Y
k k n n′ n n n n n′ n n n ∂
f (xn , cn )

1− f (x , s ) P0 (x |x , α , β ) − P1 (x |S̃ , α , β )
∂cn
r
k̸=n
Y
k k k
× Px,x k ′ (s, c , S )
∂
∂sn
∂
k̸=n
Px,x′ (s, c, S) =
n n

n n′ n n n n n′ n n
f (x , s ) P0 (x |x , α , β ) − P1 (x |S̃ , α , β )
ern
Y
f (xk , sk )Px,x
k k k
k ′ (s, c , S )
pe
∂sn
k̸=n

n n n
X
k k k k k k ′ k k k k k′ k k k
+ Px,xn ′ (s, c , S ) (f (x , s ) − f (x , c )) P0 (x |x , α , β ) − P1 (x |S , α , β )
k̸=n
Y
j
× f (xj , sj )Px,x j ′ (s, cj
, S j
)
ot
j∈[N ]\{n,k}
Appendix F. Policy update in SRL-FSA for a single-item inventory

system
tn
The partial derivative for objective with respect to the policy parameters
(s, S) is computed and the proposed algorithm updates the policy parameters
rin
in the descent direction of the gradient. First, the partial derivative of the
objective function with respect to the order-up-to level S is computed as follows:
Z
∂
P1 (x′ |S̃, α − 1, β) − P1 (x′ |S̃, α, β) V (x′ ; s, S)dx′

g(s, S) ≈ β(1 − f (x, s))
∂S x′ ∈X
Using the stochastic approximation framework, the partial derivative equation
ep
can be approximated by the following formulation with an online sample as

∂
g(s, S) ≈ β(1 − f (x, s))(−1)η V (x′ ; s, S)
∂S
Pr
35
ed
where η is a random sample from a Bernoulli trial that has equal probabil-
iew
ities (i.e., η ∼ Bern(·; 0.5)). In addition, a sample for the next state x′ is
extracted from the joint distribution, which is composed of P1 (x′ |S̃, α − 1, β)
and P1 (x′ |S̃, α, β) by mixing the Bernoulli sample (i.e., x′ ∼ (1 − η)P1 (x′ |S̃, α −
1, β) + ηP1 (x′ |S̃, α, β)).
For another parameter s, the partial derivative is computed as follows:
ev
Z
∂ ∂
P0 (x′ |x, α, β) − P1 (x′ |S̃, α, β) V (x′ ; s, S)dx′

g(s, S) ≈ f (x, s)
∂s ∂s x′ ∈X
Using the stochastic approximation framework, the partial derivative equation
can be approximated by the following formulation with an online sample as:
r
∂ ∂
g(s, S) ≈ f (x, s) × (−1)η V (x′ ; s, S)
∂s ∂s
where η is a random sample from a Bernoulli trial with equal probabilities. In
er
addition, a sample for the next state x′ is extracted from the joint distribution,
which is composed of P0 (x′ |x, α, β) and P1 (x′ |S̃, α, β) by mixing the Bernoulli
sample (i.e., x′ ∼ (1 − η)P0 (x′ |x, α, β) + ηP1 (x′ |S̃, α, β)).
pe
The policy parameters (s, S) update the formula using the SRL-FSA algo-
rithm as follows:
St+1 = St − b1 (t)β̂t+1 (1 − f (xt , st ))(−1)ηS Vt+1 (zS ; st , St ) (F.1)
∂
f (x, y)|y=st (−1)ηs Vt+1 (zs ; st , St )
st+1 = st − b2 (t)
ot
(F.2)
∂y
Here, i.i.d Bernoulli samples ηS , ηs ∼ Bern(·; 0.5); further, random samples
zS ∼ (1 − ηS )P1 (x′ |S̃t , α̂t+1 − 1, β̂t+1 ) + ηS P1 (x′ |S̃t , α̂t+1 , β̂t+1 ) and zs ∼ (1 −
tn
ηs )P0 (x′ |xt , α̂t+1 − 1, β̂t+1 ) + ηs P1 (x′ |S̃t , α̂t+1 , β̂t+1 ).
Appendix G. Policy update in SRL-PSA for a single-item inventory

rin
system
The part of the gradient of the objective function can be directly derived by
computing the integration. Then, the partial derivative of the objective function
ep
with respect to the order-up-to level S is expressed as follows:

4
∂ X
wi Eŷ∼h(d;α−1,β) (S̃ − Ŷ )i − Ey∼h(d;α,β) (S̃ − Y )i

g(s, S) ≈ β(1 − f (x, s))
∂S i=1
Pr
36
ed
The expectation term is regarded as the polynomial function of the n − th
iew
moment for the Gamma distribution; thus, it can be easily presented in closed
form (we skip this representation for brevity).
Then, the partial derivative of the objective function with respect to the
reorder level s is expressed as follows:
4
∂ ∂ X
wi Ey∼h(d;α,β) (x − Y )i − (S̃ − Y )i

g(s, S) ≈ f (x, s)
ev
∂s ∂s i=1
Similar to the update equation of S, it can be easily presented in closed form.
Then, the policy parameters (s, S) update the equation using the SRL-PSA
algorithm as follows:
r
St+1 = St − b1 (t)β̂t+1 (1 − f (xt , st ))
4
×
X
i=1
4
er
wt+1,i Eŷ∼h(d;α̂t+1 −1,β̂t+1 ) (S̃t − Ŷ )i − Ey∼h(d;α̂t+1 ,β̂t+1 ) (S̃t − Y )i

(G.1)
pe
∂ X
wt+1,i Ey∼h(d;α̂t+1 ,β̂t+1 ) (xt − Y )i − (S̃t − Y )i .

st+1 = st − b2 (t) f (xt , y)|y=st
∂y i=1
(G.2)
Appendix H. Algorithms for multi-item inventory system

ot
This section provides the SRL-FSA and SRL-PSA algorithms for a multi-
item inventory system. Algorithm 3 and Algorithm 4 summarize the cor-
tn
responding SRL-FSA and SRL-PSA algorithms, respectively. We explain the

theoretical basis of these algorithms in Appendix I and Appendix J, respec-
tively.
n
In Algorithm 3, the projection operator Ωct [·] maintains the upper bound
rin
n
of cn by S n and Ωst [·] maintains the upper bound of sn by cn according to
the (s, c, S) joint replenishment policy structure that requires sn ≤ cn ≤ S n
∀n=1,...,N . To stabilize policy learning in the multi-item problem, we consider a
ep
batch update technique that updates the parameters using several offline sam-
ples.
In Algorithm 4, the main part for the gradient with respect to the can-
Pr
37
ed
iew
Algorithm 3 SRL-FSA for multi-item inventory management
2: given xt , then take action at with noise ϵk following the stochastic
(s, c, S) replenishment policy; then, observe the transitioned state xt+1 and
ev
corresponding reward rt
3: attain the realized demand dt for all the items; then, adaptively estimate
n n
the corresponding distributional parameters α̂t+1 and β̂t+1 ∀n=1...N
r
wt+1 = wt + γ1 (t) rt − ρt + V − t(xt+1 ; st , ct , St ) −
Vt (xt ; st , ct , St ) Φ(xt )
ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , ct , St ) − Vt (xt ; st , ct , St ) − ρt
5:
6: for item n = 1, ..., N do
n n ′ n n n
1, β̂t+1 ) + ηS P1 (x |S̃t , α̂t+1 , β̂t+1
)
er
update the policy parameters for all the items:
sampling ηSn ∼ Bern(·; 0.5) then zSn ∼ (1 − ηSn )P1 (x′ |S̃tn , α̂t+1 n
−
pe
St+1 = St − b1 (t)β̂t+1 (1 − f (xnt , cnt ))(1 − k̸=n f (xkt , skt )) + (1 −
n n n
Q

n ′
f (xt , st )) k̸=n f (xt , st ) (−1)ηS Vt+1 (xn ′ = zSn , x−n = x−n
n n k k
Q
t+1 ; st , ct , St )
sampling ηcn ∼ Bern(·; 0.5) then zcn ∼ P̄ n (xn ′ |xn , st , ct , St , ηcn ) =

(1 − ηcn )P0n (x′ |xnt , α̂t+1
n n
, β̂t+1 ) + ηcn P1n (x′ |S̃tn , α̂t+1
n
, β̂ n )
∂ t+1n
ot
cn n n
ct+1 = Ωt ct −b2 (t) 1− k̸=n f (xt , st ) ∂y f (xt , y)|y=cnt (−1)ηc Vt+1 (xn ′ =
n
Q k k
′
zcn , x−n = x−n

t+1 ; st ,Qct , St )
θtn = en × k̸=n f (xkt , skt )

tn
+ k̸=n ek × (f (xkt , skt ) − f (xkt , ckt )) j∈[N ]\{n,k} f (xjt , sjt )

P Q
sampling ηsn ∼ Bern(·; 0.5) and τn ∼

n n ′ n
M ulti(θt ) then zs ∼ P̂ (x |x, st , ct , St , τ ) =
n k k′ k k j j j
P Q
k 1{τ = k} P̄ (x |x , s ,
t t c , St , η ) P
j̸=k xt ,xj ′ (s ,
t tc , St )
t
rin
sn n n
=k}ηsk
P
∂
Vt+1 (x′ =
n
n 1{τ
st+1 = Ωt st − b3 (t) ∂y f (xt , y)|y=snt (−1) k
n

zs ; st , ct , St )
ep
Pr
38
ed
iew
Algorithm 4 SRL-PSA for multi-item inventory management
2: given xt , then take action at with noise ϵt following the stochastic
ev
(s, c, S) replenishment policy; then, observe the transitioned state xt+1 and
corresponding reward rt
3: attain the realized demand dt for all the items; then, adaptively estimate
n n
the corresponding distributional parameters α̂t+1 and β̂t+1 ∀n=1...N
r

wt+1 = wt +γ1 (t)rt −ρt +Vt (xt+1 ; st , ct , St )−Vt (xt ; st , ct , St ) Φ(x t)
ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , ct , St ) − Vt (xt ; st , ct , St ) − ρt
5: update the policy parameters for all the items:
6: for item n = 1, ..., N do
k k
Ank
A2
n n
er
k k n n k k
1 = (1 − f (xt , st ))f (xt , ct ) + f (xt , st )f (xt , st ) ∀k∈[N ]\n
nk
=
f (xt , st )) ∀k∈[N ]\n
(1 − f (xt , st ))(1 − f (xt , ct )) + f (xnt , snt )(1 −
n n k k
pe

St+1 = St − b1 (t)β̂t+1 (1 − f (xnt , cnt ))(1 − k̸=n f (xkt , skt )) + (1 −
n n
Q

n n k k
[(S̃tn −
Q P
f (xt , st )) k̸=n f (xt , st ) 1≤i1 +...+iN ≤4 wi1 ...iN (Eŷ n ∼h(d;α̂n n
t+1 −1,β̂t+1 )
Ŷ n )in ] − Eyn ∼h(d;α̂n ,β̂ n ) [(S̃tn − Y n )in ]) k̸=n Eyk ∼h(d;α̂k ,β̂ k ) [Ank k
Q
t+1 t+1 t+1 t+1
1 (x −

ot
y k )ik − Ank k k ik
2 (S̃t − y ) ]

∆cnt = Eyn ∼h(d;α̂n ,β̂ n ) [(xnt − Y n )in −
P
w
1≤i1 +...+iN ≤4 i1 ...iN t+1 t+1
tn

n n in
Q nk k k ik nk k k ik
(S̃t − Y ) ] k̸=n Ey k ∼h(d;α̂k k
t+1 ,βt+1 )
[A1 (x − y ) − A2 (S̃t − y ) ]
n ∂
cnt+1 = Ωct cnt − b2 (t) 1 − k̸=n f (xkt , skt ) ∂y f (xnt , y)|y=cnt ∆cnt
Q
7: for item n = 1, ..., N do

n
∂

snt+1 = Ωst snt − b3 (t) ∂y f (xnt , y)|y=snt ∆cnt k̸=n f (xkt , skt ) +
Q
rin
P k k k k k
Q j j
k̸=n ∆ct f (xt , st ) − f (xt , ct ) j∈[N ]\{n,k} f (xt , st )}
ep
Pr
39
ed
order level ∆cn is reused to compute the gradient with respect to the reorder
iew
level sn , which improves the computational efficiency.
Appendix I. Policy update in SRL-FSA for a multi-item inventory

system
As in the single-item case, the full stochastic approximation method can be
ev
extended to the multi-item case:
Z
∇g(s, c, S) ≈ ∇Px,x′ (s, c, S)V (x′ ; s, c, S)dx′
x′ ∈X N
The partial derivative for each of the policy parameter (s, c, S) pairs for all
r
the items is computed, and this updates the policy parameters in the descent
∂
∂S n
g(s, c, S) ≈
er
direction of the gradient. First, the partial derivative of the objective function
with respect to S n , which is the order-up-to level of item n, is
pe
Y Y
β n (1 − f (xn , cn )) 1 − f (xk , sk ) + (1 − f (xn , sn )) f (xk , sk )

k̸=n k̸=n
n ′
× (−1)η V (xn ′ = z n , x−n ; s, c, S),
where η n is a random sample from a Bernoulli trial with equal probabili-
′
ties. In addition, a sample for the next state of item n (xn ) is extracted
ot
from the joint distribution, which is composed of P1n (xn ′ |S̃ n , αn − 1, β n ) and
P1n (xn ′ |S̃ n , αn , β n ) by mixing the Bernoulli sample (i.e., z n ∼ (1−η n )P1n (xn ′ |S̃ n , αn −
tn
1, β n ) + η n P1n (xn ′ |S̃ n , αn , β n )).

The stochastic approximation of the partial derivative of the objective func-
tion with respect to cn is computed as follows:

∂ ∂ n ′
f (xn , cn ) × (−1)η V (xn ′ = z n , x−n ; s, c, S)
Y
k k
g(s, c, S) ≈ 1 −
rin
f (x , s )
∂cn ∂cn
k̸=n
where η n is a random sample from a Bernoulli trial with equal probabili-
ties. In addition, a sample for the next state of item n (i.e., xn ′ ) is ex-
tracted from the joint distribution, which is composed of P0n (xn ′ |xn , αn , β n ) and
ep
P1n (xn ′ |S̃ n , αn , β n ) by mixing the Bernoulli sample (i.e., z n ∼ P̄ n (xn ′ |xn , s, c, S, η n ) =
(1 − η n )P0n (xn ′ |xn , αn , β n ) + η n P1n (xn ′ |S̃ n , αn , β n )).
Pr
40
ed
The stochastic approximation of the partial derivative of the objective func-
iew
tion with respect to sn is computed as follows:
∂ ∂ n n
P n k
k 1{τ =k}η V (z; s, c, S)
g(s, c, S) ≈ f (x , s ) × (−1)
∂sn ∂sn
where η k is a random sample from a Bernoulli trial with equal probabilities and
τ n is randomly sampled from a Multinoulli trial that has the categorical weight
vector θn as follows:
ev
Y X Y
θ n = en × f (xk , sk ) + ek × (f (xk , sk ) − f (xk , ck )) f (xj , sj )
k̸=n k̸=n j∈[N ]\{n,k}
Here, en denotes the N size standard vector whose only n − th element is one,
while the others are zero. The categorical weight is normalized to (1T θn = 1)
r
by dividing the sum of the elements of the weight vector. In addition, a sample
for the next state vector x′ is extracted from the nested joint distribution, which
′ Q j
is composed of P̄ k (xk |xk , s, c, S, η k ) j̸=k Px,x
erj j
j ′ (s, c , S ) ∀k by activating the
Multinoulli sample i.e., z ∼ P̂ (x′ |x, s, c, S, τ n ) =

k k′ k
pe
P n k
Q j j j

k 1{τ = k}P̄ (x |x , s, c, S, η ) j̸=k Px,xj ′ (s, c , S ) .
The policy parameters (s, c, S) update the formula using the SRL-FSA al-
gorithm as follows:
n
St+1 =

ot
Y Y
Stn − b1 (t)β̂t+1
n
(1 − f (xnt , cnt ))(1 − f (xkt , skt )) + (1 − f (xnt , snt )) f (xkt , skt )
k̸=n k̸=n
n ′
× (−1) Vt+1 (xn ′ = zSn , x−n = x−n
ηS
t+1 ; st , ct , St )
tn
(I.1)

Y ∂ n
cnt+1 n
= ct − b2 (t) 1 − k k
f (xt , st ) f (xnt , y)|y=cnt (−1)ηc
∂y
k̸=n (I.2)
n′ ′
× Vt+1 (x zcn , x−n x−n
rin
= = t+1 ; st , ct , St )
∂ P
1{τ n =k}ηsk
snt+1 = snt − b3 (t) f (xnt , y)|y=snt (−1) k Vt+1 (x′ = zns ; st , ct , St )
∂y
(I.3)
Here, i.i.d Bernoulli samples ηSn , ηcn , ηsn ∼ Bern(·; 0.5) ∀n=1,...,N . Random sam-
ep
ples zSn ∼ (1 − ηSn )P1 (x′ |S̃tn , α̂t+1

n
− n
1, β̂t+1 ) + ηSn P1 (x′ |S̃tn , α̂t+1
n n
, β̂t+1 ) and zcn ∼
P̄ n (xn ′ |xn , st , ct , St , ηcn ) = (1−ηcn )P0n (x′ |xnt , α̂t+1
n n
, β̂t+1 )+ηcn P1n (x′ |S̃tn , α̂t+1
n n
, β̂t+1 ).
Pr
41
ed
Further, zns ∼ P̂ (x′ |x, st , ct , St , τ n ) =
k k′ k j
(st , cjt , Stj ) where τ n ∼ M ulti(θtn ).
iew
n k
P Q
k 1{τ = k}P̄ (x |x , st , ct , St , η ) j̸=k P xt ,xjt
′
Appendix J. Policy update in SRL-PSA for a multi-item inventory

system
As in the single-item case, the partial stochastic approximation method can
ev
be extended to the multi-item case. The partial derivative with respect to the
policy parameter (s, c, S) pairs for all the items is computed and the proposed
algorithm updates in the descent direction of the gradient. First, the partial
r
derivative of the objective function with respect to S n is as follows:
∂
n
g(s, c, S) ≈
∂S
β n (1 − f (xn , cn ))(1 −
X
Y
k̸=n

er
f (xk , sk )) + (1 − f (xn , sn ))
Y
f (xk , sk )

k̸=n
Eŷn ∼h(d;αn −1,β n ) [(S̃ n − Ŷ n )in ] − Eyn ∼h(d;αn ,β n ) [(S̃ n − Y n )in ]

pe
× wi1 ...iN
1≤i1 +...+iN ≤4
Y
× Eyk ∼h(d;αk ,β k ) [Ank
1 (x
k k ik
−y ) − Ank
2 (S̃
k k ik
−y ) ]
k̸=n
where Ank
1 = (1 − f (xn , sn ))f (xk , ck ) + f (xn , sn )f (xk , sk ) and Ank
2 = (1 −
f (xn , sn ))(1 − f (xk , ck )) + f (xn , sn )(1 − f (xk , sk )).
ot
For the parameter cn , the partial derivative of the objective function is

computed as follows:
tn
∂ Y ∂
n
g(s, c, S) ≈ 1 − f (xk , sk ) f (xn , cn ) × ∆cn
∂c ∂cn
k̸=n
X
∆cn = wi1 ...iN Eyn ∼h(d;αn ,β n ) [(xn − Y n )in − (S̃ n − Y n )in ]

1≤i1 +...+iN ≤4
rin
Y
× Eyk ∼h(d;αk ,β k ) [Ank
1 (x
k k ik
−y ) − Ank
2 (S̃
k
−y ) ]k ik
k̸=n
Here, ∆cn is an incremental factor for item n and this is reused to compute the
partial derivative of the objective function with respect to sn as follows:
ep
∂ ∂
g(s, c, S) ≈ n f (xn , sn )
∂sn ∂s
Y X Y
n k k k k k k k j j

× ∆c f (x , s ) + ∆c f (x , s ) − f (x , c ) f (x , s )
k̸=n k̸=n j∈[N ]\{n,k}
Pr
42
ed
To sum up, the policy parameters (s, c, S) update the equation using the
iew
SRL-PSA algorithm
as follows:
Y Y
St+1 = βt+1 (1 − f (xnt , cnt ))(1 −
n n
f (xkt , skt )) + (1 − f (xnt , snt )) f (xkt , skt )
k̸=n k̸=n
X
× wi1 ...iN Eŷn ∼h(d;α̂n n [(S̃tn − Ŷ n )in ]
t+1 −1,β̂t+1 )
1≤i1 +...+iN ≤4
[(S̃tn − Y n )in ]
ev

−Eyn ∼h(d;α̂n n
t+1 ,β̂t+1 )
Y
nk k k ik nk k k ik
× Eyk ∼h(d;α̂k k [A 1 (x − y ) − A 2 (S̃ − y ) ]
t+1 ,β̂t+1 )
k̸=n
(J.1)
r
Y ∂
cnt+1 = cnt − b2 (t) 1 − f (xkt , skt ) f (xnt , y)|y=cnt × ∆cnt (J.2)
∂y
k̸=n
snt+1 = snt − b3 (t)
× ∆cnt

∂
∂y
Y
f (xnt , y)|y=snt
f (xkt , skt ) +
X
er
∆ckt f (xkt , skt ) − f (xkt , ckt )
Y
f (xjt , sjt )}
pe
k̸=n k̸=n j∈[N ]\{n,k}
(J.3)
Appendix K. Algorithms for discrete demand

ot
This section provides the SRL-FSA and SRL-PSA algorithms for a single-
item inventory system with discrete demand. Algorithm 5 and Algorithm 6
summarize the discrete variant of the SRL-FSA and SRL-PSA algorithms, re-
tn
spectively. We explain the theoretical basis of these algorithms in Appendix L

and Appendix M, respectively.
rin
Appendix L. Policy update in a discrete variant SRL-FSA for a

single-item inventory system
We first transform a discrete demand distribution into the corresponding

ep
continuous distribution by linear interpolation. The partial derivative of the

objective function with respect to order-up-to level S is then derived as follows:
Z
∂ ∂
g(s, S) ≈ (1 − f (x, s)) P1 (x′ |S̃, p̄) V (x′ ; s, S)dx′
∂S x′ ∈X ∂S
Pr
43
ed
iew
Algorithm 5 discrete variant SRL-FSA for single-item inventory management
sponding reward rt
3: attain the realized demand dt ; then, adaptively update the empirical
probability p̂t+1 and corresponding interpolated density p̄t+1
ev
5: update the policy parameters:
sampling µ ∼ M ulti(·; λ1 |∆p̂t+1 |) then zS ∼ U (S̃t − µ − 1, S̃t − µ)
r
St+1 = St − b1 (t)(1 − f (xt , st ))(−1)1{p̂t+1,µ+1 <p̂t+1,µ } Vt+1 (zS ; st , St )
sampling η ∼ Bern(·; 0.5) then zs ∼ ηP1 (x′ |S̃t , p̄t+1 ) + (1 −
′
6:
η)P0 (x |xt , p̄t+1 )
∂
st+1 = Ωt st − b2 (t) ∂y
update the hyperparameters:
er
f (xt , y)|y=st (−1)η Vt+1 (zs ; st , St )

pe
Algorithm 6 discrete variabt SRL-PSA for single-item inventory management
given xt , then take action at with noise ϵt following the stochastic (s, S)
ot
2:
sponding reward rt
3: attain the realized demand dt ; then, adaptively update the empirical
tn
probabilty p̂t+1 and corresponding interpolated density p̄t+1

5: update the policy parameters: Pdmax −1 P4 wt+1,i Pi
1
rin
St+1 = St −b1 (t) K (1−f (xt , st )) j=0 ∆p̂t+1,j i=1 i+1 k=0 (S̃t −
j)k (S̃t − j − 1)i−k
∂
P4
st+1 = Ωt st − b2 (t) ∂y f (xt , y)|y=st i=1 wt+1,i Ey∼h(d;p̄t+1 ) (xt −

i i

Y ) − (S̃t − Y )
ep

Pr
44
ed
Using the stochastic approximation framework, the partial derivative is approx-
iew
imated by the following formulation with an online sample:
∂
g(s, S) ≈ (1 − f (x, s))(−1)1{p̂µ+1 <p̂µ } V (x′ ; s, S)
∂S
1
, where µ is a random sample from a Multinoulli trial that has a λ |∆p̂| prob-
ability in the integer range 0 to dmax − 1 (i.e., µ ∼ M ulti(·; λ1 |∆p̂|)). Here, λ
Pdmax −1
(:= j=0 |∆p̂j |) denotes a normalizing constant that makes the piecewise
ev
∂
constant function ∂S P1 (x′ |S̃, p̄) = K
1
|∆p̂| a Multinoulli distribution by satis-
fying the property of probability. Additionally, a sample for the next state x′ is
extracted from the continuous uniform distribution in range [S̃ − µ − 1, S̃ − µ];
r
here, µ is fixed after it is sampled from the Multinoulli distribution.
The partial derivative of another parameter s is computed as follows:
∂
∂s
g(s, S) ≈
∂
∂s
f (x, s)
Z
′
x ∈X
er
P0 (x′ |x, p̄) − P1 (x′ |S̃, p̄) V (x′ ; s, S)dx′

We approximate the partial derivative by the following formulation with an

online sample:
pe
∂ ∂
g(s, S) ≈ f (x, s) × (−1)η V (x′ ; s, S)
∂s ∂s
, where η is a random sample from a Bernoulli trial with equal probabilities (i.e.,
Bern(·; 0.5)). Additionally, a sample for the next state x′ is extracted from the
joint distribution, which is composed of P0 (x′ |x, p̄) and P1 (x′ |S̃, p̄) by mixing
ot
the Bernoulli sample (i.e., x′ ∼ (1 − η)P0 (x′ |x, p̄) + ηP1 (x′ |S̃, p̄)).
Finally, the discrete variant of the SRL-FSA algorithm updates the policy
tn
(s, S) as follows:
St+1 = St − b1 (t)(1 − f (xt , st ))(−1)1{p̂t+1,µ+1 <p̂t+1,µ } Vt+1 (zS ; st , St ) (L.1)
∂
st+1 = st − b2 (t) f (xt , y)|y=st (−1)η Vt+1 (zs ; st , St ) (L.2)
∂y
rin
Here, the Multinoulli sample is µ ∼ M ulti(·; λ1 |∆p̂t+1 |) and the Bernoulli sample
η ∼ Bern(·; 0.5); moreover, the random samples are zS ∼ U nif orm(S̃t − µ −
1, S̃t − µ) and zs ∼ ηP1 (x′ |S̃t , p̄t+1 ) + (1 − η)P0 (x′ |xt , p̄t+1 ).
ep
Pr
45
ed
Appendix M. Policy update in a discrete variant SRL-PSA for a
iew
single-item inventory system
In the case of the SRL-PSA algorithm, the partial gradient of the objective
function can be obtained by computing the integral. The partial derivative of
the objective function with respect to the order-up-to level S is expressed as
follows:
ev
∂ 1 X−1
dmax Z S̃−j
g(s, S) ≈ (1 − f (x, s)) ∆p̂j V (z; s, S)dz
∂S K j=0 S̃−j−1
1 X−1
dmax X4 S̃−j
wi i+1
= (1 − f (x, s)) ∆p̂j z
r
K j=0 i=1
i+1 S̃−j−1
Additionally, the partial derivative with respect to reorder level s is given as
follows:
∂
∂s
g(s, S) ≈
∂
∂s
f (x, s)
X4
i=1
er
wi Ey∼h(d;p̄) (x − Y )i − (S̃ − Y )i

The expectation is regarded as the polynomial function of the n−th moment for

pe
the interpolated demand distribution and, thus, it is easily presented in closed
form.
The SRL-PSA algorithm updates policy (s, S) as follows:
1 X−1
dmax X4 i
wt+1,i X
St+1 = St − b1 (t) (1 − f (xt , st )) ∆p̂t+1,j (S̃t − j)k (S̃t − j − 1)i−k
ot
K j=0 i=1
i + 1
k=0
(M.1)
4
∂
tn
X
wt+1,i Ey∼h(d;p̄t+1 ) (xt − Y )i − (S̃t − Y )i .

st+1 = st − b2 (t) f (xt , y)|y=st
∂y i=1
(M.2)
Appendix N. Hyperparameter setting for the numerical study

rin
By running preliminary experiments, we tune the hyperparameters (e.g.,

decaying step size) of the proposed structured RL algorithm, while satisfy-
P∞ P∞ 2
ing the desirable conditions such as k bp (k) = ∞, k bp (k) < ∞ and
ep
bp (t)
limt→∞ γ1 (t) = 0 ∀p=1,2 . For comparison purposes, we use the same step size
for both the SRL-FSA and the SRL-PSA algorithms. The step sizes for the rel-
0.001
ative value function are γ1 (t) = (⌊t/10⌋+1)0.6 and γ2 (t) = 0.01; further, the step
Pr
46
ed
0.01 0.01
sizes for the policy parameters are b1 (t) = ⌊t/5⌋+1 and b2 (t) = (⌊t/20⌋+1)0.7 . The
iew
precision hyperparameters for the stochastic replenishment policy is defined in
τ0 σ0
a decaying form as τt = (⌊t/10⌋+1)0.8 and σt = (⌊t/10⌋+1)0.8 .
Appendix O. Comparison of the proposed algorithm with bench-

mark heuristics
ev
Table O.4 compares the optimized policy parameters of the proposed algo-
rithm with those of the full enumeration heuristic under the single-item inven-
tory system with a truncated-normal demand distribution.
r
Table O.4: Comparison of the proposed structured RL algorithm with the full enu-
meration heuristic under the single-item inventory system with a truncated-normal
demand distribution
Algorithm
Parameter
s S
er Average cost
Metric
Difference
(%)
pe
Full enum. 3.00 4.00 1.151 -
SRL-FSA 2.69 3.75 1.155 0.35
SRL-PSA 2.79 4.00 1.152 0.09
Table O.5 and Table O.6 compare the proposed algorithm with the static
ot
replenishment policy under the single- and two-item regime-switching system,

respectively.
Table O.5: Comparison of the proposed structured RL algorithm with the static re-
tn
plenishment policy for a single-item inventory system with regime-switching demand

Metric
Algorithm Average cost Improve rate (%)
Static 2.899 -
rin
SRL-FSA 2.851 1.68

SRL-PSA 2.808 3.24
ep
Appendix P. Convergence graphs of the proposed algorithms
Figure P.8 presents the convergence graph of the proposed algorithm and
benchmarking heuristic under the truncated-normal demand distribution and
Pr
47
ed
Table O.6: Comparison of the proposed structured RL algorithm with the static replen-
ishment policy under the two-item inventory system with regime-switching demand
iew
Metric
Algorithm Average cost Improve rate (%)
Static 5.870 -
SRL-FSA 5.446 7.79
SRL-PSA 5.176 13.41
ev
Figure P.9 illustrates the convergence graph of the proposed algorithm and the
benchmarking heuristic under the two-item inventory system.
r
er
pe
Figure P.8: Convergence graphs for the average cost of the proposed algorithm and
benchmarking heuristic under the single-item inventory system with a truncated-
normal demand distribution.
ot
tn
rin
Figure P.9: Convergence graphs for the average cost of the proposed algorithm and the
ep
benchmarking heuristic under the two-item inventory system with a Gamma demand
distribution.
Pr
48
ed
Appendix Q. Sales trend of the items and their unit costs setting
iew
r ev
(a) Sunblock (b) Deodorant
Figure Q.10: Daily sales for a Korean retail shop
Item cO
er
Table Q.7: Unit costs for the case study
Unit cost (KRW)
cH cB K
pe
Sunblock 14,250 9.5 19,000 38,000
Deodorant 8,625 5.75 11,500 23,000
ot
tn
rin
ep
Pr
49
ed
Appendix R. Algorithmic complexity analysis of the proposed meth-
iew
ods
Table R.8 summarizes the algorithmic complexity of the system containing

the general size of items. First, the full enumeration heuristic is inefficient due
to exponential algorithmic complexity. The time and space complexity of the
proposed algorithm has quartic complexity because the model size of the multi-
ev
variate fourth-degree polynomial approximation of the value function dominates
all the other operations. Although the benchmarking heuristic (i.e., AC) also
has quartic complexity when approximating the value function, it has distinct
r
complexity as an approximator of policy, denoted as W (N ) (e.g., a neural net-
work or other complex approximator whose model size is much more complex
er
than the polynomial value function). Therefore, the algorithmic complexity of
the proposed algorithm is more efficient than that of the benchmarking heuris-
pe
tic (i.e., the AC). The analysis shows that the proposed method has superior
algorithmic complexity than typical RL algorithms that do not consider any
structural property.
Table R.8: Algorithmic complexity of structured RL and comparative baselines

Complexity
ot
Algorithm Time Space

Full enum. O(2N ) O(N )
AC O(W (N ) + N 4 ) O(W (N ) + N 4 )
tn
SRL-FSA O(N 4 ) O(N 4 )

SRL-PSA O(N 4 ) O(N 4 )
Despite the superior algorithmic complexity of the proposed structured RL

rin
method, the quartic complexity damages the scalability for the general size of
the multi-item model. To address the scalability issue, the model complexity of
the polynomial approximator is relaxed by removing all the interaction terms
ep
among different items. We can conduct the relaxation in a belief that the true
value function has a tractable structure (e.g., multi-dimensional K-convexity)
for basic multi-item inventory management problems. According to the modi-
Pr
50
ed
fication, the space complexity reduces from quartic to linear; further, the com-
iew
putational burden can be reduced to quadratic which is arisen from the policy
updates. The improved algorithmic complexity is summarized in Table R.9.
The algorithmic complexities of the modified algorithms still outperform the
comparative baselines. The improved algorithmic complexity verifies that the
proposed structured RL method can have scalability.
ev
Table R.9: Algorithmic complexity of structured RL with relaxed value function ap-
proximation
Complexity
Algorithms Time Space
r
N
Full enum. O(2 ) O(N )
AC O(W (N ) + N ) O(W (N ) + N )
SRL-FSA
SRL-PSA
O(N 2 )
O(N 2 ) er O(N )
O(N )
pe
ot
tn
rin
ep
Pr
51

SSRN Id4272441

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN Id4272441

Uploaded by

Copyright:

Available Formats

Adaptive Inventory Replenishment using Structured Reinforcement

Hyungjun Park1, Dong Gu Choi2, Daiki Min3∗

We consider an inventory replenishment problem with unknown and switching

replenishment problem. These well-designed reinforcement learning algorithms

policy structures are available.

Preprint submitted to Elsevier October 31, 2022

Operations management scholars and practitioners have for decades exam-

To exploit the known structure of an optimal policy, we define the problem

the multi-item case exists, a (s, c, S) joint replenishment policy is regarded as

reasonable demand distribution and adapt a policy to switching the regime, a

ample, Scarf (1960) introduced K-convexity to prove the optimality of an (s, S)

 We obtain near-optimal policies for inventory systems with extended item

to verify the advantages of our proposed method. Section 6 demonstrates the

of an earlier study, Scarf (1960) considered an inventory system without a setup

the situation-dependent policy is optimal for inventory systems with switching

and Pontrandolfo (2002) proposed a RL algorithm to solve a multi-stage supply

Q-learning-based algorithm to overcome incomplete information in supply chain

(2022) demonstrated that a deep RL framework can improve the performance

cision process problems under several structural assumptions (Kunnumkal and

The objective of inventory management is to minimize the per-period cost

(x1 , x2 , ..., xN )T ∈ RN and a = (a1 , a2 , ..., aN )T ∈ RN represent the inventory

is imposed, which can be discounted when replenishment occurs by ordering

1{a > 0}, and the joint-order discount factor v = 1 − δ + hδ .

Theorem 1. A world-dependent (s(ω), S(ω)) policy with a world state ω is op-

In this section, we explain the structured RL algorithm proposed to solve the

4.1. Preliminaries for optimizing an inventory replenishment policy

4.1.1. Stochastic replenishment policy

communicative class in a Markov chain, which is regarded as a desirable char-

To satisfy the desirable property of the same single communicative class,

tic replenishment policy.

Proposition 1. The stochastic (s, S) replenishment policy converges to the cor-

Proposition 2. The stochastic (s, c, S) joint replenishment policy converges to

4.1.2. Relative value function approximation

The optimal value function of the single-item inventory replenishment prob-

satisfactorily represents the K-convex value function. Therefore, we approxi-

ρt+1 = ρt + γ2 (t){r(x, a, x′ ) + Vt (x′ ; s, S, wt ) − Vt (x; s, S, wt ) − ρt } (5)

basis and approximation functions are expressed as V (x; s, c, S, w) = wT Φ(x),

4.1.3. Transition probabilities

to update the policy by exploiting the structural property of an optimal policy.

has convergence property. Appendix B presents the details of the proof.

Proposition 3. The transition probability Px,x′ (s, S) is bounded first and a

twice differentiable function of the parameter (s, S).

Proof The proof is provided in Appendix C.

Then, the infinitesimally perturbed increment of P1 (x′ |S̃, p̄) (i.e., ∂ ′

f (x, s) P0 (x′ |x, p̄) − P1 (x′ |S̃, p̄)

Then, the expectation with a stationary policy is replaced by the following

reach the policy

As with the single-item problem, we use the sampling approximation to ob-

X × ... × X denotes the N times Cartesian product of the single-item state

Although we demonstrate the convergence of the iterative updates (see Theo-

ity verification is complemented by an extensive numerical study in Section 5.

4.2. Structured reinforcement learning with full stochastic approximation

Based on the mathematical background in Section 4.1, this section sum-

marizes the proposed structured RL method. We first observe demand and

3: attain the realized demand dt ; then, adaptively estimate the distribu-

ρt+1 = ρt + γ2 (t) rt + Vt (xt+1 ; st , St ) − Vt (xt ; st , St ) − ρt

sampling ηs ∼ Bern(·; 0.5) then zs ∼ (1 − ηs )P0 (x′ |xt , α̂t+1 −

The structured RL algorithm for a multi-item inventory system is similar to

4.3. Structured reinforcement learning with partial stochastic approximation

hyperparameters ensures the convergence to the corresponding deterministic

obtain the gradient.

update the hyperparameters:

We obtain near-optimal policies for inventory systems with extended item