Professional Documents
Culture Documents
article info a b s t r a c t
Keywords: Retailers supply a wide range of stock keeping units (SKUs), which may differ for
Forecasting example in terms of demand quantity, demand frequency, demand regularity, and
Inventory demand variation. Given this diversity in demand patterns, it is unlikely that any single
e-commerce
model for demand forecasting can yield the highest forecasting accuracy across all SKUs.
Retailing
To save costs through improved forecasting, there is thus a need to match any given
Model selection
demand pattern to its most appropriate prediction model. To this end, we propose
an automated model selection framework for retail demand forecasting. Specifically,
we consider model selection as a classification problem, where classes correspond
to the different models available for forecasting. We first build labeled training data
based on the models’ performances in previous demand periods with similar demand
characteristics. For future data, we then automatically select the most promising model
via classification based on the labeled training data. The performance is measured by
economic profitability, taking into account asymmetric shortage and inventory costs.
In an exploratory case study using data from an e-grocery retailer, we compare our
approach to established benchmarks. We find promising results, but also that no single
approach clearly outperforms its competitors, underlying the need for case-specific
solutions.
© 2021 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.ijforecast.2021.05.010
0169-2070/© 2021 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
Please cite this article as: M. Ulrich, H. Jahnke, R. Langrock et al., Classification-based model selection in retail demand forecasting. International Journal
of Forecasting (2021), https://doi.org/10.1016/j.ijforecast.2021.05.010.
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
Fig. 1. Demand patterns for the SKUs of garden parsley, apples, wraps, and honeydew melons in 2017 for one of five fulfillment centers considered.
As an illustrative example, Fig. 1 displays the demand pat- relevant SKUs, which type of forecasting model is most
terns of four SKUs considered in this study, namely garden promising with regard to cost minimization.
parsley, apples, wraps, and honeydew melons. Each ob- A natural strategy for identifying a suitable forecast-
servation here corresponds to one demand period, i.e. one ing model is to assess any candidate model’s histori-
day of delivery. First, it can be seen that apples and honey- cal performance, i.e. considering its accuracy in previous
dew melons show a higher demand frequency than wraps demand periods. To this end, based on an extensive lit-
and garden parsley. For the latter, in fact, there was no erature review, Fildes (1989) distinguishes individual se-
demand in 2% of the demand periods. Second, the average lection, aggregate selection, and combination forecasts as
demand quantity varies from ∼5 units per demand period general strategies for assessing past performance and
for garden parsley to ∼700 units for apples. Third, garden subsequently producing predictions. Individual selection
parsley and wraps show demand variations below 100%, allocates a specific model to each SKU, while aggregate
whereas the same figure exceeds 100% for apples and selection uses the model that performs best on average,
honeydew melons. Finally, while the demand variation for across all SKUs. The former strategy is more flexible
apples is largely due to regular weekly peaks on Mondays, but less robust than the latter; in other words, aggre-
there is no such demand regularity for garden parsley, gate selection is more stable (low variance) in terms of
wraps, and honeydew melons. Given this immense diver- model selection, but involves a higher risk of misspeci-
sity in demand patterns, it comes as no surprise that there fication (high bias). Combination forecasts build the pre-
is consensus in the literature that no single model can diction as a weighted average of each candidate model’s
universally outperform all other natural candidate models forecast, usually trying to optimize the models’ weights,
for demand forecasting (Fildes, 1989; Ulrich et al., 2021). although unweighted combinations are surprisingly com-
Both the class of statistical models (e.g. linear vs. quantile petitive (referred to as the ‘‘forecast combination puz-
regression) and also any associated distributional assump- zle’’; Claeskens et al., 2016). Rather than matching an
tion (e.g. normal vs. gamma) may have to be adapted to individual model to each single SKU, as in individual
suit any particular demand pattern at hand (Ulrich et al., selection, SKUs can also be clustered a priori based on
2021). For retailers, there is thus a need to identify, for all similarities in demand patterns, such that subsequently,
2
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
explored in the demand forecasting literature include, (1989), Genre et al. (2013), and Claeskens et al. (2016)
inter alia, various types of (artificial) neural networks, show that simple weighting schemes, e.g. the arithmetic
such as back propagation, fuzzy, and recurrent (Ainscough mean, often yield equally good or better forecasts than
& Aronson, 1999; Kuo, 2001; Zhang & Qi, 2005; Aburto more complex weighting schemes; individual model
& Weber, 2007; Salinas et al., 2020), support vector ma- weightings are more promising only if there is ex-ante
chines (Gür Ali & Yaman, 2013; Di Pillo et al., 2016), and evidence that the different models generate notably dif-
generally many different classification methods ferent forecasting accuracies (Fildes & Petropoulos, 2015).
(Ahmed et al., 2010; Ferreira et al., 2016; Baardman et al., The combination forecast reduces the risk of SKU-specific
2018; Huber & Stuckenschmidt, 2020). model misspecification, i.e. bias, compared to when us-
ing aggregate selection, while at the same time offering
2.3. Model selection in the retailing literature higher stability than individual selection. Thus, depending
on the demand structure of the SKU, the benefits of
The literature on demand forecasting suggests that no robustness and stability from using combination forecasts
single model universally outperforms other widely ap- may exceed the potential gains through model selec-
plied candidate models (Fildes, 1989; Wolpert & Macready, tion (Barrow & Kourentzes, 2016). In any given applica-
1997; Ulrich et al., 2021). Across different SKUs, the asso- tion, any of the three approaches may yield the highest
ciated best-performing model may differ, for example, in forecasting accuracy (Fildes, 1989).
the model class (e.g. linear vs. quantile regression), or in An alternative strategy that more explicitly addresses
their distributional assumption (e.g. normal vs. gamma). the heterogeneity in demand patterns is to cluster items
For any given retailer, there is thus a need to find a with similar demand patterns, and then assign each clus-
mapping from the demand pattern of any given SKU to ter to an appropriate forecasting model (Eaves & Kings-
the associated most promising combination of forecasting man, 2004; Williams, 1984). Tanaka (2010), Ferreira et al.
model and (potentially) associated distribution, with the (2016), and Schneider and Gupta (2016) apply cluster-
aim of minimizing the overall costs. ing approaches for new SKUs without historical demand.
A common strategy for selecting the best forecast- Existing SKUs are clustered based on similarity mea-
ing model is to assess the candidate models’ historical sures of data set characteristics, e.g. demand patterns,
performances, i.e. considering their accuracy in previ- brand, price, and customer reviews, and new SKUs are
ous demand periods. This is most commonly done using assigned to the most adequate of these clusters. Boylan
either individual selection, aggregate selection, or combina- et al. (2008) propose a categorization framework for non-
tion forecasts (Fildes, 1989). Under individual selection, a normal demand patterns, including intermittent demand
model is chosen separately for each SKU under consid- (infrequent demand occurrences), slow-moving demand
eration, by assessing, for any given SKU, the candidate (low average demand), erratic demand (highly variable
models’ past performances exclusively for that SKU. Ag- demand), lumpy demand (intermittent and highly vari-
gregate selection is used if the decision maker selects able demand), and clumped demand (intermittent, but
a single model to be applied to forecasting demand for constant demand). Potential problems with this type of
all SKUs of interest, by evaluating the past performances clustering of seemingly homogeneous SKUs include the
across all these SKUs. In either of the two approaches, selection of break points between clusters, e.g. between
one further needs to determine the length of the temporal low and high demand variability, and the selection of the
resolution, e.g. one day, one week, or one month, for number of clusters. Both decisions are made based on
both the evaluation and the actual application (before re- the historical data available, and as such may not always
assessment). In principle, the potential to minimize costs generalize well to new data sets. This is particularly prob-
increases with shorter periods. However, this increased lematic for demand patterns that may vary over time,
potential comes at the cost of increased variability in for which partitioning the SKUs into clusters requires
the model selection. Irrespective of the choice of the constant monitoring and regular adaptations.
length of the temporal resolution, i.e. the time resolution, At the SKU level in retailing, demand forecasts aim to
stability is the main advantage of aggregate selection generate predictions for a large number of SKUs over a
over individual selection. Indeed, in ex-ante model se- short forecasting horizon (Fildes et al., 2019). Instead of
lections, individual selection only rarely generates lower building clusters manually, meta-learning methods allow
costs than the simpler aggregate selection (Fildes, 1989). the algorithm to learn the structure of the problem in
Combination forecasts combine a set of forecasting mod- an automated way based on the given data. The meta-
els by building the prediction as a weighted average, learner gains meta-knowledge by linking the features
either with equal weights (Fildes, 1989) or by optimizing to the forecasting accuracy of the forecasting methods
each model’s weights (Claeskens et al., 2016). Claeskens available, and applies the developed knowledge to se-
et al. (2016) identify that an unweighted combination lect the best expected forecasting method for a given
forecast outperforms an optimally weighted one surpris- feature set. Thus, a meta-learner usually consists of a
ingly often, as the estimation of the weights introduces a feature set, base forecasters, and learned knowledge of
source of variance. how to select the best forecasting method given the
With respect to the selection of candidate models features and base forecasters (see, e.g., Lemke et al., 2015;
to be included in a combination forecast, Kourentzes and Ma & Fildes, 2020). Contributions in the literature
et al. (2018) argue that the construction of a pool of have demonstrated improved forecasting accuracy us-
reasonable models for forecasting is of crucial impor- ing meta-learning frameworks compared to the applica-
tance, and propose a heuristic for forecast pooling. Clemen tion of base forecasters (see, e.g., Prudêncio & Ludermir,
5
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
2004; Wang et al., 2009; Lemke & Gabrys, 2010; Montero- from (3) if it had been applied to forecast demand in
manso et al., 2020; and Ma & Fildes, 2020). that period. However, at a very high service level, say
However, none of the existing contributions derives 97%, a model that generally tends to underestimate de-
predictions for high service-level targets, accounting for mand variance, e.g. Poisson regression (Ulrich et al., 2021),
the asymmetric costs for shortage and excess inventory might be the best-performing model in the majority of
that we find in retail practice. This limitation will be demand periods—simply because its competitors’ predic-
overcome with the approach developed below. tions in most instances will more drastically overshoot
the actual demand—but would be extremely costly in
3. Classification-based model selection those demand periods where it is not the best-performing
model, due to undershooting the actual demand, which
We frame the model selection problem as a classifi- comes at a much higher price. As a consequence, it would
cation task, where the classes correspond to the different be myopic to ex-post deem a model to be the most ade-
forecasting models available. Classification is a supervised quate choice for any given demand period just because
learning approach where an algorithm learns the optimal it happened to have produced the lowest costs on that
mapping of input variables onto a target variable (corre- particular occasion, when in fact the expected costs under
sponding to a class). After training the algorithm based that model were very high.
on historical input variables with known class labels, the It is worth remembering that labeling the feature vec-
mapping can be applied to new input variables associated tor of some past demand periods is only an intermediate
with previously unseen data. Referring to the definition step towards solving the underlying economic problem.
of meta-learning by Ma and Fildes (2020), our approach When a decision on q is to be made at some time t,
includes all three components of meta-learning: features the decision maker seeks to choose the best forecasting
that explain customer demand, a pool of base forecasters, model m for predicting the probability distribution of the
and a meta-learner that links the features to a forecasting uncertain demand at this period t. The resulting cumu-
method. In our approach, the decision tree corresponds lative distribution function, F̂tm , would subsequently be
to the meta-learner, as it selects the best expected model used to determine the optimal inventory level q̂m t with
from labeled data.
respect to the newsvendor problem, as stated in Eq. (1).
In our setting, p input variables relevant for demand
Hence, a decision maker would choose model m if the
forecasting—e.g. price, known demand, or day of the
inventory level q̂m
t resulting from the application of that
week—are combined into a p-dimensional feature vector
model yields minimal expected total costs,
x, and any class m ∈ M = {linear regression, quantile
regression, . . .} corresponds to a candidate model for fore- t )] = hE(q̂t − Dt ) + bE(Dt − q̂t ) .
E [Ct (q̂m m + m +
(5)
casting demand. In the first stage of our classification-
based model selection (CMS), we label data from past Within the labeling process, model performance thus
demand periods according to an evaluation of the hypo- ought to be measured according to (5), rather than based
thetical performance of forecasting models within those only on the actual costs realized at time t. Since the true
demand periods. Specifically, in the training data, each probability distribution that is needed to calculate (5)
demand period t comprises the associated feature vector is unknown, even for past demand periods, we need to
xt and a label mt ∈ M resulting from hypothetical model estimate expected total costs using information from our
performance. Based on the training data, we then set up data set. We propose to do so by evaluating the overall
a classifier, i.e. a mapping performance of model m, for any given past demand
period t, on a set Ψt of demand periods similar to the
φ : Ω → M; (4) one under consideration. This idea is similar in spirit to
x ↦ → φ (x) = m, nonparametric estimation techniques (e.g. kernel-based
nonparametric regression), where in order to inform the
with Ω denoting the p-dimensional space of all possi-
estimator of a quantity of interest (e.g. the value of the
ble feature vectors. For future data, in the second stage
regression function at a given covariate value), a neigh-
we apply φ to automatically select the most promising
borhood of sufficiently similar data points is considered
forecasting model from M based on the feature vector x.
(defined, e.g., using bandwidth and kernel functions).
In essence, the classification algorithm thus selects the
Specifically, for the data from the training set, the
demand forecasting model to apply by identifying which
labeling process in our approach involves the following
models worked best in past demand periods similar to the
steps:
one currently considered.
1. estimation of each candidate forecasting model m
3.1. Generating labeled data based on historical data;
2. for each demand period t and each m, we obtain
In the first stage indicated above, we need to comple-
ment each feature vector in the training data, correspond- – an estimate F̂tm of the demand distribution Ft ,
ing to certain characteristics of the associated demand which we apply to derive the cost-optimal q̂m t
period, by a label indicating the most adequate model. according to the newsvendor problem,
At first sight, it might appear natural to simply choose – or, alternatively, a direct prediction of the cost-
as the label, for each past demand period, the model optimal q̂t for the nonparametric quantile re-
that would have (ex-post) produced the lowest total costs gression models;
6
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
3. from those forecasts, and given cost parameter val- 3.2. Selection of a classifier
ues for shortage b and excess inventory h, the model-
specific costs for each demand period are calculated After building the labeled training data, as described in
according to Eq. (3); the previous section, with a fixed v , chosen for example
4. for each demand period t, we identify a set Ψt of de- via hyperparameter tuning, the CMS approach we suggest
mand periods similar to the one demand period con- then proceeds by training a classifier φ , which in our
sidered, and we select as the label mt the model that context is a mapping from the set of potential feature
realized the lowest overall costs across all elements vectors x ∈ Ω to the set of candidate forecasting models
from Ψt : m ∈ M; cf. Eq. (4). In principle, various classifiers can be
used for this purpose, e.g. logistic regression, linear dis-
∑
mt = argmin ct ′ (q̂m
t′ )
m∈M criminant analysis, k-nearest-neighbors, or support vector
t ′ ∈Ψ
t
machines (Press & Wilson, 1978; Kotsiantis et al., 2006).
The main choice to be made is clearly how to select the In retail practice, an important requirement of operational
demand periods to be included in Ψt . To this end, we management is transparency regarding how forecasts are
define a quantile threshold v ∈ [0, 1] (e.g. 10% of all obtained, as this allows the inclusion of expert knowledge
observations for v = 0.1), and from all available demand in case of exceptional circumstances (cf. Trusov et al.,
periods, we then include those in Ψt that are among the
2006; Fildes et al., 2009; Davydenko & Fildes, 2013). De-
100 · v % with the smallest distances to the feature vector
cision trees thus constitute an obvious choice, as they are
associated with the demand period at time t. With this
particularly intuitive and easy to interpret (Goodwin &
approach, the size of Ψt depends on the total number
Wright, 2014). Unlike, say, k-nearest-neighbors or support
of feature vectors available in the training data, which is
vector machines, they do not only produce a class as out-
desirable: for larger training data sets, there will be more
feature vectors in the vicinity of the feature vector at time put, but also allow insights into the drivers of the model
t, in which case it makes sense to build a larger Ψt to choice (Armstrong, 2001; Schwartz et al., 2014). In this
increase robustness. contribution, we thus focus on decision trees, noting that
The quantile threshold v can be optimized via hyper- the performance of the CMS approach might be further
parameter tuning, i.e. by validating each v from a set of improved by additionally optimizing with respect to the
possible quantile thresholds using out-of-sample data. Se- choice of the classifier applied.
lecting the most adequate value of v via hyperparameter A decision tree is an algorithm comprising a sequence
tuning thus requires an additional partition of the training of decisions to build a tree structure, which, when applied
data into a calibration and a validation set. Specifically, to a new feature vector, ultimately leads to its classifica-
we propose to split the training data available such that tion. Decisions are commonly binary, and in what follows
90% of the data are used to train the classifier, for any we restrict our presentation to this type of tree. Fig. 2
given quantile threshold v , with the remaining 10% of the shows an example of a decision tree involving binary deci-
data used to evaluate the performance of the classifier sions. The root node at the top represents the full data set.
across the different values of v . The parameter v is then For any new feature vector to be classified, the classifica-
fixed at the value that led to the best performance in tion is performed by traversing through the nodes from
the validation data set—as measured by the total costs top to bottom, making binary decisions at every node
resulting from applying the classifier for the given v — until arriving at a leaf node, which represents the final
and subsequently used to build the final classification class decision, in our case the forecast model to be applied
algorithm based on the complete training set. for the given feature characteristics. Every node between
To measure distances between feature vectors, various the root node at the top and the leaf nodes at the bottom
metrics can be used, such as the Manhattan, Minkowski,
is called a decision node. Each decision node comprises
or Mahalanobis distance; see Prasath et al. (2017) for
information on (i) the split dimension (i.e. based on which
a review, including a performance comparison in multi-
feature the decision is made) and (ii) the split point (i.e.
ple applications. We chose to use the Euclidean distance
which value defines the threshold, either side of which
metric, as it is most widely used, due to its simplicity
and intuitive interpretation as the straight-line distance corresponds to a different decision; for binary variables
between two points (Hu et al., 2016). To avoid the dom- this naturally reduces to a yes-or-no decision). In the
inance of features with values that are relatively large in example shown in Fig. 2, a feature vector including known
magnitude (e.g. mean demand), and to generally avoid orders greater than 100 units, on a Saturday, leads to
any scale dependence, we normalize features to values the selection of the quantile regression as the forecasting
between 0 and 1, corresponding to the lowest value and model.
the highest value of the feature in the training set. Cate- When training the decision tree, each node is opti-
gorical features such as the ‘‘day of the week’’ are rewrit- mized with respect to (i) and (ii) using some criterion, e.g.
ten using binary features (e.g. Monday vs. not Monday, the Gini impurity,
and likewise for the other days), and are identified as M
numerical using the values 0 for ‘‘no’’ or 1 for ‘‘yes’’. To
∑
GI = pm (1 − pm ),
avoid the relative dominance of the binary features (e.g.
m=1
five specifications for the feature ‘‘day of the week’’), we
divide the corresponding feature values by the number where pm denotes the proportion of class m observations
of dummy variables used to specify the feature (see the within a given set of points, hence the probability that
Appendix for an example). a randomly selected unit will be from class m. For each
7
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
class resulting from a decision, this criterion measures order, known customer demand for an SKU can be consid-
its degree of impurity. If the class is pure, then pm ∈ ered as a minimum demand value. The e-grocery retailer
{0, 1} for all m, such that GI = 0. Optimization with considered here replenishes daily, with a lead time of
respect to (i) and (ii) then seeks to find, for each decision three days, which includes the processes of stowing and
node (starting at the top), the split dimension and the picking within the warehouse. We thus generate three-
associated split point such that the weighted sum of the day-ahead forecasts on a daily basis, e.g. on Wednesday
Gini impurities of the resulting classes is minimized (with to forecast Saturday demand.
the weights equal to the overall proportion of the classes).
As decision trees are prone to overfitting, trees are usu- 4.2. Feature selection
ally restricted in their growth, or pruned after maximum
growth (Thomassey & Fiordaliso, 2006). We generally use features (i.e. explanatory variables)
to predict qt . In the CMS approach, we additionally use
4. Case study the features to identify a set of similar demand periods.
The following six features were pre-selected based on
4.1. E-grocery data business considerations:
(1) price (a continuous variable);
The data set available to us covers 111 SKUs within (2) seasonality
( π tas) captured
( π t ) by (the )trigonometric
( π t ) vari-
πt
the categories of fruits, vegetables, and bake-off products ables sin 2365 , cos 2365 , sin 4365 , and cos 4365 ;
from five regional fulfillment centers. As most SKUs are (3) day of the week (five dummies);
supplied in each fulfillment center, the data include 546 (4) ID of the FC (four dummies);
time series. The business processes of the e-grocery re- (5) marketing activities (one dummy);
tailer considered here comprise new types of customer (6) known orders (a count variable).
demand data that are not as such available in store re- We cover the supply side of seasonality (price), the de-
tailing. In particular, during the online shopping process, mand side of seasonality (as per (2)), the demand profile
no inventory information is shown to the customer. Cus- of the week (the day of the week), geographical dif-
tomers are thus able to add any SKU with infinite stock ferences (the ID of the FC), and marketing activities (a
availability to the shopping basket. This allows for a com- binary flag for whether or not marketing activities were
prehensive monitoring of true customer preferences. As a in place). With respect to marketing activities, using
consequence, in contrast to store retailing, where point- a single dummy variable is a rather simple parameter-
of-sale data are distorted by stock-outs, the customer ization that might be improved by including additional
demand information available to the e-grocery retailer variables indicating possible marketing measures avail-
is effectively uncensored. In our case study, we consider able in online retailing, such as newsletter marketing or
demand as monitored during the online shopping process promoting price discounts at the landing page. However,
of the customer before any restricted availability of SKUs for our case, we simply do not have more specific infor-
is revealed during check-out. Compared to sales data, mation in the data set available. We further include the
demand is thus higher than sales for SKUs affected by known orders as an additional feature that is not available
stock-outs, and lower than sales for SKUs selected as in traditional store retailing. As customers are able to
substitutes. In addition, customers can order up to 14 days place orders up to 14 days in advance, known orders rep-
in advance. As a consequence, the e-grocery retailer can resent the corresponding customer demand information
include the known demand information at the replen- available to the retailer at the time of the replenish-
ishment decision time in the final replenishment order. ment decision, i.e. three days prior to the demand period
Assuming that most customers do not cancel a placed considered. Between the replenishment decision and the
8
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
Table 4 for evaluation and for the actual application (before re-
Demand forecasting model that realizes the lowest total costs for the assessment); we consider daily, weekly, and monthly
SKU of honeydew melons for each month in 2017 for one selected
fulfillment center and the percentage savings in costs compared to
model selection. Considering the concept of pooling by
QuantRegForest. Kourentzes et al. (2018)—i.e. to pre-select only reason-
SKU Month–year Best model Savings able models for forecasting—we exclude GAMLSS.Poisson
Honeydew melons 01–17 QuantRegForest 0% from the combination forecast, because for this model
Honeydew melons 02–17 GAMLSS.gamma −49% the realized costs are significantly higher compared to
Honeydew melons 03–17 Reg −13% the benchmarks, and the service level strongly deviates
Honeydew melons 04–17 QuantRegForest 0% from the service-level target (Ulrich et al., 2021). We fur-
Honeydew melons 05–17 ARIMAX −42%
ther use all base forecasters (i.e. fixed individual models,
Honeydew melons 06–17 ARIMAX −17%
Honeydew melons 07–17 QuantReg −22% without selection) as additional benchmarks. Overall, we
Honeydew melons 08–17 GAMLSS.negBin −32% apply 15 benchmarks to assess the performance of our
Honeydew melons 09–17 GAMLSS.Poisson −39% CMS approach: daily/weekly/monthly individual selec-
Honeydew melons 10–17 GAMLSS.Poisson −10% tion, daily/weekly/monthly aggregate selection, combina-
Honeydew melons 11–17 GAMLSS.Poisson −34%
Honeydew melons 12–17 GAMLSS.Poisson −16% tion forecast, ARIMAX, GAMLSS.gamma, GAMLSS.negBin,
GAMLSS.normal, GAMLSS.Poisson, Reg, QuantReg, and
2017 −18%
QuantRegForest.
4.8. Results
4.5. Building labeled data
The boxplots shown in Fig. 3 display the SKU-specific
For the CMS approach to be applicable, we need la- percentage differences between the costs obtained under
beled training data (cf. Section 3.1). To that end, for any the seven selection and combination benchmarks on the
13 months of data, with the sliding-window approach one hand, and the CMS approach on the other hand, for
to classifier training and subsequent classification (see the data from the test period 11/2017–10/2018 under the
Section 4.3), the demand periods from the first 12 months 97% service-level target.
are labeled. This is achieved by identifying, for each of Overall, the CMS approach and the combination fore-
the demand periods to be labeled, a set Ψt of demand cast performed about equally well. In particular, they
periods similar to the one considered (from the same 12 achieved the same median total costs across SKUs. In-
months of data). Specifically, we include all those demand dividual and aggregate selection yielded 2%–13% higher
periods in Ψt whose feature vectors are among the 100·v % median total costs than CMS, depending on the time res-
nearest to the feature vector of the demand period at time olution considered. Selection at the daily level performed
t. The individual v for each SKU and month was chosen particularly poorly, due to the relatively strong negative
by validating, separately for each SKU, each v from a set impact of the Poisson model: for high service-level tar-
of possible quantile thresholds (10%, 20%, 30%, 40%, 50%, gets, the Poisson performs best in many demand periods
60%, 70%, 80%, and 90%) using out-of-sample data. Each but is extremely costly in those demand periods where
feature vector is then labeled with the model that realizes it is undershooting. To test our results for formal signif-
the lowest total costs overall for Ψt . icance, we additionally applied the multiple comparison
with the best (MCB) test (Koning et al., 2005; Makridakis
4.6. Training the decision tree et al., 2020a). The performance of CMS was deemed to be
significantly better than those of all benchmarks except
As described in Section 4.3, for each month from for weekly aggregate selection and the combination fore-
11/2017–10/2018, the decision tree is trained based on cast. Within the CMS approach, QuantReg was most often
the preceding 12 months of labeled data, and then applied selected as the forecasting model (28% of all cases), fol-
to automatically select the forecasting model for each lowed by the GAMLSS.negBin (23%), and GAMLSS.gamma
demand period in the month considered. Thus, all demand (17%).
periods from 11/2017–10/2018 are used as out-of-sample Table 5 displays the complete results, including the
test data. To tailor the decision tree to the training data, performance of the base forecasters, i.e. the individual
we use the R function rpart (Therneau & Atkinson, 2018) models. QuantReg and GAMLSS.negBin, as fixed models,
using the default options. Thus, in particular, we use did indeed perform as well as CMS and the combina-
the Gini impurity measure to determine the split points. tion forecast in terms of median costs. The combina-
These default options also reduce the risk for overfitting tion forecast, QuantReg, weekly individual selection, and
when training the tree. CMS generated the lowest costs for 18%, 15%, 10%, and
9% of all SKUs, respectively. CMS, the combination fore-
4.7. Benchmark approaches cast, GAMLSS.gamma, and GAMLSS.negBin achieved ser-
vice levels of 96%. That is, they deviated from the
Based on our literature review in Section 2.3, we con- service-level target by only 1%. As expected, GAMLSS.
sider individual selection, aggregate selection, and an Poisson substantially undershot the desired service level.
equally weighted combination forecast as benchmark Overall, the results obtained in this case study indicate
methods. In principle, individual selection and aggregate that CMS generates modest improvements in terms of
selection can be applied for any temporal resolution, both median costs and service-level realization compared to
10
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
Fig. 3. Boxplots of the SKU-specific percentage differences in realized costs for individual selection (IS), aggregate selection (AS), and combination
forecast (combination) with the different temporal resolutions considered (day, week, and month) compared to the CMS method. Across the seven
benchmarks, there are ten percentage differences that lie outside the range depicted here.
Table 5
Results of the CMS approach and the benchmark methods, for 111 SKUs analyzed in the test period 11/2017–10/2018.
Share best method gives the percentages of SKUs for which a given method realized the lowest costs. Diff. median costs
are the percentage differences between the median total costs (over all SKUs) under a method compared to those
obtained when using CMS. Realized SL indicates the service level achieved.
Method Diff. median costs Share best method Realized SL
Model selection approaches
CMS 0% 9% 96%
Daily individual 13% 4% 92%
Weekly individual 3% 10% 95%
Monthly individual 7% 4% 95%
Daily aggregate 2% 1% 95%
Weekly aggregate 2% 8% 95%
Monthly aggregate 2% 5% 95%
Combination forecast
Combination (equal 0% 18% 96%
weights)
Base forecasters
ARIMAX 27% 1% 95%
GAMLSS.gamma 4% 5% 96%
GAMLSS.negBin 0% 9% 96%
GAMLSS.normal 2% 2% 95%
GAMLSS.Poisson 25% 5% 88%
Reg 2% 4% 95%
QuantReg 0% 15% 95%
QuantRegForest 17% 2% 94%
the established selection methods. Whether a cost reduc- magnitude of the sales. The combination forecast and
tion of about 2%–3% compared to the best of these bench- the two base forecasters, GAMLSS.negBin and QuantReg,
mark approaches is practically relevant, and in particular performed about equally well in our case study. Thus,
whether this can offset the additional effort involved in our case study does not (and clearly cannot) ultimately
implementing and continually running the much more determine which of these approaches is superior. More
complex CMS approach, clearly depends on the overall empirical comparisons are required, and it should be
11
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
emphasized here that both the CMS approach (see the dis- periods to forecasting models. As a consequence, for clar-
cussion below) and the combination forecast (via ity of presentation we abstained from optimizing each
optimization of the weights) can be further refined and individual analysis step. In particular, we made rather
improved. Our case study thus merely provides a first ad hoc choices regarding (i) the distance measure used
impression of the potential of the CMS approach relative in the labeling process, (ii) which features to include in
to existing benchmarks, under the conditions present in the forecasting models, and—perhaps most importantly—
our setting. (iii) which classification algorithm to apply for selecting
The setting in our case study involves a fairly high models. To some extent these choices were made based
service-level target of 97%. When re-running the analysis on business considerations, e.g. when pre-selecting fea-
for the much lower (and in our case economically dis- tures known to be relevant in retail practice, or when
advantageous) service-level target of 75%, the differences using decision trees for interpretability. Other choices
in performance between the selection and combination concerning the implementation were driven primarily by
methods considered were much lower (0%–3% difference the need for conciseness of presentation. For example,
in median costs for the 75% service level vs. 0%–13% for much effort could go into feature selection: rather than
the 97% service level).1 This is to be expected, because in using the rather simple AIC-based selection (and based
only on the linear model), LASSO regularization could
such a case the different forecasting models applied are
have been used (see, e.g., Uniejewski et al., 2019). The
not as sensitive as when predicting the extreme right tail
inclusion of additional numerical features capturing the
of the demand distribution, due to the lower asymmetry
characteristics of the time series data, e.g. concerning sta-
in the cost parameters.
bility, lumpiness, correlation, etc., may potentially further
improve the performance (Ma & Fildes, 2020). However,
5. Discussion
it is difficult to capture such properties embedded in time
series data by hand, as the space of possible features is
In this paper, we presented classification-based model large. A machine learning framework for automatically
selection (CMS) as a general new approach for automated learning a suitable feature representation from raw time
model selection in retail demand forecasting. CMS offers a series, as proposed by Ma and Fildes (2020), could in
flexible framework for addressing the practical challenges principle be implemented within our CMS approach, but
associated with the ever more complex demand patterns would further increase the computational complexity. Fi-
encountered in retail practice. Specifically, we proposed nally, including information on feature relevance might
leveraging the large amounts of demand data being col- improve the selection of demand periods similar to the
lected, using a data-driven (algorithmic) approach to the one being considered.
model selection exercise. Compared to the application The given contribution should thus be regarded as a
of more complex meta-learning methods that pool SKUs starting point for future research into automated model
and aim to find the best combination of base forecasters selection in demand forecasting, as it seems highly likely
(see, e.g., Ma & Fildes, 2020; Montero-manso et al., 2020), that the optimization of steps (i)–(iii), above, will further
CMS benefits from better interpretability and, as a con- improve the performance of the approach. We also en-
sequence, potentially higher acceptance by the decision visage possible extensions on a higher conceptual level.
makers. The approach allows expert adjustments for in- First, there may be alternative strategies for labeling the
formation not captured by the data—a common practice training data, e.g. that more explicitly accommodate the
in retail forecasting (Fildes et al., 2019). Compared to the misclassification costs (thus replacing the usage of v ,
established individual selection approach, the key idea and hence a neighborhood, by a cost function within the
of our framework is that model performance is evalu- classification). Second, many of the classifiers that could
ated not based on the most recent demand periods, but potentially be applied within the CMS approach do in fact
rather the most similar past demand periods. However, provide probabilistic information on class membership
as general limitations, considering each SKU individually (here corresponding to a pointer to the most promising
does not allow us to capture cross-sectional patterns, and model for demand forecasting). This information can eas-
limited data may result in noisy elasticities (Ma & Fildes, ily be translated into weights of the associated candidate
2020). models within a combination forecast. Overall, given its
Our exploratory case study constitutes a proof of con- conceptual appeal and some promising first results, we
cept of the proposed CMS approach, and more generally are confident that the CMS approach will prove to be a
a comparison of several conceptually different approaches useful addition to the suite of demand forecasting ap-
to model selection in an e-grocery setting. At this point, proaches. However, our empirical comparison of various
however, there is no clear evidence indicating that our model selection approaches also provides further evi-
dence that there is no one-size-fits-all method, indicating
approach is superior to existing ones, and in particular to
that demand forecasting in practice will generally need to
combination forecasts. While in our case study we used e-
be tailored to the specific business setting.
grocery data, the approach is also applicable to traditional
store retailing data, or in fact to other fields of forecasting. Declaration of competing interest
The main objective of the present paper was to lay out
the general idea of using classifiers to match demand The authors declare that they have no known com-
peting financial interests or personal relationships that
1 Detailed results for the 75% service level are reported in the could have appeared to influence the work reported in
Supplementary Material. this paper.
12
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
13
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
Genre, V., Kenny, G., Meyler, A., & Timmermann, A. (2013). Combining Nahmias, S. (1994). Demand estimation in lost sales inventory systems.
expert forecasts: Can anything beat the simple average? Interna- Naval Research Logistics, 41(6), 739–757. http://dx.doi.org/10.1002/
tional Journal of Forecasting, 29(1), 108–121. http://dx.doi.org/10. 1520-6750(199410)41:6<739::AID-NAV3220410605>3.0.CO;2-A.
1016/j.ijforecast.2012.06.004. Prasath, V. B. S., Alfeilat, H. A. A., Lasassmeh, O., Hassanat, A. B. A., &
Goodwin, P., & Wright, G. (2014). Decision analysis for management Tarawneh, A. S. (2017). Distance and similarity measures effect on
judgment (5th ed.). Wiley. the performance of k-nearest neighbor classifier - a review. (pp.
Gür Ali, Ö., & Yaman, K. (2013). Selecting rows and columns for 1–50). ArXiv Preprint https://arxiv.org/abs/1708.04321.
training support vector regression models with large retail datasets. Press, J. S., & Wilson, S. (1978). Choosing between logistic regression
European Journal of Operational Research, 226(3), 471–480. http: and discriminant analysis. Journal of the American Statistical Associ-
//dx.doi.org/10.1016/j.ejor.2012.11.013. ation, 73(364), 699–705. http://dx.doi.org/10.1080/01621459.1978.
Hu, L. Y., Huang, M. W., Ke, S. W., & Tsai, C. F. (2016). The distance 10480080.
function effect on k - nearest neighbor classification for medical Prudêncio, R. B., & Ludermir, T. B. (2004). Meta-learning approaches
datasets. Springer Plus, 5(1304), http://dx.doi.org/10.1186/s40064- to selecting time series models. Neurocomputing, 61(1–4), 121–137.
016-2941-7. http://dx.doi.org/10.1016/j.neucom.2004.03.008.
Huber, J., & Stuckenschmidt, H. (2020). Daily retail demand forecasting Ramaekers, K., & Janssens, G. K. (2008). On the choice of a demand
using machine learning with emphasis on calendric special days. distribution for inventory management models. European Journal of
International Journal of Forecasting, 36(4), 1420–1438. http://dx.doi. Industrial Engineering, 2(4), 479–491. http://dx.doi.org/10.1504/EJIE.
org/10.1016/j.ijforecast.2020.02.005. 2008.018441.
Koenker, R., & Hallock, K. F. (2001). Quantile regression. Journal of Rădăşanu, A. C. (2016). Inventory management, service level
Economic Perspectives, 15(4), 143–156. http://dx.doi.org/10.1257/jep. and safety stock. Journal of Public Administration, 9(9), 145–
15.4.143. 153, https://pdfs.semanticscholar.org/98b7/5ecc0cd653b5620180aa
Koning, A. J., Franses, P. H., Hibon, M., & Stekler, H. O. (2005). The b79f4024f9726016.pdf.
M3 competition: Statistical tests of the results. International Journal Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). Deepar:
of Forecasting, 21(3), 397–409. http://dx.doi.org/10.1016/j.ijforecast. Probabilistic forecasting with autoregressive recurrent networks.
2004.10.003. International Journal of Forecasting, 36(3), 1181–1191. http://dx.doi.
Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Supervised
org/10.1016/j.ijforecast.2019.07.001.
machine learning: A review of classification techniques. Artificial In-
Schneider, M. J., & Gupta, S. (2016). Forecasting sales of new and
telligence Review, 26(3), 159–190. http://dx.doi.org/10.1007/s10462-
existing products using consumer reviews: A random projections
007-9052-33.
approach. International Journal of Forecasting, 32(2), 243–256. http:
Kourentzes, N., Barrow, D., & Petropoulos, F. (2018). Another look at
//dx.doi.org/10.1016/j.ijforecast.2015.08.005.
forecast selection and combination: Evidence from forecast pooling.
Schwartz, E. M., Bradlow, E. T., & Fader, P. S. (2014). Model selection
International Journal of Production Economics, 209, 226–235. http:
using database characteristics: developing a classification tree for
//dx.doi.org/10.1016/j.ijpe.2018.05.019.
longitudinal incidence data. Marketing Science, 33(2), 188–205. http:
Kuo, R. J. (2001). Sales forecasting system based on fuzzy neural net-
//dx.doi.org/10.1287/mksc.2013.0825.
work with initial weights generated by genetic algorithm. European
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3),
Journal of Operational Research, 129(3), 496–517. http://dx.doi.org/
289–310. http://dx.doi.org/10.1214/10-STS330.
10.1016/S0377-2217(99)00463-4.
Steiner, W. J., Brezger, A., & Belitz, C. (2007). Flexible estimation of price
Lemke, C., Budka, M., & Gabrys, B. (2015). Metalearning: a survey
response functions using retail scanner data. Journal of Retailing
of trends and technologies. Artificial Intelligence Review, 44(1),
and Consumer Services, 14(6), 383–393. http://dx.doi.org/10.1016/j.
117–130. http://dx.doi.org/10.1007/s10462-013-9406-y.
jretconser.2007.02.008.
Lemke, C., & Gabrys, B. (2010). Neurocomputing meta-learning for
Tanaka, K. (2010). A sales forecasting model for new-released and
time series forecasting and forecast combination. Neurocomput-
nonlinear sales trend products. Expert Systems with Applications,
ing, 73(10), 2006–2016. http://dx.doi.org/10.1016/j.neucom.2009.09.
37(11), 7387–7393. http://dx.doi.org/10.1016/j.eswa.2010.04.032.
020.
Ma, S., & Fildes, R. (2020). Retail sales forecasting with meta-learning. Taylor, J. W. (2007). Forecasting daily supermarket sales using exponen-
European Journal of Operational Research, 288(1), 111–128. http: tially weighted quantile regression. European Journal of Operational
//dx.doi.org/10.1016/j.ejor.2020.05.038. Research, 178(1), 154–167. http://dx.doi.org/10.1016/j.ejor.2006.02.
Maciejowska, K., Nowotarski, J., & Weron, R. (2016). Probabilistic 006.
forecasting of electricity spot prices using factor quantile regression Teunter, R. H., Babai, M. Z., & Syntetos, A. A. (2010). ABC clas-
averaging. International Journal of Forecasting, 32(3), 957–965. http: sification: Service levels and inventory costs. Production and
//dx.doi.org/10.1016/j.ijforecast.2014.12.004. Operations Management, 19(3), 343–352. http://dx.doi.org/10.1111/
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). The M4 j.1937-5956.2009.01098.x.
competition: 100,000 time series and 61 forecasting methods. Therneau, T. M., & Atkinson, E. J. (2018). Rpart: Recursive partitioning
International Journal of Forecasting, 36(1), 54–74. http://dx.doi.org/ and regression trees (version 4.1-13). https://cran.r-project.org/
10.1016/j.ijforecast.2019.04.014. package=rpart.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). The M5 Thomassey, S., & Fiordaliso, A. (2006). A hybrid sales forecasting system
accuracy competition: Results, findings and conclusions. In Working based on clustering and decision trees. Decision Support Systems,
Paper. 42(1), 408–421. http://dx.doi.org/10.1016/j.dss.2005.01.008.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). The M5 un- Trusov, M., Bodapati, A. V., & Cooper, L. G. (2006). Retailer promotion
certainty competition: Results, findings and conclusions. In Working planning: Improving forecast accuracy and interpretability. Journal
Paper. of Interactive Marketing, 20(3–4), 71–81. http://dx.doi.org/10.1002/
Makridakis, S., & Winkler, R. L. (1983). Averages of forecasts: Some dir.20068.
empirical results. Management Science, 29(9), 987–996. http://dx. Ulrich, M., Jahnke, H., Langrock, R., Pesch, R., & Senge, R. (2021). Distri-
doi.org/10.1287/mnsc.29.9.987. butional regression for demand forecasting in e-grocery. European
Meinshausen, N. (2006). Quantile regression forests. Journal of Ma- Journal of Operational Research, 294(3), 831–842. http://dx.doi.org/
chine Learning Research, 7, 983–999, http://www.jmlr.org/papers/ 10.1016/j.ejor.2019.11.029.
volume7/meinshausen06a/meinshausen06a.pdf. Uniejewski, B., Marcjasz, G., & Weron, R. (2019). Understanding in-
Montero-manso, P., Athanasopoulos, G., Hyndman, R. J., & Talagala, S. traday electricity markets: Variable selection and very short-term
(2020). FFORMA: Feature-based forecast model averaging. Interna- price forecasting using LASSO. International Journal of Forecast-
tional Journal of Forecasting, 36(1), 86–92. http://dx.doi.org/10.1016/ ing, 35(4), 1533–1547. http://dx.doi.org/10.1016/j.ijforecast.2019.02.
j.ijforecast.2019.02.011. 001.
14
M. Ulrich, H. Jahnke, R. Langrock et al. International Journal of Forecasting xxx (xxxx) xxx
Wang, X., Smith-Miles, K., & Hyndman, R. (2009). Neurocomputing Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for
rule induction for forecasting method selection: Meta-learning the optimization. IEEE Transactions on Evolutionary Computation, 1(1),
characteristics of univariate time series. Neurocomputing, 72(10), 67–82. http://dx.doi.org/10.1109/4235.585893.
2581–2594. http://dx.doi.org/10.1016/j.neucom.2008.10.017. Zhang, G. P., & Qi, M. (2005). Neural network forecasting for seasonal
Weber, A., Steiner, W. J., & Lang, S. (2017). A comparison of semipara- and trend time series. European Journal of Operational Research,
160(2), 501–514. http://dx.doi.org/10.1016/j.ejor.2003.08.037.
metric and heterogeneous store sales models for optimal category
Zhang, W., Quan, H., & Srinivasan, D. (2018). Parallel and reliable
pricing. OR Spectrum, 39(2), 403–445. http://dx.doi.org/10.1007/
probabilistic load forecasting via quantile regression forest and
s00291-016-0459-6. quantile determination. Energy, 160, 810–819. http://dx.doi.org/10.
Williams, T. M. (1984). Stock control with sporadic and slow- 1016/j.energy.2018.07.019.
moving demand. Journal of the Operational Research Society, 35(10), Zipkin, P. H. (2000). Foundations of inventory management (first ed.).
939–948, https://www.jstor.org/stable/2582137. Boston: McGraw-Hill.
15