You are on page 1of 7

Computers & Industrial Engineering 121 (2018) 1–7

Contents lists available at ScienceDirect

Computers & Industrial Engineering


journal homepage: www.elsevier.com/locate/caie

A support vector machine for model selection in demand forecasting T


applications

Marco A. Villegasa, , Diego J. Pedregala, Juan R. Traperob
a
Department of Business Administration, ETSI Industriales, Universidad de Castilla-La Mancha, Ciudad Real 13071, Spain
b
Department of Business Administration, Faculty of Chemical Science and Technology, Universidad de Castilla-La Mancha, Ciudad Real 13071, Spain

A R T I C LE I N FO A B S T R A C T

Keywords: Time series forecasting has been an active research area for decades, receiving considerable attention from very
Demand forecasting different domains, such as econometrics, statistics, engineering, mathematics, medicine and social sciences.
Supply chain Moreover, with the emergence of the big data era, the automatic identification with the appropriate techniques
SVM remains an intermediate compulsory stage of any big data implementation with predictive analytics purposes.
Time series analysis
Extensive research on model selection and combination has revealed the benefits of such techniques in terms of
Model selection
forecast accuracy and reliability. Several criteria for model selection have been proposed and used for decades
with very good results. Akaike information criterion and Schwarz Bayesian criterion are two of the most popular
criteria. However, research on the combination of several criteria along with other sources of information in a
unified methodology remains scarce.
This study proposes a new model selection approach that combines different criteria using a support vector
machine (SVM). Given a set of candidate models, rather than considering any individual criterion, an SVM is
trained at each forecasting origin to select the best model. This methodology will be particularly interesting for
scenarios with highly volatile demand because it allows changing the model when it does not fit the data
sufficiently well, thereby reducing the risk of misusing modeling techniques in the automatic processing of large
datasets.
The effects of the proposed approach are empirically explored using a set of representative forecasting
methods and a dataset of 229 weekly demand series from a leading household and personal care manufacturer in
the UK. Our findings suggest that the proposed approach results in more robust predictions with lower mean
forecasting error and biases than base forecasts.

1. Introduction of such techniques in terms of forecast accuracy and reliability.


However, the application of artificial intelligence techniques to this
Companies have traditionally adopted forecasting techniques to problem is still scarce. Forecasting models are of a strategic nature
support decision making on a consistent daily basis, bringing data from given that they guide business decisions, ranging from inventory
different sources into a common data infrastructure. However, the scheduling to strategic management (Petropoulos, Makridakis,
primary focus was mainly the development of reporting tools Assimakopoulos, & Nikolopoulos, 2014). Focusing on a supply chain
(Davenport & Harris, 2007). In recent years, the so-called business context, automatic model selection is a necessity due to the high
analytics has introduced a new approach in this domain by leveraging number of products whose demand should be forecast (Fildes &
on the latest progress in both computer science (e.g., data mining al- Petropoulos, 2015).
gorithms) and hardware technology (e.g., cloud computing and in- The forecasting and operational research literature has addressed
memory technology), thus enabling the integration of data sources and this problem using different approaches. A first approach could be ag-
business operations on a higher stage of abstraction (Sheikh, 2013). gregate selection, in which a single source of forecasts is chosen for all
With the emergence of the big data era, the automatic identification the time series (Fildes, 1989), rather than individual selection, where a
of appropriate data techniques is an intermediate compulsory stage of certain method that is appropriate for each series is selected. However,
any big data implementation with predictive analytics purposes. aggregate selection cannot distinguish the individual characteristics of
Research on model selection and combination has revealed the benefits each time series (such as trend and/or seasonality), and in general,


Corresponding author.
E-mail address: marco.villegas@uclm.es (M.A. Villegas).

https://doi.org/10.1016/j.cie.2018.04.042
Received 1 February 2017; Received in revised form 3 February 2018; Accepted 21 April 2018
Available online 04 May 2018
0360-8352/ © 2018 Elsevier Ltd. All rights reserved.
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7

individual selection outperforms aggregate selection, although with an Despite the abundant literature on model selection, the possibility of
associated higher complexity level and computational burden (Fildes & combining several selection criteria has been under-investigated, as
Petropoulos, 2015). well as the benefits of taking advantage of additional sources of in-
Regarding individual selection, different criteria for selecting the formation (SACF and PACF values, fitted parameters, unit root tests,
most adequate model can be found in the literature. For instance, in- and so forth) in an integrated approach that takes them all into account.
formation criteria such as Akaike information criterion (AIC) or This study proposes a new model selection approach that combines
Schwarz Bayesian criterion (SBC) are typically used (Liang, 2014). different criteria with additional information of the time series itself as
These information criteria produce a value that represents the tradeoff well as the responses and fitted parameters of the alternative models.
between goodness of fit and the number of parameters. Billah, King, Given a set of candidate models, rather than considering any individual
Snyder, and Koehler (2006) compared different information criteria to criterion, a support vector machine (SVM) is trained at each forecasting
select the most appropriate exponential smoothing model on simulated origin to select the best model using all this information.
data and a subset of the time series from the well-known forecasting The effects of this approach are explored for the 229 stock keeping
competition known as M3, where the AIC slightly outperformed the units (SKUs) of a leading household and personal care manufacturer in
remaining information criteria considered (Makridakis & Hibon, 2000). the UK. The data are highly volatile and have a small serial correlation.
The identification of the best forecasting model has also been ad- The experiment was developed by fitting a set of exponential smoothing
dressed depending on the time series features. Initially, Pegels (1969) and ARIMA models with different levels of complexity. After a feature
presented nine possible exponential smoothing methods in graphical selection process, 19 variables were included in the SVM training da-
form, considering all combinations of trend and cyclical effects in ad- taset as the most relevant variables. The results show that the proposed
ditive and multiplicative form. Collopy and Armstrong (1992) devel- approach improves the out-of-sample forecasting accuracy with respect
oped a rule-based selection procedure model (RBF) based on a set of 99 to a single model selection criterion. This methodology will be parti-
rules for selecting and combining methods based on 18 time series cularly interesting for scenarios with highly volatile demand because it
features. To automate this procedure, Adya, Collopy, Armstrong, and allows the model to be changed when it does not fit the data sufficiently
Kennedy (2001) developed and automated heuristics to detect six fea- well, thereby reducing the risk of misusing modeling techniques in the
tures that had previously been judgmentally identified in RBF using automatic processing of large datasets.
simple statistics, achieving a similar performance in terms of fore- The key contributions of this paper are as follows: (i) propose a
casting accuracy. Petropoulos et al. (2014) analyzed via regression novel model selection approach for time series forecasting based on
analysis the main determinants of forecasting accuracy involving 14 SVM classification, (ii) compare base and ensemble forecast error
popular forecasting methods (and combinations of them), seven time characteristics out-of-sample, and (iii) investigate the effects of the
series features and the forecasting horizon as a strategic decision. ensemble on forecasting errors, as measured in terms of median, mean,
Wang, Pedrycz, and Liu (2015) proposed a rather different approach for bias and variance.
long-term forecasting based on dynamic time warping of information The remainder of this paper is organized as follows. Section 2 in-
granules. Yu, Dai, and Tang (2016) focused on finding an empirical troduces the forecasting models and the use of the SVM for automatic
decomposition (intrinsic mode functions) to aggregate the individually model selection. Section 3 presents an empirical evaluation of the ap-
forecast components later into an ensemble result as the final predic- proach in a demand planning case study with real data. Section 4
tion. analyzes the results, followed by some final considerations and after-
An alternative for selecting among forecasts is evaluating the per- thoughts.
formance of the methods in a hold-out sample (Fildes & Petropoulos,
2015; Poler & Mula, 2011), where forecasts are computed for single or 2. Methods
multiple origins (cross-validation), typically using a rolling-origin pro-
cess (Tashman, 2000). Some pragmatic approaches have been used, 2.1. Forecasting models
such as forward selection used by Kim, Dekker, and Heij (2017) for
determining the order of an AR model. Let z t be the mean-corrected output demand data sampled at a
Finally, another option is to explore combination procedures weekly rate, at be a white noise sequence (i.e., serially uncorrelated
(Clemen, 1989). In fact, Fildes and Petropoulos (2015) concluded that a with zero mean and constant variance), θi be a set of parameters to
combination could outperform individual or aggregate selection for estimate and B be the backshift operator in the sense that B lz t = z t − l .
non-trended data. Different combination operators (mode, median and Then, considering that no seasonality is present in the data, the fore-
mean) to compute neural network ensembles were analyzed by casting models considered in this paper are the following:
Kourentzes, Barrow, and Crone (2014), where the mode was found to
M1: z t = at (1)
provide the most accurate forecasts. In addition to the forecasting
models considering time series, the automatic identification algorithms M2: z t = (1 + θ1 B + θ2 B2) at (2)
developed for causal models should be mentioned. For instance, mar-
keting analytics models to forecast sales under the presence of pro- M3 (ETS): (1−B ) z t = (1 + θ1 B ) at (3)
motions were analyzed by Trapero, Kourentzes, and Fildes (2015).
M4: (1−B ) z t = (1 + θ1 B + θ2 B2) at (4)
Additionally, models capable of incorporating data from other compa-
nies in a supply chain collaboration context with information sharing Mean : Mean of forecasts M1 to M4 (5)
were explored by Trapero, Kourentzes, and Fildes (2012).
Median : Median of forecasts M1 to M4 (6)
In addition to traditional time series modeling techniques, artificial
intelligence (AI) algorithms have proven to be quite effective as a Model M1 is white noise, model M2 is a MA(2), model M3 is an
means to build higher-level methodologies to face big data challenges IMA(1,1) that is actually treated as a simple exponential smoothing
in an effective manner, relying upon both traditional and AI low-level model or an ETS(A,N,N) in (Hyndman, Koehler, Ord, & Snyder, 2008)
techniques. An initial attempt was conducted by Garcia, Villalba, and nomenclature (where E, T, S, A and N stand for error, trend, seasonal,
Portela (2012), where multiple time series were classified according to additive and none, respectively), M4 is an IMA(1,2), and Mean and
a simple autocorrelation function (SACF) and partial autocorrelation Median are combination methods. In essence, two stationary models,
function (PACF) to reduce the number of forecasting ARIMA models to three non-stationary models and two combinations of models are con-
be fitted. However, the forecasting implications of that procedure in sidered.
terms of out-of-sample accuracy were not described. Note that some models are nested versions of other models. For

2
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7

20 parameterized by (w,b) while being consistent on the training set. The


class label of x is obtained by considering the sign of f (x) . In the non-
99%
separable case, the misclassified examples are subjected to quad-
15
ratically penalized scaling by a constant C, the cost parameter, and the
95%
optimization problem takes the form
90%
10 m
1
min ‖w‖2 + C ∑ ξk2
w,ξ 2
k=1
5
subject to yk f (x k) ⩾ 1−ξk2 ,∀ k . Using Lagrangian theory, the optimal
m
vector w is known to have the form w = ∑k = 1 αk∗ yk Φ(x k) , where αk∗ is
0 the solution of the following quadratic optimization problem:
0 50 100 150 200 250
m
Fig. 1. Minimum Ljung-Box Q statistic for each product.
max W (α ) =
α
∑ αk −
k=1
m m
example, model M1 is a parametrically efficient version of the re-
maining models if θ1 = 0 and θ2 = 0 in model M2, θ1 = −1 in M3, or

1
2
∑∑
k l
(
αk αl yk yl K (x k,x l) +
1
δ
C k ,l ) (7)
θ1 = −1 and θ2 = 0 in M4. Similarly, ETS model M3 is a particular
m
version of M4 with θ2 = 0 . Finally, models M2 and M4 are not nested subject to = 0 and αk ⩾ 0 ,∀ k , where δk,l is the Kronecker
∑k = 1 yk αk
with any other models. symbol and K (x k,x l) = 〈Φ(x k),Φ(x l) 〉 is the kernel matrix of the training
These types of constraints have been taken into account in the es- examples. The extension for the case of multiclass classification with j
timation procedure since when approximate constraints are found, the levels, j > 2 , could be performed using the “one-against-one” approach
preferred models are the most parsimonious ones. This is particularly in which j (j−1)/2 binary classifiers are trained; then, the appropriate
important when dealing with estimated roots that are close to unity. class is found using a voting scheme (Meyer, Dimitriadou, Hornik,
Specifically, when θ1 < −0.992 in model M3, the model is switched to Weingessel, & Leisch, 2015).
M1 for forecasting purposes. Similarly, if any root in the MA poly- The function K is also known as the kernel function, which com-
nomial of model M4 is smaller than −0.992 , the model is switched to a putes inner products in the feature space directly from the inputs x. This
MA(1), which is not any of the models considered above. No unit roots function is supposed to capture the appropriate similarity measure
were detected when estimating model M2. between the arguments while being computationally much less ex-
For the 229 SKU time series considered, at least one of the models pensive than explicitly computing the mapping Φ and inner product.
M1 to M4 above is correct in statistical terms in the sense that one of Although the design of kernel functions is a very active research area,
the models filters out all the serial correlation present in the data. Fig. 1 there are some popular kernels that have been tested in a variety of
shows the minimum for the four models M1-M4 of the Ljung-Box Q domains and applications with good results. The polynomial kernel is
statistic to test for the absence of serial correlation for eight lags, ap- defined as K (x ,z ) = p (〈x ,z〉) , where p (·) is any polynomial with positive
proximately two months of data (Ljung & Box, 1978). Recalling that the coefficients. In many cases, it also refers to the special case
maximum number of parameters is two, a conservative value for de- K d (x ,z ) = (〈x ,z〉 + R)d , where R and d are parameters. Gaussian kernels
grees of freedom to perform the test is 6. (also known as radial basis function kernels) are the most widely used
The critical values for the Q test at confidences levels of 90%, 95% kernels and have been studied in many different applications. A
and 99% on a Chi-squared distribution with 6 degrees of freedom are Gaussian kernel is defined as
10.64, 12.59 and 16.81, respectively, and are marked by horizontal
‖x −z‖2 ⎞
lines in Fig. 1. Therefore, most of the values are well below the 90% K(x,z) = exp ⎛− ⎜ ⎟

⎝ 2σ 2 ⎠ (8)
confidence limit. This result means that, depending on the level of
confidence, for 93.89%, 97.82% and 100% of the SKU series, at least where σ is a parameter that controls the flexibility of the kernel. In this
one of the models is correctly specified in the sense of not leaving study, Gaussian kernels are extensively used, and the parameter σ is
significant serial correlation below the confidence limits mentioned estimated via cross-validation, as explained in Section 2.3.
above. This means that the models fulfill the requirement of being a Although they have been extended to regression problems since
sufficient representation of the data while simultaneously preserving their early days (Müller et al., 1997), SVMs were originally designed as
parsimony. a classification algorithm (Cortes & Vapnik, 1995) and have been ex-
tensively exploited in a wide variety of classification contexts, e.g.,
2.2. Support vector machines hand-written digit recognition, genomic DNA (Furey et al., 2000), text
classification (Joachims, 2002), and sentiment analysis (Pang, Lee, &
The forecasting models shown in the previous section are data Vaithyanathan, 2002). Surprisingly, however, SVMs have not been
specific and should be changed for each particular application. The applied to the problem of model selection in the context of multiple
novelty of this paper relies on the use of SVMs as a means to select the forecasting models. This paper constitutes a novel contribution in this
best model among the candidates for each time series at each fore- area.
casting horizon.
The SVM classifier is basically a binary classifier algorithm that 2.3. Feature selection and extraction
searches for an optimal hyperplane as a decision function in a high-
dimensional feature space (Shawe-Taylor & Cristianini, 2004). Consider Reliable results with SVM and other data-driven modeling techni-
the training data set {x k,yk } , where x k ∈ n are the training examples ques are considerably conditioned to the quality of the data available
(k = 1,2,…,m ) and yk ∈ {−1,1} are the class labels. The training ex- for training. Apart from the correctness of the data itself, there are some
amples are first mapped into another space, referred to as the feature other aspects regarding the dimensionality of the dataset. In fact, it is
space, which is eventually of a much higher dimension than n , via the well known that as the number of variables increases, the amount of
mapping function Φ . Then, a decision function of the form data required to provide a reliable analysis exponentially increases
f (x) = 〈w,Φ(x) 〉 + b in the feature space is computed by maximizing (Hira & Gillies, 2015). Many feature selection (removing variables that
the distance between the set of points Φ(x k) to the hyperplane are irrelevant) and feature extraction (applying some transformations

3
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7

,QíVDPSOe 2XWíRIíVDPSOe
to the existing variables to obtain a new one) techniques have been

10 20 30
discussed to reduce the dimensionality of the data (Kira & Rendell,

SKU
1992), as well as some other approaches based on linear transformation
and covariance analysis, such as principal component analysis (PCA)

0
0 50 100 150
and linear discriminant analysis (LDA; Cao, Chua, Chong, Lee, & Gu, Week
2003; Duin & Loog, 2004).

80 120
,QíVDPSOe 2XWíRIíVDPSOe
For the experiments performed in this work, the dataset contained

SKU
14,885 records (65 origins and 229 products). A thorough range of

40
features estimated from the signals and models were initially con-

0
sidered to not miss any relevant information. The process resulted in 39
0 50 100 150
features, including information criteria of models, estimation informa- Week
tion, formal statistical tests on residuals and forecasting results. The ,QíVDPSOe 2XWíRIíVDPSOe

detailed list is as follows:

150
SKU
• AIC and SBC information criteria on forecasting models M1-M4 (8

0 50
features). 0 50 100 150

• Autocorrelation Ljung-Box Q statistic for 10 lags on the residuals (4 Week

0 20 40 60 80
features). ,QíVDPSOe 2XWíRIíVDPSOe

• P-values of Jarque–Bera Gaussianity tests for the residuals of each

SKU
model (4 features, Jarque & Bera, 1987).
• P-values of heteroscedasticity tests for the residuals of each model (4
features). The tests are estimated as a variance ratio test of the first 0 50 100 150
Week
third of the sample on top of the last third of the sample.
• Estimated parameters (5 features). Two parameters/features for Fig. 2. Example of some SKUs from the dataset.
model M2, one for model M3, and two for model M4.
• Four last SKU values available at each point in time (4 features).
sample span one week at a time. All forecasting models are fitted in the
• Four predictions for the next week provided by each forecasting
in-sample partition using the available data up to time T with
method (M1 to M4, 4 features).
101 ⩽ T ⩽ 169and tested in the out-of-sample partition using observa-
• All possible relative distances among the predictions provided by all
tions T + 1,…,T + 4 . The out-of-sample forecasting errors are therefore
the forecasting methods (M1 to M4, 6 features).
calculated for the four forecasting horizons on each of the 229 products.
We measure the forecast error using the scaled error (sEl ) and scaled
After a long process of feature selection and extraction via cross-
squared error (sSEl ) of the lead time forecast according to the following
validation, the number of variables was reduced to 19, resulting in a
formulas:
matrix W of dimension 14,885 × 19. The final features selected are the
l l
last four items in the previous list, namely, the estimated parameters for ∑ j = 1 zT + j− ∑ j = 1 zT̂ + j
all models, the four last SKU values, the predictions of all models and sEl = 1 T
,
∑i = 1 z i (9)
the distances or differences among predictions. T

The vector of labels Li is formed as a categorical variable that in- 2


dicates the model with lower forecasting error at horizons t + i for
sSE =
l
(∑ l l
z − ∑ j = 1 zT̂ + j
j=1 T+j ),
1 ⩽ i ⩽ 4 . For horizons t + 2,t + 3 and t + 4 , the model with the lowest 1 T
∑i = 1 z i
T (10)
forecasting error is selected as the one that minimizes the total sum of
squared errors in all the spanned weeks t + 1,…,t + i . where the denominator is the mean of the time series, zT̂ + j stands for the
For each week k, with 4 < k < 65, an SVM with a radial basis forecast at time T + j , and l = 1,2,3,4 . Using these metrics has the fol-
function kernel (RBF) is trained using the training set Wtrain and the lowing advantages: (i) they allow zero values at some periods of the
corresponding vector of labels Li for each forecasting horizon. Wtrain is series; (ii) they compute the forecasting performance for different lead
formed as a partition of matrix W including up to four weeks of history times, given that in supply chain applications, the lead time is unlikely
(h = 4 ), i.e., collecting records for weeks k−4 to k−1. Similar con- to be equal to the forecast update interval (Silver, Pyke, & Thomas,
siderations were done in shaping vector Li . Different values for h were 2017); and (iii) these metrics make the results scale independent, and
also empirically tested, resulting in h = 4 as the optimal setting for the therefore, we can summarize them across products and forecasting
current dataset. For optimizing the σ and C parameters, a 5-fold cross- horizons. Scaled absolute errors were also calculated in addition to
validation was performed. squared errors in (10), but the results were very similar and are not
reported. Such results are available from the authors.
3. Case study To compute joint accuracy measurements for the full dataset, the
scaled mean squared error (sMSE) and scaled median squared error
The evaluation of the proposed models is conducted on the 229 (sMdSE) are calculated as the average and the median, respectively, of
demand series from a leading household and personal care manu- all the sSEl across all the time origins and across all the SKUs. Similarly,
facturer in the UK. This dataset was previously employed in (Barrow & forecast biases are examined by the scaled mean error (SME) and scaled
Kourentzes, 2016). For each product, there are 173 weekly sales ob- median error (SMdE) calculated accordingly on the sEl measurements.
servations, from which 101 observations are used as in-sample and the The models are estimated by exact maximum likelihood using the
remainder are reserved for out-of-sample evaluation. Therefore, a set of ECOTOOL toolbox written in MATLAB (Pedregal & Trapero, 2012),
69 forecast rounds of 4 weeks ahead were performed for each product. except for M3, which was handled in SSpace. An SVM was treated using
Fig. 2 shows some examples of the time series in the dataset, where the the R package e1071 (Meyer et al., 2015).
shaded area shows the out-of-sample period. No seasonality is visible in
the sample SKUs by visual inspection, and no strong correlation pat- 4. Results and discussion
terns are observed.
A rolling forecasting experiment is performed by expanding the in- One key issue in this study is the agnostic point of view by which we

4
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7

Table 1 Table 2
Percentage of SKUs for which each model is the best according to SBC on dif- SBC for all models in two different data partitions and sMSE for out-of-sample
ferent data partitions and the out-of-sample forecast performance. for SKU in Fig. 4.
(101) (173) (101) (173)

SBC AIC SBC AIC Out-of-sample SBC AIC SBC AIC sMSE

M1 55.46% 51.05% 39.30% 36.18% 17.03% M1 1.52 1.40 3.31 3.05 0.053
M2 14.41% 13.99% 9.61% 9.33% 16.16% M2 1.61 1.56 2.89 2.80 0.038
M3 13.97% 13.67% 29.26% 28.63% 34.93% M3 1.59 1.56 2.26 2.21 0.013
M4 16.16% 21.29% 21.83% 25.86% 31.88% M4 1.64 1.58 2.19 2.12 0.012

80 number of parameters and unit roots as the sample size increases,


whereas AIC keeps it invariant to sample size. For the small sample size
60 (101), in 55% of the cases, the simplest model M1 is chosen, i.e., for
more than half of the SKUs, the best model is that there is no model.
This proportion is reduced with the full sample (173), but M1 is still the
%

40
best model according to SBC in 39.30% of the cases (36.18% according
to AIC), followed by the exponential smoothing with 29.26% (28.63%
20
for AIC) of the cases.
The fifth column of Table 1 shows the best models contributing to
0 the accuracy in the out-of-sample partition (samples between 101 and
0 50 100 150 200
SKU 173) according to sSE4 . The disagreements with the SBC values are 69%
and 44% for the small and full sample sizes, respectively. Taking the
Fig. 3. Proportion of time origins at which the best model is the best for all information in Table 1 altogether, it shows evidence of the little cor-
SKUs.
relation structure observed in the data, which tends to become more
important with longer time series.
35 Fig. 3 shows for each SKU the proportion of times out of the 69
forecasting origins in the rolling experiment that the best model is ac-
tually the best according to the forecasting errors. For example, a 50%
30
for a single SKU in that figure means that the best model was the best in
35 of the forecasting origins. Only in 4 SKUs was the winning model the
25 best in more than 60% of the forecasting origins, and only in 28 SKUs
were the proportions greater than 50%. This result means that even
when a model is the best at minimizing the forecasting error for a single
Sales

20 SKU, rarely is it the best at more than 50% of the forecasting origins.
To obtain deeper insights into the complexity of the problem, Fig. 4
shows a single SKU, where it can be observed that, taken up to ob-
15
servation 101, it may be considered stationary, and therefore, either M1
or M2 may be appropriate candidates. However, the fact that a trend
10
subsequently appears implies that such a model might no longer be
optimal.
Such intuitions are supported by Table 2, which shows the SBC and
5 AIC for all models with samples up to 101 and the full sample, in ad-
0 20 40 60 80 100 120 140 160 dition to the sMSEs. The preferred model according to SBC and AIC in
Weeks the small sample is M1, with a Q(8) statistic of 3.72, which indicates
Fig. 4. Example of SKU. that there is no correlation remaining on the residuals. This model is the
worst for the full sample. Additionally, the best model for the full
sample switches to M4 (Q(8) is 9.54), while the forecasting criterion
assume that there is not necessarily a single stochastic process that
suggests that M4 is the best, with a slight margin over M3. Interest-
underlies the observed data. In other words, several stochastic pro-
ingly, the model that is the best considering all forecasting origins is
cesses may be required for a single SKU to describe it.
model M4 with only 53% of the time.
This is particularly true for this case study, where there is little
This evidence is complemented with Fig. 5, which shows the best
correlation structure in the data. Consequently, one might expect that
model according to its forecasting performance for each forecast origin
the best model in terms of forecasting accuracy will change for different
in the out-of-sample span for the same SKU. At the very beginning, the
forecast origins and/or horizons.
best models tended to be M2 or M3. Subsequently, however, as the
Some evidence emerges in the in-sample properties of the models.
trend becomes more prominent, the best model switches to M4 most of
For example, computing the SBC for all the SKUs with 101 observations
the time, although not always.
and the full sample, we observe a different model selection in 37% of
The previous evidence shows that there is not a best model that
the time series (Table 1). This proportion remains the same when
outperforms the rest for all SKUs, all forecasting origins and all fore-
considered AIC instead. Detailed information about model selection is
casting horizons. Moreover, even for a single SKU, there is not a con-
shown in the first four columns of Table 1, where the sample size is
sistent best model along time. At this point, the SVM-based ensemble
shown in parentheses. The SBC tends to select models with a higher
approach is introduced to test the hypothesis that there is some pattern

5
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7

the exponential smoothing, and both provide virtually the same results.
4
Finally, and most importantly, the SVM-based approach is the overall
best approach for all forecasting horizons with errors that fall between
Best model M#

the M3 model and the minimum possible forecasting error (that would
result from always selecting the model with the minimum error). The
3 advantages of the proposed method are once again appreciated more
clearly for higher forecast horizons.
Table 4 shows the bias considering the sEt measurements from Eq.
(9). Minimum sME (sMdE) for every forecasting horizon is highlighted
2 in bold. All biases are small, considering that the highest bias in the
table is 0.124 and the normalization imposed on the data implies a mean
100 110 120 130 140 150 160 170 of 1. The conclusions about bias are quite different depending on
Weeks whether we rely on SME or sMdE, but due to the robustness of the
Fig. 5. Best model for each out-of-sample forecast origin for SKU in Fig. 4. median, it is safer to use the sMdE values in parentheses. In essence, the
bias replicates what was observed in the squared errors, i.e., the models
with the smallest squared errors are simultaneously the models with the
Table 3
smallest bias. The best is the SVM-based method, followed by the ex-
Forecast accuracy for out-of-sample sets in sMSE (sMdSE). ponential smoothing (M3), then model M4 and combinations of
models.
Out t + 1 Out t + 2 Out t + 3 Out t + 4
The SVM-based approach is allowed to select among the different
Naïve 0.184 (0.041) 0.558 (0.128) 1.114 (0.266) 1.856 (0.437) forecasting models at each forecast origin, and therefore, it is more
flexible to adapt to stochastic or structural changes in the SKUs. This
M1 0.115 (0.032) 0.278 (0.075) 0.486 (0.134) 0.743 (0.195) fact explains why the SVM-based criterion outperforms all the con-
M2 0.109 (0.030) 0.255 (0.072) 0.447 (0.130) 0.689 (0.192)
sidered alternatives in terms of forecast accuracy.
M3 0.100 (0.026) 0.221 (0.054) 0.363 (0.087) 0.533 (0.123)
M4 0.102 (0.027) 0.230 (0.059) 0.380 (0.096) 0.555 (0.139)

Mean 0.101 (0.027) 0.226 (0.060) 0.374 (0.101) 0.549 (0.150) 5. Conclusions
Median 0.101 (0.027) 0.225 (0.059) 0.373 (0.101) 0.549 (0.150)
SVM-based 0.099 (0.026) 0.212 (0.052) 0.334 (0.081) 0.471 (0.110)
This study proposes a novel SVM-based approach for model selec-
Baseline 0.071 (0.011) 0.149 (0.022) 0.234 (0.034) 0.327 (0.049) tion. Since forecasting models shape business decisions at different le-
vels within companies, this paper aims at enhancing the power of
forecasting techniques by using AI techniques, SVM in particular,
that would allow improving the forecast accuracy over all models and working on a wide feature space in the context of supply chain fore-
possible combinations of models. In this sense, the proposed approach casting.
might be considered a sophisticated combination method in itself. The procedure consists of selecting the best forecast model available
Table 3 shows the scaled mean (median) squared errors for all from a pool of alternatives at each point in time using an SVM trained in
forecasting models and methods, including a naïve model (each forecast a feature space that embeds the most recent information, forecasts, the
is simply equal to the last observed value) that serves as a benchmark. relative performance and the fitted parameters of the models involved.
The last row corresponds to the errors generated by selecting the best To the authors’ best knowledge, this work is the first in which an SVM is
possible model for every forecasting step out of the model set con- used in this context in this particular way.
sidered. Minimum sMSE (sMdSE) for every forecasting horizon is The approach is empirically applied to a leading household and
highlighted in bold. personal care manufacturer in the UK with 229 weekly SKUs to forecast,
Several facts emerge from Table 3. First, taken as a whole, all with a horizon of 1 to 4 weeks ahead. The findings suggest that (i)
models outperform the naïve model by a wide margin, implying that all exponential smoothing techniques are very good in this context, both in
models capture, at least in some part of the experiment, the correlation terms of forecast accuracy and minimization of bias; (ii) simple com-
structure of the data. Second, for individual models M1 to M4, the best binations of forecasts (such as mean and median) do not help much in
model is consistently the exponential smoothing M3 model, with an this regard; and (iii) SVM-based model selection certainly manages to
advantage that increases with the forecasting horizon. Third, combi- improve the forecasting results in terms of both errors and bias.
nations of methods (mean and median) do not manage to outperform

Table 4
Forecast bias multiplied by 10 2 for out-of-sample sets in sME (sMdE).
Out t + 1 Out t + 2 Out t + 3 Out t + 4

M1 0.317 (−4.690) 0.785 (−5.252) 0.903 (−6.199) 1.024 (−7.295)


M2 −0.242 (−4.598) −0.345 (−5.185) −0.208 (−5.336) −0.068 (−7.914)
M3 −1.259 (−4.006) −2.367 (−3.909) −3.826 (−3.101) −5.282 (−3.627)
M4 −2.041 (−4.508) −4.230 (−4.693) −6.769 (−5.004) −9.307 (−5.672)

Mean −0.807 (−4.474) −1.539 (−4.853) −2.475 (−5.546) −3.408 (−6.134)


Median −0.832 (−4.589) −1.715 (−5.068) −2.669 (−5.094) −3.614 (−6.241)
SVM-based 0.309 (−3.013) 0.155 (−2.403) −0.610 (−1.310) −2.365 (−2.652)

Baseline 0.002 (−1.246) −0.713 (−1.123) −1.961 (−0.939) −3.193 (1.635)

6
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7

Acknowledgments and algorithms. Kluwer Academic Publishers.


Kim, T. Y., Dekker, R., & Heij, C. (2017). Spare part demand forecasting for consumer
goods using installed base information. Computers & Industrial Engineering, 103,
This work was supported by the European Regional Development 201–215.
Fund and Spanish Government (MINECO/FEDER, UE) under the project Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and
with reference DPI2015-64133-R and by the Vicerrectorado de a new algorithm. In AAAI, vol. 2, pp. 129–134.
Kourentzes, N., Barrow, D. K., & Crone, S. F. (2014). Neural network ensemble operators
Investigación y Política Científica from UCLM by DOCM 31/07/2014 for time series forecasting. Expert Systems with Applications, 41(9), 4235–4244.
[2014/10340]. Liang, Y.-H. (2014). Forecasting models for Taiwanese tourism demand after allowance
for Mainland China tourists visiting Taiwan. Computers & Industrial Engineering, 74,
111–119.
References Ljung, G., & Box, G. (1978). On a measure of a lack of fit in time series models. Biometrika,
65(2), 297–303.
Adya, M., Collopy, F., Armstrong, J., & Kennedy, M. (2001). Automatic identification of Makridakis, S., & Hibon, M. (2000). The m3-competition: Results,conclusions and im-
time series features for rule-based forecasting. International Journal of Forecasting, plications. International Journal of Forecast-ing, (16), 451–476.
17(2), 143–157. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F., (2015). e1071: Misc
Barrow, D. K., & Kourentzes, N. (2016). Distributions of forecasting errors of forecast Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU
combinations: Implications for inventory management. International Journal of Wien. R package version 1.6-7.
Production Economics, 177, 24–33. Müller, K.-R., Smola, A. J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., & Vapnik, V. (1997).
Billah, B., King, M. L., Snyder, R. D., & Koehler, A. B. (2006). Exponential smoothing Predicting time series with support vector machines. International conference on ar-
model selection for forecasting. International Journal of Forecasting, 22(2), 239–247. tificial neural networks (pp. 999–1004). Springer.
Cao, L., Chua, K. S., Chong, W., Lee, H., & Gu, Q. (2003). A comparison of pca, kpca and Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: Sentiment classification using
ica for dimensionality reduction in support vector machine. Neurocomputing, 55(1), machine learning techniques. Proceedings of the ACL-02 conference on Empirical
321–336. methods in natural language processing: Vol. 10, (pp. 79–86). Association for
Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. Computational Linguistics.
International Journal of Forecasting, 5(4), 559–583. Pedregal, D., & Trapero, J. (2012). The power of ecotool matlab toolbox. Industrial en-
Collopy, F., & Armstrong, J. S. (1992). Rule-based forecasting: Development and vali- gineering: Innovative networks (pp. 319–328). Springer.
dation of an expert systems approach to combining time series extrapolations. Pegels, C. C. (1969). Exponential forecasting: Some new variations. Management Science,
Management Science, 38(10), 1394–1414. 15(5), 311–315.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K. (2014). ’horses
273–297. for courses’ in demand forecasting. European Journal of Operational Research, 237(1),
Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning. 152–163.
Harvard Business Press. Poler, R., & Mula, J. (2011). Forecasting model selection through out-of-sample rolling
Duin, R., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic ex- horizon weighted errors. Expert Systems with Applications, 38(12), 14778–14785.
tension of lda: the chernoff criterion. IEEE Transactions on Pattern Analysis and Shawe-Taylor, J. & Cristianini, N., (2004). Kernel methods for pattern analysis.
Machine Intelligence, 26(6), 732–739. Sheikh, N. (2013). Implementing analytics: a blueprint for design, development, and adoption.
Fildes, R. (1989). Evaluation of aggregate and individual forecast method selection rules. Newnes.
Management Science, 35(9), 1056–1065. Silver, E., Pyke, D., & Thomas, D. (2017). Inventory and production management in supply
Fildes, R., & Petropoulos, F. (2015). Simple versus complex selection rules for forecasting chains (4th ed.). CRC Press. Taylor and Francis Group.
many time series. Journal of Business Research, 68(8), 1692–1701 Special Issue on Tashman, L. J. (2000). Out-of-sample tests of forecasting accuracy: an analysis and re-
Simple Versus Complex Forecasting. view. International Journal of Forecasting, 16(4), 437–450 The M3- Competition.
Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., & Haussler, D. Trapero, J. R., Kourentzes, N., & Fildes, R. (2012). Impact of information exchange on
(2000). Support vector machine classification and validation of cancer tissue samples supplier forecasting performance. Omega, 40(6), 738–747 Special Issue on
using microarray expression data. Bioinformatics, 16(10), 906–914. Forecasting in Management Science.
Garcia, F. T., Villalba, L. J. G., & Portela, J. (2012). Intelligent system for time series Trapero, J. R., Kourentzes, N., & Fildes, R. (2015). On the identification of sales fore-
classification using support vector machines applied to supply-chain. Expert Systems casting models in the presence of promotions. Journal of the Operational Research
with Applications, 39(12), 10590–10599. Society, 66(2), 299–307.
Hira, Z. M., & Gillies, D. F. (2015). A review of feature selection and feature extraction Wang, W., Pedrycz, W., & Liu, X. (2015). Time series long-term forecasting model based
methods applied on microarray data. Advances in Bioinformatics, 2015. on information granules and fuzzy clustering. Engineering Applications of Artificial
Hyndman, R., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Forecasting with ex- Intelligence, 41, 17–24.
ponential Smoothing: the State space approach. Springer Science & Business Media. Yu, L., Dai, W., & Tang, L. (2016). A novel decomposition ensemble model with extended
Jarque, C., & Bera, A. (1987). A test for normality of observations and regression re- extreme learning machine for crude oil price forecasting. Engineering Applications of
siduals. International Statistical Review, (55), 163–172. Artificial Intelligence, 47, 110–121.
Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory

You might also like