You are on page 1of 8

Engineering Applications of Artificial Intelligence 55 (2016) 231–238

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence


journal homepage: www.elsevier.com/locate/engappai

A clustering-based sales forecasting scheme by using extreme learning


machine and ensembling linkage methods with applications
to computer server
Chi-Jie Lu a,n, Ling-Jing Kao b
a
Department of Industrial Management, Chien Hsin University of Science and Technology, Taiwan
b
Department of Business Management, National Taipei University of Technology, Taiwan

art ic l e i nf o a b s t r a c t

Article history: Sales forecasting has long been crucial for companies since it is important for financial planning, in-
Received 13 July 2015 ventory management, marketing, and customer service. In this study, a novel clustering-based sales
Received in revised form forecasting scheme that uses an extreme learning machine (ELM) and assembles the results of linkage
11 March 2016
methods is proposed. The proposed scheme first uses the K-means algorithm to divide the training sales
Accepted 29 June 2016
data into multiple disjointed clusters. Then, for each cluster, the ELM is applied to construct a forecasting
Available online 20 July 2016
model. Finally, a test datum is assigned to the most suitable cluster identified according to the result of
Keywords: combining five linkage methods. The constructed ELM model corresponding to the identified cluster is
Sales forecasting utilized to perform the final prediction. Two real sales datasets of computer servers collected from two
Clustering
multinational electronics companies are used to illustrate the proposed model. Empirical results showed
Ensemble learning
that the proposed clustering-based sales forecasting scheme statistically outperforms eight benchmark
Linkage method
Extreme learning machine models, and hence demonstrates that the proposed approach is an effective alternative for sales fore-
Computer server casting.
& 2016 Elsevier Ltd. All rights reserved.

1. Introduction 2010; Venkatesh et al., 2014; López et al., 2015). The fundamental
idea of the clustering-based forecasting model is to utilize a
Sales forecasting is crucial for a company for financial planning, clustering algorithm to partition whole training data into multiple
inventory management, marketing, and customer service. For ex- disjoint clusters and construct a forecasting model for every
ample, sales forecasting has been used to estimate the required cluster. The test data are assigned to a cluster by their similarity,
inventory level for satisfying market demand and avoiding the and the forecasting model of a particular cluster is used to obtain
problem of over or under stocking. An effective sales forecasting forecasting outcomes for that cluster. Because data in the same
model can reduce the bullwhip effect, thereby improving a com- cluster have similar data patterns, the clustering-based forecasting
pany's supply chain management and sales management efficacy model can produce better forecasting accuracy than the forecast-
and, ultimately, increasing profits. Inaccurate sales forecasting may ing model built upon a complete dataset.
cause product backlog, inventory shortages, and unsatisfied cus- Even though Venkatesh et al. (2014) found that the clustering-
tomer demands (Luis and Richard, 2007; Thomassey, 2010; Lu based approach yielded much smaller forecasting errors than the
et al., 2012; Lu, 2014). Therefore, it is important to develop an approach of direct prediction on the entire sample without clus-
effective sales forecasting model which can generate accurate and tering, the choice of the clustering approach, the similarity mea-
robust forecasting results. surement, and the predictor will impact the performance of the
A number of sales forecasting studies have been proposed in clustering-based forecasting model. In the literature, a self-orga-
the literature, while clustering-based forecasting models have nizing map (SOM), the growing hierarchical self-organizing map
been adopted to improve prediction accuracy (Tay and Cao, 2001; (GHSOM), and the K-means clustering approach were applied to
Prinzie and Van den Poel, 2006; Lai et al., 2009; Lu and Wang, cluster data, while the support vector machine (SVM), support
vector regression (SVR), case-based reasoning (CBR), neural net-
n works, decision trees, and autoregressive integrated moving
Correspondence to: Department of Industrial Management, Chien Hsin Uni-
versity of Science and Technology, Zhong-Li Dist., Taoyuan City 32097, Taiwan. average (ARIMA) were used as predictors in the literature (Tay and
E-mail addresses: jerrylu@uch.edu.tw, chijie.lu@gmail.com (C.-J. Lu). Cao, 2001; Cao, 2003; Chang and Lai, 2005; Chang et al., 2009; Lai

http://dx.doi.org/10.1016/j.engappai.2016.06.015
0952-1976/& 2016 Elsevier Ltd. All rights reserved.
232 C.-J. Lu, L.-J. Kao / Engineering Applications of Artificial Intelligence 55 (2016) 231–238

et al., 2009; Huang and Tsai, 2009; Badge and Srivastava, 2010; Two real, monthly aggregate sales data sets of computer ser-
Kumar and Patel, 2010; Lu and Wang, 2010; Zhang and Yang, 2012; vers collected from two multinational electronics companies were
Lu and Chang, 2014). utilized as an illustrative example to evaluate the performance of
No matter which clustering approach is adopted, the linkage the proposed model. The forecasting accuracy of the proposed
method must be selected to determine the similarity between approach was compared with three single forecasting models, i.e.,
objects so that a new observation can be assigned to the appro- simple naïve forecast, seasonal naïve forecast, and pure ELM
priate cluster. The single linkage, complete linkage, centroid link- models, and five clustering-based forecasting models with differ-
age, median linkage and Ward's linkage methods are five well- ent linkage methods. The model comparison shows that the pro-
known and frequently used linkage methods in clustering analysis, posed approach provides much more accurate predictions. This
but different linkage methods have different characteristics and study contributes to the literature by proposing ensemble linkage
will generate different similarity measurement results (Palit and to avoid the problem caused by choosing a single linkage method
Popovic, 2005; Hair et al., 2006; Nandi et al., 2015). as well as by providing an application of the ELM model.
Most of the specific clustering-based forecasting models men- The rest of this paper is organized as follows. Section 2 gives a
tioned above use only one linkage method to calculate the similarity brief introduction about extreme learning machine. The proposed
of the prediction target and the clusters. However, using only one clustering-based sales forecasting model is thoroughly described
linkage method in the clustering-based forecasting model cannot in Section 3. Section 4 presents the experimental results. The pa-
provide a stable and effective outcome. Therefore, to solve this pro- per is concluded in Section 5.
blem, this study proposed the use of ensemble learning to assemble
the results of different linkage methods. Ensemble learning is a
paradigm, where several intermediate classifiers or predictors are
2. Extreme learning machine
generated and combined to finally get a single classifier or predictor.
It can be used to avoid the selection of the worst learning algorithm
ELM is one kind of single hidden-layer feedforward neural
and improve the performance of classification or prediction (Diet-
networks (SLFNs). It has a three layers structure, including the
terich, 2000; Polikar, 2006; Yang et al., 2010; Galar et al., 2012).
input layer, the hidden layer, and the output layer. It endeavors to
Among various methods for the creation of an ensemble of classifiers,
conquer the challenging issues of the traditional SLFNs such as
majority voting is the most widely used ensemble technique and
slow learning speed, trivial parameter tuning and poor general-
considered a simple and effective scheme (Lam and Suen, 1997;
ization capability (Huang et al., 2015).
Shahzad and Lavesson, 2013). Yeon et al. (2010) also proved majority
One key feature of ELM is that a researcher may randomly
voting is the optimal solution in the case of no concept drift. The
choose input weights and hidden node parameters. After the input
majority voting scheme follows democratic rules, i.e., the class with
weights and hidden nodes parameters are chosen randomly, SLFNs
highest number of votes is the outcome. Majority voting does not
become a linear system where the output weights of the network
assume prior knowledge about the problem at hand, or classifiers,
can be analytically determined using a simple generalized inverse
and does not require any parameter tuning once the individual
operation of the hidden layer output matrices (Huang et al., 2006).
classifiers have been trained (Lam and Suen, 1997).
Consider N arbitrary distinct samples ( x i, yi ) where x i =
Instead of using conventional predictors like ARIMA or artificial
neural networks, this study used extreme learning machine (ELM) [xi1, xi2, ... , xin ]T ∈ Rn is the input data and yi = [yi1, yi2 , ... , yim ]T ∈ R m
as the predictor due to its great potential and superior perfor- is the target output. If the SLFNs with η hidden neurons and ac-
mance in practical applications (Huang et al., 2015). ELM is a novel tivation function vector θ (x ) can approximate N samples with zero
N
learning algorithm for single-hidden layer feedforward neural error, this means ∑i = 1 ‖qi − yi ‖ = 0, where { qi , for i = 1, 2, ... , N } is
networks (SLFNs), which randomly selects the input weights and the output values of the SLFN. It can then be written compactly as:
analytically determines the output weights of SLFNs (Huang et al., HB = Y (1)
2006). Different from traditional gradient-based learning algo-
rithms for neural networks, ELM not only tends to reach the ⎡ θ (w1⋅x1 + b1) ⋯ θ (wη⋅x1 + bη ) ⎤
⎢ ⎥
smallest training errors but also the smallest norm of output where HN × η = [θ (wi⋅xj + bi )] = ⎢ ⋮ ⋱ ⋮ ⎥
weights. Thus, the ELM algorithm provides much better general- ⎢⎣ θ (w1⋅xN + b1) ⋯ θ (wη⋅xN + bη )⎥⎦
N×η
ization performance with much faster learning speed and avoids
represents the hidden layer output matrix of the neural network.
many issues faced with the traditional algorithms, such as stop-
The ith column of H is the ith hidden node output with respect to
ping criterion, learning rate, number of epochs and local minima,
inputs x i . B η × m = [β1, ... , βη ] is the matrix of output weights and
and the over tuned problems (Yeon et al., 2010). ELM has attracted
much attention in recent years and has become an important βi = [βi1, βi2, ... , βim ]T is the weight vector connecting the ith hidden
forecasting method (Sun et al., 2008; Wong and Guo, 2010; Chen node and the output nodes. wi = [wi1, wi2, ... , win ]T is the weight
and Ou, 2011; Lu and Shao, 2012; Wang and Han, 2015). vector connecting the ith hidden node and the input nodes; wi⋅xj
In this study, the clustering-based sales forecasting scheme is denotes the inner product of wi and xj . bi is the threshold (bias) of
implemented as follows. First, the K-means algorithm is used to the ith hidden node. YN × m = [y1, ... , yN ] is the matrix of targets.
partition the whole training sales data into multiple disjoint Huang et al. (2006) has proven that the input weights wi and
clusters. We adopted the K-means algorithm because it is one of the hidden layer biases bi of SLFNs need not be adjusted and can
the most popular methods (Nandi et. al, 2015) and is effective and be given arbitrarily. Under this assumption, the input weights wi
efficient in most cases (Jain, 2010). Then, the ELM is applied to the and hidden biases bi are randomly generated in ELM algorithm and
construct forecasting model for each cluster. Next, for a given the output weights can be determined as simple as finding the
testing dataset, the ensemble learning based on the majority least-square solution to the given linear system. The minimum
voting scheme is utilized to combine the results of the five linkage norm least-square solution to the linear system (i.e. Eq.(1)) is
methods, including single linkage, complete linkage, centroid ^ ψ
(2)
B=H Y
linkage, median linkage, and Ward's linkage, to find the cluster
ψ
which the testing data set belongs to. Finally, the ELM model where H is the Moore–Penrose generalized inverse of matrix H. The
corresponding to the identified cluster is used to generate the final minimum norm least-square solution is unique and has the smallest
prediction result. norm among all the least-square solutions (Huang et al., 2006).
C.-J. Lu, L.-J. Kao / Engineering Applications of Artificial Intelligence 55 (2016) 231–238 233

In ELM, the number of hidden nodes is a critical factor for the Fig. 1. The detailed description of each step in both phases is stated
generalization of ELM and is only parameter needed to be de- in following subsection, respectively.
termined. This makes the ELM can be easily and effectively used by
avoiding tedious and time-consuming parameter tuning. However, 3.1. Training phase
a shortcoming of ELM is that its output is usually unstable from
time to time because the input weights and hidden biases are The purpose of the training phase is to construct ELM for each
randomly chosen (Sun et al., 2008; Huang et al., 2015). To over- cluster obtained by the K-means algorithm. To achieve this goal,
come this disadvantage, for regression problem, an extended ELM four steps are taken.
(ELME), which runs original ELM for Q times and calculates the
average values of prediction as final results, was proposed by Sun Step 1: Collect and scale training data
et al. (2008).
After selecting the activation function of hidden neuron and the In the first step, the training data y = ⎡⎣ y1, y2 , ... , yN ⎤⎦, which are
neuron number of hidden layer, the ELME method generally in- sales records in our empirical study, are collected and scaled by
cludes four main steps. First, for the kth ELM, the input weights w ki min-max normalization method into the range of [0,1]. The for-
and hidden layer biases bik are respectively randomly initialized. mula of min-max normalization is given below:
Second, under the parameters w ki and bik , the original ELM model yi − min y
is utilized to obtain a prediction result qki = [q1k , q2k , ... , qNk ]. Third,
yi* = , ∀ i = 1, 2, ⋯N
max y − min y (3)
using the same input data, activation function and number of
hidden neurons, the first two steps are repeated Q times with where yi* is the scaled data, min y is the minimum value of y and
different random w ki and bik , k = 1, 2, ... , Q and Q different pre- max y is the maximum value of y. The purpose of scaling data is to
diction results qki can be obtained. Fourth, the final prediction equalize the range of variables and reduce prediction errors.
result q̄i is the average of all single prediction results,
1 Q Step 2: Define target variable and its predictors
q¯ i = ∑k = 1 qki . For more detailed information, please refer to Sun
Q
et al. (2008). The normalized training data are re-organized to define target
variables and predictors. Let yi* be the target variable and p denote
lag order. Then its corresponding predictors xi are lagged ob-
3. Proposed clustering-based ensemble forecasting scheme servations of yi*. That is, x i = [yi*− 1, yi*− 2 , ... , yi*− p ] = [xi1, xi2, ... , xip ].

This study presents a clustering-based sales forecasting scheme Step 3: Cluster training data by the K-means algorithm
based on ensemble learning and ELM with five different linkage
methods. The proposed model consists of a training phase and a Let the vector of target variables be y* = [y1* , y2* , ... , yN* ] and
testing phase. The proposed forecasting scheme is depicted in their predictors be X = [xij ] for i = 1, 2, ... , N and j = 1, 2, . . . , p .

Fig. 1. Proposed clustering-based ensemble forecasting scheme.


234 C.-J. Lu, L.-J. Kao / Engineering Applications of Artificial Intelligence 55 (2016) 231–238

The K-means algorithm is applied to the predictors X for clustering The complete linkage (or furthest neighbor) and the single linkage
the training data into g disjoint groups. Let d denote the group, nd method have same computation procedure. However, unlike the
denote the number of training data in group d , d = 1, 2, ... , g , single linkage method, the complete linkage uses the farthest points
nd < N . The dth cluster contains the target variables in the groups to measure the similarity. In other words, the complete
y (d) = [y1(d) , y2(d) , ⋯ , yn(d) ] and their corresponding predictors linkage distance Ci(d) between the test data yi* and the cluster d is the
d
X(d) = [x ij(d) ], for i = 1, 2, ... , nd and j = 1, 2, . . . , p. maximum value of λ ik(d) , that is Ci(d) = max (λ ik(d) ). the most appropriate
k
group for the test data yi* is the cluster CG i that has the smallest Ci(d) ,
Step 4: Construct ELM model for each cluster
that is, CG i = arg min (Ci(d) ).
d
In this step, the forecasting model for each cluster is con- Unlike the farthest or nearest points in the groups to measure
structed by ELM algorithm. Let ELM(d ) present the ELM forecasting the similarity, the centroid linkage method measures the distance
model for group d . We use the ELME model as mentioned in between groups using their cluster centroid. As median are less
Section 2 to avoid the unstable result of ELM (Sun et al., 2008; impacted by outliers than mean, mean is often replaced by median
Wong and Guo, 2010). By using the sigmoidal function as activa- in distance measurement. This distance measurement using media
tion function, the ELME models with numbers of hidden nodes is known as median linkage method.
varying from 1 to 30 were constructed. For each number of hidden Let the mean of the jth predictor in group d be
1 n
nodes, an ELME model was run for 30 times and the average RMSE αj(d) = n ∑i =d1 x ij(d) = mean ({x ij(d), ∀ i = 1, 2.. , nd }) and the median of
d
of each node was calculated. The number of hidden nodes that
the jth predictor in group d be m(jd) = median ({x ij(d), ∀ i = 1, 2.. , nd }).
gives the smallest average RMSE value is selected as the best
parameter of ELME model. Then, the centroid linkage distance Ai(d) between the test data yi*
and the cluster d identified in the training phase can be computed
⎛ 2⎞
3.2. Testing phase p
by Ai(d) = ∑ j = 1 ⎜

( )
xij − a(jd) ⎟. Then, the most appropriate group

The objective of the testing phase is to assign data into a group for the test data yi* is the cluster AG i that has the smallest Ai(d), that
by ensembling the result of five linkage methods, and to obtain the is, AG i = arg min (Ai(d) ).
predicted value using the ELM of that group. To achieve this goal, d
Similar to the centroid linkage method, the median linkage
five steps are taken. ⎛ 2⎞
distance MG i of yi* can be computed by Mi(d) = ∑pj = 1 ⎜

(x ij − m(jd) ) ⎟⎠.
Step 1: Collect and scale testing dataset The most appropriate group for the test data yi* is the cluster MG i
that has the smallest Mi(d), that is, MG i = arg min (Mi(d) ).
Step 2: Define target variable and its predictors d
The first two steps of testing phase are same as those of Instead of measuring Euclidean distance, the Ward's linkage
training phase. The testing data were collected, scaled, and re-ar- method defines the distance between two clusters by within-
range to form target variables yi* and x i as described in the cluster variance of these two clusters (Palit and Popovic, 2005).
training phase. That is, y* = [y1* , y2* , ... , yn* ] is the scaled target The goal of Ward's linkage method is to minimize total within-
variable and x i is the predictors of yi*, x i = [xi1, xi2, ... , xip ]. group variance. To obtain the Ward's linkage distance W i(d) be-
tween the test data yi* and the cluster d , the first step is to include
Step 3: Apply linkage methods to appoint the testing data to a sui- the test data yi* in the cluster d and compute the centroid o(jd) of
table cluster obtained in the training phase
the new cluster d as follows
In this stage, five linkage methods (single linkage, complete
n
linkage, centroid linkage, median linkage and Ward's linkage xij + ∑k d= 1 xkj(d)
methods) are applied one by one to measure the similarity be- o(jd) = , ∀ j = 1, 2, .. , p.
1 + nd (5)
tween the testing data and clusters obtained in the training phase.
For a linkage method, the most appropriate cluster for a testing Finally, the Ward's linkage distance W i(d) is calculated by
sample is the one with minimum Euclidean distance. We use the ⎛ 2⎞
p
2
Euclidean distance to measure the similarity because it is one of Wi(d) = ∑ ⎜⎝ (x ij − o(jd) ) + nd × (a (d )
j
− o(jd) ) ⎟⎠ (6)
the most popular similarity measures in time series forecasting j=1

(Palit and Popovic, 2005).


where nd is number of data points in the group d. By Ward's
Let xkj(d) denote the kth value of jth predictor in the cluster d
linkage method, the most appropriate cluster WG i of the test data
identified in the training phase, k = 1, 2, ... , nd , j = 1, 2, . . . , p.
yi* is determined by WGi = arg min (W i(d) ).
The Euclidean distance between the predictor x i of the testing d

data yi* and the predictors xkj(d) of the training data y (d) in cluster d Step 4: Apply the majority voting to assemble the results of linkage
can be computed by method
p
⎛ 2⎞
Since different linkage methods have different characteristics,
λ ik(d) = ∑ ⎜⎝ (x ij − xkj(d) ) ⎟⎠, for i = 1, 2, ... , n, d = 1, 2, ... , g .
(4)
the testing data may be assigned to different clusters by different
j=1 linkage methods. Therefore, in step 4, we use majority voting to
Since, in the single linkage (or nearest neighbor) method, dis- combine the results of single, complete, centroid, median and
tance between two groups is defined by the Euclidean distance Ward's linkage methods. For the test data yi*, the most suitable
between the nearest points in the groups, the single linkage dis- cluster FG i is the cluster which receives the most votes from five
tance Si(d) between the test data yi* and the cluster d is the linkage methods, that is, FGi = mode (SGi, CGi, AGi, MG, WG i ).
Note that if the five assigned clusters SGi, CGi, AG i, MG and WG i are
minimum value of λ ik(d) , that is, Si(d) = min (λ ik(d) ). Then, the most
k all different or two groups receive equal votes, the most suitable
appropriate group for the test data yi* is the cluster SG i that has the cluster FG i is randomly selected from the results of five linkage
smallest Si(d) , that is, SG i = arg min (Si(d) ) methods.
d
C.-J. Lu, L.-J. Kao / Engineering Applications of Artificial Intelligence 55 (2016) 231–238 235

Step 5: Obtain the predicted value using the ELM of the most suitable 14000
group 13000
After determining the most suitable cluster for each testing 12000
sample, the ELM forecasting model corresponding to the selected 11000

sales volume
cluster (ELM( FG i )) is obtained in the training phase and is used to 10000
generate the predicted value ( q¯ *i ) of the testing data yi*. Because
9000
the predicted value is scaled, the predicted value is re-scaled by
8000
the inverse transformation of the min-max normalization method
to obtain the final predicted value ( q̄i ) of yi*. 7000

Note that, the proposed research scheme can be used to predict 6000
more values recursively. If a recursive forecasting process is made, 5000
1 10 19 28 37 46 55 64 73 82 91 100 109 118
the first predicted value will be used as input to predict the next
Data Points
value. However, like all forecasting methods, the error will in-
crease with the proceeding of sequential prediction. Therefore, in Fig. 3. The monthly aggregate sales amount of Company B from January 2003 to
December 2012.
the proposed research scheme, the training phase only needs to be
carried out once. As long as ELM of each cluster is obtained in the
training phase, the predicted data will be incorporated into the methods, the proposed model was preferred to the EN-ELM model
cluster used to calculate it. (the proposed clustering-based forecasting scheme). The perfor-
mance of the proposed EN-ELM model was compared to those of
three single models (pure ELM, NF, and SNF) and five clustering-
4. Empirical study based forecasting models (SL-ELM, FL-ELM, CL-ELM, ML-ELM, and
WL-ELM). Pure ELM refers to the ELM model without data clus-
4.1. Data description and performance criteria
tering, NF refers to the simple naïve forecast which used the sales
volumes of previous months as predicted values, SNF refers to the
The monthly aggregate sales volume of two multinational
seasonal naïve forecast, SL-ELM refers to ELM with single linkage,
computer server companies in Taiwan is used to illustrate and
FL-ELM refers to ELM with complete linkage, CL-ELM refers to ELM
evaluate the proposed forecasting scheme of this study. Computer
with centroid linkage, ML-ELM refers to ELM with median linkage,
server sales data were used to illustrate the proposed research
and WL-ELM refers to ELM with Ward's linkage.
scheme because, compared to personal computer sales, computer
For the pure ELM, proposed EN-ELM model, and five clustering-
server sales data present relatively clear regularity, which is a
based forecasting models, five predictors as employed in Lu (2014)
desirable feature of all clustering-based forecasting methods.
However, with the rapid development in technology and innova- were used to perform one-step-ahead monthly sales forecasting.
tion, the periodicity of the computer server sales data is not as These five predictors are the previous month's sales amount (T-1),
stable as before. The cycle could be longer or shorter, and it could previous two months’ sales amount (T-2), previous three months’
contain noises. In such situations, using only the single linkage sales amount (T-3), 3-month moving average (MA3), and sales
method in a clustering-based forecasting method to determine the amount at the same time last year (T-12). These predictors are
most suitable cluster cannot produce a reliable outcome. In this selected according to Lu (2014).
study, the use of computer sales data was preferred as a means to The prediction performance is evaluated using the following
demonstrate that the solution of the ensemble learning provided performance measures: the root mean square error (RMSE), mean
in this study can overcome this problem. absolute error (MAE), mean absolute percentage error (MAPE),
The computer server sales data applied in this study were col- root mean square percentage error (RMSPE), and normalized mean
lected from January 2003 to December 2012, and their time series square error (NMSE). The lower values of RMSE, MAD, MAPE,
plots are presented in Figs. 2 and 3. In total, there are 120 data points RMSPE, and NMSE indicate that the predicted result is closer to the
in the data set. The data set was divided into training and testing data actual value. The definitions of these criteria are given as follows:
sets. The first 96 data points (80% of the total sample points) are used n
as the training samples, while the remaining 24 data points (20% of 1
RMSE = ∑ (yi − qi )2
the total sample points) are employed as the testing sample for n i=1 (7)
measuring out-of-sample forecasting ability.
Because the proposed forecasting method consists of the
K-means algorithm, ensemble learning and ELM with five linkage 1
n
MAE = ∑ yi − qi
n i=1 (8)
4000

3500
n
1 yi − qi
3000 MAPE = ∑
sales volume

n i=1
yi (9)
2500

2000
n ⎛
y −q⎞
2
1
1500
RMSPE = ∑ ⎜ i i⎟
n i = 1 ⎝ yi ⎠ (10)
1000
1 10 19 28 37 46 55 64 73 82 91 100 109 118
Data Point n
∑i = 1 ( yi − qi )2
Fig. 2. The monthly aggregate sales volume of Company A from January 2003 to
NMSE = n
∑i = 1 ( yi − y )2 (11)
December 2012.
236 C.-J. Lu, L.-J. Kao / Engineering Applications of Artificial Intelligence 55 (2016) 231–238

0.210 Table 1
Sensitivity analysis of Company A's clustering-based forecasting models.
0.180
Model Number of MAPE RMSPE RMSE MAE NMSE
0.150
clusters
0.120
RMSE

SL-ELM 2 3.59% 4.81% 122.48 93.72 0.111


0.090 3 3.42% 4.53% 114.73 87.39 0.094
4 6.14% 8.24% 213.21 158.09 0.364
0.060 5 8.27% 10.72% 263.69 203.76 0.556

0.030 FL-ELM 2 3.72% 4.35% 115.94 97.51 0.101


3 2.84% 3.59% 98.46 74.69 0.069
0.000 4 5.04% 7.56% 207.45 128.82 0.339
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 5 6.51% 9.17% 295.97 187.97 0.682
Number of hidden nodes
CL-ELM 2 3.51% 4.11% 104.37 88.60 0.086
Fig. 4. Sensitivity analysis for the pure ELM model of Company A. 3 3.62% 4.14% 105.83 89.80 0.080
4 6.53% 11.29% 264.01 156.47 0.557
where yi and qi represent the actual and predicted value at month 5 8.47% 10.96% 271.30 205.38 0.634

i, respectively; n is the total number of testing data points; ȳ is the


ML-ELM 2 3.31% 4.21% 101.37 81.12 0.084
average of yi . 3 3.82% 3.96% 98.39 95.46 0.067
4 3.44% 3.95% 96.56 86.91 0.063
5 5.43% 6.41% 194.37 152.55 0.286
4.2. Empirical results
WL-ELM 2 3.71% 4.18% 106.89 96.11 0.088
3 2.93% 3.54% 92.35 75.76 0.061
To construct the pure ELM model, a sensitivity study was 4 3.07% 4.10% 103.29 78.06 0.081
conducted to determine the optimal number of hidden nodes, 5 6.73% 9.06% 231.33 171.69 0.440
which gives the smallest average RMSE value. For example, in
Proposed En- 2 3.05% 3.98% 96.73 77.52 0.082
Fig. 4, the pure ELM model with 25 hidden nodes is the optimal for ELM 3 2.09% 2.69% 66.04 52.25 0.031
Company A's sales data because it has the smallest average RMSE 4 2.90% 4.46% 111.43 73.15 0.112
values. Because ELM was also used to construct the forecasting 5 5.08% 7.25% 198.88 134.83 0.352
model for the proposed EN-ELM model and five alternatives (SL-
ELM, FL-ELM, CL-ELM, ML-ELM, and WL-ELM), the sensitivity
analysis was applied as described in the pure ELM model to de- Table 2
Summary of Company A's forecasting results by the proposed En-ELM model and
termine the optimal hidden node numbers for each of them. eight competing methods.
In addition to the hidden node numbers, the cluster number of
the clustering-based forecasting model is the other factor influen- Methods MAPE RMSPE RMSE MAD NMSE
cing the forecasting result because either too many or too few
SNF 4.00% 5.53% 138.24 100.96 0.136
clusters can increase testing error. To obtain the better forecasting NF 15.06% 17.71% 446.27 379.38 1.415
result, a sensitivity analysis with 2 to 5 clusters was conducted for Pure ELM 3.94% 5.49% 133.37 96.34 0.126
the proposed EN-ELM model and five alternatives (SL-ELM, FL-ELM, SL-ELM(3) 3.42% 4.53% 114.73 87.39 0.094
FL-ELM(3) 2.84% 3.59% 98.46 74.69 0.069
CL-ELM, ML-ELM, and WL-ELM). The cluster number with the CL-ELM(2) 3.51% 4.11% 104.37 88.60 0.086
minimum testing error is the optimal cluster number for the model. ML-ELM(2) 3.31% 4.21% 101.37 81.12 0.084
Table 1 depicts the forecasting results of all clustering-based WL-ELM(3) 2.93% 3.54% 92.35 75.76 0.061
En-ELM(3) 2.09% 2.69% 66.04 52.25 0.031
models with different cluster numbers. By comparing the values of
MAPE, RMSPE, RMSE, MAE, and NMSE, it was found that the best
number of clusters for SL-ELM, FL-ELM, WL-ELM, and the proposed
EN-ELM is 3, and the best number of clusters for CL-ELM and ML- Table 3
ELM is 2. For better presentation, parentheses are used to indicate Summary of Company B's forecasting results by the proposed En-ELM model and
eight competing methods.
the best cluster number for each clustering-based forecasting
model. For example, EN-ELM(3) refers to EN-ELM with 3 clusters. Methods MAPE RMSPE RMSE MAE NMSE
The forecasting results of Company A's sales data are summar-
SNF 4.36% 6.66% 601.52 398.71 0.191
ized in Table 2. As presented in Table 2, compared to its competing
NF 8.60% 12.10% 1176.59 840.96 0.729
models, the proposed EN-ELM model has the lowest MAPE, RMSPE, Pure ELM 3.95% 5.04% 485.08 374.11 0.124
RMSE, MAD, and NMSE values. This result indicates that the pro- SL-ELM(4) 3.98% 5.01% 476.38 373.43 0.120
FL-ELM(3) 3.71% 4.74% 456.69 351.56 0.110
posed EN-ELM has the lowest number of prediction errors.
CL-ELM(4) 3.79% 4.95% 497.44 368.10 0.130
Because Company A's and Company B's sales data were ana- ML-ELM(3) 3.31% 4.04% 397.47 318.52 0.083
lyzed by the same procedure, only the forecasting result of Com- WL-ELM(3) 3.57% 4.78% 464.39 339.73 0.114
pany B is reported (Table 3) to save space. As both Tables 2 and 3 En-ELM(3) 2.22% 2.58% 249.72 210.88 0.033

reveal, the proposed EN-ELM has the lowest number of prediction


errors and outperforms competing alternatives. And, compared to
SNF, NF, and pure ELM, all clustering-based models have better 4.3. Significance test
prediction accuracy. This finding is consistent with previous stu-
dies (Lu and Wang, 2010; Venkatesh et al., 2014; López et al., 2015) To further evaluate whether the difference in Tables 2 and 3
which stated that the clustering-based forecasting models can be between the proposed EN-ELM model and competing models is
used to improve prediction performance. statistically significant, the Wilcoxon signed-rank test, a popular
C.-J. Lu, L.-J. Kao / Engineering Applications of Artificial Intelligence 55 (2016) 231–238 237

Table 4
Wilcoxon signed-rank test resultsa between the En-ELM and eight competing models.

Models NF SNF Pure ELM SL-ELM FL-ELM CL-ELM ML-ELM WL-ELM

Company A
En-ELM 2.086 (0.037)** 4.086 (0.000)** 2.084 (0.040)** 1.829 (0.067)* 1.657 (0.097)* 1.857 (0.053)* 1.743 (0.081)* 1.686 (0.092)*

Company B
En-ELM 3.086 (0.002)** 3.657 (0.000)** 2.529 (0.011)** 2.472 (0.013)** 2.114 (0.034)** 2.220 (0.028)** 1.714 (0.086)* 2.000 (0.048)**

a
p-value in parentheses.
*
p-valueo 0.10.
**
p-valueo 0.05.

nonparametric technique, was employed. Please refer to Diebold Table 5


and Mariano (1995) and Pollock et al. (2005) for the details of the Model comparison between the En-SVR model and competing models with dif-
ferent clustering method and predictors.
Wilcoxon signed-rank test.
Table 4 reports the Z statistic and p-values of the two-tailed Comparison group Methods Company A Company B
Wilcoxon signed-rank test for the squared errors between the
proposed EN-ELM model and its competing models. It shows that MAPE RMSPE MAPE RMSPE
the forecasting error of the EN-ELM model is significantly lower
1 Pure SVR 3.92% 5.45% 4.05% 5.07%
than those of its competing models. Therefore, it can be concluded SL-SVR 3.35% 4.49% 3.96% 5.11%
that the proposed EN-ELM model significantly outperforms the FL-SVR 2.85% 3.54% 3.77% 4.65%
alternatives in sales forecasting. CL-SVR 3.47% 4.08% 3.78% 4.86%
ML-SVR 3.36% 4.18% 3.34% 4.07%
WL-SVR 2.86% 3.56% 3.53% 4.83%
4.4. Effectiveness evaluation of ensemble learning in forecasting
En-SVR 2.11% 2.73% 2.27% 2.65%

To further validate the effectiveness of the proposed clustering- 2 Pure ELM 3.94% 5.49% 3.95% 5.04%
based forecasting scheme with ensemble learning, the K-means P-SL-ELM 3.59% 4.60% 3.69% 4.90%
P-FL-ELM 2.91% 3.57% 3.73% 4.75%
algorithm is replaced with partitioning around medoids (PAM)
P-CL-ELM 3.61% 3.99% 3.71% 4.73%
(Kaufman and Rousseeuw, 1990), ELM is replaced with SVR P-ML-ELM 3.52% 4.02% 3.47% 4.11%
(Vapnik, 1999) in the proposed forecasting scheme (Fig. 1), and the P-WL-ELM 2.78% 3.58% 3.39% 4.63%
prediction accuracy among the proposed EN-ELM, EN-SVR (the P-En-ELM 2.20% 2.86% 2.32% 2.78%
K-means algorithm, ensemble learning, and SVR), P-EN-ELM (PAM,
3 Pure SVR 3.92% 5.45% 4.05% 5.07%
ensemble learning, and ELM), and P-EN-SVR (PAM, ensemble
P-SL-SVR 3.64% 4.65% 3.93% 5.07%
learning, and SVR) is compared. Even though the model with en- P-FL-SVR 3.06% 3.87% 3.83% 4.99%
semble learning (EN-ELM) has proven to have better performance P-CL-SVR 3.44% 3.95% 3.76% 4.64%
than models with single linkage shown in Section 4.3, the pre- P-ML-SVR 3.48% 4.08% 3.38% 4.05%
P-WL-SVR 3.11% 3.51% 3.84% 4.73%
diction accuracy between the new models (EN-SVR, P-EN-ELM and
P-En-SVR 2.21% 2.89% 2.35% 2.81%
P-EN-SVR) and those with five single linkage methods for com- Proposed model En-ELM 2.09% 2.69% 2.22% 2.58%
pleteness was still compared.
PAM was chosen because PAM is not only a well-known
squared error-based clustering algorithm but also one of the most the advantage of PAM and to make the data in clusters relatively
popular k-medoids clustering algorithms. Compared to the homogeneous, which leads to the better forecasting performance
K-means algorithm, PAM is more robust to noise and outliers (Xu of ELM. This finding is also consistent with Jain (2010), who shows
and Wunsch, 2005). SVR is chosen because it is an effective ma- that the K-means algorithm is effective and efficient in most cases,
chine-learning algorithm for solving nonlinear regression estima- and with Huang et al. (2006), who shows that ELM and SVR have a
tion problems and has been successfully used in sales forecasting similar prediction performance.
(Lu and Wang, 2010; Lu et al., 2012; Lu, 2014). The parameters of
SVR were determined according to Lu (2014), while all linkage
methods were applied as illustrated in Section 4.2. 5. Conclusion
The results of model comparison are reported in Table 5. The
first comparison group consists of the models using SVR as a Business decisions, such as production, inventory management,
forecaster, the second comparison group consists of the models financial planning, and sales, relies highly on sales forecasting.
using PAM as the clustering method, and the third comparison Enhancing forecasting accuracy can improve a firm's decision-
group consists of the models using PAM as the clustering method making efficacy and lower operational costs. This is particularly
and using SVR as a forecaster. As shown in Table 5, regardless of true for premium products, such as computer servers, which have
the clustering methods and forecasters, the models using en- high production costs.
semble learning (EN-SVR, P-EN-ELM, and P-EN-SVR) always have To enhance forecasting accuracy, this study proposes a novel
better forecasting accuracy than those models in the same com- clustering-based sales forecasting scheme which uses the K-means
parison group. algorithm for data clustering, ensemble learning with majority
Moreover, the proposed model EN-ELM has the lowest MAPE voting for integrating the results of the five linkage methods (i.e.,
and RMSPE among all the models using ensemble learning. The single linkage, complete linkage, centroid linkage, median linkage,
reason why P-EN-ELM and P-EN-SVR do not have better fore- and Ward's linkage methods), and extreme learning machine
casting accuracy might be due to the fact that computer server (ELM) as a predictor. Two- month aggregate sales data of two
sales data are more regular and have fewer noises and outliers. computer server companies in Taiwan are used to evaluate the
This characteristic of computer server sales data fails to highlight performance of the proposed method.
238 C.-J. Lu, L.-J. Kao / Engineering Applications of Artificial Intelligence 55 (2016) 231–238

Empirical results show that the proposed sales forecasting Lai, R.K., Fan, C.Y., Huang, W.H., Chang, P.C., 2009. Evolving and clustering fuzzy
scheme produces the best forecasting results and statistically decision tree for financial time series data forecasting. Expert Syst. Appl. 36,
3761–3773.
outperforms models without clustering (NF, SNF, and pure ELM) Lam, L., Suen, S.Y., 1997. Application of majority voting to pattern recognition: an
and models with the single linkage method (such as SL-ELM, FL- analysis of its behavior and performance. IEEE Trans. Syst. Man Cybern. Part A:
ELM, CL-ELM, ML-ELM, and WL-ELM). Replacing the K-means al- Syst. Humans 27 (5), 553–568.
López, K.L., Gagné, C., Castellanos-Dominguez, G., Orozco-Alzate, M., 2015. Training
gorithm with PAM and ELM with SVR in the proposed forecasting
subset selection in hourly Ontario energy price forecasting using time series
scheme will not diminish the forecasting accuracy. Therefore, this clustering-based stratification. Neurocomputing 156, 268–279.
study validates the concept that clustering-based forecasting Lu, C.J., 2014. Sales forecasting of computer products based on variable selection
models with ensemble learning can be used to improve the pre- scheme and support vector regression. Neurocomputing 128, 491–499.
Lu, C.J., Wang, Y.W., 2010. Combining independent component analysis and grow-
diction performance of a single forecasting model. Since the sales ing hierarchical self-organizing maps with support vector regression in product
data adopted in this study are relatively simple and limited in demand forecasting. Int. J. Prod. Econ. 128 (2), 603–613.
number, future research should investigate the performance of Lu, C.J., Shao, Y.E., 2012. Forecasting computer products sales by integrating en-
semble empirical mode decomposition and extreme learning machine. Math.
proposed research schemes in more sophisticated cases. Probl. Eng. 2012, 2012. http://dx.doi.org/10.1155/2012/831201 2012.
Lu, C.J., Chang, C.C., 2014. A hybrid sales forecasting scheme by combining in-
dependent component analysis with K-meanss clustering and support vector
Acknowledgments regression. Sci. World J. 2014, 2014. http://dx.doi.org/10.1155/2014/624017.
Lu, C.J., Lee, T.S., Lian, C.M., 2012. Sales forecasting for computer wholesalers: a
comparison of multivariate adaptive regression splines and artificial neural
This work is partially supported by the Ministry of Science and networks. Decis. Support Syst. 54 (1), 584–596.
Luis, A., Richard, W., 2007. Improved supply chain management based on hybrid
Technology, Taiwan, R.O.C. under Grant no. MOST 103-2221-E-
demand forecasts. Appl. Soft Comput. 7 (1), 136–144.
231-003-MY2. The authors also gratefully acknowledge the helpful Nandi, A.K., Fa, R., Abu-Jamous, B., 2015. Integrative Cluster Analysis in Bioinfor-
comments and suggestions of the reviewers, which have improved matics. John Wiley & Sons, West Sussex, UK.
the presentation. Palit, A.K., Popovic, D., 2005. Computational Intelligence in Time Series Forecasting:
Theory and Engineering Applications (Advances in Industrial Control). Spring-
er-Verlag Inc., New Jersey, New York.
Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst.
References Mag. 6 (3), 21–45.
Pollock, A.C., Macaulay, A., Thomson, M.E., Onkal, D., 2005. Performance evaluation
of judgemental directional exchange rate predictions. Int. J. Forecast. 21,
Badge, J., Srivastava, N., 2010. Selection and forecasting of stock market patterns 473–489.
using K-means clustering. Int. J. Stat. Syst. 5, 23–27. Prinzie, A., Van den Poel, D., 2006. Incorporating sequential information into tra-
Cao, L.J., 2003. Support vector machines experts for time series forecasting. Neu- ditional classification models by using an element/position-sensitive SAM.
rocomputing 51, 321–339. Decis. Support Syst. 42 (2), 508–526.
Chang, P.C., Lai, C.Y., 2005. A hybrid system combining self-organizing maps with Shahzad, R.K., Lavesson, N., 2013. Comparative analysis of voting schemes for en-
case-based reasoning in wholesaler's new-release book forecasting. Expert Syst. semble-based malware detection. J. Wirel. Mob. Netw. Ubiquitous Comput.
Appl. 29 (1), 183–192. Dependable Appl. 4 (1), 98–117.
Chang, P.C., Liu, C.H., Fan, C.F., 2009. Data clustering and fuzzy neural network for Sun, Z.L., Choi, T.M., Au, K.F., Yu, Y., 2008. Sales forecasting using extreme learning
sales forecasting: a case study in printed circuit board industry. Knowl.-Based machine with applications in fashion retailing. Decis. Support Syst. 46 (1),
Syst. 22 (5), 344–355. 411–419.
Chen, F.L., Ou, T.Y., 2011. Sales forecasting system based on Gray extreme learning Tay, F.E.H., Cao, L.J., 2001. Improved financial time series forecasting by combining
machine with Taguchi method in retail industry. Expert Syst. Appl. 38 (3), support vector machines with self-organizing feature map. Intell. Data Anal. 5
1336–1345. (4), 339–354.
Diebold, F.X., Mariano, R.S., 1995. Comparing predictive accuracy. J. Bus. Econ. Stat. Thomassey, S., 2010. Sales forecasts in clothing industry: the key success factor of
13, 253–263. the supply chain management. Int. J. Prod. Econ. 128 (2), 470–483.
Dietterich, T.G., 2000. Ensemble methods in machine learning. Lect. Notes Comput. Vapnik, V.N., 1999. An overview of statistical learning theory. IEEE Trans. Neural
Sci. 1857, 1–15.
Netw. 10, 988–999.
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F., 2012. A review on
Venkatesh, K., Ravi, V., Prinzie, A., Van den Poel, D., 2014. Cash demand forecasting
ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-
in ATMs by clustering and neural networks. Eur. J. Oper. Res. 232 (2), 383–392.
based approaches. IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 42 (4),
Wang, X., Han, M., 2015. Improved extreme learning machine for multivariate time
463–484.
series online sequential prediction. Eng. Appl. Artif. Intell. 40, 28–36.
Hair, J.F., Black, W.C., Babin, B., Anderson, R.E., Tatham, R.L., 2006. Multivariate Data
Wong, W.K., Guo, Z.X., 2010. A hybrid intelligent model for medium-term sales
Analysis, 6th ed. Prentice-Hall, New York.
forecasting in fashion retail supply chains using extreme learning machine and
Huang, C.L., Tsai, C.Y., 2009. A hybrid SOFM-SVR with a filter-based feature selection
harmony search algorithm. Int. J. Prod. Econ. 128 (2), 614–624.
for stock market forecasting. Expert Syst. Appl. 36 (2), 1529–1539.
Xu, R., Wunsch, D., 2005. Survey of clustering algorithms. IEEE Trans. Neural Netw.
Huang, G., Huang, G.B., Song, S., You, K., 2015. Trends in extreme learning machines:
16 (3), 645–678.
a review. Neural Netw. 61, 32–48.
Yang, P., Yang, Y.H., Zhou, B.B., Zomaya, A.Y., 2010. A review of ensemble methods in
Huang, G.B., Zhu, Q.Y., Siew, C.K., 2006. Extreme learning machine: theory and
bioinformatics. Curr. Bioinform. 5 (4), 296–308.
applications. Neurocomputing 70 (1), 489–501.
Yeon, K., Song, M.S., Kim, Y., Choi, H., Park, C., 2010. Model averaging via penalized
Jain, A.K., 2010. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett.
regression for tracking concept drift. J. Comput. Graph. Stat. 19 (2), 457–473.
31 (8), 651–666.
Zhang, J., Yang, Y., 2012. BP neural network model based on the K-meanss clustering
Kaufman, L., Rousseeuw, P., 1990. Finding Groups in Data: An Introduction to
Cluster Analysis. John Wiley & Sons, New Jersey. to predict the share price. In: Proceedings of the IEEE 2012 Fifth International
Kumar, M., Patel, N.R., 2010. Using clustering to improve sales forecasts in retail Joint Conference on Computational Sciences and Optimization, pp. 181–184.
merchandising. Ann. Oper. Res. 174 (1), 33–46.

You might also like