Professional Documents
Culture Documents
Meta-Learning For Time Series Forecastin PDF
Meta-Learning For Time Series Forecastin PDF
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e in f o a b s t r a c t
Available online 7 March 2010 In research of time series forecasting, a lot of uncertainty is still related to the task of selecting an
Keywords: appropriate forecasting method for a problem. It is not only the individual algorithms that are available
Forecasting in great quantities; combination approaches have been equally popular in the last decades. Alone the
Forecast combination question of whether to choose the most promising individual method or a combination is not
Time series straightforward to answer. Usually, expert knowledge is needed to make an informed decision,
Time series features however, in many cases this is not feasible due to lack of resources like time, money and manpower.
Meta-learning This work identifies an extensive feature set describing both the time series and the pool of individual
Diversity forecasting methods. The applicability of different meta-learning approaches are investigated, first to
gain knowledge on which model works best in which situation, later to improve forecasting
performance. Results show the superiority of a ranking-based combination of methods over simple
model selection approaches.
& 2010 Elsevier B.V. All rights reserved.
0925-2312/$ - see front matter & 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2009.09.020
ARTICLE IN PRESS
C. Lemke, B. Gabrys / Neurocomputing 73 (2010) 2006–2016 2007
methods. Discriminant analysis to select between three forecasting diverse, but are kept relatively simple and, more importantly,
methods using 26 features is used in [41]. automatic. These methods perform often just as well as more
The phrase ‘‘meta-learning’’ in the context of time series was complex methods [33] are more efficient in terms of computational
first used in [39] and represents a new term for describing the requirements and also more likely to be employed in practical
process of automatically acquiring knowledge for time series applications, especially when no expert advice is available.
forecasting model selection that was adopted from the general
machine learning community. Two case studies are presented in 2.1. Data sets
[39]: In the first one, a C4.5 decision tree is used to link six
features to the performance of two forecasting methods; in the
Two data sets both consisting of 111 time series have been
second one, the NOEMON approach [25] is used for ranking three
used in this study; they were obtained from the NN3 [13] and
methods. The most recent and comprehensive treatment of the
NN5 [14] neural network forecasting competitions. NN3 data
subject can be found in [48], where time series are clustered
include monthly empirical business time series with 52–126
according to their data characteristics and rules generated judge-
observations, while the NN5 series are daily time series from
mentally as well as using a decision tree. The approach is then
cash machine withdrawals with 735 observations each. The
extended to determine weights for a combination of individual
competition task was to predict the next 18 or 56 observations,
models based on data characteristics. Table 1 summarises some
respectively. While NN3 data did not need specific preprocessing,
facts about the related work presented here for better overview of
NN5 data included some missing values, which were substituted
approaches and methods used. The calculation of features and
by taking the mean of the value of the corresponding weekday of
meta-learning method listed are implemented automatically if
the previous and the following week. The last 18 or 56 values of
not otherwise stated.
each series were not used for training the models to enable
Some time series features presented in this work are similar to
out-of-sample error evaluation.
the ones used in literature, but new and different features are
introduced extending previous work published in [28,30]. In
particular, features concerning the diversity of the pool of algorithms 2.2. Forecasting methods
are included, which is facilitated by adding a number of popular
forecast combination algorithms to the feature pool. In addition to Available forecasting algorithms can be roughly divided into a
the original question of which model to select, this work also tries few groups. Simple approaches are often surprisingly robust and
to find evidence for features being useful for guiding the choice of popular, for example those based on exponential smoothing
whether to pick an individual model or a combination. In an initial [20,33]. Statisticians and econometricians tend to rely on complex
exploratory experiment, decision trees are generated to find ARIMA models and their derivatives [6]. The machine learning
evidence for the existence of a link between time series character- community mainly looks at neural networks, either using
istics and the performance of models. Leaving aside the requirement multi-layer-perceptrons with time-lagged time series observa-
for interpretable rules and recommendations, four meta-learning tions as inputs as, for example, in [50,16], or recurrent networks
techniques are compared in another empirical experiment, assessing with a memory, see, for example [26]. As not all of the algorithms
potential performance improvements. provide native multi-step-ahead forecasting, some of them are
The paper is structured as follows: Section 2 will present the implemented using two approaches: An iterative approach, where
underlying empirical experiments and results. Section 3 begins the last prediction is fed back to the model to obtain the next
treating the model selection problem as a classification task and forecast, or a direct approach, where n different predictors are
describes an extensive number of time series characteristics trained for each of the 1 to n steps ahead problem. The selection of
which are necessary to link performances of algorithms to the models used in this work is presented in the next paragraphs.
nature of the time series. The different experiments using
meta-learning techniques are evaluated in Section 4. 2.2.1. Simple forecasting models
Many algorithms for forecasting time series are considered
simple, yet they are usually very popular and can be surprisingly
2. Performance of forecasting and forecast combination powerful. In the latest extensive M3 competitions [33], an
methods exponential smoothing approach was considered a good match
for the most successful complex method while providing a better
This part of this work presents empirical experiments that provide trade-off between prediction accuracy and computational
the basis for further meta-learning analysis. Individual predictors are complexity. Simple methods used for this experimental study
Table 1
Time series model selection—overview of literature.
1992 Collopy and Armstrong 18 (judgemental) Rule base (judgemental) 126 (M1) Exp. smoothing (Holt and Brown), random walk, linear regression
[12]
1996 Vokurka et al. [47] 5 Rule base (partly 126 (M1) Exp. smoothing (single and Gardner), structural and a combination of the
automatic) three
1997 Arinze et al. [4] 6 Rule base 67 Exp. Smoothing (Holt and Winter), adaptive filtering, three ‘‘hybrids’’ of
the previous
1997 Shah [41] 26 Discriminant analysis 203 (M1) Exp. smoothing (single and Holt-Winter), structural
2000 Adya et al. [2] 26 (mainly Rule base (judgemental) 3003 (M3) Exp. smoothing (Holt and Brown), random walk, linear regression
automatic)
2004 Prudencio and Ludermir 6/7 Decision tree/NOEMON 99/645 Exp. smoothing, neural network/random walk, Holt’s smoothing,
[39] (M3) auto-regressive
2009 Wang et al. [48] 9 Decision tree 315 Random walk, smoothing, ARIMA, neural network
ARTICLE IN PRESS
2008 C. Lemke, B. Gabrys / Neurocomputing 73 (2010) 2006–2016
are listed below, where y^ t þ 1 denotes the one-step-ahead predic- time series. They are described by the notation ARIMA (p,d,q) and
tion and yi the observation of the time series at time i. consist of the following three parts:
For the moving average, the arithmetic mean of the last k AR(p) denotes the autoregressive part of order p. Autoregres-
observations according to Eq. (1) is calculated. An appropriate sion defines a regression yt ¼ o0 þ o1 yt1 þ þ op ytp þ et
time window is found by grid-searching k-values from 1 to 20 where the target variable depends on p time-lagged values of
and choosing the k with the lowest mean squared error on a itself weighted by weights oi .
validation set prior to the test set: I(d) defines the degree of differencing involved. Differencing is
a method of removing non-stationarity in a time series by
1 X t
calculating the change between each observation. The first
y^ t þ 1 ¼ y ð1Þ
k i ¼ tk þ 1 i difference of a time series is thus given by y0t ¼ yt yt1 .
Single exponential smoothing is the simplest representative of MA(q) indicates the order of the moving average part of the
smoothing methods and it is calculated by adjusting the model, which is given as a moving average of the error series e.
previous forecast by the error it produced. The parameter a It can be described as a regression against the past error values
controls the extent of the adjustment and is determined again of the series yt ¼ o0 þ o1 et1 þ þ oq etq þ et .
by minimising errors on a validation set that was separated
from the training set: The identification of an appropriate ARIMA model for a specific
time series is not straightforward [34] and usually involves expert
y^ t þ 1 ¼ ayt þ ð1aÞy^ t ð2Þ
knowledge and intervention. Two automatic approaches have
Taylor’s exponential smoothing is a more recently introduced been implemented for this study:
exponential smoothing algorithm with a trend dampened by a
factor f, using a multiplicative approach and a growth rate R The original time series as well as two series representing its
[43]. It is given by Eq. (3), where h is the number of periods first and second differences are submitted to the automatic
ahead to be forecasted and lt denotes the estimated level of the ARMA selection process of a MATLAB toolbox published in
series at time t. The parameters a and b are smoothing [15], subsequently choosing the approach that produced the
constants taking values between zero and one, which are again lowest MSE on the validation set. The maximum number of
determined by grid search: time lags used is an input parameter of the toolbox and has
Ph
fi been set to two, which is sufficient in practice according to
y^ t þ h ¼ lt rt i ¼ 1
[34]. Furthermore, the same authors state, that it is almost
f never necessary to generate more than second-order
lt ¼ ayt þ ð1aÞðlt1 rt1 Þ
differences of a time series, because data usually only involves
f nonstationarity of the first or second level.
rt ¼ bðlt =lt1 Þ þð1bÞrt1 ð3Þ An alternative automatic approach for ARMA-modelling is
Polynomial regression fits a polynomial to the time series by included in the State-Space-Models Toolbox [38]. In this case,
regressing time series indices against time series values. In this an appropriate order for p and q was chosen using the Akaike’s
experiment, a suitable order of the polynome between two and information criterion (AIC), while a suitable order of
six is grid-searched and the resulting curve extrapolated into differencing was determined by the log-likelihood of a model
the future; Eq. (4) shows the example of a regression of order given the corresponding differenced series.
three, where oi are parameters estimated using the training
set and t is the current time index:
y^ t ¼ o0 þ o1 t þ o2 t 2 þ o3 t 3 ð4Þ 2.2.3. Structural time series
The structural approach formulates a time series as a number
The Theta-model was introduced in [5]. It decomposes series
of directly interpretable components such as trends or cycles.
into short and long term components by applying a coefficient
Structural models fit into the statistical framework of state space
y to the second order differences of the time series, thus
models, which allows usage of well established algorithms like
modifying the curvature of the time series. Here, it is employed
the Kalman filter and Kalman smoother [22]. The State-Space-
using formulas given by [23]. The general equation for a
Models Toolbox [38] has been used for implementation of a local
theta-curve is
level model with a dummy seasonal component of either twelve
y^ t þ 1 ðyÞ ¼ a^ t þ b^ t t þ yyt ð5Þ or seven, depending on the data set used. A basic local level
consists of a random walk with noise as described by the
Values a^ t and b^ t are constants determined according to [23]
following formulas:
and t is again the time index. In the original setup in [5], two
curves are used and the obtained forecasts averaged. The first yt ¼ mt þ et ,et NIDð0,s2e Þ ð7Þ
forecast is calculated using y ¼ 0 which results in a linear
regression problem where the linear part of formula (5) is mt ¼ mt1 þ Zt ,Zt NIDð0,s2Z Þ ð8Þ
extrapolated into the future. The second forecast for y ¼ 2 is
calculated using formula where et and Zt are normally and independently distributed error
terms. Variable mt represents a stochastic trend component in the
X
t1
more general structural time series definition, for the random
y^ t þ 1 ð2Þ ¼ a ð1aÞi yti ð2Þ þð1aÞn y1 ð2Þ ð6Þ
i¼0
walk it simply denotes last time series observations.
found in [50], which is somewhat outdated but still very relevant where n denotes the forecasting horizon. The standard deviation
in terms of guidelines given. According to these, a feed-forward for each method is given to provide a measure of variability across
neural network was implemented. It has one hidden layer with 12 the different time series.
neurons, training is carried out with a backpropagation algorithm Concerning individual methods in the NN3 competition in
with momentum. Input variables were the latest observations up Table 2, the feedforward neural networks perform quite well,
to a lag of seven or 12 to catch weekly or yearly seasonality together with the direct moving average. For the NN5
depending on the data set used. Ten neural networks have been competitions, results in general are considerably worse, which
trained and their predictions averaged to obtain the final can be attributed to the longer period to be forecasted, where
forecasts. errors can accumulate quite quickly. The local level model with
Furthermore, a recurrent neural network of the Elman-type seasonality is the clear-cut winner here, closely followed by the
has been employed. There seems to be no general guidelines direct neural network. ARIMA models suffer from outliers
in literature about the architecture of a recurrent network for indicated by high standard deviation values and can only be
time series forecasting, but one thing seems to be a common fitted well in some cases. The three-cluster pooling approach
agreement: They need more hidden nodes than the feedforward outperforms the best individual method in both data sets, as can
neural network because the temporal relationship has to be be observed in Table 3 and has also been shown in previous work
modelled as well. In these experiments, the number of hidden [29]. If this approach would have been submitted to the original
nodes was set to 24 (double the amount of hidden nodes for the NN3 competition, it would have ranked sixth out of 26 partici-
feedforward network). pants. The result for the NN5 competition looks less convincing,
where it would have ranked 12th out of 20 participants. This
2.3. Forecast combinations shows that the longer series might require more complex
algorithms due to their length and complexity.
Combinations of forecasts are motivated by the fact that all A look at the histograms of the best performing methods in
models of a real world data generation process are likely to be Figs. 1 and 2 show some interesting facts as well: While the best
misspecified and picking only one of the available models is risky, performing individual methods are well spread on the NN3 data
especially if the data and consequently the performance of models set, it is almost only the structural model and the direct neural
change over time. It is a reliable method of decreasing model risk network that perform best for NN5. Simple average combinations
and aims at improving forecast accuracy by exploiting the consequently perform badly on the NN5 data set, as there are
different strengths of various models while also compensating many predictors that comparatively do not perform well. The
their weaknesses. The following methods have been implemented regression combination does not have an outstanding average
for the experiments here: performance, but performs best for the largest number of the
The following tables present result averages of the individual Method NN3 NN5
and combined methods for the two data sets. The symmetric
mean absolute percentage error (SMAPE) has been used for SMAPE s SMAPE s
evaluating and comparing methods. It is a relative error measure
1. Simple average 17.5 13.8 32.2 7.2
that enables comparisons between different series and also 2. Simple trimmed average 17.4 13.6 31.8 7.1
comparison with the results of the forecasting competitions that 3. Outperformance 16.6 12.4 29.5 6.9
provided the data sets. It is given by 4. Variance-based 18.1 13.2 26.5 6.8
5. Pooling (2) 16.4 12.1 29.0 9.7
1X n
jyt y^ t j 6. Pooling (3) 16.8 12.3 25.7 10.5
SMAPE ¼ 100 ð9Þ 7. Regression 20.4 32.5 27.5 12.3
n t ¼ 1 ðyt þ y^ t Þ=2
ARTICLE IN PRESS
2010 C. Lemke, B. Gabrys / Neurocomputing 73 (2010) 2006–2016
40
Fig. 1. Histogram showing the number of times a particular method performs best for the NN3 data, left: individual methods, right: combinations.
80
number of time series for which
Fig. 2. Histogram showing the number of times a particular method performs best for the NN5 data, left: individual methods, right: combinations.
individual series. This illustrates why it might be beneficial to autocorrelation with the formula
identify conditions in which one or the other method is more
P 2
likely to perform well, which will be investigated in the next t ðet et1 Þ
d¼ P 2 ð10Þ
sections. t et
and surrogate series are significantly different, the time series is autocorrelation of lag 12 for the NN3 data set which consists of
considered to be nonlinear. monthly data and the partial autocorrelation of lag 7 for the
The largest Lyapunov exponent is a measure for the separation NN5 data, which consists of weekly time series. Features are
rate of state-space trajectories that were initially close to each summarised in Table 6.
other and quantifies chaos in a time series. The average of the
Lyapunov exponents calculated using software provided in [45] 3.4. Diversity features
was added to the feature set. The features described are sum-
marised in Table 4. When dealing with combinations of forecasts, it is crucial to
look at characteristics of the available individual forecasts. It is
3.2. Frequency domain desirable to have a diverse pool of individual predictors, ideally
with the strengths of one forecast compensating weaknesses of
another. Additionally, if there is one extremely superior forecast
A number of features have been extracted from the fast Fourier
in the pool of available methods, it is unlikely that it will be
transform of the detrended series as listed in Table 5. The
outperformed in a combination with other forecasts. Diversity is a
frequencies at which the three maximum values of the power
well known concept in ensemble learning, which is a term
spectrum occur are intended to give an indication of seasons and
normally used to describe strategies for training a number of
cycles, the maximum value of the power spectrum should give an
models sharing the same functional approach. A few concepts
indication of the general strength of the strongest seasonal or
have been successful in this area, for example bagging [8],
cyclic component. The number of peaks in the power spectrum
boosting [17] or negative correlation learning [31]. However,
that have a value of at least 60% of the maximum value quantify
since the predictors used here are also diverse in their functional
how many strong recurring components the time series has.
approach and most of them cannot be ‘‘trained’’ in a machine-
learning fashion, these concepts are not applicable.
3.3. Autocorrelations The standard ways to look at diversity for a number of
methods is examining correlation coefficients. The feature pool
Autocorrelation and partial autocorrelation give indications on here includes mean and standard deviation of the error correla-
stationarity and seasonality of a time series; both of the measures tion coefficients of the forecast pool. Other diversity measures
have been included for the lags one and two. Furthermore, have mainly been discussed in the context of classification tasks,
domain knowledge on seasonality is exploited by including partial for example in [27,18]. One of the few publications dealing with
diversity in a regression context is [9], where an error function e
Table 4 for training regression ensembles is introduced following the
Feature pool—general statistics. formula:
1X 1X
Abbreviation Description e¼ ðy^ i yÞ2 k ðy^ i y^ Þ2 ð11Þ
M i M i
General statistics
std Standard deviation of detrended series where y^ i denotes the prediction of the i th of M models, y the
skew Skewness of series actual observation and y^ the mean of the output of all ensemble
kurt Kurtosis of series members. It has been shown that the first term of this equation
length Length of series
contains the bias and variance error terms, while the second error
trend Standard deviation(series)/standard deviation(detrended
series) term includes the covariance of the ensemble members in
dw Durbin–Watson statistic of regression residuals addition to these. Hence, parameter k controls the extent of the
turn Turning points covariance impact on the error. To exploit these findings in form
step Step changes
of time series features, two values have been added to the
pred Predictability measure
nonlin Nonlinearity measure
feature set: The proposed error measure in its original form and
lyap Largest Lyapunov exponent the quotient of the mean error (first term of equation) and the
variability between the members in the method pool (second
term). In this way, the trade off between individual accuracy and
diversity can be measured.
The clustering combination method inspires a different
Table 5
approach on quantifying diversity. A k-means clustering algo-
Feature pool—frequency domain.
rithm is used to group individual forecasts in three groups. The
Abbreviation Description number of methods in the top performing cluster is then taken as
a feature that will identify if there are few or many equally well
Frequency domain
performing methods, or even just a single one. Additionally, the
ff[1–3] Power spectrum frequencies of three biggest values
ff[4] Power spectrum: maximal value
distance of the mean of the top performing cluster to the mean of
ff[5] Number of peaks not lower than 60% of the maximum
Table 7
Feature pool—diversity.
the second best is added, to put the two performances into Results given in this section do not claim to be universally
relation. For an overview of all diversity features, see Table 7. applicable, they merely provide an insight on the existence of
rules for the specific data set used. However, if the meta-features
describe the series well and a series with similar characteristics
4. Meta-learning can be found, it is probable that the guidelines are generalisable,
but there is, of course, no guarantee.
According to [46], the goal of meta learning is to ‘‘yunder- In the figures given in the following results, the leaf to the left
stand how learning itself can become flexible according to the of a node represents the data that fulfils its condition, the leaf to
domain or task under study.’’ There have been different more the right hand side represents data that does not. The numbers
detailed interpretations of the term in the literature; in this work, following the methods in the leafs denote the number of times
meta-learning is referred to as the process of linking knowledge this particular method performed best on the data subset. The
on the performance of so-called base-learners to the character- first classification problem concerned the use of combinations and
istics of the problem [40]. One difference to general perception of individual methods; the generated tree can be seen in Fig. 3.
meta-learning is that the base-learners used here are not The first node of the tree divides the series into more and less
necessarily machine learning algorithms, but include other chaotic series, indicated by the Lyapunov exponent. The majority
approaches as well. Many meta-learning approaches are available of the less chaotic series to the left of the tree are best treated
as reviewed in [46]; this section presents three different with an individual model, combination models are only suggested
experiments: In the first step, decision trees using meta-features for series where the first and second best performing cluster do
described in the previous section have been built as a machine not have a huge performance difference. Looking at the right part
learning method giving readable results. The second experiment of the tree, three more diversity measures appear in the nodes
compares a number of approaches and evaluates possible together with the kurtosis and the turning point measure. Results
performance improvements. In the last part of this section, include, amongst others, the suggestion that individual models
performance of one approach is tested under competition should not be used for series with a high kurtosis, and for these
conditions for the NN5 data set. a combination method is likely to perform well if the number
of well performing methods is high or correlation coefficients
are similar, otherwise, they tend to perform just as well as an
4.1. Experiment one—decision trees for data exploration individual model. The whole tree misclassifies 53 of the 222 data
instances, which equals to a rate of 23% as opposed to a
Decision trees were built using the Matlab statistics toolbox, misclassification rate of 49% when picking the class with the
choosing the minimum-cost-tree after a 10-fold crossvalidation. most training examples (individual method).
Features were determined on the whole time series, as the nature The same experiment was run only using features from the
of this work is exploratory. Using all available methods as diversity feature set, to see how especially these affect combina-
class labels did not yield interpretable results, which is why, tion performance. Additionally, the class label ‘‘undecided’’ was
concentrating on the best performing approaches in the replaced with combinations: As they have the reputation for
underlying experiments, the classification problem has been being less risky, it seems like a straightforward approach to pick
reduced to three different questions: them whenever their performance is equal or better in compar-
ison to individual methods.
When does an individual method perform better, when a In the top node of the tree given in Fig. 4, instances with a
combination, when does it not matter? (three class labels) smaller variability in correlation coefficients (diversification
When should a structural model, a direct neural network, measure four) are sent to the left side of the tree, where combi-
a pooling combination or a regression combination be used? nation methods work best. This is counter-intuitive, as a bigger
(four class labels) diversity in individual forecasts should favour combinations,
How to decide between pooling and regression for the it, however, illustrates the fact that diversity has to be traded
combination models? (two class labels) off with individual accuracy and is not beneficial for combinations
20 5. Conclusions
19.9
This work investigated meta-learning for time series
19.8 prediction with the aim to link problem-specific knowledge to
well performing forecasting methods and apply them in similar
19.7 situations. Initially, an extensive pool of features for describing
19.6 the nature of time series was identified. Along with features that
have been used in previous publications, several new ones have
19.5 been added, for example a measure for nonlinearity and
predictability and characteristics of the frequency spectrum.
19.4 Furthermore, measures have not only been calculated for the
time series themselves, but also for describing the behaviour of
19.3
the pool of available individual forecasting methods. In that way,
19.2 the following characteristics could be quantified: diversity in the
method pool, the trade-off between individual accuracy and
19.1 diversity, the size of the group of best performing individual
methods and the distance to the group of second best performing
19
4 6 8 10 12 14 methods.
Decision trees have been built to gain knowledge which of the
Fig. 7. Average performance in relation to number of clusters. chosen features are important for method selection. Some time
ARTICLE IN PRESS
C. Lemke, B. Gabrys / Neurocomputing 73 (2010) 2006–2016 2015
series characteristics that lead to good results in previous work [17] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line
did not seem to be significant for the time series and methods learning and an application to boosting, The Journal of Computer and System
Sciences 55 (1) (1997) 119–139.
used here. However, the newly introduced diversity measures [18] B. Gabrys, D. Ruta, Classifier selection for majority voting, Information Fusion
gave some interesting insights, quantifying some intuitive 6 (1) (2005) 63–81.
perceptions on mechanisms that make a combination more [19] E.S. Gardner, Exponential smoothing: the state of the art, Journal of
Forecasting 4 (1) (1985) 1–28.
successful. One of the lessons that has been illustrated quite well [20] E.S. Gardner, Exponential smoothing: the state of the art—part II,
is that diversity alone is not the key to a successful combination of International Journal of Forecasting 22 (4) (2006) 637–666.
methods, it is individual accuracy as well. Other easily inter- [21] T. Gautama, D. Mandic, M. Van Hulle, A novel method for determining the
nature of time series, IEEE Transactions on Biomedical Engineering 51 (2004)
pretable results for the given data set include that individual
728–736.
methods in general work better on less chaotic time series and the [22] A. Harvey, Forecasting with unobserved components time series models, in:
pooling approach in particular works well if there are many well G. Elliott, C. Granger, A. Timmermann (Eds.), Handbook of Economic
performing methods with a good ratio of accuracy and diversity in Forecasting, Elsevier, North-Holland, Amsterdam, 2006, pp. 327–408.
[23] R. Hyndman, B. Billah, Unmasking the theta method, International Journal of
the method pool. It is not claimed that all of these guidelines are Forecasting 19 (2) (2003) 287–290.
universally applicable to all data sets, however, it was shown that [24] V.R.R. Jose, R.L. Winkler, Simple robust averages of forecasts: some empirical
they do exist and can be successfully exploited for meta-learning results, International Journal of Forecasting 24 (1) (2008) 163–169.
[25] A. Kalousis, T. Theoharis, Noemon: design, implementation and performance
experiments as illustrated in the other empirical experiments. results of an intelligent assistant for classifier selection, Intelligent Data
Furthermore, neural networks, decision trees and support Analysis 5 (3) (1999) 319–337.
vector machines have been implemented, building meta-models [26] T. Koskela, M. Lehtokangas, J. Saarinen, K. Kaski, Time series prediction with
multilayer perceptron, fir and elman neural networks, in: Proceedings of
in a leave-one-out cross-validation methodology, which did not the World Congress on Neural Networks in San Diego, California, 1996, pp.
lead to convincing results. A ranking approach to determine 491–496.
combination weights was, however, able to clearly improve upon [27] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles,
Machine Learning 51 (2) (2003) 181–207.
the performance of the best individual predictors, showing that a
[28] C. Lemke, B. Gabrys, On the benefit of using time series features for choosing
combination strategy can outperform a model selection one even a forecasting method, in: Proceedings of the European Symposium on Time
on the meta-level. The last experiment was designed in a way that Series Prediction in Porvoo, Finland, 2008, pp. 1–10.
the result could have taken part in the NN5 competition and again [29] C. Lemke, S. Riedel, B. Gabrys, Dynamic combination of forecasts generated by
diversification procedures applied to forecasting of airline cancellations, in:
clearly outperformed all individual predictors. Proceedings of the IEEE Symposium Series on Computational Intelligence in
An interesting approach for a deeper understanding of the Nashville, USA, 2009, pp. 85–91.
connection between the nature of a time series and mechanisms [30] A. Lendasse (Ed.), ESTSP 2008: Proceedings, Multiprint Oy/Otamedia, 2008.
[31] Y. Liu, X. Yao, Ensemble learning via negative correlation, Neural Networks 12
that work best for forecasting could be the clustering of series in a (10) (1999) 1399–1404.
self-organising map as pursued in [48]. Future work will also be [32] P. Maforte dos Santos, T. Ludermir, R. Cavalcante, Selection of time series
concerned with extending the data set used and identifying and forecasting models based on performance information, in: Proceedings of the
Fourth International Conference on Hybrid Intelligent Systems in Kitakyushu,
implementing further ranking algorithms as they have proven to Japan, 2004, pp. 366–371.
be a very promising strategy in this work. [33] S. Makridakis, M. Hibon, The M3-competition: results, conclusions and
implications, International Journal of Forecasting 16 (4) (2000) 451–476.
[34] S. Makridakis, S. Wheelwright, R. Hyndman, Forecasting: Methods and
Applications, third ed., John Wiley, New York, 1998.
References
[35] P. Newbold, C. Granger, Experience with forecasting univariate time series
and the combination of forecasts, Journal of the Royal Statistical Society.
[1] M. Adya, J. Armstrong, F. Collopy, M. Kennedy, An application of rule-based Series A (General) 137 (2) (1974) 131–165.
forecasting to a situation lacking domain knowledge, International Journal of [36] K. Ord, Commentaries on the M3-competition, International Journal of
Forecasting 16 (4) (2000) 477–484. Forecasting 17 (2001) 537–584.
[2] M. Adya, F. Collopy, J. Armstrong, M. Kennedy, Automatic identification of [37] C. Pegels, Exponential forecasting: some new variations, Management
time series features for rule-based forecasting, International Journal of Science 15 (5) (1969) 311–315.
Forecasting 17 (2) (2001) 143–157. [38] J.-Y. Peng, J.A.D. Aston, The ssm toolbox for matlab, Technical Report, Institute
[3] M. Aiolfi, A. Timmermann, Persistence in forecasting performance and of Statistical Science, Academia Sinica /http://www.stat.sinica.edu.tw/
conditional combination strategies, Journal of Econometrics 127 (1–2) jaston/software.htmlS, 2007.
(2006) 31–53. [39] R. Prudencio, T. Ludermir, Using machine learning techniques to combine
[4] B. Arinze, S.-L. Kim, M. Anandarajan, Combining and selecting forecasting forecasting methods, in: Proceedings of the 17th Australian Joint Conference
models using rule based induction, Computers & Operations Research 24 (5) on Artificial Intelligence in Cairns, Australia, 2004, pp. 1122–1127.
(1997) 423–433. [40] R.B. Prudencio, T.B. Ludermir, Meta-learning approaches to selecting time
[5] V. Assimakopoulos, K. Nikolopoulos, The theta model: a decomposition series models, Neurocomputing 61 (2004) 121–137.
approach to forecasting, International Journal of Forecasting 16 (4) (2000) [41] C. Shah, Model selection in univariate time series forecasting using
521–530. discriminant analysis, International Journal of Forecasting 13 (4) (1997)
[6] G. Box, G. Jenkins, Time Series Analysis, Holden-Day, San Francisco, 1970. 489–500.
[7] P. Brazdil, C. Soares, J. Pinto de Costa, Ranking learning algorithms: using IBL [42] J. Stock, M. Watson, A comparison of linear and nonlinear univariate models
and meta-learning on accuracy and time results, Machine Learning 50 (3) for forecasting macroeconomic time series, in: R. Engle, H. White (Eds.),
(2003) 251–277. Cointegration, Causality and Forecasting. A Festschrift in honour of Clive W.J.
[8] L. Breiman, Bagging predictors, Machine learning 24 (2) (1996) 123–140. Granger, Oxford University Press, Oxford, 2001, pp. 1–44.
[9] G. Brown, J. Wyatt, P. Tino, Managing diversity in regression ensembles, The [43] J.W. Taylor, Exponential smoothing with a damped multiplicative trend,
Journal of Machine Learning Research 6 (2005) 1621–1650. International Journal of Forecasting 19 (4) (2003) 715–725.
[10] D. Bunn, A Bayesian approach to the linear combination of forecasts, [44] A. Timmermann, Forecast combinations, in: G. Elliott, C. Granger, A.
Operational Research Quarterly 26 (2) (1975) 325–329. Timmermann (Eds.), Handbook of Economic Forecasting, Elsevier, 2006, pp.
[11] C. Chatfield, The Analysis of Time Series, sixth ed., Chapman & Hall, London, 135–196.
2003. [45] University of Goettingen, 2009. TSTOOL software package for nonlinear time
[12] F. Collopy, S.J. Armstrong, Rule-based forecasting: development and series analysis [online]. Available online: /http://www.dpi.physik.uni-goet
validation of an expert systems approach to combining time series tingen.de/tstool/S [03/06/2009].
extrapolations, Management Science 38 (10) (1992) 1394–1414. [46] R. Vilalta, Y. Drissi, A perspective view and survey of meta-learning, Artificial
[13] S. Crone, 2006/2007, NN3 Forecasting Competition [Online]. Available online: Intelligence Review 18 (2002) 77–95.
/http://www.neural-forecasting-competition.com/NN3/S [02/06/2009]. [47] R. Vokurka, B. Flores, S. Pearce, Automatic feature identification and graphical
[14] S. Crone, 2008, NN5 Forecasting Competition [Online]. Available online: support in rule-based forecasting: a comparison, International Journal of
/http://www.neural-forecasting-competition.com/NN5/S [02/06/2009]. Forecasting 12 (4) (1996) 495–512.
[15] Delft Center for Systems and Control, 2007. Matlab toolbox ARMASA [Online] [48] X. Wang, K. Smith-Miles, R. Hyndman, Rule induction for forecasting method
/http://www.dcsc.tudelft.nl/Research/SoftwareS [13/06/2007]. selection: meta-learning the characteristics of univariate time series,
[16] J. Faraway, C. Chatfield, Time series forecasting with neural networks: a Neurocomputing 72 (2009) 2581–2594.
comparative study using the air line data, Applied Statistics 47 (2) (1998) [49] D. Wolpert, The lack of a priori distinctions between learning algorithms,
231–250. Neural Computation 8 (7) (1996) 1341–1390.
ARTICLE IN PRESS
2016 C. Lemke, B. Gabrys / Neurocomputing 73 (2010) 2006–2016
[50] G. Zhang, B. Patuwo, M. Hu, Forecasting with artificial neural networks: the Bogdan Gabrys received an M.Sc. degree in Electronics
state of the art, International Journal of Forecasting 14 (1) (1998) 35–62. and Telecommunication (Specialization: Computer
Control Systems) from the Silesian Technical Univer-
sity, Poland in 1994 and a Ph.D. in Computer Science
from the Nottingham Trent University, UK in 1998.
After many years of working at different Universi-
Christiane Lemke received a graduate degree in ties, Prof. Gabrys moved to Bournemouth University in
Computer Science (Specialization: Intelligent Systems) January 2003 where he acts as a Director of the Smart
from Brandenburg University of Applied Sciences in Technology Research Centre within the School of
Germany in 2004, where she continued working in the Design, Engineering and Computing. His current
area of robotics and image recognition for two years. research interests include a wide range of machine
Since 2006, she is a Ph.D. student at Bournemouth learning and hybrid intelligent techniques encompass-
University, UK. Her main research interests include ing data and information fusion, multiple classifier and
the combinations of time series forecasts and prediction systems, processing and modelling of uncertainty in pattern recognition,
meta-learning. diagnostic analysis and decision support systems.