You are on page 1of 15

Information Sciences 540 (2020) 160–174

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Big data time series forecasting based on pattern sequence


similarity and its application to the electricity demand
R. Pérez-Chacón, G. Asencio-Cortés, F. Martínez-Álvarez, A. Troncoso
Data Science and Big Data Lab, Pablo de Olavide University, ES-41013 Seville, Spain

a r t i c l e i n f o a b s t r a c t

Article history: This work proposes a novel algorithm to forecast big data time series. Based on the well-
Received 4 February 2020 established Pattern Sequence-based Forecasting algorithm, this new approach has two
Received in revised form 4 May 2020 major contributions to the literature. First, the improvement of the original algorithm with
Accepted 6 June 2020
respect to the accuracy of predictions, and second, its transformation into the big data con-
Available online 18 June 2020
text, having reached meaningful results in terms of scalability. The algorithm uses the
Apache Spark distributed computation framework and it is a ready-to-use application with
Keywords:
few parameters to adjust. Physical and cloud clusters have been used to carry out the
Big data
Time series
experimentation, which consisted in applying the algorithm to real-world data from
Forecasting Uruguay electricity demand.
Electricity Ó 2020 Elsevier Inc. All rights reserved.

1. Introduction

Time series data available has increased considerably in the last decades, given the recent interest in storing and analyz-
ing huge amount of data nowadays [6], as may be the case of datasets extracted from smart meters for decades, from hun-
dreds of buildings or with a very high measurement frequency [24,25].
Time series forecasting algorithms can create models based on historical data and make predictions for given target vari-
ables of interest [44,43]. The computation time of these algorithms may increase notably when big data time series are eval-
uated. Therefore, single-core machine environments are not enough and need to be improved with additional computational
resources. For such reason, it becomes necessary to distribute the data and its computation across multiple nodes using a
cluster of machines.
The Pattern Sequence based Forecasting algorithm (PSF) [21] is an effective general-purpose multi-output approach par-
ticularly designed to deal with time series for prediction horizons of an arbitrary length. This multi-output feature has turned
PSF into a flexible tool that has been used in different fields of research. PSF is mainly based on the identification of certain
patterns that are searched for throughout the whole dataset. Such patterns are calculated by means of any clustering tech-
nique (k-means in the original work) and, later, sequences of clusters are formed to characterize the target time series.
In this work, a new algorithm, hereinafter called bigPSF, is proposed. It is inspired by the pattern search strategy intro-
duced in the original PSF algorithm. But bigPSF has two major contributions to the literature. First, it is scalable, thanks to its
distributed computation under the Apache Spark framework. That is, bigPSF is suitable for handling big data time series and
mining millions of records reporting reduced execution times in contrast with PSF. Second, some modifications have been
carried out in bigPSF with respect to the original PSF improving the results of the predictions and achieving higher

E-mail addresses: rpercha@upo.es (R. Pérez-Chacón), guaasecor@upo.es (G. Asencio-Cortés), fmaralv@upo.es (F. Martínez-Álvarez), atrolor@upo.es
(A. Troncoso)

https://doi.org/10.1016/j.ins.2020.06.014
0020-0255/Ó 2020 Elsevier Inc. All rights reserved.
R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174 161

accuracy. Hence, the algorithm mainly covers volume and velocity dimensions from the well-established 4-Vs big data
paradigm [6].
Although bigPSF is also a general-purpose algorithm, data related to electricity demand have been used to assess its per-
formance. A study case of the original methodology along with its last methodological improvement for predicting electricity
demand data can be found in Ref. [19]. Compared to PSF and other five well-known prediction algorithms, bigPSF achieved a
higher accuracy than that of each of them and was able to deal with big data, exhibiting a linear behavior in terms of com-
putation time.
The rest of the paper is structured as follows. Section 2 reviews the relevant and related papers to PSF and big data time
series forecasting. Section 3 describes the proposed methodology. Section 4 reports all results and discusses the performance
in terms of both errors and scalability. Finally, Section 5 summarizes the most significant achievements within the
manuscript.

2. Related works

This section overviews the most relevant and related works. In particular, this section is structured in two different parts.
First, works related to the PSF algorithm are reviewed and summarized highlighting their main contributions to the litera-
ture. Second, works related to big data time series forecasting are reported and discussed.
The PSF algorithm was firstly published in 2011 [21]. It was developed to deal with time series and proposed, for the first
time, to use clustering methods to transform a time series into a sequence of discrete values. Recently, a robust implemen-
tation in R was also published [3]. PSF has also been successfully applied in the context of wind speed [5], solar power [45],
and water forecasting [11]. Since its first publication, several authors have proposed modifications for its improvement. In
Ref. [20], the authors modified PSF to forecast outliers in time series. Later, Fujimoto et al. [7] chose a different clustering
method, in order to find clusters with different shapes. In 2013, Koprinska et al. [14] combined PSF with artificial neural net-
works. Five variants of PSF were jointly used in Ref. [29]. Another relevant modification can be found in Ref. [13], in which
the authors suggested the calculation of the prediction using the weighted centroids and they opposed to use a voting sys-
tem to find the optimal number of clusters given its expensive computational cost. PSF has also been adapted to impute
missing values in time series [4] In 2019, the PSF algorithm was adapted to deal with functional time series [19]. The authors
combined a clustering algorithm for functional data, funHDDC [12], with PSF and assessed its performance on this kind of
data.
The field of big data time series has dramatically evolved in the last years. Since one of the earliest works by Rakthan-
manon et al. [26], in which the authors mined trillions of time series subsequences, many authors have proposed new strate-
gies to deal with big data time series. In Ref. [31] a study was carried out on the scalable prediction of energy consumption by
grouping incremental time series. In 2015, a new forecasting model was proposed in Ref. [32]. The approach consisted in
combining two well-known soft computing algorithms and was applied to India stock index price data obtaining a remark-
able performance. An interesting study on the forecasting and understanding of time series related to mobile traffic of large-
scale cellular networks can be found in Ref. [46]. In Ref. [27] a scalable method was implemented for demand forecasting in
the context of an e-commerce platform using distributed computing. The authors of Ref. [33] propose an analysis of two big
data sets containing time series of energy consumption of real houses in UK and Canada. They apply clustering techniques
and use a Bayesian network to predict energy use. Algorithms from the Apache Spark’s machine learning library (MLlib) have
allowed the scientific community to develop new models based on them [38]. For instance, a linear regression model was
developed to forecast big data time series for prediction horizons of arbitrary length [9]. Later, an ensemble approach of sev-
eral linear models was proposed in Ref. [8]. Both static and dynamic ensembles were tested achieving very promising results
in terms of accuracy. In Ref. [30] an effective prediction of missing data in multivariate time series is made. Kramer and San-
der [15] proposed a method to predict the German day-ahead spot market using Apache Spark, by computing in parallel sev-
eral datasets. The MapReduce-based Forecasting algorithm was introduced in Ref. [34]. In it, the authors reduced the
 
weighed moving average algorithm time complexity to O N 2 . Results on both real-world and synthetic datasets showed
high scalability. An adaption of one of the most used forecasting algorithm, k weighted nearest neighbors [42], has also been
published recently [35]. The authors proposed a distributed computation scheme to compute the neighbors. Experiments
using Spanish electricity demand datasets showed the best results compared to other algorithms. Later, a multivariate ver-
sion was also proposed [36]. In it, it was shown that the use of exogenous time series with a certain degree of correlation
with the target one can enhance the forecasting results. Deep learning approaches are being currently applied to the big data
time series context. Hence, a scalable approach based on a feed forward architecture was used to forecast electricity data
reporting competitive errors [39]. An early proposal for that work can also be found in Ref. [41]. Another interesting appli-
cation to solar power can be found in Ref. [40]. Australian data were used to assess the performance reaching a high accuracy.
In 2020, studies were carried out for the prediction of aggregate solar energy in Spanish solar plants [28] or on the prediction
of wind speed time series for different cities in the USA [23].
In view of the current state-of-the-art, it can be concluded that there is a gap in the literature and the proposal of this new
algorithm, bigPSF, is justified.
162 R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174

3. Methodology

This section describes the methodology proposed to predict big data time series of electricity consumption. The novel and
distributed bigPSF algorithm, based on the existing PSF algorithm, is introduced. bigPSF is a forecasting algorithm able to
handle big data time series in a scalable way. In addition to the scalability property, some modifications have been proposed
in order to enhance the prediction results of the original PSF algorithm.
Section 3.1 describes PSF, detailing how its main parameters, such as the optimal number of clusters and the window size
are determined, as well as its fundamentals to make predictions. Next, bigPSF is introduced in Section 3.2.

3.1. The original PSF algorithm

PSF was designed to predict time series using small and medium historical data that can be run in a single machine, e.g.
using a computer with a standard CPU performance. Fig. 1 illustrates the steps involved in the PSF algorithm, which are listed
below:

1. The first step consists in reducing and treating the existence of possible outliers in the original dataset, performing a nor-
malization to deal with values in very different ranges, typically found in some time series. One of the most used forms is
the unit-based normalization, transforming the original values into an interval ranging between 0 and 1.
2. PSF applies a clustering algorithm to transform the original time series values into a sequence of labels, each of them
identifying as many samples as determined by the user (e.g. a whole day, week, and so forth). The optimal number of
clusters (K) must be determined so that data can be correctly labeled. For this purpose, the application of several cluster
validity index (CVI) [18] could be carried out, as done in the orignal PSF proposal. Nevertheless, some of these indexes are
not scalable and poses important computational challenges in the big data context. For this reason, values of K ranging
from 2 to 15 are evaluated.
3. Once K is obtained, the clustering method (usually KMeans is chosen) is applied to the full dataset, in order to label the
historical data, turning the time series into a sequence of labels, corresponding to the clusters they have been assigned to.
4. Then, the second input parameter’s value is calculated: the optimal value of the window (W). This parameter determines
the length of the window extracted just before the target value and includes the sequence of labels generated by the clus-
tering process of length W. A grid search varying simultaneously K and W is performed to determine optimal values dur-
ing the training phase.
5. With the optimal W length, a sequential search is performed on the entire dataset, searching for coincidences (or
matches) between the pattern sequence, SW , and the historical data. Once the matches are identified, the h values imme-
diately following all the coincidences are extracted, being h the prediction horizon of the forecasting problem or the num-
ber of predicted samples at every iteration by bigPSF. It is worth mentioning that bigPSF is a multi-step forecasting
algorithm and more than one sample can be simultaneously predicted.
6. The next step consists in computing the prediction using the h values found immediately after the SW in the historical
data. Original PSF makes predictions by simply averaging all these values. That is, values for t þ 1; t þ 2; . . . ; t þ h are
simultaneously predicted from instant t. The prediction of every t þ i (with i ¼ 1; . . . ; h) is made by averaging all t þ i val-
ues after the SW found in the historical data.
7. Finally, outputs are denormalized by applying the reverse operation of the first step. If there are more values to predict,
the forecasted values are appended to the dataset and steps 5 and 6 are repeated until all forecasts have been completed.

Fig. 1. Flowchart of the original methodology.


R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174 163

3.2. bigPSF: A forecasting algorithm for big data time series

This section describes the bigPSF algorithm. Hence, Section 3.2.1 details the pre-processing applied to the data. Next, a
new proposal to determine the optimal values for the input parameters using a grid search is described in Section 3.2.2. Sec-
tion 3.2.3 describes the distributed and parallel KMeans++ clustering algorithm, included in the Apache Spark’s machine
learning library (MLlib) [22]. Then, the different and distributed steps of bigPSF are presented in Section 3.2.4. Given the opti-
mal values for the input parameters the formation of the pattern matrix is reported. Finally, the filtering of matching pat-
terns, the weighting of neighbours and the calculation of forecasts are explained.
A flowchart depicting each step of the bigPSF algorithm can be found in Fig. 2. Next sections are devoted to detail each of
these steps.

3.2.1. Data pre-processing


In the first step, the cleaning of outliers and the transformations of the initial dataset are carried out in order to be pro-
cessed by the clustering algorithm in MLlib. Data are stored in Resilient Distributed Data (RDD) variables, which are variables
used in Spark to distribute data across multiple machines. These variables allow operations on large amounts of data dis-
tributed in a cluster of machines so that they can be operated in a parallel, fast and fault-tolerant mode.
First, the big data time series is stored in an RDD. Once this RDD dataset is created, it is necessary to transform it by mak-
ing groups of h values along with an index variable, being h the number of future values to be predicted. Fig. 3 shows an
example of this transformation for a value of 24 for the h prediction horizon (h ¼ 24).

3.2.2. Hyperparameter tuning: Optimal K and W values


As it has been studied in the literature [17], the quadratic cost of the CVI functions means that new validation alternatives
have to be sought in big data environments. In this work, a scalable grid search is proposed for the estimation of the input
parameters of the bigPSF algorithm. This grid search is possible due to the integer nature of the input parameters to be
optimized.
Given the speed and high computing performance offered by Apache Spark’s technology for big datasets, the distribution
of the full dataset in a cluster of computers makes possible to perform an exhaustive search for finding the optimal values of
K and W. This study will focus on examining by means of a complete value grid which combination of the K cluster number
and the W window size reports the best predictions in terms of accuracy using a validation dataset.
For this purpose, first, the complete dataset is divided into two RDDs corresponding to the training set (TRS) and the test
set (TES), in a 70%-30% proportion. Therefore, these two variables will also be distributed across the different nodes in the
cluster to be used. In turn, the TRS is divided into a training subset (STRS) and a test subset (STES), in a 70%-30% proportion,
to apply the grid of K and W values mentioned above. As it can be seen in Fig. 3, there are four RDD variables distributed in
the different nodes of the cluster: STRS and STES for the grid search of the optimal W and K values, and TRS and TES for the
calculation of the final forecasts.
The grid search consists in applying the proposed bigPSF algorithm for all possible pairs of values K and W as input param-
eters. In this way, once all the bigPSF steps are completed, the combination of values K and W with the smallest error when
predicting the STES subset using the STRS as training will be used for the final prediction phase.

Fig. 2. A general scheme of the bigPSF algorithm, including all the necessary steps to make predictions.
164 R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174

Fig. 3. Illustrative representation on how the full dataset is divided into the TRS (70%) and the TES (30%) sets. TRS is subsequently divided in order to obtain
the model using, again, 70% for training (STRS) and 30% for Validation (STES).

3.2.3. Clustering: parallel KMeans++ algorithm


One of the most used clustering algorithms is KMeans. This method selects a set of initial points randomly chosen from
the dataset. Next, the algorithm processes all the points in the dataset and assigns it its closest center (initialization step) and
then recalculates the centers by averaging all the points in the cluster (distance computation and centroids update step).
These steps are executed iteratively until the algorithm converges, i.e. until no new assignments are made or until a preset
number of iterations are executed.
Given its parallel properties, the KMeans variant called KMeans++ has been used in this work. This version specifies a pro-
cedure to initialize the cluster centers before proceeding with the KMeans optimization standard iterations, in which a cen-
troid is first randomly selected and the next ones are chosen according to a specific probability [1]. Although it introduces an
overload in the initialization of the algorithm, it reduces the probability that a bad initialization leads to a bad clustering
result. In addition, the number of total iterations to find the convergence of the algorithm is reduced, which favors its use
in big data environments.
A parallelized variant of the k-means++ clustering [2] is implemented for distributed computing environments in Apache
Spark’s MLlib library and is the one chosen for the clustering step in the bigPSF algorithm.
As it can be seen in Fig. 4, the input TRS set is distributed in the cluster as a RDD variable. Later, the K centroids are ini-
tialized using the parallelized KMeans++ method, where K is the optimal value obtained in the hyperparameter tuning
described in Section 3.2.2. The final centroids will be those that minimize the value of the Within Set Sum of Squared Errors
(WSSSE) index, which is defined as follows:

X
K X
 2
WSSSE ¼ d xi ; c j ð1Þ
j¼1 xi 2C j

 
where d xi ; cj is the Euclidean distance between each observation xi belonging to the cluster C j and its centroid cj .

Fig. 4. Clustering phase. Execution of the parallelized KMeans++ algorithm, with k centroids. Two steps are depicted: the distributed initialization of the
centroids and the distance computation for each of them.
R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174 165

Once KMeans++ has been applied to the TRS training set, an RDD variable, LabelRDD, will be formed containing the index
along with the labels li resulting of the clustering process, as shown at the top of Fig. 5. Each label li corresponds to h values of
the time series, being h the prediction horizon. For instance, for w1 ¼ fl1 ; l2 ; l3 g, these h values are identified by l4 . The h val-
ues corresponding to the label li occurred just after the pattern sequence will be retrieved and used to make the final pre-
diction every time this pattern is found in the historical data.

3.2.4. Pattern Sequence-based Forecasting


Once the LabelRDD variable has been obtained as a result of the clustering step, it is necessary to transform the sequence
of labels into a matrix of patterns in order to searchi for matching patterns, using the scalable operations available in Apache
Spark.
This matrix is generated by grouping all the possible sequences of labels of length W, where W is the optimal window size
obtained in the hyperparameter tuning. This matrix is composed of tuples with an index, a sequence of W labels and the label
associated to the next h values of the corresponding sequence. Fig. 5 illustrates the setting up of the RDD variable containing
the patterns matrix from the sequence of labels obtained by the clustering algorithm. From the pattern matrix, the sequence
of labels previous to the h values to predict is extracted, forming a pattern sequence to search for. Subsequently, the filter
transformation function of Apache Spark is applied on the pattern matrix RDD. This filter extracts all the coincidences with
the search sequence and the corresponding indices.
Then, the real h values associated with each of these indices of the matching are recovered from the original big data time
series and the forecasting is obtained using a weighted average as shown in Fig. 6. Specifically, the forecasts of the next h
values of the big data time series, by , are computed as follows:
X
b
y¼ a i  yi ð2Þ
i2I

where I is the set of indices reported by the filter, yi is a vector composed of the real h values immediately after the i-th
matching pattern found in the historical data, and ai are the weights defined by:

di
ai ¼ P ð3Þ
j2I dj

where di is an integer number consisting in the difference between the i index corresponding to the i-th matching pattern
and the index corresponding to the h values to be predicted. These weights range from 0 to 1 and represent the temporal
distance between the values to predict and the values used to calculate the average. The further in time these values are,
the lower the weight will be.

Fig. 5. Pattern matrix RDD formation phase. An example with W ¼ 3 and h ¼ 24 (but identified by the labels generated during the clustering process) is
shown, representing how the pattern sequence is slided over the historical data and the labels are retrieved in order to create the pattern matrix RDD.
166 R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174

Fig. 6. Pattern Sequence-based Forecasting phase. First the pattern sequence is found in the pattern matrix RDD (idGrouping 1 and 13 in this example).
Second, the h values are retrieved (l4 and l16 ). Third, a prediction is made by weighting such values.

It is worth noting that if no match is found, the W size of the window, and consequently the number of labels forming the
sequences, will be reduced by one unit, which increases the chance of finding matches. This will require to rebuild the pat-
tern matrix as explained above to be able to filter again and find new matches.
A complete pseudo code of the bigPSF approach summarizing all the steps is presented in Algorithm 1. It can be observed
that the input variables required for the bigPSF are the dataset, the optimal K number of clusters, the optimal W window size,
and the h prediction horizon to predict.
Next, some implementation remarks about the RDD variables used in Algorithm 1 as well as which parts of the pseudo
code use distributed computation are detailed.

Fig. 7. Data structures: RDD variables for distributed computing in the bigPSF algorithm.
R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174 167

Fig. 8. Simplified numerical example for one prediction with W ¼ 4; K ¼ 9 and h ¼ 24.

Fig. 7 provides a description of each of the steps involving the data structures used for the bigPSF algorithm. Fig. 7a shows
the pattern matrix or RDD1 . RDD1 has three main fields: idGrouping, all the sets of W values existing in the historical set (wVa-
lues) and the labels immediately after such W values (hValues), where each label identifies the h values after each pattern
sequence. RDD1 is then filtered and only the rows matching the pattern sequence are kept. This set of matches is stored
in RDD2 (Fig. 7b), a considerably smaller distributed structure. For instance, for idGrouping ¼ 7 the pattern sequence of
the W labels retrieved are l7 ; . . . ; lwþ5 ; lwþ6 and the h next values are identified by label lwþ7 . Fig. 7c depicts RDD3 , in which
the h ¼ 24 real values are added in the rightmost column (hRealValues). These 24 values are mathematically expressed as
yhðwþ6Þþ1 ; . . . ; yhðwþ7Þ . After these three operations are executed and given the considerable decrease of the data size, it is
no longer necessary to process the information in any other RDD and, consequently, the operation with the master node
is started. In Fig. 7d the time distances between each match and the values at the time stamp to be predicted are calculated
(column distance), creating the Array1 . Later, these distances are transformed into weights, according to Eq. (3), forming the
Array2 (Fig. 7e). Finally, in Fig. 7f, a weighted sum of the h real values found after each match is calculated by applying the
reduce operation to all these vectors, so that Vector contains the final prediction.
A numerical example can be found in Fig. 8, in which the most relevant steps described in Fig. 7 are illustrated with
actual values. The example is shown for W ¼ 4; K ¼ 9 and h ¼ 24. Fig. 8a represents a sample RDD2 , being the
pattern sequence f8; 7; 3; 7g located at idGrouping ¼ 2909. Four matches at positions idGrouping ¼
f1453; 1817; 2139; 2230g are identified, with associated labels f4; 4; 5; 4g for the h next values. Fig. 8b shows the information
that the Array2 may contain with the weights already calculated for each of the vector containing the h real values after the
matches found. Finally, Fig. 8c shows how the final prediction is calculated, by performing a weighted addition of the values
shown in Fig. 8b.
Fig. 9 depicts the data pipeline through which Apache Spark manages the big data time series in a cluster of
machines, as well as in the master or driver, depending on the operations that are performed. Since the size of the
dataset to be processed decreases considerably once the matching sequences are computed and the real h values are recov-
ered, it is feasible to operate using the driver by means of variables such as Array or Vector. It must be noted that the pro-
cessing time of a RDD with few data is very high compared to doing so using the driver or master, as it will be discussed in
Section 4.5.

4. Results

This section presents and discusses the experiments carried out to assess the bigPSF performance for a 24-h prediction
horizon using a big data time series of electrical consumption from Uruguay. Furthermore, a comparative analysis is
performed to compare the bigPSF to other approaches published in the literature.
168 R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174

Fig. 9. Pipeline of data for the bigPSF algorithm in Apache Spark, showing the filter and map operations and how RDDs are processed in the driver.

Algorithm 1. The bigPSF algorithm

This section is divided as follows. Section 4.1 describes the dataset in which the bigPSF has been tested. The quality mea-
sures used to evaluate its performance are presented in Section 4.2. Section 4.3 shows the results obtained for the hyper-
parameters using the grid search described in Section 3.2.2. Then, the performance of the bigPSF algorithm is compared
to other forecasting methods in Section 4.4 and, finally, an analysis of the scalability is performed using the Apache Spark
framework in Section 4.5.

4.1. Data description

The dataset consists of electricity consumption in Uruguay, measurements collected and aggregated from smart meters,
recorded on an hourly basis from 2007 to 2014. In total, the target dataset is composed of 70,128 samples. The average
hourly value is 1092.21 MW, with minimum and maximum hourly values of 609.87 MW and 1907.55 MW, respectively.
This time series showed a minimum percentage of outliers, and consequently the application of specific techniques to
deal with consumption peaks was not considered.
Data distribution for the different days of the week, along with distribution for the full dataset, is illustrated in Fig. 10.
Axis x for all subfigures denotes the amount of electricity in MW, while axis y denotes the accumulated frequency for such
values. Differences in mean and skewness can be appreciated in different days of the week.
R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174 169

Fig. 10. Data distribution for days of the week, as well as for the full dataset. Y-axis represent the electricity consumption in MW and the X-axis the each of
the days included in the dataset.

In general, while working days are slightly right-skewed, it can be appreciated that weekends are slightly left-skewed,
which means that lower consumption is reported in weekends, as expected.
The complete big data time series was loaded in Apache Spark and transformed into a RDD variable, which initially was
formed by the following values: date (divided into year, month and day), electric load and type of day (regular or holiday).
The first transformation that was carried out on the original RDD consisted in the extraction and grouping of the 24 values
(one record per hour) coming from each day. The initial 70,128 measurements (corresponding to a single time series, the
total consumption in Uruguay for the period analyzed) were grouped in 2922 rows, where each row was composed of all
the measurements corresponding to a day. Then, the parallelized KMeans++ clustering algorithm was applied to this RDD
returning the numerical indices representing the labels. Finally, a new RDD was created by adding to each row these labels
in order to know to which cluster it will belong to.

4.2. Quality measures

Three well-established metrics in the context of electricity demand time series have been chosen in order to evaluate the
performance of the bigPSF algorithm [16].
The Mean Absolute Percentage Error (MAPE) is one of the most widely used metric due to the ease of interpretation of the
relative error, and it is used as a guideline to measure the goodness of the prediction method:

100 X
n
jyh  byh j
MAPEð%Þ ¼
n h¼1 yh

The Mean Absolute Error (MAE) is also used to quantify the accuracy of the forecasting algorithm by comparing predicted
versus actual values:

1X n
MAE ¼ jy  b
yhj
n h¼1 h

And finally, the Root Mean Squared Error (RMSE), is the square root of the average of squared differences between pre-
dicted and actual values:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u n
u1 X
RMSE ¼ t jy  b y h j2
n h¼1 h
170 R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174

For all the equations above, yh are the current demand samples, b
y h are the predicted samples, and n is the number of sam-
ples. Note that absolute values are used due to the nature of the data to be treated, since it is interesting for electric com-
panies to measure the total deviation of the prediction with respect to the total amount of energy.

4.3. Hyper-parameters: Optimal W and K

In this section the results obtained by the grid search described in Section 3.2.2 are presented to determine which com-
bination of K clusters and which W window size is the optimal, in order to obtain the most effective final prediction.
To do this, the bigPSF is applied to predict using the STRS (corresponding to 70% of the original TRS training set) as train-
ing set and the STES as validation set (relative to 30% of TRS) for values ranging from 2 to 15 clusters and varying from 1 to 10
the size of the initial search window. The bigPSF algorithm will find the best values for K and W between 140 possible com-
binations of pairs of values. Specifically, for the purpose of testing all the options, about 500 predictions were made for each
of these possibilities.
Table 1 shows the MAPE obtained by the bigPSF when predicting the STES for all the combinations of pairs of values K and
W. As it can be seen, the smallest MAPE value corresponds to the number of clusters K ¼ 13 and the size of the search win-
dow W ¼ 2. Therefore, these two values will be used as input variables in the prediction phase of the complete TES test set.

4.4. Analysis of results

Once the optimal hyper-parameters of the bigPSF are selected, the accuracy of the predictions according to the measures
described in Section 4.2 are presented in this section. In addition, the bigPSF algorithm is compared to six approaches: Deci-
sion Tree (CART), Gradient Boosting Machine (GBM), PSF [21], 7-Pattern Sequence based Forecasting [19], which is a variant
of the PSF that considers seven different models associated with every week day, Autoregressive Integrated Moving Average
(ARIMA) and Artificial Neural Networks (ANN) models. All these algorithms have been applied under the same training con-
ditions as the bigPSF.
After the training process, the values for the input parameters for each algorithm turned out to be:

1. PSF and 7-PSF: The number of clusters K ¼ 2 and the length of the window W ¼ 4, as done in Ref. [19].
2. ANN: 24 input neurons, one hidden layer, backpropagation and sigmoid activation function, as done in Ref. [19].
3. ARIMA: The order of the autoregressive component p ¼ 0, of the integrated component q ¼ 1, and of the moving average
component t ¼ 1, that is, an ARIMA(0,1,1) model, as done in Ref. [19].
4. CART: Regression tree using Anova criteria to split nodes with no tree pruning, as they are the default settings in the rpart
R library. [37].
5. GBM: Ensemble of 100 trees, interaction depth set to 1, shrinkage was set to 0:1 and the fraction of the training set obser-
vations randomly selected to propose the next tree in the expansion was set to 0:5, as they are the default settings in the
gbm R library [10].

Table 2 shows the MAE, MAPE and RMSE for all the above forecasting methods when predicting the TES test set. It can be
seen that the bigPSF clearly outperformed all of them. In particular, it can be seen how bigPSF reached a value of 57.15 MW
for MAE, which is the lowest value for all the considered methods. This value represents a 16.36% improvement over 7-PSF
and a 40.38% improvement over the original PSF algorithm. In addition, this improvement rises to 18% for GBM, 46.31% for
ANN, 58.4% for CART and 61.01% for ARIMA. This parameter is of utmost relevance in the energy context since it represents
the actual deviation of the prediction in watts. Similarly, the bigPSF obtained a MAPE of 4.70%. Errors achieved by 7-PSF, PSF,
GBM, ANN, CART and ARIMA were 1.19, 1.68, 1.23, 2.08, 2.33 and 2.85 times greater than that of the bigPSF, respectively.
From these values, it can be concluded that the bigPSF decreases very significantly the error expressed in relative terms.
As for the RMSE values, the bigPSF also obtained better results when compared to the other methods. In particular, it
improved by 34.29% the second best result (7-PSF) and by 65.47% the worst result (CART). This value involves that errors

Table 1
MAPE obtained by the bigPSF for the grid search during the training phase.

K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9 K = 10 K = 11 K = 12 K = 13 K = 14 K = 15


W=1 7.12 6.43 6.02 5.47 5.33 5.18 4.96 4.95 4.89 4.73 4.65 4.59 4.67 4.67
W=2 6.70 6.30 5.83 5.39 5.22 4.99 4.83 4.85 4.89 4.73 4.65 4.52 4.61 4.61
W=3 6.59 6.34 5.76 5.38 5.20 5.05 4.95 4.95 4.93 4.84 4.77 4.64 4.68 4.77
W=4 6.55 6.34 5.77 5.40 5.31 5.19 4.97 5.04 5.08 4.94 4.89 4.80 4.88 4.88
W=5 6.50 6.51 5.83 5.51 5.41 5.26 5.05 5.17 5.24 5.12 5.02 4.97 4.99 4.95
W=6 6.46 6.64 5.90 5.59 5.80 5.37 5.14 5.25 5.32 5.18 5.10 5.07 5.105 5.00
W=7 6.52 6.74 5.99 5.66 5.57 5.38 5.17 5.29 5.37 5.25 5.17 5.15 5.13 5.02
W=8 6.53 6.86 6.09 5.71 5.67 5.42 5.23 5.39 5.14 5.28 5.20 5.21 5.15 5.05
W=9 6.60 6.96 6.18 5.77 5.73 5.48 5.24 5.46 5.44 5.31 5.25 5.23 5.15 5.05
W = 10 6.70 7.04 6.23 5.84 5.73 5.50 5.26 5.48 5.48 5.34 5.26 5.24 5.19 5.05
R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174 171

Table 2
Summary of results obtained in terms of MAE, MAPE and RMSE.

Algorithm MAE MAPE RMSE


ARIMA 146.56 13.40 142.37
CART 137.59 10.97 177.34
ANN 106.44 9.78 141.20
GBM 69.71 5.79 93.75
PSF 95.85 7.94 123.92
7-PSF 68.33 5.63 93.18
bigPSF 57.15 4.70 61.23

are better not only in absolute and relative terms but also that residuals are less spread out, minimizing the existence of par-
ticularly high errors.

4.5. Scalability analysis

All experiments were run on Open Telekom Cloud (OTC), using an Elastic Cloud Server in which twenty-four different
hardware scenarios were configured using m1:2xlarge instances with 8 vCPUs and 64 GB RAM.
A total of 24 different configurations have been used for the scalability study. For this purpose, clusters were formed for
distributed computing from 2 to 24 slaves (master node aside), configuring the execution with one core per node. Addition-
ally, experiments were launched with a more powerful configuration of 48 cores composed of a cluster of 24 slaves with two
cores each.
Synthetic datasets have also been generated for a better assessment of the approach scalability by multiplying the size of
the original dataset (N) by several factors such as 32, 64, 128 and 256. That is, the 256N dataset is composed of 256 copies of
the original dataset, consecutively appended.
Fig. 11 shows the differences in execution times between the original PSF executed in a core computer (without cluster)
versus its bigPSF distributed version for 1, 6, 12 and 48 cores for the datasets of sizes N, 32N, 64N, 128N and 256N. It can be
noted that the PSF runtimes have a quadratic computational cost. This cost is improved by the new bigPSF distributed algo-
rithm, achieving a linear cost from its most basic configuration. Therefore, in addition to improving the computational cost of
the algorithm, the execution times to predict the same datasets are substantially reduced. Due to the differences in scales
between PSF and bigPSF runtimes, it is worth mentioning that the graphs corresponding to the different cores can not be
distinguished as they are overlapped.
Fig. 12 shows PSF and bigPSF runtimes increasing the scale of the Y-axis, in order to be able to appreciate the linear com-
putational cost mentioned above obtained by the bigPSF algorithm for different number of cores.
Fig. 13 displays the relationship between execution time and number of cores for the bigPSF when using datasets of dif-
ferent sizes. From this Figure, conclusions can be drawn about the scalability of the bigPSF. It should be mentioned that the
increase in the number of slaves used to distribute the datasets does not imply a shorter execution time of the algorithm, but
also depends directly on the size of the time series being processed. In particular, as the size of the time series increases, the
bigPSF algorithm becomes more efficient if it uses more slave nodes. That is, the smaller the series, the fewer slaves are
needed, making the use of more slaves inefficient. This increase in time is given by the high number of reduce operations

Fig. 11. Execution times (expressed in seconds) for bigPSF for different sizes of datasets (expressed in N times the size of the original dataset) and number
of cores (up to 48). The PSF runtime is also shown to highlight the high scalability of bigPSF.
172 R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174

Fig. 12. Runtimes for the bigPSF for different number of cores. In red, the first points of the PSF (zoom in of Fig. 11).

Fig. 13. Influence of number of cores on the bigPSF algorithm. The execution time is represented in the Y-axis, expressed in seconds, and the number of
cores are represented in the X-axis. Different length factors (N) are displayed.

of Apache Spark when sending partial results to the Master node. As can be seen, from 12 cores the performance improve-
ment of the distributed algorithm is insignificant, but for the dataset 256N.
Table 3 presents the optimal configuration of the cluster for each dataset by relating the number of nodes with the short-
est execution time. In addition, it shows the computation time required for the grid search phase (training) and the final exe-
cution for different dataset sizes (testing). It can be seen that, for the original dataset, the time is approximately 19 min for
training and 1.5 for testing. By contrast, the 256 N dataset took about 2 h for obtaining the model and almost 10 min for
testing it. Both training and testing times exhibit a linear trend with a very smooth slope, highlighting the highly scalability
of the bigPSF.
In spite of having more powerful configurations a priori, that is, configurations with more cores for processing, it is ver-
ified how to find the optimal runtimes for the different datasets it is necessary to reduce the number of cores. Therefore, it
can be concluded that if the computing capacity increases by using a larger number of cores, no better results will be
obtained unless it is fed with bigger datasets.

5. Conclusions

This work proposes the bigPSF algorithm based on distributed computing in order to process and to forecast big data time
series. This highly scalable algorithm is capable of processing and extracting results from datasets containing millions of
records in outstanding time.
In a big data environment, where the consumption habits of the population can change over time, it is very important to
calibrate the way the prediction is computed. In this sense, for instance, in a dataset containing samples from the last twenty
years, more importance is given to the coincidences found in the last five years than if it were found fifteen years ago. For this
R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174 173

Table 3
Optimal cores configuration for each dataset size and end times for grid search and forecast executions.

Dataset Cores Training time Testing time


N 2 0:18:54 0:01:30
2N 4 0:22:03 0:01:45
4N 4 0:29:24 0:02:20
8N 4 0:42:50 0:03:24
16N 4 1:07:25 0:05:21
32N 7 1:27:09 0:06:55
64N 12 1:34:17 0:07:29
128N 22 1:40:10 0:07:57
256N 24 1:59:42 0:09:30

reason, once the dataset is processed, bigPSF looks for patterns in the data history and weights them according to time close-
ness. And this is one of the improvements carried out with regard to the original PSF.
Another frequently discussed problem is the correct choice of the number of clusters when a clustering technique is
applied. In previous works, a voting system among the results obtained by several cluster quality measures such as Dunn,
Silhouette and Davies-Bouldin was used for small or medium datasets. However, the computation of these measures is
not possible for large datasets, in addition to the interpretation of the curves in the graphs generated by these measures
may be a subjective task. To solve this problem, and with the objective of obtaining more precise predictions, this work pro-
poses a hyper-parameter optimization using a grid search method in order to determine optimal K and W values, being able
to correctly label the big data time series and determine the optimal size of the window that will finally be used for the pre-
diction. This method is not based on human interpretations of graphs, as it analyzes prediction errors for all the possibilities
of K and W using a validation set, taking advantage of the power of distributed computing.
It should be noted that the bigPSF is more accurate than PSF and even improves the performance of other algorithms in
the literature such as 7-PSF, ANN, ARIMA, CART or GBM. The scalability of the bigPSF makes it feasible to use it in a big data
context where big time series are generated, as for example in the emerging smart cities.
As a future work it is proposed to design a multivariate bigPSF algorithm, using datasets that have, in addition to elec-
tricity consumption, information from the temperature or descriptive of the population to be studied. Finally, it is also
intended to provide relevant information from the models and results reported so that they can be used for decision making.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank the Spanish Ministry of Science, Innovation and Universities for the support under pro-
ject TIN2017-88209-C2-1-R.

References

[1] D. Arthur, S. Vassilvitskii, K-Means++: The advantages of careful seeding, in: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2007,
pp. 1027–1035.
[2] B. Bahmani, A. Moseley, R. Vattani, R. Kumar, S. Vassilvitskii, Scalable k-means++, in: Proceedings of the VLDB Endowment, 2012, pp. 622–633.
[3] N. Bokde, G. Asencio-Cortés, F. Martínez-Álvarez, K. Kulat, PSF: Introduction to R Package for Pattern Sequence Based Forecasting Algorithm, The R
Journal 9 (1) (2017) 324–333.
[4] N. Bokde, M.W. Beck, F. Martínez-Álvarez, K. Kulat, A novel imputation methodology for time series based on pattern sequence forecasting, Pattern
Recognition Letters 116 (2018) 88–96.
[5] N. Bokde, A. Troncoso, G. Asencio-Cortés, K. Kulat, F. Martínez-Álvarez, Pattern sequence similarity based techniques for wind speed forecasting, in:
Proceedings of the International work-conference on Time Series, 2017, pp. 786–794.
[6] W. Chen, S. Mao, Y. Liu, Big data: a survey, Mobile Networks and Applications 19 (2) (2014) 171–209.
[7] Y. Fujimoto, Y. Hayashi, Pattern sequence-based energy demand forecast using photovoltaic energy records, in: Proceedings of the IEEE International
Conference on Renewable Energy Research and Applications, 2012, pp. 1–6.
[8] A. Galicia, R. Talavera-Llames, A. Troncoso, I. Koprinska, F. Martínez-Álvarez, Multi-step forecasting for big data time series forecasting based on
ensemble learning, Knowledge-Based Systems 163 (2018) 830–841.
[9] A. Galicia, J.F. Torres, F. Martínez-Álvarez, A. Troncoso, A novel Spark-based multi-step forecasting algorithm for big data time series, Information
Sciences 467 (2018) 800–818.
[10] B. Greenwell, B. Boehmke, J. Cunningham, GBM Developers, GBM: generalized boosted regression models, 2019. R package version 2.1.5.
[11] A. Gupta, N. Bokde, K.D. Kulat, Hybrid leakage management for water network using PSF algorithm and soft computing techniques, Water Resources
Management 32 (3) (2018) 1133–1151.
[12] J. Jacques, C. Preda, Model-based clustering of multivariate functional data, Computational Statistics and Data Analysis 71 (2014) 92–106.
[13] C.H. Jin, G. Pok, H.-W. Park, K.H. Ryu, Improved pattern sequence-based forecasting method for electricity load, IEEJ Transactions on Electrical and
Electronic Engineering 9 (6) (2014) 670–674.
174 R. Pérez-Chacón et al. / Information Sciences 540 (2020) 160–174

[14] I. Koprinska, M. Rana, A. Troncoso, F. Martínez-Álvarez, Combining pattern sequence similarity with neural networks for forecasting electricity demand
time series, in: Proceedings of the IEEE International Joint Conference on Neural Networks, 2013, pp. 940–947.
[15] C. Krome, V. Sander, Time series analysis with Apache Spark and its applications to energy informatics, Energy Informatics 1 (2018) 337–341.
[16] Z. Liu, X. Sun, S. Wang, M. Pan, Y. Zhang, Z. Ji, Midterm power load forecasting model based on kernel principal component analysis and back
propagation neural network with particle swarm optimization, Big Data 7 (2) (2019) 130–138.
[17] J.M. Luna-Romera, J. García-Gutiérrez, M. Martínez-Ballesteros, J.C. Riquelme, An approach to validity indices for clustering techniques in big data,
Progress in Artificial Intelligence 7 (2) (2018) 81–94.
[18] J.M. Luna-Romera, M. Martínez-Ballesteros, J. García-Gutiérrez, J.C. Riquelme, External clustering validity index based on chi-squared statistical test,
Information Sciences 7 (2) (2018) 81–94.
[19] F. Martínez-Álvarez, A. Schmutz, G. Asencio-Cortés, J. Jacques, A novel hybrid algorithm to forecast functional time series based on pattern sequence
similarity with application to electricity demand, Energies 12 (1) (2019) 94–111.
[20] F. Martínez-Álvarez, A. Troncoso, J.C. Riquelme, J.S. Aguilar-Ruiz, Discovery of motifs to forecast outlier occurrence in time series, Pattern Recognition
Letters 32 (12) (2011) 1652–1665.
[21] F. Martínez-Álvarez, A. Troncoso, J.C. Riquelme, J.S. Aguilar-Ruiz, Energy time series forecasting based on pattern sequence similarity, IEEE Transactions
on Knowledge and Data Engineering 23 (8) (2011) 1230–1243.
[22] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D.B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M.J. Franklin, R. Zadeh, M.
Zaharia, A. Talwalkar, MLlib: Machine Learning in Apache Spark, Journal on Machine Learning Research 17 (1) (2016) 1235–1241.
[23] Z. Peng, S. Peng, L. Fu, B. Lu, J. Tang, K. Wang, W. Li, A novel deep learning ensemble model with data denoising for short-term wind speed forecasting,
Energy Conversion and Management 207 (2020) 112524.
[24] R. Perez-Chacon, R.L. Talavera-Llames, F. Martínez-Álvarez, A. Troncoso, Finding electric energy consumption patterns in big time series data, in:
Proceedings of the 13th International Conference on Distributed Computing and Artificial Intelligence, 2016, pp. 231–238.
[25] R. Pérez-Chacón, J.M. Luna-Romera, A. Troncoso, F. Martínez-Álvarez, J.C. Riquelme, Big data analytics for discovering electricity consumption patterns
in smart cities, Energies 11 (2018) 683.
[26] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, E. Keogh, Addressing big data time series: Mining trillions of time
series subsequences under dynamic time warping, ACM Transactions on Knowledge Discovery from Data 7(3):10:1–10 (2013) 31.
[27] M. Seeger, D. Salinas, V. Flunkert, Bayesian intermittent demand forecasting for large inventories, in: Proceedings of the International Conference on
Neural Information Processing Systems, 2016, pp. 4653–4661.
[28] J. Segarra-Tamarit, E. Pérez, E. Moya, P. Ayuso, H. Beltrán, Deep learning-based forecasting of aggregated CSP production, Mathematics and Computers
in Simulation (2020).
[29] W. Shen, V. Babushkin, Z. Aung, W.L. Woon, An ensemble model for day-ahead electricity demand time series forecasting, in: Proceedings of the
International Conference on Future Energy Systems, 2013, pp. 51–62.
[30] W. Shi, Y. Zhu, P.S. Yu, J. Zhang, T. Huang, C. Wang, Y. Chen, Effective prediction of missing data on Apache Spark over multivariable time series, IEEE
Transactions on Big Data 4 (4) (2018) 473–486.
[31] Y. Simmhan, M.U. Noor, Scalable prediction of energy consumption using incremental time series clustering, in: Proceedings of the IEEE International
Conference on Big Data, 2013, pp. 29–36.
[32] P. Singh, Big data time series forecasting model: a fuzzy-neuro hybridize approach, Adaptation, Learning, and Optimization 19 (2015) 55–72.
[33] S. Singh, A. Yassine, Big data mining of energy time series for behavioral analytics and energy consumption forecasting, Energies 11 (2) (2018).
[34] A. Sinha, P.K. Jana, MRF: MapReduce based forecasting algorithm for time series data, Procedia Computer Science 132 (2018) 92–102.
[35] R. Talavera-Llames, R. Pérez-Chacón, A. Troncoso, F. Martínez-Álvarez, Big data time series forecasting based on nearest neighbors distributed
computing with Spark, Knowledge-Based Systems 161 (1) (2018) 12–25.
[36] R. Talavera-Llames, R. Pérez-Chacón, A. Troncoso, F. Martínez-Álvarez, MV-kWNN: a novel multivariate and multi-output weighted nearest neighbors
algorithm for big data time series forecasting, Neurocomputing 353 (2019) 56–73.
[37] T. Therneau, B. Atkinson, rpart: Recursive Partitioning and Regression Trees, R package version 4.1-15, 2019.
[38] P. Thongtra, A. Sapronova, Time-series data analytics using Spark and machine learning, in: Proceedings of the Foundations of Intelligent Systems,
2017, pp. 509–515.
[39] J.F. Torres, A. Galicia, A. Troncoso, F. Martínez-Álvarez, A scalable approach based on deep learning for big data time series forecasting, Integrated
Computer-Aided Engineering 25 (4) (2018) 335–348.
[40] J.F. Torres, A. Troncoso, I. Koprinska, Z. Wang, F. Martínez-Álvarez, Big data solar power forecasting based on deep learning and multiple data sources,
Expert Systems 36 (4) (2019) e12394.
[41] J.F. Torres, A. Troncoso, F. Martínez-Álvarez, Deep learning-based approach for time series forecasting with application to electricity load, Lecture Notes
in Computer Science 10338 (2017) 203–212.
[42] A. Troncoso, J.M. Riquelme-Santos, A. Gómez-Expósito, J.L. Martínez-Ramos, J.C. Riquelme, Electricity market price forecasting based on weighted
nearest neighbors techniques, IEEE Transactions on Power Systems 22 (3) (2007) 1294–1301.
[43] Ó. Trull, J.C. García-Díaz, A. Troncoso, Initialization methods for multiple seasonal holt-winters forecasting models, Mathematics 8 (2020) 268.
[44] Ó. Trull, J.C. García-Díaz, A. Troncoso, Stability of multiple seasonal holt-winters models applied to hourly electricity demand in spain, Applied Sciences
10 (2020) 2630.
[45] Z. Wang, I. Koprinska, M. Rana, Pattern sequence-based energy demand forecast using photovoltaic energy records, in: Proceedings of the International
Conference on Artificial Neural Networks, 2017, pp. 486–494.
[46] F. Xu, Y. Lin, J. Huang, D. Wu, H. Shi, J. Song, Y. Li, Big data driven mobile traffic understanding and forecasting: a time series approach, IEEE
Transactions on Services Computing 9 (5) (2016) 796–805.

You might also like