Professional Documents
Culture Documents
A Hybrid Residual Dilated LSTM End Exponential
A Hybrid Residual Dilated LSTM End Exponential
Abstract—This work presents a hybrid and hierarchical deep disrupting electricity demand in a mid-term horizon include
arXiv:2004.00508v1 [eess.SP] 29 Mar 2020
learning model for mid-term load forecasting. The model com- unpredictable economic events and political decisions [2].
bines exponential smoothing (ETS), advanced Long Short-Term MTLF methods can be divided into statistical/econometric
Memory (LSTM) and ensembling. ETS extracts dynamically the
main components of each individual time series and enables models or machine learning/computational intelligence models
the model to learn their representation. Multi-layer LSTM is [3]. Typical examples of the former are ARIMA, EST, and
equipped with dilated recurrent skip connections and a spatial linear regression. ARIMA and ETS can deal with seasonal
shortcut path from lower layers to allow the model to better cap- time series but linear regression requires additional operations
ture long-term seasonal relationships and ensure more efficient such as decomposition or the extension of the model with
training. A common learning procedure for LSTM and ETS,
with a penalized pinball loss, leads to simultaneous optimization periodic components [4]. Statistical models have undoubted
of data representation and forecasting performance. In addition, advantages, being relatively simple, robust, efficient, and au-
ensembling at three levels ensures a powerful regularization. A tomatic, so that they can be used by non-expert users.
simulation study performed on the monthly electricity demand The flexibility of machine learning (ML) models has in-
time series for 35 European countries confirmed the high per- creased researchers’ interest in them as MTLF tools [5]. Of
formance of the proposed model and its competitiveness with
classical models such as ARIMA and ETS as well as state-of- these, neural networks (NNs) are the most explored because
the-art models based on machine learning. of their attractive features such as learning capability, uni-
versal approximation property, nonlinear modeling, massive
Index Terms—deep learning, exponential smoothing, Long
Short-Term Memory, mid-term load forecasting, recurrent neural parallelism and ease of specifying a loss function, to align it
networks, time series forecasting. better with forecasting goals. Some examples of using different
architectures of NNs for MLTF are: [6] where NN learns on
I. I NTRODUCTION historical demand and weather factors, [7] where Kohonen
NN was used, [8] where NNs were supported by fuzzy logic,
for learning deep sequence predictions was proposed in [21]. data flow and processing. Section 3 describes the experimental
It contains highways within the temporal structure of the net- framework used to evaluate the performance of the proposed
work for unimpeded information propagation, thus alleviating model. Finally, Section 4 concludes the work.
gradient vanishing problem. Hierarchical structure learning is
posed as a residual learning framework to prevent performance II. F ORECASTING M ODEL
degradation problems. Another example of using a new deep
learning solution for time series forecasting is the N-BEATS The proposed model is based on the winning submission
model proposed is [22]. Its architecture is based on backward to the M4 forecasting competition 2018 for monthly data and
and forward residual links and a deep stack of fully-connected point forecasts [27]. It is a hybrid and hierarchical forecasting
layers. N-BEATS has a number of desirable properties, being model that enables ETS and advanced LSTM to be mixed
interpretable, applicable without modification to a wide array into a common framework. The model architecture, its specific
of target domains, and fast to train. features and components are describe below.
In addition to point forecasting, deep learning also enables
probabilistic forecasting. A model proposed in [23], DeepAR, A. Framework and Features
produces accurate probabilistic forecasts, based on training
The proposed forecasting model is shown in Fig. 1. It is
an auto-regressive RNN on a large number of related time
composed of:
series. This model makes probabilistic forecasts in the form
of Monte Carlo samples that can be used to compute consistent • ETS – which is a Holt-Winters type multiplicative sea-
quantile estimates for all sub-ranges in the prediction horizon. sonal model. It is used for extracting two components
Another solution for probabilistic time series forecasting was from the time series: level and seasonality. ETS loads a
proposed in [24]. It combines state space models with deep set of time series (Y ), calculates the level and seasonal
learning. By parametrizing a per-time-series linear state space components individually for each series and returns sets
model with a jointly-learned RNN, the method retains the of levels (L) and seasonal components (S).
desired properties of state space models such as data efficiency • Preprocessing – the level and seasonal components are
and interpretability, while making use of the ability to learn used for deseasonalization and adaptive normalization of
complex patterns from raw data offered by deep learning the time series. The inputs to the preprocessing module
approaches. are: set of time series Y and sets of level and seasonal
To improve forecasting performance, RNN is also mixed components, L and S, respectively. The preprocessed
with other methods such as ETS. Such a model won the data are divided into input and output training data and
M4 forecasting competition in 2018 [25]. This competition returned in training set Ψ.
utilized 100,000 real-life time series, and incorporates all • RD-LSTM – which is residual dilated LSTM composed
major forecasting methods, including those based on AI and of four layers. Due to its recurrent nature, this model is
ML, as well as traditional statistical ones [26]. The winning capable of learning long-term dependencies in sequential
model, developed by Slawek Smyl [27], is a hybrid approach data. RD-LSTM learns in cross-learning mode on training
utilizing both statistical and ML features. It combined ETS set Ψ. The forecasts for all time series produced by RD-
with advanced LSTM, which is supported by such mechanisms LSTM are returned in set X̂.
as dilation, residual connections and attention [28], [29], [30]. • Postprocessing – the forecasts of the deseasonalized and
It produced the most accurate forecasts as well as the most normalized time series are ”reseasonalised” and renor-
precise prediction intervals. According to sMAPE, it was close malised. The inputs to the postprocessing module are:
to 10% more accurate than the combination benchmark of the forecasts X̂, and level and seasonality sets, L and S.
competition, which is a huge improvement. For monthly data The output is set Ŷ , containing the forecasts for each
(48,000 time series) it outperformed all other 60 submissions time series.
achieving the highest accuracy according to each of the three • Ensembling – the forecasts produced by individual mod-
performance measures. els are averaged. This enhances the robustness of the
In this work, we propose a forecasting model for MTLF method further, mitigating model and parameter un-
based on the winning submission to the M4 competition for certainty. The ensembling module receives the sets of
monthly data. It combines ETS, LSTM and ensembling. ETS forecasts produced by individual models, Ŷkr , aggregates
enables the model to capture the main components of the them and returns a set of forecasts for all time series,
individual time series, such as seasonality and level, while Ŷavg .
LSTM allows nonlinear trends and cross-learning. A common • Stochastic gradient descent (SGD) – the parameters of
learning procedure for LSTM and ETS, with a penalized both ETS and RD-LSTM are updated by the same overall
pinball loss, leads to simultaneous optimization of data rep- optimization procedure, SGD, with the overarching goal
resentation and forecasting performance. Ensembling at three of minimizing forecasting errors.
levels is a powerful regularization method which reduces the The proposed model has a hierarchical structure, i.e. the data
model variance. are exploited in a hierarchical manner. Both local and global
The rest of the work is organized as follows. Section 2 time series features are extracted. The global features are
describes the proposed forecasting model: its architecture, learned by RD-LSTM across many time series. The specific
features, components and implementation details as well as features of each individual time series are extracted by ETS.
3
subset level and model level. This reduces the variance related
to the stochastic nature of SGD, and also related to data and
parameter uncertainty.
B. Exponential Smoothing
The complex nature of a time series, e.g. nonstationarity,
non-linear trend and seasonal variations, makes forecasting
difficult and puts high demands on the models. A typical
approach in this case is to simplify the forecasting problem by
deseasonalization, detrending or decomposition. A time series
is usually decomposed into seasonal, trend and stochastic
components. The components expressing less complexity than
the original time series can be modeled independently using
simpler models. The most popular methods of decomposition
are [31]: additive decomposition, multiplicative decomposi-
tion, X11, SEAT, and STL. Although very useful, this ap-
proach has a drawback. It separates the preprocessing from
Fig. 1. Block diagram of the ETS+RD-LSTM forecasting system. the forecasting, which results in the final solution not being
optimal. Some classic statistical models, such as ETS, employ
a better way: the forecasting model has a built-in mechanism
Thus, each series has a partially unique and partially shared to deal with seasonality. The final model uses the optimal
model. decomposition of the time series.
Note the hybrid structure of the model, where statistical In our approach, we use ETS as the preprocessing tool. ETS
modeling is combined concurrently with ML algorithms. The extracts two components from the time series: level (smoothed
model combines ETS, advanced LSTM and ensembling. ETS value) and seasonality. Then we use these components to nor-
is focused on each individual series and enables the model to malize and deseasonalize the original time series. Preprocessed
capture its main components such as seasonality and level. time series are forecasted by RD-LSTM. ETS and RD-LSTM
These components are used for time series preprocessing, are optimised simultaneously using SGD. So the resulting
normalization and deseasonalization. forecasting model, including data preprocessing, is optimized
An advanced LSTM-based RNN allows non-linear trends as a whole. This distinctive feature of the proposed approach
and cross-learning. This is an extended, multilayer version needs to be emphasized.
of LSTM with residual dilated LSTM blocks. The dilated The ETS model used in this study was inspired by the
recurrent skip connections and spatial shortcut path from Holt-Winters multiplicative seasonal model. However, it has
lower layers, applied in this solution, allow the model to been simplified by the removal of the linear trend component.
better capture long-term seasonal relationships and ensure This is because the trend forecasting is the task of RD-LSTM,
more efficient training. The RD-LSTM model is trained on which is able to produce a non-linear trend which is more
many time series (cross-learning). To train deep NNs, which valuable in our case. The updating formulas for the ETS model
have many parameters, cross-learning is necessary. Moreover, with a seasonal cycle length of twelve (useful for monthly
it enables the method to capture the shared features and data) are as follows [27]:
components of the time series. yt
lt = α + (1 − α)lt−1
ETS and RD-LSTM are optimized simultaneously, i.e. the st
yt (1)
ETS parameters and the RD-LSTM weights are optimised by st+12 = β + (1 − β)st
SGD at the same time. The same overall learning procedure lt
optimizes the model, including data preprocessing. So, the where yt is the time series value at timepoint t, lt and st
learning process includes representation learning – searching are the level and seasonal components, respectively, and α,
for the most suitable representations of input and output data β ∈ [0, 1] are smoothing coefficients.
(individually for each time series), which ensures the most The level equation shows a weighted average between the
accurate forecasts. It is worth noting the dynamical character seasonally adjusted observation and the level for time t−1. The
of the training set which is related to representation learning. seasonal equation expresses a seasonal component for time
The training set is updated in each epoch of RD-LSTM t + 12 as a weighted average between a new estimate of the
learning. This is because SGD updates the ETS parameters in seasonality component (yt /lt ) and the past estimate (st ). Fig.
each epoch, and therefore the level and seasonal components, 2 depicts an example of the monthly electricity demand time
used for preprocessing, are updated as well. series and its level and seasonal components obtained from
Ensembling is seen as a much more powerful regularization (1).
technique than more popular alternatives, e.g. dropout or The ETS model parameters, twelve initial seasonal compo-
L2-norm penalty [22]. In our case, ensembling combines nents and two smoothing coefficients for each time series, were
individual forecasts at three levels: stage of training level, data adjusted together with RD-LSTM weights by SGD. Knowing
4
10 4
1.4 0.2
0.2
1.3
0 0
x out
x in
1
1.2
1
-0.2
y -0.2
1.1
-0.4
1 -0.6 -0.4
2 4 6 8 10 12 2 4 6 8 10 12
0.9 i i
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192
t
0.05
10 4 0.06
1.6
0.04
x out
1.5
x in
N
N
0.02 0
1.4
0
l
1.3
-0.02 -0.05
2 4 6 8 10 12 2 4 6 8 10 12
1.2
i i
1.1
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192
t
Fig. 3. Examples of the input (left panel) and output (right panel) vectors
1.2
created for the time series shown in Fig. 2. Upper panel shows x-vectors for
the first two cycles, lower panel shows x-vectors for the last two cycles. The
1
x-vectors created in the first training epoch in blue, in the last epoch in red.
s
0.8
0 12 24 36 48 60 72 84 96
t
108 120 132 144 156 168 180 192 Note that normalization is adaptive and local, and the
”normalizer” follows the series values. This allows us to
Fig. 2. Original time series and its level and seasonal components. include the current features of the series (lt∗ and st ) in the
input and output variables.
The preprocessed elements of the time series contained in
these parameters allows the level and seasonal components the successive input and output windows can be represented
to be calculated, which are then used for preprocessing: by vectors as follows:
deseasonalization and normalization. – first pair of input and output windows:
xin out
1 = [x1 x2 ...x12 ], x1 = [x13 x14 ...x24 ],
– second pair of input and output windows:
C. Pre- and Postprocessing
xin out
2 = [x2 x3 ...x13 ], x2 = [x14 x15 ...x25 ],
The level and seasonal components are calculated for all – ...,
points of each series, which are then used for deseasonalization – N -th pair of input and output windows:
and adaptive normalization during the on-the-fly preprocess- xin out
N = [xN xN +1 ...xN +11 ], xN = [xN +12 xN +13 ...xN +23 ].
ing. This is the most crucial element of the forecasting These vectors are included in the training subset for the
procedure as it determines its performance. The time series is i-th time series: Φi = {(xin out
t , xt ) : t = 1, 2, ..., N }. The
preprocessed in each training epoch using the updated values training subsets for all M time series are combined and form
of level and seasonal components. These updated values are the training set Ψ = {Φ1 , Φ2 , ..., ΦM } which is used for
calculated from (1), where the ETS parameters are increasingly RD-LSTM cross-learning. Note the dynamic character of the
fine tuned in each epoch by SGD. training set. It is updated in each epoch because the level and
The time series is preprocessed using rolling windows: input seasonal components in (2) are updated.
and output ones. Both windows have a length of twelve, Fig. 3 shows the input and output vectors for the time
which is equal to the length of both the seasonal cycle and series which is shown in Fig. 2. The upper panel shows
the forecast horizon. The input window, ∆in , contains twelve the corresponding input and output x-vectors representing the
consecutive elements of the time series which after preprocess- first two seasonal cycles of the time series. The lower panel
ing will be the RD-LSTM inputs. The corresponding output shows the input and output x-vectors representing the last
window, ∆out , contains the next twelve consecutive elements, two seasonal cycles. It can be seen from this figure that
which after preprocessing will be the RD-LSTM outputs. The the x-vectors express patterns of the time series fragments
time series fragments inside both windows are normalized after filtering out both level and seasonality. This pattern
by dividing them by the last value of the level in the input representation of the time series has been used successfully in
window, lt∗ , and then, divided further by the relevant seasonal earlier studies concerning ML forecasting models, especially
component. As a result of this operation, we obtain positive similarity based models [32]. Different definitions of the time
input and output values close to one. Finally, to limit the series patterns can be found in [33]. But these definitions
destructive impact of outliers on the forecasts, a squashing are fixed, while in this work we use dynamic patterns which
function, log(.), is applied. The resulting preprocessing can be change during learning (compare the patterns in the first and
expressed as follows: last training epochs in Fig. 3).
RD-LSTM operates on preprocessed time series values, xt .
yt In the postprocessing step, the forecasts generated by RD-
xt = log (2)
lt∗ st LSTM, x̂t , need to be ”unwound” in the following way:
where xt is the preprocessed t-th element of the time series,
ŷt = exp(x̂t )lt∗ st (3)
lt∗ is the last value of the level in input window ∆in , and st
is the t-th seasonal component. Note that both level lt∗ and seasonal component st in (3),
5
where the state activation function σc is a hyperbolic tangent clt = flt ⊗ clt−d + ilt ⊗ glt (10)
function.
Formulas related to the gates are as follows: hlt = olt ⊗ (σc (clt ) + hl−1
t ) (11)
E. Ensembling
The forecasts generated by the RD-LSTM model are en-
sembled at three levels:
1) Stage of training level. Averaging forecasts produced by
L most recent training epochs.
Fig. 5. RD-LSTM block. 2) Data subset level. Averaging forecasts from a pool of K
forecasting models which learns on the subsets of the
training set.
3) Model level. Averaging forecasts from R independent
runs of the pool of K models produced on the data
subset level.
The idea behind using the averaging forecasts produced in
a few most recent training epochs (first level of ensembling) is
that averaging has the effect of calming down the noisy SGD
optimization process. SGD uses mini-batches of training sam-
ples to estimate the actual gradients. The resulting searching
process operates on the approximated gradients which make
the trajectory noisy. Averaging the forecasts obtained in the
most recent epochs, when the algorithm converges around the
local minimum, may reduce the effect of stochastic searching
and produce more accurate forecasts.
Fig. 6. RD-LSTM network architecture. At the data subset level, the forecasts from K models which
learn on the subsets of the training set, Ψ1 , Ψ2 , ..., ΨK ,
are averaged. The training set, Ψ = {Φ1 , Φ2 , ..., ΦM }, is
olt = σg (Wlo hl−1
t + Vlo hlt−d + blo ) (15) composed of subsets Φi containing the training samples for the
i-th time series. To create the training Ψ-subsets, first, a set of
where superscript l indicates the layer number (from 2 to 4 M time series is split randomly into K subsets of similar size:
in our case, see below) and d is a dilation (3, 6 or 12 in our Θ1 , Θ2 , ..., ΘK . The k-th Ψ-subset contains the Φ-subsets for
case, see below). all time series excluding those in Θk , i.e. Ψk = Ψ\{Φi }i∈Θk .
The proposed RD-LSTM architecture, which is a result of Each of K models learns on its own training subset, Ψk , and
painstaking experimentation on M4 monthly data (48,000 time generates forecasts for the time series included in Ψk . Then
series), is depicted in Fig. 6. It is composed of four recurrent the K − 1 forecasts for each time series produced by the pool
layers and a linear unit LU. The first layer consists of the of K models are averaged.
standard LSTM block shown in Fig. 4. The subsequent three The last level of ensembling simply averages the forecasts
layers consist of RD-LSTM blocks (Fig. 5) with increasing for each time series generated in R independent runs of a pool
dilations d = 3, 6 and 12. The last element is a linear unit of K models. In each run, the training subsets Ψk are created
which transforms the output of the last layer, h4t , into the anew.
forecast of the output x-vector: Note that the diversity of learners, which is a key property
which governs the ensemble performance [36], has various
x̂out
t = Wx h4t + bx (16) sources in our proposed approach. They include (i) data
uncertainty: learning on mini-batches, learning on different
The learnable parameters of RD-LSTM are: input weights Ψ-subset of the training set, and (ii) parameter uncertainty:
W, recurrent weights V, and biases b. They are tuned in the learning using different initial values of the model parameters
cross-learning mode simultaneously with the ETS parameters in each run.
using SGD. The length of the cell and hidden states, m, The last two ensembling levels are depicted in Fig. 7. In
the same for all layers, was selected on the training set (see this figure, K = 4, so four training Ψ-subsets are created.
Section III for details). The set of time series included in the k-th Ψ-subsets in the r-
Note that an input is not a scalar but a vector representing a th run is denoted by Ykr in this figure, and the set of forecasts
sequence of the time series of length 12, i.e. a seasonal period. generated by the model in this case is denoted by Ŷkr . For each
It allows the RD-LSTM to be exposed to the immediate history time series R(K − 1) forecasts are averaged. This is shown
7
104
6000 6
5.5
5000 5
4.5
4000 4
E, GWh
E, GWh
3.5
3000 3
2.5
2000
2
1.5
1000
1
0.5
0 24 48 72 96 120 144 168 192 216 240 264 288 0 24 48 72 96 120 144 168 192 216 240 264 288
Months Months
• length of the cell and hidden states: m = 40, use Matlab R2018a implementation of MLP (function
• asymmetry parameter in pinball loss: τ = 0.4, feedforwardnet from Neural Network Toolbox).
• regularization parameter: λ = 50, • ANFIS – adaptive neuro-fuzzy inference system [41]. The
• ensembling parameters: L = 5, K = 4, R = 3. initial membership function parameters in the premise
The model was implemented in C++ relying on the DyNet parts of rules are determined using fuzzy c-means clus-
library [39]. It was compiled in Visual Studio 2017 (Windows tering. A hybrid learning method is applied for ANFIS
10) and run in parallel on an 8-core CPU (AMD Ryzen 7 training which uses a combination of the least-squares
1700, 3.0 GHz, 32 GB RAM). for consequent parameters and backpropagation gradient
descent method for premise parameters. The ANFIS
hyperparameters are: input pattern length and number of
A. Comparative Models
rules. The Matlab R2018a implementation of ANFIS was
The proposed model was compared with state-of-the-art used (function anfis from Fuzzy Logic Toolbox).
models based on ML as well as classical statistical models • k-NNw+ETS, FNM+ETS, N-WE+ETS, GRNN+ETS,
such as ARIMA and ETS. All ML models except LSTM use MLP+ETS, ANFIS+ETS – variants of the above models
a pattern representation of time series [14]. Patterns which with output patterns encoded with variables describing
express preprocessed repetitive sequences in a time series the current features of the time series. These features
ensure input and output data unification through trend filtering are the mean value of the series in the seasonal cycle
and variance equalization. Consequently, the relationship be- and its dispersion. To postprocess the forecasted output
tween input and output data is simplified and the transformed pattern, the coding variables are predicted for the next
forecasting problem can be solved using simple models. period using ETS. We use R implementation of ETS (see
The comparative models are described in detail in [14] and below).
outlined below. • LSTM – long short-term memory, where the responses
• k-NNw – k-nearest neighbor weighted regression model. are the training sequences with values shifted by one time
It estimates a vector-valued regression function as an step (a sequence-to-sequence regression LSTM network).
average of output patterns in a varying neighborhood of a For multiple time steps, after one step was predicted the
query pattern. A weighting function allows the model to LSTM state was updated. Previous prediction was used
take into account the similarity between the query pattern as input to LSTM, producing a forecast for the next time
and its nearest neighbors. The model hyperparameters step. LSTM was optimized using Adam (adaptive mo-
are: input pattern length and number of nearest neighbors ment estimation) optimizer. The length of the hidden state
k. Linear weighting function was used. was the only hyperparameter to be tuned. Other hyperpa-
• FNM – fuzzy neighborhood model. In this case, the rameters remain at their default values. The experiments
regression model is similar to k-NNw except that it were carried out using Matlab R2018a implementation of
aggregates not only k-nearest neighbors of the query LSTM (function trainNetwork from Neural Network
pattern but all training patterns. The weighting function Toolbox).
takes the form of a membership function (Gaussian- • ARIMA – ARIMA(p, d, q)(P, D, Q)12 model imple-
type function) which assigns to each training pattern a mented in function auto.arima in R environment
degree of membership to the query pattern neighborhood. (package forecast). This function implements auto-
The model hyperparameters are: input pattern length and matic ARIMA modeling which combines unit root tests,
membership function width. minimization of the Akaike information criterion (AICc)
• N-WE – NadarayaWatson estimator. It is a kernel regres- and maximum likelihood estimation to obtain the optimal
sion model belonging to the same category of nonpara- ARIMA model [31].
metric models as k-NNw and FNM. N-WE estimates the • ETS – exponential smoothing state space model [43] im-
regression function as a locally weighted average, using plemented in function ets (R package forecast). This
a kernel function (Gaussian) as a weighting function. implementation includes many types of ETS models de-
The model hyperparameters are: input pattern length and pending on how the seasonal, trend and error components
kernel bandwidth parameters. are taken into account. They can be expressed additively
• GRNN – general regression NN model. This is a four- or multiplicatively, and the trend can be damped or not.
layer NN with Gaussian nodes centered at training pat- As in the case of auto.arima, ets returns the optimal
terns. The node outputs express similarities between the model estimating its parameters using AICc.
query pattern and the training patterns. These outputs
are treated as the weights of the training patterns in the The models: k-NNw, FNM, N-WE and GRNN are also
regression model. The model hyperparameters are: input known as pattern similarity-based forecasting models (PSFMs)
pattern length and bandwidth parameter for nodes. because the forecast is constructed by aggregating the training
• MLP – multilayer perceptron [40] with a single hidden output patterns using similarity between the query pattern and
layer and sigmoidal neurons. It used for learning the training input patterns [14]. All the model hyperparameters
Levenberg-Marquardt method with Bayesian regulariza- mentioned above were selected on the training set in grid
tion to prevent overfitting. The MLP hyperparameters are: search procedures. The NN-based models, i.e. MLP, ANFIS
input pattern length and number of hidden nodes. We and LSTM, due to the stochastic nature of the learning
9
TABLE I TABLE II
R ESULTS COMPARISON AMONG PROPOSED AND COMPARATIVE MODELS . D ESCRIPTIVE STATISTICS OF PERCENTAGE ERRORS .
Model Median APE MAPE IQR RMSE Mean Median Std Skewness Kurtosis
k-NNw 2.89 4.99 3.85 368.79 k-NNw 1.87 1.08 10.43 5.14 44.56
FNM 2.88 4.88 4.26 354.33 FNM 2.03 1.22 9.34 4.16 35.14
N-WE 2.84 5.00 3.97 352.01 N-WE 1.91 1.18 10.82 5.41 48.94
GRNN 2.87 5.01 4.02 350.61 GRNN 1.87 1.16 10.60 5.34 48.11
k-NNw+ETS 2.71 4.47 3.52 327.94 k-NNw+ETS 1.25 0.20 9.00 4.40 37.30
FNM+ETS 2.64 4.40 3.46 321.98 FNM+ETS 1.26 0.11 8.80 4.75 41.71
N-WE+ETS 2.68 4.37 3.36 320.51 N-WE+ETS 1.26 0.17 8.68 4.63 40.75
GRNN+ETS 2.64 4.38 3.51 324.91 GRNN+ETS 1.26 0.11 8.61 4.42 38.38
MLP 2.97 5.27 3.84 378.81 MLP 1.37 0.68 11.88 7.52 109.64
MLP+ETS 3.11 4.80 4.12 358.07 MLP+ETS 1.71 1.03 7.32 1.55 11.83
ANFIS 3.56 6.18 4.87 488.75 ANFIS 2.51 1.43 11.37 4.35 34.93
ANFIS+ETS 3.54 6.32 4.26 464.29 ANFIS+ETS 1.30 0.40 12.65 0.96 39.37
LSTM 3.73 6.11 4.50 431.83 LSTM 3.12 1.81 9.49 2.86 22.21
ARIMA 3.32 5.65 5.24 463.07 ARIMA 2.35 1.03 13.62 9.01 119.20
ETS 3.50 5.05 4.80 374.52 ETS 1.04 0.31 7.97 1.89 13.52
ETS+RD-LSTM 2.74 4.48 3.55 347.24 ETS+RD-LSTM 1.11 0.27 10.07 6.37 63.61
processes return different results for the same data. In this downward trend observed in the previous period from 2010
study, 100 independent learning sessions for these models were to 2013. The reverse situation for FR caused a slight over-
performed and the final errors were calculated as averages over estimation of forecasts. For GB data, ETS+RD-LSTM with
100 trials. M AP E = 8.52% was the least accurate model, and for FR
data, with M AP E = 5.49%, it was one the most accurate
models.
B. Results
Table II shows basic descriptive statistics of the percentage
Table I shows the results of forecasting for the proposed errors (PEs) and Fig. 13 depicts the empirical probability
and comparative models: median of absolute percentage error density functions of PEs for the selected models. The PE
(APE), mean APE (MAPE), interquartile range of APE as distributions are similar to normal but tests for the assessment
a measure of the forecast dispersion, and root mean square of normality (JarqueBera test and Lilliefors test) do not
error (RMSE). These are values averaged over 35 countries. confirm this. In all cases, the forecasts are overestimated,
The lowest errors are for ”+ETS” PSBMs, where the coding having a positive PE mean. Note that ETS+RD-LSTM is one
variables are predicted using ETS. The errors for ETS+RD- of the least biased models. Positive values of skewness indicate
LSTM are slightly higher but lower than for all other models. the right-skewed PE distributions, and high kurtosis values
More detailed results are shown in Figs 9 and 10. Fig. 9 indicate leptokurtic distributions where the probability mass
depicts MAPE for each country. As can be seen from this is concentrated around the mean.
figure, ETS+RD-LSTM is one of the most accurate models
in most cases. Fig. 10 shows MAPE for each month of the
forecasted period. Note lower errors for months 8–10 and IV. C ONCLUSION
higher for months 1–4 and 12. ETS+RD-LSTM achieved In this work, we proposed and empirically validated a new
better results than most of the comparative models for the architecture for mid-term load forecasting. It was inspired by
months 6–12. For months 1 and 3 it achieved the highest errors the winning submission to the M4 forecasting competition
compared to other models. 2018, which combines ETS, advanced LSTM and ensembling.
Rankings of the models are shown in Fig. 11. These are The model has a hierarchical structure composed of a global
based on average ranks of the models in the rankings for part learned across many time series (weights of the LSTM)
individual countries. The left panel shows the ranking based with a time series specific part (ETS smoothing coefficients
on MAPE and the right panel shows the ranking based on and initial seasonal components). Using stochastic gradient
RMSE. Note the high position of ETS+RD-LSTM. It is in descent, it learns a mapping from input vectors to output
first position in MAPE ranking and third position in RMSE vectors and time series representations at the same time. ETS-
ranking. Note that in the latter case the difference between the inspired formulas, extract the main components of individual
first three positions is very small. time series to be used for deseasonalization and normalization.
Examples of forecasts produced by the models for six Preprocessed time series are forecasted using residual dilated
countries are depicted in Fig. 12. For PL data, most models LSTM. Due to the introduction of dilated recurrent skip con-
including ETS+RD-LSTM do not exceed M AP E = 2%, nections and a spatial shortcut path from lower layers, LSTM
which should be considered a very good result. Similar results is able to capture better long-term seasonal relationships and
were achieved for ES, IT and DE. In these cases MAPE ensure more efficient training. To deal with forecast bias we
for ETS+RD-LSTM were respectively: 1.61% (most accurate used a pinball loss function with a parameter controlling its
model), 2.12% (third most accurate model) and 2.29%. For asymmetry. In addition, we penalized the loss function to
GB the forecasts are underestimated. This results from the prevent overfitting. Three-level ensembling ensures powerful
fact that demand went up unexpectedly in 2014 despite the regularization reducing the model variance, which has sources
10
AT BA BE BG CH CY CZ
4 10 20 10
5
5 k-NNw
MAPE
5
2 5 10 5
FNM
0 0 0 0 0 0 0 N-WE
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
DE #model DK EE ES FI FR GB
5 10 10 4 10 10 GRNN
5
5 5 2 5 5 k-NNw+ETS
FNM+ETS
0 0 0 0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
GR HR HU IE IS IT LT N-WE+ETS
10 5 4 5 5 5
2
GRNN+ETS
5 2 1
MLP
0 0 0 0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 MLP+ETS
LU LV ME MK NI NL NO
10 50 10 5 10
5 10 ANFIS
5 5 5 5 ANFIS+ETS
0 0 0 0 0 0 0 LSTM
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
PL PT RO RS SE SI SK ARIMA
10 5 5
5 10
2 2 ETS
5
1 1 5
ETS+RD-LSTM
0 0 0 0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
12
k-NNw [4] E. H. Barakat, “Modeling of nonstationary time-series data. Part II.
10
FNM
N-WE
Dynamic periodic trends,“ Electr Power Energy Systems, vol. 23, pp.
GRNN 6368, 2001.
k-NNw+ETS
8 FNM+ETS [5] E. Gonzlez-Romera, M.A. Jaramillo-Morn, and D. Carmona-Fernndez,
N-WE+ETS
“Monthly electric energy demand forecasting with neural networks
MAPE
GRNN+ETS
6
MLP and Fourier series,“ Energy Conversion and Management, vol. 49, pp.
MLP+ETS
4 ANFIS 31353142, 2008.
ANFIS+ETS
LSTM [6] J. F. Chen, S. K. Lo, and Q. H. Do, “Forecasting monthly electricity
2 ARIMA
ETS
demands: An application of neural networks trained by heuristic algo-
0
ETS+RD-LSTM rithms,“ Information, vol. 8(1), 31, 2017.
1 2 3 4 5 6 7 8 9 10 11 12 [7] M. Gavrilas, I. Ciutea, and C. Tanasa, “Medium-term load forecasting
Months
with artificial neural network models,“ Proc. Conf. International Confer-
ence and Exhibition on Electricity Distribution, vol. 6, 2001.
Fig. 10. MAPE for each month of the forecasted period. [8] E. Doveh, P. Feigin, and L. Hyams, “Experience with FNN models for
medium term power demand predictions,“ IEEE Trans. Power Syst., vol.
14(2), pp. 538546, 1999.
ETS+RD-LSTM
k-NNw+ETS
FNM+ETS [9] P. Pełka and G. Dudek, “Medium-term electric energy demand forecasting
k-NNw+ETS
N-WE+ETS
FNM+ETS
ETS+RD-LSTM using generalized regression neural network,“ Proc. Conf. Information
N-WE+ETS
GRNN+ETS
N-WE
GRNN+ETS Systems Architecture and Technology ISAT 2018, Advances in Intelligent
N-WE
FNM FNM Systems and Computing, vol. 853, pp. 218227, Springer, Cham, 2019.
k-NNw ETS
GRNN ARIMA [10] T. Ahmad and H. Chen, “Potential of three variant machine-learning
MLP k-NNw
ARIMA MLP+ETS models for forecasting district level medium-term and long-term energy
MLP+ETS GRNN
ETS MLP demand in smart grid environment,“ Energy, vol. 160, pp. 10081020,
ANFIS+ETS ANFIS+ETS
ANFIS ANFIS 2018.
LSTM LSTM
[11] P. C. Chang, C. Y. Fan, and J. J. Lin, “Monthly electricity demand
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
MAPE rank RMSE rank forecasting based on a weighted evolving fuzzy neural network approach,“
Electrical Power and Energy Systems, vol. 33, pp. 1727, 2011.
Fig. 11. Rankings of the models. [12] S. Baek, “Mid-term Load Pattern Forecasting With Recurrent Artificial
Neural Network,“ in IEEE Access, vol. 7, pp. 172830172838, 2019.
[13] W. Zhao, F. Wang, and D. Niu, “The application of support vector
machine in load forecasting,“ Journal of Computers, vol. 7(7), pp.
in the stochastic nature of SGD, and also in data and parameter 16151622, 2012.
uncertainty. [14] G. Dudek and P. Pełka, “Pattern Similarity-based Machine Learn-
We applied the proposed model to the monthly electricity ing Methods for Mid-term Load Forecasting: A Comparative Study,“
arXiv:2003.01475, 2020.
demand forecasting for 35 European countries. The results [15] H. Hewamalage, C. Bergmeir, and K. Bandara, “Recurrent neural
demonstrated its state-of-the-art performance and competitive- networks for time series forecasting: Current status and future directions,“
ness with classical models such as ARIMA and ETS as well arXiv:1909.00590v3, 2019.
[16] K. Yan, X. Wang ,Y. Du ,N. Jin , H. Huang, and H. Zhou, “Multi-
as state-of-the-art models based on ML. Step Short-Term Power Consumption Forecasting with a Hybrid Deep
Learning Strategy,“ Energies, vol. 11(11), 3089, 2018.
R EFERENCES [17] J. Bedi and D. Toshniwal, “Empirical mode decomposition based deep
learning for electricity demand forecasting,“ IEEE Access, vol. 6, pp.
[1] F. Apadula, A. Bassini, A. Elli, and S. Scapin, “Relationships between 4914449156, 2018.
meteorological variables and monthly electricity demand,“ Appl Energy, [18] H. Zheng, J. Yuan, and L. Chen, “Short-Term Load Forecasting Using
vol. 98, pp. 346356, 2012. EMD-LSTM Neural Networks with a XGBboost Algorithm for Feature
[2] E. Dogan, “Are shocks to electricity consumption transitory or permanent? Importance Evaluation,“ Energies 2017, vol. 10(8), 1168, 2017.
Sub-national evidence from Turkey,“ Utilities Policy, vol. 41 pp. 7784, [19] A. Narayan and K. W. Hipel, “Long short term memory networks
2016. for short-term electric load forecasting,“ Proc. Conf. IEEE International
[3] L. Suganthi and A. A. Samuel, “Energy models for demand forecasting Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, pp.
A review,“ Renew Sust Energy Rev, vol. 16(2), pp. 12231240, 2002. 25732578, 2017.
11
3.2 5.5
1.3
3 5
E, GWh
E, GWh
E, GWh
2.6 4
1.1
2.4 3.5
1 2.2 3
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Months Months Months
104 DE 104 ES 104 IT
k-NNw
5 2.5 3.2
FNM
N-WE
4.8 2.4
3 GRNN
k-NNw+ETS
4.6 2.3 FNM+ETS
2.8 N-WE+ETS
E, GWh
E, GWh
E, GWh
GRNN+ETS
4.4 2.2
MLP
2.6 MLP+ETS
4.2 2.1 ANFIS
ANFIS+ETS
2.4 LSTM
4 2
ARIMA
ETS
3.8 1.9 2.2 ETS+RD-LSTM
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Real
Months Months Months
Fig. 12. Real and forecasted monthly electricity demand for selected countries.
0.12 Forecasting Part 1: Principles,“ Applied Soft Computing, vol. 37, pp.
0.1
FNM
N-WE+ETS
277287, 2015.
MLP+ETS [33] G. Dudek, “Pattern Similarity-based Methods for Short-term Load
0.08 LSTM
Forecasting Part 1: Principles,“ Applied Soft Computing, vol. 37, pp.
PDF(PE)
ETS
0.06 ETS+RD-LSTM 277287, 2015.
0.04
[34] S. Hochreiter and J. Schmidhuber, “Long short-term memory,“ Neural
Computation, vol. 9(8), pp.17351780, 1997.
0.02
[35] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink and J.
0 Schmidhuber, “LSTM: A Search Space Odyssey,“ in IEEE Transactions
-20 -15 -10 -5 0 5 10 15 20
PE on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 22222232,
Oct. 2017.
Fig. 13. PDF of the percentage errors. [36] F. Petropoulos, R. J. Hyndman, and C. Bergmeir, “Exploring the sources
of uncertainty: Why does bagging for time series forecasting work?,“
European Journal of Operational Research, vol. 268(2), pp. 545554,
2008.
[20] J. Toubeau, J. Bottieau, F. Valle, and Z. De Grve, “Deep learning-based [37] F. Chan and L. L. Pauwels, “Some theoretical results on forecast
multivariate probabilistic forecasting for short-term scheduling in power combinations,“ International Journal of Forecasting, vol. 34(1), pp. 6474,
markets,“ IEEE Transactions on Power Systems, vol. 34(2), pp. 12031215, 2018.
2019. [38] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola, “Nonparametric
[21] T. Zia, and S. Razzaq, “Residual recurrent highway networks for learning quantile estimation“ Journal of Machine Learning Research, vol. 7, pp.
deep sequence prediction models,“ Journal of Grid Computing, Jun 2018. 12311264, 2006.
[22] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-BEATS: [39] G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A.
Neural basis expansion analysis for interpretable time series forecasting,“ Anastasopoulos, et al., “DyNet: The dynamic neural network toolkit,“
arXiv:1905.10437v4, 2020. arXiv:1701.03980, 2017
[23] D. Salinas, V. Flunkert, and J. Gasthaus, “DeepAR: Probabilistic Fore- [40] P. Pełka and G. Dudek, “Pattern-based forecasting monthly electricity
casting with Autoregressive Recurrent Networks,“ arXiv:1704.04110v3, demand using multilayer perceptron,“ Proc. Conf. Artificial Intelligence
2019. and Soft Computing ICAISC 2019, Lecture Notes in Artificial Intelli-
gence, vol. 11508, pp. 663672, Springer, Cham, 2019.
[24] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and
[41] P. Pełka and G. Dudek, “Neuro-Fuzzy System for Medium-term Electric
T. Januschowski, “Deep state space models for time series forecasting“
Energy Demand Forecasting,“ Proc. Conf. Information Systems Archi-
In NeurIPS, vol. 31, pp. 77857794, 2018.
tecture and Technology ISAT 2017, Advances in Intelligent Systems and
[25] S. Makridakis, E. Spiliotis, and V.Assimakopoulos, “The M4 Com- Computing, vol. 655, pp. 3847, Springer, Cham, 2018.
petition: Results, findings, conclusion and way forward,“ International [42] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “Statistical and
Journal of Forecasting, vol. 34(4). pp. 802808, 2018. machine learning forecasting methods: Concerns and ways forward,“
[26] S. Makridakis, E. Spiliotis, and V.Assimakopoulos, “The M4 Compe- PLoS ONE, vol. 13(3), 2018.
tition: 100,000 time series and 61 forecasting methods,“ International [43] R. J. Hyndman, A. B. Koehler, J. K. Ord, R. D. Snyder, Forecasting
Journal of Forecasting, vol. 36(1): pp. 5474, 2020. with exponential smoothing: The state space approach, Springer, 2008.
[27] S. Smyl, “A hybrid method of exponential smoothing and recurrent
neural networks for time series forecasting,“ International Journal of
Forecasting, vol. 36(1), pp. 7585, 2020.
[28] S. Chang, Y. Zhang, W. Han., M. Yu, X. Guo, W. Tan, et al. “Dilated
recurrent neural networks,“ arXiv: 1710.02224, 2017.
[29] J. Kim, M. El-Khamy, and J. Lee, “Residual LSTM: Design of a deep
recurrent architecture for distant speech recognition, “ arXiv:1701.03360,
2017.
[30] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. W. Cottrell.
“A dual-stage attention-based recurrent neural network for time series
prediction,“ In IJCAI-17, pp. 26272633, 2017.
[31] R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and
practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2
(2018) Accessed on 7 March 2020.
[32] G. Dudek, “Pattern Similarity-based Methods for Short-term Load