You are on page 1of 11

1

A Hybrid Residual Dilated LSTM end Exponential


Smoothing Model for Mid-Term Electric Load
Forecasting
Grzegorz Dudek, Paweł Pełka, and Slawek Smyl

Abstract—This work presents a hybrid and hierarchical deep disrupting electricity demand in a mid-term horizon include
arXiv:2004.00508v1 [eess.SP] 29 Mar 2020

learning model for mid-term load forecasting. The model com- unpredictable economic events and political decisions [2].
bines exponential smoothing (ETS), advanced Long Short-Term MTLF methods can be divided into statistical/econometric
Memory (LSTM) and ensembling. ETS extracts dynamically the
main components of each individual time series and enables models or machine learning/computational intelligence models
the model to learn their representation. Multi-layer LSTM is [3]. Typical examples of the former are ARIMA, EST, and
equipped with dilated recurrent skip connections and a spatial linear regression. ARIMA and ETS can deal with seasonal
shortcut path from lower layers to allow the model to better cap- time series but linear regression requires additional operations
ture long-term seasonal relationships and ensure more efficient such as decomposition or the extension of the model with
training. A common learning procedure for LSTM and ETS,
with a penalized pinball loss, leads to simultaneous optimization periodic components [4]. Statistical models have undoubted
of data representation and forecasting performance. In addition, advantages, being relatively simple, robust, efficient, and au-
ensembling at three levels ensures a powerful regularization. A tomatic, so that they can be used by non-expert users.
simulation study performed on the monthly electricity demand The flexibility of machine learning (ML) models has in-
time series for 35 European countries confirmed the high per- creased researchers’ interest in them as MTLF tools [5]. Of
formance of the proposed model and its competitiveness with
classical models such as ARIMA and ETS as well as state-of- these, neural networks (NNs) are the most explored because
the-art models based on machine learning. of their attractive features such as learning capability, uni-
versal approximation property, nonlinear modeling, massive
Index Terms—deep learning, exponential smoothing, Long
Short-Term Memory, mid-term load forecasting, recurrent neural parallelism and ease of specifying a loss function, to align it
networks, time series forecasting. better with forecasting goals. Some examples of using different
architectures of NNs for MLTF are: [6] where NN learns on
I. I NTRODUCTION historical demand and weather factors, [7] where Kohonen
NN was used, [8] where NNs were supported by fuzzy logic,

E lectricity demand forecasting is an essential tool in all


sectors in the electric power industry. Mid-term load
forecasting (MTLF), which involves forecasting the daily peak
[9] where generalized regression NN was used, [10] where
NNs, linear regression and AdaBoost were used, [11] where
weighted evolving fuzzy NNs were combined, and [12] where
load for future months as well as monthly electricity demand, recurrent NNs were used. Among other machine learning
is necessary for power system operation and planning in such MTLF models, the following can be mentioned: support vector
areas as maintenance scheduling, fuel reserve planning, hydro- machines [13], and pattern similarity-based models [14].
thermal coordination, planing of electrical energy import and Recent trends in ML such as deep learning, especially
export and also security assessment. In deregulated power sys- deep recurrent NNs (RNNs), are very attractive for time
tems, MTLF is a basis for the negotiation of forward contracts. series forecasting [15]. RNNs with connections between nodes
Forecast accuracy translates directly into financial performance forming a directed graph along a temporal sequence are able
for energy companies and energy market participants. The to exhibit temporal dynamic behavior using their internal state
financial impact can be measured in millions of dollars for (memory) to process sequences of inputs. Recent works have
every point of forecasting accuracy gained. All the above reported that RNNs, such as the LSTM, provide high accuracy
reasons justify interest in new forecasting tools for MTLF. in forecasting and outperform most of the traditional statistical
In this work, we focus on monthly electricity demand and ML methods such as ARIMA, support vector machine and
forecasting. Monthly electricity demand time series express a shallow NNs [16]. There are many examples of an application
nonlinear trend, yearly cycles and a random component. The of LSTMs to load forecasting: [17], [18], [19].
trend is dependent on the rate of economic development in Many new ideas in the field of deep learning have been
a country, while seasonal cycles are related to the climate, successfully applied to time series forecasting. For example in
weather factors and the variability of seasons [1]. Factors [20], bidirectional LSTM is proposed for short-term schedul-
G. Dudek and P. Pełka are with the Department of Electrical Engineering, ing in power markets. This solution has two benefits: long-
Czestochowa University of Technology, 42-200 Czestochowa, Al. Armii range memory and bidirectional processing. It takes advantage
Krajowej 17, Poland, e-mail: dudek@el.pcz.czest.pl, p.pelka@el.pcz.czest.pl. of deep architectures, which are able to build up progressively
S. Smyl is from Uber Technologies, 555 Market St, 94104, San Francisco,
CA, USA, e-mail: slaweks@hotmail.co.uk. higher level representations of data, by piling up RNN layers
Manuscript received ..., ...; revised ..., .... on top of each other. A residual recurrent highway network
2

for learning deep sequence predictions was proposed in [21]. data flow and processing. Section 3 describes the experimental
It contains highways within the temporal structure of the net- framework used to evaluate the performance of the proposed
work for unimpeded information propagation, thus alleviating model. Finally, Section 4 concludes the work.
gradient vanishing problem. Hierarchical structure learning is
posed as a residual learning framework to prevent performance II. F ORECASTING M ODEL
degradation problems. Another example of using a new deep
learning solution for time series forecasting is the N-BEATS The proposed model is based on the winning submission
model proposed is [22]. Its architecture is based on backward to the M4 forecasting competition 2018 for monthly data and
and forward residual links and a deep stack of fully-connected point forecasts [27]. It is a hybrid and hierarchical forecasting
layers. N-BEATS has a number of desirable properties, being model that enables ETS and advanced LSTM to be mixed
interpretable, applicable without modification to a wide array into a common framework. The model architecture, its specific
of target domains, and fast to train. features and components are describe below.
In addition to point forecasting, deep learning also enables
probabilistic forecasting. A model proposed in [23], DeepAR, A. Framework and Features
produces accurate probabilistic forecasts, based on training
The proposed forecasting model is shown in Fig. 1. It is
an auto-regressive RNN on a large number of related time
composed of:
series. This model makes probabilistic forecasts in the form
of Monte Carlo samples that can be used to compute consistent • ETS – which is a Holt-Winters type multiplicative sea-

quantile estimates for all sub-ranges in the prediction horizon. sonal model. It is used for extracting two components
Another solution for probabilistic time series forecasting was from the time series: level and seasonality. ETS loads a
proposed in [24]. It combines state space models with deep set of time series (Y ), calculates the level and seasonal
learning. By parametrizing a per-time-series linear state space components individually for each series and returns sets
model with a jointly-learned RNN, the method retains the of levels (L) and seasonal components (S).
desired properties of state space models such as data efficiency • Preprocessing – the level and seasonal components are

and interpretability, while making use of the ability to learn used for deseasonalization and adaptive normalization of
complex patterns from raw data offered by deep learning the time series. The inputs to the preprocessing module
approaches. are: set of time series Y and sets of level and seasonal
To improve forecasting performance, RNN is also mixed components, L and S, respectively. The preprocessed
with other methods such as ETS. Such a model won the data are divided into input and output training data and
M4 forecasting competition in 2018 [25]. This competition returned in training set Ψ.
utilized 100,000 real-life time series, and incorporates all • RD-LSTM – which is residual dilated LSTM composed

major forecasting methods, including those based on AI and of four layers. Due to its recurrent nature, this model is
ML, as well as traditional statistical ones [26]. The winning capable of learning long-term dependencies in sequential
model, developed by Slawek Smyl [27], is a hybrid approach data. RD-LSTM learns in cross-learning mode on training
utilizing both statistical and ML features. It combined ETS set Ψ. The forecasts for all time series produced by RD-
with advanced LSTM, which is supported by such mechanisms LSTM are returned in set X̂.
as dilation, residual connections and attention [28], [29], [30]. • Postprocessing – the forecasts of the deseasonalized and

It produced the most accurate forecasts as well as the most normalized time series are ”reseasonalised” and renor-
precise prediction intervals. According to sMAPE, it was close malised. The inputs to the postprocessing module are:
to 10% more accurate than the combination benchmark of the forecasts X̂, and level and seasonality sets, L and S.
competition, which is a huge improvement. For monthly data The output is set Ŷ , containing the forecasts for each
(48,000 time series) it outperformed all other 60 submissions time series.
achieving the highest accuracy according to each of the three • Ensembling – the forecasts produced by individual mod-

performance measures. els are averaged. This enhances the robustness of the
In this work, we propose a forecasting model for MTLF method further, mitigating model and parameter un-
based on the winning submission to the M4 competition for certainty. The ensembling module receives the sets of
monthly data. It combines ETS, LSTM and ensembling. ETS forecasts produced by individual models, Ŷkr , aggregates
enables the model to capture the main components of the them and returns a set of forecasts for all time series,
individual time series, such as seasonality and level, while Ŷavg .
LSTM allows nonlinear trends and cross-learning. A common • Stochastic gradient descent (SGD) – the parameters of

learning procedure for LSTM and ETS, with a penalized both ETS and RD-LSTM are updated by the same overall
pinball loss, leads to simultaneous optimization of data rep- optimization procedure, SGD, with the overarching goal
resentation and forecasting performance. Ensembling at three of minimizing forecasting errors.
levels is a powerful regularization method which reduces the The proposed model has a hierarchical structure, i.e. the data
model variance. are exploited in a hierarchical manner. Both local and global
The rest of the work is organized as follows. Section 2 time series features are extracted. The global features are
describes the proposed forecasting model: its architecture, learned by RD-LSTM across many time series. The specific
features, components and implementation details as well as features of each individual time series are extracted by ETS.
3

subset level and model level. This reduces the variance related
to the stochastic nature of SGD, and also related to data and
parameter uncertainty.

B. Exponential Smoothing
The complex nature of a time series, e.g. nonstationarity,
non-linear trend and seasonal variations, makes forecasting
difficult and puts high demands on the models. A typical
approach in this case is to simplify the forecasting problem by
deseasonalization, detrending or decomposition. A time series
is usually decomposed into seasonal, trend and stochastic
components. The components expressing less complexity than
the original time series can be modeled independently using
simpler models. The most popular methods of decomposition
are [31]: additive decomposition, multiplicative decomposi-
tion, X11, SEAT, and STL. Although very useful, this ap-
proach has a drawback. It separates the preprocessing from
Fig. 1. Block diagram of the ETS+RD-LSTM forecasting system. the forecasting, which results in the final solution not being
optimal. Some classic statistical models, such as ETS, employ
a better way: the forecasting model has a built-in mechanism
Thus, each series has a partially unique and partially shared to deal with seasonality. The final model uses the optimal
model. decomposition of the time series.
Note the hybrid structure of the model, where statistical In our approach, we use ETS as the preprocessing tool. ETS
modeling is combined concurrently with ML algorithms. The extracts two components from the time series: level (smoothed
model combines ETS, advanced LSTM and ensembling. ETS value) and seasonality. Then we use these components to nor-
is focused on each individual series and enables the model to malize and deseasonalize the original time series. Preprocessed
capture its main components such as seasonality and level. time series are forecasted by RD-LSTM. ETS and RD-LSTM
These components are used for time series preprocessing, are optimised simultaneously using SGD. So the resulting
normalization and deseasonalization. forecasting model, including data preprocessing, is optimized
An advanced LSTM-based RNN allows non-linear trends as a whole. This distinctive feature of the proposed approach
and cross-learning. This is an extended, multilayer version needs to be emphasized.
of LSTM with residual dilated LSTM blocks. The dilated The ETS model used in this study was inspired by the
recurrent skip connections and spatial shortcut path from Holt-Winters multiplicative seasonal model. However, it has
lower layers, applied in this solution, allow the model to been simplified by the removal of the linear trend component.
better capture long-term seasonal relationships and ensure This is because the trend forecasting is the task of RD-LSTM,
more efficient training. The RD-LSTM model is trained on which is able to produce a non-linear trend which is more
many time series (cross-learning). To train deep NNs, which valuable in our case. The updating formulas for the ETS model
have many parameters, cross-learning is necessary. Moreover, with a seasonal cycle length of twelve (useful for monthly
it enables the method to capture the shared features and data) are as follows [27]:
components of the time series. yt
lt = α + (1 − α)lt−1
ETS and RD-LSTM are optimized simultaneously, i.e. the st
yt (1)
ETS parameters and the RD-LSTM weights are optimised by st+12 = β + (1 − β)st
SGD at the same time. The same overall learning procedure lt
optimizes the model, including data preprocessing. So, the where yt is the time series value at timepoint t, lt and st
learning process includes representation learning – searching are the level and seasonal components, respectively, and α,
for the most suitable representations of input and output data β ∈ [0, 1] are smoothing coefficients.
(individually for each time series), which ensures the most The level equation shows a weighted average between the
accurate forecasts. It is worth noting the dynamical character seasonally adjusted observation and the level for time t−1. The
of the training set which is related to representation learning. seasonal equation expresses a seasonal component for time
The training set is updated in each epoch of RD-LSTM t + 12 as a weighted average between a new estimate of the
learning. This is because SGD updates the ETS parameters in seasonality component (yt /lt ) and the past estimate (st ). Fig.
each epoch, and therefore the level and seasonal components, 2 depicts an example of the monthly electricity demand time
used for preprocessing, are updated as well. series and its level and seasonal components obtained from
Ensembling is seen as a much more powerful regularization (1).
technique than more popular alternatives, e.g. dropout or The ETS model parameters, twelve initial seasonal compo-
L2-norm penalty [22]. In our case, ensembling combines nents and two smoothing coefficients for each time series, were
individual forecasts at three levels: stage of training level, data adjusted together with RD-LSTM weights by SGD. Knowing
4

10 4
1.4 0.2
0.2
1.3
0 0

x out
x in
1
1.2

1
-0.2
y -0.2
1.1
-0.4
1 -0.6 -0.4
2 4 6 8 10 12 2 4 6 8 10 12
0.9 i i
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192
t
0.05
10 4 0.06
1.6
0.04

x out
1.5

x in
N

N
0.02 0
1.4
0
l

1.3
-0.02 -0.05
2 4 6 8 10 12 2 4 6 8 10 12
1.2
i i
1.1
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192
t
Fig. 3. Examples of the input (left panel) and output (right panel) vectors
1.2
created for the time series shown in Fig. 2. Upper panel shows x-vectors for
the first two cycles, lower panel shows x-vectors for the last two cycles. The
1
x-vectors created in the first training epoch in blue, in the last epoch in red.
s

0.8

0 12 24 36 48 60 72 84 96
t
108 120 132 144 156 168 180 192 Note that normalization is adaptive and local, and the
”normalizer” follows the series values. This allows us to
Fig. 2. Original time series and its level and seasonal components. include the current features of the series (lt∗ and st ) in the
input and output variables.
The preprocessed elements of the time series contained in
these parameters allows the level and seasonal components the successive input and output windows can be represented
to be calculated, which are then used for preprocessing: by vectors as follows:
deseasonalization and normalization. – first pair of input and output windows:
xin out
1 = [x1 x2 ...x12 ], x1 = [x13 x14 ...x24 ],
– second pair of input and output windows:
C. Pre- and Postprocessing
xin out
2 = [x2 x3 ...x13 ], x2 = [x14 x15 ...x25 ],
The level and seasonal components are calculated for all – ...,
points of each series, which are then used for deseasonalization – N -th pair of input and output windows:
and adaptive normalization during the on-the-fly preprocess- xin out
N = [xN xN +1 ...xN +11 ], xN = [xN +12 xN +13 ...xN +23 ].
ing. This is the most crucial element of the forecasting These vectors are included in the training subset for the
procedure as it determines its performance. The time series is i-th time series: Φi = {(xin out
t , xt ) : t = 1, 2, ..., N }. The
preprocessed in each training epoch using the updated values training subsets for all M time series are combined and form
of level and seasonal components. These updated values are the training set Ψ = {Φ1 , Φ2 , ..., ΦM } which is used for
calculated from (1), where the ETS parameters are increasingly RD-LSTM cross-learning. Note the dynamic character of the
fine tuned in each epoch by SGD. training set. It is updated in each epoch because the level and
The time series is preprocessed using rolling windows: input seasonal components in (2) are updated.
and output ones. Both windows have a length of twelve, Fig. 3 shows the input and output vectors for the time
which is equal to the length of both the seasonal cycle and series which is shown in Fig. 2. The upper panel shows
the forecast horizon. The input window, ∆in , contains twelve the corresponding input and output x-vectors representing the
consecutive elements of the time series which after preprocess- first two seasonal cycles of the time series. The lower panel
ing will be the RD-LSTM inputs. The corresponding output shows the input and output x-vectors representing the last
window, ∆out , contains the next twelve consecutive elements, two seasonal cycles. It can be seen from this figure that
which after preprocessing will be the RD-LSTM outputs. The the x-vectors express patterns of the time series fragments
time series fragments inside both windows are normalized after filtering out both level and seasonality. This pattern
by dividing them by the last value of the level in the input representation of the time series has been used successfully in
window, lt∗ , and then, divided further by the relevant seasonal earlier studies concerning ML forecasting models, especially
component. As a result of this operation, we obtain positive similarity based models [32]. Different definitions of the time
input and output values close to one. Finally, to limit the series patterns can be found in [33]. But these definitions
destructive impact of outliers on the forecasts, a squashing are fixed, while in this work we use dynamic patterns which
function, log(.), is applied. The resulting preprocessing can be change during learning (compare the patterns in the first and
expressed as follows: last training epochs in Fig. 3).
  RD-LSTM operates on preprocessed time series values, xt .
yt In the postprocessing step, the forecasts generated by RD-
xt = log (2)
lt∗ st LSTM, x̂t , need to be ”unwound” in the following way:
where xt is the preprocessed t-th element of the time series,
ŷt = exp(x̂t )lt∗ st (3)
lt∗ is the last value of the level in input window ∆in , and st
is the t-th seasonal component. Note that both level lt∗ and seasonal component st in (3),
5

o1t = σg (W1o xt + Vlo h1t−1 + b1o ) (9)


where W, V and b are input weights, recurrent weights
and biases, respectively, and σg is a gate sigmoid activation
function (1 + e−x )−1 .
In our study, we also use a dilated residual version of
LSTM (RD-LSTM). Dilated RNN architecture was proposed
in [28] as a solution to tackle three major challenges of RNN
when learning on long sequences: complex dependencies, van-
ishing and exploding gradients, and efficient parallelization.
It is characterized by multi-resolution dilated recurrent skip
connections. Moreover, it reduces the number of parameters
Fig. 4. LSTM block.
needed and enhances training efficiency significantly in tasks
involving very long-term dependencies. A dilated LSTM block
receives as input states, not the last ones, ct−1 and ht−1 , but
which are necessary to calculate ŷt from x̂t , are known. They earlier states, ct−d and ht−d , where d > 1 is a dilation. So, to
are determined from (1) on the basis of the time series history. compute the current states of the LSTM block, the last d − 1
states are skipped. Usually multiple dilated recurrent layers
are stacked with hierarchical dilations to construct a system,
D. Residual Dilated LSTM
which learns the temporal dependencies of different scales at
LSTM is a special kind of RNN capable of learning long- different layers. In [28], it was shown that this solution can
term dependencies in sequential data [34]. A common LSTM reliably improve the ability of recurrent models to learn long-
block is composed of a memory cell which can maintain term dependency in problems from different domains. It seems
its state over time, and three non-linear ”regulators”, called that dilated LSTM can be particularly useful for seasonal time
gates, which control the flow of information inside the block. series, where the relationships between the series elements
A typical LSTM block is shown in Fig. 4. In this diagram, have a cyclical character. This character can be incorporated
ht and ct denote the hidden state and the cell state at time into the model by dilations related to seasonality.
step t, respectively. The cell state contains information learned A residual version of LSTM was proposed in [29]. A
from the previous time steps. Information can be added to or standard memory cell, which learns long-term dependencies
removed from the cell state using the gates: input gate (i), of sequential data, provides a temporal shortcut path to avoid
forget gate (f ) and output gate (o). At each time step t, the vanishing or exploding gradients in the temporal domain. The
block uses the past state of the network, i.e. ct−1 and ht−1 , residual LSTM provides an additional spatial shortcut path
and the input xt to compute output ht and updated cell state ct . from lower layers for efficient training of deep LSTM architec-
The hidden and cell states are recurrently connected back to tures. To avoid a conflict between spatial and temporal-domain
the block input. All of the gates are controlled by the hidden gradient flows, residual LSTM separates the spatial shortcut
state of the past cycle and the input x-vector. Most modern path from the temporal one. This gives greater flexibility to
studies incorporate many of the improvements that have been deal with vanishing or exploding gradients.
made to the LSTM architecture since its original formulation Fig. 5 describes a residual dilated LSTM block which was
[35]. used in this study. We denote this block by RD-LSTM. In
Detailed mathematical expressions describing LSTM block this figure hl−1 is a shortcut path from (l − 1)-th layer that is
t
are given below. The cell state at time step t is: added to updated cell state ct processed by function σc . Our
c1t = f1t ⊗ c1t−1 + i1t ⊗ g1t (4) implementation of the residual LSTM is a simplified version
of the original one. The peephole connections are removed as
where operator ⊗ denotes the Hadamard product (element- well as linear transformations of the shortcut path htl−1 and
wise product) and superscript, 1, refers to the first layer of transformed cell state σc (clt ). These transformations are not
RD-LSTM network, where we use the standard LSTM block. necessary because in our case the dimensions of hl−1 t and
The hidden state at time step t is given by: σc (clt ) match that of hlt .
The mathematical expressions describing the RD-LSTM
h1t = o1t ⊗ σc (c1t ) (5) block are as follows:

where the state activation function σc is a hyperbolic tangent clt = flt ⊗ clt−d + ilt ⊗ glt (10)
function.
Formulas related to the gates are as follows: hlt = olt ⊗ (σc (clt ) + hl−1
t ) (11)

f1t = σg (W1f xt + V1f h1t−1 + b1f ) (6) flt = σg (Wlf hl−1


t + Vlf hlt−d + blf ) (12)

i1t = σg (W1i xt + V1i h1t−1 + b1i ) (7) ilt = σg (Wli hl−1


t + Vli hlt−d + bli ) (13)

g1t = σc (W1g xt + Vlg h1t−1 + b1g ) (8) glt = σc (Wlg hl−1


t + Vlg hlt−d + blg ) (14)
6

of the series directly. An output is a vector representing the


whole forecasted sequence of length 12.
When the output x-pattern is determined from (16), the
forecasted monthly demands are calculated from (3). In the
latter equation x̂t is the t-th component of vector x̂out
avg .

E. Ensembling
The forecasts generated by the RD-LSTM model are en-
sembled at three levels:
1) Stage of training level. Averaging forecasts produced by
L most recent training epochs.
Fig. 5. RD-LSTM block. 2) Data subset level. Averaging forecasts from a pool of K
forecasting models which learns on the subsets of the
training set.
3) Model level. Averaging forecasts from R independent
runs of the pool of K models produced on the data
subset level.
The idea behind using the averaging forecasts produced in
a few most recent training epochs (first level of ensembling) is
that averaging has the effect of calming down the noisy SGD
optimization process. SGD uses mini-batches of training sam-
ples to estimate the actual gradients. The resulting searching
process operates on the approximated gradients which make
the trajectory noisy. Averaging the forecasts obtained in the
most recent epochs, when the algorithm converges around the
local minimum, may reduce the effect of stochastic searching
and produce more accurate forecasts.
Fig. 6. RD-LSTM network architecture. At the data subset level, the forecasts from K models which
learn on the subsets of the training set, Ψ1 , Ψ2 , ..., ΨK ,
are averaged. The training set, Ψ = {Φ1 , Φ2 , ..., ΦM }, is
olt = σg (Wlo hl−1
t + Vlo hlt−d + blo ) (15) composed of subsets Φi containing the training samples for the
i-th time series. To create the training Ψ-subsets, first, a set of
where superscript l indicates the layer number (from 2 to 4 M time series is split randomly into K subsets of similar size:
in our case, see below) and d is a dilation (3, 6 or 12 in our Θ1 , Θ2 , ..., ΘK . The k-th Ψ-subset contains the Φ-subsets for
case, see below). all time series excluding those in Θk , i.e. Ψk = Ψ\{Φi }i∈Θk .
The proposed RD-LSTM architecture, which is a result of Each of K models learns on its own training subset, Ψk , and
painstaking experimentation on M4 monthly data (48,000 time generates forecasts for the time series included in Ψk . Then
series), is depicted in Fig. 6. It is composed of four recurrent the K − 1 forecasts for each time series produced by the pool
layers and a linear unit LU. The first layer consists of the of K models are averaged.
standard LSTM block shown in Fig. 4. The subsequent three The last level of ensembling simply averages the forecasts
layers consist of RD-LSTM blocks (Fig. 5) with increasing for each time series generated in R independent runs of a pool
dilations d = 3, 6 and 12. The last element is a linear unit of K models. In each run, the training subsets Ψk are created
which transforms the output of the last layer, h4t , into the anew.
forecast of the output x-vector: Note that the diversity of learners, which is a key property
which governs the ensemble performance [36], has various
x̂out
t = Wx h4t + bx (16) sources in our proposed approach. They include (i) data
uncertainty: learning on mini-batches, learning on different
The learnable parameters of RD-LSTM are: input weights Ψ-subset of the training set, and (ii) parameter uncertainty:
W, recurrent weights V, and biases b. They are tuned in the learning using different initial values of the model parameters
cross-learning mode simultaneously with the ETS parameters in each run.
using SGD. The length of the cell and hidden states, m, The last two ensembling levels are depicted in Fig. 7. In
the same for all layers, was selected on the training set (see this figure, K = 4, so four training Ψ-subsets are created.
Section III for details). The set of time series included in the k-th Ψ-subsets in the r-
Note that an input is not a scalar but a vector representing a th run is denoted by Ykr in this figure, and the set of forecasts
sequence of the time series of length 12, i.e. a seasonal period. generated by the model in this case is denoted by Ŷkr . For each
It allows the RD-LSTM to be exposed to the immediate history time series R(K − 1) forecasts are averaged. This is shown
7

104
6000 6

5.5
5000 5

4.5
4000 4

E, GWh

E, GWh
3.5
3000 3

2.5
2000
2

1.5
1000
1

0.5
0 24 48 72 96 120 144 168 192 216 240 264 288 0 24 48 72 96 120 144 168 192 216 240 264 288
Months Months

Fig. 8. Monthly electricity demand time series for 35 European countries.

that the seasonal components absorbed the seasonality prop-


Fig. 7. Ensembling. erly. To deal with the wiggliness of a level curve, a penalized
loss function was introduced as follows:
• Calculate the logarithms of the quotients of the neighbor-
as a joint operation for levels 2 and 3 in the figure, and can ing points of the level time series: dt = log(lt+1 /lt ),
be expressed as: • Calculate differences of the above: et = dt+1 − dt ,
• Square and average them for each series.
R K−1
1 X X
The resulting penalized loss function related to a given time
ŷavg = ŷr,k (17)
R(K − 1) r=1 series takes the form:
k=1
T T −2  
where K is the size of the pool of models, R is the number 1X 2 X lt+2 lt
L= Lt + λ log 2 (19)
of runs, and ŷr,k is the forecasted y-vector generated as the T t=1 T − 2 t=1 lt+1
k-th one in the r-th run.
In this work, we use a simple average for ensembling, where T is the number of forecasts and λ is a regulariza-
but other functions, such as median, mode, or trimmed mean tion parameter which determines how much to penalizes the
could be applied [36]. As shown in [37], a simple average wiggliness of a level curve.
of forecasts often outperforms more complicated weighting The level wiggliness penalty affected the performance of
schemes. the method significantly and contributed greatly to winning
the M4 competition [27].
Note that the K · R models created at the data subset levels
in R runs can be trained simultaneously. Thus, the proposed
forecasting system is suitable to be implemented in parallel. III. E XPERIMENTAL S TUDY
This section presents the results of applying the proposed
forecasting model to the monthly electricity demand forecast-
F. Loss Function ing for 35 European countries. The data were obtained from
As a loss function we used pinball loss operating on the ENTSO-E repository (www.entsoe.eu). The time series
normalized and deseasonalized RD-LSTM outputs and actuals: have different lengths: 24 years (11 countries), 17 years (6
( countries), 12 years (4 countries), 8 years (2 countries), and 5
(xt − x̂t )τ if xt ≥ x̂t years (12 countries). The last year of data is 2014. The time
Lt = (18) series are presented in Fig. 8. As can be seen from this figure,
(x̂t − xt )(1 − τ ) if x̂t > xt
monthly electricity demand time series exhibit different levels,
where τ ∈ (0, 1) controls the function asymmetry. nonlinear trends, strong annual cycles and variable variances.
When τ = 0.5 the loss function is symmetrical and penal- The shapes of yearly cycles change over time.
izes positive and negative deviations equally. When the model The forecasting model as well as comparative models were
tends to have a positive or negative bias, we can reduce the applied to forecast twelve monthly demands for the last year
bias by introducing τ smaller or larger than 0.5, respectively. of data, 2014. The data from previous years were used for
Thus, the asymmetric pinball function, penalizing positive and hyperparameter selection and learning.
negative deviations differently, allows the method to deal with Model hyperparameters were selected on the training data.
bias. It is worth noting that pinball loss is commonly employed A typical procedure was learning the model on the time series
in quantile regression and probabilistic forecasting [38]. fragments up to 2012, and validating it on 2013 (during the M4
The experience gained during the M4 competition shows competition a very strong correlation between errors for the
that the smoothness of the level time series influences the test period and last period of training data was observed). The
forecasting accuracy substantially. It turned out that when the selected hyperparameters were used to construct the model for
input to the RD-LSTM was smooth, the RD-LSTM focused on 2014. The selected hyperparameters were as follows:
predicting the trend, instead of overfitting on some spurious, • number of epochs: 10,
−3
seasonality-related patterns. A smooth level curve also means • learning rate: 10 ,
8

• length of the cell and hidden states: m = 40, use Matlab R2018a implementation of MLP (function
• asymmetry parameter in pinball loss: τ = 0.4, feedforwardnet from Neural Network Toolbox).
• regularization parameter: λ = 50, • ANFIS – adaptive neuro-fuzzy inference system [41]. The
• ensembling parameters: L = 5, K = 4, R = 3. initial membership function parameters in the premise
The model was implemented in C++ relying on the DyNet parts of rules are determined using fuzzy c-means clus-
library [39]. It was compiled in Visual Studio 2017 (Windows tering. A hybrid learning method is applied for ANFIS
10) and run in parallel on an 8-core CPU (AMD Ryzen 7 training which uses a combination of the least-squares
1700, 3.0 GHz, 32 GB RAM). for consequent parameters and backpropagation gradient
descent method for premise parameters. The ANFIS
hyperparameters are: input pattern length and number of
A. Comparative Models
rules. The Matlab R2018a implementation of ANFIS was
The proposed model was compared with state-of-the-art used (function anfis from Fuzzy Logic Toolbox).
models based on ML as well as classical statistical models • k-NNw+ETS, FNM+ETS, N-WE+ETS, GRNN+ETS,
such as ARIMA and ETS. All ML models except LSTM use MLP+ETS, ANFIS+ETS – variants of the above models
a pattern representation of time series [14]. Patterns which with output patterns encoded with variables describing
express preprocessed repetitive sequences in a time series the current features of the time series. These features
ensure input and output data unification through trend filtering are the mean value of the series in the seasonal cycle
and variance equalization. Consequently, the relationship be- and its dispersion. To postprocess the forecasted output
tween input and output data is simplified and the transformed pattern, the coding variables are predicted for the next
forecasting problem can be solved using simple models. period using ETS. We use R implementation of ETS (see
The comparative models are described in detail in [14] and below).
outlined below. • LSTM – long short-term memory, where the responses
• k-NNw – k-nearest neighbor weighted regression model. are the training sequences with values shifted by one time
It estimates a vector-valued regression function as an step (a sequence-to-sequence regression LSTM network).
average of output patterns in a varying neighborhood of a For multiple time steps, after one step was predicted the
query pattern. A weighting function allows the model to LSTM state was updated. Previous prediction was used
take into account the similarity between the query pattern as input to LSTM, producing a forecast for the next time
and its nearest neighbors. The model hyperparameters step. LSTM was optimized using Adam (adaptive mo-
are: input pattern length and number of nearest neighbors ment estimation) optimizer. The length of the hidden state
k. Linear weighting function was used. was the only hyperparameter to be tuned. Other hyperpa-
• FNM – fuzzy neighborhood model. In this case, the rameters remain at their default values. The experiments
regression model is similar to k-NNw except that it were carried out using Matlab R2018a implementation of
aggregates not only k-nearest neighbors of the query LSTM (function trainNetwork from Neural Network
pattern but all training patterns. The weighting function Toolbox).
takes the form of a membership function (Gaussian- • ARIMA – ARIMA(p, d, q)(P, D, Q)12 model imple-
type function) which assigns to each training pattern a mented in function auto.arima in R environment
degree of membership to the query pattern neighborhood. (package forecast). This function implements auto-
The model hyperparameters are: input pattern length and matic ARIMA modeling which combines unit root tests,
membership function width. minimization of the Akaike information criterion (AICc)
• N-WE – NadarayaWatson estimator. It is a kernel regres- and maximum likelihood estimation to obtain the optimal
sion model belonging to the same category of nonpara- ARIMA model [31].
metric models as k-NNw and FNM. N-WE estimates the • ETS – exponential smoothing state space model [43] im-
regression function as a locally weighted average, using plemented in function ets (R package forecast). This
a kernel function (Gaussian) as a weighting function. implementation includes many types of ETS models de-
The model hyperparameters are: input pattern length and pending on how the seasonal, trend and error components
kernel bandwidth parameters. are taken into account. They can be expressed additively
• GRNN – general regression NN model. This is a four- or multiplicatively, and the trend can be damped or not.
layer NN with Gaussian nodes centered at training pat- As in the case of auto.arima, ets returns the optimal
terns. The node outputs express similarities between the model estimating its parameters using AICc.
query pattern and the training patterns. These outputs
are treated as the weights of the training patterns in the The models: k-NNw, FNM, N-WE and GRNN are also
regression model. The model hyperparameters are: input known as pattern similarity-based forecasting models (PSFMs)
pattern length and bandwidth parameter for nodes. because the forecast is constructed by aggregating the training
• MLP – multilayer perceptron [40] with a single hidden output patterns using similarity between the query pattern and
layer and sigmoidal neurons. It used for learning the training input patterns [14]. All the model hyperparameters
Levenberg-Marquardt method with Bayesian regulariza- mentioned above were selected on the training set in grid
tion to prevent overfitting. The MLP hyperparameters are: search procedures. The NN-based models, i.e. MLP, ANFIS
input pattern length and number of hidden nodes. We and LSTM, due to the stochastic nature of the learning
9

TABLE I TABLE II
R ESULTS COMPARISON AMONG PROPOSED AND COMPARATIVE MODELS . D ESCRIPTIVE STATISTICS OF PERCENTAGE ERRORS .

Model Median APE MAPE IQR RMSE Mean Median Std Skewness Kurtosis
k-NNw 2.89 4.99 3.85 368.79 k-NNw 1.87 1.08 10.43 5.14 44.56
FNM 2.88 4.88 4.26 354.33 FNM 2.03 1.22 9.34 4.16 35.14
N-WE 2.84 5.00 3.97 352.01 N-WE 1.91 1.18 10.82 5.41 48.94
GRNN 2.87 5.01 4.02 350.61 GRNN 1.87 1.16 10.60 5.34 48.11
k-NNw+ETS 2.71 4.47 3.52 327.94 k-NNw+ETS 1.25 0.20 9.00 4.40 37.30
FNM+ETS 2.64 4.40 3.46 321.98 FNM+ETS 1.26 0.11 8.80 4.75 41.71
N-WE+ETS 2.68 4.37 3.36 320.51 N-WE+ETS 1.26 0.17 8.68 4.63 40.75
GRNN+ETS 2.64 4.38 3.51 324.91 GRNN+ETS 1.26 0.11 8.61 4.42 38.38
MLP 2.97 5.27 3.84 378.81 MLP 1.37 0.68 11.88 7.52 109.64
MLP+ETS 3.11 4.80 4.12 358.07 MLP+ETS 1.71 1.03 7.32 1.55 11.83
ANFIS 3.56 6.18 4.87 488.75 ANFIS 2.51 1.43 11.37 4.35 34.93
ANFIS+ETS 3.54 6.32 4.26 464.29 ANFIS+ETS 1.30 0.40 12.65 0.96 39.37
LSTM 3.73 6.11 4.50 431.83 LSTM 3.12 1.81 9.49 2.86 22.21
ARIMA 3.32 5.65 5.24 463.07 ARIMA 2.35 1.03 13.62 9.01 119.20
ETS 3.50 5.05 4.80 374.52 ETS 1.04 0.31 7.97 1.89 13.52
ETS+RD-LSTM 2.74 4.48 3.55 347.24 ETS+RD-LSTM 1.11 0.27 10.07 6.37 63.61

processes return different results for the same data. In this downward trend observed in the previous period from 2010
study, 100 independent learning sessions for these models were to 2013. The reverse situation for FR caused a slight over-
performed and the final errors were calculated as averages over estimation of forecasts. For GB data, ETS+RD-LSTM with
100 trials. M AP E = 8.52% was the least accurate model, and for FR
data, with M AP E = 5.49%, it was one the most accurate
models.
B. Results
Table II shows basic descriptive statistics of the percentage
Table I shows the results of forecasting for the proposed errors (PEs) and Fig. 13 depicts the empirical probability
and comparative models: median of absolute percentage error density functions of PEs for the selected models. The PE
(APE), mean APE (MAPE), interquartile range of APE as distributions are similar to normal but tests for the assessment
a measure of the forecast dispersion, and root mean square of normality (JarqueBera test and Lilliefors test) do not
error (RMSE). These are values averaged over 35 countries. confirm this. In all cases, the forecasts are overestimated,
The lowest errors are for ”+ETS” PSBMs, where the coding having a positive PE mean. Note that ETS+RD-LSTM is one
variables are predicted using ETS. The errors for ETS+RD- of the least biased models. Positive values of skewness indicate
LSTM are slightly higher but lower than for all other models. the right-skewed PE distributions, and high kurtosis values
More detailed results are shown in Figs 9 and 10. Fig. 9 indicate leptokurtic distributions where the probability mass
depicts MAPE for each country. As can be seen from this is concentrated around the mean.
figure, ETS+RD-LSTM is one of the most accurate models
in most cases. Fig. 10 shows MAPE for each month of the
forecasted period. Note lower errors for months 8–10 and IV. C ONCLUSION
higher for months 1–4 and 12. ETS+RD-LSTM achieved In this work, we proposed and empirically validated a new
better results than most of the comparative models for the architecture for mid-term load forecasting. It was inspired by
months 6–12. For months 1 and 3 it achieved the highest errors the winning submission to the M4 forecasting competition
compared to other models. 2018, which combines ETS, advanced LSTM and ensembling.
Rankings of the models are shown in Fig. 11. These are The model has a hierarchical structure composed of a global
based on average ranks of the models in the rankings for part learned across many time series (weights of the LSTM)
individual countries. The left panel shows the ranking based with a time series specific part (ETS smoothing coefficients
on MAPE and the right panel shows the ranking based on and initial seasonal components). Using stochastic gradient
RMSE. Note the high position of ETS+RD-LSTM. It is in descent, it learns a mapping from input vectors to output
first position in MAPE ranking and third position in RMSE vectors and time series representations at the same time. ETS-
ranking. Note that in the latter case the difference between the inspired formulas, extract the main components of individual
first three positions is very small. time series to be used for deseasonalization and normalization.
Examples of forecasts produced by the models for six Preprocessed time series are forecasted using residual dilated
countries are depicted in Fig. 12. For PL data, most models LSTM. Due to the introduction of dilated recurrent skip con-
including ETS+RD-LSTM do not exceed M AP E = 2%, nections and a spatial shortcut path from lower layers, LSTM
which should be considered a very good result. Similar results is able to capture better long-term seasonal relationships and
were achieved for ES, IT and DE. In these cases MAPE ensure more efficient training. To deal with forecast bias we
for ETS+RD-LSTM were respectively: 1.61% (most accurate used a pinball loss function with a parameter controlling its
model), 2.12% (third most accurate model) and 2.29%. For asymmetry. In addition, we penalized the loss function to
GB the forecasts are underestimated. This results from the prevent overfitting. Three-level ensembling ensures powerful
fact that demand went up unexpectedly in 2014 despite the regularization reducing the model variance, which has sources
10

AT BA BE BG CH CY CZ
4 10 20 10
5
5 k-NNw
MAPE

5
2 5 10 5
FNM

0 0 0 0 0 0 0 N-WE
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
DE #model DK EE ES FI FR GB
5 10 10 4 10 10 GRNN
5
5 5 2 5 5 k-NNw+ETS

FNM+ETS
0 0 0 0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
GR HR HU IE IS IT LT N-WE+ETS
10 5 4 5 5 5
2
GRNN+ETS
5 2 1
MLP
0 0 0 0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 MLP+ETS
LU LV ME MK NI NL NO
10 50 10 5 10
5 10 ANFIS

5 5 5 5 ANFIS+ETS

0 0 0 0 0 0 0 LSTM
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
PL PT RO RS SE SI SK ARIMA
10 5 5
5 10
2 2 ETS
5
1 1 5
ETS+RD-LSTM
0 0 0 0 0 0 0
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15

Fig. 9. MAPE for each country.

12
k-NNw [4] E. H. Barakat, “Modeling of nonstationary time-series data. Part II.
10
FNM
N-WE
Dynamic periodic trends,“ Electr Power Energy Systems, vol. 23, pp.
GRNN 6368, 2001.
k-NNw+ETS
8 FNM+ETS [5] E. Gonzlez-Romera, M.A. Jaramillo-Morn, and D. Carmona-Fernndez,
N-WE+ETS
“Monthly electric energy demand forecasting with neural networks
MAPE

GRNN+ETS
6
MLP and Fourier series,“ Energy Conversion and Management, vol. 49, pp.
MLP+ETS
4 ANFIS 31353142, 2008.
ANFIS+ETS
LSTM [6] J. F. Chen, S. K. Lo, and Q. H. Do, “Forecasting monthly electricity
2 ARIMA
ETS
demands: An application of neural networks trained by heuristic algo-
0
ETS+RD-LSTM rithms,“ Information, vol. 8(1), 31, 2017.
1 2 3 4 5 6 7 8 9 10 11 12 [7] M. Gavrilas, I. Ciutea, and C. Tanasa, “Medium-term load forecasting
Months
with artificial neural network models,“ Proc. Conf. International Confer-
ence and Exhibition on Electricity Distribution, vol. 6, 2001.
Fig. 10. MAPE for each month of the forecasted period. [8] E. Doveh, P. Feigin, and L. Hyams, “Experience with FNN models for
medium term power demand predictions,“ IEEE Trans. Power Syst., vol.
14(2), pp. 538546, 1999.
ETS+RD-LSTM
k-NNw+ETS
FNM+ETS [9] P. Pełka and G. Dudek, “Medium-term electric energy demand forecasting
k-NNw+ETS
N-WE+ETS
FNM+ETS
ETS+RD-LSTM using generalized regression neural network,“ Proc. Conf. Information
N-WE+ETS
GRNN+ETS
N-WE
GRNN+ETS Systems Architecture and Technology ISAT 2018, Advances in Intelligent
N-WE
FNM FNM Systems and Computing, vol. 853, pp. 218227, Springer, Cham, 2019.
k-NNw ETS
GRNN ARIMA [10] T. Ahmad and H. Chen, “Potential of three variant machine-learning
MLP k-NNw
ARIMA MLP+ETS models for forecasting district level medium-term and long-term energy
MLP+ETS GRNN
ETS MLP demand in smart grid environment,“ Energy, vol. 160, pp. 10081020,
ANFIS+ETS ANFIS+ETS
ANFIS ANFIS 2018.
LSTM LSTM
[11] P. C. Chang, C. Y. Fan, and J. J. Lin, “Monthly electricity demand
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
MAPE rank RMSE rank forecasting based on a weighted evolving fuzzy neural network approach,“
Electrical Power and Energy Systems, vol. 33, pp. 1727, 2011.
Fig. 11. Rankings of the models. [12] S. Baek, “Mid-term Load Pattern Forecasting With Recurrent Artificial
Neural Network,“ in IEEE Access, vol. 7, pp. 172830172838, 2019.
[13] W. Zhao, F. Wang, and D. Niu, “The application of support vector
machine in load forecasting,“ Journal of Computers, vol. 7(7), pp.
in the stochastic nature of SGD, and also in data and parameter 16151622, 2012.
uncertainty. [14] G. Dudek and P. Pełka, “Pattern Similarity-based Machine Learn-
We applied the proposed model to the monthly electricity ing Methods for Mid-term Load Forecasting: A Comparative Study,“
arXiv:2003.01475, 2020.
demand forecasting for 35 European countries. The results [15] H. Hewamalage, C. Bergmeir, and K. Bandara, “Recurrent neural
demonstrated its state-of-the-art performance and competitive- networks for time series forecasting: Current status and future directions,“
ness with classical models such as ARIMA and ETS as well arXiv:1909.00590v3, 2019.
[16] K. Yan, X. Wang ,Y. Du ,N. Jin , H. Huang, and H. Zhou, “Multi-
as state-of-the-art models based on ML. Step Short-Term Power Consumption Forecasting with a Hybrid Deep
Learning Strategy,“ Energies, vol. 11(11), 3089, 2018.
R EFERENCES [17] J. Bedi and D. Toshniwal, “Empirical mode decomposition based deep
learning for electricity demand forecasting,“ IEEE Access, vol. 6, pp.
[1] F. Apadula, A. Bassini, A. Elli, and S. Scapin, “Relationships between 4914449156, 2018.
meteorological variables and monthly electricity demand,“ Appl Energy, [18] H. Zheng, J. Yuan, and L. Chen, “Short-Term Load Forecasting Using
vol. 98, pp. 346356, 2012. EMD-LSTM Neural Networks with a XGBboost Algorithm for Feature
[2] E. Dogan, “Are shocks to electricity consumption transitory or permanent? Importance Evaluation,“ Energies 2017, vol. 10(8), 1168, 2017.
Sub-national evidence from Turkey,“ Utilities Policy, vol. 41 pp. 7784, [19] A. Narayan and K. W. Hipel, “Long short term memory networks
2016. for short-term electric load forecasting,“ Proc. Conf. IEEE International
[3] L. Suganthi and A. A. Samuel, “Energy models for demand forecasting Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, pp.
A review,“ Renew Sust Energy Rev, vol. 16(2), pp. 12231240, 2002. 25732578, 2017.
11

104 PL 104 GB 104 FR


1.4 3.4 6

3.2 5.5
1.3
3 5

E, GWh

E, GWh
E, GWh

1.2 2.8 4.5

2.6 4
1.1
2.4 3.5

1 2.2 3
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Months Months Months
104 DE 104 ES 104 IT
k-NNw
5 2.5 3.2
FNM
N-WE
4.8 2.4
3 GRNN
k-NNw+ETS
4.6 2.3 FNM+ETS
2.8 N-WE+ETS

E, GWh
E, GWh

E, GWh
GRNN+ETS
4.4 2.2
MLP
2.6 MLP+ETS
4.2 2.1 ANFIS
ANFIS+ETS
2.4 LSTM
4 2
ARIMA
ETS
3.8 1.9 2.2 ETS+RD-LSTM
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Real
Months Months Months

Fig. 12. Real and forecasted monthly electricity demand for selected countries.

0.12 Forecasting Part 1: Principles,“ Applied Soft Computing, vol. 37, pp.
0.1
FNM
N-WE+ETS
277287, 2015.
MLP+ETS [33] G. Dudek, “Pattern Similarity-based Methods for Short-term Load
0.08 LSTM
Forecasting Part 1: Principles,“ Applied Soft Computing, vol. 37, pp.
PDF(PE)

ETS
0.06 ETS+RD-LSTM 277287, 2015.
0.04
[34] S. Hochreiter and J. Schmidhuber, “Long short-term memory,“ Neural
Computation, vol. 9(8), pp.17351780, 1997.
0.02
[35] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink and J.
0 Schmidhuber, “LSTM: A Search Space Odyssey,“ in IEEE Transactions
-20 -15 -10 -5 0 5 10 15 20
PE on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 22222232,
Oct. 2017.
Fig. 13. PDF of the percentage errors. [36] F. Petropoulos, R. J. Hyndman, and C. Bergmeir, “Exploring the sources
of uncertainty: Why does bagging for time series forecasting work?,“
European Journal of Operational Research, vol. 268(2), pp. 545554,
2008.
[20] J. Toubeau, J. Bottieau, F. Valle, and Z. De Grve, “Deep learning-based [37] F. Chan and L. L. Pauwels, “Some theoretical results on forecast
multivariate probabilistic forecasting for short-term scheduling in power combinations,“ International Journal of Forecasting, vol. 34(1), pp. 6474,
markets,“ IEEE Transactions on Power Systems, vol. 34(2), pp. 12031215, 2018.
2019. [38] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola, “Nonparametric
[21] T. Zia, and S. Razzaq, “Residual recurrent highway networks for learning quantile estimation“ Journal of Machine Learning Research, vol. 7, pp.
deep sequence prediction models,“ Journal of Grid Computing, Jun 2018. 12311264, 2006.
[22] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-BEATS: [39] G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A.
Neural basis expansion analysis for interpretable time series forecasting,“ Anastasopoulos, et al., “DyNet: The dynamic neural network toolkit,“
arXiv:1905.10437v4, 2020. arXiv:1701.03980, 2017
[23] D. Salinas, V. Flunkert, and J. Gasthaus, “DeepAR: Probabilistic Fore- [40] P. Pełka and G. Dudek, “Pattern-based forecasting monthly electricity
casting with Autoregressive Recurrent Networks,“ arXiv:1704.04110v3, demand using multilayer perceptron,“ Proc. Conf. Artificial Intelligence
2019. and Soft Computing ICAISC 2019, Lecture Notes in Artificial Intelli-
gence, vol. 11508, pp. 663672, Springer, Cham, 2019.
[24] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and
[41] P. Pełka and G. Dudek, “Neuro-Fuzzy System for Medium-term Electric
T. Januschowski, “Deep state space models for time series forecasting“
Energy Demand Forecasting,“ Proc. Conf. Information Systems Archi-
In NeurIPS, vol. 31, pp. 77857794, 2018.
tecture and Technology ISAT 2017, Advances in Intelligent Systems and
[25] S. Makridakis, E. Spiliotis, and V.Assimakopoulos, “The M4 Com- Computing, vol. 655, pp. 3847, Springer, Cham, 2018.
petition: Results, findings, conclusion and way forward,“ International [42] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “Statistical and
Journal of Forecasting, vol. 34(4). pp. 802808, 2018. machine learning forecasting methods: Concerns and ways forward,“
[26] S. Makridakis, E. Spiliotis, and V.Assimakopoulos, “The M4 Compe- PLoS ONE, vol. 13(3), 2018.
tition: 100,000 time series and 61 forecasting methods,“ International [43] R. J. Hyndman, A. B. Koehler, J. K. Ord, R. D. Snyder, Forecasting
Journal of Forecasting, vol. 36(1): pp. 5474, 2020. with exponential smoothing: The state space approach, Springer, 2008.
[27] S. Smyl, “A hybrid method of exponential smoothing and recurrent
neural networks for time series forecasting,“ International Journal of
Forecasting, vol. 36(1), pp. 7585, 2020.
[28] S. Chang, Y. Zhang, W. Han., M. Yu, X. Guo, W. Tan, et al. “Dilated
recurrent neural networks,“ arXiv: 1710.02224, 2017.
[29] J. Kim, M. El-Khamy, and J. Lee, “Residual LSTM: Design of a deep
recurrent architecture for distant speech recognition, “ arXiv:1701.03360,
2017.
[30] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. W. Cottrell.
“A dual-stage attention-based recurrent neural network for time series
prediction,“ In IJCAI-17, pp. 26272633, 2017.
[31] R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and
practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2
(2018) Accessed on 7 March 2020.
[32] G. Dudek, “Pattern Similarity-based Methods for Short-term Load

You might also like