You are on page 1of 5

Using Online Learning Spiral Recurrent Neural Network for NN5

Data Prediction
Huaien Gao1,2, Rudolf Sollacher2, Han-Peter Kriegel1
1- University of Munich, Germany
2- Siemens AG, Corporate Technology, Germany

Abstract— A spiral recurrent neural network (SpiralRNN) II. S PIRAL R ECURRENT N EURAL N ETWORK
has a special structure of the reccurent hidden layer which
allows to bound the eigenvalues of the recurrent weight matrix. A. Hidden Units
Thus, the network can learn characteristic temporal correla-
tions online without running into dynamical instabilities. In A SpiralRNN [1], [2] is a recurrent neural network with
this paper, SpiralRNN is employed to solve the financial time special recurrent layer structure which can be broken down
series prediction problem in NN5 competition. We use these into smaller units, namely “hidden units” or “spiral units”.
time series to demonstrate the performance of SpiralRNN for Each hidden unit receives signals from the input neurons
data with particular weakly and seasonal periodicities. These
are taken into account by providing additional sinusoidal input
and provides processed signals to the output neurons. In
time series to the problem in question, where methods such as addition, they receive signals from other hidden neurons in
pre-processing and enlargement with appropriate periodicities. the same unit delayed by one time step. Fig-1(a) illustrates a
The non-regular Easter holidays are taken into account by typical hidden unit with three input neurons and three output
an additional Gaussian-shaped input signal centered at these neurons, where the hidden layer structure is only shown
holidays. The prediction performance is enhanced by a mixture
of experts approach consisting of the combined output of 30
symbolically. Note that hidden neurons are fully connected
online learning SpiralRNNs with weights proportional to their to all input neurons and all output neurons. More details
temporal average one-step forceast error. The main advantage of the connections inside the hidden layer are shown in
of this approach is the low configuration effort and the online fig-1(b), where the connections from only one particular
learning capability. An evaluation based on a forecast of the neuron to all other neurons in the hidden unit are displayed.
last 56 data of the 111 time series is provided.
With all neurons in the hidden unit aligned clockwise on a
circle, values of connection weights are defined such that the
I. I NTRODUCTION
connection from one neuron to its first clockwise neighbor
Time series prediction is a common task in various in- has value β1 , the connection to its second clockwise neighbor
dustry sectors, such as robotic control and financial market. has value β2 and so on. The definition of connection values
NN5 competition1 is one of the leading competitions with is applied to all the neurons, so that all connections from
an emphasis on utilizing computational intelligence methods. neurons to their respective first clockwise neighbors have an
The data in question come from the amount of money identical weight β1 , and all the connections from neurons to
withdrawn from ATM machines across England. These data their second clockwise neighbors have value β2 , and so on.
exhibit strong periodical (e.g. weekly, seasonally and yearly)
behavior. The associated processes have deterministic and  
stochastic components. In general, they will not be stationary, 0 β1 β2 . . . βu−1  
..  0 1 0
as for example more tourists are visiting this area or a new

 βu−1 0 β1 . . . .   .. .. .. 
shopping mall has opened. In this paper, we apply online
 
 .
..  P =  . .
M =  βu−2 βu−1 0 . . .
 
. 

learning Spiral Recurrent Neural Network (SpiralRNN) [1],  .. 
 0 . 1 
 
 . . . .
[2] to this prediction problem. Our approach focuses on  .. .. .. .. β 

1 1 0 ... 0
the online learning capability and on an as low as possible β1 . . . . . . βu−1 0
configuration and preprocessing effort. (1)
The remainder of this paper is arranged as followed: This corresponding hidden-weight matrix M is shown in
Section-II introduces the SpiralRNN structure; section-III ~∈
eq. (1). Its matrix elements are determined by a vector β
discusses the adaptation of SpiralRNN model to the pre- R(u−1)×1 where u refers to the number of hidden neurons in
diction of NN5 competition data; section-IV presents some the hidden unit. Furthermore, matrix M can be decomposed
evaluation results of forecasting the last 56 data of the 111 into iterated permutations described by a matrix P:
time series.
u−1
M = β1 P + β2 P 2 + . . . + βu−1 P , P ∈ Ru×u (2)
This paper has been presented to the special section of time series
competition in the World Congress on Computational Intelligence (WCCI)
2008 in Hong Kong. It is obvious that matrix P 2 is also a permutation matrix
1 http://www.neural-forecasting-competition.com/ shifting a multiplier vector by two positions. Similarly, P u
Output xt
Data Output Layer
Target x̂t

Output Layer

Z−1

Gradient
Calculation
Hidden Layer
Hidden Layer

EKF Input Layer
Input Layer

Input x̂t−1
Fig. 2. The typical structure of SpiralRNNs. Note that all hidden
units have the same basic topology (however the number of hidden
neurons in the hidden units can be different), as shown in fig-1,
1

0.5
1

0.8
1

0.5
and are separated from each other whereas the input and output
0
0.6
0
connections are fully connected to the hidden neurons.
0.4
−0.5
−0.5
0.2
−1
0 −1
320 325 330 335 340 0 200 400 600 800 0 200 400 600 800

Fig. 1. (a) The structure of a hidden unit with 3 input neurons
between any hidden neuron from one hidden unit to any
and 3 output neurons; (b) Structure of a hidden unit, where only hidden neuron of another hidden unit (see fig-2). The hidden-
the outgoing connections from one neuron are shown; connections weight matrix Whid of the entire network is a block diagonal
from other neurons will have the same structure and weights. matrix with each sub-block corresponding to one particular
hidden unit. Note that the sizes of different sub-blocks Mi
can differ from each other.
up-shifts the multiplier vector by u positions, and therefore: For such a block-diagonal structure the constraint upon the
P=P
u eigenvalue spectrum of the hidden-weight matrix Whid can
be easily derived:
Now, the eigenvalue λ̂k of any permutation matrix P i (i ∈ n o
~ (k) ||taxi , k ∈ [1, · · · , n
N+ ) satisfies [3]: |λ| ≤ max ||β units
] (6)
k

|λ̂k | = 1, k = 1, . . . , u With this structure, SpiralRNN can have a bounded eigen-
value spectrum of the hidden-weight matrix as in echo state
Therefore, the maximum absolute eigenvalue of matrix M is neural networks (ESN) [4] while still remaining trainable as
bounded, such that the relation in (3) holds. in simple recurrent networks (SRN) introduced by Elman [5].
u−1
X C. On-line training
|λu | ≤ |βi | (3)
On-line training of SpiralRNN is conducted with extend
i=1
Kalman filter (EKF). The EKF is extension of the Kalman
A suitable parameterization of the vector β~ by a predefined filter which is an optimal linear estimator with the following
value γ ∈ R+ and a trainable vector ξ~ is the following: equations:
Pt†
 
~ = γ tanh ξ~ ,
β (4) = Pt−1 + Qt
  −1
−1
† T −1
Now, the matrix M can be rewritten as Pt = Pt + H Rt H
u−1
wt− + Pt H T Rt−1 ŷt − Hwt−

X
i wt =
M= γ tanh(ξi )P ,
i=1 where wt is the parameter set to be optimized, matrices
and the relation (3) simplifies to the following relation: P, Q, R with initialization {1, 10−8, 1} × Id 2 are parameters
of Kalman filter. During the on-line training of SpiralRNN
u−1
X with NN5 competition data, we have fixed the Q and R
|λu | ≤ γ |tanh(ξi )| matrices with their initialization value. H is the gradient
i=1
of the error w.r.t. the parameter set, which has special
≤ γ(u − 1) (5)
form because of definition of SMAPE error value in NN5
B. SpiralRNNs competition as in equation-7 where Ft∗ is the data and yt∗ is
the prediction.
The construction of SpiralRNNs is generally based on
n
spiral hidden units. It simply concatenates several hidden X |yt∗ − Ft∗ |
Esmape = 1/n × 100% (7)
units together, and fully connects all hidden neurons to (yt∗ + Ft∗ )/2
t
all input and output neurons. Note that hidden units are
separated from each other, i.e. there is no interconnections 2I refers to identity matrix
d
As it will be mentioned in the later section, we use logarithm Seasonal features do not prevail in the dataset, but they
operator to transform the data into reasonable range. before do exist among several of them, e.g. time series No. 9,
the data are fed into the neural network. Therefore, the No. 88. As both are regular features with a yearly
gradient H is calculated in equation-9 with yt and Ft are period, it makes sense to provide an additional input as
the corresponding values of yt∗ and Ft∗ in logarithmic scale shown in figure-3(b) which has the period value 365.
and et is the on-line training one-step forecast error. 3) Easter holiday bump addressing feature F2 . The Easter
holidays did not have as much impact on the data dy-
s = exp(yt ) + exp(Ft ) (8)
  namics as the Christmas holidays did, but it did provide
exp(yt ) |et | ∂et certain stimulation on the usage of ATM in some areas
Ht = − sign(et ) + (9)
s s ∂wt (shown in some time series). Furthermore, as the 58-
As there exist data with empty values, whenever such step prediction interval includes the Easter holidays
empty values are supposed to be the target for training, we of year 1998, the prediction over the holiday can be
don’t implement the parameter-update but still accumulate improved when the related data behavior is learnt. This
the gradient of output w.r.t. parameters, until the data values additional input uses the Gaussian-distribution-shape
become available again. curve to emulate the Easter holiday bump as in figure-
3(c).
III. TOWARDS NN5 COMPETITION
Theoretically, SpiralRNN can learn the dynamic characters 1
of given data by itself. Being aware of the features of
given data, additional input can help to find a more accurate 0.5
solution and to speed-up the convergence. These efforts
include: (1) providing more information as input of neural 0
network; (2) using committee of experts approach on the
top of neural network training. Utilizing of these efforts is −0.5
based on the characteristics of the given dataset of NN5
competition. −1
250 255 260 265 270
A. Data characteristics (a) Weekly-input
The time series data in the NN5 dataset exhibit at least 1
the following features:
F1 Strong weekly period dominates the frequency 0.5
spectrum, usually with higher values on Thursday
and/or Friday; 0
F2 Important holidays such as the Christmas holidays
(including the New Year holiday) and the Easter
−0.5
holidays have a visible impact on the data dynam-
ics;
−1
F3 Several of the time series such as time series No. 9 0 200 400 600 800
No.89 show strong seasonal behavior; (b) Christmas-input
F4 Some of the time series (like No. 26 and No. 48)
1
show a sudden change in their statistics, e.g. a shift
in the mean value. 0.8

B. Pre-processing and additional inputs 0.6

The data presented to the neural network are mapped to 0.4
a useful range by the logarithm function. In order to avoid
0.2
singularities due to original zero values we replace them by
small positive random values. 0
Additional sinusoidal inputs are provided as a representa- 200 400 600 800
tion of calendar information. These additional inputs include: (c) Easter-input
1) Weekly behavior addressing feature F1 . Refer to figure-
3(a) and note the period is equal to 7. Fig. 3. Additional inputs of neural networks. On the X-axis is the
time steps, and Y-axis is the additional input value.
2) Christmas and seasonal behavior addressing feature F2
and F3 . It is often observed from the dataset that,
right after the Christmas holiday, withdraw of money C. Hybrid with expertization
was rare and then increased along the year and finally SpiralRNN is capable of learning time series prediction
it reached its summit value right before Christmas. with fast convergence; nevertheless, the learned weights
correspond to local minimima of the error landscape as Seasonal behavior of data is also learnt and predicted as
mentioned in [9]. As computational complexity is not an shown in figure-5. The curves in both sub-plots begin with
issue for this competition, we apply a mixture of experts the values in Christmas holidays (with time indices around
ansatz. 280 and 650) It is observed from the data in figure-5(a) that
The committee of experts consists of SpiralRNN models data values behaved as an arch in the first season (90 days)
with identical structure but different initialization of pa- after Christmas holidays and continued with another arch
rameter values. Each SpiralRNN model operates in parallel in the second season. Figure-5(b) has shown the model is
without any interference to the others. During the on-line able to capture the swapping of seasons. Note that, in figure-
training, a filtered value of the training error is recorded 5(b), there is overlap between the data and prediction, where
according to equation-10 with et referring to the one-step prediction is displayed with the black solid line and data is
on-line training error and α = 0.01. The reciprocal of this shown by the dashed line.
filtered value at the end of on-line training determines the
weight of the corresponding expert vote in the committee. 9 9

ε ← αε + (1 − α)e2t (10) 15 15

After the on-line training, autonomous predictions of all 10
10
models will be combined based on their ε values. This
procedure is shown in table-I. 5
5

0. Initialize the n experts;
1. For a SpiralRNN model k, implement on-line training with 300 350 400 650 700 750 800
the data and make a prediction yt,k , meanwhile calculate the (a) seasonal-data (b) seasonal-result
filtered error value εk ;
2. Based on their εP values, combine the prediction, such that: Fig. 5. Comparison between result and data, in terms of seasonal
1 n
yt = φ y /ε behavior. Dashed line is the data and solid line is the prediction.
Pn k t,k k
φ = k 1/εk
TABLE I Easter holidays can also be recognized by the trained
Committee of experts. model, which is shown in figure-6. The Easter holidays in
1996 to 1998 are indexed at position around 20, 375 and
755. In figure-6, the prediction on the Easter holidays 1998
IV. R ESULT follows the data values on Easter holidays 1996 and 1997,
which has predicted a spike summit on throughout ATM
Some results from the prediction are shown in this section, machine.
which somehow indicates the performance of SpiralRNN
model. In figure-4, prediction and data are displayed together, 97
where X-axis is the time step and Y-axis is the value. It is 50
shown that the prediction has a period of 7 as is the data,
furthermore, the prediction can not only recognize the main
40
peak within the period but also the smaller bump.

35 30

35 20
30
10
25
20 0
0 200 400 600 800
15 Fig. 6. Comparison between result and data, in terms of easter
behavior. Dashed line is the data and solid line is the prediction.
10
5 Table-II shows the SMAPE errors and its variances values
of the hybrid approach on the testing dataset (i.e. the data
700 710 720 from the last 56 time steps) with varied number of member
Fig. 4. Comparison between result and data, in terms of weekly in the expert committee. It is shown in the table that number
behavior. Dashed line with circles is the data and solid line with of experts doesn’t alter the average result, which on the
squares is the prediction. other side can save the effort from utilizing large amount of
experts and is favourable to the distributed sensor network
application.

# experts 3 5 10 15 20 30
SMAPE 20.65 20.15 20.41 20.96 20.58 20.38
variance 2.45 2.82 2.78 3.30 3.16 3.30
TABLE II
Statistic results. The average SMAPE error value with its
variance of expert committee on all 111 time series, given
different numbers of expert members.

R EFERENCES
[1] H. Gao, R. Sollacher, and H.-P. Kriegel, “Spiral recurrent neural
network for online learning,” in 15th European Symposium On Artificial
Neural Networks Advances in Computational Intelligence and Learning,
Bruges (Belgium), April 2007.
[2] H. Gao and R. Sollacher, “Condictional prediction of time series using
spiral recurrent neural network,” in European Symposium on Artificial
Neural Networks Advances in Computational Intelligence and Learning,
2008.
[3] K. Wieand, “Eigenvalue distributions of random permutation matrices,”
The Annals of Probability, vol. 28, no. 4, pp. 1563–1587, 2000.
[4] H. Jaeger, “Adaptive nonlinear system identification with echo state
networks,” Advances in Neural Information Processing Systems, vol. 15,
pp. 593–600, 2003.
[5] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14,
no. 2, pp. 179–211, 1990.
[6] R. Kalman, “A new approach to linear filtering and prediction prob-
lems,” Transactions of the ASME–Journal of Basic Engineering, vol. 82,
pp. 35–45, 1960.
[7] F. Lewis, Optimal Estimation: With an Introduction to Stochastic
Control Theory. A Wiley-Interscience Publication, 1986, iSBN: 0-
471-83741-5.
[8] G. Welch and G. Bishop, “An introduction to the Kalman filter,”
University of North Carolina at Chapel Hill, Department of Computer
Science, Tech. Rep. Technical Report 95-041, 2002.
[9] R. Sollacher and H. Gao, “Efficient online learning with spiral recurrent
neural networks,” in to appear in: International Joint Conference on
Neural Networks, 2008.