Professional Documents
Culture Documents
Abstract
Classification of multivariate time series (MTS) has been tackled with a large variety of methodologies and
applied to a wide range of scenarios. Among the existing approaches, reservoir computing (RC) techniques,
which implement a fixed and high-dimensional recurrent network to process sequential data, are compu-
tationally efficient tools to generate a vectorial, fixed-size representation of the MTS that can be further
processed by standard classifiers. Despite their unrivaled training speed, MTS classifiers based on a stan-
dard RC architecture fail to achieve the same accuracy of other classifiers, such as those exploiting fully
trainable recurrent networks. In this paper we introduce the reservoir model space, an RC approach to learn
vectorial representations of MTS in an unsupervised fashion. Each MTS is encoded within the parameters of
a linear model trained to predict a low-dimensional embedding of the reservoir dynamics. Our model space
yields a powerful representation of the MTS and, thanks to an intermediate dimensionality reduction pro-
cedure, attains computational performance comparable to other RC methods. As a second contribution we
propose a modular RC framework for MTS classification, with an associated open source Python library. By
combining the different modules it is possible to seamlessly implement advanced RC architectures, including
our proposed unsupervised representation, bidirectional reservoirs, and non-linear readouts, such as deep
neural networks with both fixed and flexible activation functions. Several RC architectures implemented
with the proposed framework are compared to other MTS classifiers, including state-of-the-art recurrent
networks and time series kernels. Results obtained on benchmark and real-world MTS datasets show that
RC classifiers are dramatically faster and, when implemented using our proposed representation, also achieve
superior classification accuracy.
Keywords: Reservoir computing, model space, time series classification, recurrent neural networks
1. Introduction
The problem of classifying multivariate time series (MTS) consists in assigning each MTS to one of a fixed
number of classes. This is a fundamental task in many applications, including (but not limited to) health
monitoring [1, 2, 3, 4], civil engineering [5], action recognition [6], and speech analysis [7, 8]. The problem has
been tackled by a wealth of different approaches, spanning from the definition of tailored distance measures
over MTS [9, 10, 11], to the identification and modeling of short-time patterns in the form of dictionaries or
shapelets [12]. In this paper, we focus on MTS classifiers based on recurrent neural networks (RNNs), which
are models specifically conceived to process sequential data and to capture dependencies in time [7, 13].
While there are several variations of RNNs, they all share the same fundamental principle: the MTS is first
processed sequentially by a dynamic (possibly adaptable) model, and then the sequence of its internal states
generated over time is exploited to perform classification [14].
∗ Corresponding author.
Email addresses: filippo.m.bianchi@uit.no (Filippo Maria Bianchi), simone.scardapane@uniroma1.it (Simone
Scardapane), sigurd.lokse@uit.no (Sigurd Løkse), robert.jenssen@uit.no (Robert Jenssen)
1
Reservoir computing (RC) is a family of RNN models whose recurrent part is kept fixed and is either
generated randomly, or by means of custom topologies for facilitating the information flow [15, 16, 17, 18].
Despite this strong architectural simplification, the recurrent part of the model (the reservoir) provides a
rich pool of dynamic features which are suitable for solving a large variety of tasks. Indeed, RC models can
achieve excellent performance in many fields, including time series forecasting [19, 20, 21, 22, 23], process
modelling [17], and speech analysis [8]. Since the reservoir is fixed, one needs to train only the readout,
which provides an instant mapping between the internal representation of the reservoir and the task-specific
output [16, 18]. In machine learning, RC techniques were originally introduced under the name echo state
networks (ESNs) [15]; in this paper, we use the two terms interchangeably.
Several works have shown that RC models are a remarkably efficient solution for MTS classification [24,
25, 26, 27]. The MTS is fed into the reservoir that generates a sequence of states over time and, in the
simplest case, the last reservoir state (or the mean of all the states) becomes a representation of the input.
Afterwards, the representation is processed by a classification algorithm for vectorial data [24]. However,
despite its unrivalled training speed, this approach fails to achieve the same accuracy of competing state-of-
the-art classifiers [28].
To improve the classification performance, in [29] the reservoir states associated to a MTS are projected
onto the principal components of the reservoir states, relative to each class; a new sample is classified by
identifying the subspace that yields its lower reconstruction error, in mean square sense. To learn more
powerful representations from the sequence of reservoir states, an alternative approach originally proposed
in [30] and later applied to MTS classification and fault detection [31, 28, 32], advocates to map the inputs in
a “model-based” feature space where they are represented by statistical properties that better characterize
each class. In this formulation, for each MTS a model is trained to predict its next input from the current
reservoir state, and the model parameters become the MTS representation. As a drawback, this approach
accounts only for those reservoir dynamics useful to predict the next input and could neglect meaningful
information for MTS characterization, hence hindering the quality of the representation. To overcome this
limitation, in this paper we significantly extend model-space criterion by proposing a new approach that
disentangles from the constraints imposed in the original formulation.
2
a real-world dataset of medical MTS. We show that RC classifiers are dramatically faster than the other
methods and, when implemented using our proposed representation, also achieve superior classification
accuracy.
Notation
We denote variables as lowercase letters (x); constants as uppercase letters (X); vectors as boldface
lowercase letters (x); matrices as boldface uppercase letters (X); tensors as calligraphic letters (X ). All
vectors are assumed to be columns. The operator k·kp is the standard `p norm in Euclidean spaces. The
notation x(t) indicates time step t and x[n] sample n in the dataset.
3
where θdec are trainable parameters. In practice, g(·) can implemented by another (feed-forward) neural
network or by a simpler linear model.
In the following, we describe two principal approaches for MTS classification with RNNs. The first is
based on fully trainable architectures, the second on RC where the encoding is implemented by a RNN that
is left untrained.
where l(·, ·) is a generic loss function (e.g., cross-entropy over the labels). If the encoder has adaptable
weights, their gradient of (5) with respect to θenc and θdec can be computed by back-propagation through
time [14].
To regularize the parameters values during the training phase, a common procedure is to add an `2 norm
penalty to both the weights in the encoding and decoding functions, r(·) and g(·). The penalty term is
controlled by a scalar parameter λ. Furthermore, it is also possible to include a dropout regularization, that
randomly drops connection weights with probability pdrop during training [35]. In our experiments, in the
encoding function we apply a dropout specific for recurrent architectures [36].
Despite the theoretical capability of basic RNNs to model any dynamical system, in practice their effec-
tiveness is hampered by the difficulty of training their parameters [37, 38]. To ensure stability, the derivative
of the recurrent function in an RNN must not exceed unity. However, as an undesired effect, the gradient
of the loss shrinks when back-propagated in time through the network. Using RC models (described in
the next section) is one way of avoiding this problem. Another common solution is the long short-term
memory (LSTM) network [39]; differently from the vanilla RNNs in (3), LSTM exploits gating mechanisms
to maintain its internal memory unaltered for long time intervals. However, LSTM flexibility comes at the
cost of a higher computational and architectural complexity. A popular variant is the gated recurrent unit
(GRU) [40], that provides a better memory conservation by using less parameters than the LSTM, but its
state update requires an additional operation, hence a higher computational cost. Both LSTM and GRU
still require to back-propagate through time the gradient of their loss.
1 Note that MTS may have different lengths. For readability, in the paper we refer with T to the length of a single MTS,
4
in the processing units and, jointly with ρ, can shift the internal dynamics from a chaotic to a contractive
regime. Finally, a Gaussian noise with standard deviation ξ can be added to the argument of the state
update function (3) for regularization purposes [15].
In ESNs, the decoder (commonly referred as readout) is usually a linear model:
y = g(rX ) = Vo rX + vo (6)
The encoder parameters θdec = {Vo , vo } can be learned by minimizing a ridge regression loss function
∗ 1 2 2
θdec = arg min krX Vo + vo − yk + λ kVo k , (7)
{Vo ,vo } 2
which admits a closed-form solution [18]. The combination of an untrained reservoir and a linear readout
defines the basic ESN model [15].
A powerful representation rX is the output model space representation [30], obtained by first processing
each MTS with a common reservoir and then fitting a linear model (one for each MTS), whose parameters
become the MTS representation. Specifically, a ridge regression model is trained to implement an output
function that performs one step-ahead prediction of each input MTS:
The parameters θo = [vec(Uo ); uo ] ∈ RF (R+1) becomes the representation rX of the MTS, which is, in turn,
processed by the classifier in (6). In the following, we propose an alternative model space that yields a more
expressive representation of the input MTS.
In this section we introduce the main contribution of this paper, the reservoir model space for representing
MTS. To make the model tractable, we reduce the dimensionality of reservoir features by means of a modified
version of PCA, which deals with data represented as matrices rather than vectors.
and, thus, rX = θh = [vec(Uh ); uh ] ∈ RR(R+1) is our proposed representation of the input MTS.
Contrarily to the output model space representation, the prediction model here describes a generative
model of the reservoir sequence, rather than of the input sequence. The learned feature space can capture
both the data and their generative process. As for the output model space, a classifier that processes the
reservoir model representation combines the explanatory capability of generative models with the classifica-
tion power of the discriminative methods.
5
It is possible to see that rX characterize the following discriminative model
At the same time, the reservoir encoding that operates according to (3) can be expressed as
The state h(t − 1) in (11) depends, in turn, on x(t − 1) and h(t − 2), and so on. By expressing those
dependencies explicitly with the chain rule of probability and then plugging (11) in (10) one obtains
T
Y
p (x(t + 1)|x(t), x(t − 1), . . . , x(1); rX ) = p (x(T ), x(T − 1), . . . , x(1); rX ) , (12)
t=1
Eq. (12) is a generative model [43], which provides a powerful characterization of the inputs and also
induces a metric relationship between samples [44]. As a result, classification with the model space criterion
can be categorized as a hybrid discriminative/generative method, and the representation rX (learnt unsu-
pervisedly) can be exploited in a variety of different tasks. Indeed, the model view characterization of a
MTS has proven effective in anomaly detection [31], classification [28, 32], and to build a kernel similarity
matrix [30].
6
3/7/2018 rmESN.xml
^
H H
h(t) h(t + 1)
θh [N ]
Figure 1: Schematic depiction of the procedure to generate the reservoir model space representation. For
each input MTS X[n] a sequence of states H[n] is generated by a fixed reservoir. Those are the frontal
slices (dimension N ) of H, but notice that in the figure lateral slices (dimension T ) are shown. By means
of dimensionality reduction, the reservoir features from R become D. A distinct linear model is trained to
predict the columns of each frontal slice Ĥ of Ĥ. The parameters θh [n] of the model trained on Ĥ[n] become
the representation of X[n].
orderings are lost, as reservoir states relative to different samples and generated in different time steps are
mixed together. This may lead to a potential loss in the representation capability, as the existence of modes
of variation in time courses within individual samples is ignored [49]. To address this issue, we consider as
individual samples the matrices Hn ∈ RT ×H , obtained by slicing H across its first dimension. The sample
covariance matrix in this case reads
N
1 X T
S= Hn − H̄ Hn − H̄ . (14)
N − 1 n=1
The first D leading eigenvectors of S are stacked in a matrix E ∈ RR×D and the desired tensor of reduced
dimensionality is obtained as Ĥ = H ×3 E, where, ×3 denotes the 3-mode product.
Like C, S ∈ RR×R describes the variations of the variables in the reservoir. However, the whole time
sequence of reservoir states generated by a MTS is considered as an observation. Accordingly, states per-
taining to different MTS are grouped together and their temporal ordering is preserved. We notice that an
analogous approach was adopted to reduce the dimensionality of images [50], which are characterized by
multiple spatial dimensions, while the dimensions in our samples represent space and time, respectively.
After dimensionality reduction, the model in (8) becomes
where ĥ(·) are the columns of a frontal slice Ĥ of Ĥ, Uh ∈ RD×D , and bh ∈ RD . The representation will now
coincide with the following vector of parameters rX = θh = [vec(Uh ); uh ] ∈ RD(D+1) , whose dimensionality
is controlled by the described dimensionality reduction procedure. A schematic description of the proposed
unsupervised procedure to derive the reservoir model representation is shown in Fig. 1
Apart from the model based criterion, in the last years several additional approaches have been proposed
in the literature of RC for extending the basic ESN architecture. Most works focused on the design of
more sophisticated reservoirs, readouts or representation of the input MTS and they can be considered
independently one from the other. To evaluate their efficacy in the context of MTS classification, we introduce
7
10/25/2018 main_schema.
Encoder Decoder
... ...
...
output model
lin reg
y
h(t)
¯ rX
X H H
bidirectional
reservoir θo
... ...
x(t + 1)
t
SVM
reservoir model
...
t
...
h(t) h(t + 1)
θh
Figure 2: Overview of the RC framework for MTS classification. The ecoder generates a representation rX
of the MTS X, while the decoder assign a label y, given the representation. Several models are obtained by
selecting variants for each module.
a unified framework for MTS classification that generalizes several RC architectures by combining four
modules: i) a reservoir module, ii) a dimensionality reduction module, iii) a representation module, and iv)
a readout module.
A complete overview of all the combinations considered in this paper, including the proposed reservoir
model space representation, is given in Fig. 2. The input MTS X is processed by a reservoir, which is
either unidirectional or bidirectional, and it generates the sequence of states H over time. An optional
dimensionality reduction step can be applied to reduce the number of reservoir features, and a new sequence
H̄ is obtained. Three different approaches can be chosen to generate a representation rX of the input from
the sequence of reservoir features: the last state of the sequence h(T ), the output state model θo (Sec. 2.2),
or the proposed reservoir state model θh (Sec. 3). The obtained representation rX is finally processed by a
decoder (readout), which is implemented by a classifier for real-valued vectors. We consider three different
classifiers, which are the standard linear readout implemented by ridge regression, a SVM classifier and a
deep neural network (MLP). The classifier outputs the class label y to be assigned to the MTS X.
In the following, we describe the reservoir, dimensionality reduction and readout module, and we discuss
the functionality of the variants implemented in our framework. A Python software library implementing the
unified framework is publicly available online2 . The library is regularly updated and several other options
are available to configure the classifier, beside the ones described in this paper.
2 https://github.com/FilippoMB/Reservoir-Computing-framework-for-multivariate-time-series-classification
8
Of particular interest for the classification of MTS is the bidirectional reservoir, which is included in our
framework as an optional replacement to the standard reservoir.
Bidirectional architectures have been successfully applied in RNNs to extract from the input sequences
temporal features that account also for dependencies very far in time [7]. Within the RC framework, a
bidirectional reservoir has been used in the context of time series prediction to collect future information,
which is provided at training stage to improve the accuracy of the model [17]. The future information is not
available during test and the model uses a standard reservoir to extract temporal features. On the other
hand, in a classification setting the whole time series is given at once, in both training and test stages. In
[46] the last state of a bidirectional reservoir is used as a fixed-size representation for classifying the input
MTS.
Bidirectionality is implemented by feeding into the same reservoir an input sequence both in straight
and reverse order, so that the following states are generated
~h(t) = f Win x(t) + Wr ~h(t − 1) ,
(16)
~ = f Win x(t)
h(t) ~ + Wr h(t ~ − 1) ,
The representation of the input as the last state generated by the reservoir in this case would become
rX = hb (T ). The main advantage of using a bidirectional reservoir in conjunction with the last state
representation is that, compared to unidirectional architectures, temporal dependencies spanning longer time
intervals can be captured. Since the reservoir trades its internal stability with a vanishing memory of the
past inputs [42], at time T the state ~h(T ) maintains scarce information about the first inputs processed. On
~ ) is more influenced by the first inputs and, therefore, hb (T ) summarizes well recent and
the other hand, h(T
past information. This representation improves classification results especially when important information
are contained also in the first part of the input.
In the proposed reservoir model space formulation (see Sec. 3) a bidirectional reservoir modifies the
representation rX , as the linear model in (9) becomes
h i h i
~ + 1) = Ub ~h(t); h(t)
h(t + 1); h(t ~ + ubh , (17)
h
where Ubh ∈ R2R×2R and ubh ∈ R2R are the new set of parameters. In this case, the linear model is trained to
optimize two distinct objectives, which are predicting the next state h(t + 1) and reproducing the previous
~ + 1) (or equivalently their low-dimensional embeddings). Therefore, by combining the reservoir
one h(t
model space with a bidirectional reservoir, the prediction model must learn also a memorization task, as it
yields at each time step both the previous and past reservoir state. We argue that such a model provides a
more accurate representation of the input, by modeling at the same time the temporal dependencies in both
time directions. We notice that the bidirectional reservoir produces a similar effect also in the output model
space (see Sec. 2.2), where the linear model learns jointly to reproduce the previous input and to predict the
next one.
9
states [46]. If the MTS are represented by the last reservoir states, those are stacked into a matrix to which
standard dimensionality reduction procedures were applied. However, as discussed in Sec. 3, when the whole
set of representations is represented by a tensor the dimensionality reduction technique should account for
factors of variation across more than one dimension.
Contrarily to the other modules, it is possible to implement a RC classifier without the dimensionality re-
duction module (as depicted by the skip connection in Fig. 2). However, as discussed in Sec. 3, dimensionality
reduction is particularly important when implementing the proposed reservoir model space representation,
to lower the computational complexity and reduce the risk of overfitting. Furthermore, dimensionality re-
duction becomes fundamental when using a bidirectional reservoir, as the state dimension is doubled and,
accordingly, also the size of the representation increases. In particular, when the last state is used as repre-
sentation its size is doubled, rX ∈ R2R . In the output and reservoir space representations instead, the size
becomes rX ∈ RF (2R+1) and rX ∈ R2R(2R+1) , respectively.
5. Experiments
In this section we test a variety of RC-based architectures for MTS classification implemented with
the proposed framework. We also compare against RNNs classifiers trained with gradient descent (LSTM
and GRU), a 1-NN classifier based on the Dynamic Time Warping (DTW) similarity, and SVM classifiers
configured with pre-computed kernels for MTS. Depending whether the input MTS in the RC-based model
is represented by the last reservoir state (rX = h(T )), or by the output space model (Sec. 2.2), or by the
reservoir space model (Sec. 3), we refer to the models as lESN, omESN and rmESN, respectively. Whenever
we use a bidirectional reservoir, a deep readout or a SVM readout we add the prefix “bi-”, “dr-”, and “svm-”,
respectively (e.g., bi-lESN or dr-bi-rmESN ).
First, we introduce the MTS classification datasets under analysis and we explain our experimental setup.
In Sec. 5.1, we compare the performance obtained on several benchmark datasets by RC classifiers configured
with different representations, by classifiers based on fully trained RNNs, and by 1-NN classifier using DTW.
In Sec. 5.2, we investigate whether performance in the RC-based models improves when the representations
are generated with a bidirectional reservoir and/or processed with deep readouts. Finally, in Sec. 5.3 we
process a real-world dataset and we further compare the classification performance with two time series
kernels, using a SVM classifier.
10
Benchmark datasets. To provide an extensive evaluation of the performance of each classifier, we consider
several benchmark classification datasets for MTS taken from the UCR3 and UCI repositories4 . We excluded
from the analysis only those MTS datasets that contain too few training samples to train a neural network
model. We also included three univariate time series datasets, to show how the proposed approaches can be
seamlessly applied also to the univariate case. Details of the datasets are reported in Tab. 1.
Table 1: Time series benchmark datasets details. Column 2 to 5 report the number of variables (#V ),
samples in training and test set, and number of classes (#C), respectively. Tmin is the length of the shortest
MTS in the dataset and Tmax the longest MTS.
Blood samples dataset. As real-world case study, we analyze MTS of blood measurements obtained from
electronic health records of patients undergoing a gastrointestinal surgery at the University Hospital of North
Norway in 2004–2012.5 Each patient is represented by a MTS of 10 blood sample measurements collected for
20 days after surgery. We consider the problem of classifying patients with and without surgical site infections
from their blood samples, collected 20 days after surgery. The dataset consists of 883 MTS, of which 232
pertain to infected patients. The original MTS contain missing data, corresponding to measurements not
collected for a given patient at certain time intervals, which are replaced by zero-imputation in a preprocessing
step.
Experimental setup. For each dataset, we train the models 10 times using independent random parameters
initializations and each model is configured with the same hyperparameters in all experiments. Reservoirs
are sensitive to hyperparameter setting and, therefore, fine-tuning with independent validation procedures
for each task is usually more important in RC models than in networks trained with gradient descent, such as
LSTM and GRU. Nevertheless, we show that even by setting the hyperparameters according to unsupervised
heuristics [63] the RC classifiers are robust and achieve superior performance, especially when configured
with the proposed representation (rmESN ).
To provide a significant comparison, lESN, omESN and rmESN always share the same randomly gen-
erated reservoir configured with the following hyperparameters: internal units R = 800; spectral radius
ρ = 0.99; non-zero connections percentage β = 0.25; input scaling ω = 0.15; noise level ξ = 0.01. When
classification is performed with a ridge regression readout, we set the regularization λ = 1.0. The ridge
3 www.cs.ucr.edu/
~eamonn/time_series_data
4 archive.ics.uci.edu/ml/datasets.html
5 The dataset has been published in the AMIA Data Competition 2016
11
regression prediction models, used to generate the model-space representation in omESN and rmESN, are
instead configured with λ = 5.0.
In each RC-based method, we apply dimensionality reduction, as it provides important computational
advantages (in terms of both memory and CPU time), as well as a regularization that improves the overall
generalization capability and robustness of the models. To determine the optimal number of subspace
dimensions D, we evaluate how training time and average classification accuracy (computed with a k-fold
cross-validation procedure) of the RC classifiers varies on the benchmark dataset in Tab. 1. We report the
average results in Fig. 3. While the training time increases approximately linearly with D, it is possible to
89 0.34
Classification accuracy
84 0.28
25 50 75 100 125 150
D
identify an “elbow” in the classification accuracy for D = 75, which is the value we select.
LSTM and GRU are configured with H = 30 hidden units; the decoding function is implemented as a
neural network with a dense layer of 20 hidden units followed by a softmax layer; the dropout probability
is pdrop = 0.1; the `2 regularization parameter is λ = 0.0001; gradient descent is performed with the Adam
algorithm [64] and we train the models for 5000 epochs Finally, the 1-NN classifier uses FastDTW [9],
implemented with the related Python library6 .
6 https://pypi.org/project/fastdtw/
12
&