You are on page 1of 27

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341755348

CNN-FCM: System modeling promotes stability of deep learning in time


series prediction

Article  in  Knowledge-Based Systems · May 2020


DOI: 10.1016/j.knosys.2020.106081

CITATIONS READS

24 387

3 authors, including:

Jing Liu Kai Wu


Chinese Academy of Sciences Xidian University
536 PUBLICATIONS   13,038 CITATIONS    50 PUBLICATIONS   526 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

chromatin structure research View project

Fuzzy Cognitive Maps View project

All content following this page was uploaded by Kai Wu on 03 June 2020.

The user has requested enhancement of the downloaded file.


CNN-FCM: System Modeling Promotes Stability of Deep
Learning in Time Series Prediction
Penghui Liu, Jing Liu1, Kai Wu

School of Artificial Intelligence,


Xidian University, Xi'an 710071, China

Abstract: Time series data are usually non-stationary and evolve over time. Even if deep learning has been
found effective in dealing with sequential data, the stability of deep neural networks in coping with the
situations unseen during the training stage is also important. This paper deals with this problem based on a
fuzzy cognitive block (FCB) which embeds the learning of high-order fuzzy cognitive maps into the deep
learning architecture. Thereafter, computers can automatically model the complex system and produce the
observation rather than simply regress the available data. Respectively, we design a deep neural network
termed CNN-FCM which has combined the available convolution network with FCB. To validate the
advantages of our design and verify the effectiveness of FCB, twelve benchmark datasets are employed and
classic deep learning architectures are introduced as the comparison. The experimental results show that the
performance of many current popular deep learning architectures declines when handling data deviated
from the training set. FCB plays an important role in promoting the performance of CNN-FCM in the
corresponding experiments. Thereafter, we conclude that system modeling can promote the stability of deep
learning in time series prediction.

Keywords: Fuzzy cognitive maps; Deep neural networks; Time series prediction; System modeling.

I. Introduction
Deep learning has made surprising progress in the past decades. Researchers in the
community strive to expend the application scope of deep learning and resolve problems in
various fields [1-8]. Convolution neural networks (CNN) and recurrent neural networks
(RNNs) are currently the most influential deep neural models, which have been widely
applied in the fields of computer vision [49-50] and language processing. CNNs [9-14,51] are
designed to process input data with multiple arrays, while RNNs are designed to process
sequential data [15-18].
The time series prediction problem is a typical sequential data task that can be
summarized as follows: suppose time series X = {x1, x2, …, xt} is comprised by a sequential

1Corresponding author. For additional information regarding this paper, please contact Jing Liu, e-mail: neouma@163.com,
https://faculty.xidian.edu.cn/LJ22/zh_CN/index.htm

1
observation ranked in time. The prediction task is to predict the value of xt+1 based on
available observations, whose procedure can be formulated as xt+1=f(x1, x2, …, xt-1, xt). Time
series data are different from other types of data due to the following characteristics [41]: (1)
There may exist noises in the time series data. (2) The system that produces the observation is
unknown (3) Time series are mostly non-stationary and evolve over time. These issues make
the analysis and modeling of the time series data challenging. Conventional deep learning
methods simply regressing over available data are likely to perform unstably when the input
data are deviated from the training set. Because of the characteristic of time series data,
coping with the data deviated from the training set (unseen situations) seems inevitable for the
predictors and the corresponding stability is important.
Conventional deep learning architectures make the prediction based on the features or
patterns extracted from the original time series data. As the observation evolves over time, the
corresponding features or patterns employed may be not appropriate for further prediction.
For example, stock market data are highly complex and many external factors influence the
trend. It is challenging to extract consistent features or patterns for prediction. Simply
regressing the available data and neglecting the unseen situations is risky for time series
prediction. Rather than simply regress the available data, this paper attempts to embed the
system modeling into the architecture of deep learning. In this case, the complex system that
produces the time series data can be learned. The situations unseen during the training stage
can be inferred based on the system learned and the stability of deep learning in time series
prediction can be promoted.
Fuzzy cognitive maps (FCMs) are effective tools for computers to model a complex
system [19-28]. The major process of employing FCMs for time series prediction can be
summarized as: (a) Decompose the signal from the original time data. (b) Learn FCMs based
on the signal decomposed in the previous stage and predict fuzzy time series. (c) Train a
regression model to predict the observation in the next time step based on the output of FCMs.
Traditional FCM-based series prediction methods usually rely upon fuzzy c-means clustering,
wavelet transform or other methods to extract decomposed signals [26, 23], and then gradient
descending methods or evolutionary-based algorithms are employed to learn the structure of
FCMs and train a regression model [23, 29-31]. How to integrate the procedure of FCMs into
2
deep neural networks and coordinate original separated steps (a-c) remains a challenge.
Respectively, this paper has designed a deep neural network termed CNN-FCM based on
combining the learning of fuzzy cognitive maps and deep learning. To realize our purpose, we
propose a fuzzy cognitive block (FCB) to integrate the learning of high-order FCMs into our
architecture. FCB separates the learning of high-order FCMs and employs numbers of fully
connected layers to perform a similar function. Based on FCB, computers can automatically
learn the complex system producing the corresponding observation. To investigate the
performance of our design and validate the effectiveness of FCB, the performance of classic
deep neural networks is introduced as a comparison and twelve benchmark datasets are
employed in our experiments. The experimental results illustrate the performance decline of
the convolution network and other classic neural networks when unseen situations occur.
However, CNN-FCM outperforms the other deep learning architectures, as FCB can
effectively promote the stability of CNN-FCM when predicting upon datasets with large
uncertainty.
Major contributions of this paper are summarized as follows: (1) This paper has designed
a deep neural network with satisfactory stability in dealing with time series prediction task. (2)
To the best of our knowledge, this paper is the first to apply FCMs to deep learning and
promote the corresponding stability in resolving time series prediction. (3) This paper has
revealed system modeling can promote the stability of deep learning in time series prediction.
(4) CNN-FCM can be regarded as an FCMs-based method and this paper has provided a
novel idea to realize end-to-end training of this type of method.
The rest of this paper is organized as follows: Section II provides the review of FCMs and
some classic deep learning models proposed to resolve the time series prediction problem.
The details of FCB and CNN-FCMs are presented in Section III. Section IV provides the
experimental results comparing CNN-FCMs against other popular deep learning architectures.
Section V gives the conclusion.

II. Related work on fuzzy cognitive maps and deep learning

A. Fuzzy cognitive maps

3
In general, realistic systems usually can be regarded as networks comprised by multiple
key components. For time series data, it is reasonable to learn the intrinsic relationship among
some latent variables that finally produce the observation. To realize this purpose, some
researchers introduce FCMs to illustrate the relationship between different latent variables.
Suppose an FCM consist of Nc latent variables with the corresponding activation
remarked as C = [C1, C2, ……, CNc]. The correlation among these latent variables can be
denoted by an Nc  Nc matrix as given in (1), where wi,j[-1,1] represents the relationship
between the ith and jth variables. Fig.1 provides an intuitive illustration of an FCM. Given the
activation of variables (C) and correlation of variables (W), the activation at the next time step

can be obtained according to (2), where Cit denotes the activation of ith latent variable at

time t and g(x) is the sigmoid function. This function predicts Cit +1 based on the one-order

FCM and only considers Cit for prediction. To fully utilize the history information, the

high-order cognitive map given in (4) is usually considered, where L denotes the order.
Parameters  in g(x) is set to 1 as it is commonly employed in most literature.

 w1,1 w1,2 w1, Nc 


 
 w2,1 w2,2 w2, Nc 
W =  (1)
 
 wNc ,1 wNc ,1 wNc , Nc 

Nc
Cit +1 = g ( w jiC tj ) (2)
j =1

1
g ( x) = (3)
1 + e−  x

 Nc 
Cit +1 = g   w1ji C tj + w2jiC tj−1 + + w LjiC tj− L +1 + b  (4)
 j =1 
Available FCM-based time series prediction approaches are mostly comprised by three
stages: (1) Employ some methods to obtain decomposed signals from the input. (2) Learn the
correlation among these decomposed signals (latent variables) obtained in the previous stage
and predict the activation state of latent variables at t+1. (3) Employ a regression model to
predict the next observation based on the output of FCM.

4
0 0 0 0 0 
0.1 0 0 0.2 0 

 0 −0.3 0 0.5 0 
 
0.1 0 0 0 0.1
 0 0 0 −0.4 0.2 

(a) (b)
Fig. 1. An example FCM with 5 latent variables. (a) Weighted graph and (b) Corresponding weighted
matrix.

B. Convolution networks
Convolution networks are commonly employed to extract features from the input with
multiple dimensions. Even if CNNs are applicable to sequential data, RNNs are still the best
choice when resolving signal processing problems. However, temporal convolutional network
(TCN) proposed by Bai et al. has surpassed available RNNs across a diverse range of tasks
and datasets [32]. The corresponding architecture is shown in Fig.2.
TCN is proposed based on two principles to adapt to signal processing: (1) No
information leakage from the future into the past. (2) Input and output signals have the same
length. To realize these purposes, Bai et al. employed the dilated convolution and zero
padding to construct the causal convolution as given in Fig.2 (a). As can be seen, the dilated
factor grows with the increase of layers and the receptive field expands exponentially.
Thereafter, the receptive field is capable to cover all values from the input sequence within
limited layers. To provide stable gradient for the training of TCN, the residual structure is
introduced into the design of TCN. The residual block constructing TCN is illustrated in Fig.2
(b). The 1×1 convolution in Fig.2 (b) is used to resolve the discrepant between the input and
output widths.
TCN is finally ended up with a fully connected layer as the head network and outputs the
prediction result. Obviously, TCN can be trained end-to-end and automatically extract
valuable information from the sequential data.

5
(a) (b)
Fig. 2. The architecture of TCN. (a) A dilated causal convolution employed in TCN. Corresponding
dilation factors d =1, 2, 4 and filter size ks = 3. (b) Residual block constructing TCN.

C. Recurrent network
RNNs are dedicated architectures for sequential data, which maintain a vector of hidden
states as memory maintained through time. The output of RNNs at different discrete time
steps can be considered as the output of different neurons in a multilayer network with
weights shared across layers. The corresponding illustration of RNNs is given in Fig.3. This
family of architectures has been popular in processing time series data for prominent
application to language processing. However, the training of RNNs has long been a challenge,
because RNNs can be very deep forward networks when unfolded in time and suffer from the
obstacle of gradient dispersion. In this case, RNNs are difficult to learn from a very long
sequential time series data and many other variants have been proposed. The long short-term
memory (LSTM) networks handle this problem by introducing a connection to itself at the
next time step that has a weight of one [16]. Gated recurrent unit (GRU) is a simplified
variant from the LSTM and has performed well in many different tasks [40].

Fig. 3. Unfolded recurrent network.

6
D. Comparison between FCMs based methods and deep learning methods

Previous parts have provided a basic review of FCMs based methods and deep learning
architectures in time series prediction. Deep learning architectures learn the correlation
between the feature of time series data and the prediction results. Based on the one-stage
training process, features extracted by deep learning architectures usually have advantages
over conventional designed features. However, available architectures also have
disadvantages when dealing with data deviated from the training dataset. As these data have
not been considered in the training stage, the rationality of the mapping function learned is
uncertain towards these data and corresponding architecture lacks reasoning capability. For
example, deep learning architectures are found easy to be fooled when the input image is
combined with designed noise [42-43]. The designed noise has utilized the flaws of the map
learned by deep learning. Similarly, some researches emphasize the impact of uncertainty in
deep learning and promote available architectures to learn the uncertainty of the input [44-45].
Instead, FCMs based methods learn the complex system that produces the observation.
The system learned based on FCMs has satisfactory interpretability and can provide a more
rational reasoning over data deviated from the training dataset. Graph neural networks (GNNs)
[46] have been popular for enhancing the reasoning ability of computers. FCMs based
methods share some similarities with GNNs and they both perform based on graphs. However,
FCMs learn the structure of the graph while GNNs process hidden states based on the
designed graph structure. To be noted, echo state networks (ESNs) [47] as a variant of RNNs
employ random graphs to update the state of nodes and parameters of graphs are not updated
during training. Different from GNNs and ESNs, the graph learned by high-order FCM can be
regarded as a multi-layer network.
FCMs have advantages in system modeling and embedding FCMs into deep neural
networks can promote the reasoning capability in time series prediction. The system learned
by FCMs can help provide relatively rational data processing when the input data are deviated
from the training dataset.

7
Fig. 4. Structure of fuzzy cognitive block.

III. Details of fuzzy cognitive block and CNN-FCM

A. Fuzzy cognitive block: FCB


Based on the review of FCMs-based methods and deep learning methods, we can see that
FCM-based methods can learn a complex system to provide more rational reasoning over data
deviated from the training set. The complex system learned by FCMs-based methods can fill
the gap of deep learning methods. In this case, we propose a fuzzy cognitive block (FCB) to
implement the learning of complex systems within deep learning architecture. FCB learns the
complex system producing the observation within deep learning architecture and helps reason
over data deviated from the training set.
FCMs based methods usually employ signal processing algorithms to decompose time
series data and learn the correlation graph of the decomposed components. Deep learning has
advantages in feature extraction and this paper employs TCN to realize the purpose of
decomposition. The decomposed components are denoted as {C1, C2, …, CNc}, and these

8
components perform as the latent variables in producing the observation. Nc remarks the
number of components extracted from the input data.
FCMs model the correlation between {C1, C2, …, CNc}. The formulation in (4) can be
further transformed into (5), where WT is the transpose of the corresponding adjacency matrix
of the graph. Obviously, the high-order FCM is essentially a separable multi-layer network
with adjacency matrix in each layer denoted as Wi. As fully connective layers are
mathematically equivalent to y=WTx, a high-order FCMs in (5) can be implemented by a
couple of fully connective layers (FCLs). In this case, a block comprised by FCLs can
implement the function of FCMs, which is termed fuzzy cognitive block (FCB) in our paper.

C t +1 = g (W1T C t + W2T C t −1 + + WLT C t − L +1 + b ) (5)

Fig.4 provides an intuitive illustration of FCB. As can be seen, the input of FCB is the
time series data of latent variables {C1, C2, …, CNc} extracted from the original data. Different
color is employed to remark data with different time index and indicates the order that the
corresponding data belongs to. Activation of latent variables at the next time step can be
obtained according to (5) and the FCLs within FCB learn the correlation of the latent
variables in different order (W1, W2, …, WL).
During the training process, parameters within FCLs (Wi) are updated through
backpropagation, and computers learn the complex system that produces the observation. As
Wi in most realistic complex systems tends to be sparse, a weight decay term is generally
applied to the learning of FCMs.

B. CNN-FCM: Convolution network combined with high-order FCMs


Available deep learning methods mostly presume the data distribution of the objective
task is stable. These models perform well upon computer vision and language processing.
However, time series data are mostly non-stationary and evolve over time. For example, the
mean, variance and frequency of data vary in different periods. Thereafter, simply regressing
available data is insufficient for time series prediction. As described in the previous part, FCB
can embed the learning of high-order FCMs into the architecture of deep learning. Thereafter,
the complex system that produces the observation can be learned, which provides more

9
rational reasoning over data deviated from the training set and thereby promotes the stability
of deep learning.
In this paper, instead of decomposing time series data into some handcrafted parts, we
employ the TCN to extract the time series data of latent variables. The learning of high-order
FCMs is implemented within FCB based on the series data extracted by TCN. Finally, a
regression model predicts the next observation according to the output of FCB. The
corresponding architecture is termed as CNN-FCM and is illustrated in Fig.5. As can be seen,
there are Nc residual blocks in TCN. The output of each residual block is inferred and
compressed by two convolution layers to obtain the time series of Nc latent variables. FCB
learns the correlation between these latent variables. The order of FCMs within CNN-FCM is
relative to the length of input data. For an input data with length of 4L, the time series of
latent variables has a length of L and the corresponding order of FCM is L.
To fully utilize the activation of latent variables, a regression model predicts results
according to (6), where the linear regression is employed in this paper. To be noted, as the
activation scope of FCMs usually ranges from -1 to 1, Leaky-Relu is employed as the
activation function used in the residual blocks of CNN-FCM. Obviously, the structure of
CNN-FCM is similar to the architectures of TCN but mainly differ in the FCB. However,
CNN-FCM learns and simulates the corresponding complex system producing the observation
during the training stage and further implementing the prediction based on the model learned
within FCB. As can be seen, CNN-FCM follows the same procedure of FCM-based time
series prediction methods.

10
Fig. 5. The architecture of CNN-FCM. The shape of a tensor is remarked in the flow graph in the form of
(H×W×K), where H, W respectively denote the height and width of the input data, and K denotes the
number of channels.

y t +1 = f ( C t +1 , C t ) (6)

C. Training CNN-FCM
CNN-FCM outputs the prediction results based on (6) and the corresponding loss
function can be formulated as (7). WFCB and WLR are respectively the parameters of FCB and
the regression model. yt is the ground truth value. The former part in (7) evaluates the RMSE
loss of the prediction, while the latter part provides a weight decay term to FCB and
regression model to avoid overfitting.

1 T −1 
loss =  ( yt − yt )2 +  ( WFCB + WLR )
T t =0
(7)

Generally, to avoid overfitting, researchers usually separate available time series data into
two parts: training dataset and validation dataset. The training dataset is employed to train the
model, while the validation dataset is used to validate the performance of the trained model.
Finally, the model with the best performance upon the validation dataset is selected to
conduct prediction upon the testing dataset. The validation dataset indicates the stop time
during the training process.

11
We usually employ the latest data to perform as the validation dataset. In this case, the
validation dataset can be more similar to the testing dataset and provides a proper indication,
which is generally employed in previous researches. Details of the training procedure are
given in Algorithm 1.

Algorithm 1: Training CNN-FCM


Input:
Nc: Number of latent variables in FCB;
 : Weight decay term in (7);
X: Available time series data;
// X = {Xtrain : Training dataset; Xvalidation: Validation dataset};
E : Maximum number of epochs;
Output:
*: Parameters of trained model;
1: Initialize parameters  in CNN-FCM;
2: minrmse1.0e+10;
3: *;
4: for e = 1 to E do
5: Update  in CNN-FCM based on Xtrain and minimize (7);
6: Evaluate ermse of CNN-FCM upon Xvalidation;
7: if ermse<minrmse then
8: minrmseermse;
9: *;
10: end
11: end
12: Output *;

IV. Experiments

A. Experimental setup
This section mainly describes the major procedure that experiments in this paper follow.
Firstly, to generalize to prediction tasks with different magnitudes, values within each
benchmark are normalized into the range of [-1, 1] according to (8). Secondly, each

12
benchmark is respectively separated into three parts according to the time index: Xtrain,
Xvalidation, and Xtest. Thirdly, the corresponding model is trained on Xtrain and validated on
Xvalidation. Fourth, parameters with the lowest RMSE on Xvalidation are applied to implement the
prediction of Xtest. Finally, to evaluate the prediction results, restoration of magnitude given
in (9) is conducted before calculating corresponding RMSE as illustrated in (10), where y’ is
the output results of model, xmax and xmin are respectively the maximum and minimum values
in the corresponding dataset.

x − xmin
x' = 2  −1 (8)
xmax − xmin

y = 0.5  ( y ' + 1)( xmax − xmin ) + xmin (9)

(10)
1 T −1 
RMSE =  ( yt − yt )2
T t =0

The performance of some classic deep learning architectures (TCN, LSTM, RNN, GRU)
is introduced as a comparison. It is commonly acknowledged that hyperparameters play
important roles in deep learning and determining proper hyperparameters is a challenging task.
To resolve this problem, the random seed is fixed to the same value during the grid search.
Grid search is applied to determine the proper hyperparameters of each architecture. For
CNN-FCM and TCN, this paper mainly considers hyperparameters: the length of the input
and the number of layers. For RNN, LSTM and GRU, this paper mainly considers the length
of input, the number of layers and the number of features in the hidden state.
The length of the input is selected from {8, 12, 16, 20} and the number of features in the
hidden state is selected from {5, 10, 15}. The number of layers in TCN and CNN-FCM is
selected from [1, 14]. As the number of layers for recurrent networks is usually set to a small
value, it is selected from [1, 5] for RNN, LSTM, and GRU. Moreover, the size of kernels
employed within the residual blocks is set to 3. The maximum number of epochs (E) for
training CNN-FCM and other deep learning architectures is respectively set to 500 and 1000.
The corresponding model with the best performance upon Xvalidation is saved for the prediction
task as described in Algorithm 1.  in (7) is set to 0.0005.

B. Datasets
13
Twelve benchmarks for time series prediction are employed in this paper to validate the
effectiveness of the proposed model, and the details are provided in Table I. The first eight
benchmarks are commonly employed datasets and we mainly refer to the separation given in
[23] to separate Xtrain, Xtest, and Xvalidation. The final four benchmarks are the latest financial
data. For the final four datasets provided by us, Xvalidation is the latest 20% data of Xtrain 
Xvalidation in each dataset and Xtrain  Xvalidation occupies 75% of the corresponding benchmark.
Descriptions of the datasets in Table I can be respectively given as follows: (a) Sunspot
records the annual number of sunspots indexed from 1700 to 1987. (b) Mackey-Glass time
series [33] can be generated from the first-order nonlinear differential-delay equation given in
(11). (c) The daily open price of S&P500 [34] indexed from June 1, 2016 to June 1, 2017. (d)
Monthly milk production in pound from January 1962 to December 1975 [35]. (e) The
monthly closing of the Dow-Jones industrial index from August 1968 to August 1981 [36]. (f)
The highest radio frequency that can be used for broadcasting in Washington, DC, USA, over
the period of May 1934 to April 1954 [37]. (g) CO2 (ppm) at Mauna Loa recorded from 1965
to 1980 [38]. (h) Monthly Lake Erie level from 1921 to 1970 [39]. (i) The performance of
Shenzhen Composite Index recorded from 2015.9.27-2018.9.27. (j) The performance of
CSI300 recorded from 2015.9.27-2019.9.27. (k) The close price of Pingan Bank recorded
from 2016.4.1-2020.4.30. (l) The close price of Kweichou_Moutai recorded from
2016.4.1-2020.4.30. The performance in the socket market illustrates the rise and fall ratio of
stock price p(t) compared with the initial data of the record, which can be formulated as in
(12).
• 0.2 x(t −  )
x(t ) = − 0.1x(t ) (11)
1 + x10 (t −  )

p (t ) − p (0)
x(t ) = (12)
p(0)

TABLE I
Twelve benchmarks for time series prediction and the corresponding size of each subset.
Dataset Total Length Xtrain Xvalidation Xtest
Sunspot 288 161 40 87
MG 1000 400 100 500

14
S&P500 251 120 30 101
Milk 168 108 26 34
DJ 291 175 43 73
Radio 240 184 36 20
CO2 192 131 32 29
Lake 600 368 92 140
SZ Composite Index 734 440 110 184
CSI300 977 586 146 245
Pingan Bank 995 597 149 249
Kweichou_Moutai 995 597 149 249

C. Comparison of CNN-FCM with other deep neural networks


In this part, we make a performance comparison of CNN-FCM with other deep neural
networks in time series prediction tasks and investigate the effectiveness of FCB. To provide
an intuitive illustration of the effectiveness of FCB, the performance of CNN-FCM is
compared with TCN, as CNN-FCM is designed based on the architecture of TCN. As an
architecture of deep neural network, we also compare the performance of CNN-FCM with
other classic architectures: RNN, LSTM and GRU. The prediction results of CNN-FCM
obtained under grid search is shown in Fig.6, within which the results given by TCN and
LSTM are also provided for an intuitive comparison. To get statistically significant results,
we fix the hyper-parameters found by grid search and obtain the averaged results of 10
independent runs, which are given in Table II. To make a fair comparison, no weight decay is
applied to WLR for all models in Table. II.
Fig.6 provides the prediction results given by models obtained under grid search and
CNN-FCM shows intuitively obvious advantages upon S&P, CO2, SZ Composite Index and
Pingan Bank. Moreover, we can see from the statistically significant results given in Table II
that CNN-FCM has provided the best performance in 8 out of 12 cases.

15
(a)

(b)

(c)

(d)

(e)

16
(f)

(g)

(h)

(i)

17
(j)

(k)

(l)

Fig. 6. Prediction results output by different architectures. (a)-(l) are respectively the results for
benchmarks indexed from 1-12 in Table I.

Table II. The RMSE of different algorithms on the 12 benchmarks.

Model Sunspot MG S&P500 Milk DJ Radio


CNN-FCM 17.9488 0.0013 20.8159 30.4738 25.1895 0.5665
TCN 22.4489 0.0008 51.2674 33.8579 25.2143 0.6020
RNN 19.2922 0.0005 27.8964 29.2529 26.2318 0.6128
LSTM 19.0064 0.0007 46.2659 32.7434 26.9359 0.5904
GRU 19.4079 0.0008 20.4066 36.0936 25.2110 0.8316
SZ Composite
Model CO2 Lake CSI300 Pingan Bank Kweichou_Moutai
Index
CNN-FCM 0.7305 0.3913 0.0257 0.0160 0.3694 24.4539
TCN 3.1202 0.4093 0.0655 0.0163 1.0655 165.1811
RNN 1.4190 0.3739 0.0382 0.0166 0.5126 48.8851

18
LSTM 2.1595 0.3842 0.0632 0.0159 0.5509 27.9986
GRU 1.5605 0.3853 0.0473 0.0171 0.7423 50.6338

To reasonably compare the performance obtained by different models, the statistical


comparison using the Wilcoxon signed test is conducted and the corresponding results are
provided in Table. III. Because the magnitudes of these benchmarks are different, the
observation in each data set should be normalized within [0, 1] before calculating the RMSE
used for Wilcoxon signed test.
As can be seen from the results in Table III, the p_value of Wilcoxon signed test is small
enough to conclude that CNN-FCM outperforms TCN, RNN, LSTM and GRU. CNN-FCM is
designed by introducing FCB into TCN and the promotion in performance is outstanding.
This indicates the effectiveness of FCB in promoting performance. Moreover, the comparison
between CNN-FCM and other classic deep neural networks also reflects the advantages of our
design.

TABLE III
The Wilcoxon signed test between CNN-FCM and other models

CNN-FCM vs TCN RNN LSTM GRU


p_value 0.0037 0.0186 0.0096 0.0186

The twelve benchmarks are widely used in real data sets. To investigate the effect of
system modeling in promoting the stability of deep learning, we further analyze the data
uncertainty in the twelve benchmarks. Equation (13) given in [48] is employed to learn the
uncertainty in each data set, where  remarks the uncertainty of the model towards the input
data xi. For data deviated from the training data or situations unseen at training time, the
predictor should learn a large  and corresponding L can be minimized. Equation (13) in [48]
has been applied to resolve the uncertainty of bounding box regression in the object detection.
In our paper, we employ Equation (13) to measure the data uncertainty of datasets and
use the Gaussian process regression to obtain the prediction results and . For an intuitive
illustration, Fig.7 provides the results obtained by the Gaussian process regression over SZ
Composite and MG. As can be seen, the uncertainty within MG and SZ Composite is different.
Xvalidation and Xtest of MG have a relatively stable pattern over time. However, Xvalidation and Xtest

19
of SZ Composite show an obviously different trend from Xtrain. In this case, we can see the
twelve benchmarks have different data uncertainty.

1 N 1 1
Lu =  f ( xi ) − yi + log  ( xi ) 2
2
(13)
N i =1 2 ( xi ) 2
2

(a) (b)
Fig. 7. Prediction results obtained by the Gaussian process regression. (a) MG. (b) SZ Composite. The
blue region remarks the training data set in the corresponding benchmark. The green region remarks the
trust region of 0.95. The trust region reflects the uncertainty of the prediction.

The uncertainty of Xtest in the benchmarks measured by Lu is evaluated and given in Table.
IV. As can be seen, the data uncertainty of MG is the lowest. The datasets with top5 data
uncertainty are respectively Kweichou_Moutai, CO2, Pingan Bank, S&P500 and SZ
Composite. CNN-FCM performs the best in 4 out of the 5 datasets with top5 data uncertainty.
Moreover, through analyzing the RMSE of models shown in Table II, we can see that classic
architectures in Table II have lower RMSE upon the MG dataset than CNN-FCM and the
corresponding RMSE is reduced by nearly half. MG dataset has the least data uncertainty
within the 12 benchmarks. Similarly, we can find that CNN-FCM shows a significant
reduction in RMSE compared with other architectures upon the datasets with top5 data
uncertainty. For example, CNN-FCM’s RMSE upon CO2 is about half of RNN’s RMSE and a
quarter of TCN’s RMSE. This indicates the effectiveness of FCB in promoting the stability of
our architecture to cope with data deviated from the training set, which conforms to the
original intention of our design.

TABLE IV
Uncertainty of benchmarks
Dataset Sunspot MG S&P500 Milk DJ Radio

20
Lu -0.6012 -9.6647 2.3087 0.4348 -0.8824 -1.2041
Dataset CO2 Lake SZ Composite CSI300 Pingan Bank Kweichou_
Index Moutai
Lu 4.0838 -1.2860 0.7415 -1.4965 3.2350 17.1262

Based on the experimental results above, we can see CNN-FCM surpasses the classic
deep neural networks in Table II and is competitive upon datasets with large data uncertainty.
These phenomena indicate the advantage of CNN-FCM in resolving situations unseen at
training time. As value-based methods simply regress over available data, prediction over data
deviated from the training set is risky because the mapping function learned by the network is
unknown. However, FCMs learn the complex system that produces the observation. The
system learned can provide more rational reasoning over data deviated from the training
dataset.

D. Analysis of the Parameter Sensitivities in FCB


Previous experimental results show the advantages of our design and indicate the
effectiveness of FCB. However, the drawback of applying FCB is also obvious: optimal Nc
and the length of input are usually unknown. In this case, the determination of
hyperparameters of deep learning architectures with FCB is even more difficult.
In this part, we analyze the parameter sensitivities of CNN-FCM subject to Nc and the
length of input (Linput), while the other hyperparameters are fixed unchanged. Corresponding
experimental results are provided in Fig.8. As can be seen, selecting proper (Nc, Linput) is
important and optimal combination varies for different datasets. The determination procedure
of optimal combination (Nc, Linput) seems inevitable for each dataset, as the corresponding
complex system producing the observation is different. Generally, RMSE may increase if
Linput is large while Nc is small, in which case the FCMs learned are overly simple and the
receptive field of residual blocks in TCN-FCM is not sufficient. To be noted, other
hyperparameters within deep learning architectures may also influence the optimal
combination of Nc and Linput.

21
(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)


Fig. 8. Parameter sensitivities of CNN-FCM subject to the length of input (Linput) and Nc. Results given in
(a)-(j) are respectively obtained upon dataset indexed 1-12 in Table I.

V. Conclusion
Research upon the application of deep learning to time series prediction problems is
outstanding. Current available deep learning architectures usually regress available data but

22
neglect the stability of prediction over data deviated from the training set. As time series data
usually evolves over time, the corresponding stability of models over these data is important.
This paper proposes a deep neural network termed CNN-FCM through integrating the
system modeling to realize the purpose of learning the complex system that produces the
observation. The complex system learned can provide relatively more rational data
processing over data deviated from the training dataset. In this case, the stability of deep
learning in time series prediction can be promoted. Our experimental results upon twelve
benchmarks for time-series prediction verify the effectiveness of our design.
To the best of our knowledge, this paper is the first to apply FCMs to deep learning and
has promoted the stability of deep learning in time series prediction. We also analyze the
challenge of selecting proper hyperparameters for FCMs, which will be further considered in
our future work.

Acknowledgments
This work was supported in part by the Key Project of Science and Technology
Innovation 2030 supported by the Ministry of Science and Technology of China under Grant
2018AAA0101302 and in part by the General Program of National Natural Science
Foundation of China (NSFC) under Grant 61773300.

References
[1] J. Redmon, S. Divvala, R. Girshick & A. Farhadi, “You only look once: Unified, real-time object
detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp.779-788, 2016.
[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu & A. C Berg, “Ssd: Single shot multibox
detector,” in European Conference on Computer Vision, pp.21-37, 2016.
[3] J. Redmon & A. Farhadi, “YOLO9000: better, faster, stronger,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp.7263-7271, 2017.
[4] S. Ren, K. He, R. Girshick & J. Sun, “Faster r-cnn: Towards real-time object detection with region
proposal networks,” in Advances in Neural Information Processing Systems, pp.91-99, 2015.
[5] J. Dai, Y. Li, K. He & J. Sun, “R-fcn: Object detection via region-based fully convolutional
networks,” in Advances in Neural Information Processing Systems, pp.379-387, 2016.
[6] J. Long, E. Shelhamer & T. Darrell, “Fully convolutional networks for semantic segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3431-3440,
2015.

23
[7] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy & A. L Yuille, “Deeplab: Semantic image
segmentation with deep convolutional nets,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol.40, no.4, pp.834-848, 2017.
[8] Y. Shao, C. Hardmeier, J. Tiedemann & J. Nivre, “Character-based joint segmentation and POS
tagging for Chinese using bidirectional RNN-CRF,” arXiv preprint arXiv:1704.01314, 2017.
[9] K. Simonyan & A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”
arXiv preprint arXiv:1409.1556, 2014.
[10] K. He, X. Zhang, S. Ren & J. Sun, “Deep residual learning for image recognition,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.
[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke & A.
Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015.
[12] S. Ioffe & C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal
covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens & Z. Wojna, “Rethinking the inception architecture for
computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp.2818-2826, 2016.
[14] J. Hu, L. Shen & G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp.7132-7141, 2018.
[15] Y. Bengio, P. Simard & P. Frasconi, “Learning long-term dependencies with gradient descent is
difficult,” IEEE Transactions on Neural Networks, vol.5, no.2, pp.157-166, 1994.
[16] S. Hochreiter & J. Schmidhuber, “Long short-term memory,” Neural Computation, vol.9, no.8,
pp.1735-1780, 1997.
[17] S. El Hihi & Y. Bengio, “Hierarchical recurrent neural networks for long-term dependencies,” in
Advances in Neural Information Processing Systems, pp.493-499, 1996.
[18] R. Pascanu, T. Mikolov & Y. Bengio, “On the difficulty of training recurrent neural networks,” in
International Conference on Machine Learning, pp.1310-1318, 2013.
[19] W. Pedrycz, “The design of cognitive maps: A study in synergy of granular computing and
evolutionary optimization,” Expert Systems with Applications, vol.37, no.10, pp.7288–7294, 2010.
[20] W. Pedrycz, A. Jastrzebska & W. Homenda, “Design of fuzzy cognitive maps for modeling time
series,” IEEE Transactions on Fuzzy Systems, vol.24, no.1, pp.120–130, Feb. 2016.
[21] W. Froelich & W. Pedrycz, “Fuzzy cognitive maps in the modeling of granular time series,”
Knowledge-Based Systems, vol.115, pp.110–122, 2017.
[22] F. Vanhoenshoven, G. N´apoles, S. Bielen & K. Vanhoof, “Fuzzy cognitive maps employing ARIMA
components for time series forecasting,” in International Conference on Intelligent Decision
Technologies, 2017, pp.255–264.
[23] S. Yang & J. Liu, “Time-series forecasting based on high-order fuzzy cognitive maps and wavelet
transform,” IEEE Transactions on Fuzzy Systems, vol.24, no.1, pp.3391–3402, 2018.
[24] W. Stach, L. A. Kurgan & W. Pedrycz, “Numerical and linguistic prediction of time series with the
use of fuzzy cognitive maps,” IEEE Transactions on Fuzzy Systems, vol.16, no.1, pp.61–72, 2008.
[25] H. Song, C. Miao, W. Roel, Z. Shen & F. Catthoor, “Implementation of fuzzy cognitive maps based
on fuzzy neural network and application in prediction of time series,” IEEE Transactions on Fuzzy
Systems, vol.18, no.2, pp.233–250, 2010.
[26] W. Lu, J. Yang, X. Liu & W. Pedrycz, “The modeling and prediction of time series based on synergy

24
of high-order fuzzy cognitive map and fuzzy c-means clustering,” Knowledge-Based Systems, vol.70,
pp.242–255, 2014.
[27] E. I. Papageorgiou & K. Poczeta, “A two-stage model for time series prediction based on fuzzy
cognitive maps and neural networks,” Neurocomputing, vol.232, pp.113–121, 2017.
[28] H. J. Song, C. Y. Miao, R. Wuyts, Z. Q. Shen & M. D’Hondt, “An extension to fuzzy cognitive maps
for classification and prediction,” IEEE Transactions on Fuzzy Systems, vol.19, no.1, pp.116–135,
2011.
[29] E. I. Papageorgiou, “Learning algorithms for fuzzy cognitive maps - a review study,” IEEE
Transactions on Systems, Man, and Cybernetics, Part C, vol.42, no.2, pp.150–163, 2012.
[30] K. Wu & J. Liu, “Robust learning of large-scale fuzzy cognitive maps via the lasso from noisy time
series,” Knowledge-Based Systems, vol.113, pp.23–38, 2016.
[31] K. Wu & J. Liu, “Learning large-scale fuzzy cognitive maps based on compressed sensing and
application in reconstructing gene regulatory networks,” IEEE Transactions on Fuzzy Systems, vol.25,
no.6, pp.1546–1560, 2017.
[32] S. Bai, J Z. Kolter & V. Koltun, “An empirical evaluation of generic convolutional and recurrent
networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
[33] M. C. Mackey et al., “Oscillation and chaos in physiological control systems,” Science, vol.197,
no.4300, pp.287–289, 1977.
[34] “GSPC historical prices | S&P 500 stock, Yahoo!” Jun. 14, 2017. [Online]. Available:
https://finance.yahoo.com/quote/%5EGSPC/history?p =%5EGSPC
[35] “Monthly milk production: pounds per cow. Jan. 62-Dec. 75,” Dec. 19, 2017. [Online]. Available:
https://datamarket.com/data/set/22ox/monthlymilk-production-pounds-per-cow-jan-62-dec-75#!ds=22
ox&display =line
[36] “Monthly closings of the Dow-Jones industrial index, Aug. 1968-Aug. 1981,” Dec. 19, 2017. [Online].
Available:https://datamarket.com/data/set/22v9/monthly-closings-of-the-dow-jones-industrial-index-a
ug-1968-aug-1981#!ds=22v9&display=line
[37] “Monthly critical radio frequencies in Washington, D.C., May 1934-April 1954,” Dec. 19, 2017.
[Online]. Available:
https://datamarket.com/data/set/22u2/monthly-critical-radio-frequencies-in-washington-dc-may-1934-
april-1954-these-frequencies-reflect-the-highest-radio-frequency-thatcan-be-used-for-broadcasting#!d
s=22u2&display=line
[38] “Co2 (ppm) Mauna Loa, 1965–1980,” Dec. 19, 2017. [Online]. Available:
https://datamarket.com/data/set/22v1/co2-ppm-mauna-loa-1965–1980#!ds=2 2v1&display=line
[39] “Monthly Lake Erie levels 1921–1970,” Dec. 19, 2017. [Online]. Available:
https://datamarket.com/data/set/22pw/monthly-lake-erie-levels-1921–1970 #!ds=22pw&display=line
[40] R. Dey & F. M. Salemt, “Gate-variants of gated recurrent unit (GRU) neural networks,” in 2017 IEEE
60th international midwest symposium on circuits and systems, pp.1597-1600, 2017.
[41] M. Lngkvist & L. Karlsson & A. Loutfi, “A review of unsupervised feature learning and deep learning
for time-series modeling,” Pattern Recognition Letters, vol.42, pp.11-24, 2014.
[42] I. J Goodfellow, J. Shlens & C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv
preprint arXiv:1412.6572, 2014.
[43] C. Xie, Y. Wu, L. V. D. Maaten, A. L. Yuille & K. He, “Feature denoising for improving adversarial
robustness,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp.501-509, 2019.

25
[44] B. Lakshminarayanan, A. Pritzel & C. Blundell, “Simple and scalable predictive uncertainty
estimation using deep ensembles,” Advances in Neural Information Processing Systems, pp.6402-6413,
2017.
[45] D. M. Blei, A. Kucukelbir & J. D. McAuliffe, “Variational inference: A review for statisticians,”
Journal of the American statistical Association, vol. 112, no. 518, pp.859-877, 2017.
[46] W. Hamilton, Z. Ying & J. Leskovec, “Inductive representation learning on large graphs,” In
Advances in Neural Information Processing Systems, pp.1024-1034, 2017.
[47] M. C. Ozturk, D. Xu & J. C. Principe, “Analysis and design of echo state networks,” Neural
Computation, vol. 19, no. 1, pp.111-138, 2007.
[48] Y. He, C. Zhu, J. Wang, M. Savvides & X. Zhang, “Bounding box regression with uncertainty for
accurate object detection,” In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp.2888-2897, 2019.
[49] S. Wen, M. Dong, Y. Yang, P. Zhou, T. Huang & Y. Chen, “End-to-End detection-segmentation
system for face labeling,” IEEE Transactions on Emerging Topics in Computational Intelligence, DOI:
10.1109/TETCI.2019.2947319, 2019.
[50] S. Wen, W. Liu, Y. Yang, P. Zhou, Z. Guo, Z. Yan, Y. Chen & T. Huang, “Multilabel image
classification via feature/label co-projection,” IEEE Transactions on Systems, Man, and Cybernetics:
Systems, DOI: 10.1109/TSMC.2020.2967071, 2020.
[51] S. Wen, H. Wei, Z. Yan, Z. Guo, Y. Yang, T. Huang & Y. Chen, “Memristor-based design of sparse
compact convolutional neural network,” IEEE Transactions on Network Science and Engineering,
DOI: 10.1109/TNSE.2019.2934357, 2019.

26

View publication stats

You might also like