You are on page 1of 11

Engineering Applications of Artificial Intelligence 96 (2020) 103978

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence


journal homepage: www.elsevier.com/locate/engappai

Robust empirical wavelet fuzzy cognitive map for time series forecasting
Ruobin Gao, Liang Du ∗, Kum Fai Yuen
School of Civil and Environmental Engineering, Nanyang Technological University, Singapore

ARTICLE INFO ABSTRACT


Keywords: Fuzzy cognitive maps have achieved significant success in time series modeling and forecasting. However,
Fuzzy cognitive maps fuzzy cognitive maps still contain weakness to handle the nonstationarity and outliers. We propose a novel time
High-order fuzzy cognitive maps series forecasting model based on fuzzy cognitive maps and empirical wavelet transformation in this paper. The
Empirical wavelet transformation
empirical wavelet transformation is applied to decompose the original time series into different levels which
Time series forecasting
capture information of different frequencies. Then, the high-order fuzzy cognitive map is trained to model
the relationships among all the sub-series generated and original time series. To enhance the robustness of
high-order fuzzy cognitive maps against outliers, a novel learning method based on support vector regression
is designed. Finally, we divide the summation of each concept value of the high-order fuzzy cognitive map by
two to obtain the numerical predictions. A comprehensive empirical study on eight public time series validates
the superiority of proposed model compared with the popular baseline models from the literature.

1. Introduction consist of two steps, the formulation of the FCMs’ structure and the
learning of the weight matrix. Granularity (Stach et al., 2008) and
Fuzzy modeling approaches can deal with the uncertainty of data membership values representation (Song et al., 2009) are two common
and preserve high-level interpretability simultaneously, thus achiev- choices to formulate the FCM’s structure. Fuzzy c-means clustering also
ing significant success in widespread disciplines, such as control sys- shows success in formulating FCM’s structure (Lu et al., 2014). In addi-
tem (Wang et al., 1996), fuzzy inference (Pozna et al., 2012), pattern
tion, wavelet transformation and empirical mode decomposition (EMD)
recognition (Baraldi and Blonda, 1999a,b) and time series predic-
are also developed to identify FCM’s structure and boost the forecasting
tion (Chen and Hwang, 2000; Gao and Duru, 2020; Lee et al., 2013).
performance (Liu and Liu, 2020; Yang and Liu, 2018). To learn the
Fuzzy cognitive maps (FCMs) combine the characteristics of fuzzy logic
and neural network and can model systems’ states effectively (Kosko weight matrix, most FCMs implement evolutionary algorithms (Chi
et al., 1986). Since FCM’s birth, researchers have expanded the ap- and Liu, 2015; Koulouriotis et al., 2001; Parsopoulos et al., 2003;
plication of FCMs to various fields, such as gene regulatory network Stach et al., 2005b; Zou and Liu, 2017). In Parsopoulos et al. (2003),
reconstruction (Acampora and Vitiello, 2015; Liu et al., 2015; Wu and the particle swarm optimization is applied to learn FCMs’ weights. A
Liu, 2017), time series prediction (Stach et al., 2008), game balancing real-coded genetic algorithm is developed for learning FCMs (Stach
system (Dhanji and Singh, 2018), medical decision making (Iakovidis et al., 2005b), where the FCMs can almost perfectly represent the
and Papageorgiou, 2010) and clinical diagnosis (John and Innocent, input data. Stach et al. (2008) first transform the original time series
2005). Recently, an evolutionary multi-task FCM is proposed to take into membership values space and then model these values using
advantage of the relationships among different tasks (Shen et al., an FCM trained with a real-coded genetic algorithm. Particle swarm
2020).
optimization is introduced to optimize high-order FCM’s (HFCM’s)
Time series forecasting is a hot research topic. Various algorithms
weights in Lu et al. (2014), where fuzzy c-means clustering is applied to
are proposed, such as recurrent neural network (Cao and Lin, 2008;
transform the original time series into concepts. A fuzzy gray cognitive
Zemouri et al., 2003), RBF network (Mohammadi et al., 2014), sup-
port vector machine (Sapankevych and Sankar, 2009), the hybrid map learned by evolutionary algorithm is proposed to model the
model (Faruk, 2010; Hajirahimi and Khashei, 2019). Time series pre- multivariate interval-valued time series (Hajek et al., 2020). In addition
diction is also a main chapter of FCMs’ research (Lu et al., 2014; to evolutionary learning algorithms, ridge regression (Yang and Liu,
Papageorgiou and Poczketa, 2017; Pedrycz et al., 2016; Salmeron and 2018), Bayesian ridge regression (Liu and Liu, 2020) and an entropy-
Froelich, 2016; Song et al., 2009, 2010; Stach et al., 2005a,a, 2008; based algorithm (Feng et al., 2019) are applied to optimize the FCM’s
Yang and Liu, 2018). Most FCMs in the time series prediction domain weights.

∗ Corresponding author.
E-mail addresses: gaor0009@e.ntu.edu.sg (R. Gao), liang011@e.ntu.edu.sg (L. Du), kumfai.yuen@ntu.edu.sg (K.F. Yuen).

https://doi.org/10.1016/j.engappai.2020.103978
Received 2 July 2020; Received in revised form 14 September 2020; Accepted 24 September 2020
Available online xxxx
0952-1976/© 2020 Elsevier Ltd. All rights reserved.
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

𝐴𝑖 The concept values of FCM’s 𝑖th node within the 𝜖-cube and penalizes the errors outside 𝜖-cube linearly.
𝐖 FCM’s weight matrix When applying the trained model to forecast test data, EWT first
𝑤𝑖𝑗 The strength of node 𝑗’s impact on node 𝑖 decomposes time series by adding a new test data point per time step
𝑁𝑐 The number of concepts and then the transformed time series are fed into HFCM to generate
R all real numbers forecasts for each node. Finally, we sum all the concept values of
𝜉 Slack variables of support vector regression each node and divide the summation by two to obtain the numerical
𝐶 Regularization coefficient of support vector prediction. Eight publicly available time series are used to validate the
regression superiority of the proposed model by comparing with baseline models.
𝑃 The padding length when applying the EWT The proposed robust model exhibits originality in several aspects.
𝐿 The length of historical data First, EWT is introduced into HFCM’s modeling process for the first
ℎ Model’s order time. Second, a robust FCM learning approach based on 𝜖-SVR is
𝛾 Transitional band ratio
proposed, which can make HFCM robust to outliers. Third, the EWT is
𝛼 Lagrange multipliers
implemented in a proper way, which preserves the causal relationship
𝜎 Standard deviation
in time series.
𝑔() FCM’s activation function
The rest of this paper is organized as follows. Section 2 introduces
𝑔 −1 () The inverse function of FCM’s activation
the preliminaries of FCMs in terms of time series prediction to make this
𝑌𝑖 Vector containing the 𝑔 −1 () for node 𝑖
study self-contained. Section 3 describes the proposed model in detail.
𝜔𝑛 The 𝑛th detected frequency
Section 4 presents the experimental results and the comparison against
𝜙̂ 𝑛 (𝜔), 𝜓̂ 𝑛 (𝜔) band-pass filters
baseline models. Finally, conclusion is drawn in Section 5.
ACF Auto-correlation function
CPESS Computer Peripheral Equipment and Software Sales
DGISR Durable Goods Inventories/Sales Ratio 2. Fuzzy cognitive maps
EMD Empirical Mode Decomposition
EWT Empirical Wavelet Transformation FCMs are weighted directed graphs whose nodes and edges repre-
FCM Fuzzy Cognitive Map sent concepts and logical relations, respectively. For an FCM with 𝑁𝑐
FFT Fast Fourier Transform concepts or nodes, we define the concept state values as a vector 𝐀,
HFCM High-order Fuzzy Cognitive Map [ ]
𝐀 = 𝐴1 , 𝐴2 , … , 𝐴𝑁𝑐 (1)
ISR Inventories to Sales Ratio
KNN 𝑘-Nearest Neighbors where 𝐴𝑖 ∈ [0, 1] or [−1, 1], 𝑖 = 1, 2, … ..., 𝑁𝑐 . The state value 𝐴𝑖
MASE Mean Absolute Scaled Error represents activation values of node 𝑖. The logical relations among
MLP Multi-layer Perceptron different nodes are represented by a 𝑁𝑐 × 𝑁𝑐 matrix 𝐖
NSW New South Wales
QLD Queensland ⎡ 𝑤11 𝑤12 ⋯ 𝑤1𝑁𝑐 ⎤
⎢ 𝑤 𝑤22 ⋯ 𝑤2𝑁𝑐 ⎥
RMSE Root Mean Squared Error 𝐖=⎢ 21 ⎥ (2)
SA South Australia ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ 𝑤 𝑤 𝑁𝑐 2 ⋯ 𝑤 𝑁𝑐 𝑁𝑐 ⎥
SVR Support Vector Regression ⎣ 𝑁𝑐 1 ⎦
VIC Victoria where 𝑤𝑖𝑗 ∈ [−1, 1] represents the strength of node 𝑗’s impact on node
WHFCM Wavelet High-order Fuzzy Cognitive Map 𝑖. The negative 𝑤𝑖𝑗 = −|𝑎| represents that node 𝑗 has a negative impact
on node 𝑖 with a strength |𝑎|. The 𝑤𝑖𝑗 = 0 represents that there is no
Although FCMs demonstrated strong abilities to model time series, logical relations between node 𝑗 and 𝑖. The positive 𝑤𝑖𝑗 = |𝑎| represents
the forecasting approaches based on FCMs cannot handle unstationary that node 𝑗 has a positive impact on node 𝑖 with a strength |𝑎|. The state
time series and evolutionary learning is not suitable for large-scale value of a node at 𝑡 + 1th iteration is influenced by the weight matrix
time series as pointed by Yang and Liu (2018). To overcome these 𝐖 and all the connected nodes’ state values at 𝑡th iteration. As a result,
two limitations, Yang and Liu (2018) propose the wavelet high-order we can express FCMs’ dynamics by the following equation
FCM (WHFCM) which introduces the wavelet transformation to re-
𝑁𝑐
place fuzzy time series and trains the WHFCM with ridge regression. ∑
𝐴𝑖 (𝑡 + 1) = 𝑔( 𝑤𝑖𝑗 𝐴𝑗 (𝑡)), (3)
However, the wavelet transformation suffers from several serious draw-
𝑗=1
backs. Firstly, wavelet transformation has serious boundary effects
on approximation coefficients. Secondly, mode aliasing occurs in all where 𝐴𝑗 (𝑡) is the state value of node 𝑗 at 𝑡th iteration and 𝑔() is a
detail components, degrading the transparency. Thirdly, wavelet trans- nonlinear transformation function.
formation may even detect wrong non-existent components, which Many transformation functions are available for FCMs. When the
is unacceptable for further modeling. Finally, most existing learning state values locate in the range [−1, 1], it is necessary to use the
algorithms focus on minimizing the mean square error and the learned hyperbolic tangent function defined as follows,
FCM is highly sensitive to the outliers in training data. 𝑒𝑥 + 𝑒−𝑥
𝑡𝑎𝑛ℎ(𝑥) = . (4)
To solve the above problems, we propose a robust forecasting 𝑒𝑥 − 𝑒−𝑥
framework by combining empirical wavelet transformation (EWT) and According to Eq. (3), FCMs can only model short term temporal re-
HFCM. As a result, we apply EWT to decompose the original time series lationships. As a result, HFCMs are proposed to model long temporal
and a robust learning framework is specifically designed for HFCM. dependencies (Stach et al., 2006). The modeling process of a ℎ order
EWT is first introduced by Gilles (2013) as a novel adaptive signal HFCM is presented as
decomposition method with established theoretical foundations and 𝑁𝑐
impressive effectiveness in analyzing non-stationary time series data. ∑
𝐴𝑖 (𝑡 + 1) = 𝑔( 𝜔1𝑖𝑗 𝐴𝑗 (𝑡) + 𝜔2𝑖𝑗 𝐴𝑗 (𝑡 − 1)
EWT directly analyzes the data in Fourier domain and implements 𝑗=1 (5)
the spectrum separation using the data-driven filter banks. Then, to
+, … , +𝜔ℎ𝑖𝑗 𝐴𝑗 (𝑡 − ℎ + 1) + 𝜔𝑖0 )
enhance model’s robustness to outliers, a robust HFCM learning method
based on 𝜖-support vector regression (𝜖-SVR) (Drucker et al., 1997) is where 𝜔ℎ𝑖𝑗 is the strength of node 𝑗’s impact on node 𝑖 at time step 𝑡−ℎ+1
specifically designed. The 𝜖-insensitive loss function ignores the errors and 𝜔𝑖0 is the bias term.

2
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

Fig. 1. EWT implementation.

Fig. 2. Band-pass filters of EWT.


3. Empirical wavelet representation of time series ⎪1 [ ( if (1 + 𝛾)𝜔𝑛 ≤ |𝜔| ≤ (1 − 𝛾)𝜔𝑛+1
⎪ 𝜋 1 ( ))]
⎪cos 𝛽 |𝜔| − (1 − 𝛾)𝜔𝑛+1 | if (1 − 𝛾)𝜔𝑛+1 ≤ |𝜔| ≤ (1 + 𝛾)𝜔𝑛+1
𝜓̂ 𝑛 (𝜔) = ⎨ [ 2 ( 2𝛾𝜔𝑛+1( ))]
𝜋 1
⎪sin 2 𝛽 2𝛾𝜔 |𝜔| − (1 − 𝛾)𝜔𝑛 | if (1 − 𝛾)𝜔𝑛 ≤ |𝜔| ≤ (1 + 𝛾)𝜔𝑛
⎪ 𝑛

EWT was first introduced by Gilles (2013) as a data-driven adaptive ⎪0 otherwise,



signal decomposition method with established theoretical foundations
(7)
and impressive effectiveness in the analysis of non-stationary time
The spectrum of these data-driven band-pass filters are illustrated in
series data. Since its birth, EWT has been widely used in signal pro-
Fig. 2. The detected frequencies are not marked but the middle points
cessing and time series modeling (Bhattacharyya and Pachori, 2017; between them are sorted in order and notated as 𝜔𝑛 (1 ≤ 𝑛 ≤ 𝑁). The
Deng et al., 2018; Hu and Wang, 2015; Liu et al., 2018; Liu and transitional band is controlled by parameter 𝛾. Each band-pass filter
Chen, 2019). Different from DWT and empirical mode decomposition is delicately designed as shown in Eqs. (6) and (7) so that Eq. (8) is
(EMD) (Flandrin et al., 2004), EWT directly analyzes the signal in satisfied,
( )
Fourier domain after fast Fourier transform (FFT) and implements the ∑

|̂ |2 ∑ |
𝑁
|𝜙1 (𝜔 + 2𝑘𝜋)| + |2
(8)
spectrum separation through band-pass filtering with the specific filter | | |𝜓̂𝑛 (𝜔 + 2𝑘𝜋)| =1
𝑘=−∞ 𝑛=1
bank constructed in a data-driven approach. which means the set of constructed empirical scaling function and
A brief step-by-step work-flow of EWT is shown in Fig. 1. In wavelet functions {𝜙1 (𝑡), 𝜓𝑛 (𝑡)𝑁
𝑛=1
} qualifies as a tight frame of 𝐿2 (R),
𝜔 −𝜔
EWT, limited freedom is provided in terms of selecting wavelet. Only together with transitional band coefficient 𝛾 satisfying 𝛾 ≤ min𝑛 𝜔𝑛+1 +𝜔𝑛
𝑛+1 𝑛
Littlewood–Paley and Meyer’s wavelets (Spencer, 1994) are employed to avoid spectrum overlapping. The most used function 𝛽(𝑥) in Eqs. (6)
here due to the analytic convenience brought by the closed expression and (7) is presented in Eq. (9) to satisfy Eq. (8).
in Fourier domain. In Gilles (2013), the construction of these band-pass
𝛽(𝑥) = 𝑥4 (35 − 84𝑥 + 70𝑥2 − 20𝑥3 ) (9)
filters are expressed by Eqs. (6) and (7).

⎧ 4. An novel robust empirical wavelet HFCM


⎪1 if |𝜔| ≤ (1 − 𝛾)𝜔𝑛
⎪ [ ( 1 ( ))]
𝜙̂ 𝑛 (𝜔) = ⎨cos 𝜋 𝛽 |𝜔| − (1 − 𝛾)𝜔𝑛 | if (1 − 𝛾)𝜔𝑛 ≤ |𝜔| ≤ (1 + 𝛾)𝜔𝑛 (6)

2 2𝛾𝜔𝑛 Inspired by the idea of wavelet-HFCM (Yang and Liu, 2018), we
⎪0 otherwise, propose a novel robust HFCM based on EWT. In this section, we

3
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

Fig. 3. The proposed algorithm.

describe the steps of the proposed model. The detailed workflow of the Step 3: Identify the structure of HFCM.
proposed model is presented in Fig. 3. Inspired by WHFCM proposed by Yang and Liu (2018), each sub-
Step 1: Pad the time series 𝑓 (𝑡) with the help of 𝑘-nearest neighbors series generated by EWT serves as the node of the FCM. However,
(KNN) proposed by Altman (1992) to avoid the boundary effect in EWT. the structure of WHFCM only learns the relationship among each sub-
It is well-known that wavelet-based decomposition method suffers series, but it fails to learn the relationship between each sub-series
from the boundary effect which may degrade the forecasting per- and target variable. As a result, we propose adding another node in
formance. To ensure the prediction process is causal and robust to HFCM to learn the relationship between each sub-series and the original
boundary effect, the KNN is trained to generate 𝑃 forecasts which are time series. The state values of the new node, 𝐴𝑁𝑐 (𝑡), represent the
used for padding. The value of 𝑃 can be determined by autocorrelation normalized original time series.
function (ACF), because ACF can help find the time series’ periodicity. Step 4: Optimize the HFCM’s weights using the proposed novel
Step 2: Apply EWT to the training set of 𝑓 (𝑡) including the padded 𝑃 robust learning algorithm.
data points and generate all sub-series 𝐴𝑖 (𝑡), where 𝑖 = 1, 2, 3, … , 𝑁𝑐 −1. Wu and Liu (2016, 2017) claim that the optimizing HFCM’s weight
Step 2.1: Perform FFT to 𝑓 (𝑡) and get the discrete version of its matrix can be decomposed into learning local connections of the nodes
spectrum 𝐹 (𝜔). The whole spectrum is between 0 and 2𝜋 (symmetrical individually. We adopt the same simplification method to optimize the
with respect to 𝜋). model’s weights. A robust learning method for HFCM is proposed to
Step 2.2: Determine the corresponding frequencies 𝜔1 , 𝜔2 , … , 𝜔𝑁−1 learn the weights of these local connections. Thereafter, all the learned
of ranked local minima of discrete version of its spectrum. In the weights are combined to form HFCM’s final weights. We first linearize
spectrum, a peak often denotes a component with certain frequency. the nonlinear activation function of HFCM in Eq. (5)
Step 2.3: Define the transitional band ratio 𝛾 satisfying 𝛾 ≤
𝜔 −𝜔 𝑁𝑐
min𝑛 𝜔𝑛+1 +𝜔𝑛 in order for the empirical wavelets to qualify as conven- ∑
𝑛+1 𝑛 𝑔 −1 (𝐴𝑖 (𝑡 + 1)) = (𝜔1𝑖𝑗 𝐴𝑗 (𝑡) + 𝜔2𝑖𝑗 𝐴𝑗 (𝑡 − 1) + ⋯ +
tional wavelets in terms of the expressive capability in 𝐿2 (R). (for 𝑗=1 (10)
detailed proof on this statement, please refer to Gilles, 2013)
𝜔ℎ𝑖𝑗 𝐴𝑗 (𝑡 − ℎ + 1))
Step 2.4: Establish scaling and wavelet functions in the frequency
domain using Eqs. (6) and (7) which are in essence a set of band-pass where 𝑔 −1 () is the inverse function of nonlinear activation 𝑔() and ℎ is
filters. HFCM’s order.
Step 2.5: Calculate detail and approximation coefficients of the As a result, we can collect the historical time series with length 𝐿
signal through inner product in Fourier domain. and format them in the following matrix form
Step 2.6: Obtain all the sub-series 𝐴𝑖 (𝑡), where 𝑖 = 1, 2, 3, … , 𝑁𝑐 − 1,
using a reverse operation in Step 3.5. 𝑌𝑖 = 𝐗𝑊𝑖 (12)

4
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

⎡ 𝐴1 (ℎ) ⋯ 𝐴1 (1) 𝐴2 (ℎ) ⋯ 𝐴2 (1) ⋯ 𝐴𝑁𝑐 (ℎ) ⋯ 𝐴𝑁𝑐 (1) 1 ⎤


⎢ 𝐴 (ℎ + 1) ⋯ 𝐴1 (2) 𝐴2 (ℎ + 1) ⋯ 𝐴2 (2) ⋯ 𝐴𝑁𝑐 (ℎ + 1) ⋯ 𝐴𝑁𝑐 (2) 1 ⎥
1
⎢ ⎥
𝐗 = ⎢ 𝐴1 (ℎ + 2) ⋯ 𝐴1 (3) 𝐴2 (ℎ + 3) ⋯ 𝐴2 (3) ⋯ 𝐴𝑁𝑐 (ℎ + 2) ⋯ 𝐴𝑁𝑐 (3) 1 ⎥ (11)
⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥
⎢ ⎥
⎣ 𝐴1 (𝐿 − 1) ⋯ 𝐴1 (𝐿 − ℎ) 𝐴2 (𝐿1) ⋯ 𝐴2 (𝐿 − ℎ) ⋯ 𝐴𝑁𝑐 (𝐿 − 1) ⋯ 𝐴𝑁𝑐 (𝐿 − ℎ) 1 ⎦

Box I.

where 𝑌𝑖 is a vector containing the inverse transformation of 𝑔(𝐴𝑖 (𝑡+1)) There are two hyper-parameters, the 𝜖 and 𝐶 for each node. Dif-
over all time steps as shown in Eq. (13), 𝐗 is the matrix containing state ferent from the decomposition and learning method in Wu and Liu
values of nodes over all time steps shown in Eq. (4) and 𝑊𝑖 represents (2016) and Yang and Liu (2018), where the learning algorithm for
the relationship between all nodes and node 𝑖. each node shares the same hyper-parameters, we determine the hyper-
parameters for each node specifically. Since an exhausted grid search
⎡ 𝑔 −1 (𝐴𝑖 (ℎ)) ⎤
⎢ −1 ⎥ is time-consuming, we adopt the selection method in Cherkassky and
𝑔 (𝐴𝑖 (ℎ + 1))
𝑌𝑖 = ⎢ ⎥ (13) Ma (2004) to determine the 𝜖 and 𝐶 for each node. According to
⎢ ⋮ ⎥
⎢ 𝑔 −1 (𝐴 (𝐿)) ⎥ Cherkassky and Ma (2004), 𝐶 can be calculated by the following
⎣ 𝑖 ⎦ equation:
See Eq. (11) in Box I. ( )
𝐶 = max |𝑦 − 3𝜎𝑦 |, |𝑦 + 3𝜎𝑦 | (17)
As a result, learning the connections of 𝑖th node becomes a linear
regression problem. Unlike the FCMs’ learning method proposed in Wu where 𝑦 = 𝑔 −1 (𝐴𝑖 ) represents the mean of 𝑔 −1 (𝐴𝑖 ) and 𝜎𝑦 = 𝜎𝑔−1 (𝐴𝑖 ) is
and Liu (2016, 2017), we adopt the 𝜖-loss function for 𝑖th node, the standard deviation of 𝑔 −1 (𝐴𝑖 ). It is well-known that the 𝜖 should be
{ related to the noise level and size of the data. According to the selection
0, |𝑌̂𝑖 (𝑗) − 𝑌𝑖 (𝑗)| < 𝜖
𝐿𝑜𝑠𝑠 = (14) method in Cherkassky and Ma (2004), the value of 𝐶 can be calculated
|𝑌̂𝑖 (𝑗) − 𝑌𝑖 (𝑗)| − 𝜖, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
by:
where 𝑌̂𝑖 (𝑗) is the model’s prediction for 𝑗th sample and the errors √
𝜖 = 3𝜎 𝑙𝑛(𝐿 − ℎ)∕(𝐿 − ℎ) (18)
located in the 𝜖-tube are not considered. HFCM with this loss function
shows its robustness to noise and outliers in two ways. First the noise where 𝐿 is the length of the time series, ℎ represents the model’s order
uncertainty and perturbation of data points which are located in the and 𝜎 is the standard deviation of noise which is estimated by:
𝜖-tube are not considered. Second, for the data points whose errors are 1
beyond the 𝜖-tube, the linear penalty is imposed, which is better than (𝐿 − ℎ) 5 ℎ 1 ∑
(𝐿−ℎ)
𝜎̂ 2 = × (𝑦 − 𝑦̂𝑖 )2 (19)
quadratic loss when there are outliers in the training set. Quadratic 1 (𝐿 − ℎ) 𝑖=1 𝑖
(𝐿 − ℎ) ℎ − 1
5
loss, which is the square of the errors, can mislead HFCM and it
where 𝑦̂𝑖 is the estimated value generated by KNN algorithm. According
cannot learn the patterns correctly. However, the 𝜖-loss penalizes all
to the conclusion in Cherkassky and Ma (2004), the number of nearest
errors outside the cube in a linear style. Therefore, the outliers’ errors
neighbors does not influence the accurate estimation of noise variance
cannot dominate the learning process compared to quadratic loss. After
and the authors suggest three as the default choice.
introducing the slack variables, we arrive at the following optimization
Step 5: Apply the trained model to generate forecasts on test set.
problem,
When a new test data point is available, we first apply the KNN
1 ∑
𝐿−ℎ
to pad the original time series with 𝑃 forecasts. Then, EWT is im-
minimize ‖𝑤‖2 + 𝐶 (𝜉𝑗 + 𝜉 ∗ ) plemented to decompose the padded time series. Next, we collect the
2 𝑗=1
decomposed sub-series and remove the last 𝑃 values which are padding
subject to 𝑌𝑖 (𝑗) − 𝑋(𝑗)𝑊𝑖 ≤ 𝜖 + 𝜉𝑗 , 𝑗 = 1, 2, … , 𝐿 − ℎ, (15) data. The trained HFCM is applied to generate its prediction for the next
𝑋(𝑗)𝑊𝑖 − 𝑌𝑖 (𝑗) ≤ 𝜖 + 𝜉 ∗ , 𝑗 = 1, 2, … , 𝐿 − ℎ time step. Finally, we divide the summation of HFCM’s outputs by two
𝜉𝑗 , 𝜉𝑗∗ ≥ 0, 𝑗 = 1, 2, … , 𝐿 − ℎ to generate forecasts. Since 𝑁𝑐 −1 nodes represent the sub-series whose
summation is equal to the original data, the summation of all nodes is
where 𝜉𝑗 and 𝜉𝑗∗ are the slack variables to cope with the infeasible divided by two to obtain the numerical predictions.
constraints and the positive constant 𝐶 represents the trade-off between
the smoothness and the threshold which deviations larger than 𝜖 are 5. Applications
tolerated. The solution of this optimization problem is obtained by
solving the dual problem In this section, we first briefly introduce the datasets. Then, we
∑ 𝐿−ℎ
𝐿−ℎ ∑ describe the basic experimental setup, the error metrics, normaliza-
min∗ (𝛼𝑝 − 𝛼𝑝∗ )(𝛼𝑞 − 𝛼𝑞∗ )𝑋(𝑝)𝑋(𝑞)𝑇 tion types and cross-validation procedure. After introducing the back-
𝛼,𝛼
𝑝=1 𝑞=1 ground, we present the experiments in two subsections: the case studies

𝐿−ℎ ∑
𝐿−ℎ to describe the modeling process and the comparisons against baselines.
− (𝛼𝑝 − 𝛼𝑝∗ )𝑌𝑖 (𝑝) + (𝛼𝑝 − 𝛼𝑝∗ )𝜖
𝑝=1 𝑝=1 , (16)
5.1. Datasets
subject to 0≤ 𝛼𝑝 , 𝛼𝑝∗ ≤ 𝐶, 𝑝 = 1, 2, … , 𝐿 − ℎ

𝐿−ℎ Eight publicly available datasets are used to validate the superi-
(𝛼𝑝 − 𝛼𝑝∗ ) = 0 ority of our proposed model. Four datasets are the electricity load
𝑝=1
which are publicly available at the Australian Energy Market Operator
where 𝛼 and 𝛼 ∗ are the Lagrange multipliers. This problem is also website (AEMO, 2010). The load data is recorded every half an hour,
known as the 𝜖-SVR (Smola and Schölkopf, 2004), which is a standard which amounts to 48 data points per day. Specifically, the data of
convex quadratic programming with linear constraints. January 2019 from New South Wales (NSW), Queensland (QLD), South

5
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

Table 1
Descriptive statistics of all datasets.
Dataset Max Min Median Mean Std Skewness Kurtosis
QLD 8915.400 5353.030 6802.155 6821.698 889.229 0.261 −0.959
NSW 13700.900 5973.270 8816.490 8984.025 1731.286 0.485 −0.477
SA 2994.290 647.550 1367.390 1483.873 474.340 0.952 0.485
VIC 9281.150 3320.320 4730.395 5075.310 1205.671 1.158 0.846
DGISR 2.110 1.310 1.570 1.584 0.130 0.876 1.424
CPESS 24760.000 10592.000 17093.000 17034.162 3420.402 0.186 −0.892
ISR 1.540 1.060 1.230 1.239 0.099 0.603 0.074
Miles traveled 288145.000 77442.000 196797.500 190420.381 57746.372 −0.164 −1.343

Australia (SA) and Victoria (VIC) are chosen to validate the model’s
performance. The monthly Inventories to Sales Ratio (ISR), Computer
Peripheral Equipment and Software Sales (CPESS), Durable Goods In-
ventories/Sales Ratio (DGISR) from January 2000 to January 2020
can be download from Federal Reserve Economic Data (FRED, 2010a)
website. The last time series is the vehicle miles traveled from January
1970 to December 2018 which is available on the transportation section
of FRED (2010b) website. The detailed descriptive statistics of these
eight datasets are summarized in Table 1. According to the skewness
values shown in Table 1, most data except for the miles traveled are
positively skewed to the left which indicates a longer left tail.

5.2. Experimental setup

The sub-series generated by EWT are normalized into range [−1,


1] because we adopt the hyperbolic tangent function as the non-
linear transformation function. All the models are applied to the first
difference of original time series. We assume that the maximum and
Fig. 4. ACF of ISR time series.
minimum of the time series is 𝑥max and 𝑥min , respectively. The data are
normalized into range [−1, 1] using
𝑥 − 𝑥min
𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 = 2 × −1 (20)
𝑥max − 𝑥min
where 𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 and 𝑥 represent the normalized and original time
series, respectively.
Two error metrics are used to evaluate the model’s performance.
The first error metric is the classical root mean square error (RMSE)
whose definition is


√1 ∑ 𝐿
𝑅𝑀𝑆𝐸 = √ (𝑥̂ − 𝑥𝑗 )2 . (21)
𝐿 1 𝑗

where 𝐿 is the size of the test set, 𝑥𝑗 and 𝑥̂ 𝑗 are the raw data and
predictions. Another error metric implemented in the paper is the
mean absolute scaled error (MASE) proposed by Hyndman and Koehler
(2006). The definition of MASE is given in Eq. (22)
|𝑥̂ 𝑗 − 𝑥𝑗 |
𝑀𝐴𝑆𝐸 = 𝑚𝑒𝑎𝑛( ∑𝑇 ) (22)
1
𝑇 −1 𝑡=2 |𝑥𝑡 − 𝑥𝑡−1 |
Fig. 5. The sub-series generated by EWT.
where 𝑇 represents the size of training set, 𝑥𝑗 and 𝑥̂ 𝑗 are the raw data
and predictions, respectively. The denominator of MASE is the mean
absolute error of the in-sample naive forecast. The number of nodes
determine the number of nodes (Bergmeir and Benítez, 2012). The data
is selected by last block cross-validation (Bergmeir and Benítez, 2012)
is split into three sets, training, validation and test set, to accommodate
conducted by the grid search (Hastie et al., 2009). The datasets are split
cross-validation. The training set, validation set and test set account for
into three subsets, the training set, validation set and test set, according
70%, 20% and 10% of the dataset, respectively. To reduce the impact
to the ratio 70%, 20% and 10%, respectively. The hyper-parameters
of boundary effect caused by wavelet-based transformation, the KNN
which achieve the best performance on the validation set are selected as
model is first implemented to append predictions to the training data.
the final hyper-parameters. Such last block cross validation procedure
The padding length is equal to the model’s order which is determined
can prevent the forecasting model from overfitting on training set.
by the ACF. According to Fig. 4, the ACF at time lag 12 is the second
largest except the value at zero. The EWT is applied to decompose
5.3. Case study of the ISR
the padded time series and the generated sub-series excluding the last
padded data points are used for modeling.
In this section, we describe the modeling process, using ISR time
series as an example. First, we calculate the first difference of the Once HFCM’s structure is identified, we determine the 𝜖 and 𝐶 for
original time series. Then, we normalize the first difference time series each node according to Eqs. (17) and (19). Next, the proposed robust
into range [−1, 1]. Hereafter, the last block cross-validation is used to learning process can be implemented to optimize HFCM’s weights. After

6
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

the learning process, the model can be used to predict the validation
set. We implement the prediction process in a causal way to avert
the data leakage effect in time series prediction. When generating a
prediction, the KNN is first implemented to generate twelve predictions
for padding. Then, the EWT is used to decompose the time series by
adding new padded data points. Thereafter, the learned HFCM can
make one-step forward predictions for each sub-series. Finally, we can
generate numerical predictions by summing the predictions of each sub-
series and dividing the summation by two. This process is repeated until
the model generates predictions for the whole validation set. The hyper-
parameters which achieve the minimum RMSE on the validation set
are the optimal setting. For the ISR dataset, the decomposition level
selected by cross-validation is two. We retrain the model including
the validation set using the optimal hyper-parameters and apply the
learned model to predict test data in the same causal way. The illus-
tration of the sub-series for ISR data is shown in Fig. 5. According to
Fig. 5, we can see that the approximation series capture the seasonal
components correctly.
Fig. 6. ACF of QLD load time series.
5.4. Case study of QLD load data

Another example is described in detail in this section to further illus-


trate the proposed model. First, the first difference of the original load
data is calculated and normalized into range [−1, 1]. The normalized
first difference is split into three sets, training, validation and test set,
to accommodate cross-validation. The training set, validation set and
test set account for 70%, 20% and 10% of the dataset, respectively.
Hereafter, EWT is implemented to decompose the training data. To
reduce the boundary effects, the KNN model is first implemented to
append values to the training data. The padding length is equal to the
model’s order which is determined by the ACF of target time series.
According to Fig. 6, the ACF at time lag 48 is the second largest except
the value at zero. The EWT is applied to the padded time series and the
generated sub-series excluding the last padded data points are used for
modeling. Finally, the sub-series and original data represent HFCM’s
concepts.
Once HFCM’s structure is identified, the proposed robust learning
algorithm can be applied. First, the 𝜖 and 𝐶 for each node are calculated
according to Eqs. (17) and (18). Next, individual local connection is Fig. 7. The sub-series generated by EWT on QLD dataset.
learned using 𝜖-SVR. After the learning process, the model is used to
forecast the validation set. We implement the prediction process in a
causal way to avert the data leakage effect in time series prediction. 2018), SVR (Cortes and Vapnik, 1995) and Multi-layer Perceptron
When generating a prediction, the KNN is first implemented to generate (MLP) (Haykin and Network, 2004), fuzzy time series models, (Cheng
48 predictions for padding. Then, the EWT is used to decompose et al., 2009; Sadaei et al., 2014), and WHFCM (Yang and Liu, 2018).
the time series with the padding data. Once the decomposition is The hyper-parameters of the baseline models are also optimized by
completed, the learned HFCM can make one-step forward predictions cross-validation for a fair comparison. For the fuzzy time series models,
for each concept. Finally, we can generate the final outputs by sum- the number of fuzzy sets varies from 5 to 50. For WHFCM, the number
ming the predictions of each concepts and divide the summation by of concepts 𝑁𝑐 varies from 2 to 7 and the regularization parameter 𝛼
two. This process is repeated until the model generates predictions varies in [1e-12, 1e-14, 1e-20]. The hidden nodes of MLP vary from 1 to
for the whole validation set. The hyper-parameters which achieve the 10 and the MLP is trained using Adam optimizer with learning rate 0.01
minimum RMSE on the validation set are the optimal setting. For and 5000 epochs (Paszke et al., 2019). For ARIMA, the Kwiatkowski–
QLD load dataset, the EWT’s decomposition level determined by cross- Phillips–Schmidt–Shin test is used to determine the difference order and
validation is two. We retrain the model including the validation set Bayesian information criterion is optimized to determine the order of
using the optimal hyper-parameters and apply the learned HFCM to AR and MA parts. The hyper-parameters of SVR with RBF kernel are
predict the test data using the same way as forecasting validation set. determined according to Cherkassky and Ma (2004). All models are
The illustration of the sub-series for the QLD data is shown in Fig. 7. implemented using Python programming language, version 3.6.3. The
According to Fig. 5, we can find that the approximation and detailed MLP network is built with using Pytorch library, version 1.0.1 (Paszke
series capture the seasonal components with two different frequencies et al., 2017). The SVR is established using Scikit-learn library (Pe-
correctly. dregosa et al., 2011) and the stopping criterion of the 𝜖-SVR learning
is controlled by the error tolerance term which is set as 1e-3 according
5.5. Comparison against baselines to the suggestions from (Pedregosa et al., 2011). Once the stopping
criterion of the learning process is fixed, the two major steps (i.e., the
In this section, we present the comparison against several popu- EWT and learning by 𝜖-SVR) are non-random sensitive algorithms. This
lar baselines to highlight the superiority of our proposed model. We indicates that it is not necessary to perform multiple runs to obtain
compare the proposed model with eight baseline models including optimal or fair results.
classical forecasting models ARIMA (Hyndman and Athanasopoulos,

7
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

Table 2
Comparison with baseline models in terms of the RMSE.
Dataset QLD NSW SA VIC DGISR CPESS ISR Miles Traveled
Naïve 157.964 285.775 56.300 172.240 0.090 2600.581 0.121 17795.662
ARIMA 148.053 151.288 50.255 103.116 0.133 2547.580 0.163 25588.544
SVR 55.643 103.771 38.595 89.469 0.042 1269.660 0.052 3790.114
MLP 67.822 141.013 36.195 88.103 0.043 1292.787 0.054 4033.232
Chen et al. (0000) 86.490 135.659 42.998 106.541 0.084 2512.314 0.114 19183.827
Yu (2005) 82.207 135.190 41.445 107.979 0.092 2579.704 0.117 20971.135
Cheng et al. (2009) 86.043 133.520 43.379 108.045 0.079 2620.704 0.107 20628.459
Sadaei et al. (2014) 82.131 134.380 43.947 108.225 0.081 2371.970 0.098 20540.851
Yang and Liu (2018) 76.452 165.107 47.997 85.296 0.041 1874.989 0.050 5569.113
Proposed 53.597 97.702 33.346 86.185 0.032 1163.071 0.049 3423.869

Table 3 Table 4
Comparison with baseline models in terms of the MASE. MASE statistics.
Dataset QLD NSW SA VIC DGISR CPESS ISR Miles Max Min Median Mean Std Interquartile
traveled range
Naïve 1.187 1.414 0.968 1.250 1.14 1.091 1.000 1.634 Naïve 1.6340 0.9680 1.1635 1.2105 0.2077 0.2228
ARIMA 1.146 0.660 0.703 0.670 1.625 0.970 1.295 2.471 ARIMA 2.4710 0.6600 1.0580 1.1925 0.5788 0.6828
SVR 0.396 0.410 0.580 0.596 0.561 0.474 0.437 0.348 SVR 0.5960 0.3480 0.4555 0.4753 0.0874 0.1593
MLP 0.492 0.565 0.587 0.601 0.563 0.498 0.416 0.358 MLP 0.6010 0.3580 0.5305 0.5100 0.0810 0.0975
Chen et al. (0000) 0.666 0.572 0.676 0.744 1.067 1.077 0.931 1.863 Chen et al. (0000) 1.8630 0.5720 0.8375 0.9495 0.3881 0.3960
Yu (2005) 0.614 0.566 0.636 0.735 1.138 1.065 0.958 2.082 Yu (2005) 2.0820 0.5660 0.8465 0.9742 0.4651 0.4528
Cheng et al. (2009) 0.625 0.560 0.676 0.757 0.972 1.086 0.852 2.021 Cheng et al. (2009) 2.0210 0.5600 0.8045 0.9436 0.4395 0.3373
Sadaei et al. (2014) 0.600 0.559 0.684 0.746 1.050 0.958 0.776 2.010 Sadaei et al. (2014) 2.0100 0.5590 0.7610 0.9229 0.4396 0.3180
Yang and Liu (2018) 0.539 0.750 0.765 0.566 0.563 0.784 0.420 0.512 Yang and Liu (2018) 0.7840 0.4200 0.5645 0.6124 0.1269 0.2215
Proposed 0.381 0.392 0.542 0.572 0.392 0.449 0.360 0.324 Proposed 0.5720 0.3240 0.3920 0.4265 0.0825 0.0965

Table 5
The comparison results using RMSE and MASE are presented in Ta- Parameters setting.

ble 2 and Table 3, respectively. Among these baseline models, ARIMA Dataset Order 𝑁𝑐 𝜖 𝐶

performs worse than the other non-linear models because of its simple QLD 48 3 [0.0397, 0.0446, 0.0449] [0.9160, 0.7472, 0.7963]
NSW 48 3 [0.0181, 0.0406, 0.0419] [2.0238, 2.3276, 2.0493]
linear structure. The proposed model achieves the best performance in
SA 48 3 [0.0646, 0.0478, 0.0479] [0.9378, 0.8594, 0.9356]
terms of RMSE and MASE on most time series. In addition, according VIC 48 3 [0.0213, 0.0415, 0.0434] [0.5312, 0.2999, 0.5873]
to Tables 2 and 3, we can see that the SVR and MLP perform similarly DGISR 12 3 [0.1480, 0.2108, 0.1936] [1.7823, 0.8933, 1.7448]
as they are both one-hidden layer structure network (Romero and CPESS 12 3 [0.2225, 0.1952, 0.2010] [1.4044, 0.8584, 0.9255]
Toppo, 2007). Again, the proposed model outperforms WHFCM (Yang ISR 12 3 [0.1492, 0.2152, 0.1957] [1.7750, 0.9256, 1.8575]
Miles traveled 12 3 [0.0998, 0.0998, 0.0964] [0.6850, 0.2420, 0.8876]
and Liu, 2018) on most datasets which indicates that EWT is a better
choice for time series modeling compared with the discrete wavelet
transformation. Since RMSE metric is scale dependent, it is not suitable
for comparing cross different datasets (Hyndman and Koehler, 2006).
According to the suggestions by Hyndman and Koehler (2006), MASE the weights of HFCMs. Finally, the summation of all concept values of
can be used to compare forecasting performance of various datasets. As FCMs is divided by two to obtain the predicted numerical value.
a result, the statistics of MASE are summarized in Table 4. According This paper performs a comprehensive empirical study to validate the
to Table 4, we can see that the proposed model achieves the lowest superiority of the proposed model. According to the experiments, we
mean, median, minimum and maximum of MASE values. Although the conclude that the small decomposition levels work well on these eight
MLP network has a smaller standard deviation, the proposed model’s time series. The prediction accuracy degenerates with the increase of
interquartile range is the lowest. Unlike the standard deviation which decomposition level. A small decomposition level also connotes small
considers all the MASE values into consideration, the interquartile number of nodes and high interpretability of FCMs. In addition, the
range only considers their positions after ranking. Interquartile range is proposed model achieves the best performance in terms of MASE and
not sensitive to outliers, however, the standard deviation is. Therefore, RMSE on these eight datasets, which validates the proposed model’s
we claim that the proposed model achieves the best performance on superiority compared with the baseline models. The comparison with
these eight datasets. Finally, the hyper-parameter setting of the pro- the baseline models also demonstrates that the proposed learning algo-
posed model is presented in Table 5 to facilitate reproducibility of the rithm can improve HFCM’s generalization ability. Finally, the proposed
results. For illustration, the comparison of prediction curves on the test model outperforms WHFCM, which indicates the EWT is superior to
set is shown in Fig. 8. discrete wavelet transformation in time series forecasting because EWT
is a data-driven signal decomposition algorithm.
Although using EWT boosts HFCMs’ performance, the authors be-
6. Conclusion
lieve there are many more approaches to boost the accuracy further.
The combination of novel signal decomposition methods and FCMs has
In this paper, we have proposed a novel hybrid time series forecast- not been investigated thoroughly. To the best knowledge of the authors,
ing model based on the empirical wavelet transformation and HFCMs. variational mode decomposition (Dragomiretskiy and Zosso, 2013) is
The EWT is adopted to decompose the original time series and generate not considered yet in the hybrid FCMs field. As a result, the superiority
the concept values for HFCMs. Then, the decomposed sub-series are of the other signal decomposition techniques remains as a topic that
modeled and predicted by HFCM. To reduce the impact of outliers, a is opened for future research. In addition, instead of using a single
robust learning scheme based on 𝜖-SVR is specifically designed to learn decomposition method, the ensemble of various concept generation

8
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

Fig. 8. Original time series and predictions of (a) QLD load, (b) NSW load, (c) SA load, (d) VIC load, (e)Durable Goods Inventories/Sales Ratio, (f) Computer Peripheral Equipment
and Software Sales, (g) Inventories/Sales Ratio, (h) Miles traveled.

methods can be considered. Consequently, the FCM can learn the intra- decomposition, Revision of the manuscript. Kum Fai Yuen: Revision
relationship among the concepts generated by a specific decomposition of the manuscript.
method, and the inter-relationship among the concepts of different
decomposition methods.
Declaration of competing interest
CRediT authorship contribution statement
The authors declare that they have no known competing finan-
Ruobin Gao: Theoretical development, Empirical study, Literature cial interests or personal relationships that could have appeared to
review, Finishing the manuscript. Liang Du: Development of the EWT influence the work reported in this paper.

9
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

References Lee, C.-H., Chang, F.-Y., Lin, C.-M., 2013. An efficient interval type-2 fuzzy CMAC
for chaos time-series prediction and synchronization. IEEE Trans. Cybern. 44 (3),
Acampora, G., Vitiello, A., 2015. Learning of fuzzy cognitive maps for modelling 329–341.
gene regulatory networks through big bang-big crunch algorithm. In: 2015 IEEE Liu, W., Chen, W., 2019. Recent advancements in empirical wavelet transform and its
International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, pp. 1–6. applications. IEEE Access 7, 103770–103780.
AEMO, 2010. AustraliaN energy market operator. https://aemo.com.au/, (Accessed 16 Liu, J., Chi, Y., Zhu, C., 2015. A dynamic multiagent genetic algorithm for gene
June 2010). regulatory network reconstruction based on fuzzy cognitive maps. IEEE Trans.
Fuzzy Syst. 24 (2), 419–431.
Altman, N.S., 1992. An introduction to kernel and nearest-neighbor nonparametric
Liu, Z., Liu, J., 2020. A robust time series prediction method based on empirical mode
regression. Amer. Statist. 46 (3), 175–185.
decomposition and high-order fuzzy cognitive maps. Knowl.-Based Syst. 106105.
Baraldi, A., Blonda, P., 1999a. A survey of fuzzy clustering algorithms for pattern
Liu, H., Mi, X.-w., Li, Y.-f., 2018. Wind speed forecasting method based on deep learning
recognition. I. IEEE Trans. Syst. Man Cybern. B 29 (6), 778–785.
strategy using empirical wavelet transform, long short term memory neural network
Baraldi, A., Blonda, P., 1999b. A survey of fuzzy clustering algorithms for pattern
and Elman neural network. Energy Convers. Manage. 156, 498–514.
recognition. II. IEEE Trans. Syst. Man Cybern. B 29 (6), 786–801.
Lu, W., Yang, J., Liu, X., Pedrycz, W., 2014. The modeling and prediction of time series
Bergmeir, C., Benítez, J.M., 2012. On the use of cross-validation for time series predictor
based on synergy of high-order fuzzy cognitive map and fuzzy c-means clustering.
evaluation. Inform. Sci. 191, 192–213.
Knowl.-Based Syst. 70, 242–255.
Bhattacharyya, A., Pachori, R.B., 2017. A multivariate approach for patient-specific EEG
Mohammadi, R., Ghomi, S.F., Zeinali, F., 2014. A new hybrid evolutionary based RBF
seizure detection using empirical wavelet transform. IEEE Trans. Biomed. Eng. 64
networks method for forecasting time series: a case study of forecasting emergency
(9), 2003–2015.
supply demand time series. Eng. Appl. Artif. Intell. 36, 204–214.
Cao, J., Lin, X., 2008. Application of the diagonal recurrent wavelet neural network to Papageorgiou, E.I., Poczketa, K., 2017. A two-stage model for time series predic-
solar irradiation forecast assisted with fuzzy technique. Eng. Appl. Artif. Intell. 21 tion based on fuzzy cognitive maps and neural networks. Neurocomputing 232,
(8), 1255–1263. 113–121.
Chen, S.-M., Hwang, J.-R., 2000. Temperature prediction using fuzzy time series. IEEE Parsopoulos, K.E., Papageorgiou, E.I., Groumpos, P., Vrahatis, M.N., 2003. A first
Trans. Syst. Man Cybern. B 30 (2), 263–275. study of fuzzy cognitive maps learning using particle swarm optimization. In: The
Chen, S.-M., et al., 0000. Forecasting enrollments based on fuzzy time series. In: Fuzzy 2003 Congress on Evolutionary Computation, Vol. 2. 2003. CEC’03., IEEE, pp.
Sets Syst. 81 (3). 1440–1447.
Cheng, C.-H., Chen, Y.-S., Wu, Y.-L., 2009. Forecasting innovation diffusion of products Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
using trend-weighted fuzzy time-series model. Expert Syst. Appl. 36 (2), 1826–1832. Desmaison, A., Antiga, L., Lerer, A., 2017. Automatic differentiation in pytorch.
Cherkassky, V., Ma, Y., 2004. Practical selection of SVM parameters and noise Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
estimation for SVM regression. Neural Netw. 17 (1), 113–126. Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M.,
Chi, Y., Liu, J., 2015. Learning of fuzzy cognitive maps with varying densities using a Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019.
multiobjective evolutionary algorithm. IEEE Trans. Fuzzy Syst. 24 (1), 71–81. Pytorch: An imperative style, high-performance deep learning library. In: Wal-
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20 (3), 273–297. lach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., Garnett, R. (Eds.),
Deng, W., Zhang, S., Zhao, H., Yang, X., 2018. A novel fault diagnosis method based Advances in Neural Information Processing Systems 32. Curran Associates, Inc.,
on integrating empirical wavelet transform and fuzzy entropy for motor bearing. pp. 8024–8035, URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-
IEEE Access 6, 35042–35056. style-high-performance-deep-learning-library.pdf.
Dhanji, P.K., Singh, S.K., 2018. Fuzzy cognitive maps based game balancing system in Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon-
real time. Indones. J. Electr. Eng. Comput. Sci. 9 (2), 335–341. del, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al., 2011. Scikit-learn: Machine
Dragomiretskiy, K., Zosso, D., 2013. Variational mode decomposition. IEEE Trans. learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Signal Process. 62 (3), 531–544. Pedrycz, W., Jastrzebska, A., Homenda, W., 2016. Design of fuzzy cognitive maps for
Drucker, H., Burges, C.J., Kaufman, L., Smola, A.J., Vapnik, V., 1997. Support vector modeling time series. IEEE Trans. Fuzzy Syst. 24 (1), 120–130.
regression machines. In: Advances in Neural Information Processing Systems. pp. Pozna, C., Minculete, N., Precup, R.-E., Kóczy, L.T., Ballagi, Á., 2012. Signatures:
155–161. Definitions, operators and applications to fuzzy modelling. Fuzzy Sets and Systems
Faruk, D.Ö., 2010. A hybrid neural network and ARIMA model for water quality time 201, 86–104.
series prediction. Eng. Appl. Artif. Intell. 23 (4), 586–594. Romero, E., Toppo, D., 2007. Comparing support vector machines and feedforward
Feng, G., Lu, W., Pedrycz, W., Yang, J., Liu, X., 2019. The learning of fuzzy cognitive neural networks with similar hidden-layer weights. IEEE Trans. Neural Netw. 18
maps with noisy data: A rapid and robust learning method with maximum entropy. (3), 959–963.
IEEE Trans. Cybern.. Sadaei, H.J., Enayatifar, R., Abdullah, A.H., Gani, A., 2014. Short-term load forecasting
Flandrin, P., Rilling, G., Goncalves, P., 2004. Empirical mode decomposition as a filter using a hybrid model with a refined exponentially weighted fuzzy time series and
bank. IEEE Signal Process. Lett. 11 (2), 112–114. an improved harmony search. Int. J. Electr. Power Energy Syst. 62, 118–129.
FRED, 2010a. Federal reserve economic data. https://fred.stlouisfed.org/, (Accessed 16 Salmeron, J.L., Froelich, W., 2016. Dynamic optimization of fuzzy cognitive maps for
June 2010). time series forecasting. Knowl.-Based Syst. 105, 29–37.
FRED, 2010b. Federal reserve economic data of transportation. https://fred.stlouisfed. Sapankevych, N.I., Sankar, R., 2009. Time series prediction using support vector
org/categories/33202, (Accessed 16 June 2010). machines: a survey. IEEE Comput. Intell. Mag. 4 (2), 24–38.
Shen, F., Liu, J., Wu, K., 2020. Evolutionary multitasking fuzzy cognitive map learning.
Gao, R., Duru, O., 2020. Parsimonious fuzzy time series modelling. Expert Syst. Appl.
Knowl.-Based Syst. 192, 105294.
113447.
Smola, A.J., Schölkopf, B., 2004. A tutorial on support vector regression. Stat. Comput.
Gilles, J., 2013. Empirical wavelet transform. IEEE Trans. Signal Process. 61 (16),
14 (3), 199–222.
3999–4010.
Song, H., Miao, C., Roel, W., Shen, Z., Catthoor, F., 2009. Implementation of fuzzy
Hajek, P., Froelich, W., Prochazka, O., 2020. Intuitionistic fuzzy grey cognitive maps
cognitive maps based on fuzzy neural network and application in prediction of
for forecasting interval-valued time series. Neurocomputing.
time series. IEEE Trans. Fuzzy Syst. 18 (2), 233–250.
Hajirahimi, Z., Khashei, M., 2019. Hybrid structures in time series modeling and
Song, H., Miao, C., Shen, Z., Roel, W., Maja, D., Francky, C., 2010. Design of fuzzy
forecasting: A review. Eng. Appl. Artif. Intell. 86, 83–106.
cognitive maps using neural networks for predicting chaotic time series. Neural
Hastie, T., Tibshirani, R., Friedman, J., 2009. Model assessment and selection. In: The
Netw. 23 (10), 1264–1275.
Elements of Statistical Learning. Springer, pp. 219–259.
Spencer, J., 1994. Ten Lectures on the Probabilistic Method, Vol. 64. SIAM.
Haykin, S., Network, N., 2004. A comprehensive foundation. Neural Netw. 2 (2004), Stach, W., Kurgan, L., Pedrycz, W., 2005a. Linguistic signal prediction with the use of
41. fuzzy cognitive maps. In: Proc. Symp. Human-Centric Comput. pp. 64–71.
Hu, J., Wang, J., 2015. Short-term wind speed prediction using empirical wavelet Stach, W., Kurgan, L., Pedrycz, W., 2006. Higher-order fuzzy cognitive maps. In: NAFIPS
transform and Gaussian process regression. Energy 93, 1456–1466. 2006-2006 Annual Meeting of the North American Fuzzy Information Processing
Hyndman, R.J., Athanasopoulos, G., 2018. Forecasting: Principles and Practice. OTexts. Society. IEEE, pp. 166–171.
Hyndman, R.J., Koehler, A.B., 2006. Another look at measures of forecast accuracy. Stach, W., Kurgan, L.A., Pedrycz, W., 2008. Numerical and linguistic prediction of time
Int. J. Forecast. 22 (4), 679–688. series with the use of fuzzy cognitive maps. IEEE Trans. Fuzzy Syst. 16 (1), 61–72.
Iakovidis, D.K., Papageorgiou, E., 2010. Intuitionistic fuzzy cognitive maps for medical Stach, W., Kurgan, L., Pedrycz, W., Reformat, M., 2005b. Genetic learning of fuzzy
decision making. IEEE Trans. Inf. Technol. Biomed. 15 (1), 100–107. cognitive maps. Fuzzy Sets and Systems 153 (3), 371–401.
John, R.I., Innocent, P.R., 2005. Modeling uncertainty in clinical diagnosis using fuzzy Wang, H.O., Tanaka, K., Griffin, M.F., 1996. An approach to fuzzy control of nonlinear
logic. IEEE Trans. Syst. Man Cybern. B 35 (6), 1340–1350. systems: Stability and design issues. IEEE Trans. Fuzzy Syst. 4 (1), 14–23.
Kosko, B., et al., 1986. Fuzzy cognitive maps. Int. J. Man-Mach. Stud. 24 (1), 65–75. Wu, K., Liu, J., 2016. Robust learning of large-scale fuzzy cognitive maps via the lasso
Koulouriotis, D., Diakoulakis, I., Emiris, D., 2001. Learning fuzzy cognitive maps from noisy time series. Knowl.-Based Syst. 113, 23–38.
using evolution strategies: a novel schema for modeling and simulating high-level Wu, K., Liu, J., 2017. Learning large-scale fuzzy cognitive maps based on compressed
behavior. In: Proceedings of the 2001 Congress on Evolutionary Computation, Vol. sensing and application in reconstructing gene regulatory networks. IEEE Trans.
1. IEEE Cat. No. 01TH8546, IEEE, pp. 364–371. Fuzzy Syst. 25 (6), 1546–1560.

10
R. Gao, L. Du and K.F. Yuen Engineering Applications of Artificial Intelligence 96 (2020) 103978

Yang, S., Liu, J., 2018. Time-series forecasting based on high-order fuzzy cognitive Zemouri, R., Racoceanu, D., Zerhouni, N., 2003. Recurrent radial basis function network
maps and wavelet transform. IEEE Trans. Fuzzy Syst. 26 (6), 3391–3402. for time-series prediction. Eng. Appl. Artif. Intell. 16 (5–6), 453–463.
Yu, H.-K., 2005. Weighted fuzzy time series models for TAIEX forecasting. Physica A Zou, X., Liu, J., 2017. A mutual information-based two-phase memetic algorithm
349 (3–4), 609–624. for large-scale fuzzy cognitive map learning. IEEE Trans. Fuzzy Syst. 26 (4),
2120–2134.

11

You might also like