J Neunet 2021 05 035

Journal Pre-proof
IGAGCN: Information geometry and attention-based spatiotemporal

graph convolutional networks for traffic flow prediction
Jiyao An, Liang Guo, Wei Liu, Zhiqiang Fu, Ping Ren, Xinzhi Liu,
Tao Li
PII: S0893-6080(21)00231-8
DOI: https://doi.org/10.1016/j.neunet.2021.05.035
Reference: NN 4866
To appear in: Neural Networks
Received date : 27 January 2021

Revised date : 13 May 2021
Accepted date : 28 May 2021
Please cite this article as: J. An, L. Guo, W. Liu et al., IGAGCN: Information geometry and
attention-based spatiotemporal graph convolutional networks for traffic flow prediction. Neural
Networks (2021), doi: https://doi.org/10.1016/j.neunet.2021.05.035.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
© 2021 Published by Elsevier Ltd.

Journal Pre-proof
IGAGCN: information geometry and attention-based spatiotemporal graph convolutional networks

for traffic flow prediction
Jiyao AN1*, Liang GUO1, Wei LIU1, Zhiqiang FU1, Ping REN1, Xinzhi Liu2, Tao LI1*
of
1 College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
2 Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1
ABSTRACT
pro
In this study, a novel spatiotemporal graph convolutional networks model is proposed for traffic flow prediction in urban road networks by
fully considering an information geometry approach and attention-based mechanism. Accurate traffic flow prediction in real urban road net-
works is challenging due to the presence of dynamic spatiotemporal data and external factors in the urban environment. Moreover, the dynamic
spatial and temporal dependencies of urban traffic flow data are very important for predicting traffic flow, and it has been shown that a recent
attention mechanism has a relatively good ability to capture these dynamic dependencies, which are not fully considered by most existing algo-
rithms. Therefore, in the novel model abbreviated as IGAGCN, the information geometry method is utilized to determine the dynamic data
re-
distribution difference between different sensors. The attention mechanism is employed with the information geometry method, in which a
matrix is derived by analyzing the distributions of sensor data, and the spatiotemporal dynamic connections in traffic flow data features are
better at capturing the spatial dependencies of traffic between different sensors in urban road networks. Furthermore, a parallel sub-model ar-
chitecture is proposed to consider long time spans, where each dilated causal convolution sub-model is applied to short time spans. Two
well-known data sets were employed to demonstrate that our proposed method obtains better performance and is better at capturing the dy-
namic spatial dependencies of traffic than the existing only-attention-based models. In addition a real-world urban road network in Shenzhen,
lP
China, was studied to test and verify the proposed model.
Keywords: Spatiotemporal traffic data, Traffic flow prediction, Graph convolutional network, Information geometry, Attention mechanism
1. Introduction
1.1. Motivation
rna
Urban traffic forecasting is a very important problem because of its impacts on mobility and socio-economic aspects of
people’s lives. In order to improve the efficiency of transportation, many countries have committed to developing intelligent
transportation systems that integrate communication, sensors, computers, and other technologies. Intelligent transportation in
urban road networks has rapidly become a crucial research issue. Traffic flows are important data for intelligent system analysis
because traffic flow can reflect the current traffic situation. If collected historical traffic flow data can be used to predict future
Jou
traffic flows, traffic departments could intervene in the traffic ahead of time, thereby alleviating traffic congestion and reducing
traffic accidents. However, the traffic flow prediction problem is very challenging for several reasons. First, traffic flow data are
collected at a fixed time and frequency by sensors or cameras deployed on the road, and thus the data collected by sensors at
specific locations within a certain time interval will affect each other dynamically, i.e. the spatial and temporal correlations in
traffic data are strong and highly dynamic. Modeling these nonlinear dynamic spatial and temporal patterns can be difficult. Second,
the data collected by sensors can fluctuate greatly, change suddenly, and may contain noise; thus, selecting the most effective
previous data to predict the future is challenging. Third, traffic can be strongly affected by external factors, such as bad weather or
1
This study was supported partly by the National Natural Science Foundation of China under Grant 61370097; the Natural Science Foundation of Hunan
Province, China, under Grant 2018JJ2063; and the Innovation Project of Hunan Xiangjiang Artificial Intelligence Academy.
*
Corresponding author, Email address: Jiyao AN(jt_anbob@hnu.edu.cn); Tao LI (jt_litao@hnu.edu.cn)
Journal Pre-proof
unexpected crowds gathering, which are difficult to model using data collected by sensors. Thus, massive amounts of prior
knowledge must be collected and entered into models, and this might not be easy to obtain and model. Recently, the traffic flow
prediction problem has been studied by many researchers and reviewed in comprehensive surveys (see Vlahogianni et al., 2014 and
Li & Cyrus, 2018b, and references therein).
In recent decades, deep learning models have been employed to deal with complex, high-dimensional spatiotemporal data,
of
and researchers have increasingly applied this method to traffic forecasting using deep learning technology. Due to the great
success of convolutional networks in the field of image recognition, Zhang et al. (2016a) and Ma et al. (2017) abstracted traffic
flows as heatmap images to model their spatial dependencies; however, heatmaps are two-dimensional images and cannot reflect
the topology of traffic networks. Topological information in traffic networks can be treated as a priori information because roads
pro
with upstream and downstream relationships strongly affect each other. The graph convolutional network (GCN) model proposed
by Bruna et al. (2014) was rapidly applied in the field of traffic flow prediction and has become the mainstream method. Re-
searchers usually integrate GCNs into recurrent neural networks (RNNs) to simultaneously capture spatial and temporal depend-
encies (Li et al., 2018a; Yao et al., 2019). However, iteratively training RNNs may lead to the accumulation of errors and com-
plex calculation problems, so some researchers prefer to integrate GCNs into convolution neural networks (CNNs) (Yu et al.,
2018; Guo et al., 2019). However, capturing long-term dependencies in a one-dimensional CNN is limited by the size of its re-
re-
ceptive field, which is linearly related to the number of hidden layers. Therefore, combining GCNs and CNNs is problematic due
to the inadequate ability to capture long-term dependencies, or the use of an excessive number of hidden layers resulting in too
many parameters.
Models can simultaneously consider both the spatial and temporal dependencies in a traffic network to achieve good predic-
tion performance, but the model must learn a static spatial dependency because the spatial dependency between locations de-
lP
pends only on the similarity of historical traffic. However, the spatiotemporal dependencies between different locations in traffic
networks may vary dynamically, and can sometimes be closely related. GeoMAN (Liang et al., 2018) uses a multiscale attention
mechanism to model the spatiotemporal dynamic correlations. In particular, GeoMAN uses local spatial attention and global spa-
tial attention to capture the dynamic spatial correlations between different sensors in the first layer, and a temporal attention
rna
mechanism to model the dynamic temporal correlations between different times in the second layer. However, this type of com-
ponent model requires training for each time series, which incurs relatively large computational overheads. Spatiotemporal graph
convolutional networks (STGCNs) (Guo et al., 2019) integrate spatial and temporal attention modules to simultaneously capture
spatiotemporal dependencies. However, the parameters in the attention matrix are highly dependent on the training of the model
due to the lack of prior knowledge.
In several studies, deep learning models were improved by considering the probability distributions between data. Thus,
Hong et al. (2019b) proposed a learnable manifold alignment (LeMA) method for directly learning the graph structure and cap-
Jou
turing the data distribution through graph-based label propagation to identify more accurate decision boundaries. By studying the
data distribution between different classifications in the dichotomy problem, Wang et al. (2019) constructed an information dif-
ference matrix using an information geometry method (Amari & Nagaoka, 2000) and obtained better results with many
well-known classification data sets.
According to the previous studies described above, in order to improve the performance of the attention mechanism, we
propose an information geometry method based on the hypothesis that when the data distributions of road sensors at different time
points (assuming that they follow a Gaussian distribution) are closer, the mutual influence among them will probably be greater and
the traffic trends will be more similar. Thus, a data distribution differential matrix is constructed to improve the ability of the
Journal Pre-proof
attention mechanism to capture dynamic spatial dependencies, and it is simultaneously applied with graph convolutions and gat-
ed dilated causal convolution (Oord et al., 2016) to learn the dynamic spatiotemporal dependencies of traffic data.
1.2. Our contributions
To address the limitations highlighted above, we propose a new type of deep learning traffic prediction model based on in-
formation geometry and an attention mechanism. In particular, we use information geometry and an attention mechanism to pre-
of
process the input graph-based traffic data so the subsequent convolution operations can effectively capture the dynamic spatio-
temporal correlations in traffic networks and less prior knowledge is required. The proposed model is called an information ge-
ometry attention-based graph convolutional network (IGAGCN). The main contributions of this study are summarized as fol-
pro
lows.
(1) A novel graph depth model, IGAGCN, is proposed for traffic flow prediction problems in urban road networks by uti-
lizing the information geometry method and an attention-based mechanism. The information geometry method is utilized to sim-
ultaneously quantify the raw distribution of data for different sensors at different times, and the attention-based mechanism with
information geometry is built to capture the dynamic spatiotemporal connections in traffic flow data features. To the best of our
knowledge, this is the first method to utilize the information geometry method and an attention mechanism to dynamically cap-
re-
ture the spatial dependencies in traffic between different sensors.
(2) The proposed model fully considers the dependencies in traffic flow data over different time spans. Three parallel mod-
els are constructed to extract the weekly, daily, and hourly period trends. Moreover, multidilated causal convolution layers are
applied to capture the similarity among different short time-spans in the inputs of each sub-model. The data collected by sensors
can sometimes fluctuate and this decreases the impact of the previous value, so different time spans are employed to make the
lP
data smoother and enhance the robustness of the model. In addition, the receptive field of the temporal dimension convolutional
operation is expanded by short spans and longer time dependencies are captured. Thus, a new structure for the traffic flow pre-
diction model is first proposed to fully utilize different time spans in traffic data regardless of whether they are long or short.
(3) Several experiments were conducted based on two real-world highway traffic data sets to verify the effectiveness of the
proposed model and to demonstrate that it performs better than existing baselines in prediction tasks. In addition, a real-world
rna
urban road network in Shenzhen, China, was studied to test and verify the proposed model.
The remainder of this paper is organized as follows. In Section 2, we review previous research into traffic prediction and
information geometry. In Section 3, we consider the traffic prediction problem. In Section 4, we introduce the proposed hierar-
chical model (IGAGCN). In Section 5, we present the results obtained with well-known data sets and real-world experiments,
which demonstrate the effectiveness and advantages of the proposed method. Finally, we give our conclusions in Section 6.
Jou
2. Related research
Interest in the development of high-performance and intelligent traffic flow prediction systems has increased recently. The
main aim of traffic prediction is to predict the relative traffic value for a certain location at a certain time in the future based on
historical data. Numerous studies have investigated the traffic flow prediction problem (for comprehensive surveys, see
Vlahogianni et al., 2014 and Li & Cyrus, 2018b).
Previous studies of the traffic flow prediction problem were mainly data driven due to the rapid development of traffic data
collection and storage technology. Two main representatives of data-driven methods comprise classical statistical models, in-
cluding autoregressive integrated moving average (ARIMA) (Ahmed & Cook, 1979; Williams & Hoel, 2003) and vector auto-
regressive (VAR) (Zivot & Wang, 2006), and the machine learning models, such as support vector machines and neural networks.
Journal Pre-proof
Yin et al. (2017) used a support vector machine and parameter optimization algorithm to improve the efficient and accurate pre-
diction of traffic volumes. Various neural network-based models are mainly reviewed in the following.
2.1. Neural network methods for traffic prediction
In recent years, models based on neural networks have provided a new and promising direction for traffic flow prediction
due to the rise of deep learning. Neural networks have achieved great success in the fields of image processing, natural language
of
processing, and computer vision. Many researchers have applied deep learning methods in the field of transportation with some
success. Huang et al. (2014) first introduced the deep learning method into traffic research by using a multitask deep brief net-
work (DBN), which they applied to unsupervised feature learning for traffic flow data and a multitask regression layer was then
pro
used on top of DBN for prediction. An autoencoder is a feedforward neural network comprising an encoder and decoder. By
stacking the autoencoder, the stacked autoencoder (SAE) can learn the multilayer representation of traffic flow data to obtain the
potential spatiotemporal characteristics (Lv et al., 2015). Chen et al. (2016) developed a deep stacked denoising autoencoder
model to learn the effects of human movement dependencies on traffic flows. Duan et al. (2016) used different hyperparameters
for traffic data during different time periods to demonstrate the dependence of the SAE model on hidden layers. More hidden
layers will lead to better results, but also higher computational costs. Compared with the feedforward neural network, the RNN is
re-
better at learning features and long-term dependencies from serialization and time series data, but suffers from gradient vanish or
explosion problems in the training process. Long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) networks can
effectively solve this problem, but they are also affected by problems due to their complex gating mechanism and low efficiency.
Gated recurrent unit (GRU) neural networks (Chung et al., 2014) can improve the training efficiency and reduce the computa-
tional complexity by improving the gating mechanism in LSTM, but without reducing the effectiveness. Fu et al. (2016) com-
lP
pared the use of LSTM and GRU networks for traffic flow forecasting. Compared with the LSTM model, the GRU model is
more efficient and accurate, and the convergence speed is higher. The CNNs have achieved great success in image recognition
and some researchers have applied CNNs to traffic prediction. Ma et al. (2017) extracted traffic data into traffic heat maps and
used CNNs to learn spatial features. Yao et al. (2018) proposed a deep multiview model by combining CNN and LSTM models
to simultaneously capture spatiotemporal features, where they used a local CNN to model local spatial correlations and LSTM to
rna
model the temporal correlations. The traditional convolution method extracts local patterns in data but can only be applied to
grid-based traffic data. Due to the characteristics of road networks, it is reasonable to abstract the road network as a directed
weighted graph and extract the network topology. Bruna et al. (2014) first proposed GCNs as neural networks that integrate
spectral theory, and they have achieved excellent results at high-dimensional data processing (Ullah et al., 2019; Hong et al.,
2020). Several studies have extended GCN- and RNN-based models to separately capture spatial and temporal patterns. Diffu-
sion convolution RNN (Li et al., 2018a) was proposed as a new form of convolution operation defined on a graph based on the
Jou
diffusive nature of traffic. In contrast to RNNs, STGCN (Yu et al., 2018) applies GCN and gated CNN structures to allow much
faster training with fewer parameters. Models based on graph convolution can perform better in the traffic domain, but in order
to further consider the dynamic temporal and spatial correlations of traffic data, Diao et al. (2019) extracted global and local
components from traffic flow data to allow the dynamic learning of a Laplacian matrix at a specific moment, and designed a La-
placian matrix estimator based on deep learning to learn dynamic patterns.
An attention mechanism is similar to a supervision mechanism and selects the most important information for the current
task from the input and ignores irrelevant information to improve the representational capacity of the original network. Atten-
tion-based networks have been used widely in many fields of deep learning, such as computer vision (Hu et al., 2018), natural
language processing (Yi et al., 2018), machine translation, and image recognition. Xu et al. (2015) and Serrano et al. (2019) vis-
Journal Pre-proof
ually demonstrated the effect of the attention mechanism on the deep learning model. GeoMAN (Liang et al., 2018) and atten-
tion-based spatiotemporal GCN (ASTGCN) (Guo et al., 2019) have developed the attention mechanism to model the dynamic
spatiotemporal correlations, and the spatiotemporal dynamic network (Yao et al., 2019) uses a periodically shifted attention
mechanism to handle temporal periodic similarity. In a recent study, Hou et al. (2021) integrated an attention model with CNNs
to solve the problem of marketing intent detection, with the aim to solve the problem of the disconnection between local and
of
global features in text. Chen and Shi (2021) investigated a novel end-to-end model for time series classification by using a deep
learning approach to construct a new multiscale attention CNN.
2.2. Information geometry in deep learning
pro
Information geometry is an interdisciplinary area of probability and differential geometry, and it has many outstanding im-
portant applications in the field of deep learning, including model parameter reduction, deep model interpretability, feature selec-
tion, and dimensionality reduction.
Amari and Nagaoka (2000) established the theoretical background of the information geometry field, and Amari (2014) then
proposed an information geometry approach to clustering and related pattern matching problems, in which a general and unique
class of decomposable divergence functions is constructed in the manifold of positive definite matrices. In general, information
re-
geometry involves studying geometric structures on the manifolds of the probability distributions. The geometric structures of
probability distributions are usually investigated in Riemannian space, where the basic idea is to represent the data as points on a
statistical manifold, which is a manifold of probability density functions. The theory of information geometry is derived from
two ideas comprising the random representation of information processes and the geometry of statistical models. Amari and Ka-
wanabe (1997) and Amari and Nagaoka (2000) pioneered and developed the application of statistical manifold methods in the
lP
field of information processing, such as neural computing, and Amari (2010) first introduced the information geometry concepts
into deep learning models and architectures. Qian et al. (2017) used the geometric make-up of feature vectors obtained from a
deep CNN to rank images by encoding them as Fisher kernels. Furthermore, Zhao et al. (2018) proposed a confi-
dent-information-first principle for parameter reduction and model selection, in which they evaluated the confidence of parame-
ters based on their contributions to the expected Fisher information distance within the geometric manifold over the neighbor-
rna
hood of the underlying real distribution. The GeoSeq2Seq (Bay et al., 2018) combines information geometry and Seq2Seq into a
novel deep learning network model to predict the shortest routes between two nodes of a graph. In addition, Wang et al. (2018)
constructed classification problems with more distinguishing features by using information geometry and deep belief networks.
Recently, Sun et al. (2019) proposed a novel classification method for univariate time series based on the information geometry
structure, where the covariance matrix is used as the representation of a univariate time series, and the method then synthesizes
the local and global characteristics of the time series. Based on the manifold learning method, Hong et al. (2019a, 2019b) pro-
Jou
posed the LeMA and common subspace learning methods, and applied them to hyperspectral data to achieve better classifica-
tions. Many studies have investigated information geometry in time series classification and its applications, but the application
of information geometry to the GCN modeling problem has not been reported.
In general, it should be noted that the performance of systems can be enhanced by exploring an ensemble method with dif-
ferent deep learning architectures. Thus, in the present study, we propose a novel IGAGCN model to explore the possibility of
applying the information geometry method and an attention mechanism to GCN to improve the overall performance of traffic
network prediction.
Journal Pre-proof
3. Preliminaries
In the following, we provide abstractions and notations for traffic flow data, and define the traffic flow prediction problem.
3.1. Notations
Traffic data are defined as a directed weighted graph that changes over time. The graph is expressed as G = (V, E), where V is
of
a set of vertices and E is a set of edges; V actually denotes the sensors deployed in the traffic network that are responsible for
recording real-time traffic data and E represents the distance information between these sensors. If the set size of V is N, the ad-
jacency matrix A of G represents the adjacency relationship between the vertices of G, 𝐴 ∈ ℝ𝑁×𝑁 . If (Vi, Vj) is an edge in set E,
then Aij is the normalized distance, but 0 otherwise. We assume that each node records F features in a time slice, such as the speed
pro
or occupy. We use𝑥𝑡𝑖 ∈ ℝ𝐹 to describe the value of all F features of nodes i at time t and𝛸𝑡 = (𝑥𝑡1 , 𝑥𝑡2 , . . . , 𝑥𝑡𝑁 )𝑇 ∈ ℝ𝑁×𝐹 denotes the
values of all F features of all nodes at time t. Similarly, we use 𝑌𝑡 = (𝑦𝑡𝑖 , 𝑦𝑡2 , . . . , 𝑦𝑡𝑁 )𝑇 ∈ ℝ𝑁×1 to represent the target feature value
for predicting all nodes at time t in the future. All notations and definitions are listed in Appendix A.
3.2. Problem statement
The traffic flow prediction problem can be summarized as follows:

re-
Given τ time steps of historical data 𝑋 = (𝑋1 , 𝑋2 , . . . , 𝑋𝜏 )𝑇 ∈ ℝ𝑁×𝐹×𝜏 , we need to learn a function f that can predict the Tp steps
traffic flow sequences 𝑌 = (𝑌1 , 𝑌2 , . . . , 𝑌𝑇𝑝 )𝑇 ∈ ℝ𝑁×𝑇𝑝 in the future.
𝑓
[𝑋 𝑡−𝜏,𝑡 ] → [𝑌𝑡+1,𝑡+𝑇𝑝 ]
4. Proposed approach
lP
In this section, we describe our proposed data-driven deep learning IGAGCN model for predicting future traffic flows, where
the general structure of the model is constructed. Each module in the structure is then considered in detail step-by-step according to
the flow of the data. Finally, we summarize our proposed approach in Algorithm 1.
Fig. 1 illustrates the general structure of our proposed IGAGCN model which comprises three parallel sub-models with the
rna
same structure as mentioned above. We add two models to capture the weekly and daily periodic features of traffic flow data based
on the idea of traffic prediction model ST-ResNet (Zhang et al., 2016a) and ASTGCN (Guo et al., 2019). We separately feed these
two models with the N nearest days and weeks data, where the data have the same time interval as the prediction period. The N is a
hyperparameter. In Fig. 1, N is 2, purple indicates the weekly trend input, and green denotes the daily trend input. The general input
is the most important segment of data and is directly adjacent to the predicted period. The formation and spread of traffic conges-
tion are gradual, and thus this part of the data is vital for prediction performance of the model because it inevitably affects the traffic
Jou
flow in the near future.

Journal Pre-proof
of
pro
re-
Fig. 1. The proposed structure of IGAGCN.
In general, traffic data generally exhibit repetitive patterns because of the daily routine of people. Fig. 2(a) shows the
changes in traffic data on Mondays for seven weeks. The similarity of the traffic patterns on Mondays indicates that the traffic
flow data contain weekly periodic patterns. Fig. 2(b) shows the changes in the traffic data for seven consecutive days; it is evi-
dent that the trend in daily traffic flow changes is similar.
lP
rna
Jou
(a) (b)
Fig. 2. Traffic data trends in different periods.
In each sub-model, the input data are first processed by the IGAGCN cell to capture spatiotemporal features (Fig. 3), which is
the inner structure of IGAGCN, and a residual operation is then applied. If the IGAGCN cell is the last block, a fully connected
layers follows. For the fully connected layer, ReLU is the activation function and we then obtain the output of each sub-model.
Finally, we obtain the prediction result by fusing the outputs of all three sub-models.
Journal Pre-proof
of
pro
Fig. 3. The inner structure of the IGAGCN cell. TAM is the temporal attention matrix, IGSTM is the information geometry spatial temporal attention matrix, and
DCCN is the dilated causal convolution operation.
Traffic flows differ in terms of their sensitivity to the input data in the three parts, so the degree of influence on the results also
differs for the three sub-models. We use the following method to learn different weights:
re-
𝑌̂ = 𝑊𝑤 ⊙ 𝑌̂𝑤 + 𝑊𝑑 ⊙ 𝑌̂𝑑 + 𝑊𝑟 ⊙ 𝑌̂𝑟
where ⨀ is the Hadamard product, and Ww, Wd, and Wr are learning parameters used to adjust the degree of the effects of the
weekly, daily, and recent trends, respectively. L1Loss, i.e. mean absolute error (MAE), is used as a loss function to minimize the
(1)
difference between the predicted results and true values:

𝐿(𝜃) = |𝑌 − 𝑌̂|, (2)
lP
where θ are all of the learnable parameters in our model, and Y and Ŷ represent the ground-truth value and predicted value of the
traffic flow, respectively. The details of each layer are described in the following.
4.1. Temporal attention
In the temporal dimension, the traffic conditions in different time periods are correlated and the correlation also differs in
rna
different situations. We use the attention mechanism to adaptively attach different importance to data:
(𝑟 −1) 𝑇 (𝑟 −1)
𝑃 = 𝑉𝑝 ⋅ 𝑡𝑎𝑛ℎ( ((𝑋ℎ ) 𝑈1 ) 𝑈2 ( 𝑈3 𝑋ℎ ) + 𝑏𝑝 ) (3)
(𝑟 −1)
where 𝑋ℎ = (𝑋1 , 𝑋1 , . . . , 𝑋𝑇𝑟−1 ) ∈ ℝ𝑁×𝐶𝑟−1×𝑇𝑟−1 is the input for the rth spatiotemporal block and Cr − 1 is the number of chan-
nels of input data in the rth layer. 𝑉𝑝 , 𝑏𝑝 ∈ ℝ𝑇𝑟−1×𝑇𝑟−1 ,𝑈1 ∈ ℝ𝑁 , 𝑈2 ∈ ℝ𝐶𝑟−1×𝑁 , 𝑈3 ∈ ℝ𝐶𝑟−1 are learnable parameters, and tanh is used
as the activation function. The values of the elements in the temporal correlation matrix P represent the degree of dependence
between different time steps in the input time series, and the values in E are determined by the varying inputs. The softmax function
Jou
is then used to normalize P. We multiply the input by the normalized temporal attention matrix and obtain the new input to dy-
namically adjust the input by summarizing relevant information. This step corresponds to lines 5–6 in Algorithm 1.
4.2. Information geometry spatial attention
In the spatial dimension, the traffic conditions at different locations affect each other and their interactions are highly dynamic.
Similarly, we apply the attention mechanism to adaptively capture the dynamic connections between different sensors in the spatial
dimension:
(𝑟 −1) (𝑟 −1) 𝑇
𝑆 = 𝑉𝑠 ⋅ 𝑡𝑎𝑛ℎ( (𝑋ℎ 𝑊1 ) 𝑊2 ( 𝑊3 𝑋ℎ ) + 𝑏𝑠 ) (4)
Journal Pre-proof
where 𝑉𝑠 , 𝑏𝑠 ∈ ℝ𝑁×𝑁 , 𝑊1 ∈ ℝ𝑇𝑟−1 , 𝑊2 ∈ ℝ𝐶𝑟−1×𝑇𝑟−1 , 𝑊3 ∈ ℝ𝐶𝑟−1 are learnable parameters, and tanh is used as the activation func-
tion. The softmax function is then used to normalize S. This step corresponds to line 7 in Algorithm 1.
The attention matrix S is dynamically calculated based on the input for the current layer, and the element values in the S
matrix represent the strengths of the connections between different sensors. In order to better identify the dynamic correlations
between different sensors, we use the knowledge of information geometry to measure the differences in the data distributions
of
between sensors as prior information to strengthen the attention matrix. Fisher’s information distance is used to measure the dif-
ference.
Suppose that there are two data distributions i and j that satisfy the Gaussian distribution, as follow:
pro
𝜃(𝑖) = (𝜇(𝑖), 𝜎(𝑖))
𝜃(𝑗) = (𝜇(𝑗) , 𝜎(𝑗)) (5)
First, we define a half plane:
𝐻 = {(𝜇, 𝜎) ∈ 𝑅2 | 𝜎 > 0} (6)

and the Fisher information matrix (FIM) can be expressed as follows:
The elements in the FIM are calculated by:

re- 𝑀(𝜃) = [𝑚𝑖𝑗 ( 𝜃)]
𝜕 𝑙𝑜𝑔 𝑝(𝑥 |𝜃) 𝜕 𝑙𝑜𝑔 𝑝(𝑥 |𝜃)

(7)
𝑚𝑖𝑗 (𝜃) = 𝐸{ ⋅ } (8)

𝜕𝜃𝑖 𝜕𝜃𝑗
where E is the expectation. Hence, we have the following:

lP
1/𝜎 2 0
𝑀 = (𝑚𝑖𝑗 (𝜇, 𝜎)) = ( ) (9)
0 2/𝜎 2
The pair (H, M) can be regarded as a Riemannian manifold and the expression for the metric of (H, M) follows:
𝑑𝜇2 +2𝑑𝜎 2
𝑑𝑠𝐹2 = 𝑑𝜃 𝑇 𝑀(𝜃)𝑑𝜃 = . (10)
𝜎2
rna
We assume that a curve 𝜃(𝑡)joins 𝜃(𝑖) = 𝜃(𝑡1 ) and 𝜃(𝑗) = 𝜃(𝑡2 ), and that it satisfies 𝑡1 ≤ 𝑡 ≤ 𝑡2 . Then, the distance
along the two distributions 𝑝(𝑥 | 𝜃 𝑖 ) and 𝑝(𝑥 | 𝜃 𝑗 ), along𝜃(𝑡)is defined as follows:
𝑡 𝑑𝜃 𝑑𝜃
𝐷(𝜃 𝑖 , 𝜃 𝑗 ): = ∫𝑡 1(√( )𝑇 𝑀( 𝜃)( )) 𝑑𝑡. (11)
2 𝑑𝑡 𝑑𝑡
We need to find a certain curve that minimizes the distance between 𝑝(𝑥 | 𝜃 𝑖 ) and 𝑝(𝑥 | 𝜃 𝑗 ), i.e. the integrated Fisher in-
formation distance, which can be calculated using the following formula (Wang et al., 2018):
Jou
𝜇1 −𝜇2 𝜇1 −𝜇2
(|( , 𝜎1 + 𝜎2 )| + |( , 𝜎1 − 𝜎2 )|)
√2 √2
𝐷𝐹 ((𝜇1 , 𝜎1 ), (𝜇2 , 𝜎2 )) = √2 𝑙𝑛
4𝜎1 𝜎2
𝐹((𝜇1 ,𝜎1 ),(𝜇2 ,𝜎2 ))+(𝜇1 −𝜇2 )2 +2(𝜎12 ,𝜎22 )
=√2 𝑙𝑛( ). (12)
4𝜎1 𝜎2
where | ⋅ | denotes the standard vector norm in Euclidean space and the actual expression of the function F in (8) follows:
𝐹((𝜇1 , 𝜎1 ), (𝜇2 , 𝜎2 )) = √((𝜇1 − 𝜇2 )2 + 2(𝜎1 − 𝜎2 )2 )((𝜇1 − 𝜇2 )2 + 2(𝜎1 + 𝜎2 )2 ). (13)

𝑁×𝑁
We define an information geometry matrix 𝐼𝐺 ∈ ℝ and use formula (14) to calculate the data distribution distance be-
tween different sensors, before calculating a normalized matrix IG′ with the following formula. The computational time costs due
Journal Pre-proof
to the matrix become large as the number of sensors increases, and we select k adjacent nodes for calculation based on the k-
nearest neighbor method.
𝑖,𝑗
1 − 𝐷𝐹 / 𝑚𝑎𝑥( 𝐷𝐹 ) 𝑖 ≠ 𝑗
IG𝑖,𝑗 = { (14)
1 𝑖=𝑗
′ 𝑒𝑥𝑝( 𝑂𝑖,𝑗 )
𝐼𝐺𝑖,𝑗 = ∑𝑁 (15)
of
𝑗=1 𝑒𝑥𝑝( 𝑂𝑖,𝑗 )
We then combine IG′ with the spatial attention matrix S:

𝑆𝐼𝐺 = 𝑓((1 − 𝜆) 𝑆 + 𝜆𝐼𝐺′), (16)
where f is the softmax function to ensure that the sum of the attention weights for each sensor is 1, and λ is a hyperparameter that
pro
determines the importance of the two matrices S and IG′. In the graph convolution operation, we combine the SIG matrix and
adjacent matrix A to dynamically adjust the weight values between sensors.
4.3. Graph convolution in the spatial dimension
It is more reasonable to abstract traffic data into graph- than grid-based data because graph-based data can express spatial
topological information between different nodes, and thus the connectivity and global nature of the transportation network can be
re-
demonstrated. We treat the feature of each nodes as the signal on the graph and then use graph convolution based on the spectral
graph to capture the spatial patterns in the traffic network.
According to spectral graph theory, a traffic graph is represented by its corresponding normalized Laplacian matrix, which is
1 1
defined as 𝐿 = 𝐼𝑛 − 𝐷 −2 𝐴𝐷 −2 ∈ ℝ𝑛×𝑛 , where A is the adjacent matrix, In is an identity matrix, and 𝐷 ∈ ℝ𝑛×𝑛 is the diagonal
degree matrix. After applying eigenvalue decomposition to the Laplacian matrix, L is reformulated as L = UΛUT, where Λ is the
lP
diagonal matrix comprising the eigenvalues of L, and U comprises the corresponding eigenvectors of the eigenvalues in Λ and is
also called the Fourier basis.
If we assume that 𝑥 ∈ ℝ𝑁 is the signal, then the graph convolution operation with a kernel Θ can be formulated as follows:
𝛩∗𝐺 𝑥 = 𝛩(𝐿) 𝑥 = 𝛩(𝑈𝛬𝑈 𝑇 )𝑥 = 𝑈𝛩(𝛬)𝑈 𝑇 𝑥, (17)
where *G denotes a graph convolution operation. After graph convolution, the signal x will be transformed into the spectral do-
rna
main and this operation is called the graph Fourier transform (Shuman et al., 2013; Simonovsky et al., 2017). However, the com-
plexity of eigenvalue decomposition on the Laplacian matrix is O(n2), and thus as the matrix becomes larger, the time consump-
tion is unacceptable. Defferrard et al. (2016) and Kipf and Welling (2016) reduced the complexity to linear by approximation with
the Chebyshev spectral filter, and the graph convolution can then be rewritten as follows:
𝛩∗𝐺 𝑥 = 𝛩(𝐿) 𝑥 ≈ ∑𝐾−1 ̃ )𝑥 (18)
𝑘=0 𝜃𝑘 𝑇𝑘 (𝐿
where parameter θ is a vector of polynomial coefficients, K is the kernel size of graph convolution and determines the maximum
Jou
radius of the convolution from central nodes, 𝐿̃ = 2𝐿/𝜆𝑛𝑚𝑎𝑥 , λmax denotes the largest eigenvalue of L, and 𝑇𝑘 (𝐿̃ ) is the Che-
byshev polynomial of order k.
We apply the matrix SIG, obtained by information geometry method, and the spatial attention layer to the graph convolution
operation in order to dynamically adjust the correlations between nodes, where this step corresponds to line 8 in Algorithm 1, and
the following formula is used:
𝛩∗𝐺 𝑥 = 𝛩(𝐿) 𝑥 ≈ ∑𝐾−1 ̃ ⊙ 𝑆𝐼𝐺) 𝑥 (19)
𝑘=0 𝜃𝑘 𝑇𝑘 (𝐿
4.4. Gated multilayer dilated causal convolution in the temporal dimension
Generally, the two methods used for feature extraction in the temporal dimension are RNN- and CNN-based models. The
Journal Pre-proof
RNN-based models are used widely in time series analysis, to obtain time features by retaining historical information through a
gating mechanism and iteration. The CNN-based models merge the information in adjacent time slices by applying a convolution
operation to obtain temporal features. The RNN-based models involve a complex gating mechanism and a long training time,
whereas the CNN-based models have advantages in terms of faster training speed and simpler structure, and they can also form a
multilayer structure through a hierarchical representation for training. In order to combine the advantages of both methods, gated
of
convolution is applied in our model to extract complex temporal dependencies. The gated convolution operation in the temporal
dimension of the rth layer can be expressed as follows:
(𝑟 −1)
𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ( ̂
𝑔𝜃∗𝐺 𝜒 ))) ∈ ℝ𝐶𝑟 ×𝑁×𝑇𝑟 , (20)
pro
where * represents the convolution operation, Φ denotes the parameters of the convolution kernel, tanh is the activation function
for the outputs, and the sigmoid function can determine the ratio of the current input passed to the next layer. This step corresponds
to line 9 in Algorithm 1.
re-
Fig. 4. Visualization of a stack of dilated causal convolutional layers.
In contrast to the general convolution operation, we apply stacked dilated causal convolution to capture short time-span de-
lP
pendencies, as proposed for WaveNet (Oord et al., 2016). This type of network structure (Fig. 4) can also enlarge the receptive
field to capture longer historical information as the dilation process increases. The overall process of the IGAGCN algorithm is
summarized in Algorithm 1.
Algorithm 1 Prediction algorithm for IGAGCN sub-model
Input:
rna
T time steps traffic data:𝑋 ∈ ℝ𝐵×𝑁×𝐶𝑖𝑛 ×𝑇

Output:
Tp time steps predicted traffic data: 𝑌 ∈ ℝ𝐵×𝑁×𝐶𝑜𝑢𝑡×𝑇𝑝
1: initialize all parameters θ in model
2: for each epoch do
Jou
3: for each block do

4: assign X to Residual
5: calculate temporal attention matrix (TAM) using (3)
6: get XTAt by multiplying X with TAM
7: calculate information geometry spatial attention matrix (IGSAM) using (4), (12), and (13) with XTAt
8: get XS by k-order Chebyshev graph convolution using (17) with X and IGSAM
9: get XST by k-layers dilated causal convolution using XS
10: assign XST and add Residual to X′
11: update X by normalizing X′
Journal Pre-proof
12: end for

13: do fully connect to get Y′
14: calculate loss with target value and Y′ to update θ
15: end for
16: input test set Xts into well-trained model and get output Y
of
17: return Y
5. Experiments
pro
In this section, we first introduce the two real-world data sets used in our experiments. We then describe the set of parame-
ters and the evaluation metrics used in the experiments, before comparing our model with previous representative methods, such
as STGCN (Yu et al., 2018), GeoMAN (Liang et al., 2018), and ASTGCN (Guo et al., 2019).
Table 1. Data set profiles.
Attributes PeMSD4 PeMSD8
Time spans re- January–February 2018 July–August 2016
Detectors 3848 1979
Selected detectors (vertices) 307 170
Distance information between sensors (edges) 340 295
Selected features 3 3
lP
Sequence length 16,992 17,856
5.1. Data sets
The data were collected from the California Highway Performance Measurement System, which uses more than 39,000 de-
tectors deployed on Californian highways. The detectors collect data in real time every 30 s and aggregate the data every 5 min,
thereby acquiring 288 data points per day. In this experiment, we employed three types of traffic measurements: total flow, average
rna
speed, and average occupancy. The PeMSD4 and PeMSD8 data sets were collected from different regions, and the specific dif-
ferences are shown in Table 1.
The selection of detectors required that the distance between adjacent detectors was more than 3.5 miles and the missing
data were filled in a linear manner. In addition, the data were processed by zero-mean normalization.
5.2. Settings and hyperparameters

Jou
We implemented our model with the MXNet deep learning framework and trained the models on a server with Nvidia Tesla
V100 GPUs. To ensure that the performance comparisons were fair, the hyperparameter settings were roughly the same as those
used by ASTGCN. First, we predicted the traffic flow after 1 h, i.e. Tp = 12. The data from the same time interval two weeks be-
fore, one day before, and 2 h before were fed into the three sub-models (i.e. Tw = 24, Td = 12, and Th = 24). We set the batch size
to 64, learning rate to 0.001, and Chebyshev polynomial K to 3. In the IGAGCN cell, the trade-off parameter λ was set to 0.5, and
all the graph convolution layers and the subsequent temporal convolution layers used 64 convolution kernels. In temporal dilated
causal convolution, we set k = 2, a short time span of 10 min.
5.2.1. Evaluation metrics and baseline methods
We use the rooted mean square error (RMSE) and MAE for evaluation:
Journal Pre-proof
1
𝑀𝐴𝐸 = ∑𝑛𝑖=1|𝑦𝑖 − 𝑦̂𝑖 | (23)
𝑛
1
𝑅𝑀𝑆𝐸 = √ ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂ 𝑖 )2 , (24)
𝑛
where yi and ŷi represent the ground-truth and predicted values of the traffic flow, respectively.
Nine baseline methods were compared with our model. There were three time series regression models: (1) historical aver-
of
age (HA), (2) ARIMA (Williams & Hoel, 2003), and (3) VAR (Zivot & Wang 2006). There were six deep learning methods: (4)
LSTM network (Hochreiter & Schmidhuber, 1997) as a typical gated RNN model; (5) GRU network (Chung et al., 2014) as a
simplified LSTM model; (6) STGCN (Yu et al., 2018) as a GCN with a gating mechanism, where STGCN was the first GCN
model used in the traffic flow prediction field and led to the widespread use of GCNs in traffic flow prediction; (7) GeoMAN
pro
(Liang et al., 2018) as an encoder–decoder structure network, which was the first application of a multilayer attention mechanism
to the problem of spatiotemporal data prediction for modeling the dynamic spatiotemporal dependencies between various sensors,
where it achieved success in PM2.5 prediction; (8) ASTGCN (Guo et al., 2019), which combines an attention mechanism and
GCN to predict traffic flows; and (9) multicomponent spatiotemporal GCN (MSTGCN) (Guo et al., 2019), which has the same
structure as ASTGCN but without the attention mechanism, in order to demonstrate the effectiveness of the attention mechanism
in traffic flow prediction.
re-
We also deployed another version of IGAGCN with the same structure and settings as IGAGCN but the information geom-
etry matrix was not applied in the spatial attention layer, and we designated this gated attention GCN (GAGCN). This version
allowed us to verify the impact of the proposed information geometry matrix on the attention mechanism in IGAGCN.
5.3. Results and performance analysis

lP
Table 2 compares the performance of our proposed method and all of the other baseline methods with PeMSD4 and
PeMSD8. The methods based on deep learning performed better than the time series analysis methods (HA, ARIMA, and VAR),
indicating that traffic flow prediction is not a simple linear time series analysis problem. Among the neural network-based mod-
els, the traditional deep learning models (LSTM and GRU) did not perform better than the others because they only considered
the temporal dependencies and not the spatial correlations. The STGCN, GeoMAN, ASTGCN, MSTGCN, and our models con-
rna
sidered both the temporal and spatial dependencies, and obtained better results. In addition, models with an attention mechanism
(GeoMAN, ASTGCN, GAGCN, and IGAGCN) performed better than models that only use graph convolution, such as STGCN,
because they can capture dynamic spatial relationships. Furthermore, our proposed models (GAGCN and IGAGCN) both per-
formed better than the others. In addition, compared with the best baseline method, ASTGCN, the RMSE and MAE values were
2.6% and 4% lower under GAGCN with PeMSD4, respectively, and 2.7% and 4.9% lower with PeMSD8, while the RMSE and
MAE values were 4.7% and 7.6% lower under IGAGCN with PeMSD4, and 2.7% and 6.7% lower with PeMSD8. In order to
Jou
intuitively demonstrate the better performance of our models compared with ASTGCN, the RMSE and MAE values for the three
models in Table 2 are presented as a histogram (Fig. 5). The proposed models obtained obvious improvements with PeMSD4
(Fig. 5) because the PeMSD4 data set is based on almost twice as many sensors as the PeMSD8 data set, and thus more data are
available for learning and the differences in the data distributions are high. The GAGCN was effective at capturing short
time-span dependencies for traffic flow prediction, but IGAGCN with the information geometry matrix enhanced the attention
mechanism to better capture dynamic spatial and temporal dependencies.
Table 2. Comparison of average performance based on RMSE and MAE
Method PeMSD4 PeMSD8

Journal Pre-proof
RMSE MAE RMSE MAE

HA 54.14 36.76 44.03 29.52
ARIMA 68.13 32.11 43.30 24.04
VAR 51.73 33.76 31.21 21.41
LSTM 45.82 29.45 36.96 23.18
GRU 45.11 28.65 35.95 22.20
of
STGCN 38.41 27.28 30.78 20.99
GeoMAN 37.84 23.64 28.91 17.84
MSTGCN 35.64 22.73 26.47 17.47
ASTGCN 32.82 21.80 25.27 16.50
pro
GAGCN (ours) 32.12 20.83 24.59 15.69
IGAGCN (ours) 31.27 20.13 24.55 15.38
re-
lP
(a) RMSE values. (b) MAE values.

Fig. 5. Comparison of the performance of ASTGCN, GAGCN, and IGAGCN with two data sets.
rna
In order to further compare the performance of the models, we tested the ability of the three best models to predict future
data with different lengths (Fig. 6).
As the prediction interval increased, it became more difficult for the models to make predictions and the prediction error in-
creased in a roughly linear manner. Both of our models had lower errors in different prediction intervals, and IGAGCN per-
formed better than the other two models at all time steps. The better performance of IGAGCN was more obvious when the time
step was smaller, mainly because IGAGCN learns the data distribution for each sensor in the traffic flow data based on the in-
Jou
formation geometry method, but the data distributions of the sensors may be close as the forecast time step increases even though
their trends are quite different.
We recorded the RMSE curve for the test set in each epoch, in which all data from the training set were fed into the model
and all parameters were updated after each epoch. The decrease in RMSE tended to stabilize as the number of epochs increased,
and thus the curve became flat, indicating model convergence (Fig. 7). Our two proposed models converged more rapidly than
ASTGCN, indicating that our method was more effective at learning the nonlinear relationships among the data.
Journal Pre-proof
of
pro
(a) Evaluation based on PeMSD4.
re-
lP
(b) Evaluation based on PeMSD8.

Fig. 6. RMSE and MAE values with different prediction intervals.
rna
Jou
Fig. 7. The RMSE for three methods in different epochs. Fig. 8. Comparison with real value of traffic flow.
Fig. 8 shows the differences between real values and values predicted with ASTGCN and IGAGCN. We selected the real
values with large errors for the two models to better illustrate the difference. The curve based on the real values was highly irreg-
ular, but both models could roughly fit the trend, demonstrating that both models could capture the changes within a small range
Journal Pre-proof
and that they were usually less than the true values, thus these models were not sufficiently sensitive to changes in the data. By
contrast, IGAGCN performed better, and its predictions were closer to the real values, demonstrating that IGAGCN had a greater
capacity to learn the possible patterns in the data, which also shows the effectiveness of our model.
of
pro
re-
(a) (b)
lP
rna
(c) (d)
Fig. 9. Traffic data flow obtained from sensors and corresponding information geometry attention matrix.
Jou
We selected 10 sensors in the PeMSD4 data set and the changes in traffic data over 2 h are shown in Fig. 9(a). The corre-
sponding information geometric attention matrix is shown in the form of a heat map (Fig. 9(b)), where each row represents the
degree of connection between sensors. The changes in the data recorded by sensor 0 (light green line), sensor 4 (orange line), and
sensor 8 (brown line) clearly differ from those in the others, and the corresponding rows are all lighter colors in the heat map
(Fig. 9(a)). Two obvious areas are marked in the heat map and Fig. 9(c) and (d) show the corresponding changes in the sensor
traffic flow data for these two areas, which were very similar.
Journal Pre-proof
Table 3. Computational times required by the three models.
Average training time per epoch (s)

Model
PeMSD4 PeMSD8
ASTGCN 65.19 38.23

GAGCN 105.40 60.70
of
IGAGCN 110.20 64.41
5.3.1. Computational time

We compared the computational costs for IGAGCN with those for ASTGCN and GAGCN based on PeMSD4 and PeMSD8,
(Table 3). The GAGCN was slower than ASTGCN because the method used to obtain time dependencies of ASTGCN is an or-
pro
dinary convolution operation, whereas the multilayer gated dilated causal convolution method employed by GAGCN has more
parameters that require training to better obtain temporal dependencies. Compared with GAGCN, IGAGCN utilizes an infor-
mation geometry matrix for the spatial attention calculations. The computational complexity of calculating the information geo-
metric distance between sensors is O(n2), where n is determined by the number of sensors (i.e. the number of vertices in the
graph), and the k-nearest neighbor method reduces the complexity to O(k*n).
5.4. Application to a real-world traffic network re-

Shenzhen North Railway Station is a large transportation hub in Shenzhen City, which is a well-known highly developed
economic and technology city center in China. The traffic efficiency of Shenzhen North Railway Station is directly determined
by the quality of the surrounding traffic operation. Using a deep learning model to predict the traffic conditions in certain periods
in the future may help government departments to manage and regulate the traffic in a timely manner, and improve traffic opera-
lP
tion efficiency. We selected 12 roads around Shenzhen North Railway Station (Fig. 10) and predicted the ratio of the current
travel time relative to the travel time in the free flow state (TTI) values for these 12 roads based on public data provided by the
Shenzhen Government.
rna
Jou
Fig. 10. The dotted lines indicate the six avenues selected around Shenzhen North Railway Station. Each avenue has two directions, making a total of 12 roads.
Journal Pre-proof
of
pro
re-
Fig. 11. Comparison of true TTI values and TTI values predicted by our model.
The distance of each road is fixed, so TTI is equivalent to the ratio of the average speed in the free flow state relative to the
current average speed. The TTI is computed using (25) and traffic conditions are more congested when TTI is larger.
𝑆
lP
𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑉𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑉𝑓𝑟𝑒𝑒
TTI = = 𝑆 = (25)
𝑇𝑓𝑟𝑒𝑒 𝑉𝑐𝑢𝑟𝑟𝑒𝑛𝑡
𝑉𝑓𝑟𝑒𝑒
The TTI data were collected every 10 min from January to March 2019, and from October to 20 December 2019. In addi-
tion, we obtained the average road speeds and flows (inward and outward volumes) in the corresponding time periods for each
TTI based on the GPS trajectory data from surrounding online hailing vehicles as additional features. The missing data were re-
rna
placed by linear interpolation or weighted averaging. We used our model to predict the TTI values in December and the other
data were used as a training set. Fig. 11 shows the predicted results obtained by our model and the real TTI values. Thus, we
tested and verified the effective performance of the proposed IGAGCN in a real-world urban road network environment.
6. Conclusion
Jou
In this study, we investigated traffic flow prediction and proposed a full-time span traffic flow prediction scheme with an
information geometry-based approach based on an attention mechanism GCN model, which fully considers the dynamic data
distributions between different sensors. For the first time, our method utilizes an information geometry technique to enhance the
attention mechanism. In particular, we employ parallel sub-models to consider long time spans and fully utilize dilated causal
convolution in each sub-model in order to consider short time spans. Moreover, we derive a matrix by analyzing the sensor data
distributions, which is combined with the attention matrix to better capture the dynamic spatial features of traffic flows. We used
PeMS data sets to evaluate the proposed model and conducted tests based on a real-world urban road network to clearly demon-
strate the effectiveness of our proposed IGAGCN.
Journal Pre-proof
References
Ahmed, M. S., & Cook, A. R. (1979). Analysis of freeway traffic time-series data by using Box-Jenkins techniques. Transportation Research Record, no. 722.
Amari, S. (2010). Information geometry in optimization, machine learning and statistical inference. Frontiers of Electrical and Electronic Engineering in China,
5, 241–260.
of
Amari, S. (2014). Information geometry of positive measures and positive-definite matrices: decomposable dually flat structure. Entropy, 16, 2131–2145.
Amari, S., & Nagaoka, H. (2000). Methods of Information Geometry, Volume 191 of Translations of Mathematical Monographs. Oxford University Press.
pro
Amari, S., & Kawanabe, M. (1997). Information geometry of estimating functions in semi-parametric statistical models. Bernoulli, 3, 29–54.
Bay, A., & Sengupta, B. (2018). GeoSeq2Seq: Information Geometric Sequence-to-Sequence Networks. In ICLR (Workshop) , Vancouver, BC, Canada,
http://OpenReview.net.
Bruna, J., Zaremba, W., Szlam, A., & LeCun, Y. (2014). Spectral networks and locally connected networks on graphs. In ICLR 2014: International Conference
on Learning Representations (ICLR) , Banff, AB, Canada, http://OpenReview.net.
re-
Chen, Q., Song, X., Yamada, H., & Shibasaki, R. (2016). Learning deep representation from big and heterogeneous data for traffic accident inference. In
AAAI’16 Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 338–344.
Chen, W., & Shi, K. (2021). Multi-scale attention convolutional neural network for time series classification. Neural Networks, 136, 126–140.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop
lP
on Deep Learning, Montreal, QC, Canada, MIT Press.
Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS’16 Proceedings
of the 30th International Conference on Neural Information Processing Systems, 29: 3844–3852.
Diao, Z., Wang, X., Zhang, D., Liu, Y., Xie, K., & He, S. (2019). Dynamic spatial-temporal graph convolutional neural networks for traffic forecasting. Pro-
rna
ceedings of the AAAI Conference on Artificial Intelligence 33 (1): 890–897.
Duan, Y., Lv, Y., & Wang, F.-Y. (2016). Performance evaluation of the deep learning approach for traffic flow prediction at different times. In 2016 IEEE Inter-
national Conference on Service Operations and Logistics, and Informatics (SOLI), 223–227.
Fu, R., Zhang, Z., & Li, L. (2016). Using LSTM and GRU neural network methods for traffic flow prediction. In 2016 31st Youth Academic Annual Conference
of Chinese Association of Automation (YAC), 2016: 324–328.
Jou
Guo, S., Lin, Y., Feng, N., Song, C., & Wan, H. (2019). Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. Proceedings
of the AAAI Conference on Artificial Intelligence 33 (1): 922–929.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780.
Hong, D., Gao, L., Yao, J., Zhang, B., Plaza, A., & Chanussot, J. (2020). Graph convolutional networks for hyperspectral image classification. IEEE Transactions
on Geoscience and Remote Sensing, 1–13.
Hong, D., Yokoya, N., Chanussot, J., & Zhu, X. X. (2019a). CoSpace: common subspace learning from hyperspectral-multispectral correspondences. IEEE
Transactions on Geoscience and Remote Sensing, 57, 4349–4359.
Journal Pre-proof
Hong, D., Yokoya, N., Ge, N., Chanussot, J., & Zhu X. X. (2019b). Learnable manifold alignment (LeMA): a semi-supervised cross-modality learning frame-
work for land cover and land use classification. ISPRS Journal of Photogrammetry and Remote Sensing, 147, 193–205.
Hou, Z., Ma, K., Wang, Y., Yu, J., Ji, K., Chen, Z., & Abraham, A. (2021). Attention-based learning of self-media data for marketing intention detection. Engi-
neering Applications of Artificial Intelligence, 98, 104118.
of
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141.
Huang, W., Song, G., Hong, H., & Xie, K. (2014). Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Transactions
on Intelligent Transportation Systems, 15, 2191–2201.
pro
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. In ICLR (Poster), San Juan, Puerto Rico,
http://OpenReview.net.
Li, Y., & Shahabi, C. (2018b). A brief overview of machine learning methods for short-term traffic forecasting and future directions. Sigspatial Special, 10, 3–9.
Li, Y., Yu, R., Shahabi, C., & Liu, Y. (2018a). Diffusion convolutional recurrent neural network: data-driven traffic forecasting. In International Conference on
re-
Learning Representations, Vancouver Convention Center, Vancouver Canada, http://OpenReview.net.
Liang, Y., Ke, S., Zhang, J., Yi, X., & Zheng, Y. (2018). GeoMAN: multi-level attention networks for geo-sensory time series prediction. In International Joint
Conference on Artificial Intelligence, 3428–3434.
Lv, Y., Duan, Y., Kang, W., Li, Z., & Wang, F.-Y. (2015). Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent
Transportation Systems 16, 865–873.
lP
Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., & Wang, Y. (2017). Learning traffic as images: a deep convolutional neural network for large-scale transportation net-
work speed prediction. Sensors, 17, 818.
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: a generative
rna
model for raw audio. SSW, 125.
Qian, Y., Vazquez, E., & Sengupta, B. (2017). Differential geometric retrieval of deep features. In 2017 IEEE International Conference on Data Mining Work-
shops (ICDMW), 539–44.
Serrano, S., & Smith, N. A. (2019). Is attention interpretable. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
2931–2951.
Jou
Shuman, D., Narang, S. K., Frossard, P., Ortega, A., & Vandergheynst, P. (2013). The emerging field of signal processing on graphs: extending high-dimensional
data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30, 83–98
Simonovsky, M., & Komodakis, N. (2017). Dynamic edge conditioned filters in convolutional neural networks on graphs. In Computer Vision and Pattern
Recognition, 3693–3702.
Sun, J., Yang, Y., Liu, Y., Chen, C., Rao, W., & Bai, Y. (2019). Univariate time series classification using information geometry. Pattern Recognition, 95, 24–35.
Ullah, I., Manzo, M., Shah, M., & Madden, M. G. (2019). Graph convolutional networks: analysis, improvements and results. ArXiv Preprint,
ArXiv:1912.09592.
Journal Pre-proof
Vlahogianni, E. I., Karlaftis, M. G., & Golias, J. C. (2014). Short-term traffic forecasting: where we are and where we’re going. Transportation Research Part
C-Emerging Technologies, 43, 3–19.
Wang, M., Ning, Z.-H., Xiao, C., & Li, T. (2018). Sentiment classification based on information geometry and deep belief networks. IEEE Access, 6, 35206–
35213.
Williams, B. M., & Hoel, L. A. (2003). Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: theoretical basis and empirical results.
of
Journal of Transportation Engineering-ASCE, 129, 664–672.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: neural image caption generation with
visual attention. In International conference on machine learning, 2048–2057.
pro
Yao, H., Tang, X., Wei, H., Zheng, G., & Li, Z. (2019). Revisiting spatial-temporal similarity: a deep learning framework for traffic prediction. Proceedings of
the AAAI Conference on Artificial Intelligence 33 (1): 5668–5675.
Yao, H., Wu, F., Ke, J., Tang, X., Jia, Y., Lu, S., Gong, P., Ye, J., & Li, Z. (2018). Deep multi-view spatial-temporal network for taxi demand prediction. In AAAI,
2588–2595.
re-
Yi, J. Z., Liu, L., Feng, J., Peng, H., & Zheng, X. (2018). RNN-based sequence-preserved attention for dependency parsing. In AAAI, 5738–5746.
Yin, S., Jiang, Y., Tian, Y., & Kaynak, O. (2017). A data-driven fuzzy information granulation approach for freight volume forecasting. IEEE Transactions on
Industrial Electronics, 64, 1447–1456.
Yu, B., Yin, H., & Zhu, Z. (2018). Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In International Joint Con-
ference on Artificial Intelligence, 3634–3640.
lP
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In ICLR 2016: International Conference on Learning Representations, San
Juan, Puerto Rico, http://OpenReview.net.
Zhang, J., Zheng, Y., & Qi, D. (2016a). Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, 1655–1661.
rna
Zhang, J., Zheng, Y., Qi, D., Li, R., & Yi, X. (2016b). DNN-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems, 92.
Zhao, X., Hou, Y., Song, D., & Li, W. (2018). A confident information first principle for parameter reduction and model selection of Boltzmann machines. IEEE
Transactions on Neural Networks, 29, 1608–1621.
Zivot, E., & Wang, J. (2006). Vector autoregressive models for multivariate time series. Modeling Financial Time Series with S-PLUS, 385–429.
Jou
Journal Pre-proof
Appendix
Notations and definition are provided in Table 4.

Table 4. Notations used in Section 3
Symbol Definition
G Traffic graph
of
v Set of vertices in G
E Set of edges in G
N Number of vertices
A Adjacent matrix of G
pro
F Number of features of traffic data
𝑥𝑡𝑖 Value of all features at time t for vertex i
𝑋𝑡 Value of all features at time t for all vertices
X Set of Xt in continuous time steps
𝑦𝑡𝑖 Target feature value to predict at time t for vertex i
𝑌𝑡 Target feature value to predict at time t for all vertices
Y Set of Yt in continuous time steps
𝜏 re- Set size of X (i.e. time steps)
𝑇𝑝 Set size of Y (i.e. time steps to predict)
lP
rna
Jou
Journal Pre-proof
Highlights of Neunet-D-21-00185final
of
1. Information geometry and attention based graph convolutional network (IGAGCN) is a deep
learning model
2. IGAGCN is proposed for traffic flow prediction
pro
3. Information geometry method is used to extract dynamic dependencies in traffic flow
4. IGAGCN fully considers the long-term and short-term span dependencies in traffic flow
5. Comprehensive experiments conducted based on three real-world traffic flow datasets
re-
lP
rna
Jou
Journal Pre-proof
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.
of
☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
pro
re-
lP
rna
Jou

J Neunet 2021 05 035

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

J Neunet 2021 05 035

Uploaded by

Copyright:

Available Formats

Journal Pre-proof

IGAGCN: Information geometry and attention-based spatiotemporal

To appear in: Neural Networks

Received date : 27 January 2021

© 2021 Published by Elsevier Ltd.

IGAGCN: information geometry and attention-based spatiotemporal graph convolutional networks

1.2. Our contributions

2.1. Neural network methods for traffic prediction

2.2. Information geometry in deep learning

3.2. Problem statement

The traffic flow prediction problem can be summarized as follows:

flow in the near future.

Fig. 2. Traffic data trends in different periods.

difference between the predicted results and true values:

4.1. Temporal attention

4.2. Information geometry spatial attention

𝐻 = {(𝜇, 𝜎) ∈ 𝑅2 | 𝜎 > 0} (6)

The elements in the FIM are calculated by:

𝜕 𝑙𝑜𝑔 𝑝(𝑥 |𝜃) 𝜕 𝑙𝑜𝑔 𝑝(𝑥 |𝜃)

𝑚𝑖𝑗 (𝜃) = 𝐸{ ⋅ } (8)

where E is the expectation. Hence, we have the following:

𝐹((𝜇1 , 𝜎1 ), (𝜇2 , 𝜎2 )) = √((𝜇1 − 𝜇2 )2 + 2(𝜎1 − 𝜎2 )2 )((𝜇1 − 𝜇2 )2 + 2(𝜎1 + 𝜎2 )2 ). (13)

We then combine IG′ with the spatial attention matrix S:

4.3. Graph convolution in the spatial dimension

4.4. Gated multilayer dilated causal convolution in the temporal dimension

T time steps traffic data:𝑋 ∈ ℝ𝐵×𝑁×𝐶𝑖𝑛 ×𝑇

3: for each block do

12: end for

Table 1. Data set profiles.

Attributes PeMSD4 PeMSD8

Time spans re- January–February 2018 July–August 2016

Detectors 3848 1979

Selected detectors (vertices) 307 170

Distance information between sensors (edges) 340 295

5.1. Data sets

5.2. Settings and hyperparameters

5.3. Results and performance analysis

Table 2. Comparison of average performance based on RMSE and MAE

Method PeMSD4 PeMSD8

RMSE MAE RMSE MAE

(a) RMSE values. (b) MAE values.

(b) Evaluation based on PeMSD8.

Table 3. Computational times required by the three models.

Average training time per epoch (s)

ASTGCN 65.19 38.23

5.3.1. Computational time

5.4. Application to a real-world traffic network re-

ceedings of the AAAI Conference on Artificial Intelligence 33 (1): 890–897.

model for raw audio. SSW, 125.

Notations and definition are provided in Table 4.

2. IGAGCN is proposed for traffic flow prediction

5. Comprehensive experiments conducted based on three real-world traffic flow datasets

You might also like