You are on page 1of 30

mathematics

Article
Short-Term Online Forecasting for Passenger Origin–Destination
(OD) Flows of Urban Rail Transit: A Graph–Temporal Fused
Deep Learning Method
Han Zheng, Junhua Chen *, Zhaocha Huang, Kuan Yang and Jianhao Zhu

School of Traffic and Transportation, Beijing Jiaotong University, No. 3 Shang Yuan Cun, Hai Dian District,
Beijing 100044, China
* Correspondence: cjh@bjtu.edu.cn

Abstract: Predicting short-term passenger flow accurately is of great significance for daily manage-
ment and for a timely emergency response of rail transit networks. In this paper, we propose an
attention-based Graph–Temporal Fused Neural Network (GTFNN) that can make online predictions
of origin–destination (OD) flows in a large-scale urban transit network. In order to solve the key
issue of the passenger hysteresis in online flow forecasting, the proposed GTFNN takes finished OD
flow and a series of features, which are known or observable, as the input and performs multi-step
prediction. The model is constructed from capturing both spatial and temporal characteristics. For
learning spatial characteristics, a multi-layer graph neural network is proposed based on hidden
relationships in the rail transit network. Then, we embedded the graph convolution into a Gated
Recurrent Unit to learn spatial–temporal features. For learning temporal characteristics, a sequence-
to-sequence structure embedded with the attention mechanism is proposed to enhance its ability to
capture both local and global dependencies. Experiments based on real-world data collected from
Citation: Zheng, H.; Chen, J.; Huang, Chongqing’s rail transit system show that the metrics of GTFNN are better than other methods, e.g.,
Z.; Yang, K.; Zhu, J. Short-Term Online the SMAPE (Symmetric Mean Absolute Percentage Error) score is about 14.16%, with a range from
Forecasting for Passenger 5% to 20% higher compared to other methods.
Origin–Destination (OD) Flows of
Urban Rail Transit: A Graph–Temporal Keywords: urban rail system; short-term prediction; passenger origin–destination (OD); graph neural
Fused Deep Learning Method. networks; graph–temporal fused; sequence to sequence
Mathematics 2022, 10, 3664.
https://doi.org/10.3390/ MSC: 68T07
math10193664

Academic Editor: Ioannis E. Livieris

Received: 14 September 2022 1. Introduction


Accepted: 4 October 2022
1.1. Background
Published: 6 October 2022
Urbanization introduces growing populations in cities and leads to significant mobility
Publisher’s Note: MDPI stays neutral and sustainability challenges. Predicting short-term passenger flow of urban rail transit
with regard to jurisdictional claims in systems, especially the origin–destination (OD) flow, is vital for providing full understand-
published maps and institutional affil- ing of travel patterns and daily operation efficiency improvement. In case of emergencies
iations.
or special incidents, an accurate online passenger flow forecasting can help to implement
efficient response measures and enhance the service quality of public transport systems [1].
Nevertheless, traffic forecasting is challenging due to the complex temporal and spatial
relationships among data. More specifically, in temporal dimension, the global features
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
and local features both influence the evolution of passenger flow. Then, in the spatial
This article is an open access article
dimension, the interrelationship between passenger flow and the complex rail network,
distributed under the terms and
both similarities and corrections (i.e., similarities or correlations among passenger flow
conditions of the Creative Commons series of different stations) [2], make the prediction of passenger flow much more complex.
Attribution (CC BY) license (https:// In addition, passenger forecasting problems often contain a complex mix of inputs—
creativecommons.org/licenses/by/ including time-invariant, known future inputs, and other exogenous time series that are
4.0/).

Mathematics 2022, 10, 3664. https://doi.org/10.3390/math10193664 https://www.mdpi.com/journal/mathematics


Mathematics 2022, 10, 3664 2 of 30

only observed historically—without any prior information on how they interact with the
target [3].
In addition, OD prediction has its own characteristics. In the huge size of a rail transit
network, online forecasting can only obtain data on the finished OD passenger flows, as
passenger trips cannot finish immediately. The unknown passenger flow includes the
current unfinished passenger flow as well as the future flow that will continuously enter
the network. Meanwhile, the AFC (Auto Fare Collection) system has a delay time for
updating OD flow information. Thus, the forecasting task of OD flow naturally takes temp
finished OD flow as input and the actual OD flow as output.
To solve the above issues, we need a method that simultaneously considers spatial–
temporal properties of OD passenger flow. The development of deep learning provides
the possibility to find a reasonable solution. The temporal fusion transformer method [3]
is explored, which can combine high-performance multi-horizon forecasting with inter-
pretable insights into temporal dynamics. Furthermore, Graph Convolution Networks
(GCN), which can automatically learn representations on non-Euclidean data (e.g., graphs),
has been proposed and utilized in many scenarios [4–7].
Inspired by the above, a Graph–Temporal Fused Neural Network (GTFNN) is pro-
posed in this paper. The most important feature of this model is the introduction of a graph
neural network component in the attention-based sequence-to-sequence structure to fuse
the hidden spatial relationships in a rail transit network into temporal relationships. This
realizes the prediction of time-series OD passenger flow, which is highly dependent on the
network structure.

1.2. Potential Contributions


This paper develops a method for forecasting OD flow of a network-level rail transit
system, with considerations of the hysteresis characteristic and hidden complex spatial–
tempol relationships. The potential contributions can be summarized as:
1. Considering the hysteresis characteristic of OD flow, we proposed an online forecast-
ing framework that maps the historical observable finished flow and features to the
actual OD flow in the future.
2. For better improving the prediction accuracy, we built a feature system including
observable features and knowable features based on the temporal relationship. We
also used specialized feature selection components for weighting the features.
3. For capturing the hidden spatial relationships in the rail transit network, we intro-
duced a multi-layer graph neural network as a key component in the method to depict
the relationships among stations from different aspects.
4. Based on sequence-to-sequence framework, a Graph–Temporal Fused Deep Learning
model was built. In addition, an attention mechanism was attached to the model to
fuse local and global temporal dependencies; then, we achieved the prediction of
short-term online OD passenger flow.
For rail transit systems, obtaining accurate OD passenger flow data in real time is an
important support for transportation organization. The train line planning, timetabling and
station passenger flow planning all depend on accurate OD prediction results. Especially,
the prediction of different OD passenger flow peaks can effectively guide the allocation of
rail transportation resources. Therefore, the OD prediction model studied in this paper has
strong practical significance.

2. Literature Reviews
We performed a literature review from three aspects: passenger flow prediction, graph
convolution neural networks, and time series forecasting with attention-based deep neural
networks. All three aspects are relevant to the study of this paper.
Mathematics 2022, 10, 3664 3 of 30

2.1. Passenger Flow Prediction


Passenger flow prediction is one of the most significant tasks in the field of intelligent
transportation. There are two major categories of prediction methods covered in the flow
prediction issue, which are statistical methods and machine learning methods [8]. Among
the statistical algorithms, the Auto-Regressive Integrated Moving Average (ARIMA) model
is representative. Williams et al. [9] and Lee and Fambro [10] applied the ARIMA model to
realize urban freeway traffic flow prediction.
Recently, a large number of research studies in recent years have focused on applying
machine learning or deep learning algorithms in transportation applications, which have
generated considerable achievements. Huang et al. [11] claimed that their research is
the earliest application of deep learning to traffic flow prediction research through the
proposed Deep Belief Nets networks. Specific to subway passenger flow, Ni et al. [12]
trained the Linear Regression model with event occurrences of information to tackle
abnormal prediction. Sun et al. [13] proposed Wavelet-SVM and discussed the idea to
predict different kinds of passenger flows. A combination of empirical mode decomposition
and MLP was proposed by Wei and Chen [1] to forecast the short-term metro flow. Later
on, a multi-scale radial basis function network [14] was proposed by the same authors to
improve their forecasting accuracy. Furthermore, some researchers [15] focused on using
the nonparametric regression model to predict flows of transfer stations instead of the
whole subway system.
In the early days, OD predictions were more concerned with potential relationships
in the time dimension. Some works [16–18] usually employed time series filtering models
(e.g., Kalman filter) to estimate the OD flow. Recently, many studies have fused spatial
relationships with temporal relationships. For instance, Liu et al. [19] proposed a Con-
textualized Spatial–temporal Network that incorporated local spatial context, temporal
evolution context, and global correlation context to forecast taxi OD demand. Shi et al. [20]
utilized long short-term memory (LSTM) units to extract temporal features for each OD pair
and then learned the spatial dependency of origins and destinations by a two-dimensional
graph convolutional network.
For ride-hailing applications, the origin and destination of a passenger are known
once a taxi request is generated. However, in online metro systems, the destination of a
passenger is unknown until it reaches the destination station; thus, the operators cannot
obtain the complete OD distribution immediately to forecast the future OD demand. To
address this issue, Gong et al. [21] used some indication matrices to mask and neglect the
potential unfinished metro orders. Lingbo Liu et al. [22] handled this task by learning a
mapping from the historical incomplete OD demands to the future complete OD demands.

2.2. Graph Convolution Neural Networks


To fit the required input format of Convolutional Neural Network and Recurrent
Neural Network, some works [23,24] divided the studied network system into regular grid
cells and transformed the raw traffic data to tensors. However, this preprocessing manner
has limitations in handling the traffic systems with irregular topologies, such as rail transit
systems and road networks.
Graph Convolutional Networks can be used to improve the generality of deep learning-
based forecasting methods in networks. For instance, Li et al. [25] modeled the traffic flow
as a diffusion process on a directed graph and captured the spatial dependency with bi-
directional random walks. Song et al. [26] developed a Spatial–Temporal Synchronous
Graph Convolutional Network (STSGCN), which captured the complex localized spatial–
temporal correlations through a spatial–temporal synchronous modeling mechanism. In
the article by Han et al. [27], graph convolution operations were applied to capture the
irregular spatial–temporal dependencies along with the metro network. Geng X. et al. [28]
developed a ST-MGCN, which incorporated a neighborhood graph (NGraph), a trans-
portation connectivity graph (TGraph), and a functional similarity graph (FGraph) for
ride-hailing demand prediction. Lingbo Liu et al. [22] designed a more flexible historical
Mathematics 2022, 10, 3664 4 of 30

ridership data network based on this and fully explored the inter-station flow similarity
and OD correlation for virtual graph construction.

2.3. Time Series Forecasting with Attention-Based Deep Neural Networks


Attention mechanisms are used in translation [29], image classification [30] or tabular
learning [31] to identify salient portions of input for each instance using the magnitude of
attention weights. Recently, they have been adapted for time series with interpretability
motivations [32–34], using LSTM-based [35] and transformer-based [34] architectures.
Deep neural networks have increasingly been used in time series forecasting, which
demonstrates stronger performance over traditional time-series models [32,36,37]. More
recently, transformer-based architectures have been explored in Li et al. [34], which proposes
the use of convolutional layers for local processing and a sparse attention mechanism to
increase the size of the receptive field during forecasting. Despite their simplicity, iterative
methods rely on the assumption that only the target needs to be recursively fed into
future inputs.
The Multi-horizon Quantile Recurrent Forecaster (MQRNN) [38] uses LSTM or con-
volutional encoders to generate context vectors. In Fan et al. [39], a multi-modal attention
mechanism was used with LSTM encoders to construct context vectors for a bi-directional
LSTM decoder. Despite performing better than LSTM-based iterative methods, inter-
pretability remains challenging for such standard direct methods.
In general, deep learning models involve a large number of input features. The signifi-
cance of features is difficult to measure, which is not beneficial to the large-scale application
of prediction models. Some methods are used for the significance analysis of different input
features in time series forecasting. For example, Interpretable Multi-Variable LSTMs [40]
partition the hidden state such that each variable contributes uniquely to its own memory
segment and weights memory segments to determine variable contributions. Methods
combining temporal importance and variable selection have also been considered [33],
which compute a single contribution coefficient based on attention weights from each. How-
ever, in addition to the shortcoming of modeling only one-step-ahead forecasts, existing
methods also focus on instance-specific (i.e., sample-specific) interpretations of attention
weights—without providing insights into global temporal dynamics. The usage of an atten-
tion mechanism can provide an improvement for this issue. Temporal Fusion Transformer
(TFT) [3] is able to analyze global temporal relationships and allows users to interpret
global behaviors of the model on the whole dataset—specifically in the identification of
any persistent patterns (e.g., seasonality or lag effects) and regimes present.
From the existing studies, it can be seen that the current deep learning techniques have
started to take into account the temporal and spatial fusion features into the prediction
models. To solve the prediction problem in this paper, a systematic consideration of the
hysteresis of OD traffic, the hidden spatial relationships in complex networks, the different
input features and their temporal characteristics, and how to ensure the capability to
capture both local and global dependencies in the temporal dimension is needed.

3. Methodology
3.1. Short-Term Online Framework Considering Hysteresis of Passenger OD Flow
The online short-term online framework for forecasting passenger OD flow should con-
sider hysteresis of the OD passenger, as well as the hidden temporal and spatial relationships.

3.1.1. Solution for Passenger Hysteresis


Passenger OD flow has a hysteresis characteristic that the destination of a passenger
is unknown until it reaches the destination station, which provides a challenge to the
short-term online forecasting. The passengers cannot finish their trips in a short forecasting
time step (e.g., 15 min or even 5 min); furthermore, in each time step, we cannot obtain
the actual or complete OD distribution for forecasting dynamically. This characteristic has
Mathematics 2022, 10, 3664 5 of 30

attracted the attention of one existing research study. Gong et al. [21] used some indication
matrices to mask and neglect the potential unfinished trip in urban rail networks.
In order to handle this issue, this paper proposes a framework that maps a time-
dependent finished trip to the actual entered trips based on Equation (1).

∑t OD f lowten = OD f lowtun + ∑t OD f lowt


fi
(1)

where OD f lowten represents the actual or entered OD flow of time step t, which can be
obtained in historical data but cannot be calculated instantaneously. OD f lowten is the
fi
object we wish to forecast and the guidance of system management. OD f lowt represents
the finished OD flow of time step t, which can be obtained from historical data and
instantaneous observations. The gap between the two cumulative values of OD f lowten and
fi
OD f lowt in the temporal dimension is caused by an unfinished journey. We denote this
gap as the OD f lowtun that represents the unfinished OD flow of time step t.
Based on this relationship, it is clear that one of the core aspects of the forecasting
process is how to establish a mapping between the observable finished passenger flow and
the actual passenger flow.

3.1.2. Consideration for Temporal Relationships


Each OD pair i, j associates with a set of features and targets at each time step
t ∈ [0, mmax ]. To facilitate the following description, we hereby make conventions on
the notation, model, and framework of the study. The framework of the forecasting task
can be illustrated as Equation (2).
→A →F →F
A
= GTFNN Y pt,−h,−π , X pt,−h,−π , O pt,−h,−π , K pt,−h,m , G ( X pt,−h,−π , Ospt,−h,−π , K spt,−h,m )

Ŷpt,m (2)

A is the m-step-ahead (m is the length of the predicted sequence)


In Equation (2), Ŷpt,m
forecast result at prediction time pt. GTFNN (·) refers to the proposed forecasting model.
→F
X pt,−h,−π is the vector of historical finished OD flow of time range [ pt − h, pt − π ]. h refers
to the horizon of the input sequence, and π refers to the blind spot for data updates caused
→A
by the information system update mechanism. Y pt,−h,−π is the target vector of forecasting,
a vector of historical actual or entered OD flow. O pt,−h,−π refers to the observed input fea-
tures that can be obtained in historical data only, K pt,−h,m refers to the known input features
→ F 
that can be obtained in the whole range of time, and G X pt,−h,−π , Ospt,−h,−π , K spt,−h,m rep-
resents the graphic structures called multi-layer networks that can help the framework to
extract hidden localized features among stations. Within Figure 1, we show the relationship
of the above concepts on the time axis to facilitate the construction of the following model.
Theoretically, the OD pairs among stations should be fully connected, i.e., if the
number of stations is N, the number of OD pairs is N · ( N − 1). However, the OD matrix is
relatively sparse. Thus, we only consider the OD flow from station i to the top-κ stations
where its passengers are most likely to reach, as well as the total OD flow to the remaining
stations. The details of the mentioned notations are shown in Table 1.
The details of O pt,−h,−π and K pt,−h,m are shown in Table 2. These two sets mainly
g
include sequenced (e.g., Ospt,−h,−π and K spt,−h,m ) and graphic features (e.g., O pt,−h,−π and
g
K pt,−h,m ) across time. These features are designed to help the model’s learning of the
relationship among the different components of the historical data.
Mathematics 2022,
Mathematics 2022, 10,
10, 3664
3664 66 of 32
of 30

Figure 1.
Figure 1. Illustration
Illustration of
of the
the forecasting
forecasting task
task in
in the
the temporal
temporal perspective.
perspective.

TableTheoretically, the
1. Key notations in OD pairs among
forecasting task. stations should be fully connected, i.e., if the num-
ber of stations is 𝑁, the number of OD pairs is 𝑁 ⋅ (𝑁 − 1). However, the OD matrix is
Index Notation relatively sparse. Thus, we only consider the OD flow from station 𝑖 to the top-𝜅 stations
Description
A vector ofwhere its passengers
input finished areflow
passenger most likely towhere
sequence, reach,ptas
is well as the
the time steptotal OD flow to
of forecasting. the remaining
h refers to the
stations.
length of the The details
input sequence and πofrefers
the mentioned
to the blind notations
spot for dataare showncaused
updates in Table 1. information system
by the
→F →F
1 X pt,−h,−π update mechanism,
Table 1. Keywhere h > 0 and
notations π < h. X pt,task.
in forecasting −h,−π represents the finished passenger flow of time steps
{ pt − h + 1, pt − h + 2, · · · pt − π }.
Index →F
Notation
n o
Description
X pt,−h,−π , X pt F , X F , · · · X F
− n +1 pt−n+2 pt−π
A vector of input finished passenger flow sequence, where 𝑝𝑡 is the time step of forecasting.
2 XtF
The finished
ℎnrefersOD flows
to the of time
length o tinput
ofstep
the with the and 𝜋i ∈refers
origin station
sequence [1, N ].to the blind spot for data updates
XtF , x1,t
F , xF , · · · , xF , · · · , xF κ×N ∈R
1 𝑋⃗ , , caused by the information system
2,t i,t N,t update mechanism, where ℎ > 0 and 𝜋 < ℎ. 𝑋⃗ , , rep-
resents
The top κ − 1the finished
finished OD pairs passenger flow offrom
that originate timestation
steps i at 𝑝𝑡time
− ℎstep
+ 1,t.𝑝𝑡x−IC ℎ +
i ∼κ,t 2, ⋯ 𝑝𝑡 −the
represents 𝜋 rest
. of the
F
3 xi,t 𝑋⃗n , , ≜ 𝑋
finished OD flow of station , 𝑋i. , ⋯ 𝑋 o
F , xF
xi,t F · · · , xiF∼ j,t , · · · , xiF∼κ −1,t , xiF∼κ,t ∈ Rκ
Thei∼ 1,t , xi ∼2,t ,OD
finished flows of time step 𝑡 with the origin station 𝑖 ∈ 1, 𝑁 .
2 𝑋
4 xiF∼ j,t 𝑋 ≜ 𝑥OD
The finished , , 𝑥flow
, , ⋯ , 𝑥 , , ⋯from
traveled ∈ ℝ ×i to station j at time step t.
, 𝑥 , station
The of
A vector top 𝜅 − 1 finished
predicted actual (orOD pairspassenger
entered) that originate from station
flow sequence, where𝑖 ptatistime step 𝑡. or
the predicted 𝑥 ~query
, repre-
time
3 A
𝑥, step.sents
m refersthetorest the of lengththe of finished OD flow
the predicted where.𝑖.A indicates that the variable represents the actual
of station
sequence
5 Ŷpt,m
𝑥 , n≜ passenger
(or entered) 𝑥 ~ , , 𝑥 ~ flow , , ⋯ , 𝑥 ~o, , ⋯ , 𝑥 ~
of time steps { pt +, 1,
, 𝑥~ , ∈ℝ
pt + 2, · · · pt + m}.
A , YA , YA , · · · YA
4 𝑥~ ,
Ŷpt,mThe finished
pt+1 pt+ OD2 flow pt+traveled
m from station 𝑖 to station 𝑗 at time step 𝑡.
A vector of historical actual (or entered) entered)
A vector of predicted actual (or passengerpassenger
flow sequence, flowwhere
sequence, where
pt is the 𝑝𝑡 is
time step ofthe predicted
forecasting. h
refersortoquery
the length time step. 𝑚 refers to the length of the predicted sequence where 𝑚 > 0. 𝐴 indi-
of the sequence, and π refers to the blind spot for data updates caused by the information
→A
65
→A 𝑌 , systemcates thatmechanism,
update the variablewhere represents
h > 0 and theπactual
< h. Y (or entered) passenger flow of time steps
Y pt,−h,−π pt,−h,−π represents the actual passenger flow of time
𝑝𝑡 + 1, 𝑝𝑡 + 2, ⋯ 𝑝𝑡
steps { pt − h + 1, pt − h + 2, · · · pt − π }. + 𝑚 .
→A 𝑌 ≜ n𝑌 A , 𝑌 A, ⋯ 𝑌
Y pt,−h,−, π , Ypt
o
A
−n+1 , Ypt−n+2 , · · · Ypt−π
A vector of historical actual (or entered) passenger flow sequence, where 𝑝𝑡 is the time step
The actual (or entered)ℎ OD
ofnforecasting. refersflows tooof
thetime step tofwith
length thethe origin station
sequence, and 𝜋 i ∈refers
[1, N ]. to the blind spot for data
7 YtA κ×N
6 𝑌⃗ YtA ,updatesA , yA , · · · , yA , · · · , yA
y1,t 2,tcausedi,tby the information N,t ∈ R system update mechanism, where ℎ > 0 and 𝜋 < ℎ.
, ,
𝑌⃗ represents the actual
The top κ, −, 1 actual (or entered) OD pairs that originate passenger flow from
of time steps
station 𝑝𝑡 −step
i at time ℎ + t.1,x𝑝𝑡
C − ℎ + 2, ⋯ 𝑝𝑡 − 𝜋 .
i ∼κ,t represents the rest
8 A
yi,t of the𝑌⃗n
actual
, , ≜ entered)
(or 𝑌 , 𝑌 flow of
OD , ⋯station
𝑌 i. o
A , yA
yi,t , y A , · · · , y A , · · · , yA , y A ∈ Rκ 𝑡 with the origin station 𝑖 ∈ 1, 𝑁 .
Thei∼actual
1,t i ∼(or 2,t entered) i ∼ j,t OD flows i ∼κ −1,tof time
i ∼κ,t step
7 𝑌 ×
9 yiA∼ j,t 𝑌 ≜ (or
The actual 𝑦 ,entered)
, 𝑦 , , ⋯OD , 𝑦 , flow
,⋯,𝑦 , ∈ℝ
traveled from station i to station j at time step t.
The top 𝜅 − 1 actual (or entered) OD pairs that originate from station 𝑖 at time step 𝑡. 𝑥 ~ ,
8 𝑦, represents the rest of the actual (or entered) OD flow of station 𝑖.
𝑦 , ≜ 𝑦 ~ , ,𝑦 ~ , ,⋯,𝑦 ~ , ,⋯,𝑦 ~ , ,𝑦 ~ , ∈ ℝ
Mathematics 2022, 10, 3664 7 of 30

Table 1. Cont.

Index Notation Description


The observedninput features that can
o only be obtained in historical data. We define
10 O pt,−h,−π g
O pt,−h,−π , Ospt,−h,−π , O pt,−h,−π

The knownninput features thatocan be obtained in the whole range of time. We define
11 K pt,−h,m g
K pt,−h,m , K spt,−h,m , K pt,−h,m

Table 2. Time-dependent features.

Class Name Notation Time Range Set


Finished OD passenger flow in next time step ot1 Ospt,−h,−π
Finished OD passenger flow in next two time steps ot2
Observed input features (O) [pt – h + 1, pt − π]
Max. passenger flow ot3 g
O pt,−h,−π
Min. passenger flow ot4
Hour of the day k1t
Day of the week k2t K spt,−h,m
Weather (Sunny, rainy, and cloudy) k3t
Passenger flow in the latest update time step k4t
Passenger flow in the two time steps before the latest
update time
Known input features (K) Passenger flow in the three time steps before the [pt – h + 1, pt + m]
k6t
latest update time g
Passenger flow in the four time steps before the latest K pt,−h,m
k7t
update time
Passenger flow in the same time step of the
k8t
previous day
Passenger flow in the same time step last week k9t
Passenger flow for the same time step two weeks ago k10
t

Unfold the features in Table 2 by spatial and temporal dimensions. Each element
j g
k t ∈ K pt,−h,m , j ∈ {4, 5, · · · 10}, can be denoted as:
n o
j j j j j
ot = o1,t , o2,t , · · · , oi,t , · · · , o N,t , j ∈ [1, 4] (3)

kw
 w w w w
t = k 1,t , k 2,t , · · · , k i,t , · · · , k N,t , w ∈[1, 10] (4)
Thus, each element kw
t can map to the station set of the network.
The input of the model is denoted as:

→F
 
I= X pt,−h,−π , O pt,−h,−π , K pt,−h,m (5)

n o
It = It1 , It2 , · · · , ItN (6)
 n o 
j
Iti = F
, kw

xi,t , oi,t i,t w∈[1,10]
(7)
j∈[1,4]

Specifically, in Table 2, the features in set K spt,−h,m are dependent on the OD flow but
are independent from the structure of networks. Thus, we define the input of the graph
model in Equation (8). n o
IGt = IGt1 , IGt2 , · · · , IGtN (8)
 n o 
j
IGti = xi,t
F
, kw

, oi,t i,t w∈[4,10] (9)
j∈[1,4]
𝟏
graph
but are model in Equation
independent from 𝑰𝑮𝒕 = {𝑰𝑮
(8).the structure 𝑰𝑮𝟐𝒕 , ⋯ , 𝑰𝑮𝑵
of𝒕 ,networks. }
𝒕Thus, we define the input of(8) the
𝒊 𝑭 𝒋 𝒘
graph model in Equation (8). 𝑰𝑮 𝒕 = {𝒙 𝒊,𝒕 , {𝒐 𝒊,𝒕 } , {𝒌 𝒊,𝒕 } } (9)
𝑰𝑮𝒕 = {𝑰𝑮𝟏𝒕 ,𝒋∈[𝟏,𝟒] 𝑰𝑮𝟐𝒕 , ⋯ , 𝑰𝑮𝑵 𝒘∈[𝟒,𝟏𝟎]
𝒕 } (8)
𝒋
𝑰𝑮𝒊𝒕 = {𝒙𝑭𝒊,𝒕 , {𝒐𝒊,𝒕 }𝟏 𝟐 , {𝒌 𝒘
𝒊,𝒕 } 𝑵} } (9)
𝑰𝑮𝒕 = {𝑰𝑮 , 𝑰𝑮𝒕 , ⋯ , 𝑰𝑮𝒕
𝒕𝒋∈[𝟏,𝟒] 𝒘∈[𝟒,𝟏𝟎] (8)
3.1.3. Consideration for Spatial 𝒋
𝑰𝑮𝒊𝒕 Relationships
= {𝒙𝑭𝒊,𝒕 , {𝒐𝒊,𝒕 } , {𝒌𝒘 }
𝒊,𝒕 𝒘∈[𝟒,𝟏𝟎] } (9)
Mathematics 2022, 10, 3664 𝒋∈[𝟏,𝟒] 8 of 30
3.1.3.The spatial relationships
Consideration for Spatial studied
= {𝒙𝑭𝒊,𝒕 , {𝒐
𝑰𝑮𝒊𝒕Relationships in𝒋 this
} paper
, {𝒌 𝒘 refer to the relationships among(9)
𝒊,𝒕 𝒋∈[𝟏,𝟒] 𝒊,𝒕 }𝒘∈[𝟒,𝟏𝟎] }
stations, which can influence the prediction of OD flow. We summarized four classes of
3.1.3.The spatial relationships
Consideration studied in this paper refer to the relationships among
spatial relationshipsfor Spatial
with Relationships
six derived graphs that exist in the urban transit system. A
stations, which can influence the prediction of OD flow. We summarized four classes of
3.1.3. Consideration
The
3.1.3. spatialoffor
Consideration
summarization Spatial
relationships Relationships
for Spatial
these studied is in
Relationships
relationships shown this in paper Tablerefer 3. to the relationships among
spatial relationships with six derived graphs that exist in the urban transit system. A
stations,
The The which
spatial can influence
relationships the
studied prediction
in this paper of
in thisreferOD flow.
to the We to summarized fourstations,
classes of
summarization spatial
ofrelationships
these relationshipsstudied is shown inpaperTable 3.relationships
refer among
the relationships among
Table
spatial
which 3. Multi-layer
relationships
can influence
stations, which can networks.
with
the influencesix
predictionthederived
of OD graphs
flow. We
prediction that
ofsummarized exist
OD flow. We in the
four urban transit system.
classes of spatial
summarized rela- A
four classes of
summarization
tionships
spatial
Table with
3. six of
relationships
Multi-layer these
derived relationships
graphs
with
networks. six that
derived is
exist shown
in
graphsthe in
urban
that Tableexist 3.
transit insystem.
the A
urban summarization
transit system. A
Class Graph Illustration Notation Weight of Edge
of these relationships
summarization of is shown
these in Table 3. is shown in Table 3.
relationships
Class Graph Table 3. Multi-layer
Illustration
networks. Notation Weight of Edge
Table 3. Multi-layer
Table networks.
3. Multi-layer networks.
Class Graph Illustration Notation Weight𝑺𝒆(𝒊, of Edge 𝒋)
Station network 𝑮𝒔𝒏 = (𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) 𝑾𝒔𝒏 (𝒊, 𝒋) = 𝑵 (10)
Class
Class Graph
Graph Illustration
Illustration Notation
Notation ∑
Weight
Weight𝒌=1 of
𝑺𝒆(𝒊, of
Edge Edge
𝑺𝒆(𝒊,𝒋) 𝒌)
Station network 𝑮𝒔𝒏 = (𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) 𝑾𝒔𝒏 (𝒊, 𝒋) = 𝑵 (10)
Station–line– ∑𝒌=1 𝑺𝒆(𝒊, 𝒌)
(a) 𝑺𝒆(𝒊, 𝒋)
network Station network 𝑮𝒔𝒏 = (𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) 𝑾𝒔𝒏 (𝒊, 𝒋) = 𝑵 (10)
Station–line– ∑𝒌=1 Se (𝑺𝒆(𝒊,
𝑺𝒆(𝒊, i, j) 𝒋) 𝒌)
relationship Station network
Station network (a) sn𝒔𝒏==
G𝑮 ( N,
(𝑵,E𝑬 , W
sn𝒔𝒏 ,𝑾 ))
sn𝒔𝒏 𝑾
W𝒔𝒏 sn ((𝒊,
i, j𝒋)
) == N𝑵 (10)
(10)
network ∑∑k𝒌=1 𝑺𝒆(𝒊, 𝒌)
Station–line– =1 Se (i, k )
Station–line–
relationship Transfer (a) 𝑻𝒓(𝒊, 𝒋)
network
Station–line–
network (a) 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) 𝑾𝒕𝒏 (𝒊, 𝒋) = 𝑵 (11)
relationship network
Transfer (a) ∑𝒌=1 𝑻𝒓(𝒊, 𝑻𝒓(𝒊, 𝒋) 𝒌)
relationship
network 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) 𝑾𝒕𝒏 (𝒊, 𝒋) = 𝑵 (11)
relationship network ∑𝒌=1 𝑻𝒓(𝒊, 𝒌)
Transfer (b) 𝑻𝒓(𝒊,
Tr (i, j)𝒋)
Transfer network G𝑮tn ==( (𝑵,
N, E𝑬tn , ,W
𝑾tn )) 𝑾W𝒕𝒏tn(𝒊, 𝒋)
(i, j) = (11)
(11)
𝒕𝒏 𝒕𝒏 𝒕𝒏 𝑵
network
Transfer ∑ N
∑𝒌=1
k =𝑻𝒓(𝒊,
𝑻𝒓(𝒊,k)𝒌)
1 Tr (i,𝒋)
(b) 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) 𝑾𝒕𝒏 (𝒊, 𝒋) = 𝑵 (11)
network (b) ∑𝒌=1 𝑻𝒓(𝒊, 𝒌)
Time series (b) 𝑻𝒔(𝒊, 𝒋)
𝑮𝒕𝒔 = (𝑵, 𝑬𝒕𝒔 , 𝑾𝒕𝒔 ) 𝑾𝒕𝒔 (𝒊, 𝒋) = (12)
similarity (b) ∑𝒌∈𝒄𝑻𝒔(𝒊, 𝑻𝒔(𝒊, 𝒌)
Time
Time series
series
(𝑵,E𝑬ts𝒕𝒔, ,W
G𝑮ts𝒕𝒔==( N, 𝑾ts𝒕𝒔) ) 𝑾 (𝒊, 𝒋) = Ts (i, j)𝒋) (12)
(12)
similarity W𝒕𝒔 ts ( i, j ) =
Passenger flow Time similarity ∑k𝒌∈𝒄
∑ c Ts
∈𝑻𝒔(𝒊, 𝑻𝒔(𝒊,
(i,𝒋)k)𝒌)
series (c)
Passenger flow
characteristics 𝑮𝒕𝒔 = (𝑵, 𝑬𝒕𝒔 , 𝑾𝒕𝒔 ) 𝑾𝒕𝒔 (𝒊, 𝒋) = (12)
Passenger2022,
flow
characteristics
Mathematics similarity
Time series
10, 3664 (c) ∑𝒌∈𝒄 𝑻𝒔(𝒊,
𝑻𝒔(𝒊, 𝒋)𝒌) 10 of 34
relationship (c) 𝑮𝒕𝒔 = (𝑵, 𝑬𝒕𝒔 , 𝑾𝒕𝒔 ) 𝑾𝒕𝒔 (𝒊, 𝒋) = (12)
relationship
characteristics similarity ∑𝒌∈𝒄 𝑻𝒔(𝒊, 𝒌)
Passenger2022,
Mathematics flow 3664
relationship 10,PeakPeak
hourhour
factor (c) 𝑷𝒔(𝒊,
Ps (i, j)𝒋)
10 of 34
characteristics

Passenger flow G𝑮ps𝒑𝒔== N,
(𝑵,E𝑬ps𝒑𝒔
, ,W
𝑾ps𝒑𝒔 ) 𝑾
W𝒑𝒔 ps ((𝒊,
i, j𝒋)
) == (13)
(13)
similarity (c) ∑
∑k𝒌∈𝒄 𝑷𝒔(𝒊,
(i,𝒋)k)𝒌)
relationship factor
characteristics Peak similarity
hour c Ps
∈𝑷𝒔(𝒊,
𝑮𝒑𝒔 = (𝑵, 𝑬𝒑𝒔 , 𝑾𝒑𝒔 ) 𝑾𝒑𝒔 (𝒊, 𝒋) = (13)
relationship factor similarity (d) 𝑾𝒍𝒑 (𝒊, 𝒋) ∑𝒌∈𝒄 𝑷𝒔(𝒊, 𝒌)
Line planning Line Peak hour
planning (d) 𝑷𝒔(𝒊, 𝒋)
𝑮𝑮𝒑𝒔 = (𝑵, 𝑬 ,, 𝑾 𝑾𝒑𝒔)) 𝑾𝒑𝒔 (𝒊,𝑭𝒓𝒆(𝒍)𝑳𝒑(𝒊,
𝒋) = 𝒋) (13)
𝒍𝒑 = (𝑵, 𝑬𝒑𝒔 ∑𝒌∈𝒄 𝑷𝒔(𝒊, (14)
relationship factor
Peaksimilarity
networkhour 𝒍𝒑 𝒍𝒑
= 𝑷𝒔(𝒊, 𝒋)𝒌)
(d) 𝑮𝒑𝒔 = (𝑵, 𝑬𝒑𝒔 , 𝑾𝒑𝒔 ) 𝑾
𝑾∑ 𝒍𝒑 (𝒊,
(𝒊, 𝒋)
𝒋) 𝑳𝒑(𝒊,
= 𝒌)𝑭𝒓𝒆(𝒍) (13)
Line
Line planning factor
planning Line planning
Line planning 𝒑𝒔𝒌∈𝑺𝑳(𝒍)
= ∑ ∑𝒌∈𝒄 𝑷𝒔(𝒊,
𝒋) ) 𝒌)
similarity Fre ( l ) Lp ( i,j
relationship network G𝑮 == N,
l p𝒍𝒑 (𝑵,E𝑬l p𝒍𝒑, ,W
𝑾l p𝒍𝒑 ) Wl p (i, j) 𝑭𝒓𝒆(𝒍)𝑳𝒑(𝒊, (14)
(14)
relationship network (d) = k∈SL(l ) Lp (i,k ) Fre ( l )
(e) ∑𝒌∈𝑺𝑳(𝒍) 𝑳𝒑(𝒊, 𝒌)𝑭𝒓𝒆(𝒍)
(d)
(e)
(e)

Correlation
Correlation Correlation of
Correlation of   𝑫(𝒊,
D (i, j𝒋)
)
relationship flow evolution
evolution G𝑮c f𝒄𝒇== (𝑵,
N, E𝑬c𝒄𝒇
f ,,W
𝑾c𝒄𝒇
f ) 𝑾 (𝒊, 𝒋) =
c f (i, j ) =
W𝒄𝒇 𝑵
(15)
(15)
relationship flow ∑ N
∑𝒌=1 𝑫(𝒊,k𝒌)
1 D (i,
k =𝑫(𝒊, 𝒋)
)
Correlation Correlation of
(f) 𝑮𝒄𝒇 = (𝑵, 𝑬𝒄𝒇 , 𝑾𝒄𝒇 ) 𝑾𝒄𝒇 (𝒊, 𝒋) = (15)
relationship flow evolution ∑𝑵
𝒌=1 𝑫(𝒊, 𝒌)
(f) G = ( N, E, W )
𝑮 = (𝑵, 𝑬, 𝑾)
(f)
1. Station–line–network relationship
1. Station–line–network relationship 𝑮 = (𝑵, 𝑬, 𝑾)
This is the basic tomography relationship of the rail transit network that determines the
connectionsThisbetween
is the basiceachtomography
pair of relationship
stations. We focused of the
on rail
two transit
networksnetwork
of thisthat determines
relationship:
1. Station–line–network relationship
the connections
a. Station network between each pair of stations. We focused on two networks of this
This is the basic tomography relationship of the rail transit network that determines
relationship:
The station network G = ( N, E , Wsn ) is directly constructed according to the
the connections betweensn each pairsnof stations. We focused on two networks of this
connections a. Station network
of sections and stations of the studied rail transit network. An edge is formed
relationship:
to connect Thenode i and
station j in Esn 𝑮
network if𝒔𝒏the
= corresponding directlyi and
(𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) isstation j are connected
constructed according in the
to the
a. Station network
realconnections
network. of sections and stations of the studied rail transit network. An edge is formed
The station
tob.connect
Transfer network
node and 𝒋 in 𝑮𝑬𝒔𝒏𝒔𝒏=if(𝑵,
𝒊network the𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) is directly
corresponding constructed
station 𝒊 and 𝒋 are according
connectedto the
in
connections
The
the transfer
real of network
network. sections and Gtn stations
= ( N, E oftnthe
, Wstudied rail transit network.
tn ) is constructed An edge is of
by the connections formed
a
to connect
station nearby𝒊 and
with its node 𝒋 instations.
transfer 𝑬𝒔𝒏 if theAn corresponding
edge is formed station
to connect𝒊 and
node𝒋 i are j in Etn if in
and connected
b. Transfer network
the real network.
The transfer network 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) is constructed by the connections of a
b. Transfer network
station with its nearby transfer stations. An edge is formed to connect node 𝒊 and 𝒋 in
𝑬𝒕𝒏 ifThe transfer network
the corresponding 𝑮𝒕𝒏 =
station 𝒊 (𝑵, 𝒕𝒏 , 𝑾𝒕𝒏 ) station
and𝑬transfer is constructed by the by
𝒋 are connected connections of a
a station path
station with its nearby transfer stations. An edge is formed
along with the station network, and the path cannot contain other transfer station. to connect node 𝒊 and 𝒋 in
𝑬𝒕𝒏 if the corresponding station 𝒊 and transfer station 𝒋 are connected by a station path
2. Passenger flow characteristics relationship
Mathematics 2022, 10, 3664 9 of 30

the corresponding station i and transfer station j are connected by a station path along with
the station network, and the path cannot contain other transfer station.
2. Passenger flow characteristics relationship
When two stations are located in different areas but have the same function (e.g., office,
education, business districts), it makes sense that the evolution of the passenger flow will
be similar. We chose two kinds of feature to measure similarities.
a. Time series similarity
The daily passenger flow data along the time axis will form a time series. The similarity
among the time series belonging to different stations can construct the edges and weights
among stations. By these, the time series similarity Gts = ( N, Ets , Wts ) is built by a
similarity measurement with threshold. We illustrate the details about how to construct Gts
in the following part of measurement weights.
b. Peak hour factor similarity
Likewise, the peak hour factor was also chosen as a feature to measure similarity
between any pair of stations. The peak hour factor is calculated by the formula:
→ A 
max Y pt,−h1 ,−h2
pi = → A  (16)
avg Y pt,−h1 ,−h2

where h1 and h2 are the operational beginning and end time points of a day. The function
→A
max(·) can find the maximum OD flow in vector Y pt,−h1 ,−h2 , and the function avg(·) can
calculate the average flow of off-peak hours. 
By these, the peak hour factor similarity G ps = N, E ps , Wps is built by a similarity
measurement based on the second-norm. We illustrate the details about how to construct
G ps in the following part of measurement weights.
3. Line planning relationship
Although urban rail transit has a natural physical topology, passengers can only move
with the trains. Thus, the line planning has a huge impact on passenger OD, especially in
terms of short-term resolution. A line plan is one of the most basic papers for rail transit
operations. A line is often taken to be a route in a high-level infrastructure graph ignoring
precise details of platforms, junctions, etc. In addition, a line is a route in the network
together with a stopping pattern for the stations along that route, as a line may either stop
at or bypass a station on its route [41]. We define a line plan as a set of such routes, each
with a series of way stations, a stopping pattern and frequency, which together must meet
certain targets such as providing minimal service at every station.
a. Line planning network  
The line planning network Gl p = N, El p , Wl p describes the connected relationships
formed by the line planning. This correlation has a huge impact on passenger travel.
El p and Wl p are determined by the shopping pattern and running frequency.
4. Correlation relationship
For representing potential large OD pairs or potential travel demand hidden in the
urban rail transit, we built a network to represent the correlation relationships.
a. Correlation of flow evolution
OD flow between every two stations is not uniform, and the direction of passenger
flow implicitly represents the correlation of two stations. For instance, if: (I) the majority of
inflow of station a stream to station b, or (II) the outflow of station a primarily comes from
station b, we believe that the stations a and b are highly correlated.
According to the above discussions, we defined the graphs by nodes N, edges E
and the weights W. In this context, we set node n ∈ N, where n ∈ N represents a
real station. The graphs share the same nodes but have their own edges and weights,
Mathematics 2022, 10, 3664 10 of 30

n o n o
that is, E , Esn , Etn , Ets , E ps , Ecd , Ec f and W , Wsn , Wtn , Wts , Wps , Wcd , Wc f . Specif-
ically, we denote W ∈ R6× N × N as the weights of all edges, and for each Wz ∈ W,
z ∈ {sn, tn, ts, ps, l p, c f }. An overall design of the graphs is shown in Table 3.
We denote Wz (i, j) as the weight of edge (i, j). In Table 3, the fifth column summarizes
the calculation methods of different weights.
For calculating Wsn (i, j) and Wtn (i, j), Se(i, j) represents the connection function of
nodes (i.e., station) i and j. Se(i, j) = 1 if there exists a section between node i and j, else
Se(i, j) = 0, and we set Se(i, i ) = 0. Likewise, Tr (i, j) represents the connection function of
a station with the nearby transfer stations. Tr (i, j) = 1 if there exists a path without another
transfer station between station i and transfer station j, or else Tr (i, j) = 0.
Theoretically, similarity exists between any two stations. However, in large-scale
networks, the number of station pairs is large, and the similarity of most of them is small,
which leads to a complex similarity graph. For this reason, we used a combination of
clustering method and threshold selection to first find the potential groups to which
the stations belong and then excluded the station pairs with small similarity. Thus, for
calculating Ts(i, j) in Wts (i, j), using the time series as inputs, we first obtained the similarity
relationships of different stations based on the clustering method [42] and obtained clusters
of stations, denoted as C. Then, for each c ∈ C, a predefined similarity threshold was set to
control the number of similarities. Based on the finite similarity relationships, we built the
edge set Ets and used Equations (17)–(20) to calculate Ts(i, j) in category c ∈ C.
 n 
A
o n
A
o
Ts(i, j) = exp −SBD sum yi,t , sum y j,t (17)
t∈ Th t∈ Th

∑ j∈[1,κ] yiA∼j,t
 
A
sum yi,t = (18)
→ → !
→ → CCw ( r i , r j )
SBD ( r i , r j ) = 1 − max → → (19)
w || r i || · || r j ||
2m−w




→ →
 ri,l +w−m · r j,l , w ≥ m
CCw ( r i , r j ) = (20)
l = 1
→ →


CC−w ( r j , r i ), w<m

In Equation (17), the function SBD [43] (shape-based distance) is set for measuring
the distance between two temporal sequences with equal length. Specifically, the SBD is
→ →
calculated by Equation (11), where r i , r j ∈ R I is the flow time series of OD i and j and || · ||
→ → 
refers to the second norm operator. CCw r i , r j represents the cross-correlation between
→ →
r i and r j . In addition, we set Ts(i, i ) to 0.
Analogously, we calculated Ps(i, j) for Wps (i, j) by the cluster–threshold–calculation
framework. The classical k-means method is used for clustering, and the similarity is
measured by the second norm operator, as shown in Equation (21).

Ps(i, j) = || Pi , Pj || (21)

where Pi and Pj are the peak hour factor of station i and j.


In Equation (15), the weight Wl p (i, j) takes line plans of urban transit into consideration.
We set l ∈ L as a line plan and SL(l ) as the stations in the route of l. Lp(i, j) = 0 represents
that there exists an edge between station i and j in terms of a train running line l, and
Fre(l ) is the running frequency of the train running line l.
In Equation (16), the D (i, j) is the total number of passengers that traveled from station
j to station i in the whole dataset.
Mathematics 2022, 10, 3664 11 of 30

3.2. Multi-Layer Graph Neural Networks Model (MGNN) for Structured Forecasting
3.2.1. Structure of the Multi-Layer Graph Neural Networks Model (MGNN)
In previous works [3,16,44], sequenced features and graphic features have both been
proven to be useful for traffic state prediction. One key issue in designing MGNN is how to
fuse the temporal features and the spatial features during the training of the model. In the
context of network-level passenger flow prediction, we specifically call the features that can
be represented by the graphs the spatial features, or otherwise, the temporal features. For
fusing the temporal features and the local features, we adopted the designs of the Graph
Convolution Gated Recurrent Unit (GC-GRU) and Fully Connected Gates Recurrent Unit
(FC-GRU) proposed by [3].

3.2.2. Convolution Operation for MGNN by GC-GRU


A GC-GRU [22] was introduced for spatial–temporal feature learning of MGNN. By
using this GC-GRU, we could effectively learn spatial temporal features from the OD flow
data. The convolution operation was designed as follows:
The parameters of this graph convolution are denoted as Θ.Following the definition
in Equation (9), the input of graph convolution is set as IGt = IGt1 , IGt2 , · · · , IGtN . By
definition of convolution, the output feature f IGti ∈ Rd of IGti is computed by:


f IGti = Θl IGti + ∑ j∈Nsn (i) Wsn (i, j) Θsn IGti + ∑ j∈Ntn (i) Wtn (i, j) Θtn IGti


∑ j∈Nts (i) Wts (i, j)


+ Θts IGti + ∑ j∈N ps (i) Wps (i, j) Θ ps IGti (22)

+∑ j∈N (i) Wl p (i, j) Θl p IGti + ∑ W (i, j)


j∈Nc f (i ) c f
Θc f IGti
lp
n o
where is the Hadamard product and Θ , Θsn , Θtn , Θts , Θ ps , Θl p , Θc f . Specifically,
Θl IGti is the self-loop for all graphs, and Θl is the learnable parameters. Θsn denotes the
parameters of the station network graph Gsn , and Nsn (i ) represents the neighbor set of
node i in Gsn . Other notations Θtn , Θts , Θ ps , Θl p , Θc f , Nsn (i ), Ntn (i ), Nts (i ), N ps (i ), Nl p (i )
and Nc f (i ) have similar semantic meanings. d is the dimensionality of feature f IGti .


In this manner, a node can dynamically receive information from some highly corrected
neighbor nodes. For convenience, we denoted the graph convolution in Equation (22) as
IGt ∗ Θ in the following.
Since the above-mentioned operation is conducted on a spatial dimension, we embedded
the graph convolution in a Gated  Recurrent Unit (GRU) to learn spatial–temporal  1 2 features.
Specifically, the reset gate R = R 1 , R2 , · · · , R N , update gate Z = Zt , Zt , · · · , ZtN
 1t 2 t t t t
new information Nt = Nt , Nt , · · · , NtN and hidden state Ht = Ht1 , Ht2 , · · · , HtN are
computed by:
Ht = GC-GRU ( IGt , Ht−1 ) : (23)
Rt = σ (Θrx × IGt + Θrh × Ht−1 + br ) (24)
Zt = σ (Θzx × IGt + Θzh × Ht−1 + bz ) (25)
Nt = tanh Θnx × IGt + Rt (Θnh × Ht−1 + bn )

(26)
Ht = (1 − Zt ) Nt + Zt Ht−1 (27)
where σ is the sigmoid function and the Ht−1 is the hidden state at last t − 1 iteration. Θrx ,
Θzx , Θzh , Θnx and Θnh denote the graph convolution parameters. br , bz and bn are bias
terms. The feature dimensions of Rit , Zti , Nti and Hti are set to d.

3.2.3. The Combination of GC-GRU and FC-GRU


The proposed MGNN can conduct convolution on graphic space, and FC-GRU can
learn the inputs from a sequenced view. We combined the GC-GRU and FC-GRU and de-
noted the combination as GFGRU. Many GFGRUs are organized according to the Encoder–
Mathematics 2022, 10, 3664 12 of 30

Mathematics 2022, 10, 3664 Decoder framework, and we built a Sequence-to-Sequence structure, as shown in Figure 2.
13 of 32
Specifically, the inputs of GFGRU are It and H
e t−1 , where H
e t−1 is the output hidden state of
the last iteration.

Figure 2. Illustrations for the Encoder–Decoder component


component of
of the
the GTFNN.
GTFNN.

In GFGRU, GC-GRU
3.3. Graph–Temporal utilizes
Fused Neural (GTFNN) information in Ht−1 to update the
the accumulated
Network
e
hidden state, rather than take the original Ht as input. Thus, Equation (23) becomes
3.3.1. Overall for the Proposed GTFNN
Equation (28).
When building the GTFNN Htframework,
= GC-GRUweIGneed to address two main issues: firstly,
 
t, He t− 1 (28)
the model has a comprehensive but complex input. The relationship (e.g., linear or non-
linear)
Forbetween
FC-GRU,the wecomplex inputs and
first transformed the H
It and eforecasting model is difficult
e d to determine;
t−1 to an embedded It ∈ R and Ht−1 ∈ R
e d
second,
with twotime series
fully data(FC)
connect naturally
layers. have localwe
Then, features e
fed It and (e.g.,Hchange-points,
e etc.) and global
t−1 into a common GRU [45]
features (e.g., series trends and attention at different time positions), and f thed framework
implemented with a full connection to generate a global hidden state Ht ∈ R , which can
we designed needs to take both types of features into account to ensure better forecast
be express as:
performance. Thus, three important designs Ite = FCareIconsidered in GTFNN: (29)
( t)
1. Gating structure GRN  
Hte−1 = FC H e t −1 (30)
A gating structure GRN can decide if non-linear learning is required. It can skip over
any unused components of the architecture, f which provide  adaptive depth and network
Ht = FC − GRU Ite , Hte−1 (31)
complexity to accommodate a wide range of datasets and scenarios.
g
2. Finally,
Feature
n we incorporated
selection layer
o Ht and Ht to generate a combined hidden state
H e 1, H
e t =While
H the e N features
e 2 , ·designed
·· ,H with a fully
mayconnected layer:their relevance and specific contribu-
be available,
t t t
tion to the output are typically unknown. The  GTFNN  also uses specialized components
e ti =features g
for the judicious selection of relevant H FC Hti ⊕ andHta series of gating layers to suppress (32)
unnecessary components, enabling high performance in a wide range of regimes. We
where
adopted ⊕ denotes an operator
the feature selectionof feature proposed
network concatenation.
by Lim et al. [3] to tune the weights of
input features at each time step dynamically. This feature selection network was designed
3.3. Graph–Temporal Fused Neural Network (GTFNN)
based on the gating mechanism.
3.3.1. Overall for the Proposed GTFNN
3. Sequence-to-sequence layer with attention mechanisms
When building the GTFNN framework, we need to address two main issues: firstly, the
model For learning
has both local but
a comprehensive andcomplex
global temporal
input. The relationships
relationshipfrom
(e.g.,time-varying inputs,
linear or nonlinear)
a sequence-to-sequence layer is employed for local processing, whereas
between the complex inputs and the forecasting model is difficult to determine; second, long-term de-
pendencies are captured using an interpretable multi-head attention block.
time series data naturally have local features (e.g., change-points, etc.) and global featuresThe GTFNN
employs
(e.g., a self-attention
series mechanism
trends and attention to learn
at different timelong-term relationships
positions), across different
and the framework time
we designed
steps [3], which is modified from multi-head attention in transformer-based architectures
[29,34] to enhance explainability.
Figure 3 shows the high-level architecture of GTFNN, where individual components
are described in detail in the subsequent sections.
Mathematics 2022, 10, 3664 13 of 30

needs to take both types of features into account to ensure better forecast performance.
Thus, three important designs are considered in GTFNN:
1. Gating structure GRN
A gating structure GRN can decide if non-linear learning is required. It can skip over
any unused components of the architecture, which provide adaptive depth and network
complexity to accommodate a wide range of datasets and scenarios.
2. Feature selection layer
While the designed features may be available, their relevance and specific contribution
to the output are typically unknown. The GTFNN also uses specialized components for the
judicious selection of relevant features and a series of gating layers to suppress unnecessary
components, enabling high performance in a wide range of regimes. We adopted the feature
selection network proposed by Lim et al. [3] to tune the weights of input features at each time
step dynamically. This feature selection network was designed based on the gating mechanism.
3. Sequence-to-sequence layer with attention mechanisms
For learning both local and global temporal relationships from time-varying inputs, a
sequence-to-sequence layer is employed for local processing, whereas long-term dependen-
cies are captured using an interpretable multi-head attention block. The GTFNN employs a
self-attention mechanism to learn long-term relationships across different time steps [3],
which is modified from multi-head attention in transformer-based architectures [29,34] to
enhance explainability.
Mathematics 2022, 10, 3664 14 of 32
Figure 3 shows the high-level architecture of GTFNN, where individual components
are described in detail in the subsequent sections.

Figure 3. The framework of Graph–Temporal


Graph–Temporal Fused
Fused Neural
Neural Network
Network (GTFNN).
(GTFNN).

3.3.2. Gating structure GRN


With the motivation of giving the model the flexibility to apply non-linear processing
only where needed, the Gated Residual Network (GRN) was chosen as a building block
of GTFNN. The GRN takes in a primary input 𝐼 designed in Equation (6) and obtains:
Mathematics 2022, 10, 3664 14 of 30

3.3.2. Gating Structure GRN


With the motivation of giving the model the flexibility to apply non-linear processing
only where needed, the Gated Residual Network (GRN) was chosen as a building block of
GTFNN. The GRN takes in a primary input It designed in Equation (6) and obtains:

GRNω ( It ) = LayerNorm( It + GLUω (η1 )) (33)


η1 = W1,ω η2 + b1,ω (34)
η2 = ELU (W2,ω It + b2,ω ) (35)
Mathematics 2022, 10, 3664 15 of 32
where ELU is the Exponential Linear Unit activation function [46], η1 ∈ 2 ∈ Rdmodel , η Rdmodel
are intermediate layers, LayerNorm(·) is standard layer normalization [47], and ω is an
index to denote weight sharing. When W2,ω a + b2,ω  0, the ELU activation would act
hidden state size. GLU allows GTFNN to control the extent to which the GRN contributes
as an identity function and when W2,ω a + b2,ω  0, the ELU activation would generate
to the original input 𝐼 , potentially skipping over the layer entirely if necessary, as the
a constant output, resulting in linear layer behavior. The structure of GRN is shown in
GLU outputs could be all close to 0 in order to suppress the nonlinear contribution [3].
Figure 4.

Thestructure
Figure4.4.The
Figure structureofofGRN.
GRN.

In Equation (33),
3.3.3. Instance-Wise GatedSelection
Feature Linear Units (GLUs) [48] are selected as the gating components
Layer
to provide the flexibility to suppress any part of the architecture that can be skipped in the
Instance-wise feature selection is provided by the feature selection networks applied
scenario. Denoting γ ∈ Rdmodel as the input of the GLU, we obtain Equation (36).
to all input features. Entity embeddings [49] is used for categorical variables as feature
representations, and linear GLUωtransformations
(γ) = σ(W3,ω γ for continuous
+ b3,ω ) (W4,ωvariables—transforming
γ + b4,ω ) each
(36)
input variable into a 𝑑 -dimensional vector. All inputs make use of separate feature
selection
where σ (networks with distinct
.) is the sigmoid function, W (.) ∈ Rdmodel ×dmodel , b(.) ∈ Rdmodel are the
weights.
activation
( )
Let 𝜉and ∈biases,
weights ℝ isdenote the transformed
the element-wise input
Hadamard 𝑗th feature
of the and
product, dmodel isat
thetime 𝑡, with
hidden state
size. GLU
( )
allows GTFNN to control the extent to which the GRN contributes to the original
𝛯input
= 𝜉It , potentially
,…,𝜉 being the
skipping overflattened
the layervector of ifallnecessary,
entirely past inputs GLU 𝑡.outputs
at time
as the Featurecould
se-
be all close
lection weights to 0are
in order to suppress
generated the nonlinear
by feeding 𝛯 through contribution [3]. by a Softmax layer:
a GRN, followed
3.3.3. Instance-Wise Feature Selection Layer GRN (𝛯 )
𝑣𝜒𝑡 = Softmax (37)
𝑣𝜒 𝑡
Instance-wise feature selection is provided by the feature selection networks applied
where
to all input ℝ𝑚𝜒 is a vector
𝑣𝜒𝑡 ∈ features. Entity of feature selection
embeddings [49] isweights.
used for categorical variables as feature
At each timeand
representations, step,linear
an additional layer offor
transformations non-linear
continuousprocessing is employed by feed-
variables—transforming each
( )
ing each
input 𝜉 through
variable into a its own
dmodel GRN:
-dimensional vector. All inputs make use of separate feature
selection networks with distinct weights. (𝑗) ()
( j) 𝜉𝑡 = GRN𝜉(𝑗) 𝜉𝑡 𝑗 (38)
Let ξ t ∈ Rdmodel denote the transformed input of the jth feature at time t, with
 (𝑗) T  T
(mχ ) T
Ξt = 𝜉ξ𝑡 t is, . the
where . . , ξprocessed feature
thevector vector of𝑗.all
for variable We note that each variable has
( j)
t being flattened past inputs at time t. Feature
its own 𝐺𝑅𝑁 ( ) , with weights shared across all time steps 𝑡. Processed features are then
selection weights are generated by feeding Ξ through a GRN, followed by a Softmax layer:
weighted by their variable selection weights tand are combined:
𝑚𝜒 GRNv ( Ξ t )

vχt = Softmax χ (37)
(𝑗) (𝑗)
𝑡
=
where vχt ∈ Rmχ is a vector of feature𝜉selection𝑣𝜒weights.
𝜉
𝑡 𝑡
(39)
𝑗=1
At each time step, an additional layer of non-linear processing is employed by feeding
( j)
each ξ t𝑣(𝑗)
where 𝜒𝑡 is the 𝑗th element of vector 𝑣𝜒𝑡 .
through its own GRN:

3.3.4. Sequence-to-Sequence Layer with Attention Mechanisms


The local features of the time series that are identified in relation to their surrounding
values, such as anomalies, change-points etc., are significant. Thus, we applied a se-
quence-to-sequence layer to build an Encoder–Decoder structure with feeding 𝜉 : into
Mathematics 2022, 10, 3664 15 of 30

 
( j) ( j)
ξet = GRNξe( j) ξ t (38)
( j)
where ξet is the processed feature vector for variable j. We note that each variable has
its own GRNξ ( j) , with weights shared across all time steps t. Processed features are then
weighted by their variable selection weights and are combined:

∑ vχt ξet
( j) ( j)
ξet = (39)
j =1
( j)
where vχt is the jth element of vector vχt .

3.3.4. Sequence-to-Sequence Layer with Attention Mechanisms


The local features of the time series that are identified in relation to their surrounding
values, such as anomalies, change-points etc., are significant. Thus, we applied a sequence-to-
sequence layer to build an Encoder–Decoder structure with feeding ξet−h:t into the encoder
and ξet+1:t+mmax into the decoder. This then generates a set of uniform temporal features that
serve as inputs into the decoder itself, denoted by φ(t, n) ∈ {φ(t, −h), . . . , φ(t, mmax )} with
n being a position index. We also employed a gated skip connection over this layer:

e(t, n) = LayerNorm(ξet+n + GLUφ (φ(t, n))


φ (40)
where n ∈ [−h, mmax ] is the position index.
(1) Temporal self-attention layer
Following the sequence-to-sequence layer and the gate layer, we applied a modified
self-attention layer [3]. All temporal features are grouped into a matrix, i.e., Θ(t) =
[θ (t, −h), . . . , θ (t, m)]T , and interpretable multi-head attention [3] is applied at each forecast
time step, with the number of time steps feeding into the attention layer N = mmax + h + 1:

B(t) = InterpretableMultiHead(Θ(t), Θ(t), Θ(t))


(41)
= [ β(t, −h), . . . β(t, mmax )]

Decoder masking [29,34] is applied to the multi-head attention layer to ensure that
each temporal dimension can only attend to features preceding it.

InterpretableMultiHead( Q, K, V ) = HW
e H (42)
 
 1 nH 


(h) (h)
e=A
H e( Q, K )VWV = A QWQ , KWK VWV (43)
H 
h =1
nH
1
∑ Attention
 
(h) (h)
= QWQ , KWK , VWV (44)
H
h =1
 p 
A( Q, K ) = Softmax QKT / d attn (45)

Attention( Q, K, V ) = A( Q, K )V (46)
where V ∈ R N ×dV , K ∈ R N ×dattn and Q ∈ R N ×dattn are attention mechanism scale values,
keys and queries. dV = d attn = dmpdel /n H , and n H is the number of heads. A(·) is a normal-
(h) (h)
ization function. WK ∈ Rdmodel ×dattn , WQ ∈ Rdmodel ×dattn are head-specific weights for keys,
queries. WV ∈ Rdmodel ×dV are value weights shared across all heads, and WH ∈ Rdattn ×dmodel
is used for final linear mapping.
Mathematics 2022, 10, 3664 16 of 30

The self-attention layer allows GTFNN to pick up long-range dependencies that may
be challenging for RNN-based architectures to learn. Following the self-attention layer, an
additional gating layer is also applied to facilitate training:
 
δ(t, n) = LayerNorm θ (t, n) + GLUδ β(t, n) (47)

(2) Position-wise feed-forward layer


We applied additional non-linear processing to the outputs of the self-attention layer.
This layer also makes use of GRNs:

ψ(t, n) = GRNψ (δ(t, n)) (48)

where the weights of GRNψ are shared across the entire layer. As shown in Figure 5, we also
applied a gated residual connection that skips over the entire transformer block, providing
a direct path to the sequence-to-sequence layer, yielding a simpler model if additional
complexity is not required, as shown below:
Mathematics 2022, 10, 3664   17 of 32
e(t, n) = LayerNorm φ
ψ e(t, n) + GLU e ψ(t, n) (49)
ψ

Figure 5.5.Illustration
Figure Illustrationfor
forfeature selection
feature network.
selection network.

All
Allthe
thenotation
notationisisconcluded
concludedininthe
theAppendix
AppendixA.A.
4. Numerical Experiments
4. Numerical Experiments
4.1. Experiment Settings
4.1. Experiment
4.1.1. Dataset Settings
4.1.1.
WeDataset
collected a mass of trip transaction records from a real-world metro system and
We collected
constructed a mass
a large-scale of tripwhich
dataset, transaction records
is termed from a real-world
as CQMetro. metro
The overview system
of the and
dataset
isconstructed
summarizeda in large-scale
Table 4. dataset, which is termed as CQMetro. The overview of the da-
taset is summarized
This dataset wasin Table
built 4. on the rail transit system of Chongqing, China. The
based
transaction records were collected from 1 January to 15 February 2019, with daily passenger
Table
flow of 4.1.72
Dataset of CQMetro.
million on average. The total number of OD pairs in the whole network is
14,314 pairs. Each record contains the information of entry/exit station and the correspond-
Dataset Notation
ing timestamps. In this period, 170 stations operated normally, and they were connected
City Chongqing, China
by 224 physical edges (i.e., the sections between stations). For each station, we measured
Station
design features every 15 min. The data of the first 30 days were used 170
for training, and the
Physical Edge 448
last 15 days were used for training and testing, while OD flows of the following day were
Flow volume
used for validation. per day
In particular, 1.72M
1–3 January and 4–10 February 2019 are New Year’s Day
OD pairs
and Chinese New Year holidays. 14,314
Time Step 15 min
Input length 384 (4 days)
Output length 4 (60 min)
Forecasting horizon 2 (30 min)
Training Timespan 2880 (30 days)
Mathematics 2022, 10, 3664 17 of 30

Table 4. Dataset of CQMetro.

Dataset Notation
City Chongqing, China
Station 170
Physical Edge 448
Flow volume per day 1.72 M
OD pairs 14,314
Time Step 15 min
Input length 384 (4 days)
Output length 4 (60 min)
Forecasting horizon 2 (30 min)
Training Timespan 2880 (30 days)
Testing Timespan 1440 (15 days)
Validation Timespan 96 (1 day)

4.1.2. Details for Implementing GTFNN


We implemented our GTFNN with the deep learning framework PyTorch. The lengths
of input and output are listed in Table 4. Hyperparameter optimization was conducted
via random search with 60 iterations. The search ranges for all hyperparameters and
the selected optimal hyperparameters can be found in Table 5. We applied Adam [50]
to optimize our GTFNN for 200 epochs by minimizing the jointly minimizing function
between the predicted results and the corresponding ground-truths. The jointly minimizing
function [38], summed across every time step output:
mmax 
QL yt , ŷ(t − τ, τ )
L(Ω, W ) = ∑ ∑ Mmmax
(50)
y t ∈ Ω τ =1

QL(y, ŷ) = |y − ŷ| (51)


where Ω is the domain of training data containing M samples.
During training, dropout is applied before the gating layer and layer normalization.

Table 5. Hyperparameter optimization.

State size 10, 20, 40, 80, 160, 240, 320


Dropout rate 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9
Minibatch size 64, 128, 256
Random search range Learning rate 0.0001, 0.001, 0.01
Decay ratio 0.1
Max. gradient norm 0.01, 1.0, 100.0
Feature dimensionality 256
Random search iterations 60
State size 320
Dropout rate 0.3
Minibatch Size 128
Optimal hyperparameters Learning rate 0.001
Decay ratio 0.1
Max Gradient Norm 100
Num. Heads 4
Feature dimensionality 256

4.1.3. Evaluation Metrics


Following previous works [25,51], we evaluated the performance of methods with
Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE)
and Symmetric Mean Absolute Percentage Error (SMAPE), which are defined as:
Mathematics 2022, 10, 3664 18 of 30

n
1

2
MSE = X̂i − Xi (52)
n
i =1
v
u n
u1

2
RMSE = t X̂i − Xi (53)
n
i =1
n
1
MAE =
n ∑ X̂i − Xi (54)
i =1
n
1 2 · X̂i − Xi
SMAPE =
n ∑ X̂i + | Xi |
(55)
i =1
where n is the number of testing time steps; for example, if we conduct the forecasting
every 60 min, then n = 4. X̂i and Xi denote the predicted ridership and the ground-truth
ridership, respectively. Note that X̂i and Xi have been transformed back to the original
scale with an inverted z-score normalization. Our GTFNN is developed to predict the
metro ridership of the next two steps. In the following experiments, we would measure
the errors of each time interval separately. For the 15 min granularity OD prediction, there
may be a true value of 0; thus, SMAPE is used instead of MAPE to describe the relative
accuracy of the prediction. Unlike MAPE (Mean Absolute Percentage Error), the SMAPE
values range from 0 to 200%. All metrics are close to 0, meaning higher prediction accuracy.

4.2. Comparison with State-of-the-Art Methods


In this section, we compared our GTFNN with four base-line methods including:
Gradient Boosting Decision Trees (GBDT) [52]: GBDT is a weighted ensemble method
that consists of a series of weak estimators. We implemented this method with a python
package named Sklearn. The number of boosting stages is set to 100, and the maximum
depth of each estimator is 4. Gradient descent optimizer is applied to minimize the
loss function.
Long Short-Term Memory (LSTM) [53]: This network is a simple Seq2Seq model, and
its core module consists of two fully connected LSTM layers. The hidden size of each LSTM
layer is set to 256. Its hyper-parameters are the same as ours.
Gated Recurrent Unit (GRU) [54]: With the similar architecture of the previous model,
this network replaces the original LSTM layers with GRU layers. The hidden size of GRU
is also set to 256. Its hyper-parameters are the same as ours.
Graph-WaveNet [55]: This method develops an adaptive dependency matrix to capture
the hidden spatial dependency and utilizes a stacked dilated 1D convolution component to
handle very long sequences. We implemented this method with its official code.

4.3. Performance in Different Scenarios


In this part, we discuss the prediction results of different methods on weekends,
weekdays and holidays and use four metrics such as MSE, RMSE, MAE and SMAPE for
evaluation. Each metric is the average of the predicted results of 14,314 OD pairs, with
extreme values excluded. The performances of all methods are summarized in Tables 6–8.

Table 6. The mean performances of models in forecasting weekdays OD flow.

Metrics GTFNN GBDT LSTM GRU Graph-WaveNet


MAE 0.83 1.71 4.57 4.63 2.91
MSE 4.42 7.06 15.49 16.34 10.42
RMSE 2.10 2.66 3.94 4.04 3.23
SMAPE 14.16% 19.41% 35.17% 34.52% 23.25%
Mathematics 2022, 10, 3664 19 of 30

Table 7. The mean performances of models in forecasting weekends OD flow.

Metrics GTFNN GBDT LSTM GRU Graph-WaveNet


MAE 0.78 1.58 4.12 4.26 2.13
MSE 4.02 6.53 13.09 13.19 9.63
RMSE 2.00 2.56 3.62 3.63 3.10
SMAPE 13.21% 18.35% 31.48% 31.54% 20.16%

Table 8. The mean performances of models in forecasting holiday OD flow.

Metrics GTFNN GBDT LSTM GRU Graph-WaveNet


MAE 4.41 6.60 19.97 20.30 8.89
MSE 35.08 44.44 85.76 85.11 67.32
RMSE 5.92 6.67 9.26 9.23 8.20
SMAPE 47.96% 60.83% 109.86% 85.44% 60.07%

The predictions for the weekdays are shown in Table 6. To predict the ridership at the
next four consecutive time intervals (60 min), the baseline LSTM obtains a SMAPE score of
35.17% on CQMetro, ranking last among all the methods. Compared to LSTM and GRU,
the performance of GBDT and Graph-WaveNet were much better. The prediction ability of
the GTFNN model is the best compared with the above models. SMAPE is 14.16%, MAE is
only 0.83, and the average prediction error of each period is lower than 1 person. It can be
seen that GTFNN fully combines the advantages of the graph neural network model and
time series model and has good passenger flow prediction ability.
The predictions for the weekends are shown in Table 7. Similar to working days, the
GTFNN model still predicted better than the other models, with a SMAPE of 13.21% and
MAE, MSE and RMSE metrics of 0.78, 4.02 and 2.00, respectively. In general, the prediction
accuracy of different models for weekend passenger flows is higher than that of weekday
passenger flows, with the SMAPE value of different models increasing by about 1–4%.
The predictions for the holidays are shown in Table 8. From the results, the prediction
results of different models for holidays are not very satisfactory. The GTFNN model still
predicted better than the other models, with SMAPE of 47.96% and MAE, MSE and RMSE
metrics of 4.41, 35.08 and 5.92, respectively.

4.4. The Rank of Features


Figure 6 shows the variable importance for the CQMetro dataset. From the model
description in Section 2, in the encoder, the predicted features include the observed input
feature and known input feature classes. For the encoders, the most known input features
are more important than the observed input features, with the k1t (Hour of the day) being
the most important, at more than 20%. The importance of “passenger flow for the first
2–4 periods” ( k5t ∼ k7t ) has exceeded 10%. In observed input features, the ot3 (Max. pas-
senger flow) feature is the most important. In the decoder, the “Passenger flow in the two
time steps before the latest update time” (k5t ) is much more important than all the other
features, with more than 60%. The importance of the remaining individual features was
less than 10%.
Mathematics2022,
Mathematics 2022,10,
10,3664
3664 21
20 of 30
32

(a) (b)
Figure6.6. Variable
Figure Variableimportance
importancefor
forthe
the CQMetro
CQMetro dataset.
dataset. (a)(a) Encoder
Encoder variables
variables importance.
importance. (b) De-
(b) Decoder
coder variables importance.
variables importance.

5. Discussion
5. Discussion
The results
The results of
of the
the overall
overall metrics
metrics analysis
analysis of Section 4.3 showed the excellent perfor-
mance of
mance of the
the GTFNN
GTFNN model.
model.To Tofurther
furtheranalyze
analyzethe
theforecasting
forecastingperformance
performance in in different
different
scenarios, this section
scenarios, section is
is designed
designedfrom
fromtwotwoaspects:
aspects:the prediction
the results
prediction and
results andcharacter-
charac-
istics of of
teristics different ODs
different ODson on
weekends
weekendsandandweekdays are analyzed;
weekdays furthermore,
are analyzed; the com-
furthermore, the
parison between
comparison forecasting
between results
forecasting of ordinary
results days and
of ordinary daysholidays is discussed
and holidays to analyze
is discussed to
analyze the applicability
the applicability of the model
of the studied studiedinmodel in forecasting
forecasting OD passenger
OD passenger flow from flow from
different
different
sources. sources.

5.1.
5.1. Comparison
Comparison of
of Different
Different ODs
ODs
The
The following four typical ODs
following four typical ODs were
wereselected
selectedfor
forfurther
furtherdiscussion
discussionand
andanalysis.
analysis.
(1) OD 1:
(1) OD 1: This
ThisOD ODconsists
consistsofof a hub-type
a hub-type station
station andand a station
a station in CBD.
in CBD. The selected
The selected hub-
hub-type station is located close to the city’s high-speed rail passenger
type station is located close to the city’s high-speed rail passenger hub, mainly hub, mainly
serv-
serving
ing long-distance
long-distance passengers
passengers entering
entering andand leaving
leaving thecity,
the city,and
andititis
is an
an interchange
interchange
station between the high-speed rail network and the urban rail transit. The
station between the high-speed rail network and the urban rail transit. The station
station is
is
located in the CBD of the city, which is the most prosperous part of the city, large
located in the CBD of the city, which is the most prosperous part of the city, with with
passenger
large passengerflow. flow.
(2) OD 2: This OD consists of a station in the residential area and a station in CBD. The
(2) OD 2: This OD consists of a station in the residential area and a station in CBD. The
selected station in the residential area is located in the main residential area of the city
selected station in the residential area is located in the main residential area of the
and mainly serves the commuting needs of passengers in the residential area. The
city and mainly serves the commuting needs of passengers in the residential area.
station in CBD is the same as the station in OD 1.
The station in CBD is the same as the station in OD 1.
(3) OD 3: This OD consists of a station in the residential area and a station in the suburban
(3) OD 3: This OD consists of a station in the residential area and a station in the subur-
area. The selected station in the residential area is the same as those in OD 2. The
ban area. The selected station in the residential area is the same as those in OD 2. The
suburban-type station is the starting and ending station of the line, and the station is
suburban-type station is the starting and ending station of the line, and the station is
far away from the city hub, where trains need to make a turnaround, and the daily
far away from the city hub, where trains need to make a turnaround, and the daily
passenger flow is small.
passenger flow is small.
(4) OD 4: This OD consists of a hub-type station and station in the suburban area. The
(4) OD 4: This OD consists of a hub-type station and station in the suburban area. The
selected hub-type station and the station in the suburban area are the same as the
selected
stations in hub-type
OD 1 and station
OD 3,and the station in the suburban area are the same as the
respectively.
stations in OD 1 and OD 3, respectively.
Figures 7–10 show the prediction results of four pairs of different ODs on weekdays
Figures 7–10
or weekends. The show the prediction
blue dashed results of
line represents thefour pairs ofresult,
prediction different
andODs on weekdays
the red solid line
or weekends. The blue dashed line represents the prediction result, and
represents the actual flow. The x-axis represents the time step, and the y-axis represents the red solid line
the
representsflow.
passenger the actual flow.
The time Therepresents
scale x-axis represents themin
the first 15 timeof step, andthus,
the day; the y-axis
the wholerepresents
day is
the passenger flow. The time scale represents the first 15 min of the day; thus, the whole
Mathematics 2022, 10, 3664
Mathematics 2022, 10, 3664 21 of 30 2
Mathematics 2022, 10, 3664
Mathematics 2022, 10, 3664

divided into 1440/15


day = 96 periods (since the=passenger flow at night is 0, it is not shown in
day is
is divided
divided into
into 1440/15
1440/15 = 96
96 periods
periods (since
(since the
the passenger
passenger flow
flow at
at night
night is
is 0,
0, it
i
the operation periods).
shown in the operation periods).
shown
day in the operation
is divided periods).
into 1440/15 = 96 periods (since the passenger flow at night is 0, i
shown in the operation periods).

(a)
(a) (b)
(b)
Figure 7. The examples (a)
Figure 7.
of The examples
forecasting of forecasting
curves and the curves and the
ground-truth (b)
ground-truth
curves for OD 1:curves
Figure 7. The examples of forecasting curves and the ground-truth
for OD 1: (a) wee
(a) weekdays;
curves for OD 1: (a) wee
(b) weekends. (b) weekends.
(b) weekends.
Figure 7. The examples of forecasting curves and the ground-truth curves for OD 1: (a) wee
(b) weekends.

(a)
(a) (b)
(b)
Figure (a) (b)
8. The examples of forecasting curves and the ground-truth curves for residential are
Figure 8. The examples of forecasting curves and the ground-truth curves for residential ar
Figure 8. The examples
CBD ODofflow:
forecasting curves (b)
(a) weekdays; andweekends.
the ground-truth curves for residential areas to
CBD OD flow:examples
(a) weekdays; (b) weekends.
CBD OD flow: (a)Figure 8. The
weekdays; of forecasting
(b) weekends. curves and the ground-truth curves for residential are
CBD OD flow: (a) weekdays; (b) weekends.

(a)
(a) (b)
(b)
Figure 9.(a) (b)
The examples of forecasting curves and the ground-truth curves for residential to
Figure 9. The examples of forecasting curves and the ground-truth curves for residential to
ban OD flow: (a) weekdays; (b) weekends.
ban OD
Figure
Figure 9. The examples 9.flow:
of (a) weekdays;
The examples
forecasting curves (b) the
and weekends.
of forecasting curves andcurves
ground-truth the ground-truth curves
for residential for residential to
to suburban
ban OD flow: (a)
OD flow: (a) weekdays; (b) weekends.weekdays; (b) weekends.
Mathematics 2022, 10, 3664 22 of 30
Mathematics 2022, 10, 3664 2

(a) (b)
Figure 10.
Figure 10. The examples ofThe examples
forecasting of forecasting
curves curves and the
and the ground-truth ground-truth
curves for hub tocurves for hub
suburban OD to subur
OD flow: (a) weekdays;
flow: (a) weekdays; (b) weekends. (b) weekends.

5.1.1. Forecasting of OD
5.1.1. 1 in Weekdays
Forecasting of OD 1and Weekendsand weekends
in weekdays
The main purposesThe mainof the passenger
purposes of theflow betweenflow
passenger hubs and CBDs
between hubsare
andbusiness
CBDs are busin
activities, shopping,
tivities,consumption, and attending
shopping, consumption, andlarge events.
attending Hubs
large and CBDs
events. attract
Hubs and CBDs attra
frequent economic activities and commercial activities, which lead to strong population
quent economic activities and commercial activities, which lead to strong populatio
mobility. Passenger flow in these
bility. Passenger flowareas is extremely
in these large. As large.
areas is extremely shownAs inshown
Figure in
7, Figure
the 7, the
peak passengerpassenger
flow exceeds
flow20 people/15
exceeds min on weekdays
20 people/15 and 40 people/15
min on weekdays min on min on
and 40 people/15
weekends. Table 9 shows
ends. Table the metrics
9 shows theof forecasting
metrics results. The
of forecasting passenger
results. flow forecast
The passenger flow forecast r
results on weekends
on weekends are more accurate compared to those on weekdays, where is
are more accurate compared to those on weekdays, where MAE MAE is re
reduced by 0.20byand0.20SMAPE is reduced
and SMAPE by 3.3%.
is reduced by 3.3%.
Table 9. The metrics of forecasting results for OD 1.
Table 9. The metrics of forecasting results for OD 1.
Passenger Flow Scenario 1Flow Scenario
Passenger MAE 1 MSE
MAE RMSE
MSE SMAPE
RMSE SMA
weekdays weekdays 0.38 0.52
0.38 0.72
0.52 11.51%
0.72 11.51
weekends weekends 0.18 0.07
0.18 0.27
0.07 8.20%
0.27 8.20

5.1.2. Forecasting of OD
5.1.2. 2 in Weekdays
Forecasting of OD 2and Weekends and Weekends
in Weekdays
Jobs or shopping opportunities offered
Jobs or shopping opportunities by enterprises located
offered by in CBDlocated
enterprises attract in
people
CBD attract p
from near andfrom far, which
near andcauses large passenger
far, which causes largeflow in OD 2.flow
passenger As inshown
OD 2.inAsFigure
shown8,in Figur
its traffic peak traffic
is closepeak
to 15ispeople/15
close to 15min on weekdays,
people/15 min on while its traffic
weekdays, while peak drops peak
its traffic to drops
10 people/15 min on weekends due to the reduction of commuter traffic. In terms
people/15 min on weekends due to the reduction of commuter traffic. In terms of th of
the forecast metrics (Table (Table
cast metrics 10), the10),
forecast accuracy
the forecast of passenger
accuracy flow on
of passenger weekdays
flow is
on weekdays is s
slightly higher higher
than that on weekends. In particular, MAE is reduced by 0.02, and
than that on weekends. In particular, MAE is reduced by 0.02, and SMAPE SMAPE
is reduced by 2.24%.
duced by 2.24%.
Table 10. The metrics of forecasting results for residential areas to CBD OD flow.
Table 10. The metrics of forecasting results for residential areas to CBD OD flow.
Passenger Flow Scenario 2 MAE MSE RMSE SMAPE
Passenger Flow Scenario 2 MAE MSE RMSE SMA
weekdays weekdays 0.26 0.70
0.26 0.84
0.70 31.23%
0.84 31.23
weekends 0.28 0.77 0.88 33.47%
weekends 0.28 0.77 0.88 33.47

5.1.3. Forecasting of OD
5.1.3. 3 in Weekdays
Forecasting of OD 3and Weekends and Weekends
in Weekdays
Residential–suburban stations are more
Residential–suburban diverse
stations in terms
are more of the
diverse mainofpurposes
in terms the mainof purposes o
passenger traffic. As usual, suburban stations have smaller passenger flows,
senger traffic. As usual, suburban stations have smaller passenger flows, but thebut the re
residential station area selected
tial station in this in
area selected case
thisstill
casehas some
still has passenger flows flows
some passenger due todue
its to its ph
physical proximity to the suburban station. As shown in Figure 9, its peak passenger flow
proximity to the suburban station. As shown in Figure 9, its peak passenger flow i
is close to 10 passengers/15 min onmin
to 10 passengers/15 weekdays, and drops
on weekdays, andto 6 passengers/15
drops min due
to 6 passengers/15 to due to t
min
the reduced economic activity in the city on weekends. In terms of forecasting metrics
duced economic activity in the city on weekends. In terms of forecasting metrics
Mathematics 2022, 10, 3664 23 of 30

(Table 11), the forecast accuracy of passenger flow on weekends is slightly higher than that
of weekdays. In particular, MAE is reduced by 0.04 and SMAPE is reduced by 2.24%.

Table 11. The metrics of forecasting results for residential to suburban OD flow.

Passenger Flow Scenario 3 MAE MSE RMSE SMAPE


weekdays 0.16 0.18 0.41 81.98%
weekends 0.12 0.10 0.32 76.42%

5.1.4. Forecasting of OD 4 in Weekdays and Weekends


The main purpose of passenger flow between hubs and suburban stations is to com-
mute to urban hubs or to engage in commercial activities. Since suburban stations are less
populated nearby, hubs are extremely attractive to suburban stations located at the edge of
the city. The urban functions of suburban areas are largely dependent on the existence of
urban hubs; therefore, suburban stations would have lower passenger volumes. As shown
in Figure 10, its passenger peak is just over 6 passengers/15 min on weekdays and drops to
less than 6 passengers/15 min due to the reduced urban economic activity on weekends.
In terms of forecasting metrics (Table 12), the accuracy of passenger flow forecasting on
weekdays is slightly higher than that on weekends. In particular, MAE is reduced by 0.02,
and SMAPE is reduced by 5.83%.

Table 12. The metrics of forecasting results for hub to suburban OD flow.

Passenger Flow Scenario 4 MAE MSE RMSE SMAPE


weekdays 0.12 0.09 0.29 106.84%
weekends 0.14 0.13 0.36 112.67%

5.1.5. Analyses of the Model Performance


From the above characteristics, we can analyze the following points.
(1) The predicted and actual values of OD 1 (both the weekdays and weekends) would
exceed the other three pairs of ODs. Moreover, because of the great amount of
passenger flow, the corresponding prediction ability of OD 1 is much better than the
other three pairs of ODs.
(2) The absolute MAE scores of all ODs are small regardless of the SMAPE difference. It
shows that the absolute prediction accuracy of this model is relatively high.
(3) Because the value of passenger flows in a period is small, small errors could cause big
relative errors. The MAE scores of the above four pairs of ODs have little difference,
but the larger the passenger flow is, the smaller the SAMPE scores are.
(4) The SMAPE scores of OD 1 are much lower than the other three pairs of ODs.
In general, it could be seen from the above discussion that for OD prediction, the size
of OD passenger flow will affect the evaluation of the relative metrics of the prediction,
especially when the passenger flow is small. OD with small passenger flows also shows a
poor prediction ability when evaluated with SMAPE. This conclusion has some correlation
to the conclusion that “max flow is the most important in observed input features” is
obtained from Section 4.4.
In terms of absolute metrics, such as MAE and RMSE, the prediction accuracy of the
GTFNN model is high both overall and locally, with MAE scores less than 0.4 for all six
pairs of ODs mentioned above. We believe that for the urban rail passenger flow, when
the prediction error is less than 1, the impact on the overall line is relatively small, and it is
within the acceptable range.

5.2. Comparison between Ordinary Days and Holidays


As the prediction results in Section 4.3 shows, it can be seen that the prediction
performance of holidays is worse than that of ordinary days. From the perspective of
Mathematics 2022, 10, 3664 24 of 30

MAE, the MAE score of holidays (4.41) is 3.58 and 3.63 higher than that of weekdays (0.83)
and weekends (0.78), respectively. Likewise, from the perspective of MSE, the score of
holidays (35.08) is 30.66 and 31.06 higher than that of weekdays (4.42) and weekends (4.02)
and from the perspective of SMAPE, the score of holidays (47.96%) is 33.80% and 34.75%
higher than that of weekdays (14.16%) and weekends (13.21%). Although the passenger
flow on holidays is larger than that on ordinary days, it is clear that the passenger flow
characteristics of holidays are not yet well captured by the model. On the one hand, the
model does not consider how to capture the characteristics of holidays (i.e., uncommonly
large passenger flow) when designing. Furthermore, there are only a small amount of
holidays data in the training data, and the passenger flow characteristics vary between
holidays, which poses a great challenge to the model’s prediction during holidays.

6. Conclusions
In this work, we proposed a Graph–Temporal Fused Neural Network (GTFNN) to
address the network-level origin–destination (OD) flows online short-term forecasting
problem. In order to solve the key issue of online flow forecasting, the proposed GTFNN
has made efforts in the four aspects below.
(1) The GTFNN takes finished OD flow and a series of known and observable features as
the input and explores multi-step predictions.
(2) Unlike previous works that either focus on the spatial relationship or the temporal re-
lationship of OD flows evolution, the proposed method is constructed from capturing
both spatial and temporal characteristics.
(3) In order to learn spatial characteristics, a multi-layer graph neural network model is
proposed based on hidden relationships in the rail transit network. Then, we embedded
the graph convolution in a Gated Recurrent Unit to learn spatial–temporal features.
(4) Based on the sequence-to-sequence framework, a Graph–Temporal Fused Deep Learn-
ing model was built. In addition, an attention mechanism was attached to the model
to fuse local and global temporal dependencies to achieve the prediction of short-term
online OD passenger flow.
Experiments based on real-world data collected from Chongqing’s rail transit system
showed that the proposed model performed better than other models. For instance, on
weekdays for passenger flow forecasting scenarios, the SMAPE score of GTFNN was about
14.16%, with a range from 5% to 20% higher compared to other methods. In addition, the
MAE score ranged from 0.1 to 0.8, which is suitable for applications. By comparing some
representative ODs, we found that it is more difficult to forecast ODs with small average
passenger flow values. OD forecasting for small passenger flows should be one of the next
research points.
The proposed model can also analyze weights of different features. The weights of
observed input features and known input features were different in the encoder, where
the most important feature of observed input features is “hour of the day”, and the most
important feature of known input features is “max. passenger flow”.
Obtaining accurate OD passenger flow data on time is vital to support transportation
organization in a rail transit system. The accurate OD prediction results allow operators
to understand the passenger demand between different ODs of the network at a certain
time point in the future, thereby supporting the dynamic and efficient deployment of
transportation organization resources. Well-designed train line planning, timetabling and
station passenger flow planning can be obtained. From the point of view of the trend of a rail
transit passenger flow prediction problem, the traditional station in/out volume prediction
can no longer meet the actual application demand. This kind of prediction can only know
the station collection and dispersion volume, and the passenger flow distribution on the
network is unknown. Accurate prediction of OD will become a popular topic in the future.
The method that combines both temporal and spatial relationships into the prediction
system studied in this paper will be one of the supports to solve the accurate prediction of
OD. Thus, the OD prediction model studied in this paper has strong practical significance.
Mathematics 2022, 10, 3664 25 of 30

Nevertheless, there still exist some limitations in the proposed model. By comparing
the results of ordinary days and holidays (with uncommonly large passenger flow), we can
find that the method studied in this paper still cannot guarantee the accuracy in different
passenger flow scenarios. In the future, it is also necessary to optimize the model and
algorithm for scenarios where sudden large passenger flows happen to meet the needs of
on-time forecasting. In addition, six levels of graph neural networks used in the current
model have the same weights. The effect of these graphs on prediction accuracy has not
been studied. A good understanding of the importance of different graph relations can
help reduce the complexity of the model and improve the training efficiency of the model.

Author Contributions: Data curation, H.Z.; Funding acquisition, H.Z.; Methodology, H.Z., Z.H. and
K.Y.; Supervision, J.C.; Writing—original draft, H.Z., Z.H. and K.Y.; Writing—review and editing, J.C.
and J.Z. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the China Postdoctoral Science Foundation, grant number
2021T140003, and by the China Postdoctoral Science Foundation, grant number 2021M700186.
Data Availability Statement: Not applicable.
Acknowledgments: Thanks to Guofei Gao for his support and help in data processing and for
providing hardware.
Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations

CBD Central Business Distinct


AFC Auto Fare Collection
ARIMA Auto-Regressive Integrated Moving Average
ELU Exponential Linear Unit
FC Fully Connect
FC-GRU Fully Connected Gates Recurrent Unit
FGraph Functional Similarity Graph
GBDT Gradient Boosting Decision Trees
GC-GRU Graph Convolution Gated Recurrent Unit
GCN Graph Convolution Networks
GLU Gated Linear Units
GRN Gated Residual Network
GRU Gated Recurrent Unit
GTFNN Graph–temporal Fused Neural Network
LSTM Long Short-Term Memory
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
MGNN Multi-Layer Graph Neural Networks Model
MQRNN Multi-Horizon Quantile Recurrent Forecaster
MSE Mean Square Error
NGraph Neighborhood Graph
OD Origin–Destination
RMSE Root Mean Square Error
SBD Shape-Based Distance
SMAPE Symmetric Mean Absolute Percentage Error
STSGCN Spatial–Temporal Synchronous Graph Convolutional Networks
TGraph Transportation Connectivity Graph
TFT Temporal Fusion Transformer

Appendix A
Table A1 lists the notation used in this paper and the descriptions.
Mathematics 2022, 10, 3664 26 of 30

Table A1. Notation and the descriptions.

Index Notation Description


1 t The time step t ∈ [0, mmax ]
2 mmax The maximum of t
The actual or entered OD flow of time step t, which can be obtained in historical data but cannot be
3 OD f lowten
calculated instantaneously
4 OD f lowten The object we wish to forecast and the guidance of system management
fi The finished OD flow of time step t, which can be obtained from historical data and instantaneous
5 OD f lowt observations
6 A
Ŷpt,m The m-step-ahead (m is the length of the predicted sequence) forecast result at prediction time pt
7 GTFNN (·) The proposed forecasting model
→ →  The cross-correlation between → →
r i and r j
8 CCw r i , r j
Graph convolution
9 m The length of the predicted sequence
10 pt The prediction time
11 h The length of the input sequence
12 π The blind spot for data updates caused by the information system update mechanism
13 N The number of stations
→F
→F The finished passenger flow of time steps { pt − h + 1, pt − h + 2, · · · pt − π }, X pt,−h,−π ,
14 X pt,−h,−π n o
F F F
X pt −n+1 , X pt−n+2 , · · · X pt−π
F
15 XtF
The
n finished OD flows of time o step t with the origin station i ∈ [1, N ], Xt ,
F , xF , · · · , xF , · · · , xF κ×N
x1,t 2,t i,t N,t ∈ R
F ,
The top κ − 1 finished OD pairs that origin from station i at time step t, xi,t
16 F
xi,t n o
xiF∼1,t , xiF∼2,t , · · · , xiF∼ j,t , · · · , xiF∼κ −1,t , xiF∼κ,t ∈ Rκ
17 xiIC
∼κ,t The rest of the finished OD flow of station i
18 xiF∼ j,t The finished OD flow traveled from station i to station j at time step t
19 xiC∼κ,t The rest of the actual (or entered) OD flow of station i
20 A n + 1, pt + 2, · · · pt + m}o,
The variable represents the actual (or entered) passenger flow of time steps { pt
21 A
Ŷpt,m A , YA , YA , · · · YA
The vector of predicted actual (or entered) passenger flow sequence, Ŷpt,m pt+1 pt+2 pt+m

→A
The actual passenger flow of time steps { pt − h + 1, pt − h + 2, · · · pt − π }, where h > 0 and π <
22 Y pt,−h,−π →A n o
A
h, Y pt,−h,−π , Ypt , Y A , · · · Y A
−n+1 pt−n+2 pt−π
A
23 YtA n actual (or entered) ODoflows of time step t with the origin station i ∈ [1, N ], Yt ,
The
A A A A
y1,t , y2,t , · · · , yi,t , · · · , y N,t ∈ R κ × N

A ,
The top κ − 1 actual (or entered) OD pairs that origin from station i at time step t, yi,t
24 A
yi,t n o
yiA∼1,t , yiA∼2,t , · · · , yiA∼ j,t , · · · , yiA∼κ −1,t , yiA∼κ,t ∈ Rκ
25 yiA∼ j,t The actual (or entered) OD flow traveled from station i to station j at time step t
The observed input features that can only be obtained in historical data, O pt,−h,−π ,
26 O pt,−h,−π n
g
o
Ospt,−h,−π , O pt,−h,−π
n o
g
27 K pt,−h,m The known input features that can be obtained in the whole range of time, K pt,−h,m , K spt,−h,m , K pt,−h,m
28 Ospt,−h,−π The set of finished OD passenger flow
g
29 O pt,−h,−π The set of horizontal passenger flow
30 K spt,−h,m The sequenced known input features
g
31 K pt,−h,m The graphic known input features
32 O The observed input features
33 K The known input features
34 ot1 Finished OD passenger flow in next time step
35 ot2 Finished OD passenger flow in next two time steps
36 ot3 Max. passenger flow
37 ot4 Min. passenger flow
Mathematics 2022, 10, 3664 27 of 30

Table A1. Cont.

Index Notation Description


38 k1t Hour of the day
39 k2t Day of the week
40 k3t The weather (Sunny, rainy, and cloudy)
41 k4t Passenger flow in the latest update time step
42 k5t Passenger flow in the two time steps before the latest update time
43 k6t Passenger flow in the three time steps before the latest update time
44 k7t Passenger flow in the four time steps before the latest update time
45 k8t Passenger flow in the same time step of the previous day
46 k9t Passenger flow in the same time step last week
47 k10
t Passenger flow for the same time step two weeks ago
j j
48 ot The vector of o1,t j ∈ [1, 4] collected by stations
w The vector of kw
49 kt i,t w ∈ [1, 10] collected by stations
50 I The input of the model
51 It The part of input at time step t
52 Iti The input of station i at time step t
53 IGt The input of graph convolution at time step t
54 IGti The input of graph convolution of station i at time step t
55 Gsn The graph of station network
56 Gtn The graph of transfer network
57 Gts The graph of time series similarity
58 G ps The graph of peak hour factor similarity
59 Gl p The graph of line planning network
61 Gc f The graph of correlation of flow evolution
62 pi The peak hour factor
63 Wz The weight of graphs z ∈ {sn, tn, ts, ps, l p, c f }
64 Se(i, j) The connection function of node (i.e., station) i and j
65 Ts(i, j) The time series similarity between station i and j
The connection function of a station with the nearby transfer stations, Tr (i, j) = 1
66 Tr (i, j)
if there exists a path without other transfer station between station i and transfer station j, or else Tr (i, j) = 0

67 ri The passenger flow time series of OD i

68 rj The passenger flow time series of OD j
70 Ps(i, j) The peak hour factor similarity between station i and j
71 Pi The peak hour factor of station i
72 Pj The peak hour factor of station j
73 Fre(l ) The running frequency of the train running line
74 D (i, j) The total number of passengers that traveled from station j to station i in the whole dataset
75 Θsn
76 Θtn
77 Θts
The parameters of the corresponding networks
78 Θ ps
79 Θl p
80 Θc f
81 Nsn (i )
82 Ntn (i )
83 Nts (i )
The neighbor set of node i of the corresponding networks
84 N ps (i )
85 Nl p (i )
86 Nc f (i )
87 Rt The reset gate
88 Zt The update gate
89 Nt The new information
90 Ht The hidden state of GC-GRU
91 σ The sigmoid function
92 Ht−1 The hidden state at last t − 1 iteration of GC-GRU
g
93 Ht−1 The hidden state of the multi-layer graphic structures at the t − 1 iteration of GC-GRU
Mathematics 2022, 10, 3664 28 of 30

Table A1. Cont.

Index Notation Description


94 Θrx
95 Θzx
96 Θzh The graph convolution parameters of the corresponding networks
97 Θnx
98 Θnh
99 br ,
100 bz The bias terms of Rt , Zt , and Nt
101 bn
102 Rit The ith element of Rt , where i is the index of station
103 Zti The th element of Zt , where i is the index of station
104 Nti The ith element of Nt , where i is the index of station
105 Hti The ith element of Ht , where i is the index of station
106 H
et The combined hidden state at time step t
f
107 Ht The hidden state generated by FC-GRU at time step t
108 ⊕ The operator of feature concatenation
109 ELU The exponential linear unit activation function
110 η1 The intermediate layers of ELU
111 η2 The intermediate layers of ELU
112 ω The index to denote weight sharing
113 γ The input of the GLU
114 W1,ω
The weights of ELU
115 W2,ω
116 b4,ω The biases of ELU
117 ( j) The transformed input of the jth feature at time t
ξt
118 ( j) The processed feature vector for variable j
ξet
119 Ξt The flattened vector of all past inputs at time t
120 v χt The feature selection weights
121 ( j) The jth element of vector vχt
v χt
122 φe The gated skip connection
123 φ The set of uniform temporal features which serve as inputs into the decoder itself
124 Θ The temporal features matrix
125 V The value vector of the attention mechanism
126 K The key vector of the attention mechanism
127 Q The query vector of the attention mechanism
128 A(·) The normalization function
129 δ The gating layer
130 ψ The non-linear processing by GRNs
131 ψe The gated residual connection
132 Ω The domain of training data containing M samples
133 M The number of samples

References
1. Wei, Y.; Chen, M.C. Forecasting the short-term metro passenger flow with empirical mode decomposition and neural networks.
Transp. Res. Part C Emerg. Technol. 2012, 21, 148–162. [CrossRef]
2. Bai, L.; Yao, L.; Kanhere, S.S.; Wang, X.; Sheng, Q.Z. Stg2seq: Spatial-temporal graph to sequence model for multi-step passenger
demand forecasting. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19},
Macao, China, 10–16 August 2019; pp. 1981–1987.
3. Lim, B.; Ark, S.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J.
Forecast. 2021, 37, 1748–1764. [CrossRef]
4. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013,
arXiv:1312.6203.
5. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In
Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2016.
6. Atwood, J.; Towsley, D. Diffusion-convolutional neural networks. Comput. Sci. 2015, 29. Available online: https://proceedings.
neurips.cc/paper/2016/hash/390e982518a50e280d8e2b535462ec1f-Abstract.html (accessed on 1 September 2022).
7. Velikovi, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903.
Mathematics 2022, 10, 3664 29 of 30

8. Vlahogianni, E.I.; Golias, J.C.; Karlaftis, M.G.; Banister, D.; Givoni, M. Short-term traffic forecasting: Overview of objectives and
methods. Transp. Rev. 2003, 24, 533–557. [CrossRef]
9. Williams, B.; Durvasula, P.; Brown, D. Urban freeway traffic flow prediction: Application of seasonal autoregressive integrated
moving average and exponential smoothing models. Transp. Res. Rec. 1998, 1644, 132–141. [CrossRef]
10. Lee, S.; Fambro, D.; Lee, S.; Fambro, D. Application of subset autoregressive integrated moving average model for short-term
freeway traffic volume forecasting. Transp. Res. Rec. J. Transp. Res. Board 1999, 1678, 179–188. [CrossRef]
11. Huang, W.; Song, G.; Hong, H.; Xie, K. Deep architecture for traffic flow prediction: Deep belief networks with multitask learning.
IEEE Trans. Intell. Transp. Syst. 2014, 15, 2191–2201. [CrossRef]
12. Ni, M.; He, Q.; Gao, J. Forecasting the subway passenger flow under event occurrences with social media. IEEE Trans. Intell.
Transp. Syst. 2016, 18, 1623–1632. [CrossRef]
13. Sun, Y.; Leng, B.; Guan, W. A novel wavelet-svm short-time passenger flow prediction in beijing subway system. Neurocomputing
2015, 166, 109–121. [CrossRef]
14. Li, Y.; Wang, X.; Sun, S.; Ma, X.; Lu, G. Forecasting short-term subway passenger flow under special events scenarios using
multiscale radial basis function networks—Sciencedirect. Transp. Res. Part C Emerg. Technol. 2017, 77, 306–328. [CrossRef]
15. Sun, Y.; Zhang, G.; Yin, H. Passenger flow prediction of subway transfer stations based on nonparametric regression model.
Discret. Dyn. Nat. Soc. 2014, 2014, 397154. [CrossRef]
16. Zhou, X.; Mahmassani, H.S. A structural state space model for real-time traffic origin-destination demand estimation and
prediction in a day-to-day learning framework. Transp. Res. Part B Methodol. 2007, 41, 823–840. [CrossRef]
17. Hazelton, M.L. Inference for origin–destination matrices: Estimation, prediction and reconstruction. Transp. Res. Part B 2008, 35,
667–676. [CrossRef]
18. Djukic, T. Dynamic od Demand Estimation and Prediction for Dynamic Traffic Management; Delft University of Technology: Delft, The
Netherlands, 2014.
19. Liu, L.; Qiu, Z.; Li, G.; Wang, Q.; Ouyang, W.; Lin, L. Contextualized spatial-temporal network for taxi origin-destination demand
prediction. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3875–3887. [CrossRef]
20. Shi, H.; Yao, Q.; Guo, Q.; Li, Y.; Liu, Y. Predicting Origin-Destination Flow via Multi-Perspective Graph Convolutional Network.
In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020.
21. Gong, Y.; Li, Z.; Zhang, J.; Liu, W.; Zheng, Y. Online spatio-temporal crowd flow distribution prediction for complex metro system.
IEEE Trans. Knowl. Data Eng. 2020, 34, 865–880. [CrossRef]
22. Liu, L.; Chen, J.; Wu, H.; Zhen, J.; Li, G.; Lin, L. Physical-virtual collaboration modeling for intra-and inter-station metro ridership
prediction. IEEE Trans. Intell. Transp. Syst. 2020, 23, 3377–3391. [CrossRef]
23. Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.; Ye, J.; Li, Z. Deep multi-view spatial-temporal network for taxi demand
prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018.
24. Dong, W.; Wei, C.; Jian, L.; Ye, J. Deepsd: Supply-demand prediction for online car-hailing services using deep neural networks. In
Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE), San Diego, CA, USA, 19–22 April 2017.
25. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017,
arXiv:01926.
26. Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-
temporal network data forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA,
7–12 February 2020; pp. 914–921.
27. Han, Y.; Wang, S.; Ren, Y.; Wang, C.; Gao, P.; Chen, G. Predicting station-level short-term passenger flow in a citywide metro
network using spatiotemporal graph convolutional neural networks. Int. J. Geo Inf. 2019, 8, 243. [CrossRef]
28. Geng, X.; Li, Y.; Wang, L.; Zhang, L.; Yang, Q.; Ye, J.; Liu, Y. In Spatiotemporal multi-graph convolution network for ride-hailing
demand forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February
2019; pp. 3656–3663.
29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need.
arXiv 2017, arXiv:1706.03762.
30. Fei, W.; Jiang, M.; Chen, Q.; Yang, S.; Tang, X. Residual attention network for image classification. In Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
31. Arik, S.O.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the 33rd AAAI Conference on Artificial
Intelligence, Honolulu, HI, USA, 27 January–1 February 2019.
32. Alaa, A.M.; Schaar, M. Attentive state-space modeling of disease progression. In Advances in Neural Information Processing Systems;
ACS: Washington, DC, USA, 2019.
33. Choi, E.; Bahadori, M.T.; Schuetz, A.; Stewart, W.F.; Sun, J. Retain: Interpretable Predictive Model in Healthcare Using Reverse Time
Attention Mechanism; Curran Associates Inc.: Red Hook, NY, USA, 2016.
34. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of
transformer on time series forecasting. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2019.
35. Song, H.; Rajan, D.; Thiagarajan, J.J.; Spanias, A. Attend and diagnose: Clinical time series analysis using attention models. In Pro-
ceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LO, USA, 2–7 February 2018.
Mathematics 2022, 10, 3664 30 of 30

36. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V.; Hyndman, R.J. The m4 competition: 100,000 time series and 61 forecasting
methods. Int. J. Forecast. 2020, 36, 54–74. [CrossRef]
37. Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. In Deep state space models for time series
forecasting. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2018.
38. Wen, R.; Torkkola, K.; Narayanaswamy, B. A multi-horizon quantile recurrent forecaster. arXiv 2017, arXiv:1711.11053.
39. Fan, C.; Zhang, Y.; Pan, Y.; Li, X.; Zhang, C.; Yuan, R.; Wu, D.; Wang, W.; Pei, J.; Huang, H. Multi-horizon time series forecasting
with temporal attention learning. In Proceedings of the 25th ACM SIGKDD International Conference, Anchorage, AL, USA,
3–7 August 2019.
40. Guo, T.; Lin, T.; Antulov-Fantulin, N. Exploring interpretable lstm neural networks over multi-variable data. In Proceedings of
the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019.
41. Burggraeve, S.; Bull, S.H.; Vansteenwegen, P.; Lusby, R.M. Integrating robust timetabling in line plan optimization for railway
systems. Transp. Res. Part C Emerg. Technol. 2017, 77, 134–160. [CrossRef]
42. Zheng, H.; Cui, Z.; Zhang, X. Automatic discovery of railway train driving modes using unsupervised deep learning. ISPRS Int.
J. Geo Inf. 2019, 8, 294. [CrossRef]
43. Paparrizos, J.; Gravano, L. K-shape: Efficient and accurate clustering of time series. ACM SIGMOD Rec. 2015, 45, 69–76. [CrossRef]
44. Fang, S.; Zhang, Q.; Meng, G.; Xiang, S.; Pan, C. Gstnet: Global spatial-temporal network for traffic flow prediction. In Proceedings
of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp.
2286–2293.
45. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
2014, arXiv:1412.3555.
46. Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. In Fast and accurate deep network learning by exponential linear units (elus). In
Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016.
47. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450.
48. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 34th
International Conference on Machine Learning, Sydney, Australia, 6–11 August 2016.
49. Gal, Y.; Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the
30th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Barcelona, Spain, 2016; pp.
1027–1035.
50. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
51. Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-gcn: A temporal graph convolutional network for traffic
prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [CrossRef]
52. Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [CrossRef]
53. Gers, F.A.; Schmidhuber, E. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE Trans.
Neural Netw. 2001, 12, 1333–1340. [CrossRef]
54. Jozefowicz, R.; Zaremba, W.; Sutskever, I. An empirical exploration of recurrent network architectures. In Proceedings of the
International Conference on Machine Learning, Lille, France, 6–11 July 2015.
55. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the
Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}, Macao, China, 10–16 August 2019.

You might also like