Professional Documents
Culture Documents
Article
Short-Term Online Forecasting for Passenger Origin–Destination
(OD) Flows of Urban Rail Transit: A Graph–Temporal Fused
Deep Learning Method
Han Zheng, Junhua Chen *, Zhaocha Huang, Kuan Yang and Jianhao Zhu
School of Traffic and Transportation, Beijing Jiaotong University, No. 3 Shang Yuan Cun, Hai Dian District,
Beijing 100044, China
* Correspondence: cjh@bjtu.edu.cn
Abstract: Predicting short-term passenger flow accurately is of great significance for daily manage-
ment and for a timely emergency response of rail transit networks. In this paper, we propose an
attention-based Graph–Temporal Fused Neural Network (GTFNN) that can make online predictions
of origin–destination (OD) flows in a large-scale urban transit network. In order to solve the key
issue of the passenger hysteresis in online flow forecasting, the proposed GTFNN takes finished OD
flow and a series of features, which are known or observable, as the input and performs multi-step
prediction. The model is constructed from capturing both spatial and temporal characteristics. For
learning spatial characteristics, a multi-layer graph neural network is proposed based on hidden
relationships in the rail transit network. Then, we embedded the graph convolution into a Gated
Recurrent Unit to learn spatial–temporal features. For learning temporal characteristics, a sequence-
to-sequence structure embedded with the attention mechanism is proposed to enhance its ability to
capture both local and global dependencies. Experiments based on real-world data collected from
Citation: Zheng, H.; Chen, J.; Huang, Chongqing’s rail transit system show that the metrics of GTFNN are better than other methods, e.g.,
Z.; Yang, K.; Zhu, J. Short-Term Online the SMAPE (Symmetric Mean Absolute Percentage Error) score is about 14.16%, with a range from
Forecasting for Passenger 5% to 20% higher compared to other methods.
Origin–Destination (OD) Flows of
Urban Rail Transit: A Graph–Temporal Keywords: urban rail system; short-term prediction; passenger origin–destination (OD); graph neural
Fused Deep Learning Method. networks; graph–temporal fused; sequence to sequence
Mathematics 2022, 10, 3664.
https://doi.org/10.3390/ MSC: 68T07
math10193664
only observed historically—without any prior information on how they interact with the
target [3].
In addition, OD prediction has its own characteristics. In the huge size of a rail transit
network, online forecasting can only obtain data on the finished OD passenger flows, as
passenger trips cannot finish immediately. The unknown passenger flow includes the
current unfinished passenger flow as well as the future flow that will continuously enter
the network. Meanwhile, the AFC (Auto Fare Collection) system has a delay time for
updating OD flow information. Thus, the forecasting task of OD flow naturally takes temp
finished OD flow as input and the actual OD flow as output.
To solve the above issues, we need a method that simultaneously considers spatial–
temporal properties of OD passenger flow. The development of deep learning provides
the possibility to find a reasonable solution. The temporal fusion transformer method [3]
is explored, which can combine high-performance multi-horizon forecasting with inter-
pretable insights into temporal dynamics. Furthermore, Graph Convolution Networks
(GCN), which can automatically learn representations on non-Euclidean data (e.g., graphs),
has been proposed and utilized in many scenarios [4–7].
Inspired by the above, a Graph–Temporal Fused Neural Network (GTFNN) is pro-
posed in this paper. The most important feature of this model is the introduction of a graph
neural network component in the attention-based sequence-to-sequence structure to fuse
the hidden spatial relationships in a rail transit network into temporal relationships. This
realizes the prediction of time-series OD passenger flow, which is highly dependent on the
network structure.
2. Literature Reviews
We performed a literature review from three aspects: passenger flow prediction, graph
convolution neural networks, and time series forecasting with attention-based deep neural
networks. All three aspects are relevant to the study of this paper.
Mathematics 2022, 10, 3664 3 of 30
ridership data network based on this and fully explored the inter-station flow similarity
and OD correlation for virtual graph construction.
3. Methodology
3.1. Short-Term Online Framework Considering Hysteresis of Passenger OD Flow
The online short-term online framework for forecasting passenger OD flow should con-
sider hysteresis of the OD passenger, as well as the hidden temporal and spatial relationships.
attracted the attention of one existing research study. Gong et al. [21] used some indication
matrices to mask and neglect the potential unfinished trip in urban rail networks.
In order to handle this issue, this paper proposes a framework that maps a time-
dependent finished trip to the actual entered trips based on Equation (1).
where OD f lowten represents the actual or entered OD flow of time step t, which can be
obtained in historical data but cannot be calculated instantaneously. OD f lowten is the
fi
object we wish to forecast and the guidance of system management. OD f lowt represents
the finished OD flow of time step t, which can be obtained from historical data and
instantaneous observations. The gap between the two cumulative values of OD f lowten and
fi
OD f lowt in the temporal dimension is caused by an unfinished journey. We denote this
gap as the OD f lowtun that represents the unfinished OD flow of time step t.
Based on this relationship, it is clear that one of the core aspects of the forecasting
process is how to establish a mapping between the observable finished passenger flow and
the actual passenger flow.
Figure 1.
Figure 1. Illustration
Illustration of
of the
the forecasting
forecasting task
task in
in the
the temporal
temporal perspective.
perspective.
TableTheoretically, the
1. Key notations in OD pairs among
forecasting task. stations should be fully connected, i.e., if the num-
ber of stations is 𝑁, the number of OD pairs is 𝑁 ⋅ (𝑁 − 1). However, the OD matrix is
Index Notation relatively sparse. Thus, we only consider the OD flow from station 𝑖 to the top-𝜅 stations
Description
A vector ofwhere its passengers
input finished areflow
passenger most likely towhere
sequence, reach,ptas
is well as the
the time steptotal OD flow to
of forecasting. the remaining
h refers to the
stations.
length of the The details
input sequence and πofrefers
the mentioned
to the blind notations
spot for dataare showncaused
updates in Table 1. information system
by the
→F →F
1 X pt,−h,−π update mechanism,
Table 1. Keywhere h > 0 and
notations π < h. X pt,task.
in forecasting −h,−π represents the finished passenger flow of time steps
{ pt − h + 1, pt − h + 2, · · · pt − π }.
Index →F
Notation
n o
Description
X pt,−h,−π , X pt F , X F , · · · X F
− n +1 pt−n+2 pt−π
A vector of input finished passenger flow sequence, where 𝑝𝑡 is the time step of forecasting.
2 XtF
The finished
ℎnrefersOD flows
to the of time
length o tinput
ofstep
the with the and 𝜋i ∈refers
origin station
sequence [1, N ].to the blind spot for data updates
XtF , x1,t
F , xF , · · · , xF , · · · , xF κ×N ∈R
1 𝑋⃗ , , caused by the information system
2,t i,t N,t update mechanism, where ℎ > 0 and 𝜋 < ℎ. 𝑋⃗ , , rep-
resents
The top κ − 1the finished
finished OD pairs passenger flow offrom
that originate timestation
steps i at 𝑝𝑡time
− ℎstep
+ 1,t.𝑝𝑡x−IC ℎ +
i ∼κ,t 2, ⋯ 𝑝𝑡 −the
represents 𝜋 rest
. of the
F
3 xi,t 𝑋⃗n , , ≜ 𝑋
finished OD flow of station , 𝑋i. , ⋯ 𝑋 o
F , xF
xi,t F · · · , xiF∼ j,t , · · · , xiF∼κ −1,t , xiF∼κ,t ∈ Rκ
Thei∼ 1,t , xi ∼2,t ,OD
finished flows of time step 𝑡 with the origin station 𝑖 ∈ 1, 𝑁 .
2 𝑋
4 xiF∼ j,t 𝑋 ≜ 𝑥OD
The finished , , 𝑥flow
, , ⋯ , 𝑥 , , ⋯from
traveled ∈ ℝ ×i to station j at time step t.
, 𝑥 , station
The of
A vector top 𝜅 − 1 finished
predicted actual (orOD pairspassenger
entered) that originate from station
flow sequence, where𝑖 ptatistime step 𝑡. or
the predicted 𝑥 ~query
, repre-
time
3 A
𝑥, step.sents
m refersthetorest the of lengththe of finished OD flow
the predicted where.𝑖.A indicates that the variable represents the actual
of station
sequence
5 Ŷpt,m
𝑥 , n≜ passenger
(or entered) 𝑥 ~ , , 𝑥 ~ flow , , ⋯ , 𝑥 ~o, , ⋯ , 𝑥 ~
of time steps { pt +, 1,
, 𝑥~ , ∈ℝ
pt + 2, · · · pt + m}.
A , YA , YA , · · · YA
4 𝑥~ ,
Ŷpt,mThe finished
pt+1 pt+ OD2 flow pt+traveled
m from station 𝑖 to station 𝑗 at time step 𝑡.
A vector of historical actual (or entered) entered)
A vector of predicted actual (or passengerpassenger
flow sequence, flowwhere
sequence, where
pt is the 𝑝𝑡 is
time step ofthe predicted
forecasting. h
refersortoquery
the length time step. 𝑚 refers to the length of the predicted sequence where 𝑚 > 0. 𝐴 indi-
of the sequence, and π refers to the blind spot for data updates caused by the information
→A
65
→A 𝑌 , systemcates thatmechanism,
update the variablewhere represents
h > 0 and theπactual
< h. Y (or entered) passenger flow of time steps
Y pt,−h,−π pt,−h,−π represents the actual passenger flow of time
𝑝𝑡 + 1, 𝑝𝑡 + 2, ⋯ 𝑝𝑡
steps { pt − h + 1, pt − h + 2, · · · pt − π }. + 𝑚 .
→A 𝑌 ≜ n𝑌 A , 𝑌 A, ⋯ 𝑌
Y pt,−h,−, π , Ypt
o
A
−n+1 , Ypt−n+2 , · · · Ypt−π
A vector of historical actual (or entered) passenger flow sequence, where 𝑝𝑡 is the time step
The actual (or entered)ℎ OD
ofnforecasting. refersflows tooof
thetime step tofwith
length thethe origin station
sequence, and 𝜋 i ∈refers
[1, N ]. to the blind spot for data
7 YtA κ×N
6 𝑌⃗ YtA ,updatesA , yA , · · · , yA , · · · , yA
y1,t 2,tcausedi,tby the information N,t ∈ R system update mechanism, where ℎ > 0 and 𝜋 < ℎ.
, ,
𝑌⃗ represents the actual
The top κ, −, 1 actual (or entered) OD pairs that originate passenger flow from
of time steps
station 𝑝𝑡 −step
i at time ℎ + t.1,x𝑝𝑡
C − ℎ + 2, ⋯ 𝑝𝑡 − 𝜋 .
i ∼κ,t represents the rest
8 A
yi,t of the𝑌⃗n
actual
, , ≜ entered)
(or 𝑌 , 𝑌 flow of
OD , ⋯station
𝑌 i. o
A , yA
yi,t , y A , · · · , y A , · · · , yA , y A ∈ Rκ 𝑡 with the origin station 𝑖 ∈ 1, 𝑁 .
Thei∼actual
1,t i ∼(or 2,t entered) i ∼ j,t OD flows i ∼κ −1,tof time
i ∼κ,t step
7 𝑌 ×
9 yiA∼ j,t 𝑌 ≜ (or
The actual 𝑦 ,entered)
, 𝑦 , , ⋯OD , 𝑦 , flow
,⋯,𝑦 , ∈ℝ
traveled from station i to station j at time step t.
The top 𝜅 − 1 actual (or entered) OD pairs that originate from station 𝑖 at time step 𝑡. 𝑥 ~ ,
8 𝑦, represents the rest of the actual (or entered) OD flow of station 𝑖.
𝑦 , ≜ 𝑦 ~ , ,𝑦 ~ , ,⋯,𝑦 ~ , ,⋯,𝑦 ~ , ,𝑦 ~ , ∈ ℝ
Mathematics 2022, 10, 3664 7 of 30
Table 1. Cont.
The knownninput features thatocan be obtained in the whole range of time. We define
11 K pt,−h,m g
K pt,−h,m , K spt,−h,m , K pt,−h,m
Unfold the features in Table 2 by spatial and temporal dimensions. Each element
j g
k t ∈ K pt,−h,m , j ∈ {4, 5, · · · 10}, can be denoted as:
n o
j j j j j
ot = o1,t , o2,t , · · · , oi,t , · · · , o N,t , j ∈ [1, 4] (3)
kw
w w w w
t = k 1,t , k 2,t , · · · , k i,t , · · · , k N,t , w ∈[1, 10] (4)
Thus, each element kw
t can map to the station set of the network.
The input of the model is denoted as:
→F
I= X pt,−h,−π , O pt,−h,−π , K pt,−h,m (5)
n o
It = It1 , It2 , · · · , ItN (6)
n o
j
Iti = F
, kw
xi,t , oi,t i,t w∈[1,10]
(7)
j∈[1,4]
Specifically, in Table 2, the features in set K spt,−h,m are dependent on the OD flow but
are independent from the structure of networks. Thus, we define the input of the graph
model in Equation (8). n o
IGt = IGt1 , IGt2 , · · · , IGtN (8)
n o
j
IGti = xi,t
F
, kw
, oi,t i,t w∈[4,10] (9)
j∈[1,4]
𝟏
graph
but are model in Equation
independent from 𝑰𝑮𝒕 = {𝑰𝑮
(8).the structure 𝑰𝑮𝟐𝒕 , ⋯ , 𝑰𝑮𝑵
of𝒕 ,networks. }
𝒕Thus, we define the input of(8) the
𝒊 𝑭 𝒋 𝒘
graph model in Equation (8). 𝑰𝑮 𝒕 = {𝒙 𝒊,𝒕 , {𝒐 𝒊,𝒕 } , {𝒌 𝒊,𝒕 } } (9)
𝑰𝑮𝒕 = {𝑰𝑮𝟏𝒕 ,𝒋∈[𝟏,𝟒] 𝑰𝑮𝟐𝒕 , ⋯ , 𝑰𝑮𝑵 𝒘∈[𝟒,𝟏𝟎]
𝒕 } (8)
𝒋
𝑰𝑮𝒊𝒕 = {𝒙𝑭𝒊,𝒕 , {𝒐𝒊,𝒕 }𝟏 𝟐 , {𝒌 𝒘
𝒊,𝒕 } 𝑵} } (9)
𝑰𝑮𝒕 = {𝑰𝑮 , 𝑰𝑮𝒕 , ⋯ , 𝑰𝑮𝒕
𝒕𝒋∈[𝟏,𝟒] 𝒘∈[𝟒,𝟏𝟎] (8)
3.1.3. Consideration for Spatial 𝒋
𝑰𝑮𝒊𝒕 Relationships
= {𝒙𝑭𝒊,𝒕 , {𝒐𝒊,𝒕 } , {𝒌𝒘 }
𝒊,𝒕 𝒘∈[𝟒,𝟏𝟎] } (9)
Mathematics 2022, 10, 3664 𝒋∈[𝟏,𝟒] 8 of 30
3.1.3.The spatial relationships
Consideration for Spatial studied
= {𝒙𝑭𝒊,𝒕 , {𝒐
𝑰𝑮𝒊𝒕Relationships in𝒋 this
} paper
, {𝒌 𝒘 refer to the relationships among(9)
𝒊,𝒕 𝒋∈[𝟏,𝟒] 𝒊,𝒕 }𝒘∈[𝟒,𝟏𝟎] }
stations, which can influence the prediction of OD flow. We summarized four classes of
3.1.3.The spatial relationships
Consideration studied in this paper refer to the relationships among
spatial relationshipsfor Spatial
with Relationships
six derived graphs that exist in the urban transit system. A
stations, which can influence the prediction of OD flow. We summarized four classes of
3.1.3. Consideration
The
3.1.3. spatialoffor
Consideration
summarization Spatial
relationships Relationships
for Spatial
these studied is in
Relationships
relationships shown this in paper Tablerefer 3. to the relationships among
spatial relationships with six derived graphs that exist in the urban transit system. A
stations,
The The which
spatial can influence
relationships the
studied prediction
in this paper of
in thisreferOD flow.
to the We to summarized fourstations,
classes of
summarization spatial
ofrelationships
these relationshipsstudied is shown inpaperTable 3.relationships
refer among
the relationships among
Table
spatial
which 3. Multi-layer
relationships
can influence
stations, which can networks.
with
the influencesix
predictionthederived
of OD graphs
flow. We
prediction that
ofsummarized exist
OD flow. We in the
four urban transit system.
classes of spatial
summarized rela- A
four classes of
summarization
tionships
spatial
Table with
3. six of
relationships
Multi-layer these
derived relationships
graphs
with
networks. six that
derived is
exist shown
in
graphsthe in
urban
that Tableexist 3.
transit insystem.
the A
urban summarization
transit system. A
Class Graph Illustration Notation Weight of Edge
of these relationships
summarization of is shown
these in Table 3. is shown in Table 3.
relationships
Class Graph Table 3. Multi-layer
Illustration
networks. Notation Weight of Edge
Table 3. Multi-layer
Table networks.
3. Multi-layer networks.
Class Graph Illustration Notation Weight𝑺𝒆(𝒊, of Edge 𝒋)
Station network 𝑮𝒔𝒏 = (𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) 𝑾𝒔𝒏 (𝒊, 𝒋) = 𝑵 (10)
Class
Class Graph
Graph Illustration
Illustration Notation
Notation ∑
Weight
Weight𝒌=1 of
𝑺𝒆(𝒊, of
Edge Edge
𝑺𝒆(𝒊,𝒋) 𝒌)
Station network 𝑮𝒔𝒏 = (𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) 𝑾𝒔𝒏 (𝒊, 𝒋) = 𝑵 (10)
Station–line– ∑𝒌=1 𝑺𝒆(𝒊, 𝒌)
(a) 𝑺𝒆(𝒊, 𝒋)
network Station network 𝑮𝒔𝒏 = (𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) 𝑾𝒔𝒏 (𝒊, 𝒋) = 𝑵 (10)
Station–line– ∑𝒌=1 Se (𝑺𝒆(𝒊,
𝑺𝒆(𝒊, i, j) 𝒋) 𝒌)
relationship Station network
Station network (a) sn𝒔𝒏==
G𝑮 ( N,
(𝑵,E𝑬 , W
sn𝒔𝒏 ,𝑾 ))
sn𝒔𝒏 𝑾
W𝒔𝒏 sn ((𝒊,
i, j𝒋)
) == N𝑵 (10)
(10)
network ∑∑k𝒌=1 𝑺𝒆(𝒊, 𝒌)
Station–line– =1 Se (i, k )
Station–line–
relationship Transfer (a) 𝑻𝒓(𝒊, 𝒋)
network
Station–line–
network (a) 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) 𝑾𝒕𝒏 (𝒊, 𝒋) = 𝑵 (11)
relationship network
Transfer (a) ∑𝒌=1 𝑻𝒓(𝒊, 𝑻𝒓(𝒊, 𝒋) 𝒌)
relationship
network 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) 𝑾𝒕𝒏 (𝒊, 𝒋) = 𝑵 (11)
relationship network ∑𝒌=1 𝑻𝒓(𝒊, 𝒌)
Transfer (b) 𝑻𝒓(𝒊,
Tr (i, j)𝒋)
Transfer network G𝑮tn ==( (𝑵,
N, E𝑬tn , ,W
𝑾tn )) 𝑾W𝒕𝒏tn(𝒊, 𝒋)
(i, j) = (11)
(11)
𝒕𝒏 𝒕𝒏 𝒕𝒏 𝑵
network
Transfer ∑ N
∑𝒌=1
k =𝑻𝒓(𝒊,
𝑻𝒓(𝒊,k)𝒌)
1 Tr (i,𝒋)
(b) 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) 𝑾𝒕𝒏 (𝒊, 𝒋) = 𝑵 (11)
network (b) ∑𝒌=1 𝑻𝒓(𝒊, 𝒌)
Time series (b) 𝑻𝒔(𝒊, 𝒋)
𝑮𝒕𝒔 = (𝑵, 𝑬𝒕𝒔 , 𝑾𝒕𝒔 ) 𝑾𝒕𝒔 (𝒊, 𝒋) = (12)
similarity (b) ∑𝒌∈𝒄𝑻𝒔(𝒊, 𝑻𝒔(𝒊, 𝒌)
Time
Time series
series
(𝑵,E𝑬ts𝒕𝒔, ,W
G𝑮ts𝒕𝒔==( N, 𝑾ts𝒕𝒔) ) 𝑾 (𝒊, 𝒋) = Ts (i, j)𝒋) (12)
(12)
similarity W𝒕𝒔 ts ( i, j ) =
Passenger flow Time similarity ∑k𝒌∈𝒄
∑ c Ts
∈𝑻𝒔(𝒊, 𝑻𝒔(𝒊,
(i,𝒋)k)𝒌)
series (c)
Passenger flow
characteristics 𝑮𝒕𝒔 = (𝑵, 𝑬𝒕𝒔 , 𝑾𝒕𝒔 ) 𝑾𝒕𝒔 (𝒊, 𝒋) = (12)
Passenger2022,
flow
characteristics
Mathematics similarity
Time series
10, 3664 (c) ∑𝒌∈𝒄 𝑻𝒔(𝒊,
𝑻𝒔(𝒊, 𝒋)𝒌) 10 of 34
relationship (c) 𝑮𝒕𝒔 = (𝑵, 𝑬𝒕𝒔 , 𝑾𝒕𝒔 ) 𝑾𝒕𝒔 (𝒊, 𝒋) = (12)
relationship
characteristics similarity ∑𝒌∈𝒄 𝑻𝒔(𝒊, 𝒌)
Passenger2022,
Mathematics flow 3664
relationship 10,PeakPeak
hourhour
factor (c) 𝑷𝒔(𝒊,
Ps (i, j)𝒋)
10 of 34
characteristics
Passenger flow G𝑮ps𝒑𝒔== N,
(𝑵,E𝑬ps𝒑𝒔
, ,W
𝑾ps𝒑𝒔 ) 𝑾
W𝒑𝒔 ps ((𝒊,
i, j𝒋)
) == (13)
(13)
similarity (c) ∑
∑k𝒌∈𝒄 𝑷𝒔(𝒊,
(i,𝒋)k)𝒌)
relationship factor
characteristics Peak similarity
hour c Ps
∈𝑷𝒔(𝒊,
𝑮𝒑𝒔 = (𝑵, 𝑬𝒑𝒔 , 𝑾𝒑𝒔 ) 𝑾𝒑𝒔 (𝒊, 𝒋) = (13)
relationship factor similarity (d) 𝑾𝒍𝒑 (𝒊, 𝒋) ∑𝒌∈𝒄 𝑷𝒔(𝒊, 𝒌)
Line planning Line Peak hour
planning (d) 𝑷𝒔(𝒊, 𝒋)
𝑮𝑮𝒑𝒔 = (𝑵, 𝑬 ,, 𝑾 𝑾𝒑𝒔)) 𝑾𝒑𝒔 (𝒊,𝑭𝒓𝒆(𝒍)𝑳𝒑(𝒊,
𝒋) = 𝒋) (13)
𝒍𝒑 = (𝑵, 𝑬𝒑𝒔 ∑𝒌∈𝒄 𝑷𝒔(𝒊, (14)
relationship factor
Peaksimilarity
networkhour 𝒍𝒑 𝒍𝒑
= 𝑷𝒔(𝒊, 𝒋)𝒌)
(d) 𝑮𝒑𝒔 = (𝑵, 𝑬𝒑𝒔 , 𝑾𝒑𝒔 ) 𝑾
𝑾∑ 𝒍𝒑 (𝒊,
(𝒊, 𝒋)
𝒋) 𝑳𝒑(𝒊,
= 𝒌)𝑭𝒓𝒆(𝒍) (13)
Line
Line planning factor
planning Line planning
Line planning 𝒑𝒔𝒌∈𝑺𝑳(𝒍)
= ∑ ∑𝒌∈𝒄 𝑷𝒔(𝒊,
𝒋) ) 𝒌)
similarity Fre ( l ) Lp ( i,j
relationship network G𝑮 == N,
l p𝒍𝒑 (𝑵,E𝑬l p𝒍𝒑, ,W
𝑾l p𝒍𝒑 ) Wl p (i, j) 𝑭𝒓𝒆(𝒍)𝑳𝒑(𝒊, (14)
(14)
relationship network (d) = k∈SL(l ) Lp (i,k ) Fre ( l )
(e) ∑𝒌∈𝑺𝑳(𝒍) 𝑳𝒑(𝒊, 𝒌)𝑭𝒓𝒆(𝒍)
(d)
(e)
(e)
Correlation
Correlation Correlation of
Correlation of 𝑫(𝒊,
D (i, j𝒋)
)
relationship flow evolution
evolution G𝑮c f𝒄𝒇== (𝑵,
N, E𝑬c𝒄𝒇
f ,,W
𝑾c𝒄𝒇
f ) 𝑾 (𝒊, 𝒋) =
c f (i, j ) =
W𝒄𝒇 𝑵
(15)
(15)
relationship flow ∑ N
∑𝒌=1 𝑫(𝒊,k𝒌)
1 D (i,
k =𝑫(𝒊, 𝒋)
)
Correlation Correlation of
(f) 𝑮𝒄𝒇 = (𝑵, 𝑬𝒄𝒇 , 𝑾𝒄𝒇 ) 𝑾𝒄𝒇 (𝒊, 𝒋) = (15)
relationship flow evolution ∑𝑵
𝒌=1 𝑫(𝒊, 𝒌)
(f) G = ( N, E, W )
𝑮 = (𝑵, 𝑬, 𝑾)
(f)
1. Station–line–network relationship
1. Station–line–network relationship 𝑮 = (𝑵, 𝑬, 𝑾)
This is the basic tomography relationship of the rail transit network that determines the
connectionsThisbetween
is the basiceachtomography
pair of relationship
stations. We focused of the
on rail
two transit
networksnetwork
of thisthat determines
relationship:
1. Station–line–network relationship
the connections
a. Station network between each pair of stations. We focused on two networks of this
This is the basic tomography relationship of the rail transit network that determines
relationship:
The station network G = ( N, E , Wsn ) is directly constructed according to the
the connections betweensn each pairsnof stations. We focused on two networks of this
connections a. Station network
of sections and stations of the studied rail transit network. An edge is formed
relationship:
to connect Thenode i and
station j in Esn 𝑮
network if𝒔𝒏the
= corresponding directlyi and
(𝑵, 𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) isstation j are connected
constructed according in the
to the
a. Station network
realconnections
network. of sections and stations of the studied rail transit network. An edge is formed
The station
tob.connect
Transfer network
node and 𝒋 in 𝑮𝑬𝒔𝒏𝒔𝒏=if(𝑵,
𝒊network the𝑬𝒔𝒏 , 𝑾𝒔𝒏 ) is directly
corresponding constructed
station 𝒊 and 𝒋 are according
connectedto the
in
connections
The
the transfer
real of network
network. sections and Gtn stations
= ( N, E oftnthe
, Wstudied rail transit network.
tn ) is constructed An edge is of
by the connections formed
a
to connect
station nearby𝒊 and
with its node 𝒋 instations.
transfer 𝑬𝒔𝒏 if theAn corresponding
edge is formed station
to connect𝒊 and
node𝒋 i are j in Etn if in
and connected
b. Transfer network
the real network.
The transfer network 𝑮𝒕𝒏 = (𝑵, 𝑬𝒕𝒏 , 𝑾𝒕𝒏 ) is constructed by the connections of a
b. Transfer network
station with its nearby transfer stations. An edge is formed to connect node 𝒊 and 𝒋 in
𝑬𝒕𝒏 ifThe transfer network
the corresponding 𝑮𝒕𝒏 =
station 𝒊 (𝑵, 𝒕𝒏 , 𝑾𝒕𝒏 ) station
and𝑬transfer is constructed by the by
𝒋 are connected connections of a
a station path
station with its nearby transfer stations. An edge is formed
along with the station network, and the path cannot contain other transfer station. to connect node 𝒊 and 𝒋 in
𝑬𝒕𝒏 if the corresponding station 𝒊 and transfer station 𝒋 are connected by a station path
2. Passenger flow characteristics relationship
Mathematics 2022, 10, 3664 9 of 30
the corresponding station i and transfer station j are connected by a station path along with
the station network, and the path cannot contain other transfer station.
2. Passenger flow characteristics relationship
When two stations are located in different areas but have the same function (e.g., office,
education, business districts), it makes sense that the evolution of the passenger flow will
be similar. We chose two kinds of feature to measure similarities.
a. Time series similarity
The daily passenger flow data along the time axis will form a time series. The similarity
among the time series belonging to different stations can construct the edges and weights
among stations. By these, the time series similarity Gts = ( N, Ets , Wts ) is built by a
similarity measurement with threshold. We illustrate the details about how to construct Gts
in the following part of measurement weights.
b. Peak hour factor similarity
Likewise, the peak hour factor was also chosen as a feature to measure similarity
between any pair of stations. The peak hour factor is calculated by the formula:
→ A
max Y pt,−h1 ,−h2
pi = → A (16)
avg Y pt,−h1 ,−h2
where h1 and h2 are the operational beginning and end time points of a day. The function
→A
max(·) can find the maximum OD flow in vector Y pt,−h1 ,−h2 , and the function avg(·) can
calculate the average flow of off-peak hours.
By these, the peak hour factor similarity G ps = N, E ps , Wps is built by a similarity
measurement based on the second-norm. We illustrate the details about how to construct
G ps in the following part of measurement weights.
3. Line planning relationship
Although urban rail transit has a natural physical topology, passengers can only move
with the trains. Thus, the line planning has a huge impact on passenger OD, especially in
terms of short-term resolution. A line plan is one of the most basic papers for rail transit
operations. A line is often taken to be a route in a high-level infrastructure graph ignoring
precise details of platforms, junctions, etc. In addition, a line is a route in the network
together with a stopping pattern for the stations along that route, as a line may either stop
at or bypass a station on its route [41]. We define a line plan as a set of such routes, each
with a series of way stations, a stopping pattern and frequency, which together must meet
certain targets such as providing minimal service at every station.
a. Line planning network
The line planning network Gl p = N, El p , Wl p describes the connected relationships
formed by the line planning. This correlation has a huge impact on passenger travel.
El p and Wl p are determined by the shopping pattern and running frequency.
4. Correlation relationship
For representing potential large OD pairs or potential travel demand hidden in the
urban rail transit, we built a network to represent the correlation relationships.
a. Correlation of flow evolution
OD flow between every two stations is not uniform, and the direction of passenger
flow implicitly represents the correlation of two stations. For instance, if: (I) the majority of
inflow of station a stream to station b, or (II) the outflow of station a primarily comes from
station b, we believe that the stations a and b are highly correlated.
According to the above discussions, we defined the graphs by nodes N, edges E
and the weights W. In this context, we set node n ∈ N, where n ∈ N represents a
real station. The graphs share the same nodes but have their own edges and weights,
Mathematics 2022, 10, 3664 10 of 30
n o n o
that is, E , Esn , Etn , Ets , E ps , Ecd , Ec f and W , Wsn , Wtn , Wts , Wps , Wcd , Wc f . Specif-
ically, we denote W ∈ R6× N × N as the weights of all edges, and for each Wz ∈ W,
z ∈ {sn, tn, ts, ps, l p, c f }. An overall design of the graphs is shown in Table 3.
We denote Wz (i, j) as the weight of edge (i, j). In Table 3, the fifth column summarizes
the calculation methods of different weights.
For calculating Wsn (i, j) and Wtn (i, j), Se(i, j) represents the connection function of
nodes (i.e., station) i and j. Se(i, j) = 1 if there exists a section between node i and j, else
Se(i, j) = 0, and we set Se(i, i ) = 0. Likewise, Tr (i, j) represents the connection function of
a station with the nearby transfer stations. Tr (i, j) = 1 if there exists a path without another
transfer station between station i and transfer station j, or else Tr (i, j) = 0.
Theoretically, similarity exists between any two stations. However, in large-scale
networks, the number of station pairs is large, and the similarity of most of them is small,
which leads to a complex similarity graph. For this reason, we used a combination of
clustering method and threshold selection to first find the potential groups to which
the stations belong and then excluded the station pairs with small similarity. Thus, for
calculating Ts(i, j) in Wts (i, j), using the time series as inputs, we first obtained the similarity
relationships of different stations based on the clustering method [42] and obtained clusters
of stations, denoted as C. Then, for each c ∈ C, a predefined similarity threshold was set to
control the number of similarities. Based on the finite similarity relationships, we built the
edge set Ets and used Equations (17)–(20) to calculate Ts(i, j) in category c ∈ C.
n
A
o n
A
o
Ts(i, j) = exp −SBD sum yi,t , sum y j,t (17)
t∈ Th t∈ Th
∑ j∈[1,κ] yiA∼j,t
A
sum yi,t = (18)
→ → !
→ → CCw ( r i , r j )
SBD ( r i , r j ) = 1 − max → → (19)
w || r i || · || r j ||
2m−w
∑
→ →
ri,l +w−m · r j,l , w ≥ m
CCw ( r i , r j ) = (20)
l = 1
→ →
CC−w ( r j , r i ), w<m
In Equation (17), the function SBD [43] (shape-based distance) is set for measuring
the distance between two temporal sequences with equal length. Specifically, the SBD is
→ →
calculated by Equation (11), where r i , r j ∈ R I is the flow time series of OD i and j and || · ||
→ →
refers to the second norm operator. CCw r i , r j represents the cross-correlation between
→ →
r i and r j . In addition, we set Ts(i, i ) to 0.
Analogously, we calculated Ps(i, j) for Wps (i, j) by the cluster–threshold–calculation
framework. The classical k-means method is used for clustering, and the similarity is
measured by the second norm operator, as shown in Equation (21).
Ps(i, j) = || Pi , Pj || (21)
3.2. Multi-Layer Graph Neural Networks Model (MGNN) for Structured Forecasting
3.2.1. Structure of the Multi-Layer Graph Neural Networks Model (MGNN)
In previous works [3,16,44], sequenced features and graphic features have both been
proven to be useful for traffic state prediction. One key issue in designing MGNN is how to
fuse the temporal features and the spatial features during the training of the model. In the
context of network-level passenger flow prediction, we specifically call the features that can
be represented by the graphs the spatial features, or otherwise, the temporal features. For
fusing the temporal features and the local features, we adopted the designs of the Graph
Convolution Gated Recurrent Unit (GC-GRU) and Fully Connected Gates Recurrent Unit
(FC-GRU) proposed by [3].
f IGti = Θl IGti + ∑ j∈Nsn (i) Wsn (i, j) Θsn IGti + ∑ j∈Ntn (i) Wtn (i, j) Θtn IGti
In this manner, a node can dynamically receive information from some highly corrected
neighbor nodes. For convenience, we denoted the graph convolution in Equation (22) as
IGt ∗ Θ in the following.
Since the above-mentioned operation is conducted on a spatial dimension, we embedded
the graph convolution in a Gated Recurrent Unit (GRU) to learn spatial–temporal 1 2 features.
Specifically, the reset gate R = R 1 , R2 , · · · , R N , update gate Z = Zt , Zt , · · · , ZtN
1t 2 t t t t
new information Nt = Nt , Nt , · · · , NtN and hidden state Ht = Ht1 , Ht2 , · · · , HtN are
computed by:
Ht = GC-GRU ( IGt , Ht−1 ) : (23)
Rt = σ (Θrx × IGt + Θrh × Ht−1 + br ) (24)
Zt = σ (Θzx × IGt + Θzh × Ht−1 + bz ) (25)
Nt = tanh Θnx × IGt + Rt (Θnh × Ht−1 + bn )
(26)
Ht = (1 − Zt ) Nt + Zt Ht−1 (27)
where σ is the sigmoid function and the Ht−1 is the hidden state at last t − 1 iteration. Θrx ,
Θzx , Θzh , Θnx and Θnh denote the graph convolution parameters. br , bz and bn are bias
terms. The feature dimensions of Rit , Zti , Nti and Hti are set to d.
Mathematics 2022, 10, 3664 Decoder framework, and we built a Sequence-to-Sequence structure, as shown in Figure 2.
13 of 32
Specifically, the inputs of GFGRU are It and H
e t−1 , where H
e t−1 is the output hidden state of
the last iteration.
In GFGRU, GC-GRU
3.3. Graph–Temporal utilizes
Fused Neural (GTFNN) information in Ht−1 to update the
the accumulated
Network
e
hidden state, rather than take the original Ht as input. Thus, Equation (23) becomes
3.3.1. Overall for the Proposed GTFNN
Equation (28).
When building the GTFNN Htframework,
= GC-GRUweIGneed to address two main issues: firstly,
t, He t− 1 (28)
the model has a comprehensive but complex input. The relationship (e.g., linear or non-
linear)
Forbetween
FC-GRU,the wecomplex inputs and
first transformed the H
It and eforecasting model is difficult
e d to determine;
t−1 to an embedded It ∈ R and Ht−1 ∈ R
e d
second,
with twotime series
fully data(FC)
connect naturally
layers. have localwe
Then, features e
fed It and (e.g.,Hchange-points,
e etc.) and global
t−1 into a common GRU [45]
features (e.g., series trends and attention at different time positions), and f thed framework
implemented with a full connection to generate a global hidden state Ht ∈ R , which can
we designed needs to take both types of features into account to ensure better forecast
be express as:
performance. Thus, three important designs Ite = FCareIconsidered in GTFNN: (29)
( t)
1. Gating structure GRN
Hte−1 = FC H e t −1 (30)
A gating structure GRN can decide if non-linear learning is required. It can skip over
any unused components of the architecture, f which provide adaptive depth and network
Ht = FC − GRU Ite , Hte−1 (31)
complexity to accommodate a wide range of datasets and scenarios.
g
2. Finally,
Feature
n we incorporated
selection layer
o Ht and Ht to generate a combined hidden state
H e 1, H
e t =While
H the e N features
e 2 , ·designed
·· ,H with a fully
mayconnected layer:their relevance and specific contribu-
be available,
t t t
tion to the output are typically unknown. The GTFNN also uses specialized components
e ti =features g
for the judicious selection of relevant H FC Hti ⊕ andHta series of gating layers to suppress (32)
unnecessary components, enabling high performance in a wide range of regimes. We
where
adopted ⊕ denotes an operator
the feature selectionof feature proposed
network concatenation.
by Lim et al. [3] to tune the weights of
input features at each time step dynamically. This feature selection network was designed
3.3. Graph–Temporal Fused Neural Network (GTFNN)
based on the gating mechanism.
3.3.1. Overall for the Proposed GTFNN
3. Sequence-to-sequence layer with attention mechanisms
When building the GTFNN framework, we need to address two main issues: firstly, the
model For learning
has both local but
a comprehensive andcomplex
global temporal
input. The relationships
relationshipfrom
(e.g.,time-varying inputs,
linear or nonlinear)
a sequence-to-sequence layer is employed for local processing, whereas
between the complex inputs and the forecasting model is difficult to determine; second, long-term de-
pendencies are captured using an interpretable multi-head attention block.
time series data naturally have local features (e.g., change-points, etc.) and global featuresThe GTFNN
employs
(e.g., a self-attention
series mechanism
trends and attention to learn
at different timelong-term relationships
positions), across different
and the framework time
we designed
steps [3], which is modified from multi-head attention in transformer-based architectures
[29,34] to enhance explainability.
Figure 3 shows the high-level architecture of GTFNN, where individual components
are described in detail in the subsequent sections.
Mathematics 2022, 10, 3664 13 of 30
needs to take both types of features into account to ensure better forecast performance.
Thus, three important designs are considered in GTFNN:
1. Gating structure GRN
A gating structure GRN can decide if non-linear learning is required. It can skip over
any unused components of the architecture, which provide adaptive depth and network
complexity to accommodate a wide range of datasets and scenarios.
2. Feature selection layer
While the designed features may be available, their relevance and specific contribution
to the output are typically unknown. The GTFNN also uses specialized components for the
judicious selection of relevant features and a series of gating layers to suppress unnecessary
components, enabling high performance in a wide range of regimes. We adopted the feature
selection network proposed by Lim et al. [3] to tune the weights of input features at each time
step dynamically. This feature selection network was designed based on the gating mechanism.
3. Sequence-to-sequence layer with attention mechanisms
For learning both local and global temporal relationships from time-varying inputs, a
sequence-to-sequence layer is employed for local processing, whereas long-term dependen-
cies are captured using an interpretable multi-head attention block. The GTFNN employs a
self-attention mechanism to learn long-term relationships across different time steps [3],
which is modified from multi-head attention in transformer-based architectures [29,34] to
enhance explainability.
Mathematics 2022, 10, 3664 14 of 32
Figure 3 shows the high-level architecture of GTFNN, where individual components
are described in detail in the subsequent sections.
Thestructure
Figure4.4.The
Figure structureofofGRN.
GRN.
In Equation (33),
3.3.3. Instance-Wise GatedSelection
Feature Linear Units (GLUs) [48] are selected as the gating components
Layer
to provide the flexibility to suppress any part of the architecture that can be skipped in the
Instance-wise feature selection is provided by the feature selection networks applied
scenario. Denoting γ ∈ Rdmodel as the input of the GLU, we obtain Equation (36).
to all input features. Entity embeddings [49] is used for categorical variables as feature
representations, and linear GLUωtransformations
(γ) = σ(W3,ω γ for continuous
+ b3,ω ) (W4,ωvariables—transforming
γ + b4,ω ) each
(36)
input variable into a 𝑑 -dimensional vector. All inputs make use of separate feature
selection
where σ (networks with distinct
.) is the sigmoid function, W (.) ∈ Rdmodel ×dmodel , b(.) ∈ Rdmodel are the
weights.
activation
( )
Let 𝜉and ∈biases,
weights ℝ isdenote the transformed
the element-wise input
Hadamard 𝑗th feature
of the and
product, dmodel isat
thetime 𝑡, with
hidden state
size. GLU
( )
allows GTFNN to control the extent to which the GRN contributes to the original
𝛯input
= 𝜉It , potentially
,…,𝜉 being the
skipping overflattened
the layervector of ifallnecessary,
entirely past inputs GLU 𝑡.outputs
at time
as the Featurecould
se-
be all close
lection weights to 0are
in order to suppress
generated the nonlinear
by feeding 𝛯 through contribution [3]. by a Softmax layer:
a GRN, followed
3.3.3. Instance-Wise Feature Selection Layer GRN (𝛯 )
𝑣𝜒𝑡 = Softmax (37)
𝑣𝜒 𝑡
Instance-wise feature selection is provided by the feature selection networks applied
where
to all input ℝ𝑚𝜒 is a vector
𝑣𝜒𝑡 ∈ features. Entity of feature selection
embeddings [49] isweights.
used for categorical variables as feature
At each timeand
representations, step,linear
an additional layer offor
transformations non-linear
continuousprocessing is employed by feed-
variables—transforming each
( )
ing each
input 𝜉 through
variable into a its own
dmodel GRN:
-dimensional vector. All inputs make use of separate feature
selection networks with distinct weights. (𝑗) ()
( j) 𝜉𝑡 = GRN𝜉(𝑗) 𝜉𝑡 𝑗 (38)
Let ξ t ∈ Rdmodel denote the transformed input of the jth feature at time t, with
(𝑗) T T
(mχ ) T
Ξt = 𝜉ξ𝑡 t is, . the
where . . , ξprocessed feature
thevector vector of𝑗.all
for variable We note that each variable has
( j)
t being flattened past inputs at time t. Feature
its own 𝐺𝑅𝑁 ( ) , with weights shared across all time steps 𝑡. Processed features are then
selection weights are generated by feeding Ξ through a GRN, followed by a Softmax layer:
weighted by their variable selection weights tand are combined:
𝑚𝜒 GRNv ( Ξ t )
vχt = Softmax χ (37)
(𝑗) (𝑗)
𝑡
=
where vχt ∈ Rmχ is a vector of feature𝜉selection𝑣𝜒weights.
𝜉
𝑡 𝑡
(39)
𝑗=1
At each time step, an additional layer of non-linear processing is employed by feeding
( j)
each ξ t𝑣(𝑗)
where 𝜒𝑡 is the 𝑗th element of vector 𝑣𝜒𝑡 .
through its own GRN:
( j) ( j)
ξet = GRNξe( j) ξ t (38)
( j)
where ξet is the processed feature vector for variable j. We note that each variable has
its own GRNξ ( j) , with weights shared across all time steps t. Processed features are then
weighted by their variable selection weights and are combined:
mχ
∑ vχt ξet
( j) ( j)
ξet = (39)
j =1
( j)
where vχt is the jth element of vector vχt .
Decoder masking [29,34] is applied to the multi-head attention layer to ensure that
each temporal dimension can only attend to features preceding it.
InterpretableMultiHead( Q, K, V ) = HW
e H (42)
1 nH
∑
(h) (h)
e=A
H e( Q, K )VWV = A QWQ , KWK VWV (43)
H
h =1
nH
1
∑ Attention
(h) (h)
= QWQ , KWK , VWV (44)
H
h =1
p
A( Q, K ) = Softmax QKT / d attn (45)
Attention( Q, K, V ) = A( Q, K )V (46)
where V ∈ R N ×dV , K ∈ R N ×dattn and Q ∈ R N ×dattn are attention mechanism scale values,
keys and queries. dV = d attn = dmpdel /n H , and n H is the number of heads. A(·) is a normal-
(h) (h)
ization function. WK ∈ Rdmodel ×dattn , WQ ∈ Rdmodel ×dattn are head-specific weights for keys,
queries. WV ∈ Rdmodel ×dV are value weights shared across all heads, and WH ∈ Rdattn ×dmodel
is used for final linear mapping.
Mathematics 2022, 10, 3664 16 of 30
The self-attention layer allows GTFNN to pick up long-range dependencies that may
be challenging for RNN-based architectures to learn. Following the self-attention layer, an
additional gating layer is also applied to facilitate training:
δ(t, n) = LayerNorm θ (t, n) + GLUδ β(t, n) (47)
where the weights of GRNψ are shared across the entire layer. As shown in Figure 5, we also
applied a gated residual connection that skips over the entire transformer block, providing
a direct path to the sequence-to-sequence layer, yielding a simpler model if additional
complexity is not required, as shown below:
Mathematics 2022, 10, 3664 17 of 32
e(t, n) = LayerNorm φ
ψ e(t, n) + GLU e ψ(t, n) (49)
ψ
Figure 5.5.Illustration
Figure Illustrationfor
forfeature selection
feature network.
selection network.
All
Allthe
thenotation
notationisisconcluded
concludedininthe
theAppendix
AppendixA.A.
4. Numerical Experiments
4. Numerical Experiments
4.1. Experiment Settings
4.1. Experiment
4.1.1. Dataset Settings
4.1.1.
WeDataset
collected a mass of trip transaction records from a real-world metro system and
We collected
constructed a mass
a large-scale of tripwhich
dataset, transaction records
is termed from a real-world
as CQMetro. metro
The overview system
of the and
dataset
isconstructed
summarizeda in large-scale
Table 4. dataset, which is termed as CQMetro. The overview of the da-
taset is summarized
This dataset wasin Table
built 4. on the rail transit system of Chongqing, China. The
based
transaction records were collected from 1 January to 15 February 2019, with daily passenger
Table
flow of 4.1.72
Dataset of CQMetro.
million on average. The total number of OD pairs in the whole network is
14,314 pairs. Each record contains the information of entry/exit station and the correspond-
Dataset Notation
ing timestamps. In this period, 170 stations operated normally, and they were connected
City Chongqing, China
by 224 physical edges (i.e., the sections between stations). For each station, we measured
Station
design features every 15 min. The data of the first 30 days were used 170
for training, and the
Physical Edge 448
last 15 days were used for training and testing, while OD flows of the following day were
Flow volume
used for validation. per day
In particular, 1.72M
1–3 January and 4–10 February 2019 are New Year’s Day
OD pairs
and Chinese New Year holidays. 14,314
Time Step 15 min
Input length 384 (4 days)
Output length 4 (60 min)
Forecasting horizon 2 (30 min)
Training Timespan 2880 (30 days)
Mathematics 2022, 10, 3664 17 of 30
Dataset Notation
City Chongqing, China
Station 170
Physical Edge 448
Flow volume per day 1.72 M
OD pairs 14,314
Time Step 15 min
Input length 384 (4 days)
Output length 4 (60 min)
Forecasting horizon 2 (30 min)
Training Timespan 2880 (30 days)
Testing Timespan 1440 (15 days)
Validation Timespan 96 (1 day)
n
1
∑
2
MSE = X̂i − Xi (52)
n
i =1
v
u n
u1
∑
2
RMSE = t X̂i − Xi (53)
n
i =1
n
1
MAE =
n ∑ X̂i − Xi (54)
i =1
n
1 2 · X̂i − Xi
SMAPE =
n ∑ X̂i + | Xi |
(55)
i =1
where n is the number of testing time steps; for example, if we conduct the forecasting
every 60 min, then n = 4. X̂i and Xi denote the predicted ridership and the ground-truth
ridership, respectively. Note that X̂i and Xi have been transformed back to the original
scale with an inverted z-score normalization. Our GTFNN is developed to predict the
metro ridership of the next two steps. In the following experiments, we would measure
the errors of each time interval separately. For the 15 min granularity OD prediction, there
may be a true value of 0; thus, SMAPE is used instead of MAPE to describe the relative
accuracy of the prediction. Unlike MAPE (Mean Absolute Percentage Error), the SMAPE
values range from 0 to 200%. All metrics are close to 0, meaning higher prediction accuracy.
The predictions for the weekdays are shown in Table 6. To predict the ridership at the
next four consecutive time intervals (60 min), the baseline LSTM obtains a SMAPE score of
35.17% on CQMetro, ranking last among all the methods. Compared to LSTM and GRU,
the performance of GBDT and Graph-WaveNet were much better. The prediction ability of
the GTFNN model is the best compared with the above models. SMAPE is 14.16%, MAE is
only 0.83, and the average prediction error of each period is lower than 1 person. It can be
seen that GTFNN fully combines the advantages of the graph neural network model and
time series model and has good passenger flow prediction ability.
The predictions for the weekends are shown in Table 7. Similar to working days, the
GTFNN model still predicted better than the other models, with a SMAPE of 13.21% and
MAE, MSE and RMSE metrics of 0.78, 4.02 and 2.00, respectively. In general, the prediction
accuracy of different models for weekend passenger flows is higher than that of weekday
passenger flows, with the SMAPE value of different models increasing by about 1–4%.
The predictions for the holidays are shown in Table 8. From the results, the prediction
results of different models for holidays are not very satisfactory. The GTFNN model still
predicted better than the other models, with SMAPE of 47.96% and MAE, MSE and RMSE
metrics of 4.41, 35.08 and 5.92, respectively.
(a) (b)
Figure6.6. Variable
Figure Variableimportance
importancefor
forthe
the CQMetro
CQMetro dataset.
dataset. (a)(a) Encoder
Encoder variables
variables importance.
importance. (b) De-
(b) Decoder
coder variables importance.
variables importance.
5. Discussion
5. Discussion
The results
The results of
of the
the overall
overall metrics
metrics analysis
analysis of Section 4.3 showed the excellent perfor-
mance of
mance of the
the GTFNN
GTFNN model.
model.To Tofurther
furtheranalyze
analyzethe
theforecasting
forecastingperformance
performance in in different
different
scenarios, this section
scenarios, section is
is designed
designedfrom
fromtwotwoaspects:
aspects:the prediction
the results
prediction and
results andcharacter-
charac-
istics of of
teristics different ODs
different ODson on
weekends
weekendsandandweekdays are analyzed;
weekdays furthermore,
are analyzed; the com-
furthermore, the
parison between
comparison forecasting
between results
forecasting of ordinary
results days and
of ordinary daysholidays is discussed
and holidays to analyze
is discussed to
analyze the applicability
the applicability of the model
of the studied studiedinmodel in forecasting
forecasting OD passenger
OD passenger flow from flow from
different
different
sources. sources.
5.1.
5.1. Comparison
Comparison of
of Different
Different ODs
ODs
The
The following four typical ODs
following four typical ODs were
wereselected
selectedfor
forfurther
furtherdiscussion
discussionand
andanalysis.
analysis.
(1) OD 1:
(1) OD 1: This
ThisOD ODconsists
consistsofof a hub-type
a hub-type station
station andand a station
a station in CBD.
in CBD. The selected
The selected hub-
hub-type station is located close to the city’s high-speed rail passenger
type station is located close to the city’s high-speed rail passenger hub, mainly hub, mainly
serv-
serving
ing long-distance
long-distance passengers
passengers entering
entering andand leaving
leaving thecity,
the city,and
andititis
is an
an interchange
interchange
station between the high-speed rail network and the urban rail transit. The
station between the high-speed rail network and the urban rail transit. The station
station is
is
located in the CBD of the city, which is the most prosperous part of the city, large
located in the CBD of the city, which is the most prosperous part of the city, with with
passenger
large passengerflow. flow.
(2) OD 2: This OD consists of a station in the residential area and a station in CBD. The
(2) OD 2: This OD consists of a station in the residential area and a station in CBD. The
selected station in the residential area is located in the main residential area of the city
selected station in the residential area is located in the main residential area of the
and mainly serves the commuting needs of passengers in the residential area. The
city and mainly serves the commuting needs of passengers in the residential area.
station in CBD is the same as the station in OD 1.
The station in CBD is the same as the station in OD 1.
(3) OD 3: This OD consists of a station in the residential area and a station in the suburban
(3) OD 3: This OD consists of a station in the residential area and a station in the subur-
area. The selected station in the residential area is the same as those in OD 2. The
ban area. The selected station in the residential area is the same as those in OD 2. The
suburban-type station is the starting and ending station of the line, and the station is
suburban-type station is the starting and ending station of the line, and the station is
far away from the city hub, where trains need to make a turnaround, and the daily
far away from the city hub, where trains need to make a turnaround, and the daily
passenger flow is small.
passenger flow is small.
(4) OD 4: This OD consists of a hub-type station and station in the suburban area. The
(4) OD 4: This OD consists of a hub-type station and station in the suburban area. The
selected hub-type station and the station in the suburban area are the same as the
selected
stations in hub-type
OD 1 and station
OD 3,and the station in the suburban area are the same as the
respectively.
stations in OD 1 and OD 3, respectively.
Figures 7–10 show the prediction results of four pairs of different ODs on weekdays
Figures 7–10
or weekends. The show the prediction
blue dashed results of
line represents thefour pairs ofresult,
prediction different
andODs on weekdays
the red solid line
or weekends. The blue dashed line represents the prediction result, and
represents the actual flow. The x-axis represents the time step, and the y-axis represents the red solid line
the
representsflow.
passenger the actual flow.
The time Therepresents
scale x-axis represents themin
the first 15 timeof step, andthus,
the day; the y-axis
the wholerepresents
day is
the passenger flow. The time scale represents the first 15 min of the day; thus, the whole
Mathematics 2022, 10, 3664
Mathematics 2022, 10, 3664 21 of 30 2
Mathematics 2022, 10, 3664
Mathematics 2022, 10, 3664
(a)
(a) (b)
(b)
Figure 7. The examples (a)
Figure 7.
of The examples
forecasting of forecasting
curves and the curves and the
ground-truth (b)
ground-truth
curves for OD 1:curves
Figure 7. The examples of forecasting curves and the ground-truth
for OD 1: (a) wee
(a) weekdays;
curves for OD 1: (a) wee
(b) weekends. (b) weekends.
(b) weekends.
Figure 7. The examples of forecasting curves and the ground-truth curves for OD 1: (a) wee
(b) weekends.
(a)
(a) (b)
(b)
Figure (a) (b)
8. The examples of forecasting curves and the ground-truth curves for residential are
Figure 8. The examples of forecasting curves and the ground-truth curves for residential ar
Figure 8. The examples
CBD ODofflow:
forecasting curves (b)
(a) weekdays; andweekends.
the ground-truth curves for residential areas to
CBD OD flow:examples
(a) weekdays; (b) weekends.
CBD OD flow: (a)Figure 8. The
weekdays; of forecasting
(b) weekends. curves and the ground-truth curves for residential are
CBD OD flow: (a) weekdays; (b) weekends.
(a)
(a) (b)
(b)
Figure 9.(a) (b)
The examples of forecasting curves and the ground-truth curves for residential to
Figure 9. The examples of forecasting curves and the ground-truth curves for residential to
ban OD flow: (a) weekdays; (b) weekends.
ban OD
Figure
Figure 9. The examples 9.flow:
of (a) weekdays;
The examples
forecasting curves (b) the
and weekends.
of forecasting curves andcurves
ground-truth the ground-truth curves
for residential for residential to
to suburban
ban OD flow: (a)
OD flow: (a) weekdays; (b) weekends.weekdays; (b) weekends.
Mathematics 2022, 10, 3664 22 of 30
Mathematics 2022, 10, 3664 2
(a) (b)
Figure 10.
Figure 10. The examples ofThe examples
forecasting of forecasting
curves curves and the
and the ground-truth ground-truth
curves for hub tocurves for hub
suburban OD to subur
OD flow: (a) weekdays;
flow: (a) weekdays; (b) weekends. (b) weekends.
5.1.1. Forecasting of OD
5.1.1. 1 in Weekdays
Forecasting of OD 1and Weekendsand weekends
in weekdays
The main purposesThe mainof the passenger
purposes of theflow betweenflow
passenger hubs and CBDs
between hubsare
andbusiness
CBDs are busin
activities, shopping,
tivities,consumption, and attending
shopping, consumption, andlarge events.
attending Hubs
large and CBDs
events. attract
Hubs and CBDs attra
frequent economic activities and commercial activities, which lead to strong population
quent economic activities and commercial activities, which lead to strong populatio
mobility. Passenger flow in these
bility. Passenger flowareas is extremely
in these large. As large.
areas is extremely shownAs inshown
Figure in
7, Figure
the 7, the
peak passengerpassenger
flow exceeds
flow20 people/15
exceeds min on weekdays
20 people/15 and 40 people/15
min on weekdays min on min on
and 40 people/15
weekends. Table 9 shows
ends. Table the metrics
9 shows theof forecasting
metrics results. The
of forecasting passenger
results. flow forecast
The passenger flow forecast r
results on weekends
on weekends are more accurate compared to those on weekdays, where is
are more accurate compared to those on weekdays, where MAE MAE is re
reduced by 0.20byand0.20SMAPE is reduced
and SMAPE by 3.3%.
is reduced by 3.3%.
Table 9. The metrics of forecasting results for OD 1.
Table 9. The metrics of forecasting results for OD 1.
Passenger Flow Scenario 1Flow Scenario
Passenger MAE 1 MSE
MAE RMSE
MSE SMAPE
RMSE SMA
weekdays weekdays 0.38 0.52
0.38 0.72
0.52 11.51%
0.72 11.51
weekends weekends 0.18 0.07
0.18 0.27
0.07 8.20%
0.27 8.20
5.1.2. Forecasting of OD
5.1.2. 2 in Weekdays
Forecasting of OD 2and Weekends and Weekends
in Weekdays
Jobs or shopping opportunities offered
Jobs or shopping opportunities by enterprises located
offered by in CBDlocated
enterprises attract in
people
CBD attract p
from near andfrom far, which
near andcauses large passenger
far, which causes largeflow in OD 2.flow
passenger As inshown
OD 2.inAsFigure
shown8,in Figur
its traffic peak traffic
is closepeak
to 15ispeople/15
close to 15min on weekdays,
people/15 min on while its traffic
weekdays, while peak drops peak
its traffic to drops
10 people/15 min on weekends due to the reduction of commuter traffic. In terms
people/15 min on weekends due to the reduction of commuter traffic. In terms of th of
the forecast metrics (Table (Table
cast metrics 10), the10),
forecast accuracy
the forecast of passenger
accuracy flow on
of passenger weekdays
flow is
on weekdays is s
slightly higher higher
than that on weekends. In particular, MAE is reduced by 0.02, and
than that on weekends. In particular, MAE is reduced by 0.02, and SMAPE SMAPE
is reduced by 2.24%.
duced by 2.24%.
Table 10. The metrics of forecasting results for residential areas to CBD OD flow.
Table 10. The metrics of forecasting results for residential areas to CBD OD flow.
Passenger Flow Scenario 2 MAE MSE RMSE SMAPE
Passenger Flow Scenario 2 MAE MSE RMSE SMA
weekdays weekdays 0.26 0.70
0.26 0.84
0.70 31.23%
0.84 31.23
weekends 0.28 0.77 0.88 33.47%
weekends 0.28 0.77 0.88 33.47
5.1.3. Forecasting of OD
5.1.3. 3 in Weekdays
Forecasting of OD 3and Weekends and Weekends
in Weekdays
Residential–suburban stations are more
Residential–suburban diverse
stations in terms
are more of the
diverse mainofpurposes
in terms the mainof purposes o
passenger traffic. As usual, suburban stations have smaller passenger flows,
senger traffic. As usual, suburban stations have smaller passenger flows, but thebut the re
residential station area selected
tial station in this in
area selected case
thisstill
casehas some
still has passenger flows flows
some passenger due todue
its to its ph
physical proximity to the suburban station. As shown in Figure 9, its peak passenger flow
proximity to the suburban station. As shown in Figure 9, its peak passenger flow i
is close to 10 passengers/15 min onmin
to 10 passengers/15 weekdays, and drops
on weekdays, andto 6 passengers/15
drops min due
to 6 passengers/15 to due to t
min
the reduced economic activity in the city on weekends. In terms of forecasting metrics
duced economic activity in the city on weekends. In terms of forecasting metrics
Mathematics 2022, 10, 3664 23 of 30
(Table 11), the forecast accuracy of passenger flow on weekends is slightly higher than that
of weekdays. In particular, MAE is reduced by 0.04 and SMAPE is reduced by 2.24%.
Table 11. The metrics of forecasting results for residential to suburban OD flow.
Table 12. The metrics of forecasting results for hub to suburban OD flow.
MAE, the MAE score of holidays (4.41) is 3.58 and 3.63 higher than that of weekdays (0.83)
and weekends (0.78), respectively. Likewise, from the perspective of MSE, the score of
holidays (35.08) is 30.66 and 31.06 higher than that of weekdays (4.42) and weekends (4.02)
and from the perspective of SMAPE, the score of holidays (47.96%) is 33.80% and 34.75%
higher than that of weekdays (14.16%) and weekends (13.21%). Although the passenger
flow on holidays is larger than that on ordinary days, it is clear that the passenger flow
characteristics of holidays are not yet well captured by the model. On the one hand, the
model does not consider how to capture the characteristics of holidays (i.e., uncommonly
large passenger flow) when designing. Furthermore, there are only a small amount of
holidays data in the training data, and the passenger flow characteristics vary between
holidays, which poses a great challenge to the model’s prediction during holidays.
6. Conclusions
In this work, we proposed a Graph–Temporal Fused Neural Network (GTFNN) to
address the network-level origin–destination (OD) flows online short-term forecasting
problem. In order to solve the key issue of online flow forecasting, the proposed GTFNN
has made efforts in the four aspects below.
(1) The GTFNN takes finished OD flow and a series of known and observable features as
the input and explores multi-step predictions.
(2) Unlike previous works that either focus on the spatial relationship or the temporal re-
lationship of OD flows evolution, the proposed method is constructed from capturing
both spatial and temporal characteristics.
(3) In order to learn spatial characteristics, a multi-layer graph neural network model is
proposed based on hidden relationships in the rail transit network. Then, we embedded
the graph convolution in a Gated Recurrent Unit to learn spatial–temporal features.
(4) Based on the sequence-to-sequence framework, a Graph–Temporal Fused Deep Learn-
ing model was built. In addition, an attention mechanism was attached to the model
to fuse local and global temporal dependencies to achieve the prediction of short-term
online OD passenger flow.
Experiments based on real-world data collected from Chongqing’s rail transit system
showed that the proposed model performed better than other models. For instance, on
weekdays for passenger flow forecasting scenarios, the SMAPE score of GTFNN was about
14.16%, with a range from 5% to 20% higher compared to other methods. In addition, the
MAE score ranged from 0.1 to 0.8, which is suitable for applications. By comparing some
representative ODs, we found that it is more difficult to forecast ODs with small average
passenger flow values. OD forecasting for small passenger flows should be one of the next
research points.
The proposed model can also analyze weights of different features. The weights of
observed input features and known input features were different in the encoder, where
the most important feature of observed input features is “hour of the day”, and the most
important feature of known input features is “max. passenger flow”.
Obtaining accurate OD passenger flow data on time is vital to support transportation
organization in a rail transit system. The accurate OD prediction results allow operators
to understand the passenger demand between different ODs of the network at a certain
time point in the future, thereby supporting the dynamic and efficient deployment of
transportation organization resources. Well-designed train line planning, timetabling and
station passenger flow planning can be obtained. From the point of view of the trend of a rail
transit passenger flow prediction problem, the traditional station in/out volume prediction
can no longer meet the actual application demand. This kind of prediction can only know
the station collection and dispersion volume, and the passenger flow distribution on the
network is unknown. Accurate prediction of OD will become a popular topic in the future.
The method that combines both temporal and spatial relationships into the prediction
system studied in this paper will be one of the supports to solve the accurate prediction of
OD. Thus, the OD prediction model studied in this paper has strong practical significance.
Mathematics 2022, 10, 3664 25 of 30
Nevertheless, there still exist some limitations in the proposed model. By comparing
the results of ordinary days and holidays (with uncommonly large passenger flow), we can
find that the method studied in this paper still cannot guarantee the accuracy in different
passenger flow scenarios. In the future, it is also necessary to optimize the model and
algorithm for scenarios where sudden large passenger flows happen to meet the needs of
on-time forecasting. In addition, six levels of graph neural networks used in the current
model have the same weights. The effect of these graphs on prediction accuracy has not
been studied. A good understanding of the importance of different graph relations can
help reduce the complexity of the model and improve the training efficiency of the model.
Author Contributions: Data curation, H.Z.; Funding acquisition, H.Z.; Methodology, H.Z., Z.H. and
K.Y.; Supervision, J.C.; Writing—original draft, H.Z., Z.H. and K.Y.; Writing—review and editing, J.C.
and J.Z. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the China Postdoctoral Science Foundation, grant number
2021T140003, and by the China Postdoctoral Science Foundation, grant number 2021M700186.
Data Availability Statement: Not applicable.
Acknowledgments: Thanks to Guofei Gao for his support and help in data processing and for
providing hardware.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
Appendix A
Table A1 lists the notation used in this paper and the descriptions.
Mathematics 2022, 10, 3664 26 of 30
→A
The actual passenger flow of time steps { pt − h + 1, pt − h + 2, · · · pt − π }, where h > 0 and π <
22 Y pt,−h,−π →A n o
A
h, Y pt,−h,−π , Ypt , Y A , · · · Y A
−n+1 pt−n+2 pt−π
A
23 YtA n actual (or entered) ODoflows of time step t with the origin station i ∈ [1, N ], Yt ,
The
A A A A
y1,t , y2,t , · · · , yi,t , · · · , y N,t ∈ R κ × N
A ,
The top κ − 1 actual (or entered) OD pairs that origin from station i at time step t, yi,t
24 A
yi,t n o
yiA∼1,t , yiA∼2,t , · · · , yiA∼ j,t , · · · , yiA∼κ −1,t , yiA∼κ,t ∈ Rκ
25 yiA∼ j,t The actual (or entered) OD flow traveled from station i to station j at time step t
The observed input features that can only be obtained in historical data, O pt,−h,−π ,
26 O pt,−h,−π n
g
o
Ospt,−h,−π , O pt,−h,−π
n o
g
27 K pt,−h,m The known input features that can be obtained in the whole range of time, K pt,−h,m , K spt,−h,m , K pt,−h,m
28 Ospt,−h,−π The set of finished OD passenger flow
g
29 O pt,−h,−π The set of horizontal passenger flow
30 K spt,−h,m The sequenced known input features
g
31 K pt,−h,m The graphic known input features
32 O The observed input features
33 K The known input features
34 ot1 Finished OD passenger flow in next time step
35 ot2 Finished OD passenger flow in next two time steps
36 ot3 Max. passenger flow
37 ot4 Min. passenger flow
Mathematics 2022, 10, 3664 27 of 30
References
1. Wei, Y.; Chen, M.C. Forecasting the short-term metro passenger flow with empirical mode decomposition and neural networks.
Transp. Res. Part C Emerg. Technol. 2012, 21, 148–162. [CrossRef]
2. Bai, L.; Yao, L.; Kanhere, S.S.; Wang, X.; Sheng, Q.Z. Stg2seq: Spatial-temporal graph to sequence model for multi-step passenger
demand forecasting. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19},
Macao, China, 10–16 August 2019; pp. 1981–1987.
3. Lim, B.; Ark, S.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J.
Forecast. 2021, 37, 1748–1764. [CrossRef]
4. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013,
arXiv:1312.6203.
5. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In
Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2016.
6. Atwood, J.; Towsley, D. Diffusion-convolutional neural networks. Comput. Sci. 2015, 29. Available online: https://proceedings.
neurips.cc/paper/2016/hash/390e982518a50e280d8e2b535462ec1f-Abstract.html (accessed on 1 September 2022).
7. Velikovi, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903.
Mathematics 2022, 10, 3664 29 of 30
8. Vlahogianni, E.I.; Golias, J.C.; Karlaftis, M.G.; Banister, D.; Givoni, M. Short-term traffic forecasting: Overview of objectives and
methods. Transp. Rev. 2003, 24, 533–557. [CrossRef]
9. Williams, B.; Durvasula, P.; Brown, D. Urban freeway traffic flow prediction: Application of seasonal autoregressive integrated
moving average and exponential smoothing models. Transp. Res. Rec. 1998, 1644, 132–141. [CrossRef]
10. Lee, S.; Fambro, D.; Lee, S.; Fambro, D. Application of subset autoregressive integrated moving average model for short-term
freeway traffic volume forecasting. Transp. Res. Rec. J. Transp. Res. Board 1999, 1678, 179–188. [CrossRef]
11. Huang, W.; Song, G.; Hong, H.; Xie, K. Deep architecture for traffic flow prediction: Deep belief networks with multitask learning.
IEEE Trans. Intell. Transp. Syst. 2014, 15, 2191–2201. [CrossRef]
12. Ni, M.; He, Q.; Gao, J. Forecasting the subway passenger flow under event occurrences with social media. IEEE Trans. Intell.
Transp. Syst. 2016, 18, 1623–1632. [CrossRef]
13. Sun, Y.; Leng, B.; Guan, W. A novel wavelet-svm short-time passenger flow prediction in beijing subway system. Neurocomputing
2015, 166, 109–121. [CrossRef]
14. Li, Y.; Wang, X.; Sun, S.; Ma, X.; Lu, G. Forecasting short-term subway passenger flow under special events scenarios using
multiscale radial basis function networks—Sciencedirect. Transp. Res. Part C Emerg. Technol. 2017, 77, 306–328. [CrossRef]
15. Sun, Y.; Zhang, G.; Yin, H. Passenger flow prediction of subway transfer stations based on nonparametric regression model.
Discret. Dyn. Nat. Soc. 2014, 2014, 397154. [CrossRef]
16. Zhou, X.; Mahmassani, H.S. A structural state space model for real-time traffic origin-destination demand estimation and
prediction in a day-to-day learning framework. Transp. Res. Part B Methodol. 2007, 41, 823–840. [CrossRef]
17. Hazelton, M.L. Inference for origin–destination matrices: Estimation, prediction and reconstruction. Transp. Res. Part B 2008, 35,
667–676. [CrossRef]
18. Djukic, T. Dynamic od Demand Estimation and Prediction for Dynamic Traffic Management; Delft University of Technology: Delft, The
Netherlands, 2014.
19. Liu, L.; Qiu, Z.; Li, G.; Wang, Q.; Ouyang, W.; Lin, L. Contextualized spatial-temporal network for taxi origin-destination demand
prediction. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3875–3887. [CrossRef]
20. Shi, H.; Yao, Q.; Guo, Q.; Li, Y.; Liu, Y. Predicting Origin-Destination Flow via Multi-Perspective Graph Convolutional Network.
In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020.
21. Gong, Y.; Li, Z.; Zhang, J.; Liu, W.; Zheng, Y. Online spatio-temporal crowd flow distribution prediction for complex metro system.
IEEE Trans. Knowl. Data Eng. 2020, 34, 865–880. [CrossRef]
22. Liu, L.; Chen, J.; Wu, H.; Zhen, J.; Li, G.; Lin, L. Physical-virtual collaboration modeling for intra-and inter-station metro ridership
prediction. IEEE Trans. Intell. Transp. Syst. 2020, 23, 3377–3391. [CrossRef]
23. Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.; Ye, J.; Li, Z. Deep multi-view spatial-temporal network for taxi demand
prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018.
24. Dong, W.; Wei, C.; Jian, L.; Ye, J. Deepsd: Supply-demand prediction for online car-hailing services using deep neural networks. In
Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE), San Diego, CA, USA, 19–22 April 2017.
25. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017,
arXiv:01926.
26. Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-
temporal network data forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA,
7–12 February 2020; pp. 914–921.
27. Han, Y.; Wang, S.; Ren, Y.; Wang, C.; Gao, P.; Chen, G. Predicting station-level short-term passenger flow in a citywide metro
network using spatiotemporal graph convolutional neural networks. Int. J. Geo Inf. 2019, 8, 243. [CrossRef]
28. Geng, X.; Li, Y.; Wang, L.; Zhang, L.; Yang, Q.; Ye, J.; Liu, Y. In Spatiotemporal multi-graph convolution network for ride-hailing
demand forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February
2019; pp. 3656–3663.
29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need.
arXiv 2017, arXiv:1706.03762.
30. Fei, W.; Jiang, M.; Chen, Q.; Yang, S.; Tang, X. Residual attention network for image classification. In Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
31. Arik, S.O.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the 33rd AAAI Conference on Artificial
Intelligence, Honolulu, HI, USA, 27 January–1 February 2019.
32. Alaa, A.M.; Schaar, M. Attentive state-space modeling of disease progression. In Advances in Neural Information Processing Systems;
ACS: Washington, DC, USA, 2019.
33. Choi, E.; Bahadori, M.T.; Schuetz, A.; Stewart, W.F.; Sun, J. Retain: Interpretable Predictive Model in Healthcare Using Reverse Time
Attention Mechanism; Curran Associates Inc.: Red Hook, NY, USA, 2016.
34. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of
transformer on time series forecasting. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2019.
35. Song, H.; Rajan, D.; Thiagarajan, J.J.; Spanias, A. Attend and diagnose: Clinical time series analysis using attention models. In Pro-
ceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LO, USA, 2–7 February 2018.
Mathematics 2022, 10, 3664 30 of 30
36. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V.; Hyndman, R.J. The m4 competition: 100,000 time series and 61 forecasting
methods. Int. J. Forecast. 2020, 36, 54–74. [CrossRef]
37. Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. In Deep state space models for time series
forecasting. In Advances in Neural Information Processing Systems; ACS: Washington, DC, USA, 2018.
38. Wen, R.; Torkkola, K.; Narayanaswamy, B. A multi-horizon quantile recurrent forecaster. arXiv 2017, arXiv:1711.11053.
39. Fan, C.; Zhang, Y.; Pan, Y.; Li, X.; Zhang, C.; Yuan, R.; Wu, D.; Wang, W.; Pei, J.; Huang, H. Multi-horizon time series forecasting
with temporal attention learning. In Proceedings of the 25th ACM SIGKDD International Conference, Anchorage, AL, USA,
3–7 August 2019.
40. Guo, T.; Lin, T.; Antulov-Fantulin, N. Exploring interpretable lstm neural networks over multi-variable data. In Proceedings of
the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019.
41. Burggraeve, S.; Bull, S.H.; Vansteenwegen, P.; Lusby, R.M. Integrating robust timetabling in line plan optimization for railway
systems. Transp. Res. Part C Emerg. Technol. 2017, 77, 134–160. [CrossRef]
42. Zheng, H.; Cui, Z.; Zhang, X. Automatic discovery of railway train driving modes using unsupervised deep learning. ISPRS Int.
J. Geo Inf. 2019, 8, 294. [CrossRef]
43. Paparrizos, J.; Gravano, L. K-shape: Efficient and accurate clustering of time series. ACM SIGMOD Rec. 2015, 45, 69–76. [CrossRef]
44. Fang, S.; Zhang, Q.; Meng, G.; Xiang, S.; Pan, C. Gstnet: Global spatial-temporal network for traffic flow prediction. In Proceedings
of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp.
2286–2293.
45. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
2014, arXiv:1412.3555.
46. Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. In Fast and accurate deep network learning by exponential linear units (elus). In
Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016.
47. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450.
48. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 34th
International Conference on Machine Learning, Sydney, Australia, 6–11 August 2016.
49. Gal, Y.; Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the
30th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Barcelona, Spain, 2016; pp.
1027–1035.
50. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
51. Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-gcn: A temporal graph convolutional network for traffic
prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3848–3858. [CrossRef]
52. Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [CrossRef]
53. Gers, F.A.; Schmidhuber, E. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE Trans.
Neural Netw. 2001, 12, 1333–1340. [CrossRef]
54. Jozefowicz, R.; Zaremba, W.; Sutskever, I. An empirical exploration of recurrent network architectures. In Proceedings of the
International Conference on Machine Learning, Lille, France, 6–11 July 2015.
55. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the
Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}, Macao, China, 10–16 August 2019.