You are on page 1of 11

Knowledge-Based Systems 214 (2021) 106746

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

STGSN — A Spatial–Temporal Graph Neural Network framework for


time-evolving social networks

Shengjie Min a,d , , Zhan Gao b , Jing Peng c , Liang Wang c , Ke Qin a , Bo Fang d
a
University of Electronic Science and Technology of China, China
b
Sichuan University, China
c
Sichuan Provincial Public Security Department, China
d
ChinaCloud Information Technology Co., Ltd., China

article info a b s t r a c t

Article history: Social Network Analysis (SNA) has been a popular field of research since the early 1990s. Law
Received 13 July 2020 enforcement agencies have been utilizing it as a tool for intelligence gathering and criminal inves-
Received in revised form 11 October 2020 tigation for decades. However, the graph nature of social networks makes it highly restricted to
Accepted 1 January 2021
intelligence analysis tasks, such as role prediction (node classification), social relation inference (link
Available online 9 January 2021
prediction), and criminal group discovery (community detection), etc. In the past few years, many
Keywords: studies have focused on Graph Neural Network (GNN), which utilizes deep learning methods to solve
Criminal Network Analysis graph-related problems. However, we have rarely seen GNNs tackle time-evolving social network
Social Network Analysis problems, especially in the criminology field. The existing studies have commonly over-looked the
Graph Neural Network temporal-evolution characteristics of social networks. In this paper, we propose a graph neural network
Spatial–Temporal Graph Neural Network framework, namely Spatial-Temporal Graph Social Network (STGSN), which models social networks
Attention network from both spatial and temporal perspectives. Using a novel approach, we leverage the temporal
attention mechanism to capture social networks’ temporal features. We design a method analyzing
temporal attention distribution to improve the interpretation ability of our method. In the end, we
conduct extensive experiments on six public datasets to prove our methods’ effectiveness.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction Criminal Network Analysis (CNA) is imperative for future


crime prediction and prevention. Researchers have successfully
1.1. Social network analysis and criminal network analysis applied CNA in practice since the early 1990s [2–6]. Most of
the social networks in the criminology field are incomplete due
Social Network Analysis (SNA) [1] has emerged as a critical to a lack of intelligence, only partially captured evidence, the
technique in modern sociology. Researches generally represent a use of anti-detection strategies, etc. Therefore, it is crucial for
social network as a graph that turns the SNA problems into graph law enforcement agencies to investigate missing objects, hid-
mining tasks. SNA related tasks usually fall into three levels: node, den connections, and potential groups. As one of the most vital
link, and graph. Firstly, at the node level, centrality analysis aims techniques for intelligence analysis, CNA mostly conducts in-
to identify the key actors in the network via multiple metrics, vestigations on phone calls, financial transactions, and physical
such as degree, betweenness, eigenvector, etc. Likewise, node
meetups, etc.
classification is mainly used to predict the roles of the nodes
However, graph-related problems in intelligence analysis, such
in the networks. Secondly, for link-related problems, link pre-
as role prediction, relation inference, etc. rely heavily on hand-
diction infers the potential links between nodes. Thirdly, at the
graph level, community detection is determined by clusters of crafted feature engineering, which requires in-depth domain
nodes based on network structure and topology. Other research knowledge and intensive labor. There are incredible challenges
directions related to SNA also exist, such as information diffusion, for the conventional methods when neighborhood information
community evolution, recommendations, etc. and temporal evolution of the social networks are crucial for the
predictions. Furthermore, the scarcity of reusability among differ-
∗ Corresponding author at: University of Electronic Science and Technology ent scenarios is another big obstacle. Therefore, the complexity of
of China, China. graph data has imposed a significant challenge upon the existing
E-mail address: kelvin.msj@gmail.com (S. Min). machine learning algorithms.

https://doi.org/10.1016/j.knosys.2021.106746
0950-7051/© 2021 Elsevier B.V. All rights reserved.
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

1.2. Deep learning and graph neural network topological structure with no spatial locality like grids. Most re-
cently, after some pioneering work in RecGNNs, ConvGNNs have
Deep learning has boosted research in many fields in the last become the most popular approach thanks to their efficiency
decades, such as image classification, video processing, speech and convenience. ConvGNNs fall into two categories: spectral-
recognition, natural language processing, etc. In particular, re- based and spatial-based. Spectral-based methods take the path
searchers have successfully utilized Convolutional Neural Net- of graph signal processing for graph convolution, and typical
works (CNNs) and Recurrent Neural Networks (RNNs) to model methods are Spectral Convolutional Neural Network (Spectral
spatial features and temporal dynamics, respectively. However, CNN) [20], Chebyshev Spectral CNN (ChebNet) [21], Graph Con-
data like images, audios, and videos are in the form of pixels or volutional Network (GCN) [8], etc. On the other hand, the spatial-
frames chronologically. At the same time, graph data is gener- based approach is similar to the convolutional operation of a
ated from the non-Euclidean domain and has no specific orders. conventional CNN on images. It mainly focuses on the spatial re-
Intuitively, a graph may have an un-fixed number of nodes with lations in networks, such as GraphSage [10], GAT [11], GAAN [17],
an indeterminate amount of edges. Therefore, operations like GeniePath [22], DeepMGGE [23], etc.
convolutions are straightforward on images but complicated on In this study, the main aspect that we will look into will
graphs. be temporal dependency on social networks. Existing methods
Recently, inspired by methods such as CNNs and RNNs, many usually take RNN-based approaches to capture temporal dynam-
studies have focused on extending deep learning methodologies ics. However, recurrent neural architectures follow a sequential
for graph-related problems, called Graph Neural Networks (GNN). path from past units to current ones, which leads to the time-
The comprehensive survey (as shown in [7]) categorizes GNNs consuming iterative training and gradient explosion/vanishing
into four groups: Recurrent Graph Neural Networks (RecGNNs), issues. Attention, as one of the most influential concepts in
Convolutional Graph Neural Networks (ConvGNNs), Graph Auto- recent deep learning research, was initially invented to solve
Encoders (GAEs), and Spatial–Temporal Graph Neural Networks these sequential problems with much better performance. The
(STGNN). Early studies mainly fell into the RecGNNs group, which study cited in [24] initially introduced an attention mechanism
learns a node’s representation by propagating neighbor infor- to memorize long sentences in neural machine translation. Then,
mation iteratively. Likewise, motivated by the success of CNNs, it rapidly grew and was widely adopted in various fields. The pre-
research efforts have been primarily put into ConvGNNs, which vious studies included image caption generation [25] global/local
cast the notion of convolution into neural networks for graph attention [26], multi-head attention [27], Simple Neural Atten-
data. On the other hand, GAEs encode the graph features into low- tive Learner (SNAIL) [28] , Self-Attention Generative Adversarial
dimensional representations from which it decodes to obtain the Network (SAGAN) [29], and Graph Attention Model (GAM) [30].
graph information. Meanwhile, STGNNs aim to model both spatial In the criminology field, social networks can have a complex
and temporal dynamicity of the graph. topological structure and also evolve. We cannot make any as-
sumptions on what periods are more relevant for predictions.
1.3. Deep learning for social network analysis We have observed group interactions (behaviors) with various
kinds of temporal patterns. As shown in Fig. 1, we demonstrate
Many studies have been conducted using deep learning frame- three examples in a meetup network, in which some may have a
works to tackle SNA-related problems in recent years. The cen- linear development trend, as shown in Case 1 and 2. In contrast,
tral problem for analyzing social networks is how to encode others may produce a significant seasonality pattern, such as
the network data into low-dimensional representations (vectors) that shown in Case 3. With the temporal aspects of the network
to preserve the network structure and information effectively ignored, all three cases would produce similar prediction results
for the downstream machine learning models to analyze social even though they are entirely different. In our work, we focus
our study on the graphs as they have both spatial and temporal
networks further. It has been demonstrated that deep learning
characteristics. Inspired by the recent work on STGNNs, we aim
methodology has potent capabilities for this crucial problem,
to devise a spatial–temporal deep learning framework for time-
and many successful methods have brought SNA to the next
evolving social networks. For spatial convolutions, we adopt the
level, such as GCN [8], CTDNE [9], GraphSAGE [10], and GAT [11]
graph embedding approach over spectral methods due to two
etc. In study [12], the authors conduct a comprehensive review
main reasons. Firstly, we need to incorporate node features as
of the current studies utilizing deep learning models for social
nodes in social networks have various attributes. Secondly, we
networks. It categorizes the real-world social networks as ho-
deal with social networks that evolve, which means that the
mogeneous, heterogeneous, attributed, and dynamic. Our study
number of nodes and edges will change dynamically over time.
mainly looks into dynamic social networks, which evolve over
One of our primary goals is to capture the temporal evolution
time with frequent addition/deletion of nodes and links. Although
and make it beneficial for subsequent tasks. Regarding temporal
some existing studies have looked into learning dynamic net-
dependency, we novelly introduce the attention mechanism to
works, we novelly utilize the concept of STGNN and attention
model the temporal dynamics and design an approach to make
mechanism to model time-evolving social networks for better
the model more interpretable. Finally, our method can capture
performance.
both spatial and temporal aspects of the time-evolving social
STGNNs consider both spatial and temporal dynamics when
networks.
modeling the graph while other GNNs mainly focus on modeling
In summary, our main contributions are as follows:
the spatial structure of networks. Many methods facilitate graph
convolutions to capture spatial dependency with RNNs or CNNs to 1. We propose a novel framework of the Spatial–Temporal
model the temporal dependence. Recent studies have put a lot of Graph Neural Network specifically for social network mod-
effort into STGNNs, such as for traffic forecasting [13–17], driver eling, called STGSN. It models spatial and temporal features
maneuver anticipation [18], and action recognition [19]. With the of time-evolving social networks, which can be particularly
emerging STGNN methods that have mostly looked into computer useful for criminal network analysis.
vision problems, there has been very little research conducted 2. To the best of our knowledge, our method is the first
utilizing the concept of STGNN for social networks. attempt that leverages attention mechanisms on graph em-
From the spatial perspective, unlike the fixed-size images and beddings to make the framework capable of modeling the
grids that CNN typically deals with, graphs have a more complex temporal characteristic of social networks.
2
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Fig. 1. Temporal Characteristics in Criminal Networks (Meetup). In the following examples, we denote the target user, gang members and ordinary users in yellow,
red, and white respectively. (1) Case 1: The meetups between the central user and gang members decrease over time and vanish in the end, which shows less
group involvement and less potential danger. (2) Case 2: The meetups between the central user and gang members increase monthly, which is a strong signal that
the user in question is joining the group as a new member. (3) Case 3: The target user contacts other gang members in a seasonal fashion, which indicates a high
probability that he/she holds a specific kind of role or position in the group.

3. We discuss the strong expressive ability of the proposed 2.2. Spatial–temporal graph neural network
method in-depth by creating five temporal attention distri-
bution categories and analyzing how they affect the down- Many studies have extensively applied the concept of STGNN
stream predictions. in the field of traffic forecasting. Li et al. [13] devise a deep learn-
4. We conduct extensive experiments on six public datasets ing framework called Diffusion Convolutional Recurrent Neural
with a series of prediction tasks. The results show that the Network (DCRNN) for traffic forecasting that combines spatial
model outperforms state-of-the-art baselines significantly. and temporal dependencies in traffic flow. The proposed model
utilizes bidirectional random walks to capture the spatial correla-
2. Related work
tion and the encoder–decoder architecture to seize the temporal
2.1. Graph neural network for social network analysis dependence. Guo et al. [14] propose a novel attention-based
spatial–temporal graph convolutional network (ASTGCN) for traf-
DeepInf [31] is an end to end unified framework designed fic forecasting problems. In this study, the attention mechanism
by Qiu et al. which utilizes graph convolution and attention aims to select the information that is relatively critical to the
mechanisms to incorporate user-specific features and network current task from both spatial and temporal perspectives. In
structures for predicting social influence. Zhang et al. propose a study [15], Yu et al. propose a deep learning framework for
framework: SEAL [32] for link prediction. For each target link, traffic forecasting, called Spatial–Temporal Graph Convolutional
SEAL extracts a local enclosing subgraph and uses a GNN to learn Networks (STGCN), which integrates graph convolution layer and
general graph structure features. Wang et al. propose an embed- gated temporal convolution layers through spatial–temporal con-
ding model, namely, Multiple Conditional Network Embedding volutional blocks. GaAN proposed in study [17], unlike the exist-
(MCNE) [33]. Combined with a GNN based on the message- ing multi-head attention mechanism, uses a convolutional sub-
passing/receiving mechanism, the model introduces the binary network to control each attention head’s importance. The re-
mask, followed by an attention network to model correlations searchers build Graph Gated Recurrent Unit (GGRU) by using
among multiple preferences. Liu et al. [34] point out that a single GaAN as building blocks to address the traffic speed forecast-
vector representation is not enough for network embedding — for
ing problem. In study [9], researchers propose two algorithms:
instance, used on an online shopping website where a customer
Continuous-Time Dynamic Network Embeddings (CTDNE) and
may have bought items of disparate genres. Existing embedding
TemporalWalk, which leverages random walks strategy to incor-
techniques tend to fuse different aspects of a node into only
a single vector representation, which can be problematic. This porate temporal information into network embedding methods.
study proposes a polysemous embedding approach for modeling Motivated by the success of traffic forecasting, STGNN has
multiple aspects of nodes. Ioannidis et al. [35] also present a developed rapidly in other fields as well. In study [18], for driver
Graph Recurrent Neural Network (GRNN) where nodes engage maneuver anticipation, Jain et al. propose an approach, named
in multi-relational scenarios. In study [36], motivated by the structural-RNN (S-RNN), which combines the power of spatial–
concept of GCNs, the authors devise a method, called RCNN, to temporal graphs and Recurrent Neural Networks (RNNs) by trans-
identify the critical nodes (super-spreaders) in complex networks forming arbitrary spatial–temporal graphs into a mixture of RNNs
based on the message-spreading ability. and effectively captures the interactions in the underlying
3
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Table 1
Summary of Notations.
Symbols Definition
G A social network graph
V Set of nodes
E Set of edges
X Feature vector
W, B Learning parameters
N(i) The neighbors of node i
hki The embedding of the ith node within the kth layer

hti The embedding of the ith node at the historical time step t ′ Fig. 2. Node Embedding Example. Target node: v1 . Layer-0 neighbors: Node v1 ,
htotal
i The embedding of the ith node (all the time steps as a whole) v2 , v3 , v6 , v7 . Layer-1 neighbors: v2 , v3 , v4 , v5 .
α Attention weights
α <total,t >

The attention weight (time step t ′ to total)
e The coefficient between two different embeddings (GCN) [8], we adopt the graph embedding approach, similar to
<total,t ′ > ′ GraphSage [10], to make our framework inductive. We aggregate
e The coefficient of the time step t to total embedding
the neighborhood features to the target node following Eq. (3).
Cit Temporal attention context
hki represents the embedding of the ith node in the kth layer.
hˆti The final embedding of node i til time t N(i) is the set that contains the neighbors of node i within K
AGG() Aggregation function hubs. hkj −1 defines the feature embeddings of neighbor j while
ATT () Attention function hki −1 denotes the embedding of node i in the (k − 1)th layer. W
and B are the learning parameters that reflect the importance
of the neighbors of node i and node i itself in the previous
spatial–temporal graphs. Wu et al. [16] devise a novel CNN-based layer during aggregation. Initially, when k = 0, h0i equals xi , as
graph neural network architecture, namely Graph WaveNet. Mo- shown in Eq. (1), which is the feature vector of node i. Then, we
tivated by WaveNet, the model adopts stacked dilated casual aggregate all neighbors’ information layer by layer until we reach
convolutions to capture temporal dependencies that handle very the target node — node i. Finally, hKi is the final embedding of
long sequences. Yan et al. [19] propose a novel Spatial–Temporal node i, denoted as ziK in Eq. (2).
Graph Convolutional Networks (ST-GCN) model for skeleton-
h0i = xi (1)
based action recognition. The model applies spatial and temporal
graph convolutions on the skeleton sequences. hKi = ziK (2)
([ ∑ k−1
])
j∈N(i) hj
3. Methodology hki = ReLU Wk ⏐ ⏐ , Bk hki −1 , ∀k ∈ {1, 2, . . . , K }
⏐N(i) ⏐
We propose a deep learning framework for modeling both the (3)
spatial and temporal patterns of time-evolving social networks.
To give an illustration, as shown in Fig. 2, we have K = 2,
The structure consists of three steps: Firstly, we utilize an embed-
which indicates that we aggregate the information from neigh-
ding method to capture the node’s spatial features for each time
bors within two hubs. However, the number of layers can be
slices. Secondly, we propose an attention-based mechanism to ag-
set dynamically to different applications. In our case, we only
gregate the temporal memory of the graph so that the model can
consider neighboring nodes within two hubs. Taking node v1
pay weighted attention intentionally to different historical steps.
as an example, we embed the neighbors: v2 , v3 , v6 , v7 , and v1
Additionally, we focus on improving the interpretation ability of
from layer-0 to the layer-1 neighbors that are v2 , v3 , v4 , and
the model by looking into the temporal attention distribution.
Thirdly, we assemble the neural networks above-mentioned for
v5 . And then, we further aggregate the layer-1 neighbors to the
target node v1 at layer-2. In our case, If we calculate the node i’s
the downstream social network prediction tasks.
embedding across all the time steps as a while, we denote it as
htotal
i . In contrast, if the embedding is only for a specific time step
3.1. Framework overview ′
t ′ , we note it as hti .
We first present the process of embedding the node’s spatial
features. Then, we illustrate the Algorithm 1, for building the 3.3. Temporal attention neural network
temporal attention network, and discuss its expressive ability.
Finally, we demonstrate the operation of how to put the pieces As shown in Fig. 3, after the aggregation of the spatial features,
together and make the final prediction (see Table 1). on which we build an attention network to capture the temporal
features as the network evolves.
3.2. Graph spatial convolution Firstly, we calculate the global embeddings of each node,
denoted as htotal i . Secondly, we carry out the aggregation at each

We construct a social network graph G = (V , E) where V = time step independently. Let hti be the embedding of node i at
{v1 , v2 , v3 ...vi } denotes the set of the nodes (users), and E = time t up to the present time: t. h0i , h1i , . . . , hti −1 , hti represents

{e1 , e2 , e3 ...ei } defines the set of edges (relations) between them. the embeddings of node i at the time step 0, 1, . . . , t − 1, t
Let X = {x1 , x2 ...xi } be the feature vector, and xi represents the respectively. Thirdly, we leverage the attention mechanism and
user’s ith feature, such as age, origin, hobby, etc. calculate the coefficient between the embeddings from each time
In our case, the nodes and edges in the social network keep step and the total embedding by following Eq. (4), where t is the
changing as the network evolves. Therefore, instead of mod- target time step in question, and t ′ denotes a preceding time step.
⟨total,t ′ ⟩
eling the spatial dependency via Graph Convolution Network ei defines the magnitude of the importance of time step
4
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Algorithm 1 Algorithm for STGSN


Input: G = {V , E}, K = {# of hubs of neighbors}, T = {# of time
steps}
Output: H = {Node embeddings}
1: Initialization:

2: H
total
, H t ← {}, {}
3: Spatial Aggregation:
4: for vi ∈ G do
5: Aggregate K-hubs neighborhood information globally to pro-
duce
the total node embedding for node i.
6: htotal
i = AGG(N(i) )
7: H total ← H total ∪ htotal
i
8: end for
9: Temporal Segmentation:
10: GT ← {Segment the network in T time steps}
11: for gt ∈ GT do
12: for vi ∈ gt do
13: Aggregate K-hubs neighborhood information to produce
node i′ s
spatial embedding at time step t ′ .

14: hti = AGG(N(i) )
15: end for
′ ′ ′
16: Hit ← Hit ∪ hti
′ ′ ′
17: H t ← H t ∪ Hit
18: end for
19: Temporal Attention Network:
20: for vi ∈ G do
Fig. 3. Temporal Attention Network.
21: Cit ← {}
′ ′
22: Retrieve htotal
i , hti from H total and H t and calculate the
temporal attention w.r.t the time step t ′ to the total
t ′ to the total embedding. Intuitively, it determines how much embedding.
attention should the model pay to time step t ′ when making
α <total,t > = ATT (htotal
′ ′
23: i , hti )
predictions at time step t. We employ two training parameters:
Cit ← Cit + α <total,t
′> ′
′ ′
W t on hti , W total on htotal . After applying a weight vector (training 24: ∗ hti
i
hˆti ← [Cit , htotal
′ ′
parameter) a on the concatenation of W t hti and W total htotal i , we 25: i ]
finally use LeakyReLU for non-linearity. After that, we apply a 26: H ← H ∪ hˆti
⟨t ,t ′ ⟩
softmax layer to produce the attention weight αi using Eq. (5). 27: end for

We multiple the hidden state of node i: hti by its corresponding 28: return H
⟨total,t ′ ⟩ ′
attention weight αi and training parameter for each Wit
time step. In the end, we sum them up and apply ELU as the
non-linearity function to get the attention context Cit , as shown as: ‘‘why’’. Therefore, we must make the model with a decent
in Eq. (2). expressive ability to make the prediction (intelligence) model
effective and practical. To achieve this goal, we categorize the
( [ ′ ′ ])
⟨total,t ′ ⟩
ei = LeakyReLU aT W t hti , W total htotal
i (4)
temporal patterns in five classes:
⟨total,t ′ ⟩
ei
⟨total,t ′ ⟩ ⟨total,t ′ ⟩ e 1. Increasing Trend: the time steps become more important
αi = Softmax(ei )= ∑ ⟨total,t ′ ⟩
(5)
t ei as they are closer to the present.
e
( ) 0 2. Decreasing Trend: the historical time steps are more influ-
t
∑ ⟨total,t ′ ⟩ ′ ′ ential than the recent ones.
Cit = ELU αi Wit hti (6) 3. Seasonal: the time steps demonstrate a pattern of periodi-
t ′ =0 cal fluctuation.
4. Key Player(s): one or several time slices show dominant
Another advantage of leveraging attention mechanism is the
influence than others.
interpretation ability. As mentioned above, the existing resear-
5. Random: there are no known patterns found, and all the
ches have intensively studied the temporal features of networks
time steps contribute randomly.
in the traffic forecasting field. For example, in study [14], the
research utilizes three time-series segments: recent, daily-period,
and weekly-period, respectively, to intentionally focus on the 3.4. Final aggregation
last a few minutes to present and the same period on daily and
weekly basis. Unlike traffic forecasting, in our case, the temporal As shown in Eq. (7), after the temporal attention created, we
characteristics may vary from case to case dramatically. For social build the final node embedding — hˆti by aggregating the temporal
networks in the criminology field, law enforcement agencies need attention context Cit and the node embedding htotal
i . In our case,
to know the prediction results as: ‘‘what’’. More importantly, we aggregate Cit and htotal
i by directly concatenating them up.
they need to know the reasoning logic behind the prediction However, Summing them up can also be one choice if it produces
5
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Table 2
Dataset Statistics.
Ego-Comm InVS13 InVS15 IARadoslawEmail FB-Forum DNC-Email
Nodes 774 100 232 167 899 2029
Edges 13,287 394,247 1,283,194 82,927 33,720 39,264
Average Degrees 34.33 8299.92 11,718.66 993.14 75.02 38.70
Durations 18 months 2 weeks 2 weeks 10 months 25 weeks 18 months
Node ID ✓ ✓ ✓ ✓ ✓ ✓
Node Attributed ✗ ✓ ✓ ✗ ✗ ✗
Edge Attributed ✓ ✓ ✓ ✓ ✓ ✓
Directed ✗ ✗ ✗ ✓ ✗ ✓
Temporal ✓ ✓ ✓ ✓ ✓ ✓

better performance. In the end, hˆti is the ultimate representation 4. Experiments


of node i for the downstream prediction tasks.
In this section, we conduct extensive experiments on six pub-
hˆti = Cit , htotal
[ ]
i (7) lic datasets and evaluate our framework with regards to down-
After the final aggregation, our method incorporates the spa- stream prediction tasks. In the end, the results show that STGSN
outperforms state-of-the-art methods significantly.
tial and temporal features in the node embedding — hˆti , which
can be the inputs that beneficial to the downstream prediction
4.1. Datasets and metrics
tasks. For example, node classification tasks will take the node
embedding to train the model, while link prediction algorithms
Meetup and communication networks are two of the most
will take two nodes’ embeddings to build the model and make
popular network types in our case. In our experiments, we se-
the predictions.
lected six public datasets, which include three meetup networks:
Ego-Comm [37], InVS13 [38], InVS15 [38], and three commu-
3.5. Parameters tuning
nication networks: IARadoslawEmail [39], FB-Forum [40], and
DNC-Email [41]. Ego-Comm is a communication ego network
In the section, we discuss some parameter tunings for the consisting of face-to-face communication between 24 users and
proposed model. their friends. InVS13 and InVS15 are two similar co-presence
Coefficient: When calculating the coefficient between two datasets collected from a workplace in two different years. On the
embeddings, there are a few different approaches. We can apply other hand, IARadoslawEmail is an internal email communication

dot product noted as Eq. (8) or distance following Eq. (9) on hti network between employees of a mid-sized manufacturing com-
total
and hi , which intuitively infer how close the two embeddings pany. FB-Forum is the users’ interactive activities in a facebook-
related to each other. Another approach is to have a neural like forum, and DNC-Email is the directed network of emails in
network to study the magnitude of the correlation between two the 2016 Democratic National Committee email leak.
embeddings by using Eq. (4). The statistics of the dataset are listed in Table 2. For meetup
⟨total,t ′ ⟩ ′ networks, Ego-Comm has 774 nodes (users and friends) and
ei = hti · htotal
i (8)
√ 13,287 face-to-face communications over 18 months. Addition-
⟨total,t ′ ⟩ ( t′ )2 ally, it also has friendship ratings (user-to-friend) ranging from
ei = hi − htotal
i (9)
0 to 10 on the 9th and 18th months, respectively. In this case,
Temporal Attention: After applying the attention on each friendship ratings change from time to time. Intuitively speaking,
time step, we build temporal attention as a context, namely Cit , the more two people meet up, the more intimate they become,
for the final embedding. There are some different methods for this and the higher the friendship ratings are. Ego-Comm is a perfect
context construction: concatenation, average sum, and pooling, example indicating temporal characteristics worthy considera-
etc. As shown in Eq. (6), one effective way is to sum the embed- tion when looking at social network prediction. For InVS13 and
dings from all time-slices and produce a final context vector Cit InVS15, the co-presence data are formatted as {t , i, j}. Each line
with the same size as htotal
i . The intuition is that the importance represents a contact occurring at time t between two nodes i
of all time steps combined has the same magnitude as the total and j. InVS13 has 100 users and 394,247 communications. In
embedding. On the other hand, the concatenation approach pays contrast, InVS15 has 232 users and 1,283,194 communications.
more attention to preceding time slices and equally treats all time They both have ten days of data over two weeks and also provide
steps (including the current one). the metadata indicating the user’s group affiliation: department
Attention Heads: To calculate the temporal attention Cit , we in this case. For communication networks, in IARadoslawEmail
can combine the α ⟨total,t ⟩ (attention weight), Wit (training pa-
′ ′
and DNC-Email, nodes are senders and recipients, while edges

rameter), and hti (embedding from time t ′ ) following Eq. (6). are email exchanges. IARadoslawEmail has 167 users and 82927
Inspired by the study of Multi-head Attention [27], we can utilize emails over ten months, DNC-Email has 2029 and 39264 ex-
a multi-head mechanism and try to achieve better performance changes over 18 months, and FB-Forum contains 899 users and
by following Eq. (10), in which k represents the number of atten- the 33720 interactive activities among them over 25 weeks.
tion heads employed. However, after a series of experiments, we In the experiments, we employ the area under the ROC curve
find that there is not much difference between single-head and (AUC) and F1 (micro), short for the micro-averaged F1 score,
multi-head setup w.r.t the effectiveness. Therefore, in our case, to measure performance on binary and multi-class classifica-
we employ single-head other than multi-head to get the same tion tasks, respectively. Meanwhile, we use Mean Absolute Error
level of performance with much less computational cost. (MAE) and Mean Squared Error (MSE) to compare the regres-
( ) sion tasks. MAE measures the absolute average distance between
K t
∑ ∑ ⟨total,t ′ ⟩ ′ ′ the real data and the predicted data, while MSE measures the
Cit = ELU αi Wit hti (10) squared-average distance between the real data and the predicted
k=1 t ′ =0 data.
6
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Table 3
MAE and MSE of Algorithm (Regression on Ego-Comm).
Baseline {1 − 9} → 9 {10 − 18} → 18
MAE MSE MAE MSE
GCN 2.316 ± 0.037 8.391 ± 0.392 1.983 ± 0.015 6.222 ± 0.144
CTDNE 2.237 ± 0.117 7.294 ± 0.092 1.921 ± 0.017 6.291 ± 0.068
TemporalWalk 2.195 ± 0.051 7.019 ± 0.027 1.881 ± 0.022 5.760 ± 0.113
GraphSAGE (Mean) 2.234 ± 0.025 7.003 ± 0.241 1.926 ± 0.032 6.581 ± 0.078
GraphSAGE (MaxPool) 2.191 ± 0.100 6.809 ± 0.277 1.737 ± 0.019 5.415 ± 0.047
GraphSAGE (MeanPool) 2.185 ± 0.051 6.735 ± 0.291 1.950 ± 0.022 6.676 ± 0.106
GAT (Single-head) 2.222 ± 0.021 7.599 ± 0.322 1.990 ± 0.034 6.721 ± 0.288
GAT (Multi-head) 2.195 ± 0.048 7.371 ± 0.348 1.789 ± 0.062 5.631 ± 0.271
STGSN (No Attention) 2.559 ± 0.172 9.618 ± 0.393 2.036 ± 0.073 7.451 ± 0.311
STGSN (Single-head) 2.169 ± 0.011 6.913 ± 0.271 1.730 ± 0.028 5.302 ± 0.093
STGSN (Multi-head) 2.068 ± 0.021 6.265 ± 0.228 1.770 ± 0.025 5.581 ± 0.089

Table 4
F1(Micro) Score of Algorithm (Classification on InVS13).
Baseline {1 − 6} → 7 {1 − 7} → 8 {1 − 8} → 9 {1 − 9} → 10
F1 (Micro) F1 (Micro) F1 (Micro) F1 (Micro)
GCN 0.416 ± 0.021 0.425 ± 0.039 0.420 ± 0.037 0.438 ± 0.026
CTDNE 0.520 ± 0.012 0.494 ± 0.011 0.521 ± 0.017 0.529 ± 0.018
TemporalWalk 0.555 ± 0.011 0.411 ± 0.027 0.478 ± 0.012 0.560 ± 0.013
GraphSAGE (Mean) 0.551 ± 0.026 0.513 ± 0.022 0.568 ± 0.025 0.634 ± 0.028
GraphSAGE (MaxPool) 0.527 ± 0.029 0.511 ± 0.006 0.582 ± 0.019 0.641 ± 0.011
GraphSAGE (MeanPool) 0.503 ± 0.025 0.516 ± 0.031 0.574 ± 0.029 0.621 ± 0.013
GAT (Single-head) 0.434 ± 0.021 0.409 ± 0.017 0.474 ± 0.017 0.514 ± 0.016
GAT (Multi-head) 0.436 ± 0.051 0.410 ± 0.027 0.463 ± 0.034 0.597 ± 0.033
STGSN (No Attention) 0.4696 ± 0.031 0.502 ± 0.032 0.538 ± 0.024 0.609 ± 0.033
STGSN (Single-head) 0.589 ± 0.021 0.525 ± 0.025 0.590 ± 0.005 0.652 ± 0.021
STGSN (Multi-head) 0.511 ± 0.031 0.502 ± 0.036 0.542 ± 0.023 0.634 ± 0.029

Table 5
F1(Micro) Score of Algorithm (Classification on InVS15).
Baseline {1 − 6} → 7 {1 − 7} → 8 {1 − 8} → 9 {1 − 9} → 10
F1 (Micro) F1 (Micro) F1 (Micro) F1 (Micro)
GCN 0.382 ± 0.014 0.364 ± 0.016 0.367 ± 0.011 0.406 ± 0.019
CTDNE 0.576 ± 0.009 0.505 ± 0.012 0.473 ± 0.010 0.531 ± 0.017
TemporalWalk 0.579 ± 0.017 0.509 ± 0.021 0.494 ± 0.029 0.497 ± 0.011
GraphSAGE (Mean) 0.849 ± 0.002 0.559 ± 0.010 0.574 ± 0.003 0.636 ± 0.005
GraphSAGE (MaxPool) 0.838 ± 0.004 0.553 ± 0.012 0.566 ± 0.001 0.611 ± 0.013
GraphSAGE (MeanPool) 0.842 ± 0.006 0.546 ± 0.008 0.576 ± 0.006 0.613 ± 0.010
GAT (Single-head) 0.513 ± 0.016 0.462 ± 0.018 0.438 ± 0.013 0.457 ± 0.017
GAT (Multi-head) 0.518 ± 0.011 0.477 ± 0.014 0.456 ± 0.008 0.469 ± 0.012
STGSN (No Attention) 0.849 ± 0.012 0.527 ± 0.029 0.578 ± 0.013 0.573 ± 0.019
STGSN (Single-head) 0.852 ± 0.005 0.573 ± 0.010 0.578 ± 0.005 0.631 ± 0.023
STGSN (Multi-head) 0.852 ± 0.017 0.556 ± 0.020 0.575 ± 0.035 0.604 ± 0.026

4.2. Experiment setup and 0.7 percentiles into three classes: Rare (0 − 0.35), Normal
(0.35 − 0.7), and Frequent (0.7 − 1) to make the case a classifica-
For meetup networks, Ego-Comm [37]: The nodes are users tion task. We denote the experiment as {1 − 7} → 7, indicating
and their friends while the edges indicate the face-to-face com- using the 1st to the 7th days’ data for the 7th-day prediction. We
munications. There is only one node feature: ID (we utilize one- conduct multiple rounds of experiments as follows: {1 − 7} → 7,
hot-encoding to encode the user’s ID). We divide the dataset {1 − 8} → 8, {1 − 9} → 9, {1 − 10} → 10. In this case, we use
into 18 slices in which one piece contains the activities for one 80% data for training (with 20% used for validation) and 20% for
month. We use the face-to-face communication data in the 1 − 9 testing.
months to predict the friendship ratings given in the 9th month For communication networks, IARadoslawEmail [39], we di-
vide it into ten slices (one slice per month) and then organize
and use 10 − 18 months to predict the ratings provided in the
the link prediction tasks as follows: {1 − 5} → 6, {1 − 6} → 7,
18th month. The friendship ratings scale from 0 to 10, and we set
{1 − 7} → 8, and {1 − 8} → 9. We segment DNC-Email [41]
up this case as a regression task. We denote the two experiments
daily and select the data from 2016-05-02 to 2016-05-09. For
as {1 − 9} → 9 and {10 − 18} → 18. To train the prediction
simplicity, we denote 2016-05-02, 2016-05-03, . . . , 2016-05-08 as
task, we split the dataset into 80% for training (with 20% used 1, 2, 3, . . . , 7 and design the following experiments: {1 − 7} → 8,
for validation) and 20% for testing. InVS13 and InVS15 [38]: We {1 − 8} → 9, {1 − 9} → 10, and {1 − 10} → 11. At last, FB-
segment the co-presence data into ten slices, and each piece Forum [40] has 25 weeks’ data ranging from week 19 to week
represents the meetups for a working day. In the network, nodes 34. We slice the data weekly and ignore week 19 with only 295
are the users in the workplace, and the co-presence forms edges. records, which is exceptionally small compared to other weeks.
The node features consist of the user’s ID (one-hot-encoding) and We set up the prediction tasks from week 20 to 27 as {1 − 4} →
the Department. The prediction goal is to use the node embedding 5, {1 − 5} → 6, {1 − 6} → 7, and {1 − 7} → 8. Similarly, We use
to predict the extent of co-occurrences of given pairs on a given 80% data for training (with 20% used for validation) and 20% for
day. We categorize the number of co-presence following the 0.35 testing.
7
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Table 6
ROC-AUC Score of Algorithm (Classification on IARadoslawEmail).
Baseline {1 − 5} → 6 {1 − 6} → 7 {1 − 7} → 8 {1 − 8} → 9
ROC-AUC ROC-AUC ROC-AUC ROC-AUC
GCN 0.787 ± 0.012 0.819 ± 0.006 0.820 ± 0.004 0.778 ± 0.017
CTDNE 0.748 ± 0.019 0.751 ± 0.024 0.747 ± 0.009 0.761 ± 0.014
TemporalWalk 0.817 ± 0.015 0.796 ± 0.017 0.822 ± 0.011 0.814 ± 0.023
GraphSAGE (Mean) 0.852 ± 0.009 0.845 ± 0.011 0.853 ± 0.006 0.808 ± 0.015
GraphSAGE (MaxPool) 0.839 ± 0.014 0.845 ± 0.009 0.832 ± 0.010 0.820 ± 0.008
GraphSAGE (MeanPool) 0.830 ± 0.016 0.855 ± 0.014 0.845 ± 0.006 0.798 ± 0.023
GAT (Single-head) 0.821 ± 0.012 0.812 ± 0.008 0.795 ± 0.016 0.795 ± 0.021
GAT (Multi-head) 0.824 ± 0.009 0.825 ± 0.002 0.818 ± 0.007 0.832 ± 0.015
STGSN (No Attention) 0.850 ± 0.017 0.855 ± 0.018 0.837 ± 0.009 0.815 ± 0.013
STGSN (Single-head) 0.863 ± 0.004 0.857 ± 0.013 0.861 ± 0.012 0.842 ± 0.011
STGSN (Multi-head) 0.866 ± 0.013 0.860 ± 0.025 0.863 ± 0.023 0.825 ± 0.017

Table 7
ROC-AUC Score of Algorithm (Classification on FB-Forum).
Baseline {1 − 4} → 5 {1 − 5} → 6 {1 − 6} → 7 {1 − 7} → 8
ROC-AUC ROC-AUC ROC-AUC ROC-AUC
1 GCN 0.627 ± 0.023 0.546 ± 0.013 0.610 ± 0.028 0.556 ± 0.029
CTDNE 0.816 ± 0.016 0.831 ± 0.010 0.795 ± 0.014 0.828 ± 0.015
TemporalWalk 0.831 ± 0.017 0.863 ± 0.022 0.771 ± 0.018 0.855 ± 0.017
GraphSAGE (Mean) 0.777 ± 0.015 0.810 ± 0.011 0.812 ± 0.012 0.777 ± 0.021
GraphSAGE (MaxPool) 0.815 ± 0.008 0.847 ± 0.016 0.838 ± 0.013 0.901 ± 0.014
GraphSAGE (MeanPool) 0.784 ± 0.023 0.844 ± 0.014 0.783 ± 0.030 0.757 ± 0.023
GAT (Single-head) 0.703 ± 0.021 0.707 ± 0.035 0.644 ± 0.023 0.640 ± 0.024
GAT (Multi-head) 0.684 ± 0.016 0.663 ± 0.028 0.686 ± 0.026 0.760 ± 0.019
STGSN (No Attention) 0.845 ± 0.027 0.880 ± 0.008 0.841 ± 0.013 0.913 ± 0.011
STGSN (Single-head) 0.853 ± 0.017 0.880 ± 0.011 0.859 ± 0.009 0.916 ± 0.012
STGSN (Multi-head) 0.854 ± 0.031 0.930 ± 0.026 0.869 ± 0.025 0.841 ± 0.016

4.3. Baselines the node embeddings by applying an extra-layer attention


strategy, which implicitly assigning different weights to
In the experiments, we compare our method with five state- different neighbors. For GAT experiments, we use it for the
of-the-art methods: GCN [8], CTDNE [9], TemporalWalk [9], spatial-neighbor embeddings with both single-head and
GraphSAGE [10], and GAT [11]. We apply three different aggre- multi-head attentions.
gation strategies: Mean, MeanPool, MaxPool on GraphSAGE, and
single-head and multi-head attention mechanisms on GAT. The 4.4. Effectiveness of the proposed method
baselines are as follows:
1. GCN [8]: Like CNN, GCN learns the target node’s features To evaluate the effectiveness of the proposed method, we
by performing convolutional operations on the neighbor- compare the popular methods: GCN [8], CTDNE [9], Temporal-
ing cells. As a typical spectral-based approach, the model Walk [9], GraphSage [10] and GAT [11] to STGSN on all six
generally assumes fixed graphs as input. It is not very suit- datasets. As shown in Tables 3, 4, 5, 6, 7, 8, we can make the
able for modeling time-evolving networks that frequently following observations:
encounter unseen nodes. 1. STGSN outperforms the-state-of-art methods when the
2. CTDNE: CTDNE [9] is a general random walk-based frame- network has temporal attributes. On all six datasets,
work for incorporating temporal information into network STGSN has the best over-all results. Except for one case of
embedding from continuous-time dynamic networks. It is second place, STGSN wins all the cases. For meetup net-
worth comparing our method against it because it is also works, in the case of Ego-Comm, it produces MAE of 2.068
a deep learning model considering the network’s temporal and MSE of 6.265 for {1 − 9} → 9. For {10 − 18} → 18,
aspects. STGSN offers the best results — MAE: 1.730 and MSE: 5.302.
3. TemporalWalk [9]: This model is a temporal walk version For InVS13 and InVS15, except for {1 − 9} → 10 in InVS15,
of CTDNE. When applying random walks, CTDNE (static our method has the second-best F1 score: 0.631, while
random walks) specifies the number of walks to run per GraphSage (Mean) produced the best: 0.636. In all other
node in the graph. In contrast, TemporalWalk (temporal cases, our method yields the best F1 scores: 0.589, 0.525,
random walks) defines the number of context windows we 0.590, 0.652 for InVS13 and 0.852, 0.573, 0.578, 0.631 (2nd
are interested in obtaining. place) for InVS15. For all the communication networks,
4. GraphSAGE: The model [10] is a general inductive embed- STGSN outperforms all of the baselines by a significant
ding framework that effectively aggregates features from margin, and it produces the best ROC-AUC values: 0.866,
a node’s local neighbors and generates embeddings for 0.860, 0.863, and 0.842 for IARadoslawEmail, 0.854, 0.930,
previously unseen data. In our experiments, we apply this 0.869, and 0.916 for FB-Forum. Finally, 0.971, 0.952, 0.959,
method to aggregate all neighbors as a whole without
and 0.959 for DNC-Email. We look into the results from the
considering any temporal characteristics. For aggregation
following aspects:
strategy, we adopt Mean (average), MeanPool (element-
wise mean), and MaxPool (element-wise max). (a) STGSN is the best-performed algorithm on all
5. GAT: Instead of paying equal attention to all the neigh- datasets. It performs the best on both meetup and
bors during the embedding process, GAT [11] computes communication networks. The results prove that our
8
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Table 8
ROC-AUC Score of Algorithm (Classification on DNC-Email).
Baseline {1 − 7} → 8 {1 − 8} → 9 {1 − 9} → 10 {1 − 10} → 11
ROC-AUC ROC-AUC ROC-AUC ROC-AUC
GCN 0.677 ± 0.027 0.654 ± 0.019 0.6484 ± 0.021 0.623 ± 0.016
CTDNE 0.894 ± 0.013 0.891 ± 0.022 0.904 ± 0.017 0.911 ± 0.011
TemporalWalk 0.834 ± 0.015 0.817 ± 0.019 0.848 ± 0.007 0.868 ± 0.014
GraphSAGE (Mean) 0.848 ± 0.011 0.856 ± 0.017 0.831 ± 0.008 0.797 ± 0.025
GraphSAGE (MaxPool) 0.884 ± 0.013 0.914 ± 0.009 0.907 ± 0.014 0.927 ± 0.022
GraphSAGE (MeanPool) 0.860 ± 0.015 0.868 ± 0.016 0.869 ± 0.011 0.858 ± 0.012
GAT (Single-head) 0.813 ± 0.024 0.868 ± 0.018 0.798 ± 0.012 0.915 ± 0.012
GAT (Multi-head) 0.910 ± 0.017 0.885 ± 0.016 0.832 ± 0.023 0.830 ± 0.011
STGSN (No Attention) 0.940 ± 0.024 0.922 ± 0.027 0.914 ± 0.025 0.909 ± 0.016
STGSN (Single-head) 0.971 ± 0.010 0.945 ± 0.011 0.959 ± 0.014 0.959 ± 0.013
STGSN (Multi-head) 0.968 ± 0.018 0.952 ± 0.018 0.950 ± 0.009 0.919 ± 0.026

method has powerful capabilities of modeling social 3. The temporal importance distribution varies from case
networks with temporal characteristics. to case in time-evolving social networks. In our exper-
(b) It is worth pointing out that STGSN, in particular, iments, we closely look into how attention distributions
has superior performance on communication net- vary in different cases. It is noticeable that the temporal
works: IARadoslawEmail, FB-Forum, and DNC-Email. attentions are distributed differently in various scenarios.
It indicates that the temporal patterns are crucial For example, Fig. 4(a) shows a friendship prediction in the
for communication networks (email and comment 9th month based on the 1st to 9th months’ face-to-face
exchanges). For example, to answer questions like: communications. The attention weights are increasing from
will two people email each other in a few days. The 0.095 (Month 1) to 0.147 (Month 9) chronologically, which
answer may heavily depend on if they had email ex- shows an increasing-trend pattern that the more recent
changes recently. In this case, the recent time steps the month is, the more decisive influence it contributes. In
are more important than the historical ones. It shows contrast, Fig. 4(b) is one example of FB-Forum indicating a
that our method can robustly capture the influential decreasing-trend pattern. Furthermore, Fig. 4(c) and 4(d)
temporal patterns by slicing the network interac- demonstrates a seasonal pattern. In Fig. 4(c), every sec-
tions in time steps and paying weighted attention to ond day shows a more substantial influence. Meanwhile,
them. Fig. 4(d) shows a co-presence prediction on Friday is in-
(c) Attention plays a critical role in our framework. It is fluenced mostly by the previous Fridays. Two Fridays have
noticeable that STGSN without attention mechanism, attention weights 0.192 and 0.126, respectively, which are
namely STGSN (No Attention), performs not as good significantly more influential than all other weekdays. At
as STGSN models with attention. It even produces last, Fig. 4(e) and Fig. 4(f) are the key-player(s) pattern. As
the worst performance in the Ego-Comm case. Com- shown in Fig. 4(e), this example in FB-Forum has week 2
pared to the methods, like GraphSAGE, that looking with the most prominent contribution. In this case, there is
at the network structure as a whole, learning the only one key player: week 2. In contrast, the InVS13 exam-
node representation in time-slice fashion without ple in Fig. 4(f) has multiple key time steps: two Tuesdays
attention is not much improvement. Moreover, if we and a Friday as key players.
apply the time-segmentation incautiously, the node- Unlike some models used in a low-risk environment, where
aggregation approach may even yield unsatisfactory a mistake will not have serious consequences, in our case,
performance. it is not enough for the police officers to know just the
prediction (what). The model must demonstrate how it
2. STGSN with single-head and multi-head attention has comes to the prediction (why). Therefore, when show-
not much difference w.r.t performance. In our exper- ing the prediction result, we also designed a visualization
iments, we have tried both single-head and multi-head tool to show the temporal attention distribution (chart
setup. The outcomes indicate that our method generally and pattern class), which is proven to be very helpful in
achieves the same performance level with multi-head at- practice. By doing this, the officers can use the prediction
tention and single-head attention, which implies that mul- (what + why) as assistance and combine it with his domain
tiple attention heads have no better capability of seizing knowledge to make the final call.
more hidden information of temporal characteristics as
expected. In study [42] and [43], the authors observe some 5. Conclusion and future work
similar phenomena that multi-heads are not necessarily
better than single-head attentions and pruning the vast In this paper, we propose a spatial–temporal graph neural
majority of heads can still obtain a decent performance network: STGSN for social networks specifically. The proposed
level. How multiple heads can affect performance in social framework firstly utilizes the graph embedding approach to ag-
networks analysis? It is a promising future research direc- gregate the neighborhood information for the target nodes. Sec-
tion that is worth further exploring. We consider this topic ondly, we introduce a novel attention mechanism to build a
is beyond our research scope and decide to focus on our temporal context, enabling the algorithm to pay weighted at-
goal — modeling time-evolving social networks. As shown tention to different time steps. Thirdly, we devise a method
in Tables 3, 4, 5, 6, 7, 8, we list the scores produced by to improve the interpretation ability of the model. Finally, the
STGSN with both single-head and multi-head attentions. In algorithm combines the spatial and temporal features for the
practice, we choose the single-head setting, which achieves downstream prediction tasks. We conducted extensive experi-
the same level of performance as the multi-head setup but ments on six public datasets. The results show that the proposed
with much less computational overhead. framework outperforms state-of-the-art methods by producing
9
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

Fig. 4. Attention Distribution Examples.

significantly better performance in the downstream prediction the research. Sichuan Provincial Public Security Department sup-
tasks. ported this work under the Intelligence Command Operational
Some possible future research directions are: Devising a better Platform Program.
graph representation method that fully considers the neighbor-
hood’s node ID, node attributes, and edge attributes. Develop an References
optimized time-slicing method for better temporal segmentation.
[1] Stanley Wasserman, Katherine Faust, Social Network Analysis, Cambridge
CRediT authorship contribution statement University Press, 1994.
[2] Emilio Ferrara, Pasquale De Meo, Salvatore Catanese, Giacomo Fiumara,
Shengjie Min: Conceptualization, Methodology, Writing - Detecting criminal organizations in mobile phone networks, Expert Syst.
original draft, Software. Zhan Gao: Conceptualization, Writing Appl. 41 (13) (2014) 5733–5750.
- original draft, Software. Jing Peng: Writing - original draft, [3] Rafał Drezewski, Jan Sepielak, Wojciech Filipkowski, The application of so-
cial network analysis algorithms in a system supporting money laundering
Software. Liang Wang: Software, Visualization, Investigation. Ke
detection, Inform. Sci. 295 (October) (2015) 18–32.
Qin: Writing - review & editing. Bo Fang: Data curation, Software,
[4] Giulia Berlusconi, Francesco Calderoni, Nicola Parolini, Marco Verani,
Visualization. Carlo Piccardi, Link prediction in criminal networks: A tool for criminal
intelligence analysis, PLoS One 11 (4) (2016) 1–21.
Declaration of competing interest [5] Andrea Fronzetti Colladon, Elisa Remondi, Using social network analysis to
prevent money laundering, Expert Syst. Appl. 67 (2017) 49–58.
The authors declare that they have no known competing finan- [6] S. Min, G. Luo, Z. Gao, J. Peng, K. Qin, Resonance - An intelligence anal-
cial interests or personal relationships that could have appeared ysis framework for social connection inference via mining co-occurrence
patterns over multiplex trajectories, IEEE Access 8 (2020) 24535–24548.
to influence the work reported in this paper.
[7] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang,
Philip S. Yu, A comprehensive survey on graph neural networks, IEEE Trans.
Acknowledgments Neural Netw. Learn. Syst. XX (Xx) (2020) 1–21.
[8] Thomas N. Kipf, Max Welling, Semi-supervised classification with graph
We thank the Sichuan Provincial Public Security Department convolutional networks, in: 5th International Conference on Learning
experts, who provided insight and expertise that greatly assisted Representations, ICLR 2017 - Conference Track Proceedings, 2019, pp. 1–14.

10
S. Min, Z. Gao, J. Peng et al. Knowledge-Based Systems 214 (2021) 106746

[9] Giang Hoang Nguyen, John Boaz Lee, Ryan A. Rossi, Nesreen K. Ahmed, Eu- [26] Minh Thang Luong, Hieu Pham, Christopher D. Manning, Effective ap-
nyee Koh, Sungchul Kim, Continuous-time dynamic network embeddings, proaches to attention-based neural machine translation, in: Conference
in: Companion Proceedings of the the Web Conference 2018, 2018, pp. Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural
969–976. Language Processing, 2015, pp. 1412–1421.
[10] Nesreen K. Ahmed, Ryan A. Rossi, Rong Zhou, John Boaz Lee, Xiangnan [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Kong, Theodore L. Willke, Hoda Eldardiry, Inductive representation learning Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is all you need,
in large attributed graphs, in: NIPS, 2017, pp. 1–11. Adv. Neural Inf. Process. Syst. 2017-Decem (Nips) (2017) 5999–6009.
[11] Petar Veličković, Arantxa Casanova, Pietro Liò, Guillem Cucurull, Adriana [28] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, Pieter Abbeel, A simple
Romero, Yoshua Bengio, Graph attention networks, in: 6th International neural attentive meta-learner, in: 6th International Conference on Learning
Conference on Learning Representations, ICLR 2018 - Conference Track Representations, ICLR 2018 - Conference Track Proceedings, 2018, pp. 1–17.
Proceedings, 2018, pp. 1–12. [29] Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena, Self-
[12] Qiaoyu Tan, Ninghao Liu, Xia Hu, Deep representation learning for social attention generative adversarial networks, in: 36th International Con-
network analysis, Front. Big Data 2 (April) (2019) 1–10. ference on Machine Learning, ICML 2019, Vol. 2019-June, 2019, pp.
[13] Yaguang Li, Rose Yu, Cyrus Shahabi, Yan Liu, Diffusion convolutional 12744–12753.
recurrent neural network : Data-driven traffic forecasting, in: International [30] John Boaz Lee, Ryan Rossi, Xiangnan Kong, Graph classification using
Conference on Learning Representations, 2018, pp. 1–16. structural attention, in: Proceedings of the 24th ACM SIGKDD International
[14] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, Huaiyu Wan, Atten- Conference on Knowledge Discovery & Data Mining, August, ACM, New
tion based spatial-temporal graph convolutional networks for traffic flow York, NY, USA, 2018, pp. 1666–1674.
forecasting, Proc. AAAI Conf. Artif. Intell. 33 (2019) 922–929. [31] Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, Jie Tang,
[15] Bing Yu, Haoteng Yin, Zhanxing Zhu, Spatio-temporal graph convolutional DeepInf: Social influence prediction with deep learning, in: Proceedings of
networks: A deep learning framework for traffic forecasting, in: Proceed- the ACM SIGKDD International Conference on Knowledge Discovery and
ings of the Twenty-Seventh International Joint Conference on Artificial Data Mining, 2018, pp. 2110–2119.
Intelligence, Vol. 2018-July, International Joint Conferences on Artificial [32] Muhan Zhang, Yixin Chen, Link prediction based on graph neural networks,
Intelligence Organization, California, 2018, pp. 3634–3640. Adv. Neural Inf. Process. Syst. 2018-Decem (Nips) (2018) 5165–5175.
[16] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Chengqi Zhang, Graph [33] Hao Wang, Tong Xu, Qi Liu, Defu Lian, Enhong Chen, Dongfang Du, Han
wavenet for deep spatial-temporal graph modeling, in: IJCAI International Wu, Wen Su, MCNE: An end-to-end framework for learning multiple
Joint Conference on Artificial Intelligence, Vol. 2019-Augus, 2019, pp. conditional network representations of social network, in: Proceedings of
1907–1913. the ACM SIGKDD International Conference on Knowledge Discovery and
[17] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, Dit Yan Yeung, Data Mining, 2019, pp. 1064–1072.
GaAN: Gated attention networks for learning on large and spatiotemporal [34] Ninghao Liu, Qiaoyu Tan, Yuening Li, Hongxia Yang, Jingren Zhou, Xia
graphs, in: 34th Conference on Uncertainty in Artificial Intelligence 2018, Hu, Is a single vector enough? Exploring node polysemy for network
UAI 2018, Vol. 1, 2018, pp. 339–349. embedding, in: Proceedings of the ACM SIGKDD International Conference
[18] Ashesh Jain, Amir R. Zamir, Silvio Savarese, Ashutosh Saxena, Structural- on Knowledge Discovery and Data Mining, 2019, pp. 932–940.
RNN: Deep learning on spatio-temporal graphs, in: CVPR, 2016, pp. [35] Vassilis N. Ioannidis, Antonio G. Marques, Georgios B. Giannakis, A re-
5308–5317. current graph neural network for multi-relational data, in: ICASSP, IEEE
[19] Yong Li, Zihang He, Xiang Ye, Zuguo He, Kangrong Han, Spatial temporal International Conference on Acoustics, Speech and Signal Processing -
graph convolutional networks for skeleton-based dynamic hand gesture Proceedings, Vol. 2019-May, 2019, pp. 8157–8161.
recognition, Eurasip J. Image Video Process. 2019 (1) (2019). [36] En Yu Yu, Yue Ping Wang, Yan Fu, Duan Bing Chen, Mei Xie, Identifying
[20] Joan Bruna, Wojciech Zaremba, Arthur Szlam, Yann LeCun, Spectral critical nodes in complex networks via graph convolutional networks,
networks and deep locally connected networks on graphs, in: 2nd Inter- Knowl.-Based Syst. 198 (2020).
national Conference on Learning Representations, ICLR 2014 - Conference [37] Jari Saramäki, E.A. Leicht, Eduardo López, Sam G.B. Roberts, Felix Reed-
Track Proceedings, 2014, pp. 1–14. tsochas, Robin I.M. Dunbar, Persistence of social signatures in human
[21] Michaël Defferrard, Xavier Bresson, Pierre Vandergheynst, Convolutional communication, 111 (3), 2014, pp. 942–947.
neural networks on graphs with fast localized spectral filtering, Comput. [38] Mathieu G’enois, Alain Barrat, Can co-location be used as a proxy for
Mater. Sci. 152 (59) (2016) 60–69. face-to-face contacts? EPJ Data Sci. 7 (1) (2018) 11.
[22] Ziqi Liu, Chaochao Chen, Longfei Li, Jun Zhou, Xiaolong Li, Le Song, Yuan [39] Radosław Michalski, Sebastian Palus, Przemysław Kazienko, Matching
Qi, Geniepath: Graph neural networks with adaptive receptive paths, Proc. Organizational Structure and Social Network Extracted from Email Com-
AAAI Conf. Artif. Intell. 33 (0) (2019) 4424–4431. munication, in: Lecture Notes in Business Information Processing, vol. 87,
[23] Shun Fu, Guoyin Wang, Shuyin Xia, Li Liu, Deep multi-granularity graph Springer Berlin Heidelberg, 2011, pp. 197–206.
embedding for user identity linkage across social networks, Knowl.-Based [40] T. Opsahl, Triadic closure in two-mode networks: Redefining the global
Syst. 193 (2020) 105301. and local clustering coefficients, Social Networks (2011).
[24] Dzmitry Bahdanau, Kyung Hyun Cho, Yoshua Bengio, Neural machine [41] DNC emails network dataset – KONECT, 2017.
translation by jointly learning to align and translate, in: 3rd International [42] Paul Michel, Omer Levy, Graham Neubig, Are sixteen heads really better
Conference on Learning Representations, ICLR 2015 - Conference Track than one? Adv. Neural Inf. Process. Syst. 32 (NeurIPS) (2019) 1–11.
Proceedings, 2015, pp. 1–15. [43] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov,
[25] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Analyzing multi-head self-attention: Specialized heads do the heavy lifting,
Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio, Show, attend the rest can be pruned, in: ACL 2019 - 57th Annual Meeting of the
and tell: Neural image caption generation with visual attention, in: 32nd Association for Computational Linguistics, Proceedings of the Conference,
International Conference on Machine Learning, ICML 2015, Vol. 3, 2015, 2020, pp. 5797–5808.
pp. 2048–2057.

11

You might also like