Professional Documents
Culture Documents
Signal Processing
journal homepage: www.elsevier.com/locate/sigpro
a r t i c l e i n f o a b s t r a c t
Article history: There has been an increased interest in applying machine learning techniques on relational structured-
Received 31 December 2020 data based on an observed graph. Often, this graph is not fully representative of the true relationship
Revised 10 September 2021
amongst nodes. In these settings, building a generative model conditioned on the observed graph allows
Accepted 14 September 2021
to take the graph uncertainty into account. Various existing techniques either rely on restrictive assump-
Available online 16 October 2021
tions, fail to preserve topological properties within the samples or are prohibitively expensive for larger
Keywords: graphs. In this work, we introduce the node copying model for constructing a distribution over graphs.
Generative graph model Sampling of a random graph is carried out by replacing each node’s neighbors by those of a randomly
Graph neural network sampled similar node. The sampled graphs preserve key characteristics of the graph structure without
Adversarial attack explicitly targeting them. Additionally, sampling from this model is extremely simple and scales linearly
Recommender systems with the nodes. We show the usefulness of the copying model in three tasks. First, in node classification,
a Bayesian formulation based on node copying achieves higher accuracy in sparse data settings. Sec-
ond, we employ our proposed model to mitigate the effect of adversarial attacks on the graph topology.
Last, incorporation of the model in a recommendation system setting improves recall over state-of-the-art
methods.
© 2021 Elsevier B.V. All rights reserved.
1. Introduction estimated based on the observed graph. However, these models of-
ten fail to capture the structural properties of real-world networks
In graph related learning problems, models need to take into and as a result, are not suitable as generative models. In addition,
account relational structure. This additional information is encoded inference of parameters can be prohibitively expensive for large
by a graph that represents data points as nodes and the relation- scale networks, and many parametric models cannot take into ac-
ships as edges. In practice, observed quantities are often noisy and count node or edge attributes.
the information used to construct a graph is no exception. The Recent interest in generative models for graph learning has re-
provided graph is likely to be incomplete and/or contain spurious sulted in auto-encoder based formulations [5,6]. There has also
edges. For some graph learning tasks, this error or incompleteness been an effort to combine the strengths of parametric random
is explicitly stated. For example, in the recommendation system graph models and machine learning approaches [7]. In these mod-
setting, the task is to infer unobserved links representing users’ els, the probability of the presence of an edge is related to
preferences for items. In other cases, such as protein-protein in- the node embeddings of the incident nodes. Although the auto-
teraction networks, graph uncertainty is implicit, arising because encoder solutions excel in their targeted objective of link predic-
the graphs are constructed from noisy measurement data. tion, they fail to generate representative samples [8]. In [8,9], ap-
In these cases, we can view the graph as a random quantity. plication of generative adversarial networks (GANs) is considered
The modelling of random graphs as realisations of statistical mod- for graph generative models [8]. shows that the GAN generated
els has been widely studied in network analysis [1]. Parametric graphs do not deviate from the observed graph in terms of struc-
random graph models have been used extensively to study com- tural properties. These approaches are promising but the required
plex systems and perform inference [2–4]. Parameters are usually training time can be prohibitive for graphs of even a moderate size.
The main contribution of this paper is to introduce a novel gen-
erative model for graphs for the case when we have access to a
∗
Corresponding authors. single observed graph. The generative model is designed to allow
E-mail address: florence.robert-regol@mail.mcgill.ca (F. Regol). the sampling of multiple graphs that can be used as an ensem-
1
Florence Regol and Soumyasundar Pal, both are first co-authors.
https://doi.org/10.1016/j.sigpro.2021.108335
0165-1684/© 2021 Elsevier B.V. All rights reserved.
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
ble to improve performance in graph-based learning tasks. We call inference is successful. The Barabasi-Albert model [2], exponen-
the approach “node copying”. Based on a similarity metric between tial random graphs [3], exchangeable random graph models [13],
nodes, the node copying model generates random graph realiza- and the large family of stochastic block models (SBMs) [14], in-
tions by replacing the edges of a node in the observed graph with cluding mixed-membership SBMs [4], degree-corrected SBMs [15],
those of a randomly chosen similar node. The assumption made by and overlapping SBMs [16,17]), fall into this category. Configura-
this model is that similar nodes should have similar neighborhoods. tion model [18,19] variants preserve the degree sequence of a
The meaning of ‘similar’ varies depending on how the model is graph while changing the connectivity pattern in the random sam-
employed in a learning task and has to be defined according to ples [20]. provides a recent survey of various random graph mod-
the application setting. Most graph learning algorithms rely on and els. A major drawback of these models is that they impose rela-
exploit the property of homophily, which implies that nodes with tively strict structural assumptions in order to maintain tractability.
similar characteristics are more likely to be connected to one an- As a result, they often cannot model characteristics like large clus-
other [10]. The proposed node copying model preserves this ho- tering coefficients [21], small world connectivity and exponential
mophily (one similar neighbourhood is swapped for another sim- degree distributions [22], observed in many real-world networks.
ilar neighbourhood). The advantage is that the generative model Additionally, most cannot readily take into account node or edge
permits sampling of an ensemble of graphs that can be used to attribute information, and high-dimensional parameter inference
improve the performance of learning algorithms, as illustrated in can be prohibitively computationally expensive for larger graphs.
later sections of the paper. Learning-based models: Other generative graph models have
By construction, each sampled graph possesses the same set of emerged from the machine learning community, incorporating an
nodes as the observed graph and the identities of the nodes are re- auto-encoder structure. The models are commonly trained to accu-
tained throughout the sampling procedure. The identity of a node rately predict the links in the graph, and as a result, tend to fail to
is specified by a label and/or a set of node features; hence, each reproduce global structural properties of the observed graph [5].
node preserves its own features or labels (if available) in the sam- introduces a variational auto-encoder based model parameterized
pled graphs. The proposed model is simple and the computational by a graph convolutional network (GCN) to learn node embed-
overhead is minimal. Despite its simplicity, we show that incorpo- dings. The link probabilities are derived from the dot product of
rating such a model can be beneficial in a range of graph-based the obtained node embeddings [23]. adopts a similar approach, but
learning tasks. The model is flexible in the sense that the choice of employs adversarial regularization to learn more robust embed-
the similarity metric is adapted to the end-to-end learning goal. It dings. Both of these models exhibit impressive clustering of node
can be combined with state-of-the-art methods to provide an im- embeddings [24]. adds a message passing component to the de-
provement in performance. In summary, the proposed model: 1) coder that is based on intermediately learned graphs. This leads
generates useful samples which improve the performance in graph- to improved representations and better link prediction. [7] com-
based machine learning tasks and retain important structural prop- bines the strengths of the parametric models and the graph-based
erties; 2) is flexible so that it is applicable to various tasks (e.g., learning methods, proposing the DGLFRM model, which aims to
the bipartite graphs in recommendation systems are drastically dif- retain the interpretability of the overlapping SBM (OSBM) paired
ferent from the community structured graphs considered in node with the flexibility of the graph auto-encoder. The incorporation of
classification); and 3) is fast both for model construction and ran- the OSBM improves the ability of the model to capture block-based
dom sample generation. community structure.
Extensive experiments demonstrate the effectiveness of our An alternative approach is to use generative adversarial net-
model for three different graph learning tasks. We show that the works (GANs) as the basis for graph models [9]. models edge prob-
incorporation of the model in a Bayesian framework for node ability through an adversarial framework in the GraphGAN model.
classification leads to improved performance over state-of-the art The NetGAN model represents the graph as a distribution on ran-
methods for the scarce data scenario. The model has much lower dom walks [8]. Compared to auto-encoder based methods, the
computational complexity compared to other Bayesian alternatives. GAN based methods seem more capable of capturing the struc-
In the face of adversarial attacks, a node copying based strategy tural characteristics of an observed graph. The major disadvantage
significantly mitigates the effect of the corruption. Lastly, the use is that the models are extremely computationally demanding to
of the node copying model in a personalized recommender system train and the success of the training can be sensitive to the choice
leads to improvement over a state-of-the-art graph-based method. of hyperparameters.
Preliminary results for application of the node copying model in Our focus in this paper is on learning a graph model based on
semi-supervised node classification and in defense against adver- a single observed graph. By contrast, there is a growing body of
sarial attacks were published in [11] and [12] respectively. In this work that focuses on learning graph models that can reproduce
paper, we add theoretical results to shed light on the statistical graphs that have characteristics similar to a dataset of multiple
properties of the sampled graphs in special cases, conduct thor- training graphs. These approaches can preserve important struc-
ough experimentation to examine the similarity of the sampled tural attributes of the graph(s) in the dataset, but the sampled
graphs to the observed graph, show the effectiveness of the pro- graphs do not retain node identity information. They cannot be
posed approach in a data-scarce classification problem and against applied in the node- and edge-oriented learning tasks we focus
diverse adversarial attacks, and extend the application of the copy- on. In this category, there have been variational auto-encoder ap-
ing model to the recommendation setting. proaches [25,26], GAN-based approaches [27], models based on it-
erative generation [28], and auto-regressive models [29,30].
2. Related work
3. Node copying model
Parametric random graph models: There is a rich history
of research and development of parametric models of random We propose to build a random graph model by introducing
graphs. These models have been designed to generate graphs random perturbations to an initial observed graph Gobs . Our aim
exhibiting specific characteristics and their theoretical properties is to generate graphs G that are close to Gobs in some structural
have been studied in depth. They can yield samples represen- sense, while preserving any metadata associated with a node’s
tative of an observed graph provided that the model is capa- identity (e.g., node features, labels). Here, we consider that Gobs =
ble of representing the particular graph structure and parameter {Vobs , Eobs } is a directed graph (the extension to the undirected case
2
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
3
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
pled graphs in expectation have the same number of edges |Eobs | as served when sampling is performed using a different number of
|Eobs | pcal
ij
nearest neighbors for the sampling distribution (K = 5, 10, 15). In
Gobs : pcc
ij
= N N
. We call this procedure ‘calibration and this experiment, sampling from the K = 5 nearest neighbors seems
cal
i=1 j=1 pi j
to be the best choice.
correction’ (cc).
Connections to other random graph models: We further sup-
Node copying: Since the baselines are unsupervised techniques,
port this notion that the copying model preserves the structure
for a fair comparison we refrain from using node labels for the
of the observed graph with theoretical results. We show that if
node copying model. Instead, we sample replacement nodes ac-
the observed graph itself is a sample from some parametric ran-
cording to the distances between learned embeddings. The se-
dom graph model, then suitable copying procedures can ensure
lected datasets are used as benchmark for node classification task
that the sampled graphs have the same marginal distribution. We
and are expected to exhibit some homophilic property, which sug-
prove this result for two popular random graph models, namely for
gests that nodes with the same label are clustered together and
the Stochastic Block Model in Theorem 4.1 and for the Erdős-Rényi
share similar features. This indicates that it is sensible to construct
family in Theorem 4.2.
a similarity metric using node embeddings that are derived from
both topological and feature information. Specifically, for any node, Theorem 4.1. Let Gobs be a sample of a Stochastic Block Model (SBM)
we sample the replacement node uniformly from the K-nearest with N nodes and K communities and symmetric conditional prob-
neighbors according to the Euclidean distance between embed- ability matrix β ∈ RK×K . We use a node copying procedure between
dings. nodes of the same label, i.e., a node j cannot be copied in place of
Experiment: The qualitative similarity of two graphs can be node i, if the labels ci = c j . Then any sampled graph Gsa is marginally
measured in multiple ways. To obtain an insight into which another realization from the same SBM.
method can generate graphs that are the closest to an original
Proof. For Gobs , we have p(Aobs
ij
= 1|ci , c j ) = βci ,c j . Now, based on
graph, we report various graph statistics on the sampled graphs
and compare them to the statistics of the observed graphs for Cora, the copying strategy above, the probability of observing an edge
Polblogs, Bitcoin, and Amazon-photo datasets (homophilic graphs, (i, j ) in Gsa is given by:
used extensively for node classification) in Table 1. Details of all
N
i j = 1 |ci , c j ) =
p(Asa v j = 1 | c i , c j , ζ = v ) p( ζ = v | c i , c v ) ,
p(Aobs i i
datasets used in this paper are included in Appendix A. For unifor-
mity, the directed graph of Bitcoin has been made undirected by v=1
v j = 1 | c i , c j , ζ = v ) p( ζ = v | c i , c v )
p(Aobs
casting every in/out edge as an undirected edge. = i i
Metrics: For a graph G, d (v ) is the degree of node v and v:ci =cv
cv denotes the class label of node v. We report the aver-
× (since p(ζ i = v|ci = cv ) = 0 ),
age and minimum degree, the proportion of cross-community
edges: |E1| (i, j )∈E 1[ci = c j ], the proportion of claws (‘Y’ struc- = βci ,c j = p(Aobs
i j = 1 |ci , c j ), (2)
d (v)
ture): |1E |
v j = 1|ci , c j , ζ = v ) = βci ,c j = p(Ai j = 1|ci , c j ) for all
p(Aobs
v∈V 3 and the relative entropy of the edge dis- since i obs
(3)
{v : ci = cv } and v:c =cv p(ζ = v|ci , cv ) = 1. This is reliant on only
i
tribution: log1 N v∈V −d|E(|v ) log d|(Ev|) . By reporting various summary i
graph statistics, we can have a broad picture of the properties of copying within the same community, as p(ζ i = v|ci = cv ) = 0.
the generated graphs. Next we consider the joint distribution of two edges, i.e.
In each case, 10 random trials of the model training are con- p(Asa
i j
= 1, Asa
i j
= 1|ci1 , ci2 , c j1 , c j2 ). We need to consider the
1 1 2 2
ducted and 10 random graphs are sampled based on each trial. following two cases:
The results are thus averaged over 10 × 10 = 100 sampled graphs. Case 1: i1 = i2
We observe that without the ‘calibration and correction’, the base-
i 1 j1 = 1 , Ai 2 j2 = 1 | c i 1 , c i 2 , c j1 , c j2 )
p(Asa sa
line node embedding algorithms fail to generate representative
graphs as the characteristics of the sampled graphs show large de-
N
N
bution. Moreover, the sample graphs have a much lower propor- = p(ζ i1 = v|ci1 ) p(Aobs
v j1 = 1, Av j2 = 1|ci1 = ci2 , c j1 , c j2 , ζ = v ),
obs i1
v=1
tion of claws. This is not unexpected, because any model which is
based on independence of the links cannot model such local struc- = p(ζ i1 = v, |ci1 ) p(Aobs
v j1 = 1|ci1 , c j1 , ζ = v )
i1
v j2 = 1|ci1 , c j2 , ζ = v ),
× p(Aobs
hand, the samples from the proposed node copying model show i1
4
F. Regol, S. Pal, J. Sun et al.
Table 1
Average statistics of 100 sampled graphs. Bolded entries indicate the closest values to the observed value. N/A indicates entries that were not reported due to processing limitations.
Avg. degree Max. degree Cross com. % Claws % (×10−4 ) Edge dist. ent. (% ) Avg. degree Max. degree Cross com. % Claws % (×10−4 ) Edge dist. ent. (% )
Cora Polblogs
Observed 3.89 168 19.6 6.34 95.6 27.4 351 9.42 10.1 90.3
GRAPHITE (cc) 3.9 24 34.4 0.43 96 27.3 64 18.1 0.91 98.9
raw 1.3 × 103 1.4 × 103 77.6 0.18 100 614 712 39.2 0.67 100
calibrated 105 519 34.5 0.44 97.7 256 530 18.1 0.91 99.2
GVAE (cc) 3.9 22.1 37 0.302 96.2 27.3 64.1 18.2 0.918 98.9
raw 1.34×103 1.44×103 77.7 0.136 100 614 714 39.4 0.673 100
calibrated 125 587 37 0.309 97.9 255 533 18.2 0.918 99.1
DGLFRM-B (cc) 3.9 14.9 38.8 0.169 97.8 27.3 58.8 18.1 0.874 99
COPYING K=5 3.5 105 18.4 3.2 96.9 26 311 8.94 8.61 90.6
COPYINGK=10 3.3 63.9 20.4 1.22 97.4 25.5 290 8.83 7.72 90.8
5
COPYINGK=15 3.2 58.1 22 1.1 97.5 25.2 274 8.72 7.33 90.9
Bitcoins Amazon-photo
sampled from a SBM. Similarly, if we consider any arbitrary subset For predicting labels for the unlabelled nodes, we only need to
of the edges in Gsa , we can show that the joint distribution is mu- draw samples from p(W|X, YL , G ), rather than explicitly evaluate
tually independent over the edges and is the same as that of Gobs . the probability, so the normalization constant 1/p(YL |X, G ) can be
This proves the theorem. ignored. The posterior over W induces a predictive marginal pos-
terior for the unknown node labels Z as follows:
Theorem 4.2. If Gobs is a sample of an Erdős-Rényi model with N
nodes and connection probability θ ∈ [0, 1] ), then for any arbitrary p(Z|X, YL , Gobs ) = p(Z|W, Gobs , X ) p(W|X, YL , G )
distribution of ζ , Gsa is also a sample from the same model.
×p(G |Gobs , X, YL ) dW dG. (7)
Proof. For Gobs , we have p(Aobs
ij
= 1|θ ) = θ . Now, based on any ar-
bitrary copying strategy, we have The integral in equation (7) cannot be computed in a closed form.
Hence, Monte Carlo sampling is used for approximation:
N
i j = 1|θ ) =
p(Asa p(ζ i = k|θ ) p(Aobs
k j = 1|θ , ζ = k ),
i
1
NG S
6
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
Table 2
Accuracy of semi-supervised node classification.
Cora Citeseer
Pubmed Co.-CS
Table 3
Accuracy of semi-supervised node classification in data-scarce setting.
Table 4
Accuracy of node classification for different size of labeled set and runtime (for a 20 labels per class
trial) of various BGCNs on Cora dataset, averaged over 20 trials.
We conduct 50 trials; each trial corresponds to a random weight igate the effects of missing edges and fewer training labels. The
initialization and use of a random train/test split. proposed BGCN-Copy has the capacity of adding new edges during
We compare the proposed BGCN based on node copying training, as do M-BGCN and SBM-GCN. From Table 3, we observe
(BGCN-Copy) with the BGCN based on an MMSBM [33] (M- that the proposed algorithm shows significant improvement com-
BGCN) and the SBM-GCN [34] to highlight the usefulness of the pared to its competitors.
node copying model. We also include node classification baselines Table 4 highlights that the node copying based BGCN pro-
GCN [39] and DFNET [42] (only for Cora and Citeseer datasets vides much better results than VGAE or GRAPHITE and is much
due to run-time considerations). The average accuracies along with faster compared to the other alternatives. The proposed copying
standard errors are reported in Table 2. For each setting, the high- approach generalizes the BGCN to settings where the parametric
est average accuracy is written in bold and an asterisk (∗ ) is used model does not fit well. It also reveals that our approach is faster
if the algorithm offers better performance compared to the second than M-BGCN which makes it a better alternative to process larger
best algorithm that is statistically significant at the 5% level using datasets. The parametric inference for the MMSBM is challenging
a Wilcoxon signed rank test. (The significance test results are re- for larger graphs, whereas our approach scales linearly with the
ported in the same way in the two following sections). number of nodes.
Data scarce setting: As in [34], we consider another node clas-
sification experiment where all the edges connected to the test set 6. Application – Node copying for defense against adversarial
nodes are removed from the graph. We also have access to only attacks
5 training examples from each class. We use a Multi Layer Percep-
tron (MLP) as a non-graph baseline algorithm. We conduct 20 trials We present node classification in an adversarial setting as a sec-
for each dataset and report the average accuracies along with the ond application. Generally, a topological attack corrupts the pre-
standard error in Table 3. diction of a node label by tampering with its neighborhood. Since
Comparison of BGCN generative models with runtime the node copying model explicitly provides new neighborhoods to
analysis: Finally, we compare the node copying BGCN with nodes, we can construct a defense algorithm based on sampled
BGCN variants that incorporate other graph generative models graphs in which the targeted node can possibly escape the effect
for p(G |Gobs , D ), including the generative models considered in of the attack.
Section 4. Performance and runtime results on the Cora dataset are Problem Setting: We consider attacks in which unlabeled
reported in Table 4. Implementation details of the BGCN variants nodes Vattacked ⊂ L are subjected to an adversarial attack that can
are provided in Appendix B. modify the neighborhood of the targeted nodes. The defense al-
Discussion: From Table 2, we observe that the proposed BGCN gorithm only has access to the resulting corrupted graph Gattacked ,
either outperforms or offers comparable performance to the com- and the goal is to recover classification accuracy at the targeted
peting techniques in most cases. In the data-scarce setting, mod- nodes v ∈ Vattacked . The complete feature matrix X and the training
els which can incorporate the uncertainty in the graph show bet- labels YL of a (small) labeled set are assumed to be unperturbed.
ter performance in general as they have some capability to mit- Following the categorizations of graph attacks in the survey [43],
7
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
this specific adversarial setting belongs to the edge-level, targeted, proposed defense mechanism. If the attack is believed to be too
poisoning category. strong, we can use a similarity metric that ignores any topological
Defense algorithm using node copying: information instead of relying on the embeddings of the nodes of
Our proposed defense procedure is summarized in Algorithm 2. the attacked graph. However, our experimental results in Table 5
demonstrates that even for a quite severe attack (where 75% of
Algorithm 2 Error correction using node copying. the neighbors of the targeted nodes have been tampered with), the
1: Input: Gattacked , X, YL , Vattacked proposed defense strategy using node copying has impressive per-
formance.
Output:
Copying
2: Yattacked
Fig. 2 (b) shows how a topological attack can lead to an incor-
3: Train a semi supervised node classification algorithm using rect classification for a targeted node. Fig. 2(c) provides two ex-
Gattacked , X, YL to learn model parameters W. amples of the node copying operation to generate new graphs by
4: Train a node embedding algorithm using Gattacked , X to obtain copying some ζ v s in place of node v. Fig. 2(d) depicts how the
embeddings {ei }N i=1
. softmax values derived from these generated graphs are combined
5: Compute the pairwise distance matrix D, where, Di j = ||ei − to recover the original correct classification.
e j ||2 . Experiments: Our proposed defense algorithm is evaluated
6: for v ∈ Vattacked do against three graph adversarial attacks. Targeted DICE (Delete In-
7: for k = 1 : NG do ternally Connect Externally) is the targeted version of the global
8: Sample ζkv ∼ p(ζ v |Gattacked , X, YL ) according to~(10). attack from [45]. This algorithm attacks a node by randomly dis-
9: Copy node ζkv in place of node v and compute the predic- connecting β percent of its neighbors of the same class, and add
(ζkv ) the same number of edges to random nodes of a different class.
tion of the learned classifier (in Step 3) at node v, yˆv
using the parameters W. We present results with β = 50% and β = 75%. Nettack from [46] is
10: end for another topological attack. We use the directed version of Nettack
Copying NG (ζkv ) and set the number of perturbations to be equal to the degree
11: Compute yˆv = N1 yˆ
k=1 v
G of the attacked node. The last attack we consider is FGA [47], for
12: end for
which we set the number of perturbations to 20. All remaining pa-
13: Form
Copying
Yattacked = {yˆv }v∈Vattacked
Copying
rameters are set to the values provided in the respective papers.
In order to illustrate the effectiveness of the procedure, we
We now detail the steps of the algorithm. We first train a GCN compare the accuracy at the attacked nodes for the proposed copy-
classification algorithm on the attacked graph (Step 3). Since ing algorithm with a standard GCN and a state-of-the-art defense
our sole interest is in correcting the erroneous classifications at algorithm GCN-SVD [48], for which we set the rank parameter
the nodes in Vattacked , we do not need to sample entire graphs K = 50. We consider a setting where Vtrn is formed with 20 la-
for this application. Instead, we sample local neighbourhoods bels per class. The attacked set Vattacked is simulated by randomly
of the targeted nodes v ∈ Vattacked via the node copying model sampling 40 nodes, excluding Vtrn , and corrupting them with the
p(ζ v |Gattacked , X ). We note that the classifications of the attacked attack.
nodes are most likely incorrect. So, using the predicted classes to Discussion: From Table 5, we see that the proposed corrective
define similarities in the node copying model is not sensible as mechanism based on node copying improves the classification ac-
the targeted nodes might be more similar to nodes from a differ- curacy of the corrupted nodes. Interestingly, the proposed correc-
ent class. Instead, we use an unsupervised representation of the tion procedure does not involve sampling of the entire graph or
nodes for sampling the replacement nodes. We train a node em- retraining of the model, rather we can perform computationally in-
bedding algorithm using (Gattacked , X ) to obtain the node repre- expensive, localized computations to improve accuracy at the tar-
sentations {ei }i=1:N (Step 4). A symmetric, pairwise distance ma- geted nodes. We observe that the GCN-SVD approach outperforms
trix D ∈ RN×N
+ is formed, where Di j denotes a distance between ei the Copying algorithm for Nettack, which is expected as it is tai-
and e j . (Step 5). In our experiments, we use the Graph Variational lored to provide efficient defense for this specific attack, but per-
Auto Encoder (VGAE) [39] as the embedding algorithm and com- forms significantly worse than a standard GCN against the FGA at-
pute pairwise Euclidean distances to form D : Di, j = ||ei − e j ||2 . For tack. In contrast, the proposed Copying defense offers consistent
any targeted node, we sample the replacement node (Step 8) uni- improvement over GCN across different attacks.
formly at random from the nodes which are close to the targeted
node. Formally, we define: 7. Application – Node copying for recommender systems
p(ζ v = m|Gattacked , X, YL )
As the last application, we consider the task of graph-based per-
1
, if Dv,m = Dv,( ) for some 1 ≤ ≤ P sonalized item recommendation. An intuitive solution to this prob-
= P (10) lem is to form the item recommendation of a user based on other
0, otherwise.
users that have many items in common. If two users share multi-
Here Di,( j ) is the j-th order statistic of {Di, }N
=1
. For each corrupted ple items in their interaction histories, we presume that one item
node, our approach replaces the classification obtained from the purchased/liked by one of them could be recommended to the
node classifier trained on the corrupted graph Gattacked by the en- other. A node copying model can directly apply this approach by
semble of classifications of the same model with the sampled ζ v s using a similarity metric between users that is proportional with
copied in place of node v (Step 9). We use the mean of the soft- the number of shared items.
max probabilities as the ensemble aggregation function (Step 11). Problem Setting: Let U be the set of users and I be the set
In many graph-based models [5,44], the node classification is pri- of items. Gobs is the partially observed bipartite graph built from
marily influenced by nodes within a few hops. As a result, a lo- previous user-item interactions. The task is to infer other unob-
calized evaluation of the predictions at the targeted nodes can be served interactions, which can be seen as a link prediction prob-
computed efficiently. lem [49,50]. Alternatively, we can view this task as a ranking
We note that the node embeddings {ei }N i=1
of Gattacked , which are problem [51]. For each user u, for an observed interaction with
used to form p(ζ v |Gattacked , X, YL ) are also affected by the topo- item i and a non-observed interaction with item j, we have i >u j
logical attack. This can potentially degrade the effectiveness of the in the training set. If both (u, i ) and (u, j ) are observed, we do
8
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
Table 5
Accuracy of defense algorithms at the attacked node averaged over 50 trials.
Cora Citeseer
DICE 50% DICE 75% Nettack FGA DICE 50% DICE 75% Nettack FGA
Fig. 2. Summary of the node copying procedure. a) In the absence of the attack, the softmax of node v achieves the correct classification in Gobs . b) Node v is targeted by
an attack and is now wrongly classified. c) We sample two replacement nodes for node v : ζ v = j and ζ v = k. The softmax vectors at node v after nodes j and k are copied
( j) (k ) Copying ( j) (k )
in place of node v are denoted as yˆv and yˆv , respectively. d) The error for node v is corrected by computing yˆv = 12 yˆv + yˆv .
not have a ranking. Using this interaction data {>u }trn = {(u, i, j ) : nodes can be copied to a user node. We consider that users that
(u, i ) ∈ Gobs , (u, j ) ∈/ Gobs } for fitting a model, the task is to rank share items are similar, so we use the Jaccard index ρ between the
all (u, i, j ) such that both (u, i ) and (u, j ) ∈ / Gobs . The sets of pairs sets of items for pairs of users to define the conditional probability
{(u, i ) : (u, i ) ∈ Gobs } and {(u, j ) : (u, j ) ∈/ Gobs } are referred to as the distribution of ζ as follows:
positive and negative pools of interactions respectively. We denote j ρ ( j, m )/ i∈U ρ ( j, i ), if j, m ∈ U
the test set as {>u }test = {(u, i, j ) : (u, i ) ∈ / Gobs }. We
/ Gobs , (u, j ) ∈
p ζ = m Gobs = (11)
0, otherwise.
consider a Bayesian Personalized Recommendation (BPR) frame-
work [51] (details in Appendix C) along with the inference of the We compute the pairwise ranking probabilities in the test set as
graph G. follows:
Background: Many existing graph-based deep learning rec-
p {>u }test {>u }trn , Gobs = p {>u }test G, W p W{>u }trn , Gobs
ommender system models [49,50,52] learn weights W to form
an embedding eu (W, Gobs ) for the u-th user and ei (W, Gobs ) for
×p G Gobs , {>u }trn dWdG. (12)
the i-th item node. The ranking of any (u, i, j ) triple such that
both (u, i ) ∈ / Gobs and(u, j ) ∈ / Gobs is specified as p(i >u j|Gobs , W ) = We sample NG graphs Gi s from p(G |Gobs , {>u }trn ) using node copy-
σ (eu · ei − eu · e j ), where σ (· ) is the sigmoid function and · denotes ing (Step 5) and then form a Monte Carlo approximation of
the inner product. Assuming a suitable prior on W, the learning eq. (12) as follows (Step 7):
1
goal is to maximize the posterior of W on the training set. Here, NG
= arg max p(W|{>u }trn , G ) is learned by maximizing the BPR
W p({>u }test |{>u }trn , Gobs ) ≈ ).
p({>u }test |Gi , W (13)
obs NG
W i=1
objective using Stochastic Gradient Descent (SGD). The pairwise
. Implementation of eq. (13) does not require retraining of the
ranking probabilities in the test set is then computed using W is already obtained by minimizing the BPR loss;
model since W
Ensemble BPR: We summarize the proposed Ensemble BPR al-
it only involves evaluation of a trained model on multiple sampled
gorithm in Algorithm 3. For the Ensemble BPR algorithm, we first
graphs to compute the average pairwise ranking probability for the
test set.
Algorithm 3 Ensemble BPR with node copying. Sampled Graph BPR (SGBPR) — training with sampled
1: Input: Gobs , {<u }trn graphs:
2: Output: p({<u }test |{<u }trn , Gobs ) We consider another approach where we use the generated
3: Obtain W = arg max p(W|{<u }trn , G ) by minimizing the graphs during the training process as well. The SGBPR algorithm
W obs
BPR loss. is summarized in Algorithm 4.
4: for i = 1 to NG do Our motivation is to remove some of the potentially unobserved
5: Sample graph Gi ∼ p(G |Gobs , {<u }trn ) using the node copying positive interactions in the training set from the negative interac-
model defined by~(11). tion pool. We rely on the node copying model to sample graphs
6: end for (Step 4), which potentially contain positive interactions between
7: Approximate p({<u }test |{<u }trn , Gobs ) using eq.~(13). user-item pairs, which are unobserved in Gobs . The inference of
graph G is carried out from the copying model p(G |Gobs , {>u }trn )
specified in eq. (11). We need to compute:
obtain W as previously described (Step 3), then we evaluate the
ranking by computing an expectation with respect to the random p({>u }test |{>u }trn , Gobs ) = p({>u }test |G, W )
graph G, sampled from the node copying model.
In this setting, we assign a non-zero conditional probability to × p(W|{>u }trn , Gobs , f (Gobs )) p(G |Gobs , {>u }trn )dWdG. (14)
only the class of bipartite graphs since Gobs is bipartite. This is Here, f (Gobs ) is a function of the observed graph that returns a
achieved by considering a node copying scheme where only user graph that is used to control the negative pool. There is flexibility
9
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
Table 6
Average recall and NDCG for recommender system experiment.
ML100k Ama.-CDs
Algorithm 4 Sampled Graph BPR with node copying. portion of the true (preferred) items from the top-k recommen-
dation. For a user u ∈ U, the algorithm recommends an ordered
1: Input: Gobs , {<u }trn
set (in descending order of preference) of top-k items Ik (u ) =
2: Output: p({<u }test |{<u }trn , Gobs )
3: for i = 1 to NG do
{in }kn=1 ⊂ I. There is a set of true preferred items for a user Iu+
and the number of true positive is |Iu+ ∩ Ik (u )|, so the recall is
4: Sample graph Gi ∼ p(G |Gobs , {<u }trn ) using the node copying
|Iu+ ∩Ik (u )|
model defined by~(11). computed as follows: Recall@k = |Iu+ |
. Normalized Discounted
5: end for Cumulative Gain (NDCG) [53] computes a score for recommenda-
6: Compute A using eq.~(15) and G b (equivalently A .
) using A tion Ik (u ) which emphasizes higher-ranked true positives. Dk (n ) =
G G G
b
1[in ∈ Iu+]/ log2 (n + 1 ) accounts for a relevancy score. NDCG@k =
7: Compute f (Gobs ) = Gobs ∪
Gb
DCGk i ∈I ( u ) Dk ( n ) +
= n k where Iu,k
8: Obtain W = arg maxW p(W|{<u }trn , Gobs , f
(Gobs ) ) by minimiz- IDCGk Dk ( n )
, is the ordered set of top-k true
i ∈I +
n u,k
ing the BPRloss.
preferred items in descending order of preference. We use the em-
9: Approximate p({<u }test |{<u }trn , Gobs ) using eq.~(16).
bedding model (MGCCF) presented in [54], which achieves state-
of-the-art performance for this task.
Discussion: From Table 6, we observe that for the recommen-
in the choice of this function, but we use f (Gobs ) = Gobs ∪ G b . The dation task, adaptation of the copying model results in a simple
union ∪ indicates that we take the union of the edge sets of the intuitive ‘EBPR’ algorithm which improves recall over the baseline
two graphs. We define G EG |Gobs [G] as the graph with adjacency and does not involve any retraining. Use of the sampled graphs to
matrix equal to the expectation of the adjacency matrix over the train the model offers significantly better performance.
generative graph distribution. The graph G b is derived from this;
it has a binary adjacency matrix derived by comparing the adja- 8. Conclusion
cency matrix entries of G to a small positive threshold b. So, we
have AG = 1[AG > b]. Due to analytical intractability, we approxi- We present a novel generative model for graphs called node
b
mate EG |Gobs [G] using Monte Carlo. This amounts to approximating copying. The proposed model is based on the idea that neighbour-
the adjacency matrix of G = EG |Gobs [G], whose (s, w )-th entry can hoods of similar nodes in a graph are similar and can be swapped.
be estimated as: It is flexible and can incorporate many existing graph-based learn-
ing techniques for identifying node similarity. Sampling of graphs
NG
(s, w ) = 1
A AGi (s, w ). (15) from this model is simple. The sampled graphs preserve impor-
G
NG tant structural properties. We have demonstrated that the use of
i=1
the model can improve performance for a variety of downstream
Here, Gi ∼ p(G |Gobs , {>u }trn ) is sampled from the node copying learning tasks. Future work will investigate developing more gen-
model in Step 4. This allows us to construct an estimate f (Gobs ) eral measures of similarity among the nodes of a graph and incor-
by first computing an approximate G (equivalently A ) using A porating them in the copying model, and exploring potential ex-
b
Gb G
tensions of our theoretical results to more general graphon models.
(Step 6) and then using this G b in the definition of f (Gobs ) (Step 7).
In specifying p(W|{>u }trn , Gobs , f (Gobs )), we effectively aim
Declaration of Competing Interest
to reduce the training set by eliminating unobserved edges
from the negative pool. We have {>u }trn = DS . We now intro-
The authors declare that they have no known competing finan-
duce DS = DS \ {(u, i, j ) : (u, i ) ∈ Gobs , (u, j ) ∈ G b }, and set p(W|{>u
cial interests or personal relationships that could have appeared to
}trn , Gobs , f (Gobs )) = p(W|DS ), which amounts to training the model
= influence the work reported in this paper.
based on positive and negative interactions in DS . Using W
arg max p(W|{>u }trn , G , f (G ) ) (Step 8), the posterior probabil-
obs obs Appendix A. Description of the datasets
W
ity of {>u }test in eq. (14) is then approximated as:
We conduct experiments on seven datasets. In the citation
1
NG
). datasets (Cora [40], Citeseer [40], and Pubmed [55]), nodes rep-
p({>u }test |{>u }trn , Gobs ) ≈ p({>u }test |Gi , W (16) resent research papers and the task is to classify them into topics.
NG
i=1
The graph is built by adding an undirected edge when one paper
Experiments: For the recommender system, we use two cites another and features are derived from the keywords of the
datasets (ML100k and Amazon-CDs, with details in Table A.8 document. Coauthor CS is a coauthorship graph where each node
in Appendix A) that we preprocess by retaining users and items is an author and two authors are connected if they have coau-
that have a specified minimum number of edges set by a thresh- thored a paper. The node features represent keywords from each
old. We create training, validation and test sets by splitting the author’s papers and the node labels are the active area of research
data 70/10/20%. We generate random splits for each of the 10 of the authors. Amazon-Photo [41] is a portion of the Amazon co-
trials and report average Recall and Normalized Discounted Cu- purchase dataset graph from [56]. In this case nodes are products,
mulative Gain (NDCG) at 10 and 20. Recall@k measures the pro- the features are based on the reviews, and the label is the product
10
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
Table A.7
Statistics of evaluation datasets for node classification and generative graph models.
No. Classes 7 6 3 15 8 2 2
No. Features 1,433 3,703 500 6805 745 166 N/A
No. Nodes 2,485 2110 19,717 18,333 7,487 472 1222
Edge Density 0.04% 0.04% 0.01% 0.01% 0.11% 0.11% 0.56%
Table A.8
Statistics of evaluation datasets for the recommender system experiments.
type. Items often bought together are linked in the graph. The bit- In the Bayesian personalized
ranking
framework of [51], our
coin dataset is a transactional network snapshot taken from the task is to maximize: p |{>u }DS ∝ p {>u }DS | p(). Here
dynamic graphs provided in the Elliptic Bitcoin Dataset [57]. In are the parameters of the model, and {>u }DS are the observed pref-
this dataset, nodes represent transactions and edges represent a erences in the training data. We aim to identify the parameters
directed bitcoin flow. Each transaction is labeled as being either that maximize this posterior over all users and all pairs of items.
fraudulent or legitimate. Polblogs [58] dataset is a political blogs If users are assumed to act independently, then we can write:
network.
p {>u }DS | = p(i >u j|) (C.1)
Appendix B. Details of the generative models used for BGCN (u,i, j )∈DS
11
F. Regol, S. Pal, J. Sun et al. Signal Processing 192 (2022) 108335
[15] L. Peng, L. Carvalho, Bayesian degree-corrected stochastic blockmodels for [38] A. Hasanzadeh, E. Hajiramezanali, S. Boluki, M. Zhou, N. Duffield, K. Narayanan,
community detection, Electron. J. Statist. 10 (2) (2016) 2746–2779. X. Qian, Bayesian graph neural networks with adaptive connection sampling,
[16] P. Latouche, E. Birmelé, C. Ambroise, Overlapping stochastic block models with in: Proc. Int. Conf. Machine Learning, 2020, pp. 4094–4104.
application to the french political blogosphere, Annals Applied Statist. 5 (1) [39] T. Kipf, M. Welling, Semi-supervised classification with graph convolutional
(2011) 309–336. networks, in: Proc. Int. Conf. Learning Representations, 2017.
[17] K.T. Miller, T.L. Griffiths, M.I. Jordan, Nonparametric latent feature models for [40] P. Sen, G. Namata, et al., Collective classification in network data, AI Magazine
link prediction, in: Proc. Adv. Neural Info. Process. Syst., 2009, pp. 1276–1284. 29 (3) (2008) 93–106.
[18] B. Fosdick, D. Larremore, J. Nishimura, J. Ugander, Configuring random graph [41] O. Shchur, M. Mumme, A. Bojchevski, S. Günnemann, Pitfalls of graph neu-
models with fixed degree sequences, SIAM Rev. 60 (2) (2018) 315–355. ral network evaluation, in: Proc. Relational Representation Learning Workshop,
[19] G. Casiraghi, Analytical formulation of the block-constrained configuration Adv. Neural Info. Process. Syst., 2018.
model, arXiv preprint: arXiv 1811.05337 (2018). [42] A. Wijesinghe, Q. Wang, DFNets: Spectral CNNs for graphs with feedback-
[20] M. Drobyshevskiy, D. Turdakov, Random graph modeling: a survey of the con- -looped filters, in: Proc. Adv. Neural Info. Process. Syst., 2019, pp. 6007–6018.
cepts, ACM Comput. Surv. 52 (6) (2019) 1–36. [43] L. Sun, J. Wang, P.S. Yu, B. Li, Adversarial attack and defense on graph data: a
[21] K. Bringmann, R. Keusch, J. Lengler, Geometric inhomogeneous random graphs, survey, arXiv preprint: arXiv 1812.10528 (2018).
Theoretical Comput. Science 760 (2019) 35–54. [44] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph at-
[22] V. Veitch, D.M. Roy, The class of random graphs arising from exchangeable tention networks, in: Proc. Int. Conf. Learning Representations, 2018.
random measures, arXiv preprint: arXiv 1512.03099 (2015). [45] M. Waniek, T.P. Michalak, M.J. Wooldridge, T. Rahwan, Hiding individuals and
[23] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, C. Zhang, Adversarially regularized graph communities in a social network, Nat. Hum. Behav. 2 (2) (2018) 139–147.
autoencoder for graph embedding, in: Proc. Int. Joint Conf. on Artificial Intell., [46] D. Zügner, A. Akbarnejad, S. Günnemann, Adversarial attacks on neural net-
2018, pp. 2609–2615. works for graph data, in: Proc. Int. Conf. on Knowledge Discovery & Data Min-
[24] A. Grover, A. Zweig, S. Ermon, Graphite: Iterative generative modeling of ing, 2018, pp. 2847–2856.
graphs, in: Proc. Int. Conf. Machine Learning, 2019, pp. 2434–2444. [47] J. Chen, Y. Wu, X. Xu, Y. Chen, H. Zheng, Q. Xuan, Fast gradient attack on net-
[25] M. Simonovsky, N. Komodakis, GraphVAE: Towards generation of small graphs work embedding, arXiv preprint: arXiv 1809.02797 (2018).
using variational autoencoders, in: Proc. Int. Conf. Artificial Neural Networks [48] N. Entezari, S.A. Al-Sayouri, A. Darvishzadeh, E.E. Papalexakis, All you need is
and Machine Learning, 2018, pp. 412–422. low rank: defending against adversarial attacks on graphs, in: Proc. Int. Conf.
[26] T. Ma, J. Chen, C. Xiao, Constrained generation of semantically valid graphs via on Web Search and Data Mining, 2020, pp. 169–177.
regularizing variational autoencoders, in: Proc. Adv. Neural Info. Process. Syst., [49] X. Wang, X. He, M. Wang, F. Feng, T. Chua, Neural graph collaborative filtering,
2018, pp. 7113–7124. in: Proc. Conf. Research and Development in Info. Retrieval, 2019, pp. 165–174.
[27] N. De Cao, T. Kipf, MolGAN: An implicit generative model for small molecu- [50] R. Ying, R. He, K. Chen, P. Eksombatchai, W.L. Hamilton, J. Leskovec, Graph con-
lar graphs, in: Proc. Workshop on Theoretical Foundations and Applications of volutional neural networks for web-scale recommender systems, in: Proc. Int.
Deep Generative Models, Int. Conf. Machine Learning, 2018. Conf. on Knowledge Discovery & Data Mining, 2018, pp. 974–983.
[28] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, P. Battaglia, Learning deep generative [51] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, BPR: Bayesian per-
models of graphs, arXiv preprint: arXiv 1803.03324 (2018). sonalized ranking from implicit feedback, in: Proc. Conf. Uncertainty in Artifi-
[29] J. You, R. Ying, X. Ren, W. Hamilton, J. Leskovec, GraphRNN: Generating re- cial Intell., 2009, pp. 452–461.
alistic graphs with deep auto-regressive models, in: Proc. Int. Conf. Machine [52] J. Sun, Y. Zhang, C. Ma, M. Coates, H. Guo, R. Tang, X. He, Multi-graph con-
Learning, 2018, pp. 5708–5717. volution collaborative filtering, in: Proc. IEEE Int. Conf. on Data Mining, 2019,
[30] R. Liao, Y. Li, Y. Song, S. Wang, C. Nash, W. Hamilton, D.K. Duvenaud, R. Urta- pp. 1306–1311.
sun, R. Zemel, Efficient graph generation with graph recurrent attention net- [53] K. Järvelin, J. Kekäläinen, IR evaluation methods for retrieving highly relevant
works, in: Proc. Adv. Neural Info. Process. Syst., 2019, pp. 4257–4267. documents, in: Proc. Int. ACM SIGIR Conf. Research and Development in Inf.
[31] X. Dong, D. Thanou, P. Frossard, P. Vandergheynst, Learning laplacian matrix Retrieval, 20 0 0, pp. 41–48.
in smooth graph signal representations, IEEE Trans. Sig. Proc. 64 (23) (2016) [54] K. Sun, P. Koniusz, J. Wang, Fisher-Bures adversary graph convolutional net-
6160–6173. works, in: Proc. Conf. Uncertainty in Artificial Intell., 2019.
[32] V. Kalofolias, N. Perraudin, Large scale graph learning from smooth signals, in: [55] G. Namata, B. London, L. Getoor, B. Huang, Query-driven active surveying
Proc. Int. Conf. Learning Representations, 2019. for collective classification, in: Proc. Workshop on Mining and Learning with
[33] Y. Zhang, S. Pal, M. Coates, D. Üstebay, Bayesian graph convolutional neural Graphs, Int. Conf. Machine Learning, 2012.
networks for semi-supervised classification, in: Proc. AAAI Conf. Artificial In- [56] J. McAuley, C. Targett, Q. Shi, A. van den Hengel, Image-based recommenda-
tell., 2019, pp. 5829–5836. tions on styles and substitutes, in: Proc. Int. ACM SIGIR Conf. Research and
[34] J. Ma, W. Tang, J. Zhu, Q. Mei, A flexible generative framework for graph-based Development Info. Retrieval, 2015, pp. 43–52.
semi-supervised learning, in: Proc. Adv. Neural Info. Process. Syst., 2019, [57] M. Weber, G. Domeniconi, J. Chen, D.K.I. Weidele, C. Bellei, T. Robinson,
pp. 3276–3285. C.E. Leiserson, Anti-money laundering in bitcoin: Experimenting with graph
[35] P. Elinas, E.V. Bonilla, L. Tiao, Variational inference for graph convolutional net- convolutional networks for financial forensics, in: Proc. KDD Deep Learning on
works in the absence of graph data and adversarial settings, in: Proc. Adv. Neu- Graphs: Methods and Applications workshop, 2018.
ral Info. Process. Syst., 2020, pp. 18648–18660. [58] L.A. Adamic, N. Glance, The political blogosphere and the 2004 U.S. election:
[36] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang, Y. Xu, M. Coates, Non-paramet- Divided they blog, in: Proc. Int. Workshop on Link Discovery, 2005, pp. 36–43.
ric graph learning for Bayesian graph neural networks, in: Proc. Conf. Uncer- [59] X. Zhang, C. Moore, M. Newman, Random graph models for dynamic networks,
tainty Artificial Intell., 2020, pp. 1318–1327. Eur. Phys. J. B (2017) 90–200.
[37] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing
model uncertainty in deep learning, in: Proc. Int. Conf. Machine Learning, 2016,
pp. 1050–1059.
12