Professional Documents
Culture Documents
In conclusion, it remains an open problem to design a GCN UΛUT ,. Here Λ is a diagonal matrix of the eigenvalues
model that effectively prevents over-smoothing and achieves of L, and U ∈ Rn×n is a unitary matrix that consists of
state-of-the-art results with truly deep network structures. the eigenvectors of L. The graph convolution operation
Due to this challenge, it is even unclear whether the network between signal x and filter gγ (Λ) = diag(γ) is defined as
depth is a resource or a burden in designing new graph neu- gγ (L) ∗ x = Ugγ (Λ)UT x, where the parameter γ ∈ Rn
ral networks. In this paper, we give a positive answer to this corresponds to a vector of spectral filter coefficients.
open problem by demonstrating that the vanilla GCN (Kipf
& Welling, 2017) can be extended to a deep model with Vanilla GCN. (Kipf & Welling, 2017) and (Defferrard
two simple yet effective modifications. In particular, we et al., 2016) suggest that the graph convolution operation
propose Graph Convolutional Network via Initial residual can be further approximated by the K-th order polynomial
and Identity mapping (GCNII), a deep GCN model that of Laplacians
resolves the over-smoothing problem. At each layer, ini- ! !
K K
tial residual constructs a skip connection from the input X X
layer, while identity mapping adds an identity matrix to the Ugθ (Λ)UT x ≈ U θ` Λ` U> x = θ` L` x,
weight matrix. The empirical study demonstrates that the `=0 `=0
where H (0) = fθ (X). By decoupling feature transfor- partially relieves the over-smoothing problem; the perfor-
mation and propagation, PPNP and APPNP can aggregate mance of the model still degrades as we stack more layers.
information from multi-hop neighbors without increasing
We propose that, instead of using a residual connection to
the number of layers in the neural network.
carry the information from the previous layer, we construct
a connection to the initial representation H(0) . The initial
JKNet. The first deep GCN framework is proposed by residual connection ensures that that the final representation
(Xu et al., 2018). At the last layer, JKNet combines all of each node retains at least a fraction of α` from the input
previous representations H(1) , . . . , H(K) to learn repre- layer even if we stack many layers. In practice, we can
sentations of different orders for different graph substruc- simply set α` = 0.1 or 0.2 so that the final representation of
tures. (Xu et al., 2018) proves that 1) a K-layer vanilla each node consists of at least a fraction of the input feature.
GCN model simulates random walks of K steps in the self- We also note that H(0) does not necessarily have to be the
looped graph G̃ and 2) by combining all representations feature matrix X. If the feature dimension d is large, we
from the previous layers, JKNet relieves the problem of can apply a fully-connected neural network on X to obtain
over-smoothing. a lower-dimensional initial representation H(0) before the
forward propagation.
DropEdge A recent work (Rong et al., 2020) suggests
that randomly removing some edges from G̃ retards the con- Finally, we recall that APPNP (Klicpera et al., 2019a) em-
vergence speed of over-smoothing. Let P̃drop denote the ploys a similar approach to the initial residual connection in
renormalized graph convolution matrix with some edge re- the context of Personalized PageRank. However, (Klicpera
moved at random, the vanilla GCN equipped with DropEdge et al., 2019a) also shows that performing multiple non-
is defined as linearity operations to the feature matrix will lead to over-
fitting and thus results in the performance drop. Therefore,
H(`+1) = σ P̃drop H(`) W(`) . (4) APPNP applies a linear combination between different lay-
ers and thus remains a shallow model. This suggests that
the idea of initial residual alone is not sufficient to extend
3. GCNII Model GCN to a deep model.
W` to avoid over-fitting, while the later is desirable in Now, if we reparameterize −BT B by W, the above up-
semi-supervised tasks where training data is limited. date formula becomes quite similar to the one used in
our method. More spopposeecifically, we have xt+1 =
• (Oono & Suzuki, 2020) theoretically proves that the
Pµt λ (I + µt W)x + µt B y , where the term µt BT y
t T
node features of a K-layer GCNs will converge to
corresponds to the initial residual, and I+µt W corresponds
a subspace and incur information loss. In particular,
to the identity mapping in our model (5). The soft threshold-
the rate of convergence depends on sK , where s is
ing operator acts as the nonlinear activation function, which
the maximum singular value of the weight matrices
is similar to the effect of ReLU activation. In conclusion,
W(`) , ` = 0, . . . , K − 1. By replacing W(`) with (1 −
our network structure, especially the use of identity map-
β` )In + β` W(`) and imposing regularization on W(`) ,
ping is well-motivated from iterative shrinkage-thresholding
we force the norm of W(`) to be small. Consequently,
algorithms for solving LASSO.
the singular values of (1 − β` )In + β` W(`) will be
close to 1. Therefore, the maximum singular value s
will also be close to 1, which implies that sK is large, 4. Spectral Analysis
and the information loss is relieved.
4.1. Spectral analysis of multi-layer GCN.
The principle of setting β` is to ensure the decay of the We consider the following GCN model with residual con-
weight matrix adaptively increases as we stack more layers. nection:
In practice, we set β` = log( λ` + 1) ≈ λ` , where λ is a
hyperparameter. H(`+1) = σ P̃H(`) + H(`) W(`) . (6)
Connection to iterative shrinkage-thresholding. Re- Recall that P̃ = D̃−1/2 ÃD̃−1/2 is the graph convolution
cently, there has been work on optimization-inspired net- matrix with the renormalization trick. (Wang et al., 2019)
work structure design (Zhang & Ghanem, 2018; Papyan points out that equation (6) simulates a lazy random walk
−1/2 −1/2
et al., 2017). The idea is that a feedforward neural network with the transition matrix In +D̃ 2 ÃD̃ . Such a lazy
can be considered as an iterative optimization algorithm random walk eventually converges to the stationary state
to minimize some function, and it was hypothesized that and thus leads to over-smoothing. We now derive the closed-
better optimization algorithms might lead to better network form of the stationary vector and analyze the rate of such
structure (Li et al., 2018a). Thus, theories in numerical convergence. Our analysis suggests that the converge rate of
optimization algorithms may inspire the design of better an individual node depends on its degree, and we conduct
and more interpretable network structures. As we will show experiments to back up this theoretical finding. In particular,
next, the use of identity mappings in our structure is also we have the following Theorem.
well-motivated from this. We consider the LASSO objec- Theorem 1. Assume the self-looped graph G̃ is connected.
tive: −1/2 −1/2
K
1 Let h(K) = In +D̃ 2 ÃD̃ ·x denote the representa-
minn kBx − yk22 + λkxk1 .
x∈R 2 tion by applying a K-layer renormalized graph convolution
Similar to compressive sensing, we consider x as the signal with residual connection to a graph signal x. Let λG̃ de-
we are trying to recover, B as the measurement matrix, and note the spectral gap of the self-looped graph G̃, that is,
y as the signal we observe. In our setting, y is the original the least nonzero eigenvalue of the normalized Laplacian
feature of a node, and x is the node embedding the network L̃ = In − D̃−1/2 ÃD̃−1/2 . We have
tries to learn. As opposed to standard regression models,
hD̃1/2 1,xi
the design matrix B is unknown parameters and will be 1) As K goes to infinity, h(K) converges to π = 2m+n ·
learned through back propagation. So, this is in the same D̃1/2 1, where 1 denotes an all-one vector.
spirit as the sparse coding problem, which has been used to
design and to analyze CNNs (Papyan et al., 2017). Iterative 2) The convergence rate is determined by
shrinkage-thresholding algorithms are effective for solving n
! !K
(K)
X λ2G̃
the above optimization problem, in which the update in the h =π± xi · 1 − · 1. (7)
(t + 1)th iteration is: i=1
2
The proof of Theorem 1 can be found in the supplementary case, recall that Theorem 1 suggests a K-layer vanilla GCN
materials. There are two consequences from Theorem 1. simulates a fixed K-order polynomial filter P̃K x, where
First of all, it suggests that the K-th representation of GCN P̃ is the renormalized graph convolution matrix. Over-
hD̃1/2 1,xi smoothing is caused by the fact that P̃K x converges to
h(K) converges to a vector π = 2m+n · D̃1/2 1. Such
convergence leads to over-smoothing as the vector π only a distribution isolated from the input feature x and thus
carries the two kinds of information: the degree of each incuring gradient vanishment. DropEdge (Rong et al., 2020)
node, and the inner product between the initial signal x and slows down the rate of convergence, but eventually will fail
vector D1/2 1. as K goes to infinity.
On the other hand, Theorem 2 suggests that deep GCNII
Convergence rate and node degree. Equation (7) sug- converges to a distribution that carries information from
gests that the converge
Pn rate depends on the summation of both the input feature and the graph structure. This prop-
feature entries i=1 xi and the spectral gap λG̃ . If we take erty alone ensures that GCNII will not suffer from over-
a closer look at the relative converge rate for an individual smoothing even if the number of layers goes to infinity.
node j, we can express its final representation h(K) (j) as
P 2 statesthat a K-layer GCNII
More precisely, Theorem
(K) K `
λ2 K can express h = `=0 θ` L̃ · x with arbitrary co-
n √
Pn
di +1 x i 1− 2G̃
(K)
p X i=1 efficients θ. Since the renormalized graph convolution
h (j)= dj + 1 xi ± .
matrix P̃ = In −L̃, it follows that K-layer GCNII can
p
i=1
2m+n dj + 1
PK 0 `
express h(K) = `=0 θ` P̃ · x with arbitrary coef-
This suggests that ifpa node j has a higher degree of dj ficients θ0 . Note that with a proper choice of θ0 , h(K)
(and hence a larger dj + 1), its representation h(K) (j) can carry information from both the input feature and the
converges faster to the stationary state π(j). Based on this graph structure even with K going to infinity. For example,
fact, we make the following conjecture. APPNP (Klicpera et al., 2019a) and GDC (Klicpera et al.,
Conjecture 1. Nodes with higher degrees are more likely 2019b) set θi0 = α(1−α)i for Psome constant 0 < α < 1. As
(K) K 0 `
to suffer from over-smoothing. K goes to infinity, h = `=0 θ` P̃ · x converges to
the Personalized PageRank vector of x, which is a function
We will verify Conjecture 1 on real-world datasets in our of both the adjacency matrix à and the input feature vector
experiments. x. The difference between GCNII and APPNP/GDC is that
1) the coefficient vector theta in our model is learned from
4.2. Spectral analysis of GCNII the input feature and the label, and 2) we impose a ReLU
We consider the spectral domain of the self-looped graph operation at each layer.
G̃. Recall that a polynomial
P filter of order K on a graph
signal x is defined as
K
`=0 θ` L̃
`
x, where L̃ is the nor- 5. Other Related Work
malized Laplacian matrix of G̃ and θk ’s are the polynomial Spectral-based GCN has been extensively studied for the
coefficients. (Wu et al., 2019) proves that a K-layer GCN past few years. (Li et al., 2018c) improves flexibility by
simulates a polynomial filter of order K with fixed coef- learning a task-driven adaptive graph for each graph data
ficients θ. As we shall prove later, such fixed coefficients while training. (Xu et al., 2019) uses the graph wavelet
limit the expressive power of GCN and thus leads to over- basis instead of the Fourier basis to improve sparseness and
smoothing. On the other hand, we show a K-layer GCNII locality. Another line of works focuses on the attention-
model can express a K order polynomial filter with arbitrary based GCN models (Veličković et al., 2018; Thekumpara-
coefficients. mpil et al., 2018; Zhang et al., 2018), which learn the edge
Theorem 2. Consider the self-looped graph G̃ and a graph weights at each layer based on node features. (Abu-El-Haija
signal x. AK-layer GCNII et al., 2019) learn neighborhood mixing relationships by
can express a K order polyno-
PK ` mixing of neighborhood information at various distances
mial filter θ
`=0 ` L̃ x with arbitrary coefficients θ.
but still uses a two-layer model. In graph-level classifica-
The proof of Theorem 2 can be found in the supplementary tion, (Li et al., 2019a) proposed residual connections and
materials. Intuitively, the parameter β allows GCNII to dilated convolutions to facilitate the training of deep models.
simulate the coefficient θ` of the polynomial filter. (Gao & Ji, 2019; Lee et al., 2019) devote to extend pooling
operations to graph neural network. For unsupervised infor-
Expressive power and over-smoothing. The ability to mation, (Velickovic et al., 2019) train graph convolutional
express a polynomial filter with arbitrary coefficients is es- encoder through maximizing mutual information. (Pei et al.,
sential for preventing over-smoothing. To see why this is the 2020) build structural neighborhoods in the latent space of
Simple and Deep Graph Convolutional Networks
Table 1. Dataset statistics. Table 2. Summary of classification accuracy (%) results on Cora,
Citeseer, and Pubmed. The number in parentheses corresponds to
Dataset Classes Nodes Edges Features the number of layers of the model.
Cora 7 2,708 5,429 1,433
Method Cora Citeseer Pubmed
Citeseer 6 3,327 4,732 3,703
Pubmed 3 19,717 44,338 500 GCN 81.5 71.1 79.0
GAT 83.1 70.8 78.5
Chameleon 4 2,277 36,101 2,325
APPNP 83.3 71.8 80.1
Cornell 5 183 295 1,703 JKNet 81.1 (4) 69.8 (16) 78.1 (32)
Texas 5 183 309 1,703 JKNet(Drop) 83.3 (4) 72.6 (16) 79.2 (32)
Wisconsin 5 251 499 1,703 Incep(Drop) 83.5 (64) 72.7 (4) 79.5 (4)
PPI 121 56,944 818,716 50 GCNII 85.5 ± 0.5 (64) 73.4 ± 0.6 (32) 80.2 ± 0.4 (16)
GCNII* 85.3 ± 0.2 (64) 73.2 ± 0.8 (32) 80.3 ± 0.4 (16)
Table 3. Summary of classification accuracy (%) results with vari- Table 4. Summary of Micro-averaged F1 scores on PPI.
ous depths.
Method PPI
Layers
Dataset Method
2 4 8 16 32 64 GraphSAGE (Hamilton et al., 2017) 61.2
GCN 81.1 80.4 69.5 64.9 60.3 28.7 VR-GCN (Chen et al., 2018b) 97.8
GCN(Drop) 82.8 82.0 75.8 75.7 62.5 49.5 GaAN (Zhang et al., 2018) 98.71
JKNet - 80.2 80.7 80.2 81.1 71.5 GAT (Veličković et al., 2018) 97.3
Cora JKNet(Drop) - 83.3 82.6 83.0 82.5 83.2 JKNet (Xu et al., 2018) 97.6
Incep - 77.6 76.5 81.7 81.7 80.0
Incep(Drop) - 82.9 82.5 83.1 83.1 83.5
GeniePath (Liu et al., 2019) 98.5
GCNII 82.2 82.6 84.2 84.6 85.4 85.5 Cluster-GCN (Chiang et al., 2019) 99.36
GCNII* 80.2 82.3 82.8 83.5 84.9 85.3 GCNII 99.53 ± 0.01
GCN 70.8 67.6 30.2 18.3 25.0 20.0 GCNII* 99.56 ± 0.02
GCN(Drop) 72.3 70.6 61.4 57.2 41.6 34.4
JKNet - 68.7 67.7 69.8 68.2 63.4
Citeseer JKNet(Drop) - 72.6 71.8 72.6 70.8 72.2
Incep - 69.3 68.4 70.2 68.0 67.5 6.2. Full-Supervised Node Classification
Incep(Drop) - 72.7 71.4 72.5 72.6 71.0
GCNII 68.2 68.9 70.6 72.9 73.4 73.4 We now evaluate GCNII in the task of full-supervised node
GCNII* 66.1 67.9 70.6 72.0 73.2 73.1 classification. Following the setting in (Pei et al., 2020),
GCN 79.0 76.5 61.2 40.9 22.4 35.3 we use 7 datasets: Cora, Citeseer, Pubmed, Chameleon,
GCN(Drop) 79.6 79.4 78.1 78.5 77.0 61.5 Cornell, Texas, and Wisconsin. For each datasets, we ran-
JKNet - 78.0 78.1 72.6 72.4 74.5 domly split nodes of each class into 60%, 20%, and 20%
Pubmed JKNet(Drop) - 78.7 78.7 79.1 79.2 78.9 for training, validation and testing, and measure the perfor-
Incep - 77.7 77.9 74.9 OOM OOM
mance of all models on the test sets over 10 random splits,
Incep(Drop) - 79.5 78.6 79.0 OOM OOM
GCNII 78.2 78.8 79.3 80.2 79.8 79.7 as suggested in (Pei et al., 2020). We fix the learning rate
GCNII* 77.7 78.2 78.8 80.3 79.8 80.1 to 0.01, dropout rate to 0.5 and the number of hidden units
to 64 on all datasets and perform a hyper-parameter search
to tune other hyper-parameters based on the validation set.
achieves new state-of-the-art performance across all three
Detailed configuration of all model for full-supervised node
datasets. Notably, GCNII outperforms the previous state-
classification can be found in the supplementary materials.
of-the-art methods by at least 2%. It is also worthwhile to
Besides the previously mentioned baselines, we also include
note that the two recent deep models, JKNet and IncepGCN
three variants of Geom-GCN (Pei et al., 2020) as they are
with DropEdge, do not seem to offer significant advantages
the state-of-the-art models on these datasets.
over the shallow model APPNP. On the other hand, our
method achieves this result with a 64-layer model, which Table 5 reports the mean classification accuracy of each
demonstrates the benefit of deep network structures. model. We reuse the metrics already reported in (Pei et al.,
2020) for GCN, GAT, and Geom-GCN. We observe that
A detailed comparison with other deep models. Table 3 GCNII and GCNII* achieves new state-of-the-art results on
summaries the results for the deep models with various num- 6 out of 7 datasets, which demonstrates the superiority of
bers of layers. We reuse the best-reported results for JKNet, the deep GCNII framework. Notably, GCNII* outperforms
JKNet(Drop) and Incep(Drop) 1 . We observe that on Cora APPNP by over 12% on the Wisconsin dataset. This result
and Citeseer, the performance of GCNII and GCNII* con- suggests that by introducing non-linearity into each layer,
sistently improves as we increase the number of layers. On the predictive power of GCNII is stronger than that of the
Pubmed, GCNII and GCNII* achieve the best results with linear model APPNP.
16 layers, and maintain similar performance as we increase
the network depth to 64. We attribute this quality to the 6.3. Inductive Learning
identity mapping technique. Overall, the results suggest that
with initial residual and identity mapping, we can resolve For the inductive learning task, we apply 9-layer GCNII and
the over-smoothing problem and extend the vanilla GCN GCNII* models with 2048 hidden units on the PPI dataset.
into a truly deep model. On the other hand, the performance We fix the following sets of hyperparameters: α` = 0.5,
of GCN with DropEdge and JKNet drops rapidly as the λ = 1.0 and learning rate of 0.001. Due to the large volume
number of layers exceeds 32, which means they still suffer of training data, we set the dropout rate to 0.2 and the weight
from over-smoothing. decay to zero. Following (Veličković et al., 2018), we also
add a skip connection from the `-th layer to the (` + 1)-th
1
https://github.com/DropEdge/DropEdge layer of GCNII and GCNII* to speed up the convergence
Simple and Deep Graph Convolutional Networks
of the training process. We compare GCNII with the fol- 6.5. Ablation Study
lowing state-of-the-art methods: GraphSAGE (Hamilton
Figure 2 shows the results of an ablation study that evaluates
et al., 2017), VR-GCN (Chen et al., 2018b), GaAN (Zhang
the contributions of our two techniques: initial residual con-
et al., 2018), GAT (Veličković et al., 2018), JKNet (Xu et al.,
nection and identity mapping. We make three observations
2018), GeniePath (Liu et al., 2019), Cluster-GCN (Chiang
from Figure 2: 1) Directly applying identity mapping to the
et al., 2019). The metrics are summarized in Table 4.
vanilla GCN retards the effect of over-smoothing marginally.
In concordance with our expectations, the results show that 2) Directly applying initial residual connection to the vanilla
GCNII and GCNII* achieve new state-of-the-art perfor- GCN relieves over-smoothing significantly. However, the
mance on PPI. In particular, GCNII achieves this perfor- best performance is still achieved by the 2-layer model. 3)
mance with a 9-layer model, while the number of layers Applying identity mapping and initial residual connection
with all baseline models are less or equal to 5. This suggests simultaneously ensures that the accuracy increases with the
that larger predictive power can also be leveraged by in- network depths. This result suggests that both techniques
creasing the network depth in the task of inductive learning. are needed to solve the problem of over-smoothing.
Accuracy
Accuracy
0.6 0.6 0.6
Accuracy
Accuracy
0.6 0.6 0.6
the Research Funds of Renmin University of China under tributed network embedding. Data Science and Engi-
Grant 18XNLG21, by Shanghai Science and Technology neering, 4(2):119–131, 2019.
Commission (Grant No. 17JC1420200), by Shanghai Sail-
ing Program (Grant No. 18YF1401200) and a research fund Defferrard, M., Bresson, X., and Vandergheynst, P. Con-
supported by Alibaba Group through Alibaba Innovative volutional neural networks on graphs with fast localized
Research Program. spectral filtering. In NeurIPS, pp. 3837–3845, 2016.
Huang, W., Zhang, T., Rong, Y., and Huang, J. Adaptive Page, L., Brin, S., Motwani, R., and Winograd, T. The
sampling towards fast graph representation learning. In pagerank citation ranking: Bringing order to the web.
NeurIPS, pp. 4563–4572, 2018. Technical report, Stanford InfoLab, 1999.
Kingma, D. P. and Ba, J. Adam: A method for stochastic Papyan, V., Romano, Y., and Elad, M. Convolutional neural
optimization. In ICLR, 2015. networks analyzed via convolutional sparse coding. The
Journal of Machine Learning Research, 18(1):2887–2938,
Kipf, T. N. and Welling, M. Semi-supervised classification
2017.
with graph convolutional networks. In ICLR, 2017.
Klicpera, J., Bojchevski, A., and Günnemann, S. Predict Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B.
then propagate: Graph neural networks meet personalized Geom-gcn: Geometric graph convolutional networks. In
pagerank. In ICLR, 2019a. ICLR, 2020.
Klicpera, J., Weißenberger, S., and Günnemann, S. Diffu- Qiu, J., Tang, J., Ma, H., Dong, Y., Wang, K., and Tang, J.
sion improves graph learning. In NeurIPS, pp. 13333– Deepinf: Social influence prediction with deep learning.
13345, 2019b. In KDD, pp. 2110–2119. ACM, 2018.
LeCun, Y., Bengio, Y., et al. Convolutional networks for Rong, Y., Huang, W., Xu, T., and Huang, J. Dropedge:
images, speech, and time series. The handbook of brain Towards deep graph convolutional networks on node clas-
theory and neural networks, 3361(10):1995, 1995. sification. In ICLR, 2020.
Lee, J., Lee, I., and Kang, J. Self-attention graph pooling. Rozemberczki, B., Allen, C., and Sarkar, R. Multi-scale
In ICML, 2019. attributed node embedding, 2019.
Li, C. and Goldwasser, D. Encoding social information with
Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B.,
graph convolutional networks forpolitical perspective de-
and Eliassi-Rad, T. Collective classification in network
tection in news media. In ACL, 2019.
data. AI Magazine, 29(3):93–106, 2008.
Li, G., Müller, M., Thabet, A., and Ghanem, B. Deepgcns:
Can gcns go as deep as cnns? In The IEEE International Shang, J., Xiao, C., Ma, T., Li, H., and Sun, J. Gamenet:
Conference on Computer Vision (ICCV), 2019a. Graph augmented memory networks for recommending
medication combination. In AAAI, 2019.
Li, H., Yang, Y., Chen, D., and Lin, Z. Optimization al-
gorithm inspired deep neural network structure design. Thekumparampil, K. K., Wang, C., Oh, S., and Li, L.-
arXiv preprint arXiv:1810.01638, 2018a. J. Attention-based graph neural network for semi-
supervised learning, 2018.
Li, J., Han, Z., Cheng, H., Su, J., Wang, P., Zhang, J., and
Pan, L. Predicting path failure in time-evolving graphs. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò,
In KDD. ACM, 2019b. P., and Bengio, Y. Graph Attention Networks. ICLR,
2018.
Li, Q., Han, Z., and Wu, X. Deeper insights into graph
convolutional networks for semi-supervised learning. In Velickovic, P., Fedus, W., Hamilton, W. L., Liò, P., Bengio,
AAAI, 2018b. Y., and Hjelm, R. D. Deep graph infomax. In ICLR, 2019.
Li, R., Wang, S., Zhu, F., and Huang, J. Adaptive graph
Wang, G., Ying, R., Huang, J., and Leskovec, J. Improv-
convolutional neural networks. In AAAI, 2018c.
ing graph attention networks with large margin-based
Liu, Z., Chen, C., Li, L., Zhou, J., Li, X., Song, L., and constraints. arXiv preprint arXiv:1910.11945, 2019.
Qi, Y. Geniepath: Graph neural networks with adaptive
receptive paths. In AAAI, 2019. Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Wein-
berger, K. Simplifying graph convolutional networks. In
Ma, J., Wen, J., Zhong, M., Chen, W., and Li, X. MMM: ICML, pp. 6861–6871, 2019.
multi-source multi-net micro-video recommendation with
clustered hidden item representation learning. Data Sci- Xu, B., Shen, H., Cao, Q., Qiu, Y., and Cheng, X. Graph
ence and Engineering, 4(3):240–253, 2019. wavelet neural network. In ICLR, 2019.
Oono, K. and Suzuki, T. Graph neural networks exponen- Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.,
tially lose expressive power for node classification. In and Jegelka, S. Representation learning on graphs with
ICLR, 2020. jumping knowledge networks. In ICML, 2018.
Simple and Deep Graph Convolutional Networks