Professional Documents
Culture Documents
Abstract—Driven by the outstanding performance of neural networks in the structured Euclidean domain, recent years have seen a
surge of interest in developing neural networks for graphs and data supported on graphs. The graph is leveraged at each layer of the
neural network as a parameterization to capture detail at the node level with a reduced number of parameters and computational
complexity. Following this rationale, this paper puts forth a general framework that unifies state-of-the-art graph neural networks
(GNNs) through the concept of EdgeNet. An EdgeNet is a GNN architecture that allows different nodes to use different parameters to
weigh the information of different neighbors. By extrapolating this strategy to more iterations between neighboring nodes, the EdgeNet
learns edge- and neighbor-dependent weights to capture local detail. This is the most general local operation that a node can perform
and encompasses under one formulation all existing graph convolutional neural networks (GCNNs) as well as graph attention networks
(GATs). In writing different GNN architectures with a common language, EdgeNets highlight specific architecture advantages and
arXiv:2001.07620v1 [cs.LG] 21 Jan 2020
limitations, while providing guidelines to improve their capacity without compromising their local implementation. For instance, we show
that GCNNs have a parameter sharing structure that induces permutation equivariance. This can be an advantage or a limitation,
depending on the application. In cases where it is a limitation, we propose hybrid approaches and provide insights to develop several
other solutions that promote parameter sharing without enforcing permutation equivariance. Another interesting conclusion is the
unification of GCNNs and GATs —approaches that have been so far perceived as separate. In particular, we show that GATs are
GCNNs on a graph that is learned from the features. This particularization opens the doors to develop alternative attention
mechanisms for improving discriminatory power.
Index Terms—Edge varying, graph neural networks, graph signal processing, graph filters, learning on graphs.
1 I NTRODUCTION
most general operation that a node can implement locally. The adjacency of points in time signals or the adjacency
I.e., the most general operation that relies on information of points in images codifies a sparse and local relationship
exchanges only with neighbor nodes (Section 2). between signal components. This sparsity and locality is
In its most general form the number of learnable pa- leveraged by time or space filters. Similarly, S captures the
rameters of edge varying GNNs (EdgeNets) is of the order sparsity and locality of the relationship between compo-
of the number of nodes and edges of the graph. To reduce nents of a signal x supported on G . It is then natural to take
the complexity of this parameterization, we can regularize the shift operator as the basis for defining filters for graph
EdgeNets in different ways by imposing restrictions on the signals. In this spirit, let Φ(0) be an N × N diagonal matrix
freedom to choose different parameters at different nodes. and Φ(1) , . . . , Φ(K) be a collection of K matrices sharing
We explain in this paper that existing GNN architectures the sparsity pattern of IN + S. Consider then the sequence
are particular cases of EdgeNets associated to different of signals z(k) as
parameter restrictions. We further utilize the insight of edge
varying recursions to propose novel GNN architectures. In k
Y 0
consequence, the novel contributions of this paper are: z(k) = Φ(k ) x = Φ(k:0) x, for k = 0, . . . , K (1)
k0 =0
(i) We define EdgeNets, which parameterize the linear
operation of neural networks through a bank of edge Qk (k0 )
where the product matrix Φ(k:0) := k0 =0 Φ =
varying recursions. EdgeNets are the most generic
framework to design GNN architectures (Section 3). Φ(k) . . . Φ(0) is defined for future reference. Signal z(k) can
be computed using the recursion
(ii) We show the approaches in [8]–[17] (among others) are
EdgeNets were all nodes use the same parameters. We z(k) = Φ(k) z(k−1) , for k = 0, . . . , K (2)
extend the representing power of these networks by
adding some level of variability in weighting different
with initialization z(−1) = x. This recursive expression
nodes and different edges of a node (Section 4).
implies signal z(k) is produced from z(k−1) using operations
(iii) Replacing finite length polynomials by rational func- that are local in the graph. Indeed, since Φ(k) shares the
tions provides an alternative parameterization of con- (k)
sparsity pattern of S, node i computes its component zi as
volutional GNNs in terms of autoregressive moving
average (ARMA) graph filters [29]. These ARMA GNNs (k)
X (k) (k−1)
zi = Φij zj . (3)
generalize rational functions based on Cayley polyno-
j∈Ni ∪i
mials [14] (Section 4.3).
(iv) Although GATs and convolutional GNNs are believed Particularizing (3) to k = 0, it follows that each node i builds
to be different entities, GATs can be understood as the ith entry of z(0) as a scaled version of its signal x by the
GNNs with convolutional graph filters where a graph (0) (0)
diagonal matrix Φ(0) , i.e., zi = Φii xi . Particularizing to
is learned ad hoc in each layer to represent the required k = 1, (3) yields the components of z(1) depend on the
abstraction between nodes. The weights of this graph values of the signal x at most at neighboring nodes. Particu-
choose neighbors whose values should most influence larizing to k = 2, (3) shows the components of z(2) depend
the computations at a particular node. This reinterpre- only on the values of signal z(1) at neighboring nodes which,
tation allows for the proposal of more generic GATs in turn, depend only on the values of x at their neighbors.
with higher expressive power (Section 5). Thus, the components of z(2) are a function of the values
The rest of the paper is organized as follows. Section 2 of x at most at the respective two-hop neighbors. Repeating
(k)
reviews edge varying recursions on graphs and Section 3 this argument iteratively, zi represents an aggregation of
introduces edge varying GNNs (EdgeNets). Section 4 dis- information at node i coming from its k -hop neighborhood
cusses how existing GNN approaches are particular cases —see Figure 1.
of EdgeNets and provides respective extensions. Section 5 is The collection of signals z(k) behaves like a sequence of
devoted to GAT networks. Numerical results are provided scaled shift operations except that instead of shifting the
in Section 6 and conclusions are drawn in Section 7. signal in time, the signal is diffused through the graph
(the signal values are shifted between neighboring nodes).
Leveraging this interpretation, the graph filter output u is
2 E DGE VARYING L INEAR G RAPH F ILTERS defined as the sum
Consider a weighted graph G with vertex set V = K
X K
X
{1, . . . , N }, edge set E ⊆ V × V composed of |E| = M u = z(k) = Φ(k:0) x. (4)
ordered pairs (i, j), and weight function W : E → R. For k=0 k=0
each node i, define the neighborhood Ni = {j : (j, i) ∈ E}
as the set of nodes connected to i and let Ni := |Ni | denote A filter output in time is a sum of scaled and shifted copies
the number of elements (neighbors) in this set. Associated of the input signal. That (4) behaves as a filter follows from
with G is a graph shift operator matrix S ∈ RN ×N whose interpreting Φ(k:0) as a scaled shift, which holds because
sparsity pattern matches that of the edge set, i.e., entry of its locality. Each shift Φ(k:0) is a recursive composition
Sij 6= 0 when (j, i) ∈ E or when i = j . Supported on of individual shifts Φ(k) . These individual shifts represent
the vertex set are graph signals x = [x1 , . . . , xN ]T ∈ RN in different operators that respect the structure of G while
which component xi is associated with node i ∈ V . reweighing individual edges differently when needed.
3
A(S)x
+ + + +
Fig. 1. Edge Varying Graph Filters. Each edge varying matrix Φ(k) acts as a different shift operator that locally combines the graph signal. (Top-left)
The colored discs are centered at five reference nodes and their coverage shows the amount of local information needed to compute z(1) = Φ(1:0) x
at these nodes. The covarage of the discs in the other graphs shows the signal information needed by the reference nodes to produce the successive
outputs. (Bottom) Schematic illustration of the edge varying filter output of order K = 3.
For future reference define the filter matrix A(S) so that scaled identity matrix Φ(k) = ak IN . If S is the shift operator
(4) rewrites as u = A(S)x. For this to hold, the filter matrix of a directed line graph but Φ(k) is an arbitrary matrix, we
must be recover a linear time varying filter. The edge varying filter in
K K k
! (6) retains the number of parameters and the computational
(k0 )
X X Y
(k:0) complexity of (5) —both of order O K(M +N ) . It has been
A(S) = Φ = Φ . (5)
k=0 k=0 k0 =0 observed that (6) leads to more stable numerical properties
than (5) [32] but we shall see that (5) makes it easier to
Following [32], A(S) is a K th order edge varying graph
connect to existing GNN architectures. Both formulations
filter. Each matrix Φ(k) contains at most M + N nonzero
can be used interchangeably.
elements corresponding to the nonzero entries of IN + S;
thus, the total number of parameters that define the filter Remark 2. The presence of the edge (j, i) in graph G is
A(S) in (5) is K(M + N ) + N . For short filters, this is interpreted here as an expectation signal components xj and
smaller than the N 2 components of an arbitrary linear xi are close. The shift operator entry Sij is a measure of
transform. Likewise, in computing z(k) = Φ(k) z(k−1) as per the expected similarity. Larger entries indicate linked signal
(2) incurs a computational complexity of order O(M + N ). components are more related to each other. Therefore, the
This further results in an
overall computational complexity definition of the shift operator S makes it a valid stand-in for
of order O K(M + N ) for obtaining the filter output u in any graph representation matrix. Forthcoming discussions
(4). The reduced number of parameters and computational are valid whether S is an adjacency or a Laplacian matrix in
cost of the EV graph filter is leveraged in the next section any of their various normalized and unnormalized forms.
to define graph neural network (GNN) architectures with a We use S to keep discussions generic.
controlled number of parameters and computational com-
plexity matched to the sparsity pattern of the graph.
Remark 1. In recursion (2), the edge varying coefficients 3 E DGE VARYING G RAPH N EURAL N ETWORKS
Φ(k) up to order k behave as different graph shifts and
Edge varying graph filters are the basis for defining GNN
affect recursion z(k) . Alternatively, we can think of shifting
architectures through composition with pointwise nonlinear
the signal with the original shift operator S, i.e., Sz(k−1) ,
functions. Formally, consider a set of L layers indexed by
followed by the application of a matrix of coefficients Φ(k) , PK (k:0)
l = 1, . . . , L and let Al (S) = k=0 Φl be the graph
i.e., Φ(k) Sz(k−1) . This would result in an edge varying filter
filter used at layer l. A GNN is defined by the recursive
of the form
XK expression
A(S) = Φ(k) Sk−1 . (6)
K
!
k=0 X (k:0)
xl = σ Al (S) xl−1 = σ Φl xl−1 (7)
The definition in (6) makes the connection with conven- k=0
tional filters more apparent than (4). The matrix power Sk
represents a shift of order k , the matrix Φ(k) represents where we convene that x0 = x is the input to the GNN
an edge varying filter coefficient. Different shifts are scaled and xL is its output. To augment the representation power
differently and combined together with a sum. In fact, a of GNNs it is customary to add multiple node features
(convolutional) linear time-invariant filter can be recovered per layer. We do this by defining matrices Xl ∈ RN ×Fl in
f
from (6) if the shift operator S is the adjacency matrix of which each column xl represents a different graph signal at
a directed line graph and the filter coefficient matrix is a layer l. These so-called features are cascaded through layers
4
where they are processed with edge varying graph filters nature of edge varying filters [cf. (4) and (5)]. A GNN
and composed with pointwise nonlinearities according to can be then considered as an architecture that exploits the
K
! graph structure to reduce the number of parameters of a
Xl = σ
X
Φl
(k:0)
Xl−1 Alk (8) fully connected neural network. The implicit hypothesis is
k=0
that signal components associated with different nodes are
processed together in accordance with the nodes’ proximity
Fl−1 ×Fl
where Alk ∈ R is a matrix of coefficients that affords in the graph.
flexibility to process different features with different filter We will show that the different existing GNN architec-
coefficients. It is ready to see that (8) represents a bank of tures are particular cases of (9)-(10) using different sub-
edge varying graph filters applied to a set of Fl−1 input classes of edge varying graph filters (Section 4) and that
g f
features xl−1 to produce a set of Fl output features xl . the same is true for graph attention networks (Section 5).
fg
Indeed, if we let alk = [Alk ]f g denote the (f, g)th entry of Establishing these relationships allows the proposal of natu-
Alk , (8) produces a total of Fl−1 × Fl intermediate features ral architectural generalizations that increase the descriptive
of the form power of GNNs while still retaining manageable complex-
K ity. The proposed extensions exploit the analogy between
(k:0)
ufl g = Afl g (S) xgl−1 = aflkg Φl xgl−1
X
(9) edge varying graph filters [cf. (5)] and linear time varying
k=0 filters (Remark 1).
fg Remark 3. In the proposed EdgeNet, we considered graphs
for g = 1, . . . , Fl−1 and f = 1, . . . , Fl . The features ul are
then aggregated across all g and passed through a pointwise with single edge features, i.e., each edge is described by a
nonlinearity to produce the output features of layer l as single scalar. However, even when the graph has multiple
edge features, say E , the EdgeNet extendes readily to this
Fl−1 !
f
X fg scenario. This can be obtained by seeing the multi-edge
xl = σ ul . (10) featured graph as the union of E graphs Ge = (V, Ee ) with
g=1
identical node set V and respective shift operator matrix
At layer l = 1 the input feature is a graph signal x10 = x. Se . For {Φe(k) } being the collection of the edge varying
This feature is passed through F1 filters to produce F1 parameter matrices [cf. (1)] relative to the shift operator Se ,
higher level features as per (9). The latter are then processed the lth layer output Xl [cf. (8)] becomes
by a pointwise nonlinearity [cf. (10)] to produce F1 output E X K
!
f X e(k:0) e
features x1 . The subsequent layers l > 1 start with Fl−1 Xl = σ Φl Xl−1 Alk . (12)
g
input features xl−1 that are passed through the filter bank e=1 k=0
Afl g (S) [cf. (9)] to produce the higher level features ufl g . I.e., the outputs of each filter are aggregated also over
These are aggregated across all g = 1, . . . , Fl−1 and passed the edge-feature dimension. The number of parameters is
through a nonlinearity to produce the layer’s output fea- now at most (K(M + N ) + N )F 2 E and the computational
f
tures xl [cf. (10)]. In the last layer l = L, it is assumed complexity at each layer is of order O(KF 2 E(M + N )).
without loss of generality the number of output features is The GNN architectures that follow in the remainder of this
FL = 1. This single feature x1L = xL is the output of the manuscript, as a special case of the EdgeNet, can be readily
edge varying GNN or, for short, EdgeNet. extended to the multi-edge feature scenario by considering
The EdgeNet output is a function of the input signal x (12) instead of (8). The approach in [33] is the particular case
fg
and the collection of filter banks Al [cf. (5)]. Group the of (12) with K = 1 and parameter matrix reduced to a scalar
fg
filters in the filter tensor A(S) = {Al (S)}lf g so that to coefficient.
write the GNN output as the mapping
n o
Ψ x; A(S) = xL , with A(S) = Afl g (S) . (11) 4 G RAPH C ONVOLUTIONAL N EURAL N ETWORKS
lf g
Several variants of graph convolutional neural networks
The filter parameters are trained to minimize a loss over a (GCNNs) have been introduced in [9]–[12]. They can all be
training set of input-output pairs T = {(x, y)}. This loss written as GNN architectures in which the edge varying
measures the difference between the EdgeNet output xL component in (8) is fixed and given by powers of the shift
and the true value y averaged over the examples (x, y) ∈ T . operator matrix Φl
(k:0)
= Sk ,
As it follows from (5), the total number of parameters in
K
!
each filter is K(M +N )+N . This gets scaled by the number X
k
of filters per layer Fl−1 × Fl and the number of layers L. Xl = σ S Xl−1 Alk . (13)
To provide an order bound on the number of parameters k=0
that define the edge varying GNN, set the maximum fea- By comparing (9) with (13), it follows this particular restric-
ture number F = maxl Fl and observe that the number tion yields a tensor A(S) with filters of the form
of parameters per layer is at most (K(M + N ) + N )F 2 .
K
Likewise, the computational complexity at each layer is of
Afl g (S) = aflkg Sk
X
(14)
order O K(M + N )F 2 . This number of parameters and
k=0
computational complexity are expected to be smaller than
fg fg
the corresponding numbers of a fully connected neural for some order K and scalar parameters al0 , . . . , alK . Our
network. This is a consequence of exploiting the sparse focus on this section is to discuss variations on (14). To sim-
5
(k) 2 4 6
space of S creates the so-called graph Fourier transforms
[ΦI ]21
(k) x̃ := V−1 x and ũ := V−1 u [37] which allow us to write
[ΦI ]41
(k) K
X
(k) [ΦI ]74 (k) k
1 [ΦI ]31 (k) [ΦI ]76 8 ũ := ak Λ x̃. (19)
[ΦI ]51
k=0
(k)
3 5 (k) 7 [ΦI ]78 Eq. (19) reveals that convolutional graph filters are point-
[ΦI ]75
wise in the spectral domain, due to the diagonal nature of
the eigenvalue matrix Λ. We can therefore define the filter’s
Fig. 3. Hybrid Edge Varying Filter [cf. (18)]. The nodes in set I = {2, 7} spectral response a : R → R as the function
are highligted. Nodes 2 and 7 have edge varying parameters associated
to their incident edges. All nodes, including 2 and 7, also use the global K
X
parameter ak as in a regular convolutional graph filter. a(λ) = ak λk (20)
k=0
The substitution of (17) into (5) generates block varying which is a single-variable polynomial characterizing the
GNNs [36]. Block varying GNNs have B(K + 1)F 2 pa- graph filter A(S). If we allow for filters of order K = N − 1,
rameters per layer and a computational complexity of order there is always a set of parameters ak such that a(λi ) = ãi
O(KF 2 M ). for any set of spectral response ãi [25]. Thus, training
Alternatively, we can consider what we call hybrid filters over the set of spectral coefficients a(λ1 ), . . . , a(λN ) is
that are defined as linear combinations of convolutional equivalent to training over the space of (nodal) parameters
filters and edge varying filters that operate in a subset a0 , . . . , aN −1 . GCNNs were first introduced in [8] using the
of nodes —see Figure 3. Formally, let I ⊂ V denote an spectral representation of graph filters in (20).
importance subset of I = |I| node labels and define shift By using edge varying graph filters [cf. (5)], we can pro-
(k) (0)
matrices ΦI such that the diagonal matrix ΦI has entries pose an alternative parameterization of the space of filters of
(0) (k) order N which we will see may have some advantages. To
[ΦI ]ii 6= 0 for all i ∈ I and [ΦI ]ij = 0 for all i ∈ / I or
(k) explain this better let J be the index set defining the zero
(i, j) ∈/ E and k ≥ 1. That is, the coefficient matrices ΦI 2
entries of S + IN and let CJ ∈ {0, 1}|J |×N be a binary
may contain nonzero elements only at rows i that belong to
selection matrix whose rows are those of IN 2 indexed by J .
set I and with the node j being a neighbor of i. We define
Let also B be a basis matrix that spans the null space of
hybrid filters as those of the form
K Y
k CJ vec(V−1 ∗ V) (21)
(k0 )
X
k
A(S) = ΦI + ak S . (18) where vec(·) is the column-wise vectorization operator and
k=0 k0 =0 “ ∗ ” is the Khatri-Rao product. Then, the following proposi-
Substituting (18) in (5) generates hybrid GNNs. In essence, tion from [32] quantifies the spectral response of a particular
nodes i ∈ I learn edge dependent parameters which may class of the edge varying graph filter in (5).
also be different at different nodes, while nodes i ∈/ I learn Proposition 2. Consider the subclass of theedge varying graph
global parameters. filters in (5) where the parameter matrices Φ(0) + Φ(1) and
Hybrid filters are defined by a number of parameters Φ(k) for all k = 2, . . . , K are
that depends on the total number of neighborsPof all nodes restricted to the ones that share
the eigenvectors with S, i.e., Φ(0) + Φ(1) = VΛ(1) V−1 and
in the importance set I . Define then MI = i∈I Ni and
(0) Φ(k) = VΛ(k) V−1 for all k = 2, . . . , K . The spectral response
observe that ΦI has I nonzero entries since it is a diagonal of this subclass of edge varying filter has the form
(k)
matrix, while ΦI for k ≥ 1 have respectively MI nonzero
K Y k K Y k
values. We then have KMI + I parameters in the edge
X
0
X 0
varying filters and K + 1 parameters in the convolutional a(Λ) = Λ(k ) = diag Bµ(k ) (22)
k=1 k0 =1 k=1 k0 =1
filters. We therefore have a total of (I + KMI + K + 1)F 2
parameters per layer in a hybrid GNN. The implementation where B is an N × b basis kernel matrix that spans the null
cost of a hybrid GNN layer is of order O(KF 2 (M + N )) space of (21) and µ(k) is a b × 1 vector containing the expansion
since both terms in (18) respect the sparsity of the graph. coefficients of Λ(k) into B.
Block GNNs depend on the choice of blocks B and
Proof. See Appendix B.
hybrid GNNs on the choice of the importance set I . We
explore the use of different heuristics based on centrality Proposition 2 provides a subclass of the edge varying
and clustering measures in Section 6 where we will see that graph filters where, instead of learning K(M + N ) + N
the choice of B and I is in general problem specific. parameters, they learn the Kb entries in µ(1) , . . . , µ(K) in
(22). These filters build the output features as a pointwise
multiplication between the filter spectral response a(Λ) and
4.2 Spectral Graph Convolutional Neural Networks the input spectral transform x̃ = V−1 x, i.e., u = Va(Λ)x̃ =
The convolutional operation of the graph filter in (15) can be Va(Λ)V−1 x. Following then the analogies with conven-
represented in the spectral domain. To do so, consider the tional signal processing, (22) represents the spectral re-
input-output relationship u = A(S)x along with the eigen- sponse of a convolutional edge varying graph filter. Spectral
vector decomposition of the shift operator S = VΛV−1 . GCNNs are a particular case of (22) with order K = 1 and
Projecting the input and output signals in the eigenvector kernel B independent from the graph (e.g., a spline kernel).
7
Besides generalizing [8], a graph-dependent kernel allows to and α = [α0 , . . . , αK ]> be a set of direct terms; we can then
implement (22) in the vertex domain through an edge vary- rewrite (25) as
ing filter of the form (5); hence, having a complexity of order P K
O(K(M + N )) in contrast to O(N 2 ) required for the graph-
X βp X
a(λ) = + αk λk , (26)
independent kernels [8]. The edge varying implementation p=0
λ − γ p k=0
captures also local detail up to a region of radius K from
a node; yet, having a spectral interpretation. Nevertheless, where α, β , and γ are computed from a and b. A graph
both the graph-dependent GNN [cf. (22)] and the graph- filter whose spectral response is as in (26) is one in which the
independent GNN [8] are more of theoretical interest since spectral variable λ is replaced by the shift operator variable
they require the eigendecomposition of the shift operator S. It follows that if α, β , and γ are chosen to make (26) and
S. This aspect inadvertently implies a cubic complexity in (25) equivalent, the filter in (23) is, in turn, equivalent to
the number of nodes and an accurate learning process will P
X −1 XK
suffer from numerical instabilities since it requires an order A(S) = βp S − γ p I + αk Sk . (27)
K ≈ N ; hence, high order matrix powers Sk . p=0 k=0
the coefficients βp and γp of each of the single pole filter which by grouping further the terms of the same order k
approximations HK (R(γp )). These single pole filters are leads to a single edge varying graph filter of the form in (5).
themselves reminiscent of the convolutional filters in (15) The Jacobi ARMA GNN provides an alternative param-
but the similarity is again superficial. Instead of coefficients eterization of the EdgeNet that is different from that of the
ak that multiply shift operator powers Sk , the filters in (34) other polynomial convolutional filters in (15). In particular,
train a coefficient γp which represents a constant that is ARMA GNNs promote the use of multiple polynomial
subtracted from the diagonal entries of the shift operators S. filters of smaller order (i.e., the number of Jacobi iterations)
The fact that this is equivalent to an ARMA filter suggests with shared parameters between them. In fact, as it follows
that (33)-(34) may help in designing more discriminative from (34) and (31), each of the filters HK (R(γp )) depends
filters. We corroborate in Section 6 that GNNs using (33)- on two parameters βp and γp . We believe this parame-
(34) outperform GNNs that utilize the filters in (15). ter sharing among the different orders and the different
An ARMA GNN has (2P + K + 3)F 2 parame- nodes is the success behind the improved performance of
ters per layer and a computational complexity of order the ARMA GNN compared with the single polynomial
O F 2 P (M K + N ) . This decomposes as: O(P N ) to invert filters in (15). Given also the hybrid solutions developed
the diagonal matrices (D − γp IN ); O(P M ) to scale the in Section (4.1) for the polynomial filters, a direction that
nonzeros of (S − D) by the inverse diagonal; O(P KM ) to may attain further improvements is that of ARMA GNN
obtain the outputs of Jacobi ARMA filters (33) of order K . architectures with controlled edge variability.
The Jacobi ARMA GNN generalizes the architecture in
ARMA GNNs as EdgeNets. ARMA GNNs are another
[14] where instead of restricting the polynomials in (23)
subclass of the EdgeNet. To see this, consider that each
to Cayley polynomials, it allows the use of general poly-
shift operator R(γp ) in (31) shares the support with IN + S.
nomials. GNNs with ARMA graph filters have also been
Hence, we can express the graph filter in (33) as the union
proposed in [39], [40]. The latter works considered a first-
of P + 2 edge varying graph filters. The first P + 1 of these
order iterative method to avoid the inverse operation. As
filters have parameters matrices of the form
shown in [29], first-order implementation of ARMA filters
βp Rk (γp ), k = 0, . . . , K − 1 implement a convolutional filter of the form in (15) with pa-
(k:0)
Φp = rameter sharing between the different orders k = 0, . . . , K .
RK (γp ), k=K
9
Fig. 5. Higher Order Graph Attention Filters. (a) Graph convolutional attention filter. The input features Xl−1 are shifted by the same edge varying
shift operator Φ(B, e) and weighted by different coefficient matrices Ak . The edge varying coefficients in all Φ(B, e) are parameterized by the
same matrix B and vector e following the attention mechanism. (b) Edge varying GAT filter. The input features Xl−1 are shifted by different edge
varying shift operators Φ(k) (Bk , ek ) and weighted by different coefficient matrices Ak . The edge varying coefficients in the different Φ(k) (Bk , ek )
are parameterized by a different matrix Bk and vector ek following the attention mechanism.
By introducing the Jacobi method, we tackle the equivalence The score αij could be used directly as an entry for the
with (15) and obtain an ARMA GNN that in substantially matrix Φ but to encourage attention sparsity we pass αij
different from (15). through a local soft maximum operator
!−1
5 G RAPH C ONVOLUTIONAL ATTENTION N ETWORKS Φij = exp αij ×
X
exp αij 0
. (38)
A graph convolutional attention network (GCAT) utilizes j 0 ∈Ni ∪i
filters as in (13) but they are convolutional in a layer-specific
matrix Φl = Φ that may be different from the shift operator The soft maximum assigns edge weights Φij close to 1 to
S the largest of the edge scores αij and weights Φij close to 0
K to the rest. See also Figure 5 (a).
!
X
k
Xl = σ Φ Xl−1 Ak . (36) In Section 2 we introduced arbitrary edge varying graph
k=0 filters [cf. (5)] which we leveraged in Section 3 to build edge
Note that Ak = Alk and Φ = Φl are layer-dependent but varying GNNs [cf. (7) - (8)]. In Section 4 we pointed out that
we omit the layer index to simplify notation. Since matrix edge varying graph filters left too many degrees of freedom
Φ shares the sparsity pattern of S, (36) defines a GNN in the learning parametrization; a problem that we could
as per (8). In a GCAT, the matrix Φ is learned from the overcome with the use of graph convolutional filters [cf.
features Xl−1 that are passed from layer l − 1 following the (15)]. The latter suffer from the opposite problem as they
attention mechanism [18]. Specifically, we define a matrix may excessively constrict the GNN. GATs provide a solu-
B ∈ RFl−1 ×Fl as well as a vector e ∈ R2Fl and compute the tion of intermediate complexity. Indeed, the filters in (36)
edge scores allow us to build a GNN with convolutional graph filters
h
i>
where the shift operator Φ is learned ad hoc in each layer
αij = σ e> Xl−1 B i , Xl−1 B j
(37) to represent the required abstraction between nodes. The
edges of this shift operator are trying to choose neighbors
for all edges (i, j) ∈ E . In (37), we start with the vector whose values should most influence the computations at
of features Xl−1 and mix them as per the coefficients in a particular node. This is as in any arbitrary edge varying
B. This produces a collection of graph signals Xl−1 B in graph filter but the novelty of GATs is to reduce the number
which each node i has Fl features that correspond to the ith of learnable parameters by tying edge values to the matrix
row [Xl−1 B]i of the product matrix Xl−1 B. The features at B and the vector e —observe that in (37) e is the same for
node i are concatenated with the features of node j and the all edges. Thus, the computation of the αij scores depends
resulting vector of 2Fl components is multiplied by vector e. on the Fl−1 × Fl parameters in B and the 2Fl parameters in
This product produces the score αij after passing through e. This is of order no more than F 2 if we make F = maxl Fl .
the nonlinearity σ(·). Note that B = Bl , e = el , and the It follows that for the GAT in (36) the number of learnable
scores αij = αlij depend on the layer index l. As is the case parameters is at most F 2 + 2F + F 2 (K + 1), which depends
of Ak and Φ in (36), we omitted this index for simplicity. on design choices and are independent of the number of
10
TABLE 1
Properties of Different Graph Neural Network Architectures. The parameters and the complexity is considered per layer. Architectures in bold are
proposed in this work. Legend: N - number nodes; M - number of edges; F - maximum number of features; K - recursion order; b - dimension of the
Kernel in (22); B - number of blocks in (16); I - the set of important nodes in (16) and (18); MI - totoal neighbors for the nodes in I ; P - parallel
J-ARMANet branches; R - parallel attention branches; ∗ Self-loops are not considered. ∗∗ The eigendecomposition cost O(N 3 ) is computed once.
edges. We point out that since Φ respects the sparsity of the The computational complexity of the edge varying GAT is
graph, the computational complexity of implementing (36) of order O(KF (N F + M )).
and its parameterization is of order O(F (N F + KM )).
Remark 5. Graph attention networks first appear in [18].
5.1 Edge varying GAT networks In this paper, (37) and (38) are proposed as an attention
mechanism for the signals of neighboring nodes and GNN
The idea of using attention mechanisms to estimate entries
layers are of the form Xl = σ(Φl Xl−1 Al ). The latter is
of a shift operator Φ can be extended to estimate entries
a particular case of either (36) or (39) in which only the
Φ(k:0) of an edge varying graph filter. To be specific, we
term k = 1 is not null. Our observation in this section
propose to implement a generic GNN as defined by recur-
is that this is equivalent to computing a different graph
sion (8) which we repeat here for ease of reference
represented by the shift operator Φ. This allows for the
K
!
X generalization to filters of arbitrary order K [cf. (36)] and
(k:0)
Xl = σ Φ Xl−1 Ak . (39) to edge varying graph filters of arbitrary order [cf. (39)]. The
k=0 approaches presented in this paper can likewise be extended
Further recall that each of the edge varying filter coefficients with the multi-head attention mechanism proposed in [18]
Φ(k:0) is itself defined recursively as [cf. (5)] to improve the network capacity.
k
Y 0 5.2 Discussions
Φ(k:0) = Φ(k) × Φ(k−1:0) = Φ(k ) . (40)
k0 =0
Reducing model complexity. As defined in (37) and (41)
the attention mechanisms are separate from the filtering.
We propose to generalize (37) so that we compute a different To reduce the number of parameters we can equate the
matrix Φ(k) for each filter order k . Consider then matrices attention matrices B or Bk with the filtering matrices Ak . To
Bk and vectors ek to compute the edge scores do that in the case of the GCAT in (36) the original proposal
(k)
h
iT
in [18] is to make B = A1 so that (37) reduces to
αij = σ eTk Xl−1 Bk i , Xl−1 Bk j
(41) h
iT
αij = σ eT Xl−1 A1 i , Xl−1 A1 j
. (43)
for all edges (i, j) ∈ E . As in the case of (37), we could
(k) In the case of the edge varying GATs in (39) it is natural to
use αij as edge weights in Φ(k) , but to promote attention
(k) equate Bk = Ak in which case (41) reduces to
sparsity we send αij through a soft maximum function to
yield edge scores
h
(k) iT
αij = σ eTk Xl−1 Ak i , Xl−1 Ak j
!−1 (44)
(k) (k) (k)
X
Φij = exp αij × exp αij 0 . (42) The choice in (44) removes (K + 1)F 2 parameters.
j 0 ∈Ni ∪i
Accounting for differences in edge weights in the original
(k)
Each of the edge varying matrices Φ for k = 1, . . . , K shift operator. The major benefit of the GAT mechanism
is parameterized by the tuple of transform parameters is that of building a GNN without requiring full knowl-
(Bk , ek ). Put simply, we are using a different GAT mech- edge of S. This is beneficial as it yields a GNN robust to
anism for each edge varying matrix Φ(k) . These learned uncertainties in the edge weights. This benefit becomes a
matrices are then used to built an edge varying filter to drawback when S is well estimated as it renders weights sij
process the features Xl passed from the previous layer – see equivalent regardless of their relative values. One possible
Figure 5 (b). The edge varying GAT filter employs K + 1 solution to this latter drawback is to use a weighted soft
transform matrices Bk of dimensions Fl × Fl−1 , K + 1 maximum operator so that the entries of Φ(k) are chosen as
vectors ek of dimensions 2Fl , and K + 1 matrices Ak of −1
!
dimensions Fl ×Fl−1 . Hence, the total number of parameters (k) (k) (k)
X
Φij = exp sij αij × exp sij 0 αij 0 . (45)
for the edge varying GAT filter is at most (K +1)(2F 2 +2F ). j 0 ∈Ni ∪i
11
Alternatively, we can resort to the use of a hybrid GAT in important nodes are selected based on iii − a) maximum
which we combine a GAT filter of the form in (39) with a degree; iii − b) spectral proxies [42], which ranks the nodes
regular convolutional filter of the form in (13) according to their contribution to different frequencies;
K
! iii − c) diffusion centrality (see Appendix C); iv) three node
dependent edge varying GNNs, where the five important
X
k (k:0) 0
Xl = σ S Xl−1 Ak + Φ Xl−1 Ak . (46)
k=0
nodes B are selected similalry to the node varying case; v)
three ARMANets (33) with number of Jacobi iterations v−a)
This is the GAT version of the hybrid GNN we proposed K = 1; v − b) K = 3; v − c) K = 5; vi) the GAT network
in (18). The filters in (46) account for both, the GAT learned from [18]; vii) the GCAT network (36); and viii) the edge
shifts Φ(k) and the original shift S. varying GAT (39) network.
Our goal is to see how the different architectures handle
6 N UMERICAL R ESULTS their degrees of freedom, while all having linear complexity.
This section corroborates the capacity of the different mod- To make this comparison more insightful, we proceed with
els with numerical results on synthetic and real-world graph the following rationale. For the approaches in i) − iv), we
signal classification problems. Given the different hyperpa- analyzed filter orders in the interval K ∈ {1, . . . , 5}. This
rameters for the models in Table 1 and the trade-offs one can is the only handle we have on these filters to control the
consider (e.g., complexity, number of parameters, radius of number of parameters and locality radius, while keeping
local information), our main aim is to provide insights on the same computational complexity. For the ARMANet in
which methods exploit better the graph prior for learning v), we set the direct term order to K = 0 to observe
purposes rather than achieving the highest accuracy. only the effect of the rational part. Subsequently, for each
From the architectures in Table 1, we did not consider the Jacobi iteration value K , we analyzed rational orders in the
fully-connected, the spectral kernel GCNN from [8], and the interval P ∈ {1, . . . , 5} as for the former approaches. While
spectral edge varying GCNN (22) since their computational this strategy helps us controlling the local radius, recall the
cost is not linear in the number of nodes or edges. We also ARMANet has a computational complexity slightly higher
leave to interested readers the GAT extensions discussed than the former four architectures. For the GAT in vi), we
in Section 5.2. For all scenarios, we considered the ADAM analyzed different attention heads R ∈ {1, . . . , 5} such that
optimization algorithm with parameters β1 = 0.9 and β2 = the algorithm complexity matches those of the approaches
0.999 [41]. in i) − iv). The number of attention heads is the only handle
in the GAT network. Finally, for the GCAT in vii) and the
6.1 Source localization on SBM graphs edge varying GAT in viii), we fixed the attention heads to
The goal of this experiment is to identify which community R = 3 and analyzed different filter orders K ∈ {1, . . . , 5}.
in a stochastic block model (SBM) graph is the source of a Such a choice allows to compare the impact of the local
diffused signal by observing different diffusion realizations radius for the median attention head value of the GAT. Nev-
in different time instants. We considered a connected and ertheless, these architectures have again a slightly higher
undirected graph of N = 50 nodes divided into five blocks complexity for K ≥ 2.
each representing one community {c1 , . . . , c5 }. The intra- Observations. The results of this experiment are shown
and inter-community edge formation probabilities are 0.8 in Figure 6. We make the following observations.
and 0.2, respectively. The source node is one between of First, the attention-based approaches are characterized
the five nodes (vc1 , . . . , vc5 ) with the largest degree in the by a slow learning rate leading to a poor performance in
respective community. The source signal x(0) is a Kronecker 40 epochs. This is reflected in the higher test error of the
delta centered at the source node. The source signal is GAT and of the edge varying GAT networks. However,
diffused at time t ∈ [0, 50] as x(t) = St x(0), where S this is not the case for the GCAT network. We attribute
is the graph adjacency matrix normalized to have unitary the latter reduced error to the superposition of the graph
maximum eigenvalue. convolutional prior over attentions that GCAT explores –in
The training set is composed of 10240 tuples of the form fact, all convolutional approaches learn faster. The GCAT
(x(t), ci ) for random t and i ∈ {1, . . . , 5}. These tuples are convolutional coefficients Ak in (36) are trained faster than
used to train the EdgeNets that are subsequently used to their attention counterparts brining the overall architecture
predict the source community c0i for a testing signal x0 (t) to a good local minimum. On the contrary, the error in-
again for a random value of t. The validation and the test creases further for the edge varying GAT since multiple at-
set are both composed of 2560 tuples (25% of the training tention strategies are adopted for all k ∈ {1, . . . , K} in (39).
set). The performance of the different algorithms is averaged Therefore, our conclusion is that the graph convolutional
over ten different graph realizations and ten data splits, for prior can be significantly helpful for attention mechanisms.
a total of 100 Monte-Carlo iterations. The ADAM learning We will further corroborate this in the upcoming sections.
algorithm is run over 40 epochs with batches of size 100 and Second, the edge varying GNN in (8) achieves the lowest
a learning rate of 10−3 . error, although having the largest number of parameters.
Architecture parameters. For this experiment, we com- The convolutional approaches parameterize well the edge
pared 14 different architectures. All architectures comprise varying filter; hence, highlighting the benefit of the graph
the cascade of a graph filtering layer with ReLU nonlinearity convolution. ARMANet is the best among the latter charac-
and a fully connected layer with softmax nonlinearity. The terized both by a lower mean error and standard deviation.
architectures are: i) the edge varying GNN (8); i) the GCNN This reduced error for ARMANet is not entirely surprising
(13); iii) three node varying GNNs (16), where the five since rational functions have better interpolation and ex-
12
80
70
65
60
55
50
test error %
45
40
20
15
10
8
6
4
2
0
NN ing ree xies lity ree xies lity K=
1
K=
3
K=
5
GA
T AT GA
T
GC ev
ary Deg c. pro centra Deg c. pro centra GC ing
Edg Spe omm. Spe omm. v ary
C C Edge
Node varying Hybrid edge varying ARMANet
Fig. 6. Source Localization Test Error in the Stochastic Block Model graph. The y−axis scale is deformed to improve visibility. The thick bar interval
indicates the average performance for different parameter choice (e.g., filter order, attention heads). The circle marker represents the mean value of
this interval. The thin line spans an interval of one standard deviation from the average performance. The convolutional-based approaches offer a
better performance than the attention-based ones. We attribute the poor performance of the attention techniques to the slow learning rate. Both the
GAT and the edge variant GAT required more than 40 epochs to reach a local minima. However, the graph convolutional attention network (GCAT)
does not suffer from the latter isue leading to a faster learning.
trapolation properties than polynomial ones. It is, however, searched all parameters in Table 2 with the goal to reach a
remarkable that the best result is obtained for a Jacobi classification error of at most 2%. For the architectures that
iteration of K = 1. I.e., the parameter sharing imposed by reach this criterion, we report the smallest parameters. For
ARMANet reaches a good local optimal even with a coarse the architectures that do not reach this criterion, we report
approximation of the rational function. the minimum achieved error and the respective parameters.
Third, for the node selection strategies in architectures Our rationale is that the minimum parameters yield a lower
iii) and iv), there is no clear difference between the degree complexity; hence, a faster training –the opposite holds for
and the communication centrality. For the node varying the learning rate.
GNNs, the communication centrality offers a lower error From Table 2, we can observe that only the edge varying
both in the mean and deviation. In the hybrid edge varying GNN and the ARMANet reach the predefined error. Both
GNNs [cf. (18)], instead, the degree centrality achieves a architectures stress our observation that low order recur-
higher mean while paying in the deviation. The spectral sions (K = 1) are often sufficient. Nevertheless, this is
proxies centrality yields the worst performance. not the case for all other architectures. These observations
Finally, we remark that we did not find any particular suggest the edge varying GNN explores well its degrees of
trend while changing the parameters of the different GNNs freedom without requiring information form more than one-
(e.g., order, attention head). A rough observation is that low hop away neighbors. The ARMANet explores the best the
order recursions are often sufficient to reach a low errors. convolutional prior; in accordance with the former results,
the Jacobi implementation does not need to runt until con-
6.2 Source localization on Facebook sub-network vergence to achieve impressive results. We also conclude the
In the second experiment, we considered the source localiza- convolutional prior helps reducing the degrees of freedom
tion on a real-world network comprising a 234−used Face- of the EdgeNet but requires a deeper and/or wider network
book subgraph obtained as the largest connected component to achieve the predefined criterion. This is particularly seen
of the dataset in [43]. This graph has two well-defined in the GAT based architectures. The GCAT architecture, in
connected communities fo different size and the objective here, explores the convolutional prior and reduces the error
is to identify which of the two communities originated the compared with the edge varying prior that is unhelpful. We
diffusion. The performance of the different algorithms is finally remark, that for all approaches a substantially lower
averaged over 100 Monte-Carlo iterations. The remaining variance can be achieved by solely increasing the features.
parameters for generating the data are unchanged.
Architecture parameters. We compared the eight GNN
architectures reported in the left-most column of Table 2. 6.3 Authorship attribution
For the node varying and the hybrid edge varying GNNs, In this third experiment, we assess the performance of the
the important nodes are again 10% of all nodes selected different GNN architectures in an authorship attribution
based on communication centrality. The Jacobi number of problem based on real data. The goal is to classify if a text
iterations for the ARMANet is K = 1. excerpt belongs to a specific author or to any other of the 20
Overall, this problem is easy to solve if the GNN is contemporary authors based on word adjacency networks
hypertuned with enough width and depth. However, this (WANs) [44]. A WAN is an author-specific directed graph
strategy hinders the impact of the specific filter. To highlight whose nodes are function words without semantic mean-
the role of the latter, we considered minimal GNN architec- ing (e.g., prepositions, pronouns, conjuctions). A directed
tures composed of one layer and two features. We then grid- edge represents the transition probability between a pair of
13
TABLE 2
Source Localization Test Error in Facebook Subnetwork. The goal is to grid-search the parameters to achieve a mean error of at most 2%. For the
architectures that did not achieve this criterion, the minimum error is reported.
Architecture mean std. dev. order attention heads epochs learning rate
{1, 2, 3, 4, 5} {1, 2, 3, 4, 5} {10, 20, 40, 100} {10−2 , 10−3 }
GCNN 4.0% 13.0% 3 n/a 100 10−3
Edge varying 1.5% 8.4% 1 n/a 10 10−3
Node varying 6.0% 15.8% 3 n/a 20 10−3
Hybrid edge var. 6.6% 15.9% 2 n/a 40 10−3
ARMANet 2.0% 9.7% 1 n/a 20 10−3
GAT 10.9% 20.8% n/a 1 40 10−3
GCAT 8.0% 18.4 3 1 100 10−3
Edge varying GAT 7.1% 17.8% 2 3 100 10−3
networks, which is an EdgeNet that jointly learns the edge those of S + IN . From the properties of the vectorization
weights and the parameters of a convolutional filter. operation, (50) can be rewritten as
Further research is needed in three main directions. First,
CJ vec VΛ(k) V−1 = CJ vec V−1 ∗V λ(k)
in exploring the connection between the EdgeNets and (51)
receptive fields to parameterize differently the edge varying
graph filter. Second, in assessing the EdgeNets performance where “ ∗ ” denotes the Khatri-Rao product and λ(k) =
in semi-supervised and graph classification scenarios. Third, diag(Λ(k) ) is the N -dimensional vector composed by the di-
in characterizing which EdgeNet parameterization has bet- agonal elements of Λ(k) . As it follows from (50), (51) implies
ter transferability properties to unseen graphs. the vector λ(k) lies in the null space of CJ vec(V−1 ∗V), i.e.,
λ(k) ∈ null CI vec(V−1 ∗ V) .
A PPENDIX A (52)
P ROOF OF P ROPOSITION 1 Let then B be a basis that spans spans this null space [cf.
Denote the respective graph shift operator matrices of the (21)]. The vector λ(k) can be expanded as
graphs G and G 0 as S and S0 . For P being a permutation λ(k) = Bµ(k) (53)
matrix, S0 and x0 can be written as S0 = PT SP and
(k)
x0 = PT x. Then, the output of the convolutional filter in µ is the vector containing the basis expansion coefficients.
(15) applied to x0 is Finally, by putting back together (50)-(53), (49) becomes
K Y
k
K K X 0
0
u =
X
0k 0
ak S x =
X
T
ak P SP P x.
k T
(47) a(Λ) = diag Bµ(k ) . (54)
k=1 k0 =1
k=0 k=0
The N × b basis matrix B is a kernel that depends on
By using the properties of the permutation matrix Pk = P the specific graph and in particular on the eigenvectors V.
and PPT = IN , the output u0 becomes The kernel dimension b depends on the rank of B and thus,
XK
! on the rank of the null space in (52). In practice it is often
0 T
u =P ak S x = PT u
k
(48) observed that rank(B) = b N .
k=0