Professional Documents
Culture Documents
As we are interested in representation learning over the and our local function, f, can take into account the
nodes, we attach to each node u2V a feature vector, neighbourhood, that is:
x u 2Rk . The main way in which this data is presented to a
h iu
machine learning model is in the form of a node feature hu ¼ f x u ; X N u FðXÞ ¼ h1 ; h2 ; .; hjVj
matrix. That is, a matrix X2RjVjk is prepared by
stacking these features: (7)
There are many ways to represent E; since our context is Graph neural networks
one of linear algebra, we will use the adjacency Needless to say, defining f is one of the most active
matrix, A2RjVjjVj : areas of machine learning research today. Depending on
the literature context, it may be referred to as either
1 ðu; vÞ2E ‘diffusion’, ‘propagation’, or ‘message passing’. As
auv ¼ (2) claimed by [7]; most of them can be classified into one
0 ðu; vÞ;E
of three spatial flavours:
hu ¼ f x u ; 4v2N u cvu jðx v Þ ðConvolutionalÞ (8)
Note that it is often possible, especially in biochemical
inputs that we want to attach more information to the
edges (such as distance scalars, or even entire feature hu ¼ f x u ; 4v2N u aðxu ; x v Þjðxv Þ ðAttentionalÞ (9)
vectors). I deliberately do not consider such cases to
retain claritydthe conclusions we make would be the
same in those cases. hu ¼ f x u ; 4v2N u jðxu ; x v Þ ðMessage passingÞ (10)
However, the very act of using the above representations where j and f are neural networksde.g.
imposes a node ordering, and is therefore an arbitrary j(x) = ReLU(Wx þ b), and P 4 is any permutation-
choice which does not align with the nodes and edges invariant aggregator, such as , averaging, or max. The
being unordered! Hence, we need to make sure that expressive power of the GNN progressively increases going
permuting the nodes and edges (PAPu, for a permuta- from Equations (8)e(10), at the cost of interpretability,
tion matrix P), does not change the outputs. We recover scalability, or learning stability. For most tasks, a careful
the following rules a GNN must satisfy: tradeoff is needed when choosing the right flavour.
Given such a GNN layer, we can learn (m)any inter- All points considered, it is the author’s opinion that the
esting tasks over a graph, by appropriately combining hu. formalism in this section is likely all we need to build
I exemplify the three principal such tasks, grounded in powerful GNNsdalthough, of course, different per-
biological examples: spectives may benefit different problems, and existence
of a powerful GNN does not mean it is easy to find using
Node classification. If the aim is to predict targets for stochastic gradient descent.
each node u2V, then our output is equivariant, and we
can learn a shared classifier directly on hu. A canonical GNNs without a graph: Deep Sets and
example of this is classifying protein functions (e.g. Transformers
using gene ontology data [55]) in a given protein-protein Throughout the prior section, we have made a seem-
interaction network, as first done by GraphSAGE [19]. ingly innocent assumption: that we are given an input
graph (through A). However, very often, not only will
Graph classification. If we want to predict targets for there not be a clear choice of A, but we may not have any
the entire graph, then we want an invariant output, prior belief on what A even is. Further, even if a ground-
hence need to first reduce all the hu into a common truth A is given without noise, it may not be the optimal
representation, e.g. by performing 4u2V hu , then computation graph: that is, passing messages over it may
learning a classifier over the resulting flat vector. A ca- be problematicdfor example, due to bottlenecks [1]. As
nonical example is classifying molecules for their such, it is generally a useful pursuit to study GNNs that
quantum-chemical properties [18], estimating pharma- are capable of modulating the input graph structure.
cological properties like toxicity or solubility [13,50,22]
or virtual drug screening [41]. Accordingly, let us assume we only have a node feature
matrix X, but no adjacency. One simple option is the
Link prediction: Lastly, we may be interested in ‘pessimistic’ one: assume there are no edges at all, i.e.
predicting properties of edges (u, v), or even predicting A = I, or N u ¼ fug. Under such an assumption,
whether an edge exists; giving rise to the name ‘link Equations (8)e(10) all reduce to hu = f(xu), yielding
prediction’. In this case, a classifier can be learnt over the Deep Sets model [53]. Therefore, no power from
the concatenation of features hukhv, along with any given graph-based modelling is exploited here.
edge-level features. Canonical tasks include predicting
links between drugs and diseasesddrug repurposing The converse option (the “‘lazy’ one) is to, instead,
[37], drugs and targetsdbinding affinity prediction assume a fully-connected graph; that is A = 11u, or N u ¼
[29,23], or drugs and drugsdpredicting adverse side- V. This then gives the GNN the full potential to exploit
effects from polypharmacy [54,9]. any edges deemed suitable, and is a very popular choice
for smaller numbers of nodes. It can be shown that
It is possible to use the building blocks from the prin- convolutional GNNs (Equation (8)) would still reduce
cipal tasks above to go beyond classifying the entities to Deep Sets in this case, which motivates the use of a
given by the input graph, and have systems that produce stronger GNN. The next model in the hierarchy, atten-
novel molecules [33] or even perform retrosynthesisdthe tional GNNs (Equation (9)), reduce to the following
estimation of which reactions to utilise to synthesise equation:
given molecules [39,30].
hu ¼ fðxu ; 4v2V aðxu ; x v Þjðxv ÞÞ (11)
A natural question arises, following similar discussions
over sets [53,47]: Do GNNs, as given by Equation which is essentially the forward pass of a Transformer [43].
(10), represent all of the valid permutation-equivariant To reverse-engineer why Transformers appear here, let us
functions over graphs? Opinions are divided. Key re- consider the NLP perspective. Namely, words in a sen-
sults in previous years seem to indicate that such tence interact (e.g. subject-object, adverb-verb). Further,
models are fundamentally limited in terms of problems these interactions are not trivial, and certainly not sequen-
they can solve [51,36]. However, most, if not all, of tialdthat is, words can interact even if they are many
the proposals for addressing those limitations are sentences apart.1 Hence, we may want to use a graph be-
still expressible using the pairwise message passing tween them. But what is this graph? Not even annotators
formalism of Equation (10); the main requirement is tend to agree, and the optimal graph may well be task-
to carefully modify the graph over which the equation dependent. In such a setting, a common assumption is to
is applied [44]. To supplement this further use a complete graph, and let the network infer relations by
[31], showed that, under proper initial features, suf- itselfdat this point, the Transformer is all but rederived.
ficient depth-width product (#layers dim hu), For an in-depth rederivation, see [24].
and correct choices of j and f, GNNs in Equation (10)
are Turing universaldlikely to be able to simulate any 1
This insight may also partly explain why RNNs or CNNs have been seen as
computation which any computer can perform over suboptimal language models: they implicitly assume only neighbouring words directly
such inputs. interact.
Another reason why Transformers have become such a while the output coordinates transform accord-
dominant GNN variant is the fact that using a fully ingly: x 0u )Rx0u þ b.
connected graph structure allows to express all model
computations using dense matrix products, and hence their While indeed highly elegant, a model like this hides a
computations align very well with current prevalent caveat: it only works over scalar features fu. If our model
accelerators (GPUs and TPUs). Further, they have a needs to support any kind of vector input (e.g. forces
more favourable storage complexity than the message between atoms), the model in Equation (12) would not
passing variant (Equation (10)). Accordingly, Trans- suffice, because the vectors would need to appropriately
formers can be seen as GNNs that are currently winning rotate with R [38]. do propose a variant that allows for
the hardware lottery [21]! explicitly updating vector features, vu:
X
Before closing this section, it is worth noting a third v 0u ¼ fv ðhu Þv u þ C ðx u x v Þfx f u ; f v ; xu xv k2 ; x 0u
option to learning a GNN without an input graph: to vsu
infer a graph structure to be used as edges for a GNN. ¼ xu þ v 0u
This is an emerging area known as latent graph inference. It
is typically quite challenging, since edge selection is a (14)
non-differentiable operation, and various paradigms
have been proposed in recent years to overcome this
But these issues will continue to arise, as features get
challenge: nonparametric [48,10], supervised [45],
‘tensored up’. Hence, in such circumstances, it might be
variational [27], reinforcement [26] and self-supervised
useful to instead characterise a generic equation that
learning [14].
supports all possible roto-translation equivariant models,
and then learning its parameters. Such an analysis was
GNNs beyond permutation equivariance: done in Tensor Field Networks [42] for point clouds,
Geometric graphs and then extended to SE(3)-Transformers for general
To conclude our discussion, we revisit another assump- graphs [16].
tion: we have assumed our graphs to be a discrete, un-
ordered, collection of nodes and edgesdhence, only Perhaps a fitting conclusion of this survey is a simple
susceptible to permutation symmetries. But in many realisation: having showed how both Transformers and
cases, this is not the entire story! The graph, in fact, may geometric equivariance constraints play a part within
often be endowed with some spatial geometry, which will the context of GNNs, we now have all of the key
be very useful to exploit. Molecules, and their three- building blocks to define some of the most famous
dimensional conformer structure, are a classical example geometric GNN architectures in the wild, such as
of this. AlphaFold 2 [25], but also similar protein-related arti-
cles which made headlines in both Nature Methods
In general, we will assume our inputs are geometric graphs: [17]; MaSIF) and Nature Machine Intelligence [32]. It
nodes are endowed with both features, fu, and coordinates, seems that protein folding, protein design, and protein
x u 2R3 . We may be interested in designing a model that binding prediction [40] all appear to be an extremely
is equivariant not only to permutations, but also 3D potent area of attack for geometric GNNs; just one of
rotations, translations and reflections (the Euclidean many solid reasons why the field of structural biology
group, E(3)). would benefit from these recent developments [5].
An E(3)-equivariant message passing layer transforms
the coordinates and features, and yields updated fea- Declaration of competing interest
tures f 0u and coordinates x0u . There exist many GNN No conflicts to declare.
layers that obey E(n) equivariance, and one particularly
elegant solution was proposed by [38]: Data availability
No data was used for the research described in the
f 0u ¼ f f u ; 4v2N u jf f u ; f v ; x u xv k2 (12) article.
References
X Papers of particular interest, published within the period of review,
x 0u ¼ xu þ ðx u xv Þjx f u ; f v ; x u xv k2 (13) have been highlighted as:
vsu
* of special interest
The key insight behind this model is that rotating, * * of outstanding interest
translating or reflecting coordinates does not change
their distances kxu xvk2, i.e., such operations are 1. Alon U, Yahav E: On the bottleneck of graph neural networks
and its practical implications. In International conference on
isometries. Hence, if we roto-translate all nodes as xu learning representations; 2021. URL: https://openreview.net/
)Rxu þ b, the output features f 0u remain unchanged, forum?id=i80OPhOCVH2.
2. Battaglia P, Pascanu R, Lai M, Jimenez Rezende D, et al.: Deployed GNNs in important parts of Amazon’s product recommen-
Interaction networks for learning about objects, relations and dation system.
physics. Adv Neural Inf Process Syst 2016, 29.
21. Hooker S: The hardware lottery. Commun ACM 2021, 64:58–65.
3. Battaglia PW, Hamrick JB, Bapst V, Sanchez-Gonzalez A,
Zambaldi V, Malinowski M, Tacchetti A, Raposo D, Santoro A, 22. Jiang D, Wu Z, Hsieh CY, Chen G, Liao B, Wang Z, Shen C,
Faulkner R, et al.: Relational inductive biases, deep learning, and Cao D, Wu J, Hou T: Could graph neural networks learn better
graph networks. 2018. arXiv preprint arXiv:1806.01261. molecular representation for drug discovery? a comparison
study of descriptor-based and graph-based models.
4. Blundell C, Buesing L, Davies A, Velickovi
c P, Williamson G: J Cheminf 2021, 13:1–23.
Towards combinatorial invariance for kazhdan-lusztig poly-
nomials. 2021. arXiv preprint arXiv:2111.15161. 23. Jiang M, Li Z, Zhang S, Wang S, Wang X, Yuan Q, Wei Z:
Drug–target affinity prediction using graph neural network
5. Bouatta N, Sorger P, AlQuraishi M: Protein structure prediction and contact maps. RSC Adv 2020, 10:20701–20712.
by alphafold2: are attention and symmetries all you need?
Acta Crystallogr D: Struct Biol 2021, 77:982–991. 24. Joshi C: Transformers are graph neural networks. The Gradient;
* * 2020:5.
6. Brody S, Alon U, Yahav E: How attentive are graph attention Demonstrates that Transformers are attentional GNNs.
networks?. In International conference on learning representa-
tions; 2022. URL: https://openreview.net/forum?id=F72ximsx7C1. 25. Jumper J, Evans R, Pritzel A, Green T, Figurnov M,
*
Ronneberger O, Tunyasuvunakool K, Bates R, Zídek A,
7. Bronstein MM, Bruna J, Cohen T, Velic kovic P: Geometric deep Potapenko A, et al.: Highly accurate protein structure predic-
** learning: grids, groups, graphs, geodesics, and gauges. arXiv tion with AlphaFold. Nature 2021, 596:583–589.
preprint arXiv:2104.13478 2021. The AlphaFold 2 model, demonstrating highly accurate predictions of
Generalises the concepts in graph neural networks to derive most of protein structure given their amino acid sequence.
deep learning in use today, satisfying specific symmetry groups over
input domains. 26. Kazi A, Cosmo L, Ahmadi SA, Navab N, Bronstein M: Differentiable
* graph module (dgm) for graph convolutional networks. In IEEE
8. Davies A, Veli ckovi
c P, Buesing L, Blackwell S, Zheng D, transactions on pattern analysis and machine intelligence; 2022.
* Toma sev N, Tanburn R, Battaglia P, Blundell C, Juhász A, et al.: Uses a combination of GNNs and RL to discover latent graph struc-
Advancing mathematics by guiding human intuition with AI. tures in point clouds and biomedical data.
Nature 2021, 600:70–74.
Used GNNs to uncover the structure of objects in pure mathematics, leading 27. Kipf T, Fetaya E, Wang KC, Welling M, Zemel R: Neural rela-
to top-tier conjectures and theorems in two distinct areas of mathematics. tional inference for interacting systems. In International con-
ference on machine learning; 2018:2688–2697. PMLR.
9. Deac A, Huang YH, Veli ckovi
c P, Liò P, Tang J: Drug-drug
adverse effect prediction with graph co-attention. 2019. arXiv 28. Kipf TN, Welling M: Semi-supervised classification with graph
preprint arXiv:1905.00534. convolutional networks. In International conference on learning
representations; 2017. URL: https://openreview.net/forum?
10. Deac A, Lackenby M, Veli ckovi
c P: Expander graph propaga- id=SJU4ayYgl.
tion. In The first learning on graphs conference; 2022. URL:
https://openreview.net/forum?id=IKevTLt3rT. 29. Lim J, Ryu S, Park K, Choe YJ, Ham J, Kim WY: Predicting
drug–target interaction using a novel graph neural network
11. Defferrard M, Bresson X, Vandergheynst P: Convolutional with 3d structure-embedded graph representation. J Chem Inf
neural networks on graphs with fast localized spectral Model 2019, 59:3981–3988.
filtering. Adv Neural Inf Process Syst 2016, 29.
30. Liu CH, Korablyov M, Jastrzebski S, Włodarczyk-Pruszynski P,
12. Derrow-Pinion A, She J, Wong D, Lange O, Hester T, Perez L, Bengio Y, Segler M: Retrognn: fast estimation of synthesiz-
* Nunkesser M, Lee S, Guo X, Wiltshire B, et al.: Eta prediction ability for virtual screening and de novo design by learning
with graph neural networks in google maps. In Proceedings of from slow retrosynthesis software. J Chem Inf Model 2022, 62:
the 30th ACM international conference on information & knowl- 2293–2300.
edge management; 2021:3767–3776.
Deployed GNNs to serve travel time predictions in Google Maps. 31. Loukas A: What graph neural networks cannot learn: depth vs
* width. In International conference on learning representations;
13. Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, 2020. URL: https://openreview.net/forum?id=B1l2bp4YwS.
Aspuru-Guzik A, Adams RP: Convolutional networks on Demonstrates that MPNNs are Turing universal, under reasonable
graphs for learning molecular fingerprints. Adv Neural Inf assumptions on their input and parametrisation.
Process Syst 2015, 28.
32. Méndez-Lucio O, Ahmad M, del Rio-Chanona EA, Wegner JK:
14. Fatemi B, El Asri L, Kazemi SM: Slaps: self-supervision im- A geometric deep learning approach to predict binding con-
proves structure learning for graph neural networks. Adv formations of bioactive molecules. Nat Mach Intell 2021, 3:
Neural Inf Process Syst 2021, 34:22667–22681. 1033–1039.
15. Forrester JW: Counterintuitive behavior of social systems. 33. Mercado R, Rastemo T, Lindelöf E, Klambauer G, Engkvist O,
Theor Decis 1971, 2:109–140. Chen H, Bjerrum EJ: Graph networks for molecular design.
Mach Learn: Sci Technol 2021, 2, 025023.
16. Fuchs F, Worrall D, Fischer V, Welling M: Se (3)-transformers:
3d roto-translation equivariant attention networks. Adv Neural 34. Mirhoseini A, Goldie A, Yazgan M, Jiang JW, Songhori E,
Inf Process Syst 2020, 33:1970–1981. * Wang S, Lee YJ, Johnson E, Pathak O, Nazi A, et al.: A graph
placement methodology for fast chip design. Nature 2021,
17. Gainza P, Sverrisson F, Monti F, Rodola E, Boscaini D, 594:207–212.
Bronstein M, Correia B: Deciphering interaction fingerprints Used GNNs to optimise chip placement to superhuman level, with
from protein molecular surfaces using geometric deep GNN-powered designs making it into the TPUv5.
learning. Nat Methods 2020, 17:184–192.
35. Monti F, Boscaini D, Masci J, Rodola E, Svoboda J,
18. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE: Neural Bronstein MM: Geometric deep learning on graphs and man-
message passing for quantum chemistry. In International ifolds using mixture model cnns. In Proceedings of the IEEE
conference on machine learning. PMLR; 2017:1263–1272. conference on computer vision and pattern recognition; 2017:
5115–5124.
19. Hamilton W, Ying Z, Leskovec J: Inductive representation
learning on large graphs. Adv Neural Inf Process Syst 2017, 30. 36. Morris C, Ritzert M, Fey M, Hamilton WL, Lenssen JE, Rattan G,
Grohe M: Weisfeiler and leman go neural: higher-order graph
20. Hao J, Zhao T, Li J, Dong XL, Faloutsos C, Sun Y, Wang W: P- neural networks. In Proceedings of the AAAI conference on
* companion: a principled framework for diversified comple- artificial intelligence; 2019:4602–4609.
mentary product recommendation. In Proceedings of the 29th
ACM international conference on information & knowledge man- 37. Morselli Gysi D, Do Valle Í, Zitnik M, Ameli A, Gan X, Varol O,
agement; 2020:2517–2524. Ghiassian SD, Patten J, Davey RA, Loscalzo J, et al.: Network