Part I-KDD - Tutorial - GNN PDF

Part I:
Foundations and Applications of

Graph Neural Networks
Yao Ma and Yiqi Wang, Michigan State University
Tyler Derr, Vanderbilt University
Lingfei Wu and Tengfei Ma, IBM Research
Tutorial website: https://ai.tencent.com/ailab/ml/KDD-Deep-Graph-Learning.html
1
Book: Deep Learning on Graphs
https://cse.msu.edu/~mayao4/dlg_book/
2
Part I Overview
Foundations Models Applications
Filtering Layers in GNN

Basic Graph Theory Healthcare
Pooling Layers in GNN
Spectral Graph Theory Graph Structure Learning
Robustness of GNN
Graph Fourier Analysis Self-supervised Learning for GNN Natural Language Processing
Scalable Learning for GNN
3
Part I Overview

Robustness of GNN
4
Graphs and Graph Signals
5
Graph Signal:
6
Graph Signal:
7
Graph Signal:
8
Matrix Representations of Graphs
9
Spectral graph theory. American Mathematical Soc.; 1997.
Adjacency Matrix
10
Degree Matrix:
Degree Matrix Adjacency Matrix
11
Degree Matrix:
Degree Matrix Adjacency Matrix Laplacian Matrix
12
Laplacian Matrix as an Operator
Laplacian matrix is a difference operator:
13
14
Laplacian quadratic form:
15
16
Low frequency graph signal
17
Low frequency graph signal
High frequency graph signal

18
Eigen-decomposition of Laplacian Matrix
Laplacian matrix has a complete set of orthonormal eigenvectors:
19
20
Eigenvalues are sorted non-decreasingly:
21
Eigenvectors as Graph Signals
22
The frequency of an eigenvector of Laplacian matrix is its
corresponding eigenvalue:
23
The frequency of an eigenvector of Laplacian matrix is its
corresponding eigenvalue:
Low frequency High frequency
24
Graph Fourier Transform(GFT)
25
26
27
The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine
Inverse Graph Fourier Transform (IGFT)
28
Part I Overview

Robustness of GNN
29
Tasks on Graph-Structured Data
Node-level Graph-level
Link Prediction Graph Classification
？？
t t+a
Node Classification
？
？
30
31
Node Representations
32
Node Representations Graph Representation
33
Filtering Pooling
Node Representations Graph Representations
34
Two Main Operations in GNN
Graph Filtering
Graph Filtering
35
Graph Filtering
Graph Filtering
36
Graph Filtering
Graph Filtering
Graph filtering refines the node features

37
Graph Pooling
Graph Pooling
38
Graph Pooling
Graph Pooling
39
Graph Pooling
Graph Pooling
Graph pooling generates a smaller graph

40
General GNN Framework
For node-level tasks
Filtering Layer Activation
41
General GNN Framework
For graph-level tasks
Filtering Layer Activation Pooling Layer
… … …
42
Graph Filtering Operation
Graph Filtering
43
Two Types of Graph Filtering Operation
Spatial Based Filtering Spectral Based Filtering
Original GNN
(Scarselli et al.
2005)
GraphSage Spectral
GAT (Hamilton et al. Graph CNN
(Veličković et al. NIPS 2017) (Bruna et al.
ICLR 2018) GCN ICLR 2014)
(Kipf & Welling.
ICLR 2017)
MPNN
(Glimer et al.
ICML 2017) ChebNet
(Defferard et al.
… NIPS 2016)
44
Graph Filtering in the First GNN Paper
Graph neural networks for ranking web pages. WI. IEEE, 2005.
45
Graph Filtering in the First GNN Paper
46
Graph Spectral Filtering for Graph Signal
Recall:
47
Recall:
Decompose
Coefficients
48
Recall:
Decompose Filter
Coefficients Filtered coefficients
49
Recall:
Decompose Filter
Example:
50
Recall:
Decompose Filter Reconstruct

Example:
51
Recall:

Example:
52
Recall:

Example:
53
Recall:

Filtering
54
Graph Spectral Filtering for GNN
How to design the filter?
55
56
How to deal with multi-channel signals?
57
Each input channel contributes to each output channel
58
Filter each input channel 59

Filter each input channel 60

61
Spectral Networks and Locally Connected Networks on Graphs. ICLR 62
2014
63
Expensive eigen-decomposition
64
Convolutional Neural Networks on Graphs with Fast Localized Spectral
65
Filtering. NIPS 2016.
66
67
No eigen-decomposition needed
68
Polynomial Parametrized Filter: a Spatial
View
69
View
70
View
71
Chebyshev Polynomials
72
Unstable under perturbation of coefficients
73

Chebyshev polynomials:
74

75

76

77
ChebNet
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. NIPS
78
2016.
ChebNet
79
ChebNet
80
ChebNet
Stable under perturbation of coefficients
81
GCN: Simplified ChebNet
Semi-Supervised Classification with Graph Convolutional

Networks. ICLR 2017.
82
83
Apply a renormalization trick
84
GCN for Multi-channel Signal
Recall:
Filter each input channel
85
Recall:

For GCN:
GCN filter
86
Recall:

For GCN:
GCN filter
In matrix form:
87
A Spatial View of GCN Filter
88
89
Observe that:
90
Observe that:
Hence,
91
Observe that:
Hence,
Feature transformation
92
Observe that:
Hence,
Feature transformation
Aggregation 93
Filter in GCN VS Filter in the First GNN
GCN: k-th layer
The first GNN: k-th layer
94
GCN: k-th layer
95
GCN: k-th layer
96
Filter in GraphSage
Neighbor Sampling
Inductive Representation Learning on Large Graphs. NIPS 2017.

97
Filter in GraphSage
Neighbor Sampling
Aggregation
98
Filter in GAT
Graph Attention Networks. ICLR

2018. 99
Filter in GAT
100
Filter in MPNN
Message Passing
Feature Updating
Neural Message Passing for Quantum 101

Chemistry. ICML 2017.
Graph Pooling Operation
Graph Pooling
102
gPool
Downsample by selecting the most importance nodes
Graph U-Nets. ICML

2019. 103
gPool
Importance Measure
104
gPool
Importance Measure
105
gPool
Importance Measure
106
gPool
Importance Measure
107
DiffPool
Downsample by clustering the nodes using GNN
108
Hierarchical Graph Representation Learning with Differentiable Pooling. NeurIPS
DiffPool
2 filters
Filter1:
Generate a soft-assign matrix
109
DiffPool
2 filters
Filter1:
Generate a soft-assign matrix
Filter2:
Generate new features
110
DiffPool
Generated soft-assign matrix
Generated new features
111
DiffPool
112
DiffPool
113
Eigenpooling
Graph Convolutional Networks with 114

EigenPooling. KDD 2019.
Eigenpooling
115
Eigenpooling
116
Eigenpooling
Capture both feature

and graph structure
117
Going Back to Graph Spectral Theory
Recall:
118
Do we need all the coefficients to reconstruct a “good” signal?
119
120
121
Eigenpooling: Truncated Fourier Coefficients
Eigenvectors (Fourier Modes) of the subgraph
122
GFT
Fourier coefficients
123
GFT
Truncated Fourier
coefficients
124
GFT
Truncated Fourier
coefficients
New features for the subgraph (a node in the smaller graph)
125
Robustness of GNN
126
Adversarial Attacks on Deep Learning
Do Graph Neural Networks
Suffer the Same Problem?
Adversarial Attacks on GNN
5 6 5 6
7 7
4 7 4 7
3 2 3 2
1 1
8 8
2 2
GNN GNN
8 Predicted as: 8 Predicted as:
129
Consequences
Financial Systems
7 • Credit Card Fraud Detection
7
2 Recommender Systems
8 • Social Recommendation
• Product Recommendation
….
130
Image vs Graph
Discreteness
Perturbation Measure
Perturbation Type
F
A
E
C
B
D
131
Perturbation Type
6 5 6
5
7 7
4 4 7
7
3 3 2 6
1 2 1 5
8 8 7
2 2 4 7
Rewiring 3 2
Adding an edge 1
8
2
5 6
5 6
7
4 7
4 7 3 2 Modifying Features
3 1
1
8
8 2
2
9
Deleting an edge Node Injection

132
Evasion & Poisoning Attack
Evasion Attack Poisoning Attack
6 6 5 6 5 6 7
5 7 5 7 7
4 4 ① Perturb 4 7
7 4 7 7 3
1 3 3 1 3 1 2
2 1 2 2 8
2 8 8 2 8 2
2
① Train ② Train
Trained GNN
Trained GNN
133
Targeted & Non-Targeted
Targeted Attack Non-Targeted Attack
5 6 6
5
7 7
4 7 4 7
3 2 3
1 1 2
8 8
2 2
8 Target Node
134
Attack Methods
Attack Injecting Adding Rewiring Modifying Evasion Poisoning Targeted Non-Targe
Methods Node /Deleting Features ted
Edge
Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax
Nettack ✔ ✔ ✔ ✔ ✔
GF-Attack ✔ ✔ ✔
ReWatt ✔ ✔ ✔
RL-S2V ✔ ✔ ✔
Meta-Atta ✔ ✔ ✔
ck
NIPA ✔ ✔ ✔
135
Attack Methods
Edge
Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax
136
GradArgmax
Adversarial Attack on Graph Structured Data. ICML 2018. 137

GradArgmax
138
GradArgmax
139
GradArgmax
140
GradArgmax
141
Attack Methods
Edge
Grad-Arg ✔ ✔ ✔ ✔ ✔ ✔ ✔
max
142
Nettack
143
Adversarial Attacks on Neural Networks for Graph Data. KDD 2018.
Nettack
Idea 1: Train a surrogate model
A two-layer linearized GCN trained on

original graph
144
Nettack
Edge Perturbations
Candidates
Feature
Perturbations
Candidates
Degree Distribution
Feature Co-occurrence
145
Nettack
Edge Perturbations
Candidates
Feature
Perturbations
Candidates
Attack
Target GCN Models Wrong
Prediction
146
Attack Methods
Edge
Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax
147
GF-Attack
Motivation:
• the output embeddings of graph embedding models are demonstrated to have very
low-rank property.
• Goal: to damage the quality of output embedding Z
• Formulation:
• A graph embedding model can be treated as producing the new graph signals according to graph
filter ℋ together with feature transformation:
148
A Restricted Black-box Adversarial Framework Towards Attacking Graph Embedding Models. AAAI20.
GF-Attack
149
Attack Methods
Edge
Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax
ReWatt ✔ ✔ ✔
150
ReWatt
Motivation Rewiring
Degree distribution may not be an ideal
measure for perturbations
How to make perturbation more

unnoticeable?
151
Attacking Graph Convolutional Networks via Rewiring. Arxiv 2019.
ReWatt
Rewiring Advantages
• Number of nodes and edges
remain the same
• Affects algebraic connectivity
in a smaller way
• Affects effective graph resistance
in a smaller way
152
ReWatt
Reinforcement Learning
Black-box classifier
Attacker
Policy Network
Node Edge
GCN Emb Emb
153
Defending Against Attacks
Adversarial Training
Graph Purifying
Attention Mechanism
154
Motivation
Augment the training set with

adversarial data
155
Latent Adversarial Training of Graph Convolution Networks. ICML 2019 workshop.
Obstacles
•A is discrete
•X is often discrete
156
Graph Purifying - Preprocessing
Main Idea
• Purify the poisoned graph
• Train GNN on the purified graph
6 5 6
5 7
7 ① Preprocess
4 4 7
7 3
1 3 1 2
2
8 8
2 2
② Train
Trained GNN 157

Graph Purifying - Preprocessing
Observations
• Attackers favor adding edges than removing edges
Attackers tend to connect dissimilar nodes!

•
158
Adversarial Examples on Graph Data: Deep Insights into Attack and Defense. IJCAI 2019.
Graph Purifying – Graph Learning: Pro-GNN
Graph Learning and GNN training
6 5 6
5 7
7
4 Graph Learning 4 7
7 3
1 3 1 2
2
8 8
2 GNN Learning 2
Clean Graph
Trained GNN
159
Graph Structure Learning for Robust Graph Neural Networks. KDD 2020.
Pro-GNN: Defend Against Adversarial Attacks
Graph Properties
Low-rank
Sparsity
Feature smoothness
160
Graph Properties
Low-rank
Sparsity
Feature smoothness
Table Credit: Adversarial Attacks and Defenses on Graphs: A Review and Empirical Study
161
Graph Properties
Low-rank
Sparsity
Feature smoothness
162
Pro-GNN: Framework
163
Attention Mechanism
Motivation
Reduce impact of adversarial edges
-- give lower attention score to adversarial edges
Thicker arrows indicate higher attention coefficients 164

RGCN
Motivation
Attacked nodes may have high uncertainty
Give lower attention score to reduce their impact
165
Robust Graph Convolutional Networks Against Adversarial Attacks. KDD 2019.
RGCN
Embed nodes as Gaussian
distributions to capture
uncertainty
166
RGCN
Attention Mechanism Attacked nodes do have higher
variance!
167
PA-GNN
Motivation Clean graphs from similar domain
• Only relying on perturbed graph to Facebook & Twitter
learn attention coefficients is not Yelp & Foursquare
enough.
• We should exploit information from
clean graphs.
Then Use Transfer Learning/Meta
Learning!
168
Robust Graph Neural Network Against Poisoning Attacks via Transfer Learning. WSDM 2020.
PA-GNN
169
Self-Supervised Learning for
170
Self-Supervised Learning
Relative position
pretext task
Doersch et al., 2015
Jigsaw puzzle
pretext task
Noroozi and Favaro, 2016
171
Shuffled Solved
Graph-Structured Data
172
Traditional Deep Learning on Graphs
Traditional DL is designed for simple grids or sequences
CNNs for fixed-size images/grids
RNNs for text/sequences

But nodes on graphs have Representations
different connections Activation Graph-l
Function
•Arbitrary neighbor size evel
•Complex topological
structure
Node-
•No fixed node ordering level
Graph Graph
Convolutions Convolutions 173
Early Unsupervised GNN
•Objective is to
reconstruct
masked edges
•Could be used as a
pre-training step
for another task
(e.g., node
classification)
Variational Graph Auto-Encoders. NIPS Workshop on Bayesian Deep Learning. 174

2016.
Harnessing Unlabeled Nodes
in Node Classification
GNNs are inherently semi-supervised as unlabeled

nodes can also be utilized during feature aggregation
Can we harness the benefits of SSL to more fully

utilize unlabeled nodes?
Labeled node
Unlabeled node
Aggregation region
around labeled node 175
Problem Statement
Unlabeled node
176
Applying SSL to Graphs
Similarities to Image and Text Domains
• Nodes have features like images or text
🡪 Pretext tasks using attribute information
• Topological structure associated with unlabeled samples
🡪 Pretext tasks using structural information
Fundamental Differences Found in Graph
Domain
• Nodes are connected and dependent
🡪 Pretext tasks using node pairs or even sets
• Unlabeled nodes have structural relations to labeled nodes

🡪 Pretext tasks using label information 177
Main Strategies to Merge SSL Tasks with GNN
Joint
Training
Joint Training
Two-stage
Training
178
Multi-Stage Self-Training GNNs
Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels. AAAI. 179
2020.
DeepCluster and M3S Training
For each stage
• Run Deep Clustering by using
K-means on node embeddings
• Align cluster centers to labeled
data class centers
• Sort remaining unlabeled based on
confidence of prediction
For each class j
DeepCluster • Find top samples
• If sample pseudo label
matches aligned cluster
label, then add to training
Train for fixed epochs
180
Case Study Results with M3S
General Insights:
• The less training data the larger the improvement over GCN
• Self-training can typically provide improvements
• Using MultiStage typically is better than single Self-training
• DeepCluster based self-checking provides a benefit
181
Contrastive Learning for Graphs via
Augmentations
Contrastive Multi-View Representation Learning on Graphs. AAAI. 2020. 182

Insights on Contrastive Learning for Graphs
Augmentation types:
Feature space:
• Masking or adding
gaussian noise
Structure space:
• Adding/removing
connections,
sub-sampling, global
view with diffusion
matrix
They use the diffusion matrix (e.g., Insights:
Personalized PageRank) along with • Able to compete with supervised methods
adjacency for providing and contrasting a • Contrasting node and graph embeddings works best
global and local viewpoint • More than two views is worse 183
Generative Pre-Training of GNNs
GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD. 2020. 184

When Does SSL Help GCNs?
They evaluate
three ways to
include SSL tasks:
• Pretraining/
Finetuning
• Self-training
• Multi-task
learning
When Does Self-Supervision Help Graph Convolutional Networks? ICML. 2020. 185
Node Clustering Pretext Task
• Features used: Nodes
• Assumptions: Feature similarity
• Loss Type: Classification
• Follows the ideas of clustering from M3S

• Each node is given a cluster based on node features
• This cluster label is assigned the self-supervised label to predict
186
Partitioning Pretext Task
• Features used: Edges
• Assumptions: Connection density
• Loss Type: Classification
• Rather than clustering the node features, instead they partition the
network based on the structure
• Similarly, partition indices are the self-supervised label to predict
187
Graph Completion Pretext Task
• Features used: Nodes & Edges
• Assumptions: Context based Representation
• Loss Type: Regression
188
SSL Universally Improves Most GNN Models
Insights:
• Generally multi-task performs better than
pretraining/finetuning
• SSL acts universally well to improve many
GNN base models
189
When and Why SSL Works on GNNs
• Presents a set of basic pretext tasks using structure and
attribute information
• Insights gained on:

• Which strategy to harness SSL on GNNs?
• Why do some pretext tasks work other others do not on GNNs?
• How to construct advanced pretext tasks beyond basic structure
and attributes?
Self-Supervised Learning on Graphs: Deep Insights and New Directions. arXiv. 2020. 190
Basic Pretext Tasks
on Graphs Pretext Tasks
Node Property
Local
Information Type
Edge Mask
Structure
Pairwise Distance
Global
Distance to Clusters
Attribute Mask
Attribute
Pairwise Attribute Similarity
191
Local Structure Pretext Tasks
Node Property Regression Loss
GNN -
Extracted Node Mapped node
Property embeddings
Edge Mask Random masked edge (i,j)

Random remaining edge (i,j) Classification Loss
i j Predict
Masked
GNN - vs.
Remaining
Mapped node
embeddings
[Input] 192
Global Structure Pretext Tasks
Pairwise Distance Classification Loss
i j Predict
Calculate
all (i,j) GNN -
distances
Shortest Mapped node

path embeddings
[Input]
lengths
Distance to Clusters Regression Loss
GNN -
Mapped node
Distance from each embeddings
node to the center Obtain k
clusters [Input]
of each clusters 193
Attribute Pretext Tasks
Attribute Mask Regression Loss
GNN -
Mapped Node
embeddings for
Nodes with . nodes
[Input] masked attributes
Pairwise Attribute Similarity

Most similar Regression Loss
pairs
i j
Associated
similarity
- -
…
values
GNN
- -
Least similar
pairs Mapped node embeddings
[Input] of most/least similar pairs
194
Empirical Study of Basic Pretext Tasks
Local
Structure
Global
Structure
Attribute
Insights:
• In general joint/multi-task training outperforms pre-training/two-stage training
• Global structure generally outperforms local structure
• Is there a way to further combine and improve these basic methods?
195
Deeper Insights to Why some SSL Work
GCN node
embeddings
Positive
🡪 achieve
values
higher
accuracy
Pretext task performance when using

original GCN embeddings compared with
original node attributes 196
Further Insights and New Directions
Node similarity is a fundamental property of graphs
🡪 Does this similarity get maintained in the GCN embeddings?
Two nodes are
• structurally equivalent if their local neighborhoods significantly overlap
🡪 Based on local neighbor aggregation in GCN it would be expected
to be somehow maintained in their embeddings
e.g., Pairwise Distance pretext helps maintain this
• regularly equivalent even if not having the same neighbors if the neighbors are themselves
similar
🡪 If this similarity is based on their attributes, even if neighbors are different,
if their neighbors share similar features then two nodes are similar
e.g., Pairwise Attribute Similarity pretext helps maintain this
🡪 Next we define regular task equivalence… 197
Further Insights and New Directions
Node similarity is a fundamental property of graphs
🡪 Does this similarity get maintained in the GCN embeddings?
Two nodes are

structurally equivalent if their local neighborhoods significantly overlap
regularly (attribute) equivalent even if not having the same neighbors if the
neighbors are themselves similar (regarding attributes)
regularly task equivalent defines similarity of nodes in relation to the task
🡪 Intuition: if every node constructs a pretext vector based on label
information from their local neighborhood, then two nodes having similar (or
dissimilar) vectors, we encourage to be similar (or dissimilar) in the
embedding space 198
Advanced Pretext Tasks on Graphs
Structure
+ Distance to Labeled
Label
Based on the
SelfTask
intuitions of
regular task Structure Context Label
equivalence
+
Attribute Ensemble Label
+
Label Corrected Label
199
SelfTask: Distance to Labeled
200
SelfTask: Context Label
Each node constructs a neighbor label
distribution context vector
201
SelfTask: Corrected Label
Key Idea: Improve Context Label by iteratively improving the context vector
SelfTask:
Context Label
202
Advanced SSL Results with
SelfTask for Node
Classification
203
Insights:
• Advanced methods utilizing the label information of
neighbors significantly improves performance
• The label correction stage indeed helps SelfTask
• Limited labeled data? No problem!
Summary of SSL for GNNs
• SSL for GNNs is still in the early stages but seen rapid growth/interest
• Just as in other domains, not all defined pretext tasks can work
• Some are more general than others
• While some can be specifically designed with domain specific knowledge
• Methods have taken a pre-training, self-training, or multi-task training

approaches
• Can we further leverage the relation between unlabeled nodes to labeled

nodes in advancing pretext tasks?
• Further analysis both theoretically and empirically are desired to better

understand when/why/how SSL for GNNs can work 204
Scalable Learning of Graph Neural
Networks
Tengfei Ma
IBM Research AI
IBM T. J. Watson Research Center
@KDD 2020 Tutorial
205
Graph Convolutional Networks (GCN)
- (Kipf and Welling 2017)
Motivation
Matrix form of a GCN layer:
Problems of GCN:
• Time and memory cost for large graphs:
Per-node form (embedding vectors are oriented as column vectors)
Full neighborhood of
v
From a single node as the start, after a few layers,

almost the whole graph will be touched. It means, even
batch training is also expensive
Node Sampling
-GraphSAGE (Hamilton et al. 2017)
Node sampling: for each node just sample a fixed number of neighbors
• Matrix form
• Where idxk-1 is a uniformly randomly sampled subset of nodes. (For nonuniform random, need
proper scaling.)
• Per-node form
• For all v in idxk only
• Problems of node sampling

• Still power law
• No formal analysis to justify
Sampled
neighborhood
(same for all v in idxk)
Layer Sampling
- FastGCN (ICLR 2018)
Generalization of a GCN layer (assume each layer independent)
Monte Carlo sampling

• For each layer/batch sample nodes
• matrix form (in a batch, all “v” has the same sampled neighbors)
Comparison
GCN FastGCN
Importance Sampling
• Uniform sampling
• Variance reduction
• Importance sampling, sampling from Q instead of a uniform P
Results
Per-batch training time Prediction accuracy

Adaptive Sampling
(Huang et al. 2018)
Problem of FastGCN:
• Layer-independent assumption
• Too sparse sampling ->lower accuracy
Extension
• Layer-dependent: sample the lower layer conditioned on the top one
• Based on importance sampling schema, they learn a self-dependent function
of each node to determine its importance for the sampling
• To explicitly reduce sampling variance, they add the variance to the loss
function and explicitly minimize the variance by model training
213
Graph Sampling
- GraphSAINT (Zeng et al. 2020)
Instead of node sampling or layer sampling, do graph sampling
If , it is an unbiased estimator of the aggregator, i.e.

• Pre-processing: repeatedly sample n subgraphs, and set
• Run a full GCN on each subgraph

Advantage:
• Permits multiple sampler methods (random node/edge/random walk) 214
Sampling for Multi-Relational Graphs
--RS-GCN (ICML20 GRL+)
Motivation:
• Most of previous sampling methods are for homogeneous graphs.
• We are focusing on accelerating the learning on multi-relational graphs
Idea
• Relation type matters!
• Probability to sample relation r at hop k:
• REINFORCE to update sampling probabilities
215
Application in Real-World
- PinSAGE (Ying et al. 2018)
An early industry-level GNN-based recommendation system
• The core of PinSAGE is a neighborhood aggregation algorithm similar to
GraphSAGE
• Novelty: how to define the neighborhood?
• Importance-based: the neighborhood of a node u is defined as the T nodes that exert the
most influence on node u.
• Random walks from node u, top T visited nodes.
• Efficient Training:
• Does not train on the whole graph, but only on
targeted node set and their neighborhood
• MapReduce for model inference
216
Application in Real-World
-Anti-Money Laundering
Application of FastGCN on large synthetic AML datasets:
• Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional
Networks for Financial Forensics (NeurIPS 2018 WS)
• Entities as nodes and transactions as edges
• Detecting suspicious nodes/transactions
217
Industrial-level Libraries and Applications
Deep Graph Library(DGL)
PyTorch Geometric (PyG)
AliGraph -- Alibaba
PyTorch-BigGraph (PBG) -- Facebook
AntGraph Machine Learning system (AGL) -- Ant Finance
…
218
Part I Overview

Robustness of GNN
219
Graph Neural Networks for
Healthcare Applications
--Drug Discovery and Medical Recommendation
Tengfei Ma
IBM Research AI
@KDD2020 Tutorial
220
Drug Discovery
Drug Discovery is a long tedious costly process
• Machine learning can help
• de novo drug design
• Generating new molecules for desired target.
• drug safety checking
• Toxicity
• Adverse reaction/drug-drug interaction
Interestingly, they are all related to graphs
• Molecule -- graph
• DDI – graph
It is natural to develop GNN-based methods
• Molecule generation
• DDI prediction
221
Constrained Generation of Semantically
Valid Graphs via Regularizing Variational
Autoencoders
NeurIPS 2018.
222
Molecule Graph Generation
Generative Models for Images/Sequences
But generation of graphs?

• Graph neural networks need to know the predefined graph structure
• GNNs can be used as encoders, but how to design an decoder/generator?
• How to guarantee the generated sample is a valid graph?
Ideas:
• Represent graphs as concatenation of its node matrix and edge matrix and treat it as an
image –> so we can use the same decoder as image
• Validness? add constraints for the generator
Constrained Graph Variational Auto-Encoder
Overview of the framework
• A graph auto-encoder used to generate the graph

• In addition to a standard VAE (within the rectangle), we add a regularization
term.
• f(x) is the original VAE loss
• h and g are regularization terms
Approximate Training
A Lagrangian relaxation
Training in Standard VAE
• Monte Carlo sampling
Similarly for the regularization term

Constraints
Molecules
• Valence
• Expected node capacity (sum of edges) <= valence
• Connectivity
• Every node pair much be connected by a path
• B = A + A^2 + … + A^{n-1}
• If node i and j are connected, B_{ij} != 0
Results
Compared to VAE with no regularization
Compared to previous works

Visualization of Generated Molecules
Left: two-dimensional interpolation
Right: one-dimensional (column-wise) interpolation
Drug Similarity Integration Through
Multi-view Graph Auto-Encoders
IJCAI 2018
229
Adverse Drug-Drug Interaction (DDI)
Common among patients with

complex diseases or comorbidities.
Hard to observe in clinical testing.
Affects 15% U.S. population. Cost
more than $177 billion per year in
disease management.
Drugs may interact to cause adverse DDIs.
DDI Prediction
Drug Features (database)
• Label Side Effect (SIDER)
• Off-Label Side Effect (OFFSIDES)
• Molecular substructure
• Drug Indication (MedDRA)
• ……
Assumption: similar drugs may have similar interaction to another drug.
Related to Graphs?
• Constructing a DDI graph, we can predict unknown DDIs as a link prediction problem.
• Constructing a similarity graph, where DDIs are regarded as node labels, and DDI prediction is
a node classification problem.
MV-GAE: Drug Similarity Integration Through
Multi-view Graph Auto-Encoders (IJCAI 2018)
drug B
• Challenges of multi-view learning
,B) • the underlying relations of biomedical events are often
sim(A nonlinear and complex over all types of features
drug A
• features have different importance toward different target
outcomes
• Our solution:
• Construct drug similarity graph for each
view, where DDIs are node labels
• Graph convolutional network (GCN) based
model for node embedding and prediction
• Attention mechanism to integrate different
views
A Simple Multi-View GCN
GCN GCN
Question: what if we do not have labels?

Attentive Multiview Similarity Fusion with
Graph Auto-Encoders (GAE)
GCN decoder
GCN encoder
Normalize
attention weights decided

by data and target
Semi-Supervised Extensions (SemiGAE)
GCN decoder
GCN encoder
What if we do not have node features?
GCN decoder
GCN encoder
Results
Analysis of Attention
Attention Weights
DDI Type AUC Chem. indi. TTDS CPI Results

Chest Pain 0.772 0.151 0.303 0.144 0.402 Views “indication” and
“CPI” receive high weights
Insomnia 0.755 0.380 0.261 0.078 0.291 for the ADR “Chest pain”.
indication 0.774 0.117 0.301 0.283 0.299
Graph-Enhanced Medication Recommendation:
GAMENet and G-BERT
AAAI 2019, IJCAI 2019.
239
EHR Phenotyping and Medication
Recommendation
EHR (electronic healthcare record):
• Representation
• Phenotyping is important for

• Disease prediction
What if there is a DDI?
• Medication recommendation
• Readmission prediction
Medication Recommendation
Challenges
Problems of previous approaches
• Without consideration of DDIs
• It is possible to recommend some un-safe drugs
• Lack of structure information for medical codes
• Throwing out a lot of resources
• The single-visit EHR sequences
• -> we cannot use it in RNN
• ->we cannot use it to predict the next visit
241
Medical Recommendation I: GAMENet
GAMENet: Graph Augmented MEmory Networks for Recommending
Medication Combination (Shang et al. 2019a)
• Key ideas: integrate the DDI graphs to provide safer medical recommendation
• Method: encode both EHRs and DDI graphs in the memory and impact on the
memory output
Graph Augmented Memory Module
I: Input memory representation converts inputs
into query for memory reading.
• RNN
G: Generalization is the process of generating
and updating the memory representation.
• Static memory bank Mb: GCN
• Dynamic memory Md : adding key-value pair
O: Output memory representation produces
outputs given the patient representation (the
query) and the current memory state Mb and Md.
R: Response is the final step to utilize patient

representation and memory output to predict
the multilabel medication.
• RNN

• RNN

Medical Recommendation II: GBERT (Graph +
Pre-training)
Motivation
• To utilize the hierarchy information of medical codes
• To pretrain on single-visit data (which was generally discarded in previous
systems)
Framework:
• ontology embedding -> visit embedding -> pre-training
Ontology Embedding
A modified GNN to model the ontology and get initial embeddings for
all medical codes
• Leaf to root
• Root to leaf
• GAT style message passing

Visit Embedding
248
Pre-training
Input example:
• [CLS] d1, d2, mask
d3, d4, m1, mask
m2, m3
• d: diagnosis; m: medication
pre-training on each visit of EHR sequences
• Self-prediction: same as BERT, use to a mask to mask out some codes and
predict them
• Dual-prediction: known all diagnosis, predict medication; known medication,
predict diagnosis
• Note: No position embedding, because there is no order within one visit.
Results
• We used EHR data from MIMIC-III
[Johnson et al., 2016] and conducted all
our experiments on a cohort where
patients have more than one visit.
• For GAMENet, we used DDI knowledge

from TWOSIDES dataset.
• For G-BERT, we utilize data from patients

with both single visit and multiple visits
in the training dataset as pre-training
data source.
IBM Research AI
Deep Graph Learning and
Graph-to-Sequence Learning in NLP
Lingfei Wu
IBM Research AI
Joint work with Yu Chen, Mohammed J Zaki, Kun Xu, Zhiguo Wang, Yansong Feng,
Michael Witbrock, and Vadim Sheinin
@KDD 2020 Tutorial

251
Why graphs?
Graphs are a general
language for
describing and
modeling complex
systems
Graph!
252
Graph-structured data are ubiquitous
Interne Social Networks of

t networks transactions
Biomedical graphs Scene graphs

Program graphs
253
Machine(Deep) Learning with Graphs
Classical ML tasks in graphs: Recent ML tasks in graphs:
Node classification • Graph classification
▪ Predict a type of a given node ▪ Predict a type of a given graph
Link prediction • Graph generation
▪ Generate graphs from learned
▪ Predict whether two nodes are distribution
linked
• Graph structure learning
Community detection ▪ Joint learn graph structure and
▪ Identify densely linked clusters of graph embeddings
nodes • Graph-to-X learning
Graph matching (similarity) ▪ Graph Inputs – X outputs
▪ How similar are two (sub)graphs
254
Deep Graph Learning and
Graph-to-Sequence Learning in NLP
Lingfei Wu
IBM Research AI
Joint work with Yu Chen, Mohammed J Zaki, Kun Xu, Zhiguo Wang, Yansong Feng,
Michael Witbrock, and Vadim Sheinin
@KDD 2020 Tutorial

255
Why graphs?
Graphs are a general
language for
describing and
modeling complex
systems
Graph!
256
Graph-structured data are ubiquitous
Interne Social Networks of

t networks transactions
Biomedical graphs Scene graphs

Program graphs
257
Machine(Deep) Learning with Graphs
Classical ML tasks in graphs: Recent ML tasks in graphs:
Node classification • Graph classification
▪ Predict a type of a given node ▪ Predict a type of a given graph
Link prediction • Graph generation
▪ Generate graphs from learned
▪ Predict whether two nodes are distribution
linked
• Graph structure learning
Community detection ▪ Joint learn graph structure and
▪ Identify densely linked clusters of graph embeddings
nodes • Graph-to-X learning
Graph matching (similarity) ▪ Graph Inputs – X outputs
▪ How similar are two (sub)graphs
258
Iterative and Robust Deep
Graph Learning for GNNs
259
Graph Learning: Motivations
• GNNs are powerful, unfortunately, it
requires graph-structured data
available.
• Questionable if the given intrinsic
graph-structures are optimal (i.e., noisy,
incomplete) for the downstream tasks.
• Many applications (e.g., NLP tasks) may
only have non-graph structured data or
even just the original feature matrix.
260
Graph Learning: Formulation
261
Existing State-of-the-art Methods
Graph construction from data [Kalofolias, 2016; Kalofolias and Perraudinl, 2017]
• Gaussian kernel or KNN-based
• Directly optimizing the graph adjacency matrix with smoothed graph signals
• Issues: 1) does not consider downstream task; 2) no refinement
Dynamic models of interacting systems [Kipf et al., ICML’18]
• Inferring an explicit interaction structure using a variational graph auto-encoder
• Issues: 1) cannot joint learn the graph structure and graph representations; 2)
transductive
• Jointly optimizing graph structures and GNN parameters [Franceschi, ICML ’19]
• Modeling joint probability distribution on the edges of the graph consisting of N
number of vertices
• Issues: 1) hard to optimize; 2) Not scalable; 3) cannot handle inductive learning
262
Iterative Deep Graph Learning : System
Overview
Graph Learning and Graph Embedding: A Unified Perspective

• Graph learning as similarity metric learning
• Graph regularization to control smoothness, sparsity, and connectivity
• Iterative method to refine the graph structures and graph embeddings
263
IDGL: Graph Learning as Similarity Metric
Learning
We design a multi-head weighted cosine similarity metric function to learn a
similarity matrix S for all pairs of nodes.
We proceed to extract a symmetric sparse adjacency matrix from the similarity

matrix S by considering only the ɛ-neighborhood for each node.
where is the normalized adjacency matrix of the initial graph (or kNN-graph).
264
IDGL: Graph Regularization
We adapt the techniques designed for learning graphs from smooth and apply
them as regularization for controlling smoothness, connectivity and sparsity
Smoothness
Connectivity & sparsity
Graph regularization loss
265
IDGL: Iterative Method for Joint Graph
Structure and Representation Learning
Iterative method repeatedly
▪ Learn better adjacency matrix with the
updated node embeddings
▪ Learn better node embeddings with the
refined adjacency matrix
Iterative procedure dynamically stops in a
mini-batch
▪ the learned adjacency matrix converges
with certain threshold
▪ the maximal number of iterations is reached
266
Results (Transudative Setting)
267
Results (Inductive Setting & Runtime)
268
Results (Ablation Study)
269
Results (Robustness to Missing/Adding Edges)
270
Results (Convergence & Dynamic Stopping)
271
Graph-to-sequence Learning in
Natural Language Processing
272
Seq2Seq: Applications and Challenges
Applications Challenges
• Machine translation • Only applied to problems whose
• Natural Language Generation inputs are represented as
• Logic form translation sequences
• Drug Discovery • Cannot handle more complex
structure such as graphs
• Converting graph inputs into
sequences inputs lose information
• Augmenting original sequence
inputs with additional structural
information enhances word
sequence feature
273
Contributions and Highlighted Research
Fundamental contributions in this research:
• Presented Graph2Seq, a generalized seq2seq model for graph inputs
• Attention-based encoder-decoder model for graph-to-sequence learning
• Two highlighted NLP tasks using Graph2Seq model:

• SQL-to-text Generation with Graph2Seq Model
• Question Generation with RL based Graph2Seq Model
274
[1]
Graph-to-Sequence Model
Graph Convolutional Neural Network
[1] Kun Xu*, Lingfei Wu*, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim Sheinin (Equally Contributed), "Graph2Seq: Graph to Sequence
Learning with Attention-based Neural Networks", arXiv 2018.
[2] Yu Chen, Lingfei Wu** and Mohammed J. Zaki (**Corresponding Author), "Reinforcement Learning Based Graph-to-Sequence Model for Natural
Question Generation”, ICLR’20. 275
Bidirectional Node Embedding (Separate)
Bi-Sep Node embedding (take node v as an example)
1. transform each node’s text attribute to a feature vector by looking up the
embedding matrix
2. classify v’s neighbors into forward and backward neighbors, aggregate

neighbors information using a fully connect network followed by a max
pooling operation
3. repeat steps 2 for K times to aggregate neighbors information in K hops

4. concatenate final v’s forward and backward node embeddings as the final
bi-directional representation of node v
276
Bidirectional Node Embedding (Fuse)
Bi-Fuse Node embedding (take node v as an example)
1) Node aggregation
2) Fuse the aggregated node embeddings from both directions
3) Update the node embeddings using fused information
277
Graph Encoding
Graph embedding
• Pooling based graph embedding (max, min and average pooling)
• Node based graph embedding
Add one super node which is connected to all other nodes in the graph
The embedding of this super node is treated as graph embedding
278
Attention Based Sequence Decoding
context node
vector representation
279
context node
attention alignment
weights model
280
context node
attention alignment
weights model
Objective Function
281
Experiments: Text Reasoning and Shortest
Path
282
Experiments: Bidirectional Node Embedding
Converge
More quickly
Bidirectional Node Embedding

VS Unidirectional Node Embedding
283
When Shall We Use Graph2Seq?
Case I: the inputs are naturally or Case II: Hybrid Graph with
best represented in graph sequence and its hidden structural
information
Augmenting “are there ada jobs outside Austin”

with its dependency parsing tree results
“Ryan’s description of himself: a genius.”
284
SQL-to-text Generation with
Graph2seq Model (EMNLP’18)
285
Natural Language Interface to Database
Need explanation !
What is the meaning ?
286
SQL-to-text Generation Task
287
Previous Approaches
Template-based approaches
Koutrikaal et al. 2010
Time consuming and generating rigid and stylized language !
Ngonga Ngomo et al. 2013
Deep learning models

Iyer et al. 2016 (Sequence to sequence model)
288
Problem
SQL query is a graph structured query
Naïve sequence encoders may need an elaborate design to fully capture the global
structure information
289
Motivation
290
Graph Representation of SQL Query
Represent the SQL query as a graph

Select clause
Where clause
291

Select clause
a) create a node assigned with text attribute select
Where clause
292

Select clause
b) connect SELECT node with column nodes
Where clause
293

Select clause
c) In some cases, there may exist aggregation functions such
as count and max; add aggregation node and connect it
with column node
Where clause
294

Select clause
with column node
Where clause
a) for each condition, we use the same process as for the
Select clause to create nodes
295

Select clause
with column node
Where clause
296

Select clause
with column node
Where clause
297

Select clause
with column node
Where clause
298

Select clause
with column node
Where clause
b) add logical operators such as AND, OR and NOT
299
Encoder-Decoder Architecture
Encoding
How to encode this graph structure

?
300
[1]
Experiments
Datasets
WikiSQL (61,297 for training, 9,145 for development and 17,284 for test)
Stackoverflow (25,869 for training, 3,234 for development and 3,234 for test)
Baselines
Template
a) first map each element of a SQL query to an utterance
b) then use simple rules to assemble these utterances
Seq2Seq
We implement the model proposed in Bahdanau et al. 2014
Seq2Seq + Copy
We implement the model proposed in Gu et al. 2016
Tree2Seq
We implement the model proposed in Eriguchi et al. 2016
302
Results
Criteria
BLEU-4
Grammar (human evaluation)
Correct (human evaluation) WikiSQL
Stack
overflow
303
Examples
Seq2Seq model
Graph2Seq model
Seq2Seq model
Graph2Seq model
304
Graph2Seq + Reinforcement Learning
for Question Generation (ICLR’20)
305
Natural Question Generation: Background
Natural question generation (QG) is
a challenging yet rewarding task,
that aims to generate questions
given an input passage and a target
answer.
Many real applications:

Reading comprehension
Visual and video question answering
Dialog system
306
Natural Question Generation: Definition
Input:
A text passage:
A target answer:
• Output:
The best natural language question:
which maximizes the conditional likelihood:
307
Existing State-of-the-art Methods
Template-based approaches
Mostow & Chen, 2009; Heilman & Smith, 2010; Heilman, 2011
Rely on heuristic rules or hand-crafted templates
low generalizability and scalability
Seq2Seq-based approaches
Du et al., 2017; Zhou et al., 2017; Song et al., 2018a; Kumar et al., 2018a
Fail to utilize the rich text structure information beyond the simple word
sequence
Rely on cross-entropy based sequence training which has several limitations
Fail to effectively utilize the answer information
308
Known Issues of Existing Approaches
Issue I: fail to consider global Solution I: Deep alignment network
interactions between answer and to align answer and context
context
Issue II: fail to consider rich hidden Solution II: Novel Graph2Seq model
structure information of word for considering hidden structure
sequence information in sequence
Issue III: limitations of
cross-entropy based objectives like Solution III: Novel Reinforcement
exposure bias and inconsistency Learning Loss for enforcing syntactic
between train/test measurement and semantic coherent of generated
text
309
RL-based Graph2Seq for QG: System
Overview
310
Deep Answer Alignment
A deep alignment network for incorporating the answer information into passages at
both the word level and the contextualized hidden state level
denotes passage
denotes answer
311
Graph Construction: Static VS Dynamic
Syntax-based static graph
A directed and unweighted passage
graph based on dependency parsing
Semantics-aware dynamic graph

Dynamically build a directed and
weighted graph to model semantic
relationships among passage words
is the passage representation
312
Encoder-Decoder Architecture
Encoding
How to encode this graph structure

?
313
[1]
Hybrid Evaluator
Regular cross-entropy based training objectives have limitations
Exposure bias
Evaluation discrepancy between training and testing
We apply a mixed objective function combining both the
cross-entropy loss and RL loss
Ensure the generation of syntactically and semantically valid text
Two-stage training strategy:

Train the model with cross-entropy loss
Finetune the model by optimizing the mixed objective function
315
Automatic Evaluation Results
316
Human Evaluation and Ablation Study
Results
317
Graph4NLP: A Library for Deep
Learning on Graphs for NLP
318
Architecture of Graph4NLP Library
Topology
Graph
Construction
Embedding
Node
Encoding
SAGE GCN GAT GGNN RGCN
Prediction Classification Generation KB Completion Decoding
Evaluation Metrics Loss Loss

Data Flow of Graph4NLP
Encoded Structured Data

Raw Data (Graph4NLP.GraphData) Prediction Results
Graph Construction
Evaluation Loss
Featured Structured Data GNN Embedding

(Graph4NLP.GraphData) Methods
User Model
Take-home Messages from This Talk
• Deep Learning on Graphs is a fast-growing area today.
• Since graph can naturally encode complex information, it could bridge a gap by
combining both empirical domain knowledges and the power of deep learning.
• However, the input graph could be noisy or incomplete or not available, we
presented the IDGL model learns both graph structure and graph embeddings
• We also presented Graph2Seq model, a generalized Seq2Seq model for graph
inputs and demonstrate its advantages on two useful scenarios:
✔ Graph data (natural or best expressed)
✔ Augmented sequence with hidden structure information
321
What’s The Next?
• Welcome all to attend The Second International
Workshop on Deep Learning on Graphs: Methods
and Applications (DLG-KDD’20) will be held joint
with KDD MLG workshop on August 24th, 2020
• Release of our Graph4NLP library: the first library

for researchers and practitioners for easy use of
Graph Neural Networks for various NLP tasks!

Part I-KDD - Tutorial - GNN PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part I-KDD - Tutorial - GNN PDF

Uploaded by

Copyright:

Available Formats

Part I:

Foundations and Applications of

Tutorial website: https://ai.tencent.com/ailab/ml/KDD-Deep-Graph-Learning.html

Foundations Models Applications

Filtering Layers in GNN

Foundations Models Applications

Filtering Layers in GNN

Degree Matrix Adjacency Matrix

Degree Matrix Adjacency Matrix Laplacian Matrix

Laplacian quadratic form:

Laplacian quadratic form:

Low frequency graph signal

Laplacian quadratic form:

Low frequency graph signal

Laplacian quadratic form:

High frequency graph signal

Eigenvalues are sorted non-decreasingly:

Low frequency High frequency

Foundations Models Applications

Filtering Layers in GNN

Node Representations Graph Representation

Node Representations Graph Representations

Graph filtering refines the node features

Graph pooling generates a smaller graph

Filtering Layer Activation

Filtering Layer Activation Pooling Layer

Decompose Filter Reconstruct

Decompose Filter Reconstruct

Decompose Filter Reconstruct

Decompose Filter Reconstruct

How to deal with multi-channel signals?

How to deal with multi-channel signals?

Each input channel contributes to each output channel

How to deal with multi-channel signals?

Each input channel contributes to each output channel

Filter each input channel 59

How to deal with multi-channel signals?

Each input channel contributes to each output channel

Filter each input channel 60

Unstable under perturbation of coefficients

Unstable under perturbation of coefficients

Unstable under perturbation of coefficients

Unstable under perturbation of coefficients

Unstable under perturbation of coefficients

Semi-Supervised Classification with Graph Convolutional

Apply a renormalization trick

Filter each input channel

Filter each input channel

Filter each input channel

The first GNN: k-th layer

The first GNN: k-th layer

The first GNN: k-th layer

Inductive Representation Learning on Large Graphs. NIPS 2017.

Graph Attention Networks. ICLR

Neural Message Passing for Quantum 101

Graph U-Nets. ICML

Generated soft-assign matrix

Generated new features

Generated soft-assign matrix

Generated new features

Generated soft-assign matrix

Generated new features

Graph Convolutional Networks with 114