You are on page 1of 322

Part I:

Foundations and Applications of


Graph Neural Networks
Yao Ma and Yiqi Wang, Michigan State University
Tyler Derr, Vanderbilt University
Lingfei Wu and Tengfei Ma, IBM Research

Tutorial website: https://ai.tencent.com/ailab/ml/KDD-Deep-Graph-Learning.html

1
Book: Deep Learning on Graphs

https://cse.msu.edu/~mayao4/dlg_book/
2
Part I Overview

Foundations Models Applications

Filtering Layers in GNN


Basic Graph Theory Healthcare
Pooling Layers in GNN
Spectral Graph Theory Graph Structure Learning
Robustness of GNN
Graph Fourier Analysis Self-supervised Learning for GNN Natural Language Processing
Scalable Learning for GNN

3
Part I Overview

Foundations Models Applications

Filtering Layers in GNN


Basic Graph Theory Healthcare
Pooling Layers in GNN
Spectral Graph Theory Graph Structure Learning
Robustness of GNN
Graph Fourier Analysis Self-supervised Learning for GNN Natural Language Processing
Scalable Learning for GNN

4
Graphs and Graph Signals

5
Graphs and Graph Signals

Graph Signal:

6
Graphs and Graph Signals

Graph Signal:

7
Graphs and Graph Signals

Graph Signal:

8
Matrix Representations of Graphs

9
Spectral graph theory. American Mathematical Soc.; 1997.
Matrix Representations of Graphs

Adjacency Matrix

10
Spectral graph theory. American Mathematical Soc.; 1997.
Matrix Representations of Graphs

Degree Matrix:

Degree Matrix Adjacency Matrix

11
Spectral graph theory. American Mathematical Soc.; 1997.
Matrix Representations of Graphs

Degree Matrix:

Degree Matrix Adjacency Matrix Laplacian Matrix

12
Spectral graph theory. American Mathematical Soc.; 1997.
Laplacian Matrix as an Operator
Laplacian matrix is a difference operator:

13
Laplacian Matrix as an Operator
Laplacian matrix is a difference operator:

14
Laplacian Matrix as an Operator
Laplacian matrix is a difference operator:

Laplacian quadratic form:

15
Laplacian Matrix as an Operator
Laplacian matrix is a difference operator:

Laplacian quadratic form:

16
Laplacian Matrix as an Operator
Laplacian matrix is a difference operator:

Low frequency graph signal

Laplacian quadratic form:

17
Laplacian Matrix as an Operator
Laplacian matrix is a difference operator:

Low frequency graph signal

Laplacian quadratic form:

High frequency graph signal


18
Eigen-decomposition of Laplacian Matrix
Laplacian matrix has a complete set of orthonormal eigenvectors:

19
Eigen-decomposition of Laplacian Matrix
Laplacian matrix has a complete set of orthonormal eigenvectors:

20
Eigen-decomposition of Laplacian Matrix
Laplacian matrix has a complete set of orthonormal eigenvectors:

Eigenvalues are sorted non-decreasingly:

21
Eigenvectors as Graph Signals

22
Eigenvectors as Graph Signals
The frequency of an eigenvector of Laplacian matrix is its
corresponding eigenvalue:

23
Eigenvectors as Graph Signals
The frequency of an eigenvector of Laplacian matrix is its
corresponding eigenvalue:

Low frequency High frequency

24
Graph Fourier Transform(GFT)

25
Graph Fourier Transform(GFT)

26
Graph Fourier Transform(GFT)

27
The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine
Inverse Graph Fourier Transform (IGFT)

28
Part I Overview

Foundations Models Applications

Filtering Layers in GNN


Basic Graph Theory Healthcare
Pooling Layers in GNN
Spectral Graph Theory Graph Structure Learning
Robustness of GNN
Graph Fourier Analysis Self-supervised Learning for GNN Natural Language Processing
Scalable Learning for GNN

29
Tasks on Graph-Structured Data
Node-level Graph-level
Link Prediction Graph Classification

? ?

t t+a

Node Classification


30
Tasks on Graph-Structured Data
Node-level Graph-level

31
Tasks on Graph-Structured Data
Node-level Graph-level

Node Representations

32
Tasks on Graph-Structured Data
Node-level Graph-level

Node Representations Graph Representation

33
Tasks on Graph-Structured Data
Node-level Graph-level

Filtering Pooling

Node Representations Graph Representations

34
Two Main Operations in GNN
Graph Filtering

Graph Filtering

35
Two Main Operations in GNN
Graph Filtering

Graph Filtering

36
Two Main Operations in GNN
Graph Filtering

Graph Filtering

Graph filtering refines the node features


37
Two Main Operations in GNN
Graph Pooling

Graph Pooling

38
Two Main Operations in GNN
Graph Pooling

Graph Pooling

39
Two Main Operations in GNN
Graph Pooling

Graph Pooling

Graph pooling generates a smaller graph


40
General GNN Framework
For node-level tasks

Filtering Layer Activation

41
General GNN Framework
For graph-level tasks

Filtering Layer Activation Pooling Layer

… … …

42
Graph Filtering Operation

Graph Filtering

43
Two Types of Graph Filtering Operation
Spatial Based Filtering Spectral Based Filtering

Original GNN
(Scarselli et al.
2005)

GraphSage Spectral
GAT (Hamilton et al. Graph CNN
(Veličković et al. NIPS 2017) (Bruna et al.
ICLR 2018) GCN ICLR 2014)
(Kipf & Welling.
ICLR 2017)
MPNN
(Glimer et al.
ICML 2017) ChebNet
(Defferard et al.

… NIPS 2016)

44
Graph Filtering in the First GNN Paper

Graph neural networks for ranking web pages. WI. IEEE, 2005.
45
Graph Filtering in the First GNN Paper

46
Graph Spectral Filtering for Graph Signal
Recall:

47
Graph Spectral Filtering for Graph Signal
Recall:

Decompose
Coefficients

48
Graph Spectral Filtering for Graph Signal
Recall:

Decompose Filter
Coefficients Filtered coefficients

49
Graph Spectral Filtering for Graph Signal
Recall:

Decompose Filter
Coefficients Filtered coefficients

Example:

50
Graph Spectral Filtering for Graph Signal
Recall:

Decompose Filter Reconstruct


Coefficients Filtered coefficients

Example:

51
Graph Spectral Filtering for Graph Signal
Recall:

Decompose Filter Reconstruct


Coefficients Filtered coefficients

Example:

52
Graph Spectral Filtering for Graph Signal
Recall:

Decompose Filter Reconstruct


Coefficients Filtered coefficients

Example:

53
Graph Spectral Filtering for Graph Signal
Recall:

Decompose Filter Reconstruct


Coefficients Filtered coefficients

Filtering

54
Graph Spectral Filtering for GNN
How to design the filter?

55
Graph Spectral Filtering for GNN
How to design the filter?

56
Graph Spectral Filtering for GNN
How to design the filter?

How to deal with multi-channel signals?

57
Graph Spectral Filtering for GNN
How to design the filter?

How to deal with multi-channel signals?

Each input channel contributes to each output channel

58
Graph Spectral Filtering for GNN
How to design the filter?

How to deal with multi-channel signals?

Each input channel contributes to each output channel

Filter each input channel 59


Graph Spectral Filtering for GNN
How to design the filter?

How to deal with multi-channel signals?

Each input channel contributes to each output channel

Filter each input channel 60


61
Spectral Networks and Locally Connected Networks on Graphs. ICLR 62
2014
63
Expensive eigen-decomposition
64
Convolutional Neural Networks on Graphs with Fast Localized Spectral
65
Filtering. NIPS 2016.
66
67
No eigen-decomposition needed
68
Polynomial Parametrized Filter: a Spatial
View

69
Polynomial Parametrized Filter: a Spatial
View

70
Polynomial Parametrized Filter: a Spatial
View

71
Chebyshev Polynomials

72
Chebyshev Polynomials

Unstable under perturbation of coefficients

73
Chebyshev Polynomials

Unstable under perturbation of coefficients


Chebyshev polynomials:

74
Chebyshev Polynomials

Unstable under perturbation of coefficients


Chebyshev polynomials:

75
Chebyshev Polynomials

Unstable under perturbation of coefficients


Chebyshev polynomials:

76
Chebyshev Polynomials

Unstable under perturbation of coefficients


Chebyshev polynomials:

77
ChebNet

Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. NIPS
78
2016.
ChebNet

79
ChebNet

No eigen-decomposition needed

80
ChebNet

No eigen-decomposition needed
Stable under perturbation of coefficients
81
GCN: Simplified ChebNet

Semi-Supervised Classification with Graph Convolutional


Networks. ICLR 2017.
82
GCN: Simplified ChebNet

83
GCN: Simplified ChebNet

Apply a renormalization trick

84
GCN for Multi-channel Signal
Recall:

Filter each input channel

85
GCN for Multi-channel Signal
Recall:

Filter each input channel


For GCN:

GCN filter

86
GCN for Multi-channel Signal
Recall:

Filter each input channel


For GCN:

GCN filter
In matrix form:

87
A Spatial View of GCN Filter

88
A Spatial View of GCN Filter

89
A Spatial View of GCN Filter

Observe that:

90
A Spatial View of GCN Filter

Observe that:

Hence,

91
A Spatial View of GCN Filter

Observe that:

Hence,

Feature transformation
92
A Spatial View of GCN Filter

Observe that:

Hence,

Feature transformation
Aggregation 93
Filter in GCN VS Filter in the First GNN
GCN: k-th layer

The first GNN: k-th layer

94
Filter in GCN VS Filter in the First GNN
GCN: k-th layer

The first GNN: k-th layer

95
Filter in GCN VS Filter in the First GNN
GCN: k-th layer

The first GNN: k-th layer

96
Filter in GraphSage
Neighbor Sampling

Inductive Representation Learning on Large Graphs. NIPS 2017.


97
Filter in GraphSage
Neighbor Sampling

Aggregation

98
Filter in GAT

Graph Attention Networks. ICLR


2018. 99
Filter in GAT

100
Filter in MPNN
Message Passing

Feature Updating

Neural Message Passing for Quantum 101


Chemistry. ICML 2017.
Graph Pooling Operation

Graph Pooling

102
gPool
Downsample by selecting the most importance nodes

Graph U-Nets. ICML


2019. 103
gPool
Downsample by selecting the most importance nodes
Importance Measure

104
gPool
Downsample by selecting the most importance nodes
Importance Measure

105
gPool
Downsample by selecting the most importance nodes
Importance Measure

106
gPool
Downsample by selecting the most importance nodes
Importance Measure

107
DiffPool
Downsample by clustering the nodes using GNN

108
Hierarchical Graph Representation Learning with Differentiable Pooling. NeurIPS
DiffPool
Downsample by clustering the nodes using GNN
2 filters

Filter1:
Generate a soft-assign matrix

109
DiffPool
Downsample by clustering the nodes using GNN
2 filters

Filter1:
Generate a soft-assign matrix

Filter2:
Generate new features

110
DiffPool
Downsample by clustering the nodes using GNN

Generated soft-assign matrix

Generated new features

111
DiffPool
Downsample by clustering the nodes using GNN

Generated soft-assign matrix

Generated new features

112
DiffPool
Downsample by clustering the nodes using GNN

Generated soft-assign matrix

Generated new features

113
Eigenpooling

Graph Convolutional Networks with 114


EigenPooling. KDD 2019.
Eigenpooling

115
Eigenpooling

116
Eigenpooling

Capture both feature


and graph structure

117
Going Back to Graph Spectral Theory
Recall:

118
Going Back to Graph Spectral Theory
Do we need all the coefficients to reconstruct a “good” signal?

119
Going Back to Graph Spectral Theory
Do we need all the coefficients to reconstruct a “good” signal?

120
Going Back to Graph Spectral Theory
Do we need all the coefficients to reconstruct a “good” signal?

121
Eigenpooling: Truncated Fourier Coefficients
Eigenvectors (Fourier Modes) of the subgraph

122
Eigenpooling: Truncated Fourier Coefficients
Eigenvectors (Fourier Modes) of the subgraph

GFT
Fourier coefficients

123
Eigenpooling: Truncated Fourier Coefficients
Eigenvectors (Fourier Modes) of the subgraph

GFT
Fourier coefficients

Truncated Fourier
coefficients

124
Eigenpooling: Truncated Fourier Coefficients
Eigenvectors (Fourier Modes) of the subgraph

GFT
Fourier coefficients

Truncated Fourier
coefficients
New features for the subgraph (a node in the smaller graph)
125
Robustness of GNN

126
Adversarial Attacks on Deep Learning
Do Graph Neural Networks
Suffer the Same Problem?
Adversarial Attacks on GNN
5 6 5 6
7 7
4 7 4 7
3 2 3 2
1 1
8 8
2 2

GNN GNN

8 Predicted as: 8 Predicted as:

129
Consequences
Financial Systems
7 • Credit Card Fraud Detection
7
2 Recommender Systems
8 • Social Recommendation
• Product Recommendation
….

130
Image vs Graph

Discreteness
Perturbation Measure
Perturbation Type

F
A

E
C
B
D

131
Perturbation Type
6 5 6
5
7 7
4 4 7
7
3 3 2 6
1 2 1 5
8 8 7
2 2 4 7
Rewiring 3 2
Adding an edge 1
8
2
5 6
5 6
7
4 7
4 7 3 2 Modifying Features
3 1
1
8
8 2
2
9

Deleting an edge Node Injection


132
Evasion & Poisoning Attack
Evasion Attack Poisoning Attack

6 6 5 6 5 6 7
5 7 5 7 7
4 4 ① Perturb 4 7
7 4 7 7 3
1 3 3 1 3 1 2
2 1 2 2 8
2 8 8 2 8 2
2

① Train ② Train

Trained GNN
Trained GNN

133
Targeted & Non-Targeted
Targeted Attack Non-Targeted Attack

5 6 6
5
7 7
4 7 4 7
3 2 3
1 1 2
8 8
2 2

8 Target Node

134
Attack Methods
Attack Injecting Adding Rewiring Modifying Evasion Poisoning Targeted Non-Targe
Methods Node /Deleting Features ted
Edge
Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax
Nettack ✔ ✔ ✔ ✔ ✔
GF-Attack ✔ ✔ ✔
ReWatt ✔ ✔ ✔
RL-S2V ✔ ✔ ✔
Meta-Atta ✔ ✔ ✔
ck
NIPA ✔ ✔ ✔

135
Attack Methods
Attack Injecting Adding Rewiring Modifying Evasion Poisoning Targeted Non-Targe
Methods Node /Deleting Features ted
Edge

Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax

136
GradArgmax

Adversarial Attack on Graph Structured Data. ICML 2018. 137


GradArgmax

138
GradArgmax

139
GradArgmax

140
GradArgmax

141
Attack Methods
Attack Injecting Adding Rewiring Modifying Evasion Poisoning Targeted Non-Targe
Methods Node /Deleting Features ted
Edge

Grad-Arg ✔ ✔ ✔ ✔ ✔ ✔ ✔
max
Nettack ✔ ✔ ✔ ✔ ✔

142
Nettack

143
Adversarial Attacks on Neural Networks for Graph Data. KDD 2018.
Nettack
Idea 1: Train a surrogate model

A two-layer linearized GCN trained on


original graph

144
Nettack
Edge Perturbations
Candidates

Feature
Perturbations
Candidates

Degree Distribution
Feature Co-occurrence

145
Nettack
Edge Perturbations
Candidates

Feature
Perturbations
Candidates

Attack
Target GCN Models Wrong
Prediction

146
Attack Methods
Attack Injecting Adding Rewiring Modifying Evasion Poisoning Targeted Non-Targe
Methods Node /Deleting Features ted
Edge
Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax
Nettack ✔ ✔ ✔ ✔ ✔
GF-Attack ✔ ✔ ✔

147
GF-Attack
Motivation:
• the output embeddings of graph embedding models are demonstrated to have very
low-rank property.
• Goal: to damage the quality of output embedding Z

• Formulation:
• A graph embedding model can be treated as producing the new graph signals according to graph
filter ℋ together with feature transformation:

148
A Restricted Black-box Adversarial Framework Towards Attacking Graph Embedding Models. AAAI20.
GF-Attack

149
Attack Methods
Attack Injecting Adding Rewiring Modifying Evasion Poisoning Targeted Non-Targe
Methods Node /Deleting Features ted
Edge

Grad-Argm ✔ ✔ ✔ ✔ ✔ ✔ ✔
ax
Nettack ✔ ✔ ✔ ✔ ✔
GF-Attack ✔ ✔ ✔
ReWatt ✔ ✔ ✔

150
ReWatt
Motivation Rewiring
Degree distribution may not be an ideal
measure for perturbations

How to make perturbation more


unnoticeable?

151
Attacking Graph Convolutional Networks via Rewiring. Arxiv 2019.
ReWatt
Rewiring Advantages
• Number of nodes and edges
remain the same
• Affects algebraic connectivity
in a smaller way
• Affects effective graph resistance
in a smaller way

152
ReWatt
Reinforcement Learning
Black-box classifier

Attacker
Policy Network

Node Edge
GCN Emb Emb

153
Defending Against Attacks
Adversarial Training
Graph Purifying
Attention Mechanism

154
Adversarial Training
Motivation

Augment the training set with


adversarial data

155
Latent Adversarial Training of Graph Convolution Networks. ICML 2019 workshop.
Adversarial Training
Obstacles
•A is discrete
•X is often discrete

156
Graph Purifying - Preprocessing
Main Idea
• Purify the poisoned graph
• Train GNN on the purified graph

6 5 6
5 7
7 ① Preprocess
4 4 7
7 3
1 3 1 2
2
8 8
2 2

② Train

Trained GNN 157


Graph Purifying - Preprocessing
Observations
• Attackers favor adding edges than removing edges

Attackers tend to connect dissimilar nodes!


158
Adversarial Examples on Graph Data: Deep Insights into Attack and Defense. IJCAI 2019.
Graph Purifying – Graph Learning: Pro-GNN
Graph Learning and GNN training
6 5 6
5 7
7
4 Graph Learning 4 7
7 3
1 3 1 2
2
8 8
2 GNN Learning 2

Clean Graph
Trained GNN
159
Graph Structure Learning for Robust Graph Neural Networks. KDD 2020.
Pro-GNN: Defend Against Adversarial Attacks
Graph Properties
Low-rank
Sparsity
Feature smoothness

160
Pro-GNN: Defend Against Adversarial Attacks
Graph Properties
Low-rank
Sparsity
Feature smoothness

Table Credit: Adversarial Attacks and Defenses on Graphs: A Review and Empirical Study
161
Pro-GNN: Defend Against Adversarial Attacks
Graph Properties
Low-rank
Sparsity
Feature smoothness

162
Pro-GNN: Framework

163
Attention Mechanism
Motivation
Reduce impact of adversarial edges
-- give lower attention score to adversarial edges

Thicker arrows indicate higher attention coefficients 164


RGCN
Motivation
Attacked nodes may have high uncertainty
Give lower attention score to reduce their impact

165
Robust Graph Convolutional Networks Against Adversarial Attacks. KDD 2019.
RGCN
Embed nodes as Gaussian
distributions to capture
uncertainty

166
RGCN
Attention Mechanism Attacked nodes do have higher
variance!

167
PA-GNN
Motivation Clean graphs from similar domain
• Only relying on perturbed graph to Facebook & Twitter
learn attention coefficients is not Yelp & Foursquare
enough.
• We should exploit information from
clean graphs.
Then Use Transfer Learning/Meta
Learning!

168
Robust Graph Neural Network Against Poisoning Attacks via Transfer Learning. WSDM 2020.
PA-GNN

169
Self-Supervised Learning for
Graph Neural Networks

170
Self-Supervised Learning
Relative position
pretext task
Doersch et al., 2015

Jigsaw puzzle
pretext task
Noroozi and Favaro, 2016
171
Shuffled Solved
Graph-Structured Data

172
Traditional Deep Learning on Graphs
Traditional DL is designed for simple grids or sequences
CNNs for fixed-size images/grids
RNNs for text/sequences

Graph Neural Networks


But nodes on graphs have Representations
different connections Activation Graph-l
Function
•Arbitrary neighbor size evel
•Complex topological
structure
Node-
•No fixed node ordering level
Graph Graph
Convolutions Convolutions 173
Early Unsupervised GNN

•Objective is to
reconstruct
masked edges

•Could be used as a
pre-training step
for another task
(e.g., node
classification)

Variational Graph Auto-Encoders. NIPS Workshop on Bayesian Deep Learning. 174


2016.
Harnessing Unlabeled Nodes
in Node Classification

GNNs are inherently semi-supervised as unlabeled


nodes can also be utilized during feature aggregation

Can we harness the benefits of SSL to more fully


utilize unlabeled nodes?
Labeled node
Unlabeled node
Aggregation region
around labeled node 175
Problem Statement

Unlabeled node

176
Applying SSL to Graphs
Similarities to Image and Text Domains
• Nodes have features like images or text
🡪 Pretext tasks using attribute information
• Topological structure associated with unlabeled samples
🡪 Pretext tasks using structural information
Fundamental Differences Found in Graph
Domain
• Nodes are connected and dependent
🡪 Pretext tasks using node pairs or even sets

• Unlabeled nodes have structural relations to labeled nodes


🡪 Pretext tasks using label information 177
Main Strategies to Merge SSL Tasks with GNN
Joint
Training

Joint Training

Two-stage
Training

178
Multi-Stage Self-Training GNNs

Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels. AAAI. 179
2020.
DeepCluster and M3S Training
For each stage
• Run Deep Clustering by using
K-means on node embeddings
• Align cluster centers to labeled
data class centers
• Sort remaining unlabeled based on
confidence of prediction
For each class j
DeepCluster • Find top samples
• If sample pseudo label
matches aligned cluster
label, then add to training
Train for fixed epochs

180
Case Study Results with M3S

General Insights:
• The less training data the larger the improvement over GCN
• Self-training can typically provide improvements
• Using MultiStage typically is better than single Self-training
• DeepCluster based self-checking provides a benefit

181
Contrastive Learning for Graphs via
Augmentations

Contrastive Multi-View Representation Learning on Graphs. AAAI. 2020. 182


Insights on Contrastive Learning for Graphs
Augmentation types:
Feature space:
• Masking or adding
gaussian noise
Structure space:
• Adding/removing
connections,
sub-sampling, global
view with diffusion
matrix
They use the diffusion matrix (e.g., Insights:
Personalized PageRank) along with • Able to compete with supervised methods
adjacency for providing and contrasting a • Contrasting node and graph embeddings works best
global and local viewpoint • More than two views is worse 183
Generative Pre-Training of GNNs

GPT-GNN: Generative Pre-Training of Graph Neural Networks. KDD. 2020. 184


When Does SSL Help GCNs?
They evaluate
three ways to
include SSL tasks:
• Pretraining/
Finetuning
• Self-training
• Multi-task
learning

When Does Self-Supervision Help Graph Convolutional Networks? ICML. 2020. 185
Node Clustering Pretext Task
• Features used: Nodes
• Assumptions: Feature similarity
• Loss Type: Classification

• Follows the ideas of clustering from M3S


• Each node is given a cluster based on node features
• This cluster label is assigned the self-supervised label to predict

186
Partitioning Pretext Task
• Features used: Edges
• Assumptions: Connection density
• Loss Type: Classification

• Rather than clustering the node features, instead they partition the
network based on the structure
• Similarly, partition indices are the self-supervised label to predict

187
Graph Completion Pretext Task
• Features used: Nodes & Edges
• Assumptions: Context based Representation
• Loss Type: Regression

188
SSL Universally Improves Most GNN Models

Insights:
• Generally multi-task performs better than
pretraining/finetuning
• SSL acts universally well to improve many
GNN base models

189
When and Why SSL Works on GNNs
• Presents a set of basic pretext tasks using structure and
attribute information

• Insights gained on:


• Which strategy to harness SSL on GNNs?
• Why do some pretext tasks work other others do not on GNNs?
• How to construct advanced pretext tasks beyond basic structure
and attributes?

Self-Supervised Learning on Graphs: Deep Insights and New Directions. arXiv. 2020. 190
Basic Pretext Tasks
on Graphs Pretext Tasks

Node Property
Local

Information Type
Edge Mask

Structure
Pairwise Distance
Global
Distance to Clusters

Attribute Mask

Attribute
Pairwise Attribute Similarity
191
Local Structure Pretext Tasks
Node Property Regression Loss

GNN -
Extracted Node Mapped node
Property embeddings

Edge Mask Random masked edge (i,j)


Random remaining edge (i,j) Classification Loss

i j Predict
Masked
GNN - vs.
Remaining

Mapped node
embeddings
[Input] 192
Global Structure Pretext Tasks
Pairwise Distance Classification Loss

i j Predict

Calculate
all (i,j) GNN -
distances

Shortest Mapped node


path embeddings
[Input]
lengths

Distance to Clusters Regression Loss

GNN -
Mapped node
Distance from each embeddings
node to the center Obtain k
clusters [Input]
of each clusters 193
Attribute Pretext Tasks
Attribute Mask Regression Loss

GNN -
Mapped Node
embeddings for
Nodes with . nodes
[Input] masked attributes

Pairwise Attribute Similarity


Most similar Regression Loss
pairs
i j
Associated
similarity
- -

values
GNN
- -
Least similar
pairs Mapped node embeddings
[Input] of most/least similar pairs
194
Empirical Study of Basic Pretext Tasks

Local
Structure
Global
Structure

Attribute

Insights:
• In general joint/multi-task training outperforms pre-training/two-stage training
• Global structure generally outperforms local structure
• Is there a way to further combine and improve these basic methods?
195
Deeper Insights to Why some SSL Work

GCN node
embeddings
Positive
🡪 achieve
values
higher
accuracy

Pretext task performance when using


original GCN embeddings compared with
original node attributes 196
Further Insights and New Directions
Node similarity is a fundamental property of graphs
🡪 Does this similarity get maintained in the GCN embeddings?
Two nodes are
• structurally equivalent if their local neighborhoods significantly overlap
🡪 Based on local neighbor aggregation in GCN it would be expected
to be somehow maintained in their embeddings
e.g., Pairwise Distance pretext helps maintain this

• regularly equivalent even if not having the same neighbors if the neighbors are themselves
similar
🡪 If this similarity is based on their attributes, even if neighbors are different,
if their neighbors share similar features then two nodes are similar
e.g., Pairwise Attribute Similarity pretext helps maintain this
🡪 Next we define regular task equivalence… 197
Further Insights and New Directions
Node similarity is a fundamental property of graphs
🡪 Does this similarity get maintained in the GCN embeddings?

Two nodes are


structurally equivalent if their local neighborhoods significantly overlap
regularly (attribute) equivalent even if not having the same neighbors if the
neighbors are themselves similar (regarding attributes)
regularly task equivalent defines similarity of nodes in relation to the task
🡪 Intuition: if every node constructs a pretext vector based on label
information from their local neighborhood, then two nodes having similar (or
dissimilar) vectors, we encourage to be similar (or dissimilar) in the
embedding space 198
Advanced Pretext Tasks on Graphs
Structure
+ Distance to Labeled
Label
Based on the

SelfTask
intuitions of
regular task Structure Context Label
equivalence
+
Attribute Ensemble Label
+
Label Corrected Label
199
SelfTask: Distance to Labeled

200
SelfTask: Context Label
Each node constructs a neighbor label
distribution context vector

201
SelfTask: Corrected Label
Key Idea: Improve Context Label by iteratively improving the context vector

SelfTask:
Context Label

202
Advanced SSL Results with
SelfTask for Node
Classification

203

Insights:
• Advanced methods utilizing the label information of
neighbors significantly improves performance
• The label correction stage indeed helps SelfTask
• Limited labeled data? No problem!
Summary of SSL for GNNs
• SSL for GNNs is still in the early stages but seen rapid growth/interest

• Just as in other domains, not all defined pretext tasks can work
• Some are more general than others
• While some can be specifically designed with domain specific knowledge

• Methods have taken a pre-training, self-training, or multi-task training


approaches

• Can we further leverage the relation between unlabeled nodes to labeled


nodes in advancing pretext tasks?

• Further analysis both theoretically and empirically are desired to better


understand when/why/how SSL for GNNs can work 204
Scalable Learning of Graph Neural
Networks
Tengfei Ma
IBM Research AI
IBM T. J. Watson Research Center
@KDD 2020 Tutorial

205
Graph Convolutional Networks (GCN)
- (Kipf and Welling 2017)
Motivation
Matrix form of a GCN layer:
Problems of GCN:
• Time and memory cost for large graphs:
Per-node form (embedding vectors are oriented as column vectors)

Full neighborhood of
v

From a single node as the start, after a few layers,


almost the whole graph will be touched. It means, even
batch training is also expensive
Node Sampling
-GraphSAGE (Hamilton et al. 2017)
Node sampling: for each node just sample a fixed number of neighbors
• Matrix form
• Where idxk-1 is a uniformly randomly sampled subset of nodes. (For nonuniform random, need
proper scaling.)
• Per-node form
• For all v in idxk only

• Problems of node sampling


• Still power law
• No formal analysis to justify
Sampled
neighborhood
(same for all v in idxk)
Layer Sampling
- FastGCN (ICLR 2018)
Generalization of a GCN layer (assume each layer independent)

Monte Carlo sampling


• For each layer/batch sample nodes

• matrix form (in a batch, all “v” has the same sampled neighbors)
Comparison

GCN FastGCN
Importance Sampling

• Uniform sampling

• Variance reduction
• Importance sampling, sampling from Q instead of a uniform P
Results

Per-batch training time Prediction accuracy


Adaptive Sampling
(Huang et al. 2018)
Problem of FastGCN:
• Layer-independent assumption
• Too sparse sampling ->lower accuracy
Extension
• Layer-dependent: sample the lower layer conditioned on the top one
• Based on importance sampling schema, they learn a self-dependent function
of each node to determine its importance for the sampling
• To explicitly reduce sampling variance, they add the variance to the loss
function and explicitly minimize the variance by model training

213
Graph Sampling
- GraphSAINT (Zeng et al. 2020)
Instead of node sampling or layer sampling, do graph sampling

If , it is an unbiased estimator of the aggregator, i.e.


• Pre-processing: repeatedly sample n subgraphs, and set

• Run a full GCN on each subgraph


Advantage:
• Permits multiple sampler methods (random node/edge/random walk) 214
Sampling for Multi-Relational Graphs
--RS-GCN (ICML20 GRL+)
Motivation:
• Most of previous sampling methods are for homogeneous graphs.
• We are focusing on accelerating the learning on multi-relational graphs
Idea
• Relation type matters!
• Probability to sample relation r at hop k:

• REINFORCE to update sampling probabilities

215
Application in Real-World
- PinSAGE (Ying et al. 2018)
An early industry-level GNN-based recommendation system
• The core of PinSAGE is a neighborhood aggregation algorithm similar to
GraphSAGE
• Novelty: how to define the neighborhood?
• Importance-based: the neighborhood of a node u is defined as the T nodes that exert the
most influence on node u.
• Random walks from node u, top T visited nodes.
• Efficient Training:
• Does not train on the whole graph, but only on
targeted node set and their neighborhood
• MapReduce for model inference

216
Application in Real-World
-Anti-Money Laundering
Application of FastGCN on large synthetic AML datasets:
• Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional
Networks for Financial Forensics (NeurIPS 2018 WS)
• Entities as nodes and transactions as edges
• Detecting suspicious nodes/transactions

217
Industrial-level Libraries and Applications
Deep Graph Library(DGL)
PyTorch Geometric (PyG)
AliGraph -- Alibaba
PyTorch-BigGraph (PBG) -- Facebook
AntGraph Machine Learning system (AGL) -- Ant Finance

218
Part I Overview

Foundations Models Applications

Filtering Layers in GNN


Basic Graph Theory Healthcare
Pooling Layers in GNN
Spectral Graph Theory Graph Structure Learning
Robustness of GNN
Graph Fourier Analysis Self-supervised Learning for GNN Natural Language Processing
Scalable Learning for GNN

219
Graph Neural Networks for
Healthcare Applications
--Drug Discovery and Medical Recommendation
Tengfei Ma
IBM Research AI
IBM T. J. Watson Research Center
@KDD2020 Tutorial

220
Drug Discovery
Drug Discovery is a long tedious costly process
• Machine learning can help
• de novo drug design
• Generating new molecules for desired target.
• drug safety checking
• Toxicity
• Adverse reaction/drug-drug interaction
Interestingly, they are all related to graphs
• Molecule -- graph
• DDI – graph
It is natural to develop GNN-based methods
• Molecule generation
• DDI prediction

221
Constrained Generation of Semantically
Valid Graphs via Regularizing Variational
Autoencoders
NeurIPS 2018.

222
Molecule Graph Generation
Generative Models for Images/Sequences

But generation of graphs?


• Graph neural networks need to know the predefined graph structure
• GNNs can be used as encoders, but how to design an decoder/generator?
• How to guarantee the generated sample is a valid graph?
Ideas:
• Represent graphs as concatenation of its node matrix and edge matrix and treat it as an
image –> so we can use the same decoder as image
• Validness? add constraints for the generator
Constrained Graph Variational Auto-Encoder
Overview of the framework

• A graph auto-encoder used to generate the graph


• In addition to a standard VAE (within the rectangle), we add a regularization
term.
• f(x) is the original VAE loss
• h and g are regularization terms
Approximate Training
A Lagrangian relaxation

Training in Standard VAE

• Monte Carlo sampling

Similarly for the regularization term


Constraints
Molecules
• Valence
• Expected node capacity (sum of edges) <= valence
• Connectivity
• Every node pair much be connected by a path
• B = A + A^2 + … + A^{n-1}
• If node i and j are connected, B_{ij} != 0
Results
Compared to VAE with no regularization

Compared to previous works


Visualization of Generated Molecules
Left: two-dimensional interpolation
Right: one-dimensional (column-wise) interpolation
Drug Similarity Integration Through
Multi-view Graph Auto-Encoders
IJCAI 2018

229
Adverse Drug-Drug Interaction (DDI)

Common among patients with


complex diseases or comorbidities.
Hard to observe in clinical testing.
Affects 15% U.S. population. Cost
more than $177 billion per year in
disease management.
Drugs may interact to cause adverse DDIs.
DDI Prediction
Drug Features (database)
• Label Side Effect (SIDER)
• Off-Label Side Effect (OFFSIDES)
• Molecular substructure
• Drug Indication (MedDRA)
• ……
Assumption: similar drugs may have similar interaction to another drug.
Related to Graphs?
• Constructing a DDI graph, we can predict unknown DDIs as a link prediction problem.
• Constructing a similarity graph, where DDIs are regarded as node labels, and DDI prediction is
a node classification problem.
MV-GAE: Drug Similarity Integration Through
Multi-view Graph Auto-Encoders (IJCAI 2018)
drug B
• Challenges of multi-view learning
,B) • the underlying relations of biomedical events are often
sim(A nonlinear and complex over all types of features
drug A
• features have different importance toward different target
outcomes
• Our solution:
• Construct drug similarity graph for each
view, where DDIs are node labels
• Graph convolutional network (GCN) based
model for node embedding and prediction
• Attention mechanism to integrate different
views
A Simple Multi-View GCN

GCN GCN

Question: what if we do not have labels?


Attentive Multiview Similarity Fusion with
Graph Auto-Encoders (GAE)
GCN decoder

GCN encoder

Normalize

attention weights decided


by data and target
Semi-Supervised Extensions (SemiGAE)

GCN decoder

GCN encoder
What if we do not have node features?

GCN decoder

GCN encoder
Results
Analysis of Attention
Attention Weights

DDI Type AUC Chem. indi. TTDS CPI Results


Chest Pain 0.772 0.151 0.303 0.144 0.402 Views “indication” and
“CPI” receive high weights
Insomnia 0.755 0.380 0.261 0.078 0.291 for the ADR “Chest pain”.
indication 0.774 0.117 0.301 0.283 0.299
Graph-Enhanced Medication Recommendation:
GAMENet and G-BERT
AAAI 2019, IJCAI 2019.

239
EHR Phenotyping and Medication
Recommendation
EHR (electronic healthcare record):
• Representation

• Phenotyping is important for


• Disease prediction
What if there is a DDI?
• Medication recommendation
• Readmission prediction
Medication Recommendation
Challenges
Problems of previous approaches
• Without consideration of DDIs
• It is possible to recommend some un-safe drugs
• Lack of structure information for medical codes
• Throwing out a lot of resources
• The single-visit EHR sequences
• -> we cannot use it in RNN
• ->we cannot use it to predict the next visit

241
Medical Recommendation I: GAMENet
GAMENet: Graph Augmented MEmory Networks for Recommending
Medication Combination (Shang et al. 2019a)
• Key ideas: integrate the DDI graphs to provide safer medical recommendation
• Method: encode both EHRs and DDI graphs in the memory and impact on the
memory output
Graph Augmented Memory Module
I: Input memory representation converts inputs
into query for memory reading.
• RNN
G: Generalization is the process of generating
and updating the memory representation.
• Static memory bank Mb: GCN
• Dynamic memory Md : adding key-value pair
O: Output memory representation produces
outputs given the patient representation (the
query) and the current memory state Mb and Md.

R: Response is the final step to utilize patient


representation and memory output to predict
the multilabel medication.
Graph Augmented Memory Module
I: Input memory representation converts inputs
into query for memory reading.
• RNN
G: Generalization is the process of generating
and updating the memory representation.
• Static memory bank Mb: GCN
• Dynamic memory Md : adding key-value pair
O: Output memory representation produces
outputs given the patient representation (the
query) and the current memory state Mb and Md.

R: Response is the final step to utilize patient


representation and memory output to predict
the multilabel medication.
Graph Augmented Memory Module
I: Input memory representation converts inputs
into query for memory reading.
• RNN
G: Generalization is the process of generating
and updating the memory representation.
• Static memory bank Mb: GCN
• Dynamic memory Md : adding key-value pair
O: Output memory representation produces
outputs given the patient representation (the
query) and the current memory state Mb and Md.

R: Response is the final step to utilize patient


representation and memory output to predict
the multilabel medication.
Medical Recommendation II: GBERT (Graph +
Pre-training)
Motivation
• To utilize the hierarchy information of medical codes
• To pretrain on single-visit data (which was generally discarded in previous
systems)
Framework:
• ontology embedding -> visit embedding -> pre-training
Ontology Embedding
A modified GNN to model the ontology and get initial embeddings for
all medical codes
• Leaf to root

• Root to leaf

• GAT style message passing


Visit Embedding

248
Pre-training
Input example:
• [CLS] d1, d2, mask
d3, d4, m1, mask
m2, m3
• d: diagnosis; m: medication
pre-training on each visit of EHR sequences
• Self-prediction: same as BERT, use to a mask to mask out some codes and
predict them
• Dual-prediction: known all diagnosis, predict medication; known medication,
predict diagnosis
• Note: No position embedding, because there is no order within one visit.
Results
• We used EHR data from MIMIC-III
[Johnson et al., 2016] and conducted all
our experiments on a cohort where
patients have more than one visit.

• For GAMENet, we used DDI knowledge


from TWOSIDES dataset.

• For G-BERT, we utilize data from patients


with both single visit and multiple visits
in the training dataset as pre-training
data source.

IBM Research AI
Deep Graph Learning and
Graph-to-Sequence Learning in NLP
Lingfei Wu
IBM Research AI
IBM T. J. Watson Research Center

Joint work with Yu Chen, Mohammed J Zaki, Kun Xu, Zhiguo Wang, Yansong Feng,
Michael Witbrock, and Vadim Sheinin

@KDD 2020 Tutorial


251
Why graphs?
Graphs are a general
language for
describing and
modeling complex
systems

Graph!
252
Graph-structured data are ubiquitous

Interne Social Networks of


t networks transactions

Biomedical graphs Scene graphs


Program graphs
253
Machine(Deep) Learning with Graphs
Classical ML tasks in graphs: Recent ML tasks in graphs:
Node classification • Graph classification
▪ Predict a type of a given node ▪ Predict a type of a given graph
Link prediction • Graph generation
▪ Generate graphs from learned
▪ Predict whether two nodes are distribution
linked
• Graph structure learning
Community detection ▪ Joint learn graph structure and
▪ Identify densely linked clusters of graph embeddings
nodes • Graph-to-X learning
Graph matching (similarity) ▪ Graph Inputs – X outputs
▪ How similar are two (sub)graphs
254
Deep Graph Learning and
Graph-to-Sequence Learning in NLP
Lingfei Wu
IBM Research AI
IBM T. J. Watson Research Center

Joint work with Yu Chen, Mohammed J Zaki, Kun Xu, Zhiguo Wang, Yansong Feng,
Michael Witbrock, and Vadim Sheinin

@KDD 2020 Tutorial


255
Why graphs?
Graphs are a general
language for
describing and
modeling complex
systems

Graph!
256
Graph-structured data are ubiquitous

Interne Social Networks of


t networks transactions

Biomedical graphs Scene graphs


Program graphs
257
Machine(Deep) Learning with Graphs
Classical ML tasks in graphs: Recent ML tasks in graphs:
Node classification • Graph classification
▪ Predict a type of a given node ▪ Predict a type of a given graph
Link prediction • Graph generation
▪ Generate graphs from learned
▪ Predict whether two nodes are distribution
linked
• Graph structure learning
Community detection ▪ Joint learn graph structure and
▪ Identify densely linked clusters of graph embeddings
nodes • Graph-to-X learning
Graph matching (similarity) ▪ Graph Inputs – X outputs
▪ How similar are two (sub)graphs
258
Iterative and Robust Deep
Graph Learning for GNNs

259
Graph Learning: Motivations
• GNNs are powerful, unfortunately, it
requires graph-structured data
available.
• Questionable if the given intrinsic
graph-structures are optimal (i.e., noisy,
incomplete) for the downstream tasks.
• Many applications (e.g., NLP tasks) may
only have non-graph structured data or
even just the original feature matrix.

260
Graph Learning: Formulation

261
Existing State-of-the-art Methods
Graph construction from data [Kalofolias, 2016; Kalofolias and Perraudinl, 2017]
• Gaussian kernel or KNN-based
• Directly optimizing the graph adjacency matrix with smoothed graph signals
• Issues: 1) does not consider downstream task; 2) no refinement
Dynamic models of interacting systems [Kipf et al., ICML’18]
• Inferring an explicit interaction structure using a variational graph auto-encoder
• Issues: 1) cannot joint learn the graph structure and graph representations; 2)
transductive
• Jointly optimizing graph structures and GNN parameters [Franceschi, ICML ’19]
• Modeling joint probability distribution on the edges of the graph consisting of N
number of vertices
• Issues: 1) hard to optimize; 2) Not scalable; 3) cannot handle inductive learning
262
Iterative Deep Graph Learning : System
Overview

Graph Learning and Graph Embedding: A Unified Perspective


• Graph learning as similarity metric learning
• Graph regularization to control smoothness, sparsity, and connectivity
• Iterative method to refine the graph structures and graph embeddings
263
IDGL: Graph Learning as Similarity Metric
Learning
We design a multi-head weighted cosine similarity metric function to learn a
similarity matrix S for all pairs of nodes.

We proceed to extract a symmetric sparse adjacency matrix from the similarity


matrix S by considering only the ɛ-neighborhood for each node.

where is the normalized adjacency matrix of the initial graph (or kNN-graph).

264
IDGL: Graph Regularization
We adapt the techniques designed for learning graphs from smooth and apply
them as regularization for controlling smoothness, connectivity and sparsity

Smoothness

Connectivity & sparsity

Graph regularization loss

265
IDGL: Iterative Method for Joint Graph
Structure and Representation Learning
Iterative method repeatedly
▪ Learn better adjacency matrix with the
updated node embeddings
▪ Learn better node embeddings with the
refined adjacency matrix
Iterative procedure dynamically stops in a
mini-batch
▪ the learned adjacency matrix converges
with certain threshold
▪ the maximal number of iterations is reached

266
Results (Transudative Setting)

267
Results (Inductive Setting & Runtime)

268
Results (Ablation Study)

269
Results (Robustness to Missing/Adding Edges)

270
Results (Convergence & Dynamic Stopping)

271
Graph-to-sequence Learning in
Natural Language Processing

272
Seq2Seq: Applications and Challenges
Applications Challenges
• Machine translation • Only applied to problems whose
• Natural Language Generation inputs are represented as
• Logic form translation sequences
• Drug Discovery • Cannot handle more complex
structure such as graphs
• Converting graph inputs into
sequences inputs lose information
• Augmenting original sequence
inputs with additional structural
information enhances word
sequence feature
273
Contributions and Highlighted Research
Fundamental contributions in this research:
• Presented Graph2Seq, a generalized seq2seq model for graph inputs
• Attention-based encoder-decoder model for graph-to-sequence learning

• Two highlighted NLP tasks using Graph2Seq model:


• SQL-to-text Generation with Graph2Seq Model
• Question Generation with RL based Graph2Seq Model

274
[1]
Graph-to-Sequence Model
Graph Convolutional Neural Network

[1] Kun Xu*, Lingfei Wu*, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim Sheinin (Equally Contributed), "Graph2Seq: Graph to Sequence
Learning with Attention-based Neural Networks", arXiv 2018.
[2] Yu Chen, Lingfei Wu** and Mohammed J. Zaki (**Corresponding Author), "Reinforcement Learning Based Graph-to-Sequence Model for Natural
Question Generation”, ICLR’20. 275
Bidirectional Node Embedding (Separate)
Bi-Sep Node embedding (take node v as an example)
1. transform each node’s text attribute to a feature vector by looking up the
embedding matrix

2. classify v’s neighbors into forward and backward neighbors, aggregate


neighbors information using a fully connect network followed by a max
pooling operation

3. repeat steps 2 for K times to aggregate neighbors information in K hops


4. concatenate final v’s forward and backward node embeddings as the final
bi-directional representation of node v
276
Bidirectional Node Embedding (Fuse)
Bi-Fuse Node embedding (take node v as an example)
1) Node aggregation

2) Fuse the aggregated node embeddings from both directions

3) Update the node embeddings using fused information

277
Graph Encoding
Graph embedding
• Pooling based graph embedding (max, min and average pooling)
• Node based graph embedding
Add one super node which is connected to all other nodes in the graph
The embedding of this super node is treated as graph embedding

278
Attention Based Sequence Decoding

context node
vector representation

279
Attention Based Sequence Decoding

context node
vector representation
attention alignment
weights model

280
Attention Based Sequence Decoding

context node
vector representation
attention alignment
weights model
Objective Function

281
Experiments: Text Reasoning and Shortest
Path

282
Experiments: Bidirectional Node Embedding

Converge
More quickly

Bidirectional Node Embedding


VS Unidirectional Node Embedding

283
When Shall We Use Graph2Seq?
Case I: the inputs are naturally or Case II: Hybrid Graph with
best represented in graph sequence and its hidden structural
information

Augmenting “are there ada jobs outside Austin”


with its dependency parsing tree results
“Ryan’s description of himself: a genius.”
284
SQL-to-text Generation with
Graph2seq Model (EMNLP’18)

285
Natural Language Interface to Database

Need explanation !

What is the meaning ?

286
SQL-to-text Generation Task

287
Previous Approaches
Template-based approaches
Koutrikaal et al. 2010
Time consuming and generating rigid and stylized language !
Ngonga Ngomo et al. 2013

Deep learning models


Iyer et al. 2016 (Sequence to sequence model)

288
Problem
SQL query is a graph structured query
Naïve sequence encoders may need an elaborate design to fully capture the global
structure information

289
Motivation

290
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause

Where clause

291
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select

Where clause

292
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select
b) connect SELECT node with column nodes

Where clause

293
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select
b) connect SELECT node with column nodes
c) In some cases, there may exist aggregation functions such
as count and max; add aggregation node and connect it
with column node
Where clause

294
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select
b) connect SELECT node with column nodes
c) In some cases, there may exist aggregation functions such
as count and max; add aggregation node and connect it
with column node
Where clause
a) for each condition, we use the same process as for the
Select clause to create nodes

295
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select
b) connect SELECT node with column nodes
c) In some cases, there may exist aggregation functions such
as count and max; add aggregation node and connect it
with column node
Where clause
a) for each condition, we use the same process as for the
Select clause to create nodes

296
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select
b) connect SELECT node with column nodes
c) In some cases, there may exist aggregation functions such
as count and max; add aggregation node and connect it
with column node
Where clause
a) for each condition, we use the same process as for the
Select clause to create nodes

297
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select
b) connect SELECT node with column nodes
c) In some cases, there may exist aggregation functions such
as count and max; add aggregation node and connect it
with column node
Where clause
a) for each condition, we use the same process as for the
Select clause to create nodes

298
Graph Representation of SQL Query

Represent the SQL query as a graph


Select clause
a) create a node assigned with text attribute select
b) connect SELECT node with column nodes
c) In some cases, there may exist aggregation functions such
as count and max; add aggregation node and connect it
with column node
Where clause
a) for each condition, we use the same process as for the
Select clause to create nodes
b) add logical operators such as AND, OR and NOT

299
Encoder-Decoder Architecture

Encoding

How to encode this graph structure


?

300
[1]
Graph-to-Sequence Model
Graph Convolutional Neural Network

[1] Kun Xu*, Lingfei Wu*, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim Sheinin (Equally Contributed), "Graph2Seq: Graph to Sequence
Learning with Attention-based Neural Networks", arXiv 2018.
[2] Yu Chen, Lingfei Wu** and Mohammed J. Zaki (**Corresponding Author), "Reinforcement Learning Based Graph-to-Sequence Model for Natural
Question Generation”, ICLR’20. 301
Experiments
Datasets
WikiSQL (61,297 for training, 9,145 for development and 17,284 for test)
Stackoverflow (25,869 for training, 3,234 for development and 3,234 for test)
Baselines
Template
a) first map each element of a SQL query to an utterance
b) then use simple rules to assemble these utterances
Seq2Seq
We implement the model proposed in Bahdanau et al. 2014
Seq2Seq + Copy
We implement the model proposed in Gu et al. 2016
Tree2Seq
We implement the model proposed in Eriguchi et al. 2016

302
Results
Criteria
BLEU-4
Grammar (human evaluation)
Correct (human evaluation) WikiSQL

Stack
overflow

303
Examples

Seq2Seq model
Graph2Seq model

Seq2Seq model
Graph2Seq model

304
Graph2Seq + Reinforcement Learning
for Question Generation (ICLR’20)

305
Natural Question Generation: Background
Natural question generation (QG) is
a challenging yet rewarding task,
that aims to generate questions
given an input passage and a target
answer.

Many real applications:


Reading comprehension
Visual and video question answering
Dialog system

306
Natural Question Generation: Definition
Input:
A text passage:
A target answer:

• Output:
The best natural language question:
which maximizes the conditional likelihood:

307
Existing State-of-the-art Methods
Template-based approaches
Mostow & Chen, 2009; Heilman & Smith, 2010; Heilman, 2011
Rely on heuristic rules or hand-crafted templates
low generalizability and scalability

Seq2Seq-based approaches
Du et al., 2017; Zhou et al., 2017; Song et al., 2018a; Kumar et al., 2018a
Fail to utilize the rich text structure information beyond the simple word
sequence
Rely on cross-entropy based sequence training which has several limitations
Fail to effectively utilize the answer information

308
Known Issues of Existing Approaches
Issue I: fail to consider global Solution I: Deep alignment network
interactions between answer and to align answer and context
context
Issue II: fail to consider rich hidden Solution II: Novel Graph2Seq model
structure information of word for considering hidden structure
sequence information in sequence
Issue III: limitations of
cross-entropy based objectives like Solution III: Novel Reinforcement
exposure bias and inconsistency Learning Loss for enforcing syntactic
between train/test measurement and semantic coherent of generated
text

309
RL-based Graph2Seq for QG: System
Overview

310
Deep Answer Alignment
A deep alignment network for incorporating the answer information into passages at
both the word level and the contextualized hidden state level

denotes passage
denotes answer

311
Graph Construction: Static VS Dynamic
Syntax-based static graph
A directed and unweighted passage
graph based on dependency parsing

Semantics-aware dynamic graph


Dynamically build a directed and
weighted graph to model semantic
relationships among passage words
is the passage representation

312
Encoder-Decoder Architecture

Encoding

How to encode this graph structure


?

313
[1]
Graph-to-Sequence Model
Graph Convolutional Neural Network

[1] Kun Xu*, Lingfei Wu*, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim Sheinin (Equally Contributed), "Graph2Seq: Graph to Sequence
Learning with Attention-based Neural Networks", arXiv 2018.
[2] Yu Chen, Lingfei Wu** and Mohammed J. Zaki (**Corresponding Author), "Reinforcement Learning Based Graph-to-Sequence Model for Natural
Question Generation”, ICLR’20. 314
Hybrid Evaluator
Regular cross-entropy based training objectives have limitations
Exposure bias
Evaluation discrepancy between training and testing
We apply a mixed objective function combining both the
cross-entropy loss and RL loss
Ensure the generation of syntactically and semantically valid text

Two-stage training strategy:


Train the model with cross-entropy loss
Finetune the model by optimizing the mixed objective function
315
Automatic Evaluation Results

316
Human Evaluation and Ablation Study
Results

317
Graph4NLP: A Library for Deep
Learning on Graphs for NLP

318
Architecture of Graph4NLP Library
Topology
Graph
Construction
Embedding

Node
Encoding
SAGE GCN GAT GGNN RGCN

Prediction Classification Generation KB Completion Decoding

Evaluation Metrics Loss Loss


Data Flow of Graph4NLP

Encoded Structured Data


Raw Data (Graph4NLP.GraphData) Prediction Results

Graph Construction
Evaluation Loss

Featured Structured Data GNN Embedding


(Graph4NLP.GraphData) Methods
User Model
Take-home Messages from This Talk
• Deep Learning on Graphs is a fast-growing area today.
• Since graph can naturally encode complex information, it could bridge a gap by
combining both empirical domain knowledges and the power of deep learning.
• However, the input graph could be noisy or incomplete or not available, we
presented the IDGL model learns both graph structure and graph embeddings
• We also presented Graph2Seq model, a generalized Seq2Seq model for graph
inputs and demonstrate its advantages on two useful scenarios:
✔ Graph data (natural or best expressed)
✔ Augmented sequence with hidden structure information

321
What’s The Next?
• Welcome all to attend The Second International
Workshop on Deep Learning on Graphs: Methods
and Applications (DLG-KDD’20) will be held joint
with KDD MLG workshop on August 24th, 2020

• Release of our Graph4NLP library: the first library


for researchers and practitioners for easy use of
Graph Neural Networks for various NLP tasks!

You might also like