Capturing Fine-Grained Semantics in Contrastive Graph Representation Learning

Capturing Fine-grained Semantics in Contrastive Graph
Representation Learning
Lin Shua , Chuan Chena; * and Zibin Zhenga
a Sun Yat-sen University
Abstract. Graph contrastive learning defines a contrastive task to

pull similar instances close and push dissimilar instances away. It
learns discriminative node embeddings without supervised labels,
which has aroused increasing attention in the past few years. Nev-
ertheless, existing methods of graph contrastive learning ignore the
differences between diverse semantics existed in graphs, which learn
arXiv:2304.11658v1 [cs.LG] 23 Apr 2023
coarse-grained node embeddings and lead to sub-optimal perfor-

mances on downstream tasks. To bridge this gap, we propose a novel
Fine-grained Semantics enhanced Graph Contrastive Learning (FS-
GCL) in this paper. Concretely, FSGCL first introduces a motif-based
graph construction, which employs graph motifs to extract diverse
semantics existed in graphs from the perspective of input data. Then,
the semantic-level contrastive task is explored to further enhance the
utilization of fine-grained semantics from the perspective of model Figure 1. An illustrative example of diverse semantics existed in social
training. Experiments on five real-world datasets demonstrate the su- graphs and the case study of utilizing fine-grained semantics in graph con-
periority of our proposed FSGCL over state-of-the-art methods. To trastive learning. The histograms depict the accuracies of node classification
make the results reproducible, we will make our codes public on on the synthetic dataset.
GitHub after this paper is accepted.
resulting in coarse-grained representation for each node. As an il-

1 Introduction lustration, Figure 1 depicts a social graph that contains four dis-
tinct communities. Specifically, Alice participates in four commu-
Graph-structured data is pervasive in a wide variety of real-world nities that imply diverse semantics of colleagues, friends, families
scenarios, such as social graphs [17, 23], citation graphs [1, 6] and and classmates. Existing GCL methods leverage the connections in
biological graphs [16, 9]. Recent years have witnessed the prolif- graphs without distinguishing the semantic nuances between differ-
eration of network representation [34], among which Graph Neural ent connections, thus leading to the information loss of inherent se-
Networks (GNNs) [29, 39] attract a surge of interest and show their mantics and sub-optimal performances on downstream tasks. To ver-
effectiveness in learning good-quality representations from graphs. ify our hypothesis, we conduct a case study on a synthetic dataset
However, most GNNs are trained under the supervision of manual composed of 8 communities, where nodes might belong to 2 com-
labels [10, 26], which are expensive and even unavailable in the munities simultaneously and each community can simulate the in-
real world, resulting in overfitting on the limited labels. As a conse- herent semantics in graphs explicitly (More detailed settings are in-
quence, self-supervised graph representation learning [31, 32] comes troduced in the Supplementary Material). Concretely, we apply the
into being, which makes full use of unlabeled graphs to train models multilabel algorithm [35] on the frozen node embeddings generated
and thus removes the need for expensive manual labels. from method w/o semantics (i.e., method that ignores fine-grained se-
As a dominative category of self-supervised graph representation mantics [13]) and method with semantics (i.e., our proposed method),
learning, graph contrastive learning (GCL) [33, 41] has gained in- where the ground-truth communities are regarded as labels. The his-
creasing attention in the past few years. Most GCL methods achieve tograms report the classification accuracies, where the value of the
an instance-level contrastive task, i.e., make each node/instance dis- i-th row and j-th column denotes the proportion of nodes belonging
criminative from each other. Nevertheless, there exist various latent to both community i and j that are accurately classified. As a re-
semantics in graphs and different nodes might share high similar- sult, the diagonal parts of histograms depict the accuracies of nodes
ity under different inherent semantics. The instance-level contrastive with only one community, while the non-diagonal parts depict the
schema constraints each node far away from each other, which ig- accuracies of nodes with two communities. It can be observed that
nores the differences between diverse semantics existed in graphs method w/o semantics has lower classification accuracies on the non-
and is prone to pull nodes with high semantic similarity far away, diagonal parts compared with diagonal ones, implying that meth-
ods ignoring fine-grained semantics have poor learning ability on
∗ Corresponding Author. Email: chenchuan@mail.sysu.edu.cn. nodes with multiple semantics. In contrast, method with semantics
improves the accuracies of non-diagonal parts, indicating that con- for contrastive GNNs. Further, GCA [41] proposes adaptive augmen-
sidering the fine-grained semantics can boost the embedding quality tation schemes to encourage the model to learn significant topologies
and achieve higher performances on downstream tasks. Therefore, and attributes from graphs. However, they exploit the connections
it is of curial significance to mine fine-grained semantics for graph in graphs without distinguishing the underlying semantic nuances,
contrastive learning. thus leading to sub-optimal node representations. Recently, PGCL
In recent years, numerous studies [2, 5] have verified the effective- [15] proposes a clustering-based approach that encourages semanti-
ness of motifs for capturing latent semantics in graphs. Specifically, cally similar graphs closer and investigates the bias issue of negative
graph motifs are defined as essential subgraph patterns that occur sampling. DGCL [14] disentangles the latent semantic factors of the
frequently in graphs. Since different graph motifs depict different se- graph to learn factorized graph representations for contrastive learn-
mantic contexts of each node, the semantic roles of each node can be ing. Nevertheless, these methods are tailored for graph-level con-
easily distinguished according to the specified semantic contexts. As trastive learning, which don’t take the underlying node-level seman-
a consequence, graph motifs have been recognized as popular fun- tics into account and are not suitable for node-level graph contrastive
damental tools to mine latent semantics. For example, Motif-CNN learning.
[22] fuses information from multiple graph motifs to mine semantic
dependencies between nodes. SHNE [38] captures structural seman-
2.2 Motifs in Graph Learning
tics by exploiting graph motifs to boost the expressive power of the
proposed model. Therefore, we propose to integrate graph motifs to Graph motif, a fundamental subgraph pattern that occurs frequently
mine fine-grained semantics for graph contrastive learning. in graphs, plays a significant role in mining latent semantics. For in-
Overall, we propose Fine-grained Semantics enhanced Graph stance, Motif-CNN [22] fuses information from multiple graph mo-
Contrastive Learning (FSGCL) in this paper. Concretely, FSGCL tifs to mine semantic dependencies between nodes. SHNE [38] cap-
first introduces motif-based graph construction, which generates tures structural semantics by exploiting graph motifs to boost the
multiple semantic graphs to integrate latent distinct semantics from expressive power of the proposed model. However, these models
the perspective of input data. Then, based on the generated semantic are trained under the supervision of manual labels, which are ex-
graphs, we combine the instance-level contrastive task with a care- pensive and even unavailable. Therefore, it is of crucial significance
fully designed semantic-level contrastive task to further enhance the to utilize graph motifs to capture fine-grained semantics via graph
utilization of fine-grained latent semantics from the perspective of contrastive learning, a dominative category of self-supervised graph
model training. In addition, since the random negative sampling strat- representation learning. It should be noted that MICRO-Graph [36]
egy used in traditional GCL methods is prone to treat nodes with and MGSSL [37] also incorporate graph motifs with self-supervised
high similarity as negative pairs and pull them away, which results learning, but the goal of them is to generate informative motifs via
in the loss of semantic similarity, we adopt a slow-moving average self-supervised learning, where the latter model further leverages
approach [7] to achieve the contrastive task without negative samples learned motifs to sample subgraphs for graph-level contrastive learn-
to further ensure the effective utilization of inherent semantics. ing, whose aims are entirely different from that of our node-level
To sum up, the major contributions of this paper are concluded as graph contrastive learning model, which considers the underlying se-
follows: mantics to generate high-quality node embeddings.
• We propose a fine-grained semantics enhanced graph contrastive

learning (FSGCL) model, which boosts the expressive ability of
node representations by exploring latent semantics for contrastive
learning on graphs.
• The motif-based graph construction is introduced to extract di-
verse semantics existed in graphs from the perspective of in-
put data. Furthermore, the semantic-level and instance-level con-
trastive task without negative samples are explored jointly to fur-
ther enhance the utilization of fine-grained semantics from the per-
spective of model training.
• We conduct extensive experiments to demonstrate the effective- Figure 2. An illustrative example of graph motifs and corresponding co-
ness of FSGCL. We further report the results of visualization on occurrence matrices.
the synthetic graph to verify the superior ability of capturing the
diverse semantics of FSGCL.
3 Preliminaries
2 Related Work In this section, we first describe the main definitions and notations
2.1 Graph Contrastive Learning that appeared in this paper. Specifically, calligraphic math font (e.g.,
V) represents set, boldface upperclass letters (e.g., A) represent ma-
Graph contrastive learning has gained increasing popularity in recent trices and boldface lowerclass letters (e.g., w) represent vectors.
years, which leverages unlabeled graphs to train models and thus re- Graph Contrastive Representation Learning. A graph can be
moves the need for expensive manual labels. Inspired by the suc- represented as G = (V, E), where V = {v1 , v2 , ..., vn } denotes the
cess of contrastive learning in computer vision [4, 25], contrastive set of n nodes and E denotes the set of links. Given a graph G, graph
learning for graphs achieves the contrastive task by pulling similar contrastive representation learning aims at learning discriminative d-
instances close and pushing dissimilar instances far away. Represen- dimensional embeddings Z ∈ Rn×d for n nodes in G via a con-
tative researches include GraphCL [33] which designs various types trastive task, which can be served for downstream tasks such as node
of graph augmentations to facilitate invariant representation learning classification.
Graph Motif. A graph motif is denoted as M = (VM , EM ), Then, the non-zero position matrix of the co-occurrence matrix OMi
where VM and EM are the set of nodes and links of motif M re- is extracted, which is denoted as RMi :
spectively. Given a motif pattern M, a motif instance is defined as a
1, OM

u,v 6= 0,
i
subgraph of G that matches the pattern of motif M, which is denoted RM i
u,v = (3)
as (VS , ES ), where VS ⊆ V and ES ⊆ E, satisfying (1) ∀u ∈ VS , 0, OM u,v = 0.
i
Φ(u) ∈ VM , where Φ : VS → VM is a bijection, (2) ∀u, v ∈ VS ,

Ultimately, we select the top-K elements from M = C RMi for
if (Φ(u), Φ(v)) ∈ EM , then (u, v) ∈ ES . We denote the instance set
each node to construct the semantic graph GSGi , whose adjacency
of motif M as SM = {(VS , ES )}.
matrix ASGi is calculated as
Motif-based Co-occurrence Matrix. Given a specific motif
M and corresponding instance set SM = {(VS , ES )}, the co- Mu,v , Mu,v ∈ T opK(Mu ),
ASG i
u,v = (4)
occurrence matrix OM of M is defined as: 0, Mu,v ∈/ T opK(Mu ),
X
OM v,u = I((v, u) ∈ ES ), (1) where denotes the operation of Hadamard product. Therefore, we
(VS ,ES ) construct T semantic graphs via T graph motifs respectively, where
where I(s) is the indicative function, i.e., the value of I(s) equals to the non-zero elements of each adjacency matrix ASGi denotes that
1 if the condition s is true and 0 otherwise. OM two nodes share high semantic similarity within both motif Mi and
v,u depicts the num-
ber of co-occurrences of node v and u in all motif instances in SM . node features.
Figure 2 illustrates an example of graph motifs and corresponding Based on the semantic graphs, we generate two different graph
co-occurrence matrices. views G1st and G2nd via graph augmentations, which help learn
discriminative node embeddings for subsequent graph encoding and
contrastive tasks. To further strengthen the information of the input
4 Framework graph, we collectively utilize the original graph and semantics graphs
4.1 Components of FSGCL to build different graph views. Specifically, inspired by MVGRL [8],
we first adopt the Personalized PageRank (PPR) [18] diffusion on the
Overall, FSGCL first generates two graph views G 1st and G 2nd based
input graph to compute the diffusion matrix U:
on the semantic graphs constructed by graph motifs (Section 4.2).
−1
Then, FSGCL utilizes two main neural networks to learn embed- 1
U = α(In − (1 − α)D− 2 AD− 2 )
1
, (5)
dings, i.e., the online network fθ and target network fξ (θ 6= ξ).
n×n
Concretely, the online network is composed of graph encoder, pro- where A is the adjacency matrix of the input graph, D ∈ R is the
jector and predictor, while the target network has the same archi- diagonal degree matrix of A and α denotes the teleport probability in
tecture as online network except for the absence of predictor. The a random walk [11]. Furthermore, we perturb the initial node feature
asymmetric architecture prevents the training model from falling into matrix X to generate augmented features X e 1st and Xe 2nd of different
trivial solutions. The semantic-wise graph encoder are introduced to views, where each elements of X e 1st and X
e 2nd is set to zero with a
extract latent semantics existed in graphs (Section 4.3), followed by probability r. Overall, as depicted in Figure 3, we build two graph
the semantic-level and instance-level contrastive objective formal- views G1st and G2nd by
ized in Section 4.4. The whole architecture of FSGCL is illustrated  1st
 G = {(X e 1st , A
e 1st e 1st = A, A
i )|A0
e 1st
j = ASGj },
in Figure 3.
(6)
 2nd e 2nd , Ae 2nd e 2nd e 2nd
G = {(X i )|A 0 = U, A j = ASGj },
4.2 Motif-based Graph Construction
where i ∈ {0, ..., T } and j ∈ {1, ..., T }. By applying graph diffu-
To explicitly explore latent semantics in graphs, we leverage graph
sion and feature perturbation, we can not only simultaneously encode
motifs to capture the semantic similarity. Specifically, given T graph
rich local and global information existed in the original graph [8], but
motifs, for each motif Mi , i ∈ {1, ..., T }, we first obtain the motif
also extract the intrinsic feature information. As a result, each graph
instance set SMi via an efficient motif matching algorithm [12], and
view is composed of a holistic graph (i.e., the original whole graph)
calculate the corresponding motif-based co-occurrence matrix OMi ,
and T semantic graphs that have been perturbed.
where node u and v share the same semantics within motif Mi if
OM u,v is non-zero. It is straightforward to employ the co-occurrence
i
matrix as the semantic adjacency matrix to achieve the subsequent 4.3 Semantic-wise Graph Encoder
graph encoding procedure to generate node embeddings. However, After generating two graph views based on the holistic graph and se-
on the one hand, graph motifs can only capture underlying semantics mantic graphs, we employ graph neural networks (GNNs) to learn
from the perspective of structural information, which ignore the role node embeddings on each view. Since the T + 1 graphs of each view
of node features played in semantic similarity. On the other hand, depict distinct aspects of the input graph, encoding all graphs with
there exists a large number of similar node pairs in the co-occurrence the same GNN will result in the loss of unique information of each
matrix, thus it is unscalable and inflexible to regard all similar node graph. Therefore, we propose to learn separate GNNs for each graph
pairs as neighbor pairs. As a consequence, we propose to facilitate of each view, which are collectively referred to as semantic-wise
both the feature similarity and motif co-occurrence matrix to select e m, Aem
graph encoder. More specifically, for each graph (X i ), i ∈
top-K neighbors for each node. Concretely, given the node feature {0, ..., T }, where m denotes the m-th augmented graph view, we uti-
X ∈ Rn×df , where df denotes the dimension of initial features, we lize graph convolutional layers [10] to learn node embeddings Zm i ,
first calculate the feature similarity matrix C through the widely used which update node embeddings by aggregating neighbors’ embed-
cosine similarity, which is formulated as: dings. The propagation of each layer can be represented as:
Xi · Xj 1 1
Cij = . (2) e−2 m e−2 m m
kXi · Xj k Zm
i = σ(Di,m Ai Di,m Zi Wi ), (7)
Figure 3. Framework of FSGCL. FSGCL utilizes the online and target network to learn, where the target network is stop-gradient and optimized by the
slow-moving average of the online network.
where σ is the activation function (i.e., PReLU), Ai = A

m
emi + In
to project the encoded embeddings Zm i to new subspaces, which have
is the adjacency matrix added with self-connections, In is the iden- been confirmed to be an effective way to improve the representation
tity matrix, D e i,m is the diagonal degree matrix of Am i , Wi
m
∈ quality [3]. Specifically, we leverage a non-linear perceptron layer to
R din ×dout
is the layer-specific transformation matrix for the i-th project node embeddings:
graph in the m-th view, din and dout denote the input and output
embedding dimensions of the propagation layer and dout is set to Qm m m m
i = σ(Ui Zi + bi ), i ∈ {0, ..., T + 1}, (9)
be d in each propagation layer. Ultimately, the GNNs employed in
semantic-wise graph encoder are built by stacking multiple propa- where Um i ∈ R
d×d
and bm
i ∈ R
d×1
are the transformation matrix
gation layers defined in Equation 7. For the convenience of presen- and bias for the i-th embeddings of m-th graph view. Furthermore,
tation, we denote Zm 0 generated from the holistic graph as holistic
for the online network, we stack L perceptron layers as predictors to
embeddings and Zm build an asymmetric architecture to prevent the training model from
i , i ∈ {1, ..., T } generated from semantic graphs
as semantic embeddings in the following sections. falling into trivial solutions:
Generally, for each augmented graph view, since different seman-
tics emphasize distinct aspects of the whole graph, we apply the P1st
i = σ(Ei Q1st
i + ai ), i ∈ {0, ..., T + 1}, (10)
weighted sum function on T semantic embeddings before combin-
ing them with the holistic embeddings to obtain the final node em- where Ei ∈ Rd×d and ai ∈ Rd×1 are the transformation matrix and
beddings Zm bias for the i-th embeddings in graph predictor.
combine :
In the following, we introduce the instance-level and semantic-
T
X level contrastive task in detail.
Zm
combine = β wim Zm m
i + Z0 , (8)
i=1
4.4.1 Instance-level Contrastive Learning
where wim is the coefficient that balances different importance of T
semantics and β is the weight coefficient that controls the balance be- The instance-level contrastive task regards the graph as a perceptual
tween the holistic and semantic embeddings. As a consequence, the whole and achieves the contrastive task on the whole graph, i.e., it
proposed semantic-wise graph encoder generates T + 2 embeddings doesn’t distinguish different semantics. In this section, we formalize
for each node v on each graph view, which contains the holistic em- the instance-level contrastive task on the holistic and final combined
beddings Zm m
0 , T semantic embeddings Zi , i ∈ {1, ..., T }, and the node embeddings since they don’t distinguish diverse semantics.
m m
final node embeddings ZT +1 = Zcombine . Traditional contrastive learning methods randomly select negative
instances for each node to achieve the contrastive task. Nevertheless,
4.4 Contrastive Objective random negative sampling introduces false-negative instance pairs,
i.e., nodes that share similar semantics might be pulled far away,
The downstream task requires the output embeddings of graph en- thus weakening the utilization of node similarity. Therefore, inspired
coders to learn, yet the optimization objectives of the downstream by BYOL, we propose to design the instance-level contrastive task
task and contrastive task are not consistent. As a consequence, be- without negative samples. Specifically, we utilize the normalized pre-
fore defining the contrastive objectives, we employ graph projectors dicted embeddings of online network P1st i to predict the normalized
projected embeddings of target network Q2nd
i and leverage the co- the initial stage of training, thus achieving the effect of distinguishing
sine similarity to calculate the loss: different instances (i.e., the effect of negative instances). Moreover, τ
is usually set to be a large number (e.g., 0.99), thus the slow-moving
N 1st 2nd
1 X P(0,v) · Q(0,v) average algorithm can not only make augmented instances pairs (i.e.,
LHolistic = − , (11)
N v=1 P1st · Q2nd positive instance pairs) closer, but also implicitly preserve the dis-
(0,v) (0,v)
tinction between different instances to a large extent. Compared to
N 1st 2nd methods with random negative sampling, FSGCL only pulls positive
1 X P(T +1,v) · Q(T +1,v) instances closer, which not only reduces the computation cost, but
LCombine = − , (12)
N v=1 P1st 2nd also avoids pushing instances of similar semantics far away while
(T +1,v) · Q(T +1,v)
optimization, thus strengthening the utilization of semantic similar-
where P1st 2nd
(i,v) and Q(i,v) denote node v’s embeddings of Pi
1st
and ity.
2nd
Qi . As a result, the predicting loss only requires pushing pairs
with high similarity closer.
5 Experiments
4.4.2 Semantic-level Contrastive Learning 5.1 Datasets
Since the instance-level contrastive learning ignores the fine-grained We conduct node classification on five public real-world datasets:
semantics existed in graphs, we propose the semantic-level con- Amazon-Photos, Amazon-Computers1 , Coauthor-CS, Coauthor-
trastive task to strengthen the utilization of semantic information. Physics2 and WikiCS3 , where we use a random train/validation/test
Specifically, we leverage T semantic embeddings to achieve the split (10/10/80%) for the datasets of Amazon and Coauthor since
semantic-level contrastive learning. Since different semantic graphs these four datasets have no standard dataset splits, and follow the 20
convey distinct information, separate contrastive objectives should canonical train/validation/test splits for WikiCS dataset. The details
be designed on T semantic graphs to distinguish semantic nuances. of datasets as listed in Table 2.
Therefore, for the i-th semantic graph, we formalize the semantic
predicting loss without negative samples as
5.2 Baselines and Experimental Settings
N 1st 2nd
1 X P(i,v) · Q(i,v) We compare our FSGCL with semi-supervised GCN [10] and 8 state-
LiSemantic =− , i ∈ {1, ..., T }, (13)
N v=1 P1st · Q2nd of-the-art unsupervised contrastive learning methods including DGI
(i,v) (i,v)
[27], GMI [21], MVGRL [8], GRACE [40], GCA [41], BGRL [24],
where P1st 2nd
(i,v) and Q(i,v) denote node v’s embeddings of Pi
1st
and ProGCL [30], AFGRL [13]. We use the authors’ released codes and
2nd
Qi . The whole semantic-level contrastive loss is composed by the follow the papers’ guidance to train all comparison methods. Since
combination of T semantic contrastive subtasks: the three motif in Figure 2 are the most basic and simples motifs con-
sisting of three nodes and four nodes, we utilize three motifs shown
T
X in Figure 2 in this paper, which are denoted as Motif-A,B,C in the
LSemantic = LiSemantic , i ∈ {1, ..., T }. (14)
following. More detailed experimental settings and parameter sensi-
i=1
tivity analysis are presented in the Supplementary Material due to the
limitation of pages.
4.5 Model Training
Combining the semantic-level and instance-level contrastive learn-
5.3 Experimental Results
ing, we train the FSGCL model by the joint loss: We repeat all experiments 5 times and report the average accuracies
(± std) on five real-world datasets. The experimental results are re-
LJoint = γLSemantic + LHolistic + LCombine , (15)
ported in Table 1, where we emphasize the highest values in bold
where γ is the weight coefficient that controls the significance of and the second ones with underlines. Overall, we summarize major
the semantic-level contrastive task. In practice, we symmetrize the observations as follows:
loss by feeding G2nd and G1st to the online and target network
• Our proposed FSGCL consistently outperforms baselines of graph
respectively to compute L e Joint . Furthermore, we only update the
contrastive learning, which can be attributed to the ability of graph
parameters θ of online network by minimizing the total loss L =
motifs that mine fine-grained semantics existed in graphs. There-
LJoint + L e Joint , while the parameter ξ of target network is stop-
fore, distinguishing nuances between diverse semantics in con-
gradient and optimized by a slow-moving average of the online net-
trastive learning helps learn more informative node embeddings,
work:
thus boosting the performances of downstream tasks.
θ ← optimize(θ, λ, ∇θ L), (16)
• It can be observed that FSGCL has large improvements on
Amazon-Photos and Coauthor datasets, while the improvements
ξ ← τ ξ + (1 − τ )θ, (17) are relatively small on Amazon-Computers and WikiCS. We con-
jecture that this phenomenon is because the matching numbers
where λ is the learning rate and τ is the decay rate. At the end of the
of motif instances are higher on Amazon-Computers and WikiCS
training, we only keep the encoder of online network and treat the
than other datasets (shown in Table 2). To be more specific, we
final node embeddings Z1stf inal as the input of downstream inference
tasks. 1 https://github.com/shchur/gnn-benchmark/tree/master/data/npz
Since the online and target network are initialized randomly, the 2 https://github.com/shchur/gnn-benchmark/tree/master/data/npz
3
randomness makes all node embeddings distinct from each other at https://github.com/pmernyei/wiki-cs-dataset/raw/master/dataset
Table 1. Node classification accuracies (± std) results (in %) on five real-world datasets. OOM: out of memory on a 24GB GPU. We emphasize the highest
values in bold and the second ones with underlines.
Amazon-Photos Amazon-Computers Coauthor-CS Coauthor-Physics WikiCS
GCN 93.18 ± 0.18 89.37 ± 0.29 92.34 ± 0.04 95.06 ± 0.06 79.19 ± 0.60
DGI 91.61 ± 0.22 83.95 ± 0.47 92.15 ± 0.63 94.51 ± 0.52 75.35 ± 0.14
GMI 90.68 ± 0.17 82.21 ± 0.31 OOM OOM 74.85 ± 0.08
MVGRL 91.74 ± 0.07 87.52 ± 0.11 92.11 ± 0.12 95.33 ± 0.03 77.52 ± 0.08
GRACE 92.59 ± 0.16 88.55 ± 0.30 89.81 ± 0.19 OOM 79.27 ± 0.67
GCA 92.39 ± 0.32 87.57 ± 0.45 92.49 ± 0.14 OOM 79.14 ± 0.25
BGRL 93.23 ± 0.28 90.42 ± 0.15 92.68 ± 0.14 95.37 ± 0.10 80.07 ± 0.49
ProGCL 92.01 ± 0.24 87.44 ± 0.38 OOM OOM 79.61 ± 0.63
AFGRL 93.04 ± 0.26 89.41 ± 0.34 93.27 ± 0.17 95.71 ± 0.10 77.78 ± 0.42
FSGCL 94.65 ± 0.16 90.54 ± 0.21 94.22 ± 0.07 96.10 ± 0.08 80.25 ± 0.02
Table 2. Description of datasets and corresponding matching numbers of

motif instances.
Amazon- Amazon- Coauthor- Coauthor-
Datasets WikiCS
Photos Computers CS Physics
Nodes 7,650 13,572 18,333 34,493 11,701
Edges 119,081 245,861 81,894 247,962 216,123
Attribute 754 767 6,805 8,415 300
(a) ProGCL (b) AFGRL
Classes 8 10 15 5 10
Motif A 12,136,593 42,532,480 1,409,965 7,499,633 36,875,788
Motif B 717,400 1,527,469 85,799 468,550 3,224,375
Motif C 14,516,676 38,630,055 202,620 1,886,051 79,649,534
extract the non-zero node pairs in the motif co-occurrence matri-

ces, and select top-K elements for each node based on the fea-
ture similarity of the aforementioned non-zero node pairs to con-
struct semantic graphs. As a result, when there exist excess mo- (c) BGRL (d) FSGCL
tif instances, a large amount of non-zero node pairs in the co-
occurrence matrices are extracted, resulting in the adjacency ma- Figure 4. Visualization analysis on the synthetic datasets.
trices of semantic graphs being extremely close to the top-K ma-
trices of feature similarity, thus having less informative semantics
where 8,000 nodes simultaneously belong to two different communi-
and achieving smaller improvements.
ties (More detailed parameters are introduced in the Supplementary
• Contrastive learning methods that don’t require negative instances
Material). Then, the multilabel K nearest neighbors algorithm [35]
(i.e., our FSGCL, BGRL and AFGRL) perform better than those
is applied on the frozen node embeddings generated from ProGCL,
with negative sampling (i.e., DGI, GMI, MVGRL, GRACE, GCA
AFGRL, BGRL and our FSGCL, where the ground-truth communi-
and ProGCL). What’s more, methods without negative sampling
ties are used as labels.
are more scalable and can easily be applied to larger datasets such
Figure 4 depicts the heatmaps of classification accuracies on 8
as Coauthor-CS and Coauthor-Physics. It verifies that removing
communities, where the value of the i-th row and j-th column de-
the procedure of negative sampling can not only reduce the com-
notes the proportion of nodes belonging to both community i and
putation cost, but also improve the embedding quality since it
j that are accurately classified. Therefore, the diagonal elements of
avoids pushing instances of similar semantics far away while op-
heatmaps imply the predicting accuracies of nodes which have only
timizing.
one community, while the non-diagonal elements imply the accu-
racies for nodes which have two distinct communities. It is appar-
ent that ProGCL, AFGRL and BGRL have higher predicting accura-
5.4 Visualization Analysis on Synthetic Graphs
cies on the diagonals compared with non-diagonal parts of heatmaps,
To provide a more intuitive understanding of the functionality of min- which indicates that existing methods learn poor-quality embeddings
ing fine-grained semantics of FSGCL, we employ the well-known for nodes that simultaneously participate in multiple communities
LFR toolkit [12] to generate a synthetic graph with overlapping com- since they ignore the distinct semantics existed in graphs. Instead,
munities, i.e., each node might exist in multiple communities, which our FSGCL generates higher accuracies overall, especially on the
can simulate the underlying diverse semantics more explicitly. Con- non-diagonal parts of heatmaps, which demonstrates that FSGCL can
cretely, we generate a graph with 10,000 nodes and 8 communities, indeed distinguish fine-grained semantics and boost the expressive
ability.
Table 4. Ablation study of different motif combinations of FSGCL.
Table 3. Ablation study of key components of FSGCL. Amazon- Amazon- Coauthor- Coauthor-
Datasets WikiCS
Amazon- Amazon- Coauthor- Coauthor- Photos Computers CS Physics
Datasets WikiCS FSGCL 94.65 90.54 94.22 96.10 80.25
Photos Computers CS Physics
FSGCL 94.65 90.54 94.22 96.10 80.25 motif-A 94.60 90.51 93.81 96.02 79.47
i
w/o wm 94.07 90.45 94.20 96.08 80.26 motif-B 93.74 89.97 93.49 95.85 79.89
w/o slow 92.20 89.80 92.09 95.75 78.45 motif-C 94.11 90.21 93.54 95.85 79.86
w/o ASGi 93.13 90.05 93.32 95.75 79.55 motif-AB 94.49 90.50 94.15 96.07 79.98
w/o top-k A SGi
93.97 89.96 94.10 96.03 78.97 motif-AC 94.62 90.52 94.16 96.05 79.99
w/o LSemantic 94.19 90.49 93.92 96.02 80.14 motif-BC 94.29 94.30 93.78 95.83 80.14
w/o LHolistic 94.15 90.11 93.87 96.02 79.60
5.5 Ablation Study

The ablation study is performed on six variants to investigate the
effectiveness of key components:
• w/o wim : it removes the weighted coefficients of different motifs
in Eq.(8), i.e., it trains with an average weighted sum in Eq.(8). (a) Amazon-Photos (b) Amazon-Computers
• w/o slow: it replaces the slow-moving average algorithm with the
random negative sampling strategy to optimize the model. Figure 5. Runtime of FSGCL and three comparison methods on Amazon-
• w/o ASGi : it sets ASGi = A, i ∈ {1, ..., T }, i.e., it trains with Photos and Amazon-Computers.
original graphs instead of semantic graphs.
• w/o top-k ASGi : it sets ASGi = topK(C), i ∈ {1, ..., T }, i.e., 5.6 Runtime Analysis
it substitutes the top-K matrices of feature similarity for the adja- In this section, we compare the runtime of our FSGCL and three
cency matrices of semantic graphs to train. methods (ProGCL, AFGRL and BGRL) that perform the best among
• w/o LSemantic and w/o LHolistic : they set LSemantic = 0 and all comparison methods. Specially, we not only report the model
LHolistic = 0, respectively. training time of FSGCL, but also report the time of semantic con-
The results of FSGCL and its variants are reported in Table 3. From struction, i.e., the time of generating motif instances and constructing
this table, we can conclude that: semantic graphs in the preprocessing process. The runtime results on
Amazon-Photos and Amazon-Computers are depicted in Figure 5:
• w/o wim performs slightly worse than FSGCL in most situations,
demonstrating that considering different contributions made by • Compared with ProGCL requiring negative samples in model
distinct motifs helps improving the performances in downstream training, FSGCL reduces training time by removing the need of
tasks. It should be noted that w/o wim is still superior than all massive negative instances.
comparison methods in Table 1, demonstrating that our FSGCL • FSGCL requices less training time than AFGRL. Though AFGRL
can achieve improvements with only few parameters. also adopts the schema of non-negative training, this phenomenon
• w/o slow performs much worse than FSGCL, proving that the is reasonable since AFGRL performs K-means clustering in each
non-negative schema (i.e., the slow-moving average algorithm) iteration, which is more time-consuming than our FSGCL that
can indeed improve the embedding quality by reducing the infor- constructs semantic graphs during data preprocessing.
mation loss of semantics. • Although FSGCL consumes more total time than BGRL, FSGCL
• We observe a performance drop in w/o ASGi and w/o top- owns less model training time than BGRL, which might because
k ASGi , verifying the effectiveness of graph motifs of capturing FSGCL converges faster by capturing fine-grained semantics and
informative semantic nuances. thus requiring fewer iterations and less training time.
• Compared with w/o LSemantic and w/o LHolistic , we can see that
both semantic-level and instance-level contrastive loss can boost
the performances of FSGCL, confirming the necessity of combin- 5.7 Conclusions
ing these two contrastive tasks.
In this paper, we propose a fine-grained semantics enhanced graph
Besides, to investigate how graph motifs affect the performances contrastive learning model - FSGCL. Concretely, FSGCL first em-
of FSGCL, we conduct experiments on all combinations of motif- ploys graph motifs to construct multiple semantic graphs to mine
A,B,C used in this paper. From Table 4, we can find that all com- the distinct semantics from the perspective of input data. Then, the
binations are worse than FSGCL which contains all three motifs, semantic-level contrastive task without negative samples is intro-
demonstrating that more graph motifs can help mine more seman- duced to further enhance the utilization of fine-grained latent seman-
tic information and boost the expressive ability of node embeddings. tics from the perspective of model training. The conducted exper-
It should be noted that our method can be seamlessly extended to uti- iments on five real-world datasets indicate that FSGCL is consis-
lize more complex graph motifs, we don’t conduct extra experiments tently superior to state-of-the-art methods, verifying the effectiveness
with more motifs in this paper. of our proposed model.
(a) Amazon Photos (b) Amazon Computers
(c) Coauthor CS (d) Coauthor Physics (e) WikiCS
Figure 6. Parameter sensitivity analysis of τ, k, r, γ, β on five real-world datasets.
Appendix A Experimental Settings specifically, we set the average degree k and maximum degree to be
20 and 50 respectively, the mixing parameter mu (each node shares
The proposed model FSGCL is implemented by PyTorch [19] and a fraction of its edges with nodes in other communities) to be 0.2, the
DGL [28] with the AdamW optimizer with the base learning rate minimum number minc and maximum number maxc of community
ηb = 0.001 and weight decay of 10−5 . Following the settings of size to be 1500 and 3000, the number of overlapping nodes on to be
BGRL [24], the learning rate is annealed via a cosine schedule dur- 8,000, and the number of memberships of the overlapping nodes om
ing the ntotal steps with an initial warmup period of nw steps. As a to be 2.
result, the learning rate at step i is computed as:
 i×η
base
, if i ≤ nw
nw
Appendix C Parameter Sensitivity Analysis


ηi =

ηb × (1 + cos n(i−n w )×π
) × 0.5, if nw ≤ i ≤ ntotal

total −nw In this section, we investigate the sensitivity of five major hy-
(18) perparameters: the decay rate τ of the slow-moving average
For node classification, the final evaluation is achieved by fitting strategy, k that is used to select the top-K elements to build
an l2 -regularized LogisticRegression classifier from Scikit-Learn semantic graphs, the augmentation ratio r, the weight coeffi-
[20] using the liblinear solver on the frozen node embeddings, cient γ that controls the significance of the semantic-level con-
where the regularization strength is chosen by grid-search from trastive objective, and the weight coefficient β that balances the
{2−10 , 2−9 , ..., 29 , 210 }. The dimension of hidden and output rep- holistic and semantic embeddings. The results of the above pa-
resentations are set to be 512. The number of k, τ, β, γ and layers rameters are illustrated in Figure 6, where τ takes the value
of GCN are set to be 7, 0.99, 0.5, 0.5, 2 for Amazon-Computers from the list {0.9,0.93,0.96,0.99,0.993,0.996,0.999}, k ranges
and WikiCS, while 5, 0.996, 1, 1, 1 for other datasets. The MLP in {3,5,7,9,11,13,15}, r ranges in {0.1,0.2,0.3,0.4,0.5,0.6,0.7},
layers L of online predictor is set to be 2 for Amazon and Wiki γ and β take the value from {0.1,0.3,0.5,0.7,0.9,1,2} and
datasets, and 1 for Coauthor datasets. The augmentation ratio r is set {0.1,0.3,0.5,0.7,0.9,1,2}, respectively.
to be 0.2 for Amazon-Computers, 0.1 for WikiCS and 0.3 for other As shown in Figure 6, FSGCL achieves higher performances when
datasets. As for the motif coefficient wim , i ∈ motif − A, B, C, τ ranges from 0.99 to 0.999, demonstrating that a larger value of τ is
we set the value of wm to be [0.7, 0.1, 0.2] for Amazon-Photos, beneficial for the slow-moving average strategy for learning discrim-
[0.3, 0.1, 0.6] for Amazon-Computers, [0.4, 0.2, 0.4] for Coauthor- inative embeddings. Besides, FSGCL always performs better when
CS, [0.5, 0.45, 0.05] for Coauthor-Physics and [0.1, 0.5, 0.4] for Wi- k is small, which is reasonable since the semantic graphs own more
kiCS. The value of α is set to be 0.2 on all datasets. unique semantic information when k is smaller. For the augmentation
ratio r, the performances drop sharply when r becomes too large, in-
Appendix B Settings of Synthetic Dataset dicating that over-perturbation leads to the loss of useful information
in the original graphs. For the weight coefficient γ and β, it can be
We employ the well-known LFR toolkit [12] to generate a synthetic observed that our model is relatively stable with regard to γ, while
graph with 10,000 nodes and 8 overlapping communities. More the performance decreases when β is too small or too large.
References Tingyang Xu, and Junzhou Huang, ‘Graph representation learning via
graphical mutual information maximization’, in Proceedings of The
Web Conference 2020, pp. 259–270, (2020).
[1] Kimitaka Asatani, Junichiro Mori, Masanao Ochi, and Ichiro Sakata, [22] Aravind Sankar, Xinyang Zhang, and Kevin Chen-Chuan Chang,
‘Detecting trends in academic research from a citation network using ‘Motif-based convolutional neural network on graphs’, arXiv preprint
network representation learning’, PloS one, 13(5), e0197260, (2018). arXiv:1711.05697, (2017).
[2] Chris Biemann, Lachezar Krumov, Stefanie Roos, and Karsten Weihe, [23] Qiaoyu Tan, Ninghao Liu, and Xia Hu, ‘Deep representation learning
‘Network motifs are a powerful tool for semantic distinction’, in To- for social network analysis’, Frontiers in big Data, 2, 2, (2019).
wards a Theoretical Framework for Analyzing Complex Linguistic Net- [24] Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar,
works, 83–105, Springer, (2016). Mehdi Azabou, Eva L Dyer, Remi Munos, Petar Veličković, and Michal
[3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hin- Valko, ‘Large-scale representation learning on graphs via bootstrap-
ton, ‘A simple framework for contrastive learning of visual represen- ping’, arXiv preprint arXiv:2102.06514, (2021).
tations’, in International conference on machine learning, pp. 1597– [25] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia
1607. PMLR, (2020). Schmid, and Phillip Isola, ‘What makes for good views for contrastive
[4] Bo Dai and Dahua Lin, ‘Contrastive learning for image captioning’, learning?’, Advances in Neural Information Processing Systems, 33,
Advances in Neural Information Processing Systems, 30, (2017). 6827–6839, (2020).
[5] Manoj Reddy Dareddy, Mahashweta Das, and Hao Yang, ‘motif2vec: [26] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana
Motif aware node representation learning for heterogeneous networks’, Romero, Pietro Lio, and Yoshua Bengio, ‘Graph attention networks’,
in 2019 IEEE International Conference on Big Data (Big Data), pp. arXiv preprint arXiv:1710.10903, (2017).
1052–1059. IEEE, (2019). [27] Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò,
[6] Travis Ebesu and Yi Fang, ‘Neural citation network for context-aware Yoshua Bengio, and R Devon Hjelm, ‘Deep graph infomax.’, ICLR
citation recommendation’, in Proceedings of the 40th international (Poster), 2(3), 4, (2019).
ACM SIGIR conference on research and development in information [28] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song,
retrieval, pp. 1093–1096, (2017). Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al., ‘Deep graph li-
[7] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tal- brary: A graph-centric, highly-performant package for graph neural net-
lec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo works’, arXiv preprint arXiv:1909.01315, (2019).
Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al., ‘Boot- [29] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi
strap your own latent-a new approach to self-supervised learning’, Ad- Zhang, and S Yu Philip, ‘A comprehensive survey on graph neural net-
vances in neural information processing systems, 33, 21271–21284, works’, IEEE transactions on neural networks and learning systems,
(2020). 32(1), 4–24, (2020).
[8] Kaveh Hassani and Amir Hosein Khasahmadi, ‘Contrastive multi-view [30] Jun Xia, Lirong Wu, Ge Wang, Jintao Chen, and Stan Z Li, ‘Progcl: Re-
representation learning on graphs’, in International Conference on Ma- thinking hard negative mining in graph contrastive learning’, in Inter-
chine Learning, pp. 4116–4126. PMLR, (2020). national Conference on Machine Learning, pp. 24332–24346. PMLR,
[9] Shuting Jin, Xiangxiang Zeng, Feng Xia, Wei Huang, and Xiangrong (2022).
Liu, ‘Application of deep learning methods in biological networks’, [31] Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shui-
Briefings in bioinformatics, 22(2), 1902–1917, (2021). wang Ji, ‘Self-supervised learning of graph neural networks: A unified
[10] Thomas N Kipf and Max Welling, ‘Semi-supervised classification review’, IEEE Transactions on Pattern Analysis and Machine Intelli-
with graph convolutional networks’, arXiv preprint arXiv:1609.02907, gence, (2022).
(2016). [32] Minghao Xu, Hang Wang, Bingbing Ni, Hongyu Guo, and Jian Tang,
[11] Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann, ‘Self-supervised graph-level representation learning with local and
‘Diffusion improves graph learning’, arXiv preprint arXiv:1911.05485, global structure’, in International Conference on Machine Learning,
(2019). pp. 11548–11558. PMLR, (2021).
[12] Andrea Lancichinetti and Santo Fortunato, ‘Benchmarks for testing [33] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang
community detection algorithms on directed and weighted graphs with Wang, and Yang Shen, ‘Graph contrastive learning with augmenta-
overlapping communities’, Physical Review E, 80(1), 016118, (2009). tions’, Advances in Neural Information Processing Systems, 33, 5812–
[13] Namkyeong Lee, Junseok Lee, and Chanyoung Park, ‘Augmentation- 5823, (2020).
free self-supervised learning on graphs’, in Proceedings of the AAAI [34] Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang, ‘Network
Conference on Artificial Intelligence, volume 36, pp. 7372–7380, representation learning: A survey’, IEEE transactions on Big Data,
(2022). 6(1), 3–28, (2018).
[14] Haoyang Li, Xin Wang, Ziwei Zhang, Zehuan Yuan, Hang Li, and [35] Min-Ling Zhang and Zhi-Hua Zhou, ‘Ml-knn: A lazy learning approach
Wenwu Zhu, ‘Disentangled contrastive learning on graphs’, Advances to multi-label learning’, Pattern recognition, 40(7), 2038–2048, (2007).
in Neural Information Processing Systems, 34, 21872–21884, (2021). [36] Shichang Zhang, Ziniu Hu, Arjun Subramonian, and Yizhou Sun,
[15] Shuai Lin, Chen Liu, Pan Zhou, Zi-Yuan Hu, Shuojia Wang, Ruihui ‘Motif-driven contrastive learning of graph representations’, arXiv
Zhao, Yefeng Zheng, Liang Lin, Eric Xing, and Xiaodan Liang, ‘Proto- preprint arXiv:2012.12533, (2020).
typical graph contrastive learning’, IEEE Transactions on Neural Net- [37] Zaixi Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong
works and Learning Systems, (2022). Lee, ‘Motif-based graph self-supervised learning for molecular prop-
[16] Giulia Muzio, Leslie O’Bray, and Karsten Borgwardt, ‘Biological net- erty prediction’, Advances in Neural Information Processing Systems,
work analysis with deep learning’, Briefings in bioinformatics, 22(2), 34, 15870–15882, (2021).
1515–1530, (2021). [38] Ziyang Zhang, Chuan Chen, Yaomin Chang, Weibo Hu, Xingxing
[17] David F Nettleton, ‘Data mining of social networks represented as Xing, Yuren Zhou, and Zibin Zheng, ‘Shne: Semantics and homophily
graphs’, Computer Science Review, 7, 1–34, (2013). preserving network embedding’, IEEE Transactions on Neural Net-
[18] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd, works and Learning Systems, (2021).
‘The pagerank citation ranking: Bringing order to the web.’, Technical [39] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang,
report, Stanford InfoLab, (1999). Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun, ‘Graph
[19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James neural networks: A review of methods and applications’, AI open, 1,
Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia 57–81, (2020).
Gimelshein, Luca Antiga, et al., ‘Pytorch: An imperative style, high- [40] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang
performance deep learning library’, Advances in neural information Wang, ‘Deep graph contrastive representation learning’, arXiv preprint
processing systems, 32, (2019). arXiv:2006.04131, (2020).
[20] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent [41] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang
Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret- Wang, ‘Graph contrastive learning with adaptive augmentation’, in Pro-
tenhofer, Ron Weiss, Vincent Dubourg, et al., ‘Scikit-learn: Machine ceedings of the Web Conference 2021, pp. 2069–2080, (2021).
learning in python’, the Journal of machine Learning research, 12,
2825–2830, (2011).
[21] Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong,

Capturing Fine-Grained Semantics in Contrastive Graph Representation Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capturing Fine-Grained Semantics in Contrastive Graph Representation Learning

Uploaded by

Copyright:

Available Formats

Capturing Fine-grained Semantics in Contrastive Graph

Abstract. Graph contrastive learning defines a contrastive task to

coarse-grained node embeddings and lead to sub-optimal perfor-

resulting in coarse-grained representation for each node. As an il-

• We propose a fine-grained semantics enhanced graph contrastive

Φ(u) ∈ VM , where Φ : VS → VM is a bijection, (2) ∀u, v ∈ VS ,

where σ is the activation function (i.e., PReLU), Ai = A

Table 2. Description of datasets and corresponding matching numbers of

extract the non-zero node pairs in the motif co-occurrence matri-

5.5 Ablation Study

(c) Coauthor CS (d) Coauthor Physics (e) WikiCS

Figure 6. Parameter sensitivity analysis of τ, k, r, γ, β on five real-world datasets.

You might also like