Discriminative Regularized Deep Generative Models For Semi-Supervised Learning

2019 IEEE International Conference on Data Mining (ICDM)
Interpretable Feature Learning of Graphs using

Tensor Decomposition
Shah Muhammad Hamdi and Rafal Angryk
Department of Computer Science, Georgia State University,
Atlanta, GA, 30302, United States.
Email: {shamdi1, rangryk} @cs.gsu.edu
Abstract—In recent years, node embedding algorithms, which multihop similarity-based approaches, and random walk-based
learn low dimensional vector representations for nodes in a
approaches. Most matrix decomposition-based approaches de-
graph, have been one of the key research interests of the graph
mining community. The existing algorithms either rely on com- compose various matrix representations of graphs by eigende-
putationally expensive eigendecomposition of the large matrices, composition or Singular Value Decomposition (SVD). Mul-
or require tuning of the word embedding-based hyperparameters tihop similarity-based approaches consider the higher-order
as a result of representing the graph as a node sequence similar proximities of the nodes, and use matrix factorization for
to the sentences in a document. Moreover, the latent features pro-
decomposing higher-order proximity matrices (e.g., GraRep
duced by these algorithms are hard to interpret. In this paper, we
present two novel tensor decomposition-based node embedding [2], AROPE [3]). Random walk-based approaches consider
algorithms, that can learn node features from arbitrary types of the input graph as a set of random walks from each node
graphs: undirected, directed, and/or weighted, without relying on (e.g., Node2vec [4], DeepWalk [5]). These random walks are
eigendecomposition or word embedding-based hyperparameters. considered as sentences, where the nodes are considered as
Both algorithms preserve the local and global structural proper-
words in a Natural Language Processing (NLP) model. Finally,
ties of the graph by using k-step transition probability matrices
to construct third-order multidimensional arrays or tensors and the Skip-gram model [6] is used to find the node embeddings.
perform CANDECOMP/PARAFAC (CP) decomposition in order While eigendecomposition on the large real-world networks
to produce an interpretable and low dimensional vector space is very expensive, random walk-based methods are compara-
for the nodes. Our experiments encompass different types of
tively scalable. But, the random walk-based approaches require
graphs (undirected/directed, unweighted/weighted, sparse/dense)
of different domains such as social networking and neuroscience. the tuning of a number of hyperparameters, some of which
Our experimental evaluation proves our models to be inter- are NLP-based. For example, Node2vec requires tuning of
pretable with respect to the understandability of the feature several hyperparameters such as context size, walks per node,
space, precise with respect to the network reconstruction and walk length, return parameter and in-out parameter. Moreover,
link prediction, and accurate with respect to node classification
almost all the node embedding algorithms represent the nodes
and graph classification.
Index Terms—Node embedding, Tensor decomposition, Inter- as d-dimensional vectors, and do not provide any direction to
pretability of feature space the interpretability of the features.
In this work, we propose two algorithms: Tensor
I. I NTRODUCTION Decomposition-based Node Embedding (TDNE) and Tensor
Decomposition-based Node Embedding per Slice (TDNEpS).
Graphs are one of the most ubiquitous data structures
TDNE uses higher-order transition probability matrices of
used in computer science and related fields. By capturing
a graph to construct one third-order tensor, while TDNEpS
the interactions between individual entities, graphs facilitate
considers each transition probability matrix as one third-order
discovering the underlying complex structure of a system.
tensor. Both algorithms use higher-order transition probability
Mining real-life graphs plays an important role in studying the
matrices of a graph to construct one or more third-order ten-
network behavior of different domains such as social sciences
sors, and perform CP decomposition to get the representations
(social network), linguistics (word co-occurrence network),
of the nodes and the representations of the transition steps.
biology (protein-protein interaction network), neuroscience
The algorithms do not rely on eigendecomposition of large
(brain network) and so on. Recently, there has been a surge of
matrices, or tuning of the NLP-based hyperparameters such
research interest in embedding graph structures, such as nodes,
as context size.
edges, subgraphs, and the whole graph in a low dimensional
vector space [1]. Among them, representation learning of the The main contributions of this work are:
nodes is most widely studied, which facilitates downstream 1) To the best of our knowledge, this work is the first
machine learning tasks, such as network reconstruction, link attempt to learn embeddings of the transition steps (one
prediction, node classification, and graph classification. kind of pairwise proximity [2]).
In recent years, a good number of node embedding al- 2) Our method provides interpretability by creating a fea-
gorithms have been proposed. They can be roughly divided ture space for the nodes, where the role of each feature
into three categories - matrix decomposition-based approaches, is understandable.
2374-8486/19/$31.00 ©2019 IEEE 658

DOI 10.1109/ICDM.2019.00076
Authorized licensed use limited to: University of Edinburgh. Downloaded on June 14,2020 at 05:54:25 UTC from IEEE Xplore. Restrictions apply.
3) When we have a set of graphs, and each graph consists F F F5
of same labeled node set, we use learned representations &35

§ E E E5
of the nodes of each graph for embedding the whole ᮌ
graphs. Therefore, in addition to evaluate our algorithms
in single graph-based tasks such as node classification D D D5
and link prediction, we have evaluated our algorithms in Fig. 1: CP decomposition of a third-order tensor.
multi-graph-based tasks such as graph classification.
The rest of the paper is organized as follows. In section
II, we discuss the related work. We discuss the graph no- To capture the highly non-linear structures of the graphs,
tations used in this paper and some preliminaries of tensor deep learning has been used by Structural Deep Network
decomposition in section III. We present our algorithms TDNE Embedding (SDNE) [10], Deep Neural Networks for Learning
and TDNEpS in section IV. In section V, we present the Graph Representation (DNGR) [11], and Graph Convolutional
experimental findings. We conclude the paper in section VI. Network (GCN) [12]. SDNE and DNGR use deep autoencoder
to learn node representation from its global neighborhood
II. R ELATED W ORK
vector. GCN becomes comparatively more scalable by defining
Early works on node embedding were basically dimension- a convolution operator on graph, which iteratively aggregates
ality reduction techniques, which required the matrix factor- embeddings of the local neighborhood to reach the global
ization of the first-order proximity matrix or adjacency ma- optima. Although deep learning-based models result in high
trix. Laplacian Eigenmaps [7] and Locally Linear Embedding accuracy, the scalability is compromised because of their high
(LLE) [8] can be viewed as those early approaches. After cre- training time.
ating a knn graph from the feature space of the data, Laplacian While all the previous node embedding algorithms pro-
Eigenmaps embeds the nodes by eigendecomposition of the duce node features that are not easily interpretable, our
graph Laplacian. LLE considers that each node is a linear tensor decomposition-based node embedding algorithms use
combination of its neighbors, and finds the solution by singular arbitrary-order proximity to generate an interpretable feature
value decomposition of a sparse matrix, which is calculated by space for the nodes.
subtracting the normalized adjacency matrix from the same-
sized identity matrix. The later approaches such as GraRep [2] III. N OTATIONS AND PRELIMINARIES
and Higher Order Proximity preserved Embedding (HOPE)
[9] considered higher-order proximities of the nodes. GraRep A. Graph notations
utilizes multihop neighborhood of the nodes by incorporating Definition 1. (Graph) A graph with n nodes is defined as
higher powers of the adjacency matrix and generates node G = (V, E), where V = {v1 , v2 , v3 , . . . , vn } is the set of
embedding by successive singular value decomposition of nodes, and E = {eij }ni,j=1 is the set of edges, which are the
the powers of the log-transformed, probabilistic adjacency relationships between the nodes. The adjacency matrix S of
matrix. HOPE measures overlap between node neighborhoods, the graph has n rows and n columns. For unweighted graphs,
where Jaccard similarity, Adamic-Adar score, Katz score or Sij = 1, if there exists an edge between nodes i and j, and
Personalized PageRank score can be used as overlap calcu- Sij = 0 otherwise. For weighted graphs, Sij = 0 represents
lating functions. Asymmetric transitivity preserving nature of the positive/negative weight of the relationship between nodes
HOPE enables embedding of nodes of a directed graph. The i and j, while Sij = 0 means no relationship between them.
relying on eigendecomposition or singular value decompo- For undirected graphs, adjacency matrix S is symmetric, i.e.,
sition of large matrices makes all the matrix factorization- Sij = Sji . For directed graphs, adjacency matrix S is not
based approaches computationally expensive, and results in the symmetric, i.e., Sij = Sji .
compromise of the performance due to poor approximation. Definition 2. (1-step transition probability matrix) The 1-
Being inspired by the Skip-gram model [6], which learns step transition probability between nodes i and j for both di-
word embeddings by employing a fixed sliding window so rected and undirected graphs is defined as the normalized edge
that words in the similar context have similar representations, weight between those nodes. Therefore, the 1-step transition
DeepWalk [5] considered the network as a ”document”. By probability matrix is found by normalizing each row of the
applying truncated random walk, DeepWalk sampled sequence adjacency matrix S.
of nodes (similar to the words of a document) and used
Stochastic Gradient Descent (SGD) optimization to learn the Sij
representation of each node so that it is similar to the repre- Aij =
j Si
sentations of its neighbor nodes. Node2vec [4] later increased
the flexibility of node sampling by incorporating a biased Definition 3. (k-step transition probability matrix) For pre-
random walk. Although both methods are able to achieve serving the global structural similarity, we use k-step transition
more scalability than the matrix factorization-based methods, probability matrix Ak , which is the k-th power of the 1-step
dependence on local neighborhood window refrains them from transition probability matrix. In this matrix, Akij represents the
achieving the global optimal solution. transition probability from node i to node j in exactly k steps.
659
$.

$
6RXUFH
XUFH
Q ᮌ S
$ WH
V
DQV
$ 7DUJHW 7U
.
Q
D0DNLQJRIDWKLUGRUGHUWHQVRUIURPSRZHUVRIWKHVWHSWUDQVLWLRQSUREDELOLW\PDWUL[
6RXUFHIDFWRUPDWUL[$ 7DUJHWIDFWRUPDWUL[% 7UDQVLWLRQVWHSIDFWRUPDWUL[&

D D D5 E E E5 F F F5
&35
.
Q
6RXUFH
XUFH
Q Q
ᮌ S
VWH 5
V
UDQ
7DUJHW 7
.
Q 5 5
E&3GHFRPSRVLWLRQDQGWKHH[WUDFWLRQRIWKUHHIDFWRUPDWULFHV
Fig. 2: CP decomposition-based representation learning of source nodes, target nodes, and transition steps
2
B. Preliminaries of Tensor Decomposition where .F is the tensor Frobinius norm, which is the sum of
squares of all elements of the tensor. By initializing B and C
Tensors are multidimensional arrays. In our proposed with random values, ALS updates A by following rule.
method of node embedding using tensor decomposition, we 2
consider third-order tensors and CP decomposition. In this A ← arg min X(1) − A(C B)T (3) F
A
section, we briefly review the CP decomposition.
CP decomposition: CP decomposition factorizes the tensor Then by fixing A and C, it updates B by,
2
into a sum of rank one tensors [13]. Given a third-order tensor B ← arg min X(2) − B(C A)T F (4)
X ∈ RI×J×K , where I, J and K denote the indices of tensor B
elements in three of its modes, CP decomposition factorizes Finally, by fixing A and B, it updates C by,
the tensor in the following way. 2
C ← arg min X(3) − C(B A)T F (5)
C

R
Equations 3, 4 and 5 are repeated until the convergence of
X ≈ ar o br o cr = [[A, B, C]] (1)
r=1
equation 2.
Here, o denotes the outer product of the vectors, R is IV. T HE P ROPOSED M ODELS
the tensor rank which is a positive integer, ar , br , and cr In this section, we discuss two algorithms for tensor
are vectors, where ar ∈ RI , br ∈ RJ , and cr ∈ RK for decomposition-based node embedding: TDNE and TDNEpS.
r = 1, 2, 3, . . . R. After stacking those vectors, we can get A. Tensor Decomposition-based Node Embedding (TDNE)
the factor matrices A = [a1 , a2 , . . . aR ], B = [b1 , b2 , . . . bR ],
and C = [c1 , c2 , . . . cR ], where A ∈ RI×R , B ∈ RJ×R , and Fig. 2 describes our model of tensor decomposition-based
C ∈ RK×R . Fig. 1 is a visualization of the CP decomposition node embedding (TDNE). Without loss of generality, we use
of a third-order tensor. an example of a directed graph in the figure. In TDNE, a
The matricized forms of the tensor X is given by, third-order tensor X ∈ Rn×n×K is constructed by stacking the
k-step transition probability matrices for k = 1, 2, 3, . . . , K.
X(1) ≈ A(C B)T The objects represented by the three modes of this tensor are:
nodes (as sources), nodes (as targets), and transition steps.
X(2) ≈ B(C A)T
Then CP decomposition is performed with a given rank R.
X(3) ≈ C(B A)T CP decomposition results in vectors ar ∈ Rn , br ∈ Rn , and
cr ∈ RK for r = 1, 2, 3, . . . R. These vectors are stacked
together to form three factor matrices, A = [a1 , a2 , . . . aR ],
where represents Khatri-Rao product of two matrices. B = [b1 , b2 , . . . bR ], and C = [c1 , c2 , . . . cR ], where A ∈
ALS Solution of CP Decomposition: CP decomposition Rn×R , B ∈ Rn×R , and C ∈ RK×R .
can be solved by Alternating Least Squares [14]. The cost In factor matrix A ∈ Rn×R , each row is an R-dimensional
function of CP decomposition can be formulated as, representaion of the source role played by the corresponding
2 node. In factor matrix B ∈ Rn×R , each row is an R-
R
dimensional representaion of the target role played by the
min X − ar o br o cr (2) corresponding node. In factor matrix C ∈ RK×R , each row i
A,B,C
r=1 F
660
is an R-dimensional representaion of the i-th transition step,
where 1 ≤ i ≤ K. T
After we find the source factor matrix A, target factor ST(k) = A(k) ∗ C(k)
T
matrix B, and transition factor matrix C, we can compute the TT(k) = B(k) ∗ C(k)
projection of source embedding of node i on the transition
Z = [ST(1) , ST(2) , . . . ST(K) ,
embedding j, where 1 ≤ i ≤ n and 1 ≤ j ≤ K, and get
source-transition embedding matrix ST ∈ Rn×K . Similarly, TT(1) , TT(2) , . . . TT(K) ]
we can get a target-transition embedding matrix TT ∈ Rn×K
that reflects the projection of target embeddings on transition
step embeddings. Finally, we get the node embedding matrix
Z ∈ Rn×2K by concatenating ST and TT. First K columns Algorithm 2 TDNEpS: Tensor Decomposition-based Node
of Z represent source role of a node with varying transition Embedding per Slice
steps, and last K columns of Z represent target role of a node Input: 1-step transition probability matrix A
with varying transition steps. TDNE is shown in Algorithm 1. Maximum transition step K
CP decomposition rank R
Output: Node embedding matrix Z
ST = A ∗ CT 1: n = count rows(A)
TT = B ∗ CT 2: for k in 1 to K do
3: X (k) = tensor(n, n, 1)
Z = [ST, TT]
4: X (k) = Ak
5: [A(k) , B(k) , C(k) ] ⇐ CP ALS(X (k) , R)
T
6: ST(k) = A(k) ∗ C(k) T
B. Tensor Decomposition-based Node Embedding per Slice 7: TT(k) = B(k) ∗ C(k)
(TDNEpS) 8: end for
(1) (2) (K)
9: Z = [ST , ST , . . . ST , TT(1) , TT(2) . . . TT(K) ]
Algorithm 1 TDNE: Tensor Decomposition-based Node Em- 10: return Z
bedding
Input: 1-step transition probability matrix A The meaning of each column of Z is same for both
Maximum transition step K TDNE and TDNEpS. Although both algorithms have similar
CP decomposition rank R complexities and output, our experimental findings suggest
Output: Node embedding matrix Z that TDNEpS has less performance variance and high accuracy
in comparison with TDNE.
1: n = count rows(A)
2: X = tensor(n, n, K) V. E XPERIMENTS
3: for k in 1 to K do
In this section, we experimentally evaluate our algorithms
4: X (:, :, k) = Ak
with respect to the interpretability of the feature space, and
5: end for
performance of the network reconstruction, link prediction,
6: [A, B, C] ⇐ CP ALS(X , R)
node classification, and graph classification. Among these
7: ST = A ∗ C
T
tasks, graph classification is a multi-graph-based learning task,
8: TT = B ∗ C
T
while others are single graph-based.
9: Z = [ST, TT]
10: return Z A. Experimental settings
To comprehensively experiment our algorithms we have
In TDNEpS (Algorithm 2), we consider each transition used different types of networks, such as directed and undi-
probability matrix Ak as a third-order tensor with a single rected, weighted and unweighted, sparse and dense, small and
slice. Therefore, instead of having a single third-order tensor, large networks. In Table I, we list the datasets used in the
we have K third-order tensors, where each tensor X (k) ∈ experiments and their properties.
Rn×n×1 . Then for each tensor X (k) , CP decomposition is For network reconstruction, link prediction, node classi-
performed with a given rank R, and three factor matrices fication, and graph classification, we have compared our
are found, which are source factor matrix regarding k th algorithms TDNE and TDNEpS with five baseline algorithms.
transition step A(k) ∈ Rn×R , target factor matrix regarding k th We selected the baselines based on the categories of node
transition step B(k) ∈ Rn×R , and k th transition factor matrix embedding algorithms (Section I). For all of these baselines,
C(k) ∈ R1×R . We compute source-transition embedding we have used dataset and task-dependent hyperparameters
matrix ST(k) ∈ Rn×1 and target-transition embedding matrix suggested by their papers. For BlogCatalog dataset, we have
TT(k) ∈ Rn×1 , and finally concatenate ST(k) ’s and TT(k) ’s used the number of dimensions d = 128 and for other datasets,
for 1 ≤ k ≤ K to get the node embedding matrix Z ∈ Rn×2K . we have used d = 16 for these baselines.
661
TABLE I: Dataset statistics and their use in experiments
Dataset Network type Network properties |V | |E| Density** Experiment
interpretability
Karate social network directed, unweighted 34 78 0.06952
of the features
network reconstruction
BlogCatalog social network undirected, unweighted 10,312 333,983 0.00628
and link prediction
Brazilian airports Air-traffic network undirected, unweighted 131 1,003 0.11779 node classification
ADHD graph database* Brain network undirected, weighted 90 1992.897 0.4976 graph classification
* For graph database, |V | is the number of nodes in each graph, and |E| is the mean number of edges of all graphs
2 ∗ |E| |E|
** For undirected graph, density= . For directed graph, density=
|V |(|V | − 1) |V |(|V | − 1)
• Laplacian Eigenmaps (LAP) [7] is a matrix where Epredi is the set of predicted edges for node i. For
decomposition-based method that performs network reconstruction, Ei is the set of observed edges for
eigendecomposition of the Laplacian of the graph. node i. For link prediction, Ei is the set of hidden edges for
• LLE [8] is a matrix decomposition-based method that node i.
embeds nodes by singular value decomposition, taking For implementing baseline algorithms and performance
into consideration that each node is a linear combination evaluation metrics, we used GEM (Graph Embedding Meth-
of its neighbors. ods) library [15]. All the experiments are conducted in a single
• HOPE [9] is a multihop similarity-based method that PC with Intel Core i7-6700 CPU (clock speed 3.40GHz), 16
performs generalized SVD on the similarity matrix found GB RAM, and Ubuntu 16.04 operating system. Our implemen-
from node neighborhoods. We set the decay parameter tation uses Tensorly library of Python [16]. We made TDNE
β = 0.01 for Katz Index. and TDNEpS available at: https://github.com/hamdi08/TDN
• GraRep [2] is a multihop similarity-based method that
generates node embedding by successive singular value B. Interpretability of the features
decomposition of the powers of the log-transformed, For this experiment, we used Zachary’s Karate club network
probabilistic adjacency matrix. We set maximum transi- [17] (Fig. 3(a)), where we consider each edge as directed. The
tion step K = 6, and log shifted factor β = 1/n. network has 34 nodes and 78 directed edges. We performed
• Node2vec [4] is a random walk-based method, which is TDNE with K = 6 and r = 2. Therefore, the third-order
a generalization of DeepWalk [5]. Node2vec uses biased tensor has size 34*34*6. After CP decomposition, we visualize
random walk to create node sequences, and uses Skip- the embeddings of each transition step (Fig. 3(b)). The L2
gram model [6] to learn node embeddings. We set walks norms of the transition embeddings (Fig. 3(c)) show the rela-
per node r = 80, walk length l = 10, context size k = 10, tively high importance of lower-order proximities compared to
return parameter p = 1, and in-out parameter q = 1. the higher-order proximities, which is intuitive for the social
For node classification and graph classification, we use networks. In Fig. 3(d), the final embeddings of each node is
classification accuracy, i.e., percentage of correct predic- shown in a 12 (=2*6) dimensional feature space, where the
tions. For network reconstruction and link prediction, we use first six features represent the source property of the nodes
P recision@Np and mean average precision (MAP) [3], [15]. with varying transition step from one to six, and the last
P recision@Np : P r@Np is the fraction of correct predic- six features represent the target property of the nodes with
tions in top-Np predicted node pairs. It is defined as, varying transition step from one to six. Node 1, which has all
outgoing edges and no incoming edges, is embedded in a way
|Epred (1 : Np ) ∩ Eobs |
P r@Np = (6) so that it has high values in only source property representing
Np features (more specifically, the features which represent source
where Epred (1 : Np ) are top-Np predicted node pairs, and property in lower transition steps). Almost opposite embedding
Eobs are the observed edges. For network reconstruction, nature is observed in node 34, which has all incoming edges
Eobs = E. For link prediction, Eobs is the set of hidden edges. and no outgoing edges. For some nodes which have almost
Mean Average Precision: MAP considers the precision of equal number of incoming and outgoing edges, such as node
each node and computes the average over all nodes. 9 and 10, we see a distribution of high values among source
property representing features and target property representing
AP (i) features. Features representing higher-order transition steps
M AP = i (7)
n (such as 4th , 5th and 6th -order) of both source and target
properties have no impact in this network, which supports the
facts found in Fig. 3(c).
Np P r@Np (i)I{Epredi (Np ) ∈ Ei }
AP (i) =
|{Np : Epredi (Np ) ∈ Ei }| C. Network reconstruction
|Epredi (1 : Np ) ∩ Ei | Reconstruction of the network from the learned embeddings
P r@Np (i) = of the nodes is a common task for evaluating node embedding
Np
662
1RGHLG
D'LUHFWHG.DUDWHQHWZRUNRIQRGHVDQGHGJHV
/QRUPRIHPEHGGLQJV
7UDQVLWLRQVWHSV
&3GHFRPSRVLWLRQUDQN 7UDQVLWLRQVWHS
)HDWXUHV
E7UDQVLWLRQVWHSHPEHGGLQJV.
F/QRUPVRIN VWHS G5HSUHVHQWDWLRQRIWKHQRGHVLQ .
L I K G L
DIWHU&3GHFRPSRVLWLRQZLWKUDQN
WUDQVLWLRQVWHSHPEHGGLQJV GLPHQVLRQDOVSDFH
Fig. 3: Executing TDNE on directed Karate network
algorithms. The node pairs (possible edges) are ranked ac- given number of reconstructed node pairs. We executed each
cording to the node similarities, i.e., the inner product of two algorithm five times, and plotted the means as points and
node embeddings, and equations 6 and 7 are used to determine the standard deviations as shaded regions. We observe that
P recision@Np and MAP. single matrix factorization-based methods such as Laplacian
For this experiment, we have used BlogCatalog network 1 Eigenmaps and LLE perform very poorly because of their re-
which consists of 10,312 nodes and 333,983 edges (undirected lying on the approximation of eigenvectors of large, first-order
and unweighted). In this social network, nodes represent proximity matrix. Multihop similarity-based methods such as
bloggers and edges represent social relationships among them. HOPE and GraRep that use higher-order proximities of nodes
Fig. 4 shows that TDNE (K = 3, r = 4) and TDNEpS perform comparatively better. Random walk-based method
(K = 1, r = 1) outperforms other baseline algorithms Node2vec performs better than them because of its flexible
in terms of P recision@Np (Fig. 4a) and MAP (Fig. 4b). neighborhood sampling. Although Node2vec’s performance is
We vary the number of reconstructed node pairs from one better than GraRep in terms of P recision@Np , in terms of
hundred to one million, and recorded P recision@Np for each MAP, Node2vec’s performance is a bit inferior to GraRep’s.
Surprisingly, TDNE with d = 6 and TDNEpS with d = 2
1 http://socialcomputing.asu.edu/datasets/BlogCatalog3
663
(a) P recision@Np in network reconstruction (a) P recision@Np in link prediction
(b) Mean MAP of each baseline in network reconstruction (b) Mean MAP of each baseline in link prediction
Fig. 4: Network reconstruction performance of TDNE and TDNEpS Fig. 5: Link prediction performance of TDNE and TDNEpS with five
with five other baseline algorithms on BlogCatalog network other baseline algorithms on BlogCatalog network
perform better than all the baselines, which use d = 128. The number of reconstructed node pairs and number of executions
robust performance of these tensor decomposition-based node of each algorithm).
embedding methods can be attributed to the representation Fig. 5 shows that both TDNE and TDNEpS outperform
learning of the transition steps (proximities). While TDNE other baselines in terms of both P recision@Np (Fig. 5a) and
shows some variance over the experiments, TDNEpS performs MAP (Fig. 5b). Since link prediction is a harder task than
consistently and outperform TDNE in network reconstruction. network reconstruction, all algorithms give smaller values in
This proves the superiority of independent decomposition of both metrics. According to P recision@Np and MAP values
the tensor slices in comparison with single decomposition of of link prediction, we can see that T DN EpS > T DN E >
a single tensor. N ode2vec > GraRep > HOP E > LLE > LAP . These
results support the fact that algorithms considering higher-
D. Link prediction order proximities tend to have better performance than the
Link prediction is a typical application of node embedding algorithms considering lower-order proximities, and proximity
that aims to predict which pairs of nodes are likely to form learning nature of tensor decomposition-based methods per-
edges. In our experiments, we randomly hid 20% of edges form better in link prediction even with the small number of
(66,797 edges) of the BlogCatalog network, and run node dimensions. Like the case of network reconstruction, TDNE
embedding algorithms on the remaining edges (267,186 edges) shows more variance and less precision than TDNEpS.
to learn node representations. Like the experiments of network
E. Node classification
reconstruction, we used d = 128 for baseline algorithms,
d = 6 for TDNE, and d = 2 for TDNEpS. For both For node classification, we used the Brazilian airport net-
P recision@Np and MAP, we followed exactly the same ex- work, which has 131 nodes and 1003 edges (undirected and
perimental settings of network reconstruction (variation of the unweighted). The nodes correspond to the airports in Brazil,
664
Fig. 6: Node classification performance in Brazilian airport network Fig. 7: Graph classification performance in ADHD graph database
and edges indicate the existence of commercial flights between and apply node embedding algorithms to embed the nodes of
them. The nodes have four labels from 0 to 3, which indicate each graph. If each graph has n nodes, then graph Gi has a
the airport activities, i.e., the number of takeoffs and landings node embedding matrix Zi ∈ Rn×d . We get the embedding
in the year 2016, and label 0 means the highest activity level. of the graph Gi by reshaping its node embedding matrix by
The data is collected and labeled by Ribeiro et al. [18] from Zi ∈ R1×nd . Therefore, the embedding matrix of the graph
the website of the National Civil Aviation Agency (ANAC) 2 . database is ZD ∈ R|D|×nd .
Because of the limited size of the network, we set d = 16 Brain network classification is a good example of graph
for the baseline algorithms, and d = 8 for both TDNE classification, where each graph has the same labeled node
and TDNEpS. We executed both tensor decomposition-based set. In brain networks, the nodes represent brain regions
methods with K = 4 and r = 4. As the classifier, we used defined by some standard brain atlas, and edges represent
one-vs-all logistic regression classifier with L2 regularization. functional/structural similarity of the brain regions [21]. Non-
We varied the percentage of training representations from 10% invasive neuroimaging modalities such as Magnetic Resonance
to 90% and reported mean classification accuracy over 10 Imaging (MRI), Electroencephalography (EEG), and Diffu-
trials of random sampling. From Fig. 6, we can see that both sion Tensor Imaging (DTI) can be used to construct brain
TDNE and TDNEpS outperform other baselines in node clas- networks, that have been used by the neuroscience commu-
sification. This figure also shows the performance variation of nity to investigate different neurological disorders such as
single matrix decomposition-based methods (LAP, LLE, and Alzheimer's, Schizophrenia, Bipolar disorder, and Attention-
HOPE), multiple matrix decomposition-based/random walk- deficit/hyperactivity disorder (ADHD) [22]. Given a set of
based higher-order proximity preserving methods (GraRep and brain networks and associated case/control labels, we need to
Node2vec), and tensor decomposition-based proximity learn- maximize the classification performance. For this experiment,
ing methods (TDNE and TDNEpS). With only 30% training we have considered an fMRI (Functional Magnetic Resonance
data, the classification accuracy of TDNEpS is twice than that Imaging)-based brain network dataset.
of HOPE, the best performing single matrix decomposition- fMRI measures the functional activities of different brain
based method. regions by capturing 3D brain volumes over time. The time
series of each voxel (the unit of brain volume) is calculated,
F. Graph classification which represents the change in Blood Oxygenation Level
Dependent (BOLD) signal over the scan period. The voxels are
For graph classification, we have a graph database of labeled
grouped together to predefined brain atlas-based regions, and
graphs, D = {G1 , G2 , . . . , G|D| }. We can represent each
the mean time series is calculated for each region. The pairwise
graph by a fixed dimensional vector space by different graph
Pearson correlation coefficients between the time series of
embedding schemes, such as computing the structural proper-
the regions give the correlation matrix. By applying threshold
ties such as degree, clustering coefficient, etc of the nodes [19],
(usually 0) on the correlation matrix, we get the adjacency
counting the appearances of frequent/discriminative subgraph
matrix of the brain network [20].
patterns [20], and so on. To evaluate the node embedding
For this experiment, we collected a brain network dataset
algorithms in the light of graph classification, we take a graph
from ADHD-200 global competition 3 . The dataset has 768
database, where each graph has the same labeled node set,
3 http://fcon 1000.projects.nitrc.org/indi/adhd200/
2 http://www.anac.gov.br/
665
(a) Brazil airport network with fixed R(=4) and varying K (b) Brazil airport network with fixed K(=2) and varying R
(c) ADHD dataset with fixed R(=2) and varying K (d) ADHD dataset with fixed K(=1) and varying R
Fig. 8: Node classification and graph classification performance of TDNEpS in different settings of K and R
brain networks, where 280 of them are labeled as ADHD Kd-dimensional embedding of each node. Such high dimen-
positive, and the rest are normal control. There are 90 nodes sionality of node representations generated by GraRep causes
in each network, which represents 90 cerebral brain regions classifier overfitting and results in less classification accuracy
defined by the Automated Anatomical Labeling (AAL) parcel- even with a higher amount of training data in some cases.
lation on resting-state fMRI scans of the subjects. The detailed
preprocessing steps of this dataset are discussed in [23]. G. Parameter sensitivity
We have used d = 16 for the baselines, d = 4 for TDNE, Fig. 8 shows the node classification and graph classification
and d = 2 for TDNEpS in this experiment. We have used performance of TDNEpS with the change of hyperparameter
binary logistic regression classifier with L2 regularization. K and R. The tensor decomposition rank R is set as the
We report the mean classification accuracy after 10 trials number of class labels in the dataset, and maximum transition
of train/test sampling, while varied the train set size from step K is tuned with keeping R fixed. Therefore, for the
10% to 90%. From Fig. 7, we see that TDNEpS outperforms Brazilian airport network, which has four class labels, we
all other baselines. Node2vec performs better than TDNE set R = 4 to see the effect of K in the node classification
in this experiment. LAP performs better than other single performance with respect to train set size (Fig. 8a). The setting
matrix factorization-based methods such as LLE and HOPE. of K = 2 performs well when the train set size is small. Then
Surprisingly, GraRep performs very poorly, most probably due with fixed K = 2, we vary R (Fig. 8b), and see R = 2
to overfitting. In GraRep, the concatenation of d-dimensional performs consistently better with K = 2. We set R = 2 for
embeddings of each transition probability matrix results in a binary labeled ADHD dataset in graph classification, and vary
666
K with R fixed. We see from Fig. 8c that graph classification [5] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: online learning
of social representations,” in The 20th ACM SIGKDD Intl. Conf. on
performance in ADHD dataset is very sensitive with K. For Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA
some K values, the classifier might overfit (for example, - August 24 - 27, 2014, 2014, pp. 701–710.
K = 4). We take K = 1 as best maximum transition step [6] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
of word representations in vector space,” in 1st Intl. Conf. on Learning
and vary R (Fig. 8d). Although R = 1 and R = 3 give Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013,
some good results, it is difficult to find optimal R. Therefore, Workshop Track Proc., 2013.
we recommend to use grid search for optimizing these two [7] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques
for embedding and clustering,” in Advances in Neural Information
hyperparameters K and R. Processing Systems, NIPS 2001, December 3-8, 2001, Vancouver, British
VI. C ONCLUSION Columbia, Canada, 2001, pp. 585–591.
[8] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally
In this work, we present two novel tensor decomposition- linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
based node embedding algorithms TDNE and TDNEpS, which [9] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity
preserving graph embedding,” in Proc. of the 22nd ACM SIGKDD Intl.
utilizes higher-order transition probability matrices of a graph Conf. on Knowledge Discovery and Data Mining, San Francisco, CA,
(directed or undirected, weighted or unweighted) to construct USA, August 13-17, 2016, 2016, pp. 1105–1114.
one or more third-order tensor(s), and use CP decomposition [10] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in
Proc. of the 22nd ACM SIGKDD Intl. Conf. on Knowledge Discovery
to extract factor matrices containing the representations of and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016,
the source and/or target properties of the nodes, and the pp. 1225–1234.
transition steps. We have theoretically and experimentally [11] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning graph
representations,” in Proc. of the Thirtieth AAAI Conf. on Artificial
shown that the node features produced by these algorithms Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., 2016, pp.
are highly interpretable in terms of the understandability of the 1145–1152.
feature roles. Moreover, learned embeddings of the transition [12] T. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” in 5th Intl. Conf. on Learning Representations,
steps make them perform well in network reconstruction, link ICLR 2017, Toulon, France, April 24-26, 2017, Conf. Track Proceedings,
prediction, node classification, and graph classification. 2017.
In the future, we want to find the theoretical relationship [13] T. Kolda and B. Bader, “Tensor decompositions and applications,” SIAM
review, vol. 51, no. 3, pp. 455–500, 2009.
of tensor decomposition-based node embedding algorithms [14] N. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. Papalexakis, and
and eigendecomposition/SVD-based algorithms. Moreover, we C. Faloutsos, “Tensor decomposition for signal processing and machine
are interested to construct the tensor with other types of learning,” IEEE Trans. on Signal Processing, vol. 65, no. 13, pp. 3551–
3582, 2017.
similarity matrix such as Jaccard similarity and observe node [15] P. Goyal and E. Ferrara, “Graph embedding techniques, applications,
embedding performance. We also look forward to apply TDNE and performance: A survey,” Knowledge Based Systems, vol. 151, pp.
and TDNEpS for embedding the correlation graphs derived 78–94, 2018.
[16] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic, “Tensorly:
from multivariate time series data, such as the data for solar Tensor learning in python,” Journal of Machine Learning Research,
flare prediction [24]. vol. 20, no. 26, pp. 1–6, 2019.
[17] W. Zachary, “An information flow model for conflict and fission in small
ACKNOWLEDGMENTS groups,” Journal of Anthropological Research, vol. 33, no. 4, pp. 452–
This project has been supported in part by funding from the 473, 1977.
[18] L. F. R. Ribeiro, P. H. P. Saverese, and D. R. Figueiredo, “struc2vec:
Division of Advanced Cyberinfrastructure within the Direc- Learning node representations from structural identity,” in Proc. of the
torate for Computer and Information Science and Engineering, 23rd ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data
the Division of Astronomical Sciences within the Directorate Mining, Halifax, NS, Canada, August 13 - 17, 2017, 2017, pp. 385–
394.
for Mathematical and Physical Sciences, and the Division of [19] C.-Y. Wee, P.-T. Yap, D. Zhang, K. Denny, J. N. Browndyke, G. G.
Atmospheric and Geospace Sciences within the Directorate for Potter, K. A. Welsh-Bohmer, L. Wang, and D. Shen, “Identification of
Geosciences, under NSF awards #1443061 and #1931555. It MCI individuals using structural and functional connectivity networks,”
Neuroimage, vol. 59, no. 3, pp. 2045–2056, 2012.
was also supported in part by funding from the Heliophysics [20] B. Cao, X. Kong, J. Zhang, P. S. Yu, and A. B. Ragin, “Mining
Living With a Star Science Program, under NASA award brain networks using multiple side views for neurological disorder
#NNX15AF39G. identification,” in 2015 IEEE Intl. Conf. on Data Mining, ICDM 2015,
Atlantic City, NJ, USA, November 14-17, 2015, 2015, pp. 709–714.
R EFERENCES [21] B. Jie, D. Zhang, W. Gao, Q. Wang, C.-Y. Wee, and D. Shen, “Integration
of network topological and connectivity properties for neuroimaging
[1] H. Cai, V. W. Zheng, and K. C. Chang, “A comprehensive survey of classification,” IEEE Trans. on Biomedical Engineering, 2014.
graph embedding: Problems, techniques, and applications,” IEEE Trans. [22] S. M. Hamdi, Y. Wu, R. A. Angryk, L. C. Krishnamurthy, and R. Morris,
Knowledge and Data Engineering, vol. 30, no. 9, pp. 1616–1637, 2018. “Identification of discriminative subnetwork from fmri-based complete
[2] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations functional connectivity networks,” Intl. Journal of Semantic Computing,
with global structural information,” in Proc. of the 24th ACM Intl. Conf. vol. 13, no. 1, pp. 25–44, 2019.
on Information and Knowledge Management, CIKM 2015, Melbourne, [23] J. B. Lee, X. Kong, Y. Bao, and C. M. Moore, “Identifying deep
VIC, Australia, October 19 - 23, 2015, 2015, pp. 891–900. contrasting networks from time series data: Application to brain network
[3] Z. Zhang, P. Cui, X. Wang, J. Pei, X. Yao, and W. Zhu, “Arbitrary- analysis,” in Proc. of the 2017 SIAM Intl. Conf. on Data Mining,
order proximity preserved network embedding,” in Proc. of the 24th Houston, Texas, USA, April 27-29, 2017., 2017, pp. 543–551.
ACM SIGKDD Intl. Conf. on Knowledge Discovery & Data Mining, [24] S. M. Hamdi, D. Kempton, R. Ma, S. F. Boubrahimi, and R. A. Angryk,
KDD 2018, London, UK, August 19-23, 2018, 2018, pp. 2778–2786. “A time series classification-based approach for solar flare prediction,”
[4] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for in 2017 IEEE Intl. Conf. on Big Data, BigData 2017, Boston, MA, USA,
networks,” in Proc. of the 22nd ACM SIGKDD Intl. Conf. on Knowledge December 11-14, 2017, 2017, pp. 2543–2551.
Discovery and Data Mining, San Francisco, CA, USA, August 13-17,
2016, 2016, pp. 855–864.
667

Discriminative Regularized Deep Generative Models For Semi-Supervised Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discriminative Regularized Deep Generative Models For Semi-Supervised Learning

Uploaded by

Copyright:

Available Formats

2019 IEEE International Conference on Data Mining (ICDM)

Interpretable Feature Learning of Graphs using

2374-8486/19/$31.00 ©2019 IEEE 658

6RXUFHIDFWRUPDWUL[$ 7DUJHWIDFWRUPDWUL[% 7UDQVLWLRQVWHSIDFWRUPDWUL[&

You might also like

Discriminative Regularized Deep Generative Models For Semi-Supervised Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discriminative Regularized Deep Generative Models For Semi-Supervised Learning

Uploaded by

Copyright:

Available Formats

2019 IEEE International Conference on Data Mining (ICDM)

Interpretable Feature Learning of Graphs using

2374-8486/19/$31.00 ©2019 IEEE 658

6RXUFHIDFWRUPDWUL[$ 7DUJHWIDFWRUPDWUL[% 7UDQVLWLRQVWHSIDFWRUPDWUL[&

You might also like

6RXUFHIDFWRUPDWUL[$ 7DUJHWIDFWRUPDWUL[% 7UDQVLWLRQVWHSIDFWRUPDWUL[&