Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Abstract—Heterogeneous information networks (HINs) are usually used to model information systems with multi-type objects and
relations. In contrast, graphs that have a single type of nodes and edges, are often called homogeneous graphs. Measuring similarities
among objects is an important task in data mining applications, such as web search, link prediction, and clustering. Currently, several
similarity measures are defined for HINs. Most of these measures are based on meta-paths, which show sequences of node classes
and edge types along the paths between two nodes. However, meta-paths, which are often designed by domain experts, are hard to
enumerate and choose w.r.t. the quality of similarity scores. This makes using existing similarity measures in real applications difficult.
To address this problem, we extend SimRank, a well-known similarity measure on homogeneous graphs, to HINs, by introducing the
concept of the decay graph. The newly proposed similarity measure is called HowSim, which has the property of being meta-path free,
and capturing the structural and semantic similarity simultaneously. The generality and effectiveness of HowSim, and the efficiency of
our proposed algorithms for computing HowSim scores, are demonstrated by extensive experiments.
1 I NTRODUCTION
1.1 Motivation principle to decide which meta-path is more important than
Many information systems can be modeled as graphs, where the others; 2) There is lots of workload to enumerate meta-
data objects are represented by nodes, and the relations paths, since two nodes can be connected in different ways,
among them are represented by edges. Besides, real-word and the number of potential meta-paths is infinity.
systems often consist of data objects of multiple types Moreover, a meta-path defined for querying similarities
and relations. Hence, heterogeneous information networks among type A1 can not be applied to nodes of type A2 . For
(HINs), which contain multiple-type nodes and links, are instance, if we use meta-paths AP A and AP V P A to find
usually used to model such systems. Examples include similar authors over the HIN in Figure 1a, we cannot reuse
social media networks and bibliographic networks. these meta-paths to find similar papers or similar venues.
Measuring similarities among nodes in the same domain The candidates for finding similar papers are P AP or P V P ,
plays a key role in data mining tasks, such as graph clus- and for finding similar venues are V P V or V P AV P . This
tering [1], spam detection [2], recommendation systems [3], shortage of meta-path based measures leads to a plentiful
and link prediction [4]. Currently, there are several simi- workload for meta path generation and selection, when
larity measures defined for HINs, including PathSim [5], users want to query similarities in multiple domains.
HeteSim [6], RelSim [7], PCRW [8], [9], AvgSim [10], and Another way to measure similarities over an HIN is
so on. However, all of them suffer from the generality to treat it as a homogeneous graph by ignoring the type
problem, since they require a user to specify a meta-path information of different nodes and edges. It then uses an
before querying similarities. A meta-path can capture one existing similarity measure for homogeneous graphs, such
particular semantic meaning among different paths con- as Personalized PageRank(PPR) [11], SimRank [12] or P-
necting two objects. Consider the HIN in Figure 1a, with Rank [13]. This strategy can tackle the generality problem
node types Author, Paper and Venue. The meta-path AP A discussed above. However, only considering the structural
indicates two authors have been co-authors, and the meta- similarity and ignoring the information of multiple typed
path AP V P A indicates two authors have published papers relations would result in the loss of semantic information,
in the same venue. Since one meta-path only has one se- which can also contribute to similarity scores. Therefore,
mantic meaning, it fails to aggregate all possible semantic currently there is still no such similarity measure on HINs,
information to evaluate the similarity. Even though a user that can combine structural similarity and semantic similar-
can combine similarity scores by providing multiple meta- ity in a simple and general way.
paths, this strategy has the following drawbacks: 1) It is hard
to adjust the weights among meta-paths when combining 1.2 Contributions
scores w.r.t. the quality of similarity results, since there is no In this paper, we propose HowSim, a Heterogeneous
information network based Similarity measure. Our basic
• Y. Wang is with Shenzhen Institute of Computing Sciences, Shenzhen idea is to apply the intuition of SimRank [12], i.e., “two
University, China. E-mail: yuewang@sics.ac.cn. nodes are similar if they are referenced by similar nodes”, to
• Z. Wang, Z. Li, X. Jian, H. Xin, L. Chen are with the Department of HINs. However, it is not intuitive to perform this extension
Computer Science and Engineering, the Hong Kong University of Science directly. The original SimRank is a structural similarity
and Technology, Clear Water Bay, Hong Kong SAR, China. E-mail: measure defined particularly for homogeneous graphs, and
{zwangec,zlicb,xjian,hxinaa,leichen}@ust.hk. it can not capture the semantic similarity from multiple
• Z. Zhao, J. Song, Z. Chen, M. Zhao are with the WeChat relations over different domains in an HIN. As we have
Group, Tencent Corporation, Guangzhou, China. E-mail: discussed previously, one way is to ignore the node and
{joshuazhao,barrysong,hollischen,doudouzhao}@tencent.com. edge types, and treat the HIN as a homogeneous graph.
However, this results in the loss of the semantic information.
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
From our point of view, due to the existence of multiple TABLE 1: The comparison of different similarity measures
typed relations in HINs, the similarity of two objects can be Measure Symmetric Structural Semantic Meta-path
aggregated by different semantic meanings under different Informa- Informa- Free
tion tion
relations with different neighbors, instead of the single “ref- PPR [14] 7 3 7 3
erenced by” relationship in SimRank. For example, consider SimRank [12] 3 3 7 3
the bibliographic network in Figure 1a, two papers can be PathSim [5] 3 7 3 7
HeteSim [6] 3 7 3 7
similar in the following four semantic aspects: PCRW [8] 7 7 3 7
1) They are cited by similar papers (“cited-by”); HowSim 3 3 3 3
2) They are citing similar papers (“citing”);
3) They are written by similar authors (“written-by”);
4) They are published in similar venues (“published-in”).
At the same time, two authors can only be similar if they
write similar papers, since the node “author” is only in-
volved in the relation “writing” (“written-by”) 1 . Therefore,
ignoring the different types of nodes and edges in the
HIN, as SimRank, cannot capture the subtle differences in
(i) the ways of similarity aggregation in different domains;
and (ii) the similarities aggregated from different types of
relations with different neighbors.
To address this problem, we propose the concept of the (b) An example of network
decay graph. A decay graph is a weighted directed graph, (a) Network instance schema on bibliographic data
whose structure is the same as the network schema of an
HIN, and whose weights are defined for each relation and Fig. 1: An example of an HIN and its network schema
its inverse (details in Section 3). The functionality of a
decay graph is to encode how similarities are aggregated
under different relations. Equipped with the decay graph, •We introduce an iterative method to compute HowSim
we define our HowSim measure by aggregating similarities scores with an accuracy guarantee, and then propose
recursively from neighbors. optimization techniques to boost computation.
HowSim is a general similarity measure due to the fol- • We propose an an effective algorithm for finding a
lowing reasons. First, the original definition of SimRank and decay graph automatically for HowSim.
its variants [12], [13] are all special cases of the HowSim • We perform extensive experiments to validate the effec-
model, since they just differ in the initializations of decay tiveness and generality of HowSim.
graphs. Second, once given the decay graph, all similarity This paper is organized as follows. We introduce related
scores in all domains in an HIN are determined, without definitions in Section 2, and define HowSim in Section 3.
requiring users to provide any additional meta-path infor- Methods for computing HowSim scores are introduced in
mation. Third, even though HowSim is a meta-path free Section 4. We discuss how to define a decay graph in
measure, it can combine the different semantic meanings Section 5, and we perform experimental studies in Section 6.
of different paths due to its recursive definition. Therefore, Related works are shown in Section 7, and we conclude the
HowSim can capture the structural similarity and semantic paper in Section 8.
similarity simultaneously.
We then study the mathematical properties of HowSim
in depth. In addition to the basic definition of HowSim, we
2 P RELIMINARIES
present its matrix representation and probabilistic interpre-
tation. We show that HowSim is symmetric, normalized, In this section, we give backgrounds of SimRank and HINs.
self-maximum, and it always has a unique solution. In ad- The detailed definition of them can also be found in [12] and
dition, we also propose effective method for finding a decay [5], [6], [15], [16], respectively.
graph automatically given example node pairs. To compute
HowSim scores, we propose a naive iterative method and
prove that it always converges to the correct HowSim scores. 2.1 SimRank
We also propose heuristic optimization strategies to improve
the efficiency. We verify the effectiveness of our HowSim The intuition behind SimRank is two-fold: a) two nodes are
similar if they are linked by similar nodes; b) two identical
model for various data mining tasks by extensive exper- nodes have the similarity 1. Given an unweighted directed
iments. The comparisons of HowSim with other popular graph G(V, E) with n nodes and m edges, the SimRank
measures are also shown in Table 1. In summary, we make score of two nodes a, b ∈ V is formulated as follows:
the following contributions in this paper:
• We propose a general and meta-path free similarity 1 (a = b),
0 0
measure, i.e. HowSim, by proposing the concept of the
s(a, b) =
X C × s(a , b ) (1)
decay graph to extend SimRank to HINs. (a 6= b),
0 0
|I(a)||I(b)|
• We study properties of HowSim in depth and show a ∈In(a),b ∈In(b)
it is symmetric, non-negative, normalized and self-
maximum. This leads to HowSim being a semi-metric where I(a) denotes the set of in-neighbors of node a, and
measure. C ∈ (0, 1) is the decay factor which is usually set to 0.6 [17]
or 0.8 [12]. Since Eq.(1) is recursive, SimRank can aggregate
similarities of multi-hop neighbors of the original pair of
1. This does not mean the similarity of authors is not influenced by nodes, and produce high-quality results. SimRank has at-
the similarity of venues where they have publications, since this infor-
mation is absorbed in the paper-paper similarity due to the recursive tracted a lot of research attention following its introduction
aggregation. [17], [18], [19], [20], [21], [22], [23].
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
TABLE 2: R of the network schema in Figure 1b
R R.S R.M R.T c(R)
R1 paper citing paper 0.3
R2 venue publishing paper 0.5
R3 author writing paper 0.6 (b) The decay graph of Sim-
R4 paper cited by paper 0.2 Rank, C ∈ (0, 1)
R5 paper published in venue 0.2
R6 paper written by author 0.1
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
published in, 10% comes from the average similarity of authors 3.2 Probabilistic Interpretation.
who write them, 30% comes from the average similarity of papers
they cite, and 20% comes from the similarity of papers which Suppose a coupled random walks Wa and Wb start from a
cite them. For the “venue-venue”similarity, 50% comes from the and b, respectively. Let type A ∈ A be the node type Wa
average similarity of papers they publish. and Wb currently reside in. At each step, Wa and Wb either:
1) stop with the probability α(A), or; 2) choose a relation R
The ability to customize the decay function is central from R(A) with the probability proportional to c(R), and
to HowSim, since we cannot assume weights are fixed for choose one of their neighbors under R (NR (a) and NR (b))
different applications. For example, to find similar papers, a uniformly to proceed. The probabilistic interpretation of
researcher would focus more on the author-paper relation- HowSim is given as follows.
ship to find papers with similar authors, or alternatively 0
focus more on the paper-venue relationship to find papers Theorem 2. Let s (a, b) be the probability that Wa and Wb meet,
0
published in similar venues. then s (a, b) is equal to s(a, b).
Let a and b be two nodes of type A, we define the
HowSim score sA (a, b) as follows. Proof. Please see the proof in the supplemental material.
Definition 4 (HowSim). Given an HIN G(V, E, φ, ψ), a decay
function c, and a pair of nodes a, b ∈ V where φ(a) = φ(b) = Based on the probabilistic interpretation of HowSim, we
A ∈ A. The HowSim similarity of (a, b) is defined as: can conclude HowSim scores are in [0, 1].
1, a = b,
0 0
sA (a, b) = P P P c(R)sR.T (a ,b )
a 6= b. 3.3 Matrix Representation.
0 0
|NR (a)||NR (b)|
R∈R(A) a ∈NR (a) b ∈NR (b)
(3) Here we give the matrix representation for Eq.(3). Given a
type A ∈ A, let SA be the similarity matrix for A. Given a
When a = b, their similarity is 1 by definition. Consider relation R ∈ R, let R be the adjacency matrix of R, i.e., ∀a ∈
the case that a 6= b, Eq.(3) computes the average similarities VR.S , b ∈ VR.T , Ra,b is 1 if R(a, b) ∈ E and 0 otherwise. Let
from NR (a) and NR (b) for each R ∈ R(A), and combine WR be the row normalized matrix of R. Then ∀A ∈ A, SA
them with the weights of c(R)s. Since φ(a) = φ(b), the types satisfies:
of neighbors of a and b under R are always the same, so X
>
Eq.(3) is well defined on any node pair with the same type, SA = c(R)WR SR.T WR ) ∨ I,
Pof |A| similarity
and the solution to Eq.(3) is the collection R∈R(A)
matrices. In addition, given G, there are A∈A |VA |2 equa-
tions following Eq.(3). In the following part of this paper, where ∨ denotes the element-wise maximum operator, and
we would simplify both sA (a, b) and sR.T (a, b), by ignoring I is an identity matrix. There are |A| such matrix equations
the type information, as s(a, b) when the context is clear. for G.
SimRank and its variants (bipartite SimRank [12] and P-
Rank [13]) are special cases of HowSim similarity, since they
can be viewed as different initializations of decay graphs. 3.4 Properties of HowSim
Their decay graphs are shown in Figures 2b to 2d, and the
details of comparisons are in the supplemental material. In summary, HowSim has the following numeric properties:
Theorem 1. Given G and a decay function c, the solution of 1) Normalized: given a ∈ V , ∀b ∈ V : s(a, b) ∈ [0, 1];
Eq.(3) is unique. 2) Symmetric: ∀a, b ∈ V, s(a, b) = s(b, a);
3) Self-maximum: ∀a ∈ V, s(a, a) = 1.
Proof. Please see the proof in the supplemental material. 4) Uniqueness: given G and c, the similarities in all do-
mains of G are unique.
More Discussions about Decay Graphs. The decay Next, we discuss the computation of HowSim.
graph and the network schema share the same network
structure, the only difference between them is that the decay
graph is weighted. The presence of decay graph is the reason
that we conclude HowSim outperforms meta-path based 4 T HE C OMPUTATION OF H OW S IM
measures w.r.t. usability, it just uses the network schema
itself to compute similarities instead of asking users to pro- Extending SimRank Techniques to HowSim. It is possi-
vide meta paths. The size of a decay graph is O(|R| + |A|), ble to extend the techniques for SimRank computation to
which is much smaller than the underlying HIN. The role HowSim, since HowSim is a generalization of SimRank.
of the decay graph w.r.t. HowSim is the same as that of Currently there are two types of methods for SimRank
the decay factor C w.r.t. SimRank, though the former is a computation with accuracy guarantees: iterative methods
weighted graph and the latter is scalar. When the HIN is and random walk based methods [24]. The former one is
dynamic, i.e., nodes/edges keep being added or deleted, usually used for all-pairs SimRank, while the latter is for
we can reuse the same decay graph if the network schema single-pair/source queries. We study all-pairs HowSim in
is not changed, which is similar to the reuse of decay this paper, and leave the HowSim query problem as the
factor in dynamic SimRank computation [19], [20], [22], [23]. future work. We cannot extend the iterative method unless
However, if the network schema is changed, e.g., a new type the following questions are answered: 1) does the estimated
of relation is added to the HIN, we need to refine the decay HowSim scores converge to the ground truth; 2) if so, what
graph correspondingly. Besides, HowSim replies on that the is the convergence rate. In this section, we first present a
network schema is given, which is the same case in [5], [6], native iterative method for computing HowSim matrices
[16]. We would extend HowSim to schema-free networks, with an accuracy guarantee, and then present optimization
such as knowledge graphs, in the future work. techniques to reduce the cost.
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
TABLE 3: Case study of DeBo (Bookmark: 8 Hours Of The Top 10 JavaScript Talks From 2010 That You Can’t Miss)
HowSim Weighted PPR
8 Hours Of The Top 10 JavaScript Talks From 2010 That You Can’t Miss 8 Hours Of The Top 10 JavaScript Talks From 2010 That You Cant Miss
50 Useful Tools and Resources For Web Designers - Smashing Magazine russell davies: what I meant to say at lift - part two - big red buttons and sliding
into glass
Script Junkie — LABjs; RequireJS: Loading JavaScript Resources the Fun Way Roo Reynolds - Playful
The 20 Most Practical and Creative Uses of jQuery - NETTUTS YouTube - Simon Peyton Jones: Data Parallel Haskell
35 Best Free Chrome Extensions for Web Developers — Web Resources — Must Watch TED Videos
WebAppers
The Essentials of Writing High Quality JavaScript — Nettuts+ TED Talks - PostRank - Google Docs
5 Good Reasons Why Designers Should Code — Carsonified The Best TED Talks To Make Use Of Social Media
Custom Checkboxes, Custom Radio Buttons, Custom Select Lists 100 Great Tech Talks for Educators — Best Colleges Online
Table Sorting with Prototype * Dexagogo blip.tv (since 2005)
A Detailed Look into Popular Styles in Web Design — Onextrapixel - Showcas- Science Cooking Public Lectures a Harvard School of Engineering and Applied
ing Web Treats Without A Hitch Sciences
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
UIKit, Three20 and UITableView (rank 2-4) are all tools or TABLE 9: Meta-paths used in clustering for PathSim and
HeteSim
frameworks for developing ios applications, and Keychain
(rank 6) is the password management system developed by IMDb M: MDM D: DMGMD A: AMGMA G: GMDMG
Apple. However, weighted PPR would find some general LaFM U: UATAU A: AUA T: TAUAT
DeBo U: UBTBU B: BUB T: TBUBT
tags whose scope is much larger than iphone-development, DBIS V: VPAPV P: PAP A: APVPA
such as github and open-source. Amazon I: IICII C: CIC
We then use LaFM and query the top-10 similar artists
for Wolfgang Amadeus Mozart, who was a prolific and influ- Howsim Pathsim(MDM) WPPR
ential composer of the classical era. The results are shown in nSimGram Hetesim(MGM) PPR
Table 5, we find that the results of HowSim are also famous Pathsim(MGM) Hetesim(MDM) SimRank
classical pianists or composers, such as Martha Argerich and 1.0
0.9
Franz Joseph Haydn. However, the results of weighted PPR 0.8
0.7
Precision
are not satisfactory, i.e., Sissel is a crossover soprano and
0.6
Acid House Kings a Swedish indie pop band. Using the 0.5
same decay graph, we also query similar genres for country 0.4
0.3
music, the results are shown in Table 6. We can see that 0.2
similar genres are found by HowSim, such as Canadian 0.1
0.0
country, Texas country and hillbilly music, while weighted 2 4 6 8 10
PPR would find irrelevant genres, e.g., female vocalists and
rock. k
The reason for the results of weighted PPR being worse
than HowSim, is that weighted PPR gives high scores to Fig. 4: Top-k precision by user study
those hubs (nodes with high degree), though these hubs
may be irrelevant to the query nodes. From above case
studies, we can conclude that HowSim can identify similar nodes in the pool w.r.t. the similarity to u. This ordering is
objects over multiple domains. viewed as the ground truth in p’s view, then the precision of
different measures can be calculated. Based on the above
6.2.2 Varying Decay Graphs strategy, using the IMDb data, we use the movie Titanic
as the query, and present the pooling results of different
Different decay graphs lead to different similarity results. measures to 10 different people, then calculate the top-k
Over IMDb, we use three different decay graphs c1 , c2 and precision for each of them. The average precision among
c3 (shown in Table 7), and find the top-10 similar movies different users for different measures is shown in Figure 4.
for Titanic. The results are shown in Table 8. We find that We can find that HowSim achieves the best precision with
for c1 , the top-6 movies are all directed by James Cameron, a varying k , indicating users prefer the similarity results
this is because the similarity of movies is highly dependent returned by HowSim to others.
on the similarity of directors who direct them under c1 . As
for c2 , we find that all similar movies are either acted by
Leonardo DiCaprio or Kate Winslet, the reason is that in c2 , 6.3 Effectiveness
similarity of movies highly depends on the similarity of We compare HowSim with measures over different tasks,
actors, and that Leonardo and Kate are the stars of Titanic. and then study the effectiveness of finding decay graphs.
The results of c3 are all romantic drama movies, since Titanic
is a love story. In addition, most of the c3 results are also 6.3.1 Node Clustering
acted either by Leonardo or by Kate, this shows that HowSim We test the quality of clustering of different measures.
has the ability to combine the similarities from multiple We apply K-Medoids [28] to perform clustering by differ-
relations. Therefore, HowSim has the flexibility to reflect ent scores returned from different measures. We use the
users’ preferences of how to aggregate similarities from compactness (CP) to evaluate clustering quality, which is
multiple domains by customizing decay graphs. (K−1) K
P P
i=1 vj ∈Ci s(vj ,mi )
defined as: CPK = PK P
s(mi ,vj )
, where K is
i=1 vj ∈C\Ci
6.2.3 User Study the number of clusters, Ci is the i-th cluster, C = ∪K i=1 Ci , mi
We perform a blinded user study to test the quality of is the center of Ci , and s(∗, ∗) is the similarity function. The
results of different methods. We use pooling [27], which is a numerator describes the closeness between centers and the
standard approach for evaluating top-k documents ranking objects grouped into the current clusters, and denominator
quality in Information Retrieval (IR) systems when the represents the closeness between centers and the objects that
ground truth ranking scores of all documents are difficult are not grouped into the current clusters. A high CPK score
to obtain. The basic idea of pooling is as follows. Suppose indicates a good clustering result. For each data set, we first
that we are evaluating l similarity measures A1 , · · · , Al , vary K from 3 to 9 with the step of 2, and we use different
each of which returns the top-k results that are most similar meta-paths for each meta-path based measure. The result
to a query. We first take the top-k results returned by is shown in Figure 5. We find that HowSim outperforms
each measure, then we merge them into a pool, removing others over different data sets, and we can also find that
the duplicates. We then show the results in the pool to using different meta-paths leads to different compactness
experts for evaluation. Based on the feedback provided by for PathSim and HeteSim. We then perform clustering over
the experts, we pick the best k documents from the pool, different domains, by setting K = 5. The meta-paths of
and use them as the ground truth for evaluating the top- PathSim and HeteSim for different domains are summarized
k results returned by A1 , · · · , Al . In the user study, each in Table 9, and the compactness is shown in Figure 6. We
user is viewed as an expert for examining the results in the can still observe that HowSim has the large compactness
pool. More precisely, for each query node u, we retrieve the scores on various domains. The reason is that HowSim can
top-k nodes returned by each method, and merge them into aggregate similarity scores from different domains of neigh-
a pool. We then present the pool to a user p to order the bors, while meta-path based methods can only capture one
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
TABLE 10: The precision of label prediction of different measures
Method Howsim PathSim(MDM) PathSim(MGM) HeteSim(MDM) HeteSim(MGM) nSimGram WPPR PPR SimRank
Precision 0.74 0.28 0.7 0.12 0.68 0.72 0.7 0.66 0.22
Howsim Pathsim(DMD) Howsim Pathsim(VPV) Howsim Pathsim(UBU) Howsim Pathsim(UAU) Howsim Pathsim(ICI)
WPPR Hetesim(DMGMD) WPPR Hetesim(VPAPV) WPPR Hetesim(UBTBU) WPPR Hetesim(UATAU) WPPR Hetesim(IICII)
PPR Hetesim(DMD) PPR Hetesim(VPV) PPR Hetesim(UBU) PPR Hetesim(UAU) PPR Hetesim(ICI)
Pathsim(DMGMD) SimRank Pathsim(VPAPV) SimRank Pathsim(UBTBU) SimRank Pathsim(UATAU) SimRank Pathsim(IICII) SimRank
nSimGram nSimGram nSimGram nSimGram nSimGram
103
Compactness
Compactness
Compactness
Compactness
Compactness
105
103 104
104
103
102
3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9
K K K K K
(a) IMDb (b) DBIS (V) (c) DeBo (U) (d) LaFM (U) (e) Amazon (I)
Compactness
Compactness
Compactness
Compactness
103 104 104
104
102 103 103
A F D G V P A U B T U A T 103 I C
IMDb DBIS DeBo LaFM Amazon
Fig. 6: Clustering on different domains
Precision
Precision
Precision
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
k ×102 k ×102 k ×102 k ×102
5% 10% 15% 20%
Fig. 7: Link prediction - Amazon
particular meaning given a meta-path. SimRank and PPR the results to predict the genre. We use the MovieLens 10 as
cannot capture any semantic information. Another finding ground truth. The average precision of different measures
is that PPR performs better than meta-path based methods, are shown in Table 10. We can observe the quality of
the reason is that the number of path instances following the PathSim and HeteSim highly depends on the selected meta-
given meta-path can be very small, especially when nodes path. The weighted PPR is better than normal PPR since
are of low degrees. In this case, the scores of meta-path the former takes node/edge type into consideration. Among
based measures are too restricted to distinguish different them, HowSim achieves the highest precision.
levels of similar node pairs (see Table 12). Besides, only
counting the paths following a specific meta-path cannot 6.3.3 Link Prediction
aggregation information from other paths. On the contrary, We compare different measures w.r.t. the quality of link
PPR and WPPR have a recursive definition, the number of prediction. Since similarity function is defined for two nodes
paths considered is infinite, and the scores are more multi- with the same type, links can only be predicted in the same
farious. Note that when do clustering on different domains, domain. As a result, we can only use Amazon, DeBo and
we need to specify different meta-paths for PathSim and LaFM, and predict links among items and users. For each
HeteSim, while for HowSim, the same decay function is data set, we randomly remove edges from 5% to 20% of
used without any changes. Therefore, we can also conclude total edges. The missing links are predicted as follows: given
HowSim is a general measure and can ease the data mining an endpoint a of a removed edge, we retrieve the top-k
tasks on HINs. similar nodes of a, and add edges between a and its similar
nodes, to a set of candidates. Precision is used to measure the
|candidates∩deleted edges|
6.3.2 Label Prediction quality, i.e., |deleted edges| . We also vary k from 10
to 100 with the step of 10. For meta-path based methods,
We compare different measures w.r.t. the precision of label we also consider using different meta-paths, e.g., we use the
prediction. Using IMDb data set, we predict the genres notation PathSim(UBTBU) if we use PathSim with the meta-
of movies using different measures. Particularly, we firstly path UBTBU. The results are shown in Figures 7 to 9. We
remove the edges between movies and genres, and then can observe that the precision increases with an increasing
select 500 movies randomly. For for each movie, we compute
its top-50 similar movies, and use the majority voting among 10. https://movielens.org/.
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
Precision
Precision
Precision
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02
k ×10 k ×10 k ×10 k ×10
5% 10% 15% 20%
Fig. 8: Link prediction - DeBo
Precision
Precision
Precision
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
k ×102 k ×102 k ×102 k ×102
5% 10% 15% 20%
Fig. 9: Link prediction - LaFM
104
Maxium error
103
10 4
10 4 0.5
0.5 102
DBIS DeBo LaFM Amazon IMDb DBIS DeBo LaFM Amazon IMDb 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10
0.0 LaFM IMDb DBIS Amazon DeBo Data sets Data sets
Data sets
(a) Maximum error (b) CPU time (a) Maximum error (b) CPU time
Fig. 10: Average score
Fig. 11: Varying data sets Fig. 12: Varying
TABLE 12: Similarity scores of top-10 lists is also our motivation to introduce HowSim, i.e., it is hard
PathSim PathSim HeteSim HeteSim
to generate and select good meta-paths for different appli-
HowSim nSimGram WPPR PPR
(MDM) (MGM) (MDM) (MGM) cations and domains before we perform any data mining
1 1 1 1 1 1 0.408375 0.471681
0.290797 0.5 1 1 1 1 0.00248191 0.00186792 tasks.
0.195247 0.25 1 1 1 1 0.00246846 0.00110983
0.19511 0.25 1 1 1 1 0.00245409 0.001088
0.178121
0.172314
0.25
0.25
1
1
1
1
1
1
1
1
0.00245201
0.00243339
0.00107252
0.00106975
6.3.4 Top-k Similarity Search
0.169834 0.25 1 1 1 1 0.00242949 0.00104432
0.16869 0.25 0 1 0 1 0.000522484 0.00103956 We compare different measures w.r.t. the quality of results of
0.167171 0.25 0 1 0 1 0.000269837 0.000981586
0.1648 0.25 0 1 0 1 0.00026926 0.000973551 top-k similarity search. We use IMDb, and search the top-10
similar movies to Titanic. The results are in Table 11. We can
see results of HowSim are better than others. Semantically,
PathSim (MDM) and HeteSim (MDM) only outputs films
k , this is because a larger k indicates a larger candidate set which are directed by Cameron, and PathSim (MGM) and
for link prediction. We can find that HowSim, HeteSim and HeteSim (MGM) only produces romantic drama films. Most
nSimGram generally outperform other competitors over of results of SimRank are not similar to Titanic. On the
different settings. Besides, the quality of results of meta- contrary, the results of HowSim are much better, the reason
based measures highly depends on the meta-path used. For is that they are all romantic drama movies starring Leonardo,
example, under Amazon, no matter which meta-path based Kate, or both of them. The results of PPR are not as good
measure is used (PathSim or HeteSim) the results under as those of HowSim, for example, though Aliens, The Termi-
IICII have higher precision than under ICI. Actually, this nator, Avatar are science-fiction action horror films directed
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
Howsim Metapath2vec TransE similarities in vector space. We compare HowSim with the
1.0 following methods: 1) TransE [29]: an embedding method
0.9 for multi-relational data; 2) Metapath2vec [30]: a represen-
Precision
0.8 tation learning method for heterogeneous networks. For
0.7 Metapath2vec, we set the number of walks per node to
0.6
0.5 0.05
200, the walk length to 20, the vector dimension to 128.
0.1 0.15 0.2 We use the meta-path UBU. For TransE, we set epoch as
Percentage of deleted edges
500. We perform link prediction on DeBo, set k = 20, and
Fig. 13: Link prediction - DeBo vary the percentage of randomly removed edges from 5%
to 20%. The precision is shown in Figure 13. On average,
Metapath2vec achieves the highest precision, and the result
by Cameron, they are obviously different from Titanic in the of HowSim is better than TransE. Therefore, the precision of
aspect of genres and actors. The results of nSimGram also HowSim is comparable with embedding-based methods. On
include some unrelated movies, e.g., Welcome to Collinwood is the other hand, compared with embedding-based methods,
a caper comedy film and Demon Knight is a horror comedy HowSim is relatively interpretable due to its clear definition,
film, which are dissimilar to Titanic. Among all measures, and the parameters of decay graph have clear semantic
HowSim is the only one which can combine similarities meaning.
from multiple domains.
In addition, numerically, the distribution of scores of 6.4 Efficiency
meta-path based methods cannot provide ranking results of
We study the efficiency of computing HowSim scores,
good quality since most items share the same score, while
by comparing Naive-Iter (Section 4.1) and Opt-Iter (Algo-
the scores of HowSim always have different values due to
rithm 4.2). We use maximum error (ME) to measure the ac-
its recursive definition, i.e., an infinite number of paths are
curacy of algorithms, i.e., M E = maxa,b∈V |ŝ(a, b)−s(a, b)|,
taken into consideration. The similarity scores of top-k lists
where ŝ(a, b) is the estimated HowSim score. We then test
are shown in Table 12, we find that scores of PathSim and
the efficiency of finding a decay graph.
HeteSim are either 1s or 0s, which cannot distinguish differ-
ent levels of similarity. Similar results are shown in nSim-
Gram. On the contrary, scores of HowSim can be used to 6.4.1 Varying Data Sets
rank similarities due to it being recursive and can thus pro- We set = 0.01 and run both algorithms on all data
duce different numerical levels. If we want to make PathSim sets. The results of ME are shown in Figure 11a. We can
or HeteSim consider all aspects to find similar films, then see the ME of both algorithms is smaller , showing they
the shortest meta-path is M DM AM GM AM DM , which is both have an accuracy guarantee. The computational time is
too long to compute, based on the PathSim and HeteSim shown in Figure 11b. We can observe that Opt-Iter is faster
algorithms. than Naive-Iter on all data sets, verifying the usability of
strategies proposed in Section 4.2.
6.3.5 Effect of Varying α
We test the effect of varying α w.r.t. the similarity results. We 6.4.2 Varying
use the IMDb data set, and then vary α from 0.2 to 0.8 with Using DBIS, we vary from 0.01 to 0.1 with the step of 0.01.
the step of 0.2 for all node types, and for each particular The ME under different is shown in Figure 12a. In addition,
α(A), we set c(R) = 1−α(A) we also draw a dotted line for various s, thus if the
|R(A)| for ∀R ∈ R(A). We extract the
maximum error of an algorithm is bounded by , it would be
top-100 similar directors from SD , excluding the diagonal
under the dotted line. We find that both Naive-Iter and Opt-
elements. We find that the top-100 pairs are identical, and
Iter are bounded, which is consistent with our analysis in
similar pairs of other types show similar results. We also
Section 4. Note that the ME shows a stair-like manner with
query the top-10 similar directors for a set of randomly
an increasing , since the number of iterations K is discrete,
selected directors, the top-10 lists are also the same for
thus under a varying , K may remain the same, which
various α. Due to space limitations, we omit the results of
leads to the same ME. The computational time is shown in
the tables here. We conclude that a different α only affects
Figure 12b. We find that the CPU time of Opt-Iter is less than
absolute HowSim scores, not the relative ranking of them.
the Naive-Iter under all , showing the effectiveness of the
6.3.6 Effectiveness of Finding Decay Graphs optimization techniques introduced in Section 4.2. Another
finding is that the CPU time of both Naive-Iter and Opt-
We study the effectiveness of our method of finding decay Iter shows a stair-like behavior, showing different s may
graphs. For each data set, we select 50 examples pairs, and share the same number of iterations. Besides, the number
set α(A) = 0.2 for ∀A ∈ A. In addition to Algorithm 1, of iterations of both methods under each particular is the
we also consider the following strategies: 1) Equiv: it sets same, this is because Opt-Iter only boosts the computational
c(R) = 1−α(A)
|R(A)| for ∀R ∈ R(A) and ∀A ∈ A; 2) Rand: it cost within each iteration, instead of reducing the whole
randomly sets a real number of c(R)s and makes sure they number of iterations.
follow Def. 3. We then calculate the average HowSim scores
of example pairs based on the decay functions computed 6.4.3 Varying |Λ|
by three methods. The results are shown in Figure 10. We test the efficiency of the decay graph finding algo-
Among all data sets, we find our method outputs the highest rithm. Using the DeBo data set, we varying the number
average HowSim scores, this verifies the effectiveness of the of example pairs |Λ| from 200 to 1000. The time of the
solution for finding a decay graph in Section 5.2. decay graph finding process is shown in Figure 14a. We can
find that the processing time grows approximately linearly
6.3.7 Comparison with Embedding-based Methods with an increasing |Λ|. This is because the more example
We also compare HowSim with embedding-based methods, pairs provided, the more time needed for enumerating the
which generate embeddings for nodes and measure node paths connecting each pair of nodes in the example set and
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
Precision
necting nodes a and b. The difference between [26] and
Time (s)
2 0.6 HowSim is that [26] considered paths of a fixed length
0.4 q , while HowSim aggregates all possible symmetric paths
1 0.2
200 400 600 800 1000 0.0 200 400 600 800 1000
connecting nodes a and b, due to its recursive definition.
| | | | [31] also extended SimRank with semantics by incorporat-
ing semantic similarity of nodes and edge weights over an
(a) CPU time (b) Precision HIN. [31] required an additional ontology graph equipped
with the underlying HIN to compute the semantic similarity.
Fig. 14: Varying |Λ| However, most real HINs does not have a corresponding
ontology with them. Besides, [31] assumed that each edge
is labeled with a weight which indicates the importance of
setting up the optimization problem. We set k = 50 and the relation, but in reality, weights over an HIN are hard
randomly remove 10% edges, and then perform the task of to obtain. On the contrary, HowSim does not assume the
link prediction, guided by the decay graph under a varying ontology or edge weights, it only requires the parameters
|Λ|. The precision result is shown in Figure 14b. We can find over at schema level, instead of instance level, and thus the
that while |Λ| varies from 200 to 1000, the precision does not number of parameters are much smaller.
fluctuate much, and is around 0.8 among different sizes of Similarity Search on Homogeneous Graphs. Similarity
example pairs. The reason for that only a small number of search over homogeneous graphs has been well studied,
example pairs can achieve high precision is as follows. Since most similarity measures for them are link-based, because
the number of parameters of a decay graph is small, i.e., homogeneous graphs only have structural information.
|R| + |A|, the cardinality of the training set (example pairs) Some popular measures include PPR [14], SimRank [12], P-
for deriving the decay graph is also small O(|R| + |A|). Rank [13], CoSimRank [32], and so on. These similarities
In reality, it is also not practical for users to provide lots cannot be used in heterogeneous graphs directly, since they
of similar pairs, this is consistent with the motivation of do not consider the semantic information of HINs. HowSim
designing the decay graph: only a few specified parameters is extending the intuition of SimRank to HINs. However,
are enough for deciding similarities of pairs of all types. On they are different in the following aspects: 1) SimRank uses
the other hand, the precision does not increase much as |Λ| a numerical variable, i.e., decay factor, to describe how
increases, this is due to the limited number of parameters in similarities are aggregated from neighbors, while HowSim
the decay graph, which restricts the flexibility of HowSim uses a weighted graph, i.e., decay graph, to define how
to fits all possible similarity configurations (consider there similarities are aggregated from neighbors of different types;
is only parameter for SimRank, i.e., the decay factor c). 2) While SimRank is defined for nodes of the same types,
However, this does not affects the effectiveness of HowSim, HowSim is defined for nodes of multiple types.
since only a few example pairs can achieve high precision.
We summarize the findings in experiments as follows: Graph Embeddings. Graph/Node embeddings [33] are
1) HowSim can find similarity objects for multiple domains, vector representations of nodes, where the learned em-
aggregate similarities from various relations, and thus can beddings usually preserve the node proximity, i.e., similar
produce high-quality results; 2) The decay graph is flex- nodes should have close embeddings. Some early works
ible for users to encode their preferences w.r.t. similarity such as LLE [34], LE [35], CGE [36] computed the node
aggregation for query processing; 3) HowSim is effective embeddings by matrix factorization, which result in heavy
in different data mining tasks; 4) The strategies proposed time and space overheads on large graphs. Recently, with
in Section 4.2 can boost HowSim’s computation. 5) Our the advent of deep learning [37], deep-learning based meth-
method (Section 5.2) is effective for finding good-quality ods have shown promising performance in the similarity
decay graphs; evaluation. Some works such as D2AGE [38] and IPE [39]
aimed to produce embeddings particularly for the similarity
search task by a path-to-path learning schema. In addition,
many graph embedding methods were proposed for general
7 R ELATED W ORKS
graph mining tasks, such as DeepWalk [40], LINE [41],
Similarity Search on HINs. Similarity search over HINs has GCN [42], and Node2Vec [43]. Despite their effectiveness, all
been studied recently, due to the popularity and ubiquity of these methods only focused on homogeneous graphs. To
of HINs. Most current similarity measures are meta-path address this problem, some recent methods such as Metap-
based. PathSim was proposed in [5], which counts all possi- ath2Vec [30] and HIN2Vec [44] were proposed to learn graph
ble path instances given a meta path and then performs the embeddings on heterogeneous graphs. Specifically, Metap-
normalization. [6] introduced HeteSim, which computes the ath2Vec [30] exploited the graph schema and utilizes the
meeting probability of two nodes following a given meta- user-specified meta-paths to control the generation process
path, and supports similarity search over different types of of graph embeddings, while HIN2Vec [44] further improved
nodes. [7] proposed RelSim for relation similarity search in the Metapath2Vec by introducing a semi-supervised process
schema-rich HINs, e.g., knowledge graphs. [8], [9] defined a to fine-tune the embeddings for specific graph mining tasks.
Path Constrained Random Walk (PCRW) model to measure Our HowSim model differs from above methods in the fol-
the object proximity in a labeled graph, which outperforms lowing aspects. First, while all above methods are learning-
PPR over several data mining tasks. [10] proposed AvgSim, based, where embeddings are learned by optimizing a ob-
which considers both the given meta path and its reverse, jective function, HowSim is a general measure which has a
avoiding the path decomposition phase in HeteSim. While clear definition, and has an interpretable semantic meaning.
current works mainly focused on formulating different Second, when the graph is updated, learning based methods
measures by incorporating meta-path in different ways, would re-train the node embeddings, which is costly. On the
HowSim is a meta-path free measure and can determine the contrary, HowSim similarities can be computed on-line, due
similarities of different domains. Recently, [26] introduced to its closed-form definition. Third, learning-based methods
a q -gram based measure, which counts the frequencies of need to stores all node embeddings, which has a large space
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
cost when the graph is large. There is no additional space [19] M. Jiang, A. W. Fu, R. C. Wong, and K. Wang, “READS: A ran-
cost for HowSim. dom walk approach for efficient and accurate dynamic simrank,”
PVLDB, vol. 10, no. 9, 2017.
[20] Y. Liu, B. Zheng, X. He, Z. Wei, X. Xiao, K. Zheng, and J. Lu,
“Probesim: Scalable single-source and top-k simrank computa-
8 C ONCLUSION tions on dynamic graphs,” PVLDB, 2017.
[21] W. Yu, X. Lin, W. Zhang, and J. A. McCann, “Dynamical simrank
In this paper, we propose an effective and general similarity search on time-varying networks,” VLDB J., 2018.
measure, i.e., HowSim, by extending SimRank to HINs. [22] Y. Wang, X. Lian, and L. Chen, “Efficient simrank tracking in
We introduce the concept of the decay graph, which de- dynamic graphs,” in ICDE, 2018, pp. 545–556.
[23] Y. Wang, L. Chen, Y. Che, and Q. Luo, “Accelerating pairwise
scribes how similarities are aggregated from different do- simrank estimation over static and dynamic graphs,” VLDBJ,
mains over different relations, making extending SimRank vol. 28, no. 1, pp. 99–122, Feb. 2019.
to HINs possible. We also give the matrix representation [24] Z. Zhang, Y. Shao, B. Cui, and C. Zhang, “An experimental eval-
uation of simrank-based similarity search algorithms,” PVLDB,
and probabilistic interpretation of HowSim. We study the vol. 10, no. 5, pp. 601–612, 2017.
properties of HowSim in-depth, i.e., HowSim is normal- [25] W. Xie, D. Bindel, A. Demers, and J. Gehrke, “Edge-weighted per-
ized, self-maximum, symmetric, and has a unique solution. sonalized pagerank: Breaking a decade-old performance barrier,”
Compared with meta-path based measures, which can only in KDD. ACM, 2015, pp. 1325–1334.
[26] A. Conte, G. Ferraro, R. Grossi, A. Marino, K. Sadakane, and
capture a specific aspect of similarity, HowSim is more T. Uno, “Node similarity with q-grams for real-world labeled
general and the decay function can be customized by users. networks,” in KDD. ACM, 2018, pp. 1282–1291.
We propose a naive iterative method for computing all- [27] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to
information retrieval. Cambridge university press, 2008.
pairs HowSim and introduce optimization techniques. We [28] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an intro-
a method for finding decay graphs by providing example duction to cluster analysis. John Wiley & Sons, 2009, vol. 344.
pairs of high proximity. Extensive experiment shows the [29] A. Bordes, N. Usunier, A. Garcı́a-Durán, J. Weston, and
effectiveness of HowSim and the efficiency of our proposed O. Yakhnenko, “Translating embeddings for modeling multi-
relational data,” in NIPS, 2013, pp. 2787–2795.
methods. In the future, we plan to design efficient algo- [30] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable
rithms for top-k/single-source queries with HowSim, and representation learning for heterogeneous networks,” in SIGKDD.
extend HowSim to schema-free networks such as knowl- ACM, 2017, pp. 135–144.
[31] B. Youngmann, T. Milo, and A. Somech, “Boosting simrank with
edge graphs. semantics,” in EDBT Lisbon, Portugal, March 26-29,, 2019, pp. 37–48.
[32] S. Rothe and H. Schütze, “Cosimrank: A flexible & efficient graph-
theoretic similarity measure,” in ACL, vol. 1, 2014, pp. 1392–1402.
R EFERENCES [33] H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey of
graph embedding: problems, techniques and applications,” IEEE
[1] X. Yin, J. Han, and P. S. Yu, “Linkclus: efficient clustering via Transactions on Knowledge and Data Engineering, 2018.
heterogeneous semantic links,” in VLDB, 2006. [34] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction
[2] N. Spirin and J. Han, “Survey on web spam detection: principles by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–
and algorithms,” SIGKDD Explor. Newsl., vol. 13, no. 2, 2012. 2326, 2000.
[3] Z. Abbassi and V. S. Mirrokni, “A recommender system based on [35] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality
local random walks and spectral methods,” in Proceedings of the reduction and data representation,” Neural computation, vol. 15,
9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and no. 6, pp. 1373–1396, 2003.
social network analysis. ACM, 2007, pp. 102–108. [36] D. Luo, F. Nie, H. Huang, and C. H. Ding, “Cauchy graph
[4] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem embedding,” in Proceedings of the 28th International Conference on
for social networks,” J Assoc Inf Sci Technol, vol. 58, no. 7, 2007. Machine Learning (ICML-11), 2011, pp. 553–560.
[5] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim: Meta [37] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
path-based top-k similarity search in heterogeneous information MIT press Cambridge, 2016, vol. 1.
networks,” PVLDB, vol. 4, no. 11, pp. 992–1003, 2011. [38] Z. Liu, V. W. Zheng, Z. Zhao, F. Zhu, K. C.-C. Chang, M. Wu, and
[6] C. Shi, X. Kong, Y. Huang, S. Y. Philip, and B. Wu, “Hetesim: J. Ying, “Distance-aware dag embedding for proximity search on
A general framework for relevance measure in heterogeneous heterogeneous graphs,” in Proceddings of the 32th AAAI Conference
networks.” IEEE Trans. Knowl. Data Eng., vol. 26, no. 10, pp. 2479– on Artificial Intelligence, 2018.
2492, 2014. [39] Z. Liu, V. W. Zheng, Z. Zhao, Z. Li, H. Yang, M. Wu, and J. Ying,
[7] C. Wang, Y. Sun, Y. Song, J. Han, Y. Song, L. Wang, and M. Zhang, “Interactive paths embedding for semantic proximity search on
“Relsim: relation similarity search in schema-rich heterogeneous heterogeneous graphs,” in Proceedings of the 24th ACM SIGKDD
information networks,” in SDM. SIAM, 2016, pp. 621–629. International Conference on Knowledge Discovery & Data Mining.
[8] N. Lao and W. W. Cohen, “Relational retrieval using a combination ACM, 2018, pp. 1860–1869.
of path-constrained random walks,” Machine learning, vol. 81, [40] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
no. 1, pp. 53–67, 2010. of social representations,” in Proceedings of the 20th ACM SIGKDD
[9] ——, “Fast query execution for retrieval models based on path- international conference on Knowledge discovery and data mining.
constraiaclned random walks,” in KDD. ACM, 2010, pp. 881–888. ACM, 2014, pp. 701–710.
[10] D. Xiao, X. Meng, Y. Li, C. Shi, and B. Wu, “Avgsim: Relevance [41] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line:
measurement on massive data in heterogeneous networks.” JATIT, Large-scale information network embedding,” in Proceedings of the
vol. 84, no. 1, 2016. 24th International Conference on World Wide Web. International
[11] G. Jeh and J. Widom, “Scaling personalized web search,” in WWW, World Wide Web Conferences Steering Committee, 2015, pp. 1067–
2003. 1077.
[12] ——, “Simrank: a measure of structural-context similarity,” in [42] T. N. Kipf and M. Welling, “Semi-supervised classification with
KDD, 2002. graph convolutional networks,” arXiv preprint arXiv:1609.02907,
[13] P. Zhao, J. Han, and Y. Sun, “P-rank: a comprehensive structural 2016.
similarity measure over information networks,” in CIKM, 2009. [43] A. Grover and J. Leskovec, “node2vec: Scalable feature learning
[14] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank for networks,” in Proceedings of the 22nd ACM SIGKDD international
citation ranking: Bringing order to the web.” Stanford InfoLab, conference on Knowledge discovery and data mining. ACM, 2016, pp.
Tech. Rep., 1999. 855–864.
[15] Y. Sun and J. Han, “Mining heterogeneous information networks: a [44] T.-y. Fu, W.-C. Lee, and Z. Lei, “Hin2vec: Explore meta-paths in
structural analysis approach,” Acm Sigkdd Explorations Newsletter, heterogeneous information networks for representation learning,”
vol. 14, no. 2, pp. 20–28, 2013. in Proceedings of the 26th ACM on Conference on Information and
[16] C. Shi, Y. Li, J. Zhang, Y. Sun, and S. Y. Philip, “A survey of Knowledge Management. ACM, 2017, pp. 1797–1806.
heterogeneous information network analysis,” IEEE Transactions
on Knowledge and Data Engineering, vol. 29, no. 1, pp. 17–37, 2017.
[17] D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov, “Accuracy
estimate and optimization techniques for simrank computation,”
The VLDB Journal, vol. 19, no. 1, pp. 45–66, 2010.
[18] B. Tian and X. Xiao, “Sling: A near-optimal index structure for
simrank,” in SIGMOD, 2016.
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
Yue Wang received the PhD degree from the Hao Xin is a phd candidate at Department of
Hong Kong University of Science and Technol- Computer Science and Engineering in Hong
ogy (HKUST), in 2019. He is a researcher with Kong University of Science and Technology
the Shenzhen Institute of Computing Sciences, (HKUST). Currently, he is working with Prof.Lei
Shenzhen University. His research interests in- Chen on knowledge base. He obtained his bach-
clude data mining and graph algorithms. elor degree in computer science from Zhejiang
University, China in 2016.
Zhe Wang received the BS degree from School Lei Chen (Fellow, IEEE) received his BS de-
of Data and Computer science, Sun Yat-sen Uni- gree in Computer Science and Engineering from
versity in 2018, the MA degree from CSE De- Tianjin University, China, in 1994, the MA de-
partment, HKUST in 2019. He is currently work- gree from Asian Institute of Technology, Thai-
ing as a research assistant in the CSE Depart- land, in 1997, and the PhD degree in computer
ment, HKUST. His major research fields include science from University of Waterloo, Canada,
large-scale graph mining and graph representa- in 2005. He is now a professor in the De-
tion learning. partment of Computer Science and Engineer-
ing at Hong Kong University of Science and
Technology. His research interests include data-
driven machine learning, crowdsourcing, knowl-
edge graphs, graph and probabilistic databases.
Ziyuan Zhao is a researcher at WeChat Group, Jianchun Song received the BS degree in com-
Tencent Cooperation. puter science and technology from Zhengzhou
University, China, in 2012, the MA degree from
Harbin Institute of Technology, China, in 2014.
His research interests include search engine and
dialogue system.
Zijian Li received his BS and MA degree from Zhenhong Chen is a researcher at WeChat
CSE department of Zhejiang University, China in Group, Tencent Cooperation.
2015. He is currently a PhD student at the CSE
department of Hong Kong University of Science
and Technology. His major research fields are
large-scale graph mining and distributed network
analysis.
Xun Jian received his B.Eng. degree in Software Meng Zhao is a researcher at WeChat Group,
Engineering in 2014 from Beihang University. Tencent Cooperation.
Then he received his M.Sc. degree in Informa-
tion Technology in 2016 from The Hong Kong
University of Science and Technology(HKUST).
Now he is a Ph.D. student in the Department
of Computer Science at HKUST. His research
interests include crowdsourcing and algorithms
on graph.
1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.