You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Effective Similarity Search on Heterogeneous


Networks: A Meta-path Free Approach
Yue Wang, Zhe Wang, Ziyuan Zhao, Zijian Li, Xun Jian, Hao Xin, Lei Chen, Fellow, IEEE, Jianchun Song,
Zhenhong Chen, Meng Zhao

Abstract—Heterogeneous information networks (HINs) are usually used to model information systems with multi-type objects and
relations. In contrast, graphs that have a single type of nodes and edges, are often called homogeneous graphs. Measuring similarities
among objects is an important task in data mining applications, such as web search, link prediction, and clustering. Currently, several
similarity measures are defined for HINs. Most of these measures are based on meta-paths, which show sequences of node classes
and edge types along the paths between two nodes. However, meta-paths, which are often designed by domain experts, are hard to
enumerate and choose w.r.t. the quality of similarity scores. This makes using existing similarity measures in real applications difficult.
To address this problem, we extend SimRank, a well-known similarity measure on homogeneous graphs, to HINs, by introducing the
concept of the decay graph. The newly proposed similarity measure is called HowSim, which has the property of being meta-path free,
and capturing the structural and semantic similarity simultaneously. The generality and effectiveness of HowSim, and the efficiency of
our proposed algorithms for computing HowSim scores, are demonstrated by extensive experiments.

Index Terms—Heterogeneous information networks, similarity measure, data mining, SimRank

1 I NTRODUCTION
1.1 Motivation principle to decide which meta-path is more important than
Many information systems can be modeled as graphs, where the others; 2) There is lots of workload to enumerate meta-
data objects are represented by nodes, and the relations paths, since two nodes can be connected in different ways,
among them are represented by edges. Besides, real-word and the number of potential meta-paths is infinity.
systems often consist of data objects of multiple types Moreover, a meta-path defined for querying similarities
and relations. Hence, heterogeneous information networks among type A1 can not be applied to nodes of type A2 . For
(HINs), which contain multiple-type nodes and links, are instance, if we use meta-paths AP A and AP V P A to find
usually used to model such systems. Examples include similar authors over the HIN in Figure 1a, we cannot reuse
social media networks and bibliographic networks. these meta-paths to find similar papers or similar venues.
Measuring similarities among nodes in the same domain The candidates for finding similar papers are P AP or P V P ,
plays a key role in data mining tasks, such as graph clus- and for finding similar venues are V P V or V P AV P . This
tering [1], spam detection [2], recommendation systems [3], shortage of meta-path based measures leads to a plentiful
and link prediction [4]. Currently, there are several simi- workload for meta path generation and selection, when
larity measures defined for HINs, including PathSim [5], users want to query similarities in multiple domains.
HeteSim [6], RelSim [7], PCRW [8], [9], AvgSim [10], and Another way to measure similarities over an HIN is
so on. However, all of them suffer from the generality to treat it as a homogeneous graph by ignoring the type
problem, since they require a user to specify a meta-path information of different nodes and edges. It then uses an
before querying similarities. A meta-path can capture one existing similarity measure for homogeneous graphs, such
particular semantic meaning among different paths con- as Personalized PageRank(PPR) [11], SimRank [12] or P-
necting two objects. Consider the HIN in Figure 1a, with Rank [13]. This strategy can tackle the generality problem
node types Author, Paper and Venue. The meta-path AP A discussed above. However, only considering the structural
indicates two authors have been co-authors, and the meta- similarity and ignoring the information of multiple typed
path AP V P A indicates two authors have published papers relations would result in the loss of semantic information,
in the same venue. Since one meta-path only has one se- which can also contribute to similarity scores. Therefore,
mantic meaning, it fails to aggregate all possible semantic currently there is still no such similarity measure on HINs,
information to evaluate the similarity. Even though a user that can combine structural similarity and semantic similar-
can combine similarity scores by providing multiple meta- ity in a simple and general way.
paths, this strategy has the following drawbacks: 1) It is hard
to adjust the weights among meta-paths when combining 1.2 Contributions
scores w.r.t. the quality of similarity results, since there is no In this paper, we propose HowSim, a Heterogeneous
information network based Similarity measure. Our basic
• Y. Wang is with Shenzhen Institute of Computing Sciences, Shenzhen idea is to apply the intuition of SimRank [12], i.e., “two
University, China. E-mail: yuewang@sics.ac.cn. nodes are similar if they are referenced by similar nodes”, to
• Z. Wang, Z. Li, X. Jian, H. Xin, L. Chen are with the Department of HINs. However, it is not intuitive to perform this extension
Computer Science and Engineering, the Hong Kong University of Science directly. The original SimRank is a structural similarity
and Technology, Clear Water Bay, Hong Kong SAR, China. E-mail: measure defined particularly for homogeneous graphs, and
{zwangec,zlicb,xjian,hxinaa,leichen}@ust.hk. it can not capture the semantic similarity from multiple
• Z. Zhao, J. Song, Z. Chen, M. Zhao are with the WeChat relations over different domains in an HIN. As we have
Group, Tencent Corporation, Guangzhou, China. E-mail: discussed previously, one way is to ignore the node and
{joshuazhao,barrysong,hollischen,doudouzhao}@tencent.com. edge types, and treat the HIN as a homogeneous graph.
However, this results in the loss of the semantic information.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

From our point of view, due to the existence of multiple TABLE 1: The comparison of different similarity measures
typed relations in HINs, the similarity of two objects can be Measure Symmetric Structural Semantic Meta-path
aggregated by different semantic meanings under different Informa- Informa- Free
tion tion
relations with different neighbors, instead of the single “ref- PPR [14] 7 3 7 3
erenced by” relationship in SimRank. For example, consider SimRank [12] 3 3 7 3
the bibliographic network in Figure 1a, two papers can be PathSim [5] 3 7 3 7
HeteSim [6] 3 7 3 7
similar in the following four semantic aspects: PCRW [8] 7 7 3 7
1) They are cited by similar papers (“cited-by”); HowSim 3 3 3 3
2) They are citing similar papers (“citing”);
3) They are written by similar authors (“written-by”);
4) They are published in similar venues (“published-in”).
At the same time, two authors can only be similar if they
write similar papers, since the node “author” is only in-
volved in the relation “writing” (“written-by”) 1 . Therefore,
ignoring the different types of nodes and edges in the
HIN, as SimRank, cannot capture the subtle differences in
(i) the ways of similarity aggregation in different domains;
and (ii) the similarities aggregated from different types of
relations with different neighbors.
To address this problem, we propose the concept of the (b) An example of network
decay graph. A decay graph is a weighted directed graph, (a) Network instance schema on bibliographic data
whose structure is the same as the network schema of an
HIN, and whose weights are defined for each relation and Fig. 1: An example of an HIN and its network schema
its inverse (details in Section 3). The functionality of a
decay graph is to encode how similarities are aggregated
under different relations. Equipped with the decay graph, •We introduce an iterative method to compute HowSim
we define our HowSim measure by aggregating similarities scores with an accuracy guarantee, and then propose
recursively from neighbors. optimization techniques to boost computation.
HowSim is a general similarity measure due to the fol- • We propose an an effective algorithm for finding a
lowing reasons. First, the original definition of SimRank and decay graph automatically for HowSim.
its variants [12], [13] are all special cases of the HowSim • We perform extensive experiments to validate the effec-
model, since they just differ in the initializations of decay tiveness and generality of HowSim.
graphs. Second, once given the decay graph, all similarity This paper is organized as follows. We introduce related
scores in all domains in an HIN are determined, without definitions in Section 2, and define HowSim in Section 3.
requiring users to provide any additional meta-path infor- Methods for computing HowSim scores are introduced in
mation. Third, even though HowSim is a meta-path free Section 4. We discuss how to define a decay graph in
measure, it can combine the different semantic meanings Section 5, and we perform experimental studies in Section 6.
of different paths due to its recursive definition. Therefore, Related works are shown in Section 7, and we conclude the
HowSim can capture the structural similarity and semantic paper in Section 8.
similarity simultaneously.
We then study the mathematical properties of HowSim
in depth. In addition to the basic definition of HowSim, we
2 P RELIMINARIES
present its matrix representation and probabilistic interpre-
tation. We show that HowSim is symmetric, normalized, In this section, we give backgrounds of SimRank and HINs.
self-maximum, and it always has a unique solution. In ad- The detailed definition of them can also be found in [12] and
dition, we also propose effective method for finding a decay [5], [6], [15], [16], respectively.
graph automatically given example node pairs. To compute
HowSim scores, we propose a naive iterative method and
prove that it always converges to the correct HowSim scores. 2.1 SimRank
We also propose heuristic optimization strategies to improve
the efficiency. We verify the effectiveness of our HowSim The intuition behind SimRank is two-fold: a) two nodes are
similar if they are linked by similar nodes; b) two identical
model for various data mining tasks by extensive exper- nodes have the similarity 1. Given an unweighted directed
iments. The comparisons of HowSim with other popular graph G(V, E) with n nodes and m edges, the SimRank
measures are also shown in Table 1. In summary, we make score of two nodes a, b ∈ V is formulated as follows:
the following contributions in this paper:

• We propose a general and meta-path free similarity  1 (a = b),
0 0
measure, i.e. HowSim, by proposing the concept of the

s(a, b) =
X C × s(a , b ) (1)
decay graph to extend SimRank to HINs. (a 6= b),

 0 0
|I(a)||I(b)|
• We study properties of HowSim in depth and show a ∈In(a),b ∈In(b)
it is symmetric, non-negative, normalized and self-
maximum. This leads to HowSim being a semi-metric where I(a) denotes the set of in-neighbors of node a, and
measure. C ∈ (0, 1) is the decay factor which is usually set to 0.6 [17]
or 0.8 [12]. Since Eq.(1) is recursive, SimRank can aggregate
similarities of multi-hop neighbors of the original pair of
1. This does not mean the similarity of authors is not influenced by nodes, and produce high-quality results. SimRank has at-
the similarity of venues where they have publications, since this infor-
mation is absorbed in the paper-paper similarity due to the recursive tracted a lot of research attention following its introduction
aggregation. [17], [18], [19], [20], [21], [22], [23].

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
TABLE 2: R of the network schema in Figure 1b
R R.S R.M R.T c(R)
R1 paper citing paper 0.3
R2 venue publishing paper 0.5
R3 author writing paper 0.6 (b) The decay graph of Sim-
R4 paper cited by paper 0.2 Rank, C ∈ (0, 1)
R5 paper published in venue 0.2
R6 paper written by author 0.1

(c) The decay graph of Bi-


2.2 Heterogeneous Information Network partite SimRank, C1 , C2 ∈
(0, 1)
Definition 1 (Information Network [5]). An information net-
work is a directed graph G(V, E) with a node type mapping func-
tion: φ : V → A and an edge mapping function: ψ : E → R,
where each node v ∈ V has a node type φ(v) ∈ A and each edge
e ∈ E has a relation type ψ(e) ∈ R. (a) A decay graph of the schema in Fig- (d) The decay graph of P-
ure 1b Rank, C1 + C2 ∈ (0, 1)
The underlying network is called a heterogeneous informa-
tion network (HIN) when the types of nodes |A| > 1 or the Fig. 2: Examples of decay graphs
types of edges |R| > 1, otherwise, it is called a homogeneous
information network.
give its probabilistic interpretation and matrix representa-
Example 1. Figure 1a shows an HIN for the bibliographic data, tion, which are useful for understanding the properties of
derived from DBLP 2 , which contains the bibliographic infor- HowSim.
mation for the computer science community. This bibliographic
network contains three types of nodes: paper, venue and authors.
Each paper is connected with one or multiple authors and a venue 3.1 Decay Function and HowSim Equation
where it is published. These different types of edges reveal the Our intuition of HowSim is derived from SimRank [12],
different relations among different nodes over an HIN. whose intuition is “two objects are similar if they are refer-
It is also necessary to describe an HIN on the meta level, enced by similar objects”. To extend the idea of SimRank to
i.e., which types of objects and relations an HIN can have. HINs, we follow the following intuitions to define HowSim:
Therefore, a concept network schema is proposed in [5]. 1) Similarity is only defined among nodes of the same
type, not nodes of different types 3 ;
Definition 2 (Network Schema [5]). The network schema is a 2) Two objects are similar if they have similar neighbors;
meta template for an HIN, and it is a directed graph TG (A, R), 3) A node has the similarity of 1 with itself.
which is defined on the node types A, with edges as relations from
R. Intuition (1) means that the similarities among nodes of
an HIN are represented by |A| similarity matrices, each of
The network schema specifies the type constraints over which is for a type A ∈ A, instead of a single |V | × |V |
objects and relationships among them. We denote the se- matrix. These |A| matrices are dependent on each other. For
mantic meaning of R as R.M . which links an object type S intuition (2), we consider NR (a) for ∀R ∈ R(A), due to
R.M different roles of a in different relations. Intuition (3) makes
to another object type T , i.e., S −−−→ T , we call S the source
type of R (denoted as R.S ), and T the target type (denoted sure that similarity scores are normalized, which follows the
as R.T ). The inverse relationship of R is denoted as R−1 . original SimRank definition [12].
R.M R−1 .M Definition 3 (Decay Function). Given a network schema
If S −−−→ T holds, the T − −−−−→ S holds naturally, i.e.,
R.S = R−1 .T and R.T = R−1 .S . Therefore, if R ∈ R, then TG (A, R), a decay function c is R → R, i.e., mapping each
R−1 ∈ R. As a result, edges in TG are always bi-directional. relation in R to a real number, and c also satisfies:
We denote R(A) the set of relations whose source type is A, 1) ∀R ∈ R, c(R)P≥ 0;
i.e., R(A) = {R : R ∈ R ∧ R.S = A}. Given a node a and a 2) ∀A ∈ A, 0 < R∈R(A) c(R) < 1.
relation R where R.S = φ(a), we denote the neighbors of a
0 0 0 The decay function can also be represented by a directed
under R in G as NR (a) = {a : φ(a ) = R.T ∧ (a, a ) ∈ E}.
graph, whose structure is the same as TG (A, R), except that
Example 2. The directed graph in Figure 1b shows the network edges are assigned with weights. We call this graph the decay
schema of the HIN in Figure 1a. Table 2 shows relations in the graph, and we would use the term decay function and decay
network schema, where R = {R1 , R2 , R3 , R4 , R5 , R6 }. In R, graph interchangeably in the following part of the paper.
the following reverse relations hold: R1−1 = R4 , R2−1 = R5 Given any TG and c, a unique function α : A → R can be
and R3−1 = R6 . Consider the type A is paper, then R(A) = determined, where ∀A ∈ A :
{R1 , R4 , R5 , R6 }, i.e., the set of relations whose source type is X
paper. Consider paper p1 , its neighbors under different relations in α(A) = 1 − c(R). (2)
R(A) are: NR1 (p1 ) = {p2 }, NR4 (p1 ) = ∅, NR5 (p1 ) = {z1 }, R∈R(A)
and NR6 (p1 ) = {x1 , x2 , x3 }. The decay graph describes how a node aggregates simi-
larities from its neighbors over different relations.

3 T HE D EFINITION OF H OW S IM Example 3. Figure 2a shows a decay graph for the network


schema in Figure 1b, we can see for the “paper-paper” similar-
In this section, we firstly introduce the concept of de- ity, 20% comes from the average similarity of venues they are
cay function (graph) and the HowSim measure. We then
3. Indeed, finding close objects among different types is usually
2. https://dblp.uni-trier.de/db/ called relevance search, instead of similarity search.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

published in, 10% comes from the average similarity of authors 3.2 Probabilistic Interpretation.
who write them, 30% comes from the average similarity of papers
they cite, and 20% comes from the similarity of papers which Suppose a coupled random walks Wa and Wb start from a
cite them. For the “venue-venue”similarity, 50% comes from the and b, respectively. Let type A ∈ A be the node type Wa
average similarity of papers they publish. and Wb currently reside in. At each step, Wa and Wb either:
1) stop with the probability α(A), or; 2) choose a relation R
The ability to customize the decay function is central from R(A) with the probability proportional to c(R), and
to HowSim, since we cannot assume weights are fixed for choose one of their neighbors under R (NR (a) and NR (b))
different applications. For example, to find similar papers, a uniformly to proceed. The probabilistic interpretation of
researcher would focus more on the author-paper relation- HowSim is given as follows.
ship to find papers with similar authors, or alternatively 0
focus more on the paper-venue relationship to find papers Theorem 2. Let s (a, b) be the probability that Wa and Wb meet,
0
published in similar venues. then s (a, b) is equal to s(a, b).
Let a and b be two nodes of type A, we define the
HowSim score sA (a, b) as follows. Proof. Please see the proof in the supplemental material.
Definition 4 (HowSim). Given an HIN G(V, E, φ, ψ), a decay
function c, and a pair of nodes a, b ∈ V where φ(a) = φ(b) = Based on the probabilistic interpretation of HowSim, we
A ∈ A. The HowSim similarity of (a, b) is defined as: can conclude HowSim scores are in [0, 1].

1, a = b,
0 0
sA (a, b) = P P P c(R)sR.T (a ,b )
a 6= b. 3.3 Matrix Representation.
 0 0
|NR (a)||NR (b)|
R∈R(A) a ∈NR (a) b ∈NR (b)
(3) Here we give the matrix representation for Eq.(3). Given a
type A ∈ A, let SA be the similarity matrix for A. Given a
When a = b, their similarity is 1 by definition. Consider relation R ∈ R, let R be the adjacency matrix of R, i.e., ∀a ∈
the case that a 6= b, Eq.(3) computes the average similarities VR.S , b ∈ VR.T , Ra,b is 1 if R(a, b) ∈ E and 0 otherwise. Let
from NR (a) and NR (b) for each R ∈ R(A), and combine WR be the row normalized matrix of R. Then ∀A ∈ A, SA
them with the weights of c(R)s. Since φ(a) = φ(b), the types satisfies:
of neighbors of a and b under R are always the same, so X
>
Eq.(3) is well defined on any node pair with the same type, SA = c(R)WR SR.T WR ) ∨ I,
Pof |A| similarity
and the solution to Eq.(3) is the collection R∈R(A)
matrices. In addition, given G, there are A∈A |VA |2 equa-
tions following Eq.(3). In the following part of this paper, where ∨ denotes the element-wise maximum operator, and
we would simplify both sA (a, b) and sR.T (a, b), by ignoring I is an identity matrix. There are |A| such matrix equations
the type information, as s(a, b) when the context is clear. for G.
SimRank and its variants (bipartite SimRank [12] and P-
Rank [13]) are special cases of HowSim similarity, since they
can be viewed as different initializations of decay graphs. 3.4 Properties of HowSim
Their decay graphs are shown in Figures 2b to 2d, and the
details of comparisons are in the supplemental material. In summary, HowSim has the following numeric properties:
Theorem 1. Given G and a decay function c, the solution of 1) Normalized: given a ∈ V , ∀b ∈ V : s(a, b) ∈ [0, 1];
Eq.(3) is unique. 2) Symmetric: ∀a, b ∈ V, s(a, b) = s(b, a);
3) Self-maximum: ∀a ∈ V, s(a, a) = 1.
Proof. Please see the proof in the supplemental material. 4) Uniqueness: given G and c, the similarities in all do-
mains of G are unique.
More Discussions about Decay Graphs. The decay Next, we discuss the computation of HowSim.
graph and the network schema share the same network
structure, the only difference between them is that the decay
graph is weighted. The presence of decay graph is the reason
that we conclude HowSim outperforms meta-path based 4 T HE C OMPUTATION OF H OW S IM
measures w.r.t. usability, it just uses the network schema
itself to compute similarities instead of asking users to pro- Extending SimRank Techniques to HowSim. It is possi-
vide meta paths. The size of a decay graph is O(|R| + |A|), ble to extend the techniques for SimRank computation to
which is much smaller than the underlying HIN. The role HowSim, since HowSim is a generalization of SimRank.
of the decay graph w.r.t. HowSim is the same as that of Currently there are two types of methods for SimRank
the decay factor C w.r.t. SimRank, though the former is a computation with accuracy guarantees: iterative methods
weighted graph and the latter is scalar. When the HIN is and random walk based methods [24]. The former one is
dynamic, i.e., nodes/edges keep being added or deleted, usually used for all-pairs SimRank, while the latter is for
we can reuse the same decay graph if the network schema single-pair/source queries. We study all-pairs HowSim in
is not changed, which is similar to the reuse of decay this paper, and leave the HowSim query problem as the
factor in dynamic SimRank computation [19], [20], [22], [23]. future work. We cannot extend the iterative method unless
However, if the network schema is changed, e.g., a new type the following questions are answered: 1) does the estimated
of relation is added to the HIN, we need to refine the decay HowSim scores converge to the ground truth; 2) if so, what
graph correspondingly. Besides, HowSim replies on that the is the convergence rate. In this section, we first present a
network schema is given, which is the same case in [5], [6], native iterative method for computing HowSim matrices
[16]. We would extend HowSim to schema-free networks, with an accuracy guarantee, and then present optimization
such as knowledge graphs, in the future work. techniques to reduce the cost.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

4.1 Naive Iterative Computing Algorithm 1 Find decay graph


Due to the recursive definition of Eq.(3), the HowSim scores Require: An HIN G, example pairs Λ, the function α(∗)
can be computed via the Jacobi method, which iteratively Ensure: c(R) for ∀R ∈ R
1: for all (a, b) ∈ Λ do
updates current scores using results of the last iteration. 2: for all p ∈ P(a, b) do
Initially, at step k = 0, we set s(a, b) = 1 if a = b, and 3: l ← the length of p
0 otherwise. Then at each step k > 0, we update s(a, b) 4: build the weight function w(c) according to p, add it to
(where a 6= b) according to Eq.(3): Eq.(12)
5: Solve the minimization problem in Eq.(13)
(k−1) 0 0
(k)
X X X c(R)sR.T (a , b )
sA (a, b) = .
|NR (a)||NR (b)|
R∈R(A) a0 ∈NR (a) b0 ∈NR (b) 5.1 Semantic Meaning
(4) The semantic meaning of parameters in the decay graph
The following theorem shows the similarities computed at is clear. Specifically, α(A) controls to what extent the sim-
any step k are bounded. ilarities of nodes depend on their neighbors. The smaller
α(A) (A ∈ A) is, the more heavily the similarities of nodes
Theorem 3. ∀k > 0, ∀A ∈ A, and ∀a, b ∈ VA , the following with type A rely on their neighbors. After α(A) is set,
inequality holds: P
R∈R(A) = 1 − α(A) is fixed, then we can set c(R) for
0 ≤ s(k−1) (a, b) ≤ s(k) (a, b) ≤ 1. (5) R ∈ R(A) one by one, based on the fact that c(R) denotes
to what extent the similarities of the nodes of type A rely
Proof. Please see the proof in the supplemental material. on similarities of nodes of type R.T under R, as we have
Proposition 4. The result of the iterative procedure converges to illustrated in Example 3. Without prior knowledge of the
the correct solution of Eq.(3) as k → ∞. HIN data, a simplification can be made by firstly setting
α(A), ∀A ∈ A to a constant, and then setting c(R) for each
Proof. Please see the proof in the supplemental material. R individually.
The Complexity. We first analyse the number of iter-
ations required to achieve an error bound and then the 5.2 Example Pairs
complexity. Sometimes, users are not able to provide exact values for
P
Theorem 5. Let C = maxA∈A R∈R(A) c(R), for any k ∈ α(∗) and c(∗), due to the lack of domain knowledge. In-
stead, example pairs can be provided by users to indicate
[0, ∞) and any A ∈ A,
which node pairs should be of high proximity. Given a set
(k)
sA (a, b) − sA (a, b) ≤ C k+1 . (6) of example pairs Λ, we aim to find such a decay graph, that
actually makes the HowSim scores of examples pairs large.
Proof. Please see the proof in the supplemental material. Let C be all possible decay functions, then finding the decay
From Theorem 5, if we want maxa,b∈V |ŝ(a, b) − s(a, b)| graph is the following optimization problem 4 :
is less than a given error bound , where ŝ(a, b) is the maximize
X
s(a, b). (8)
estimated HowSim score of a and b, then we can set c∈C
(a,b)∈Λ
k = dlogC e. Next P we analyse the complexity. In each
iteration, there are A∈A |VA |2 scores to be computed. Since To find such a c, a naive method is to enumerate all possible
each node pair aggregates under different relations, the time decay functions, then for each possible decay function, use
cost of one iteration is: methods in Section 4 to compute similarities in Λ, and
X X X X then choose the optimal one. This naive methods has two
O( |VA |2 d2R ) = O( |ER |2 ) drawbacks: 1) Parameters in c are real numbers, which are
A∈A R∈R(A) A∈A R∈R(A) uncountable and cannot be enumerated; 2) Using the all-
pairs solution (Section 4) to get similarities in Λ is inefficient,
X
= O( |ER |2 ) = O(m2 ), (7)
since |Λ|  n2 . Next, we introduce a more efficient method,
R∈R
which computes s(a, b) by path enumeration to avoid the
where VA is the set of nodes of type A, ER is the set of edges computation of all-pairs HowSim scores.
for relation R, and dR the average node degree under R. The
worst-case time complexity of above naive iterative method Definition 5 (Symmetric Path). A path p in an HIN G is
is O(Km2 ). In addition, the space cost is O( A∈A |VA |2 ) =
P symmetric if the relations defined by it is symmetric, in other
R1 R2 Rl Rl
O(n2 ). words, p can be represented as: a1 −−→ a2 −−→ · · · al −→ f ←−
R2 R1
bl · · · ←−− b2 ←−− b1 .
4.2 Optimization Techniques
We define the length of a symmetric p as the half of the
In this section, we present two techniques, i.e., selecting number of its edges, e.g., the length of the symmetric paths
candidate pairs and relation-specific partial sum, which in Def. 5 is l. Given a symmetric path p which connects a
reduces the above time complexity to O(Kmn). These tech- and b, let weight(p) be defined as:
niques are originally proposed by [17] for efficient SimRank
l
iteration, however, they are still applicable to boost HowSim Y c(Ri )
computation with minor modifications. The details can be weight(p) = , (9)
|NRi (ai )||NRi (bi )|
found in the supplemental material. i=1

4. Note that though we provide such an objective function, users can


5 D EFINING D ECAY G RAPH provide different objectives and examples for their own needs such as
dissimilar pairs or relative similarity ranking of node pairs, to get a
The decay graph is the key component in HowSim. In this decay function. How to define a decay function is orthogonal to the
section, we discuss how to define a decay graph. definition of HowSim.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

where Ri is the i-th relation along p, and ai (resp. bi ) is the i-


th node along p from a (resp. b), and NRi (ai ) (resp. NRi (bi ))
is the set of neighbors of ai (resp. bi ) under Ri . Let P(a, b) (c) DBIS
be the set of symmetric paths between a and b. We then have
the following result for the single-pair HowSim score.
P
Theorem 6. s(a, b) = p∈P(a,b) weight(p). (a) IMDb (d) Delicious Bookmarks
Next, we re-formulate the problem of Eq.(8). Let type Ai
be the i-th node type in A, and Rj be the j -th relation type
in R. et the |A| by |R| matrix B be defined as: (e) Last.FM
(b) Amazon

1, Rj ∈ R(Ai )
Bi,j = (10) Fig. 3: Data sets (the underlined characters represent node
0, otherwise. types)
Let a be a vector of length |A|, where ai = α(Ai ). Let c be
a vector of length |R| where ci = c(Ri ). Finding a decay The method of finding a decay function c is shown in
graph is equivalent to find such a c to maximize Eq.(8). Algorithm 1. A user provide a set of example pairs Λ, and
Besides, Eq.(2) requires Bc + a = 1. ∀A ∈ A, α(∗) is set to a constant. We next build the loss
From above, we can also conclude that weight(p) can be function. We iterate over node pairs in Λ (line 1), for each
viewed as a multivariate function w which maps c to R+ node pair, we enumerate its symmetric path (line 2), and
w.r.t. p. Hence, we can get: update the loss function by adding the term wk (c). After
X X X X the loss function is obtained, we then solve the minimization
s(a, b) = weight(p) = w(c), problem using the above gradient descent process. In prac-
(a,b)∈Λ (a,b)∈Λ p∈P(a,b) p∈PΛ tice, we cannot enumerate all possible symmetric paths of a
(11) and b, since their number is infinite. Based on Theorem 6,
S the longer a symmetric path, the smaller the contribution
where PΛ = (a,b)∈Λ P(a, b). We then introduce the follow- it makes to s(a, b). Therefore, we only enumerate short
ing loss function w.r.t. c: symmetric paths (setting the maximum length L = 3).
|PΛ |
Enumerating symmetric paths between a and b requiring
X O(d2L ), where d is the average degree of G. Building the
loss(c) = − wk (c) + λkBc + a − 1k22 , (12)
loss function (line 4) takes O(|PΛ |L) time. Therefore, Algo-
k=1
rithm 1 requires O(|Λ|d2L + |R||PΛ |LT + |R|2 T ), where T
where wk (c) is the weight of k -th path in PΛ , and λkBc+a− is the number of iterations of gradient descent. Note that in
1k22 is the penalty term which makes Bc + a ≈ 1. Therefore, practice L, |R| and |Λ| are very small (compared with |G|),
the optimization problem of Eq.(8) can be transformed to thus Algorithm 1 is practically efficient. Note that the pair
the following minimization problem: set Λ does not need to cover all node types or relation types.
This is because of the recursive definition of HowSim: no
minimize loss(c) (13) matter which type of pairs a user has provided, the weights
c
of all relations are covered in the computation.
subject to ci ≥ 0, ∀i ∈ 1, · · · , |R|. (14)
According to Eq.(9), wk (c) is a wighted product of different 6 E XPERIMENTS
powers of different components of c, thus Eq.(13) is a non-
linear programming problem. We derive an approximate 6.1 Setup
solution for Eq.(13) using the gradient descent method. The 6.1.1 Data Sets
partial derivative for each component of c is: We use five real-word HIN data sets: 1) IMDb 5 : an HIN
records information of movies; 2) DBIS 6 : an HIN for col-
|PΛ |
∂loss X ∂wk (c) laboration in the database community; 3) Last.FM (LaFM) 7 :
=− + 2λ(Bc + a − 1)> Bei , (15) it contains social networking, tagging, and music artist
∂ci ∂ci
k=1 listening information from a set of 2K users from Last.fm
where ei is an unit vector whose i-th coordinate is 1. online music system; 4) Delicious Bookmarks (DeBo) 7 :
Besides, there is a box constrains in Eq.(14). Therefore, in it contains social networking, bookmarking, and tagging
each iteration, we update ci using the following formula: information from a set of 2K users from the Delicious social
bookmarking system. 5) Amazon 8 : an HIN of product co-
∂loss purchasing network crawled from Amazon website. The
ct+1
i = max{0, cti − η · }, (16) network schema and size of each data set are shown in
∂ci
Figure 3.
where η is the learning rate. Suppose a relation R with
the source node type A, and R corresponds to the i-th 6.1.2 Methods
1−α(A)
coordinate of c, we set initial c0i = |R(A)| . Next, we We use the following measures to compare the effectiveness
analyse the cost of each iteration. For each ci , computing of similarity search: 1) PPR [14]: one of the most popular
∂wk (c) similarity measure on homogeneous graphs; 2) Weighted
∂ci requires O(lk ) time, where lk is the length of k -th
path in PΛ . Besides, computing the term 2λ(Bc − a)> Bei
5. https://www.kaggle.com/carolzhangdc/
requires O(|R|) time. Hence, each coordinate of c requires imdb-5000-movie-dataset
O(|P|lk + |R|) to update, and it takes O(|R||PΛ |L + |R|2 ) 6. https://ericdongyx.github.io/metapath2vec/m2v.html
for updating c, where L is the maximum length of paths in 7. https://grouplens.org/datasets/hetrec-2011/
PΛ . 8. http://snap.stanford.edu/data/amazon0601.html

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
TABLE 3: Case study of DeBo (Bookmark: 8 Hours Of The Top 10 JavaScript Talks From 2010 That You Can’t Miss)
HowSim Weighted PPR
8 Hours Of The Top 10 JavaScript Talks From 2010 That You Can’t Miss 8 Hours Of The Top 10 JavaScript Talks From 2010 That You Cant Miss
50 Useful Tools and Resources For Web Designers - Smashing Magazine russell davies: what I meant to say at lift - part two - big red buttons and sliding
into glass
Script Junkie — LABjs; RequireJS: Loading JavaScript Resources the Fun Way Roo Reynolds - Playful
The 20 Most Practical and Creative Uses of jQuery - NETTUTS YouTube - Simon Peyton Jones: Data Parallel Haskell
35 Best Free Chrome Extensions for Web Developers — Web Resources — Must Watch TED Videos
WebAppers
The Essentials of Writing High Quality JavaScript — Nettuts+ TED Talks - PostRank - Google Docs
5 Good Reasons Why Designers Should Code — Carsonified The Best TED Talks To Make Use Of Social Media
Custom Checkboxes, Custom Radio Buttons, Custom Select Lists 100 Great Tech Talks for Educators — Best Colleges Online
Table Sorting with Prototype * Dexagogo blip.tv (since 2005)
A Detailed Look into Popular Styles in Web Design — Onextrapixel - Showcas- Science Cooking Public Lectures a Harvard School of Engineering and Applied
ing Web Treats Without A Hitch Sciences

TABLE 4: DeBo (Tag:iphone-development) TABLE 7: Different decay graphs for IMDb


HowSim Weighted PPR R R.S R.M R.T c1 c2 c3
iphone-development iphone-development R1 A act M 0.1 0.75 0.1
uiview iphone R2 M acted by A 0.1 0.75 0.1
uikit development R3 D direct M 0.75 0.1 0.1
three20 ios R4 M directed by D 0.75 0.1 0.1
uitableview objective-c R5 G have M 0.1 0.1 0.75
opengles opensource R6 M belong to G 0.1 0.1 0.75
keychain github
postmortem sample
coredata open-source
TABLE 8: Similar movies of Titanic
mediathek ipad
Decay graph c1 Decay graph c2 Decay graph c3
Titanic Titanic Titanic
TABLE 5: LaFM (Artist:Wolfgang Amadeus Mozart) The Abyss Revolutionary Road Revolutionary Road
Aliens The Reader The Reader
HowSim Weighted PPR Terminator 2: Judg- The Great Gatsby The Great Gatsby
Wolfgang Amadeus Mozart Wolfgang Amadeus Mozart ment Day
Martha Argerich Sissel The Terminator Quills Marvin’s Room
Georges Bizet Johann Sebastian Bach True Lies Little Children Labor Day
Franz Joseph Haydn Felix Mendelssohn Revolutionary Road Labor Day Little Children
Johann Pachelbel Bjrk Marvin’s Room What’s Eating Gilbert Romeo + Juliet
Marc-Antoine Charpentier Kaizers Orchestra Grape
Giuseppe Verdi Giuseppe Verdi Sense and Sensibility Romeo + Juliet Sense and Sensibility
Antonn Dvok Acid House Kings The Great Gatsby Marvin’s Room What’s Eating Gilbert
Hauschka Frdric Chopin Grape
Giovanni Allevi Johann Pachelbel

default path length q = 3, the same as [26]. We use various


PPR(WPPR) [25]: the variant of PPR whose transition proba- meta-paths for PathSim and HeteSim over different tasks.
bilities are weighted over edges; 3) SimRank [12]: a similar- Particularly, due to the existence of the network schema of
ity measure on homogeneous graph based on the expected each data set, we first enumerate possible meta-paths, and
meeting time of two random walks; 4) PathSim [5]: a simi- then select those which produce high-quality results. The
larity measure on HIN by counting the path instances given default decay graph for each HIN is defined as: c(R) = 0.2
a meta-path; 5) HeteSim [6]: a similarity measure on HIN for ∀R ∈ R. All experiments are conducted on a Linux
by the meeting probability over a given meta-path whose machine with Intel(R) Xeon(R) E52630 v3@2.40GHz CPU
length is odd; 6) nSimGram [26]: a q-gram based similarity and a 128GB memory.
measure based on the node labels of paths; 7) HowSim. To
test the efficiency of all-pairs HowSim computation, we use
Nav-Iter (Section 4.1) and Opt-Iter (Section 4.2). Besides, 6.2 Case Study
we also use Opt-Df (Algorithm 1) to test the effectiveness of We show the generality and usability of HowSim.
finding good decay functions.
6.2.1 Query Over Multiple Domains
6.1.3 Settings Using the data set DeBo, we first query the top-10 similar
The default decay factor of SimRank is set to 0.6, as sug- bookmarks for the bookmark 8 Hours Of The Top 10 JavaScript
gested in [17], and the default teleportation probability for Talks From 2010 That You Can’t Miss. The results are shown in
PPR is 0.8. For WPPR, we use the same decay graph for its Table 3. The results of the weighted PPR, which achieves the
transition probability as HowSim. For nSimGram, we set the best results among different measures for a homogeneous
graph. are also included. We can find HowSim can precisely
find bookmarks about Javascript and web-page design. For
TABLE 6: LaFM (Genre:country) example, the 2nd result shows the tools and resources for
HowSim Weighted PPR
web designers, and the 4th result is about jQuery, which
country country is a very popular Javascript library 9 . However, weighted
Canadian country female vocalists PPR would find some irrelevant bookmarks, such as Must
rain is a good thing classic country
texas country folk Watch TED Videos and blip.tv. Using the same decay graph,
hillbilly music 90s we then query the top-10 similar tags for the tag iphone-
cheesy pop 60s development, the results are in Table 4. We can see similar
walk the line soundtrack singer-songwriter
country taggradio country ladies tags are also found by HowSim. For instance, UIView,
honky tonk rock
favorite country music group beautiful 9. https://jquery.com

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

UIKit, Three20 and UITableView (rank 2-4) are all tools or TABLE 9: Meta-paths used in clustering for PathSim and
HeteSim
frameworks for developing ios applications, and Keychain
(rank 6) is the password management system developed by IMDb M: MDM D: DMGMD A: AMGMA G: GMDMG
Apple. However, weighted PPR would find some general LaFM U: UATAU A: AUA T: TAUAT
DeBo U: UBTBU B: BUB T: TBUBT
tags whose scope is much larger than iphone-development, DBIS V: VPAPV P: PAP A: APVPA
such as github and open-source. Amazon I: IICII C: CIC
We then use LaFM and query the top-10 similar artists
for Wolfgang Amadeus Mozart, who was a prolific and influ- Howsim Pathsim(MDM) WPPR
ential composer of the classical era. The results are shown in nSimGram Hetesim(MGM) PPR
Table 5, we find that the results of HowSim are also famous Pathsim(MGM) Hetesim(MDM) SimRank
classical pianists or composers, such as Martha Argerich and 1.0
0.9
Franz Joseph Haydn. However, the results of weighted PPR 0.8
0.7

Precision
are not satisfactory, i.e., Sissel is a crossover soprano and
0.6
Acid House Kings a Swedish indie pop band. Using the 0.5
same decay graph, we also query similar genres for country 0.4
0.3
music, the results are shown in Table 6. We can see that 0.2
similar genres are found by HowSim, such as Canadian 0.1
0.0
country, Texas country and hillbilly music, while weighted 2 4 6 8 10
PPR would find irrelevant genres, e.g., female vocalists and
rock. k
The reason for the results of weighted PPR being worse
than HowSim, is that weighted PPR gives high scores to Fig. 4: Top-k precision by user study
those hubs (nodes with high degree), though these hubs
may be irrelevant to the query nodes. From above case
studies, we can conclude that HowSim can identify similar nodes in the pool w.r.t. the similarity to u. This ordering is
objects over multiple domains. viewed as the ground truth in p’s view, then the precision of
different measures can be calculated. Based on the above
6.2.2 Varying Decay Graphs strategy, using the IMDb data, we use the movie Titanic
as the query, and present the pooling results of different
Different decay graphs lead to different similarity results. measures to 10 different people, then calculate the top-k
Over IMDb, we use three different decay graphs c1 , c2 and precision for each of them. The average precision among
c3 (shown in Table 7), and find the top-10 similar movies different users for different measures is shown in Figure 4.
for Titanic. The results are shown in Table 8. We find that We can find that HowSim achieves the best precision with
for c1 , the top-6 movies are all directed by James Cameron, a varying k , indicating users prefer the similarity results
this is because the similarity of movies is highly dependent returned by HowSim to others.
on the similarity of directors who direct them under c1 . As
for c2 , we find that all similar movies are either acted by
Leonardo DiCaprio or Kate Winslet, the reason is that in c2 , 6.3 Effectiveness
similarity of movies highly depends on the similarity of We compare HowSim with measures over different tasks,
actors, and that Leonardo and Kate are the stars of Titanic. and then study the effectiveness of finding decay graphs.
The results of c3 are all romantic drama movies, since Titanic
is a love story. In addition, most of the c3 results are also 6.3.1 Node Clustering
acted either by Leonardo or by Kate, this shows that HowSim We test the quality of clustering of different measures.
has the ability to combine the similarities from multiple We apply K-Medoids [28] to perform clustering by differ-
relations. Therefore, HowSim has the flexibility to reflect ent scores returned from different measures. We use the
users’ preferences of how to aggregate similarities from compactness (CP) to evaluate clustering quality, which is
multiple domains by customizing decay graphs. (K−1) K
P P
i=1 vj ∈Ci s(vj ,mi )
defined as: CPK = PK P
s(mi ,vj )
, where K is
i=1 vj ∈C\Ci
6.2.3 User Study the number of clusters, Ci is the i-th cluster, C = ∪K i=1 Ci , mi
We perform a blinded user study to test the quality of is the center of Ci , and s(∗, ∗) is the similarity function. The
results of different methods. We use pooling [27], which is a numerator describes the closeness between centers and the
standard approach for evaluating top-k documents ranking objects grouped into the current clusters, and denominator
quality in Information Retrieval (IR) systems when the represents the closeness between centers and the objects that
ground truth ranking scores of all documents are difficult are not grouped into the current clusters. A high CPK score
to obtain. The basic idea of pooling is as follows. Suppose indicates a good clustering result. For each data set, we first
that we are evaluating l similarity measures A1 , · · · , Al , vary K from 3 to 9 with the step of 2, and we use different
each of which returns the top-k results that are most similar meta-paths for each meta-path based measure. The result
to a query. We first take the top-k results returned by is shown in Figure 5. We find that HowSim outperforms
each measure, then we merge them into a pool, removing others over different data sets, and we can also find that
the duplicates. We then show the results in the pool to using different meta-paths leads to different compactness
experts for evaluation. Based on the feedback provided by for PathSim and HeteSim. We then perform clustering over
the experts, we pick the best k documents from the pool, different domains, by setting K = 5. The meta-paths of
and use them as the ground truth for evaluating the top- PathSim and HeteSim for different domains are summarized
k results returned by A1 , · · · , Al . In the user study, each in Table 9, and the compactness is shown in Figure 6. We
user is viewed as an expert for examining the results in the can still observe that HowSim has the large compactness
pool. More precisely, for each query node u, we retrieve the scores on various domains. The reason is that HowSim can
top-k nodes returned by each method, and merge them into aggregate similarity scores from different domains of neigh-
a pool. We then present the pool to a user p to order the bors, while meta-path based methods can only capture one

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
TABLE 10: The precision of label prediction of different measures
Method Howsim PathSim(MDM) PathSim(MGM) HeteSim(MDM) HeteSim(MGM) nSimGram WPPR PPR SimRank
Precision 0.74 0.28 0.7 0.12 0.68 0.72 0.7 0.66 0.22

Howsim Pathsim(DMD) Howsim Pathsim(VPV) Howsim Pathsim(UBU) Howsim Pathsim(UAU) Howsim Pathsim(ICI)
WPPR Hetesim(DMGMD) WPPR Hetesim(VPAPV) WPPR Hetesim(UBTBU) WPPR Hetesim(UATAU) WPPR Hetesim(IICII)
PPR Hetesim(DMD) PPR Hetesim(VPV) PPR Hetesim(UBU) PPR Hetesim(UAU) PPR Hetesim(ICI)
Pathsim(DMGMD) SimRank Pathsim(VPAPV) SimRank Pathsim(UBTBU) SimRank Pathsim(UATAU) SimRank Pathsim(IICII) SimRank
nSimGram nSimGram nSimGram nSimGram nSimGram
103

Compactness

Compactness
Compactness
Compactness

Compactness
105
103 104
104
103
102
3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9
K K K K K
(a) IMDb (b) DBIS (V) (c) DeBo (U) (d) LaFM (U) (e) Amazon (I)

Fig. 5: Clustering with varying K

Howsim WPPR PPR Pathsim nSimGram Hetesim SimRank


105
Compactness

Compactness

Compactness

Compactness

Compactness
103 104 104
104
102 103 103
A F D G V P A U B T U A T 103 I C
IMDb DBIS DeBo LaFM Amazon
Fig. 6: Clustering on different domains

Howsim nSimGram Pathsim(IICII) Pathsim(ICI) Hetesim(IICII) Hetesim_s(ICI) SimRank WPPR PPR


1.0 ×100 1.0 ×100 1.0 ×100 1.0 ×100
0.8 0.8 0.8 0.8
Precision

Precision

Precision

Precision
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
k ×102 k ×102 k ×102 k ×102
5% 10% 15% 20%
Fig. 7: Link prediction - Amazon

particular meaning given a meta-path. SimRank and PPR the results to predict the genre. We use the MovieLens 10 as
cannot capture any semantic information. Another finding ground truth. The average precision of different measures
is that PPR performs better than meta-path based methods, are shown in Table 10. We can observe the quality of
the reason is that the number of path instances following the PathSim and HeteSim highly depends on the selected meta-
given meta-path can be very small, especially when nodes path. The weighted PPR is better than normal PPR since
are of low degrees. In this case, the scores of meta-path the former takes node/edge type into consideration. Among
based measures are too restricted to distinguish different them, HowSim achieves the highest precision.
levels of similar node pairs (see Table 12). Besides, only
counting the paths following a specific meta-path cannot 6.3.3 Link Prediction
aggregation information from other paths. On the contrary, We compare different measures w.r.t. the quality of link
PPR and WPPR have a recursive definition, the number of prediction. Since similarity function is defined for two nodes
paths considered is infinite, and the scores are more multi- with the same type, links can only be predicted in the same
farious. Note that when do clustering on different domains, domain. As a result, we can only use Amazon, DeBo and
we need to specify different meta-paths for PathSim and LaFM, and predict links among items and users. For each
HeteSim, while for HowSim, the same decay function is data set, we randomly remove edges from 5% to 20% of
used without any changes. Therefore, we can also conclude total edges. The missing links are predicted as follows: given
HowSim is a general measure and can ease the data mining an endpoint a of a removed edge, we retrieve the top-k
tasks on HINs. similar nodes of a, and add edges between a and its similar
nodes, to a set of candidates. Precision is used to measure the
|candidates∩deleted edges|
6.3.2 Label Prediction quality, i.e., |deleted edges| . We also vary k from 10
to 100 with the step of 10. For meta-path based methods,
We compare different measures w.r.t. the precision of label we also consider using different meta-paths, e.g., we use the
prediction. Using IMDb data set, we predict the genres notation PathSim(UBTBU) if we use PathSim with the meta-
of movies using different measures. Particularly, we firstly path UBTBU. The results are shown in Figures 7 to 9. We
remove the edges between movies and genres, and then can observe that the precision increases with an increasing
select 500 movies randomly. For for each movie, we compute
its top-50 similar movies, and use the majority voting among 10. https://movielens.org/.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

Howsim nSimGram Pathsim(IICII) Pathsim(ICI) Hetesim(IICII) Hetesim_s(ICI) SimRank WPPR PPR


×100 ×100 ×100 ×100
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
Precision

Precision

Precision

Precision
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.02
k ×10 k ×10 k ×10 k ×10
5% 10% 15% 20%
Fig. 8: Link prediction - DeBo

Howsim nSimGram Pathsim(IICII) Pathsim(ICI) Hetesim(IICII) Hetesim_s(ICI) SimRank WPPR PPR


1.0 ×10 1.0 ×10 ×100
1.0 ×10
0 0 0
1.0
0.9 0.9 0.9 0.9
Precision

Precision

Precision

Precision
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
k ×102 k ×102 k ×102 k ×102
5% 10% 15% 20%
Fig. 9: Link prediction - LaFM

TABLE 11: Comparison of Top-k similarity search


HowSim nSimGram PathSim(MDM) PathSim(MGM) HeteSim(MDM) HeteSim(MGM) WPPR PPR SimRank
Titanic Titanic Titanic Water for Elephants Titanic Titanic Titanic Titanic Titanic
Revolutionary Road Revolutionary Road The Abyss Closer Avatar The Great Gatsby Aliens Revolutionary Road Inherent Vice
The Reader Heavenly Creatures The Terminator Love in the Time of Terminator 2: Judg- Memoirs of a The Abyss Aliens Triple
Cholera ment Day Geisha
The Great Gatsby Iris Aliens Angel Eyes True Lies Dj vu The Terminator The Abyss Deconstructing
Harry
Quills The Terminator True Lies Curse of the Golden The Abyss The Majestic Avatar The Terminator Three to Tango
Flower
Little Children Movie 43 Terminator 2 Memoirs of a Aliens Eat Pray Love True Lies Terminator 2 Burnt
Geisha
Labor Day Romance & Avatar Revolutionary Road The Terminator Up Close & Per- Revolutionary Road Avatar We’re No Angels
Cigarettes sonal
What’s Eating What’s Eating Harry Potter and Seven Pounds Pirates of the Seven Pounds The Reader True Lies Everyone Says I
Gilbert Grape Gilbert Grape the Half-Blood Caribbean: At Love You
Prince World’s End
Romeo + Juliet Welcome to Pirates of the At First Sight Spectre The Scarlet Letter Little Children Quills Death Sentence
Collinwood Caribbean: Dead
Mans Chest
Marvin’s Room Tales from the Superman Returns Hit the Floor The Dark Knight Anna Karenina Romeo + Juliet Gangs of New York Everybody’s Fine
Crypt: Demon Rises
Knight

Naive-Iter Opt-Iter Naive-Iter Opt-Iter Naive-Iter Opt-Iter


Equiv Opt Random Naive-Iter Opt-Iter
10 2 ×104
1.0
Maxium error

104
Maxium error

CPU time (s)


1.0 10 2
CPU time (s)
Avg. similairty

103
10 4
10 4 0.5
0.5 102
DBIS DeBo LaFM Amazon IMDb DBIS DeBo LaFM Amazon IMDb 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10
0.0 LaFM IMDb DBIS Amazon DeBo Data sets Data sets
Data sets
(a) Maximum error (b) CPU time (a) Maximum error (b) CPU time
Fig. 10: Average score
Fig. 11: Varying data sets Fig. 12: Varying 

TABLE 12: Similarity scores of top-10 lists is also our motivation to introduce HowSim, i.e., it is hard
PathSim PathSim HeteSim HeteSim
to generate and select good meta-paths for different appli-
HowSim nSimGram WPPR PPR
(MDM) (MGM) (MDM) (MGM) cations and domains before we perform any data mining
1 1 1 1 1 1 0.408375 0.471681
0.290797 0.5 1 1 1 1 0.00248191 0.00186792 tasks.
0.195247 0.25 1 1 1 1 0.00246846 0.00110983
0.19511 0.25 1 1 1 1 0.00245409 0.001088
0.178121
0.172314
0.25
0.25
1
1
1
1
1
1
1
1
0.00245201
0.00243339
0.00107252
0.00106975
6.3.4 Top-k Similarity Search
0.169834 0.25 1 1 1 1 0.00242949 0.00104432
0.16869 0.25 0 1 0 1 0.000522484 0.00103956 We compare different measures w.r.t. the quality of results of
0.167171 0.25 0 1 0 1 0.000269837 0.000981586
0.1648 0.25 0 1 0 1 0.00026926 0.000973551 top-k similarity search. We use IMDb, and search the top-10
similar movies to Titanic. The results are in Table 11. We can
see results of HowSim are better than others. Semantically,
PathSim (MDM) and HeteSim (MDM) only outputs films
k , this is because a larger k indicates a larger candidate set which are directed by Cameron, and PathSim (MGM) and
for link prediction. We can find that HowSim, HeteSim and HeteSim (MGM) only produces romantic drama films. Most
nSimGram generally outperform other competitors over of results of SimRank are not similar to Titanic. On the
different settings. Besides, the quality of results of meta- contrary, the results of HowSim are much better, the reason
based measures highly depends on the meta-path used. For is that they are all romantic drama movies starring Leonardo,
example, under Amazon, no matter which meta-path based Kate, or both of them. The results of PPR are not as good
measure is used (PathSim or HeteSim) the results under as those of HowSim, for example, though Aliens, The Termi-
IICII have higher precision than under ICI. Actually, this nator, Avatar are science-fiction action horror films directed

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

Howsim Metapath2vec TransE similarities in vector space. We compare HowSim with the
1.0 following methods: 1) TransE [29]: an embedding method
0.9 for multi-relational data; 2) Metapath2vec [30]: a represen-

Precision
0.8 tation learning method for heterogeneous networks. For
0.7 Metapath2vec, we set the number of walks per node to
0.6
0.5 0.05
200, the walk length to 20, the vector dimension to 128.
0.1 0.15 0.2 We use the meta-path UBU. For TransE, we set epoch as
Percentage of deleted edges
500. We perform link prediction on DeBo, set k = 20, and
Fig. 13: Link prediction - DeBo vary the percentage of randomly removed edges from 5%
to 20%. The precision is shown in Figure 13. On average,
Metapath2vec achieves the highest precision, and the result
by Cameron, they are obviously different from Titanic in the of HowSim is better than TransE. Therefore, the precision of
aspect of genres and actors. The results of nSimGram also HowSim is comparable with embedding-based methods. On
include some unrelated movies, e.g., Welcome to Collinwood is the other hand, compared with embedding-based methods,
a caper comedy film and Demon Knight is a horror comedy HowSim is relatively interpretable due to its clear definition,
film, which are dissimilar to Titanic. Among all measures, and the parameters of decay graph have clear semantic
HowSim is the only one which can combine similarities meaning.
from multiple domains.
In addition, numerically, the distribution of scores of 6.4 Efficiency
meta-path based methods cannot provide ranking results of
We study the efficiency of computing HowSim scores,
good quality since most items share the same score, while
by comparing Naive-Iter (Section 4.1) and Opt-Iter (Algo-
the scores of HowSim always have different values due to
rithm 4.2). We use maximum error (ME) to measure the ac-
its recursive definition, i.e., an infinite number of paths are
curacy of algorithms, i.e., M E = maxa,b∈V |ŝ(a, b)−s(a, b)|,
taken into consideration. The similarity scores of top-k lists
where ŝ(a, b) is the estimated HowSim score. We then test
are shown in Table 12, we find that scores of PathSim and
the efficiency of finding a decay graph.
HeteSim are either 1s or 0s, which cannot distinguish differ-
ent levels of similarity. Similar results are shown in nSim-
Gram. On the contrary, scores of HowSim can be used to 6.4.1 Varying Data Sets
rank similarities due to it being recursive and can thus pro- We set  = 0.01 and run both algorithms on all data
duce different numerical levels. If we want to make PathSim sets. The results of ME are shown in Figure 11a. We can
or HeteSim consider all aspects to find similar films, then see the ME of both algorithms is smaller , showing they
the shortest meta-path is M DM AM GM AM DM , which is both have an accuracy guarantee. The computational time is
too long to compute, based on the PathSim and HeteSim shown in Figure 11b. We can observe that Opt-Iter is faster
algorithms. than Naive-Iter on all data sets, verifying the usability of
strategies proposed in Section 4.2.
6.3.5 Effect of Varying α
We test the effect of varying α w.r.t. the similarity results. We 6.4.2 Varying 
use the IMDb data set, and then vary α from 0.2 to 0.8 with Using DBIS, we vary  from 0.01 to 0.1 with the step of 0.01.
the step of 0.2 for all node types, and for each particular The ME under different  is shown in Figure 12a. In addition,
α(A), we set c(R) = 1−α(A) we also draw a dotted line for various s, thus if the
|R(A)| for ∀R ∈ R(A). We extract the
maximum error of an algorithm is bounded by , it would be
top-100 similar directors from SD , excluding the diagonal
under the dotted line. We find that both Naive-Iter and Opt-
elements. We find that the top-100 pairs are identical, and
Iter are bounded, which is consistent with our analysis in
similar pairs of other types show similar results. We also
Section 4. Note that the ME shows a stair-like manner with
query the top-10 similar directors for a set of randomly
an increasing , since the number of iterations K is discrete,
selected directors, the top-10 lists are also the same for
thus under a varying , K may remain the same, which
various α. Due to space limitations, we omit the results of
leads to the same ME. The computational time is shown in
the tables here. We conclude that a different α only affects
Figure 12b. We find that the CPU time of Opt-Iter is less than
absolute HowSim scores, not the relative ranking of them.
the Naive-Iter under all , showing the effectiveness of the
6.3.6 Effectiveness of Finding Decay Graphs optimization techniques introduced in Section 4.2. Another
finding is that the CPU time of both Naive-Iter and Opt-
We study the effectiveness of our method of finding decay Iter shows a stair-like behavior, showing different s may
graphs. For each data set, we select 50 examples pairs, and share the same number of iterations. Besides, the number
set α(A) = 0.2 for ∀A ∈ A. In addition to Algorithm 1, of iterations of both methods under each particular  is the
we also consider the following strategies: 1) Equiv: it sets same, this is because Opt-Iter only boosts the computational
c(R) = 1−α(A)
|R(A)| for ∀R ∈ R(A) and ∀A ∈ A; 2) Rand: it cost within each iteration, instead of reducing the whole
randomly sets a real number of c(R)s and makes sure they number of iterations.
follow Def. 3. We then calculate the average HowSim scores
of example pairs based on the decay functions computed 6.4.3 Varying |Λ|
by three methods. The results are shown in Figure 10. We test the efficiency of the decay graph finding algo-
Among all data sets, we find our method outputs the highest rithm. Using the DeBo data set, we varying the number
average HowSim scores, this verifies the effectiveness of the of example pairs |Λ| from 200 to 1000. The time of the
solution for finding a decay graph in Section 5.2. decay graph finding process is shown in Figure 14a. We can
find that the processing time grows approximately linearly
6.3.7 Comparison with Embedding-based Methods with an increasing |Λ|. This is because the more example
We also compare HowSim with embedding-based methods, pairs provided, the more time needed for enumerating the
which generate embeddings for nodes and measure node paths connecting each pair of nodes in the example set and

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

×103 1.0 different q -grams formed by different q -length paths con-


3 0.8

Precision
necting nodes a and b. The difference between [26] and
Time (s)
2 0.6 HowSim is that [26] considered paths of a fixed length
0.4 q , while HowSim aggregates all possible symmetric paths
1 0.2
200 400 600 800 1000 0.0 200 400 600 800 1000
connecting nodes a and b, due to its recursive definition.
| | | | [31] also extended SimRank with semantics by incorporat-
ing semantic similarity of nodes and edge weights over an
(a) CPU time (b) Precision HIN. [31] required an additional ontology graph equipped
with the underlying HIN to compute the semantic similarity.
Fig. 14: Varying |Λ| However, most real HINs does not have a corresponding
ontology with them. Besides, [31] assumed that each edge
is labeled with a weight which indicates the importance of
setting up the optimization problem. We set k = 50 and the relation, but in reality, weights over an HIN are hard
randomly remove 10% edges, and then perform the task of to obtain. On the contrary, HowSim does not assume the
link prediction, guided by the decay graph under a varying ontology or edge weights, it only requires the parameters
|Λ|. The precision result is shown in Figure 14b. We can find over at schema level, instead of instance level, and thus the
that while |Λ| varies from 200 to 1000, the precision does not number of parameters are much smaller.
fluctuate much, and is around 0.8 among different sizes of Similarity Search on Homogeneous Graphs. Similarity
example pairs. The reason for that only a small number of search over homogeneous graphs has been well studied,
example pairs can achieve high precision is as follows. Since most similarity measures for them are link-based, because
the number of parameters of a decay graph is small, i.e., homogeneous graphs only have structural information.
|R| + |A|, the cardinality of the training set (example pairs) Some popular measures include PPR [14], SimRank [12], P-
for deriving the decay graph is also small O(|R| + |A|). Rank [13], CoSimRank [32], and so on. These similarities
In reality, it is also not practical for users to provide lots cannot be used in heterogeneous graphs directly, since they
of similar pairs, this is consistent with the motivation of do not consider the semantic information of HINs. HowSim
designing the decay graph: only a few specified parameters is extending the intuition of SimRank to HINs. However,
are enough for deciding similarities of pairs of all types. On they are different in the following aspects: 1) SimRank uses
the other hand, the precision does not increase much as |Λ| a numerical variable, i.e., decay factor, to describe how
increases, this is due to the limited number of parameters in similarities are aggregated from neighbors, while HowSim
the decay graph, which restricts the flexibility of HowSim uses a weighted graph, i.e., decay graph, to define how
to fits all possible similarity configurations (consider there similarities are aggregated from neighbors of different types;
is only parameter for SimRank, i.e., the decay factor c). 2) While SimRank is defined for nodes of the same types,
However, this does not affects the effectiveness of HowSim, HowSim is defined for nodes of multiple types.
since only a few example pairs can achieve high precision.
We summarize the findings in experiments as follows: Graph Embeddings. Graph/Node embeddings [33] are
1) HowSim can find similarity objects for multiple domains, vector representations of nodes, where the learned em-
aggregate similarities from various relations, and thus can beddings usually preserve the node proximity, i.e., similar
produce high-quality results; 2) The decay graph is flex- nodes should have close embeddings. Some early works
ible for users to encode their preferences w.r.t. similarity such as LLE [34], LE [35], CGE [36] computed the node
aggregation for query processing; 3) HowSim is effective embeddings by matrix factorization, which result in heavy
in different data mining tasks; 4) The strategies proposed time and space overheads on large graphs. Recently, with
in Section 4.2 can boost HowSim’s computation. 5) Our the advent of deep learning [37], deep-learning based meth-
method (Section 5.2) is effective for finding good-quality ods have shown promising performance in the similarity
decay graphs; evaluation. Some works such as D2AGE [38] and IPE [39]
aimed to produce embeddings particularly for the similarity
search task by a path-to-path learning schema. In addition,
many graph embedding methods were proposed for general
7 R ELATED W ORKS
graph mining tasks, such as DeepWalk [40], LINE [41],
Similarity Search on HINs. Similarity search over HINs has GCN [42], and Node2Vec [43]. Despite their effectiveness, all
been studied recently, due to the popularity and ubiquity of these methods only focused on homogeneous graphs. To
of HINs. Most current similarity measures are meta-path address this problem, some recent methods such as Metap-
based. PathSim was proposed in [5], which counts all possi- ath2Vec [30] and HIN2Vec [44] were proposed to learn graph
ble path instances given a meta path and then performs the embeddings on heterogeneous graphs. Specifically, Metap-
normalization. [6] introduced HeteSim, which computes the ath2Vec [30] exploited the graph schema and utilizes the
meeting probability of two nodes following a given meta- user-specified meta-paths to control the generation process
path, and supports similarity search over different types of of graph embeddings, while HIN2Vec [44] further improved
nodes. [7] proposed RelSim for relation similarity search in the Metapath2Vec by introducing a semi-supervised process
schema-rich HINs, e.g., knowledge graphs. [8], [9] defined a to fine-tune the embeddings for specific graph mining tasks.
Path Constrained Random Walk (PCRW) model to measure Our HowSim model differs from above methods in the fol-
the object proximity in a labeled graph, which outperforms lowing aspects. First, while all above methods are learning-
PPR over several data mining tasks. [10] proposed AvgSim, based, where embeddings are learned by optimizing a ob-
which considers both the given meta path and its reverse, jective function, HowSim is a general measure which has a
avoiding the path decomposition phase in HeteSim. While clear definition, and has an interpretable semantic meaning.
current works mainly focused on formulating different Second, when the graph is updated, learning based methods
measures by incorporating meta-path in different ways, would re-train the node embeddings, which is costly. On the
HowSim is a meta-path free measure and can determine the contrary, HowSim similarities can be computed on-line, due
similarities of different domains. Recently, [26] introduced to its closed-form definition. Third, learning-based methods
a q -gram based measure, which counts the frequencies of need to stores all node embeddings, which has a large space

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

cost when the graph is large. There is no additional space [19] M. Jiang, A. W. Fu, R. C. Wong, and K. Wang, “READS: A ran-
cost for HowSim. dom walk approach for efficient and accurate dynamic simrank,”
PVLDB, vol. 10, no. 9, 2017.
[20] Y. Liu, B. Zheng, X. He, Z. Wei, X. Xiao, K. Zheng, and J. Lu,
“Probesim: Scalable single-source and top-k simrank computa-
8 C ONCLUSION tions on dynamic graphs,” PVLDB, 2017.
[21] W. Yu, X. Lin, W. Zhang, and J. A. McCann, “Dynamical simrank
In this paper, we propose an effective and general similarity search on time-varying networks,” VLDB J., 2018.
measure, i.e., HowSim, by extending SimRank to HINs. [22] Y. Wang, X. Lian, and L. Chen, “Efficient simrank tracking in
We introduce the concept of the decay graph, which de- dynamic graphs,” in ICDE, 2018, pp. 545–556.
[23] Y. Wang, L. Chen, Y. Che, and Q. Luo, “Accelerating pairwise
scribes how similarities are aggregated from different do- simrank estimation over static and dynamic graphs,” VLDBJ,
mains over different relations, making extending SimRank vol. 28, no. 1, pp. 99–122, Feb. 2019.
to HINs possible. We also give the matrix representation [24] Z. Zhang, Y. Shao, B. Cui, and C. Zhang, “An experimental eval-
uation of simrank-based similarity search algorithms,” PVLDB,
and probabilistic interpretation of HowSim. We study the vol. 10, no. 5, pp. 601–612, 2017.
properties of HowSim in-depth, i.e., HowSim is normal- [25] W. Xie, D. Bindel, A. Demers, and J. Gehrke, “Edge-weighted per-
ized, self-maximum, symmetric, and has a unique solution. sonalized pagerank: Breaking a decade-old performance barrier,”
Compared with meta-path based measures, which can only in KDD. ACM, 2015, pp. 1325–1334.
[26] A. Conte, G. Ferraro, R. Grossi, A. Marino, K. Sadakane, and
capture a specific aspect of similarity, HowSim is more T. Uno, “Node similarity with q-grams for real-world labeled
general and the decay function can be customized by users. networks,” in KDD. ACM, 2018, pp. 1282–1291.
We propose a naive iterative method for computing all- [27] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to
information retrieval. Cambridge university press, 2008.
pairs HowSim and introduce optimization techniques. We [28] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an intro-
a method for finding decay graphs by providing example duction to cluster analysis. John Wiley & Sons, 2009, vol. 344.
pairs of high proximity. Extensive experiment shows the [29] A. Bordes, N. Usunier, A. Garcı́a-Durán, J. Weston, and
effectiveness of HowSim and the efficiency of our proposed O. Yakhnenko, “Translating embeddings for modeling multi-
relational data,” in NIPS, 2013, pp. 2787–2795.
methods. In the future, we plan to design efficient algo- [30] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable
rithms for top-k/single-source queries with HowSim, and representation learning for heterogeneous networks,” in SIGKDD.
extend HowSim to schema-free networks such as knowl- ACM, 2017, pp. 135–144.
[31] B. Youngmann, T. Milo, and A. Somech, “Boosting simrank with
edge graphs. semantics,” in EDBT Lisbon, Portugal, March 26-29,, 2019, pp. 37–48.
[32] S. Rothe and H. Schütze, “Cosimrank: A flexible & efficient graph-
theoretic similarity measure,” in ACL, vol. 1, 2014, pp. 1392–1402.
R EFERENCES [33] H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey of
graph embedding: problems, techniques and applications,” IEEE
[1] X. Yin, J. Han, and P. S. Yu, “Linkclus: efficient clustering via Transactions on Knowledge and Data Engineering, 2018.
heterogeneous semantic links,” in VLDB, 2006. [34] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction
[2] N. Spirin and J. Han, “Survey on web spam detection: principles by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–
and algorithms,” SIGKDD Explor. Newsl., vol. 13, no. 2, 2012. 2326, 2000.
[3] Z. Abbassi and V. S. Mirrokni, “A recommender system based on [35] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality
local random walks and spectral methods,” in Proceedings of the reduction and data representation,” Neural computation, vol. 15,
9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and no. 6, pp. 1373–1396, 2003.
social network analysis. ACM, 2007, pp. 102–108. [36] D. Luo, F. Nie, H. Huang, and C. H. Ding, “Cauchy graph
[4] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem embedding,” in Proceedings of the 28th International Conference on
for social networks,” J Assoc Inf Sci Technol, vol. 58, no. 7, 2007. Machine Learning (ICML-11), 2011, pp. 553–560.
[5] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim: Meta [37] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
path-based top-k similarity search in heterogeneous information MIT press Cambridge, 2016, vol. 1.
networks,” PVLDB, vol. 4, no. 11, pp. 992–1003, 2011. [38] Z. Liu, V. W. Zheng, Z. Zhao, F. Zhu, K. C.-C. Chang, M. Wu, and
[6] C. Shi, X. Kong, Y. Huang, S. Y. Philip, and B. Wu, “Hetesim: J. Ying, “Distance-aware dag embedding for proximity search on
A general framework for relevance measure in heterogeneous heterogeneous graphs,” in Proceddings of the 32th AAAI Conference
networks.” IEEE Trans. Knowl. Data Eng., vol. 26, no. 10, pp. 2479– on Artificial Intelligence, 2018.
2492, 2014. [39] Z. Liu, V. W. Zheng, Z. Zhao, Z. Li, H. Yang, M. Wu, and J. Ying,
[7] C. Wang, Y. Sun, Y. Song, J. Han, Y. Song, L. Wang, and M. Zhang, “Interactive paths embedding for semantic proximity search on
“Relsim: relation similarity search in schema-rich heterogeneous heterogeneous graphs,” in Proceedings of the 24th ACM SIGKDD
information networks,” in SDM. SIAM, 2016, pp. 621–629. International Conference on Knowledge Discovery & Data Mining.
[8] N. Lao and W. W. Cohen, “Relational retrieval using a combination ACM, 2018, pp. 1860–1869.
of path-constrained random walks,” Machine learning, vol. 81, [40] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
no. 1, pp. 53–67, 2010. of social representations,” in Proceedings of the 20th ACM SIGKDD
[9] ——, “Fast query execution for retrieval models based on path- international conference on Knowledge discovery and data mining.
constraiaclned random walks,” in KDD. ACM, 2010, pp. 881–888. ACM, 2014, pp. 701–710.
[10] D. Xiao, X. Meng, Y. Li, C. Shi, and B. Wu, “Avgsim: Relevance [41] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line:
measurement on massive data in heterogeneous networks.” JATIT, Large-scale information network embedding,” in Proceedings of the
vol. 84, no. 1, 2016. 24th International Conference on World Wide Web. International
[11] G. Jeh and J. Widom, “Scaling personalized web search,” in WWW, World Wide Web Conferences Steering Committee, 2015, pp. 1067–
2003. 1077.
[12] ——, “Simrank: a measure of structural-context similarity,” in [42] T. N. Kipf and M. Welling, “Semi-supervised classification with
KDD, 2002. graph convolutional networks,” arXiv preprint arXiv:1609.02907,
[13] P. Zhao, J. Han, and Y. Sun, “P-rank: a comprehensive structural 2016.
similarity measure over information networks,” in CIKM, 2009. [43] A. Grover and J. Leskovec, “node2vec: Scalable feature learning
[14] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank for networks,” in Proceedings of the 22nd ACM SIGKDD international
citation ranking: Bringing order to the web.” Stanford InfoLab, conference on Knowledge discovery and data mining. ACM, 2016, pp.
Tech. Rep., 1999. 855–864.
[15] Y. Sun and J. Han, “Mining heterogeneous information networks: a [44] T.-y. Fu, W.-C. Lee, and Z. Lei, “Hin2vec: Explore meta-paths in
structural analysis approach,” Acm Sigkdd Explorations Newsletter, heterogeneous information networks for representation learning,”
vol. 14, no. 2, pp. 20–28, 2013. in Proceedings of the 26th ACM on Conference on Information and
[16] C. Shi, Y. Li, J. Zhang, Y. Sun, and S. Y. Philip, “A survey of Knowledge Management. ACM, 2017, pp. 1797–1806.
heterogeneous information network analysis,” IEEE Transactions
on Knowledge and Data Engineering, vol. 29, no. 1, pp. 17–37, 2017.
[17] D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov, “Accuracy
estimate and optimization techniques for simrank computation,”
The VLDB Journal, vol. 19, no. 1, pp. 45–66, 2010.
[18] B. Tian and X. Xiao, “Sling: A near-optimal index structure for
simrank,” in SIGMOD, 2016.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.3019488, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

Yue Wang received the PhD degree from the Hao Xin is a phd candidate at Department of
Hong Kong University of Science and Technol- Computer Science and Engineering in Hong
ogy (HKUST), in 2019. He is a researcher with Kong University of Science and Technology
the Shenzhen Institute of Computing Sciences, (HKUST). Currently, he is working with Prof.Lei
Shenzhen University. His research interests in- Chen on knowledge base. He obtained his bach-
clude data mining and graph algorithms. elor degree in computer science from Zhejiang
University, China in 2016.

Zhe Wang received the BS degree from School Lei Chen (Fellow, IEEE) received his BS de-
of Data and Computer science, Sun Yat-sen Uni- gree in Computer Science and Engineering from
versity in 2018, the MA degree from CSE De- Tianjin University, China, in 1994, the MA de-
partment, HKUST in 2019. He is currently work- gree from Asian Institute of Technology, Thai-
ing as a research assistant in the CSE Depart- land, in 1997, and the PhD degree in computer
ment, HKUST. His major research fields include science from University of Waterloo, Canada,
large-scale graph mining and graph representa- in 2005. He is now a professor in the De-
tion learning. partment of Computer Science and Engineer-
ing at Hong Kong University of Science and
Technology. His research interests include data-
driven machine learning, crowdsourcing, knowl-
edge graphs, graph and probabilistic databases.

Ziyuan Zhao is a researcher at WeChat Group, Jianchun Song received the BS degree in com-
Tencent Cooperation. puter science and technology from Zhengzhou
University, China, in 2012, the MA degree from
Harbin Institute of Technology, China, in 2014.
His research interests include search engine and
dialogue system.

Zijian Li received his BS and MA degree from Zhenhong Chen is a researcher at WeChat
CSE department of Zhejiang University, China in Group, Tencent Cooperation.
2015. He is currently a PhD student at the CSE
department of Hong Kong University of Science
and Technology. His major research fields are
large-scale graph mining and distributed network
analysis.

Xun Jian received his B.Eng. degree in Software Meng Zhao is a researcher at WeChat Group,
Engineering in 2014 from Beihang University. Tencent Cooperation.
Then he received his M.Sc. degree in Informa-
tion Technology in 2016 from The Hong Kong
University of Science and Technology(HKUST).
Now he is a Ph.D. student in the Department
of Computer Science at HKUST. His research
interests include crowdsourcing and algorithms
on graph.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 03,2020 at 05:00:45 UTC from IEEE Xplore. Restrictions apply.

You might also like