You are on page 1of 12

Received May 6, 2020, accepted May 20, 2020, date of publication May 25, 2020, date of current version

June 5, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2996945

Optimized Signature Selection for Efficient


String Similarity Search
TAEGYOUNG LEE1 , TAE-SUN CHUNG 2, AND JONGIK KIM 3
1 Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), Hong Kong
2 Department of Computer Engineering, Ajou University, Suwon 16499, South Korea
3 Division of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, South Korea

Corresponding author: Jongik Kim (jongik@jbnu.ac.kr)


This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the
Korean Government (Ministry of Science and ICT) under Grant 2019R1F1A1059795 and Grant 2019R1F1A1058548.

ABSTRACT In this paper, we study the problem of string similarity search to retrieve in a database all
strings similar to a query string within a given threshold. To measure the similarity between strings, we use
edit distance. Many algorithms have been proposed under a filtering-and-verification framework to solve
the problem. To reduce the overhead of edit distance verification, it is crucial to efficiently generate a small
number of candidates in the filtering phase. Recently, an index structure named HSTree has been proposed
for efficiently generating candidate strings. To generate candidates, they select and utilize HSTree nodes at
a specific level calculated from a given threshold. In this paper, we observe that there are many alternative
ways to select HSTree nodes, and propose a novel technique that selects HSTree nodes in an optimized way
based on the observation. We also propose a modified HSTree, named a threaded HSTree, which connects
inverted lists of an HSTree node to inverted lists of its child nodes. With a threaded HSTree, we can reduce
the overhead of index lookups in HSTree nodes while selecting optimal tree nodes. Experimental results
show that the proposed technique significantly outperforms the existing technique using the HSTree.

INDEX TERMS Edit distance, hierarchical tree index, optimized signature selection, partition signature
scheme, string similarity search.

I. INTRODUCTION is an fundamental operation required in a wide range of


Finding similar objects is essential in data analytics, and applications including data cleaning [8], query relax-
many similarity measures have been developed for differ- ation [33], DNA read mapping [17], [18], and near duplicate
ent types of data. For example, SimRank [13] and its vari- detection [45]. To measure the similarity between two strings,
ants [26], [37], [47], [49], [50], [52] have been proposed we use edit distance [11], [27], [28], [38], which is the
to measure the similarity between objects in an information minimum number of edit operations (insertion, deletion, and
network; common subgraphs [5], [36], missing edges and fea- substitution) to transform one string to the other. Edit distance
tures [51], [53], and graph edit distance [15], [32], [34] have has the following advantages over alternative measure: it
been developed to quantify the similarity between complex reflects the ordering of characters in the string and it allows
objects that are represented by graph models; Jaccard [12], non-trivial alignment. These properties enable edit distance to
Cosine, and Dice [9] similarities are commonly used for set capture typographical errors for text documents, and to cap-
data; and LSA [19] have been developed to measure similar- ture similarities for Homologous proteins or genes [29], [43].
ity between documents through corpus analysis. The problem of string similarity search studied in this
In this paper, we focus on syntactic similarity between paper is to retrieve all strings in a string database whose edit
unstructured text data. Because text data are abundant, and distance to a query string is within a given threshold. This is
typographical errors and different representations of text data a challenging problem, because edit distance computation is
cannot be avoidable, finding syntactically similar strings costly and a scan-based approach that computes the edit dis-
tance for each string in the database would incur a prohibiting
The associate editor coordinating the review of this manuscript and O(N · n2 ) cost for a large database, where N is the number
approving it for publication was Qichun Zhang . of strings in the database and n is the average length of a

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 98193
T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

string. To address the performance challenge, there has been a search problem, it exhibits good performance. It is also easily
rich literature on this problem [3], [4], [6], [7], [16], [18], used to support top-k similarity queries.
[20]–[23], [29], [39], [41], [46], [48]. Although we can use the partition signature scheme with
All existing techniques adopt a filtering-and-verification HSTree, this approach has the following limitation. We only
framework, with a main focus on the filtering phase to reduce use the tree at a specific level, which is determined by
the overhead of edit distance computation in the verification a threshold. In partition-based approach, we need at least
phase. To effectively generate candidate strings by filtering τ +1 partitions to use the pigeonhole principle. Since HSTree
out strings dissimilar to a query string, they utilize signatures selects nodes in a specific level, the number of partitions used
extracted from data strings. The most widely used signature for a query is not consistently determined.
scheme is q-gram, which is a substring of a string with Example 1: Consider we have a string ‘‘SIMILARITY’’
length q. Given two strings with an edit distance threshold, in our string database. Figure 1 depicts how the string is
a necessary condition to meet the threshold is established on partitioned into each HSTree node. For a query with a thresh-
the minimum number of q-grams contained in both strings. old τ = 2, the tree nodes at the second level is selected
To efficiently generate candidate strings using the q-gram and τ + 2 = 4 partitions of the string is used to check if
signature scheme, all existing algorithms utilize an inverted at least c = 2 partitions are contained in the query string.
index built on data strings. They extract q-grams from data When τ = 4, the tree nodes at the third level is selected and
strings, and make an inverted list on each q-gram, which is a τ + 4 = 8 partitions of the string is used to check if at least
list of ids of strings that contain the q-gram. Then, they build c = 4 partitions are contained in the query string.
an index that maps each q-gram to its corresponding inverted In Example 1, we have to select τ + 2 partitions for τ = 2,
list. Early work (e.g., [20], [21]) extracts overlapping q-grams while we should select τ + 4 partitions for τ = 4. When
from a query string, and using an inverted index, generates using the pigeonhole principle with τ + c partitions, there is a
those data string that shares enough number of q-grams with trade-off between filtering and verification costs for different
the query string. Later work (e.g., [14], [17]) selects non- c values (see Section III-A for the details), but HSTree cannot
overlapping q-grams from a query string to reduce the number balance the trade-off because c value is determined by the
of candidate strings. threshold τ as shown in the example above.
A drawback of the q-gram signature scheme is that there To address the limitation, we propose a partition selection
can be many strings that share a q-gram, because q is usually algorithm that selects τ + c partitions, i.e., HSTree nodes,
chosen to be a small value (e.g. from 2 to 4) to support various for a fixed value c regardless of τ . We observe that we can
thresholds. As a result, a large number of candidates can be select partitions from different levels of HSTree. For example,
generated in the filtering phase. To overcome the limitation, consider we are given a fixed c = 1. When τ = 2, we can
the partition signature scheme has been proposed [22], [23]. select τ +c = 3 partitions ‘‘SIMIL’’ at level 1, and ‘‘AR’’ and
The partition signature scheme establishes a filtering con- ‘‘ITY’’ at level 2 in Figure 1. If τ = 4, we can select τ +c = 5
dition using the pigeonhole principle as follows. Given two partitions ‘‘SI’’, ‘‘MIL’’, and ‘‘AR’’ at level 2 and ‘‘I’’ and
strings r and s with a threshold τ , if we decompose r into ‘‘TY’’ at level 3. Interestingly, there are many alternative ways
τ + c partitions, i.e., disjoint substrings, at least c partitions to select a given number of partitions. When τ = 2, for
should be contained in s to meet the threshold. Since we can example, we can select alternative combinations of τ + c = 3
use large signatures (especially when c = 1), the partition partitions: {‘‘SIMIL’’, ‘‘AR’’, ‘‘ITY’’} or {‘‘SIM’’, ‘‘IL’’,
signature scheme has been found to be much more efficient ‘‘ARITY’’}. Based on the observation, we propose a novel
than the q-gram scheme. However, an offline index cannot dynamic programming algorithm that selects an optimal com-
be built on partitions because the number of partitions is bination of a given number of partitions that generates the
determined by the threshold, which is specified when a query minimum number of candidates.
is issued. Therefore, it is not suitable to the string similarity
search problem. This scheme is proposed to solve the string
similarity join problem, where an index is built on-the-fly
during join processing.
Recently, a hierarchical index structure, named HSTree,
has been proposed to apply the partition signature scheme
to string similarity search problem [39], [48]. The HSTree
is a full binary tree. At the ith level of the tree, each data FIGURE 1. ‘‘SIMILARITY’’ partitioned into HSTree nodes.

string is decomposed into 2i partitions, where the jth partition


is indexed in the jth node of the ith level (see Section II-C The following summarize the contributions of the paper
for the details of the HSTree index). Given a query string • We show that there are many alternative combinations of
with a threshold τ , it selects the lowest level having at least HSTree nodes to evaluate a query, and develop a novel
τ + 1 nodes (or partitions) to use the pigeonhole principle. dynamic programming algorithm that select an optimal
As HSTree can use the partition signature scheme in the combination of nodes.

98194 VOLUME 8, 2020


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

• We propose an enhanced HSTree, named a threaded Lemma 1: Given two strings r and s and a threshold τ ,
HSTree, that connects inverted lists of an HSTree node if we extract τ + c disjoint segments from r, where c is a
to inverted lists of its child nodes. With a threaded constant, s should contain at least c segments of r to meet the
HSTree, we can reduce the overhead of index lookups threshold.
in HSTree nodes while selecting optimal tree nodes. A disjoint segment of r contained in s is called a match-
• We implement the proposed algorithm and conduct ing segment. The lemma above states that we need at least
extensive experiments on real datasets. From our experi- c matching segments to meet the threshold. The intuition
mental results, we show that the proposed algorithm sig- behind the lemma is that a mismatching segment causes at
nificantly outperforms the existing algorithm that uses least one edit operation, and edit operations caused by dif-
the HSTree. ferent segments are independent. If the number of matching
segments is less than c, the number of mismatching seg-
The rest of this paper is organized as follows. In Section II, ments is greater than τ , and thus r and s cannot meet the
we formulate the problem of string similarity search, describe threshold τ .
the candidate generation method based on the partition Example 3: Consider two strings s5 and s6 in Table 1.
signature scheme and review the HSTree index structure. Given a threshold τ = 2, if we extract τ +2 disjoint segments,
In Section III, we propose a novel dynamic programming ‘‘al’’, ‘‘on’’, ‘‘ene’’, and ‘‘ss’’ from s5 , only one of the
algorithm to select an optimal combination of tree nodes. segments, i.e., ‘‘al’’, is contained in s6 . Hence, s5 and s6
In Section IV, we enhance an HSTree to reduce the overhead cannot meet the threshold by Lemma 1
of index lookups. In Section V, we report our experimental
results on real datasets. We brief related work in Section VI TABLE 1. An example string collection.
and conclude the paper in Section VII.

II. PROBLEM FORMULATION AND PRIOR WORK


A. PROBLEM FORMULATION
The edit distance between two strings r and s, denoted by
ed(r, s), is the minimum number of edit operations to trans-
form r into s or vice versa. An edit operation is insertion,
deletion, or substitution of a single character. For example,
ed(‘‘string’’, ‘‘strong’’) is 1 because ‘‘string’’ can
be transformed into ‘‘strong’’ by substituting one character In the q-gram signature scheme, existing techniques
‘i’ with ‘o’. select non-overlapping q-grams to generate candidates using
Definition 1: Given a string database D, and a query string Lemma 1. However, those techniques cannot utilize string
q with an edit distance threshold τ , the problem of string segments between the selected q-grams, and filtering power
similarity search is to retrieve from D all strings s such that is limited because the signature size (i.e., q) is small.
ed(q, s) ≤ τ . PassJoin [22], [23] has been proposed to find similar strings
Example 2: For the strings in Table 1, consider a string using partition-based signatures for Lemma 1. PassJoin
database D = {s1 , s2 , . . . , s8 }. Given a query string q = decomposes a string s into τ +c disjoint segments1 w1 , w2 , . . . ,
‘‘string’’ with a threshold 1, the result of the similarity wτ +c , such that s = w1 ·w2 ·. . .·wτ +c , where wi ·wi+1 denotes
search is {s1 , s2 , s3 }. the concatenation of wi and wi+1 . We call wi a partition of s.
By using a partition-based signature, which is longer than a
q-gram signature, the number of candidates can be reduced
B. DISJOINT SIGNATURE-BASED APPROACH FOR since the longer a signature is, the less strings that share
GENERATING CANDIDATES the signature. In the remaining of this subsection, we briefly
We can establish a necessary condition between two strings introduce the technique in PassJoin.
to meet a threshold using the pigeonhole principle. The fol- Given a string database D and a threshold τ , an index
lowing definition and lemma formally state the necessary is built on strings in Dl = {s | s ∈ D ∧ |s| = l} as
condition. follows. Each string in Dl is partitioned into τ + c seg-
Definition 2: Given a string s, consider two substrings ments2 . For the jth segments of the strings in Dl , we make
s[p1 , l1 ] and s[p2 , l2 ] of s, where s[p, l] denotes a substring j
an inverted index Ll that maps each distinct jth segment w to
of s starting at position p with length l. Without loss of j
Ll (w), which is a list of ids of strings that have w in their
generality, we assume that p1 < p2 . The substrings s[p1 , l1 ]
and s[p2 , l2 ] are disjoint, if and only if p1 + l1 ≤ p2 . 1 We note that PassJoin [22], [23] uses τ + 1 signatures. In our discussion
Disjoint substrings in the definition above does not share here, we generalize it to τ + c signatures based on Lemma 1 for the easy of
description in the following section.
any character in a common position. For a string ‘‘abcde’’, 2 There are many ways to partition a string into τ + c segments. PassJoin
for example, ‘‘abc’’ and ‘‘de’’ are disjoint, but ‘‘abc’’ and uses an even-partition scheme, where each segment should have nearly the
‘‘cd’’ are not disjoint. same length. Refer to [22], [23] for the details.

VOLUME 8, 2020 98195


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

j
jth segments. In this way, an index Il = {Ll | 1 ≤ j ≤ C. HSTree
τ + c} is built for the strings in Dl . The index for D is The partition-based approach described in Section II-B can
I = {Il | l is a distinct length of strings in D}. build an index only with a given static threshold. Therefore,
Example 4: For the string collection D in Table 1, consider it is hard to apply the approach to the search problem, where a
that τ = 1 and c = 1 are given. To make I6 , we decompose threshold can vary from query to query. To address the prob-
each string in D6 = {s1 , s2 , s3 , s4 } into two partitioned lem, HSTree [39], [48] has been proposed, which maintains
segments with the same length 3. We can construct I9 for alternative partitioning results of each data string.
D9 in a similar way. Figure 2 shows I6 = {L16 , L26 } and For each distinct string length l, an HSTree Hl is built on
I9 = {L19 , L29 }. Therefore, the index for D is I = {I6 , I9 }. Dl as follows. At the level i of the tree3 , Hl partitions each
data string s ∈ Dl into 2i segments, and the jth segment of s is
i,j
indexed in the inverted index of the jth node, denoted by Ll ,
just like the partition-based approach in the previous section.
For the purpose of presentation, we use the inverted index in
a node interchangeably with the node itself in the rest of the
paper. Similar to PassJoin, HSTree also use an even-partition
FIGURE 2. Example inverted indices I6 and I9 for the string collection
in Table 1. scheme. Specifically, for each segment w at the level i, w is
partitioned into two disjoint segments in the (i + 1)th level,
Before we describe how to evaluate a query using the such that the first segment is the prefix of w of length b|w|/2c
index, we present an obvious necessary condition between and the second segment is the suffix of w of length d|w|/2e.
two strings to meet a threshold. The following lemma states For example, Figure 3 shows an HSTree H9 for the string
a condition on the size difference of two strings. collection in Table 1.
Lemma 2: Given two strings r and s with a threshold τ ,
if ed(r, s) ≤ τ , then ||r| − |s|| ≤ τ .
To evaluate a query string q, we need to search the index
j
from I|q|−τ to I|q|+τ by Lemma 2. For each inverted index Ll
in Il (|q| − τ ≤ l ≤ |q| + τ, 1 ≤ j ≤ τ + c), we first compute
j
substrings of q, denoted by W(q, Ll ), to look up Lil . Let the
j j
length of segments indexed in Ll be `l . Note that all segments
j
in Ll have the same length (e.g., `29 = 5 in Example 4).
j
To find all segments in Ll that are contained in q, we compute
j j
W(q, Ll ) = {w | w is a substring of q of length `l } and take
j j
union of Ll (w)’s for all w ∈ W(q, Ll ). Let the result list of
j FIGURE 3. H9 for the string collection in Table 1.
the union be Rl (q). The set containing all candidate strings
j
in Il can be obtained by merging Rl (q)’s for all j ∈ [1, τ + c]
Given a query q with a threshold τ , we can evaluate
and choosing those strings that are contained at least c result
the query using the HSTree between H|q|−τ and H|q|+τ by
lists by Lemma 1.
Lemma 2. To generate candidate strings, we need at least τ +1
Example 5: Given the index depicted in Figure 2, consider
partitions of strings by Lemma 1. For each Hl (|q| − τ ≤
a query string q = ‘‘aparment’’ with a threshold τ = 1,
l ≤ |q| + τ ), we select the lowest level having at least τ + 1
where c = 1. Since |q| = 8, we need to search I7 , I8 , i,j
nodes. Therefore, i = dlog2 (τ + 1)e. Given 2i nodes, i.e., Ll
and I9 by Lemma 2. Since I7 = I8 = ∅, we only look i
(1 ≤ j ≤ 2 ), the query can be evaluated as the similar way
up I9 as follows. Since `19 = 4 (for L19 ), each string in
introduced in the previous section. A segment set of the query
W(q, L19 ) = {apar, parm, arme, rmen, ment} searches i,j i,j
string q for searching an index Ll , denoted by W(q, Ll ),
L19 , and R19 = {s7 } is obtained since apar in W(q, L19 )
is computed using the following position lower bound and
is indexed in L19 . To search L29 , we enumerate W(q, L29 ) =
upper bound (see PassJoin [22], [23] for the details).
{aparm, parme, armen, rment} since `29 = 5 (for L29 ).
i,j
R29 = ∅ because no string in W(q, L29 ) is indexed in L29 . LBT = max(1, p(Ll ) − (j − 1),
We merge R19 = {s7 } and R29 = ∅ and find a candidate s7 , i,j
p(Ll ) + 1 − (τ + c − j)), (1)
since s7 is contained in c (= 1) result list.
i,j i,j
Once candidate strings are generated, each of candidate is UBT = min(|q| − `(Ll ) + 1, p(Ll ) + (j − 1),
verified by computing the edit distance to the query string. i,j
j
p(Ll ) + 1 + (τ + c − j)), (2)
Intuitively, the smaller size W(q, Ll ), the smaller number
i,j i,j
of candidates. By utilizing various conditions (e.g., a condi- where `(Ll ) and p(Ll ) denote the length and the position of
i,j
tion on position difference between segments), the number the segments in Ll , respectively, and 1 = |q| − l. In the
j
of segments in W(q, Ll ) can be substantially reduced (see
Section II-C). 3 The root node of an HSTree is at the level 0.

98196 VOLUME 8, 2020


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

formulas above, j − 1 and τ + c − j indicate the relative optimal τ + c disjoint nodes across different levels in an
locations4 of a partition from the left-side and the right-side HSTree, where any two nodes are disjoint if and only if they
perspectives, respectively. Given the LBT and UBT , do not lie on the same root-to-leaf path of the tree. Note that
i,j i,j if two HSTree nodes are disjoint, the substrings of a data
W(q, Ll ) = {q[p, `(Ll )] | p ∈ [LBT , UBT ]}, (3) string corresponding the nodes are also disjoint. Given τ + c
i,j
where q[p, `(Ll )] denotes a substring of q starting at posi- disjoint nodes, therefore, we can apply Lemma 1 to generate
i,j candidate strings.
tion p with length `(Ll ). Since the level i has 2i nodes,
Example 7: Consider a query string q = alignment for
τ + c = 2 and c = 2 − τ in Lemma 1. Hence, we generate
i i
τ = 4 for H9 in Figure 3. In the original HSTree technique,
those strings whose segments are selected from at least 2i − τ
i,j we select the level 3 and c is determined to 23 − 4 = 4. For
nodes at the level i, while searching the index with W(q, Ll )
i a given c = 1, however, we can consider tree nodes from all
for 1 ≤ j ≤ 2 .
levels to select τ + 1 = 5 disjoint nodes. For example, we can
Example 6: Consider a query string q = ‘‘alignment’’
select {L1,1 3,5 3,6 3,7 3,8
9 , L9 , L9 , L9 , L9 } at the levels 1 and 3.
with a threshold τ = 1 for the string collection in Table 1.
Alternatively, we can also select {L2,1 3,3 3,4 2,3 2,4
9 , L9 , L9 , L9 , L9 }
Since |q| = 9, we need to search H8 , H9 , and H10 .
at the level 2 and 3. There are many other combinations of
As H8 = H10 = ∅, we search H9 depicted in Figure 3.
τ + 1 disjoint nodes in this case.
In H9 , we select the level dlog2 τ + 1e = 1 and search L1,1 9 As shown in the example above, there can be multiple
and L1,2 . To search L 1,1
, we compute W(q, L 1,1
) = {alig}
9 9 9 combinations of τ + c disjoint nodes. Among all possible
(LBT = 1 and UBT = 1 because p(L1,1 9 ) = 1, 1 = 0, and combinations, in this paper, we develop a novel dynamic
`(L1,1 1,1
9 ) = 4). Since L9 does not contain alig, we generate programming algorithm that selects an optimal combination
no candidate string in this node. Next, we search L1,2 9 with that minimizes the number of candidates. We remark that
W(q, L1,29 ) = {nment} (LB T = 5 and UB T = 5 because any combination of τ + c disjoint nodes generates candidate
p(L1,2
9 ) = 5, 1 = 0, and `(L 1,2
9 ) = 5). L 1,2
9 does not contain strings containing all result strings by Lemma 1. Thus, our
1,2
nment, and we generate no candidate in L9 either. Hence, optimization technique does not affect the accuracy of sim-
the result is ∅. ilarity search, i.e., it does not miss any result string. In the
following subsections, we use c = 1 for simplicity, and we
III. OPTIMIZED HSTree NODE SELECTION will discuss the effect of different c values by treating c as a
A. MOTIVATION OF OUR WORK tunable parameter in Section V-B.
In general, the quality of a partition signature for generating
candidates is assumed to be proportional to the size of the B. OPTIMIZED NODE SELECTION ALGORITHM
signature. By choosing c = 1 in Lemma 1, we can maximize Given a query string q with a threshold τ , we search from
the size of each partitioned segment, and thus expect that H|q|−τ to H|q|+τ to generate candidates as we discussed
the number of candidates generated from each partition is earlier. Because we independently search each HSTree, and
minimized. For this reason, PassJoin [22], [23] uses τ + 1 each tree generates candidates independently, in this section,
scheme. If we use a larger c value, the size of each parti- we restrict our discussion to those strings of length l and
tion signature is reduced, and thus the inverted list for the consider optimized node selection for the HSTree Hl . Given
signature becomes longer. Nevertheless, we have a stricter τ + 1 nodes {N1 , N2 , . . . , Nτ +1 } of Hl , candidate strings are
filtering condition, since a candidate requires to have at least generated as follows.
c partitioned segments contained in a query. To generate τ[
+1 [
candidates, however, we need to scan and merge more and C= Ni (w), (4)
longer inverted lists, which can degrade the performance of i=1 w∈W (q,Ni )
the search. Therefore, c in Lemma 1 can be also used as a
tunable parameter (e.g. [14], [17], [18]) to balance filtering where q is the query string and w is each segment in W(q, Ni ).
and verification costs. Note that Ni also denotes the inverted index of the ith node.
In HSTree, the c value is dependent on τ , that is, c = 2i −τ Like all other string similarity search techniques that utilize
where i = dlog2 (τ + 1)e. For example, we have to use τ + 1 signatures to generate candidates, we assume each partition
scheme for τ = 3, while we should use τ + 4 scheme for signature independently generates candidate strings. There-
τ = 4. This is because we select nodes in a specific level fore, the number of candidates can be estimated as:
of the tree based on a given threshold. Since the c value is τ +1
X X
determined by τ , we cannot expect consistent performance NC = |Ni (w)|, (5)
for different τ values, and we have no chance for improving i=1 w∈W (q,Ni )
performance by adjusting c. To address the problem, for a
where |Ni (w)| denotes the size of the inverted list Ni (w).
given fixed c value, we propose a novel technique that selects
An optimal combination of τ + 1 nodes can be obtained
4 The position of a segment starts from 1, while the relative location of a by enumerating all possible τ + 1 disjoint nodes, computing
partition starts from 0. the number of candidates generated by each combination,

VOLUME 8, 2020 98197


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

and selecting a combination having the minimum number of


candidates. The following lemma states that we can prune
certain combinations of τ + 1 disjoint nodes.
Lemma 3: Given a combination P +1of τ + 1 disjoint nodes
S = {N1 , . . . , Nτ +1 } of Hl , if τk=1 `(Nk ) < l, there exists
another combination that generates candidates no more than
the initial combination S, where `(Nk ) denotes the length of
segments indexed P in Nk .
Proof: If τk=1 +1
`(Nk ) < l, there should be at least one
FIGURE 4. An initial DP array for an HSTree having 15 nodes
node N ∈ S such that no nodes in the subtree rooted by the (i.e., depth = 3) when τ = 4.
sibling of N is included in S. By replacing N with its parent,
we can have another combination that generates candidates
no more than S, because candidate generated by the parent is Proof: It considers k disjoint nodes in the left subtree
obviously a subset of that generated by N .  i,j i,j
of Ll and n − k disjoint nodes in the right subtree of Ll .
Example 8: In Example 7, we may select τ + 1 nodes Given any k value, assume that it correctly selects the min-
{L2,1 3,3 3,4 3,6 2,4
9 , L9 , L9 , L9 , L9 }. In this case, we can replace imum number of candidates in the left and right subtrees,
L9 with its parent L9 , and candidates generated by L2,3
3,6 2,3
9
respectively. Since it considers all possible range of k value
is a subset of candidates generated by L3,6 9 .
by Lemma 4, it computes the minimum number of candidates
i,j i,j
The following recurrence calculates the minimum number for the subtree rooted by Ll by the assumption. When Ll is
of candidates when we select n disjoint nodes from a subtree a leaf node, it clearly returns the correct number of candi-
i,j
i,j
rooted by Ll , which is the jth node at the level i of Hl for dates by Equation (7). Therefore, by induction, NC (Ll , n)
those data strings of length l. correctly computes the minimum number of candidates. 
A minor difficulty with the recurrence is that we can
if n = 1: identify relative locations of partitions after selecting optimal
i,j i,j
X
NC (Ll , n) = |Ll (w)|, (6) i,j
partitions, while LBT and UBT for W(q, Ll ) require the
i,j i,j
i,j
w∈W (q,Ll ) relative location of the partition Ll . Notice that j in Ll is
i,j
otherwise : no more the relative location of the partition Ll , since we
i,j
NC (Ll , n) = min{NC (Ll
i+1,2j−1
, k) select partitions in different levels of the tree. We solve the
i,j
k problem by using the following LBL and UBL for W(q, Ll )
i+1,2j
+ NC (Ll , n − k)} (7) in our recurrence (see PassJoin [22], [23] for the details of
i,j LBL and UBL ).
In case n = 1, Ll generates the minimum number of can-
i,j i,j τ −1
didates, thus we select Ll . When n > 1, we select k disjoint LBL = max(1, p(Ll ) − b c), (8)
i+1,2j−1 2
nodes from the subtree rooted by the left child Ll , and τ +1
i,j i,j
remaining n − k disjoint nodes from the subtree rooted by the UBL = min(|q| − `(Ll ) + 1, p(Ll ) + b c). (9)
i+1,2j 2
right child Ll . Among all possible k values, which are
discussed in Lemma 4, we select an optimal combination of After selecting optimal partitions, we use LBT and UBT
n disjoint nodes that has the minimum number of candidates. to generate candidates with the selected partitions. It is
Note that the recurrence considers only worth noting that we do not need to lookup indices for
Pthose combinations i,j the inverted lists used in generating candidates, but we can
of disjoint nodes {N1 , . . . Nn } such that nk=1 `(Nk ) = `(Ll )
by the following lemma. select the required inverted lists from those inverted lists
Lemma 4: In the recurrence above, the range of all possi- retrieved during partition selection because [LBT , UBT ] ⊆
ble k values is as follows. [LBL , UBL ] [22], [23].
It is obviously inefficient to compute the minimum number
max(1, n − 2maxL−(i+1) ) ≤ k ≤ min(n − 1, 2maxL−(i+1) ), of candidates by recursively enumerating all possible combi-
nations of τ +1 disjoint nodes. Instead, we develop a dynamic
where maxL denotes the maximum level of the tree, i.e., the
programming algorithm based on the recurrence above as
leaf level blog2 lc.
follows. There are |Hl | = 2maxL+1 − 1 nodes in Hl , where
Proof: It is obvious that 1 ≤ k ≤ n − 1. Since
maxL is the leaf level. We make an array A having |Hl |
an HSTree is a full binary tree, the maximum number of i,j
elements, where the node Ll corresponds to A[2i − 1 + j].
disjoint nodes (i.e., the number of nodes in the leaf level) in i,j
the subtree rooted by Ll
i+1,2j−1 i+1,2j
(or Ll ) is 2maxL−(i+1) . For the purpose of presentation, we use A[Ll ] to denote
i,j
Therefore, k ≤ 2 maxL−(i+1) and n − k ≤ 2 maxL−(i+1) , that is, A[2i − 1 + j]. The subtree rooted by Ll has 2maxL−i leaf
maxL−i i,j
n − 2maxL−(i+1) ≤ k ≤ 2maxL−(i+1) .  nodes. Hence, we make min(τ + 1, 2 ) slots for A[Ll ]
i,j i,j i,j
i,j
Theorem 1: NC (Ll , n) correctly computes the minimum to keep NC (Ll , n) in the nth slot of A[Ll ], i.e., A[Ll ][n] =
i,j
number of candidates. NC (Ll , n). We initialize each slot of the array to ∞.

98198 VOLUME 8, 2020


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

i,j
For example, Figure 4 shows an initial DP array for an HSTree Recall the location of Ll in A is 2i − 1 + j. Let the location
with depth 3 when τ = 4. be x. The locations of the left and right children are 2x =
Given an initialized DP array A, Algorithm 1 outlines our 2i+1 −1+2j−1 and 2x +1 = 2i+1 −1+2j, respectively. The
dynamic programming algorithm that computes the mini- algorithm finally returns the number of minimum candidates
mum number of candidates in Hl for a query string q with a (Line 13).
threshold τ . We assume that q, τ and A are globally visible in Example 9: Given a query string q with a threshold
the algorithm. In the algorithm, each slot of the DP array A has τ = 4, consider an HSTree Hl shown in Figure 5(a). In the
three values: n_cands is the minimum number of candidates, figure, the number in each node denotes the sum of the
and left and right are links to its children cells, which are lengths of the inverted lists selected by segments of q (i.e.,
P i,j
used to keep track of the optimal combination of nodes. The i,j |Ll (w)|). We can find an optimal combination
w∈W (q,L )
l
algorithm computes the minimum number of candidates only
of disjoint nodes by calling DPSelect(L0,1 l , τ + 1 = 5).
when it is not already computed (Line 1). If the number of
To find an optimal combination, DPSelect construct a DP
nodes to be selected is 1, then it saves the sum of the sizes of
i,j array depicted in Figure 5(b). It recursively fills each slot in
the inverted lists for W(q, Ll ) of the current node (Line 4).
the array while keeping links to its children slots. Once it fills
In this case, the child links are set to nil, which indicates
A[L0,1
l ][5], we can find an optimal combination by following
that this node is a terminal node (Line 5). If the number of
the links of A[L0,1
l ][5], which are depicted in red lines in the
nodes to be selected is greater than 2, the algorithm selects an
figure. The optimal combination of disjoint nodes selected by
optimal combination of nodes in the subtree rooted by current
the algorithm is indicated by the grayed slots.
node based on the recurrence (for loop in Line 7). It saves
i,j
the minimum number of candidates in A[Ll ][n].n_cands
i,j i,j
(Line 10). It also keeps, in A[Ll ][n].left and A[Ll ][n].right,
the slot numbers of the children of the current node
(Lines 11–12). Note that the algorithm only needs to keep
the slot numbers, because the children of current node
can be easily located as follows. The left and right chil-
i,j i+1,2j−1 i+1,2j
dren of A[Ll ] are A[Ll ] and A[Ll ], respectively.

i,j
Algorithm 1: DPSelect(Ll , n)
i,j
input : Ll is a node of Hl , and n is the number of
disjoint nodes to select.
i,j
output: A[Ll ][n].n_cands – minimum number of
candidates of n disjoint nodes in the subtree
i,j
rooted by Ll
i,j
1 if A[Ll ][n].n_cands 6 = ∞ then
i,j
2 return A[Ll ][n].n_cands;
FIGURE 5. A running example of the DPSelect algorithm.
3 if n = 1 then
i,j P i,j
4 A[Ll ][n].n_cands ← w∈W (q,Li,j ) |Ll (w)|;
i,j i,j
l Lemma 5: The time complexity of the algorithm is
5 A[Ll ][n].left ← A[Ll ][n].right ← nil; O(l · τ · CI ), where l is the length of strings indexed in an
6 else HSTree Hl , τ is a threshold for a query, and CI is the average
7 for k ← max(1, n − 2maxL−(i+1) ) to cost for index lookups.
min(n − 1, 2maxL−(i+1) ) do Proof: It can be seen that the number of segments in
i,j
8 NC ← DPSelect(Ll
i+1,2j−1
, k) + W(q, Ll ) is at most 2τ . The algorithm requires O(l · τ · CI )
i+1,2j to fill the first row of the DP array, i.e., A[1][∗], since there
DPSelect(Ll , n − k);
i,j are at most l slots in the first row. In the nth row of the DP
9 if A[Ll ][n] > NC then array, there are at most 2dlogl2 ne ≤ nl slots. A slot in the nth
i,j
10 A[Ll ][n].n_cands ← NC ; row requires at most 2(n − 1) lookups of the DP array slots
i,j
11 A[Ll ][n].left ← k; (see Line 6 of the algorithm). Hence, the algorithm requires
i,j
12 A[Ll ][n].right ← n − k; O(l) to fill the nth row (n > 1), and it requires O(l · τ ) to fill
all the rows except the first row. Consequently, filling the first
i,j row dominates the time complexity of the algorithm, which
13 return A[Ll ][n].n_cands;
is O(l · τ · CI ). 

VOLUME 8, 2020 98199


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

C. REDUCING INDEX LOOKUP OVERHEAD


A drawback of the proposed algorithm is that it looks up
indices in all tree nodes. To reduce the overhead of index
lookups, we limit the maximum level maxL to dlog2 (τ + 1)e,
i.e., the minimum level having at least τ + 1 nodes. An inter-
esting observation is that we can ignore HSTree nodes below
a certain level to select τ + 1 disjoint nodes for a query.
Lemma 6 formally states the observation.
Lemma 6: Given a threshold τ and a maximum level
maxL, the minimum level where we can select a node is
FIGURE 6. Threaded HSTree for the HSTree in Figure 3.
minL = dlog2 (2maxL /(2maxL − τ ))e.
Proof: The maximum number of possible disjoint nodes retrieve all inverted lists for the nodes at the levels between
(i.e., the number of nodes in the leaf level) in an HSTree is minL and maxL. Then, we fill the first row of the DP array
2maxL . Consider we select a node at minL. Then, we cannot using the sizes of retrieved inverted lists. Finally, we apply
use any nodes in the subtree rooted by the selected node (the our algorithm to compute remaining slots of the DP array.
subtree has 2maxL−minL nodes at the level maxL). Hence, We remark that the condition Line 3 of the algorithm is always
the maximum number of remaining disjoint nodes is 2maxL − false, since we initialize the first row of the DP array before
2maxL−minL . To select τ + 1 disjoint nodes, we need to be calling the algorithm.
able to select τ nodes in the remaining disjoint nodes. Hence, Algorithm 2 encapsulates the initialization of the DP
we have the inequality, τ ≤ 2maxL − 2maxL−minL . The array using a threaded HSTree. We assume that q, τ ,
inequality gives us the minimum level minL = dlog2 (2maxL / the HSTree Hl , and the DP array A are visible in the
(2maxL − τ ))e.  algorithm. The algorithm first initialize an array of sets of
Example 10: For the HSTree in Figure 3, if τ = 5, inverted lists SL , which keeps inverted lists corresponding
maxL = dlog2 (5 + 1)e = 3 and minL = i,j i,j
to W(q, Ll ) for each node Ll (Line 1). Then, it retrieve
dlog2 (23 /(23 − 5)e = 2. inverted lists for the query in each node at the level minL
The observation can be generalized to τ + c scheme as (Line 2). Recall that an HSTree node N also denotes the
follows. Since we need to select τ + c disjoint nodes, maxL inverted index in N . For simplicity, we use SL [N ] to denote
becomes dlog2 (τ + c)e, and minL in Lemma 6 becomes the element of SL corresponding to the node N . After looking
dlog2 (2maxL /(2maxL − (τ + c − 1)))e. We remark that Algo- up inverted indices in the nodes at the level minL and retriev-
rithm 1 does not look up the inverted indices of the nodes ing inverted lists for the query, it uses the retrieved inverted
below minL by Lemma 4 (i.e, the condition in Line 3 is lists to construct inverted lists of the nodes at the levels above
always false for the nodes below minL).

IV. THREADED HSTree


Algorithm 2: InitializeDPArray
To further reduce the index lookup cost, in this section,
we develop a threaded HSTree structure. Consider a seg- 1 SL ← an array of empty sets;
ment w indexed in an HSTree node N and let the inverted 2 foreach nodeSN at level minL do
list for w be Iw . The first half of w is indexed in the left child 3 SL [N ] = w∈W (q,N ) N (w);
of N and the second half of w is indexed in the right child
4 for lv ← minL + 1 to maxL do
of N . Let the inverted lists of the first and second halves be
5 foreach node N at level lv do
Iw1 and Iw2 , respectively. We connect Iw with Iw1 and Iw2 with
6 NP ← parent of N ;
pointers, which are called threads. If we look up the inverted
7 foreach list l ∈ SL [NP ] do
index in N to find Iw for w, we can directly locate Iw1 and Iw2
8 if N is the left child of NP then
by following the threads. Figure 6 shows this modification of
9 SL [N ] = SL [N ] ∪ l.left_thread;
the HSTree in Figure 3, where the dashed blue lines denote
the threads that connect inverted lists. 10 else SL [N ] = SL [N ] ∪ l.right_thread;
We use a threaded HSTree with our algorithm as follows. 11 foreach w ∈ W(q, N ) do
Given a query, we first look up inverted indices of the nodes at 12 if N (w) 6 ∈ SL [N ] then
the minimum level minL, and we keep the retrieved inverted 13 SL [N ] = SL [N ] ∪ N (w)
lists. For each node N in the next level, we first locate its
parent node, and follow the threads of the inverted lists kept
14 for lv ← minL to maxL do
in the parent node. It can be easily seen that the inverted lists
15 foreach node N P
at level lv do
identified by parent’s threads are a subset of the inverted lists
16 A[N ][1] ← l∈SL [N ] |l|;
required in N . Hence, we look up the inverted index in N to
retrieve unidentified inverted lists only. In this way, we can

98200 VOLUME 8, 2020


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

FIGURE 7. Query time for different c values.

TABLE 2. Datasets used in experiments. remaining sections, we denote our algorithm by OptSearch
and the original HSTree search algorithm by HSSearch.

B. EXPERIMENTS ON DIFFERENT C VALUES


In this subsection, we evaluate our algorithm varying c values.
Even when c > 1, as shown in [17]5 , we can still estimate
an upper bound of the number of candidates with the sum
minL (i.e., levels lv > minL) (Line 4). For each node N at the of the sizes of inverted lists for a query string. Therefore,
level lv, it first locates the parent node NP of N (Line 6), and we can still use our algorithm to obtain an optimal combi-
obtains inverted lists for the query in N using the inverted nation of τ + c disjoint nodes. In this case, as discussed in
lists of the parent nodes, i.e., SL [Np ] as follows. If N is the Section III-C, we only need to change maxL to dlog2 (τ + c)e
left child of Np , it follows the left thread of each inverted list and minL to dlog2 (2maxL /(2maxL − (τ + c − 1)))e in our
in SL [Np ] (Line 8). Otherwise, it follows the right thread of algorithm.
each inverted list in SL [Np ] (Line 9). To retrieve those inverted Figure 7 shows query response times for alternative c
lists that are not identified from SL [NP ], it finally lookup values on three different datasets. If we use a c value larger
the inverted index in N (Line 11). For each segment w ∈ than 1, an underflow of the number of partitions may occurs.
W(q, N ), if the inverted list of w is not identified from SL [NP ] In this case, we re-evaluated the query using τ + 1 partitions.
(Line 12), we lookup the inverted index of N to retrieve the For all datasets and thresholds, we observed that c = 2
inverted index (Line 13). In Line 12, we need to check if w is exhibited the best performance. It can be explained by the
contained in the segments corresponding to the inverted lists query time ratios shown in Figure 8. There is a trade-off
in SL [N ]. This membership test can be easily done by merging between filtering (i.e., merging inverted lists) and verification
the positions of the segments in W(q, N ) and the positions of (i.e., edit distance computation) times for different c values.
the segments corresponding to the inverted lists in SL [N ]. To obtain the best performance, we need to balance the trade-
off. As shown in the figure, the time differences between
V. EXPERIMENTS filtering and verification were minimized when c was 2.
A. EXPERIMENTAL SETTINGS We remark that the cost for selecting an optimal combination
In experiments, we used four widely used real-world is negligible, and it is included in Indexlookup in Figure 8.
datasets, IMDB Actor and Movie (http://www.imdb.com), These results justify the motivation of our work described in
and Web Corpus (http://www.ldc.upenn.edu/Catalog, number Section III-A. Based on the results in this subsection, we used
LDC2006T13). Some important statistics of the datasets are c = 2 in our algorithm in the following subsections.
presented in Table 2. We chose the datasets to compare C. EXPERIMENTS ON INDEX LOOKUP TIME
performance for short and long strings. Our algorithm was
In this subsection, we evaluate the effects on index lookup
implemented in GNU C++ and compiled with -O3 option.
times when a threaded HSTree and the restriction of the
All experiments were conducted on a machine with 32GB
maximum level maxL are applied. Figure 9 shows the results.
main memory and Intel i7 CPU running an Ubuntu operating
When we used a threaded HSTree, index lookup time was
system. Datasets and indices were held in main memory.
reduced by 1.5 times on average. When we applied the restric-
We randomly extracted 2000 query strings from each data
tion of maxL (i.e., maxL= dlog2 (τ + c)e) along with a
set, ran queries. We evaluated the proposed technique in terms
threaded HSTree, index lookup time was reduced by 2 times
of query processing time. The reported results in this section
on average. As shown in Figure 8, index lookup time did not
are aggregated response times from 2000 queries. Since the
affect the query time when a threshold was large (e.g., τ ≥ 3).
HSTree technique consistently outperformed other existing
techniques as reported in [39], [48], we compared our tech- 5 we remark that [17] is a DNA read mapping technique that utilizes q-
nique with the original HSTree technique [39], [48]. In the grams, and thus the contributions of this paper are orthogonal to that of [17].

VOLUME 8, 2020 98201


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

FIGURE 8. Query time ratios for different c values.

FIGURE 9. Index lookup time comparisions.

FIGURE 10. Comparisons with HSSearch (Query Time).

For a low threshold (e.g. τ = 1), however, index lookup lookups, the benefit of OptSearch is reduced by the over-
time was very important on the overall performance because head of index lookups. As a threshold increased, however,
merging inverted lists and verifying candidate strings were OptSearch significantly outperformed HSSearch because
very fast. Since the proposed technique looks up inverted of the optimal partition selection and a good balance between
indices in all HSTree nodes to select an optimized combi- filtering and verification. For example, OptSearch is about
nation of nodes, it is crucial to reduce index lookup time 3 times faster than HSSearch when τ = 4 on the Actor
for low thresholds. As shown in the experiments in this sub- dataset (Figure 10(a)).
section, the threaded HSTree structure and maxL restriction Since we used the dataset, Corpus, which contains
technique effectively reduced index lookup time. 10 millions of strings, we can see that the proposed techniques
scales well to a large dataset from Figure 10(c).
D. COMPARISONS WITH HSSearch
In this subsection, we compare our algorithm with VI. RELATED WORK
HSSearch. Figure 10 shows the results. On each dataset, A. SIMILARITY MEASURES
OptSearch outperformed HSSearch by up to about 3 times. The problem of quantifying similarity between objects has
For low thresholds (e.g., τ ≤ 2), the performance of witnessed growing interest over the past decades. To measure
OptSearch was just as good as HSSearch. This is because the similarity between text data, character-based similarity
inverted lists for partitions are very short and list merging functions and token set-based similarity functions are widely
and edit distance computation can be done very quickly used. The most representative character-based similarity
on a low threshold. Since OptSearch requires more index function is edit distance [11], [27], [28], [38], which is also

98202 VOLUME 8, 2020


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

known as Levenshtein distance. Since it reflects the ordering strings to establish a filtering condition based on the pigeon-
of characters in strings and it allows non-trivial alignment, hole principle. PassJoin [22], [23] decomposes data strings
it is widely adopted in many applications. For concrete into τ +1 partitions to perform efficient string similarity join.
examples, it is used in practical applications such as diff HSTree [39], [48] proposes a hierarchical index structure
and patch commands in Linux OS systems, source code that considers multiple partitionings of a string to support
management for version control systems like GitHub, and string similarity search using the partition-based approach of
DNA read mappers (e.g. [17], [18]) for analyzing genomic PassJoin.
data. Set-based similarity functions such as Jaccard coeffi-
cient [12], Dice [9] and Cosine similarity are also used to VII. CONCLUSION
measure the similarity between text data by tokenizing each In this paper, we propose an optimal partition selection algo-
string into a set of words or q-grams. Since set-based similar- rithm to improve the performance of string similarity search
ity functions considers only exact match between tokens, they using the HSTree indexing technique. We observed that there
might not correctly measure similarity when the granularity can be multiple combination of τ + c disjoint nodes in an
of a token is large. Fast-Join [40] and MF-Join [42] address HSTree, and proposed a novel dynamic programming algo-
this problem by considering similarity between tokens using rithm that selects an optimal combination of HSTree nodes.
edit distance before computing set-based similarity. LSA [19] To reduce the overhead of selecting optimal combination,
and it variants (e.g., [24], [25]) also have been developed we also proposed a threaded HSTree, which is an enhanced
to measure similarity between documents through corpus HSTree structure. We evaluated the proposed technique using
analysis. real datasets and showed our technique outperformed the
There are similarity functions that are used in structured existing technique HSSearch.
data. SimRank [13] and many variants [26], [37], [47], [49],
[50], [52] has been proposed to consider semantic similarity
information between objects in information networks. The REFERENCES
intuition behind SimRank is that similar objects are linked by [1] A. Arasu, V. Ganti, and R. Kaushik, ‘‘Efficient exact set-similarity joins,’’
similar objects. Based on the intuition, it quantifies node sim- in Proc. VLDB, 2006, pp. 918–929.
[2] R. J. Bayardo, Y. Ma, and R. Srikant, ‘‘Scaling up all pairs similar-
ilarity based on the compound similarity of their neighbors. ity search,’’ in Proc. 16th Int. Conf. World Wide Web WWW, 2007,
Graph edit distance [32], [34] and feature-based similarity pp. 131–140.
functions [5], [36], [51], [53] has been proposed to quantify [3] A. Behm, S. Ji, C. Li, and J. Lu, ‘‘Space-constrained gram-based indexing
for efficient approximate string search,’’ in Proc. IEEE 25th Int. Conf. Data
the similarity between complex objects represented by graph Eng., Mar. 2009, pp. 604–615.
models. Similar to edit distance, graph edit distance measure [4] A. Behm, C. Li, and M. J. Carey, ‘‘Answering approximate string queries
the distance between two graphs using the minimum number on large data sets using external memory,’’ in Proc. IEEE 27th Int. Conf.
Data Eng., Apr. 2011, pp. 888–899.
of edit operations to make the graphs isomorphic. In contrast [5] H. Bunke and K. Shearer, ‘‘A graph distance metric based on the max-
to edit distance, however, graph edit distance computation is imal common subgraph,’’ Pattern Recognit. Lett., vol. 19, nos. 3–4,
NP-hard and many techniques have been proposed to effi- pp. 255–259, Mar. 1998.
[6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, ‘‘Robust and efficient
ciently compute graph edit distance (e.g. [15], [32]). fuzzy match for online data cleaning,’’ in Proc. ACM SIGMOD Int. Conf.
Manage. Data SIGMOD, 2003, pp. 313–324.
B. STRING SIMILARITY QUERY PROCESSING [7] S. Chaudhuri, V. Ganti, and L. Gravano, ‘‘Selectivity estimation for string
predicates: Overcoming the underestimation problem,’’ in Proc. 20th Int.
Set similarity join is the problem of finding similar pairs of Conf. Data Eng., 2004, pp. 227–238.
records from two collections of records, which is essential [8] S. Chaudhuri, V. Ganti, and R. Kaushik, ‘‘A primitive operator for simi-
in many applications including data cleaning [8] and near larity joins in data cleaning,’’ in Proc. 22nd Int. Conf. Data Eng. (ICDE),
Apr. 2006, p. 5.
duplicate detection [45]. Many algorithms are developed for [9] L. R. Dice, ‘‘Measures of the amount of ecologic association between
the problem of set similarity join [1], [2], [6], [8], [10], species,’’ Ecology, vol. 26, no. 3, pp. 297–302, Jul. 1945.
[10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan,
[30], [35], [40], [41], [43]–[45]. Some of these algorithms and D. Srivastava, ‘‘Approximate string joins in a database (almost) for
(e.g. [2], [31]) solve only join problems, but most of these free,’’ in Proc. VLDB, Sep. 2001, pp. 491–500.
algorithms can be applied to the search problem in their [11] P. A. V. Hall and G. R. Dowling, ‘‘Approximate string matching,’’ ACM
Comput. Surv., vol. 12, pp. 381–402, Dec. 1980.
original form or with slight modification. Many algorithms [12] P. Jaccard, ‘‘Étude comparative de la distribution florale dans une portion
and inverted index structures have been proposed for the des alpes et des jura,’’ Bull Soc Vaudoise Sci. Nat., vol. 37, pp. 547–579,
similarity search problem [3], [4], [6], [7], [16], [20], [21], 1901.
[13] G. Jeh and J. Widom, ‘‘SimRank: A measure of structural-context similar-
[29], [41], [46]. The technique called VGRAM [21], [46] was ity,’’ in Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining
proposed to use variable-length grams in an inverted index to KDD, 2002, pp. 538–543.
improve similarity search performance and reduce the index [14] J. Kim, ‘‘An effective candidate generation method for improving perfor-
mance of edit similarity query processing,’’ Inf. Syst., vol. 47, pp. 116–128,
size. A disk-based inverted index structure [4] was proposed Jan. 2015.
for supporting similarity search on large datasets. In [16], [15] J. Kim, D.-H. Choi, and C. Li, ‘‘Inves: Incremental partitioning-based ver-
a dataset partitioning algorithm has been proposed to reduce ification for graph similarity search,’’ in Proc. EDBT, 2019, pp. 229–240.
[16] J. Kim and H. Lee, ‘‘Efficient exact similarity searches using multiple
the number of candidates by exploiting document frequency token orderings,’’ in Proc. IEEE 28th Int. Conf. Data Eng., Apr. 2012,
orderings. Recent techniques exploits partitioning of data pp. 822–833.

VOLUME 8, 2020 98203


T. Lee et al.: Optimized Signature Selection for Efficient String Similarity Search

[17] J. Kim, C. Li, and X. Xie, ‘‘Improving read mapping using additional prefix [44] C. Xiao, W. Wang, X. Lin, and H. Shang, ‘‘Top-k set similarity joins,’’ in
grams,’’ BMC Bioinf., vol. 15, no. 1, p. 42, 2014. Proc. IEEE 25th Int. Conf. Data Eng., Mar. 2009, pp. 916–927.
[18] J. Kim, C. Li, and X. Xie, ‘‘Hobbes3: Dynamic generation of variable- [45] C. Xiao, W. Wang, X. Lin, and J. X. Yu, ‘‘Efficient similarity joins for near
length signatures for efficient approximate subsequence mappings,’’ in duplicate detection,’’ in Proc. 17th Int. Conf. World Wide Web WWW, 2008,
Proc. IEEE 32nd Int. Conf. Data Eng. (ICDE), May 2016, pp. 169–180. pp. 131–140.
[19] T. K. Landauer and S. T. Dumais, ‘‘A solution to Plato’s problem: The latent [46] X. Yang, B. Wang, and C. Li, ‘‘Cost-based variable-length-Gram selection
semantic analysis theory of acquisition, induction, and representation of for string collections to support approximate queries efficiently,’’ in Proc.
knowledge.,’’ Psychol. Rev., vol. 104, no. 2, pp. 211–240, 1997. ACM SIGMOD Int. Conf. Manage. Data SIGMOD, 2008, pp. 353–364.
[20] C. Li, J. Lu, and Y. Lu, ‘‘Efficient merging and filtering algorithms for [47] S.-H. Yoon, S.-W. Kim, and S. Park, ‘‘C-rank: A link-based similarity
approximate string searches,’’ in Proc. IEEE 24th Int. Conf. Data Eng., measure for scientific literature databases,’’ Inf. Sci., vol. 326, pp. 25–40,
Apr. 2008, pp. 257–266. Jan. 2016.
[21] C. Li, B. Wang, and X. Yang, ‘‘VGRAM: Improving performance of [48] M. Yu, J. Wang, G. Li, Y. Zhang, D. Deng, and J. Feng, ‘‘A unified frame-
approximate queries on string collections using variable-length grams,’’ work for string similarity search with edit-distance constraint,’’ VLDB J.,
in Proc. VLDB, 2007, pp. 303–314. vol. 26, no. 2, pp. 249–274, Apr. 2017.
[22] G. Li, D. Deng, and J. Feng, ‘‘Pass-join+: A partition-based method [49] W. Yu, X. Lin, W. Zhang, J. Pei, and J. A. McCann, ‘‘SimRank*: Effective
for string similarity joins with edit-distance constraints,’’ ACM Trans. and scalable pairwise similarity search based on graph topology,’’ VLDB
Database Syst., vol. 38, no. 2, pp. 1–40, 2013. J., vol. 28, no. 3, pp. 401–426, Jun. 2019.
[23] G. Li, D. Deng, J. Wang, and J. Feng, ‘‘Pass-join: A partition based method [50] M. Zhang, J. Wang, and W. Wang, ‘‘HeteRank: A general similarity
for similarity joins,’’ Proc. VLDB Endowment, vol. 5, no. 3, pp. 253–264, measure in heterogeneous information networks by integrating multi-type
2011. relationships,’’ Inf. Sci., vol. 453, pp. 389–407, Jul. 2018.
[24] P. Martin, S. Benno, and A. Maik, ‘‘A wikipedia- based multilingual [51] S. Zhang, J. Yang, and W. Jin, ‘‘SAPPER: Subgraph indexing and approxi-
retrieval model,’’ in Proc. ECIR, 2008, pp. 522–530. mate matching in large graphs,’’ Proc. VLDB Endowment, vol. 3, nos. 1–2,
[25] I. Matveeva, G. Levow, A. Farahat, and C. Royer, ‘‘Term representa- pp. 1185–1194, Sep. 2010.
tion with generalized latent semantic analysis,’’ in Proc. RANLP, 2005, [52] P. Zhao, J. Han, and Y. Sun, ‘‘P-rank: A comprehensive structural similarity
pp. 308–315. measure over information networks,’’ in Proc. 18th ACM Conf. Inf. Knowl.
[26] T. Milo, A. Somech, and B. Youngmann, ‘‘Boosting simrank with seman- Manage. CIKM, 2009, pp. 553–562.
tics,’’ in Proc. EDBT, 2019, pp. 1–12. [53] G. Zhu, X. Lin, K. Zhu, W. Zhang, and J. X. Yu, ‘‘TreeSpan: Efficiently
[27] S. B. Needleman and C. D. Wunsch, ‘‘A general method applicable to the computing similarity all-matching,’’ in Proc. Int. Conf. Manage. Data
search for similarities in the amino acid sequence of two proteins,’’ J. Mol. SIGMOD, 2012, pp. 529–540.
Biol., vol. 48, no. 3, pp. 443–453, Mar. 1970.
[28] J. L. Peterson, ‘‘Computer programs for detecting and correcting spelling
errors,’’ Commun. ACM, vol. 23, no. 12, pp. 676–687, Dec. 1980. TAEGYOUNG LEE received the B.S. and M.S.
[29] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin, ‘‘Efficient exact edit similarity degrees in computer science and engineering
query processing with the asymmetric signature scheme,’’ in Proc. Int. from Jeonbuk National University, South Korea,
Conf. Manage. Data SIGMOD, 2011, pp. 1033–1044. in 2015 and 2017, respectively, and the second
[30] K. Ramasamy, J. M. Patel, R. Kaushik, and J. F. Naughton, ‘‘Set contain- M.S. degree in computer science from the Korea
ment joins: The good, the bad and the ugly,’’ in Proc. VLDB, Sep. 2000, Advanced Institute of Science and Technology
pp. 351–362. (KAIST), in 2019. She is currently pursuing the
[31] L. A. Ribeiro and T. Härder, ‘‘Efficient set similarity joins using min-
Ph.D. degree in computer science and engineer-
prefix,’’ in Proc. ADBIS, 2009, pp. 88–102.
[32] K. Riesen, S. Fankhauser, and H. Bunke, ‘‘Speeding up graph edit ing with The Hong Kong University of Science
distance computation with a bipartite heuristic,’’ in Proc. MLG, 2007, and Technology (HKUST). Her research interests
pp. 21–24. include data analysis, theory of computation, and computational geometry.
[33] M. Sahami and T. D. Heilman, ‘‘A Web-based kernel function for measur-
ing the similarity of short text snippets,’’ in Proc. 15th Int. Conf. World TAE-SUN CHUNG received the B.S. degree
Wide Web WWW, 2006, pp. 377–386. in computer science from the Korea Advanced
[34] A. Sanfeliu and K.-S. Fu, ‘‘A distance measure between attributed rela-
Institute of Science and Technology (KAIST),
tional graphs for pattern recognition,’’ IEEE Trans. Syst., Man, Cybern.,
in February 1995, and the M.S. and Ph.D. degrees
vols. SMC–13, no. 3, pp. 353–362, May 1983.
[35] S. Sarawagi and A. Kirpal, ‘‘Efficient set joins on similarity predi- in computer science from Seoul National Uni-
cates,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data SIGMOD, 2004, versity, South Korea, in February 1997 and
pp. 743–754. August 2002, respectively. He is currently a Pro-
[36] H. Shang, X. Lin, Y. Zhang, J. X. Yu, and W. Wang, ‘‘Connected substruc- fessor with the Department of Software, Ajou
ture similarity search,’’ in Proc. Int. Conf. Manage. Data SIGMOD, 2010, University. His current research interests include
pp. 903–914. flash memory storage, query processing in spatial
[37] H. Tong, C. Faloutsos, and J.-Y. Pan, ‘‘Fast random walk with restart and databases, machine learning, similarity search, and general database systems.
its applications,’’ in Proc. 6th Int. Conf. Data Mining (ICDM), Dec. 2006,
pp. 613–622.
[38] R. A. Wagner and M. J. Fischer, ‘‘The String-to-String correction prob- JONGIK KIM received the B.S. and M.S.
lem,’’ J. ACM (JACM), vol. 21, no. 1, pp. 168–173, Jan. 1974. degrees in computer science from the Korea
[39] J. Wang, G. Li, D. Deng, Y. Zhang, and J. Feng, ‘‘Two birds with one stone: Advanced Institute of Science and Technology
An efficient hierarchical framework for top-k and threshold-based string (KAIST), in 1998 and 2000, respectively, and
similarity search,’’ in Proc. IEEE 31st Int. Conf. Data Eng., Apr. 2015, the Ph.D. degree in computer engineering from
pp. 519–530. Seoul National University, South Korea, in 2004.
[40] J. Wang, G. Li, and J. Fe, ‘‘Fast-join: An efficient method for fuzzy token He worked as a Senior Researcher at the Elec-
matching based string similarity join,’’ in Proc. IEEE 27th Int. Conf. Data tronics and Telecommunications Research Insti-
Eng., Apr. 2011, pp. 458–469. tute (ETRI), from 2004 to 2007. He joined the
[41] J. Wang, G. Li, and J. Feng, ‘‘Can we beat the prefix filtering?: An adaptive
Division of Computer Science and Engineering,
framework for similarity join and search,’’ in Proc. Int. Conf. Manage. Data
Jeonbuk National University, as a Faculty Member, in 2007. He is currently
SIGMOD, 2012, pp. 85–96.
[42] J. Wang, C. Lin, and C. Zaniolo, ‘‘MF-join: Efficient fuzzy string similarity a Professor with Jeonbuk National University. He was a Visiting Scholar
join with multi-level filtering,’’ in Proc. IEEE 35th Int. Conf. Data Eng. with the University of California at Irvine (UCI) Irvine, from 2012 to
(ICDE), Apr. 2019, pp. 386–397. 2013. He has been working in the area of semi-structured database (XML
[43] C. Xiao, W. Wang, and X. Lin, ‘‘Ed-join: An efficient algorithm for sim- database), telematics systems, flash-memory data management, event stream
ilarity joins with edit distance constraints,’’ in Proc. PVLDB, Aug. 2008, processing, and similarity query processing.
pp. 933–944.

98204 VOLUME 8, 2020

You might also like