Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
6Activity
0 of .
Results for:
No results containing your search query
P. 1
Applying Hash-based Indexing in Text-based Information Retrieval

Applying Hash-based Indexing in Text-based Information Retrieval

Ratings:

5.0

(1)
|Views: 587 |Likes:
Published by matthewriley123

More info:

Published by: matthewriley123 on Jun 17, 2008
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/09/2014

pdf

text

original

 
Applying Hash-based Indexingin Text-based Information Retrieval
Benno Stein and Martin Potthast
Faculty of Media, Media SystemsBauhaus University Weimar, 99421 Weimar, Germany,{benno.stein | martin.potthast}@medien.uni-weimar.de
ABSTRACT
Hash-based indexing is a powerful technology for similarity searchin large document collections [13]. Central idea is the interpre-tation of hash collisions as similarity indication, provided that anappropriate hash function is given. In this paper we identify basicretrieval tasks which can benefit from this new technology, we re-late them to well-known applications and discuss how hash-basedindexing is applied. Moreover, we present two recently developedhash-based indexing approaches and compare the achieved perfor-mance improvements in real-world retrieval settings. This analy-sis, which has not been conducted in this or a similar form by now,shows the potential of tailored hash-based indexing methods.
Keywords
hash-based indexing, similarity hashing, efficient search,performance evaluation
1. INTRODUCTORY BACKGROUND
Text-based information retrieval in general deals with the searchin a large document collection
D
. In this connection we distinguishbetween a “real” document
d
D
, in the form of a paper, a book,or a Web site, and its computer representation
d
, in the form of aterm vector, a suffix tree, or a signature file. Likewise,
D
denotesthe set of computer representations of the real documents in
D
.Typically, a document representation
d
is an
m
-dimensional fea-ture vector, which means that the objects in
D
can be consideredas points in the
m
-dimensional vector space. The similarity be-tween two documents,
d,d
q
, shall be inversely proportional to thedistance of their feature vectors
d
,
d
q
D
. The similarity is mea-sured by a function
ϕ
(
d
,
d
q
)
which maps onto
[0;1]
, with
0
and
1
indicating no and a maximum similarity respectively;
ϕ
may relyon the
l
1
-norm, the
l
2
-norm, or on the angle between the featurevectors. Obviously the most similar document
d
D
respectinga query document
d
q
maximizes
ϕ
(
d
,
d
q
)
.We refer to the task of finding
d
as
similarity search
, which canbe done by a linear scan of 
D
—in fact this task cannot be donebetter than in
O
(
|
D
|
)
if the dimensionality,
m
, of the feature space
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission from the author.
 DIR
2007, March 28-29, Leuven, BelgiumCopyright 2007 Stein / Potthast.
is around 10 or higher [15]. Also the use of an inverted file indexcannot improve the runtime complexity since the postlist lengthsare in
O
(
|
D
|
)
. Here the idea of hash-based indexing comes intoplay: Testing whether or not
d
is a member of 
D
can be donein virtually constant time by means of hashing. This concept canbe extended toward similarity search if we are given some kind of similarity hash function,
h
ϕ
:
D
, which maps the set
D
of document representations to a universe
of keys from the naturalnumbers,
N
, and which fulfills the following property [13]:
h
ϕ
(
d
) =
h
ϕ
(
d
q
)
ϕ
(
d
,
d
q
)
1
ε,
(1)with
d
,
d
q
D
and
0
< ε
1
.That is, a hash collision between two elements from
D
indicatesahighsimilaritybetweenthem, whichessentiallyrelates tothecon-cept of precision
p
rec
ε
:
p
rec
ε
=
|{
d
D
q
:
ϕ
(
d
,
d
q
)
1
ε
}||
D
q
|
,
with
D
q
=
{
d
D
:
h
ϕ
(
d
) =
h
ϕ
(
d
q
)
}
.Though the applicability of hash-based indexing to text-basedinformation retrieval has been demonstrated, there has been no dis-cussion which retrieval tasks can benefit from it; moreover, noneof the hash-based indexing approaches have been compared in thisconnection. Our research aims at closing this gap.Section 2 discusses retrieval tasks along with applications wherehash-based indexing can be applied to improve retrieval perfor-mance and result quality. Section 3 introduces two hash-based in-dexing methods that can be utilized in the field of text-based infor-mation retrieval. Section 4 presents results from a comparison of these methods for important text-retrieval tasks.
2. RETRIEVAL TASKS IN TEXT-BASEDINFORMATION RETRIEVAL
Let
denote the set of terms that are used within the documentrepresentations
d
D
. The most important data structure forsearching
D
is the inverted file [16, 2], which maps a term
t
onto a so-called postlist that encodes for each occurrence of 
t
therespective documents
d
i
,
i
= 1
,...,o
, along with its position in
d
i
. In the following it is assumed that we are given a (possiblyvery large) document collection
D
which can be accessed with aninverted file index,
µ
i
,
µ
i
:
D
as well as with a hash index,
µ
h
,
µ
h
:
D
DD
denotes the power set of 
D
. The inverted file index
µ
i
is used toanswer term queries; the hash index
µ
h
is ideally suited to answer
 
Index-basedretrievalGroupingClassificationSimilarity searchCategorizationNear-duplicate detectionPartial document similarityComplete document similarity
Retrieval tasks Applications
focused search, efficient search(cluster hypothesis)plagiarism analysis, result cleaning plagiarism analysis query by example directory maintenance 
Figure1: Text-based information retrieval tasks (left) and exemplified applications(right) wherehash-based indexingcan beapplied.
document queries, say, to map a document onto a set of similardocuments. Section 3 shows how such a hash index along with ahash function
h
ϕ
is constructed.We have identified three fundamental text retrieval tasks wherehash-based indexing can be applied: (
i
) grouping, which furtherdivides into categorization and near-duplicate detection, (
ii
) simi-larity search, which further divides into partial and complete docu-ment comparison, and (
iii
) classification. Figure 1 organizes thesetasks and gives examples for respective applications; the followingsubsections provide a more in-depth characterization of them.
2.1 Retrieval Task: Grouping
Grouping plays an important role in connection with user inter-action and search interfaces: Many retrieval tasks return a largeresult set that needs to be refined, visually prepared, or cleanedfrom duplicates. Refinement and visual preparation require theidentification of categories, which is an unsupervised classifica-tion task since reasonable categories are a-priori unknown. Cat-egorizing search engines like Vivísimo and AIsearch address thisproblem with a cluster analysis [17, 11]; they are well-known ex-amples where such a kind of result set preparation is applied. Du-plicate elimination helps in cleaning result sets that are clutteredwith many, nearly identical documents [4]. The latter situation is
term queryInvertedfile indexHashindex
term 1doc 2, doc 3, ...term 2doc 5, doc 8......112568doc 1, doc 7, ...256749doc 9......
DocumentcollectionResultsetIdentifiedcategories
...12
Figure 2: Illustration of a grouping retrieval task. Startingpoint is an information need formulated as term query, whichis satisfied with an inverted file index, yielding a result set(Step
x
). The subsequent categorization is realized with a hashindex in linear time of the result set size (Step
).
typical for nearly all kinds of product searches and commercial ap-plications on the Web: for the same product various suppliers canbe found; likewise, mirror sites can pretend the existence of appar-ently different documents.
Remarks on Runtime.
With hash-based indexing the performanceof such kinds of retrieval tasks can be significantly improved: Auser formulates his/her information need as a term query for which,in a first step, a result set
D
q
D
is compiled by means of an in-verted file index
µ
i
. In a second step the hash index
µ
h
is appliedfor the grouping operation, which may be categorization or dupli-cate elimination. Figure 2 illustrates the strategy. Recall that
µ
h
helps us to accomplish the grouping operation in linear time in theresult set size
|
D
q
|
. By contrast, if for grouping a vector-baseddocument model is employed we obtain a runtime of 
O
(
|
D
q
|
2
)
,since duplicate elimination or category formation requires a pair-wise similarity computation between the elements in
D
q
.
2.2 Retrieval Task: Similarity Search
The most common retrieval task is a similarity search where auser’s information need is formulated as (boolean) term query. If the user is in fact interested in documents that match exactly, say,that contain the query terms literally, the optimum index struc-ture respecting search performance is the inverted file index,
µ
i
.The major search engine companies like Google, Yahoo, or Al-taVista are specialized in satisfying these kinds of information
term query kInvertedfile indexHashindexDocumentcollectionResultset
term 1doc 2, doc 3, ...term 2doc 5, doc 8......112568doc 1, doc 7, ...256749doc 9......
document
query
 
term query 1
...
Alternative
...
Figure 3: Illustration of a similarity search retrieval task.Starting point is an information need formulated as documentquery, which is satisfied with a hash index in constant time of the document collection size. Alternatively, keywords may beextracted from the document query to formulate several termqueries.
 
needs. However, if a user specifies his/her information need inthe form of a document query, say, in a “more like this” manner,a hash-based index,
µ
h
, could be employed as index structure. Thequestion is whether a similarity hash function
h
ϕ
that fulfills Prop-erty (1) can be constructed at all, which in turn depends on theadmissible
ε
-interval demanded by the information need.
Remarks on Runtime.
The utilization of a hash-based index,
µ
h
,can be orders of magnitude faster compared to a classical invertedfile index
µ
i
: To identify similar documents with
µ
i
, appropriatekeywords must be extracted from the document query, a number
k
of term queries must be formed from these keywords, and therespective result sets
D
q
1
,...,D
q
k
must be compared to the doc-ument query. Figure 3 illustrates both this alternative strategy andthe utilization of 
µ
h
. If we assume an effort of 
O
(1)
for both in-verted file index access and hash index access, the construction of the result set requires
O
(
|
D
q
1
|
+
...
+
|
D
q
k
|
)
with
µ
i
, and only
O
(1)
with
µ
h
. Note that the actual runtime difference depends onthe quality of the extracted keywords and the finesse when formingthe term queries from them; it can assume dramatic proportions.This situation is aggravated if an application like plagiarism anal-ysis requires the segmentation of the query document in order torealize a similarity search for one paragraph at a time [14].
2.3 Retrieval Task: Classification
Classification plays an important role in various text retrievaltasks: genre analysis, spam filtering, category assignment, or au-thor identification, to mention only a few. Text classification is thesupervised counterpart of text categorization and, for a small num-ber of classes, it can be successfully addressed with Bayes, dis-criminant analysis, support vector machines, or neural networks.However, with an increasing number of classes the construction of a classifier with assured statistical properties is rendered nearly im-possible.Withhash-based indexing anefficient and robust classifiercanbeconstructed, even if the number of classes,
1
,...,C 
p
, is high andif only a small or irregularly distributed number of training docu-ments is given. The classification approach follows the
k
-nearestneighbor principle and presumes that all documents of the classes
i
are stored in a hash-index
µ
h
. Given a document
d
q
to be clas-sified, its hash value is computed and the documents
D
q
that areassigned to the same hash bucket are investigated with respect totheir distribution in
1
,...,C 
p
.
d
q
gets assigned that class
j
where the majority of the documents of 
D
q
belongs to:
j
= argmax
i
=1
,...,p
|
i
D
q
|
3. REALIZATION OFHASH-BASED INDEXING
Given a similarity hash function
h
ϕ
:
D
, a hash-basedindex
µ
h
can be directly constructed by means of a hash table
T  
along withastandard hash function
h
:
{
1
,...,
|T|}
;
h
mapsthe universe of keys,
, onto the
|T|
storage positions. Indexinga set
D
of document representations means to compute for each
d
D
its hash key
h
ϕ
(
d
)
and to insert in
T  
at position
h
(
h
ϕ
(
d
))
a pointer to
d
. This way
T  
maintains for each value of 
h
ϕ
a bucket
D
q
D
that fulfills the following condition:
d
1
,d
2
D
q
h
ϕ
(
d
1
) =
h
ϕ
(
d
2
)
,
where
d
1
,
d
2
denote the computer representations of the docu-ments
d
1
,d
2
. Based on
T  
and
h
ϕ
a document query corresponds toa single hash table lookup. In particular, if 
h
ϕ
fulfills Property (1)the bucket returned for
d
q
defines a set
D
q
whose elements are inthe
ε
-neighborhood of 
d
q
with respect to
ϕ
:
d
D
q
ϕ
(
d
,
d
q
)
1
ε
The crucial part is the choice and the parameterization of a suit-able similarity hash function
h
ϕ
for text documents. To our knowl-edge two approaches have been recently proposed, namely fuzzy-fingerprinting and locality-sensitive hashing. Both approaches canbe directly applied to the vector space representation
d
of a docu-ment and are introduced now.
3.1 Fuzzy-Fingerprinting
Fuzzy-fingerprinting is a hashing approach specifically designedfor but not limited to text-based information retrieval [13]. It isbased on the definition of a small number of 
k
,
k
[10
,
100]
, pre-fix equivalence classes. A prefix class, for short, contains all termsstarting with the same prefix. The computation of 
h
ϕ
(
d
)
happensin the following steps: (
i
) Computation of 
pf 
, a
k
-dimensional vec-tor that quantifies the distribution of the index terms in
d
with re-spect to the prefix classes. (
ii
) Normalization of 
pf 
using a corpusthat provides a representative cross-section of the source language,and computation of 
pf 
= (
δ
1
,...,δ
k
)
, the vector of deviationsto the expected distribution.
1
(
iii
) Fuzzification of 
pf 
by project-ing the exact deviations according to diverse fuzzification schemes.Figure 4 illustrates the construction process.Typically, three fuzzification schemes (= linguistic variables) areused whereas each scheme differentiates between one and three de-viation intervals. For a fuzzification scheme
ρ
with
r
deviation in-tervals Equation 2 shows how a document’s normalized deviationvector
pf 
is encoded:
h
(
ρ
)
ϕ
(
d
) =
k
1
X
i
=0
δ
(
ρ
)
i
·
r
i
,
with
δ
(
ρ
)
i
{
0
,...,r
1
}
(2)
δ
(
ρ
)
i
is a document-specific value and represents the fuzzified devi-ation of 
δ
i
pf 
when applying fuzzification scheme
ρ
.A prefixclass is not necessarily limitedtoexactlyone prefix. It isalso possible to define a combined prefix class as the union of twoor more others. There are
26
i
possible prefix classes for the Latinalphabet, where
i
denotes the prefix length. Using a bin packingheuristic the combined prefix classes can be constructed such thateach class has the same expected frequency.Document representations based on prefix classes can be re-garded as abstractions of the vector space model. On the one handthey have a weaker performance when directly applied to aretrievaltask such as grouping, similarity search, or classification. On theother hand they are orders of magnitude smaller and do not sufferfrom sparsity.
3.2 Locality-Sensitive Hashing
Locality-sensitive hashing (LSH) is a generic framework for therandomized construction of hash functions [9]. Based on a family
ϕ
of simple hash functions
h
,
h
:
D
, a locality-sensitivehash function
h
ϕ
is a combination of 
k
functions
h
ϕ
ob-tainedbyanindependent and identicallydistributedrandom choice.When using summation as combination rule the construction of 
h
ϕ
(
d
)
is defined as follows:
h
ϕ
(
d
) =
k
X
i
=1
h
i
(
d
)
,
with
{
h
1
,...,h
k
}
rand 
ϕ
1
In this paper the British National Corpus is used as reference,which is a 100 million word collection of written and spoken lan-guage from a wide range of sources, designed to represent a widecross-section of current British English [1].

Activity (6)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
arshadparvez liked this
Rilinda Ajdari liked this
scribdshit liked this
eng_amira liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->