needs. However, if a user speciﬁes his/her information need inthe form of a document query, say, in a “more like this” manner,a hashbased index,
µ
h
, could be employed as index structure. Thequestion is whether a similarity hash function
h
ϕ
that fulﬁlls Property (1) can be constructed at all, which in turn depends on theadmissible
ε
interval demanded by the information need.
Remarks on Runtime.
The utilization of a hashbased index,
µ
h
,can be orders of magnitude faster compared to a classical invertedﬁle index
µ
i
: To identify similar documents with
µ
i
, appropriatekeywords must be extracted from the document query, a number
k
of term queries must be formed from these keywords, and therespective result sets
D
q
1
,...,D
q
k
must be compared to the document query. Figure 3 illustrates both this alternative strategy andthe utilization of
µ
h
. If we assume an effort of
O
(1)
for both inverted ﬁle index access and hash index access, the construction of the result set requires
O
(

D
q
1

+
...
+

D
q
k

)
with
µ
i
, and only
O
(1)
with
µ
h
. Note that the actual runtime difference depends onthe quality of the extracted keywords and the ﬁnesse when formingthe term queries from them; it can assume dramatic proportions.This situation is aggravated if an application like plagiarism analysis requires the segmentation of the query document in order torealize a similarity search for one paragraph at a time [14].
2.3 Retrieval Task: Classiﬁcation
Classiﬁcation plays an important role in various text retrievaltasks: genre analysis, spam ﬁltering, category assignment, or author identiﬁcation, to mention only a few. Text classiﬁcation is thesupervised counterpart of text categorization and, for a small number of classes, it can be successfully addressed with Bayes, discriminant analysis, support vector machines, or neural networks.However, with an increasing number of classes the construction of a classiﬁer with assured statistical properties is rendered nearly impossible.Withhashbased indexing anefﬁcient and robust classiﬁercanbeconstructed, even if the number of classes,
C
1
,...,C
p
, is high andif only a small or irregularly distributed number of training documents is given. The classiﬁcation approach follows the
k
nearestneighbor principle and presumes that all documents of the classes
C
i
are stored in a hashindex
µ
h
. Given a document
d
q
to be classiﬁed, its hash value is computed and the documents
D
q
that areassigned to the same hash bucket are investigated with respect totheir distribution in
C
1
,...,C
p
.
d
q
gets assigned that class
C
j
where the majority of the documents of
D
q
belongs to:
C
j
= argmax
i
=1
,...,p

C
i
∩
D
q

3. REALIZATION OFHASHBASED INDEXING
Given a similarity hash function
h
ϕ
:
D
→
U
, a hashbasedindex
µ
h
can be directly constructed by means of a hash table
T
along withastandard hash function
h
:
U
→{
1
,...,
T}
;
h
mapsthe universe of keys,
U
, onto the
T
storage positions. Indexinga set
D
of document representations means to compute for each
d
∈
D
its hash key
h
ϕ
(
d
)
and to insert in
T
at position
h
(
h
ϕ
(
d
))
a pointer to
d
. This way
T
maintains for each value of
h
ϕ
a bucket
D
q
⊂
D
that fulﬁlls the following condition:
d
1
,d
2
∈
D
q
⇒
h
ϕ
(
d
1
) =
h
ϕ
(
d
2
)
,
where
d
1
,
d
2
denote the computer representations of the documents
d
1
,d
2
. Based on
T
and
h
ϕ
a document query corresponds toa single hash table lookup. In particular, if
h
ϕ
fulﬁlls Property (1)the bucket returned for
d
q
deﬁnes a set
D
q
whose elements are inthe
ε
neighborhood of
d
q
with respect to
ϕ
:
d
∈
D
q
⇒
ϕ
(
d
,
d
q
)
≥
1
−
ε
The crucial part is the choice and the parameterization of a suitable similarity hash function
h
ϕ
for text documents. To our knowledge two approaches have been recently proposed, namely fuzzyﬁngerprinting and localitysensitive hashing. Both approaches canbe directly applied to the vector space representation
d
of a document and are introduced now.
3.1 FuzzyFingerprinting
Fuzzyﬁngerprinting is a hashing approach speciﬁcally designedfor but not limited to textbased information retrieval [13]. It isbased on the deﬁnition of a small number of
k
,
k
∈
[10
,
100]
, preﬁx equivalence classes. A preﬁx class, for short, contains all termsstarting with the same preﬁx. The computation of
h
ϕ
(
d
)
happensin the following steps: (
i
) Computation of
pf
, a
k
dimensional vector that quantiﬁes the distribution of the index terms in
d
with respect to the preﬁx classes. (
ii
) Normalization of
pf
using a corpusthat provides a representative crosssection of the source language,and computation of
∆
pf
= (
δ
1
,...,δ
k
)
T
, the vector of deviationsto the expected distribution.
1
(
iii
) Fuzziﬁcation of
∆
pf
by projecting the exact deviations according to diverse fuzziﬁcation schemes.Figure 4 illustrates the construction process.Typically, three fuzziﬁcation schemes (= linguistic variables) areused whereas each scheme differentiates between one and three deviation intervals. For a fuzziﬁcation scheme
ρ
with
r
deviation intervals Equation 2 shows how a document’s normalized deviationvector
∆
pf
is encoded:
h
(
ρ
)
ϕ
(
d
) =
k
−
1
X
i
=0
δ
(
ρ
)
i
·
r
i
,
with
δ
(
ρ
)
i
∈{
0
,...,r
−
1
}
(2)
δ
(
ρ
)
i
is a documentspeciﬁc value and represents the fuzziﬁed deviation of
δ
i
∈
∆
pf
when applying fuzziﬁcation scheme
ρ
.A preﬁxclass is not necessarily limitedtoexactlyone preﬁx. It isalso possible to deﬁne a combined preﬁx class as the union of twoor more others. There are
26
i
possible preﬁx classes for the Latinalphabet, where
i
denotes the preﬁx length. Using a bin packingheuristic the combined preﬁx classes can be constructed such thateach class has the same expected frequency.Document representations based on preﬁx classes can be regarded as abstractions of the vector space model. On the one handthey have a weaker performance when directly applied to aretrievaltask such as grouping, similarity search, or classiﬁcation. On theother hand they are orders of magnitude smaller and do not sufferfrom sparsity.
3.2 LocalitySensitive Hashing
Localitysensitive hashing (LSH) is a generic framework for therandomized construction of hash functions [9]. Based on a family
H
ϕ
of simple hash functions
h
,
h
:
D
→
U
, a localitysensitivehash function
h
ϕ
is a combination of
k
functions
h
∈
H
ϕ
obtainedbyanindependent and identicallydistributedrandom choice.When using summation as combination rule the construction of
h
ϕ
(
d
)
is deﬁned as follows:
h
ϕ
(
d
) =
k
X
i
=1
h
i
(
d
)
,
with
{
h
1
,...,h
k
}⊂
rand
H
ϕ
1
In this paper the British National Corpus is used as reference,which is a 100 million word collection of written and spoken language from a wide range of sources, designed to represent a widecrosssection of current British English [1].