2Many of the clustering algorithms were motivated by a certain problem domain.Accordingly, there is a variation on the requirements of each algorithm, includingdata representation, cluster model, similarity measures, and running time. Each of these requirements more or less has a significant effect on the usability of anyalgorithm. Moreover, it makes it difficult to compare different algorithms based ondifferent problem domains. The following section addresses some of theserequirements.
2. Properties of Clustering Algorithms
Before we can analyze and compare different algorithms, we have to define some of the properties for such algorithms, and find out what problem domains impose whatkind of properties. An analysis of different document clustering methods will bepresented in section 3.
2.1. Data Model
Most clustering algorithms expect the data set to be clustered in the form of a set of vectors
12
{,,,}
n
X
=
xxx
K
, where the vector ,1,,
i
in
=
x
K
corresponds to a singleobject in the data set and is called the
feature vector
. Extracting the proper features torepresent through the feature vector is highly dependent on the problem domain. Thedimensionality of the feature vector is a crucial factor on the running time of thealgorithm and hence its scalability. However, some problem domains by defaultimpose a high dimension. There exist some methods to reduce the problemdimension, such as principle component analysis. Krishnapuram
et al
[5] were able toreduce a 500-dimensional feature vector to 10-dimension using this method; howeverits validity was not justified. We now turn our focus on document data representation,and how to extract the proper features.
Document Data Model
Most document clustering methods use the
Vector Space
model to representdocument objects. Each document is represented by a vector
d
, in the term space,such that
12
{,,,}
n
tftftf
=
d
K
, where ,1,,
i
tfin
=
K
is the
term frequency
of the term
i
t
in the document. To represent every document with the same set of terms, we haveto extract all the terms found in the documents and use them as our feature vector
1
.Sometimes another method is used which combines the term frequency with theinverse document frequency (TF-IDF). The
document frequency
i
df
is the numberof documents in a collection of
N
documents in which the term
i
t
occurs. A typicalinverse document frequency (
idf
) factor of this type is given by log(/)
i
Ndf
. Theweight of a term
i
t
in a document is given by log(/)
iii
wtfNdf
=×
[13]. To keep the
1
Obviously the dimensionality of the feature vector is always very high, in the range of hundreds andsometimes thousands.