1
Web Mining: Clustering Web DocumentsA Preliminary Review
Khaled M. HammoudaDepartment of Systems Design EngineeringUniversity of WaterlooWaterloo, Ontario, Canada N2L 3G1hammouda@pami.uwaterloo.caFebruary 26, 2001Evidently there is a tremendous proliferation in the amount of information foundtoday on the largest shared information source, the World Wide Web (or simply theWeb). The process of finding relevant information on the web can be overwhelming.Even with the presence of today’s
search engines
that index the web it is hard towade through the large number of returned documents in a response to a user query.This fact has lead to the need to organize a large set of documents (due to a userquery or simply a collection of documents) into categories through clustering. It isbelieved that grouping similar documents together into clusters will help the usersfind relevant information quicker, and will allow them to focus their search in theappropriate direction. The purpose of this review is an attempt to explore theclustering techniques in the data mining literature and to report on theirappropriateness for clustering large sets of web documents. The review is by nomeans complete but covers the most representative approaches for clustering.
1. Background
The motivation behind clustering any set of data is to find inherent structure in thedata, and expose this structure as a set of groups, where the data objects within eachgroup should exhibit a large degree of similarity (known as intra-cluster similarity)while the similarity among different clusters should be minimized [9]. There is amultitude of clustering techniques in the literature, each adopting a certain strategyfor detecting the grouping in the data. However, most of the reported methods havesome common features [4]:
There is no explicit supervision effect.
Patterns are organized with respect to an optimization criterion.
They all adopt the notion of similarity or distance.It should be noted that some algorithms, however, make use of labelled data toevaluate their clustering results, but not in the process of clustering itself (e.g. [12]and [13]).
 
2Many of the clustering algorithms were motivated by a certain problem domain.Accordingly, there is a variation on the requirements of each algorithm, includingdata representation, cluster model, similarity measures, and running time. Each of these requirements more or less has a significant effect on the usability of anyalgorithm. Moreover, it makes it difficult to compare different algorithms based ondifferent problem domains. The following section addresses some of theserequirements.
2. Properties of Clustering Algorithms
Before we can analyze and compare different algorithms, we have to define some of the properties for such algorithms, and find out what problem domains impose whatkind of properties. An analysis of different document clustering methods will bepresented in section 3.
2.1. Data Model
Most clustering algorithms expect the data set to be clustered in the form of a set of vectors
12
{,,,}
n
 X 
=
xxx
K
, where the vector ,1,,
i
in
=
x
K
corresponds to a singleobject in the data set and is called the
 feature vector 
. Extracting the proper features torepresent through the feature vector is highly dependent on the problem domain. Thedimensionality of the feature vector is a crucial factor on the running time of thealgorithm and hence its scalability. However, some problem domains by defaultimpose a high dimension. There exist some methods to reduce the problemdimension, such as principle component analysis. Krishnapuram
et al
[5] were able toreduce a 500-dimensional feature vector to 10-dimension using this method; howeverits validity was not justified. We now turn our focus on document data representation,and how to extract the proper features.
Document Data Model
Most document clustering methods use the
Vector Space
model to representdocument objects. Each document is represented by a vector
d
, in the term space,such that
12
{,,,}
n
tftft
=
d
K
, where ,1,,
i
tfin
=
K
is the
term frequency
of the term
i
in the document. To represent every document with the same set of terms, we haveto extract all the terms found in the documents and use them as our feature vector
1
.Sometimes another method is used which combines the term frequency with theinverse document frequency (TF-IDF). The
document frequency
 
i
df 
is the numberof documents in a collection of 
 N 
documents in which the term
i
occurs. A typicalinverse document frequency (
idf 
) factor of this type is given by log(/)
i
 Nd
. Theweight of a term
i
in a document is given by log(/)
iii
wtfNd
=×
[13]. To keep the
1
Obviously the dimensionality of the feature vector is always very high, in the range of hundreds andsometimes thousands.
 
3feature vector dimension reasonable, only
n
terms with the highest weights in all thedocuments are chosen as the
n
features. Wong and Fu [13] showed that they couldreduce the number of representative terms by choosing only the terms that havesufficient
coverage
 
1
over the document set.Some algorithms [6][13] refrain from using term frequencies (or term weights) byusing a binary feature vector, where each term weight is either 1 or 0, depending onwhether it is present in the document or not, respectively. Wong and Fu [13] arguedthat the average term frequency in web documents is below 2 (based on statisticalexperiments), which does not indicate the actual importance of the term, thus a binaryweighting scheme would be more suitable to this problem domain.Before any feature extraction takes place, the document set is first
cleaned 
byremoving stop-words
2
and then applying a stemming algorithm that converts differentword forms into a similar canonical form.Another model for document representation is called
N-gram
. The N-gram modelassumes that the document is a sequence of characters, and using a sliding window of size
n
the character sequence is scanned extracting all
n
-character sequences in thedocument. The N-gram approach is tolerant of minor spelling errors because of theredundancy introduced in the resulting n-grams. The model also achieves minorlanguage independence when used with a stemming algorithm. Similarity in thisapproach is based on the number of shared n-grams between two documents.Finally, a new model proposed by Zamir and Etzioni [2] is a phrase-based approach.The model finds common phrase suffixes between documents and builds a suffix treewhere each node represents part of a phrase (a suffix node) and associated with it arethe documents containing this phrase-suffix. The approach clearly captures theinformation of word proximity, which is thought to be valuable for finding similardocuments.
Numerical Data Model
A more straightforward model of data is the numerical model. Based on the problemcontext, a number of features are extracted, where each feature is represented as aninterval of numbers. The feature vector is usually of reasonable dimensionality, yet itdepends on the problem being analyzed. The features intervals are usually normalizedso that each feature has the same effect when calculating distance measures.Similarity in this case is straightforward since only the distance calculation betweentwo vectors is usually trivial [15].
1
The
Coverage
of a feature is defined as the percentage of documents containing that feature.
2
Stop-words are very common words that have no significance for capturing relevant informationabout a document (such as “the”, “and”, “a”, …etc).
Search History:
Searching...
Result 00 of 00
00 results for result for
  • p.
  • More From This User

    Notes
    Load more