You are on page 1of 29

EFFECTIVE TERM BASED TEXT CLUSTERING

ALGORITHMS

NIBAS P.P

EPAHECS033
Government Engineering College
Sreekrishnapuram
Palakkad

November 25, 2010


CONTENTS

INTRODUCTION
REQUIREMENT OF INFORMATION RETRIEVAL
DOCUMENT PREPROCESSING
TEXT CLUSTERING ATTRIBUTES SELECTION
PROBLEM DEFINITION
FTC (Frequent Term-based Clustering)
CLUSTERING ALGORITHMS
APPLICATION
CONCLUSION
REFERENCE
INTRODUCTION

In every industry, almost all the documents on paper have


their electronic copies.
This is because electronic format provides:
a) safer storage
b) smaller size
c) quick access to documents
Text clustering methods can be used to group large sets of
text documents.
Document clustering is the automatic organization of
documents into clusters or groups. So grouping is based on
the principle of maximizing intra-cluster similarity and
minimizing inter-cluster similarity.
REQUIREMENT OF INFORMATION RETRIEVAL

To improve the result of information retrieval for document


clustering and the requirements of information retrieval is stated as
follows:
The document model preserves the sequential relationship
between words in the document.
Associating a meaningful label to each final Cluster is
essential.
Overlapping between documents should be allowed.
The high dimensionality of text document should be reduced.
DOCUMENT PREPROCESSING

All text clustering methods require several steps of


preprocessing of data.
Non-textual information such as HTML tags and punctuation
are removed from the documents.
Mostly the contexts of the documents are represented by
nouns.
Contd...

Based on this, following assumptions were made to achieve


document dimension reduction:
Elimination of words which possess less than 3 characters.
Elimination of general words.
Elimination of adverbs and adjectives.
Elimination of verbs.
To achieve frequent term generation
For small document, each line is treated as a record.
For large document, each paragraph is treated as a record.
TEXT CLUSTERING ATTRIBUTES SELECTION

Text clustering is performed in two stages:


Frequent term set generation.
Grouping of frequent term documents.
Frequent term set generation is characterised by the attribute
minimum support threshold.
Grouping of frequent term documents is characterised by the
attribute matching threshold.
Contd...

Minimum Support Threshold


The document database is reduced, based on the value called
minimum support threshold.
If the minimum support threshold takes less value, then the
dimension reduction is less. Inorder to get more reduction in
size the value of minimum support should be high.
Matching Threshold
The grouping of documents is carried out by finding the match
of frequent terms between the documents which is measured
by a value called matching threshold.
Matching is the ratio of number of common terms between
documents to the total number of terms.
For low matching threshold value ,the grouping of document is
high and for high matching threshold value ,the grouping of
document is less.
PROBLEM DEFINITION

Let D = {d1, d2, d3, . . . , dn} be the set of text documents.


T be the set of all terms occurring in the documents of D.
d1 = {t11, t12, . . . , t1m}, d2 ={t21, t22, . . . , t2m} be a
set of frequent terms in document d1 and d2.
Let F={f1,f2,...fk} be the set of all frequent term sets in D
with respect to min-support, where min-support be a real
number.
The cover of each element fi of F can be regarded as a cluster.
Contd...

Let the clustering of D in m sets be defined as R ={C1, C2,


C3, . . . , Cm} such that each cluster Ci contains atleast one
document. Ci6= NULL,i= 1 . . . . m.
FTC(Frequent Term-based Clustering)

Problems of text clustering such as:


Very high dimensionality of the data.
Understandability of the clustering descriptions.
So a frequent term based approach of clustering has been
introduced.
Frequent Term based Clustering (FTC) is a text clustering
technique which uses frequent term sets and dramatically
decreases the dimensionality of the document vector space.
CLUSTERING ALGORITHMS

Algorithms for effective Text clustering are:


1. Min-match Cluster Algorithm
2. Max-match cluster algorithm
3. Min-Max match cluster algorithm
Min-match Cluster Algorithm

Let A and B be two frequent term sets of documents d1 and


d2 represented as vectors.
Matching denoted as min(Vm) and defined as the number of
common elements between vector A and B to number of
elements in the minimum of two sets.

Example
Algorithm

D: Document database
FTL: frequent term list
CL: Cluster list
FT: frequent terms
Min-Cluster(CL,FTL,D)
1. For each FT i in FTL do
2. t1 = ith index frequent terms
3. Initialise high percent matching = -1 and cluster index= -1
4. For each FT j in FTL do
5. if (i6= j) then t2 = jth index frequent words
6. if (t1.length < t2.length) then total terms = t1.length
7. Else total terms=t2.length End if
8. match= Calculate matching terms between vector i and j using
Binary Search
9. matching percent = match * 100 / total terms
10. if (matching percent> matching threshold) and
(high percent matching matching percent) then
high percent matching = matching percent and cluster index = j
11. End if
12. End if
13. Next loop (j)
14. if (cluster index 6= -1) then
15. Add frequent term list(cluster index) to frequent term list(i)
16. Add Cluster list(cluster index) to Cluster list(i)
17. Remove Cluster list(cluster index)from Cluster list
18. Remove frequent term list(cluster index)from
frequent term list
19. End if
20. Next loop (i)
Contd...

In this algorithm,step 2 select a vector as a comparable vector.


step 5 to 7 is used to find out the minimum vector from the
two input vectors specified in step 2 & 5 and assign its length
as minimum vector count.
In step 8, the matching terms between two vectors are
calculated by using binary search concept.
In step 9, matching percentage between vectors is calculated
using minimum vector count.
In step 10, the highest matching vector between the two
vectors is selected and updates the value of highest match
vector.
step 5 to 11 is repeated until the comparable vector has to
compare all the remaining vectors.
Contd...

In steps 15 and 16, if the highest match vector is found, then :


a) Its frequent terms are added to the terms of comparable
vector selected in step 2.
b) Add the highest match cluster to the comparable cluster
(step 16).
In steps 17 and 18, remove the highest match cluster from the
cluster list (step 17).
Remove the highest match cluster terms from the frequent
term list (step 18).
Max-match cluster algorithm

Let A and B be two frequent term sets of documents d1 and


d2 represented as vectors.
Matching denoted as max(Vm) and defined as the number of
common elements between vector A and B to number of
elements in the maximum of two sets.

Example
Algorithm

D: document database
FTL: frequent term list
CL: Cluster list
FT: frequent terms
Max-Cluster(CL,FTL,D)
1. For each FT i in FTL do
2. t1 = ith index frequent words
3. Initialise high percent matching = -1 and cluster index= -1
4. For each FT j in FTL do
5. if (i6= j) then t2 = jth index frequent words
6. if (t1.length<t2.length) then total terms = t2.length
7. Else total terms=t1.length
End if
8. match= Calculate matching terms between vector i and j using
Binary Search
9. matching percent = match * 100 / total terms
10. if (matching percent>matching threshold) and
(high percent matching< matching percent) then
high percent matching = matching percent and cluster index = j
11. End if
12. End if
13. Next loop (j)
14. if (cluster index6= -1) then
15. Add frequent term list(cluster index) to frequent term list(i)
16. Add Cluster list(cluster index) to Cluster list(i)
17. Remove Cluster list(cluster index)from Cluster list
18. Remove frequent term list(cluster index)from
frequent term list
19. End if
20. Next loop (i)
Contd..

Here the only difference is that here we find the maximum


vector count of two input vectors.
Rest of the steps are same as illustrated in the previous
algorithm.
Min-Max match cluster algorithm

The matching is denoted by min-max(Vm) and is defined as the


number of matching terms multiplied by 2 to the number of
elements of two sets

Example
Algorithm

D: document database
FTL: frequent term list (set contains set of Frequent Terms)
CL: Cluster list (set contains set of Input Files Names)
FT: frequent terms
t1, t2: Frequent Term Set
Min-MaxCluster (CL,FTL,D)
1. For each FT i in FTL do
2. t1 = ith index frequent words
3. Initialise high percent matching = -1 and cluster index= -1
4. For each FT j in FTL do
5. if (i6= j) then t2 = jth index frequent words
6. t3 = ith FTL UNION jth FTL
7. total terms = t3.length
8. match= Calculate matching terms between vector i and j using
Binary Search
9. matching percent = match * 2* 100 / total terms
10. if (matching percent> matching threshold) and
(high percent matching< matching percent) then
high percent matching = matching percent and cluster index = j
11. End if
12. End if
13. Next loop (j)
14. if (cluster index6= -1) then
15. Add frequent term list(cluster index) to frequent term list(i)
16. Add Cluster list(cluster index) to Cluster list(i)
17. End if
18. Remove Cluster list(cluster index)from Cluster list
19. Remove frequent term list(cluster index)from
frequent term list
20. Next loop (i)
Contd...

Here the first difference is that we are considering the total


number of items present in the all sets.
Another main difference is that we multiply the numerator by
the number of vectors.
APPLICATION

Document clustering has wide application in areas such as :


web mining
It is the process of discovering patterns from the web.
search engine
It is designed to search for information on the world wide web.
information retrieval
It is the science of searching for documents,for information
within documents,for metadata about documents as well as
searching relational database and world wide web.
CONCLUSION

For effective text clustering, three new clustering algorithms


were proposed.
All the three algorithms are compared with the standard FTC
algorithm to show their competency.
The developed three algorithms perform better cluster quality
than FTC algorithm.
References

1 Ponmuthuramalingam P et. al,Effective Term Based Text


Clustering Algorithms,(IJCSE) Vol. 02, No. 05, 2010,
1665-1673
2 Beil F., Ester M. and Xu X.,Frequent Term-based Text
Clustering,Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2002,
436-442
3 Dubes R.C and Jain A.K,”Algorithms for Clustering
Data”,Prentice Hall,Englewood Cliffs N.J,U.S.A,1988.
4 Fung B.C.M,Wang K and Ester M,”Hierarchial Document
Clustering using Frequent Item sets”,Proceedings of SIAM
International Conference on Data Mining,2003,180-304
THANK YOU.

You might also like