You are on page 1of 5

Improved Document Clustering using K-means

Algorithm
Pramod Bide Rajashree Shedge
Dept.Computer Engg, Ramrao Adik Institute of Dept. of Computer Engg, Ramrao Adik Institute of
Technology, Navi Mumbai,India Technology, Navi Mumbai,India
pramod.bide@gmail.com rajashree.shedge@rait.ac.in

Abstract- Searching for similar documents has a crucial role child clusters, which form a partition of their parent cluster.
in document management. Because of tremendous increase in Different clustering algorithms [5] produce different results
documents day by day, it is very essential to segregate these with different features. Hierarchical algorithms are slower
documents in proper clusters. Faster categorization of documents
than partitioning algorithms but they give better accuracy.
is required in forensic investigation but analysis of these
Very short time is available for finding performing
documents is very difficult. So, there is a need to separate
mUltiple collections of documents into similar ones through
examinations [6]. Experimental results showed that the
clustering. Specifying number of clusters is mandatory in existing proposed algorithm takes less time for clustering compare to
partitioning algorithms and the output is totally dependent on existing K-means algorithm and the F-measure score that is F-
given input. Over clustering is the major problem in document 1 score of clustering is also very high than existing.
clustering. The proposed algorithm takes input as Keywords The remainder of this paper is organized as follows. Section
found after extraction and solves the problem of over clustering II presents related work. Section III explains details of
by dividing the documents into small groups using Divide and proposed system. Section IV analyzes the Experimental results.
Conquer Strategy. In this paper, an Improved Document Conclusion and Future Work is given in Section V.
Clustering algorithm is given which generates number of clusters
for any text documents and uses cosine similarity measures to II. REALATED WORK
place similar documents in proper clusters. Experimental results
K-means clustering comes under partitioning clustering
showed that accuracy of proposed algorithm is high compare to
existing algorithm in terms ofF-Measure and time complexity.
algorithm. It partitions given data into K clusters. Several other
clustering algorithms are proposed for dealing with document
Keywords-Document Clustering;Divide and Conquer;Cosine clustering task including Novel algorithm[9] for automatic
Similarity; Tf-Idj; Threshold clustering suggested how clustering is done automatically,
Improved partitioning K-means algorithm[lO] presented new
I. INTRODUCTION
method for initializing centroids. Ontology based k-means
There is an exponential increase in the number of algorithm[ll] presented how ontological domains are used in
digital data and text documents. Thus, it is very difficult to clustering documents. In [12] datasets are tasted for analyzing
organize these large collection of text documents in an efficiency of K-means algorithm when crime documents are
effective way and locating interesting information or patterns given as a input.
[1] has become a vital task. To speed up the searching process Above discussed methods [9],[10],[11],[12] used 20
for similar documents for finding required documents, newsgroup, Rueters-21578 and Real time data sets. All
document clustering is a method widely used. Clustering Algorithms used cosine similarity measure for finding
organize a large set of documents into a number of similar similarity. Granularity of every algorithm is specified (i.e. K
clusters [2]. So, the documents in the same cluster are more values) except first one because first algorithm is automatic.
similar to one another than to documents in another cluster. The initial conditioning is required in all the algorithms except
Vector space model [3] is used by our document clustering the first one. First one is automatic algorithm. They considered
technique which is explained in detail in section III. documents categories for automatic clustering. Zero clustering
Preprocessing is done to convert the words to their base form, is the major advantage of first algorithm. i.e. documents with
to remove stop words, duplicate words before applying vector zero value in similarity matrix also get cluster. All Algorithms
space model to the text documents. The distances between are adaptable to dynamic data except second, because second
document are measured using similarity measures like Cosine, algorithm calculates average document similarity matrix.
Jaccard [4] etc. Then the clustering algorithms are applied till After analysis we come to know that these algorithms have
required number of clusters is formed. There are two common limitations, i.e. over clustering and Mandatory K-value so to
clustering algorithms. Partitioning algorithms in which overcome these limitations a new algorithm is developed.
clusters are computed directly. Clustering is done by
iteratively swapping objects or groups of objects between the III. SYSTEM ARCHITECHURE

clusters and the hierarchical based algorithms in which a Traditional K-means algorithm has problems like K-means
hierarchy of clusters is build. Every cluster is subdivided into algorithm works well for a few documents. But when number

978-1-4799-608S-9/1S/$31.00©201S IEEE
of documents is increased, it could not cluster the documents.
Text Documents
It is not automatic, the number of clusters needs to be
specifying in advance. To resolve these issues we started with
improved document clustering using k-means algorithm and
proceeded with and algorithm explained in section G. �he
Partitioning Documents (Divide
and Conquer Strategy)
main features of proposed algorithm are is capable of handlmg
large documents, since it is partition based algorithm and it is
also automatic. Proposed algorithm considers feature vectors Preprocessing
for automatically clustering the documents present in a corpus
collection.
The flow of our proposed system is shown in below fig 1. Feature Extraction
Text documents are given as input. Divide and conquer merge
sort strategy is applied on the documents present in corpus for
.
further partitioning into small set of documents. Pr�process�ng Vector Space Model
is done on all documents to remove some non mformattve
words e.g. "is, and, if"" and all stop words are removed.
Calculate Cosine Similarity
Feature vector is extracted from the preprocessed dataset by
taking keywords (i.e. features) with highest frequency. Vector
space model (VSM) computes the term frequency and Co�t of
Clustering using K-means
each word in the documents. Final vector space model IS a Algorithm
numerical matrix representation of text data. It is then
normalized (NVSM). Cosine Similarity matrix is calculated
using NVSM. These matrices are then used a� input t� K­ Similar Clusters
means and proposed improved document c1ustermg algorithm
and clustering is done. Finally results are compared for
Fig. I: System Architecture
different parameters like F-measure, time complexity.
A. Divide and Conquer Strategy
C. Feature Vector
The input corpus (i.e. Text Documents) is divided into
A small set of Keywords extracted from the dataset known
small sets of documents (i.e. partitions) because K-means
as Feature vector. These Keywords are required to create
algorithm did not work well if the number of documents is
vector space model. We have used frequency based method to
large. So, we used divide and conquer merge sort strategy
extract the feature vector. Tokens of frequency >5 are selected
for dividing documents into small set of document and
from all the documents and the keywords list (i.e Feature
conquering strategy for merging the output of each
vector) is prepared. For 20Newgroup dataset, From three
partition according to cluster labels.
categories we got total 1172 features
B. Preprocessing
D. Vector Space model
In information retrieval, preprocessing is done before
Another name of VSM isTerm Frequency Inverse
applying vector space model to the text documents.
Document Frequency model i.e TF-IDF model. It is the
Preprocessing steps take as input a plain text document and
standard retrieval technique used in text mining. Each
output a set of tokens to be included in the vector model. There
document is represented as an n-dimensional vector using the
are 5 steps of preprocessing:
feature vector. The value of each element in the vector reflects
the importance of the corresponding feature in the document.
1) Filtering: Punctuation marks and special characters
The similarity between documents can be measured by
are removed from the plain text document.
calculating the distance between document vectors. If the
2) Tokenization: Sentences are split into individual
Documents contain the same keywords they are similar
words or tokens.
Therefore we can use the term frequency tf(i,j) i.e. number of
3) Stop word removal: The words (e.g. "and", "the"
times a term i occurs in a document. The term frequency is
etc.) which do not convey any meaning as a
normalized with respect to the maximal frequency of all terms
dimension in the vector space are removed.
occurring in a document.
4) Stemming: Words are reduced to their base form.
For example, the words "process", "processing", are . . freq (i,j)
reduced to the stem "process" using Porter's if (/, j) (3.1)
Max (f(x,j):w€j)
algorithm. Where,
5) Pruning: Very low frequency words are removed. i= term or keyword in document j.
x= Any Term with maximum frequency.
We should also consider how many times a term is G. The Proposed Algorithm
occurred in the document corpus. The Document frequency While implementing K-means we faced some problems
(dfi) of a term is the number of documents in which term i described below:
occurs. If D is size of documents (i.e. total number of If we give 20newsgroup dataset as input to K-means with
documents) in a database then, Inverse Document Frequency is K value then it will cluster the documents into K clusters.so,
given by, Here K-value is mandatory. It takes more time to cluster the
idf (i, j) log (01 dfi)
= (3.2) documents. Overclustering is the major problem in K­
means.Unnecessarily unrelated documents gets placed in
Weight(Wi) of term is calculated as, related documents. Another problem is that K-means
algorithm worked well for a few documents. But when the
Wi tfi * log (O/dfi) (3.3)
number of documents is increased, it could not cluster the
=

documents
Weight of a term is normalized with respect to the
To resolve these issues we started with divide and conquer
maximum weight of all terms present in a document. Cosine
strategy to handle the problem of Large documents and taking
similarity used as similarity measure for vector space retrieval.
input as a set of Keywords i.e terms (e.g term1, term2 ....term
If two vectors matches completely their similarity should be
n) instead of K-value. Now with Cosine similarity matrix we
equal to 1. If two vectors have no keywords in common, the
proceeded with the following algorithm.
similarity should be equal to O. If some part of a vector matches
then the similarity should be between 0 and 1. Input: Dataset set D = {d1, d2 .. dn}
.

Cosine similarity between two vectors is calculated Output: Set of Cluster Numbers C along with document
as: numbers m associated.
. . . Di*Dj 1. U= {DI I i € N}
Slm(ol,oJ)= (3.4)
IDillDj 1 2. Distribute the documents into groups usmg
Dividing and Conquer Merge sort strategy.
Finally, Cosine similarity matrix (CS) for sets of documents 2.1 Apply Divide Strategy on Input Corpus.
given as a input and output is document numbers with their 2.2 Apply Divide Strategy till documents are
cluster labels. equally placed in groups.
2.3 Go to step 3.
2.4 Conquer the Clusters obtained in step 7.
E. F-measure

It is a combination of the Precision(P) and Recall(R) 3. Now apply K- means algorithm on every partition
measures used to compare how similar two clusters are. It is iteratively till we get the same clusters.
given by, 4. Calculate the similarity of the documents using
2*PR
F-Measure = - (3.5) cosine similarity measure.
P+R
4.1 Similarity of the document in step 4 is
Where, calculated as,
4.2 for S=Di I i (N
,I re.:,:le..::.;va
::. n::.t :...:d:.::.o::..:cu::.:m :.:,: t::..s :...:I-=
:.::. en n l retrived documents 1 Where,
Precision(P) = -:...: �..::.; ..::.: ..:.. -=�-= ..::.: -= -= ..::.: ....!
Iretrived documents 1 D- Documents,
N- Number of documents.
1 relevant documents 1n 1 retrived documents 1
Recall(R) = :":"'::':::":": :':'''':'
:'::"' ::' ':''' -= '':'': ':': =- "':':-" - -= -= -= -= -' 4.3 for i=1 to n Cosine Similarity Matrix
"':"---
Irelevant documents 1

I 0(1,2) 0(1,3) 0(1,4) . . . O(1,n)


0(2,1) I 0(2,3) 0(2,4) ... 0(2,n)
F. K-means Algorithm using cosine distance CS ixj= 0(3,1) 0(3,2) I 0(3,4).... . 0(3,n)
If a vector of documents (Dl, D2 . Dn) is given , K­ . ·

means clustering Algorithm will partition the n documents


into K clusters (K :::; n) such that cosine distance between O(n,l) 0(n,2) 0(n,3) 0(n,4) . . . O(n.n)
them is minimum.
1. Randomly Select initial centroids that divides
5. Assign the nearest (similar) document to the new
the documents into k clusters.
clusters.
2. Compute Cosine distance of each document from the
6. If the clusters are not matched then go to step 4.
centroid of each of the clusters. Assign that document to
7. If clusters are matched then stop.
the cluster with the closest centroid.
S. Conquer the Clusters using Conquer Strategy
3. Repeat step 2 until there is no change in newly formed
mentioned in Step 2.4
clusters.
IV. EXPERIMENTAL RESULTS

We have tested our algorithm and K-means algorithm on 250


20newsgroup-18828. 20newsgroup-18828 dataset has text '"
"C 200
documents of 20 categories, we have used categories stated in =
=
" 150
table 1. ..
'"
TABLE 1 - 20newsgroup-18828 .: 100
.. • Existing
E
� 50 • Proposed
Catel(ories
Talk.politics.guns 0
Talk.politics.mideast
2 3 4 5
Talk.politics.misc
Number of Clusters
We have designed complete system including all models
described in section III. Input to the system is any text dataset Fig. 4. Time comparision for 20newsgroup dataset.
from the categories mentioned in above Table I and output is
document numbers with their cluster labels.
For analysis of results, we applied K-means and our V. CONCLUSION

proposed algorithm on 100, 200 documents from Document clustering plays an important role in selecting
20newsgroup. We got following observations: required documents among the thousands of
For 20newsgroup-I8828 dataset, if we give K=5 as input, documents.Traditinonal K-means document clustering
K-means will cluster them in 5 clusters. Since our algorithm algorithms are not automatic and they deal with a problem of
does not require K as input instead it takes cluster labels as over clustering. The proposed algorithm has proven that the
input, so it has clustered them in mentioned cluster labels. problem of automatic clustering is achieved by considering
fig. 3 shows FI-score of both algortihms.Accuracy of features as cluster labels and the problem of over clustering is
both the algorithm are checked using FI-Score. FI-Score is overcome by partitioning the corpus data initially and further
explained in detail in above section. FI Score of proposed iteratively partitioning till all the documents placed in proper
algorithm is greater than existing because Proposed algorithm clusters. The experimental results have shown that the
is keyword based i.e feature based. so, the FI- Score is more Improved document clustering exhibits improvements in the
towards 1. whereas existing algorithm is key based so FI clustering of similar documents. FI-score is high for proposed
Score is less.FI-Score is calculated using standard formula for algorithm and time required for clustering is also very less
both algorithm. Following figure shows the experimental compare to Existing algorithm.
resuIts of FI-score.
fig. 4 shows the comparison of both algorithm with
respect to time. Existing algorithm takes more time than
REFERENCES
proposed algorithm. Existing algorithm takes more time
because it calculates cosine similarity between every
documents present in input corpus . Proposed algorithm takes [1] Sargur Shrihari and Graham Leedam , "A survey of computer methods
in forensic document examination" Proceedings of the 11 conference of
less time because it calculates cosine similarity between the
the international Graphonomics society,pp278-281.IEEE,November
documents present in every partition simultaneously. Since 2003.
our proposed algorithm partitions the documents into small set [2] ROLF OPPLIGER AND RUED! RYTZ, "Digital Evidence: Dream and
of documents, less number of documents are present in single Reality" Swiss Federal Strategy Unit for Information Technology,
partition. IEEE,pages 44-48, 2003.
[3] Ahmad Mehrbod, Aneesh Zutshi and Antnio Grilo "A Vector Space
Model Approach for Searching and Matching Product E-Catalogues"
Proceedings of the Eighth International Conference on Management
0.9 Science and Engineering Management, Advances in Intelligent Systems
and Computing,Volume 281,pp 833-842,Springer,2014.
0.8
0.7 [4] Mushfeq-Us-Saleheen Shameem,Raihana Ferdous, "An efficient K­
.. means Algorithm integrated with Jaccard Distance Measure for
... 0.6
= Document Clustering", pp 1-6,IEEE,2009.
'" 0.5

.. [5] V. Mary Amala Bai,Dr. D. Manimegalai , lCCCCT-IO, "An Analysis of
� 0.4
• Existing Document Clustering Algorithms",pp402-406, IEEE, 2010.
� 0.3
0.2 • Proposed
0.1 [6] Sonia Bui,Michelle Enyeart, Jenghuei Luong, "Issues in Computer
0 Forensics" COEN 150, November 2003.
2 3 4 5 [7] Xiaoping Qingl, Shijue Zheng ,"A new method for initialising the K­
means clustering algorithm" Second International Symposium on
Number of Clusters Knowledge Acquisition and Modeling,pp 41-44,IEEE,2009.

Fig. 3. F-Measure comparision for 20newsgroup


[8] Ranjana Agrawal ,Madhura Phatak, "A Novel Algorithm for Automatic [12] Tapas Kanungo,David M. Mount,Nathan S. Netanyahu,Christine D.
Document Clustering," 3rd IEEE International Advance Computing Piatko, Ruth Silverman, and Angela Y. Wu "An Efficient k-Means
Conference (IACC) ,pages 877 - 882,IEEE, 2013. Clustering Algorithm: Analysis and Implementation" IEEE transactions
[9] Zonghu Wang, Zhijing Liu, Donghui Chen, Kai Tang,"A New on pattern analysis and machine intelligence, vol. 24, NO. 7,pp 881-
Partitioning Based Algorithm For Document Clustering",Eighth 892 JULY 2002.
International Conference on Fuzzy Systems and Knowledge [13] Elizabeth Len, Jonatan Gmez and Olfa Nasraoui,"A Genetic Niching
Discovery,pages 1741 - 1745 IEEE,20 I I. Algorithm with Self-adaptating Operator Rates for Document
[10] S.C. Punitha, R. Jayasree andDr. M. Punithavalli, "Partition Document Clustering", Eighth Latin American Web Congress, 2012.
Clustering using Ontology Approach", Multimedia and Expo, 2013 [14] A. E. ELdesoky, M. Saleh, N.A. Sakr, "Novel Similarity Measure for
International Conference on Computer Communication and Informatics Document Clustering Based on Topic Phrases",pages 92
(ICCCI -2013), Jan. 04 06,pages 1-5, 2013. 96,1EEE,2009.
[ I I] Lus Filipe da Cruz Nassif and Eduardo Raul Hruschka, "Document [15] Sargur Shrihari AND Graham Leedam , "Study of Clustering Algorithm
Clustering for Forensic Analysis: An Approach for Improving Computer based on Model data" Proceedings of the II conference of the
Inspection," IEEE transactions on information forensics and security, international Graphonomics society,pages 3961 - 3964 November 2007
Vol. 8, NO. I,pages 46 - 54 Jan 2013.

You might also like