You are on page 1of 5

International Journal of Computer Systems (ISSN: 2394-1065), Volume 03– Issue 08, August, 2016

Available at http://www.ijcsonline.com/

A Study on K-Means Clustering in Text Mining Using Python


Dr. (Ms). Ananthi Sheshasayee 1, Ms. G. Thailambal2
1
Head and Associate Professor, Quaid -e- Milleth College for Women, Chennai, India
2
Research Scholar, SCSVMV University, Kancheepuram, India

Abstract

According to Statistics 195,248,950 Internet users are in India, which is the second largest internet user in the world.
The total number of websites gets increased to 672,985,183 in the year of 2013. Text Mining is an emerging research
area in nowadays as the information gets increased everyday on the web. The User did not know how the documents
were linked to the query given and displayed. Sometimes the documents are relevant and many times the documents are
irrelevant to the query typed by the user. These appropriate and inappropriate results are due to the clustering algorithm
applied to it. Getting proper results page from these websites are possible only with the process of Clustering. Clustering
is the fundamental process in many disciplines whereas Cluster Analysis is used for grouping of similar collection of
patterns based on Similarity factors. This paper discusses the tasks of Text Mining algorithms and clustering techniques.
Different types of clustering algorithm available where K-Means clustering algorithm presented in detail along with its
Strengths and Limitations in this paper. It also includes various Computation measures of algorithm which is used to
identify the similar objects to cluster. This paper gives the detailed information about the applications of Clustering and
tools used for clustering in different applications. Related works of K-means clustering algorithm in Text Mining
applications and other applications are presented with the conclusion that the K-Means algorithm can be combined with
other algorithms to get efficient results.

Keywords: Text Mining, Clustering Algorithm, K-Means Clustering, Python.

C. Concept Mining
I. INTRODUCTION
The task of discovering concepts which combine
Text Mining is retrieving information of different Categorization and clustering approach to find concepts
patterns from unstructured textual data in the web and their relations from text collections.
Repository. Text mining is a variation on a field called data
mining that tries to find interesting patterns from large D. Information Retrieval
databases. Text mining, also known as Intelligent Text Retrieving the information from a collection of
Analysis, Text Data Mining or Knowledge-Discovery in information resources available depending on the user's
Text (KDT), refers generally to the process of extracting query.
interesting and non-trivial information and knowledge from
unstructured text. [8]. Typically, only a small fraction of E. Information Extraction
the many available documents will be relevant to a given Task of automatically extracting structured information
individual user. Without knowing what could be in the from unstructured or Semi-Structured documents.
documents, it is difficult to formulate effective queries for
analyzing and extracting useful information from the data. III. CLUSTERING TECHNIQUES
Users need tools to compare different documents, rank the Clustering is grouping of similar data sets with the
importance and relevance of the documents, or find same content. It includes grouping of same text messages
patterns and trends across multiple documents. Thus, text in e-mail, same content from different Books. Text
mining has become an increasingly popular and essential Clustering algorithms are classified into many types,
theme in data mining. [9] namely distance-based algorithms, frequent sequence
algorithms, feature selection and extraction algorithms,
II. TASKS OF TEXT MINING ALGORITHMS [7] density-based algorithms, distance-based algorithms,
A. Text Categorization frequent sequence algorithms, feature selection and
extraction algorithms, density-based algorithms. A
Assigning the documents to pre-defined categories. clustering algorithm discovers groups in the set of
Many Statistical approaches have been applied such as documents such that documents within a group are more
Regression Models, Support Vector Machines. similar than documents across groups [2].
B. Text Clustering
Finding Group of Similar objects of data based on the
Similarity Function. Methods applied are categorized as
Hierarchical and Partitioning.

560 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python

 Distance from x to y always same as y to x


 Distance from point x to point y cannot be greater
than the sum of the distance from x to any other point
z and distance from y to x.

Clustering Tasks

Scattered Document Clustered Document


Fig.1 Documents Before and after Clustering

Document
The following conditions help to increase the Representation Definition of
effectiveness of the clustering. [1] Similarity
-------------------
Convert the Measure
A. Similarity Measure: Only Similar documents to be
documents into -------------------
considered which is hard to define.
structured form. Similarities
B. Dimension Reduction: The size of the data needs to between two
be reduced to increase the operations efficiency by documents.
removing the irrelevant words from the text collection.
C. Cluster Labels: Giving separate names to different
clusters in an appropriate way are needed to identify
the clusters in a clear way.
D. Number of Clusters: Number of clusters used to be Clustering Logic
deciding earlier, which is difficult when you have less ----------------------------------------------
information. Determining the documents is assigned to
E. Overlapping of Clusters: algorithm should accept the clusters based on similarity measure.
overlapping of clusters since several topics are used by
certain documents.
F. Scalability: Irrespective of size the algorithm should
be used. Fig 2. Key Tasks of Clustering

G. Flexibility: Algorithm should be scalable with


different attributes, clusters etc. A. Distance measures of Clusters
 Euclidean distance:
Clustering hypothesis formulated as “Given a Suitable
Clustering collection, if„d‟ documents interested then other The largest value attributes are
members of „d‟ also interested by the user”. The Properly scaled.
Parameters used by the clustering algorithms are [3]
D(x,y) = (E(xi-yi)2)1/2 ….(1)
 Number of clusters desired
 Manhattan distance:
 A Minimum and Maximum size of the cluster.
The domination of largest valued is not much as
 The Control of overlap between Clusters. Euclidean distance.
 An arbitrarily chosen objective function D(x,y)=Ei mod xi-yi …(2)
optimized.
 Chebychev distance:
A threshold value of the matching function below
which an object will not be included in the cluster. This is based on maximum attribute difference.
D(x,y)= Max mod xi-yi …(3)
H. Distance Computation  Categorical distance:
Most clusters analysis methods based on similarity If many attributes have categorical values with only a
between objects by computing distance between each pair. small number of values. Let N be the total number of
The Properties of distance are categorical attributes.
 Distance is always positive D(x, y) = (Number of xi-yi)/N… (4)
 Distance from a point to itself is zero

561 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python

I. Types of Clustering [5] cluster is a dense region of points surrounded by regions of


 Partitional clustering low density.

The given n data is partitioned into k partitions  Grid based clustering


represent cluster, i.e. (k<=n). The partitioned data should Object space is divided into grid according to the
follow the criteria: characteristics of data. This method not affected by data
(i) At least One data object should be in each ordering and they can deal with non numeric data easily
cluster  Model based clustering
(ii) A Data object should belong to only one This algorithm builds clusters with a high level of
cluster group. similarity within them and low level of similarity between
The widely used methods are Iterative clustering or them. This algorithm works Based on the Mean values and
Reallocation clustering in which data objects move from this minimizes the squared error function.
one cluster to another and in Single pass Clustering the data
object processing is done only once. Advantages K-Means
 K-Means Clustering:
The widely used Partitional clustering is K-Means in
which it assigns each point to a cluster whose center called Type of Attributes algorithm Numeric
centroid is nearest. The center is the average of all the can handle
points and its coordinates are the arithmetic mean for each
dimension separately over all the points in the cluster. [6]
The Steps of K-Means:
Step 1: Choose the k number of clusters. Time Complexity Low

Step 2: Randomly generate k random points as


a cluster center.
Data ordering Dependency Yes
Step 3: Determine the Euclidean distance of
each Object to all Centroids.
Prior Knowledge and User Yes
Step 4: Assign each point to the nearest
Defined parameters
Centroid.
Step 5: Re-compute the new cluster Centers.
Step 6: Repeat steps 2 & 3 until Convergence.
Interpretability of Results Clusters
This algorithm aims to minimize the following function
for k clusters and no data points
J=∑∑ ||xi-cj||2 … (5) Ability to Memorize results Centroids
Where j=1 to k and i=1 to n and
||xi-cj|| is a chosen Euclidean distance measure between
data point xi from cluster cj.
Still K-means have some limitations such as Handling Table 1: Advantages of K-Means
Outliers is not possible, Intermediate Solutions are not
made. But this algorithm is traditionally used in most of the
applications since it is easy to implement and the time J. Clustering Implementation in Python
complexity is O (N) [10] where N is the number of objects
to be grouped. Table 1 contains the advantages of K-Means The following partial code implemented in Python
Clustering. language [22].

 Hierarchical Clustering
These methods start with one cluster and then split
into smaller and smaller clusters and then merge similar
clusters into larger and larger clusters in which objects
resulting in a tree of clusters.
 Density Based clustering
For each data point in a cluster at least a minimum
number of points must exist within a given radius. Each

562 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python

Fig. 3 Sample Clustering Implementation using Python

IV. RELATED WORK OF K-MEANS CLUSTERING IN OTHER


APPLICATIONS
Oyelade, O. J et.al., presents k-means clustering
algorithm as a simple and an efficient tool to monitor the
progression of students' performance in higher institution.
They analyzed the students' results based on cluster
analysis and uses standard statistical algorithms to arrange
their scores data according to the level of their performance
[11].
Bader Aljaber et.al use of citation contexts, when
combined with the vocabulary in the full-text of the
document in High Energy Physics and Genomics, is a
promising alternative means of capturing critical topics
covered by journal articles. The author uses link based
clustering algorithm which determines the similarity
between documents with a number of co-citations. They
used bi-clustering algorithm and at the end they include K-
means algorithm to reduce the size of the bi-clusters by
merging its similar documents [12].

V. RELATED WORK OF TEXT MINING APPLICATIONS


USING K-MEANS CLUSTERING ALGORITHM
Anil Kumar Pandey et.al., uses k-means algorithm to
cluster web documents to help researchers. The author
extracts document features and applies the Apriori

563 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016
Dr. (Ms). Ananthi Sheshasayee et al A Study on K-Means Clustering in Text Mining Using Python

algorithm which generates mutually exclusive frequent sets of Computer Technology & Applications, Vol 3 (4), 1598-1604,
taken as initial points of k-means clustering algorithm. This ISSN: 2229-6093.
displays the highly related documents appearing together [15] L.V. Bijuraj “Clustering and its Applications”, Proceedings of
National Conference on New Horizons in IT –ISBN 978-93-82338-
with same features [13]. 79-6 .
Neetu Sharma et al uses K-means algorithm and [16] https://code.google.com/p/sofia-ml
Random Forest Classifier in WEKA tool and concluded [17] http://nlp.fi.muni.cz/projekty/gensim
that using clustering before classification on the data file [18] http://mahout.apache.org
poach.arff from WORDNET has optimized the [19] http://radimrehurek.com/gensim
performance [14]. [20] http://carrotsearch.com/lingo3g
[21] http://graphlab.org
VI. CONCLUSION [22] Toby Segaran, Programming Collective Intelligence: Building
The performance of Clustering algorithm depends Smart Web 2.0 Applications. Sebastopol, CA: O'Reilly Media.
on the structure, the amount and the representativeness of
the data. Some of the applications where Clustering is
widely used are discussed in this paper that shows the
importance of clustering in Text Mining. Many other
clustering algorithms available with some Pros and Cons
which can be combined for getting better results.
REFERENCES
[1] Francis Musembi Kwale, “A Critical Review of K - Means Text
Clustering Algorithms”, International Journal of Advanced
Research in Computer Science, Volume 4, No. 9, ISSN No. 0976-
5697.
[2] Dan Munteanu, Severin Bumbaru, “A Survey Of Text Clustering
Techniques Used For Web Mining”, The Annals Of ”Dunarea De
Jos” University Of Galati Fascicle III, ISSN 1221-454x.
[3] C. J. Van Rijsbergen , “Information Retrieval”, Butterworths,
London.
[4] Pushplata, Mr. Ram Chatterjee, “An Analytical Assessment on
Document Clustering”, I.J. Computer Network and Information
Security, 5, 63-71, DOI: 10.5815/ijcnis. 2012.05.08.
[5] Ms.S.Prabha, Dr.K.Duraiswamy, Ms.M.Sharmila “Analysis of
Different Clustering Techniques in Data and Text Mining”,
International Journal of Computer Science Engineering (IJCSE),
Vol. 3 No.02 , ISSN: 2319-7323.
[6] Mrs.S.C.Punitha, Dr. M. Punithavalli “A Comparative Study to
Find a Suitable Method for Text Document Clustering”,
International Journal of Computer Science & Information
Technology, Vol3, No.6.
[7] Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K.
Bandyopadhyay, “A Tutorial Review On Text Mining Algorithms”,
International Journal of Advanced Research in Computer and
Communication Engineering Vol. 1, Issue 4, ISSN : 2278 – 1021.
[8] Vishal Gupta , Gurpreet S. Lehal “A Survey of Text Mining
Techniques and Applications”, Journal of Emerging Technologies
in Web Intelligence, Vol. 1, No. 1.
[9] R. Sagayam, S.Srinivasan, S. Roshni “A Survey of Text Mining:
Retrieval, Extraction and Indexing Techniques”, International
Journal of Computational Engineering Research Vol. 2 Issue. 5.pp:
1443-1446.
[10] “Comparative Study of Clustering Algorithms On Textual
Databases”, Thesis submitted to Technical University Ilmenau,
Germany.
[11] O. J. Oyelade, O. O. Oladipupo, I. C. Obagbuwa, “Application Of
K-Means Clustering Algorithm For Prediction Of Students‟
Academic Performance”, (IJCSIS) International Journal of
Computer Science and Information Security, Vol. 7, Issue 1.
[12] Bader Aljaber Æ Nicola Stokes Æ James Bailey Æ Jian Pei
“Document Clustering Of Scientific Texts using Citation Contexts”,
Information Retrieval DOI 10.1007/s10791-009-9108-x, Springer
Science+Business Media, LLC .
[13] Anil Kumar Pandey, T. Jaya Laxmi, “Web Document Clustering
for Finding Expertise in Research Area”, BVICAM‟s International
Journal of Information Technology, Vol. 1 No. 2 ISSN 0973 –
5658.
[14] Neetu Sharma, Dr. S. Niranjan “Optimization Of Word Sense
Disambiguation Using Clustering In Weka”, International Journal

564 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

You might also like