You are on page 1of 8

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

A Comparative study on Classification and Clustering


Techniques Using Assorted Data Mining Tools
Dr.S.Prasath
Assistant Professor, Department of Computer Science,
Erode Arts and Science College (Autonomous), Erode.
softprasaths@gmail.com
ABSTRACT
Data mining is basically the disclosure of important data and pattern from extremely large pieces
of accessible information. Two very important strategies of data mining are group and
classification, where the recent utilizes an arrangement of pre-ordered illustrations to build up a
model that can arrange the number of inhabitants in records everywhere, and the previous
partitions the information into gatherings of comparative items. In this paper, proposed a new
method for information integrating so as to group two data mining strategies, viz. bunching and
grouping. At that point a near study has been done between the basic arrangement and new
proposed incorporated grouping characterization procedure. Four prevalent information digging
instruments were utilized for both the strategies by utilizing six distinct classifiers and one
grouped for all sets. It was found that over every one of the apparatuses utilized, the coordinated
grouping arrangement method was superior to the straightforward order strategy. This outcome
was predictable for all the six classifiers utilized. For both of the systems, the best classifier was
observed to be SVM. Out of the four devices utilized, WEKA was observed to be the best as far as
adaptability of calculation. All examinations were drawn by looking at the rate precision of every
classifier utilized.
Keywords: Data mining, Classification, Clustering, Data mining tools, WEKA, Orange, Fuzzy.
I.

INTRODUCTION

Data mining fixates on the computerized revelation of new actualities and connections in officially
existing information. The different methods of information mining incorporate affiliation, relapse,
forecast, bunching and characterization. Bunching is the division of information into gatherings of
comparative articles. Cluster is a case of unsupervised learning as it learns by perception [1]. Classify is a
data mining capacity that function that assigns items in a collection to target classifications or classes.
The objective of arrangement is to precisely anticipate the objective class for every case in the
information [2]. This paper manages the utilization of the incorporated bunching order method on a
portion of the free information mining apparatuses accessible nowadays. Devices on which incorporated
bunching arrangement procedure has been executed are KNIME (Konstanz Information Miner), Tanagra
[3], orange and WEKA (Waikato Environment for Knowledge Learning) [4]. The different classifier
utilized for this reason for existing are Nave Bayes, Support Vector machine, K Nearest Neighbor, Zero
Rule, Decision tree and One Rule.
Data mining is the procedure of programmed grouping of cases taking into account information
examples acquired from a dataset. Various calculations have been produced and actualized to
concentrate data and find information designs that may be valuable for choice backing. Information
mining otherwise called KDD (Knowledge Discovery in Databases), information preprocessing, example
acknowledgment, grouping, order are the prevalent advances in information mining. In this paper,
55 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
examine definite about the information preprocessing that comes in the database server module after
that they talk about the database module characterized as order and bunching by the information mining
tools.

II. KNOWLEDGE DISCOVERY PROCESS


The terms Knowledge Discovery in Databases (KDD) and Data Mining are regularly utilized conversely.
KDD is the procedure of transforming the low-level information into abnormal state learning.
Consequently, KDD refers to the nontrivial extraction of understood, beforehand obscure and possibly
helpful data from information in databases. While information mining and KDD are frequently regarded
as comparable words yet in genuine information mining is an imperative stride in the KDD process.
The Knowledge Discovery in Databases procedure embodies a couple steps driving from crude
information accumulations to some type of new learning. Data cleaning stage depicts the clamor
information and immaterial information is expelled from the gathering. Data integration stage includes
different information sources, frequently heterogeneous may be consolidated in a typical source. Data
selection stage significant information to the examination is choose and recovered from the information
gathering. Data transformation is known as information union, in this stage they chose information is
changed into structures fitting for the mining method. Data mining is perceptive systems are connected
to concentrate designs possibly helpful. Pattern evaluation includes fascinating examples speaking to
information are distinguished in light of given measures. Knowledge representation is last stage in which
the found information is outwardly spoken to the client. In this stride perception strategies are utilized to
help clients comprehend and translate the information mining results.
III. DATA MINING PROCESS
In the KDD process, the information digging routines are for extricating examples from information. The
examples that can be found rely on the information mining assignments connected. For the most part,
two sorts of information mining assignments. The distinct information mining errands which portray the
general properties of the current information and prescient information mining undertakings that
endeavor to do forecasts in view of accessible information. Information mining should be possible on
information which is in quantitative, literary or sight and sound structures.
Information mining applications can utilize distinctive sort of parameters to look at the information. They
incorporate affiliation (designs where one occasion is associated with another occasion), arrangement or
way examination (designs where one occasion prompts another occasion), characterization (ID of new
examples with predefined targets) and bunching (gathering of indistinguishable or comparative items).
Problem definition: The first step is to identify goals based on the correct series of tools can be applied
to the data to build the corresponding behavioral model.
Data exploration: If the quality of data is not suitable for an accurate model then recommendations on
future data collection and storage strategies can be made and to analysis all data needs to be
consolidated that can be treated consistently.
Data preparation: The purpose to clean and transform the data is missing the invalid values are treated
and the all known valid values are made consistent for more robust analysis.

56 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
Modeling: It is based on the data and desired outcomes of a data mining algorithm or combination of
algorithms is selected for analysis. These algorithms include classical techniques such as statistics,
neighborhoods and clustering also consider next generation techniques such as decision trees, networks
and rule based algorithms. The specific algorithm is selected based on the particular objective are
achieved and quality of the data analyzed.
Evaluation and Deployment: Based on the results of the data mining algorithms, an analysis is
conducted to determine key conclusions from the analysis and create a series of recommendations for
further consideration.

Fig.1 Steps involved in Data mining


IV. Data Mining Methods
Classification: Supervised Learning strategy with their classes is known.
Clustering: Unsupervised Learning strategies with their classes are unclear.
Association Rule Mining: Identifying the covered up, beforehand obscure connection between the
elements.
Temporal mining: The utilization of worldly information, displaying transient occasions, time
arrangement, design location, groupings and fleeting affiliation rules.
Time Series Analysis: To portrays nature and conduct of time arrangement information. To anticipate
the future pattern and conduct of the information.
Web Mining: Mining web information, Web substance mining, Web structure mining and Web utilization
mining.
Spatial Mining: Use with GIS for mining learning from spatial database. Spatial arrangement, grouping
and principle era undertakings.
V. Data mining Classification Algorithms
The different classification algorithms available are
Naive Bayes (NB): An independent feature probability model based on the Bayes theorem with
probabilistic classifier.
57 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
Decision tree (C4.5): It is statistical classifier developed by Ross Quinlan and classifies data by
generating decision trees.
Support Vector Machine (SVM): The example of non-probabilistic binary linear classifier and from the
set of input data predicts which of the two possible classes forms the output.
K Nearest Neighbor (KNN): An example of instance-based learning, KNN is sensitive to the local
structure of the data thus the function is approximated locally and computation is done after
classification is complete.

VI. CLUSTERING
This pattern partitions the records in database into diverse gatherings. In the same gathering, the
gatherings have the comparative properties and the distinctions ought to make as bigger as could be
expected under the circumstances and in the same gathering, the distinctions ought to be as littler as
would be prudent. There is no predefined class in this gathering it goes under the unsupervised learning.
Techniques included in bunch examination are partioning systems, various leveled routines, thickness
Based strategies, network based techniques, model-based routines, grouping high-dimensional
information, requirement based bunching and Outlier investigation.
i. K-means Clustering
ii. Hierarchical clustering
iii.Density based clustering
VII. DATA MINING TOOLS
The data mining tools on which the integrated clustering-classification technique has been implemented.
WEKA tool
WEKA is Waikato Environment for Knowledge Analysis, data mining/machine learning tool developed by
Department of Computer Science, University of Waikato, New Zealand. It is a collection of open source of
many data mining and machine learning algorithms, including pre-processing on data, classification,
regression, clustering, association rule extraction and feature selection which supports .arff (attribute
relation file format) file format.
Tanagra
Tanagra was written an aid to education and research on data mining by Ricco Rakotomalala. The entire
user operation of Tanagra is based on the stream diagram paradigm. Under the stream diagram
paradigm, a user builds a graph specifying the data sources and operations on the data. Paths through the
graph can describe the flow of data through manipulations and analysis. Tanagra simplifies this paradigm
by restricting the graph to be a tree with only one parent to each node and the other one for data source
of an each operation.
KNIME

58 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
KNIME is Konstanz Information Miner is open source data analytics, reporting and integration platform.
KNIME integrates various components for machine learning and data mining through its modular data
pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing (ETL:
Extraction, Transformation, Loading) for modeling and data analysis and visualization.
Orange
Orange is a component-based data mining and machine learning software suite, featuring a visual
programming front-end for explorative data analysis, visualization, Python bindings and libraries for
scripting. It includes set of components for data preprocessing, feature scoring and filtering, modeling,
model evaluation and exploration techniques. It is implemented in C++ and Python.

Fig.2. Process Flow


VIII. EXPERIMENTAL RESULTS
The dataset "pima Indian Diabetes" are consider with the use of K-means, Hierarchical and Density based
clustering technique and the different classification algorithms available on data mining tools. The Pima
Indian diabetes data sets available on UCI machine learning repository. The experiment is performed on
the dataset results in Table I shows the accuracy measure of K-means clustering technique for different
classifiers used. SVM provides the highest accuracy in the range of 76-78%, followed by Nave Bayes with
accuracy in the range of 73-76% and KNN with accuracy ranging between 72-73% and followed closely
by C4.5 with accuracy in the range 69-74%. The pictorial representation as shown in the fig.3.
Table I: Accuracy for K-means clustering
Classifier
Weka
Tanagra
Orange
KNIME
NB
76.32 %
74.87%
75.38%
73.17%
C4.5
73.82 %
74.21%
70.05%
69.27%
KNN
73.17 %
72.11%
72.90%
72.26%
SVM
78.34 %
76.45%
76.17%
77.60%

59 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

Fig.3. Accuracy for K-means clustering


From the below Table II show the accuracy measure of the K-means Clustering technique for different
classifiers used. The SVM classifier gives the better accuracy measure between 73-77%, is followed by
Nave Bayes with accuracy between 64-68%. KNN gives accuracy between 62-71% and C 4.5 between
89-99%. The pictorial representation as shown in the fig.4.

Classifier
NB
C4.5
KNN
SVM

Table II: Accuracy for Hierarchical Clustering


Weka
Tanagra
Orange
65.22 %
66.34%
68.74%
62.32 %
71.08%
70.65%
71.08 %
68.84%
69.38%
77.67 %
76.24%
77.68%

KNIME
64.01%
68.75%
78.21%
73.01%

Fig.4. Accuracy for Hierarchical Clustering


From the below Table III shows the accuracy measure of the Hierarchical Clustering technique for
different classifiers used. The SVM classifier gives the higher accuracy measure between 73-77% and it is
followed by Nave Bayes with accuracy between 64-68%. KNN has accuracy between 62-71% and C 4.5 is
between 89-99% respectively. The pictorial representation as shown in the fig.5.

Classifier
NB
C4.5
KNN
SVM

Table III: Accuracy for Density based clustering


Weka
Tanagra
Orange
63.58 %
63.34%
64.89%
62.11 %
68.71%
69.28%
69.65 %
70.84%
70.95%
77.00 %
74.89%
74.87%

60 | 2015, IJAFRC All Rights Reserved

KNIME
64.22%
68.59%
75.52%
73.81%
www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

Fig.5. Accuracy for Density based clustering


Comparing the data in Table I, II and III shows the SVM classifier is the best for the K-means, Hierarchical,
Density based clustering and clustering-classification techniques. However, the percentage accuracy
using SVM classifier is in the range of 76-78% and the range of 76-77%. From the comparison of the
tables, it shows that the results of the accuracy of K-means clustering technique are more accurate than
the other classification data mining technique. Overall, the K-means clustering technique is about 2-12%
greater than the other clustering technique above with a range of tools and algorithms used.
IX. CONCLUSIONS
Data mining is the extraction of useful patterns and relationships from data sources, such as databases,
texts, the web etc. This research discussed the different data mining tool focus importance of tools by
considering in various aspects. The experimental results are compared with existing techniques such as
clustering and classification gives better results to improve the accuracy gives SVM is the best compare to
other method.
X. REFERENCES

[1]

David Heckerman,"Bayesian Network for Data Mining and Knowledge Discovery", 1997.

[2]

David Hand, Heikki Mannila and Padhraic Smyth,"Principles of Data Mining", the MIT Press, 2001.

[3]

Ritu Chauhan, Harleen Kaur, M.Afshar Alam, "Data Clustering Method for Discovering Clusters in
Spatial Cancer Databases", International Journal of Computer Applications, Volume10, No.6,
November 2010.

[4]

J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.

[5]

S.Kotsiantis, D.Kanellopoulos, P.Pintelas, "Data Preprocessing for Supervised Leaning",


International Journal of Computer Science,Vol.1,No.2, pp.111117,2006.

[6]

MacQueen J. B., "Some Methods for classification and Analysis of Multivariate Observations",
Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of
California Press., pp.281297,1967.

[7]

Lloyd, S. P., "Least square quantization in PCM", IEEE Transactions on Information Theory 28,pp.
129137,1982 .

61 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
[8]

Manish Verma, MaulySrivastava, NehaChack, Atul Kumar Diswar and Nidhi Gupta, "A
Comparative Study of Various Clustering Algorithms in Data Mining", International Journal of
Engineering Research and Applications, Vol. 2, Issue.3, 2012.

[9]

Timonthy C. Havens, "Clustering in relational data and ontologies", July 2010.

AUTHOR PROFILE
Dr.S.Prasath is currently working as an Assistant Professor in Department of
Computer Science, Erode Arts & Science College (Autonomous), Erode,
Tamilnadu, India. He received Ph.D degree from Bharathiar University,
Coimbatore, Tamilnadu, India in 2015. He has obtained his Masters degree in
Software Engineering from M.Kumarasamy college of Engineering, Karur under
Anna University, Chennai in 2008 and M.Phil degree in Computer Science in the
year 2009. His area of interests includes, Image Processing and Data Mining. He
has presented 6 papers in National and 2 International level conferences. He
has published 10 papers in National and International journals.

62 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org