You are on page 1of 4

Execution Analysis of Various Clustering Algorithms in Weka

Muhammad Sohail
(Student, Department of computer science RIPHAH International University Lahore)
Maliksohail284@yahoo.com
ABSTRACT K-means clustering is an unsupervised technique and It
Clustering is a technique undercome data mining. means there is no need of labeled data. This algorithm
It has a lot of applications. There are different describes and minimize the Euclidean distance among
algorithms of clustering which are suitable under different points or attributes. The working of k-means
different conditions. It accepted some parameters algorithm is not complicated but it is very useful. The
and relevant points. This paper provides the mean in k describes the average about the data provided
comparative analysis of clustering algorithms for analysis.
including K-Means, DBSCAN, Optics and Make Density based spatial cluster of applications with noise
Density Based Clusters So as to comprehend the (DBSCAN) is a density-based cluster and it is basically
exhibition and relative examination of these used to isolated high density cluster from low density
calculations the four-dataset utilized for this cluster. In DBSCAN the working of cluster happened in
reason. Birds, Cars, Mall Customer, and Seeds two parameters and those are neighborhood and minimum
dataset by using data science tool WEKA. The points.
outcomes were broke down on the base of the Optics and Make Density Based Clusters are also density-
runtime of all clustering algorithms which were based clustering algorithms and it is analyzed by in terms
utilized in this paper. of parameter which is necessary for creating clusters. In
optics there is no need to create clusters of data separately
Keywords – Clustering, Data mining, Weka. but it produces arrangement of database which represent
the related density-based clusters.
In order to complete this research WEKA tool is used and
I. INTRODUCTION it is widely used data mining tool and it has all built in
feature which were used in this research. In WEKA there
Data Science is a broad field and it is also related to is no need for coding and it provides GUI and it is easy to
data mining. Data mining is used to find the use and easy to learn. It is free tool which is available on
unknown patterns in different sets and also provide internet for data mining tasks. It was developed in New
the correlations in order to find the outcomes. There Zealand.
are many data mining techniques like classification,
clustering, association and regressions. Every
technique used under specific circumstances and it
provides better results. Clustering demonstrate the
concept of grouping data and making their classes
and identifying objects having most significant
properties and similarities. Similarity and
dissimilarity based on the object’s properties.
Clustering is an unsupervised learning and the
datasets used for this are unlabeled.
There are many clustering techniques like density Fig1 WEKA interface
based, grid based and model based etc. Data II. REVEW OF LITERATURE
segmentation is another name which is used for
clustering. It also used to detect outlier. Data mining Data mining is widely used in statistics analysis. The
is widely used in many fields of statistics, marketing researchers had done many works in it. Xianzhi Wang et
and many other technologies. In clustering al., in 2006 provide a method for clustering of time series
everything is learned by observing the attributes and based on their structural characteristics, rather its clusters
there is no need to learn by examples. There are based on global features extracted from the time series.
many typical requirements of clustering like Global measures describing the time series are obtained by
scalability knowledge about domain of minimum applying statistical operations that best capture the
requirements. underlying characteristics: trend, seasonality, periodicity,
serial correlation, skewness, kurtosis, chaos, nonlinearity,
1|Page
and self-similarity. Since the method clusters using DEMassDBSCAN algorithm run time is varied according
extracted global measures, it reduces the to data size.
dimensionality of the time series and is much less
sensitive to missing or noisy data. A search III. Methodology
mechanism is provided to find the best selection from
the feature set that should be used as the clustering In order to get experimental results of this paper the four
inputs [2]. algorithms are used. All of these four algorithms belong to
the clustering. The results of these algorithms were
DBSCAN [15] and SNN [12]. The no of input analyzed by measuring run time. K-means, DBSCAN,
required by SNN is more as compared to DBSCAN. Optics and Make Density Based Clusters and Optics
The results showed that SNN performs better than algorithms used to get experimental results.
DBSCAN since it can detect clusters with different
densities while the former cannot. Dataset.

There are four datasets used in this research paper and these
In [19] Mariam et al. and Syed et al. made a
are birds, mall customer, seeds and cars dataset. The bird’s
comparison for two density-based clustering dataset contains 12 attributes and 420 instances. The mall
algorithms i.e. DBSCAN and RDBC i.e. Recursive
customer dataset contains 5 attributes and 200 instances.
density-based clustering. RDBC is an improvement The seeds dataset contains 8 attributes and 210 instances
of DBSCAN. In this algorithm it calls DBSCAN
and the cars dataset contains 8 attributes and 261 instances.
with different density distance thresholds ε and There are no missing values in these datasets and these
density threshold MinPts. It concludes that the
datasets were taken by Kaggle and UCI machine learnings
number of clusters formed by RDBC is more as repository. The dataset contains numeric values and the
compared to DBSCAN also we see that the runtime
missing values were removed by using weka filter interface
of RDBC is less as compared to DBSCAN. which tool mean and median as an alternative of missing
Every algorithm has their own importance and we values.
use them on the behavior of the data, but on the basis
of this research we found that k-means clustering TABLE 1: Dataset Instance and Attributes
algorithm is simplest algorithm as compared to other
algorithms. We can’t require deep knowledge of Dataset Instances Attributes
algorithms for working in weka. That’s why weka is
more suitable tool for data mining applications. This
paper shows only the clustering operations in the Birds 420 12
weka, we will try to make a complete reference paper Mall Customer 240 5
of weka. Seeds 210 8
Cars 261 8
we find that k-mean algorithm provides better results
than hierarchical and EM algorithm. It has better Table No.1
time for building a model than hierarchical and EM All the experiments and practical work is done WEKA
but it takes more time than Density based algorithms. tool.
We also find that log likelihood value of density- First of the WEKA tool open and click on explorer after
based algorithm is higher. Form result we find that k- that dataset selected using explorer menu and it is checked
mean is a faster and safer algorithm than other that there are no missing values in the dataset. After that in
algorithms we used. the clustering menu the algorithms are applied on it and
In this paper the runtime behavior of the their cluster and readings and run time noted. It is repeated
DEMassDBSCAN algorithm is better than DBSCAN for all the datasets which are used in this research. The
clustering algorithm but run time of OPTICS is better interface of the selection of algorithms is following for the
than DBSCAN .and It will give better results on the bird’s dataset.
large data sets. The unassigned clusters are less than
the DBSCAN clustering algorithm. The result of the
DEMassDBSCAN is efficient and having less noise
points out of whole data set. In future mass
estimation is applied on other tasks. but some time
runtime of the OPTICS it better than

1|Page
Fig.2 K-Means run time.

Fig.5 Optics run time.

The above 4 figures show the result of all the 4 algorithms


which are applied on bird’s dataset.
The run time of all the algorithms with their dataset is given
below in the tabular form.

Datasets Algorithms Run Time


K- DBSCAN Optics Make
mean density
s
Birds 0.07 0.14 0.5 0.05

Car 0.02 0.03 0.23 0.02


Fig.3 DEMassDBSCAN run time.
Customer 0.05 0.06 0.31 0.02
Seed 0.05 0.03 0.41 0.03

Table No.2

IV. Results

The table no 2 clearly demonstrate the results of all the


four clustering algorithms. It shows that when the bird’s
dataset is used the run time of Optics and Make Density
Based Clusters is better than all the other algorithms.
When the Car dataset is used the K-means and Optics and
Make Density Based Clusters shows the same run time.
And when the customer dataset is used the run time the
Fig.4 DBSCAN run time. Optics and Make Density Based Clusters is also best. And
finally, when the seed dataset is used and all the four
clustering algorithms implemented the results are also tied.
DBSCAN and Optics and Make Density Based Clusters
have the same run time.

V. CONCLUSION

This research paper shows that the runtime of Optics and


Make Density Based Clusters is best among all the other

1|Page
algorithms which were used in this research paper.
The table no 2 clearly demonstrate the difference
between them and it is also concluded that the run
time of K-means clustering is better than DBSCAN.
It is also concluded that the run time of Optics is best
at some extent related to other algorithms.

VI. REFERENCES

1|Page

You might also like