You are on page 1of 4

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 5, Issue 4, April 2016

UNCOVERING FEATURES NOISE WITH K-MEANS ALGORITHM


USING ENTROPY MEASURE
Lamia Abed Noor Muhammed
College of Computer and Mathematics
University of Al-Qadissiyah
lamia.abed@qu.edu.iq

ABSTRACT
Clustering data is one of the techniques used for
extracting the structures in data, where it is useful to
get the significant structures in the data. The data can
be clustered with different clustering techniques at
most done with unsupervised process, so the
evaluation of clustering is required. K-means
algorithm is one of the most common clustering
techniques which has been evaluated against many
issues. One of these issues is This paper will take up
the sensitive of data that to be clustered while contains
noise features against the number of clusters based on
the entropy metric that would be used to evaluate the
sensitivity. Experiment work would be performed
through application k-means algorithm with iris data,
the specific procedure would be executed in order to
consolidate results.
Keywords: k-means, entropy, number of clusters,
clustering quality, iris data analysis

I. INTRODUCTION
In recent years, the high dimensionality of the modern
massive datasets has provided a considerable challenge
to k-means clustering approaches. First, the curse of
dimensionality can make algorithms for k-means
clustering very slow, and, second, the existence of
many irrelevant features may not allow the
identification of the relevant underlying structure in
the data [1].
Generally Feature subset selection can be viewed as
the process of identifying and removing as many
irrelevant and redundant features as well as possible.
This is because firstly irrelevant features that do not
contribute predictive accuracy, and secondly redundant
features that do not redound to getting a better
predictor for that they provide mostly information
which is already present in other feature [2].
Stability of a learning algorithm with respect to small
input perturbations is an important property, as it
implies that the derived models are robust with respect
to the presence of noisy features and/or data sample
fluctuations [3]. Traditionally, the feature subset
selection research has focused on searching for
relevant features [2]. The criteria is used to evaluate
"goodness" of a specific subset of features follows
either the wrapper model or the filter model.
According to the former, the clustering algorithm, C, is

first chosen and a subset of features (F), is evaluated


through the results obtained from the application of C
on the data set X, where for each point only the
features in F are taken into account. According to the
latter, the evaluation of a subset of features is carried
out using intrinsic properties of the data prior
application of the clustering algorithm[4]. The
optimality of a feature subset is measured by an
evaluation criterion. As the dimensionality of a domain
expands, the number of features N increases. Finding
an optimal feature subset is usually intractable and
many problems related to feature selection have been
shown to be NP-hard [5].

II.

K_MEANS CLUSTERING

The most popular partitional algorithm among various


clustering algorithms is K-mean clustering. K-means is
an exclusive clustering algorithm, which is simple,
easy to use and very efficient in dealing with large
amount of data with linear time complexity[6]. In 1967
MacQueen first proposed k-Means clustering
algorithm. k-Means algorithm is one of the popular
partitioning algorithm. The idea is to classify the data
into k clusters where k is the input parameter specified
in advance through iterative relocation technique
which converges to local minimum[7].
K-means algorithm clusters observations into k groups,
where k is provided as an input parameter. It then
assigns each observation to clusters based upon the
observations proximity to the of the cluster. The
clusters mean is then recomputed and the process
begins again[6] as shown in Algorithm(1) .
Algorithm(1): Traditional k-means Algorithm
a) Select k points as initial centroids
b) Repeat
c) From k clusters by assigning each point to its closest
centroid
d) Recompute the centroid of each cluster until
centroid does not change[7]
The objective function of K-mean algorithm is to
minimize the intra-cluster distance and maximize the
inter-cluster distance based on the Euclidean
distance[6]. So to measure the quality of a clustering ,
sum square error (SSE) is used i.e. the error for each

www.ijsret.org

202

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 5, Issue 4, April 2016

point is computed. The SSE is formally defined as


follows[9]:

Table(1) entropy values for different runs with different


features subsets and different numbers of clusters

(1)
In the above equation, mi is the center of cluster Ci ,
while d(x,mi ) is the Euclidean distance between a
point x and mi . Thus, the criterion function E attempts
to minimize the distance of each point from the center
of the cluster to which the point belongs [10].
The K-means algorithm requires three user-specified
parameters: number of clusters K, cluster initialization,
and distance metric. Cluster initialize with one object
that is center for each cluster. A cluster is a set of
objects such that an object in a cluster is closer (more
similar) to the center of a cluster, than to the center
of any other cluster. The center of a cluster is often a
centroid, the average of all the points in the cluster.
Different initializations can lead to different final
clustering because K-means only converges to local
minima. One way to overcome the local minima is to
run the K-means algorithm, for a given K, with
multiple different initial partitions and choose the
partition with the smallest squared error [11].

III.

ENTROPY

Evaluating the quality of clustering results is still a


challenge in recent research.
There are two kinds of clustering validation
techniques, which are based on external criteria and
internal criteria respectively [12].
Entropy measures the purity of the clusters class
labels. Thus, if all clusters consist of objects with only
a single class label, the entropy is 0. However, as the
class labels of objects in a cluster become more varied,
the entropy increases. To compute the entropy of a
dataset, we need to calculate the class distribution of
the objects in
each cluster as follows:
(1)
Where the sum is taken over all classes. The total
entropy for a set of clusters is calculated as the
weighted sum of the entropies of all clusters, as shown
in the next equation
.
Where

(2)

is the size of cluster j, m is the number of

clusters, and

IV.

no.
of
cluster
features
subsets used

two
cluster

three
cluster

four
cluster

Five
cluster

0.91

0.74

0.84

0.83

0.89

0.84

1.17

0.95

1.465

1.72

1.351

1.09

0.74

0.74

0.74

0.78

0.42

0.48

0.56

1.41

1.16

1.62

2.25

0.95

0.94

1.03

0.81

1.33

1.3

1.34

1.48

0.693

0.73

0.64

0.919

0.78

0.5

0.69

1.37

0.78

0.44

0.44

0.69

1.61

1.99

2.46

2.63

Sepal.Width

1.67

2.4

3.1

3.63

Petal.Length

0.78

0.5

0.68

0.67

1.06

0.41

0.44

0.68

Sepal.Length,
Sepal.Width,
Petal.Length,
Petal.Width
Sepal.Length,
Sepal.Width,
Petal.Length
Sepal.Length,
Sepal.Width,
Petal.Width
Sepal.Length,
Petal.Length,
Petal.Width
Sepal.Width,
Petal.Length,
Petal.Width
Sepal.Length,
Sepal.Width
Sepal.Length,
Petal.Length
Sepal.Length,
Petal.Width
Sepal.Width,
Petal.Width
Sepal.Width,
Petal.Length
Petal.Length,
Petal.Width
Sepal.Length

0.91

Petal.Width
Table(2) statistical metrics of entropy values
from different number of clusters
metric
max

min

average

Stdv.

1.67

0.693

1.0468

0.3146

2.4

0.41

0.975

0.5966

3.1

0.44

1.1373

0.7847

3.63

0.56

1.3186

0.8771

no. of
cluster

is the total number of data points [13].

EXPERIMENT WORK

The work in this paper would be performed with


several steps. The first step is to choose the suitable
and famous data to deal with the k-means, so the "iris"
data would be selected.
www.ijsret.org

203

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 5, Issue 4, April 2016

the other 2; the latter are NOT linearly separable from


each other. Predicted attribute: class of iris plant.

B. EXPERIMENTAL PROCEDURE
The experiment in this paper was done through
specific procedure. However, K-means algorithm
would be executed for many runs, the runs were
performed according to different subset of features.
With each subset of features, k-means algorithm would
be executed for 4 time, each run with different number
of clusters (two, three, four, five) clusters. Then
calculating the entropy of the resulted clusters based
on the class label that available with data as shown in
Table(1). So the entropy value would be increased
with classes were increased in an individual cluster
that means impurity of this cluster and vice verse.

(a) Runs with two clusters

C. DISCUSSION
The disparity of the entropy results by executing with
different subsets of features indicate to the effect of
selected features in clustering. As a result the
significant results point to determine the noise features
that would result the increasing the entropy values. It
is obvious the two features: " Sepal.Length" and
"Sepal.Width" are noise features because the resulted
entropy values are high in comparing with the other
features " Petal.Length" and "Petal.Width".
In other side the results in table(1) reveal that no. of
clusters play role in giving the significant entropy
values between different runs as shown in table(2) that
indicate to statistical metrics for different runs
according to the number of clusters as shown in
Figure(1).

(b) Runs with three clusters

(c) Runs with four clusters

V.

(d) Runs with five clusters


Figure(1) entropy values with different features subsets using

CONCLUSION

The results obtained from the work in this paper shed


lights on the sensitivity of clustering with noise
features through measuring the entropy of the resulted
clusters. The uncovering noise feature through
clustering can be exploited by executing through a
preprocessing step with small data as a bootstrap step
in order to determine these features and then excludes
them in the next step while perform the main
clustering.

REFERENCES
A. IRIS DATA
(a) is
two
clustersthe
(b)best
threeknown
clustersdatabase
(c) four to
clusters
(d) five
This
perhaps
be found
in
clusters Fisher's paper is a
the pattern recognition literature.
classic in the field and is referenced frequently to this
day. There are four variables(can be used as features)
and one variable as class label as following:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
and Species as class variable. The data set contains 3
classes of 50 instances each, where each class refers to
a type of iris plant. One class is linearly separable from

[1] C. Boutsidis, M.W. Mahoney, and P. Drineas.


Unsupervised feature selection for the k-means
clustering problem. In Annual Advances in Neural
Information Processing Systems 22: Proceedings of
the 2009, Conference, 2009.
[2] G. Pawan, J. Susheel, J. Anurag, A Review Of
Fast Clustering-Based Feature Subset Selection
Algorithm, INTERNATIONAL JOURNAL OF
SCIENTIFIC & TECHNOLOGY RESEARCH
VOLUME 3, ISSUE 11, NOVEMBER 2014, pp. 8691.

www.ijsret.org

204

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 5, Issue 4, April 2016

[3] M. Dimitrios and M. Elena, " Feature selection for


k-means clustering stability: theoretical analysis and
an algorithm ", Data Mining and Knowledge
Discovery, July 2014, Volume 28, Issue 4, pp 918-960.
[4] T. Sergios and K. Konstantinos, "Pattern
Recognition", fourth edition, Academic Press, page
823 .
[5] R.Sabitha 1, S.Karthik, "Performance assessment
of feature selection methods using k-means on adult
set",
Research Journal of Computer Systems
Engineering RJCSE, Vol 04; Special Issue; June
2013.
[6] M. R. Sindhu, S. Rachna, 2013, "Clustering
Algorithm Technique", International Journal of
Enhanced Research in Management and Computing
Applications,vol. 2, Issue 3, March 2013.
[7] M.P.S Bhatia and Deepika Khurana, 2013, "
Experimental study of Data clustering using k- Means
and modified algorithms "International Journal of Data
Mining & Knowledge Management Process (IJDKP)
Vol.3, No.3, May 2013, p.p. 17-30.
[8] I. R. Swapnil, S. Amit M., 2013, "Clustering
Techniques", International Journal of Advanced
Research in Computer Science, Volume 4, No. 6, May
2013 (Special Issue), p.p.5-9.
[9] T. Pang-Ning , S. Michael, and K. Vipin, 2006, "
Introduction to Data Mining", 2006 AddisonWesley , p.p487-568.
[10] H. Maria, B. Yannis, and V. Michalis, 2001, " On
Clustering Validation Techniques", Journal of
Intelligent Information Systems, 17:2/3, 107145,
2001, Kluwer Academic Publishers. Manufactured in
The Netherlands
-163.
[11] A.K. Jain, 2010, "Data clustering: 50 years
beyond K-means", Pattern Recognition Letters 31
(2010) 651666.
[12] X. Hui , W. Junjie, C. Jian, " Kmeans Clustering
versus Validation Measures: A Data Distribution
Perspective", KDD06, August 2023, 2006,
Philadelphia, Pennsylvania, USA.
[13] R. ERNDIRA and others, " A comparison of
internal and external cluster validation Indexes",
proceedings of the 2011 American conference of
applied mathematics and the 5th WSEAS International
conference on computer engineering and applications,
pages 158.

www.ijsret.org

205