Clustering data is one of the techniques used for extracting the structures in data, where it is useful to get the significant structures in the data. The data can be clustered with different clustering techniques at most done with unsupervised process, so the evaluation of clustering is required. K-means algorithm is one of the most common clustering techniques which has been evaluated against many issues. One of these issues is This paper will take up the sensitive of data that to be clustered while contains noise features against the number of clusters based on the entropy metric that would be used to evaluate the sensitivity. Experiment work would be performed through application k-means algorithm with iris data, the specific procedure would be executed in order to consolidate results.

© All Rights Reserved

10 views

Clustering data is one of the techniques used for extracting the structures in data, where it is useful to get the significant structures in the data. The data can be clustered with different clustering techniques at most done with unsupervised process, so the evaluation of clustering is required. K-means algorithm is one of the most common clustering techniques which has been evaluated against many issues. One of these issues is This paper will take up the sensitive of data that to be clustered while contains noise features against the number of clusters based on the entropy metric that would be used to evaluate the sensitivity. Experiment work would be performed through application k-means algorithm with iris data, the specific procedure would be executed in order to consolidate results.

© All Rights Reserved

- Ieee 2009 Project Dotnet Data Mining @ Sbgc ( Chennai, Trichy, Tamilnadu India )
- IJETTCS-2014-01-08-015
- Weka
- Clustering on Boston Dataset
- Web Users Clustering Analysis
- Measuring Underemployment: Establishing the Cut-off Point
- Midterm Solutions Machine
- [IJETA-V5I6P3]:Ei Ei Phyo, Ei Ei Myat
- IJETTCS-2014-08-08-89
- t 24144148
- On Challenges in Evaluating Malware Clustering
- Data Mining 2013-2014
- IJETTCS-2013-04-18-120
- ECCV2012 Poster3
- 6867term Project
- Spatial Conflict Management in Urban Planning (1)
- Kmeans is Np-Hard
- Speaker Adaptation
- Tasic
- Copia de ExcelTabuTSP

You are on page 1of 4

USING ENTROPY MEASURE

Lamia Abed Noor Muhammed

College of Computer and Mathematics

University of Al-Qadissiyah

lamia.abed@qu.edu.iq

ABSTRACT

Clustering data is one of the techniques used for

extracting the structures in data, where it is useful to

get the significant structures in the data. The data can

be clustered with different clustering techniques at

most done with unsupervised process, so the

evaluation of clustering is required. K-means

algorithm is one of the most common clustering

techniques which has been evaluated against many

issues. One of these issues is This paper will take up

the sensitive of data that to be clustered while contains

noise features against the number of clusters based on

the entropy metric that would be used to evaluate the

sensitivity. Experiment work would be performed

through application k-means algorithm with iris data,

the specific procedure would be executed in order to

consolidate results.

Keywords: k-means, entropy, number of clusters,

clustering quality, iris data analysis

I. INTRODUCTION

In recent years, the high dimensionality of the modern

massive datasets has provided a considerable challenge

to k-means clustering approaches. First, the curse of

dimensionality can make algorithms for k-means

clustering very slow, and, second, the existence of

many irrelevant features may not allow the

identification of the relevant underlying structure in

the data [1].

Generally Feature subset selection can be viewed as

the process of identifying and removing as many

irrelevant and redundant features as well as possible.

This is because firstly irrelevant features that do not

contribute predictive accuracy, and secondly redundant

features that do not redound to getting a better

predictor for that they provide mostly information

which is already present in other feature [2].

Stability of a learning algorithm with respect to small

input perturbations is an important property, as it

implies that the derived models are robust with respect

to the presence of noisy features and/or data sample

fluctuations [3]. Traditionally, the feature subset

selection research has focused on searching for

relevant features [2]. The criteria is used to evaluate

"goodness" of a specific subset of features follows

either the wrapper model or the filter model.

According to the former, the clustering algorithm, C, is

through the results obtained from the application of C

on the data set X, where for each point only the

features in F are taken into account. According to the

latter, the evaluation of a subset of features is carried

out using intrinsic properties of the data prior

application of the clustering algorithm[4]. The

optimality of a feature subset is measured by an

evaluation criterion. As the dimensionality of a domain

expands, the number of features N increases. Finding

an optimal feature subset is usually intractable and

many problems related to feature selection have been

shown to be NP-hard [5].

II.

K_MEANS CLUSTERING

clustering algorithms is K-mean clustering. K-means is

an exclusive clustering algorithm, which is simple,

easy to use and very efficient in dealing with large

amount of data with linear time complexity[6]. In 1967

MacQueen first proposed k-Means clustering

algorithm. k-Means algorithm is one of the popular

partitioning algorithm. The idea is to classify the data

into k clusters where k is the input parameter specified

in advance through iterative relocation technique

which converges to local minimum[7].

K-means algorithm clusters observations into k groups,

where k is provided as an input parameter. It then

assigns each observation to clusters based upon the

observations proximity to the of the cluster. The

clusters mean is then recomputed and the process

begins again[6] as shown in Algorithm(1) .

Algorithm(1): Traditional k-means Algorithm

a) Select k points as initial centroids

b) Repeat

c) From k clusters by assigning each point to its closest

centroid

d) Recompute the centroid of each cluster until

centroid does not change[7]

The objective function of K-mean algorithm is to

minimize the intra-cluster distance and maximize the

inter-cluster distance based on the Euclidean

distance[6]. So to measure the quality of a clustering ,

sum square error (SSE) is used i.e. the error for each

www.ijsret.org

202

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 5, Issue 4, April 2016

follows[9]:

features subsets and different numbers of clusters

(1)

In the above equation, mi is the center of cluster Ci ,

while d(x,mi ) is the Euclidean distance between a

point x and mi . Thus, the criterion function E attempts

to minimize the distance of each point from the center

of the cluster to which the point belongs [10].

The K-means algorithm requires three user-specified

parameters: number of clusters K, cluster initialization,

and distance metric. Cluster initialize with one object

that is center for each cluster. A cluster is a set of

objects such that an object in a cluster is closer (more

similar) to the center of a cluster, than to the center

of any other cluster. The center of a cluster is often a

centroid, the average of all the points in the cluster.

Different initializations can lead to different final

clustering because K-means only converges to local

minima. One way to overcome the local minima is to

run the K-means algorithm, for a given K, with

multiple different initial partitions and choose the

partition with the smallest squared error [11].

III.

ENTROPY

challenge in recent research.

There are two kinds of clustering validation

techniques, which are based on external criteria and

internal criteria respectively [12].

Entropy measures the purity of the clusters class

labels. Thus, if all clusters consist of objects with only

a single class label, the entropy is 0. However, as the

class labels of objects in a cluster become more varied,

the entropy increases. To compute the entropy of a

dataset, we need to calculate the class distribution of

the objects in

each cluster as follows:

(1)

Where the sum is taken over all classes. The total

entropy for a set of clusters is calculated as the

weighted sum of the entropies of all clusters, as shown

in the next equation

.

Where

(2)

clusters, and

IV.

no.

of

cluster

features

subsets used

two

cluster

three

cluster

four

cluster

Five

cluster

0.91

0.74

0.84

0.83

0.89

0.84

1.17

0.95

1.465

1.72

1.351

1.09

0.74

0.74

0.74

0.78

0.42

0.48

0.56

1.41

1.16

1.62

2.25

0.95

0.94

1.03

0.81

1.33

1.3

1.34

1.48

0.693

0.73

0.64

0.919

0.78

0.5

0.69

1.37

0.78

0.44

0.44

0.69

1.61

1.99

2.46

2.63

Sepal.Width

1.67

2.4

3.1

3.63

Petal.Length

0.78

0.5

0.68

0.67

1.06

0.41

0.44

0.68

Sepal.Length,

Sepal.Width,

Petal.Length,

Petal.Width

Sepal.Length,

Sepal.Width,

Petal.Length

Sepal.Length,

Sepal.Width,

Petal.Width

Sepal.Length,

Petal.Length,

Petal.Width

Sepal.Width,

Petal.Length,

Petal.Width

Sepal.Length,

Sepal.Width

Sepal.Length,

Petal.Length

Sepal.Length,

Petal.Width

Sepal.Width,

Petal.Width

Sepal.Width,

Petal.Length

Petal.Length,

Petal.Width

Sepal.Length

0.91

Petal.Width

Table(2) statistical metrics of entropy values

from different number of clusters

metric

max

min

average

Stdv.

1.67

0.693

1.0468

0.3146

2.4

0.41

0.975

0.5966

3.1

0.44

1.1373

0.7847

3.63

0.56

1.3186

0.8771

no. of

cluster

EXPERIMENT WORK

several steps. The first step is to choose the suitable

and famous data to deal with the k-means, so the "iris"

data would be selected.

www.ijsret.org

203

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 5, Issue 4, April 2016

each other. Predicted attribute: class of iris plant.

B. EXPERIMENTAL PROCEDURE

The experiment in this paper was done through

specific procedure. However, K-means algorithm

would be executed for many runs, the runs were

performed according to different subset of features.

With each subset of features, k-means algorithm would

be executed for 4 time, each run with different number

of clusters (two, three, four, five) clusters. Then

calculating the entropy of the resulted clusters based

on the class label that available with data as shown in

Table(1). So the entropy value would be increased

with classes were increased in an individual cluster

that means impurity of this cluster and vice verse.

C. DISCUSSION

The disparity of the entropy results by executing with

different subsets of features indicate to the effect of

selected features in clustering. As a result the

significant results point to determine the noise features

that would result the increasing the entropy values. It

is obvious the two features: " Sepal.Length" and

"Sepal.Width" are noise features because the resulted

entropy values are high in comparing with the other

features " Petal.Length" and "Petal.Width".

In other side the results in table(1) reveal that no. of

clusters play role in giving the significant entropy

values between different runs as shown in table(2) that

indicate to statistical metrics for different runs

according to the number of clusters as shown in

Figure(1).

V.

Figure(1) entropy values with different features subsets using

CONCLUSION

lights on the sensitivity of clustering with noise

features through measuring the entropy of the resulted

clusters. The uncovering noise feature through

clustering can be exploited by executing through a

preprocessing step with small data as a bootstrap step

in order to determine these features and then excludes

them in the next step while perform the main

clustering.

REFERENCES

A. IRIS DATA

(a) is

two

clustersthe

(b)best

threeknown

clustersdatabase

(c) four to

clusters

(d) five

This

perhaps

be found

in

clusters Fisher's paper is a

the pattern recognition literature.

classic in the field and is referenced frequently to this

day. There are four variables(can be used as features)

and one variable as class label as following:

Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,

and Species as class variable. The data set contains 3

classes of 50 instances each, where each class refers to

a type of iris plant. One class is linearly separable from

Unsupervised feature selection for the k-means

clustering problem. In Annual Advances in Neural

Information Processing Systems 22: Proceedings of

the 2009, Conference, 2009.

[2] G. Pawan, J. Susheel, J. Anurag, A Review Of

Fast Clustering-Based Feature Subset Selection

Algorithm, INTERNATIONAL JOURNAL OF

SCIENTIFIC & TECHNOLOGY RESEARCH

VOLUME 3, ISSUE 11, NOVEMBER 2014, pp. 8691.

www.ijsret.org

204

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 5, Issue 4, April 2016

k-means clustering stability: theoretical analysis and

an algorithm ", Data Mining and Knowledge

Discovery, July 2014, Volume 28, Issue 4, pp 918-960.

[4] T. Sergios and K. Konstantinos, "Pattern

Recognition", fourth edition, Academic Press, page

823 .

[5] R.Sabitha 1, S.Karthik, "Performance assessment

of feature selection methods using k-means on adult

set",

Research Journal of Computer Systems

Engineering RJCSE, Vol 04; Special Issue; June

2013.

[6] M. R. Sindhu, S. Rachna, 2013, "Clustering

Algorithm Technique", International Journal of

Enhanced Research in Management and Computing

Applications,vol. 2, Issue 3, March 2013.

[7] M.P.S Bhatia and Deepika Khurana, 2013, "

Experimental study of Data clustering using k- Means

and modified algorithms "International Journal of Data

Mining & Knowledge Management Process (IJDKP)

Vol.3, No.3, May 2013, p.p. 17-30.

[8] I. R. Swapnil, S. Amit M., 2013, "Clustering

Techniques", International Journal of Advanced

Research in Computer Science, Volume 4, No. 6, May

2013 (Special Issue), p.p.5-9.

[9] T. Pang-Ning , S. Michael, and K. Vipin, 2006, "

Introduction to Data Mining", 2006 AddisonWesley , p.p487-568.

[10] H. Maria, B. Yannis, and V. Michalis, 2001, " On

Clustering Validation Techniques", Journal of

Intelligent Information Systems, 17:2/3, 107145,

2001, Kluwer Academic Publishers. Manufactured in

The Netherlands

-163.

[11] A.K. Jain, 2010, "Data clustering: 50 years

beyond K-means", Pattern Recognition Letters 31

(2010) 651666.

[12] X. Hui , W. Junjie, C. Jian, " Kmeans Clustering

versus Validation Measures: A Data Distribution

Perspective", KDD06, August 2023, 2006,

Philadelphia, Pennsylvania, USA.

[13] R. ERNDIRA and others, " A comparison of

internal and external cluster validation Indexes",

proceedings of the 2011 American conference of

applied mathematics and the 5th WSEAS International

conference on computer engineering and applications,

pages 158.

www.ijsret.org

205

- IJETTCS-2014-01-08-015Uploaded byAnonymous vQrJlEN
- WekaUploaded byOmkarPrabhuGaonkar
- Clustering on Boston DatasetUploaded byanubhav582
- Measuring Underemployment: Establishing the Cut-off PointUploaded byAsian Development Bank
- Ieee 2009 Project Dotnet Data Mining @ Sbgc ( Chennai, Trichy, Tamilnadu India )Uploaded bySBGC
- Web Users Clustering AnalysisUploaded byijcsis
- Midterm Solutions MachineUploaded bylapendyala
- [IJETA-V5I6P3]:Ei Ei Phyo, Ei Ei MyatUploaded byIJETA - EighthSenseGroup
- IJETTCS-2014-08-08-89Uploaded byAnonymous vQrJlEN
- On Challenges in Evaluating Malware ClusteringUploaded byLee Wang Hao
- Data Mining 2013-2014Uploaded bypondyit
- 6867term ProjectUploaded byjimakosjp
- ECCV2012 Poster3Uploaded byKarthik Sheshadri
- t 24144148Uploaded byAnonymous 7VPPkWS8O
- Spatial Conflict Management in Urban Planning (1)Uploaded byMaru Pablo
- Speaker AdaptationUploaded bySweety Chaddha
- Kmeans is Np-HardUploaded byshraddha212
- IJETTCS-2013-04-18-120Uploaded byAnonymous vQrJlEN
- TasicUploaded byNebojsa
- Copia de ExcelTabuTSPUploaded byLuisEnriqueLedezma
- Software Modules Clustering an Effective Approach for Re UsabilityUploaded byiiste
- Clustering with Confidence: A Binning ApproachUploaded byLeonWessels
- Advanced Database SystemsUploaded bySenthil Ilangovan
- pxc3873975.pdfUploaded byDragos Popescu
- 848-869-1-PBUploaded bySanthosh Ranjy
- Cluster AnalysisUploaded bypopat vishal
- Flickr SolutionUploaded byram
- Birch-09Uploaded byUmamageswari Kumaresan
- IJET-V4I3P4.pdfUploaded byInternational Journal of Engineering and Techniques
- Priview of All ProtocolUploaded byHitesh Kapuriya

- A Systematic Literature Review on Network Attacks, Classification and Models for Anomaly-based Network Intrusion Detection SystemsUploaded byijsret
- Thermal analysis on Characterization of Polycaprolactone (PCL) – Chitosan Scaffold for Tissue EngineeringUploaded byijsret
- On JAM of Triangular Fuzzy Number MatricesUploaded byijsret
- CONSRUCTION OF A DIRECT SOLAR DRYER FOR PERISHABLE FARM PRODUCTSUploaded byijsret
- CHANGES IN ANTI OXIDANT ENZYME ACTIVITIES IN Pseudomonas syringae pv syringae (BACTERIAL BROWN SPOT) INFECTED SEEDLINGS OF Vigna radiata L.Uploaded byijsret
- A Review of Matrix multiplication in Multicore Processor using Interconnection NetworkUploaded byijsret
- Design and Manufacture of Carbon-free CarUploaded byijsret
- A Review on Data De-Duplication Techniques for Managing Data into CloudUploaded byijsret
- A Review of Large-Scale RDF Document Processing in Hadoop MapReduce FrameworkUploaded byijsret
- An Approach to Bug Triage: A ReviewUploaded byijsret
- An Augmented Anomaly-Based Network Intrusion Detection Systems Based on Neural NetworkUploaded byijsret
- A Review on Natural Fibre Polymer CompositesUploaded byijsret
- A DISTINCT METHOD TO FIND THE CRITICAL PATH AND TOTAL FLOAT UNDER FUZZY ENVIRONMENTUploaded byijsret
- SOURCE IDENTIFICATION FOR ANONYMOUS ATTACKS WITH DETERMINISTIC PACKET MARKINGUploaded byijsret
- Comparative Study and Review on Object Oriented Design MetricsUploaded byijsret
- MECHANICAL AND FRACTURE TOUGHNESS ANALYSIS OF WOVEN CARBON FIBRE REINFORCED EPOXY COMPOSITESUploaded byijsret
- Solving a Decision Making Problem Using Weighted Fuzzy Soft MatrixUploaded byijsret
- Echo Cancellation System in VOIP using MATLABUploaded byijsret
- A Review on Fingerprint Detection and Recognization TechniquesUploaded byijsret
- A Review on Face Detection and Recognization TechniquesUploaded byijsret
- Experimental Investigation and Numerical Simulation of Marble Dust Filled Aramid Fibre Reinforced Epoxy Composite for Wind Turbine Blade ApplicationUploaded byijsret
- Automated Personalized Face Detection using Viola DetectorUploaded byijsret
- AUGMENTATION HEAT TRANSFER IN A CIRCULAR TUBE USING CONICAL RING AND TWISTED TAPE INSERTUploaded byijsret
- Demographic Differences in Postgraduate Students’ Attitudes and Use of ICT Facilities in Rivers State University of Science and Technology, Port HarcourtUploaded byijsret
- BLENDING BEHAVIOR OF COTTON AND POLYESTER FIBERS ON DIFFERENT SPINNING SYSTEMS IN RELATION TO PHYSICAL PROPERTIES OF BLENDED YARNSUploaded byijsret
- NUMERICALLY INVESTIGATING EFFECTS OF CHANNEL ANGLE, FRICTION AND RAM VELOCITY ON STRAIN HOMOGENEITY IN ECAPUploaded byijsret
- Comparison of workpiece geometry and its effects on ECAP process by FEAUploaded byijsret
- Content Based Image Retrieval using Color featureUploaded byijsret
- STRATEGIES OF EFFECTIVE TEACHING AND LEARNING PRACTICAL SKILLS IN TECHNICAL AND VOCATIONAL TRAINING PROGRAMMES IN NIGERIAUploaded byijsret

- LiveCD-Server_engUploaded byAdy Saputra
- EMV_v4.2_Book_3_Application_Specification_CR05_2011111807264590.pdfUploaded byAnonymous nD4ut5
- Hacking for Dummies - CheatsheetUploaded bydjtai87
- Seminar on Virtual Reality1Uploaded byBaneeIshaqueK
- Data TypesUploaded byIgor Gjorgjiev
- Recon 2014 Skochinsky.pdfUploaded byjames wright
- PEEGA CSA 6.2Uploaded byRavitej Tadinada
- Implementation of a Fast Artificial Neural Network Library (fann)Uploaded bySteffen Nissen
- 30277062 Lab 3 N Queens Problem in PrologUploaded bytalhaaftab728
- Global_Absence_FastFormula_User_Guide__201804130.pdfUploaded byIvan
- FPGAUploaded byAnshuman Vyas
- Arena TutorialUploaded byTiago Pimentel
- GIS - Lecture 8Uploaded byAmir Khan
- 10.1.1.89.8448Uploaded byAnjana Jayasekara
- Sup Uagent WinUploaded byVictor Hugo Santos
- ANSYS ACT Developers GuideUploaded byWilder Molina
- SQL Server for the Oracle DBAUploaded byMarcelo Sinni
- Rfc 6733Uploaded byUttam Hoode
- Fortitester v3 0 0 Release NotesUploaded byFatogoma Diarra
- DFD PresentationUploaded byvittal_handady
- Spb Full Screen Keyboard Skin DocumentationUploaded bynemethg
- Or Cad 0Uploaded byNitin Kolte
- موقع تعلموا - مشكلة بطىء التشغيل و حلهاUploaded byبيدو
- Introduction to Raid levelsUploaded bySai Prasad
- Acer Office Manager User ManualUploaded byRichard Baker
- Early Experts Study Guide for Microsoft Specialist Certification Exam 70-534 NotesUploaded byshekhar785424
- ADF PracticesUploaded bySurendra Babu
- Done_Best a Interview Questions & Answers by Saurav MitraUploaded byswapna3183
- DIALux Setup Log.txtUploaded byksenijaksenija
- Web Logic Disable Enforce Valid AuthUploaded byJimmy Javier Govea Villao