Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
A New Approach on K-Means Clustering

A New Approach on K-Means Clustering

Ratings: (0)|Views: 14 |Likes:
Published by ijcsis
To explore the application of feature extraction technique to extract necessary features using k-mean clustering. The main goal of research on feature extraction using k-mean is to find out best features from the cluster analysis. All the implementation can be performed by using Genetic algorithm(GA) also. The same problem is done by using Mat lab. The k-mean clustering process for feature extraction gives accuracy almost equal with that Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).Although this is a unsupervised learning method, before classification of dataset into different class this method can be used to partition the group to obtain the better efficiency with respect to the number of object and attributes this can be developed with same logic and can give better accuracy in Genetic algorithm (GA).
To explore the application of feature extraction technique to extract necessary features using k-mean clustering. The main goal of research on feature extraction using k-mean is to find out best features from the cluster analysis. All the implementation can be performed by using Genetic algorithm(GA) also. The same problem is done by using Mat lab. The k-mean clustering process for feature extraction gives accuracy almost equal with that Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).Although this is a unsupervised learning method, before classification of dataset into different class this method can be used to partition the group to obtain the better efficiency with respect to the number of object and attributes this can be developed with same logic and can give better accuracy in Genetic algorithm (GA).

More info:

Published by: ijcsis on Feb 19, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

02/19/2012

pdf

text

original

 
 
(IJCSIS)
 International Journal of Computer Science and Information Security, 2011
ANEW APPROACH ON
K
-
MEANS CLUSTERING
 
Trilochan Rout
1
, Srikanta Kumar mohapatra
2
, Jayashree Mohanty
3
,Sushant Ku. Kamillla
4
, Susant K. mohapatra
5
1,2,3
-
Computer Sc
ience and Engineerinmg Dept.,NMIET, Bhubaneswar,Oissa,India
 
4-
Dept of Physics,ITER,Bhubaneswar,orissa,India
 5-
Chemical and Materials Engineering/MS 388, University of Nevada, Reno, NV 89557, USA
 
 Abstract 
To explore the application of feature
extracti
on technique to extract necessary features using k-mean clustering . The main goal of research on featureextraction using k-mean is to find out best features fromthe cluster analysis. All the implementation can be
performed
by using Genetic algorithm(GA) also
.
The
same problem is done by using Mat lab.
The
k-
mean
clustering process for feature extraction gives accuracy
almost
equal with that Principal Component Analysis(PCA) and Linear Discriminant Analysis (LDA).
Although
this is a unsupervised learning method, beforeclassification of dataset into different class this methodcan be used to partition the group to obtain the betterefficiency with respect to the number of object andattributes this can be developed with same logic and can
give bet
ter accuracy in Genetic algorithm(GA)
.
Keywords-:
Principal Component Analysis (PCA),
 
LinearDiscriminant Analysis (LDA)
,
 
Genetic algorithm(GA)
.
 
I.
 
I
NTRODUCTION
The need to understand large, complex, information-rich data sets is common to virtually all fields of business,science, and engineering. In the business world, corporate andcustomer data are becoming recognized as a strategic asset.The ability to extract useful knowledge hidden in these dataand to act on that knowledge is becoming increasinglyimportant in today's competitive world. So for the industries
mining of data is important to take decision.
 Data mining has attracted a great deal of attention inthe information industry and in society as a whole in recent
years, due to the wide availability of huge amounts of data and
the imminent need for turning such data into usefulinformation and knowledge. The information and knowledgegained can be used for applications ranging from marketanalysis, fraud detection, and customer retention, to
pro
duction control and science exploration. Mainly instatistical pattern classification this data mining is used.Statistical pattern classification deals with classifying objectsinto different categories, based on certain observations madeon the objects. The possible information available about theobject is in terms on certain measurements made on the object
known as the features or the attribute set of the object.
 In many applications, data, which is the subject of analysis and processing in data mining, is multidimensional,and presented by a number of features. The so-calledcurse of dimensionality pertinent to many learning algorithms,denotes the drastic raise of computational complexity andclassification error with data having high amount of 
dim
ensions Hence, the dimensionality of the feature space isoften reduced before classification is undertaken. Featureextraction and feature selection principles are used forreducing the dimension of the dataset. Feature extractioninvolves the production of a new set of features from theoriginal features in the data, through the application of somemapping. Feature Selection involves the selection of importantattributes or the features from the data set to make classify the
data present in the data set.
 
Well
-known unsupervised feature extraction methodsinclude Principal Component Analysis (PCA) and k-
mean
63http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
 
(IJCSIS)
 International Journal of Computer Science and Information Security, 2011
clustering.
The important corresponding supervised approach
is Linear Discriminant Analysis (LDA).
 Primary purpose of my work is to develop an
effic
ient method of feature extraction for reducing the
dimension. For this I have worked on new approach of k 
-
mean
clustering for feature extraction. This method extract the
feature on the basis of cluster center.
 
1.2 Motivation
In the field of Data mining Feature Extraction has atremendous application such as dimension reduction, patternclassification, data visualization,
 
Automatic Exploratory Data
Analysis. To extract proper feature from the rich data set is the
major issue. For this many work has been done before toreduce dimension. Mainly PCA and LDA are used for thisdimension reduction. Identification of important attributes orfeatures is a major area of research from last several years.To give new solution to some long standing necessities of feature extraction and to work with a new approach of dimension reduction. PCA finds a set of the mostrepresentative projection vectors such that the projectedsamples retain the most information about original samples.LDA uses the class information and finds a set of vectors thatmaximize the between-class scatter while minimizing the
within
-class scatter. Cluster is another technique for makinggroup for the different object present in the dataset. With thecluster center also it can be possible to find out the necessaryfeature from the data set. In my present work I use this new
approach of extracting the feature.
 
2.
 
An Overview of Data Mining and
Knowledge Discovery
 
Data mining is an iterative process within whichprogress is defined by discovery, through either automatic ormanual methods. Data mining is most useful in an exploratoryanalysis scenario in which there are no predetermined notionsabout what will constitute an "interesting" outcome. Datamining is the search for new, valuable, and nont
rivial
information in large volumes of data. It is a cooperative effortof humans and computers. Best results are achieved bybalancing the knowledge of human experts in describing
problems and goals with the search capabilities of computers.
 The process of grouping a set of physical or abstractobjects into classes of 
similar
 
objects
is called clustering. Acluster is a collection of data objects that are
similar
 
to one
another
within the same cluster and are
dissimilar
 
to theobjects in other clusters. A cluster of data objects can betreated collectively as one group and so may be considered asa form of data compression. Although classification is an
effective means for distinguishing groups or
 
classes of objects,
it requires the often
 
costly collection
 
and labeling of a large set
of training tuples or patterns, which the classifier uses tomodel each group. It is often more desirable to proceed in thereverse direction: First partition the set of data into groupsbased on data similarity (e.g., using clustering), and thenassign labels to the relatively small number of 
groups.
Additional advantages of such a clustering-based process arethat it is adaptable to changes and helps single out useful
features that distinguish different groups.
 As a branch of statistics, cluster analysis has beenextensively studied for many years, focusing mainly on
distance
-based cluster analysis
. Cluster analysis tools based
on
 
-
means,
-medoids, and several other methods have also
been built into many statistical
 
analysis software packages.
In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering
and
unsupervised learning do not rely on predefined classes and
class
-labeled training examples. For this reason, clustering is aform of learning by observation, rather than
learning by
examples
. In data mining, efforts have focused on findingmethods for efficient and effective cluster analysis in
large
databases
. Active themes of research focus on
the
scalability
of clustering methods, the effectiveness of methods for
clustering
complex shapes and types of data
, high
-
dimensional
 
clustering techniques, and methods for clustering
mixednumerical and categorical data
 
in large databases.
 
64http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
 
(IJCSIS)
 International Journal of Computer Science and Information Security, 2011
Two types of clustering algorithms are
 nonhierarchical 
and
 hierarchical 
. In nonhierarchicalclustering, such as the
k
-
 means
algorithm, the relationshipbetween clusters is undetermined. Hierarchical clusteringrepeatedly links pairs of clusters until every data object isincluded in the hierarchy. With both of these approaches, an
important issue is how to determine the similarity between two
objects, so that clusters can be formed from objects with ahigh similarity to each other. Commonly,
 distance functions
,
such as the
 Manhattan
and
 Euclidian
distance functions, areused to determine similarity. A distance function yields ahigher value for pairs of objects that are less similar to oneanother. Sometimes a
 similarity function
is used instead,
which yields higher values for pairs that are more simila
r.
 
Data clustering is a common technique forstatistical data analysis, which is used in manyfields, including machine learning, data mining,pattern recognition, image analysis andbioinformatics. The computational task of classifying the data set into
clusters is oftenreferred to as
 k
-
clustering
.
 
Simply speaking k-means clustering is an algorithmto classify or to group your objects based on attributes orfeatures into K number of group. K is positive integer number.The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid.
Thus the purpose of K
-
mean clustering is to classify the data.
 
4.3
K
-
means algorithm :
 
The basic step of k-means clustering issimple. In the beginning we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects asthe initial centroids or the first K objects insequence can also serve as the initial centroids.Then the K means algorithm will do the three steps
be
low until convergence
Iterate until
stable
(= no object move group):
1.
 
Determine the centroid coordinate
2.
 
Determine the distance of each object to the centroids
3.
 
Group the object based on minimum distance
Fig 4.1: Flow chart for finding Cluster
 1
.Initial value of centroids
 
:
Assign the first k object as the initial cluster and their centroid can befound by assigining directly their attributes
value
initially
.2.
Objects
-Centroids distance
 
:
we calculate thedistance between cluster centroid to each objectwith the help of Euclidean distance between points
12
,,.....,
n
Pppp
and
12
,,.....,
n
Qqqq
inEuclidean
n
-
space, is de
fined as:
 
65http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->