P. 1
Clustering Survey

Clustering Survey

|Views: 418|Likes:
Published by Shanti Prasad

More info:

Published by: Shanti Prasad on Dec 23, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Survey of Clustering Data Mining Techniques
Pavel Berkhin
 Accrue Software, Inc
is a division of data into groups of similar objects. Representing thedata by fewer clusters necessarily loses certain fine details, but achievessimplification. It models data by its clusters. Data modeling puts clustering in ahistorical perspective rooted in mathematics, statistics, and numerical analysis.From a machine learning perspective clusters correspond to
hidden patterns
, thesearch for clusters is
unsupervised learning 
, and the resulting system represents a
data concept 
. From a practical perspective clustering plays an outstanding role indata mining applications such as scientific data exploration, information retrievaland text mining, spatial database applications, Web analysis, CRM, marketing,medical diagnostics, computational biology, and many others.Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering indata mining. Data mining adds to clustering the complications of very largedatasets with very many attributes of different types. This imposes uniquecomputational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and weresuccessfully applied to real-life data mining problems. They are subject of thesurvey.Categories and Subject Descriptors: I.2.6. [
Artificial Intelligence
]: Learning –
Concept learning 
; I.4.6 [
Image Processing
]: Segmentation; I.5.1 [
]: Models; I.5.3 [
Pattern Recognition
]: Clustering.General Terms: Algorithms, DesignAdditional Key Words and Phrases: Clustering, partitioning, data mining,unsupervised learning, descriptive learning, exploratory data analysis, hierarchicalclustering, probabilistic clustering, k-means
1. Introduction
1.1. Notations
Clustering Bibliography at Glance
Classification of Clustering Algorithms
Plan of Further PresentationAuthor’s address: Pavel Berkhin, Accrue Software, 1045 Forest Knoll Dr., San Jose, CA,95129; e-mail: pavelb@accrue.com1
2. Hierarchical Clustering2.1. Linkage Metrics2.2. Hierarchical Clusters of Arbitrary Shapes2.3. Binary Divisive Partitioning2.4. Other Developments3. Partitioning Relocation Clustering3.1. Probabilistic Clustering3.2.
-Medoids Methods3.3.
-Means Methods4. Density-Based Partitioning4.1. Density-Based Connectivity4.5. Density Functions
Grid-Based Methods6. Co-Occurrence of Categorical Data7. Other Clustering Techniques
Constraint-Based Clustering7.2. Relation to Supervised Learning
Gradient Descent and Artificial Neural Networks7.4. Evolutionary Methods7.5. Other Developments
Scalability and VLDB Extensions
Clustering High Dimensional Data
Dimensionality Reduction
Subspace Clustering
General Algorithmic Issues
Assessment of Results10.2. How Many Clusters?10.3. Data Preparation
Proximity Measures10.5. Handling OutliersAcknowledgementsReferences
1. Introduction
The goal of this survey is to provide a comprehensive review of different clusteringtechniques in data mining.
is a division of data into groups of similar objects.Each group, called cluster, consists of objects that are similar between themselves anddissimilar to objects of other groups. Representing data by fewerclusters necessarilyloses certain fine details (akin to lossy data compression), but achieves simplification. Itrepresents many data objects by few clusters, and hence, it models data by its clusters.Data modeling puts clustering in a historical perspective rooted in mathematics, statistics,and numerical analysis. From a machine learning perspective clusters correspond to
hidden patterns
, the search for clusters is
unsupervised learning 
, and the resulting systemrepresents a
data concept 
. Therefore, clustering is unsupervised learning of a hidden data2
concept. Data mining deals with large databases that impose on clustering analysisadditional severe computational requirements. These challenges led to the emergence of  powerful broadly applicable data mining clustering methods surveyed below.
1.1. Notations
To fix the context and to clarify prolific terminology, we consider a dataset
consistingof data points (or synonymously,
objects, instances, cases
)in attribute space
, where
, and each component is anumerical or nominal categorical
(or synonymously
 , feature
). For a discussion of attributes data types see [Han &Kamber 2001]. Such point-by-attribute data format conceptually corresponds to amatrix and is used by the majority of algorithms reviewed below. However, data of otherformats, such as variable length sequences and heterogeneous data, is becoming more andmore popular. The simplest attribute space subset is a direct Cartesian product of sub-ranges called a
cube, cell, region
). A
is an elementary segment whose sub-ranges consist of a single category value, or of asmall numerical bin. Describing the numbers of data points per every
represents anextreme case of clustering, a
, where no actual clustering takes place. This is avery expensive representation, and not a very revealing one. User driven
isanother commonly used practice in data exploration that utilizes expert knowledgeregarding the importance of certain sub-domains. We distinguish clustering fromsegmentation to emphasize the importanceof the automatic learning process.
 A x x x
id ii
A x
 A A
The ultimate goal of clustering is to assign points to a finite system of 
subsets, clusters.Usually subsets do not intersect (this assumption is sometimes violated), and their unionis equal to a full dataset with possible exception of outliers.
o X 
 j jouliers
1.2. Clustering Bibliography at Glance
General references regarding clustering include [Hartigan 1975; Spath 1980; Jain &Dubes 1988; Kaufman & Rousseeuw 1990; Dubes 1993; Everitt 1993; Mirkin 1996; Jainet al. 1999; Fasulo 1999; Kolatch 2001; Han et al. 2001; Ghosh 2002]. A very goodintroduction to contemporary data miningclustering techniques can be found in thetextbook [Han & Kamber 2001].There is a close relationship between clustering techniques and many other disciplines.Clustering has always been used in statistics [Arabie & Hubert 1996] and science[Massart & Kaufman 1983]. The classic introduction into pattern recognition frameworkis given in [Duda & Hart 1973]. Typical applications include
character recognition
. Machine learning clustering algorithms were applied to
image segmentation
computer vision
[Jain & Flynn 1996]. For statistical approaches to patternrecognition see [Dempster et al. 1977] and [Fukunaga 1990]. Clustering can be viewed asa density estimation problem. This is the subject of traditional multivariate statisticalestimation [Scott 1992]. Clustering is also widely used for data compression in image processing, which is also known as
vector quantization
[Gersho & Gray 1992]. Data3

Activity (14)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Frank Venance liked this
subbaiah54 liked this
Raghu Nayak liked this
Raghu Nayak liked this
mafida p dina liked this
zartab liked this
Frank Vanden Berghen liked this
punithav liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->