P. 1

Ratings: (0)|Views: 5|Likes:
Published by Mauly Srivastava

More info:

Published by: Mauly Srivastava on Apr 08, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Data Clustering: A Review
 Michigan State University
 Indian Institute of Science
The Ohio State University
Clustering is the unsupervised classification of patterns (observations, data items,or feature vectors) into groups (clusters). The clustering problem has beenaddressed in many contexts and by researchers in many disciplines; this reflects itsbroad appeal and usefulness as one of the steps in exploratory data analysis.However, clustering is a difficult problem combinatorially, and differences inassumptions and contexts in different communities has made the transfer of usefulgeneric concepts and methodologies slow to occur. This paper presents an overviewof pattern clustering methods from a statistical pattern recognition perspective,with a goal of providing useful advice and references to fundamental conceptsaccessible to the broad community of clustering practitioners. We present ataxonomy of clustering techniques, and identify cross-cutting themes and recentadvances. We also describe some important applications of clustering algorithmssuch as image segmentation, object recognition, and information retrieval.Categories and Subject Descriptors: I.5.1 [
Pattern Recognition
]: Models; I.5.3[
Pattern Recognition
]: Clustering; I.5.4 [
Pattern Recognition
]: Applications—
Computer vision
; H.3.3 [
Information Storage and Retrieval
]: InformationSearch and Retrieval—
; I.2.6 [
 Artificial Intelligence
 Knowledge acquisition
General Terms: Algorithms Additional Key Words and Phrases: Cluster analysis, clustering applications,exploratory data analysis, incremental clustering, similarity indices, unsupervisedlearningSection 6.1 is based on the chapter “Image Segmentation Using Clustering” by A.K. Jain and P.J.Flynn,
Advances in Image Understanding: A Festschrift for Azriel Rosenfeld
(K. Bowyer and N. Ahuja,Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society. Authors’ addresses: A. Jain, Department of Computer Science, Michigan State University, A714 WellsHall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation, IndianInstitute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, TheOhio State University, Columbus, OH 43210.Permission to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and/or a fee. © 2000 ACM 0360-0300/99/0900–0001 $5.00
 ACM Computing Surveys, Vol. 31, No. 3, September 1999
1. INTRODUCTION1.1 Motivation
Data analysis underlies many comput-ing applications, either in a designphase or as part of their on-line opera-tions. Data analysis procedures can bedichotomized as either exploratory orconfirmatory, based on the availabilityof appropriate models for the datasource, but a key element in both typesof procedures (whether for hypothesisformation or decision-making) is thegrouping, or classification of measure-ments based on either (i) goodness-of-fitto a postulated model, or (ii) naturalgroupings (clustering) revealed throughanalysis. Cluster analysis is the organi-zation of a collection of patterns (usual-ly represented as a vector of measure-ments, or a point in a multidimensionalspace) into clusters based on similarity.Intuitively, patterns within a valid clus-ter are more similar to each other thanthey are to a pattern belonging to adifferent cluster. An example of cluster-ing is depicted in Figure 1. The inputpatterns are shown in Figure 1(a), andthe desired clusters are shown in Figure1(b). Here, points belonging to the samecluster are given the same label. The variety of techniques for representingdata, measuring proximity (similarity)between data elements, and groupingdata elements has produced a rich andoften confusing assortment of clusteringmethods.It is important to understand the dif-ference between clustering (unsuper- vised classification) and discriminantanalysis (supervised classification). Insupervised classification, we are pro- vided with a collection of 
(pre-classified) patterns; the problem is tolabel a newly encountered, yet unla-beled, pattern. Typically, the given la-beled (
) patterns are used tolearn the descriptions of classes whichin turn are used to label a new pattern.In the case of clustering, the problem isto group a given collection of unlabeledpatterns into meaningful clusters. In asense, labels are associated with clus-ters also, but these category labels are
data driven
; that is, they are obtainedsolely from the data.Clustering is useful in several explor-atory pattern-analysis, grouping, deci-sion-making, and machine-learning sit-uations, including data mining,document retrieval, image segmenta-tion, and pattern classification. How-ever, in many such problems, there islittle prior information (e.g., statisticalmodels) available about the data, andthe decision-maker must make as fewassumptions about the data as possible.It is under these restrictions that clus-tering methodology is particularly ap-propriate for the exploration of interre-lationships among the data points tomake an assessment (perhaps prelimi-nary) of their structure.The term “clustering” is used in sev-eral research communities to describe
1. Introduction1.1 Motivation1.2 Components of a Clustering Task1.3 The User’s Dilemma and the Role of Expertise1.4 History1.5 Outline2. Definitions and Notation3. Pattern Representation, Feature Selection andExtraction4. Similarity Measures5. Clustering Techniques5.1 Hierarchical Clustering Algorithms5.2 Partitional Algorithms5.3 Mixture-Resolving and Mode-Seeking Algorithms5.4 Nearest Neighbor Clustering5.5 Fuzzy Clustering5.6 Representation of Clusters5.7 Artificial Neural Networks for Clustering5.8 Evolutionary Approaches for Clustering5.9 Search-Based Approaches5.10 A Comparison of Techniques5.11 Incorporating Domain Constraints inClustering5.12 Clustering Large Data Sets6. Applications6.1 Image Segmentation Using Clustering6.2 Object and Character Recognition6.3 Information Retrieval6.4 Data Mining7. Summary
 Data Clustering
 ACM Computing Surveys, Vol. 31, No. 3, September 1999
methods for grouping of unlabeled data.These communities have different ter-minologies and assumptions for thecomponents of the clustering processand the contexts in which clustering isused. Thus, we face a dilemma regard-ing the scope of this survey. The produc-tion of a truly comprehensive surveywould be a monumental task given thesheer mass of literature in this area.The accessibility of the survey mightalso be questionable given the need toreconcile very different vocabulariesand assumptions regarding clusteringin the various communities.The goal of this paper is to survey thecore concepts and techniques in thelarge subset of cluster analysis with itsroots in statistics and decision theory.Where appropriate, references will bemade to key concepts and techniquesarising from clustering methodology inthe machine-learning and other commu-nities.The audience for this paper includespractitioners in the pattern recognitionand image analysis communities (whoshould view it as a summarization of current practice), practitioners in themachine-learning communities (whoshould view it as a snapshot of a closelyrelated field with a rich history of well-understood techniques), and thebroader audience of scientific profes-sionals (who should view it as an acces-sible introduction to a mature field thatis making important contributions tocomputing application areas).
1.2 Components of a Clustering Task
Typical pattern clustering activity in- volves the following steps [Jain andDubes 1988]:(1) pattern representation (optionallyincluding feature extraction and/orselection),(2) definition of a pattern proximitymeasure appropriate to the data do-main,(3) clustering or grouping,(4) data abstraction (if needed), and(5) assessment of output (if needed).Figure 2 depicts a typical sequencing of the first three of these steps, includinga feedback path where the groupingprocess output could affect subsequentfeature extraction and similarity com-putations.
 Pattern representation
refers to thenumber of classes, the number of avail-able patterns, and the number, type,and scale of the features available to theclustering algorithm. Some of this infor-mation may not be controllable by the
xxxxx111x x112 2x x2 2x xxxxxxxxxxxxxxxxxx3 333444444444444444xxxxxxxx66677776xxxxxxx4555555
Figure 1
. Data clustering.
A. Jain et al.
 ACM Computing Surveys, Vol. 31, No. 3, September 1999

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->