concept. Data mining deals with large databases that impose on clustering analysisadditional severe computational requirements. These challenges led to the emergence of powerful broadly applicable data mining clustering methods surveyed below.
To fix the context and to clarify prolific terminology, we consider a dataset
consistingof data points (or synonymously,
objects, instances, cases
)in attribute space
, and each component is anumerical or nominal categorical
). For a discussion of attributes data types see [Han &Kamber 2001]. Such point-by-attribute data format conceptually corresponds to amatrix and is used by the majority of algorithms reviewed below. However, data of otherformats, such as variable length sequences and heterogeneous data, is becoming more andmore popular. The simplest attribute space subset is a direct Cartesian product of sub-ranges called a
cube, cell, region
is an elementary segment whose sub-ranges consist of a single category value, or of asmall numerical bin. Describing the numbers of data points per every
represents anextreme case of clustering, a
, where no actual clustering takes place. This is avery expensive representation, and not a very revealing one. User driven
isanother commonly used practice in data exploration that utilizes expert knowledgeregarding the importance of certain sub-domains. We distinguish clustering fromsegmentation to emphasize the importanceof the automatic learning process.
A x x x
d l AC A
The ultimate goal of clustering is to assign points to a finite system of
subsets, clusters.Usually subsets do not intersect (this assumption is sometimes violated), and their unionis equal to a full dataset with possible exception of outliers.
oC C C C C X
1.2. Clustering Bibliography at Glance
General references regarding clustering include [Hartigan 1975; Spath 1980; Jain &Dubes 1988; Kaufman & Rousseeuw 1990; Dubes 1993; Everitt 1993; Mirkin 1996; Jainet al. 1999; Fasulo 1999; Kolatch 2001; Han et al. 2001; Ghosh 2002]. A very goodintroduction to contemporary data miningclustering techniques can be found in thetextbook [Han & Kamber 2001].There is a close relationship between clustering techniques and many other disciplines.Clustering has always been used in statistics [Arabie & Hubert 1996] and science[Massart & Kaufman 1983]. The classic introduction into pattern recognition frameworkis given in [Duda & Hart 1973]. Typical applications include
. Machine learning clustering algorithms were applied to
[Jain & Flynn 1996]. For statistical approaches to patternrecognition see [Dempster et al. 1977] and [Fukunaga 1990]. Clustering can be viewed asa density estimation problem. This is the subject of traditional multivariate statisticalestimation [Scott 1992]. Clustering is also widely used for data compression in image processing, which is also known as
[Gersho & Gray 1992]. Data3