You are on page 1of 29

INFORMATION

SCIENCES
AN E',~3GP--~ATIONALJOURNAl

ELSEVIER Information Sciences 112 (1998) 267 295

Data clustering analysis in a multidimensional


space
A. Bouguettaya *, Q. Le Viet
School of Information Systems, Queensland University of Technology, Brisbane, Qld 4001, Australia
Received 1 June 1997; accepted 30 April 1998
Communicated by Ahmed Elmagarmid

Abstract
Cluster analysis techniques are used to classify objects into groups based on their sim-
ilarities, There is a wide choice of methods with different requirements in computer re-
sources, We present the result of a fairly exhaustive study to evaluate three commonly
used clustering algorithms, namely, single linkage, complete linkage, and centroid. The
cluster analysis study is conducted in the two dimensional (2-D) space. Three types of
statistical distribution are used. Two different types of distances to compare lists of ob-
jects are also used. The results point to some startling similarities in the behavior and
stability of all clustering methods. © 1998 Elsevier Science Inc. All rights reserved.

1. Introduction and motivation

Cluster analysis is a generic name for multivariate analysis techniques to cre-


ate groups o f objects based on their degree o f association [21]. In simple words,
cluster analysis classifies items into groups based on their similarities. In data-
bases, cluster analysis has been used to re-allocate stored information based
on predefined criteria with the goal to improve the efficiency o f data retrieval op-
erations. The standard way for evaluating degree o f similarities varies f r o m one
application to another. By re-allocating data, related information is physically
stored as close as possible to reduce the n u m b e r o f disk accesses. In a distributed
environment, clustering is even m o r e i m p o r t a n t because o f the impact on the

*Corresponding author. E-mail: athman@icis.qut.edu.au

0020-0255/98/$19.00 0 1998 Elsevier Science Inc. All rights reserved.


PII: S 0 0 2 0 - 0 2 5 5 (9 8) 1 0 0 3 7 - 3
268 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

response time if the requested data is physically located at different sites. The
need to do clustering is clear. There are many issues that need to be addressed:
1. Calculation of the degree of association between different types of objects.
2. Determination of an acceptable criterion to evaluate the "goodness" of clus-
tering methods.
3. Adaptability of the clustering methods with different distributions of data:
randomly scattered, skewed or concentrated around certain regions, etc.
Several clustering methods have been proposed that differ in the approach
taken to perform the clustering some of which are: Hierarchical versus Partit-
ional, Agglomerative versus Divisive, Extrinsic versus Intrinsic, etc.
[7,10,8,14,22,23,25,18,1]. In that respect, each clustering method has a different
theoretical basis and is applicable to particular fields. We propose a fairly ex-
haustive study of well known clustering techniques in the two-dimensional
(2-D) space. Previous studies, that are less exhaustive in their analysis, have fo-
cused on the 1-D space [5]. Our experiment includes a variety of environment
settings to test the clustering techniques sensibility and behavior. The results
can be of paramount importance across a wide range of applications.

1.1. Definitions

We present a high level description of the different ontologies used in the re-
search literature and this paper. The ontologies we cover are the following:
cluster analysis, objects, clusters, distance and similarity, coefficient of correla-
tion, and standard deviation.
Cluster analysis: This is about the generation of groups of entities that fit a
set of definitions. The group which forms a cluster should have higher degree of
associations within group members than members of different groups. At a
high level of abstraction, a cluster can be viewed as a group of "similar" ob-
jects. Cluster analysis is sometimes referred to as automatic classification. Clus-
ter analysis has a property that makes it different from other classification
methods, namely, information about classes of groupings are not known prior
to the processing. Items are grouped into clusters that are defined by members
of those clusters.
Objects (or items): These are used in a broad sense. They can be anything
that require to be classified based on certain criteria. The object may represent
a single attribute in a relational database or a complex object in an object-ori-
ented database provided that it can be represented as a point in a measurement
space. Obviously, all objects to be clustered should be defined in the same mea-
surement space. In a 1-D (one dimensional) environment, an object is repre-
sented as a point belonging to a segment defined by the interval [a,b] where
a and b are arbitrary numbers.
Clusters: These are groups of objects linked together according to some
rules. The goal of clustering is to find groups containing objects most homoge-
A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295 269

neous within these groups. Homogeneity refers to the common properties of


the objects to be clustered. There are two ways to represent clusters in a mea-
surement space: as a hypothetical point which is not an object in the cluster, or
as an existing object in the cluster called centroid or cluster representative.
Clusters are represented in the measurement space in the same form as the
objects they contain. To distinguish between an object and a cluster, additional
information is needed: the number of objects in the cluster. F r o m that point of
view, a single object is a cluster containing exactly one object. An example of
clusters in a 1-D environment is {0.1}, {0.6}, {0.2, 0.3, 0.5}, etc.
Distance and Similarity: To cluster items in a database or in any other envi-
ronment, some means of quantifying the degree of associations between items
are needed. They may be a measure of distances or similarities. There are a
number of similarity measures available and the choice may have an effect
on the results obtained. Objects which have more than one dimension may
use relative or normalized weight to convert their distance to an arbitrary scale
so that they can be compared. Once the objects are defined in the same mea-
surement space as the points, it is a straightforward exercise to compute the dis-
tance or similarity. The smaller the distance the more similar two objects are.
The most popular choice in computing distance is the Euclidean distance with

d(i,j) = ¢(x,, - x j , ) 2 + (xi2 -xj2) 2 -[----~ (Xin --Xjn) 2,


where n is the number of dimensions. For the 1-D space, the distance becomes:
d(i,j) = [ xi - xj 1.
There is also the Manhattan distance or city block concepts that are represented
as follows
d(i,j) = [Xil - x j l ] -~- [xi2-xj2 1-~...~- [Xin --Xjn 1.
The distance between two clusters involves some or all items of the two clusters
and is calculated differently depending on the clustering method.
Standard Deviation: This is the measurement of the fluctuation of values as
compared to the mean value. In this study, standard deviation is used to show
the acceptability of the results. The standard deviation of a random variable .~r
is given by

o-(~r) = C E ( Y "2) - E2(~r).


Coefficient o f Correlation: This is the measurement of the strength of rela-
tionship between two variable 5f and Y¢. It essentially answers the question
"how similar are Y" and Y¢?". The values of the coefficients of correlation range
from 0 to 1 where the value 0 points to no similarity and the value 1 points high
similarity. The coefficient of correlation is used to find the similarity among ob-
jects. The correlation r of two random variables ~ and ~ where:
Y" = (Xl,X2,X3,... ,xn) and ~ = (Yl,y2,y3,... ,yn) is given by the formula
270 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

r ~
I E(.~,~) - E(Y') • E ( ~ ) I
v / ( E ( f 2) - E2(~) v / ( E ( ~ 2) - E2(~) '
n .;1 n
where E(X) = ~i=l xi/n and E ( ~ ) = ~'~i=l yi/n and E(Y', ~ ) = ~i=1 xi .yi/n.
1.2. Related work

Cluster analysis has been used in several fields of science to determine tax-
onomical relationships among entities, including fields dealing with species,
psychiatric profiles, census and survey data, and chemical structures
[18,27,14,22].
Clustering applications range from databases (e.g. data clustering and data
mining) to machine learning to data compression [11]. Our application domain
is in the area of databases. In the case of database clustering, the ability to cat-
egorize items into groups allows the re-allocation of related data in a database
to improve the performance of DBMSs. Data records which are frequently ref-
erenced together are moved in close proximity to reduce access time. To reach
this goal, cluster analysis is used to form clusters based on the similarities of
data items. Data may be re-allocated based on values of an attribute, group
of attributes or on accessing patterns. These criteria determine the measuring
distance between data items.
With the advent of OODBs, the need of efficient clustering techniques be-
comes crucial. Some OODBs use some limited form of clustering to improve
their performance. However, they are mostly static in nature [4]. The case of
OODBs is unique in that the underlying model provides a testbed for dynamic
clustering. This is the reason why clustering takes on a whole new meaning
with OODBs. There has been a surge in the number of studies of database clus-
tering [13,17,24,3,2]. In particular, there were recently a number of studies
which investigated adaptive clustering techniques, i.e., the clustering techniques
which can cope with changing access pattern and perform clustering on-line
[5,26].
In database mining and knowledge discovery, the primary goal is the search
for, and the discovery of, previously unknown patterns in large datasets stored
in databases [19,9]. The patterns are then used to predict the model of data
classification. There is a wide range of benefits for "mining" data to find inter-
esting associations. Data warehouses become valuable in terms of understand-
ing, managing, and using previously unknown relationships between sets of
data. Our experiments are meant to provide a generic view on how data is clus-
tered. In this regard, the experiments do not target specific applications and are
applicable to a wide range of domains.
This research builds upon previous work that we have conducted using a dif-
ferent set of environment settings. We investigated three commonly used clus-
tering algorithms: Slink, Clink, and Average, with different settings [5]. The
A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295 271

preliminary finding seem to point that the choice of clustering method becomes
irrelevant in terms of final outcomes, the study presented here extends our pre-
vious work to include several other settings [5]. The new environment settings
include additional parameters that include: a new clustering method, a statisti-
cal distribution, larger input sizes, and space dimension. The aim is to provide
a basis for a more categorical argument as to the behavior and sensitivity of the
considered clustering methods.

1.3. Statistical distributions

The objects used in this study consists of points lying in the interval [0,1].
There are numerous statistical distributions and we selected the three that
closely model real world distributions [6]. Our aim is to see whether clustering
is dependent on the way objects are generated. The first one is the uniform dis-
tribution and the second one is the piecewise distribution and the third one is
the Gaussian distribution. In what follows, we describe the statistical distribu-
tions that we used in our experiments.
Uniform distribution: The respective distribution function is the following:
~ ( x ) = x. The density function of this distribution is f ( x ) = ~ ' ( x ) = 1 Vx
such that 0 ~<x ~< 1.
Piecewise (skewed) distribution: The respective distribution function is the
following

0.05 if 0~<x < 0.37,


0.475 if 0.37 ~<x < 0.62,
~(x) = 0.525 if 0.62~<x < 0.743,
0.95 if 0.743 ~<x < 0.89,
1 if 0.89~<x<~ 1.

The density function of this distribution is: f ( x ) = [~(b) - ~,~(a)]/(b - a) Vx


such that a ~<x < b.
Gaussian (normal) distribution: x
The
2
respective distribution function is the
following: ~,~(x) = ~x/2n~r e x p ( - ~ )2 a "
This as a two-parameter (tr and #) distribution, where ~ is the mean of the
distribution and tr2 is the variance.

f(x)=~'(x) - 1 p-x ((x~)2)


o"3 exp

In producing samples for the Gaussian distribution, we choose p = 0.5 and a


=0.1.
272 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

0.00132 if 0.1 ~ x < 0.2,


0.02277 if 0.2 ~<x < 0.3,
~ 0.15867 if 0.3 ~<x < 0.4,.
~-(x) = [ ~.49997 if 0.4~<x < 0.5,
for 0.0~<x~< 1.

For values of x that are in the range [0.5,1], the distribution is symmetric.
Following is an outline of the paper. In Section 2, the clustering methods
used in this study are described. In Section 3, we detail the experiments con-
ducted in this study. In Section 4, we provide the interpretations of the exper-
iment results. In Section 5, we provide some concluding remarks.

2. Clustering methods

There are different ways to classify clustering methods according to the type
of cluster structure they produce. The simple non-hierarchical methods divide
the data set of N objects into M clusters, where no overlap is allowed. They
are also known as partitioning methods. Each item is only a member of a clus-
ter with which it is most similar to, and the cluster may be represented by a
centroid or cluster representative that represents the characteristics of all con-
tained objects. This method is heuristically based and mostly applied in social
sciences.
Hierarchical methods produce a nested data set in which pairs of items or
clusters are successively linked until every item in the data set is linked to form
one cluster. Hierarchical methods can be one of the following:
• Agglomerative, with N - 1 pairwise joins from an unclustered data set. In
other words, from N clusters of one object, this method gradually forms
one cluster of N objects. At each step, clusters or objects are joined together
into larger and larger clusters ending with one big cluster containing all ob-
jects.
• Divisive, in which all objects belong to a single cluster at the beginning,
then they are divided into smaller clusters over and over until the last clus-
ter containing two objects have been broken apart into both atomic con-
stituents.
In both cases, the result of the procedure is a hierarchical tree, hence the
name " hierarchical methods". The hierarchical tree may be presented as a den-
drogram, in which pairwise coupling of the objects in the data set is shown and
the length of the branches (vertices) or the value of the similarity is represented
numerically. Divisive methods are less commonly used and only agglomerative
methods will be discussed in this paper.
A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295 273

2.1. Hierarchical clustering techniques

The section describes the hierarchical agglomerative clustering methods and


their characteristics. These methods have been used in the experiments present-
ed in this paper.
Single linkage clustering method (Slink): The distance between two clusters
is the smallest distance of all distances between two items (x, y), denoted ~x,y,
such that x is a member of a cluster and y is a member of another cluster. This
method is also known as the nearest neighbor method. The distance ~x.Y is
computed as follows
~x,r=min{~x,y} withxEX, yEY,
where (X, Y) are clusters, and (x,y) are objects in the corresponding clusters.
This method is the simplest among all clustering methods. It has some attrac-
tive theoretical properties [12]. However, it tends to form long or chaining clus-
ters. This method may not be very suitable for objects concentrated around
some centers in the measurement space.
Complete linkage clustering method (Clink): The similarity coefficient is the
longest distance between any pair of objects, denoted ~x~v, taken from two
clusters. This method is also called furthest neighbor clustering method
[16,22,8]. The distance is computed as follows
~x,r = min{~x,y} with x E X, y E Y,
Centroidlmedian method." Clusters in this method are represented by a "cen-
troid", a point in the middle of the cluster. The distance between two clusters is
the distance between their centroids. This method also has a special case in
which the centroid of the smaller group is leveled to the larger one.
2.2. General algorithm

We provide examples using the three clustering methods. More examples


can be found in [22,8]. As a first step, objects are generated using a random
number generator. In our case, these objects are objects in the interval [0,1]. Af-
ter these objects are created, they are compared to each other by measuring the
distance. The distance between two clusters is computed using the similarity co-
efficient. The way objects and clusters of objects are joined together to form
larger clusters varies with the approach used. We outline a generic algorithm
that is applicable to all clustering methods. Essentially, it consists of two phas-
es. The first phase records the similarity coefficients. The second phase com-
putes the minimum and performs the clustering. Initially every cluster
consists of exactly one object.
1. Scan all clusters and record all similarity coefficients.
2. Compute the minimum of all similarity coefficients and then join the corre-
sponding clusters.
274 A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295

3. If exactly one cluster remains then stop.


4. Go to (I).
There is a case when using Clink method where ambiguity may arise. Sup-
pose when performing Step 1, three successive clusters are to be joined (they all
have the minimum similarity value). When performing Step 2, the first two
clusters are joined. However, when computing the similarity coefficient be-
tween this new cluster and the third cluster, the similarity value may now be
different from the minimum value. The question now is what the next step
should be. There are essentially two possibilities:
• Either proceed by joining clusters using a recomputation of the similarity co-
efficient for each time in Step 2.
• Or join all those clusters that have the similarity coefficient different at once
and do not recompute the similarity in Step 2.
In general, there is no evidence that one is better than the other [22]. For our
study, we selected the first alternative.

2.3. Examples

Following are examples of how (see Table 1) data is clustered to provide an


idea how different clustering methods work with the same set of data. Example
1 uses Slink method, Example 2 uses Clink method, and Example 3 uses Cen-
troid (see Figs. 1-3). The sample data has 10 items and each item has an iden-
tification; and a value on which the distance is calculated. The uniform
distribution was used to generate the above set of objects. For the sake of sim-
plicity, we consider the 1-D space for generating data objects.

Example 1 (Slink).
1. Join clusters {10} and {1} at distance 0.0117584.
2. Join clusters {6} and {10, 1} at 0.105885.

Table 1
Example o f a sample data list

Value Id

0.7764770 1
0.5529483 2
0.3294196 3
0.1058909 4
0.8823621 5
0.6588334 6
0.4353047 7
0.2117760 8
0.9882473 9
0.7647185 10
A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267 295 275

3 7 2 6 10
Fig. l. Clustering tree using slink.

3 7 2 6 10 1 5 9
Fig. 2. Clustering tree using clink.
276 A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295

8 3 7 2 6 10 1 5
Fig. 3. Clustering tree using centroid.

3. Join clusters {4} and {8} at 0.105885140.


4. Join clusters {3} and {7} as one cluster and {2} {6, 10, 1} {5} and {9} as
another cluster at 0.105885148.
5. Join clusters {4, 8} and {3, 7} and {2, 6, 10, 1, 5, 9} as one cluster at
0.117643.

Example 2 (Clink).
1. Join clusters { 1} and {10} at distance 0.0117584.
2. Join clusters {4} and {8} at 0.105885140.
3, Join clusters {3 } and {7} as one cluster, and {2} and {6} as another cluster,
and {5} and {9} as another cluster at 0.105885148.
4. Join clusters {2, 6} and {1, 10} at 0.2235286.
5. Join clusters {4, 8} and {3, 7} at 0.3294138.
6. Join clusters {2, 6, 10, 1} and {5, 9} at 0.4352989.
7. Join clusters {4, 8, 3, 7} and {2, 6, 10, 1, 5, 9} at 0.8823563.

Example 3 (Centroid).
1. Join clusters {10} and {1} at distance of 0.0117585 and form the central
point at 0.77059775.
A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295 277

2. Join clusters {2} and {6} as one cluster, and {3} and {7} as another cluster,
and {4} and {8} as another cluster at 0.105885148 and form the central
points at 0.60589085, 0.32241215, and 0.15883345.
3. Join clusters {5} and {9} at 0.1058852 and form the central point at
0.93530470.
4. Join clusters { 1, 10} and {2, 6} at 0.16478690 and form the central point at
0.68824430.
5. Join clusters {3, 7} and {4, 8} at 0.22357870 and form the central point at
0.27062280.
6. Join clusters {1, 10, 2, 6} and {5, 9} at 0.24706040 and form the central
point at 0.81177450.
7. Join clusters {1, 10, 2, 6, 5, 9} and {3, 7, 4, 8} at 0.5315170.

3. Experiment description

In this study, the hierarchical tree is implemented as a minimum spanning


tree (MST). MST is a tree in which N objects are linked by N - 1 connections
and there is no cycle in the tree. MST has the following properties that make it
suitable to represent the hierarchical tree.
• Any isolated item can be connected to a nearest neighbor.
• Any isolated fragment (sub-set of a MST) in any case can be connected to a
nearest neighbor by the available link.
Data used in this paper is drawn from a (2-D) space. Their values range
from 0 to 1 inclusive and their sizes range from 100 to 500. The congruential
linear algorithm is used to generate data objects [15,20]. The seed is the system
time. Each experiment followed the following steps:
• Generate lists of objects.
• Carry out the clustering process with three different clustering methods.
• Calculate the coefficient of correlation for each clustering method.
Each experiment is repeated 100 times and the standard deviation of the co-
efficients of correlation is calculated. The least square approximation (LSA) is
used to evaluate the acceptability of the approximation. If a coefficient of cor-
relation obtained using the LSA, falls within the segment defined by the corre-
sponding standard deviation, the approximation is deemed to be acceptable.
Types of distances in comparing trees: There are two ways of comparing two
trees, obtained from lists of objects, to compute their coefficient of correlation.
The distance used in the coefficient of correlation could for instance be comput-
ed using the actual linear (Euclidean) difference between two objects. The other
method of computing the distance is to use the minimum number of edges (of a
tree) needed to join two objects. The latter has an advantage over the former in
that it provides a more "natural" implementation of a correlation. We call the
first type of distance, linear distance and the second distance, edge distance.
278 A. Bouguettaya, Q Le Viet I Information Sciences 112 (1998) 267-295

Once we choose a distance type, we compute the coefficient of correlation by


selecting one pair of identifiers in the second list (shorter list) and compute
its distance and then look for the same pair in the first list and compute its dis-
tance. We repeat the same process for all remaining pairs in the second list.
There are several types of coefficients of correlation. This stems from the fact
that several parameters have been used. For instance, the clustering method is
one parameter. There are 3 × 2 x 3 = 18 possible ways to compute the coeffi-
cient of correlation for two lists of objects. Indeed, we have the following choic-
es~

Slink
first parameter Clink
Centroid,

linear distance
second parameter edge distance,

uniform distribution
third parameter piecewise distribution
gaussian distribution.
The other dimension of this comparison study that has a direct influence on
the clustering is the data input. This determines what kind of data is to be com-
pared and what its size is. The following cases have been identified to check the
sensitivity of each clustering method with regard to the input data. For every
type of coefficient of correlation mentioned above, eleven (11) types of situa-
tions (hence, eleven coefficients of correlation) have been isolated. It is our be-
lief that these cases closely represent, what may influence the choice of a
clustering method.
1. The coefficient of correlation is between pairs of objects drawn from a set S
and pairs of objects drawn from the first half of the same set S. The first half
of S is used before the set is sorted.
2. The coefficient of correlation is between pairs of objects drawn from S and
pairs of objects drawn from the second half of S. The second half of S is
used before the set is sorted.
3. The coefficient of correlation is between pairs of objects drawn from the first
half of S, say S> and pairs of objects drawn from the first half of another set
S', say S~. The two sets are given ascending identifiers after being sorted.
The first object of $2 is given as identifier the number 1 and so is given
the first object of S~. The second object of $2 is given as identifier the num-
ber 2 and so is given the second object of S~ and so on.
4. The coefficient of correlation is between pairs of objects drawn from the sec-
ond half of S, say $2, and pairs of objects drawn from the second half of S',
A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295 279

say S~. The two sets are given ascending identifiers after being sorted in the
same was as the previous case.
5. The coefficient of correlation is between pairs of objects drawn from S and
pairs of objects drawn from the union of a set X and S. The set X contains
10% new randomly generated objects.
6. The coefficient of correlation definition is the same as the fifth coefficient of
correlation except that X now contains 20% new randomly generated ob-
jects.
7. The coefficient of correlation definition is the same as the fifth coefficient of
correlation except that X now contains 30% new randomly generated ob-
jects.
8. The coefficient of correlation definition is the same as the fifth coefficient of
correlation except that X now contains 40% new randomly generated ob-
jects.
9. The coefficient of correlation is between pairs objects drawn from S using
the uniform distribution and pairs of objects drawn from S' using the piece-
wise distribution.
10. The coefficient of correlation is between pairs of objects drawn from S using
the uniform distribution and pairs of objects drawn from S' using the
Gaussian distribution.
1I. The coefficient of correlation is between pairs of objects drawn from S using
the Gaussian distribution and pairs of objects drawn from S' using the
piecewise distribution.
In a nutshell, the above coefficients of correlation are meant to analyze dif-
ferent situations in the evaluation of results. We partition these situations into
three situations, represented by three blocks of coefficients of correlation:
First block: The first, second, third and fourth coefficients of correlation are
used to check the influence of the context on how objects are clustered.
Second block: The fifth, sixth, seventh and eight coefficients of correlation
are used to check the influence of the data size.
Third Block: The ninth, tenth, and eleventh coefficients of correlation are
used to check the relation which may exist between two lists obtained using
two different distributions.
To ensure the statistical representativity of the results, the average of 100 co-
efficient of correlation and standard deviation values (of the same type) are
computed. The least square approximation is then applied to obtain the follow-
ing equation
f(x) = a x + b.

The criterion for a good approximation (or acceptability) is given by the in-
equality
I Y, - f ( x i ) I <~a(y,) for all i,
280 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

where yi is the coefficient of correlation, f is the approximation function and a


is the standard deviation for yi. If this inequality is satisfied, f is then a good
approximation. The least square approximation, if acceptable, helps predict
the behavior of clustering methods for data points beyond the range considered
in our experiments.

4. Resultsandtheirinterpretations

As stated earlier, the aim of this study is to conduct experiments to deter-


mine the stability of clustering methods and how they compare to each other.
For the sake of readability, a shorthand notation is used to indicate all possible
cases (see Table 2). A similar notation has been used in our previous findings
[5]. For instance, to represent the input with the following parameters: Slink,
Uniform distribution and Linear distance; the abbreviation SUL is used.
Results are presented in figures and tables. The figures describe the different
types of coefficients of correlation. The tables (see Table 3) describe the least
square approximations of the coefficients of correlations.

4.1. Analysis o f the stability and sensitivity of the clustering methods

We first look at the different clustering methods and analyze how stable and
sensitive they are to the various parameters. In essence, we are interested in
knowing how each clustering method reacts to the changes in parameter values.

4.1.1. Slink." Results interpretation


We look at the behavior of the three blocks of coefficients of correlation val-
ues as defined in Section 3. We then provide an interpretation of the corre-
sponding results.
First block of coefficients of correlation: As previously mentioned, the first
four coefficients of correlation are meant to test the influence of the context.

Table 2
List of abbreviations
Term Shorthand
Slink S
Clink C
Centroid O
Uniform distr. U
Gaussian distr. G
Piecewisedistr. P
Linear distance L
Edge distance E
A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295 281

Table 3
Graphical representation of all types of correlation coefficients and standard deviations

Coefficient of correlation First and second blocks Third block

First and Fifth Solid


Second and Sixth Dashed
Third and Seventh Dotted
Fourth and Eighthz Dash-dotted
Ninth Solid
Tenth Dashed
Eleventh Dotted

Fig. 4 represents the first four coefficients of correlation. The (small) difference
between L and E values is consistently the same across all experiments with the
exception of those experiments comparing different distributions (Figs. 6, 9
and 12). We note that the values using E are consistently smaller than those
values using L. The reason lies in difference in computing distances. When L
is used, the distance between the members of two clusters is the same for all
members. However, when E is used, this may not be true (e.g. tree that is
not height balanced) since the distance is equal to the number of edges connect-
ing two members belonging to different clusters. In the case of Figs. 6, 9 and 12
the difference is attenuated due to the use of different distributions.
When the values in L and E are compared against each other, the trend
among the four coefficients of correlation is almost the same. This points to
the fact that the distance type does not play a major role in the final clustering.
The values of the first and second types of correlation are larger than those
of the third and fourth types of correlation because of the corresponding intrin-
sic semantics. The first and second types of correlation compares data objects
drawn form the same initial set. The third and fourth types of correlation com-
pare data objects drawn from different sets. One should expect the former data
objects to be more related than the latter data objects. The standard deviation
values exhibit roughly the same kind of behavior as their corresponding coef-
ficient of correlation values. This points to the fact that the different types of
correlation behave in a uniform and predictable fashion.
Since the coefficient of correlation values are greater than 0.5 in all the above
cases, this points to the important observation that the data context does not
seem to play a significant role in the final data clustering. Likewise, the data
set size does not seem to have any substantial influence on the final clustering.
Note that the slope value is almost equal to zero. This is also confirmed by the
uniform behavior of the standard deviation values as described above.
Second block of coefficients of correlation: The next four coefficients of cor-
relation check the influence of the data size on clustering. Fig. 5 depicts the
next block of coefficient of correlation.
2 8 2 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

SiIOEI ItnklOe - U d f o r m d i l l - Line~r


0.8 S m ~ racege - Uniform d ~ - Edge
0.61

O.7 0 . 1 8 ~ ...... - ...........

0.6

~'0.5

~ 0.4 io.0.4

0,3 0"3f.
O.25

o., 2 2 . 1 . ~ . ~ - .....

,50 200 250 3~0 350 4~0 450 500


S i n ~ Ilnkege - Pleoev¢,~ alst. - Edge
1 0.8 . . . . . . .

O.9 ,.~

0.8
0.7
0.6 ......
iii i : i:iii .........
;i i : 12 :~i:. ~ . : - i ] ~..
?7'.7".7:~:~ . . . . . . . . . ..~..~:~:~..~ - ~::7 "7 T..~'..~'IS "5."? -is ;7 7 7 ~ 0.5
8
0.4

0.4
0.3
0.3
0~2
O-2

O'lo 0 150 200 250 3C0 350 400 ~ 500


size size
Sk,l g l e llnlmp - ~ e n dist. - Linear
0.8, .... , .... _, . . . . '.__ ----E . . . . " . . . . '. . . . " . . . . Single link~lle - G ~ dbL - Edge

0.7 - - - - . . . . . . . . . . . . .

0.(
O.5

~0.4

~0.~
0.3
0,t

0.2~ ......... ....


0.;

°'I~ ,~ ~o ~ ~o ~o ,;o ~
,~o ~o ~ ~o ~o ~ 4~o .~ ~e

Fig. 4. Slink: First block of coelficients of correlation.

When the values in L and E are compared, there is no substantial difference


which is indicative of the independence of the clustering from any type of dis-
tance used. As in the previous case, the standard deviation values exhibit the
same behavior as the corresponding coefficient of correlation values. This is
A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295 283

Singll linkl41e - Uniform dizt - linear


O.9 -- ' , . -- . ,

O.B 0.~

0.71
0.~

0.6
.? ~ 0.,'
80.5

~0.~
0.4

O-Q

............ -- _ . _ . i : - - -/-..i...~ L i_ .............


0"~00 150 ~00 250 ~ 3~0 400 450 500 °'I, ,~ ~o ~o ~o
Idze
~o ~o ,5o'
Single linkage - Pklozr~i~ diBt. - I d n ~ t s ~ n g ~ Zlnkeoe - P l e ~ a~. - Ed~
I . . . . . . . 0.~

o.~ 0.~

O.7
0.~
~ 0.6
~0.~
~ 0.5

0.~
0.4

.................. r.-, ..... r~ - ~ :=- 7 -,.~ = , ~ :~=7 -

%o--,-;;--~- - - ~ o - - - ~ - - - g o - - ~ - - ~ - - %
size
o~! ~e

0.9 O.O

08 0.7

0.7
0.15

0,6
"0.5

8 O.5

~ o.4
o4

o.3
0.3

0.2 0.2

°~o~--~;--~---~o-'--~---~o-- ~ o - - , - ~ - - ~ °I, 500


size

Fig. 5. Slink: Second block of coefficients of correlation.

reminiscent o f a uniform and predictable behavior o f the different types of cor-


relations.
The high values indicate that the context sizes have little effect on h o w data
is clustered. As in the previous case, the data size does not seem to influence the
284 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

final clustering outcome as the slope is nearly equal to zero. Likewise, the data
set size does not seem to have any substantial influence on the final clustering.
Note that the slope value is almost equal to zero. This is also confirmed by the
uniform behavior of the standard deviation values as described above.
Third block of coefficients of correlation: The next three coefficients of corre-
lation check the influence of the distribution for L and E. All other parameters
are set and the same for the pairs of sets of objects to be compared.
Fig. 6 depicts the last three types of coefficients of correlations.
The curve representing the case for U P (Uniform and Piecewise distribu-
tions) in either L or E case, shows values that are a bit lower than the values
in curves representing U G (Uniform and Gaussian distributions) and G P
(Gaussian and Piecewise distributions). This can be explained by the problem
of bootstrapping the random number generator. This is a constant most of the
experiments conducted in this study. The first concurrent experiments (Slink
using L and E) exhibit a behavior that is a little different from the other pieces
of experiments.
When the values in the case of L and E are compared, no substantial differ-
ence is observed. This is indicative of the independence of the clustering from
the types of distances used. Like the previous case, the standard deviation val-
ues exhibit the same behavior as the corresponding coefficient of correlation
values. This is reminiscent of a uniform and predictable behavior of the different
types of correlations.
Since values converge to the value 0.5, this indicates that the distributions do
not influence the clustering one way or the other. As in the previous case, the
data size does not seem to influence the final clustering outcome as the slope
is nearly equal to zero. Likewise, the data set size does not seem to have any
substantial influence on the final clustering. Note that the slope value is almost
equal to zero. This is also confirmed by the uniform behavior of the standard
deviation values as described above.
Slnglel~kage-...-l.ln~r s~b~ ..... Edge
0.55

0.5
o8I
0.55

0.4,5
J 0.45

0.4

~ 0.3 ~ O.3[
O.25
0.~
O.2

0.~ 0.15 _ _ .- ~. .. -

0.1 150 2~ 250 ~0 350 400 450 660 O'l~lO t50 2OO 266 3OO 35O 4OO 45O 66O

Fig. 6. Slink: T h i r d b l o c k o f coefficients o f correlation.


A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295 285

C o m ¢ ~ I ~ g e - U r / b ~ ast. - Uneer
0.8 ~ e laxage - Unlfonn a s t . - Edge
0.~

0.7
. . . . . . . . . .

0.6
0.~
~'0,5

t~
O.4

0.~

300 350 400 450 5~


~ze ~ze
Coml~te linkage - Pk~ewlse dbL - U r ~ r ~ Irlkage - ~ dlst, - Edge
o.~ , __ . . . . . ,_ . . . . ,. . . . , . . . . ,_____, . . . . 0.8

o,- -7-~-~-]i~?-]:'2~i-777-5==~== ==
0.7

0.6

~'l 0.5 ~ 0.~


O.

0.~

0.~

, , , , ,
~so ~ 2so ~0 ~0 4oo 4so soo °'~00 ~so ~0 25o xo 3so 4oo ,tso soo
slze da~

Compkm l l n ~ - P--,,-,~,n dbt. - Edge


. . . . . . . . . . . . . . . . . . . . . -_ o.7

0.5
~0.5

8 ~o.a
O.4

0.3

.i . . ., . , 3~ 350 400 4~ 500 0.]0 0 150 . . . . 350 400 450 500


size

Fig. 7. Clink: First block of coefficients of correlation.

4.1.2. Clink: Results interpretation


In essence, the experiments for the Clink clustering method follow the same
type of pattern and behavior as the Slink method. Figs. 7-9 depict the the first,
second, third blocks of coefficients of correlation.
286 .4. Bouguettaya, Q. Le Fiet Information Sciences 112 (1998) 267-295

C~nllXqa~ Ilnlglg~. LklilOnll d ~ t - ~ r C~.piete I t ~ - UNlolm dblL - Edge


0.~

0.7 7
:.~. _ ~i. . . . . . . . . . . . ~ ~ ..................................

o.e 0.el

I
8

o.3~

0.~ 021

~="''=:==~'= ........... ~ ~ '~-".i..2 2-_ =

size
CemplID ~ - ~ a6t.- Unur emmplm I l a g ~ - P i ~ r c a ~ diet. - ~
o.s , , , , , ...... ,. . . . . . . i

"~ _ - . - . - "?.= " : . = . - r.=.7.= =.7.75 ~ 7 "~ 7" . . . . . . ~ o.7

0.6
O.7

8
-/
0.~

0.4
0.3

0.2

me ~e
Compete l~¢~le - ~ u m l m ~ c - U n e ~ Colrele~ ~ k ~ l e - G a ~ dla. - IEdga
o.g o.8,

o.a ~ . :~.:,:=.~.~~ o.7 .......................................... ~=~ .......


0.7
0.0

0.6

80.5
~0A
OA

O.3

0.~

~ze

Fig. 8. Clink: Second block of coefficients of correlation.

The interpretations that apply for the Slink method also apply for the Clink
clustering methods as both follow a very similar pattern of behavior and the
values for both the coefficients of correlation and the standard deviation are
quite similar.
A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295 287

ComC~an~e-...-Lin~r
0.6
0.~

0.55
01
~zL.~ .................... L~.ST. ~
0.5
0.4.=
0.45
0.~
O.4

0.~
~ 0.~

~ o.~
~ 0.3
o.2~
0.25

0.2 o.;

0.15 0.1,~

o.I o.I

Fig. 9. Clink: Third block of coefficients of correlation.

4.1.3. Centroid." Results interpretation


As was the case for Slink and Clink, the experiments for the Centroid clus-
tering methods follow the same type of pattern and behavior. The only differ-
ence lies in the values for the different correlations obtained using different
parameters. Figs. 10-12 depict the first, second, third blocks of coefficients
of correlation.
The interpretations that apply for the previous clustering methods also ap-
ply for the Centroid clustering methods as it follows a similar pattern of behav-
ior. Similarly, the values for both the coefficients of correlation and the
standard deviation are also similar.

4.2. Summary of results

Table 4 shows a summary of the results that averages out the different com-
puted coefficients of correlation. Blocks 1-3 correspond respectively to the first
block of four coefficients of correlation, the second block of four coefficients of
correlation, and the third block of three coefficients of correlation.

4.3. Acceptability of the least square approximation

Tables 5-7 represent the least square approximations for all the curves
shown in this study. The acceptability of an approximation depends on wheth-
er all the coefficient of correlation Values fall within the interval delimited by
the approximating function and the standard deviation. If this is the case, then
we say that the approximation is good. Otherwise, we look at how many points
do not fall within the boundaries and determine the goodness of the function.
Using these functions, will enable us to predict the behavior of the clustering
methods with higher data set sizes.
288 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

C e n ~ d Mkege - Lk~don~~ - L~mr


O.B O.f~5,

0.6 . . . . . . .

o.6

~.~,~ .............
~'o.5
~"o.,[
~o.4 ~O~i ~

° . 3
0.25

02 ....
i ............................... ?-2 ...... .'_2-_
0.t51 . . . . ,. . . . "=. . . . . . . i ......... ' ~ " ' ~ " - T ~ . . . . . . i .......... I ". . . . . .
o.i 1(~ 150 2o0 25o 3o0 35o 4oo 45o 5oo
me

0.9
CenUokl gnk~le - Pie~m¢~ ~ L - Ummr
O ' ~ L ~ ~
0.8
0.7

0.7
O.B

O.B

~0.4
°.4
0.3
0.3
O.2 O.2
....... - _ --.

0.i
m
CenVo~l Ilnl~ge - ( ~ u ~ , a n d ~ - U n ~ r
0.8 Cer,uo~ k ' , k ~ - a m m ~ din. - Edge

0.7

0.6
0.5

8
~ 0.4
O.3
°.3

o.]5o
o. I

Fig. 10. Centroid: First block of coefficients of correlation.

As Tables 5-7 (see Appendix) show, the values of the slopes are all very
small. This points to the stability of all results. All approximations yield almost
parallel lines to the x-axis.
,4. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295 289

C 4 m u ~ e~tmOe - Uralorm d~. - I.~,'~r c~m(d rmXa~ - Uraoem d~ - E~


0.0

0.0
0.7
..... .::.: - ,.. ~' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.;
O.6

OJ
'0.5
8 o.s g
i ~0.4
0.4
0.3
0.3

O.2 0.2

0.11 , i i i i 1
IO0 1,50 20O 250 3OO 360 4~ 450 50O 0"100 150 2OO 2SO 30~ 360 400 450 5OO
stm
Centm~ Unka~ - ~ dbt. - U n ~ C~tmkl I n l u ~ - P l ~ m ~ dlat - E d ~
0,9 0,8,

0.7
0.6~

~,0,6
I
8o.s
~ 0.5

O.4
o.3t
o.a

o2 o.2~
7--.

~11 le
C ; ~ n ~ l e,n k ~ - ~,~,n al~ - C i m ~ Cenuo~ I m k ~ - ~ , a n dist. - EdQe
0.0 0.81
-----._..__

0.7 0.01 ~' "~ ~ : 7 :

0.0

8o.s
~0.4
0.4
o.3J
0.3

0.2 o.2~

0'I00 150 ~0 250 ~0 360 400 480 500 0"]~10 150 ~ 250 ~0 350 400 4~0 500
#a~e ~iZI

Fig. 11. Centroid: Second block of coefficients of correlation.

The acceptability test was run and all points passed the test satisfactorily.
Therefore all the approximations listed in tables mentioned above are good ap-
proximations.
290 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

CwWad Ilag~ - ... - L I n N t


O.68

0.6 O,e

0.46 0.45

0.4 0.4

~ 0,3~
~0.35 g
~ o.~
~ 0.3
o,2~
O25
o.2
O2 o.~ .-~.:- _ :::."~ ~ .~ .'~, ...................... _ .... . . ~
, , , ....- ~ :,~---
~:
0.1!
size

Fig. 12. Centroid: Third block of coefficients of correlation.

4.4. Comparison of results across clustering methods

In what follows, we compare the different clustering methods against each


other using the different parameters used in this study. We rely on the results
obtained and the general observations that can be drawn from the experiments
shown in the different figures and tables.
1. The results show that across space dimensions, the context does not com-
pletely hide the sets. For instance, the first and second types of coefficients of
correlation (as shown in all figures) are a little different from the third and

Table 4
Summary of results

Slink Clink Centroid

Block 1 L U 0.6 0.6 0.55


P 0.7 0.75 0.7
G 0.7 0.7 0.65
E U 0.5 0.55 0.55
P 0.7 0.7 0.65
G 0.6 0.6 0.6

Block 2 L U 0.8 0.7 0.75


P 0.9 0.85 0.85
G 0.85 0.8 0.75
E U 0.65 0.7 0.65
P 0.8 0.75 0.7
G 0.7 0.7 0.7

Block 3 L 0.5 0.55 0.5


E 0.5 0.5 0.45
A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295 291

yo +o+~+ + o ~ + ~ + ~ +
+~+~+ ~+ + + + + + +~+
oo
×~×~×× ××××××××~ ×
×

0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I I I I I I I I I I I I

~ + + + ~ + + +
+ ~ + + ++++ +~+ ++

I I I I I I I I I I II

0
c~

oo+ oo~+ o + ~ +~ o+ ++ ~++ +


++~++++ +~+ ~
~c ××~ ××× ×~×× x ~
6

I I I I I I

d ddg d d
• ~ ~ + ~ ~ +o~
+++ +X + +++ ~++
m N ~ __ ~ N __ X X N m N X
0

I I I I
292 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

°ooo+oooo+o+ • oooooo
+ + + + ~ + + + + + + + + + + +
N X N X ~ N N N N ~ X N X N X N N X
o !
o J

I I I I I I I I I I I

+~+ o + ~ o ~ o o ~ o o o
+ ~ ++ ++ ++
~ × ~ ~ ~ ~ ~+ ~ ~
o

0 0 0 0 0

r~ I II I I I

e~
O

8 ~

r~ I I I I I I I

° ~

I I I ] I I

~ ~ ~ ~ 0 0 0 0 0 0
A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295 293

Table 7
Function approximation of the third block of coefficients of correlation

Type Ninth correlation Tenth correlation Eleventh correlation

SL -0.0000078 X + 0.44 -0.00083 X + 0.50 -0.0008 X + 0.51


SE 0.00113 X + 0.45 -0.000119 X + 0.51 -0.000124 X + 0.51
CL -0.00089 X + 0.53 -0.00095 X + 0.52 0.0000088 X + 0.53
CE -0.00085 X + 0.52 0.00089 X + 0.50 0.00088 X + 0.51
OL 0.00079 X + 0.50 0.00079 X + 0.51 -0.0000088 X + 0.50
OE 0.0000101 X + 0.46 0.00111 X + 0.48 0.0000101 X + 0.49

fourth types of coefficient of correlation (as shown in all figures). The values
clearly show what kinds of coefficients are computed.
2. The results show that given the same distribution and type of distance, all
clustering methods exhibit the same behavior and yield approximately the same
values.
3. Slink, Clink, and Centroid seem to have very similar behavior. The coef-
ficients of correlation values are also very close. An explanation for the similar-
ity in behavior between Slink, Clink, and Centroid is that these methods are
based on one single object per cluster to determine similarity distances.
4. The second block of coefficients of correlation for all clustering methods,
demonstrate that the context does not influence the data clustering because all
coefficients of correlation are close to the value 1.
5. The results also show that all clustering methods are equally stable. This
finding comes as a surprise, as intuitively, one expects a clustering method to be
more stable than the others.
6. The results show that the data distribution does not significantly affect the
clustering techniques because the values obtained are very similar to each oth-
er. That is a relatively major finding as the results strongly point to the inde-
pendence of the distribution and the data clustering.
7. The third block of coefficients of correlation across all clustering methods
show that the three methods are little or not perturbed even in a noisy environ-
ment since there are no significant differences in results from Uniform and
Piecewise, and Gaussian distributions.
8. The type of distance (linear or edge) does not influence the clustering pro-
cess as there are no significant differences between the coefficients of correlation
obtained using either linear or edge distances.
The results obtained in this study confirm those from [5] which used one-di-
mensional (l-D) data sample and fewer parameters. The results point very
strongly that in general, no clustering technique is better than another. What
this essentially means is that there is an inherent way for data objects to cluster
and independently from the techniques used.
The other important results that the only discriminator for selecting a clus-
tering method is its computation attractiveness and nothing else. This is a very
294 A. Bouguettaya, Q. Le Viet / Information Sciences 112 (1998) 267-295

important result as in the past there was never an evidence that clustering
methods had very similar behavior.
The results presented here are a compelling evidence that clustering methods
do not seem to influence the outcome of the clustering process. Indeed, all clus-
tering methods considered here exhibit a behavior that is almost constant re-
gardless of the parameters being used in comparing them.

5. Conclusion

In this exhaustive study, we considered three clustering methods. The exper-


iments involved a wide range of parameters to test the stability and sensitivity
of the clustering methods. These experiments were conducted for objects that
are in the 2-D space. The results obtained overwhelmingly point to the stability
of each clustering method and the little sensitivity to noise. The most startling
findings of this study is however, that all clustering methods exhibit an almost
identical behavior; and regardless of the parameters used. The fact that data
objects are drawn from different data spaces does not change the above find-
ings [5].
The above findings are of paramount implications. In that regard, one of the
most important results is t h a t objects have a natural tendency to cluster them-
selves. Clustering methods do not seem to play a major role in the final shape of
the clustering. This also means that the only criterion that should be used to
select one clustering method is the attractiveness of the computational com-
plexity of the clustering algorithm.

Acknowledgements

We would like to thank Mostefa Golea and Alex Delis for their insightful
and detailed comments.

References

[1] M.S. Alderfer, R.K. Blashfield,Cluster Analysis, Sage, BeverleyHills, CA, 1984.
[2] J. Banerjee, W. Kim, S.-Jo. Kim, J. Garza, Clustering a dag for cad databases, IEEE
Transactions on Software Engineering 14 (11) (1988) 1684-1699.
[3] V. Benzaken,An evaluation model for clustering strategies in the 02 object-orienteddatabase
system, In: ICDT, 1990.
[4] V. Benzaken, C. Delobel, Enhancing performance in a persistent object store, Clustering
strategies in 02, In: PODS, 1990.
[5] A. Bouguettaya,On-line Clustering, IEEE Transactionson Knowledgeand Data Engineering
8 (2) (1996).
A. Bouguettaya, Q. Le Viet I Information Sciences 112 (1998) 267-295 295

[6] A. Delis, V.R. Basili, Data binding tool: A tool for measurement based ada source reusability
and design assessment, International Journal of Software Engineering and Knowledge
Engineering 3 (3) (1993) 287-318.
[7] R. Dubes, A.K. Jain, Clustering methodologies in exploratory data analysis, Advances in
Computers 19 (1980).
[8] B. Everitt, Cluster Analysis, Heinemann, Yorkshire, England, 1977.
[9] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smuth, R. Uthurusamy (Eds.), Advances in
Knowledge Discovery and Data Mining, MIT Press, Menlo Park, CA, 1996.
[10] J.A. Hartigan, Clustering Algorithms, Wiley, London, 1975.
[11] A.K. Jain, J. Mao, K.M. Mohiuddin, Artificial neural networks, Computer 29 (3) (1996) 31-
44.
[12] N. Jardine, R. Sibson, Mathematical Taxonomy, Wiley, London, 1971.
[13] Jia-Bing, R. Cheng, A.R. Hurson, Effective clustering of complex objects in object-oriented
databases, In: SIGMOD, 1991.
[14] L. Kaufman, P.J. Rousseenw, Finding Groups in Data, An Introduction to Cluster Analysis,
Wiley, London, 1990.
[15] D.E. Knuth, The Art of Computer Programming, Addison-Wesley, Reading, MA, 1971.
[16] G.N. Lance, W.T. Williams, A general theory for classification sorting strategy, The
Computer Journal 9 (1967) 373-386.
[17] W.J. Mclver, R. King, Self-adaptive, on-line reclustering of complex object data, In:
SIGMOD, 1994.
[18] F. Murtagh, A survey of recent advances in hierachical clustering algorithms, The Computer
Journal 26 (4) (1983) 354-358.
[19] G. Piatetsky-Shapiro, W.J. Frawley (Eds.), Knowledge Discovery in Databases, AAAI Press,
Menlo Park, CA, 1991.
[20] W.H. Press, Numerical Recipes in C, The Art of Scientific Programming, 2nd ed., Cambridge
University Press, 1992.
[21] E. Ramussen, Clustering Algorithms in Information Retrieval, In: W.B. Frakes, R. Baeza-
Yates (Ed.), Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Engle-
wood Cliffs, NJ, 1990.
[22] H.C. Romesburg, Cluster Analysis for Researchers, Krieger, Malabar, FL, 1990.
[23] R.C. Tryon, D.E. Bailey, Cluster Anaysis, McGraw-Hill, New York, 1970.
[24] M.M. Tsangaris, J.F. Naughton, A stochastic approach for clustering in object bases, In:
SIGMOD, 1991.
[25] P. WiUett, Recent trends in hierarchic document clustering, a critical review, Information
Processing and Management 9 (24) (1988) 577-597.
[26] C.T. Yu, C. Suen, K. Lam, M.K. Siu, Adaptive record clustering, ACM Transactions 2 (10)
(1985) 180-204.
[27] J. Zupan, Clustering of Large Data Sets, Research Studies Press, Letchworth, England, 1982.