You are on page 1of 3

ICS 635 - MACHINE LEARNING - DR.

SUSANNA STILL 1

Iterative Mesh Based Clustering with Threshold


Subdivision
Robert Ross Puckett
Department of Information & Computer Sciences
University of Hawaii, Manoa Campus
Honolulu, HI 96822, USA

AbstractTraditional clustering methodologies suffer from II. M ETHODS


serious drawbacks to their general utility. For example, the k-
means clustering algorithm usually involves the use of a metric Initially the algorithm was implemented, for the most part,
such as euclidean distance which is scale variant [1]. Clustering as outlined in the Choudhari paper. This provided the ca-
methods based upon similarity matrices, although less susceptible pability of mesh clustering for a given M by N grid size
to this problem, add greater space complexity for the storage value. In the algorithm the data set is first normalized. Next,
of the matrices. As proposed by Choudhari et al. [2] a mesh
each point is assigned to a box number. After all the boxes
based clustering approach without stopping criterion can provide
a clustering solution with O(n) time and space requirements. are assigned points, the boxes are clustered together. The
As mentioned in his paper, such stopping criterion based on algorithm examines the neighbors of each cell to see which
connections or distance may produce suboptimal clustering by neighbor boxes contain points. All neighbor boxes containing
not exhausting the solution space or through scale variance. points are assumed to be inside the same cluster as the box
However, Choudhari notes that the selection of an appropriate
being considered. At this point a graph is displayed with the
grid size is a major limitation of the algorithm. As such, this
paper describes the development of an iterative mesh based current grid and boxes, using colors to represent the different
clustering method. In this method, the grid size is reduced clusters. Next, the M and N values are increased and the
incrementally until a threshold value of the cells belong to the process repeats. Thus the grid is made finer.
expected number of clusters. Thus, the appropriate grid size is After each clustering attempt and graph display, the clusters
self-determined.
are examined against a threshold. If the mean cluster size
Index Termsmesh clustering, grid clustering, partitioning, is within a proportion of the expected cluster size, then the
pattern classification. subdivision process is halted. For the purposes of graphing,
an additional stopping criteria is added to prevent subdivisions
that would make the graph illegible.
I. I NTRODUCTION
III. E XPERIMENTS
LUSTERING is a useful tool for segmentation, pattern
C classification, and data mining. However, traditional ap-
proaches suffer from serious drawbacks to their utility. One
The first experiments were verification tests to ensure that
the clustering algorithm and grapher are operating correctly.
Using artificial data and hand calculations, several values were
of the more popular methods, k-means clustering, depends tested against the results of the algorithm. Different mesh
heavily on the choice of distance metric used. Such an sizes were used to determine useful graduations in size for
algorithm likely is subject to scale variance which can result subdivisions. Some grid sizes result in obvious erroneous
in sub-optimal clustering. Many other methods involve the use graphs which are likely the result of round off error in
of a similarity matrix. However, the creation and maintenance converting between double value and integer pixel positions.
of such a matrix leads to O(n2 ) space complexity [2]. Thus, in This problem is being tracked down. However, certain grid
large high-dimensional data sets, the curse of dimensionality sizes seemed immune to this problem. Thus, stable grid sizes
makes such clustering methods inefficient and ineffective [3]. were used for the further experiments.
Choudhari et al. [2] proposed a mesh based clustering The remaining experiments used the ELENA artificial data-
algorithm lacking a stopping criterion. That is, the space set database [4]. This database includes intersecting Gaussians,
is divided into a mesh and clustering is performed based rings, and additional forms. Although k-means clustering pro-
upon the adjacent cells with data inside them. Furthermore, vides hyper spheres for its clusters, the mesh based clustering
the algorithm does not include a stopping criterion such allows for greater flexibility in cluster shape through non-
as stopping upon a certain degree connected or a distance spherical clusters.
threshold.
However, as admitted in the paper, the algorithms major
problem is finding an appropriate grid size. Thus, the imple- A. ELENA Clouds
mented algorithm described herein incrementally decreases the The ELENA clouds database consists of 5000 two-
grid size until a threshold of cluster membership is reached. dimensional data points divided into three overlapping clusters
ICS 635 - MACHINE LEARNING - DR. SUSANNA STILL 2

Fig. 1. 100 Square Grid Results Fig. 3. 200 Square Grid Results

Any shape of adjacent boxes with points in them can constitute


a cluster. Thus, it was hoped that the algorithm would be able
to identify the ring shaped cluster and the circle shaped cluster
as two separate clusters.
Unfortunately, the mesh clustering algorithm performed
abysmally for this data set. For no grid size were the two
clusters separable. For large grid sizes all data points were
clustered into a single cluster. For grid sizes smaller than the
average distance between two points, the graph is divided into
dozens if not hundreds of meaningless clusters. The problem
is that the algorithm depends on proximity of neighbors to
define membership to clusters. However, if the gap between
the two distributions were larger, then a grid size could be
found that would accurately separate the distributions.
Fig. 2. 150 Square Grid Results
IV. C ONCLUSION

of different variance, mean, and position. Two of the distribu- Mesh based clustering is a useful tool that requires addi-
tions are circular, while the third is oblong. tional research. Since identifying the appropriate grid size is
As shown in Figure 1, the clustering attempt with 100x100 a major limiting factor of the Choudhari algorithm, the im-
cells resulted in one large cluster. The cell size is too large plemented algorithm performs incremental subdivisions until
and enough neighbor cells contain points such that there is a threshold of cluster membership is reached. Unfortunately,
a connection between all of the clusters. Figure 2 shows that there are still problems needing resolution with this algorithm
with 150x150 cells, the upper two clusters are now separated and with mesh clustering in general.
from the lower cluster. The overlapping of the lower cluster Overlapping and sparse distributions create major problems
is not as severe as the upper clusters, thus, this cell size for this form of grid clustering. For overlapping data-sets it is
was able to separate the top clusters and the bottom cluster. desirable to have a small cell size to prevent the clusters being
Finally, with 200x200 cells, Figure 3 shows the three clusters grouped together. For sparse data sets, it is desirable to have a
fully separated. The graphs are filtered to not show clusters of large cell size to capture more of the neighbors and join more
exceedingly small size. points together into a common cluster. Unfortunately, as the
ELENA clouds experiment above shows, it is possible to have
both overlapping and spare distributions occur together.
B. ELENA Concentric Additionally, the lack of a stopping criterion allows for a
The ELENA concentric database consists of 2500 two- great variety of possible cluster shapes and sizes. Unfortu-
dimensional data points divided into two non-overlapping nately, if the cell size is not optimal, then far more cells may be
clusters. One cluster is a ring shape, while the other cluster joined together into a cluster than should be. All it takes is for
is a circular shape embedded within the first clusters ring. one small chain of boxes to connect two clusters and join them,
There is no appreciable gap between the two clusters. The no matter how far they are apart. Furthermore, the practice of
grid algorithm is not limited to clustering circles of ellipses. adding adjacent neighbors to the same cluster is impractical in
ICS 635 - MACHINE LEARNING - DR. SUSANNA STILL 3

high dimensional spaces as the number of neighbors increases


exponentially [3].
For simple distributions, identifying the center of cluster
could be as simple as averaging the positions of member boxes
of a cluster. However, for more complex shaped distributions
this process, and the very concept of a center, may not be
useful.

V. F UTURE W ORK
Although it is possible to simply rerun the algorithm with
larger or smaller grid sizes, it seems unnecessary to include all
of the cells in such future operations. Although the algorithm
originally has O(n) complexity, repetitive runs would result
in longer run-time.
As such it may be possible to optimize future iterations
by limiting the number of cells being subdivided. Instead of
dividing all of the cells, which is somewhat equivalent to
increasing the grid size, we can instead divide only the cells
that are important. That is, we will first discard cells with no
data points within. Next, we should not need to subdivide cells
that have a high degree of similar data points compared to free
space.
Thus, with normalized data, we can start the algorithm
with the largest possible grid size, and allow the algorithm to
continue subdividing the grids where subdivision is suspected
to result in improved clustering. This process will likely
result in clusters composed of heterogeneous sized cell pieces.
Cluster centers will likely be larger cells surrounded by smaller
cells that further define the cluster boundaries.

R EFERENCES
[1] R. O. Duda, P. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
Wiley, 2001.
[2] V. N. Choudhari A., Hanmandlu M. and C. R.D, Mesh based clustering
without stopping criterion, in INDICON, 2005 Annual IEEE, 2005.
[3] A. Hinneburg and D. A. Keim, Optimal grid-clustering: Towards
breaking the curse of dimensionality in high-dimensional clustering, in
Proceedings of the 25th VLDB Conference, 1999, pp. 506517. [Online].
Available: http://fusion.cs.uni-magdeburg.de/pubs/optigrid.pdf
[4] Elena database, April 2005. [Online]. Available:
http://www.dice.ucl.ac.be/mlg/DataBases/ELENA/ARTIFICIAL/