You are on page 1of 14

CURE: An Efficient Clustering

Algorithm for Large Databases


Authors: Presentation by:
Vuk Malbasa
Sudipto Guha, For
Rajeev Rastogi, CIS664
Kyuseok Shim Prof. Vasilis
Megalooekonomou
Overview
• Introduction
• Previous Approaches
• Drawbacks of previous approaches
• CURE: Approach
• Enhancements for Large Datasets
• Conclusions
Introduction
• Clustering problem: Given points separate them
into clusters so that data points within a cluster
are more similar to each other than points in
different clusters.
• Traditional clustering techniques either favor
clusters with spherical shapes and similar sizes
or are fragile to the presence of outliers.
• CURE is robust to outliers and identifies clusters
with non-spherical shapes, and wide variances
in size.
• Each cluster is represented by a fixed number of
well scattered points.
Introduction

• CURE is a hierarchical clustering


technique where each partition is nested
into the next partition in the sequence.
• CURE is an agglomerative algorithm
where disjoint clusters are successively
merged until the number of clusters
reduces to the desired number of clusters.
Previous Approaches

• At each step in agglomerative clustering the


merged clusters are ones where some distance
metric is minimized.
• This distance metric can be:
– Distance between means of clusters, dmean
– Average distance between all points in clusters, dave
– Maximal distance between points in clusters, dmax
– Minimal distance between points in clusters, dmin
Drawbacks of previous approaches
• For situations where clusters vary in size
dave, dmax and dmean distance metrics will
split large clusters into parts.
• Non spherical clusters will be split by dmean
• Clusters connected by outliers will be
connected if the dmin metric is used
• None of the stated approaches work well
in the presence of non spherical clusters
or outliers.
Drawbacks of previous approaches
CURE: Approach
• CURE is positioned between centroid
based (dave) and all point (dmin) extremes.
• A constant number of well scattered
pointsis used to capture the shape and
extend of a cluster.
• The points are shrunk towards the centroid
of the cluster by a factor α.
• These well scattered and shrunk points
are used as representative of the cluster.
CURE: Approach
• Scattered points approach alleviates
shortcomings of dave and dmin.
– Since multiple representatives are used the
splitting of large clusters is avoided.
– Multiple representatives allow for discovery of
non spherical clusters.
– The shrinking phase will affect outliers more
than other points since their distance from the
centroid will be decreased more than that of
regular points.
CURE: Approach
• Initially since all points are in separate clusters, each cluster is
defined by the point in the cluster.
• Clusters are merged until they contain at least c points.
• The first scattered point in a cluster in one which is farthest away
from the clusters centroid.
• Other scattered points are chosen so that their distance from
previously chosen scattered points in maximal.
• When c well scattered points are calculated they are shrunk by
some factor α (r = p + α*(mean-p)).
• After clusters have c representatives the distance between two
clusters is the distance between two of the closest representatives
of each cluster
• Every time two clusters are merged their representatives are re-
calculated.
Enhancements for Large Datasets
• Random sampling
– Filters outliers and allows the dataset to fit into
memory
• Partitioning
– First cluster in partitions then merge partitions
• Labeling Data on Disk
– The final labeling phase can be done by NN on
already chosen cluster representatives
• Handling outliers
– Outliers are partially eliminated and spread out by
random sampling, are identified because they belong
to small clusters that grow slowly
Conclusions
• CURE can identify clusters that are not
spherical but also ellipsoid
• CURE is robust to outliers
• CURE correctly clusters data with large
differences in cluster size
• Running time for a low dimensional
dataset with s points is O(s2)
• Using partitioning and sampling CURE can
be applied to large datasets
Thanks!
?

You might also like