Authors: Presentation by: Vuk Malbasa Sudipto Guha, For Rajeev Rastogi, CIS664 Kyuseok Shim Prof. Vasilis Megalooekonomou Overview • Introduction • Previous Approaches • Drawbacks of previous approaches • CURE: Approach • Enhancements for Large Datasets • Conclusions Introduction • Clustering problem: Given points separate them into clusters so that data points within a cluster are more similar to each other than points in different clusters. • Traditional clustering techniques either favor clusters with spherical shapes and similar sizes or are fragile to the presence of outliers. • CURE is robust to outliers and identifies clusters with non-spherical shapes, and wide variances in size. • Each cluster is represented by a fixed number of well scattered points. Introduction
• CURE is a hierarchical clustering
technique where each partition is nested into the next partition in the sequence. • CURE is an agglomerative algorithm where disjoint clusters are successively merged until the number of clusters reduces to the desired number of clusters. Previous Approaches
• At each step in agglomerative clustering the
merged clusters are ones where some distance metric is minimized. • This distance metric can be: – Distance between means of clusters, dmean – Average distance between all points in clusters, dave – Maximal distance between points in clusters, dmax – Minimal distance between points in clusters, dmin Drawbacks of previous approaches • For situations where clusters vary in size dave, dmax and dmean distance metrics will split large clusters into parts. • Non spherical clusters will be split by dmean • Clusters connected by outliers will be connected if the dmin metric is used • None of the stated approaches work well in the presence of non spherical clusters or outliers. Drawbacks of previous approaches CURE: Approach • CURE is positioned between centroid based (dave) and all point (dmin) extremes. • A constant number of well scattered pointsis used to capture the shape and extend of a cluster. • The points are shrunk towards the centroid of the cluster by a factor α. • These well scattered and shrunk points are used as representative of the cluster. CURE: Approach • Scattered points approach alleviates shortcomings of dave and dmin. – Since multiple representatives are used the splitting of large clusters is avoided. – Multiple representatives allow for discovery of non spherical clusters. – The shrinking phase will affect outliers more than other points since their distance from the centroid will be decreased more than that of regular points. CURE: Approach • Initially since all points are in separate clusters, each cluster is defined by the point in the cluster. • Clusters are merged until they contain at least c points. • The first scattered point in a cluster in one which is farthest away from the clusters centroid. • Other scattered points are chosen so that their distance from previously chosen scattered points in maximal. • When c well scattered points are calculated they are shrunk by some factor α (r = p + α*(mean-p)). • After clusters have c representatives the distance between two clusters is the distance between two of the closest representatives of each cluster • Every time two clusters are merged their representatives are re- calculated. Enhancements for Large Datasets • Random sampling – Filters outliers and allows the dataset to fit into memory • Partitioning – First cluster in partitions then merge partitions • Labeling Data on Disk – The final labeling phase can be done by NN on already chosen cluster representatives • Handling outliers – Outliers are partially eliminated and spread out by random sampling, are identified because they belong to small clusters that grow slowly Conclusions • CURE can identify clusters that are not spherical but also ellipsoid • CURE is robust to outliers • CURE correctly clusters data with large differences in cluster size • Running time for a low dimensional dataset with s points is O(s2) • Using partitioning and sampling CURE can be applied to large datasets Thanks! ?