• Cluster: A collection of data objects – similar (or related) to one another within the same group – dissimilar (or unrelated) to the objects in other groups • Cluster analysis (or clustering, data segmentation, …) – Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) • Typical applications – As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms Clustering for Data Understanding and Applications
• Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species • Information retrieval: document clustering • Land use: Identification of areas of similar land use in an earth observation database • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults • Segment customer database based on similar buying patterns. • Identify similar Web usage patterns Clustering Example Clustering vs. Classification • No prior knowledge – Number of clusters – Meaning of clusters • Unsupervised learning Types of Clustering • Hierarchical – Nested set of clusters created. • Partitional – One set of clusters created. • Incremental – Each element handled one at a time. • Simultaneous – All elements handled together. • Overlapping/Non-overlapping Clustering Approaches Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive Sampling Compression
Partitional Clustering • Nonhierarchical • Creates clusters in one step as opposed to several steps. • Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. • Usually deals with static sets. Partitional Algorithms • MST • Squared Error • K-Means • Nearest Neighbor • PAM(Partitioned around medoids) • BEA (Bond Energy Algorithm) • GA (Genetic Algorithm) K-Means Example • Given: {2,4,10,12,3,20,30,11,25}, k=2 • Randomly assign means: m1=2,m2=4 • K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 • K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 • K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 • K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 • Stop as the clusters with these means are the same.