You are on page 1of 10

Clustering

What is Cluster Analysis?


• Cluster: A collection of data objects
– similar (or related) to one another within the same group
– dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering for Data Understanding and Applications

• Biology: taxonomy of living things: kingdom, phylum, class, order, family,


genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation
database
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• City-planning: Identifying groups of houses according to their house type,
value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
• Segment customer database based on similar buying patterns.
• Identify similar Web usage patterns
Clustering Example
Clustering vs. Classification
• No prior knowledge
– Number of clusters
– Meaning of clusters
• Unsupervised learning
Types of Clustering
• Hierarchical – Nested set of clusters created.
• Partitional – One set of clusters created.
• Incremental – Each element handled one at a
time.
• Simultaneous – All elements handled
together.
• Overlapping/Non-overlapping
Clustering Approaches
Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression


Partitional Clustering
• Nonhierarchical
• Creates clusters in one step as opposed to
several steps.
• Since only one set of clusters is output, the
user normally has to input the desired number
of clusters, k.
• Usually deals with static sets.
Partitional Algorithms
• MST
• Squared Error
• K-Means
• Nearest Neighbor
• PAM(Partitioned around medoids)
• BEA (Bond Energy Algorithm)
• GA (Genetic Algorithm)
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=2,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the
same.

You might also like