Professional Documents
Culture Documents
Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
What is Cluster Analysis?
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
What is Clustering?
• Given a collection of objects, put objects into •Do we put Collins with Venter because they’re
groups based on similarity. both biologists, or do we put Collins with Lander
because they both work for the HGP?
HGP Celera
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
• create thematic maps in GIS by clustering feature spaces
• detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
• Document classification
• Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation
database
• Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
What is not Cluster Analysis?
• Supervised classification
• Have class label information
• Simple segmentation
• Dividing students into different registration groups alphabetically, by last name
• Results of a query
• Groupings are a result of an external specification
• Graph partitioning
• Some mutual relevance and synergy, but areas are not identical
Introduction to Clustering
• Clustering is somewhat related to classification in the sense that in both cases data
are grouped.
• In order to understand the difference between the two, consider a sample dataset
containing marks obtained by a set of students and corresponding grades as shown in
Table 15.1.
• The fact is that groups in Fig 12.1 are already predefined in Table 12.1. This is similar to
classification, where we have given a dataset where groups of data are predefined.
• Consider another situation, where ‘Grade’ is not known, but we have to make a grouping.
• Put all the marks into a group if any other mark in that group does not exceed by 5 or more.
• This is similar to “Relative grading” concept and grade may range from A to Z.
• Informally, similarity between two objects (e.g., two images, two documents, two records, etc.) is a numerical measure
of the degree to which two objects are alike.
• The dissimilarity on the other hand, is another alternative (or opposite) measure of the degree to which two objects
are different.
• Usually, similarity and dissimilarity are non-negative numbers and may range from zero (highly dissimilar (no similar))
to some finite/infinite value (highly similar (no dissimilar)).
Note:
• Frequently, the term distance is used as a synonym for dissimilarity
• In fact, it is used to refer as a special case of dissimilarity.
• Partitional Clustering
• A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n objects into
a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
K-Means Clustering
• Simple Clustering: K-means
• Interval-scaled variables:
• Binary variables: