You are on page 1of 47

Clustering

Partial of the content of this class are copied from online materials. In particular:
1. Introduction to Computational Thinking and Data Science, by Pro. Eric Grimson, Prof. John Guttag and Dr. Ana Bell.
https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-20
16/
, MIT.
2. Unsupervised Learning Clustering, by Shimon Ullman, Tomaso Poggio Danny Harari, DaneilZysman, Darren Seibert
http://www.mit.edu/~9.54/fall14/slides/Class13.pdf, MIT
Machine learning paradigm
• Observe set of examples: training data
• Infer something about process that generated that data
• Use inference to make predictions about previously unseen data: test
data
• Supervised: give a set of feature/label pairs, find a rule that predicts
the label associated with a previously unseen input
• Unsupervised: given a set of feature vectors (without labels) group
them into "natural clusters".
What is Clustering?
What do we need for Clustering?
Distance Measures
Clustering is an Optimization Problem

• Why not divide variability by size of cluster (like variance)?


• So cluster with more points are likely to look less cohesive according to this
measure. So big and bad is worse than small and bad.
• If one wants to compare the coherence of two clusters of different sizes, one
needs to divide the variability of each cluster by the size of the cluster.
• Is optimization problem finding a C that minimizes dissimilarity(C)?
• No, otherwise could put each example in its own cluster.
• Need a constraint, e.g.
• Minimum distance between clusters
• Number of clusters
Clustering Techniques
Hierarchical clustering
Linkage metrics
Example of hierarchical clustering
Clustering Algorithms
K-means Algorithm
An Example: Step 1
Step 2:
Step 3:
Result of first iteration
Second Iteration
Result of Second Iteration
Why Use K-means?
Issues with K-means
• Choosing the "wrong" k can lead to strange results
• Consider k = 3
• Result can depend upon initial
centroids
• Number of iterations
• Even final results
• Greedy algorithm can find different local optimas
• The algorithm is sensitive to outliners
Dealing with Outliers
How to choose K
Sensitivity to Initial Seeds
Mitigating dependence on initial centroids
An Example
Data Sample
Class Example
Class Cluster
Evaluating a clustering
Patients
Kmeans
Examining results
Result
How many positives are there?
A Hypothesis
Testing multiple values of K

You might also like