You are on page 1of 46

Unsupervised Learning and

Clustering
Somayeh Molaei
University of Michigan, BDSI 2022

Thanks to Professor Danai Kutra for some of the slide content.


Supervised Learning

Somayeh Molaei, CSE Department, University of Michigan,BDSI


2
2022
Supervised Learning
• Goal: Learn to classify examples
• Images (face recognition)
• Emails (spam filtering)
• Biological samples (tumor classification)

Somayeh Molaei, CSE Department, University of Michigan,BDSI


3
2022
Tesla autopilot failed

Somayeh Molaei, CSE Department, University of Michigan,BDSI


4
2022
Amazon’s AI recruiting tool showed bias
against women
• Amazon started building machine learning programs in 2014 to
review job applicants’ resumes. However, the AI-based experimental
hiring tool had a major flaw: it was biased against women.

Somayeh Molaei, CSE Department, University of Michigan,BDSI


5
2022
AI camera mistakes linesman’s head for a ball

• In a hilarious incident, an AI-powered


camera designed to automatically
track the ball at a soccer game ended
up tracking the bald head of a
linesman instead.

Somayeh Molaei, CSE Department, University of Michigan,BDSI


6
2022
Microsoft’s AI chatbot turns sexist, racist
• In 2016, Microsoft launched an AI chatbot called Tay. Tay engaged
with Twitter users through “casual and playful conversation.”
However, in less than 24 hours, Twitter users manipulated the bot to
make deeply sexist and racist remarks.

Somayeh Molaei, CSE Department, University of Michigan,BDSI


7
2022
Unstable behavior

Somayeh Molaei, CSE Department, University of Michigan,BDSI


8
2022
Supervised Learning

Unsupervised Learning

Unsupervised Learning

Somayeh Molaei, CSE Department, University of Michigan,BDSI


9
2022
Unsupervised learning in the wild
• Examples:
• finding similar homes for sales
• grouping patients by symptoms
• mining customer purchase patterns
• grouping search results according to topic
• computational biology à gene expression analysis
• group emails
• image processing à regions of image segmentation

Somayeh Molaei, CSE Department, University of Michigan,BDSI


10
2022
Examples

Somayeh Molaei, CSE Department, University of Michigan,BDSI


11
2022
What’s possible?
•Dimensionality reduction
•Recommender systems
•Anomaly detection
•Clustering

Somayeh Molaei, CSE Department, University of Michigan,BDSI


12
2022
Dimensionality Reduction
• Maps the data from a
high dimensional space
to a low dimensional
space
• Loss of information
• Keeps the majority of
information
• Lower space
• Lesser computational
power
• Ability to visualize

Image source: https://machinelearningandco.com/2022/06/21/dimensionality-reduction/


Somayeh Molaei, CSE Department, University of Michigan,BDSI
13
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
14
2022
Loss of structural information

Somayeh Molaei, CSE Department, University of Michigan,BDSI


15
2022
Benefit: gives a simpler overall image of the
data

Somayeh Molaei, CSE Department, University of Michigan,BDSI


16
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
17
2022
Principal Component Analysis(PCA)
• Standardize the range of continuous initial
variables
• Compute the covariance matrix to identify
correlations
• Compute the eigenvectors and eigenvalues of
the covariance matrix to identify the principal
components
• Keep the desired number of principal
components
• Recast the data along the chosen principal
components axes

Somayeh Molaei, CSE Department, University of Michigan,BDSI


18
2022
Recommender Systems
- Content-based system:
- based on a single user’s
interactions and preference.
- collaborative filtering:
- Collaborative filtering casts a
much wider net, collecting
information from the interactions
from many other users to derive
suggestions for you. This
approach makes
recommendations based on history of the
other users with similar tastes or user and
situations. the behavior of
similar users.

Somayeh Molaei, CSE Department, University of Michigan,BDSI


19
2022
Anomaly detection
• Fraud detection
• Medical diagnoses
• Airplane engine anomaly detection
• Machinery rotation prediction

Somayeh Molaei, CSE Department, University of Michigan,BDSI


20
2022
Anomaly detection methods
• Isolation forest (tree based method)
• KNN(Clustering based methods)
• PCA based

Somayeh Molaei, CSE Department, University of Michigan,BDSI


21
2022
Clustering
• Retail companies often use clustering to identify groups of households that
are similar to each other.
• Streaming services often use clustering analysis to identify viewers who
have similar behavior.
• Data scientists for sports teams often use clustering to identify players that
are similar to each other.
• Many businesses use cluster analysis to identify consumers who are similar
to each other so they can tailor their emails sent to consumers in such a way
that maximizes their revenue.
• health insurance companies often used cluster analysis to identify “clusters”
of consumers that use their health insurance in specific ways. They set
monthly premiums based on how much they’re expected to use their
insurance.
Somayeh Molaei, CSE Department, University of Michigan,BDSI
22
2022
Clustering
• K-means clustering
• group similar data points together and discover underlying patterns. To
achieve this objective, K-means looks for a fixed number (k) of clusters in a
dataset.
• Hierarchal Clustering
• Group data points in a set of clusters, where each cluster is distinct from
other cluster, and the objects within each cluster are broadly similar to each
other.

Somayeh Molaei, CSE Department, University of Michigan,BDSI


23
2022
K-means Clustering

Somayeh Molaei, CSE Department, University of Michigan,BDSI


24
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
25
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
26
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
27
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
28
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
29
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
30
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
31
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
32
2022
Somayeh Molaei, CSE Department, University of Michigan,BDSI
33
2022
Hierarchal Clustering
• How many clusters?
• Dendrogram plot
• Elbow method
• How to measure the distance between two points?
• Euclidean distance
• Manhattan distance
• How to measure the distance between two clusters of points?
• Single linkage
• Full linkage
• Average
• Centroid
• Ward
Somayeh Molaei, CSE Department, University of Michigan,BDSI
34
2022
How many clusters?

This method is specific to hierarchal clustering


Somayeh Molaei, CSE Department, University of Michigan,BDSI
35
2022
How many clusters? The general method
• Elbow method works
for all the clustering
methods:

Distortion score: the sum of


square distances from each
point to its assigned center.

Somayeh Molaei, CSE Department, University of Michigan,BDSI


36
2022
Metric distance
• Euclidean distance

Somayeh Molaei, CSE Department, University of Michigan,BDSI


37
2022
Distance between clusters

Somayeh Molaei, CSE Department, University of Michigan,BDSI


38
2022
Single Linkage

Somayeh Molaei, CSE Department, University of Michigan,BDSI


39
2022
Average Linkage

Somayeh Molaei, CSE Department, University of Michigan,BDSI


40
2022
Complete - Linkage

Somayeh Molaei, CSE Department, University of Michigan,BDSI


41
2022
Ward method

centroid

Variation within the cluster

Somayeh Molaei, CSE Department, University of Michigan,BDSI


42
2022
Clustering with ward linkage vs average
linkage

http://scikit-learn.org/stable/auto_examples/cluster/plot_digits_linkage.html
Somayeh Molaei, CSE Department, University of Michigan,BDSI
43
2022
Disadvantage? 1000 data points

• Hierarchal clustering is no scalable to larger datasets:

Somayeh Molaei, CSE Department, University of Michigan,BDSI


44
2022
Hierarchal vs k-means
• Hierarchal:
• deterministic, don’t need to decide # of clusters ahead of time
• only input parameters are distance measure and linkage criteria
• can find weirdly shaped clusters
• slow

• K-means :
• fast, linear in all relevant quantities
• can be affected by outliers
• problems when clusters are different sizes, densities & shapes

Somayeh Molaei, CSE Department, University of Michigan,BDSI


45
2022
It’s all about what works for your data.

http://scikit-learn.org/stable/modules/clustering.html
Somayeh Molaei, CSE Department, University of Michigan,BDSI
46
2022

You might also like