Unsupervised Learning and Clustering: Somayeh Molaei University of Michigan, BDSI 2022

Unsupervised Learning and
Clustering
Somayeh Molaei
University of Michigan, BDSI 2022
Thanks to Professor Danai Kutra for some of the slide content.

Supervised Learning
Somayeh Molaei, CSE Department, University of Michigan,BDSI

2
2022
Supervised Learning
• Goal: Learn to classify examples
• Images (face recognition)
• Emails (spam filtering)
• Biological samples (tumor classification)

3
2022
Tesla autopilot failed

4
2022
Amazon’s AI recruiting tool showed bias
against women
• Amazon started building machine learning programs in 2014 to
review job applicants’ resumes. However, the AI-based experimental
hiring tool had a major flaw: it was biased against women.

5
2022
AI camera mistakes linesman’s head for a ball
• In a hilarious incident, an AI-powered

camera designed to automatically
track the ball at a soccer game ended
up tracking the bald head of a
linesman instead.

6
2022
Microsoft’s AI chatbot turns sexist, racist
• In 2016, Microsoft launched an AI chatbot called Tay. Tay engaged
with Twitter users through “casual and playful conversation.”
However, in less than 24 hours, Twitter users manipulated the bot to
make deeply sexist and racist remarks.

7
2022
Unstable behavior

8
2022
Supervised Learning
Unsupervised Learning
Unsupervised Learning

9
2022
Unsupervised learning in the wild
• Examples:
• finding similar homes for sales
• grouping patients by symptoms
• mining customer purchase patterns
• grouping search results according to topic
• computational biology à gene expression analysis
• group emails
• image processing à regions of image segmentation

10
2022
Examples

11
2022
What’s possible?
•Dimensionality reduction
•Recommender systems
•Anomaly detection
•Clustering

12
2022
Dimensionality Reduction
• Maps the data from a
high dimensional space
to a low dimensional
space
• Loss of information
• Keeps the majority of
information
• Lower space
• Lesser computational
power
• Ability to visualize
Image source: https://machinelearningandco.com/2022/06/21/dimensionality-reduction/

13
2022
14
2022
Loss of structural information

15
2022
Benefit: gives a simpler overall image of the
data

16
2022
17
2022
Principal Component Analysis(PCA)
• Standardize the range of continuous initial
variables
• Compute the covariance matrix to identify
correlations
• Compute the eigenvectors and eigenvalues of
the covariance matrix to identify the principal
components
• Keep the desired number of principal
components
• Recast the data along the chosen principal
components axes

18
2022
Recommender Systems
- Content-based system:
- based on a single user’s
interactions and preference.
- collaborative filtering:
- Collaborative filtering casts a
much wider net, collecting
information from the interactions
from many other users to derive
suggestions for you. This
approach makes
recommendations based on history of the
other users with similar tastes or user and
situations. the behavior of
similar users.

19
2022
Anomaly detection
• Fraud detection
• Medical diagnoses
• Airplane engine anomaly detection
• Machinery rotation prediction

20
2022
Anomaly detection methods
• Isolation forest (tree based method)
• KNN(Clustering based methods)
• PCA based

21
2022
Clustering
• Retail companies often use clustering to identify groups of households that
are similar to each other.
• Streaming services often use clustering analysis to identify viewers who
have similar behavior.
• Data scientists for sports teams often use clustering to identify players that
are similar to each other.
• Many businesses use cluster analysis to identify consumers who are similar
to each other so they can tailor their emails sent to consumers in such a way
that maximizes their revenue.
• health insurance companies often used cluster analysis to identify “clusters”
of consumers that use their health insurance in specific ways. They set
monthly premiums based on how much they’re expected to use their
insurance.
22
2022
Clustering
• K-means clustering
• group similar data points together and discover underlying patterns. To
achieve this objective, K-means looks for a fixed number (k) of clusters in a
dataset.
• Hierarchal Clustering
• Group data points in a set of clusters, where each cluster is distinct from
other cluster, and the objects within each cluster are broadly similar to each
other.

23
2022
K-means Clustering

24
2022
25
2022
26
2022
27
2022
28
2022
29
2022
30
2022
31
2022
32
2022
33
2022
Hierarchal Clustering
• How many clusters?
• Dendrogram plot
• Elbow method
• How to measure the distance between two points?
• Euclidean distance
• Manhattan distance
• How to measure the distance between two clusters of points?
• Single linkage
• Full linkage
• Average
• Centroid
• Ward
34
2022
How many clusters?
This method is specific to hierarchal clustering

35
2022
How many clusters? The general method
• Elbow method works
for all the clustering
methods:
Distortion score: the sum of

square distances from each
point to its assigned center.

36
2022
Metric distance
• Euclidean distance

37
2022
Distance between clusters

38
2022
Single Linkage

39
2022
Average Linkage

40
2022
Complete - Linkage

41
2022
Ward method
centroid
Variation within the cluster

42
2022
Clustering with ward linkage vs average
linkage
http://scikit-learn.org/stable/auto_examples/cluster/plot_digits_linkage.html
43
2022
Disadvantage? 1000 data points
• Hierarchal clustering is no scalable to larger datasets:

44
2022
Hierarchal vs k-means
• Hierarchal:
• deterministic, don’t need to decide # of clusters ahead of time
• only input parameters are distance measure and linkage criteria
• can find weirdly shaped clusters
• slow
• K-means :
• fast, linear in all relevant quantities
• can be affected by outliers
• problems when clusters are different sizes, densities & shapes

45
2022
It’s all about what works for your data.
http://scikit-learn.org/stable/modules/clustering.html
46
2022

Unsupervised Learning and Clustering: Somayeh Molaei University of Michigan, BDSI 2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised Learning and Clustering: Somayeh Molaei University of Michigan, BDSI 2022

Uploaded by

Copyright:

Available Formats

Unsupervised Learning and

Thanks to Professor Danai Kutra for some of the slide content.

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

• In a hilarious incident, an AI-powered

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Image source: https://machinelearningandco.com/2022/06/21/dimensionality-reduction/

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

This method is specific to hierarchal clustering

Distortion score: the sum of

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Variation within the cluster

Somayeh Molaei, CSE Department, University of Michigan,BDSI

• Hierarchal clustering is no scalable to larger datasets:

Somayeh Molaei, CSE Department, University of Michigan,BDSI

Somayeh Molaei, CSE Department, University of Michigan,BDSI

You might also like