Professional Documents
Culture Documents
Chapter3 ML 21-22
Chapter3 ML 21-22
Plan
Introduction
Régression
Séparateurs a vaste marge
Arbres de décision
Apprentissage bayésien
Réseaux de neurones artificiels
Modèles de Markov cachés
© FEZZA S. v21‐22
Apprentissage par renforcement
Machine Learning 1
Objectifs
• Le clustering
• Domaines d’application
• Clustering par partitionnement (K‐means)
• Qualité des clusters
© FEZZA S. v21‐22
Machine Learning 2
1
28/11/2021
Supervised learning
Training set:
© FEZZA S. v21‐22
Machine Learning 3
Unsupervised learning
Training set:
• Clustering aims to find classes without labeled examples
© FEZZA S. v21‐22
• An “unsupervised” learning method
• Place similar items in same group, different items in different groups
Machine Learning 4
2
28/11/2021
© FEZZA S. v21‐22
Clustering
Machine Learning 5
Clustering
Cluster 1 Cluster 2
© FEZZA S. v21‐22
Machine Learning 6
3
28/11/2021
Applications of clustering
• Biology: identify similar entities, plant and animal taxonomies, genes functionality.
• Also used for pattern recognition, data analysis, and image processing.
© FEZZA S. v21‐22
Machine Learning 7
K-means
• The “K” in K‐means stands for the number of clusters you want.
• The “means” in K‐means stands for the cluster centroids (means) we will compute.
© FEZZA S. v21‐22
Machine Learning 8
4
28/11/2021
© FEZZA S. v21‐22
K-means
Machine Learning 9
K-means
© FEZZA S. v21‐22
Machine Learning 10
5
28/11/2021
© FEZZA S. v21‐22
K-means
Machine Learning 11
K-means
© FEZZA S. v21‐22
Machine Learning 12
6
28/11/2021
© FEZZA S. v21‐22
K-means
Machine Learning 13
K-means
© FEZZA S. v21‐22
Machine Learning 14
7
28/11/2021
© FEZZA S. v21‐22
K-means
Machine Learning 15
K-means
© FEZZA S. v21‐22
Machine Learning 16
8
28/11/2021
K-means
K‐means algorithm
Input:
‐ (number of clusters)
‐ Training set
(drop convention)
© FEZZA S. v21‐22
Machine Learning 17
K-means
K‐means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
Cluster
assignment
:= index (from 1 to ) of cluster centroid
closest to
Move
for = 1 to
centroid := average (mean) of points assigned to cluster
}
© FEZZA S. v21‐22
Machine Learning 18
9
28/11/2021
K-means
K‐means algorithm
© FEZZA S. v21‐22
Machine Learning 19
K-means
K‐means for non‐separated clusters
T‐shirt sizing
Weight
Height
© FEZZA S. v21‐22
Machine Learning 20
10
28/11/2021
K-means
K‐means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
© FEZZA S. v21‐22
Machine Learning 21
K-means
Random initialization
Should have
Randomly pick training
examples.
Set equal to these
examples.
© FEZZA S. v21‐22
Machine Learning 22
11
28/11/2021
K-means
Local optima
© FEZZA S. v21‐22
Machine Learning 23
K-means
Suboptimal solutions due to unlucky centroid initializations
© FEZZA S. v21‐22
Machine Learning 24
12
28/11/2021
K-means
Random initialization
For i = 1 to 100 {
Randomly initialize K‐means.
Run K‐means. Get .
Compute cost function (distortion)
Pick clustering that gave lowest cost
© FEZZA S. v21‐22
Machine Learning 25
K-means++
© FEZZA S. v21‐22
Machine Learning 26
13
28/11/2021
K-means
What is the right value of K?
© FEZZA S. v21‐22
Machine Learning 27
K-means
What is the right value of K?
Bad choices for the number of clusters: when k is too small, separate clusters get
merged (left), and when k is too large, some clusters get chopped into multiple
pieces (right)
© FEZZA S. v21‐22
Machine Learning 28
14
28/11/2021
K-means
Choosing the value of K
Elbow method:
Cost function
Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(no. of clusters) (no. of clusters)
© FEZZA S. v21‐22
Machine Learning 29
K-means
Choosing the value of K
Sometimes, you’re running K‐means to get clusters to use for some
later/downstream purpose. Evaluate K‐means based on a metric for
how well it performs for that later purpose.
Height Height
Machine Learning 30
15
28/11/2021
© FEZZA S. v21‐22
Clustering assessment metrics
Machine Learning 31
• Silhouette coefficient – By noting a and b the mean distance between a sample and all
other points in the same class, and between a sample and all other points in the next
nearest cluster, the silhouette coefficient s for a single sample is defined as follows
The coefficient can take values in the interval [‐1, 1].
• If it is 0 –> the sample is very close to the neighboring clusters.
• It it is 1 –> the sample is far away from the neighboring clusters.
• It it is ‐1 –> the sample is assigned to the wrong clusters.
© FEZZA S. v21‐22
Machine Learning 32
16
28/11/2021
K-means
Choosing the value of K
© FEZZA S. v21‐22
Machine Learning 33
© FEZZA S. v21‐22
34
17
28/11/2021
© FEZZA S. v21‐22
Drawbacks of K-means
Machine Learning 35
Drawbacks of K-means
© FEZZA S. v21‐22
Machine Learning 36
18