Professional Documents
Culture Documents
Gilles Gasso
18 septembre 2019
1 Introduction
Notion of dissimilarity
Quality of clusters
2 Methods of clustering
Hierarchical clustering
Principle and algorithm
K-means
Principle and algorithm
Variations
Introduction
D = {x i ∈ Rd }N
i=1 : set of training samples
Clustering images
https://courses.cs.washington.edu/courses/cse416/18sp/slides/L11_kmeans.pdf
Applications
Field Data type Clusters
Text mining Texts Close texts
E-mails Automatic folders
Graph mining Graphs Social sub-networks
BioInformatics Genes Resembling genes
Marketing Client profile, Customer
purchased products segmentation
Image Images Homogeneous areas
segmentation in an image
Web log analysis Clickstream User profile
http:
//images2.programmersought.com/267/fc/fc00092c0966ec1d4b726f60880f9703.png
Gilles Gasso Clustering 4 / 29
Introduction
What is clustering ?
How to define similarity or dissimilarity between samples
How to characterize a cluster ?
Number of clusters
Which algorithms of clustering ?
How to assess a clustering result
https://image.slidesharecdn.com/k-means-130411020903-phpapp01/95/k-means-clustering-algorithm-4-638.jpg?
cb=1365646184
Gilles Gasso Clustering 5 / 29
Introduction Notion of dissimilarity
Concept of dissimilarity
Dissimilarity is a function of the pair (x 1 , x 2 ) with a value in R+ such that
D(x 1 , x 2 ) = D(x 2 , x 1 ) ≥ 0 and D(x 1 , x 2 ) = 0 ⇒ x 1 = x 2
Euclidean distance
Pd corresponds to q = 2 :
D(x 1 , x 2 )2 = j=1 (x1,j − x2,j )2 = (x 1 − x 2 )> (x 1 − x 2 )
Pd
Manhattan distance (q = 1) : D(x 1 , x 2 ) = j=1 |x1,j − x2,j |
D 2 (x 1 , x 2 ) = (x 1 − x 2 )> W(x 1 − x 2 )
Example : D(x 1 , x 2 ) = 3
maximum diameter :
Dmax (C1 , C2 ) = max {D(x i , x j ), x i ∈ C1 , x j ∈ C2 }
measures how close are the points around µ` . The lower J` , the smaller
is the dispersion of points around µ`
D 2 (x i , µ` ) = i∈C` J`
P P P
Within distance : Jw = ` i∈C`
2 (µ , µ)
P
Between distance : Jb = ` N` D `
measures the "distance" between the clusters. The greater the inertia,
the more the clusters are well separated
Illustration
C1
g1
C3
g3
C2
g g
g4 C4
g2
A good clustering . . .
is the one which minimizes the within distance and maximizes the between
distance
Gilles Gasso Clustering 11 / 29
Methods of clustering Hierarchical clustering
Approaches of clustering
Hierarchical clustering
K-means clustering
Algorithm
Initialization :
Each sample represents a cluster,
Compute the pairwise distance matrix M with Mij = D(x i , x j )
Repeat
Select from M the two closest clusters CI and CJ
Merge CI and CJ into the cluster CG
Update M by computing the distance between CG and the remaining
clusters
Until all samples merged into one cluster
C10
Hiérarchie i C8 C10
(indicée) C5
C9 C2
C4
C9
C8
C7 C6
C6 C3 C1
C7
C5
C4
C3 C2 C1
Linkage criterion
E 0,9 2,8
2
F
B 0,7 1,7
1
A
0,5 0,5
0
0 1 2 3 4 F E A B C D C D A B E F
Approaches of clustering
Hierarchical clustering
K-means clustering
Goal
D = {x i ∈ Rd }N
i=1 a set of training samples
Direct approach
Build all possible partitions
Retain the best partition among them
NP-hard problem
The number of possible partitions increases exponentially :
#Clusters = K1! K K −k C K k N .
P
k=1 (−1) k
For N = 10 and K = 4, we have 34105 possible partitions !
Data partitioning
Workaround solution
Determine the K clusters {C` }K K
`=1 and their centers {µ` }`=1 that
minimize the cluster within-distance Jw
K X
X
min kx i − µ` k2
{C` }K K
`=1 ,{µ` }`=1 `=1 i∈C`
K-means clustering
K-Means : illustration
Clustering in K = 2 classes
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2
−3 −2 −1 0 1 2 3 4 5
−2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6
1 X
µ` = xi with N` = card(C` )
N`
i∈C`
Until convergence
K-Means : example
K-Means : example
Sequential K-Means
Adaptation of K-means for streaming samples
Initialize µ1 , · · · µK
Repeat
acquire x
assign x to the nearest cluster
increment n`
recompute the center µ` of this cluster
1
µ` = µ` + (x − µ` )
n`
Conclusion