Clustering Methods

Clustering
Gilles Gasso
INSA Rouen - ASI Departement

Laboratory LITIS
18 septembre 2019
Gilles Gasso Clustering 1 / 29

Plan
1 Introduction
Notion of dissimilarity
Quality of clusters
2 Methods of clustering
Hierarchical clustering
Principle and algorithm
K-means
Principle and algorithm
Variations

Introduction
Introduction
D = {x i ∈ Rd }N
i=1 : set of training samples
Goal : structure the data into homogeneous categories

Group the samples into clusters so that the data in a cluster be as
similar as possible
Clustering ≡ unsupervised learning.
Clustering images
https://courses.cs.washington.edu/courses/cse416/18sp/slides/L11_kmeans.pdf

Introduction
Applications
Field Data type Clusters
Text mining Texts Close texts
E-mails Automatic folders
Graph mining Graphs Social sub-networks
BioInformatics Genes Resembling genes
Marketing Client profile, Customer
purchased products segmentation
Image Images Homogeneous areas
segmentation in an image
Web log analysis Clickstream User profile
http:
//images2.programmersought.com/267/fc/fc00092c0966ec1d4b726f60880f9703.png
Introduction
What is clustering ?
How to define similarity or dissimilarity between samples
How to characterize a cluster ?
Number of clusters
Which algorithms of clustering ?
How to assess a clustering result
https://image.slidesharecdn.com/k-means-130411020903-phpapp01/95/k-means-clustering-algorithm-4-638.jpg?
cb=1365646184
Introduction Notion of dissimilarity
Dissimilarity measure (1)
Concept of dissimilarity
Dissimilarity is a function of the pair (x 1 , x 2 ) with a value in R+ such that
D(x 1 , x 2 ) = D(x 2 , x 1 ) ≥ 0 and D(x 1 , x 2 ) = 0 ⇒ x 1 = x 2
Measure of dissimilarity : distance D(x 1 , x 2 ) between 2 points x 1 and x 2

Minkoswski’s distance : D(x 1 , x 2 )q = dj=1 |x1,j − x2,j |q
P
Euclidean distance
Pd corresponds to q = 2 :
D(x 1 , x 2 )2 = j=1 (x1,j − x2,j )2 = (x 1 − x 2 )> (x 1 − x 2 )
Pd
Manhattan distance (q = 1) : D(x 1 , x 2 ) = j=1 |x1,j − x2,j |
Metric linked to the positive definite matrix W :
D 2 (x 1 , x 2 ) = (x 1 − x 2 )> W(x 1 − x 2 )

Dissimilarity measure (2)
The d-dimensional samples x 1 and x 2 are of discrete values

Compute the contingency matrix A(x 1 , x 2 ) ∈ IRd×d
> >
x1 = 0 1 2 1 2 1 and x 2 = 1 0 2 1 0 1
 
0 1 0
A(x 1 , x 2 ) = 1 2 0
1 0 1
Hamming’s distance : number of indexes where the 2 vectors differ

d
X d
X
D(x 1 , x 2 ) = aij
i=1 j=1,j6=i
Example : D(x 1 , x 2 ) = 3

Dissimilarity between clusters (1)
Distance D(C1 , C2 ) between 2 clusters C1 and C2

minimum diameter (nearest neighbor) :
Dmin (C1 , C2 ) = min {D(x i , x j ), x i ∈ C1 , x j ∈ C2 }
maximum diameter :
Dmax (C1 , C2 ) = max {D(x i , x j ), x i ∈ C1 , x j ∈ C2 }
Minimum diameter Maximum diameter

Dissimilarity between clusters (2)

Distance D(C1 , C2 ) between 2 clusters C1 and C2
average distance :
P P
x i ∈C1 x j ∈C2 D(x i , x j )
Dmoy (C1 , C2 ) =
n1 n2
Ward’s distance (between
q centres) :
n1 n2
DWard (C1 , C2 ) = n1 +n2 D(µ1 , µ2 )
Average distance Distance between centroids

Introduction Quality of clusters
Characteristics of a good clustering
Every cluster C` is characterized by :

His centroid : µ` = N1` i∈C` x i with N` = card(C` )
P
His inertia : J` = i∈C` D 2 (x i , µ` )

P
measures how close are the points around µ` . The lower J` , the smaller
is the dispersion of points around µ`
D 2 (x i , µ` ) = i∈C` J`
P P P
Within distance : Jw = ` i∈C`
Let µ be the mean of the samples : µ = N1 i x i

P
2 (µ , µ)
P
Between distance : Jb = ` N` D `
measures the "distance" between the clusters. The greater the inertia,
the more the clusters are well separated

Introduction Quality of clusters
Illustration
C1
g1
C3
g3
C2
g g
g4 C4
g2
Total inertia of the points = Inertia Intra-cluster + Inertia Inter-cluster
A good clustering . . .
is the one which minimizes the within distance and maximizes the between
distance
Methods of clustering Hierarchical clustering
Approaches of clustering
K-means clustering

Hierarchical clustering : principle
Principle : bottom up approach

Each point or cluster is gradually "agglomerated" to the nearest cluster.
Algorithm
Initialization :
Each sample represents a cluster,
Compute the pairwise distance matrix M with Mij = D(x i , x j )
Repeat
Select from M the two closest clusters CI and CJ
Merge CI and CJ into the cluster CG
Update M by computing the distance between CG and the remaining
clusters
Until all samples merged into one cluster

Hierarchical clustering : illustration
C10
Hiérarchie i C8 C10
(indicée) C5
C9 C2
C4
C9
C8
C7 C6
C6 C3 C1
C7
C5
C4
C3 C2 C1
Samples Dendogram Clustering
Dendrogram = representation of the successive fusions
Height of a cluster in the dendrogram = distance betweent the 2

clusters before they merge

Linkage criterion
Common metrics used for merging two clusters
Single linkage (minimum) based on Dmin (C1 , C2 )

tends to produce some large clusters (by chaining effect)
sensitivity to noised data
Complete linkage (maximum) based on Dmax (C1 , C2 )

tends to produce specific clusters (only very close clusters are merged
together)
sensitivity to noised data
Average linkage based on Dmoy (C1 , C2 )

tendency to produce classes with close variance
Ward distance DWard (C1 , C2 )

tends to minimize within variance of clusters being merged

Influence of linkage criterion (1)
Données (métrique : dist. Eucl.) Saut minimal Saut maximal

4
D
i i
1,1 4,0
C
3
E 0,9 2,8
2
F
B 0,7 1,7
1
A
0,5 0,5
0
0 1 2 3 4 F E A B C D C D A B E F
Clustering result may change w.r.t the selected linkage measure

Influence of linkage criterion (2)

Methods of clustering K-means
Approaches of clustering
K-means clustering

Clustering by data partitioning
Goal
D = {x i ∈ Rd }N
i=1 a set of training samples
Search of a partition in K clusters (with K < N)
Direct approach
Build all possible partitions
Retain the best partition among them
NP-hard problem
The number of possible partitions increases exponentially :
#Clusters = K1! K K −k C K k N .
P
k=1 (−1) k
For N = 10 and K = 4, we have 34105 possible partitions !

Data partitioning
Workaround solution
Determine the K clusters {C` }K K
`=1 and their centers {µ` }`=1 that
minimize the cluster within-distance Jw
K X
X
min kx i − µ` k2
{C` }K K
`=1 ,{µ` }`=1 `=1 i∈C`
Global solution : NP-hard problem
A local solution (not necessarily the optimal partition) can be attained

using a simple algorithm : K-means

K-means clustering
A popular clustering algorithm

Principle
Assume the centroids µ` , ` = 1, · · · , K are fixed
assign any point x i to one and only one cluster
x i is assigned to the closest cluster Ck (according to the distance
between x i and the clusters’ center µ1 ` )
Given the clusters C` , ` = 1, · · · , K ,
estimate their centers µ` , ` = 1, · · · , K
Repeat the previous steps until convergence

K-Means : illustration
Clustering in K = 2 classes
Data Initialization Iteration 1

La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2
−3 −2 −1 0 1 2 3 4 5
−2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6
Iteration 2 Iteration 3 Iteration 5

Clusters obtenus à l’iteration 2 Clusters obtenus à l’iteration 3 Clusters obtenus à l’iteration 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6

K-Means : Llyod’s algorithm
Initialize the centers µ1 , · · · µK

Repeat
Assign each point x i to the closest cluster
∀i ∈ {1, · · · , N} si ← arg min kx i − µ` k2 and Ck = {i : si = k}

`
Compute the center µk of each cluster
1 X
µ` = xi with N` = card(C` )
N`
i∈C`
Until convergence

K-Means : example
Initial centers : plain yellow squares

K-Means : example
=⇒ Different initializations lead to different partitions !

K-Means : remarks and limitations

The criterion Jw decreases at each iteration.
The algorithm converges to (at least) a local minimum of Jw

Initialization of µk :
select randomly within the range of definition of x i
select randomly among x i
Different initializations can lead to different clusters (convergence to
local minimum)
K-Means : some issues

Number of clusters
Difficult to asses the number of clusters because of a lack of a ground
truth to compare with
Fixed a priori (e.g. : we want to split customers into K segments)
Use the "elbow trick" on the variation of Jw (K ) w.r.t K
Use ad-hoc metrics such as silhouette score

Sequential K-Means
Adaptation of K-means for streaming samples
Initialize µ1 , · · · µK
Repeat
acquire x
assign x to the nearest cluster
C` ← x such that ` = arg min kx − µk k2

k
increment n`
recompute the center µ` of this cluster
1
µ` = µ` + (x − µ` )
n`

Conclusion
Clustering : unsupervised learning

Data grouping into homogeneous clusters
The number of clusters is a hyper-parameter that depends on the
application but can be determined on the basis of some metrics such
as silhouette score
Several algorithm : hierarchical clustering, K-means, but also DBScan,
Spectral clustering, . . .

Clustering Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Methods

Uploaded by

Copyright:

Available Formats

Clustering

INSA Rouen - ASI Departement

Gilles Gasso Clustering 1 / 29

Gilles Gasso Clustering 2 / 29

Goal : structure the data into homogeneous categories

Clustering ≡ unsupervised learning.

Gilles Gasso Clustering 3 / 29

Dissimilarity measure (1)

Measure of dissimilarity : distance D(x 1 , x 2 ) between 2 points x 1 and x 2

Metric linked to the positive definite matrix W :

Gilles Gasso Clustering 6 / 29

Dissimilarity measure (2)

The d-dimensional samples x 1 and x 2 are of discrete values

Hamming’s distance : number of indexes where the 2 vectors differ

Gilles Gasso Clustering 7 / 29

Dissimilarity between clusters (1)

Distance D(C1 , C2 ) between 2 clusters C1 and C2

Minimum diameter Maximum diameter

Gilles Gasso Clustering 8 / 29

Dissimilarity between clusters (2)

Average distance Distance between centroids

Gilles Gasso Clustering 9 / 29

Characteristics of a good clustering

Every cluster C` is characterized by :

His inertia : J` = i∈C` D 2 (x i , µ` )

Let µ be the mean of the samples : µ = N1 i x i

Gilles Gasso Clustering 10 / 29

Total inertia of the points = Inertia Intra-cluster + Inertia Inter-cluster

Gilles Gasso Clustering 12 / 29

Hierarchical clustering : principle

Principle : bottom up approach

Gilles Gasso Clustering 13 / 29

Hierarchical clustering : illustration

Samples Dendogram Clustering

Dendrogram = representation of the successive fusions

Height of a cluster in the dendrogram = distance betweent the 2

Gilles Gasso Clustering 14 / 29

Common metrics used for merging two clusters

Single linkage (minimum) based on Dmin (C1 , C2 )

Complete linkage (maximum) based on Dmax (C1 , C2 )

Average linkage based on Dmoy (C1 , C2 )

Ward distance DWard (C1 , C2 )

Gilles Gasso Clustering 15 / 29

Influence of linkage criterion (1)

Données (métrique : dist. Eucl.) Saut minimal Saut maximal

Clustering result may change w.r.t the selected linkage measure

Gilles Gasso Clustering 16 / 29

Influence of linkage criterion (2)

Gilles Gasso Clustering 17 / 29

Gilles Gasso Clustering 18 / 29

Clustering by data partitioning

Search of a partition in K clusters (with K < N)

Gilles Gasso Clustering 19 / 29

Global solution : NP-hard problem

A local solution (not necessarily the optimal partition) can be attained

Gilles Gasso Clustering 20 / 29

A popular clustering algorithm

Gilles Gasso Clustering 21 / 29

Data Initialization Iteration 1

Iteration 2 Iteration 3 Iteration 5

Gilles Gasso Clustering 22 / 29

K-Means : Llyod’s algorithm

Initialize the centers µ1 , · · · µK

∀i ∈ {1, · · · , N} si ← arg min kx i − µ` k2 and Ck = {i : si = k}