You are on page 1of 39

Clustering

Sistemas de Informação e Estatística 1


Clustering

Dividing the data into clusters such that:


§Similar (related) objects should be in the same cluster
§Dissimilar (unrelated) objects should be in different clusters
§Clusters are not defined beforehand (otherwise: use classification)
§Clusters have (statistical, geometric, ...) properties such as:
§ Connectivity
§ Separation
§ Least squared deviation
§ Density

Sistemas de Informação e Estatística 2


Some applications
§Customer segmentation
§ Optimize ad targeting or product design for different “focus groups”
§Web visitor segmentation
§ Optimize web page navigation for different user segments
§Data aggregation/reduction
§ Represent many data points with a single (representative) example. E.g., reduce color
palette of an image to k colors
§Text collection organization
§ Group text documents into (previously unknown) topics
§Biology
§ Taxonomy of living things: kingdom, phylum, class, order, family, genus and species
§Information retrieval
§ Document clustering

Sistemas de Informação e Estatística 3


Some applications
§Land use
§ Identification of areas of similar land use in an Earth observation database
§Marketing
§ Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
§City-planning
§ Identifying groups of houses according to their house type, value and geographical location
§Earthquake studies
§ Observed earthquake epicenters should be clustered along continent faults
§ Climate
§ Understanding Earth climate, find patterns of atmospheric and oceanic phenomena
§Economic Science
§ Market research

Sistemas de Informação e Estatística 4


Process
§ Feature selection
§ Select information (about objects) concerning the task of interest
§ Aim at minimal information redundancy
§ Weighting of information
§ Proximity measure
§ Similarity of two feature vectors
§ Clustering criterion
§ Expressed via a cost function or some rules
§ Clustering algorithms
§ Choice of algorithms
§ Validation of the results
§ Validation test
§ Interpretation of the results
§ Integration with applications

Sistemas de Informação e Estatística 5


Metric

<latexit sha1_base64="7BCvAv3eHU/QkV0DyqOsjhpkS+Y=">AAACfHicbZHLTttAFIbHphfq3lJYsuioSSVH0MhGamGDhOimiy6o1ABSHEXj8bEzYjw2M8cQY/EUvBk7HqWbquOQBST9pZF+fecczbnEpRQGg+DecdeePX/xcv2V9/rN23fvOx82TkxRaQ5DXshCn8XMgBQKhihQwlmpgeWxhNP4/HsbP70EbUShfmNdwjhnmRKp4AwtmnRuoxgyoRpQVQ6aIdx4kUDIvV7iz3bqfpTBRdDznsKDIPoJKWqRTZFpXVzR2UG9kpT49c6sv4wjCRe0tdf97cS/tqTnRaCSRx1MOt1gEMxFV024MF2y0PGkcxclBbflCrlkxozCoMRxwzQKLtuJKgMl4+csg5G1iuVgxs18eTf0syUJTQttn0I6p48rGpYbU+exzcwZTs1yrIX/i40qTPfHjVBlhaD4w0dpJSkWtL0ETYQGjrK2hnEtbK+UT5lmHO29PLuEcHnkVXOyOwi/DoJfu93Do8U61skW+UR8EpI9ckh+kGMyJJz8cT46vtN3/ro9d9v98pDqOouaTfJE7rd/eIC7QQ==</latexit>

1. d(x, y) 0
2. d(x, y) = 0 , x = y • Distance: satisfies 1. and 3.

• Dissimilarity: usually satisfies 1. and 3.


3. d(x, y) = d(y, x)
4. d(x, y)  d(x, z) + d(z, y)

Sistemas de Informação e Estatística 6


Similarity

<latexit sha1_base64="XsMC6HrzCmZFuNGEKjHUiWvHlv8=">AAACX3icbZFNS8NAEIY38Tt+RT2Jl2ArtFBKIoheCqIXjxWsCk0om+20Lm42cXdSGoJ/0pvgxX/itvbQVgcWXp6ZYWbejTPBNfr+p2WvrK6tb2xuOds7u3v77sHho05zxaDDUpGq55hqEFxCBzkKeM4U0CQW8BS/3k7yTyNQmqfyAYsMooQOJR9wRtGgnjsKYxhyWYLME1AU4d0JOULiVHVt3Cjq4RDe/KqzCFu6VjTG9WUcctn1G0G0yMf1VlB1QpD9uSE9t+I3/Wl4f0UwExUyi3bP/Qj7KTPtEpmgWncDP8OopAo5E5Olcw0ZZa90CF0jJU1AR+XUn3fvzJC+N0iVeRK9KZ3vKGmidZHEpjKh+KKXcxP4X66b4+AqKrnMcgTJfgcNcuFh6k3M9vpcAUNRGEGZ4mZXj71QRRmaL3GMCcHyyX/F43kzuGj69+eV65uZHZvkhJySGgnIJbkmd6RNOoSRL8u2tq0d69vesPds97fUtmY9R2Qh7OMfeNqv1A==</latexit>

1. s(x, y) 0
1
<latexit sha1_base64="24mvL/pcF4UjDhdLqLxAuhAW2b0=">AAACLXicbVDLSsNAFJ3Ud3xVXboJFsEXJRFEEYSiLlwq2FpoQplMburQySTOTMQS8kNu/BURXFTErb/hNHZRWw8MHM65hzv3+AmjUtl23yhNTc/Mzs0vmItLyyur5bX1hoxTQaBOYhaLpo8lMMqhrqhi0EwE4MhncOd3Lwb+3SMISWN+q3oJeBHucBpSgpWW2uVL14cO5Rk8pIWyl5ty5+mgt3t65oYCk8zJM2c/KKTcpbxlHzie6QIPRiLtcsWu2gWsSeIMSQUNcd0uv7lBTNIIuCIMS9ly7ER5GRaKEga56aYSEky6uAMtTTmOQHpZcW1ubWslsMJY6MeVVaijiQxHUvYiX09GWN3LcW8g/ue1UhWeeBnlSaqAk99FYcosFVuD6qyACiCK9TTBRFD9V4vcY12S0gWbugRn/ORJ0jisOkdV++awUjsf1jGPNtEW2kEOOkY1dIWuUR0R9IxeUR99GC/Gu/FpfP2OloxhZgP9gfH9A0Mkp4Q=</latexit>

2. s(x, y) = s(y, x)
s(x, y) := 2 [0, 1]
1 + d(x, y)
3. s(x, y) 2 [0, 1]
4. s(x, x) = 1

Sistemas de Informação e Estatística 7


Distances for binary data Clustering Distance & Similarity

Distances for Binary Data


§𝑜! , 𝑜" ∈ 0, 1 Binary a�ributes: two objects o1 , o2 2 {0, 1}d having only binary a�ributes
§ f01 = the number of attributes where o has 0 and o has 1
Clustering Distance & Similarity
1
3: 12 / 153
I Determine the following quantities:
2
f = the number of a�ributes where o has 0 and o2 has 1
s for Binary Data
§ f10 = the number of attributes where o1 has f =1the
and o2ofhas
number
01
0 where o
a�ributes 10
1
1 has 1 and o2 has 0
f = the number of a�ributes where o 1 has 0 and o2 has 0
butes: two§objects d having only binary a�ributes 00
f00 = the number of attributes where o1 has
o 1 , o 2 2 {0, 1} f =0the
and o2ofhas
number 0 where o
a�ributes 11 1 has 1 and o2 has 1
ne the following quantities:
§ fof11a�ributes
= the number = thewhere numbero1 has 0of
andattributes
o2 has 1 where Io1Simple
has 1matching
and ocoe�icient
2 has 1
(SMC):
= the number of a�ributes where o1 has 1 and o2 has 0 f11 + f00 f11 + f00
sSMC (o1 , o2 ) = =
= the number of a�ributes where o1 has 0 and o2 has 0 f01 + f10 + f00 + f11 d
= the number of a�ributes where o1 has 1 and o2 has 1 used for symmetric Simple
a�ributesMatching
where eachDistance
state (0, 1) is equally important.
Simple Matching Coefficient
matching coe�icient (SMC): I Simple matching distance (SMD):
f11 + f00 f11 + f00 f01 + f10 dHamming (o1 , o2 )
sSMC (o1 , o2 ) = = dSMC (o1 , o2 ) = 1 sSMC (o1 , o2 ) = =
f01 + f10 + f00 + f11 d d d
symmetric a�ributes where each state (0, 1) is equally important.
matching distance (SMD):
f01 + f10 dHamming (o1 , o2 ) Erich Schubert Knowledge Discovery in Databases Winter S
dSMC (o1 , o2 ) = 1 sSMC (o1 , o2 ) = =
d d

Sistemas de Informação e Estatística 8


Distance for sets
Clustering Distance & Similarity 3: 13 / 153

Jaccard Coe�icient forJaccard


Sets coefficient for sets A and B
Jaccard coe�icient for sets A and B (if A = B = ;, we use similarity 1):
Clustering Distance & Similarity 3: 13 / 153
|A \ B| |A \ B|
sJaccard (A, B) := = 2 [0; 1]
Jaccard Coe�icient for Sets |A [ B| |A| + |B| |A \ B|
Jaccard distance for sets A and B:
Jaccard coe�icient for sets A and B (if A = B = ;, we use similarity 1):
|A [ B| |A \ B|
dJaccard (A, B) := 1 |A sJaccard
\ B| (A, B) = |A \ B| 2 [0; 1]
sJaccard (A, B) := |A
= for sets A and B[ B| 2 [0; 1]
Jaccard
|A [distance
B| |A| + |B| |A \ B|
If we encode
Jaccard thefor
distance sets as A
sets binary vectors, we get:
and B:
|A f[11B| |A \ B|
s
dJaccard (A, B) := 1Jaccard (o
sJaccard , o
1 (A, )
2 B) == 2 [0; 1]
f01 + f10|A + [f11
B|
f01 + f10
dJaccard (o
If we encode the sets as binary vectors, we1 ,get:
o2 ) =
f01 + f10 + f11
Sistemas de Informação e Estatística 9
f11
Distance for categorical data Clustering
Clustering
Distance & Similarity
Distance & Similarity
3: 14 / 153
3: 14 / 153

Example
ExampleDistances
Distancesfor
forCategorical
Categorical Data
Data
§ 𝑥 = 𝑥! , …Categorical
, Categorical
𝑥" , y = data:
𝑦data:
! , … two
,𝑦" lists
two listsxx=={x{x11,,xx22,,......,, xxdd}} and
and yy = {y11,,yy22,,......, ,yydd}}
= {y
with
witheach
eachxixand y being categorical (nominal) a�ributes
i and iyi being categorical (nominal) a�ributes
I IHamming
Hammingdistance:
distance:“count
“countthe
thenumber Hamming
number of distance
of di�erences”.
di�erences”.
dd ((
XX 00 ififxxi i==yiyi
dist
dist
Hamming
Hamming (x,
(x,y)
y) == (x
(x i
i , yi ) with (x
(x i,
i i,yy )
i ) =
=
i=1
i=1
11 ififxxi i6=6=yiyi
I IJaccard
Jaccardbybytaking
takingthe
theposition
positioninto
into account.
account.
Gower’sfor
I IGower’s formixed
mixedinterval,
interval,ordinal, Gower
ordinal, and
and distancedata.
categorical
categorical data.
8
8
>
>
>
> 00 if
if xxii =
= yyii
>
>
>
dd >
X <1
< if
dist (x, y) =
X 1 if variable
variableX Xiicategorical
categoricalandandxix6= y
i 6= iyi
dist (x,
Gower
Gower y) = > |xi yyi||
if
>
>
i=1>
i=1 >
>
>
|x i
RRi
i
if variable
variableX Xiiisisinterval
intervalscale
scale
:| |rank(x
> i
rank(xi ) rank(yi )|
if
: i)
n
rank(y
1
n 1
i )|
if variable
variableX Xi isisordinal
i ordinalscale
scale
whereRR=
where i =max
maxxxxi minminxxxi isis the
the range,
range, and
and rank(x) [1,. .. .. ., ,n]n]isisthe
rank(x) 22[1, therank.
rank.
i x i x i
Intuition: each a�ribute has the same
Jaccard maximum distance
distance with contribution
position 1.
Intuition: each a�ribute has the same maximum distance contribution 1.
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18

Sistemas de Informação e Estatística 10


Distance for continuous variables
!
§𝑑# 𝐱, 𝐲 = 𝑦! − 𝑥! # + ⋯ + 𝑦$ − 𝑥$ #

§ 𝑑' 𝐱, 𝐲 = 𝑦' − 𝑥' + ⋯ + 𝑦( − 𝑥(


§ 𝑑) 𝐱, 𝐲 = 𝑦' − 𝑥' ) + ⋯ + 𝑦( − 𝑥( ) (euclidean distance)
§ 𝑑* 𝐱, 𝐲 = max 𝑦' − 𝑥' , ⋯ , 𝑦( − 𝑥(

Sistemas de Informação e Estatística 11


Clustering Distance & Similarity 3: 15 / 153

Mahalanobis
Distance Distance variables
for continuous
Given a data set with variables X1 , . . . Xd , and n vectors x1 . . . xn , xi = (xi1 , . . . , xid )T .
The d ⇥ d covariance matrix is computed as:
Xn
1
⌃ij := S Xi Xj =Mahalanobis
Clustering
n 1 (x&kiSimilarityXi )(xkj Xj )
Distance
distance
3: 15 / 153
k=1

TheMahalanobis Distance
Mahalanobis distance is then defined on two vectors x, y 2 Rd as:
q
Given a data set with variables X , . . . X , and n vectors x . . . x , x = (x , . . . , x ) T.
dM ah (x, y) = (x y) ⌃ (x y)
1 d T 1 1 n i i1 id
The d ⇥ d covariance matrix is computed as:
Xn of the data.
where ⌃ 1 is the inverse of the covariance1 matrix
⌃ij := S Xi Xj = n 1 (xki Xi )(xkj Xj )
k=1
I This is a generalization of the Euclidean distance
The Mahalanobis distance is then defined on two vectors x, y 2 Rd as:
I This takes correlation between a�ributesq into account
dM ah (x, y) = (x y)T ⌃ 1 (x y)
I The a�ributes can have di�erent ranges of values
where ⌃ 1 is the inverse of the covariance matrix of the data.
I This is a generalization of the Euclidean distance
I This takes correlation between a�ributes into account
I The a�ributes can have di�erent ranges of values
Sistemas de Informação e Estatística 12
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
Hierarchical clustering

§Agglomerative methods
§ Start with each observation forming a cluster
§ Join clusters based on cluster linkage
§ Stop when having 1 cluster
§ Analyze intermediate stopping point
§Divisive methods
§ Start with all observations in 1 cluster
§ Divide a cluster based
§ Stop when having 1 observation per cluster
§ Analyze intermediate stopping point

Sistemas de Informação e Estatística 13


Linkage Clustering Hierarchical Methods
Clustering Hierarchical Methods
3: 25 / 153

Distance of Clusters
Clustering Hierarchical Methods 3: 25 / 153
Clustering Hierarchical Methods 3: 25 / 153
Distance of Clusters
Distance of
Distance minimumClusters
of Clusters
Single-linkage: Single ⇠
distance = maximum⇠similarity
linkage
Single-linkage: minimum distance
Single-linkage: minimum distance ⇠ maximum
maximum similarity
= similarity
=⇠ ⇠
Single-linkage: minimum
dsingle (A, B) := distance
min d(a, maximum
= b) = maxsimilarity
s(a, b)

dsingle
dsingle (A, B) :=(A, B)
min:=d(a,
a2A,b2B min
b) ⇠
= d(a,
a2A,b2B
max b)s(a,
= b)max s(a, b)
dsingle (A, B) :=a2A,b2B a2A,b2B ⇠
min d(a, b) =a2A,b2B
max s(a, a2A,b2B
b)
a2A,b2B a2A,b2B
Complete-linkage: maximum distance ⇠ = minimum similarity
Complete-linkage:
Complete-linkage: maximum
Complete
maximum ⇠ minimum⇠
=distance
linkage
distance minimum similarity
=similarity
Complete-linkage:
dcomplete (A, B)maximum
:= maxdistance ⇠⇠
d(a, b)==minimum similarity
min s(a, b)
dcomplete
dcomplete a2A,b2B
(A, B) :=
(A, max
B) := b) ⇠
d(a,max =a2A,b2B ⇠
min b)s(a,
d(a, = b)min s(a, b)
dcomplete (A, B) := max d(a,
a2A,b2B
a2A,b2B

b) = min s(a,
a2A,b2B b)
a2A,b2B
a2A,b2B a2A,b2B
Average-linkage (UPGMA): average distance ⇠ =⇠average similarity
Average-linkage (UPGMA): average distance
Xlinkage X =⇠ average similarity

Average-linkage
Average-linkage (UPGMA):(UPGMA):
Average
1 average
X average
distance
X =distance
average = average similarity
similarity
daverage (A, B) := |A|·|B| 1 d(a,
X d(a,X b)
daverage (A, B) := |A|·|B| Xa2A
1 X
b2B b)
a2A b2B
daveraged(A, B) (A,
average B) := |A|·|B|
1
:= |A|·|B| d(a, b) d(a, b)
a2A a2A
b2B b2B
Centroid-linkage:
Centroid-linkage:distance
distanceofofcluster
clustercenters
centers(Euclidean
(Euclidean only)
only)
Centroid-linkage:
Centroid-linkage:
dcentroid (A, B)distance of cluster
distance
Centroid
:= kµ 2 2centers
kof
µBlinkage cluster(Euclidean only)
centers (Euclidean only)
dcentroid (A, B) := kµA µB k
A
2
dcentroid := kµ
(A, B) (A,
dcentroid B) B kA
A :=µkµ µB k2

Sistemas
Erich Schubert
Erich de Informação e Estatística
Schubert Knowledge
KnowledgeDiscovery
Discoveryin
in Databases
Databases Winter
Winter Semester
Semester 2017/1814
2017/18
Clustering Hierarchical Methods 3: 28 / 153
Lance-Williams equation
AGNES – Agglomerative Nesting [KR90] /2
Lance-Williams update equation have the general form:
D(A [ B, C) = ↵1 d(A, C) + ↵2 d(B, C) + d(A, B) + |d(A, C) d(B, C)|
Several (but not all) linkages can be expressed in this form (for distances):
↵1 ↵2
Single-linkage 1/2 1/2 0 1/2
Complete-linkage
§𝐴 and 1/2
𝐵 are the 2 previous clusters agglomerated 1/2
in 𝐴 ∪ 𝐵 0 +1/2
|A| |B|
Average-group-linkage (UPGMA) 0 0
§Recursive equation |A|+|B| |A|+|B|
Mc�i�y (WPGMA) 1/2 1/2 0 0
§Several (but not all)
Centroid-linkage linkages can be expressed
(UPGMC) |A| in |B|
this form (for
|A||B|distances)
0
|A|+|B| |A|+|B| (|A|+|B|)2
Median-linkage (WPGMC) 1/2 1/2 1/4 0
|A|+|C| |B|+|C| |C|
Ward |A|+|B|+|C| |A|+|B|+|C| |A|+|B|+|C| 0
Sistemas de Informação e Estatística 15
MiniMax linkage: cannot be computed with Lance-Williams updates,
GNES – Agglomerative Nesting [KR90] /2
Linkageupdate equation have the general form:
nce-Williams
D(A [ B, C) = ↵1 d(A, C) + ↵2 d(B, C) + d(A, B) + |d(A, C) d(B, C)|
veral (but not all) linkages can be expressed in this form (for distances):
↵1 ↵2
Single-linkage 1/2 1/2 0 1/2
Complete-linkage 1/2 1/2 0 +1/2
|A| |B|
Average-group-linkage (UPGMA) |A|+|B| |A|+|B| 0 0
Mc�i�y (WPGMA) 1/2 1/2 0 0
|A| |B| |A||B|
Centroid-linkage (UPGMC) |A|+|B| |A|+|B| (|A|+|B|)2
0
Median-linkage (WPGMC) 1/2 1/2 1/4 0
|A|+|C| |B|+|C| |C|
Ward |A|+|B|+|C| |A|+|B|+|C| |A|+|B|+|C| 0

niMax linkage: cannot be computed with Lance-Williams updates,


t we need to find the best cluster representative
Sistemas de Informação e Estatística (in O(n2 )). 16
Dendogram
3: 24 / 153

jects 12
4
R90]:
kage”)
11

10

9
3
8

6
2
5

3 1
2

0
0 1 2 3 4 5 6 7 8 9 10 11 12
0

Sistemas de Informação e Estatística 17


AGNES – AGglomerative NESting

1. Compute the pairwise distance matrix of objects


2. Find position of the minimum distance 𝑑 𝑖, 𝑗 (similarity: maximum
similarity 𝑠 𝑖, 𝑗
3. Combine rows and columns of 𝑖 and 𝑗 into one using Lance-Williams
update equations
4. Repeat from (2.) until only one entry remains
5. Return dendrogram tree

Sistemas de Informação e Estatística 18


AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Find minimum distance.

Sistemas de Informação e Estatística 19


Please do not print these slides – save a tree.
AGNES – complete linkage
Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Merge clusters, and update dendrogram.

Please do not print these slides – save a tree.


Erich Schubert Knowledge
Sistemas de Informação Discovery in Databases
e Estatística Winter Semester 2017/18 20
AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Find minimum distance.

Sistemas de Informação e Estatística 21


AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Merge clusters, and update dendrogram.

Sistemas de Informação e Estatística 22


AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Update distance matrix (here: keep maximum in each row/column, except diagonal).

Sistemas de Informação e Estatística 23


Please do not print these slides – save a tree.
AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Find minimum distance.

Sistemas de Informação e Estatística 24


Please do not print these slides – save a tree.
AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Merge clusters, and update dendrogram.

Please do not print these slides – save


Sistemas a tree. e Estatística
de Informação 25
AGNES – complete linkage
Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Update distance matrix (here: keep maximum in each row/column, except diagonal).

Please do not print these slides – save


Sistemas a tree. e Estatística
de Informação 26
AGNES – complete linkage
Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Find minimum distance.

Please do not print these slides – save a tree.


Sistemas de Informação e Estatística 27
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
AGNES – complete linkage
Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Merge clusters, and update dendrogram.

Sistemas de Informação e Estatística 28


Please do not print these slides – save a tree.
AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Update distance matrix (here: keep maximum in each row/column, except diagonal).

Please do not print these slides – save a tree.


Sistemas de Informação e Estatística 29
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Find minimum distance.

Sistemas de Informação e Estatística 30


Please do not print these slides – save a tree.
AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153

Example: AGNES We may need to merge


non-adjacent rows!
Example with complete linkage (= maximum of the distances):

5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Merge clusters, and update dendrogram.
We don’t know the optimum
label positions in advance
Sistemas de Informação e Estatística 31
Please do not print these slides – save a tree.
AGNES – single linkage Clustering Hierarchical Methods 3: 30 / 153

Example: AGNES /2 In this very simple example,


single and complete linkage
Example with single linkage (= minimum of the distances): are very similar

5 A B C D E F 3
C
4 D F A 0 0.71 5 2.92 2.5 3.54 2.5
E
B 0.71 0 5.70 3.61 3.20 4.24 2
3
C 5 5.70 0 2.55 2.69 1.58 1.5
2
A D 2.92 3.61 2.55 0 0.5 1 1

1 B E 2.5 3.20 2.69 0.5 0 1.12 0.5

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F


0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Merge clusters, and update dendrogram.
Same clusters in this example,
but this is usually not the case.

Please do not print these slides – save


Sistemas a tree. e Estatística
de Informação 32
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
Clustering Partitioning Methods 3: 35 / 153

K-means
The Sum of Squares Objective Clustering Partitioning Methods
We can rearrange these sums
3: 35 / 153

because of communtativity
Thesum-of-squares
The Sum of Squares objective: Objective We can rearrange these sums
§Sum of The
Squares
sum-of-squares
SSQ :=
X
objective:
X
Clustering
X
Partitioning Methods (xi,d µC,d )2
because of communtativity
3: 35 / 153
x 2C
|X Xd |X
C
{z } |{z} {zi } | {z }2
The SumSSQof :=Squares
every cluster Objective (x
We can µ
rearrange )these sums
C ⇥ every dimension ⇥ every pointxi 2C squared deviation from mean
i,d C,d
d
The sum-of-squares | {z }
objective: |{z} | {z } |because{z
of communtativity
}
For every cluster C and dimension
every cluster d, the arithmetic
X ⇥ every dimension
X ⇥X mean minimizes
every point squared deviation from mean
X SSQ := (xi,d µC,d1 )2X
For every cluster C and(xi,ddimension
µC,d ) C} d, the
| {z
2 arithmetic
is |{z}
minimized
d
| {z
xmean
byi 2C
} minimizes
µ = |C|}
| C,d {z xi,d
§Cluster centroids X x i 2C every cluster ⇥ every dimension ⇥ every point squared deviation from mean
(xi,d itsµdimension
C,d )
2
is minimized by minimizes 1
X x i 2C

Assigning For
every point
every 2C xiCtoand
xicluster least-squaresd, the closest cluster
arithmetic mean C µusually
C,d = 2|C|reduces xi,dtoo.
SSQ,
xi 2C
X X
2 1 2
Assigning every point xxii2C to(xits is
least-squares minimized by
closest cluster C,d
C usually xreduces
i,d µ C,d ) µ = |C| xi,d SSQ, too.
Note: sum of squares ⌘ squared Euclidean distance: i 2C

Assigning every Xpoint xi to its least-squares closest cluster C usually2 reduces SSQ, too.
Note: sum of squares ⌘ squared (xi,d µEuclidean 2
C,d ) ⌘ kx µC k2 ⌘ d2Euclidean (xi , µC )
distance:
i
Note: sum of squaresXd ⌘ squared Euclidean distance: 2
We can therefore say thatdevery X
(x i,d pointµC,dis)2assigned

2 kxi the k ⌘2 d2Euclidean
µC“closest” cluster,(xbut, µwe) cannot use arbitrary
(xi,d µC,d ) ⌘ kxi µC k ⌘ dEuclidean (xi , µC )i C
2

other distance functions in k-means d


(because the arithmetic mean only minimizes SSQ).
We can therefore say that every point
We can therefore say that every point is isassigned the“closest”
assigned the “closest” but webut
cluster,
cluster, we cannot
cannot use arbitrary
use arbitrary
other
2
Thisdistance
other
is not alwaysfunctions
distance
optimal: in
functions
thek-means
change in (because
in k-means mean (because the arithmetic
the arithmetic
Cluster
can increase ofmean
the SSQmean only
the new only
minimizesminimizes
cluster. SSQ). SSQ).
But this di�erence
2
This is notisalways
commonly
optimal:ignored in inalgorithms
the change and textbooks.
mean cancentroid
increase the SSQ of the new cluster.
2
This is not always
ErichBut
Schubert
optimal:isthe
this di�erence change
commonly in mean
ignored can
in algorithms
Knowledge
increase
Discoveryand the SSQ of the new cluster.
in textbooks.
Databases Winter Semester 2017/18
But this di�erence is commonly ignored in algorithms
Erich Schubert
and textbooks.
Knowledge Discovery in Databases Winter Semester 2017/18

Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18


Sistemas de Informação e Estatística 33
Optimization algorithm

1. Choose k points randomly as initial centers

2. Repeat
1. Assign every point to the least-squares closest center

2. Stop if no cluster assignment changed

3. Update the centers with the arithmetic mean

Sistemas de Informação e Estatística 34


Centroid initialization strategies
§ Initial centers given by domain expert
§ Randomly assign points to partitions 1...k (not very good)
§ Randomly generate k centers from a uniform distribution (not very good)
§ Randomly choose k data points as centers (uniform from the data)
§ First k data points
§ Choose a point randomly, then use always the farthest point to get k initial points (initial
centers are often outliers, and gives similar results when run often)
§ Weighting points by their distance
§ Points are chosen randomly with 𝑝 ∝ min 𝑥, − 𝑐 ) (c = all current seeds)
+
§ Run a few k-means iterations on a sample, then use centroids from the sample result
§ ...

Sistemas de Informação e Estatística 35


reference object is located at the centre of the ‘star’. In this connection, the star

Measures Table
of the (in)adequacy
5.1 Measures of the (in)adequacy of the mth cluster containing n objects derived from
m
a dissimilarity matrix D, with elements dql,kv measuring the dissimilarity between the lth object
in the qth group and the vth object in the kth group.

Measure Index ðr 2 f1; 2gÞ


P
nm nm !
P "r
Lack of homogeneity h1 ðmÞ ¼ dml;mv
l¼1 v¼1;v6¼l
#! "r $
Lack of homogeneity h2 ðmÞ ¼ max dml;mv
l¼1;...;nm
v¼1;...;nm
v6¼l
" #
X
nm
! "r
Lack of homogeneity h3 ðmÞ ¼ min dml;mv
v¼1;...;nm
l¼1
nk !
nm P P
P "r
Separation i 1 ðm Þ ¼ dml;kv
l¼1 k6¼m v¼1
#! "r $
Separation i 2 ðm Þ ¼ min dml;kv
l¼1;...;nm
k6¼m
v¼1;...;nk

Sistemas de Informação e Estatística 36


Variations

§k-medians: use the median in each dimension


§ For use with Manhattan norm 𝑥! − 𝑚"
#
§k-modes: use the mode instead of the mean
§ For categorical data, using Hamming distance.
§k-prototypes: mean on continuous variables, mode on categorical
§ For mixed data, using squared Euclidean (respectively, Hamming) distance
§k-medoids: using the medoid (element with smallest distance sum)
§ For arbitrary distance functions.
§Spherical k-means: use the mean, but normalized to unit length
§ For cosine distance.
§Gaussian Mixture Modeling: using mean and covariance
§ For use with Mahalanobis distance.

Sistemas de Informação e Estatística 37


Optimal k

§SSQk may exhibit an “elbow” or “knee”: initially it improves fast, then much
slower
§Use alternate criteria such as Silhouette
§ Computing silhouette is O(n2) – more expensive than k-means
§ GAP statistic

Sistemas de Informação e Estatística 38


References

§Everitt, B. S., Landau, S., Leese, M. & Stahl, D. Cluster Analysis. (Wiley,
2011).
§Johnson, R. A. & Wichern, D. W. Applied Multivariate Statistical Analysis.
(Prentice Hall, 2007).
§Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. J Comput Appl Math 20, 53–65 (1987).

Sistemas de Informação e Estatística 39

You might also like