3 - Clustering

Clustering
Sistemas de Informação e Estatística 1

Clustering
Dividing the data into clusters such that:

§Similar (related) objects should be in the same cluster
§Dissimilar (unrelated) objects should be in different clusters
§Clusters are not defined beforehand (otherwise: use classification)
§Clusters have (statistical, geometric, ...) properties such as:
§ Connectivity
§ Separation
§ Least squared deviation
§ Density

Some applications
§Customer segmentation
§ Optimize ad targeting or product design for different “focus groups”
§Web visitor segmentation
§ Optimize web page navigation for different user segments
§Data aggregation/reduction
§ Represent many data points with a single (representative) example. E.g., reduce color
palette of an image to k colors
§Text collection organization
§ Group text documents into (previously unknown) topics
§Biology
§ Taxonomy of living things: kingdom, phylum, class, order, family, genus and species
§Information retrieval
§ Document clustering

Some applications
§Land use
§ Identification of areas of similar land use in an Earth observation database
§Marketing
§ Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
§City-planning
§ Identifying groups of houses according to their house type, value and geographical location
§Earthquake studies
§ Observed earthquake epicenters should be clustered along continent faults
§ Climate
§ Understanding Earth climate, find patterns of atmospheric and oceanic phenomena
§Economic Science
§ Market research

Process
§ Feature selection
§ Select information (about objects) concerning the task of interest
§ Aim at minimal information redundancy
§ Weighting of information
§ Proximity measure
§ Similarity of two feature vectors
§ Clustering criterion
§ Expressed via a cost function or some rules
§ Clustering algorithms
§ Choice of algorithms
§ Validation of the results
§ Validation test
§ Interpretation of the results
§ Integration with applications

Metric
<latexit sha1_base64="7BCvAv3eHU/QkV0DyqOsjhpkS+Y=">AAACfHicbZHLTttAFIbHphfq3lJYsuioSSVH0MhGamGDhOimiy6o1ABSHEXj8bEzYjw2M8cQY/EUvBk7HqWbquOQBST9pZF+fecczbnEpRQGg+DecdeePX/xcv2V9/rN23fvOx82TkxRaQ5DXshCn8XMgBQKhihQwlmpgeWxhNP4/HsbP70EbUShfmNdwjhnmRKp4AwtmnRuoxgyoRpQVQ6aIdx4kUDIvV7iz3bqfpTBRdDznsKDIPoJKWqRTZFpXVzR2UG9kpT49c6sv4wjCRe0tdf97cS/tqTnRaCSRx1MOt1gEMxFV024MF2y0PGkcxclBbflCrlkxozCoMRxwzQKLtuJKgMl4+csg5G1iuVgxs18eTf0syUJTQttn0I6p48rGpYbU+exzcwZTs1yrIX/i40qTPfHjVBlhaD4w0dpJSkWtL0ETYQGjrK2hnEtbK+UT5lmHO29PLuEcHnkVXOyOwi/DoJfu93Do8U61skW+UR8EpI9ckh+kGMyJJz8cT46vtN3/ro9d9v98pDqOouaTfJE7rd/eIC7QQ==</latexit>
1. d(x, y) 0
2. d(x, y) = 0 , x = y • Distance: satisfies 1. and 3.
• Dissimilarity: usually satisfies 1. and 3.

3. d(x, y) = d(y, x)
4. d(x, y)  d(x, z) + d(z, y)

Similarity
<latexit sha1_base64="XsMC6HrzCmZFuNGEKjHUiWvHlv8=">AAACX3icbZFNS8NAEIY38Tt+RT2Jl2ArtFBKIoheCqIXjxWsCk0om+20Lm42cXdSGoJ/0pvgxX/itvbQVgcWXp6ZYWbejTPBNfr+p2WvrK6tb2xuOds7u3v77sHho05zxaDDUpGq55hqEFxCBzkKeM4U0CQW8BS/3k7yTyNQmqfyAYsMooQOJR9wRtGgnjsKYxhyWYLME1AU4d0JOULiVHVt3Cjq4RDe/KqzCFu6VjTG9WUcctn1G0G0yMf1VlB1QpD9uSE9t+I3/Wl4f0UwExUyi3bP/Qj7KTPtEpmgWncDP8OopAo5E5Olcw0ZZa90CF0jJU1AR+XUn3fvzJC+N0iVeRK9KZ3vKGmidZHEpjKh+KKXcxP4X66b4+AqKrnMcgTJfgcNcuFh6k3M9vpcAUNRGEGZ4mZXj71QRRmaL3GMCcHyyX/F43kzuGj69+eV65uZHZvkhJySGgnIJbkmd6RNOoSRL8u2tq0d69vesPds97fUtmY9R2Qh7OMfeNqv1A==</latexit>
1. s(x, y) 0
1
<latexit sha1_base64="24mvL/pcF4UjDhdLqLxAuhAW2b0=">AAACLXicbVDLSsNAFJ3Ud3xVXboJFsEXJRFEEYSiLlwq2FpoQplMburQySTOTMQS8kNu/BURXFTErb/hNHZRWw8MHM65hzv3+AmjUtl23yhNTc/Mzs0vmItLyyur5bX1hoxTQaBOYhaLpo8lMMqhrqhi0EwE4MhncOd3Lwb+3SMISWN+q3oJeBHucBpSgpWW2uVL14cO5Rk8pIWyl5ty5+mgt3t65oYCk8zJM2c/KKTcpbxlHzie6QIPRiLtcsWu2gWsSeIMSQUNcd0uv7lBTNIIuCIMS9ly7ER5GRaKEga56aYSEky6uAMtTTmOQHpZcW1ubWslsMJY6MeVVaijiQxHUvYiX09GWN3LcW8g/ue1UhWeeBnlSaqAk99FYcosFVuD6qyACiCK9TTBRFD9V4vcY12S0gWbugRn/ORJ0jisOkdV++awUjsf1jGPNtEW2kEOOkY1dIWuUR0R9IxeUR99GC/Gu/FpfP2OloxhZgP9gfH9A0Mkp4Q=</latexit>
2. s(x, y) = s(y, x)
s(x, y) := 2 [0, 1]
1 + d(x, y)
3. s(x, y) 2 [0, 1]
4. s(x, x) = 1

Distances for binary data Clustering Distance & Similarity
Distances for Binary Data

§𝑜! , 𝑜" ∈ 0, 1 Binary a�ributes: two objects o1 , o2 2 {0, 1}d having only binary a�ributes
§ f01 = the number of attributes where o has 0 and o has 1
Clustering Distance & Similarity
1
3: 12 / 153
I Determine the following quantities:
2
f = the number of a�ributes where o has 0 and o2 has 1
s for Binary Data
§ f10 = the number of attributes where o1 has f =1the
and o2ofhas
number
01
0 where o
a�ributes 10
1
1 has 1 and o2 has 0
f = the number of a�ributes where o 1 has 0 and o2 has 0
butes: two§objects d having only binary a�ributes 00
f00 = the number of attributes where o1 has
o 1 , o 2 2 {0, 1} f =0the
and o2ofhas
number 0 where o
a�ributes 11 1 has 1 and o2 has 1
ne the following quantities:
§ fof11a�ributes
= the number = thewhere numbero1 has 0of
andattributes
o2 has 1 where Io1Simple
has 1matching
and ocoe�icient
2 has 1
(SMC):
= the number of a�ributes where o1 has 1 and o2 has 0 f11 + f00 f11 + f00
sSMC (o1 , o2 ) = =
= the number of a�ributes where o1 has 0 and o2 has 0 f01 + f10 + f00 + f11 d
= the number of a�ributes where o1 has 1 and o2 has 1 used for symmetric Simple
a�ributesMatching
where eachDistance
state (0, 1) is equally important.
Simple Matching Coefficient
matching coe�icient (SMC): I Simple matching distance (SMD):
f11 + f00 f11 + f00 f01 + f10 dHamming (o1 , o2 )
sSMC (o1 , o2 ) = = dSMC (o1 , o2 ) = 1 sSMC (o1 , o2 ) = =
f01 + f10 + f00 + f11 d d d
symmetric a�ributes where each state (0, 1) is equally important.
matching distance (SMD):
f01 + f10 dHamming (o1 , o2 ) Erich Schubert Knowledge Discovery in Databases Winter S
dSMC (o1 , o2 ) = 1 sSMC (o1 , o2 ) = =
d d

Distance for sets
Clustering Distance & Similarity 3: 13 / 153
Jaccard Coe�icient forJaccard

Sets coefficient for sets A and B
Jaccard coe�icient for sets A and B (if A = B = ;, we use similarity 1):
|A \ B| |A \ B|
sJaccard (A, B) := = 2 [0; 1]
Jaccard Coe�icient for Sets |A [ B| |A| + |B| |A \ B|
Jaccard distance for sets A and B:
Jaccard coe�icient for sets A and B (if A = B = ;, we use similarity 1):
|A [ B| |A \ B|
dJaccard (A, B) := 1 |A sJaccard
\ B| (A, B) = |A \ B| 2 [0; 1]
sJaccard (A, B) := |A
= for sets A and B[ B| 2 [0; 1]
Jaccard
|A [distance
B| |A| + |B| |A \ B|
If we encode
Jaccard thefor
distance sets as A
sets binary vectors, we get:
and B:
|A f[11B| |A \ B|
s
dJaccard (A, B) := 1Jaccard (o
sJaccard , o
1 (A, )
2 B) == 2 [0; 1]
f01 + f10|A + [f11
B|
f01 + f10
dJaccard (o
If we encode the sets as binary vectors, we1 ,get:
o2 ) =
f01 + f10 + f11
f11
Distance for categorical data Clustering
Clustering
Distance & Similarity
Distance & Similarity
3: 14 / 153
3: 14 / 153
Example
ExampleDistances
Distancesfor
forCategorical
Categorical Data
Data
§ 𝑥 = 𝑥! , …Categorical
, Categorical
𝑥" , y = data:
𝑦data:
! , … two
,𝑦" lists
two listsxx=={x{x11,,xx22,,......,, xxdd}} and
and yy = {y11,,yy22,,......, ,yydd}}
= {y
with
witheach
eachxixand y being categorical (nominal) a�ributes
i and iyi being categorical (nominal) a�ributes
I IHamming
Hammingdistance:
distance:“count
“countthe
thenumber Hamming
number of distance
of di�erences”.
di�erences”.
dd ((
XX 00 ififxxi i==yiyi
dist
dist
Hamming
Hamming (x,
(x,y)
y) == (x
(x i
i , yi ) with (x
(x i,
i i,yy )
i ) =
=
i=1
i=1
11 ififxxi i6=6=yiyi
I IJaccard
Jaccardbybytaking
takingthe
theposition
positioninto
into account.
account.
Gower’sfor
I IGower’s formixed
mixedinterval,
interval,ordinal, Gower
ordinal, and
and distancedata.
categorical
categorical data.
8
8
>
>
>
> 00 if
if xxii =
= yyii
>
>
>
dd >
X <1
< if
dist (x, y) =
X 1 if variable
variableX Xiicategorical
categoricalandandxix6= y
i 6= iyi
dist (x,
Gower
Gower y) = > |xi yyi||
if
>
>
i=1>
i=1 >
>
>
|x i
RRi
i
if variable
variableX Xiiisisinterval
intervalscale
scale
:| |rank(x
> i
rank(xi ) rank(yi )|
if
: i)
n
rank(y
1
n 1
i )|
if variable
variableX Xi isisordinal
i ordinalscale
scale
whereRR=
where i =max
maxxxxi minminxxxi isis the
the range,
range, and
and rank(x) [1,. .. .. ., ,n]n]isisthe
rank(x) 22[1, therank.
rank.
i x i x i
Intuition: each a�ribute has the same
Jaccard maximum distance
distance with contribution
position 1.
Intuition: each a�ribute has the same maximum distance contribution 1.
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18

Distance for continuous variables
!
§𝑑# 𝐱, 𝐲 = 𝑦! − 𝑥! # + ⋯ + 𝑦$ − 𝑥$ #
§ 𝑑' 𝐱, 𝐲 = 𝑦' − 𝑥' + ⋯ + 𝑦( − 𝑥(

§ 𝑑) 𝐱, 𝐲 = 𝑦' − 𝑥' ) + ⋯ + 𝑦( − 𝑥( ) (euclidean distance)
§ 𝑑* 𝐱, 𝐲 = max 𝑦' − 𝑥' , ⋯ , 𝑦( − 𝑥(

Mahalanobis
Distance Distance variables
for continuous
Given a data set with variables X1 , . . . Xd , and n vectors x1 . . . xn , xi = (xi1 , . . . , xid )T .
The d ⇥ d covariance matrix is computed as:
Xn
1
⌃ij := S Xi Xj =Mahalanobis
Clustering
n 1 (x&kiSimilarityXi )(xkj Xj )
Distance
distance
3: 15 / 153
k=1
TheMahalanobis Distance
Mahalanobis distance is then defined on two vectors x, y 2 Rd as:
q
Given a data set with variables X , . . . X , and n vectors x . . . x , x = (x , . . . , x ) T.
dM ah (x, y) = (x y) ⌃ (x y)
1 d T 1 1 n i i1 id
The d ⇥ d covariance matrix is computed as:
Xn of the data.
where ⌃ 1 is the inverse of the covariance1 matrix
⌃ij := S Xi Xj = n 1 (xki Xi )(xkj Xj )
k=1
I This is a generalization of the Euclidean distance
The Mahalanobis distance is then defined on two vectors x, y 2 Rd as:
I This takes correlation between a�ributesq into account
dM ah (x, y) = (x y)T ⌃ 1 (x y)
I The a�ributes can have di�erent ranges of values
where ⌃ 1 is the inverse of the covariance matrix of the data.
I This is a generalization of the Euclidean distance
I This takes correlation between a�ributes into account
I The a�ributes can have di�erent ranges of values
Hierarchical clustering
§Agglomerative methods
§ Start with each observation forming a cluster
§ Join clusters based on cluster linkage
§ Stop when having 1 cluster
§ Analyze intermediate stopping point
§Divisive methods
§ Start with all observations in 1 cluster
§ Divide a cluster based
§ Stop when having 1 observation per cluster
§ Analyze intermediate stopping point

Linkage Clustering Hierarchical Methods
Clustering Hierarchical Methods
3: 25 / 153
Distance of Clusters
Clustering Hierarchical Methods 3: 25 / 153
Distance of Clusters
Distance of
Distance minimumClusters
of Clusters
Single-linkage: Single ⇠
distance = maximum⇠similarity
linkage
Single-linkage: minimum distance
Single-linkage: minimum distance ⇠ maximum
maximum similarity
= similarity
=⇠ ⇠
Single-linkage: minimum
dsingle (A, B) := distance
min d(a, maximum
= b) = maxsimilarity
s(a, b)
⇠
dsingle
dsingle (A, B) :=(A, B)
min:=d(a,
a2A,b2B min
b) ⇠
= d(a,
a2A,b2B
max b)s(a,
= b)max s(a, b)
dsingle (A, B) :=a2A,b2B a2A,b2B ⇠
min d(a, b) =a2A,b2B
max s(a, a2A,b2B
b)
a2A,b2B a2A,b2B
Complete-linkage: maximum distance ⇠ = minimum similarity
Complete-linkage:
Complete-linkage: maximum
Complete
maximum ⇠ minimum⇠
=distance
linkage
distance minimum similarity
=similarity
Complete-linkage:
dcomplete (A, B)maximum
:= maxdistance ⇠⇠
d(a, b)==minimum similarity
min s(a, b)
dcomplete
dcomplete a2A,b2B
(A, B) :=
(A, max
B) := b) ⇠
d(a,max =a2A,b2B ⇠
min b)s(a,
d(a, = b)min s(a, b)
dcomplete (A, B) := max d(a,
a2A,b2B
a2A,b2B
⇠
b) = min s(a,
a2A,b2B b)
a2A,b2B
a2A,b2B a2A,b2B
Average-linkage (UPGMA): average distance ⇠ =⇠average similarity
Average-linkage (UPGMA): average distance
Xlinkage X =⇠ average similarity
⇠
Average-linkage
Average-linkage (UPGMA):(UPGMA):
Average
1 average
X average
distance
X =distance
average = average similarity
similarity
daverage (A, B) := |A|·|B| 1 d(a,
X d(a,X b)
daverage (A, B) := |A|·|B| Xa2A
1 X
b2B b)
a2A b2B
daveraged(A, B) (A,
average B) := |A|·|B|
1
:= |A|·|B| d(a, b) d(a, b)
a2A a2A
b2B b2B
Centroid-linkage:
Centroid-linkage:distance
distanceofofcluster
clustercenters
centers(Euclidean
(Euclidean only)
only)
Centroid-linkage:
Centroid-linkage:
dcentroid (A, B)distance of cluster
distance
Centroid
:= kµ 2 2centers
kof
µBlinkage cluster(Euclidean only)
centers (Euclidean only)
dcentroid (A, B) := kµA µB k
A
2
dcentroid := kµ
(A, B) (A,
dcentroid B) B kA
A :=µkµ µB k2
Sistemas
Erich Schubert
Erich de Informação e Estatística
Schubert Knowledge
KnowledgeDiscovery
Discoveryin
in Databases
Databases Winter
Winter Semester
Semester 2017/1814
2017/18
Lance-Williams equation
AGNES – Agglomerative Nesting [KR90] /2
Lance-Williams update equation have the general form:
D(A [ B, C) = ↵1 d(A, C) + ↵2 d(B, C) + d(A, B) + |d(A, C) d(B, C)|
Several (but not all) linkages can be expressed in this form (for distances):
↵1 ↵2
Single-linkage 1/2 1/2 0 1/2
Complete-linkage
§𝐴 and 1/2
𝐵 are the 2 previous clusters agglomerated 1/2
in 𝐴 ∪ 𝐵 0 +1/2
|A| |B|
Average-group-linkage (UPGMA) 0 0
§Recursive equation |A|+|B| |A|+|B|
Mc�i�y (WPGMA) 1/2 1/2 0 0
§Several (but not all)
Centroid-linkage linkages can be expressed
(UPGMC) |A| in |B|
this form (for
|A||B|distances)
0
|A|+|B| |A|+|B| (|A|+|B|)2
Median-linkage (WPGMC) 1/2 1/2 1/4 0
|A|+|C| |B|+|C| |C|
Ward |A|+|B|+|C| |A|+|B|+|C| |A|+|B|+|C| 0
MiniMax linkage: cannot be computed with Lance-Williams updates,
GNES – Agglomerative Nesting [KR90] /2
Linkageupdate equation have the general form:
nce-Williams
D(A [ B, C) = ↵1 d(A, C) + ↵2 d(B, C) + d(A, B) + |d(A, C) d(B, C)|
veral (but not all) linkages can be expressed in this form (for distances):
↵1 ↵2
Single-linkage 1/2 1/2 0 1/2
Complete-linkage 1/2 1/2 0 +1/2
|A| |B|
Average-group-linkage (UPGMA) |A|+|B| |A|+|B| 0 0
Mc�i�y (WPGMA) 1/2 1/2 0 0
|A| |B| |A||B|
Centroid-linkage (UPGMC) |A|+|B| |A|+|B| (|A|+|B|)2
0
Median-linkage (WPGMC) 1/2 1/2 1/4 0
|A|+|C| |B|+|C| |C|
Ward |A|+|B|+|C| |A|+|B|+|C| |A|+|B|+|C| 0
niMax linkage: cannot be computed with Lance-Williams updates,

t we need to find the best cluster representative
Sistemas de Informação e Estatística (in O(n2 )). 16
Dendogram
3: 24 / 153
jects 12
4
R90]:
kage”)
11
10
9
3
8
6
2
5
3 1
2
0
0 1 2 3 4 5 6 7 8 9 10 11 12
0

AGNES – AGglomerative NESting
1. Compute the pairwise distance matrix of objects

2. Find position of the minimum distance 𝑑 𝑖, 𝑗 (similarity: maximum
similarity 𝑠 𝑖, 𝑗
3. Combine rows and columns of 𝑖 and 𝑗 into one using Lance-Williams
update equations
4. Repeat from (2.) until only one entry remains
5. Return dendrogram tree

AGNES – complete linkage Clustering Hierarchical Methods 3: 29 / 153
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6
Sca�er plot Distance matrix Dendrogram
Find minimum distance.

Please do not print these slides – save a tree.
AGNES – complete linkage
Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6
Merge clusters, and update dendrogram.

Erich Schubert Knowledge
Sistemas de Informação Discovery in Databases
e Estatística Winter Semester 2017/18 20
Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6
Update distance matrix (here: keep maximum in each row/column, except diagonal).

Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6
Please do not print these slides – save

Sistemas a tree. e Estatística
de Informação 25
Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

de Informação 26
Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

Example: AGNES
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6

Example: AGNES We may need to merge

non-adjacent rows!
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
1 B E 2.5 3.20 2.69 0.5 0 1.12 1
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6
We don’t know the optimum
label positions in advance
AGNES – single linkage Clustering Hierarchical Methods 3: 30 / 153
Example: AGNES /2 In this very simple example,

single and complete linkage
Example with single linkage (= minimum of the distances): are very similar
5 A B C D E F 3
C
4 D F A 0 0.71 5 2.92 2.5 3.54 2.5
E
B 0.71 0 5.70 3.61 3.20 4.24 2
3
C 5 5.70 0 2.55 2.69 1.58 1.5
2
A D 2.92 3.61 2.55 0 0.5 1 1
1 B E 2.5 3.20 2.69 0.5 0 1.12 0.5
0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

0 1 2 3 4 5 6
Same clusters in this example,
but this is usually not the case.

de Informação 32
Clustering Partitioning Methods 3: 35 / 153
K-means
The Sum of Squares Objective Clustering Partitioning Methods
We can rearrange these sums
3: 35 / 153
because of communtativity
Thesum-of-squares
The Sum of Squares objective: Objective We can rearrange these sums
§Sum of The
Squares
sum-of-squares
SSQ :=
X
objective:
X
Clustering
X
Partitioning Methods (xi,d µC,d )2
because of communtativity
3: 35 / 153
x 2C
|X Xd |X
C
{z } |{z} {zi } | {z }2
The SumSSQof :=Squares
every cluster Objective (x
We can µ
rearrange )these sums
C ⇥ every dimension ⇥ every pointxi 2C squared deviation from mean
i,d C,d
d
The sum-of-squares | {z }
objective: |{z} | {z } |because{z
of communtativity
}
For every cluster C and dimension
every cluster d, the arithmetic
X ⇥ every dimension
X ⇥X mean minimizes
every point squared deviation from mean
X SSQ := (xi,d µC,d1 )2X
For every cluster C and(xi,ddimension
µC,d ) C} d, the
| {z
2 arithmetic
is |{z}
minimized
d
| {z
xmean
byi 2C
} minimizes
µ = |C|}
| C,d {z xi,d
§Cluster centroids X x i 2C every cluster ⇥ every dimension ⇥ every point squared deviation from mean
(xi,d itsµdimension
C,d )
2
is minimized by minimizes 1
X x i 2C
Assigning For
every point
every 2C xiCtoand
xicluster least-squaresd, the closest cluster
arithmetic mean C µusually
C,d = 2|C|reduces xi,dtoo.
SSQ,
xi 2C
X X
2 1 2
Assigning every point xxii2C to(xits is
least-squares minimized by
closest cluster C,d
C usually xreduces
i,d µ C,d ) µ = |C| xi,d SSQ, too.
Note: sum of squares ⌘ squared Euclidean distance: i 2C
Assigning every Xpoint xi to its least-squares closest cluster C usually2 reduces SSQ, too.
Note: sum of squares ⌘ squared (xi,d µEuclidean 2
C,d ) ⌘ kx µC k2 ⌘ d2Euclidean (xi , µC )
distance:
i
Note: sum of squaresXd ⌘ squared Euclidean distance: 2
We can therefore say thatdevery X
(x i,d pointµC,dis)2assigned
⌘
2 kxi the k ⌘2 d2Euclidean
µC“closest” cluster,(xbut, µwe) cannot use arbitrary
(xi,d µC,d ) ⌘ kxi µC k ⌘ dEuclidean (xi , µC )i C
2
other distance functions in k-means d

(because the arithmetic mean only minimizes SSQ).
We can therefore say that every point
We can therefore say that every point is isassigned the“closest”
assigned the “closest” but webut
cluster,
cluster, we cannot
cannot use arbitrary
use arbitrary
other
2
Thisdistance
other
is not alwaysfunctions
distance
optimal: in
functions
thek-means
change in (because
in k-means mean (because the arithmetic
the arithmetic
Cluster
can increase ofmean
the SSQmean only
the new only
minimizesminimizes
cluster. SSQ). SSQ).
But this di�erence
2
This is notisalways
commonly
optimal:ignored in inalgorithms
the change and textbooks.
mean cancentroid
increase the SSQ of the new cluster.
2
This is not always
ErichBut
Schubert
optimal:isthe
this di�erence change
commonly in mean
ignored can
in algorithms
Knowledge
increase
Discoveryand the SSQ of the new cluster.
in textbooks.
Databases Winter Semester 2017/18
But this di�erence is commonly ignored in algorithms
Erich Schubert
and textbooks.
Knowledge Discovery in Databases Winter Semester 2017/18

Optimization algorithm
1. Choose k points randomly as initial centers
2. Repeat
1. Assign every point to the least-squares closest center
2. Stop if no cluster assignment changed
3. Update the centers with the arithmetic mean

Centroid initialization strategies
§ Initial centers given by domain expert
§ Randomly assign points to partitions 1...k (not very good)
§ Randomly generate k centers from a uniform distribution (not very good)
§ Randomly choose k data points as centers (uniform from the data)
§ First k data points
§ Choose a point randomly, then use always the farthest point to get k initial points (initial
centers are often outliers, and gives similar results when run often)
§ Weighting points by their distance
§ Points are chosen randomly with 𝑝 ∝ min 𝑥, − 𝑐 ) (c = all current seeds)
+
§ Run a few k-means iterations on a sample, then use centroids from the sample result
§ ...

reference object is located at the centre of the ‘star’. In this connection, the star
Measures Table
of the (in)adequacy
5.1 Measures of the (in)adequacy of the mth cluster containing n objects derived from
m
a dissimilarity matrix D, with elements dql,kv measuring the dissimilarity between the lth object
in the qth group and the vth object in the kth group.
Measure Index ðr 2 f1; 2gÞ

P
nm nm !
P "r
Lack of homogeneity h1 ðmÞ ¼ dml;mv
l¼1 v¼1;v6¼l
#! "r $
Lack of homogeneity h2 ðmÞ ¼ max dml;mv
l¼1;...;nm
v¼1;...;nm
v6¼l
" #
X
nm
! "r
Lack of homogeneity h3 ðmÞ ¼ min dml;mv
v¼1;...;nm
l¼1
nk !
nm P P
P "r
Separation i 1 ðm Þ ¼ dml;kv
l¼1 k6¼m v¼1
#! "r $
Separation i 2 ðm Þ ¼ min dml;kv
l¼1;...;nm
k6¼m
v¼1;...;nk

Variations
§k-medians: use the median in each dimension

§ For use with Manhattan norm 𝑥! − 𝑚"
#
§k-modes: use the mode instead of the mean
§ For categorical data, using Hamming distance.
§k-prototypes: mean on continuous variables, mode on categorical
§ For mixed data, using squared Euclidean (respectively, Hamming) distance
§k-medoids: using the medoid (element with smallest distance sum)
§ For arbitrary distance functions.
§Spherical k-means: use the mean, but normalized to unit length
§ For cosine distance.
§Gaussian Mixture Modeling: using mean and covariance
§ For use with Mahalanobis distance.

Optimal k
§SSQk may exhibit an “elbow” or “knee”: initially it improves fast, then much
slower
§Use alternate criteria such as Silhouette
§ Computing silhouette is O(n2) – more expensive than k-means
§ GAP statistic

References
§Everitt, B. S., Landau, S., Leese, M. & Stahl, D. Cluster Analysis. (Wiley,
2011).
§Johnson, R. A. & Wichern, D. W. Applied Multivariate Statistical Analysis.
(Prentice Hall, 2007).
§Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. J Comput Appl Math 20, 53–65 (1987).

3 - Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 - Clustering

Uploaded by

Copyright:

Available Formats

Clustering

Sistemas de Informação e Estatística 1

Dividing the data into clusters such that:

Sistemas de Informação e Estatística 2

Sistemas de Informação e Estatística 3

Sistemas de Informação e Estatística 4

Sistemas de Informação e Estatística 5

• Dissimilarity: usually satisfies 1. and 3.

Sistemas de Informação e Estatística 6

Sistemas de Informação e Estatística 7

Distances for Binary Data

Sistemas de Informação e Estatística 8

Jaccard Coe�icient forJaccard

Sistemas de Informação e Estatística 10

§ 𝑑' 𝐱, 𝐲 = 𝑦' − 𝑥' + ⋯ + 𝑦( − 𝑥(

Sistemas de Informação e Estatística 11

Sistemas de Informação e Estatística 13

niMax linkage: cannot be computed with Lance-Williams updates,

Sistemas de Informação e Estatística 17

1. Compute the pairwise distance matrix of objects

Sistemas de Informação e Estatística 18

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Sistemas de Informação e Estatística 19

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Please do not print these slides – save a tree.

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Sistemas de Informação e Estatística 21

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Sistemas de Informação e Estatística 22

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Sistemas de Informação e Estatística 23

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Sistemas de Informação e Estatística 24

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Please do not print these slides – save

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Please do not print these slides – save

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Please do not print these slides – save a tree.

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Sistemas de Informação e Estatística 28

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Please do not print these slides – save a tree.

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Sistemas de Informação e Estatística 30

Example: AGNES We may need to merge

1 B E 2.5 3.20 2.69 0.5 0 1.12 1

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Example: AGNES /2 In this very simple example,

1 B E 2.5 3.20 2.69 0.5 0 1.12 0.5

0 F 3.54 4.24 1.58 1 1.12 0 A B C D E F

Please do not print these slides – save