Professional Documents
Culture Documents
<latexit sha1_base64="7BCvAv3eHU/QkV0DyqOsjhpkS+Y=">AAACfHicbZHLTttAFIbHphfq3lJYsuioSSVH0MhGamGDhOimiy6o1ABSHEXj8bEzYjw2M8cQY/EUvBk7HqWbquOQBST9pZF+fecczbnEpRQGg+DecdeePX/xcv2V9/rN23fvOx82TkxRaQ5DXshCn8XMgBQKhihQwlmpgeWxhNP4/HsbP70EbUShfmNdwjhnmRKp4AwtmnRuoxgyoRpQVQ6aIdx4kUDIvV7iz3bqfpTBRdDznsKDIPoJKWqRTZFpXVzR2UG9kpT49c6sv4wjCRe0tdf97cS/tqTnRaCSRx1MOt1gEMxFV024MF2y0PGkcxclBbflCrlkxozCoMRxwzQKLtuJKgMl4+csg5G1iuVgxs18eTf0syUJTQttn0I6p48rGpYbU+exzcwZTs1yrIX/i40qTPfHjVBlhaD4w0dpJSkWtL0ETYQGjrK2hnEtbK+UT5lmHO29PLuEcHnkVXOyOwi/DoJfu93Do8U61skW+UR8EpI9ckh+kGMyJJz8cT46vtN3/ro9d9v98pDqOouaTfJE7rd/eIC7QQ==</latexit>
1. d(x, y) 0
2. d(x, y) = 0 , x = y • Distance: satisfies 1. and 3.
<latexit sha1_base64="XsMC6HrzCmZFuNGEKjHUiWvHlv8=">AAACX3icbZFNS8NAEIY38Tt+RT2Jl2ArtFBKIoheCqIXjxWsCk0om+20Lm42cXdSGoJ/0pvgxX/itvbQVgcWXp6ZYWbejTPBNfr+p2WvrK6tb2xuOds7u3v77sHho05zxaDDUpGq55hqEFxCBzkKeM4U0CQW8BS/3k7yTyNQmqfyAYsMooQOJR9wRtGgnjsKYxhyWYLME1AU4d0JOULiVHVt3Cjq4RDe/KqzCFu6VjTG9WUcctn1G0G0yMf1VlB1QpD9uSE9t+I3/Wl4f0UwExUyi3bP/Qj7KTPtEpmgWncDP8OopAo5E5Olcw0ZZa90CF0jJU1AR+XUn3fvzJC+N0iVeRK9KZ3vKGmidZHEpjKh+KKXcxP4X66b4+AqKrnMcgTJfgcNcuFh6k3M9vpcAUNRGEGZ4mZXj71QRRmaL3GMCcHyyX/F43kzuGj69+eV65uZHZvkhJySGgnIJbkmd6RNOoSRL8u2tq0d69vesPds97fUtmY9R2Qh7OMfeNqv1A==</latexit>
1. s(x, y) 0
1
<latexit sha1_base64="24mvL/pcF4UjDhdLqLxAuhAW2b0=">AAACLXicbVDLSsNAFJ3Ud3xVXboJFsEXJRFEEYSiLlwq2FpoQplMburQySTOTMQS8kNu/BURXFTErb/hNHZRWw8MHM65hzv3+AmjUtl23yhNTc/Mzs0vmItLyyur5bX1hoxTQaBOYhaLpo8lMMqhrqhi0EwE4MhncOd3Lwb+3SMISWN+q3oJeBHucBpSgpWW2uVL14cO5Rk8pIWyl5ty5+mgt3t65oYCk8zJM2c/KKTcpbxlHzie6QIPRiLtcsWu2gWsSeIMSQUNcd0uv7lBTNIIuCIMS9ly7ER5GRaKEga56aYSEky6uAMtTTmOQHpZcW1ubWslsMJY6MeVVaijiQxHUvYiX09GWN3LcW8g/ue1UhWeeBnlSaqAk99FYcosFVuD6qyACiCK9TTBRFD9V4vcY12S0gWbugRn/ORJ0jisOkdV++awUjsf1jGPNtEW2kEOOkY1dIWuUR0R9IxeUR99GC/Gu/FpfP2OloxhZgP9gfH9A0Mkp4Q=</latexit>
2. s(x, y) = s(y, x)
s(x, y) := 2 [0, 1]
1 + d(x, y)
3. s(x, y) 2 [0, 1]
4. s(x, x) = 1
Example
ExampleDistances
Distancesfor
forCategorical
Categorical Data
Data
§ 𝑥 = 𝑥! , …Categorical
, Categorical
𝑥" , y = data:
𝑦data:
! , … two
,𝑦" lists
two listsxx=={x{x11,,xx22,,......,, xxdd}} and
and yy = {y11,,yy22,,......, ,yydd}}
= {y
with
witheach
eachxixand y being categorical (nominal) a�ributes
i and iyi being categorical (nominal) a�ributes
I IHamming
Hammingdistance:
distance:“count
“countthe
thenumber Hamming
number of distance
of di�erences”.
di�erences”.
dd ((
XX 00 ififxxi i==yiyi
dist
dist
Hamming
Hamming (x,
(x,y)
y) == (x
(x i
i , yi ) with (x
(x i,
i i,yy )
i ) =
=
i=1
i=1
11 ififxxi i6=6=yiyi
I IJaccard
Jaccardbybytaking
takingthe
theposition
positioninto
into account.
account.
Gower’sfor
I IGower’s formixed
mixedinterval,
interval,ordinal, Gower
ordinal, and
and distancedata.
categorical
categorical data.
8
8
>
>
>
> 00 if
if xxii =
= yyii
>
>
>
dd >
X <1
< if
dist (x, y) =
X 1 if variable
variableX Xiicategorical
categoricalandandxix6= y
i 6= iyi
dist (x,
Gower
Gower y) = > |xi yyi||
if
>
>
i=1>
i=1 >
>
>
|x i
RRi
i
if variable
variableX Xiiisisinterval
intervalscale
scale
:| |rank(x
> i
rank(xi ) rank(yi )|
if
: i)
n
rank(y
1
n 1
i )|
if variable
variableX Xi isisordinal
i ordinalscale
scale
whereRR=
where i =max
maxxxxi minminxxxi isis the
the range,
range, and
and rank(x) [1,. .. .. ., ,n]n]isisthe
rank(x) 22[1, therank.
rank.
i x i x i
Intuition: each a�ribute has the same
Jaccard maximum distance
distance with contribution
position 1.
Intuition: each a�ribute has the same maximum distance contribution 1.
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
Mahalanobis
Distance Distance variables
for continuous
Given a data set with variables X1 , . . . Xd , and n vectors x1 . . . xn , xi = (xi1 , . . . , xid )T .
The d ⇥ d covariance matrix is computed as:
Xn
1
⌃ij := S Xi Xj =Mahalanobis
Clustering
n 1 (x&kiSimilarityXi )(xkj Xj )
Distance
distance
3: 15 / 153
k=1
TheMahalanobis Distance
Mahalanobis distance is then defined on two vectors x, y 2 Rd as:
q
Given a data set with variables X , . . . X , and n vectors x . . . x , x = (x , . . . , x ) T.
dM ah (x, y) = (x y) ⌃ (x y)
1 d T 1 1 n i i1 id
The d ⇥ d covariance matrix is computed as:
Xn of the data.
where ⌃ 1 is the inverse of the covariance1 matrix
⌃ij := S Xi Xj = n 1 (xki Xi )(xkj Xj )
k=1
I This is a generalization of the Euclidean distance
The Mahalanobis distance is then defined on two vectors x, y 2 Rd as:
I This takes correlation between a�ributesq into account
dM ah (x, y) = (x y)T ⌃ 1 (x y)
I The a�ributes can have di�erent ranges of values
where ⌃ 1 is the inverse of the covariance matrix of the data.
I This is a generalization of the Euclidean distance
I This takes correlation between a�ributes into account
I The a�ributes can have di�erent ranges of values
Sistemas de Informação e Estatística 12
Erich Schubert Knowledge Discovery in Databases Winter Semester 2017/18
Hierarchical clustering
§Agglomerative methods
§ Start with each observation forming a cluster
§ Join clusters based on cluster linkage
§ Stop when having 1 cluster
§ Analyze intermediate stopping point
§Divisive methods
§ Start with all observations in 1 cluster
§ Divide a cluster based
§ Stop when having 1 observation per cluster
§ Analyze intermediate stopping point
Distance of Clusters
Clustering Hierarchical Methods 3: 25 / 153
Clustering Hierarchical Methods 3: 25 / 153
Distance of Clusters
Distance of
Distance minimumClusters
of Clusters
Single-linkage: Single ⇠
distance = maximum⇠similarity
linkage
Single-linkage: minimum distance
Single-linkage: minimum distance ⇠ maximum
maximum similarity
= similarity
=⇠ ⇠
Single-linkage: minimum
dsingle (A, B) := distance
min d(a, maximum
= b) = maxsimilarity
s(a, b)
⇠
dsingle
dsingle (A, B) :=(A, B)
min:=d(a,
a2A,b2B min
b) ⇠
= d(a,
a2A,b2B
max b)s(a,
= b)max s(a, b)
dsingle (A, B) :=a2A,b2B a2A,b2B ⇠
min d(a, b) =a2A,b2B
max s(a, a2A,b2B
b)
a2A,b2B a2A,b2B
Complete-linkage: maximum distance ⇠ = minimum similarity
Complete-linkage:
Complete-linkage: maximum
Complete
maximum ⇠ minimum⇠
=distance
linkage
distance minimum similarity
=similarity
Complete-linkage:
dcomplete (A, B)maximum
:= maxdistance ⇠⇠
d(a, b)==minimum similarity
min s(a, b)
dcomplete
dcomplete a2A,b2B
(A, B) :=
(A, max
B) := b) ⇠
d(a,max =a2A,b2B ⇠
min b)s(a,
d(a, = b)min s(a, b)
dcomplete (A, B) := max d(a,
a2A,b2B
a2A,b2B
⇠
b) = min s(a,
a2A,b2B b)
a2A,b2B
a2A,b2B a2A,b2B
Average-linkage (UPGMA): average distance ⇠ =⇠average similarity
Average-linkage (UPGMA): average distance
Xlinkage X =⇠ average similarity
⇠
Average-linkage
Average-linkage (UPGMA):(UPGMA):
Average
1 average
X average
distance
X =distance
average = average similarity
similarity
daverage (A, B) := |A|·|B| 1 d(a,
X d(a,X b)
daverage (A, B) := |A|·|B| Xa2A
1 X
b2B b)
a2A b2B
daveraged(A, B) (A,
average B) := |A|·|B|
1
:= |A|·|B| d(a, b) d(a, b)
a2A a2A
b2B b2B
Centroid-linkage:
Centroid-linkage:distance
distanceofofcluster
clustercenters
centers(Euclidean
(Euclidean only)
only)
Centroid-linkage:
Centroid-linkage:
dcentroid (A, B)distance of cluster
distance
Centroid
:= kµ 2 2centers
kof
µBlinkage cluster(Euclidean only)
centers (Euclidean only)
dcentroid (A, B) := kµA µB k
A
2
dcentroid := kµ
(A, B) (A,
dcentroid B) B kA
A :=µkµ µB k2
Sistemas
Erich Schubert
Erich de Informação e Estatística
Schubert Knowledge
KnowledgeDiscovery
Discoveryin
in Databases
Databases Winter
Winter Semester
Semester 2017/1814
2017/18
Clustering Hierarchical Methods 3: 28 / 153
Lance-Williams equation
AGNES – Agglomerative Nesting [KR90] /2
Lance-Williams update equation have the general form:
D(A [ B, C) = ↵1 d(A, C) + ↵2 d(B, C) + d(A, B) + |d(A, C) d(B, C)|
Several (but not all) linkages can be expressed in this form (for distances):
↵1 ↵2
Single-linkage 1/2 1/2 0 1/2
Complete-linkage
§𝐴 and 1/2
𝐵 are the 2 previous clusters agglomerated 1/2
in 𝐴 ∪ 𝐵 0 +1/2
|A| |B|
Average-group-linkage (UPGMA) 0 0
§Recursive equation |A|+|B| |A|+|B|
Mc�i�y (WPGMA) 1/2 1/2 0 0
§Several (but not all)
Centroid-linkage linkages can be expressed
(UPGMC) |A| in |B|
this form (for
|A||B|distances)
0
|A|+|B| |A|+|B| (|A|+|B|)2
Median-linkage (WPGMC) 1/2 1/2 1/4 0
|A|+|C| |B|+|C| |C|
Ward |A|+|B|+|C| |A|+|B|+|C| |A|+|B|+|C| 0
Sistemas de Informação e Estatística 15
MiniMax linkage: cannot be computed with Lance-Williams updates,
GNES – Agglomerative Nesting [KR90] /2
Linkageupdate equation have the general form:
nce-Williams
D(A [ B, C) = ↵1 d(A, C) + ↵2 d(B, C) + d(A, B) + |d(A, C) d(B, C)|
veral (but not all) linkages can be expressed in this form (for distances):
↵1 ↵2
Single-linkage 1/2 1/2 0 1/2
Complete-linkage 1/2 1/2 0 +1/2
|A| |B|
Average-group-linkage (UPGMA) |A|+|B| |A|+|B| 0 0
Mc�i�y (WPGMA) 1/2 1/2 0 0
|A| |B| |A||B|
Centroid-linkage (UPGMC) |A|+|B| |A|+|B| (|A|+|B|)2
0
Median-linkage (WPGMC) 1/2 1/2 1/4 0
|A|+|C| |B|+|C| |C|
Ward |A|+|B|+|C| |A|+|B|+|C| |A|+|B|+|C| 0
jects 12
4
R90]:
kage”)
11
10
9
3
8
6
2
5
3 1
2
0
0 1 2 3 4 5 6 7 8 9 10 11 12
0
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
Example: AGNES
Example with complete linkage (= maximum of the distances):
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
5 A B C D E F 6
C
4 D F A 0 0.71 5 2.92 2.5 3.54 5
E
B 0.71 0 5.70 3.61 3.20 4.24 4
3
C 5 5.70 0 2.55 2.69 1.58 3
2
A D 2.92 3.61 2.55 0 0.5 1 2
5 A B C D E F 3
C
4 D F A 0 0.71 5 2.92 2.5 3.54 2.5
E
B 0.71 0 5.70 3.61 3.20 4.24 2
3
C 5 5.70 0 2.55 2.69 1.58 1.5
2
A D 2.92 3.61 2.55 0 0.5 1 1
K-means
The Sum of Squares Objective Clustering Partitioning Methods
We can rearrange these sums
3: 35 / 153
because of communtativity
Thesum-of-squares
The Sum of Squares objective: Objective We can rearrange these sums
§Sum of The
Squares
sum-of-squares
SSQ :=
X
objective:
X
Clustering
X
Partitioning Methods (xi,d µC,d )2
because of communtativity
3: 35 / 153
x 2C
|X Xd |X
C
{z } |{z} {zi } | {z }2
The SumSSQof :=Squares
every cluster Objective (x
We can µ
rearrange )these sums
C ⇥ every dimension ⇥ every pointxi 2C squared deviation from mean
i,d C,d
d
The sum-of-squares | {z }
objective: |{z} | {z } |because{z
of communtativity
}
For every cluster C and dimension
every cluster d, the arithmetic
X ⇥ every dimension
X ⇥X mean minimizes
every point squared deviation from mean
X SSQ := (xi,d µC,d1 )2X
For every cluster C and(xi,ddimension
µC,d ) C} d, the
| {z
2 arithmetic
is |{z}
minimized
d
| {z
xmean
byi 2C
} minimizes
µ = |C|}
| C,d {z xi,d
§Cluster centroids X x i 2C every cluster ⇥ every dimension ⇥ every point squared deviation from mean
(xi,d itsµdimension
C,d )
2
is minimized by minimizes 1
X x i 2C
Assigning For
every point
every 2C xiCtoand
xicluster least-squaresd, the closest cluster
arithmetic mean C µusually
C,d = 2|C|reduces xi,dtoo.
SSQ,
xi 2C
X X
2 1 2
Assigning every point xxii2C to(xits is
least-squares minimized by
closest cluster C,d
C usually xreduces
i,d µ C,d ) µ = |C| xi,d SSQ, too.
Note: sum of squares ⌘ squared Euclidean distance: i 2C
Assigning every Xpoint xi to its least-squares closest cluster C usually2 reduces SSQ, too.
Note: sum of squares ⌘ squared (xi,d µEuclidean 2
C,d ) ⌘ kx µC k2 ⌘ d2Euclidean (xi , µC )
distance:
i
Note: sum of squaresXd ⌘ squared Euclidean distance: 2
We can therefore say thatdevery X
(x i,d pointµC,dis)2assigned
⌘
2 kxi the k ⌘2 d2Euclidean
µC“closest” cluster,(xbut, µwe) cannot use arbitrary
(xi,d µC,d ) ⌘ kxi µC k ⌘ dEuclidean (xi , µC )i C
2
2. Repeat
1. Assign every point to the least-squares closest center
Measures Table
of the (in)adequacy
5.1 Measures of the (in)adequacy of the mth cluster containing n objects derived from
m
a dissimilarity matrix D, with elements dql,kv measuring the dissimilarity between the lth object
in the qth group and the vth object in the kth group.
§SSQk may exhibit an “elbow” or “knee”: initially it improves fast, then much
slower
§Use alternate criteria such as Silhouette
§ Computing silhouette is O(n2) – more expensive than k-means
§ GAP statistic
§Everitt, B. S., Landau, S., Leese, M. & Stahl, D. Cluster Analysis. (Wiley,
2011).
§Johnson, R. A. & Wichern, D. W. Applied Multivariate Statistical Analysis.
(Prentice Hall, 2007).
§Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. J Comput Appl Math 20, 53–65 (1987).