Professional Documents
Culture Documents
Cluster Measures
Measures for Continuous Data
EUCLID
The distance between two items, x and y, is the square root of the sum of the
squared differences between the values for the items.
1 6 ∑ 1x − y 6
EUCLID x , y =
i
i i
2
SEUCLID
The distance between two items is the sum of the squared differences between the
values for the items.
1 6 ∑ 1x − y 6
SEUCLID x , y =
i
i i
2
CORRELATION
3Z Z 8
1 6 ∑ N
CORRELATION x , y = i
xi yi
where Z xi is the (standardized) Z-score value of x for the ith case or variable, and
N is the number of cases or variables.
1
2 CLUSTER
COSINE
1x y 6
1 6 ∑
COSINE x , y = i
i i
∑ x ∑ y
2 2
i i
i i
CHEBYCHEV
The distance between two items is the maximum absolute difference between the
values for the items.
1 6
CHEBYCHEV x , y = maxi xi − yi
BLOCK
The distance between two items is the sum of the absolute differences between the
values for the items.
1 6 ∑ x −y
BLOCK x , y =
i
i i
MINKOWSKI(p)
The distance between two items is the pth root of the sum of the absolute
differences to the pth power between the values for the items.
1 6 ∑ x − y
MINKOWSKI x , y =
i
i i
p 1p
CLUSTER 3
1 6
POWER p, r
The distance between two items is the rth root of the sum of the absolute
differences to the pth power between the values for the items.
1 6 ∑ x − y
POWER x , y =
i
i i
p 1r
CHISQ
The magnitude of this similarity measure depends on the total frequencies of the
two cases or variables whose proximity is computed. Expected values are from the
model of independence of cases (or variables), x and y.
CHISQ1 x , y 6 = ∑
2 x − E 1 x 67 + 2 y − E 1 y 67
2 2
i
E1x 6 ∑ E1 y 6
i
i i
i
i
i i
PH2
This is the CHISQ measure normalized by the square root of the combined
frequency. Therefore, its value does not depend on the total frequencies of the two
cases or variables whose proximity is computed.
1 6
PH2 x , y =
1 6
CHISQ x , y
N
Present Absent
Item 1 Present a b
Absent c d
1 6
RR x , y =
a
a +b+c+d
This is the ratio of the number of matches to the total number of characteristics.
1 6
SM x , y =
a+d
a +b+c+d
1 6
JACCARD x , y =
a
a +b+c
1 6
DICE x , y =
2a
2a + b + c
CLUSTER 5
1 6 21a2+1ad 6++db6 + c
SS1 x , y =
1 6
RT x , y =
a+d
1 6
a+d +2 b+c
1 6
SS2 x , y =
a
1 6
a+2 b+c
This measure has a minimum value of 0 and no upper limit. It is undefined when
1 6
there are no nonmatches b = 0 and c = 0 . Therefore, PROXIMITIES assigns an
artificial upper limit of 9999.999 to K1 when it is undefined or exceeds this value.
1 6
K1 x , y =
a
b+c
This measure has a minimum value of 0, has no upper limit, and is undefined when
1 6
there are no nonmatches b = 0 and c = 0 . As with K1, PROXIMITIES assigns an
artificial upper limit of 9999.999 to SS3 when it is undefined or exceeds this value.
1 6
SS3 x , y =
a+d
b+c
6 CLUSTER
Conditional Probabilities
The following three binary measures yield values that you can interpret in terms of
conditional probability. All three are similarity measures.
This yields the average conditional probability that a characteristic is present in one
item given that the characteristic is present in the other item. The measure is an
average over both items acting as predictors. It has a range of 0 to 1.
1 6
K2 x , y =
1 6 1 6
a a +b +a a +c
2
This yields the conditional probability that a characteristic of one item is in the
same state (present or absent) as the characteristic of the other item. The measure is
an average over both items acting as predictors. It has a range of 0 to 1.
1 6
SS4 x , y =
1 6 1 6 1
a a +b +a a +c +d b+d +d c+d 6 1 6
4
This measure gives the probability that a characteristic has the same state in both
items (present in both or absent from both) minus the probability that a
characteristic has different states in the two items (present in one and absent from
the other). HAMANN has a range of –1 to +1 and is monotonically related to SM,
SS1, and RT.
1 6 1aa++db6+−c1b++dc6
HAMANN x , y =
CLUSTER 7
Predictability Measures
The following four binary measures assess the association between items as the
predictability of one given the other. All four measures yield similarities.
1 6 1 6 1 6 1 6
t1 = max a , b + max c, d + max a , c + max b, d
1 6 1 6
t 2 = max a + c, b + d + max a + b, c + d
1 6 1
LAMBDA x , y =
6
t1 − t 2
2 a + b + c + d − t2
Anderberg’s D (Similarity)
1 6 1 6 1 6 1 6
t1 = max a , b + max c, d + max a , c + max b, d
1 6 1 6
t 2 = max a + c, b + d + max a + b, c + d
1 6 1
D x, y =
t1 − t 2
2 a +b+c+d 6
This is a function of the cross-product ratio for a 2 × 2 table. It has a range of –1 to +1.
1 6
Y x, y =
ad − bc
ad + bc
8 CLUSTER
Yule’s Q (Similarity)
This is the 2 × 2 version of Goodman and Kruskal’s ordinal measure gamma. Like
Yule’s Y, Q is a function of the cross-product ratio for a 2 × 2 table and has a range
of –1 to +1.
1 6
Q x, y =
ad − bc
ad + bc
This is the binary form of the cosine. It has a range of 0 to 1 and is a similarity
measure.
This is the binary form of the Pearson product-moment correlation coefficient. Phi
is a similarity measure, and its range is 0 to 1.
This is a distance measure. Its minimum value is 0, and it has no upper limit.
1 6
BEUCLID x , y = b + c
This is also a distance measure. Its minimum value is 0, and it has no upper limit.
1 6
BSEUCLID x , y = b + c
Size Difference
1 6
SIZE x , y =
1b − c6 2
1a + b + c + d 6 2
Pattern Difference
1 6
PATTERN x , y =
bc
1a + b + c + d 6 2
1 6 1a + b + ca++db61+bc++cd6 − 1b − c6
2
BSHAPE x , y =
1 6 2
10 CLUSTER
1 6
DISPER x , y =
ad − bc
1a + b + c + d 6 2
1 6 41a +bb++cc + d 6
VARIANCE x , y =
Also known as the Bray-Curtis nonmetric coefficient, this dissimilarity measure has
a range of 0 to 1.
1 6
BLWMN x , y =
b+c
2a + b + c
Clustering Methods
Notation
The following notation is used unless otherwise specified:
S Matrix of similarity or dissimilarity measures
sij Similarity or dissimilarity measure between cluster i and cluster j
General Procedure
Begin with N clusters each containing one case. Denote the clusters 1 through N.
• 1 6
Find the most similar pair of clusters p and q p > q . Denote this similarity
s pq . If a dissimilarity measure is used, large values indicate dissimilarity. If a
similarity measure is used, small values indicate dissimilarity.
• Perform the previous two steps until all entities are in one cluster.
str = s pr + sqr
Update N t by
Nt = N p + Nq
and then choose the most similar pair based on the value
sij 3N N 8
i j
12 CLUSTER
str = s pr + sqr
Nt = N p + N q
43 N i 83
+ N j Ni + N j − 1 89 2
Single Linkage
Update str by
%Kmin3s pr , sqr 8
&Kmax3s pr , sqr 8
if S is a dissimilarity matrix
str =
' if S is a similarity matrix
Complete Linkage
Update str by
%Kmax3s pr , sqr 8
&Kmin3s pr , sqr 8
if S is a dissimilarity matrix
str =
' if S is a similarity matrix
CLUSTER 13
Centroid Method
Update str by
Np Nq N p Nq
str = s pr + sqr − s pq
N p + Nq N p + Nq
3N p + Nq 8 2
Median Method
Update str by
3
str = s pr + sqr 8 2−s pq 4
Ward’s Method
Update str by
6 3N 8 3 8
1
str =
1 Nt + Nr
r + N p srp + N r + N q srq − N r s pq
W = W + .5 s pq
Note that for Ward’s method, the coefficient given in the agglomeration schedule is
really the within-cluster sum of squares at that step. For all other methods, this
coefficient represents the distance at which the clusters p and q were joined.
References
Anderberg, M. R. 1973. Cluster analysis for applications. New York: Academic
Press.