Professional Documents
Culture Documents
cluster
outliers
tuples/objects
(two modes)
... ... ... ... ...
x ... x ... x
i1 if ip
... ... ... ... ...
x ... x ... x
n1 nf np
objects
dissimilarity or distance
0
matrix d(2,1) 0
objects
(one mode) d(3,1 ) d ( 3, 2) 0
Assuming simmetric distance : : :
d(i,j) = d(j, i) d ( n , 1 ) d (n ,2) ... ... 0
Measuring Similarity in
Clustering
Dissimilarity/Similarity metric:
Binary variables
e.g., gender (M/F), has_cancer(T/F)
Ordinal variables
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Ratio-scaled variables
population growth (1,10,100,1000,...)
1/ p
p p p
L p (i , j ) x | | x x | ... | x x |
| x
i1
j1 i2 j2 in jn
where i = (xi1 , x , …, xin ) and j = (xj1 , xj2 , …, xjn ) are two n-dimensional
i2
L ( i , j ) | x x | | x x | ... | x x |
1 i1 j1 i2 j2 in jn
Similarity and Dissimilarity
Between Objects (Cont.)
If p = 2, L2 is the Euclidean distance:
d ( i , j ) (| x x | x x ... | x x
2 2 2
| | | )
i1 j1 i2 j2 in jn
Properties
d(i,j) 0
d(i,i) =0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance:
d (i , j ) ( w | x x | w x ... w n | x x
2 2 2
|x | | )
1 i1 j1 2 i2 j2 in jn
Binary Variables
A binary variable has two states: 0 absent, 1 present
A contingency table for binary data object j
1 0 sum
i= (0011101001)
1 a b a b
J=(1001100110)
object i 0 c d c d
sum a c b d p
d (i , j ) b c
Jaccard coefficient distance : a b c d
d (i , j ) b c
a b c
Binary Variables
Another approach is to define the similarity of two
objects and not their distance.
In that case we have the following:
Simple matching coefficient similarity:
s( i , j ) a d
a b c d
Jaccard coefficient similarity:
s( i , j ) a
a b c
Jack: 101000
Mary: 101010
Jim:110000
Simple match distance:
number of non - common bit positions
d (i , j )
total number of bits
Jaccard coefficient:
number of 1' s in i j
d (i , j ) 1
number of 1' s in i j
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio-scaled.
One may use a weighted formula to combine their effects.
Major Clustering Approaches
Partitioning algorithms: Construct random partitions and then
iteratively refine them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Partitioning Algorithms: Basic
Concept
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
K-means Clustering
1 22.4 32.0 C1
rj 2 10.0 5.0 C2
4 25.5 36.6 C1
5 20.6 29.2 C1
K-means Clustering Example
Now,the new means (centroids) for the two clusters are computed. The
mean for a cluster, Ci, with n records of m dimensions is the vector:
A second iteration proceeds to get the distance of each record with new
centroids
In the following table: calculate the distance of each record from the new
C1 and C2, and assign each to the suitable cluster
1
2
3
4
5
6
calculate new C1 and C2 ?
tip : C1 will be (28.3, 6.7) and C2 will be (51.7, 21.7)
K-means Clustering Example