Professional Documents
Culture Documents
CLUSTERING: SIMILARITY
OBJECTIVES
https://discuss.cryosparc.com/t/using-particles-from-cluster-mode-in-3d-va-for-refinement-fa
ils/3665/2
SIMILARITY AND DISSIMILARITY
• Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
• Proximity refers to a similarity or dissimilarity
• Used to create cluster
• A cluster is a collection of data objects such that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters.
MEASURE THE QUALITY OF CLUSTERING
1
Number of columns
Suppose that we have n 1
objects (such as persons,
Data matrix represents
Number of prows
attributes ..
x11
... x1f ... x1p
represents n objects ... ... ... ... ...
This structure stores the n data objects in the form of a
items, or courses) i x ... xif ... xip
described by p attributes relational table, or i1
n-by-p matrix (n number of objects × p number of attributes) .. ... ... ... ... ...
(also called x
measurements or So, the matrix represents n data points with p dimensions n
n1 ... xnf ... xnp
features), such as age, Two modes
height, weight, or gender. made up of two entities or “things”,
namely rows (for objects) and columns (for attributes)
2
The objects are:
x1 = (x11, x12, . . . , x1p), Dissimilarity matrix 1 2 … n
x2 = (x21, x22, . . . , x2p), Used to store dissimilarity values for pairs of objects 1
and so on, where xij is the 0
This stores a collection of proximities that are available for all d(2,1)
2 0
value for object xi of the pairs of n objects. It is often represented by an n-by-n table:
jth attribute. n data points, but registers only the difference d(i, j) 3 d(3,1) d ( 3,2) 0
• becomes larger the more they differ ...
: : :
A triangular matrix
n d ( n,1) d ( n,2) ... ... 0
Single mode (contains one kind of entity)
NOMINAL ATTRIBUTE
• A nominal attribute can take on two or more states e.g., red,
yellow, blue, green
• The dissimilarity between two objects i and j:
• Method 1: Simple matching
m: # of matches, p: total # of variables
[ 0¿¿
10¿
¿ 1 ¿ 1¿ 0 ¿ ¿ ¿ 0 ¿ 1 ¿ 1 ¿ 0 ¿ ¿ ¿ ] [ 0¿¿
00¿
¿ 1 ¿ 1¿ 0 ¿ ¿ ¿ 1 ¿ 1 ¿ 1 ¿ 0 ¿ ¿ ¿ ] [ 0 ¿¿
0.5 0 ¿
¿1 ¿1 ¿0 ¿ 0 .5 ¿1 ¿1 ¿ 0 ¿ ]
NOMINAL ATTRIBUTES: more than 1 attribute
Object Zone Task Code d (i, j) p
p
m ?
identifi Code (nominal)
Dissimilarity Matrix
er (nominal)
1 Code-A Code-P 0 • d(2,1) = 2-1 / 2 = 0.5
d(2,1) 0
2 Code-B Code-P
d(3,1) d (3,2) 0
• d(3,1) = 2-0 / 2 = 1
3 Code-C Code-R
d (4,1) d ( 4,2) d ( 4,3) 0 • d(4,1) = 2-1 / 2 = 0.5
4 Code-A Code-Q
• d(3,2) = 2-0 / 2 = 1
• d(4,2) = 2-0 / 2 = 1
Zone and Task Code
• d(4,3) = 2-0 / 2 = 1
[ 0 ¿¿
]
How do you get the distance?
¿1¿1¿0 ¿ 0 .5¿1¿1¿ 0 ¿
0.5 0 ¿
BINARY ATTRIBUTE
Nominal attribute with only 2 states (0 and 1)
CONTIGENCY
TABLE for
Object i and j Object i
BINARY ATTRIBUTE
Object j
• Distance measure for symmetric binary variables
(if the outcomes of each state is equally valuable): t is unimportant Object i
d(i,j) = (r+s) / (q+r+s+t)
Example of symmetric is gender (whether female or male)
• Distance measure for asymmetric binary variables
(if the outcomes of the states are not equally important):
• Minkowski distance:
A popular distance measure
√
ℎ ℎ ℎ
𝑑 ( 𝑖, 𝑗 )= | 𝑥𝑖1 − 𝑥 𝑗1| +| 𝑥𝑖2 − 𝑥 𝑗2| +….+|𝑥 𝑖𝑝 − 𝑥 𝑗𝑝|
ℎ
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order
Properties
◦ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
◦ d(i, j) = d(j, i) (Symmetry)
◦ d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
NUMERIC ATTRIBUTE
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5
0 2 4 = 3.61
√ 2 2
𝑑 𝑖, 𝑗 = ( 𝑥 𝑖1 − 𝑥 𝑗1 ) + ( 𝑥 𝑖2 − 𝑥 𝑗 2 ) +…+ ( 𝑥𝑖 𝑝 − 𝑥 𝑗𝑝 )
( ) 2
[ 0¿¿
?0¿
¿? ¿?¿ 0¿¿¿? ¿? ¿? ¿0 ¿¿ ¿ ]
ORDINAL ATTRIBUTE
Object Lab test-2 Lab test-2 Lab test-2
(ordinal)
identifi (ordinal) *map to
(ordinal) d(2,1)
er *normalize the
rank
( increasing rank within [0,1] d(3,1)
order) d(4,1)
1 excellent 3 = (3-1)/(3-1) = 1 ……
2 fair 1 = (1-1)/(3-1) = 0
3 good 2 = (2-1)/(3-1) = 0.5 d(4,3)
4 excellent 3 = (3-1)/(3-1) = 1
√ 2 2
𝑑 ( 𝑖, 𝑗 )= ( 𝑥 𝑖1 − 𝑥 𝑗1 ) + ( 𝑥 𝑖2 − 𝑥 𝑗 2 ) +…+ ( 𝑥𝑖 𝑝 − 𝑥 𝑗𝑝 )
2
[ 0 ¿¿
1.0 0 ¿
¿0.5¿0.5¿0¿¿¿0¿1.0¿0.5¿0¿¿¿ ]
COSINE SIMILARITY
||d1||= = = 6.481
||d2||= = = 4.12
References