You are on page 1of 23

TOPIC 6 – PART B

CLUSTERING: SIMILARITY
OBJECTIVES

• To introduce the basic concepts of clustering


• To discuss how to compute the dissimilarity between objects of
different attribute types ✅
• To examine several clustering techniques
• Partitioning approach
• Hierarchical approach

https://discuss.cryosparc.com/t/using-particles-from-cluster-mode-in-3d-va-for-refinement-fa
ils/3665/2
SIMILARITY AND DISSIMILARITY

• Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
• Proximity refers to a similarity or dissimilarity
• Used to create cluster
• A cluster is a collection of data objects such that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters.
MEASURE THE QUALITY OF CLUSTERING

• Dissimilarity/Similarity metric Similarity


 Similarity is expressed in terms of a  Numerical measure of how
distance function, typically metric: d(i, j) alike two data objects are
 The definitions of distance functions  Value is higher when objects
are more alike
are usually rather different variables:
 Often falls in the range [0,1]
numeric/boolean/categorical/ordinal ratio/
vector
• Quality of clustering: Dissimilarity (e.g., distance)
 Numerical measure of how
• It is hard to define “similar enough” different two data objects are
• or “good enough” for a cluster  Lower when objects are more
alike
• The answer is typically highly subjective  Minimum dissimilarity is often 0
 Upper limit varies
DATA MATRIX AND DISSIMILARITY MATRIX
1 … f … P

1
Number of columns
Suppose that we have n 1
objects (such as persons,
Data matrix represents
Number of prows
attributes ..
 x11

... x1f ... x1p 

represents n objects  ... ... ... ... ... 
 This structure stores the n data objects in the form of a
items, or courses) i x ... xif ... xip 
described by p attributes relational table, or  i1 
 n-by-p matrix (n number of objects × p number of attributes) ..  ... ... ... ... ... 
(also called x
measurements or  So, the matrix represents n data points with p dimensions n
 n1 ... xnf ... xnp 

features), such as age,  Two modes
height, weight, or gender.  made up of two entities or “things”,
 namely rows (for objects) and columns (for attributes)

2
The objects are:
x1 = (x11, x12, . . . , x1p), Dissimilarity matrix 1 2 … n
x2 = (x21, x22, . . . , x2p),  Used to store dissimilarity values for pairs of objects 1
and so on, where xij is the  0 
 This stores a collection of proximities that are available for all  d(2,1) 
2 0
value for object xi of the pairs of n objects. It is often represented by an n-by-n table:  
jth attribute.  n data points, but registers only the difference d(i, j) 3  d(3,1) d ( 3,2) 0 
• becomes larger the more they differ ...
 
 : : : 
 A triangular matrix
n d ( n,1) d ( n,2) ... ... 0
 Single mode (contains one kind of entity)
NOMINAL ATTRIBUTE
• A nominal attribute can take on two or more states e.g., red,
yellow, blue, green
• The dissimilarity between two objects i and j:
• Method 1: Simple matching
 m: # of matches, p: total # of variables

d (i, j )  p  m m is the number of matches


p p is the total number of
attributes describing the objects
• Method 2: Use a large number of binary attributes
 creating a new binary attribute for each of the M nominal states
NOMINAL ATTRIBUTE
Object identifier Zone Code
(nominal)  0 
Form d(2,1) 
1 Code-A dissimilarity  0 
2 Code-B matrix by d(3,1) d (3,2) 0 
3 Code-C calculating the  
distance d ( 4,1) d ( 4,2) d ( 4,3) 0 
4 Code-A  

d (i, j)  p  m ? m is the number of matches


p p is the total number of attributes 0 
describing the objects
1 0 
 
m = 0, because Zone Code for
1 1 0 
1−0 object 1 and 2 is different
 
𝑑 ( 2 ,1 ) = =1
1 p = 1, because only zone code is the
0 1 1 0 
attribute, so the total number of  
attributes is equal to 1
NOMINAL ATTRIBUTES: more than 1 attribute
m is the number of matches
Object Zone Task Code
identifi Code (nominal)  0  p is the total number of attributes
er (nominal) d(2,1) 0  describing the objects
 
d(3,1) d (3,2)  In this example, p = 2 for attributes
1 Code-A Code-P 0
zone code and task code
 
2 Code-B Code-P d (4,1) d ( 4,2) d ( 4,3) 0 
d (i, j)  p  m ?
 
3 Code-C Code-R p
4 Code-A Code-Q
The distance is calculated based on 2 attributes.

Task Code Zone and Task Code


Zone Code

[ 0¿¿
10¿
¿ 1 ¿ 1¿ 0 ¿ ¿ ¿ 0 ¿ 1 ¿ 1 ¿ 0 ¿ ¿ ¿ ] [ 0¿¿
00¿
¿ 1 ¿ 1¿ 0 ¿ ¿ ¿ 1 ¿ 1 ¿ 1 ¿ 0 ¿ ¿ ¿ ] [ 0 ¿¿
0.5 0 ¿
¿1 ¿1 ¿0 ¿ 0 .5 ¿1 ¿1 ¿ 0 ¿ ]
NOMINAL ATTRIBUTES: more than 1 attribute
Object Zone Task Code d (i, j)  p 
p
m ?
identifi Code (nominal)
Dissimilarity Matrix
er (nominal)
1 Code-A Code-P  0  • d(2,1) = 2-1 / 2 = 0.5
d(2,1) 0 
2 Code-B Code-P 
d(3,1) d (3,2) 0


• d(3,1) = 2-0 / 2 = 1
3 Code-C Code-R  
d (4,1) d ( 4,2) d ( 4,3) 0  • d(4,1) = 2-1 / 2 = 0.5
4 Code-A Code-Q  
• d(3,2) = 2-0 / 2 = 1
• d(4,2) = 2-0 / 2 = 1
Zone and Task Code
• d(4,3) = 2-0 / 2 = 1

[ 0 ¿¿
]
How do you get the distance?

¿1¿1¿0 ¿ 0 .5¿1¿1¿ 0 ¿
0.5 0 ¿
BINARY ATTRIBUTE
Nominal attribute with only 2 states (0 and 1)

• A contingency table for binary data


where
• q is the number of attributes that equal 1 for both objects i and j,
• r is the number of attributes that equal 1 for object i but that are 0 for object j,
• s is the number of attributes that equal 0 for object i but equal 1 for object j,
and
• t is the number of attributes that equal 0 for both objects i and j.
• The total number of attributes is p, where p = q + r + s + t.
Object j

CONTIGENCY
TABLE for
Object i and j Object i
BINARY ATTRIBUTE
Object j
• Distance measure for symmetric binary variables
(if the outcomes of each state is equally valuable): t is unimportant Object i
d(i,j) = (r+s) / (q+r+s+t)
Example of symmetric is gender (whether female or male)
• Distance measure for asymmetric binary variables
(if the outcomes of the states are not equally important):

d(i,j) = (r+s) / (q+r+s)


Example of asymmetric is medical test (positive is more important)
• Jaccard coefficient (similarity measure for asymmetric binary variables):

sim (i,j) = q/(q+r+s) = 1 – d(i,j)


BINARY ATTRIBUTES
CONTIGENCY TABLE
for Object i and j
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Object j
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N Object i

• gender is a symmetric attribute


• the remaining attributes
d(i,j) = (r+s) / (q+r+s)
are asymmetric binary let the values Y and
P be set to 1, and the value N be set to 0
0 1
d ( jack , mary )   0.33
Name Fever Cough Test-1 Test-2 Test-3 Test-4 2  0 1
Jack 1 0 1 0 0 0 11
Mary 1 0 1 0 1 0 d ( jack , jim)   0.67
Jim 1 1 0 0 0 0
111
1 2
d ( jim, mary )   0.75
11 2
BINARY ATTRIBUTES
CONTIGENCY TABLE
Let’s find the distance between Jack and Mary. for Object i and j
Object j
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0 Object i
Jim 1 1 0 0 0 0

1 - Tabulate the contingency table for Jack and Mary.


2- Apply the formula.
Jack/Mary 1 0
1 q=2 r=0 d(i,j) = (r+s) / (q+r+s)
0 s=1 t=3

Repeat to get d(Jack, Jim)


and d(Mary, Jim).
DISTANCE ON NUMERIC ATTRIBUTE
• Most popular distance measure is Euclidean distance
• Let i = (xi1, xi2, . . . , xip) and j = (xj1, xj2, . . . , xjp) be two objects which are i and j, described by p numeric
attributes.

• The Euclidean distance between


objects i and j is defined as
√ 2 2
𝑑 𝑖, 𝑗 = ( 𝑥 𝑖1 − 𝑥 𝑗1 ) + ( 𝑥 𝑖2 − 𝑥 𝑗 2 ) +…+ ( 𝑥𝑖 𝑝 − 𝑥 𝑗𝑝 )
( ) 2

• Manhattan ((or city block) 𝑑 ( 𝑖 , 𝑗 )=|𝑥 𝑖 1 − 𝑥 𝑗 1|+|𝑥 𝑖 2 − 𝑥 𝑗 2|+…+|𝑥 𝑖𝑝 − 𝑥 𝑗𝑝|


distance

• Minkowski distance:
A popular distance measure

ℎ ℎ ℎ
𝑑 ( 𝑖, 𝑗 )= | 𝑥𝑖1 − 𝑥 𝑗1| +| 𝑥𝑖2 − 𝑥 𝑗2| +….+|𝑥 𝑖𝑝 − 𝑥 𝑗𝑝|

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order
Properties
◦ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
◦ d(i, j) = d(j, i) (Symmetry)
◦ d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
NUMERIC ATTRIBUTE
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5

Dissimilarity Matrix (with Euclidean Distance)


2 x1 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 ?? 5.1 0
x3 x4 4.24 ?? 5.39 0

0 2 4 = 3.61

Find d(x3,x1) and d(x4,x2).


ORDINAL ATTRIBUTE
An ordinal variable can be discrete or continuous
Order is important, e.g., rank, can be treated like interval-scaled
Grade (e.g., A+, A, A-, B+, B, B-, C+, C, C-, D+, D, E, F)

• Suppose that f is an attribute from a set of ordinal attributes describing n objects.


The dissimilarity computation with respect to f involves the following steps:
• The value of f for the ith object is xif , and f has Mf ordered states, representing the ranking
1, . . . , Mf .
• Replace xif by their rank 𝑟 𝑖𝑓 ∈ {1 , ..., 𝑀 𝑓 }
• Since each ordinal attribute can have a different number of states, it is often necessary to map
the range of each attribute onto [0.0,1.0] so that each attribute has equal weight. We perform
such data normalization 𝑟 𝑖𝑓replacing
by −1 the rank rif of the ith object in the fth attribute by
𝑧 𝑖𝑓 =
𝑀 𝑓 −1

• Compute the dissimilarity using methods for interval-scaled (numeric) variables


ORDINAL ATTRIBUTE
𝑟 𝑖𝑓 − 1
𝑧 𝑖𝑓 =
Object Lab test-2 Lab test-2 Lab test-2 𝑀 𝑓 −1
identifier (ordinal) (ordinal) (ordinal)
*map to rank
( increasing *normalize the Find distance
order) rank within [0,1]
using Euclidean :
1 excellent 3 = (3-1)/(3-1) = 1 d(2,1)
d(3,1)
2 fair 1 = (1-1)/(3-1) = 0 d(4,1)
3 good 2 = (2-1)/(3-1) = 0.5 ……
4 excellent 3 = (3-1)/(3-1) = 1 d(4,3)

√ 2 2
𝑑 𝑖, 𝑗 = ( 𝑥 𝑖1 − 𝑥 𝑗1 ) + ( 𝑥 𝑖2 − 𝑥 𝑗 2 ) +…+ ( 𝑥𝑖 𝑝 − 𝑥 𝑗𝑝 )
( ) 2
[ 0¿¿
?0¿
¿? ¿?¿ 0¿¿¿? ¿? ¿? ¿0 ¿¿ ¿ ]
ORDINAL ATTRIBUTE
Object Lab test-2 Lab test-2 Lab test-2
(ordinal)
identifi (ordinal) *map to
(ordinal) d(2,1)
er *normalize the
rank
( increasing rank within [0,1] d(3,1)
order) d(4,1)
1 excellent 3 = (3-1)/(3-1) = 1 ……
2 fair 1 = (1-1)/(3-1) = 0
3 good 2 = (2-1)/(3-1) = 0.5 d(4,3)
4 excellent 3 = (3-1)/(3-1) = 1

Distance measure is Euclidean distance

√ 2 2
𝑑 ( 𝑖, 𝑗 )= ( 𝑥 𝑖1 − 𝑥 𝑗1 ) + ( 𝑥 𝑖2 − 𝑥 𝑗 2 ) +…+ ( 𝑥𝑖 𝑝 − 𝑥 𝑗𝑝 )
2
[ 0 ¿¿
1.0 0 ¿
¿0.5¿0.5¿0¿¿¿0¿1.0¿0.5¿0¿¿¿ ]
COSINE SIMILARITY

• A document can be represented by thousands of attributes, each recording the


frequency of a particular word (such as keywords) or phrase in the document.
• Other vector objects: gene features in micro-arrays, …
• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product,
||d||: the length of vector d
COSINE SIMILARITY
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product,
||d||: the length of vector d (Euclidean norm)
Euclidean norm =

Example : Find the similarity between documents 1 and 2.


d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3 + 0*0 + 3*2 + 0*0 + 2*1 + 0*1 + 0*1 + 2*1 + 0*0 + 0*1 = 25

||d1||= = = 6.481
||d2||= = = 4.12
References

1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and


Techniques, 3rd Edition, Morgan Kaufmann, 2012.

2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data


Mining, Addison Wesley, 2019.
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin

You might also like