You are on page 1of 30

Measuring Data Similarity and

Dissimilarity

1
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are

◼ Value is higher when objects are more alike

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)


◼ Numerical measure of how different two data objects

are
◼ Lower when objects are more alike

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

2
◼ In data mining applications – clustering, outlier
analysis, and nearest-neighbor classification, we
need ways to assess how alike or unalike objects
are in comparison to one another
◼ For example, a store may want to search for clusters
of customer objects, resulting in groups of customers
with similar characteristics (e.g., similar income, area
of residence, and age). Such information can then be
used for marketing.

3
◼ A cluster is a collection of data objects such that the
objects within a cluster are similar to one another and
dissimilar to the objects in other clusters
◼ Outlier analysis employs clustering-based techniques to
identify potential outliers as objects that are highly
dissimilar to others
◼ Knowledge of object similarities can be used in nearest-
neighbor classification schemes where a given object
(e.g., a patient) is assigned a class label (relating to, say,
a diagnosis) based on its similarity toward other objects in
the model.

4
Data Matrix
◼ OR object-by-attribute structure
◼ Two mode matrix
◼ This structure stores the n data objects in the form
of a relational table, or n-by-p matrix (n objects p
attributes) ie. n data points with p dimensions
◼ Each row corresponds to an object and each column
represents attributes
◼ From the notations f is use to index through the p
attributes.  x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1  5
Dissimilarity Matrix
◼ OR object-by-object structure
◼ used to store dissimilarity values for pairs of
objects
◼ n data points, but registers only the distance
◼ A triangular matrix
◼ Single mode matrix (one kind of entity
(dissimilarities))
◼ where d(i, j) is the measured dissimilarity or
“difference” between objects i and j

6
Dissimilarity Matrix (cont)

◼ In general, d(i, j) is a non-negative number that


is close to 0 when objects i and j are highly
similar or “near” each other, and becomes larger
the more they differ.
◼ Note:
◼ d(i, j)= 0; that is, the difference between an object
and itself is 0. Furthermore,
◼ d(i, j)= d(j,i)
 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n, 2 ) ... ... 0
7
Measures of similarity

◼ Measures of similarity can often be expressed as


a function of measures of dissimilarity
◼ For example, for nominal data,
◼ sim(i, j)=1-d(i, j)
◼ where sim(i, j)is the similarity between object i

and j.

8
Measures of similarity (cont)

◼ Many clustering and nearest-neighbor algorithms


operate on a dissimilarity matrix
◼ Data in the form of a data matrix can be
transformed into a dissimilarity matrix before
applying such algorithms.

9
Proximity Measure for Nominal Attributes

◼ Can take 2 or more states, e.g., Color={ red, yellow,


blue, green} (generalization of a binary attribute)
◼ The dissimilarity between two objects i and j can
be computed based on the ratio of mismatches
◼ Method 1: Simple matching
◼ m: # of matches (i.e., the number of attributes for which i
and j are in the same state)

◼ p: total # of variables
d (i, j) = p −
p
m

10
Proximity Measure for Nominal Attributes (cont)

◼ Method 2: Use a large number of binary


attributes
◼ creating a new binary attribute for each of the
M nominal states

11
Proximity Measure for Nominal Attributes (cont)

◼ Example: Compute the dissimilarity using the


nominal data below

Solution:
◼ For the nominal attribute, test-1, we set p =1

so that d(i, j) evaluates to 0 if objects i and j


match, and 1 if the objects differ

12
Proximity Measure for Nominal Attributes (cont)
d (i, j) = p −
p
m

◼ From the above, all objects are dissimilar except


objects 1 and 4.
◼ p=6 and m=1
p − m 6 −1 5
d (i, j ) = = =
p 6 6
m 1
sim(i, j ) = 1 − d (i, j ) = =
p 6
13
Proximity Measure for Binary Attributes
Object j

◼ A contingency table for binary data


Object i

◼ Distance measure for symmetric


binary variables:
◼ Distance measure for asymmetric
binary variables:
◼ Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Given two vectors Mi,j M11
◼ J s (M i, j ) =
M 01 + M10 + M11

14
Proximity Measure for Binary Measures
Binary Attribute

Dissimilarity Similarity

Symmetric Binary Asymmetric Binary Asymmetric Binary


𝒒
𝒔𝒊𝒎 𝒊, 𝒋 =
Symmetric Binary 𝒒+𝒓+𝒔
𝒒+𝒕
𝒔𝒊𝒎 𝒊, 𝒋 =
𝒒+𝒓+𝒔+𝒕
Note:
1 0
1 q (1,1) r (1,0)
0 S (0,1) t (0,0)
15
Proximity Measure for Nominal Attributes (cont)
1 0
1 q (1,1) r (1,0)
0 S (0,1) t (0,0)

Symmetric Binary 𝒓+𝒔 𝑴𝟏𝟎 + 𝑴𝟎𝟏


𝒅 𝒊, 𝒋 = =
𝒒 + 𝒓 + 𝒔 + 𝒕 𝑴𝟏𝟏 + 𝑴𝟏𝟎+ 𝑴𝟎𝟏 + 𝑴𝟎𝟎

𝒒+𝒕 𝑴𝟏𝟏 +𝑴𝟎𝟎


𝒔𝒊𝒎 𝒊, 𝒋 = =
𝒒+𝒓+𝒔+𝒕 𝑴𝟏𝟏 +𝑴𝟏𝟎+ 𝑴𝟎𝟏 +𝑴𝟎𝟎

Asymmetric Binary
𝒓+𝒔 𝑴𝟏𝟎 + 𝑴𝟎𝟏
𝒅 𝒊, 𝒋 = =
𝒒 + 𝒓 + 𝒔 𝑴𝟏𝟏 + 𝑴𝟏𝟎+ 𝑴𝟎𝟏

𝒒 𝑴𝟏𝟏
𝒔𝒊𝒎 𝒊, 𝒋 = =
𝒒+𝒓+𝒔 𝑴𝟏𝟏 +𝑴𝟏𝟎+ 𝑴𝟎𝟏 16
Dissimilarity between Binary Variables
◼ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Gender is a symmetric attribute
◼ The remaining attributes are asymmetric binary
◼ Let the values Y and P be 1, and the value N 0
◼ Distance between objects (patients) is computed based only on
the asymmetric attributes.

0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
17
Dissimilarity of Numeric Data

◼ Distance measures used for computing the


dissimilarity of objects described by numeric
attributes
◼ Common distance measures
◼ Euclidean
◼ Manhattan, and
◼ Minkowski distances
◼ In some cases, the data are normalized before
applying distance calculations
◼ This involves transforming the data to fall within
a smaller or common range, such as [-1, 1] or
[0.0, 1.0]. 18
Euclidean distance (L2 norm)
◼ Most popular
◼ Straight line
◼ Given i = ( xi1, xi 2 ,..., xip ) and j = ( x j1, x j 2 ,..., x jp ) as two
objects described by p numeric attributes
◼ Euclidean distance between i and j:
d (i, j ) = ( xi1 − x j1 ) 2 + ( xi 2 − x j 2 ) 2 + ... + ( xip − x jp ) 2
◼ OR
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

◼ Example, the distance between 2 vectors


◼ Work an example
19
Manhattan (or city block) distance (L1 norm)

◼ Named so because it is the distance in blocks


between any two points in a city
◼ Given i = ( xi1, xi 2 ,..., xip ) and j = ( x j1, x j 2 ,..., x jp ) as two
objects described by p numeric attributes
◼ The manhattan distance between i and j:

d (i, j ) =| xi1 − x j1 | + | xi 2 − x j 2 | +...+ | xip − x jp |

◼ Work an example

20
Properties of Euclidean and Manhattan
distance measures
◼ Both the Euclidean and the Manhattan distance
satisfy the following mathematical properties:
◼ Non-negativity: d(i, j)≥ 0: Distance is a non-
negative number.
◼ Identity of indiscernibles: d(i, i) = 0: The distance
of an object to itself is 0
◼ Symmetry: d(i, j) = d(j, i): Distance is a symmetric
function
◼ Triangle inequality: d (i, j )  d (i, k ) + d (k , j ) Going
directly from object i to object j in space is no more
than making a detour over any other object k
◼ A measure that satisfies these conditions is
known as metric.
21
Minkowski distance

◼ Minkowski distance is a generalization of the


Euclidean and Manhattan distances.
◼ It is defined as
d (i, j ) = h | xi1 − x j1 |h + | xi 2 − x j 2 |h +... + + | xip − x jp |h

◼ where h is real number such that h ≥1 (Note:


h=p in other books)

◼ h=1 → L1 norm→ Manhattan distance


◼ h=2 → L2 norm→ Euclidean distance
22
The supremum distance

◼ Also known as
◼ Lmax norm
◼ L∞ norm
◼ Chebyshev distance
◼ It is a generalization of the Minkowski distance
for h= ∞
◼ Computation: find the attribute f that gives the
maximum difference in values between the two objects
◼ Thus the maximum difference between any component
(attribute) of the vectors
◼ It is defined as:
23
Example: Minkowski Distance
◼ Compute the dissimilarity among the data objects
(x1,x2,x3 and x4) with attribute 1 and 2 each
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

24
Cosine Similarity
◼ A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document

◼ Other vector objects: gene features in micro-arrays, …


◼ Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...

25
Cosine Similarity (cont)
◼ Cosine measure: If d1 and d2 are two vectors then
d1 • d 2
cossim ( d1 , d 2 ) =
|| d1 || || d 2 ||

where • indicates vector dot product, ||d||: the length of vector d ie.
The Euclidean norm of the vectors

◼ Cosine similarity measure does not obey the


properties – (non-negativity, identity of
indiscernible, symmetry and triangle inequality)
hence referred to as a nonmetric measure

26
Example: Cosine Similarity
◼ Exercise: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
d1 • d 2
cossim (d1 , d 2 ) =
|| d1 || || d 2 ||
1Hence, cos(d , d ) = 0.94
2
27
Wait for Assignment
Assignment

1. Use the information in the table below to


compute the cosine similarity and dissimilarity of :
a) Document1 and document 3
b) Document 2 and document 4
c) Document 3 and document 4

29
Assignment
2. Compute the dissimilarity among the data objects (a,b, c
and d) with Attribute 1 and 2 each

poi nt Attri bute 1 Attri bute 2


a 2 7
b 4 6
c 6 6
d 2 0

3. Compute the Similarity and dissimilarity of the binary


data in the table below
Name Malaria Tuberculosis Stomach Typhoid High BP
Ulcer
Mensah Y Y N Y Y
Kofi N N Y N Y
Mike Y Y N N Y
Esi N Y Y Y Y
30

You might also like