Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236

Measuring Data Similarity and
Dissimilarity
1
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are
◼ Value is higher when objects are more alike
◼ Often falls in the range [0,1]
◼ Dissimilarity (e.g., distance)

◼ Numerical measure of how different two data objects
are
◼ Lower when objects are more alike
◼ Minimum dissimilarity is often 0
◼ Upper limit varies
◼ Proximity refers to a similarity or dissimilarity
2
◼ In data mining applications – clustering, outlier
analysis, and nearest-neighbor classification, we
need ways to assess how alike or unalike objects
are in comparison to one another
◼ For example, a store may want to search for clusters
of customer objects, resulting in groups of customers
with similar characteristics (e.g., similar income, area
of residence, and age). Such information can then be
used for marketing.
3
◼ A cluster is a collection of data objects such that the
objects within a cluster are similar to one another and
dissimilar to the objects in other clusters
◼ Outlier analysis employs clustering-based techniques to
identify potential outliers as objects that are highly
dissimilar to others
◼ Knowledge of object similarities can be used in nearest-
neighbor classification schemes where a given object
(e.g., a patient) is assigned a class label (relating to, say,
a diagnosis) based on its similarity toward other objects in
the model.
4
Data Matrix
◼ OR object-by-attribute structure
◼ Two mode matrix
◼ This structure stores the n data objects in the form
of a relational table, or n-by-p matrix (n objects p
attributes) ie. n data points with p dimensions
◼ Each row corresponds to an object and each column
represents attributes
◼ From the notations f is use to index through the p
attributes.  x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1  5
Dissimilarity Matrix
◼ OR object-by-object structure
◼ used to store dissimilarity values for pairs of
objects
◼ n data points, but registers only the distance
◼ A triangular matrix
◼ Single mode matrix (one kind of entity
(dissimilarities))
◼ where d(i, j) is the measured dissimilarity or
“difference” between objects i and j
6
Dissimilarity Matrix (cont)
◼ In general, d(i, j) is a non-negative number that

is close to 0 when objects i and j are highly
similar or “near” each other, and becomes larger
the more they differ.
◼ Note:
◼ d(i, j)= 0; that is, the difference between an object
and itself is 0. Furthermore,
◼ d(i, j)= d(j,i)
 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n, 2 ) ... ... 0
7
Measures of similarity
◼ Measures of similarity can often be expressed as

a function of measures of dissimilarity
◼ For example, for nominal data,
◼ sim(i, j)=1-d(i, j)
◼ where sim(i, j)is the similarity between object i
and j.
8
Measures of similarity (cont)
◼ Many clustering and nearest-neighbor algorithms

operate on a dissimilarity matrix
◼ Data in the form of a data matrix can be
transformed into a dissimilarity matrix before
applying such algorithms.
9
Proximity Measure for Nominal Attributes
◼ Can take 2 or more states, e.g., Color={ red, yellow,

blue, green} (generalization of a binary attribute)
◼ The dissimilarity between two objects i and j can
be computed based on the ratio of mismatches
◼ Method 1: Simple matching
◼ m: # of matches (i.e., the number of attributes for which i
and j are in the same state)
◼ p: total # of variables
d (i, j) = p −
p
m
10
Proximity Measure for Nominal Attributes (cont)
◼ Method 2: Use a large number of binary

attributes
◼ creating a new binary attribute for each of the
M nominal states
11
◼ Example: Compute the dissimilarity using the

nominal data below
Solution:
◼ For the nominal attribute, test-1, we set p =1
so that d(i, j) evaluates to 0 if objects i and j

match, and 1 if the objects differ
12
d (i, j) = p −
p
m
◼ From the above, all objects are dissimilar except

objects 1 and 4.
◼ p=6 and m=1
p − m 6 −1 5
d (i, j ) = = =
p 6 6
m 1
sim(i, j ) = 1 − d (i, j ) = =
p 6
13
Proximity Measure for Binary Attributes
Object j
◼ A contingency table for binary data

Object i
◼ Distance measure for symmetric

binary variables:
◼ Distance measure for asymmetric
binary variables:
◼ Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Given two vectors Mi,j M11
◼ J s (M i, j ) =
M 01 + M10 + M11
14
Proximity Measure for Binary Measures
Binary Attribute
Dissimilarity Similarity
Symmetric Binary Asymmetric Binary Asymmetric Binary

𝒒
𝒔𝒊𝒎 𝒊, 𝒋 =
Symmetric Binary 𝒒+𝒓+𝒔
𝒒+𝒕
𝒔𝒊𝒎 𝒊, 𝒋 =
𝒒+𝒓+𝒔+𝒕
Note:
1 0
1 q (1,1) r (1,0)
0 S (0,1) t (0,0)
15
1 0
1 q (1,1) r (1,0)
0 S (0,1) t (0,0)
Symmetric Binary 𝒓+𝒔 𝑴𝟏𝟎 + 𝑴𝟎𝟏

𝒅 𝒊, 𝒋 = =
𝒒 + 𝒓 + 𝒔 + 𝒕 𝑴𝟏𝟏 + 𝑴𝟏𝟎+ 𝑴𝟎𝟏 + 𝑴𝟎𝟎
𝒒+𝒕 𝑴𝟏𝟏 +𝑴𝟎𝟎

𝒔𝒊𝒎 𝒊, 𝒋 = =
𝒒+𝒓+𝒔+𝒕 𝑴𝟏𝟏 +𝑴𝟏𝟎+ 𝑴𝟎𝟏 +𝑴𝟎𝟎
Asymmetric Binary
𝒓+𝒔 𝑴𝟏𝟎 + 𝑴𝟎𝟏
𝒅 𝒊, 𝒋 = =
𝒒 + 𝒓 + 𝒔 𝑴𝟏𝟏 + 𝑴𝟏𝟎+ 𝑴𝟎𝟏
𝒒 𝑴𝟏𝟏
𝒔𝒊𝒎 𝒊, 𝒋 = =
𝒒+𝒓+𝒔 𝑴𝟏𝟏 +𝑴𝟏𝟎+ 𝑴𝟎𝟏 16
Dissimilarity between Binary Variables
◼ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Gender is a symmetric attribute
◼ The remaining attributes are asymmetric binary
◼ Let the values Y and P be 1, and the value N 0
◼ Distance between objects (patients) is computed based only on
the asymmetric attributes.
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
17
Dissimilarity of Numeric Data
◼ Distance measures used for computing the

dissimilarity of objects described by numeric
attributes
◼ Common distance measures
◼ Euclidean
◼ Manhattan, and
◼ Minkowski distances
◼ In some cases, the data are normalized before
applying distance calculations
◼ This involves transforming the data to fall within
a smaller or common range, such as [-1, 1] or
[0.0, 1.0]. 18
Euclidean distance (L2 norm)
◼ Most popular
◼ Straight line
◼ Given i = ( xi1, xi 2 ,..., xip ) and j = ( x j1, x j 2 ,..., x jp ) as two
objects described by p numeric attributes
◼ Euclidean distance between i and j:
d (i, j ) = ( xi1 − x j1 ) 2 + ( xi 2 − x j 2 ) 2 + ... + ( xip − x jp ) 2
◼ OR
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
◼ Example, the distance between 2 vectors

◼ Work an example
19
Manhattan (or city block) distance (L1 norm)
◼ Named so because it is the distance in blocks

between any two points in a city
◼ Given i = ( xi1, xi 2 ,..., xip ) and j = ( x j1, x j 2 ,..., x jp ) as two
objects described by p numeric attributes
◼ The manhattan distance between i and j:
d (i, j ) =| xi1 − x j1 | + | xi 2 − x j 2 | +...+ | xip − x jp |
◼ Work an example
20
Properties of Euclidean and Manhattan
distance measures
◼ Both the Euclidean and the Manhattan distance
satisfy the following mathematical properties:
◼ Non-negativity: d(i, j)≥ 0: Distance is a non-
negative number.
◼ Identity of indiscernibles: d(i, i) = 0: The distance
of an object to itself is 0
◼ Symmetry: d(i, j) = d(j, i): Distance is a symmetric
function
◼ Triangle inequality: d (i, j )  d (i, k ) + d (k , j ) Going
directly from object i to object j in space is no more
than making a detour over any other object k
◼ A measure that satisfies these conditions is
known as metric.
21
Minkowski distance
◼ Minkowski distance is a generalization of the

Euclidean and Manhattan distances.
◼ It is defined as
d (i, j ) = h | xi1 − x j1 |h + | xi 2 − x j 2 |h +... + + | xip − x jp |h
◼ where h is real number such that h ≥1 (Note:

h=p in other books)
◼ h=1 → L1 norm→ Manhattan distance

◼ h=2 → L2 norm→ Euclidean distance
22
The supremum distance
◼ Also known as
◼ Lmax norm
◼ L∞ norm
◼ Chebyshev distance
◼ It is a generalization of the Minkowski distance
for h= ∞
◼ Computation: find the attribute f that gives the
maximum difference in values between the two objects
◼ Thus the maximum difference between any component
(attribute) of the vectors
◼ It is defined as:
23
Example: Minkowski Distance
◼ Compute the dissimilarity among the data objects
(x1,x2,x3 and x4) with attribute 1 and 2 each
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
24
Cosine Similarity
◼ A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document
◼ Other vector objects: gene features in micro-arrays, …

◼ Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
25
Cosine Similarity (cont)
◼ Cosine measure: If d1 and d2 are two vectors then
d1 • d 2
cossim ( d1 , d 2 ) =
|| d1 || || d 2 ||
where • indicates vector dot product, ||d||: the length of vector d ie.
The Euclidean norm of the vectors
◼ Cosine similarity measure does not obey the

properties – (non-negativity, identity of
indiscernible, symmetry and triangle inequality)
hence referred to as a nonmetric measure
26
Example: Cosine Similarity
◼ Exercise: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
d1 • d 2
cossim (d1 , d 2 ) =
|| d1 || || d 2 ||
1Hence, cos(d , d ) = 0.94
2
27
Wait for Assignment
Assignment
1. Use the information in the table below to

compute the cosine similarity and dissimilarity of :
a) Document1 and document 3
b) Document 2 and document 4
c) Document 3 and document 4
29
Assignment
2. Compute the dissimilarity among the data objects (a,b, c
and d) with Attribute 1 and 2 each
poi nt Attri bute 1 Attri bute 2

a 2 7
b 4 6
c 6 6
d 2 0
3. Compute the Similarity and dissimilarity of the binary

data in the table below
Name Malaria Tuberculosis Stomach Typhoid High BP
Ulcer
Mensah Y Y N Y Y
Kofi N N Y N Y
Mike Y Y N N Y
Esi N Y Y Y Y
30

Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236

Uploaded by

Copyright:

Available Formats

Measuring Data Similarity and

◼ Value is higher when objects are more alike

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

◼ In general, d(i, j) is a non-negative number that

◼ Measures of similarity can often be expressed as

◼ Many clustering and nearest-neighbor algorithms

◼ Can take 2 or more states, e.g., Color={ red, yellow,

◼ Method 2: Use a large number of binary

◼ Example: Compute the dissimilarity using the

so that d(i, j) evaluates to 0 if objects i and j

◼ From the above, all objects are dissimilar except

◼ A contingency table for binary data

◼ Distance measure for symmetric

Symmetric Binary Asymmetric Binary Asymmetric Binary

Symmetric Binary 𝒓+𝒔 𝑴𝟏𝟎 + 𝑴𝟎𝟏

𝒒+𝒕 𝑴𝟏𝟏 +𝑴𝟎𝟎

◼ Distance measures used for computing the

◼ Example, the distance between 2 vectors

◼ Named so because it is the distance in blocks

d (i, j ) =| xi1 − x j1 | + | xi 2 − x j 2 | +...+ | xip − x jp |

◼ Minkowski distance is a generalization of the

◼ where h is real number such that h ≥1 (Note:

◼ h=1 → L1 norm→ Manhattan distance

◼ Other vector objects: gene features in micro-arrays, …

◼ Cosine similarity measure does not obey the

1. Use the information in the table below to

poi nt Attri bute 1 Attri bute 2

3. Compute the Similarity and dissimilarity of the binary

You might also like