Professional Documents
Culture Documents
Dissimilarity
1
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are
are
◼ Lower when objects are more alike
2
◼ In data mining applications – clustering, outlier
analysis, and nearest-neighbor classification, we
need ways to assess how alike or unalike objects
are in comparison to one another
◼ For example, a store may want to search for clusters
of customer objects, resulting in groups of customers
with similar characteristics (e.g., similar income, area
of residence, and age). Such information can then be
used for marketing.
3
◼ A cluster is a collection of data objects such that the
objects within a cluster are similar to one another and
dissimilar to the objects in other clusters
◼ Outlier analysis employs clustering-based techniques to
identify potential outliers as objects that are highly
dissimilar to others
◼ Knowledge of object similarities can be used in nearest-
neighbor classification schemes where a given object
(e.g., a patient) is assigned a class label (relating to, say,
a diagnosis) based on its similarity toward other objects in
the model.
4
Data Matrix
◼ OR object-by-attribute structure
◼ Two mode matrix
◼ This structure stores the n data objects in the form
of a relational table, or n-by-p matrix (n objects p
attributes) ie. n data points with p dimensions
◼ Each row corresponds to an object and each column
represents attributes
◼ From the notations f is use to index through the p
attributes. x11 ... x1f ... x1p
... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1 5
Dissimilarity Matrix
◼ OR object-by-object structure
◼ used to store dissimilarity values for pairs of
objects
◼ n data points, but registers only the distance
◼ A triangular matrix
◼ Single mode matrix (one kind of entity
(dissimilarities))
◼ where d(i, j) is the measured dissimilarity or
“difference” between objects i and j
6
Dissimilarity Matrix (cont)
and j.
8
Measures of similarity (cont)
9
Proximity Measure for Nominal Attributes
◼ p: total # of variables
d (i, j) = p −
p
m
10
Proximity Measure for Nominal Attributes (cont)
11
Proximity Measure for Nominal Attributes (cont)
Solution:
◼ For the nominal attribute, test-1, we set p =1
12
Proximity Measure for Nominal Attributes (cont)
d (i, j) = p −
p
m
14
Proximity Measure for Binary Measures
Binary Attribute
Dissimilarity Similarity
Asymmetric Binary
𝒓+𝒔 𝑴𝟏𝟎 + 𝑴𝟎𝟏
𝒅 𝒊, 𝒋 = =
𝒒 + 𝒓 + 𝒔 𝑴𝟏𝟏 + 𝑴𝟏𝟎+ 𝑴𝟎𝟏
𝒒 𝑴𝟏𝟏
𝒔𝒊𝒎 𝒊, 𝒋 = =
𝒒+𝒓+𝒔 𝑴𝟏𝟏 +𝑴𝟏𝟎+ 𝑴𝟎𝟏 16
Dissimilarity between Binary Variables
◼ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Gender is a symmetric attribute
◼ The remaining attributes are asymmetric binary
◼ Let the values Y and P be 1, and the value N 0
◼ Distance between objects (patients) is computed based only on
the asymmetric attributes.
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
17
Dissimilarity of Numeric Data
◼ Work an example
20
Properties of Euclidean and Manhattan
distance measures
◼ Both the Euclidean and the Manhattan distance
satisfy the following mathematical properties:
◼ Non-negativity: d(i, j)≥ 0: Distance is a non-
negative number.
◼ Identity of indiscernibles: d(i, i) = 0: The distance
of an object to itself is 0
◼ Symmetry: d(i, j) = d(j, i): Distance is a symmetric
function
◼ Triangle inequality: d (i, j ) d (i, k ) + d (k , j ) Going
directly from object i to object j in space is no more
than making a detour over any other object k
◼ A measure that satisfies these conditions is
known as metric.
21
Minkowski distance
◼ Also known as
◼ Lmax norm
◼ L∞ norm
◼ Chebyshev distance
◼ It is a generalization of the Minkowski distance
for h= ∞
◼ Computation: find the attribute f that gives the
maximum difference in values between the two objects
◼ Thus the maximum difference between any component
(attribute) of the vectors
◼ It is defined as:
23
Example: Minkowski Distance
◼ Compute the dissimilarity among the data objects
(x1,x2,x3 and x4) with attribute 1 and 2 each
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
24
Cosine Similarity
◼ A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document
25
Cosine Similarity (cont)
◼ Cosine measure: If d1 and d2 are two vectors then
d1 • d 2
cossim ( d1 , d 2 ) =
|| d1 || || d 2 ||
where • indicates vector dot product, ||d||: the length of vector d ie.
The Euclidean norm of the vectors
26
Example: Cosine Similarity
◼ Exercise: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
d1 • d 2
cossim (d1 , d 2 ) =
|| d1 || || d 2 ||
1Hence, cos(d , d ) = 0.94
2
27
Wait for Assignment
Assignment
29
Assignment
2. Compute the dissimilarity among the data objects (a,b, c
and d) with Attribute 1 and 2 each