You are on page 1of 18

CSC479

Data Mining
Lecture # 4
Non-Euclidean Distances
Non-Euclidean Distances
 Proximity Measure for Nominal Attributes
 Jaccard measure for binary vectors
 Distance measure for Ordinal variables
 Cosine measure = angle between vectors
from the origin to the points in question.
 Dissimilarity Measure for Mixed type
attributes.
 Edit distance = number of inserts and
deletes to change one string into another.
2
Proximity Measure for Nominal Attributes
 The dissimilarity between two objects I
and j having nominal attributes can be
compute based on the ratio of
mismatches pm
d (i, j) 
p
where m is the number of matches and p
is the total number of attributes.
m
sim(i, j) 1 d (i, j) 
p
Example:
3
pm
d (i, j) 
p
4
Jaccard Measure
 A note about Binary variables first
 Symmetric binary variable
• If both states are equally valuable and carry the same weight,
that is, there is no preference on which outcome should be
coded as 0 or 1.
• Like “gender” having the states male and female
 Asymmetric binary variable
• If the outcomes of the states are not equally important, such
as the positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and
the other by 0 (HIV negative).
 Given two asymmetric binary variables, the agreement
of two 1s (a positive match) is then considered more
important than that of two 0s (a negative match).
5
Jaccard Measure
 A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i
0 c d cd
sum a  c b  d p

 Simple matching coefficient (invariant, if the binary


variable is symmetric): d (i, j)  a  bbc
cd
 Jaccard coefficient (noninvariant if the binary variable
is asymmetric): d (i, j)  b  c
a bc
6
Jaccard Measure Example
 Example
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack Y N P N N N
Mary Y N P N P N
Jim Y P N N N N

 All attributes are asymmetric binary


 let the values Y and P be set to 1, and the value N be set to 0

1 0 sum
1 a b a b
0 c d cd
sum a  c b  d p

d (i, j)  bc
a bc
sim(i, j) 1 d (i, j)  a
a bc 7
 These measurements suggest that Jim and Mary are unlikely to have
a similar disease because they have the highest dissimilarity value
among the three pairs. Of the three patients, Jack and Mary are the
most likely to have a similar disease.

8
Distance for Ordinal variables
 The rank of the ordinal variable f for the ith object is rif.
Where variable f has Mf ordered states.
 rif Є {1…Mf}
 Since each ordinal variable can have a different number
of states, therefore map the range of each variable onto
[0.0, 1.0], so that each variable has equal weight. This
can be achieved using the following formula.
 For each value rif in ordinal variable f , replace it by zif
rif 1
zif 
M f 1
 After calculating zif , calculate the distance using
Euclidean distance formulas for numeric values.
Example:
9
Distance for Ordinal variables
Example: Find dissimilarity matrix for test-2.

10
• Suppose that we have the sample data shown earlier in Table 2.2, except that this time only the
object-identifier and the continuous ordinal attribute, test-2, are available
• There are three states for test-2: fair, good, and excellent, that is, Mf = 3. For step 1, if we
replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3,
respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
• For step 3, we can use, say, the Euclidean distance which results in the following dissimilarity
matrix:

3
1 rif 1
zif 
2 M f 1
3
• Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., d(2,1)=1.0 and
d.(4,2)= 1.0). This makes intuitive sense since objects 1 and 4 are both excellent. Object 211is
fair, which is at the opposite end of the range of values for test-2.
Dissimilarity for attributes of Mixed Type
 Suppose the data set contains p attributes of mixed type.
The dissimilarity between object i and j is
 p
 ij( f ) d ij( f )
d (i, j)  f 1
Where
 p
f 1  (f)
ij
 (f)
ij 0 if (i) xif or xjf is missing or
(ii) xif =xjf =0 and attribute is symmetric binary
 (f)
ij 1 otherwise
(f)
xif  x jf
 If f is numeric: d ij  where h runs
max h xhf  min h xhf
over all the nonmissing objects for attribute f.
 If f is nominal or binary: ij  0 if xif  x jf ; otherwise 1.
(f)
d
12
Dissimilarity for attributes of Mixed Type
 If f is ordinal: compute the ranks rif and z  rif 1 , and
treat Zif as numeric.
if
M f 1

13
Dissimilarity for attributes of Mixed Type

14
Dissimilarity for attributes of Mixed Type

15
Cosine Measure
 Think of a point as a vector from the origin
(0,0,…,0) to its location.
 Two points’ vectors make an angle, whose
cosine is the normalized dot-product of the
vectors. sim(i, j)  x t .y
x. y

Where x and y are column vectors


Example:
Find cosine similarity between x and y where


2 

4
   
x 5
  x 2
 
   
   
8
  7
 
16
Edit Distance
 The edit distance of two strings is the
number of inserts and deletes of
characters needed to turn one into the
other.

 Equivalently, d(x,y) = |x| + |y| -2|


LCS(x,y)|.
 LCS = longest common subsequence =
longest string obtained both by deleting
from x and deleting from y.
17
Example
 x = abcde ; y = bcduve.

 LCS(x,y) = bcde.
 D(x,y) = |x| + |y| - 2|LCS(x,y)| =
5 + 6 –2*4 = 3.

18

You might also like