Professional Documents
Culture Documents
Data Mining
Lecture # 4
Non-Euclidean Distances
Non-Euclidean Distances
Proximity Measure for Nominal Attributes
Jaccard measure for binary vectors
Distance measure for Ordinal variables
Cosine measure = angle between vectors
from the origin to the points in question.
Dissimilarity Measure for Mixed type
attributes.
Edit distance = number of inserts and
deletes to change one string into another.
2
Proximity Measure for Nominal Attributes
The dissimilarity between two objects I
and j having nominal attributes can be
compute based on the ratio of
mismatches pm
d (i, j)
p
where m is the number of matches and p
is the total number of attributes.
m
sim(i, j) 1 d (i, j)
p
Example:
3
pm
d (i, j)
p
4
Jaccard Measure
A note about Binary variables first
Symmetric binary variable
• If both states are equally valuable and carry the same weight,
that is, there is no preference on which outcome should be
coded as 0 or 1.
• Like “gender” having the states male and female
Asymmetric binary variable
• If the outcomes of the states are not equally important, such
as the positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and
the other by 0 (HIV negative).
Given two asymmetric binary variables, the agreement
of two 1s (a positive match) is then considered more
important than that of two 0s (a negative match).
5
Jaccard Measure
A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i
0 c d cd
sum a c b d p
1 0 sum
1 a b a b
0 c d cd
sum a c b d p
d (i, j) bc
a bc
sim(i, j) 1 d (i, j) a
a bc 7
These measurements suggest that Jim and Mary are unlikely to have
a similar disease because they have the highest dissimilarity value
among the three pairs. Of the three patients, Jack and Mary are the
most likely to have a similar disease.
8
Distance for Ordinal variables
The rank of the ordinal variable f for the ith object is rif.
Where variable f has Mf ordered states.
rif Є {1…Mf}
Since each ordinal variable can have a different number
of states, therefore map the range of each variable onto
[0.0, 1.0], so that each variable has equal weight. This
can be achieved using the following formula.
For each value rif in ordinal variable f , replace it by zif
rif 1
zif
M f 1
After calculating zif , calculate the distance using
Euclidean distance formulas for numeric values.
Example:
9
Distance for Ordinal variables
Example: Find dissimilarity matrix for test-2.
10
• Suppose that we have the sample data shown earlier in Table 2.2, except that this time only the
object-identifier and the continuous ordinal attribute, test-2, are available
• There are three states for test-2: fair, good, and excellent, that is, Mf = 3. For step 1, if we
replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3,
respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
• For step 3, we can use, say, the Euclidean distance which results in the following dissimilarity
matrix:
3
1 rif 1
zif
2 M f 1
3
• Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., d(2,1)=1.0 and
d.(4,2)= 1.0). This makes intuitive sense since objects 1 and 4 are both excellent. Object 211is
fair, which is at the opposite end of the range of values for test-2.
Dissimilarity for attributes of Mixed Type
Suppose the data set contains p attributes of mixed type.
The dissimilarity between object i and j is
p
ij( f ) d ij( f )
d (i, j) f 1
Where
p
f 1 (f)
ij
(f)
ij 0 if (i) xif or xjf is missing or
(ii) xif =xjf =0 and attribute is symmetric binary
(f)
ij 1 otherwise
(f)
xif x jf
If f is numeric: d ij where h runs
max h xhf min h xhf
over all the nonmissing objects for attribute f.
If f is nominal or binary: ij 0 if xif x jf ; otherwise 1.
(f)
d
12
Dissimilarity for attributes of Mixed Type
If f is ordinal: compute the ranks rif and z rif 1 , and
treat Zif as numeric.
if
M f 1
13
Dissimilarity for attributes of Mixed Type
14
Dissimilarity for attributes of Mixed Type
15
Cosine Measure
Think of a point as a vector from the origin
(0,0,…,0) to its location.
Two points’ vectors make an angle, whose
cosine is the normalized dot-product of the
vectors. sim(i, j) x t .y
x. y
LCS(x,y) = bcde.
D(x,y) = |x| + |y| - 2|LCS(x,y)| =
5 + 6 –2*4 = 3.
18