You are on page 1of 4

2.8 It is important to dene or select similarity measures in data analysis.

However, there is no commonly


accepted subjective similarity measure. Results can vary depending on the similarity measures used.
Nonetheless, seemingly different similarity measures may be equivalent after some transformation.
Suppose we have the following 2-D data set Formula for Eucledian distance,

(a) Consider the data as 2-D data points. Given a new data point, x = (1.4,1.6) as a query, rank the
database points based on similarity with the query using Euclidean distance, Manhattan distance,
supremum distance, and cosine similarity.
(b) Normalizethedatasettomakethenormofeachdatapointequalto1.UseEuclidean distance on the
transformed data to rank the data points.
Ans a) Formula for Euclidean distance,

Therefore, d(x,x1)=0.141
d(x,x2)=0.67
d(x,x3)=0.28
d(x,x4)=0.223
d(x,x5)=0.60
Thus, rank of the data points based on similarity with x using Eucledian distance is
x2,x5,x3,x4,x1

Formula for Manhattan distance,

Therefore, d(x,x1)=0.2
d(x,x2)=0.9
d(x,x3)=0.4
d(x,x4)=0.3
d(x,x5)=0.7
Thus, rank of the data points based on similarity with x using Manhattan distance is
X2, x5, x3, x4, and x1

Formula for Supremum distance,

Therefore, d(x,x1)=0.1
d(x,x2)=0.6
d(x,x3)=0.2
d(x,x4)=0.2
d(x,x5)=0.6
Thus, rank of the data points based on similarity with x using Supremum distance is
X2, x5, x3, x4, and x1
Cosine similarity:

x. x1
x.x 1

( x , x 1) =

where

( x , x 1) =

( x , x 2) =

( x , x 3 )=

is the Euclidean norm of vector x defined as

( 1.4 )( 1.5 )+(1.6)(1.7)


( 1.4 +1.6 )( 1.5 +1.7 )
2

(1.4 ) (2 )+(1.6)(1.9)
( 1.4 2 +1.62 )( 22 +1.92 )

2.1+ 2.72 4.82


=
=0.9999
4.86
4.86

=0.9957

( 1.4 ) ( 1.6 ) +(1.6)(1.8)


( 1.42 +1.62 )( 1.62+ 1.82)

x 12+ x 22 ++ xn 2

=0.9999

( x , x 4) =

( x , x 5 )=

(1.4 ) (1.2 ) +(1.6)(1.5)


( 1.42 +1.62 )( 1.22 +1.52 )

( 1.4 ) ( 1.5 )+(1.6)(1.0)


( 1.42 +1.62 )( 1.52+1.0 2)

=0.9990

=0.9653

Thus, rank of the data points based on similarity with x using Supremum distance is x1, x3, x4, x2, x5.

b) norm( x)=sqrt{(1.4)^2 +(1.6)^2} ~ 2.13


Normalized x is (1.4/2.13,1.6/2.13) =(0.65,0.75)

norm( x1)=sqrt{(1.5)^2 +(1.7)^2} ~ 2.26


Normalized x1 is (1.5/2.26,1.7/2.26) =(0.57,0.75)
norm( x2)=sqrt{(2)^2 +(1.7)^2} ~ 2.76
Normalized x2 is (2/2.76,1.9/2.76) =(0.26,0.69)
norm( x3)=sqrt{(1.6)^2 +(1.8)^2} ~ 2.40
Normalized x3 is (1.6/2.40,1.8/2.40) =(0.67,0.75)
norm( x4)=sqrt{(1.2)^2 +(1.5)^2} ~ 1.92
Normalized x4 is (1.2/1.92,1.5/1.92) =(0.62,0.78)
norm( x5)=sqrt{(1.5)^2 +(1.0)^2} ~ 1.80
Normalized x5 is (1.5/1.80,1.0/1.80) =(0.83,0.55)
Formula for Eucledian distance is,

D(x,x1)=0.8
D(x,x2)=0.71
D(x,x3)=0.02

D(x,x4)=0.04
D(x,x5)=0.27
Thus, rank of the data points based on similarity with x using Euclidean distance in normalized form is
x1, x2, x5, x4, x3.

You might also like