You are on page 1of 3

1. Compute the Lp -norm between (1,2) and (3,4) for p = 1,2,∞.

(That is, Manhattan


distance, Euclidean distance, and Infinity norm)
d
L1 = ∑ |xi − y i | = |x1 − y 1 | + |x2 − y 2 | = |1 − 3| + |2 − 4| = 2 + 2 = 4
i=1


d
L2 = ∑ (xi − y i )2 =
i=1
√(x 1 − y 1 )2 + (x2 − y 2 )2 = √(1 − 3) 2
+ (2 − 4)2 = √4 + 4 = 2√2

L∞ = max(|xi − y i |) = max(|1 − 3|, |2 − 4|) = 2

2. Compute the match-based similarity, cosine similarity, and the Jaccard


coefficient between the two sets {A,B,C} and {A,C,D,E}. If the measure only
applies to numeric data, you can transform the data into numeric first.

Match-based similarity:
Discretize each dimension (A, B, … E) data into 1 equidepth bucket of range
[0,1]. Therefore, mi = 1, ni = 0
# A B C D E

1 1 1 1 0 0

2 1 0 1 1 1

matched elements in two sets (proximity set)= {A,B,C,D,E}


1/p
p
ˉ Yˉ , k d ) =
P Select(X, ∑
ˉ ˉ,k )
i∈S(X,Y d
( 1−
|xi −y i |
mi −ni )
1−1
For p=1: (1 − 1
) + (1 − 1−1
1
)=2
1−1 2 1−1 2 1/2
For p=2: [(1 − 1 ) + (1 − 1 ) ] = √2

Cosine similarity:
d
∑ xi .y i
(1)(1)+(1)(0)+(1)(1)+(0)(1)+(0)(1) 2 1 √3
i=1
= = = = 3
= 0.577
√1 +1 +1 +0 +0 √1 +0 +1 +1 +1
2 2 2 2 2 2 2 2 2 2 √3√4 √3

√ √
d d
∑ xi 2 ∑ yi 2
i=1 i=1

|S ⋂S | |{A,C}| 2
Jaccard coefficient: |S X ⋃S Y | = |{A,B,C,D,E}|
= 5
= 0.4
X Y
3. Compute the edit distance between: (a) ababcabc and babcbc and (b)
cbacbacba and acbacbacb.
Assume an equal cost of insertion, deletion, or replacement.
A. ababcabc → babcbc
babcabc: delete a at position 1
babcbc: delete a at position 6
Cost = 2
B. cbacbacba → acbacbacb
a​cbacbacba: insert a at position 0
acbacbac​b​: delete a at position 9
Cost = 2

4. Compute the normalized-cosine measure between the following two sentences:


(a) “The sly fox jumped over the lazy dog.”
(b) “The dog jumped at the intruder.”
For TF, use the raw count, while for IDF use the standard inverse document
frequency.
Possible Answer #1: “The” and“the” are treated as the same word

Word TF(a) TF(b) IDF IDF (standard) TF-IDF(a) TF-IDF(b)

the 2 2 2/2 log(2/2) 0 0

sly 1 0 2/1 log(2) 0.301 0

fox 1 0 2/1 log(2) 0.301 0

jumped 1 1 2/2 log(2/2) 0 0

over 1 0 2/1 log(2) 0.301 0

lazy 1 0 2/1 log(2) 0.301 0

dog 1 1 2/2 log(2/2) 0 0

at 0 1 2/1 log(2) 0 0.301

intruder 0 1 2/1 log(2) 0 0.301


d
∑ h(xi ).h(y i )
0
Normalized-cosine similarity: i=1
= =0

√ √ √ √
d d d d
2 2 2 2
∑ h(xi ) ∑ h(y i ) ∑ h(xi ) ∑ h(y i )
i=1 i=1 i=1 i=1

Possible Answer #2: “The” and“the” are treated as different words


(a) “The sly fox jumped over the lazy dog”
(b) “The dog jumped at the intruder.”

Word TF (a) TF (b) IDF IDF (standard) TF-IDF (a) TF IDF (b)

The 1 1 1 log(2/2) 0 0

sly 1 0 2 log(2/1) 0.301 0

fox 1 0 2 log(2/1) 0.301 0

jumped 1 1 1 log(2/2) 0 0

over 1 0 2 log(2/1) 0.301 0

the 1 1 1 log(2/2) 0 0

lazy 1 0 2 log(2/1) 0.301 0

dog 1 1 1 log(2/2) 0 0

at 0 1 2 log(2/1) 0 0.301

intruder 0 1 2 log(2/1) 0 0.301

d
∑ h(xi ).h(y i )
0
Normalized-cosine similarity: i=1
= =0

√ √ √ √
d d d d
2 2 2 2
∑ h(xi ) ∑ h(y i ) ∑ h(xi ) ∑ h(y i )
i=1 i=1 i=1 i=1

You might also like