Professional Documents
Culture Documents
√
d
L2 = ∑ (xi − y i )2 =
i=1
√(x 1 − y 1 )2 + (x2 − y 2 )2 = √(1 − 3) 2
+ (2 − 4)2 = √4 + 4 = 2√2
Match-based similarity:
Discretize each dimension (A, B, … E) data into 1 equidepth bucket of range
[0,1]. Therefore, mi = 1, ni = 0
# A B C D E
1 1 1 1 0 0
2 1 0 1 1 1
Cosine similarity:
d
∑ xi .y i
(1)(1)+(1)(0)+(1)(1)+(0)(1)+(0)(1) 2 1 √3
i=1
= = = = 3
= 0.577
√1 +1 +1 +0 +0 √1 +0 +1 +1 +1
2 2 2 2 2 2 2 2 2 2 √3√4 √3
√ √
d d
∑ xi 2 ∑ yi 2
i=1 i=1
|S ⋂S | |{A,C}| 2
Jaccard coefficient: |S X ⋃S Y | = |{A,B,C,D,E}|
= 5
= 0.4
X Y
3. Compute the edit distance between: (a) ababcabc and babcbc and (b)
cbacbacba and acbacbacb.
Assume an equal cost of insertion, deletion, or replacement.
A. ababcabc → babcbc
babcabc: delete a at position 1
babcbc: delete a at position 6
Cost = 2
B. cbacbacba → acbacbacb
acbacbacba: insert a at position 0
acbacbacb: delete a at position 9
Cost = 2
√ √ √ √
d d d d
2 2 2 2
∑ h(xi ) ∑ h(y i ) ∑ h(xi ) ∑ h(y i )
i=1 i=1 i=1 i=1
Word TF (a) TF (b) IDF IDF (standard) TF-IDF (a) TF IDF (b)
The 1 1 1 log(2/2) 0 0
jumped 1 1 1 log(2/2) 0 0
the 1 1 1 log(2/2) 0 0
dog 1 1 1 log(2/2) 0 0
at 0 1 2 log(2/1) 0 0.301
d
∑ h(xi ).h(y i )
0
Normalized-cosine similarity: i=1
= =0
√ √ √ √
d d d d
2 2 2 2
∑ h(xi ) ∑ h(y i ) ∑ h(xi ) ∑ h(y i )
i=1 i=1 i=1 i=1