Professional Documents
Culture Documents
Find Cos_Sim(2,3) ?
Cosine Similarity Example
Cos_Sim(2,3) = (0*1 + 0*0 + 1*0 + 1*0) / (sqrt(1^2 + 1^2))*(sqrt(1^2))
= 0.
Euclidean Distance
One of the distance measurements.
The input is two items (documents for example), represented
as vectors, and the output is the distance between them.
If the two items are identical, the distance will be 0
No upper limit for the distance.
Distance Example
We have two training documents
D1: Team, Team, win (sport class)
D2: Stock, Market (Eco class)
We have one testing document
D3: Team
Find the distance between D3 and D1
Answer: Training can be represented as:
Team Win Stock Market
2 1 0 0
0 0 1 1
Find Distance(2,3) ?
Distance Example
Distance(2,3) = sqrt( (0-1)^2 + (0-0) ^2+ (1-0) ^2+ (1-0) ^2) =
sqrt(1+0+1+1) =
sqrt(3).
Our First Data Mining Algorithm!
Back to the previous example:
We have two training documents
D1: Team, Team, win (sport class)
D2: Stock, Market (Eco class)
We have one testing document
D3: Team
We want to classify the testing document
One classifier is called K Nearest Neighbor
Simply it classify the testing document based on the closest training
document.
Closest = max similarity or min distance
Cosine Similarity Example
Cos_Sim(1,3) = (2*1+1*0+0*0+0*0)/((sqrt(2^2+1^2))*(sqrt(1^2)))
= 2/sqrt(5) = 0.89
Cos_Sim(2,3) = 0
document 1 is more similar to document 3 than document 2
to 3.
kNN will predict the label of document 3 to be the same like
label of document 1
Label of document 1 is sport label of document 3 will be
sport.
Our First Data Mining Algorithm!
Back to the previous example:
We have two training documents
D1: Team, Team, win (sport class)
D2: Stock, Market (Eco class)
We have one testing document
D3: Team
We want to classify the testing document using distance
Distance Example
Distance(1,3) = sqrt( (2-1)^2 + (1-0) ^2+ (0-0) ^2+ (0-0) ^2) =
sqrt(1+1+0+0) =
sqrt(2).
Distance(2,3) = sqrt(3)
document 1 is closest to document 3 than document 2
document 3 will have the label of document 1, which is
sport.
Example 2
We have two training documents
D1: Team, win (sport class)
D2: Stock, Market, Stock (Eco class)
We have one testing document
D3: Team, Stock
1. Represent the three documents using document term matrix
2. What would be the prediction of the label of D3 if kNN is used and we
used the cosine similarity?
3. What would be the prediction of the label of D3 if kNN is used and we
used the Eucledian distance?
Answer
We have two training documents
D1: Team, win (sport class)
D2: Stock, Market, Stock (Eco class)
We have one testing document
D3: Team, Stock
Answer: 1. Training can be represented as:
Cos_Sim(2,3) = (0*1+0*0+2*1+1*0)/((sqrt(2^2+1^2))*((sqrt(1^2+1^2)))
= 2/sqrt(5) sqrt(2)
= 0.63
Prediction of D3 will be economy
Answer
Team Win Stock Market
1 1 0 0
0 0 2 1
Note that using cosine similarity provided a result that is different from
using distance.
Commonly, cosine similarity is more accurate than the Euclidean
distance.