Professional Documents
Culture Documents
Similarity Measures/Metrics
Similarity Measures/Metrics
Semantic Relatedness
• Euclidean Distance
• Cosine Similarity
• Jaccard Coefficient
• F-SCORE
• Pearson Correlation Coefficient
1. EUCLIDEAN DISTANCE
• It is the ordinary distance between two points and can
be easily measured with a ruler in two or three
dimensional space. Euclidean distance is widely used in
clustering, dimensionality reduction and other data
mining problems, including text clustering.
• This is probably the most commonly chosen type of
distance. It is given by
• Between two document vectors and , the Euclidean
distance is defined as:
COSINE SIMILARITY
• Cosine Similarity measures the cosine of the angle
between two vectors.
•ѳ
•t
1
•t
2
Recall
• Recall is the fraction of the documents that are
relevant to the query that are successfully retrieved.
Example
• Let us assume that there are 60 relevant documents for
a particular keyword ‘w’.
• Let’s also assume that given the keyword ‘w’, an
information retrieval system returns 30 documents in
total out of which 20 are relevant.
• Then the precision and recall in this case are
20/30=66.67% and 20/60=33.33% respectively and the
F-measure is 4/9.
Pearson Correlation Coefficient