You are on page 1of 18

Similarity Measures/Metrics

Semantic Relatedness

• Semantic relatedness indicates degree to which words


are associated via any type (such as synonymy,
Antonym, hyponymy, hypernymy and other types) of
semantic relationships.
• Semantic similarity is a special case of relatedness and
takes into consideration only hyponymy/hypernymy
relations
Semantic Similarity

• Semantic similarity is a concept where a set of


documents or terms within term lists are assigned a
metric based on the likeness of their meaning /
semantic content.
• Many automatic measures of semantic
similarity/relatedness, is usually a number between
-1 and 1, or between 0 and 1. 1 signifies extremely
high similarity/relatedness, and 0 signifies
little-to-none.
Various Similarity Measures

Similarities (distances) are a set of rules that serve as


criteria for grouping or separating objects.

• Euclidean Distance
• Cosine Similarity
• Jaccard Coefficient
• F-SCORE
• Pearson Correlation Coefficient
1. EUCLIDEAN DISTANCE
• It is the ordinary distance between two points and can
be easily measured with a ruler in two or three
dimensional space. Euclidean distance is widely used in
clustering, dimensionality reduction and other data
mining problems, including text clustering.
• This is probably the most commonly chosen type of
distance. It is given by
• Between two document vectors and , the Euclidean
distance is defined as:
COSINE SIMILARITY
• Cosine Similarity measures the cosine of the angle
between two vectors.

• When documents are represented as vectors, the


similarity of two documents corresponds to the
correlation between the vectors.

• Cosine similarity is one of the most popular similarity


measure applied to text documents, such as in numerous
Information Retrieval applications, Natural Language
Processing and Clustering .
•t •d
3
2
•d
1

•ѳ
•t
1
•t
2

Given two documents the cosine similarity can be


calculated by the formula given below
• An important property of the cosine similarity
is its independence of document length.
• The cosine similarity is non-negative and
bounded between [0, 1].
3. JACCARD COEFICIENT

• The Jaccard coefficient measures similarity as the


intersection divided by the union of the objects. For
text document, the Jaccard coefficient compares the
sum weight of shared terms to the sum weight of
terms that are present in either of the two
documents but are not the shared terms.
• The Jaccard coefficient is a similarity measure
and ranges between 0 and 1.
• It is 1 when and 0 when and are disjoint,
where 1 means the two objects are the same
and 0 means they are completely different.
• Let there be two documents
4. F-SCORE

• In statistics, the F1 score (also F-score or F-measure) is a


measure of a test's accuracy. It considers both the
precision (p) and the recall (r) of the test to compute
the score:

• p is the number of correct results divided by the


number of all returned results and r is the number of
correct results divided by the number of results that
should have been returned.
• In various Information Retrieval and other text
document mining applications the Precision and
Recall has been defined as follows:
Precision
• Precision is the fraction of the documents retrieved
that are relevant to the user's information need.

Recall
• Recall is the fraction of the documents that are
relevant to the query that are successfully retrieved.
Example
• Let us assume that there are 60 relevant documents for
a particular keyword ‘w’.
• Let’s also assume that given the keyword ‘w’, an
information retrieval system returns 30 documents in
total out of which 20 are relevant.
• Then the precision and recall in this case are
20/30=66.67% and 20/60=33.33% respectively and the
F-measure is 4/9.
Pearson Correlation Coefficient

• In statistics, Pearson correlation(or Pearson


correlation coefficient) refers to the measure of the
linear dependence between two variables .
• this metric measures how highly correlated are two
variables and is measured from -1 to +1.
• The closer the coefficient is to 1 or -1, the more
closely they are related. If the coefficient is close to
zero, it means that there is no relationship between
the two distance variables.
Cont’d

• If sample Xk has n measurements, or features, it can


be written as Xk{x1k, x2k,…xnk}. The above formula to
calculate r can be written as follows.
Assignment#3

query1. Anna hazare anti Land Acquisition Bill


query2. Stock market mutual fund
query3. Britney spear music mp3
query4. Khap panchayat honour killing
Query5. Sql server dbms database
For the above queries fetch the top 100 documents
retrieved from the Google.
Calculate the various similarity measures and analyze
the relevancy of a particular measure.

You might also like