You are on page 1of 11

Assignment No.

Text Mining

Similarity and dissimilarity measures

Submitted To: - Dr. Ibrar

Submitted By:- Fahmi Hassan Ali Quradaa

Ph.D Scholar

University of Peshawar

Department of Computer Science


Similarity and Dissimilarity Measurements

1. INTRODUCTION

Similarity or distance measures are vital components used for solving many pattern
recognition problems such as classification and clustering. These measures play an
increasingly core role in text related research and application in several tasks such as
text classification, topic tracking, question answering, short answer scoring, machine
translation, essay scoring, topic detection and others. Different clustering algorithms need
a measure for determining how dissimilar two given documents are. This difference is
often measured by similarity measures such as Euclidean distance, Cosine similarity, etc.
These measurements can be used to identify the suitable clustering algorithm for a
specific problem.

Informally, the similarity between two objects is a numerical measure of the degree to
which the two objects are similar. Consequently, similarity is higher for pairs of objects
that are more the same. Similarities are usually no-negative and are often between Zero
(dissimilar) and one (identical practically) .

On the other hand, the dissimilarity between two objects is the degree to which the two
objects are unlike. For more similar pairs of objects dissimilarities are lower. commonly,
the term distance is used as synonym for dissimilarity. Dissimilarities sometimes fall in
the interval Zero and One, but it is also common form them to range from 0 to infinity.

A variety of similarity/dissimilarity measures have been formulated throughout the years,


each with its own strengths and weaknesses. In this assignment I will discuss some of the
literature studies that have been done in the similarity and distance measurements. Then a
list of similarity and dissimilarity measures are presented. Finally, a comparison between
those measurements.

2. Literature review
There have been many surveys of the similarity and distance measures proposed in
different discipline. Applying suitable measures reflects in more accurate data analysis.
Vijaymeena, M. K., and Kavitha in [1] have discussed the similarity measures that are
applied on text similarity. They classified similarity measuers into three significant
categories; Corpus-based similarities, string-base and Knowledge based. Choi, S. et. al.
[2] carried out a comprehensive survey on binary measures. They collected 76 binary
similarity and dissimilarity (distance) measures that are over the last century and disclose
their correlations by means of the hierarchical clustering technique. However , David J.
Weller-Fahy et. al. [3] conducted an overview of the use of similarity and distance
(dissimilarity) measures within Network Intrusion Anomaly Detection (NIAD) research.
Nitin P., Madhura P et. al. [4] discussed various clustering techniques and the current
similarity measures based on distance based clustering. A. S. Shirkhorshidi et. al. [4] has
been proposed a technical framework to analyze, compare and benchmark the effect of
various similarity measures on the results of distance-based clustering algorithms.

3. Similarity and distance measures

3.1 Levenshtein distance

The Levenshtein measure among two strings is measured as the minimum number of
edits required to convert one string into the other, with the allowable edit operations
being deletion, insertion, or replacements of a single character [11]. Levenshtein
algorithm computes the distance between two string is as follow:
1- Initialize matri M of size (|s1|+1) x (|s2|+1)
2- Fill matrix : Mi,0 = I and M0,j=j
𝑀𝑖 − 1, 𝑗 − 1 𝑖𝑓 𝑥[𝑖] = 𝑦[𝑗]
3- Recursion : Mi,j= {
1 + min(𝑀𝑖 − 1, 𝑗, 𝑀𝑖, 𝐽 − 1, 𝑀𝑖 − 1, 𝑗 − 1) 𝑒𝑙𝑠𝑒
4- Distance : Levenshteindist(x,y)=M|x|,|y|
𝑙𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑑𝑖𝑠𝑡(𝑥,𝑦)
Levenshtein Similarity: simlevenshtein(x,y)=1- max(|𝑥|,|𝑦|)

For example :
3.2 Jaro distance

To measure the similarity between two strings Jaro distance is used. When the distance
value is higher, the more similar the strings are [5]. It computes the distance between
strings as follows:
1- Search for common characters between strings
2- m: number of matching characters
max(|𝑥|,|𝑦|)
3- search range matching characters : −1
2

4- t :number of transpositions
1 𝑚 𝑚 𝑚−𝑡
5- simjaro= 3 (|𝑥| + |𝑦| + )
𝑚

For example :
3.3 Q-grams

Q-grams are typically used in approximate string matching by “sliding” a window of


length q over a contiguous sequence of n items from a given string to create a number of
'q' length grams for matching a match is then rated as number of q-gram matches within
the second string over possible q-grams [5]. It works as follows:
 Split string into short substrings of length n.
o Sliding window over string
o N=2 :bigrams
o Variation : Pad with n-1 special characters
 Emphasizes beginning and end of string
o Variation : include positional information to weight similarities
 Number of n-grams= |x|-n+1
 Count how many n-grams are common in both strings.

3.4 Euclidean distance

The Euclidean distance between two points is measured by the numerical difference
of their coordinates. It is general to recognize the name of a point with its Cartesian
coordinate [11]. So if we have two point’s p1 and p2 on the real line, then the distance
( 𝑝𝑞
̅̅̅) between them is known by:
3.5 Manhattan distance

The Manhattan distance measure the distance that would be traveled to get from one
point (x) to the other (y) if a grid-like path is followed. The distance between two points
is the sum of the differences of their corresponding components [5].
The distance between a point X=(X1, X2, etc.) and a point Y=(Y1, Y2, etc.) is known by:
𝑑 = ∑𝑛𝑖=1 |𝑥𝑖 − 𝑦𝑖|
Where n is the number of variables, and Xi and Yi are the values of the ith variable, at
points X and Y in that order.
The difference between Euclidean distance and Manhattan distance shown in the figure:

Figure (1) the difference between Euclidean distance and Manhattan distance

3.6 Chebyshev distance

Chebyshev similarity measurement is a distance measurement that is the maximum


absolute distance in one dimension of two M dimensional points. It examines the absolute
magnitude of the difference between coordinates of a pair of points [11].
Its formula is : dij=max(|X1-X2|,|Y1-Y2|).
Applications of Chebyshev distance measurement are: Chess and Warehouse Logistics.

3.7 Hamming Distance

The Hamming distance is a measure for comparing two equal length binary strings. It
computes the distance based on the number of positions at which the two bits are not
similar [11]. For Example

 "karolin" and "kathrin" is 3. "karolin" and "kerstin" is 3.


 1011101and1001001is 2. 2173896 and 2233796is 3.
It computes as :

This measure is used in telecommunication to estimate error when data is transmitted


between computer network.

3.8 Cosine similarity

Cosine similarity is a metric of similarity used to compare how similar documents are
without consideration of their size. The angle between two vectors is measured by cosine
to decide whether documents’ vectors are pointing in the same direction [11].
The cosine similarity computed as follow:
𝐴.𝐵 ∑𝑛
𝑖=1 𝐴𝑖 𝑥 𝐵𝑖
Similarity (A, B) = ||𝐴||𝑥||𝑏|| =
√∑𝑛 2 𝑛 2
𝑖=1 𝐴𝑖 𝑥√∑𝑖=1 𝐵𝑖

3.9 Latent Semantic Analysis (LSA)

Latent Semantic Indexing (LSI) is a semantic similarity measure for discovering


similarity between words based on information gain from big corpus. It assumes that
terms (words) which are close in meaning will arise in similar pieces of document or
paragraph. The first step in LSA is to create a matrix, each row represents an index word
and each column represents document or paragraph. Each cell in the matrix contains the
number of times that word (term) occurs in that document or paragraph. Then, numbers
of columns are reduced by using mathematical technique called SVD (singular value
decomposition). Finally, it compares between remaining words (rows) by using the
cosine similarity [11].

3.10 Mahalanobis distance

The Mahalanobis distance is a measure of distance between two vectors and a set of
data, or a variation that measures the separation of two vectors from the same dataset
[5]. It is calculated as:
3.11 Lesk

Lesk measure is another type of knowledge-based semantic relatedness. Which cover a


more relationship between words or concepts for example is a part of, is a kind of , is a
specific example of , is the opposite of and is a part of. It return back from Machine
Readable Dictionaries (MRD) all sense definitions of the concepts (words) to be
disambiguated. Then it finds out overlaps in the definitions. The score of relatedness
measures by summation of squares of the overlap length [11].

4. Comparison between similarity measures

In the following table different similarity and distance measure are compared based
on the following criteria equation, time complexity, advantages , disadvantage and the
area that metric is suitable. As we can see each measure has strengths and weakness such
as Euclidean distance that is very common and easy to compute and works well with
datasets. But it is sensitive to outliers. Based on that, the selection of the similarity
measure identifies the suitable clustering or pattern recognition algorithm for a specific
problem.
Time
Distance disadva
Equation complexit advantage applications
measure ntages
y
Levenshtein O(n*m) spelling
distance correction, all
applications that
benefit from soft
matching of
words, e.g.
information
retrieval, machine
translation etc.
Euclidean O(n) Very Sensitiv K-means
distance common, e to algorithm,
easy to outliers Fuzzy c-means
compute [7,10]. algorithm [8].
and works
well with
datasets
with
compact or
isolatedclus
ters [7,10].
𝑛
Manhattan O(n) Is common Sensitiv K-means
distance 𝑑 = ∑ |𝑥𝑖 − 𝑦𝑖| and like e to the algorithm
𝑖=1 other outliers.
Minkowski [7,10]
-driven
distances it
works well
with
datasets
with
compact or
isolated
clusters [7].
Chebyshev O(n) it requires It Chess and
distance less time to require Warehouse
decide the more Logistics
distances space electronic CAM
between applications
the datasets
[6]
Hamming O(n) detection of
Distance errors in
information
transmission and
telecommunicatio
n
Cosine ∑𝑛𝑖=1 𝐴𝑖 𝑥 𝐵𝑖 O(3n) Independen It is not Mostly used in
measure t of vector invarian document
√∑𝑛𝑖=1 𝐴2𝑖 𝑥√∑𝑛𝑖=1 𝐵𝑖2 length and t to similarity
invariant tolinear applications
rotation transfor [8,10].
[10]. mation
[10].
Mahalanobis O(3n) Mahalanobi It can Hyperellipsoidal
distance s is a be clustering
datadriven expensi algorithm
measure ve in [9].
that can terms
ease the of
distance comput
distortion ation
caused by a [10]
linear
combinatio
n of
attributes
[5].

5. Conclusion

In this work we discuses several similarity and distance metrics and compare some of the
similarity measures in terms of equation, time complicity and other criteria. Each
measurement has a strength and weakness. So when we are using any clustring or data
mining algorithm we should decide which measure we will use because the call will
effect on the results that we will get.

References

[1] Vijaymeena, M. K., and Kavitha, K. (2016). A survey on similarity measures in text
mining. Mach. Learn. Appl. Int. J. 3, 19–28. doi: 10.5121/mlaij.2016.3103

[2] Choi, S. Cha, and C. C. Tappert, “A Survey of Binary Similarity and Distance
Measures, ”J. Systemics, Cybernetics and Informatics,vol.8,no. 1, pp. 43–48, 2010.

[3] D. J. Weller-Fahy, B. J. Borghetti, and A. A. Sodemann, “A Survey of Distance and


Similarity Measures Used Within Network Intrusion Anomaly Detection,” IEEE
Commun. Surv. Tutor., vol. 17, no. 1, pp. 70–91, 2015.

[4] Jasmine Irani, Nitin Pise and Madhurs Phatak” Clustering techniques and similarity
measure used in clustering: A survey, Published in International Journal of Computer
Applications 134(7):9-14, 2016.doi: 10.5120/ijca2016907841

[5] A. S. Shirkhorshidi, S. Aghabozorgi, and T. Y. Wah, “A comparison study on


similarity and dissimilarity measures in clustering continuousdata,”PLOS ONE, vol. 10,
no. 12, pp. 1–20, 12 2015. [Online].Available:
https://doi.org/10.1371/journal.pone.0144059

[6] Sujan Dahal , Effect of Different Distance Measures in Result of Cluster Analysis,

Master’s Thesis, 2015


[7] Gan G, Ma C, Wu J. Data Clustering theory, Algorithms, and Applications.
ASASIAM Series on Statis-tics and Applied. Society for Industrial and Applied
Mathematics; 2007
[8] Han J, Kamber M, Pei J. Data mining: concepts and techniques. Morgan Kaufmann;
2006
[9] Mao J, Jain AK. A self-organizing network for hyperellipsoidal clustering (HEC).
IEEE Trans Neural Networks.
1996; 7: 16–29. doi: 10.1109/72.478389 PMID: 18255555
[10] Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Computing Surveys.
ACM; 1999. pp.264–323. doi:10.1145/331499.331504
[11] W. H. Gomaa and A. A. Fahmy, “A survey of text similarityapproaches,”Int. J.
Comput. Appl., vol. 68, no. 13, pp. 13–18, 2013.

You might also like