You are on page 1of 8

Summary for:

https://www.researchgate.net/publication/325411209_Semantic_Similarity-
_A_Review_of_Approaches_and_Metrics

D., Akila. (2018). Semantic Similarity- A Review of Approaches and Metrics. International
Journal of Applied Engineering Research. 9.

introduction:
Semantic similarity computes the conceptual (‫( الحسية‬similarity between the words or

terms.

Semantic similarity consists of two types, namely relational similarity) indicates the
relations (and attributional similarity) denotes relations between the words (

X and Y, are attributional similar when the attributes of X are similar to the attributes of Y. Two
pairs, A:B and C:D, are relationally similar when the relations between A and B are similar to the
relations between C and D

Classification of Semantic Similarity Approaches


Semantic similarity approaches are divided into following five categories.

 Metrics based semantic similarity approaches

 Corpus based approaches

 Ontology based Approaches

 Relational based Approaches

 Hybrid based Approaches


 Metrics based semantic similarity approaches

Semantic similarity approaches measure the similarity between words based on metrics such
as path length, page count and feature. Metrics based semantic similarity methods can be
categorized into the following:

 Edge based method

The edge based method measures the similarity between two terms based on the length connects
the terms and location of the terms in the taxonomy

The various edge based methods are Path Length Approach, Depth Relative Scaling,
Conceptual Similarity, and Normalized Path Length. Further, the path length approach
classified as shortest path length and weighted shortest path length

Path Length Approach:

Most basic, divided into:

 Shortest Path Length: (Shortest Path Length presents a straightforward method to


measure semantic similarity). Merits: simple ,demerits: link variance problem
 Weighted Shortest Path Length: (Method assigns the weight to every edge and
calculates the semantic similarity based on the weight.), better than shortest

Depth Relative Scaling

The distance between two adjacent nodes is equal to the average of each direction edge with
respect to the depth of the nodes. The semantic distance refers the summation of the distance
between two neighboring nodes over all links in a path.

Conceptual Similarity

Used with translated words, It measures the similarity among the semantic representation of
verbs and provides solutions to the lexical selection problems in machine translation.
Moreover, it finds conceptual similarity among pairs of concepts

These methods doesn’t support is-a relation, depends only on the path length
 Information Content based Method

Using an is-A relation, It defines the similarity between two concepts through examining the
maximum shared information, to add the ability for edge counting we use probability.

It uses WordNet as the taxonomy and calculates the information content using the Brown corpus.
However, it suffers due to the word ambiguous problem.

Page count is considered a factor sometimes

Merits: Conceptually quite simple and not sensitive to the problem of varying link
distances

, demerits: Suffers due to inappropriate word senses

 Feature based method

This method measures the similarity between two terms based on their properties or
relationships among the terms in the taxonomy. Common features, among the terms boost
the similarity, and non-common features reduce the similarity of two concepts. It mostly uses
novel model to find all kind of relations.

Merits: takes concepts feature into consideration, demerits: Computational complexity


and can’t work well if there is not a complete features set

 Corpus based approaches


find the similarity between terms based on the corpus (a collection of written texts, especially the
entire works of a particular author or a body of writing on a particular subject) , the approaches
used are:

 Latent Semantic Analysis (LSA)

statistical system that leverages word concurrence from a large unlabeled corpus of texts, LSA
believes that similar meaning words will have the same pieces of text.
Merits: reduced redundancy, demerits: Can’t measure the degree of similarity between
two relations

 Generalized Latent Semantic Analysis (GLSA)

It is the extension of LSA approach and focuses on term vectors, it requires dimensionality
reduction methods and a measure of semantic association.

Merits: Efficiently capture semantic relations between terms, demerits: noise

 Explicit Semantic Analysis (ESA)

A novel approach called Explicit Semantic Analysis (ESA) computes semantic relatedness of
texts with the help of huge knowledge base repositories such as Wikipedia and Open
Directory Project (ODP)

Merits: can use wiki, demerits: is not link structure

There are more techniques we can use with corpus approach

 Ontology based Approaches


Ontology determines concept pairs, using a Resource Description Framework, it has a wide
range of algorithms and approaches like:

 Lexical Resource based Approaches:

This approach use Lexical Resources such as WordNet, Wikipedia to compute the semantic
similarity.

Techniques we can use:

1)Directed Acyclic Graph (DAG): which uses WordNet and DAG theory

Merits: Combines WordNet and DAG theory, provides better results, Demerits: In
WordNet, words have more sense

2)Spreading activation strategy, which uses wiki


Merits: Wikipedia, increase accuracy, Demerits: hard to compute

 Ontology Based Approach

The computer programs map the various domain-specific ontologies through different similarity
measures and corpus. Similarity measures used in this technique are Cosine similarity, Jaccard
coefficient, and novel market-based model.

Note: The process of accurate information gathering suffers due to the automatic obtaining of
the web user profile in normal methods ,The novel technique solves this complexity, and its
principal intention is to construct an absolute concept model that discovers ontologies
automatically from data sets. Moreover, it offers a method that captures developing a pattern to
process the discovered ontologies, and it efficiently achieves the intention.

Techniques:

1)Compact Concept Ontology (CCO), which uses wordnet

Merits: better performance, Demerits: Can’t apply to the context-based video retrieval.

2)new ontology-based measure or new feature-based measure: which uses Taxonomical


features

Merits: simple, Demerits: slow performance.

3) Ontology mining: which uses Association set

Merits: Effectively uses discovered knowledge, Demerits: Difficult writes adequate


Descriptions and narratives.

4) Market basket model: which uses Document corpus

Merits: Prediction error reduced using corpus structural information, Demerits:


Scalable problem and inefficient.
 Relational based Approaches

Considers the relatedness among the words, its main models are:

Vector Space Model (VSM):

The VSM defines the relationship between a word pairs using a predefined pattern of vector
frequencies of a large corpus.

Merits: Reduces the number of nodes and computation cost, Demerits: Does not
consider the lexical similarity

Latent Relational Analysis:

Latent Relational Analysis (LRA) is the extension of VSM to overcome the drawbacks in
measuring the semantic relations between two pairs of words. LRA enhances the VSM approach
by adding automatic pattern derived from the corpus, smooth the frequency of the data using
singular value decomposition, and reformulate the word pairs based on the synonym. In
terms of performance, LRA achieves substantial improvement over VSM

Merits: Can apply to many applications, Demerits: high error rate

Lexical Pattern

combines two distinctive approaches called statistical methods and lexico-syntactic patterns to
extract the semantic relation from text

Merits: Extracts the generic and associative relations between words, Demerits: low
accuracy
 Hybrid Approaches:

The hybrid approach combines the semantic, corpus, ontology and relational based approaches.

It provides improvements on simple models like combining lexical taxonomy structure and
corpus statistical information. Thus, simplifying the semantic distance calculation between nodes
in the semantic space, it improves edge based to the better, more detailed node-based approach,
another improved model is the novel hybrid approach.

Techniques we can use:

1) corpus-based approach that combines the lexical taxonomy structure with corpus
statistical information: which depends on occurrence

Merits: very high correlation value, Demerits: consumes a lot of resources

2) hyprid approach with wordnet and internet as a corpus

Merits: use internet thus higher accuracy, Demerits: Data sparseness of the corpus

Check table 2 for more methods to research. Its very good

You might also like