You are on page 1of 18

Feature-based approaches to semantic

similarity assessment of concepts using


Wikipedia
-Yuncheng Jiang , Xiaopei Zhang, Yong Tang, Ruihua Nie

Presented By:
Kushagra Sharma (286/CO/12)
Lakshay Bansal (287/CO/12)

Abstract
In the past, several approaches to assess similarity by
evaluating the knowledge modeled in an (or multiple) ontology (or
ontologies) have been proposed.

However, there are some limitations such as the facts of relying


on predefined ontologies and fitting non-dynamic domains in the
existing measures.

In this paper, some novel feature based similarity assessment


methods have been proposed that are fully dependent on
Wikipedia and can avoid most of the limitations and drawbacks
introduced in the previous methods.

Introduction
Definition: Semantic similarity is understood as the degree of
taxonomic proximity between concepts (or terms, words).
In other words, semantic similarity states how taxonomically near
two concepts (or terms, words) are, because they share some
aspects of their meaning.
Technically, similarity measures assess a numerical score that
quantifies this proximity as a function of the semantic evidence
observed in one or several knowledge sources

Ontology based methods to


estimate similarity
Edge Counting Measures
Information Content Measures
Feature Based Measures
Hybrid Measures

Edge counting measures:


- consists of taking into account the length of the path linking the
concepts (or terms) and the position of the concepts (or terms)
in a given dictionary (or taxonomy, ontology)
The main advantage of edge counting measures is their simplicity.
They only rely on the graph model of an input ontology whose
evaluation requires a low computational cost.
Due to their simplicity, these approaches offer a limited accuracy
due to ontologies model a large amount of taxonomical knowledge
that is not considered during the evaluation of the minimum path In
another perspective, the main assumption of edge counting
measures is that an edge represents the same semantic distance
anywhere in the structure of the graph (or path), which is not true as
sections of the graph may be finely classified and others only
coarsely defined.

Information Content Measures


- consist of measuring the difference of the information content of the
two concepts (or terms) as a function of their probability of
occurrence in a text corpus (or an ontology).
Information Content (IC) based approaches assess the similarity between concepts as a
function of the information content that both concepts have in common in a given ontology. In
the past, IC was typically computed from concept distribution in tagged textual corpora.
However, this introduces a dependency on corpora availability and manual tagging that
hampered their accuracy and applicability due to data sparseness

Feature based measures


-consist of measuring the similarity between concepts (or terms) as a
function of their properties or based on their relationships to other
similar concepts (or terms)
Feature based approaches estimate similarity between concepts according to the weighted
sum of the amount of common and non-common features.
By features, authors usually consider taxonomic and non-taxonomic information modeled in an
ontology, in addition to concept descriptions (e.g., glosses) retrieved from dictionaries Due to
the additional semantic evidences considered during the assessment, they potentially
improve edge counting approaches

DISADVANTAGES of
ONTOLOGY BASED METHODS
Clearly, the construction process of domain ontologies is time-consuming and
error-prone and maintaining these ontologies also requires a lot of effort from
experts. Thus, the methods of ontology based similarity measures are limited in
scope and scalability.
With the emergence of social networks or instant messaging systems, a lot of
(sets of) concepts or terms (proper nouns, brands, acronyms, new words,
conversational words, technical terms and so on) are not included in WordNet and
domain ontologies (in fact Web users can publish whatever they want to share with
the rest of the world by using Wikis, Blogs and online communities at present),
therefore, similarity measures that are based on these kinds of knowledge resources
cannot be used in these tasks.
These limitations are the motivation behind the new techniques presented in this
paper which infer semantic similarity from a kind of new source of information,
i.e., a wide coverage online encyclopedia, namely Wikipedia

Feature based similarity

Feature based approaches to similarity measures assess similarity between concepts


as a function of their properties.
Common features tend to increase similarity and non-common ones tend to diminish it.
Admitting a function (c) that yields the set of features relevant to c, Tversky proposed the
following similarity function:

Sim(a,b)= .F((a) (b)) .F((a)\(b)) .F((b)\(a))


where F is some function that reflects the salience of a set of features, (a) (b) is the
intersection between those two sets of features, (a) \(b)is the set obtained when
eliminating the elements of w(b) from the set of features of concept a, (a), and , and
are parameters that provide for differences in focus on the different components.
Rodriguez and Egenhofer(R&E) present a kind of approach to computing semantic
similarity. The similarity is computed as the weighted sum of similarities between synsets,
features (e.g., meronyms, attributes, etc.) and neighbor concepts (those linked via semantic
pointers) of evaluated terms: Simre(a,b) = w. Ssynsets(a,b) + u. Sfeatures(a,b)+ v. Sneighborhoods(a,b)
where the functions Ssynsets, Sfeatures, and Sneighborhoods are the similarity between
synonym sets, features, and semantic neighborhoods of evaluated terms, w, u, and v (w, u, v
0) are the respective weights of the similarity of each specification com- ponent, which
depend on the characteristics of the ontologies.

S represents the overlapping between the different features,computed as follows:

A feature based function called X-similarity relies on the matching between synsets and term
description sets. The term description sets contain words extracted by parsing term definitions .
Two terms are similar if their synsets or description sets or, the synsets of the terms in their
neighborhood (e.g., more specific and more general terms) are lexically similar. The similarity
function is expressed as follows

The similarity for the semantic neighbors Sneighborhoods is calculated as follows:

where i denotes relationship type.

WIKIPEDIA
The structure of Wikipedia is as follows :
Articles: The basic unit of information in Wikipedia is the article. Articles are written in a form of free
text that follows a comprehensive set of editorial and structural guidelines in order to promote
consistency and cohesion. Each article describes a single concept, and there is a single article
for each concept. Article titles are succinct phrases that resemble terms in a conventional
thesaurus.
Redirect pages: A redirect page is one with no text other than a directive in the form of a direct link.
These pages are used to redirect the query to the actual article page containing information about
the entity denoted by the query. This is used to point alternative expressions for an entity to the
same article, and accordingly models synonymy.
Disambiguation pages: Instead of taking readers to an article named by the term, the Wikipedia
search engine some- times takes them directly to disambiguation page where they can click on
the meaning they want. These pages collect links for a number of possible entities the original
query could be pointed to. This models homonymy.
Hyperlinks: Articles are peppered with hyperlinks to other articles. Because the terms used as
anchors are often couched in different words, Wikipedias hyperlinks are also useful as an
additional source of synonyms not captured by redirects.Hyperlinks also complement
disambiguation pages by encoding polysemy. In particular, articles mentioning other encyclopedic
entries point to them through internal hyperlinks. This models article cross-reference.
Category structure: Since May 2004 Wikipedia provides also a semantic network by means of its
categories: articles can be assigned one or more categories, which are further categorized to
provide a so-called category tree. In practice, this tree is not designed as a strict hierarchy,
but allows multiple categorization schemes to coexist simultaneously.

Feature-based similarity using


Wikipedia
Formal representation of Wikipedia concepts:
Let A be a Wikipedia article and Con be the title of A. The formal representation of Wikipedia
concept Con isdefined as follows:

Con = <Synonyms, Glosses, Anchors, Categories>


where Synonyms = {Con, Con1, . . . , Conm} is the set of synonyms of Con, Glosses is the first
paragraph of text of A, Anchors = {Anc1, . . . , Ancn} is the set of anchor texts (i.e., labels of
internal hyperlinks) in A, and Categories = {Cat1, . . . , Catk} is the set of categories of A.

A framework for feature-based similarity

Let Con1 = <Synonyms1, Glosses1, Anchors1, Categories1> and Con2 = <Synonyms2, Glosses2,
Anchors2, Categories2> be two Wikipedia concepts. The similarity of Con1 and Con2, denoted
as SimCon(Con1, Con2), is the function
SimCon: WikiCon X WikiCon [0, 1] and is defined as follows:
SimCon(Con1 , Con2)= Sconcepts ( Ssynonyms (Synonyms1 ,Synonyms2), Sglosses(Glosses1, Glosses2),
Sanchors(Anchors1, Anchors2), Scategories(Categories1, Categories2))
{ WikiCon stands for the set of all Wikipedia concepts, Synsets, Glosssets, Anchorsets, and
Categorysets denote the sets ofall synonym sets, gloss sets, anchor sets, and category sets
of Wikipedia concepts in Wikipedia, respectively.}

MATHEMATICAL MODELLING OF FEATURE


BASED ASSESSMENT
We can obtain different feature based approaches to similarity assessment reulting from
instantiations of the framework.
Without loss of generality, we assume that there are two sets of terms (or words, concepts) Set 1
and Set2. Obviously, these two sets may be Synonyms, Setglosses, Anchors, or Categories.
According to X-similarity approach or R&E approach (Rodriguez & Egenhofer), we have the
following similarity computation methods for Set1 and Set2:

{ Where 0<=<=1 and the value of is specified by users or experts (


= 0.5 in default) }

Given two Wikipedia concepts


Con1 = <Synonyms1, Setglosses1, Anchors1, Categories1> and
Con2 = <Synonyms2, Setglosses2,Anchors2, Categories2>
According to the notions of SXsim and SR&E, we have the following approaches to similarity
measures for Wikipedia concepts (suppose that the function Sconcepts is the average or max):

Comparison
of various
Approaches
to Human
based
Judgements

Results on correlation
with human judgements
of similarity measures.

(From left to right: measure


approach, correlation for M&C
benchmark, correlation for R&G
benchmark, and correlation for
353-TC benchmark)

Benchmark and Experimental results (based on 30


students' and professors' judgements)

Results on correlation with our benchmark of


similarity measures.

Analysis of Experimental
Results

The approaches SimFirCon, SimSecCon, SimThiCon, and SimFouCon, they perform relatively well
with the lowest correlation being 0.520 and the highest 0.827.
The exploitation of R&E approach provides better performance than X-similarity
approach in feature based similarity assessment of Wikipedia concepts
In R&E approach, a similarity measure is based on the normalization of Tverskys model and
the set-theory functions of intersection and difference. Moreover, the relative importance of
the noncommon characteristics is also considered in R&E approach.
SimFifCon and SimSixCon obtain the lowest correlation coefficients on all benchmarks.
Regarding similarity computation of SimFifCon and SimSixCon, the maximum is obtained by
Scategories. That is to say, in SimFifCon and SimSixCon, while three features (i.e., glosses, anchor
texts, and categories) are considered, only one feature (i.e., categories) is exploited in
practical computation. The similar situation is repeated for the approaches SimNin Con and
SimTenCon. This shows that if similarity assessment of Wikipedia concepts only relies on
Wikipedia category structure, such feature based similarity computation methods yield
substantially inferior results.

Conclusion
The final goal of computerized similarity measures is to accurately mimic human
judgements about semantic similarity.
In this paper, some limitations of the existing feature based measures are
identified, such as the facts of relying on a (or multiple) predefined domain ontology (or
ontologies) and fitting static domains (i.e., non-dynamic domains).
To implement semantic similarity measurement based on feature by making use of
Wikipedia a formal representation of Wikipedia concepts is presented. Then, a
framework for feature based similarity based on the formal representation of Wikipedia
concepts is given.
The evaluation, based on several widely used benchmarks and a benchmark
developed in this paper, sustains the intuitions with respect to human judgements.
Overall, several methods presented here have good human correlation and constitute
some effective ways of determining similarity between Wikipedia concepts. In addition,
considering the limitations (e.g., small size) of the existing standard benchmarks for
concept similarity assessment, we will pursue the design of a new benchmark specially
focused on Wikipedia concepts.

You might also like