Professional Documents
Culture Documents
Presented By:
Kushagra Sharma (286/CO/12)
Lakshay Bansal (287/CO/12)
Abstract
In the past, several approaches to assess similarity by
evaluating the knowledge modeled in an (or multiple) ontology (or
ontologies) have been proposed.
Introduction
Definition: Semantic similarity is understood as the degree of
taxonomic proximity between concepts (or terms, words).
In other words, semantic similarity states how taxonomically near
two concepts (or terms, words) are, because they share some
aspects of their meaning.
Technically, similarity measures assess a numerical score that
quantifies this proximity as a function of the semantic evidence
observed in one or several knowledge sources
DISADVANTAGES of
ONTOLOGY BASED METHODS
Clearly, the construction process of domain ontologies is time-consuming and
error-prone and maintaining these ontologies also requires a lot of effort from
experts. Thus, the methods of ontology based similarity measures are limited in
scope and scalability.
With the emergence of social networks or instant messaging systems, a lot of
(sets of) concepts or terms (proper nouns, brands, acronyms, new words,
conversational words, technical terms and so on) are not included in WordNet and
domain ontologies (in fact Web users can publish whatever they want to share with
the rest of the world by using Wikis, Blogs and online communities at present),
therefore, similarity measures that are based on these kinds of knowledge resources
cannot be used in these tasks.
These limitations are the motivation behind the new techniques presented in this
paper which infer semantic similarity from a kind of new source of information,
i.e., a wide coverage online encyclopedia, namely Wikipedia
A feature based function called X-similarity relies on the matching between synsets and term
description sets. The term description sets contain words extracted by parsing term definitions .
Two terms are similar if their synsets or description sets or, the synsets of the terms in their
neighborhood (e.g., more specific and more general terms) are lexically similar. The similarity
function is expressed as follows
WIKIPEDIA
The structure of Wikipedia is as follows :
Articles: The basic unit of information in Wikipedia is the article. Articles are written in a form of free
text that follows a comprehensive set of editorial and structural guidelines in order to promote
consistency and cohesion. Each article describes a single concept, and there is a single article
for each concept. Article titles are succinct phrases that resemble terms in a conventional
thesaurus.
Redirect pages: A redirect page is one with no text other than a directive in the form of a direct link.
These pages are used to redirect the query to the actual article page containing information about
the entity denoted by the query. This is used to point alternative expressions for an entity to the
same article, and accordingly models synonymy.
Disambiguation pages: Instead of taking readers to an article named by the term, the Wikipedia
search engine some- times takes them directly to disambiguation page where they can click on
the meaning they want. These pages collect links for a number of possible entities the original
query could be pointed to. This models homonymy.
Hyperlinks: Articles are peppered with hyperlinks to other articles. Because the terms used as
anchors are often couched in different words, Wikipedias hyperlinks are also useful as an
additional source of synonyms not captured by redirects.Hyperlinks also complement
disambiguation pages by encoding polysemy. In particular, articles mentioning other encyclopedic
entries point to them through internal hyperlinks. This models article cross-reference.
Category structure: Since May 2004 Wikipedia provides also a semantic network by means of its
categories: articles can be assigned one or more categories, which are further categorized to
provide a so-called category tree. In practice, this tree is not designed as a strict hierarchy,
but allows multiple categorization schemes to coexist simultaneously.
Let Con1 = <Synonyms1, Glosses1, Anchors1, Categories1> and Con2 = <Synonyms2, Glosses2,
Anchors2, Categories2> be two Wikipedia concepts. The similarity of Con1 and Con2, denoted
as SimCon(Con1, Con2), is the function
SimCon: WikiCon X WikiCon [0, 1] and is defined as follows:
SimCon(Con1 , Con2)= Sconcepts ( Ssynonyms (Synonyms1 ,Synonyms2), Sglosses(Glosses1, Glosses2),
Sanchors(Anchors1, Anchors2), Scategories(Categories1, Categories2))
{ WikiCon stands for the set of all Wikipedia concepts, Synsets, Glosssets, Anchorsets, and
Categorysets denote the sets ofall synonym sets, gloss sets, anchor sets, and category sets
of Wikipedia concepts in Wikipedia, respectively.}
Comparison
of various
Approaches
to Human
based
Judgements
Results on correlation
with human judgements
of similarity measures.
Analysis of Experimental
Results
The approaches SimFirCon, SimSecCon, SimThiCon, and SimFouCon, they perform relatively well
with the lowest correlation being 0.520 and the highest 0.827.
The exploitation of R&E approach provides better performance than X-similarity
approach in feature based similarity assessment of Wikipedia concepts
In R&E approach, a similarity measure is based on the normalization of Tverskys model and
the set-theory functions of intersection and difference. Moreover, the relative importance of
the noncommon characteristics is also considered in R&E approach.
SimFifCon and SimSixCon obtain the lowest correlation coefficients on all benchmarks.
Regarding similarity computation of SimFifCon and SimSixCon, the maximum is obtained by
Scategories. That is to say, in SimFifCon and SimSixCon, while three features (i.e., glosses, anchor
texts, and categories) are considered, only one feature (i.e., categories) is exploited in
practical computation. The similar situation is repeated for the approaches SimNin Con and
SimTenCon. This shows that if similarity assessment of Wikipedia concepts only relies on
Wikipedia category structure, such feature based similarity computation methods yield
substantially inferior results.
Conclusion
The final goal of computerized similarity measures is to accurately mimic human
judgements about semantic similarity.
In this paper, some limitations of the existing feature based measures are
identified, such as the facts of relying on a (or multiple) predefined domain ontology (or
ontologies) and fitting static domains (i.e., non-dynamic domains).
To implement semantic similarity measurement based on feature by making use of
Wikipedia a formal representation of Wikipedia concepts is presented. Then, a
framework for feature based similarity based on the formal representation of Wikipedia
concepts is given.
The evaluation, based on several widely used benchmarks and a benchmark
developed in this paper, sustains the intuitions with respect to human judgements.
Overall, several methods presented here have good human correlation and constitute
some effective ways of determining similarity between Wikipedia concepts. In addition,
considering the limitations (e.g., small size) of the existing standard benchmarks for
concept similarity assessment, we will pursue the design of a new benchmark specially
focused on Wikipedia concepts.