You are on page 1of 6

2012 International Conference on Information Technology and e-Services

Semantic similarity measure based on multiple


ressources
Imen Akermi

Rim Faiz

LARODEC, ISG Tunis


University of Tunis
Tunis, Tunisia
imenakermi@yahoo.fr

LARODEC, IHEC Carthage


University of Carthage
Carthage Prsidence, Tunisia
Rim.Faiz@ihec.rnu.tn

Abstract The ability to accurately judge the semantic similarity


between words is critical to the performance of several
applications such as Information Retrieval and Natural
Language Processing. Therefore, in this paper we propose a
semantic similarity measure that uses in one hand, an online
English dictionary provided by the Semantic Atlas project of the
French National Centre for Scientific Research (CNRS) and on
the other hand, a page counts based metric returned by a social
website whose content is generated by users.
Keywords: Similarity measures, Web search, Semantic
similarity, User-generated content, Information Retrieval.

I.
INTRODUCTION
With the development of the Web, there is a great amount
of information available on Web pages, which opens new
perspectives for sharing information, allowing the collaborative
construction of content and development of social networks.
This shared information present a collective intelligence or
what we prefer to call human swarm knowledge. It is a
remarkable potential that we took advantage of, in our work, in
order to measure the similarity between words.
Measures of the semantic similarity of words have been
used for a long time in applications in Natural Language
Processing and related areas, such as the automatic creation of
thesauri [10], [19], text classification [8], word sense
disambiguation [15], [16] and information extraction and
retrieval [5], [30]. Indeed, one of the main goals of these
applications is to facilitate user access to relevant information.
According to Baeza-Yates and Ribeiro-Neto [3], an
information retrieval system should provide the user with easy
access to the information in which he is interested.
Therefore, various methods have been proposed; Corpusbased methods using online dictionaries such as WordNet1 and
Brown corpus2 [18], [15], [12], [19] and Web-based metrics
[20], [28], [1].
In this same context, we propose a method that uses
information available on the Web to measure semantic
similarity between a pair of words.

http://wordnet.princeton.edu/.
http://icame.uib.no/brown/.

978-1-4673-1166-3/12/$31.00 2012 IEEE

The rest of the document is organized as follows: Section 2


introduces the related work on semantic similarity methods. In
section 3, we present our approach for measuring semantic
similarity between words. In section 4 we evaluate our method
in order to demonstrate its ability. In section 5 we conclude
with few notes and some perspectives.
II. RELATED WORK
Many previous works on semantic similarity have used
manually compiled taxonomies such as WordNet and large text
corpora [18], [15], [12], [19].
Li et al. [17] combined structural semantic information
from a lexical taxonomy and information content from a corpus
in a nonlinear model. They proposed a similarity measure that
uses shortest path length, depth and local density in taxonomy.
Lin [19] defined the similarity between two concepts as the
information that is in common to both concepts and the
information contained in each individual concept. However,
one major issue behind taxonomies and corpora oriented
approaches is that they might not necessarily capture similarity
between proper names such as named entities (e.g., personal
names, location names, product names) and the new uses of
existing words.
Other Web-based approaches used Web content for
measuring semantic similarity between words. Turney [28]
defined a measure called point-wise mutual information (PMIIR), using the number of hits returned by a web search engine,
to recognize synonyms. A similar approach for measuring the
similarity between words was proposed by Matsuo et al. [20].
They proposed the use of web hits for extraction of
communities on the Web. They measured the association
between two personal names using the Simpson coefficient,
which is calculated based on the number of web hits for each
individual name and their conjunction (i.e., AND query of the
two names). However, using page counts alone as a measure of
similarity is not enough since even though two words appear in
a page, they might not be related [4]. A perfect exemplary case
for this matter; page counts for the word apple contains page
counts for apple as a fruit and apple as a company. Moreover,
given the scale and noise in the Web, some words might occur
arbitrarily, i.e. by random chance, on some pages. For those
reasons, page counts alone are unreliable when measuring
semantic similarity.

Sahami and Heilman [26] measured semantic similarity


between two queries using snippets returned for those queries
by a search engine. For each query, they collect snippets from a
search engine and represent each snippet as a TF-IDF-weighted
term vector. Each vector is normalized and the centroid of the
set of vectors is computed. Semantic similarity between two
queries is then defined as the inner product between the
corresponding centroid vectors. Chen et al. [7] developed a
double-checking model using text snippets returned by a web
search engine to compute semantic similarity between words.
For two words P and Q, they collect snippets for each word
from a web search engine. Then they count the occurrences of
word P in the snippets for word Q and the occurrences of word
Q in the snippets for word P. These values are combined
nonlinearly to compute the similarity between P and Q.
However, this method depends heavily on the search engine's
ranking algorithm. Although two words P and Q might be very
similar, there is no reason to believe that one can find Q in the
snippets for P, or vice versa.
Bollegala et al. [4] developed a hybrid semantic similarity
measure. They combined a WordNet based metric with
retrieval of information from text snippets returned by a Web
search engine. They automatically discover lexico-syntactic
templates for semantically related and unrelated words using
WordNet, and they train a support vector machine (SVM)
classifier. The learned templates are used for extracting
information from the text fragments returned by the search
engine.
In our work, we explore the potential offered by social
networks and Web content in order to measure semantic
similarity. Therefore, in this paper we propose a method that
uses both page counts based similarity returned by a social
website and a dictionary based metric.
III. THE PROPOSED METHOD
Our method for measuring semantic similarity between
words uses, in one hand, an online English dictionary provided
by the Semantic Atlas project (SA) [23] and in the other hand,
page counts returned by a social website whose content is
generated by users (see Fig. 1).

We proceed in three phases:


1. The calculation of the similarity (SD) between two
words based on the online dictionary provided by the Semantic
Atlas project.
2. The calculation of the similarity (SP) between two
words based on the page counts returned by the social Website
Digg.com3.
3.

Integration of the two similarity measures SD and SP.

A. Phase 1:The Semantic Atlas based similarity measure


In this phase, we extract synonyms for each word from the
online English dictionary provided by the Semantic Atlas
project [23] of the French National Centre for Scientific
Research (CNRS).
What lured our attention is the fact that the SA is composed
of several dictionaries and thesauri (including the Rogets
thesaurus), offering thus a wide range of senses for a given
word. In fact, the SA is used for the automatic treatment of
polysemy and semantic disambiguation [29]. It has also been
used in psycholinguistics research at the University of Geneva
[13] in a survey to define the age of word acquisition in French
as well as degrees of familiarity of words. The SA is currently
available for French and English versions. It can be consulted
online via a flexible interface allowing for interactive
navigation on http://dico.isc.cnrs.fr4. The current online version
allows users to set the search type for English words as (1)
standard (narrow synonymy) or (2) enriched (broad
synonymy). In this paper, we use the enriched search type for a
better precision.
In our work, we use the SA to extract sets of synonyms for
each word. Once the two sets of synonyms for each word are
collected, we calculate the degree of similarity, which we call
S(w1,w2) between them, using the Jaccard coefficient:
S (w1, w2 )=

mc
mw1+ mw2- mc

(1)

Where
mc: The number of common words between the two synonym
sets.

Word pair (w1,w2)

Dictionary based
semantic similarity
Synonyms
for word w1

mw1: The number of words contained in the w1 synonym set.

Page counts based semantic


similarity

Synonyms
for word w2

Similarity coefficient
Jaccard

Similarity coefficient
WebPMI

The semantic similarity between w1 et w2


is given by Sim (w1,w2)

mw2: The number of words contained in the w2 synonym set.


If the group of synonyms for the words w1 explicitly
contains the word w2 or vice versa, we assign directly the value
1 to S(w1,w2). For example, if we want to find the semantic
similarity between words asylum and madhouse, we proceed to
the extraction of two groups of synonyms for these two words
from the dictionary.

Figure 1. The proposed method for measuring semantic similarity between a


given pair of words

http://www.digg.com.
This site is the most consulted address of the French National
Center for Scientific Researchs domain (CNRS), one of the
major research bodies in France.

According to TABLE I, the two words appear respectively in


the two groups of synonyms, meaning that asylum and
madhouse are perfect synonyms.

C. Phase 3:The overall similarity measure


In this last phase, we incorporate both measures previously
calculated by the following formula:

TABLE I. THE GROUP OF SYNONYMS FOR THE WORDS ASYLUM AND


MADHOUSE

SimFA (w1,w2)=* S (w1, w2 )+1-* WebPMI(w1, w2) (3)

asylum
ark, bedlam, booby
hatch, crazy house, cuckoo's
nest, funny farm, funny
house, home, hospital,
infirmary, insane asylum,
institution, loony bin,
madhouse, mental home,
mental hospital, mental
institution, nuthouse,
psychiatric hospital,
sanatorium harbour, haven,
preserve, refuge, reserve,
retreat, safehold, safety,
sanctuary, shelter.

[0,1].

madhouse
asylum, bedlam, booby
hatch, chaos, crazy house,
cuckoo's nest, funny farm,
funny house, insane asylum,
loony bin, lunatic asylum,
mental home, mental
hospital, mental institution,
nuthouse, psychiatric
hospital, sanatorium.

B. Phase 2: The page counts similarity measure


In this phase, we calculate the degree of similarity between
the two words w1 and w2 using the coefficient WebPMI [4]
which has as parameters the number of pages returned by the
social Website Digg.com for queries w1, w2 and (w1, w2 ). In
fact, we explore the potential offered by a community site
whose content is generated by users, for measuring semantic
similarity between words.
Barlow [2] identifies Digg.com as one of the most popular
aggregators of articles published on the Web, and whose
content is generated by users. Actually, it makes users
participate in the sharing of useful information, and the
exchange of advice. This offers a huge amount of information
that would be impossible to obtain otherwise. Perfect example
of what the relational dynamics can produce Digg.com is one
of the most visited websites on the Web. Users publish on the
site links to articles or pieces of information that seem
interesting with or without comments [22].
We use this social site to calculate the number of pages for a
given query. Therefore, once the page counts for queries w1,
w2 and (w1, w2) are obtained, we calculate the WebPMI
coefficient for the given pair of words:
H(w1 w2)
N
WebPMI(w1,w2)= log2(
)
H(w1) H(w2)

N
N

(2)

Where
H(w1w2): The page count for the query (w1w2).
H( w1): The page count for the query w1.
H( w2): The page count for the query w2.
We set N = 10 10, according to [4].
The WebPMI coefficient, for the example mentioned in the
previous section, is:
WebPMI(asylum, madhouse) =0,800.

First experiments on Miller-Charles [21] and on RubensteinGoodenough's [25] datasets have shown that our measure
performs better with = 0,6. We intend to perform further
experiments on The WordSimilrity-353 Test Collection [9] in
order to stabilize .
IV.

EXPERIMENTS

A. The benchmarck dataset


We evaluate the proposed method against Miller-Charles
dataset [21], a dataset of 30 word pairs. Because of the
omission of two word pairs in earlier versions of WordNet,
most researchers had used only 28 pairs for evaluations;
however we evaluated our method on the RubensteinGoodenough's [25] original data set of 65 word pairs. These
two datasets are considered as a reliable benchmark for
evaluating semantic similarity measures. The word pairs are
rated on a scale from 0 (no similarity) to 4 (perfect synonymy).
In TABLE II, we compare the results of our measure with
Chens Co-occurrence Double Checking (CODC) measure [7],
with Sahami and Heilman metric [26], Bollegala et al. measure
[4] and with four popular co-occurrence measures modified by
[4]; WebJaccard, WebSimpson, WebDice, and WebPMI
(point-wise mutual information). These measures use page
counts returned by the search engine Google5.
They define WebSimpson(P,Q), as :
0
if H(PQ) c
H(PQ)
WebSimpson(P,Q)= 
otherwise
Min(H(P),H(Q))

(4)

They set the WebSimpson coefficient to zero if the page


count for the query P Q is less than a threshold c6.
Applied to the previous example:
WebSimpson(asylum,madhouse) = 0,024.
They define the WebDice coefficient as a variant of the
Dice coefficient. WebDice(P,Q) is defined as :
0
if H(PQ) c
WebDice(P,Q)=  2*H(PQ)
otherwise
H(P)+H(Q)
Example: WebDice(asylum,madhouse) = 0,008.

5
6

http://www.google.com.
c = 5 in the experiments.

(5)

TABLE II. THE SIMFA SIMILARITY MEASURE COMPARED TO BASELINES ON MILLER-CHARLES' DATASET .
Word pairs

MillerCharles

Webjaccard

WebSimpson

WebDice

WebPMI

Sahami et
al.

Chen-CODC

Bollegala et
al.

SimFA

cord-smile
rooster-voyage
noon-string
glass-magician
monk-slave
coast-forest
monk-oracle
lad-wizard
forest-graveyard
food-rooster
coast-hill
car-journey
crane-implement
brother-lad
bird-crane
bird-cock
food-fruit
brother-monk
asylum-madhouse
furnace-stove
magician-wizard
journey-voyage
coast-shore
implement-tool
boy-lad
automobile-car
midday-noon
gem-jewel
Correlation

0,13
0,08
0,08
0,11
0,55
0,42
1,1
0,42
0,84
0,89
0,87
1,16
1,68
1,66
2,97
3,05
3,08
2,82
3,61
3,11
3,5
3,84
3,7
2,95
3,76
3,92
3,42
3,84
-

0,004
0,000
0,002
0,002
0,006
0,022
0,001
0,003
0,002
0,001
0,020
0,015
0,002
0,003
0,010
0,003
0,116
0,006
0,007
0,028
0,020
0,029
0,052
0,045
0,007
0,124
0,012
1
0,316

0,003
0,000
0,002
0,008
0,003
0,013
0,001
0,005
0,005
0,019
0,009
0,028
0,004
0,014
0,022
0,003
0,167
0,018
0,024
0,017
0,020
0,033
0,035
0,047
0,062
0,397
0,008
1
0,383

0,004
0,000
0,002
0,003
0,007
0,026
0,001
0,004
0,002
0,001
0,024
0,018
0,002
0,004
0,012
0,004
0,136
0,007
0,008
0,034
0,024
0,035
0,062
0,053
0,009
0,144
0,015
1
0,328

0,467
0,000
0,450
0,513
0,609
0,570
0,466
0,581
0,556
0,486
0,522
0,474
0,476
0,561
0,626
0,475
0,661
0,577
0,800
0,736
0,691
0,658
0,650
0,556
0,633
0,684
0,685
1
0,663

0,090
0,197
0,082
0,143
0,095
0,248
0,045
0,149
0
0,075
0,293
0,189
0,152
0,236
0,223
0,058
0,181
0,267
0,212
0,310
0,233
0,524
0,381
0,419
0,471
1
0,289
0,211
0,579

0
0
0
0
0
0
0
0
0
0
0
0,290
0
0,379
0
0,502
0,338
0,547
0
0,928
0,671
0,417
0,518
0,419
0
0,686
0,856
1
0,693

0
0,017
0,018
0,180
0,375
0,405
0,328
0,220
0,547
0,060
0,874
0,286
0,133
0,344
0,879
0,593
0,998
0,377
0,773
0,889
1
0,996
0,945
0,684
0,974
0,980
0,819
0,686
0,834

0,199
0,227
0,195
0,219
0,262
0,246
0,199
0,250
0,244
0,210
0,234
0,215
0,202
0,257
0,271
0,214
0,286
0,849
0,946
0,917
0,901
0,884
0,881
0,839
0,875
0,897
0,899
0,969
0,840

They define WebJaccard as a variant form of Jaccard


index using page counts as:
0

if H(PQ) c
H(PQ)
WebJaccard(P,Q)= 
otherwise
H(P)+H(Q)-H(PQ)

(6)

Example: WebJaccard(asylum,madhouse) = 0,007.


These measures are based on the use of association ratios
between words computed using the frequency of cooccurrence of words in corpora. The main hypothesis of this
approach is that two words are semantically related if their
association ratio is high.
All figures in TABLE II, except those of Miller-Charles, are
normalized in the interval [0,1] to facilitate comparison.
We also evaluate in TABLE III, our method on the
Rubenstein-Goodenough's dataset [25].
Rubenstein and Goodenough [25] obtained synonymy
judgments from 51 human subjects on 65 pairs of words.
The pairs ranged from highly synonymous to
semantically unrelated, and the subjects were asked to rate
them, on the scale of 0.0 to 4.0, according to their similarity
of meaning [6].

We evaluate our method against Rubenstein and


Goodenoughs dataset with the following five measures:
Hirst and St-Onges [11], Jiang and Conraths [12], Leacock
and Chodorows[14], Lins[18], and Resniks [24].
The first is claimed as a measure of semantic relatedness
because it uses all noun relations in WordNet; the others are
claimed only as measures of similarity because they use only
the hyponymy relation. These measures were implemented
by Budanitsky and Hirst [6], where the Brown Corpus was
used as the basis for the frequency counts needed in the
information-based approaches.
TABLE III. THE SIMFA SIMILARITY MEASURE COMPARED TO BASELINES ON
RUBENSTEIN AND GOODENOUGHS DATASET .

Method
Human replication
Resnik
Lin (1998a)
Li et al. (2003)
Leacock &chodorow
Hirst & St-Onge
Jiang & Conrath
Bollegala et al. (2007)
Proposed SimFA

Correlation
0.901
0.779
0.819
0.891
0.838
0.786
0.781
0.812
0,875

B. Results
According to TABLE II, our proposed measure SimFA
earns the highest correlation of 0,840. CODC measure
reports zero similarity scores for many word-pairs in the
benchmark. One reason for this sparsity in CODC measure is
that even though two words in a pair (P;Q) are semantically
similar, we might not always find Q among the top snippets
for P (and vice versa)[4]. Similarity measure proposed by
Sahami and Heilman [26] measure is placed fifth, reflecting
a correlation of 0,579.
Among the four page-counts-based measures, WebPMI
earns the highest correlation (r = 0,663).
As summarized in TABLE III, our proposed measure
evaluated against Rubenstein-Goodenough's [25] dataset
gives a correlation coefficient of 0,875 which we can
consider as promising. Our measure outperforms simple
WordNet-based approaches such as Edge-counting and
Information Content measures and it is comparable with the
other methods.
V. CONCLUSION AND FUTURE WORK
Semantic similarity between words is fundamental to
various fields such as Cognitive Science, Natural Language
Processing and Information Retrieval. Therefore, relying on
a robust semantic similarity measure is crucial.
In this paper, we introduced a new similarity measure
between words using in one hand, an online English
dictionary provided by the Semantic Atlas project of the
French National Centre for Scientific Research (CNRS) and
in the other hand, page counts returned by the social website
Digg.com whose content is generated by users.

[5]
[6]
[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Experimental results on Miller-Charles benchmark


dataset show that the proposed measure is promising.

[16]

There are several lines of future work that our proposed


measure lays the foundation for. We will incorporate this
measure into other similarity-based applications to determine
its ability to provide improvement in tasks such as
classification and clustering of text. In fact, we are working
on integrating our metric into an opinion classification
application [8].

[17]

[18]
[19]
[20]

Besides, we intend to go further in this domain by


developing new semantic similarity measures related to
sentences and documents.

[21]

REFERENCES

[22]

[1]

[2]
[3]
[4]

I. Akermi and R. Faiz: Mesure de similarit smantique entre les mots


en utilisant le contenu du Web. Revue des Nouvelles Technologies de
lInformation (RNTI), Herman editions, Zighed, D. and Venturini, G.,
editors, EGC 2012, pp. 545546, 2012.
A. Barlow: Blogging America: The New Public Sphere. Praeger,
2008.
R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval.
Addison Wesley, 1999.
D. Bollegala, Y. Matsuo, and M. Ishizuka: Measuring semantic
similarity between words using web search engines. In Proceedings
of International Conference on the World Wide Web (WWW 2007),
pp. 757766, 2007.

[23]

[24]
[25]
[26]
[27]

C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query


expansion using smart: Trec 3. In Proceedings of 3rd Text Retreival
Conference, pp 69-80, 1994.
A. Budanitsky and G. Hirst. Evaluating wordnet-based measures of
lexical semantic relatedness. Computational Linguistics, 32(1) pages
1347, 2006.
H. Chen, M.Lin and Y.Wei,: Novel association measures using web
search with double checking. In Proceedings of the COLING/ACL,
pp. 10091016, 2006.
A. Elkhlifi, R. Bouchlaghem and R. Faiz: Opinion Extraction and
Classification Based on Semantic Similarities. Proceedings of the
Twenty-Fourth International Florida Artificial Intelligence Research
Society Conference (FLAIRS 2011), AAAI Press 2011.
L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G.
Wolfman, and E. Ruppin, "Placing search in sontext: The concept
revisited", ACM Transactions on Information Systems, 20(1):116131, 2002.
G. Grefenstette: Automatic thesaurus generation from raw text using
knowledge-poor techniques. In Making Sense of Words, 9th Annual
Conference of the UW Centre for the New OED and Text Research,
1993.
G. Hirst, and D. St-Onge: Lexical chains as representations of context
for the detection and correction of malapropisms. In Christiane
Fellbaum editor, WordNet: An Electronic Lexical Database. The MIT
Press, Cambridge, MA, pages 305332, 1998.
J.J. Jiang and D.W. Conrath: Semantic similarity based on corpus
statistics and lexical taxonomy. In Proceedings of the International
Conference on Research in Computational Linguistics ROCLING X,
1998.
C. Lachaud: La prgnance perceptive des mots parls: une rponse au
problme de la segmentation lexicale. Thse de doctorat, Universit
de Genve (2005).
C. Leacock, and M. Chodorow: Combining local context and
WordNet similarity for word sense identification. In Christiane
Fellbaum, editor, WordNet: An Electronic Lexical Database. The
MIT Press, Cambridge, MA, pages 265283. 1998.
M.E. Lesk: Automatic sense disambiguation using machine readable
dictionaries: How to tell a pine cone from an ice cream cone. In
Proceedings of the SIGDOC Conference, Toronto, 1986.
H. Li and N. Abe: Word clustering and disambiguation based on cooccurrence data. In Proceedings of COLING-ACL, pp. 749-755
(1998).
Y. Li, Z.Bandar and D. McLean: An approach for measuring
semantic similarity between words using multiple information
sources. IEEE Transactions on Knowledge and Data Engineering
15(4), pages 871-882, 2003.
D. Lin: Automatic retrieval and clustering of similar words. In
Proceedings of the 17 th, COLING, pages 768- 774, 1998a.
D. Lin.: An information-theoretic definition of similarity. In
Proceedings of the 15th ICML, pages 296304, 1998b.
Y. Matsuo, T. Sakaki, K. Uchiyama and M. Ishizuka: Graph-based
word clustering using web search engine. In Proceedings of EMNLP,
2006.
G.A Miller and W.G. Charles: Contextual correlates of semantic
similarity. Language and Cognitive Processes 6(1), pages 128,
1991.
F. Pisani and D. Piotet: Comment le web change le monde:
L'alchimie des multitudes, 2008.
S. Ploux, A. Boussidan and H. Ji: The Semantic Atlas: and interactive
Model of Lexical Representation. In Proceedings of the Seventh
conference on International Language Resources and Evaluation
(LREC'10), Valletta, Malta, 2010.
P. Resnik: Using information content to evaluate semantic similarity
in taxonomy. In Proceedings of 14th International Joint Conference
on Artificial Intelligence (IJCAI), 1995.
H. Rubenstein and J. Goodenough: Contextual correlates of
synonymy. Communications of the ACM 8, pages 627-633, 1965.
M. Sahami and T. Heilman: A web-based kernel function for
measuring the similarity of short text snippets. In Proceedings of 15th
International World Wide Web Conference (WWW 2006), 2006.
P. Siddharth, S. Banerjee and T. Pedersen: Using measures of
semantic relatedness for word sense disambiguation. In Proceedings

of the Fourth International Conference on Intelligent Text Processing


and Computational Linguistics, Mexico City, Mexico, pages 241-257,
2003.
[28] P. D. Turney: Mining the web for synonyms: Pmi-ir versus lsa on
toefl. In Proceedings of ECML, pages 491502, 2001.
[29] F. Venant: Utiliser des classes de slection distributionnelle pour
dsambiguser les adjectifs, In Proceedings of TALN, Toulouse,
France, 2007.
[30] J. Xu and B.Croft: Improving the effectiveness of information
retrieval. ACM Transactions on Information Systems, 18(1) pages 79112, 2000.

You might also like