Professional Documents
Culture Documents
Rim Faiz
I.
INTRODUCTION
With the development of the Web, there is a great amount
of information available on Web pages, which opens new
perspectives for sharing information, allowing the collaborative
construction of content and development of social networks.
This shared information present a collective intelligence or
what we prefer to call human swarm knowledge. It is a
remarkable potential that we took advantage of, in our work, in
order to measure the similarity between words.
Measures of the semantic similarity of words have been
used for a long time in applications in Natural Language
Processing and related areas, such as the automatic creation of
thesauri [10], [19], text classification [8], word sense
disambiguation [15], [16] and information extraction and
retrieval [5], [30]. Indeed, one of the main goals of these
applications is to facilitate user access to relevant information.
According to Baeza-Yates and Ribeiro-Neto [3], an
information retrieval system should provide the user with easy
access to the information in which he is interested.
Therefore, various methods have been proposed; Corpusbased methods using online dictionaries such as WordNet1 and
Brown corpus2 [18], [15], [12], [19] and Web-based metrics
[20], [28], [1].
In this same context, we propose a method that uses
information available on the Web to measure semantic
similarity between a pair of words.
http://wordnet.princeton.edu/.
http://icame.uib.no/brown/.
mc
mw1+ mw2- mc
(1)
Where
mc: The number of common words between the two synonym
sets.
Dictionary based
semantic similarity
Synonyms
for word w1
Synonyms
for word w2
Similarity coefficient
Jaccard
Similarity coefficient
WebPMI
http://www.digg.com.
This site is the most consulted address of the French National
Center for Scientific Researchs domain (CNRS), one of the
major research bodies in France.
asylum
ark, bedlam, booby
hatch, crazy house, cuckoo's
nest, funny farm, funny
house, home, hospital,
infirmary, insane asylum,
institution, loony bin,
madhouse, mental home,
mental hospital, mental
institution, nuthouse,
psychiatric hospital,
sanatorium harbour, haven,
preserve, refuge, reserve,
retreat, safehold, safety,
sanctuary, shelter.
[0,1].
madhouse
asylum, bedlam, booby
hatch, chaos, crazy house,
cuckoo's nest, funny farm,
funny house, insane asylum,
loony bin, lunatic asylum,
mental home, mental
hospital, mental institution,
nuthouse, psychiatric
hospital, sanatorium.
N
N
(2)
Where
H(w1w2): The page count for the query (w1w2).
H( w1): The page count for the query w1.
H( w2): The page count for the query w2.
We set N = 10 10, according to [4].
The WebPMI coefficient, for the example mentioned in the
previous section, is:
WebPMI(asylum, madhouse) =0,800.
First experiments on Miller-Charles [21] and on RubensteinGoodenough's [25] datasets have shown that our measure
performs better with = 0,6. We intend to perform further
experiments on The WordSimilrity-353 Test Collection [9] in
order to stabilize .
IV.
EXPERIMENTS
(4)
5
6
http://www.google.com.
c = 5 in the experiments.
(5)
TABLE II. THE SIMFA SIMILARITY MEASURE COMPARED TO BASELINES ON MILLER-CHARLES' DATASET .
Word pairs
MillerCharles
Webjaccard
WebSimpson
WebDice
WebPMI
Sahami et
al.
Chen-CODC
Bollegala et
al.
SimFA
cord-smile
rooster-voyage
noon-string
glass-magician
monk-slave
coast-forest
monk-oracle
lad-wizard
forest-graveyard
food-rooster
coast-hill
car-journey
crane-implement
brother-lad
bird-crane
bird-cock
food-fruit
brother-monk
asylum-madhouse
furnace-stove
magician-wizard
journey-voyage
coast-shore
implement-tool
boy-lad
automobile-car
midday-noon
gem-jewel
Correlation
0,13
0,08
0,08
0,11
0,55
0,42
1,1
0,42
0,84
0,89
0,87
1,16
1,68
1,66
2,97
3,05
3,08
2,82
3,61
3,11
3,5
3,84
3,7
2,95
3,76
3,92
3,42
3,84
-
0,004
0,000
0,002
0,002
0,006
0,022
0,001
0,003
0,002
0,001
0,020
0,015
0,002
0,003
0,010
0,003
0,116
0,006
0,007
0,028
0,020
0,029
0,052
0,045
0,007
0,124
0,012
1
0,316
0,003
0,000
0,002
0,008
0,003
0,013
0,001
0,005
0,005
0,019
0,009
0,028
0,004
0,014
0,022
0,003
0,167
0,018
0,024
0,017
0,020
0,033
0,035
0,047
0,062
0,397
0,008
1
0,383
0,004
0,000
0,002
0,003
0,007
0,026
0,001
0,004
0,002
0,001
0,024
0,018
0,002
0,004
0,012
0,004
0,136
0,007
0,008
0,034
0,024
0,035
0,062
0,053
0,009
0,144
0,015
1
0,328
0,467
0,000
0,450
0,513
0,609
0,570
0,466
0,581
0,556
0,486
0,522
0,474
0,476
0,561
0,626
0,475
0,661
0,577
0,800
0,736
0,691
0,658
0,650
0,556
0,633
0,684
0,685
1
0,663
0,090
0,197
0,082
0,143
0,095
0,248
0,045
0,149
0
0,075
0,293
0,189
0,152
0,236
0,223
0,058
0,181
0,267
0,212
0,310
0,233
0,524
0,381
0,419
0,471
1
0,289
0,211
0,579
0
0
0
0
0
0
0
0
0
0
0
0,290
0
0,379
0
0,502
0,338
0,547
0
0,928
0,671
0,417
0,518
0,419
0
0,686
0,856
1
0,693
0
0,017
0,018
0,180
0,375
0,405
0,328
0,220
0,547
0,060
0,874
0,286
0,133
0,344
0,879
0,593
0,998
0,377
0,773
0,889
1
0,996
0,945
0,684
0,974
0,980
0,819
0,686
0,834
0,199
0,227
0,195
0,219
0,262
0,246
0,199
0,250
0,244
0,210
0,234
0,215
0,202
0,257
0,271
0,214
0,286
0,849
0,946
0,917
0,901
0,884
0,881
0,839
0,875
0,897
0,899
0,969
0,840
if H(PQ) c
H(PQ)
WebJaccard(P,Q)=
otherwise
H(P)+H(Q)-H(PQ)
(6)
Method
Human replication
Resnik
Lin (1998a)
Li et al. (2003)
Leacock &chodorow
Hirst & St-Onge
Jiang & Conrath
Bollegala et al. (2007)
Proposed SimFA
Correlation
0.901
0.779
0.819
0.891
0.838
0.786
0.781
0.812
0,875
B. Results
According to TABLE II, our proposed measure SimFA
earns the highest correlation of 0,840. CODC measure
reports zero similarity scores for many word-pairs in the
benchmark. One reason for this sparsity in CODC measure is
that even though two words in a pair (P;Q) are semantically
similar, we might not always find Q among the top snippets
for P (and vice versa)[4]. Similarity measure proposed by
Sahami and Heilman [26] measure is placed fifth, reflecting
a correlation of 0,579.
Among the four page-counts-based measures, WebPMI
earns the highest correlation (r = 0,663).
As summarized in TABLE III, our proposed measure
evaluated against Rubenstein-Goodenough's [25] dataset
gives a correlation coefficient of 0,875 which we can
consider as promising. Our measure outperforms simple
WordNet-based approaches such as Edge-counting and
Information Content measures and it is comparable with the
other methods.
V. CONCLUSION AND FUTURE WORK
Semantic similarity between words is fundamental to
various fields such as Cognitive Science, Natural Language
Processing and Information Retrieval. Therefore, relying on
a robust semantic similarity measure is crucial.
In this paper, we introduced a new similarity measure
between words using in one hand, an online English
dictionary provided by the Semantic Atlas project of the
French National Centre for Scientific Research (CNRS) and
in the other hand, page counts returned by the social website
Digg.com whose content is generated by users.
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
REFERENCES
[22]
[1]
[2]
[3]
[4]
[23]
[24]
[25]
[26]
[27]