Professional Documents
Culture Documents
........
..
""'""
..
......
rechlogies" (ICANMEET-20J3)
,
-26 , July, 2013.
*l
*2
Shanmugapriya , K. Latha
'shanmu.raj90@gmail.com
from
snippets
returned
by
search
engine.
These
Keywords
I. INTRODUCTION
In search engine, information is retrieved for given query
by using some lexical ontology called WordNet. But there is
problem in retrieving set of documents that is semantically
related to given query. The reason is semantically related
words of a particular sense of a word is manually updated in
those lexical ontologies. In those, synset contains a set of
synonymous words of particular sense of a word is listed.
However, semantic similarity between entities changes over
time. Thus it is very difficult to update manually. For example,
apple is frequently associated with computers on the web.
However, this sense of apple is not listed in most general
purpose thesauri or dictionaries. A user, who searches for
apple on the web, might be interested in this sense of apple
and not apple as a fruit. New words are constantly being
created as well as new senses are assigned to existing words.
Manually maintaining ontologies to capture these new words
and senses is costly if not impossible.
Semantic similarity between any entities is measured using
some information extracted from web search engine. Those
information are Page count and Snippets returned for given
query. Page count is a count of web pages that consists of
given query words. If we use this page count to measure
semantic similarity between two query words, count of web
pages that consists of co-occurrence of query words. It is a
simple similarity measure. But it has some drawbacks. First, it
does not consider the position of the words in web page.
Though two words occur in a web page, those may not be
related words. Second, it might contain the combination of all
978-t-4799-1379-4/13/$31.002013
IEEE
639
1&:lt!J
....""""'''''
..
....
'"''
.. organized by Sathyabama UmvefSlty, Chennm,. India In aSSOCiatIOn with DRDO, New Delhi, India,. 24
rechlogies" (ICANMEET-20J3)
,
-26 , July, 2013.
CODC(X, Y)
WebO[e
If fCY@X)=O,jCX@Y)=O
] Otherwise
e unt
(1)
WebMI
WebJ acGlrd
max{logH(P),H(Q} -logH(P, Q)
10gN -min{logH(P),logH(Q)}
Webf.Nerlap
x-+
Sealt h
y-+
lEx[al
Sniooets
J 9
tJ
S\lM
Engi ne
att e m 5
...
attErn
Clusters
(Ie bl
(2)
m.
640
J
I
Proceedings of the "InternatIOnal Conference on Advanced Nanomaterials & Emerging Engineering Technologies" (ICANMEET-20J3)
1h
Ih
L. """'"_t:l. iiIWIl orgamzed by Sathyabama UmvefSlty, Chennm,. India In aSSOCiatIOn with DRDO, New Delhi, India, 24 _26 , July, 2013.
H(PnQ)
Otherwise (3)
H(P) + H(Q)-H(PnQ)
2H(PnQ)
H(P) +H(Q)
(4)
Otherwise
IfH(PnQ)'S,c
O_'_H-,-(p_n..
Q-,-..== )
Min(H(P),H(Q)
(5)
Otherwise
H(PnQ)
H(P) H(Q)
.
+-N
--
Otherwise
(6)
641
Proceedings of the
I
t...!!EI!!!I!I
..:;;I
lIIt:J......
organized by Sathyabama UmvefSlty, Chennm, India In aSSOCiatIOn with DRDO, New Delhi, India, 24
.
.
f.L(a) Lif(Pi,Qi,a)
=
(6)
After sorting, the most occur patterns came to first and rare
patterns shifted to the last in set. We initialize the set of
clusters, C, to the empty set. The outer for loop (starting at
line 3), repeatedly takes a pattern ai from the ordered set, and
in the inner for loop (starting at line 6), finds the cluster, c* ( E:
C) that is most similar to ai. First, we represent a cluster by
the centroid of all word-pair frequency vectors corresponding
to the patterns in that cluster to compute the similarity
between a pattern and a cluster. Next, we compute the cosine
similarity between the cluster centroid (cj), and the word-pair
frequency vector of the pattern (ai). If the similarity between a
pattern ai, and its most similar cluster, c*, is greater than the
threshold, we append ai to c*.
Algorithm 1. Sequential pattern clustering algorithm.
Input: patterns P={al; . . . ; an}, threshold 8
Output: clusters C
1: SORT(P)
+- { }
2: C
3: for pattern ai E: P do
4: max +- 00
5: c* +-null
6: for cluster Cj E: C do
7: sim +- cosine(aj; Cj)
8: if sim > max then
9: max +-sim
10: c Cj
11: end if
12: end for
13: if max > 8 then
14: c*+- c* + a;
15: else
16: C+-C U {ai}
17: end if
18: end for
19: return C
rechlogies" (ICANMEET-20J3)
f.L(a})
LtEC; f.L(t)
==-----'----
(7)
642
Proceedings of the
I
'--"""""
"
""""......
(9)
d(f*) h(f*) +b
(10)
(11)
TABLE 1
Word Pair
Carautomobile
Applecomputer
Car-journey
Car-travel
WebJaccard
0.64
WebDice
0.65
WebPMI
0.45
WebOverlap
0.83
CODC
0.7
NGD
0.15
SH
0.9
Proposed
0.92
0.5
0.52
0.42
0.9
0.68
0.23
1.00
1.00
0.32
0.47
0.45
0.45
0.28
0.37
0.42
0.5
V. CONCLUSION
Semantic similarity is measured using both page counts and
snippets. Based on page count, four word co-occurrences are
measured. Lexical patterns are extracted from snippets and
then clustered using sequential algorithm. Co-occurrence
643
0.5
0.6
0.05
0.10
0.6
0.75
0.7
0.82
Proceedings of the
I
'--"""""
"
""""......
VI. REFERENCES
[1]
Search
with
Double
Checking,"
Proc.
21st
Int'l
Conf.
[3]
Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383,2007.
Measuring the Similarity of Short Text Snippets," Proc. 15th Int'l
World Wide Web Conf, 2006.
[4]
[5]
A.
"Googleology
Is
Bad
Science,"
Computational
[8]
M. Li,
IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3250- 3264, Dec.
2004.
[9]
[10]
644