You are on page 1of 6

1&:lt!

........
..
""'""
..
......

Proceedings of the "InternatIOnal Conference on Advanced Nanomaterials & Emerging Engineerz ng


organized by Sathyabama UmvefSlty, Chennm, India In aSSOCiatIOn with DRDO, New Delhi, India, 24
.
.

rechlogies" (ICANMEET-20J3)
,
-26 , July, 2013.

Measuring Semantic Similarity Using Web Search


Engine

*l
*2
Shanmugapriya , K. Latha
'shanmu.raj90@gmail.com

An n a Un iversity, Region al Cen tre, Tiruchirapal/i

Abstract- An automatic method to measure semantic similarity


between entities using web search engine which uses both page
count and lexical patterns extracted from snippets. Semantic
similarity is measured using both page count and lexical patterns
based on snippets from web search engine for given query words.
By using page count value, four word co-occurrence measures
are calculated. Lexical patterns describing semantic relations are
extracted

from

snippets

returned

by

search

engine.

These

patterns are then clustered using sequential algorithm. Word co


occurrence measures are combined with lexical patterns which is
learned using SVM.

Keywords

Page count, Lexical pattern extraction, Lexical

pattern clustering, Support Vector Machine

I. INTRODUCTION
In search engine, information is retrieved for given query
by using some lexical ontology called WordNet. But there is
problem in retrieving set of documents that is semantically
related to given query. The reason is semantically related
words of a particular sense of a word is manually updated in
those lexical ontologies. In those, synset contains a set of
synonymous words of particular sense of a word is listed.
However, semantic similarity between entities changes over
time. Thus it is very difficult to update manually. For example,
apple is frequently associated with computers on the web.
However, this sense of apple is not listed in most general
purpose thesauri or dictionaries. A user, who searches for
apple on the web, might be interested in this sense of apple
and not apple as a fruit. New words are constantly being
created as well as new senses are assigned to existing words.
Manually maintaining ontologies to capture these new words
and senses is costly if not impossible.
Semantic similarity between any entities is measured using
some information extracted from web search engine. Those
information are Page count and Snippets returned for given
query. Page count is a count of web pages that consists of
given query words. If we use this page count to measure
semantic similarity between two query words, count of web
pages that consists of co-occurrence of query words. It is a
simple similarity measure. But it has some drawbacks. First, it
does not consider the position of the words in web page.
Though two words occur in a web page, those may not be
related words. Second, it might contain the combination of all

978-t-4799-1379-4/13/$31.002013

IEEE

of its senses of a query word where all of senses are not


needed.
Next useful information, Snippets extracted from search
engine for given query is utilized in measuring semantic
similarity in such way that providing useful information
regarding local context of the query. This avoids full search of
query words in documents. We consider only top rank results
from search engine. Due to this, there is no guarantee to get
useful information from top rank results since factors
considered for ranking results is varying for different search
engines.
In this project, an automatic method to measure semantic
similarity between entities using web search engine which
uses both page count and lexical patterns extracted from
snippets. Four word co-occurrence measures are calculated
using page count of query words. Lexical patterns describing
semantic relations are extracted from snippets which are
retrieved from search engine for given query words. And then
those patterns are clustered using some algorithm. These
clusters are then combined with word co-occurrence measures
which is learned using SVM.
By using this semantic similarity measure, we identify the
relations between entities among world wide web. An
Ontology is created using this information which avoids
manual updating of different senses associated with entities in
web.
II. RELATED WORK
Chen[l] measures the association of terms using snippets
returned by web search and simple web page counts. A model
called Web search with Double checking(WSDC) to analyze
snippets returned by web search engine. He assumed that two
entities X and Y to have an association between them if we
can find Y from X(forward process) and X from Y(backward
process by web search.
Forward process counts the occurrences of Y in the
snippets of X f(Y@X) and Backward process counts the
occurrences of X in the snippets of Y f(X@Y). This is called
as Double Checking operation.
Co-occurrence Double checking depends on search
engine's ranking algorithm.

639

1&:lt!J

....""""'''''
..
....
'"''
.. organized by Sathyabama UmvefSlty, Chennm,. India In aSSOCiatIOn with DRDO, New Delhi, India,. 24

Proceedings of the "InternatIOnal Conference on Advanced Nanomaterials & Emerging Engineerz ng

rechlogies" (ICANMEET-20J3)
,
-26 , July, 2013.

CODC(X, Y)

WebO[e

If fCY@X)=O,jCX@Y)=O

etoS fCY@X) xfCX@Y)


fCy)
L f(X)

] Otherwise

e unt

(1)

WebMI
WebJ acGlrd

f(X) is the total occurrences of X and feY) is the total


occurrences of Y. CODC measures association in the
interval[O,l].
When
f(Y@X)=O
and
f(X@Y)=O,
CODC(X,Y)=O.
When
f(Y@X)=l
and
f(X@Y)=l,
CODC(X,Y)=l. This measure is very stable on word-word
and name-name experiments.
Cilibrasi[2] proposed a theory of similarity between words
or entities based in information distance and kolmogrov
complexity. Words and phrases have the meaning in terms of
way they have used in society with their relative semantics.
To measure semantic, we use world wide web as database and
search engine. A method is constructed to measure similarity
as Google similarity Distance using Google page counts. It is
also called as Normalized Google distance(NGD). It is
defined as
NGD(PQ)

max{logH(P),H(Q} -logH(P, Q)
10gN -min{logH(P),logH(Q)}

Webf.Nerlap

x-+

Sealt h

y-+

lEx[al
Sniooets

J 9
tJ
S\lM

Engi ne

att e m 5

...

attErn
Clusters

(Ie bl

Figure 1 : Outline of the Proposed Method

(2)

It ignores the position of the word in a web page. And it


might contain a combination of all of its senses of particular
word. It does not consider the local context of the query.
Sahami and Heilman[3] proposed web based kernel
function for measuring the similarity for short text snippets.
Generally, there may be different way of describing some
concept or entity such as "AI" and "Artificial Intelligence" are
very similar with regard to their meaning. But they don't have
common terms. Thus the similarity measures those consider
only common terms such as cosine similarity will give
incorrect result.
A method for measuring the similarity between short text
snippets will consider semantic context of those short text
snippets. Each snippet is considered as a query. For each
query, collect snippets from search engine and that is
represented as TF-IDF vector. The centroid of the vectors is
computed. And then semantic similarity is defined as inner
product between centroid vectors corresponding to queries.

m.

SEMANTIC SIMILARITY MEASURE

In this project, an automatic method to measure semantic


similarity between entities using web search engine which
uses both page count and lexical patterns extracted from
snippets. Four word co-occurrence measures are calculated
using page count of query words. Lexical patterns describing
semantic relations are extracted from snippets which are
retrieved from search engine for given query words. And then
those patterns are clustered using some algorithm. These
clusters are then combined with word co-occurrence measures
which is learned using SVM. By using this semantic similarity
measure, we identify the relations between entities among
world wide web. An Ontology is created using this
information which avoids manual updating of different senses
associated with entities in web.

A. PAGE COUNT CALCULATJON


Page Count is an estimate of number of web pages which
includes co-occurrence of query words. We query the search
engine by using two words. Then page counts for both words
separately and also for their co-occurrence is calculated. For
example, Query words P and Q, Page counts for P and Q is

640

J
I

Proceedings of the "InternatIOnal Conference on Advanced Nanomaterials & Emerging Engineering Technologies" (ICANMEET-20J3)
1h
Ih
L. """'"_t:l. iiIWIl orgamzed by Sathyabama UmvefSlty, Chennm,. India In aSSOCiatIOn with DRDO, New Delhi, India, 24 _26 , July, 2013.

calculated separately and also page counts for P AND Q


(Conjuctive) is calculated.
For example, Google returns 11,300,000 as the page count
for "car" AND "automobile," whereas the same is 49,000,000
for "car" AND "apple." Although, automobile is more
semantically similar to car than apple is, page counts for the
query "car" AND "apple" are more than four times greater
than those for the query "car" AND "automobile." One must
consider the page counts not just for the query P AND Q, but
also for the individual words P and Q to assess semantic
similarity between P and Q.
By using this estimate, we calculate four word co
occurrence measures such as WebJaccard, WebDice,
WebOverlap,WebPMI to compute the semantic similarity.
The WebJaccard(P,Q) is defined as, WebJaccard(P,Q)
0,
=
IfH(PnQ)'S,c

H(PnQ)
Otherwise (3)
H(P) + H(Q)-H(PnQ)

Given the scale and noise in web data, it is possible that


two words may appear on some pages even though they are
not related. In order to reduce the adverse effects attributable
to such co-occurrences, we set the WebJaccard coefficient to
zero if the page count for the query P rl Q is less than a
threshold c.
The WebDice is defmed as, WebDice(P,Q)
0,
=
IfH(PnQ)'S,c

2H(PnQ)
H(P) +H(Q)

(4)

Otherwise

The WebOverlap is defined as, WebOverlap(P,Q)

IfH(PnQ)'S,c

O_'_H-,-(p_n..
Q-,-..== )
Min(H(P),H(Q)

(5)

Otherwise

The WebPMI is defined as, WebPMI(P,Q)


IfH(PnQ)<S,c
0,
10

H(PnQ)

H(P) H(Q)
.
+-N

--

Otherwise

(6)

Tn above equations, H(P),H(Q),H(P rl Q) are the page


counts of query words P, Q , P AND Q respectively. C is

constant. N is the number of documents indexed by the search


engine[4].
B.

LE XICAL PATTERN E XTRACTION

Snippets are the text selected from document which


includes queried words. It gives information regarding local
context of query. This gives useful clues to describe the
semantic relations exist between query words. It is
computationally efficient since it obviates the need to
download the documents and thus large time consumption is
avoided. Here, Phrases in between query words are extracted
from snippets. These phrases describes semantic relationships
between query words. For example, is a, is the largest, is part
of, also known as, all are semantic relations between query
words.
Consider the following snippet,

"Cricket is a game played between two teams, each with


eleven players. "
In the example snippet above, words indicating the
semantic relation appear in between query words such as
Cricket and Game. That lexical pattern is "is a" describes the
semantic relation between Cricket and Game.
Though snippets are efficient in measuring semantic
relation, it has some drawbacks such as snippets can be
fragmented sentence and it can also be formulated by using
fragments which are in different position in document.
Tn Google, delimiters are used to separate multiple
fragments of a snippet. Before lexical pattern extraction, these
fragments are separated. Example delimiter used in Google is
" . . . . " We use Lexical pattern extraction algorithm is used to
find the semantic relations exist between query words. Then
for each fragment, we run the algorithm separately.
For two words P, Q, we query the search engine using
wildcard factor as "P*****Q" and retrieved snippets are
,
downloaded. The ,* denotes none or single word in between
query words. So snippets are retrieved for which query of
within seven words including our desired words.
For a snippet, retrieved for a word pair (P,Q), first, we
replace the two words P and Q, respectively, with two
variables X and Y . We replace all numeric values by D, a
marker for digits. Next, we generate all subsequences of
words from that snippet satisfy all of the following conditions:
A subsequence must contain exactly one occurrence of each
X and Y .
The maximum length of a subsequence is L words.
A subsequence is allowed to skip one or more words.
However, we do not skip more than G number of words
consecutively. Moreover, the total number of words skipped
in a subsequence should not exceed G.

641

Proceedings of the

I
t...!!EI!!!I!I
..:;;I
lIIt:J......

"InternatIOnal Conference on Advanced Nanomaterials & Emerging Engineerz ng

organized by Sathyabama UmvefSlty, Chennm, India In aSSOCiatIOn with DRDO, New Delhi, India, 24
.
.

LE XICAL PATTERN CLUSTERING

A semantic relation can be represented by different patterns.


For example, consider two patterns "is a" and "is a large",
both describes a semantic relation "is-a" between words.
Identitying different patterns describing same semantic
relation lead us to measure the semantic similarity accurately.
Patterns are represented by vector of word pair frequencies.
A word-pair frequency vector a includes word-pair
frequencies of pattern 'a' in different word pairs. The
frequency, f(Pi,Qi,a) , that the pattern a occurs in between the
word pair(Pi,Qi).
Given a set of patterns and a clustering similarity threshold,
Clustering Algorithm returns clusters (of patterns) that express
similar semantic relations. First, in Algorithm 1, the function
SORT sorts the patterns into descending order of their total
occurrences in all word pairs. The total occurrences of a
pattern a in all word pairs /lea) is defmed as

f.L(a) Lif(Pi,Qi,a)
=

(6)

After sorting, the most occur patterns came to first and rare
patterns shifted to the last in set. We initialize the set of
clusters, C, to the empty set. The outer for loop (starting at
line 3), repeatedly takes a pattern ai from the ordered set, and
in the inner for loop (starting at line 6), finds the cluster, c* ( E:
C) that is most similar to ai. First, we represent a cluster by
the centroid of all word-pair frequency vectors corresponding
to the patterns in that cluster to compute the similarity
between a pattern and a cluster. Next, we compute the cosine
similarity between the cluster centroid (cj), and the word-pair
frequency vector of the pattern (ai). If the similarity between a
pattern ai, and its most similar cluster, c*, is greater than the
threshold, we append ai to c*.
Algorithm 1. Sequential pattern clustering algorithm.
Input: patterns P={al; . . . ; an}, threshold 8
Output: clusters C
1: SORT(P)
+- { }
2: C
3: for pattern ai E: P do
4: max +- 00

-26 , July, 2013.

5: c* +-null
6: for cluster Cj E: C do
7: sim +- cosine(aj; Cj)
8: if sim > max then
9: max +-sim
10: c Cj
11: end if
12: end for
13: if max > 8 then
14: c*+- c* + a;
15: else
16: C+-C U {ai}
17: end if
18: end for
19: return C

We expand all negation contractions in a context. For


example, didn't is expanded to did not. We do not skip the
word not when generating subsequences. For example, this
condition ensures that from the snippet X is not a Y, we do not
produce the subsequence X is a Y.
Finally, we count the frequency of all generated
subsequences and only use subsequences that occur more than
T times as lexical patterns. We use modified version of
prejixspan [7] algorithm to generate subsequences of text
snippets.
C.

rechlogies" (ICANMEET-20J3)

By sorting the lexical patterns in the descending order of


their frequency and clustering the most frequent patterns first,
we form clusters for more common relations first. This
enables us to separate rare patterns which are likely to be
outliers from attaching to otherwise clean clusters.

D. MEASURING THE SEMANTIC SIMILARITY


Now we extracted lexical patterns based on snippets
returned from web search engine and page count for query
words and their conjuctive fonn. Then we need to combine
the four word co-ocuurrence measures with lexical patterns by
learning using Support Vector machine(SVM). To train SVM,
we take 3000 synonymous word pairs and 3000 non
synonymous word pairs as training set. Synonymous word
pairs are extracted from WordNet. And then non-synonymous
word pairs are formed from synonymous word pairs by re
arranging the word pairs.
Given N clusters of lexical patterns, first, we represent a
pair of words (P,Q) by an (N + 4) dimensional feature .vector
fpQ. The four page counts-based co-occurrence measures
defined in Section 3.2 are used as four distinct features in fpQ.
For completeness, let us assume that (N + l)st, (N + 2)nd, (N
+ 3)rd, and (N + 4)th features are set, respectively, to
WebJaccard, WebOverlap, WebDice, and WebPMl. Next, we
compute a feature from each of the N clusters as follows: first,
we assign a weight wij to a pattern ai that is in a cluster cj as
follows
Wlj

f.L(a})
LtEC; f.L(t)

==-----'----

(7)

Here, /lea) is the total frequency of a pattern a in all word


pairs, and it is given by (6). Because we perform a hard
clustering on patterns, a pattern can belong to only one cluster
(i.e., wij = 0 for a; '* cj). Finally, we compute the value of the

642

"International Conference on Advanced Nanomaterials & Emerging Engineerz ng Technologies" (ICANMEET-20J3)


th
th
organized by Sathyabama University, Chennai, India in association with DRDO, New Delhi, India, 24 _26 , July, 2013.
.
.

Proceedings of the

I
'--"""""
"
""""......

the notation, let us denote the feature vector of a word pair


(Pk;Qk ) by fk. Finally, we train a two-class SVM using the
labeled feature vectors.
We define the semantic similarity sim(P*,Q*) between p*
and Q* as the posterior probability, p(y*= IIf*), that the
feature vector f*, corresponding to the word pair (P*,Q*)
belongs to the synonymous-words class (i.e., y*= 1).
Sim(P*;Q*) is given by

jth feature in the feature vector for a word pair (P,Q) as


follows:
(8)
The value of the jth feature of the feature vector fPQ
representing a word pair (P,Q) can be seen as the weighted
sum of all patterns in cluster cj that co-occur with words P and
Q. We assume all patterns in a cluster to represent a particular
semantic relation. Consequently, the jth feature value given by
(8) expresses the significance of the semantic relation
represented by cluster j for word pair (P,Q). For example, if
the weight wij is set to 1 for all patterns ai in a cluster cj, then
the jth feature value is simply the sum of frequencies of all
patterns in cluster cj with words P and Q. However, assigning
an equal weight to all patterns in a cluster is not desirable in
practice because some patterns can contain misspellings
and/or can be grammatically incorrect. Equation (7) assigns a
weight to a pattern proportionate to its frequency in a cluster.
If a pattern has a high frequency in a cluster, then it is likely to
be a canonical form of the relation represented by all the
patterns in that cluster. Consequently, the weighting scheme
described by (7) prefers high frequent patterns in a cluster.
To train a two-class SVM to detect synonymous and non
synonymous word pairs, we utilize a training data set S=
{Pk;Qk; yd of word pairs. S consists of synonymous word
pairs (positive training instances) and non-synonymous word
pairs (negative training instances). Training data set S is
generated automatically from WordNet synsets. Label Yk {I,
I} indicates whether the word pair {Pk;Qk} is a synonymous
word pair (i.e., Yk = 1) or a nonsynonymous word pair (i.e., Yk
= -1). For each word pair in S, we create an (N + 4)
dimensional feature vector as described above. To simplifY

sim(P*;Q*) = p(y* 11 f*)

(9)

Because SVMs are large margin classifiers, the output of an


SVM is the distance from the classification hyperplane. The
distance d(f*) to an instance f* from the classification
hyperplane is given by

d(f*) h(f*) +b

(10)

Here, b is the bias term and the hyperplane, h(f*), is given


by

h(f*) LYk01' K(fi;f*)


=

(11)

IV. RESULT ANALYSIS


The proposed work is evaluated by comparing vanous
semantic similarity measures using following data set. The
results show the difference between the various measures as in
Table l.

TABLE 1

Word Pair
Carautomobile

Applecomputer
Car-journey
Car-travel

COMPARISON OF VARIOUS SEMANTIC MEASURES

WebJaccard
0.64

WebDice
0.65

WebPMI
0.45

WebOverlap
0.83

CODC
0.7

NGD
0.15

SH

0.9

Proposed
0.92

0.5

0.52

0.42

0.9

0.68

0.23

1.00

1.00

0.32
0.47

0.45
0.45

0.28
0.37

0.42
0.5

V. CONCLUSION
Semantic similarity is measured using both page counts and
snippets. Based on page count, four word co-occurrences are
measured. Lexical patterns are extracted from snippets and
then clustered using sequential algorithm. Co-occurrence

643

0.5
0.6

0.05
0.10

0.6
0.75

0.7
0.82

measures are combined with lexical patterns to define feature


for given word pair. By using these features, SVM is trained
to classifY synonymous and non-synonymous word pairs.
From this, semantic similarity is measured as the distance of a
feature to the hyperplane. In future, Ontology file(WSDL) will

"International Conference on Advanced Nanomaterials & Emerging Engineerz ng Technologies" (ICANMEET-20J3)


th
th
organized by Sathyabama University, Chennai, India in association with DRDO, New Delhi, India, 24 _26 , July, 2013.
.
.

Proceedings of the

I
'--"""""
"
""""......

be created by using this semantic similarity which helps


identifying the relations between entities.

VI. REFERENCES
[1]

H. Chen, M. Lin, and Y. Wei, "Novel Association Measures Using


Web

Search

with

Double

Checking,"

Proc.

21st

Int'l

Conf.

Computational Linguistics and 44th Ann. Meeting of the Assoc. for


Computational Linguistics (COLING/ACL '06), pp. 1009-1016, 2006.
[2]

R. Cilibrasi and P. Vitanyi, 'The Google Similarity Distance," IEEE

[3]

M. Sahami and T. Heilman, "A Web-Based Kernel Function for

Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383,2007.
Measuring the Similarity of Short Text Snippets," Proc. 15th Int'l
World Wide Web Conf, 2006.
[4]

Z. Bar-Yossef and M. Gurevich, "Random Sampling from a Search


Engine's Index," Proc. 15th Int'l World Wide Web Conf., 2006.

[5]

1. Pei, 1. Han, B. Mortazavi-Asi, 1. Wang, H. Pinto, Q. Chen, U. Dayal,

and M. Hsu, "Mining Sequential Patterns by Pattern- Growth: The


Prefixspan Approach," IEEE Trans. Knowledge and Data Eng., vol. 16,
no.
[6]

A.

II, pp. 1424-1440.,2004.


Kilgarriff,

"Googleology

Is

Bad

Science,"

Computational

Linguistics, vol. 33, pp. 147-151, 2007.


[7]

D. Bollegala, Y. Matsuo, and M. Ishizuka, "Measuring Semantic


Similarity between Words Using Web Search Engines," Proc. Int'l
Conf. World Wide Web (WWW '07), pp. 757-766, 2007.

[8]

M. Li,

X. Chen, X. Li, B. Ma, and P. Vitanyi, 'The Similarity Metric,"

IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3250- 3264, Dec.
2004.

[9]

Z. Bar-Yossef and M. Gurevich, "Random Sampling from a Search

[10]

D. Bollegala, Y. Matsuo, and M. Ishizuka, "Disambiguating Personal

Engine's Index," Proc. 15th Int'l World Wide Web Conf.,2006.


Names on the Web Using Automatically Extracted Key Phrases," Proc.
17th European Conf. Artificial Intelligence, pp. 553- 557.,2006.

644

You might also like