Professional Documents
Culture Documents
Computation of Semantic Similarity Among Cross Ontological Concepts For Biomedical Domain
Computation of Semantic Similarity Among Cross Ontological Concepts For Biomedical Domain
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 111
Abstract— Based on Amos Tversky psychological contrast model this paper proposes a corpus independent information content
based similarity computation method to assess similarity between biomedical concepts belonging to multiple ontology’s. Ontologies have
been widely used in many domains including database integration, bioinformatics, and the Semantic Web to facilitate the sharing of hetero-
geneous information. Semantic similarity techniques are becoming important components in most intelligent knowledge-based and semantic
information retrieval (SIR) systems. This paper discusses the limitations of existing semantic similarity methods for computing similarity be-
tween concepts of a single ontology and concepts belonging to different ontologies. The proposed approach exploits informativeness of
concepts as a factor for computing the amount of specific and shared features between the concepts. Identifying the Most Common Specific
Abstraction between concepts belonging to different ontologies is a challenge and we proposed a methodology to identify the MCSA by
forming a virtual root which connects the root concepts of the considered ontologies. The proposed idea is tested using MESH and
SNOMED-CT biomedical ontology.
Index Terms — Biomedical domain, Information retrieval, Ontology, Similarity Methods, UMLS.
—————————— ——————————
1 INTRODUCTION
ssessing semantic similarity between concepts is a COMMONALITY PROPERTY ‐ The similarity between A
A main issue in much research areas such as Linguis‐
tics, Cognitive Science, Biomedicine, and Artificial
and B is related to their commonality. The more com‐
monality they share, the more similar they are.
Intelligence. Semantic similarity techniques are becoming
important components in most intelligent knowledge‐ DIFFERENCE PROPERTY ‐ The similarity between A and
based and semantic information retrieval (SIR) systems B is related to the differences between them. The less dif‐
[1], [2]. With the growing access to heterogeneous and ference they have, the more similar they are.
independent data repositories, the differences in the
structure and semantics of the data stored in those reposi‐ IDENTITY PROPERTY ‐ The maximum similarity be‐
tories plays a major role in information systems. Semantic tween A and B is reached when A and B are identical, no
Similarity relates to computing the similarity between matter how much commonality they share.
conceptually similar but not necessarily lexically similar
terms. Typically, semantic similarity is computed by SYMMETRIC PROPERTY – The similarity between con‐
mapping terms to ontology and by examining their rela‐ cepts (A, B) is equal to the similarity between the concepts
tionships (hyponymy, hypernomy, meronymy and ho‐ (B, A).
monym) in that ontology. Semantic similarity approaches
fall under four different categories: ontology based ap‐ DEPTH PROPERTY: The distance between A and B is
proach, Information content based approach, feature represented by an edge of the concepts and is influenced
based approach and hybrid based approach. The basic by the depth of the location of the edge in the ontology.
qualitative properties that a semantic similarity measure
should consider are commonality, difference, identity, This paper discusses the proposed method to compute
symmetric and depth property. semantic similarity among cross ontological concepts.
————————————————
Section II discusses the classification of various semantic
Mrs.K.Saruladha is with the Computer Science Department, Pondicherry similarity methods based on single ontology; Section III
Engineering College, Puducherry, Pin 605014, India. discusses Classification of similarity methods based on
Dr.K Aghila is with the Computer Science Department, Pondicherry Uni-
versity, Puducherry, Puducherry, Pin 605014, India.
cross ontology. Section IV discusses the architectural de‐
Ms.A.Bhuvaneswary is with the Computer Science Department, Pondi- sign and algorithm of the similaity method for cross onto‐
cherry Engineering College, Puducherry, Pin 605014, India. logical concepts in biomedical domain.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 8, AUGUST 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 112
2 SEMANTIC SIMILARITY METHODS FOR is obtained by considering negative log likelihood of the
SINGLE ONTOLOGY probability of the concept in a given corpus and is given
Semantic similarity methods are broadly classified into by
single ontology similarity methods and cross ontological IC ( C ) log p ( c ) (1)
similarity methods. Various approaches could be used to
find similarity between two similar concepts in ontology. where c is a concept in the considered ontology and p(c)
Similarity methods for single ontology could be broadly is the probability of encountering c in a given corpus. IC
classified into four main approaches value of each concept is monotonically decreasing as we
Ontology Based Approaches move from the leaves of the taxonomy to its roots. The
Information Content (Corpus) Based Ap‐ root node of the concept in the IS‐A hierarchy has the
proaches maximum frequency count, since it includes the frequen‐
Hybrid Based Approaches cy counts of every other concept in the hierarchy. This
Feature Based Approach approach adheres to the basic properties such as commo‐
nality, symmetry and difference.
( 2 D1 1 )
PathRate
( 2 D 2 1 ) (5)
where D1 and D2 represents depth of the primary and
secondary ontology. According to the path feature scale of
primary ontology, the cross modified path length between
the two concepts nodes in primary ontology is calcalcu‐
lated as given in (6)
Path (C1, C2) = d1 + PathRate × d2 – 1 (6)
Since there may be many bridge nodes between two con‐
Fig. 2. Semantic Similarity Methods for Cross Ontology.
cepts there can be more than one path length i.e. {pathi}
and the semantic distance, SemDist, between two concept
3.1 Path Length Approach for Cross Ontology nodes is given as follows
The ontology based approach used in the similarity me‐
thod for single ontology is differing from multiple ontolo‐ CSpec i ( C 1, C 2 ) D1 Depth ( LCS ( C 1, Bridge i ) (7)
gies by considering one as primary and another as second‐
ary ontology. The semantic similarity between cross onto‐ SemDist ( C1, C 2 ) log(( pathi 1) *CSpeci ) K (8)
logical concepts is measured by joining the common node
belonging to two ontologies is considered as bridge node. 3.2 Feature Based Approach for Cross Ontology
According to Al‐Mubaid.et.al method [2], the semantic According to Rodriguez & Egenhofer, the semantic simi‐
similarity between concepts in single ontology and mul‐ larity is measured among multiple ontologies by consid‐
tiple ontologies are measured by ontology‐structure‐based ering three important features 1) matching process, 2)
technique for the biomedical domain (MeSH). Al‐ semantic neighborhoods 3) distinguishing features.
Mubaid.et.al has proposed that semantic similarity can be In [16], each concept is considered as an entity class.
measured by using three different cases: 1) Similarity me‐ The similarity between entity classes is given as
thod for single primary ontology, 2) Similarity method for
p q p q p q p q
cross ontology and 3) Similarity method within secondary S ( a , b ) Ww S w ( a , b ) Wu Su ( a , b ) Wn Sn ( a , b ) (9)
ontologies.
where Ww ,Wu ,Wn are the respective weights of the simi‐
The semantic similarity measure for cross ontology is
larity of each component and it value is greater than 0.
based on three features
The functions Sw, Su, and Sn are the similarity between
A common specificity of concepts in the ontology
synonym sets, features, and semantic neighbor‐
Cross modified path length between two concepts
hoods.The entity class a belongs to ontology p and b
A local granularity of both ontologies.
belongs to ontology q. The similarity between entity
For cross‐ontology semantic similarity, the common
classes is calculated using synonym sets, features, and
specificity feature between two concepts C1 and C2 takes
semantic neighborhoods and is given by
into account the depth of the least common subsumer
(LCS) of two concepts and the depth of the ontology.
A B
S ( a ,b ) (10)
A B ( a , b ) A / B ( 1 ( a ., b )) B / A
CSpec( C1,C 2) D Depth( LCS ( C1,C 2 )) (3) where α is the function representing the depth of the on‐
The less the CSpec value, the more they have shared in‐ tology and its value ranges from 0 to 1. The function α is
formation between two concepts. In this case, two concepts given in (11), (12).
belong to two different ontologies one identified as prima‐ When depth(C1O1) ≤ depth(C2O2)
ry ontology and other with lesser number of concepts as O1 O1 O2
( C1, C 2 ) Depth ( C1 ) / depth ( C1 ) depth ( C 2 ) (11)
secondary ontology. Using bridge node, the least common
When depth(a ) > depth(b ) p q
subsumer node of two concepts (C1, C2) is measured by O1 O1 O2
considering the LCS of the first node C1 in primary ontol‐ ( C1, C 2 ) 1 ( Depth ( C1 ) / Depth ( C1 ) Depth ( C 2 )) (12)
ogy and the bridge node, Word matching (Sw) is determined by contemplates
the set of common words and different words in the
LCS ( C 1, C 2 ) LCS ( C 1, bridge n ) (4) synonym sets that denote the entity classes [14].Feature
matching (Su) applies a matching process which classi‐
Thus the path length is calculated by adding d1 = d(C1 ,
fies features into parts (Sp), functions (Sf), and attributes
bridge) and d2 = d(C2 , bridge). In order to scale the path
(Sa). The feature similarity using word matching is given
length and CSpec features in the secondary ontology to the
by
primary ontology, the path rate is given by
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 8, AUGUST 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 114
S u ( a , b ) W p S p ( a , b ) W f S f ( a , b ) W a S a ( a , b ) (13)
p q p q p q p q
But when we extend P&S [14], [15] metric for cross on‐
for Wp, Wf, Wa ≥ 0. Semantic‐neighborhood matching (Sn) tologies the IC value of a concept should be computed
compares entity classes ap and bq of ontologies p and q based on both the ontologies. The following principles
with radius r, respectively. The semantic neighborhoods were kept in mind for designing the new similarity meas‐
is given by ure.
a p n bq The proposed measure should be based on hu‐
S (a,b) (14) man psychological models as all of the existing
a n b (a ,b ). (a ,a n bq ,r )(1 (a p ,bq ). (a p ,a p m bq ,r )
p q p q p p
nate one ontology as primary and the other as secondary
Step 5: Calculate Information content for Specific
based on the granularity of the concepts they possess and
concepts i.e IC (C1) and IC (C2)
then identify the concepts for which the semantic similari‐
IC ( c ) 1log( hypo ( c )1)
ty is to be calculated. Let O1(Ci) and O2(Cj) be the concepts log(maxcon )
belonging to the corresponding ontologies and r1 and r2 go to Step 7
be the root nodes of the selected ontologies. Create a vir‐
tual root (VR) which connects the root nodes r1 and r2 to Step 6: Calculate the depth of the concepts in both
VR. For our experiments we have considered the datasets ontology by (12),(13)
(36 concept pairs) used by [2] & [14]. For the biomedical go to Step 7
concepts of the datasets XML files are generate using
Clinclue and dragon toolkit. XML input file contains Step 7: Calculate semantic similarity between the
concept pair (C1,C2) by
hypernomy and hyponymy relations of each concept. It
also contains depth and synonym set of each concept. The IC ( MSCA ( c ))
Sim ( C 1 ,C 2 )
created XML files of the biomedical concepts serve as IC ( MSCA ( C )) ( C 1 ,C 2 ).( IC ( C 1 ) ( 1 ( C 1 ,C 2 ).( IC ( C 2 ))
output to the algorithm and the semantic similarity is go to Step 8
calculated.
Step 8: Calculate semantic similarity for cross ontol-
SS_Score Algorithm (Cross Sim(XML file1,XML
file2)) ogy using refined Information Content Approaches
(Resnik using (19), J&C using (20), Lin using (21).
// SS_Score represents Semantic Similarity Score.
Step 9: Collect human judgements for which similar-
Step1: Get the input XML file for the concepts from
the repository. ity rating is to be calculated.
The similarity between concepts a4 and b3 belongs to plication of a metric on semantic nets”, IEEE Trans. on Systems,
two different ontologies is measured by connecting sub‐ Man, and Cybernetics vol. 19, pp. 17–30, 1989.
roots (a1 and b1) of the concepts to the virtual root (VR).
[5] G. Hirst, D. St-Onge, WordNet, “An Electronic Lexical Data-
The common ancestor that exists between a4 and b3
base, Chapter Lexical Chains as Representations of Context for
among different ontologies are a3 in O1 and b3 in O2.
the Detection and Correction of Malapropisms”, MIT Press,
Thus MSCA (a4, b3) is calculated by choosing the ontol‐ 1998.
ogy which is having minimum number of hyponymy of
MSCA concept and the Information content value can be [6] Wu and M. Palmer, “Verb semantics and lexical selection,”
calculated using (16). Information Content for the specif‐ Proc. 32nd Ann. Meeting Assoc. Comput. Linguistics, pp. 133–138,
ic concept is measured by using (17). Depth of msca 1994.
concept from the virtual root is calculated using the for‐
mula (11) & (12). Thus the similarity value among cross [7] Michael Sussna,”Word sense disambiguation for free-text in-
dexing using a massive semantic network”, Proc. Second Interna-
ontological concepts is calculated using (15).
tional Conference on Information and Knowledge Management, pp.
TABLE 2 67–74, 1993.
SIMILARITY RATING FOR BIOMEDICAL CONCEPTS
Concept 1 Concept 2 Similarity [8] Claudia Leacock and Martin Chodorow”Combining local con-
text and Word-Net similarity for word sense identification”, In
rating
Christiane Fellbaum, editor, WordNet: An Electronic Lexical Data-
Anemia Appendicitis 0 base, pp. 265–283. 1998.
[9] P. Resnik, “Information content to evaluate semantic similarity
in taxonomy”, Proc. of IJCAI, pp. 448–453, 1995.
Antibiotics Antibacterial 0.736
agent [10] D. Lin, “An information-theoretic definition of similarity”, in
Urinary tract Pyelonephritis 0.373 Proc. of Conference on Machine Learning, pp. 296–304, 1998.
[18] MeSH Browser (2010). Available:
[2] H. A. Nguyen and H. Al-Mubaid, “Measuring Semantic Simi-
http://www.nlm.nih.gov/mesh/MBrowser.html
larity Between Biomedical Concepts Within Multiple Ontolo-
gies,” IEEE Trans. on Systems, Man, and Cybernetics,vol.39,no.4, [19] SNOMED‐CT (2010). Available:
pp. 339–398, 2009. http://www.snomed.org/index.html
[3] A.Tversky, “Features of similarity, Psychological Review” vol. [20] Angelos Hliaoutakis, “Semantic Similarity Measure in MeSH
84 no. 2, pp. 327– 352, 1977. Ontology and their application to Information Retrieval on Med‐
line”, 2005
[4] Rada, H. Mili, M. Bicknell, E. Blettner, “Development and ap-
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 8, AUGUST 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 118
[21] Giuseppe Pirro and Jerome Euzenat, “A Feature and Information
Theoretic Framework for Semantic Similarity and Related‐
ness”,2010