You are on page 1of 8

Knowledge-Based Systems xxx (2012) xxxxxx

Contents lists available at SciVerse ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Evolutionary algorithm based on different semantic similarity functions


for synonym recognition in the biomedical domain
Jos M. Chaves-Gonzlez , Jorge Martnez-Gil
University of Extremadura, Escuela Politcnica, 10003 Cceres, Spain

a r t i c l e i n f o a b s t r a c t

Article history: One of the most challenging problems in the semantic web eld consists of computing the semantic sim-
Received 12 December 2011 ilarity between different terms. The problem here is the lack of accurate domain-specic dictionaries,
Received in revised form 9 May 2012 such as biomedical, nancial or any other particular and dynamic eld. In this article we propose a
Accepted 13 July 2012
new approach which uses different existing semantic similarity methods to obtain precise results in
Available online xxxx
the biomedical domain. Specically, we have developed an evolutionary algorithm which uses informa-
tion provided by different semantic similarity metrics. Our results have been validated against a variety
Keywords:
of biomedical datasets and different collections of similarity functions. The proposed system provides
Semantic similarity
Evolutionary computation
very high quality results when compared against similarity ratings provided by human experts (in terms
Semantic web of Pearson correlation coefcient) surpassing the results of other relevant works previously published in
Synonym recognition the literature.
Differential evolution 2012 Elsevier B.V. All rights reserved.

1. Introduction based on semantic similarity measures previously published in


the specic literature. The evolutionary algorithm (EA) designed
The vast amount of information on the World Wide Web combines information provided by a variety of well-known seman-
(WWW) has made the study of web semantic techniques [27] tic similarity functions. Thus, our EA works as a high level heuris-
one of the most interesting areas of research. The semantic web tics designed to improve the results obtained by using each
is based on semantic organization of resources to process them individual similarity function. Furthermore, in order to validate
in a more efcient manner. The semantic similarity [4] of text our results, we use different biomedical datasets for which classical
expressions is a very relevant problem, especially in elds, such algorithms do not usually get optimal results.
as data integration [10], query expansion [9] or document classi- The rest of the manuscript is organized as follows: Section 2 dis-
cation [12] among others. The reason is that those specic elds cusses related work. Section 3 describes the problem and the ap-
need semantic similarity computation to work properly. proaches we have used in this paper. The methodology and
On the other hand, semantic similarity measurements are usu- results are explained in Section 4. Finally, conclusions and possible
ally performed with some kind of metrics [20]. The most common lines of future work are discussed in the last section.
metrics are text-based semantic similarity measures obtained from
the degree of similarity or dissimilarity between two text strings
[5]. Semantic similarity measures provide a oating point number 2. Previous work
between 0 (total dissimilarity) and 1 (complete similarity).
Most of existing works reach a high level of accuracy when solv- Semantic similarity measure has traditionally been an interest-
ing general purpose datasets [22], because they describe ap- ing research area within the Natural Language Processing (NLP)
proaches based on large and updated dictionaries [20]. However, eld [13,18]. The reason is that synonym recognition is a key as-
these methods do not obtain precise results when they are applied pect for human conversations. In fact, the fast development of
to specic domains, such as biomedicine. The reason is that most of the semantic web has led researchers to focus on the development
methods in this domain work with a unique ontology. Therefore, of techniques based on synonym recognition to improve the dis-
results depend on the ontological detail and coverage [26]. In this covery of resources on the WWW [14]. Thus, a user who is looking
work, we propose a new technique which beats other methods for the term cars obtains results including terms such as automo-
biles and vehicles.
Corresponding author. A rst approach to measure the semantic similarity between
E-mail addresses: jm@unex.es (J.M. Chaves-Gonzlez), jorgemar@unex.es two terms consists of computing the Euclidean distance (one of
(J. Martnez-Gil). the most popular metrics) between those words. However,

0950-7051/$ - see front matter 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.knosys.2012.07.005

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005
2 J.M. Chaves-Gonzlez, J. Martnez-Gil / Knowledge-Based Systems xxx (2012) xxxxxx

Fig. 1. Illustrative diagram of the proposed approach, HH(DE).

Euclidean distance is not suitable for all types of input data, such as better in this case because we combine information provided by
the case in which we have to compute the distance of different different methods.
word meanings. For this reason, most of previous works have been
focused on designing new semantic similarity measures.
3. Differential evolution for synonym recognition
Traditional semantic similarity measures techniques use some
kind of dictionaries in order to compute the degree of similarity
In this section we describe the problem and the proposed solu-
between words. The problem here is that most works use general tion. Our approach is based on the similarity scores provided by
purpose resources such as WordNet [20]. However, these sources
different atomic similarity functions. The evolutionary algorithm
offer limited coverage of biomedical terms. For this reason, several works as a hyper-heuristics which assigns different coefcient val-
resources have been developed in recent years to improve
ues to the similarity scores obtained from the pool of functions in-
semantic similarity measures, for example, Medical Subject cluded in the system. At the beginning of the process, all functions
Headings (MeSHs) [22] or Unied Medical Language System
(or metrics) evenly contribute to calculate the semantic similarity
(UMLS) [21]. Semantic similarity measures fall into three main value of a specic pair of terms. Then, the system evolves so that
categories:
the functions which provide the most similar values to the human
experts opinion have the highest coefcients. Fig. 1 shows the
 Path-based measures are based on dictionaries or thesaurus. If a
working diagram of the proposed approach.
given word has two or more meanings in those sources, then The differential evolution (DE) algorithm [29] was chosen
multiple paths may exist between the two words. The problem
among other candidates because, after a preliminary study, we
with this method is that it relies on the notion that all links in conclude that DE obtained very competitive results for the problem
the taxonomy are at uniform distances.
addressed. The reason lies in how the algorithm makes the solu-
 Information content measures are based on frequency counts of tions evolve. Our system can be considered as a hyper-heuristics
concepts when they are found in the corpus text (Pedersen
(HHs) which uses differential evolution, HH(DE), to assign to each
et al. [20]). These measures assign higher values to specic con- similarity function a specic coefcient. These values modify the
cepts (e.g. pitch fork), and lower scores to more general terms
relevance of each function. Differential evolution performs the
(e.g. idea). search of local optima by making small additions and subtractions
 Feature based measures consider the similarity between terms
between the members of its population (see Section 3.3). This fea-
according to their properties. In general, it is possible to esti- ture ts perfectly the problem, because the algorithm works with
mate semantic similarity according to the number of common
the scores provided by the similarity functions (Fig. 1). In fact,
features. For example, these methods could be based on the
the individual is dened as an array of oating point values, s,
relations between similar terms according to concept descrip-
where s(fx) is the coefcient which modies the result provided
tions retrieved from dictionaries.
by the similarity function fx. Fig. 2 illustrates the representation
of the individual used in this work.
As was previously mentioned, the problem of traditional
semantic similarity metrics is that there are not complete and up-
3.1. The synonym recognition problem
dated dictionaries for specic elds. If we focus on the biomedical
domain, we have to say that several outstanding works have been
Given two text expressions a and b, the problem consists of pro-
proposed in recent years. For instance, Pirr [22] proposed a new
viding the degree of synonymy between both words. However,
information content measure using the MeSH biomedical ontology
synonym recognition usually extends beyond synonymy and also
which successfully improved existing similarity methods. How-
involves semantic similarity measurements. According to Bollegala
ever, our study improves the results obtained in that work by using
et al. [3], a certain degree of semantic similarity is observed not
a combination of several similarity functions. Al-Mubai and Ngu-
only between synonyms (e.g. lift and elevator), but also between
yen [1] also proposed an ontology-based semantic similarity mea-
metonyms (e.g. car and wheel), hyponyms (leopard and cat), re-
sure and applied it to the biomedical domain. This proposal is
lated words (e.g. blood and hospital), and even between antonyms
based on the path length between concept nodes and the depth
(e.g. day and night).
of each term in the ontology hierarchy tree. Our results are also
nearer to human judgments in this case. Furthermore, there are
important works by Snchez et al. [24,26] in which several seman-
tic similarity measures based on approximating concept semantics
in terms of information content are successfully presented. Once
again, our technique obtains more precise results. Finally, Pedersen
et al. [21] implemented and evaluated a variety of semantic simi-
larity measures based on ontologies and terminologies found in
the Unied Medical Language System (UMLS). Our results are also Fig. 2. Individual representation.

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005
J.M. Chaves-Gonzlez, J. Martnez-Gil / Knowledge-Based Systems xxx (2012) xxxxxx 3

Table 1
Classication of the most relevant similarity metrics.

Type Similarity function and Description


reference
Path based Path length (PATH) [20] PATH is inversely proportional to the number of nodes along the shortest path between the words
measures Leacock and Chodorow (LCH) LCH can be computed as log (length/(2  D)), where length is the length of the shortest path between the two
[15] words and D is the maximum depth of the taxonomy
Wu and Palmer (WUP) [30] WUP considers the depths of the words in the taxonomies, along with the depth of the LCS (least common
subsumer). The formula is: score = 2  depth(LCS)/(depth(s1) + depth(s2))
Hirst and St-Onge (HSO) [11] HSO looks for lexical chains linking the word senses
Information Resnik (RES) [23] RES computes the information content of the LCS
content Jiang and Conrath (JCN) [6] JCN can be calculated as follows: 1/IC(term1) + IC(term2)  2  IC(LCS), where IC refers to information content
measures Lin (LIN) [17] LIN can be computed as: 2  IC(LCS)/(IC(term1) + IC(term2)
Feature based Adapted Lesk (LESK) [16] The similarity score is the sum of the squares of the overlap lengths
measures Gloss Vector (vector) [2] This method works by forming second-order co-occurrence vectors from the glosses or WordNet denitions
of concepts
Gloss Vector, pairwise modied This metric forms separate vectors corresponding to each of the adjacent glosses
(vector_pairs) [20]

We have designed an algorithm which provides a oating point


number. This number refers to the similarity between two biomed-
ical expressions. A value of 0 indicates that the words compared
are absolutely dissimilar. On the other hand, a value of 1 means
that the expressions share exactly the same meaning.

3.2. Semantic similarity metrics

There is a large number of similarity metrics published in the


literature. Table 1 summarizes the most relevant similarity func-
tions according to their classication. All of them have been used
in our work. The rst column (Table 1) indicates the general type
of the function. Second column contains the similarity metrics
and a reference to obtain more detailed information. Finally, the
third column includes a brief explanation of the metrics.
The main advantage of Path based measures is that they are easy
to develop because they are easy to understand. On the contrary,
this kind of measures needs rich taxonomies. Moreover they only
works with nodes belonging to those taxonomies and only the
relation is-a can be used to link to terms. The advantage of
information content measures is that they use empirical information
from real corpora. Although information content measures only
worked with nodes belonging to specic taxonomies in the past, Fig. 3. Outline of the DE algorithm for the Synonym Recognition Problem.
Snchez et al. shows that this kind of measures obtain very good
results with real corpora [24]. Finally, feature based measures do
not require underlying structures and use implicit knowledge from
real corpora. As a disadvantage of these metrics, the denitions of best/1/bin indicates how the crossover and mutation operators
terms can be too short and the computation tends to be very work [19]. Thus, our DE includes binomial crossover (bin) and uses
intensive in most cases. a unique (/1/) difference vector for the mutation of the best solu-
tion (best) taken from the population.
3.3. The differential evolution algorithm As we can see in Algorithm 1, DE starts with the random gener-
ation of the population (line 1) through the assignment of a ran-
Differential evolution (DE) is a population based evolutionary dom coefcient to each gene of the individual (see Fig. 2). As was
algorithm (EA) for global optimization created by Storn and Price explained previously, the population consists of a certain number
[29]. Due to its simplicity and effectiveness, DE has been applied of solutions (this is a parameter to be congured). Each individual
to a large number of optimization problems in a wide range of is represented by a vector of weighting factors, as described in Fig
domains [7]. It is based on the generation of new individuals by 2. After the generation of the population, the tness of each indi-
calculating vector differences between randomly-selected solu- vidual is assigned to each solution using the Pearson correlation
tions. Fig. 3 illustrates the explanation of our version of the [28]. This correlation, corr(X, Y), is calculated according to Eq. (1)
algorithm. Our particular DE has been carefully congured and with the values provided by a human expert for every pair of words
adapted to the problem managed in our study, as will be explained of the specic dataset under study. The nal result is a oating
in Section 4. For this reason, among the different variants of the point value between +1 (perfect positive linear correlation) and
algorithm, we chose the discrete handling approach version with 1 (perfect negative linear correlation) which indicates the degree
the selection scheme best/1/bin [19]. This conguration provides of linear dependence between the variable X (human expert opin-
more competitive results than the rest of variants. The notation ion) and Y (our solution).

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005
4 J.M. Chaves-Gonzlez, J. Martnez-Gil / Knowledge-Based Systems xxx (2012) xxxxxx

Table 2
Algorithm 1. Pseudo-code for the DE algorithm Optimal parameter setting.

1: generateRandomPopulation (population) Parameter Optimal value


2: calculateFitness (population) Population size 100
3: while (stop condition not reached) do Mutation factor, F 0.5
4: for (each individual of the population) Crossover probability, crossProb 0.1
Max generations 100
5: selectIndividuals (xTarget, xBest, xInd1, xInd2)
MIN, MAX 100, 100
6: xMut diffMutation (xBest, F, xInd1, xInd2)
7: xTrial binCrossOver (xTarget, xMut, CrossProb)
8: calculateFitness (xTrial)
9: updateIndividual (xTarget, xTrial) 4.1. Methodology
10: endfor
11: endwhile All the experiments have been run under the same environ-
12: return bestIndividual (population) ment: an Intel Xeon 2.33 GHz processor and 8 GB RAM. On the
software side, we have used the GCC 4.1.2 compiler on a Scientic
Linux 5.3 64 bits OS. Since we are dealing with a stochastic algo-
rithm, we have carried out 100 independent runs for each experi-
The nearer the value of correlation is to any of the extreme ment. Results provided in the following subsections are average
values (1 or +1), the stronger is the correlation between the results of these 100 executions. It is important to point here that
variables and the higher is the quality of the solution. On the the use of the arithmetic mean is a valid statistical measurement
other hand, if the result gets nearer to 0, it means that the vari- because the results follow a normal distribution (a p-value greater
ables are closer to be uncorrelated, and consequently, the solu- than 0.05 for the Shapiro-Wilk test conrms this fact [8]). More-
tion must be considered of poor quality. We use the correlation over, all results present an extremely low dispersion, since the
so we can compare in fair terms against other published studies standard deviation for all experiments is lower than 1015, so re-
[20,22]. For a more detailed explanation of Eq. (1), please consult sults can be considered statistically reliable.
[28]. In the following subsection we will discuss the results obtained
covX; Y EX  lX Y  lY  in our experiments using different similarity metrics and different
corrX; Y 1 word datasets. Table 2 summarizes the parameter adjustment per-
rX rY rX rY
formed to the system developed. The parameter setting is very rel-
The main loop of the algorithm (Fig. 3) starts after evaluat- evant, since the quality of the results largely depends on the
ing the whole population (line 2). DE is an iterative algorithm, accuracy of this adjustment. Therefore, we performed a complete
where the successive generations try to get an optimal solution, and precise study for each parameter.
stopping when the maximum number of generations is reached
(line 3). 4.2. Result discussion
First, we select four solutions (line 5). xTarget and xBest are the
solution being processed and the best solution found so far respec- We have performed two kinds of experiments. First, we explain
tively. xInd1 and xInd2 are two randomly chosen solutions different the results provided by our proposed system using two different
from xTarget and xBest. Next, mutation is performed (line 6) sets of similarity metrics (from WordNet dictionary1 and from
according to the expression: xMut xBest + F (xInd1  xInd2). This the Pirr study [22]). Next, we discuss the results obtained using
operator is explained in Fig. 3 in two steps (diffMutation 1 and 2). two different datasets from the biomedical domain [1,22].
In the rst part xDiff is calculated through the expression: xDiff F
(xInd1  xInd2). xDiff represents the mutation to be applied to the 4.2.1. Experiments with different metrics
best solution. The parameter F e [0, 2] establishes the mutation We compare our results against two sources. First, we use the
amount used in that operation. Then, xMut is generated by modify- study published by Pirr [22], in which the authors propose a
ing each gene of xBest with the mutation indicated in xDiff (Fig. 3). new metric based on features (P&S). Table 3 summarizes the study
After mutation, xTarget and xMut individuals are crossed (line 7) using the same biomedical dataset. Every value is normalized in
using binary crossover [31] according to a crossover probability, the interval [0, 1].
crossProb e [0, 1]. Then, the obtained individual, xTrial, is evaluated The rst column (in Table 3) identies the word pair. Then,
(Eq. (1)) to check its quality (line 8) and compared against xTarget. each word under evaluation is presented (columns 2 and 3). The
The best individual is saved in the xTarget position (line 9 and fourth column corresponds with the value of correlation provided
Fig. 3). This process is repeated for each individual in the popula- by a human expert. Next, all computational methods appear clas-
tion (line 4) while the stop condition is not satised (line 3). In sied in different methodologies. Our results are shown in the last
our case, the stop condition is a certain number of generations column. As described previously (Section 4.1), they are highly
which is also a parameter to be congured (see Section 4.1). At reliable because they are average results from 100 independent
the end of the process, the best individual (Fig. 2) is returned as executions.
the nal result of our system (line 12). It is important to point here Our results are similar to human judgment when the scores of
that results have been obtained after a complete experimental pro- the functions included in our hyper-heuristics (Fig. 1) are precise.
cess using different biomedical datasets. In case of using other For example, our scores for pairs with labels in Table 3: P17, P24
datasets from other domains, further experiments would be and P35 are close to the human opinion because basic functions
necessary. provide precise results (see Table 3). On the other hand, if the basic
functions do not perform accurately for a specic pair (e.g. P13,
4. Experiments and results Table 3), our score is better than most of the values provided by
those functions.
In this section we summarize the main experiments and the re-
sults obtained in our study. We have used different similarity met-
rics and biomedical datasets to test the proposed system. 1
http://www.wordnet.princeton.edu/.

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005
J.M. Chaves-Gonzlez, J. Martnez-Gil / Knowledge-Based Systems xxx (2012) xxxxxx 5

Table 3
Similarity results obtained by our system (last column) against other results published.

Word pair Human expert IC based Hybrid Features EA


Word 1 Word 2 Resnik Lin J&C Li P&S HH(DE)
P01 Anemia Appendicitis 0.031 0.000 0.000 0.190 0.130 0.133 0.116
P02 Otitis Media Infantile Colic 0.156 0.000 0.000 0.160 0.100 0.000 0.056
P03 Dementia Atopic Dermatitis 0.060 0.000 0.000 0.290 0.130 0.202 0.165
P04 Bacterial Pneumonia Malaria 0.156 0.000 0.000 0.030 0.100 0.000 0.024
P05 Osteoporosis Patent Ductus Arteriosus 0.156 0.000 0.000 0.150 0.000 0.000 0.037
P06 Sequence Antibacterial Agents 0.155 0.000 0.000 0.270 0.160 0.184 0.159
P07 Acq. Immunno. Syndrome Congenital Heart Defects 0.060 0.000 0.000 0.070 0.080 0.000 0.030
P08 Meningitis Tricuspid Atresia 0.031 0.000 0.000 0.190 0.130 0.131 0.115
P09 Sinusitis Mental Retardation 0.031 0.000 0.000 0.360 0.130 0.117 0.152
P10 Hypertension Failure 0.500 0.000 0.000 0.210 0.130 0.109 0.112
P11 Hyperlipidemia Hyperkalemia 0.156 0.331 0.483 0.470 0.510 0.561 0.443
P12 Hypothyroidism Hyperthyroidism 0.406 0.619 0.726 0.750 0.630 0.718 0.665
P13 Sarcoidosis Tuberculosis 0.406 0.000 0.000 0.250 0.070 0.169 0.134
P14 Vaccines Immunity 0.593 0.000 0.000 0.520 0.000 0.344 0.251
P15 Asthma Pneumonia 0.375 0.517 0.790 0.870 0.520 0.749 0.627
P16 Diabetic Nephropathy Diabetes Mellitus 0.500 0.612 0.759 0.790 0.770 0.741 0.696
P17 Lactose Intolerance Irritable Bowel Syndrome 0.468 0.468 0.468 0.470 0.360 0.468 0.451
P18 Urinary Tract Infection Pyelonephritis 0.656 0.470 0.588 0.670 0.420 0.604 0.533
P19 Neonatal Jaundice Sepsis 0.187 0.000 0.000 0.190 0.160 0.000 0.073
P20 Anemia Deciency Anemia 0.437 0.601 0.720 0.790 0.360 0.712 0.622
P21 Psychology Cognitive Science 0.593 0.627 0.770 0.810 0.800 0.751 0.714
P22 Adenovirus Rotavirus 0.437 0.267 0.332 0.450 0.350 0.398 0.358
P23 Migraine Headache 0.718 0.229 0.243 0.370 0.170 0.269 0.266
P24 Myocardial Ischemia Myocardial Infarction 0.750 0.595 0.918 0.890 0.800 0.830 0.713
P25 Hepatitis B Hepatitis C 0.562 0.649 0.823 0.860 0.660 0.790 0.715
P26 Carcinoma Neoplasm 0.750 0.246 0.626 0.850 0.450 0.651 0.488
P27 Pulmonary Stenosis Aortic Stenosis 0.531 0.658 0.781 0.810 0.660 0.763 0.707
P28 Failure to Thrive Malnutrition 0.625 0.000 0.000 0.180 0.130 0.126 0.111
P29 Breast Feeding Lactation 0.843 0.000 0.000 0.040 0.080 0.029 0.033
P30 Antibiotics Antibacterial Agents 0.937 1.000 1.000 1.000 0.990 1.000 1.000
P31 Seizures Convulsions 0.843 0.880 1.000 0.900 0.810 0.990 0.887
P32 Pain Ache 0.875 0.861 1.000 1.000 0.990 0.954 0.920
P33 Malnutrition Nutritional Deciency 0.875 0.622 1.000 1.000 0.980 0.874 0.780
P34 Measles Rubeola 0.906 0.924 1.000 1.000 0.990 1.000 0.965
P35 Chicken Pox Varicella 0.968 1.000 1.000 1.000 0.990 1.000 1.000
P36 Down Syndrome Trisomy 21 0.875 1.000 1.000 1.000 0.990 1.000 1.000

Table 4 better results obtained (see Table 7). Although the global correla-
Pearson correlation between computational methods and human tion of a particular function is not very good, that particular func-
judgments.
tion is important in our system, since our hyper-heuristics
Similarity function Correlation modies its coefcient so that the function provides relevant infor-
Resnik 0.721 mation to the global system.
Lin 0.718 Once again, we can see that our HH(DE) improves the results
J&C 0.718 provided by the basic functions because it uses a linear combina-
Li 0.707
tion of every function (see Section 3). Although a specic function
P&S 0.725
obtains bad results, it is difcult that every metric provides poor
HH(DE) 0.732
results.
As may be seen in Table 6, our proposal improves signicantly
the rest of metrics. In fact, the correlation value reached (0.809)
is even higher than the result obtained in the previous set of exper-
Table 4 presents the correlation of Pearson between the meth-
iments (0.732, Table 4), although in that case, every similarity
ods appeared in Table 3 [22] and the human expert score. As can
function provided better results. The scores provided by every sim-
be seen, our approach provides the best results of the study.
ilarity function are worse than in the previous experiment because
HH(DE) surpasses other similarity metrics because it is more ro-
WordNet is a general purpose resource which is not very appropri-
bust than most of existing methods.
ate for the biomedical domain [25]. Therefore, we can conclude
The second step in our experimental evaluation consists of
that the higher is the number of similarity functions included in
using a different set of similarity functions to tackle the same word
our hyper-heuristics (see Fig. 1), the higher is the quality in the glo-
dataset. Table 5 presents the results obtained with every metric in-
bal results. Table 7 shows how the global quality of our method is
cluded in WordNet::similarity resources [20].
improved as more basic functions are incorporated into our pool.
Every metric used has been previously explained in Section 3.2
(Table 1). Our results appear in the last column (Table 5). Our
scores are also statistically robust since they are average results 4.2.2. Experiments with different datasets
of 100 independent executions, with very low standard deviations In this subsection we study the results provided by our system
(see Section 4.1). Although there are functions which do not obtain using other biomedical dataset [1]. The conguration of our
good correlation results (Table 6), such as HSO (0.332), JCN (0.237), system is exactly the same, therefore, we can say that the param-
LIN (0.218) or vector_pairs (0.333), they have been included in our eter setting of our HH(DE) is consistent with at least two datasets.
system (Fig. 1), because the higher number of functions used, the To our best knowledge, there are no works in which more datasets

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005
6 J.M. Chaves-Gonzlez, J. Martnez-Gil / Knowledge-Based Systems xxx (2012) xxxxxx

Table 5
Similarity results obtained by our system (last column) against WordNet results.

Word pair Human expert HSO JCN WUP PATH LIN LESK RES LCH Vector_pairs Vector HH(DE)
P01 0.031 0.250 0.000 0.842 0.250 0.000 0.004 0.000 0.624 0.010 0.183 0.139
P02 0.156 0.188 0.000 0.800 0.200 0.000 0.002 0.000 0.564 0.007 0.211 0.169
P03 0.06 0.000 0.000 0.480 0.071 0.000 0.002 0.000 0.285 0.013 0.111 0.115
P04 0.156 0.125 0.092 0.720 0.250 0.524 0.010 0.000 0.436 0.080 0.326 0.113
P05 0.156 0.000 0.000 0.174 0.050 0.000 0.002 0.000 0.188 0.018 0.081 0.034
P06 0.155 0.000 0.058 0.300 0.077 0.060 0.010 0.000 0.305 0.022 0.152 0.073
P07 0.06 0.000 0.052 0.375 0.091 0.075 0.028 0.000 0.350 0.051 0.397 0.195
P08 0.031 0.000 0.000 0.609 0.100 0.000 0.002 0.000 0.376 0.011 0.121 0.134
P09 0.031 0.000 0.000 0.556 0.111 0.000 0.003 0.000 0.404 0.003 0.057 0.083
P10 0.5 0.250 0.000 0.842 0.250 0.000 0.007 0.000 0.624 0.018 0.457 0.281
P11 0.156 0.313 0.000 0.889 0.333 0.000 0.078 0.331 0.702 0.195 0.727 0.472
P12 0.406 1.000 0.000 0.900 0.333 0.000 0.019 0.619 0.702 0.221 0.097 0.199
P13 0.406 0.000 0.000 0.720 0.125 0.000 0.007 0.000 0.436 0.070 0.222 0.165
P14 0.593 0.000 0.048 0.267 0.083 0.000 0.009 0.000 0.326 0.016 0.251 0.125
P15 0.375 0.313 0.000 0.923 0.333 0.000 0.013 0.517 0.702 0.227 0.375 0.358
P16 0.5 1.000 0.044 0.182 0.053 0.000 0.105 0.612 0.202 0.041 0.396 0.435
P17 0.468 0.000 0.000 0.250 0.053 0.000 0.001 0.468 0.202 0.034 0.108 0.298
P18 0.656 0.250 0.000 0.963 0.500 0.000 0.050 0.470 0.812 0.062 0.591 0.536
P19 0.187 0.125 0.059 0.400 0.077 0.221 0.011 0.000 0.305 0.043 0.112 0.029
P20 0.437 0.000 0.108 0.571 0.100 0.471 0.051 0.601 0.376 0.074 0.329 0.450
P21 0.593 0.375 0.000 0.900 0.333 0.000 0.153 0.627 0.702 0.167 0.515 0.539
P22 0.437 0.250 0.000 0.842 0.250 0.000 0.029 0.267 0.624 0.042 0.093 0.212
P23 0.718 0.250 0.000 0.957 0.500 0.000 0.428 0.229 0.812 0.020 0.612 0.504
P24 0.75 0.000 0.000 0.720 0.125 0.000 0.038 0.595 0.436 0.034 0.116 0.446
P25 0.562 0.313 0.000 0.933 0.333 0.000 0.074 0.649 0.702 0.359 0.583 0.463
P26 0.75 0.313 0.000 0.889 0.250 0.000 0.228 0.246 0.624 0.134 0.480 0.385
P27 0.531 0.313 0.000 0.917 0.333 0.000 0.100 0.658 0.702 0.438 0.833 0.550
P28 0.625 0.000 0.000 0.636 0.111 0.000 0.029 0.000 0.404 0.125 0.309 0.164
P29 0.843 0.250 0.000 0.857 0.250 0.000 0.017 0.000 0.624 0.218 0.849 0.366
P30 0.937 0.250 1.000 0.941 0.500 1.000 0.661 1.000 0.812 0.119 0.769 0.894
P31 0.843 0.313 0.455 0.952 0.500 0.897 0.098 0.880 0.812 0.197 0.628 0.710
P32 0.875 0.313 0.402 0.947 0.500 0.861 0.152 0.861 0.812 0.050 0.419 0.562
P33 0.875 0.000 0.000 0.571 0.100 0.000 0.040 0.622 0.376 0.022 0.322 0.554
P34 0.906 1.000 0.000 1.000 1.000 0.000 1.000 0.924 1.000 0.333 1.000 0.791
P35 0.968 0.250 0.000 0.966 0.500 0.000 0.752 1.000 0.812 0.055 0.720 1.000
P36 0.875 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.500 1.000 0.720

Table 6 Table 7
Pearson correlation between computational methods Pearson correlation values for different number of
calculated in WordNet::similarity and human similarity functions. HH(DE) uses the similarity func-
judgments. tions of more quality in each case.

Metrics Correlation Similarity functions Correlation


included in HH(DE)
HSO 0.332
JCN 0.237 10 0.809
WUP 0.490 9 0.771
PATH 0.517 8 0.769
LIN 0.218 7 0.768
LESK 0.517 6 0.701
RES 0.721 5 0.692
LCH 0.553 4 0.692
Vector_pairs 0.333 3 0.678
Vector 0.593 2 0.658
1 0.642
HH(DE) 0.809

from the biomedical domain have been used, so we cannot perform


more comparisons. Table 8 presents the word pairs of the dataset As may be observed in Table 9, our results clearly surpass the
and the scores provided by human experts. As in the previous case, result provided by the rest of metrics. We can also verify that
every value is normalized in the interval [0, 1]. our results are more reliable using different datasets than results
Our results are shown in Table 9. The two rst columns are ta- provided by other functions, because, for example, the HSO func-
ken from Table 8 and identify the word pair and the human expert tion obtains poor results in the previous dataset (0.332, Table 6)
score. Next 10 columns correspond with the values obtained from and in this case provides interesting results (0.701, Table 10), using
the WordNet similarity tool [20]. The last column contains the re- in both cases WordNet.
sult obtained by our HH(DE). As in previous experiments, these re- Furthermore, as happened in the previous experiment, our
sults are statistically condent because they are average result of strategy improves the results for word pairs nicely evaluated by
100 independent executions, with very low standard deviations basic functions (e.g. WP01, WP04 or WP29), and it provides not
(lower than 1010). so bad results for words not poorly evaluated (e.g. WP21 or WP30).

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005
J.M. Chaves-Gonzlez, J. Martnez-Gil / Knowledge-Based Systems xxx (2012) xxxxxx 7

Table 8
Human expert scores for another biomedical word dataset.

Word pair Word 1 Word 2 Human expert


WP01 Renal failure Kidney failure 1
WP02 Heart Myocardium 0.75
WP03 Stroke Infarct 0.7
WP04 Abortion Miscarriage 0.825
WP05 Delusion Schizophrenia 0.55
WP06 Congestive heart failure Pulmonary edema 0.35
WP07 Metastasis Adenocarcinoma 0.45
WP08 Calcication Stenosis 0.5
WP09 Diarrhea Stomach cramps 0.325
WP10 Mitral stenosis Atrial brillation 0.325
WP11 Chronic obstructive pulmonary disease Lung inltrates 0.475
WP12 Rheumatoid arthritis Lupus 0.275
WP13 Brain tumor Intracranial hemorrhage 0.325
WP14 Carpel tunnel syndrome Osteoarthritis 0.275
WP15 Diabetes mellitus Hypertension 0.25
WP16 Acne Syringe 0.25
WP17 Antibiotic Allergy 0.3
WP18 Cortisone Total knee replacement 0.25
WP19 Pulmonary embolus Myocardial infarction 0.3
WP20 Pulmonary brosis Lung cancer 0.35
WP21 Cholangiocarcinoma Colonoscopy 0.25
WP22 Lymphoid hyperplasia Laryngeal cancer 0.25
WP23 Multiple sclerosis Psychosis 0.25
WP24 Appendicitis Osteoporosis 0.25
WP25 Rectal polyp Aorta 0.25
WP26 Xerostomia Alcoholic cirrhosis 0.25
WP27 Peptic ulcer disease Myopia 0.25
WP28 Depression Cellulites 0.25
WP29 Varicose vein Entire knee meniscus 0.25
WP30 Hyperlidpidemia Metastasis 0.25

Table 9
Similarity results obtained by our system (last column) against the results obtained with WordNet using the second biomedical dataset.

Word pair Human expert HSO JCN WUP PATH LIN LESK RES LCH Vector_pairs Vector HH(DE)
WP01 1 1.000 0.000 1.000 1.000 0.000 1.000 0.000 1.000 0.010 0.183 1.000
WP02 0.75 0.313 0.078 0.600 0.111 0.370 0.210 0.320 0.404 0.007 0.211 0.517
WP03 0.7 0.000 0.055 0.333 0.077 0.079 0.060 0.066 0.305 0.013 0.111 0.094
WP04 0.825 1.000 0.000 1.000 1.000 0.000 0.195 0.907 1.000 0.080 0.326 0.742
WP05 0.55 0.188 0.000 0.778 0.200 0.000 0.114 0.469 0.564 0.018 0.081 0.193
WP06 0.35 0.000 0.000 0.556 0.111 0.000 0.047 0.269 0.404 0.022 0.152 0.019
WP07 0.45 0.000 0.000 0.174 0.050 0.000 0.018 0.000 0.188 0.036 0.327 0.050
WP08 0.5 0.000 0.000 0.556 0.111 0.000 0.027 0.269 0.404 0.051 0.397 0.030
WP09 0.325 0.000 0.057 0.333 0.077 0.000 0.074 0.066 0.305 0.011 0.121 0.123
WP10 0.325 0.188 0.000 0.833 0.200 0.000 0.022 1.000 0.564 0.003 0.057 0.128
WP11 0.475 0.000 0.000 0.500 0.200 0.000 0.008 0.297 0.298 0.018 0.457 0.030
WP12 0.275 0.188 0.000 0.846 0.200 0.000 0.114 0.582 0.564 0.195 0.727 0.079
WP13 0.325 0.000 0.098 0.750 0.143 0.540 0.043 0.508 0.472 0.221 0.097 0.056
WP14 0.275 0.000 0.000 0.160 0.046 0.000 0.029 0.000 0.162 0.070 0.222 0.055
WP15 0.25 0.000 0.000 0.560 0.083 0.000 0.095 0.477 0.326 0.016 0.251 0.124
WP16 0.25 0.000 0.043 0.167 0.048 0.000 0.036 0.000 0.175 0.227 0.375 0.023
WP17 0.3 0.000 0.000 0.200 0.059 0.000 0.077 0.000 0.232 0.041 0.396 0.125
WP18 0.25 0.000 0.000 0.300 0.067 0.000 0.040 0.052 0.266 0.034 0.108 0.018
WP19 0.3 0.000 0.000 0.300 0.067 0.000 0.037 0.066 0.266 0.062 0.591 0.066
WP20 0.35 0.000 0.000 0.667 0.100 0.000 0.022 0.508 0.376 0.043 0.112 0.007
WP21 0.25 0.000 0.000 0.222 0.046 0.000 0.066 0.066 0.162 0.500 1.000 0.298
WP22 0.25 0.000 0.104 0.583 0.091 0.540 0.045 0.477 0.350 0.074 0.329 0.029
WP23 0.25 0.000 0.000 0.632 0.125 0.000 0.052 0.352 0.436 0.167 0.515 0.029
WP24 0.25 0.000 0.000 0.286 0.063 0.000 0.012 0.066 0.248 0.042 0.093 0.010
WP25 0.25 0.000 0.000 0.261 0.056 0.000 0.033 0.052 0.216 0.020 0.612 0.018
WP26 0.25 0.000 0.000 0.571 0.100 0.000 0.014 0.352 0.376 0.034 0.116 0.005
WP27 0.25 0.000 0.000 0.692 0.111 0.000 0.022 0.508 0.404 0.359 0.583 0.191
WP28 0.25 0.000 0.000 0.375 0.091 0.000 0.025 0.052 0.350 0.134 0.480 0.081
WP29 0.25 0.000 0.000 0.546 0.091 0.000 0.021 0.320 0.350 0.438 0.833 0.221
WP30 0.25 0.000 0.000 0.250 0.077 0.000 0.048 0.000 0.305 0.125 0.309 0.170

Table 10 shows a wide range of scores for the correlation coef- are often good; however, they are not successful in our case be-
cient. This is mainly due to the fact that the correlation not only cause the terms examined are not extracted from high quality cor-
depends on the strategy implemented but on the amount and qual- pora. A general purpose dictionary (WordNet) does not containing
ity of the background data. For example, metrics based on vectors many biomedical terms, so these functions are not precise in this

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005
8 J.M. Chaves-Gonzlez, J. Martnez-Gil / Knowledge-Based Systems xxx (2012) xxxxxx

Table 10 good synonym dictionaries do not exist. We also look forward to


Correlation between different computational methods improve our tness function to make it domain independent.
and human judgments for the second dataset.

Metrics Correlation
References
HSO 0.701
JCN 0.111 [1] H. Al-Mubaid, H.A. Nguyen, Measuring semantic similarity between
WUP 0.483 biomedical concepts within multiple ontologies, IEEE Transactions on
PATH 0.753 Systems, Man, and Cybernetics, Part C 39 (4) (2009) 389398.
LIN 0.077 [2] S. Banerjee, T. Pedersen, Extended gloss overlaps as a measure of semantic
LESK 0.712 relatedness, IJCAI (2003) 805810.
RES 0.106 [3] D. Bollegala, Y. Matsuo, M. Ishizuka, Measuring semantic similarity between
LCH 0.687 words using web search engines, in: Proceedings of the World Wide Web
Conference, 2007, pp. 757766.
Vector_pairs 0.351
[4] A. Budanitsky, G. Hirst, Evaluating wordnet-based measures of lexical
Vector 0.289
semantic relatedness, Computational Linguistics 32 (1) (2006) 1347.
HH(DE) 0.885 [5] P-I. Chen, S-J. Lin, Word AdHoc Network: Using Google Core Distance to extract
the most relevant information, Knowledge Based System 24 (3) (2011) 393
405.
[6] D. Conrath, J. Jiang, Semantic similarity based on corpus statistics and lexical
Table 11 taxonomy, in: Comp. Linguist Proc., Taiwan, 1997, pp. 1933.
Pearson correlations between different approaches and [7] S. Das, P.N. Suganthan, Differential evolution: a survey of the state-of-the-art,
IEEE Transaction on Evolutionary Computation 15 (1) (2011) 431.
the human expert opinion.
[8] J. Demsar, Statistical comparison of classiers over multiple data sets, Journal
Similarity Correlation of Machine Learning Research 7 (2006) 130.
function [9] F.A. Grootjen, T.P. van der Weide, Conceptual query expansion, Data
(metric) Knowledge Engineering 56 (2) (2006) 174193.
[10] A.Y. Halevy, A. Rajaraman, J.J. Ordille, Data integration: the teenage years,
Vector [21] 0.76 VLDB (2006) 916.
LIN [21] 0.69 [11] G. Hirst, D. St-Onge, Lexical chains as representations of context for the
J&C [21] 0.55 detection and correction of malapropisms, in: C. Fellbaum (Ed.), WordNet: An
RES [21] 0.55 Electronic Lexical Database, MIT Press, 1998.
Path [21] 0.48 [12] J. Hu, R.S. Kashi, G.T. Wilfong, Comparison and classication of documents
L&C [21] 0.47 based on layout similarity, Information Retrieval 2 (2/3) (2000) 227243.
[13] A. Java, S. Nirenburg, M. McShane, T.W. Finin, J. English, A. Joshi, Using a
PATH [1] 0.818
natural language understanding system to generate semantic web content,
L&C [1] 0.833
International Journal on Semantic Web and Information Systems 3 (4) (2007)
W&P [1] 0.778
5074.
C&K [1] 0.702 [14] E. Kaufmann, A. Bernstein, Evaluating the usability of natural language query
Metrics proposed in [1] 0.836 languages and interfaces to Semantic Web knowledge bases, Journal of Web
HH(DE) 0.885 Semantics 8 (3) (2010) 377393.
[15] C. Leacock, M. Chodorow, G.A. Miller, Using corpus statistics and wordnet
relations for sense identication, Computational Linguistics 24 (1) (1998) 147
165.
[16] M. Lesk, Information in data: using the Oxford English dictionary on a
case. Our HH(DE) assigns a coefcient to each metric to modify the computer, SIGIR Forum 20 (14) (1986) 1821.
importance of each metric in the global system and avoid negative [17] D. Lin, An information-theoretic denition of similarity, in: Int. Conf. ML Proc.,
San Francisco, CA, USA, 1998, pp. 296304.
results.
[18] C.D. Manning, H. Schtze, Foundations of Statistical Natural Language
Finally, Table 11 summarizes all results related to the second Processing, MIT Press, Cambridge, Massachusetts, 1999.
dataset. As can be seen, our approach improves any other similar- [19] E. Mezura-Montes, J. Velzquez-Reyes, C.A. Coello-Coello, A comparative study
of differential evolution variants for global optimization, in: Proceedings of the
ity function applied over the same word dataset. There are several
8th Annual Conference on Genetic and Evolutionary Computation (GECCO 06),
reasons which can explain this good behaviour. For example, the ACM, New York, NY, USA, 2006, pp. 485492.
approaches presented in [21] are limited by the fact that a unique [20] T. Pedersen, S. Patwardhan, J. Michelizzi, Word-Net: similarity measuring the
ontology is exploited. According to Snchez et al. [26] these results relatedness of concepts, Association for the Advancement of Articial
Intelligence (2004) 10241025.
rely completely on the coverage and completeness of the input [21] T. Pedersen, S. Pakhomov, S. Patwardhan, C.G. Chute, Measures of semantic
ontology. On the other hand, the metrics proposed in [1] are nearer similarity and relatedness in the biomedical domain, Journal of Biomedical
to our results since their strategies consist of experimentally opti- Informatics 40 (3) (2007) 288299.
[22] G. Pirr, A semantic similarity metric combining features and intrinsic
mized parameters for the evaluated dataset. However, even com- information content, Data Knowledge Engineering 68 (11) (2009) 12891308.
pared against those metrics, our hyper-heuristics obtains more [23] P. Resnik, Using information content to evaluate semantic similarity in a
successful results. taxonomy, IJCAI (1995) 448453.
[24] D. Snchez, M. Batet, D. Isern, Ontology-based information content
computation, Knowledge Based System 24 (2) (2011) 297303.
5. Conclusions and future work [25] D. Snchez, M. Batet, D. Isern, A. Valls, Ontology-based semantic similarity: a
new feature-based approach, Expert Systems with Applications 39 (9) (2012)
77187728.
In this work, we have presented a novel approach that sur- [26] D. Snchez, A. Sol-Ribalta, M. Batet, F. Serratosa, Enabling semantic similarity
passes existing similarity functions when dealing with datasets estimation across multiple ontologies: an evaluation in the biomedical
from the biomedical domain. The novelty of our work consists of domain, Journal of Biomedical Informatics 45 (1) (2012) 141155.
[27] N. Shadbolt, T. Berners-Lee, W. Hall, The semantic web revisited, IEEE
using other similarity functions as black boxes which are smartly Intelligent Systems 21 (3) (2006) 96101.
combined. This allows our HH(DE) to make use of the best features [28] J.S. Simonoff, Smoothing Methods in Statistics, Springer, 1996.
from every similarity function. [29] R. Storn, K. Price, Differential Evolution A Simple and Efcient Adaptive
Scheme for Global Optimization Over Continuous Spaces, TR-95-012,
We present the novel approach of applying an evolutionary International Computer Science Institute, Berkeley, 1995.
algorithm to this kind of problem. Furthermore, it provides the best [30] Z. Wu, M. Palmer, Verb semantics and lexical selection, in: Assoc. Comput.
similarity scores (see Section 4) when compared against other rel- Linguist Proc., Las Cruces, NM, USA, 1994, pp. 133138.
[31] D. Zaharie, A comparative analysis of crossover variants in differential
evant works published in the bibliography. evolution, in: Proceedings of the International Multiconference on Computer
As future work, we propose to explore further possibilities for Science and Information Technology, Wisla, Poland, 2007, pp. 171181.
synonym recognition in other domains, especially those in which

Please cite this article in press as: J.M. Chaves-Gonzlez, J. Martnez-Gil, Evolutionary algorithm based on different semantic similarity functions for
synonym recognition in the biomedical domain, Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.07.005