You are on page 1of 9

Biomedical Semantic Embeddings: Using hybrid sentences to construct biomedical

word embeddings and its applications

Arshad Shaik and Wei Jin


Computer Science and Engineering
University of North Texas
Denton, Texas
ArshadShaik@my.unt.edu, wei.jin@unt.edu

Abstract—word embeddings is a useful method that has matrix factorization for learning distributed representation of
shown enormous success in various NLP tasks, not only in words [5], [6], [7]. Although, algorithms have different
open domain but also in biomedical domain. The biomedical underlying architectures, the resulted embeddings usually
domain provides various domain specific resources and tools
that can be exploited to improve performance of these word give similar performance when leveraged for various NLP
embeddings. However, most of the research related to word tasks.
embeddings in biomedical domain focuses on analysis of Similarly, there has been some research for creation of
model architecture, hyper-parameters and input text. In this word embeddings in biomedical domain. Stenetorp et al. [8]
paper, we use SemMedDB to design new sentences called was the first to present an analysis of various word represen-
‘Semantic Sentences’. Then we use these sentences in addition
to biomedical text as inputs to the word embedding model. tations in biomedical domain NLP, demonstrating substantial
This approach aims at introducing biomedical semantic types benefit from word representations trained on in-domain texts
defined by UMLS, into the vector space of word embeddings. compared to out-of-domain texts for entity recognition and
The semantically rich word embeddings presented here rivals classification tasks. Pyysalo et al. [9] provided distributional
state of the art biomedical word embedding in both semantic semantic resources for biomedical text processing. They uti-
similarity and relatedness metrics up to 11%. We also demon-
strate how these semantic types in word embeddings can be lized these resources to preprocess openly available biomed-
utilized. ical literature i.e. PubMed and PMC OA. Finally applying
word2vec to train word embeddings from these preprocessed
Keywords-word embeddings; semantic types; SemMedDB
texts. These word embeddings are openly available and has
been used as features for various BioNLP studies. Chiu
I. I NTRODUCTION
et al. [10] prepared a study on how to train good word
Word embeddings or distributed representation of words embeddings for biomedical NLP, their research provided a
have been focus of research since the early 1990s. However, comparison on how the quality of word embeddings differed
more noticeable results were achieved with the advancement based on pre-processing of text, model selection(skip-grams
in computational speed and neural networks. Bengio et vs CBOW), hyper-parameters(sampling, min-count, learning
al. [1] provided a neural probabilistic language model which rate, vector dimension and context window size) and cor-
was able to learn a distributed representation for words pus selection(PMC, PubMed). In conclusion, most of the
tackling the curse of dimensionality. Collobert et al. [2] word embeddings related research in biomedical domain is
trained a deep neural network with semi-supervised and restricted to utilization of openly available biomedical text
multitask learning for various NLP tasks such as prediction and analysis studies.
of part-of-speech tags, chunks, named entity tags, semantic On the other hand, biomedical domain is rich of se-
roles, etc. Joseph et al. [3] took previously trained word mantic resources that has been used to improve various
embeddings and provided a comparison of these embeddings tasks such as text classification, named entity recognition,
when used as input features for NLP tasks such as named information retrieval, Question Answering systems. These
entity recognition and chunking. Mikolov et al. [4] proposed biomedical tools were developed with advancement in NLP
continuous bag-of-words and skip-gram architectures for in biomedical field and have been used directly or indirectly
computing the word vectors from very large data sets. These for various NLP tasks. Sarker et al. [11] used MetaMap [12]
model architectures provided a significant improvement in to extract semantic types and concepts from text for adverse
accuracy when tested for syntactic and semantic word sim- drug reaction detection. Similar features were extracted by
ilarities with much lower computational cost, both these Cao, et al. [13] for question classification based on topics.
models still stand as state of the art for training word embed- Weiming et al. [14], Cao, et al. [15] and Hritovski et al. [16]
dings. Recently, a surge of algorithms have been proposed makes use of SemRep [17], Metamap and UMLS [18]
in the literature which utilized co-occurence information and for biomedical question answering system. These semantic

978-1-5386-9138-0/19/$31.00 ©2019 IEEE

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
resources have been good addition to improve on various A. Word2vec
NLP tasks, however it remains an unexplored domain in the Mikolov et al. [4] proposed two architectures as part
field of biomedical word embeddings. of their word2vec tool to learn distributed representation
Although individual research has been conducted in the of words: continuous bag-of-words(CBOW) and skip-gram
area of word embeddings to solve BioNLP problems and for computing the word vectors from very large data sets.
utilization of domain knowledge to improve various BioNLP These model architectures yielded a remarkable enhance-
tasks, there remains a large gap of research in the area of ment in accuracy when tested for syntactic and semantic
word embeddings utilizing semantic resources in biomedical word similarities with much inferior computational cost,
domain. Some of the research involving both is done by both these models still stand as state of the art for training
Yu, et al. [19] and Abdeddaim, et al. [20] who has made word embeddings. skip-gram aims to predict the context
use of UMLS/Mesh lexicon to improve word embeddings. words given a target word in a sliding window of text while
Our paper is unique in utilizing SemMedDB information CBOW tends to predict the probability of a word given the
and constructing new biomedical word embeddings which surrounding words.
is semantically rich and more useful compared to previous skip-gram. Given a text corpus, skip-gram targets at de-
biomedical word embeddings. ducing word representations that are good at estimating the
The remainder of paper is organized as follows. Section context words given a target word in a sliding window
II provides background information, tools and resources of text. Specifically, skip-gram takes each word in the
that are used in this novel approach. It further details the corpus (denoted as wt ) and its surrounding words within
implementation of these tools. Section III, Section IV details a window of defined size (denoted as Ct ) as input. The
word embeddings design and the efficiency utilizing this model then feeds each pair (wt , wc ), where wc ∈ Ct into
word embeddings. Section V discusses the dataset used for a neural network that is trained to maximize the log prob-
experiments and results of the evaluation. Section VI, is ability of neighboring words in the corpus. More formally,
reserved for conclusions and future work. given a training corpus represented as a sequence of words
w1 , w2 , ..., wT , the objective of the skip-gram model is to
maximize the following function:
II. BACKGROUND

T ∑
Considering the extensive research in model architecture O= logP (wc |wt )
to train word embeddings, our motivation of study was t=1 wc ∈Ct
not to tweak these already established state of the art For both its performance and popularity, we make use of
architectures, rather novelty of our research lies in usage of skip-gram to train our word embeddings model.
domain specific semantic resources to provide better inputs
to these models. We were also influenced by the fact that B. UMLS
there is minimmum research in biomedical word embeddings The aim of the National Library of Medicine’s(NLM)
exploiting these semantic resources especially SemMedDB Unified Medical Language System (UMLS) [18] resource
to improve the quality of biomedical word embeddings. Even is to ease the development of conceptual relationships be-
though these semantic resources have been used to solve tween users and machine-readable data. The UMLS model
similar NLP problems that are tackled by word embeddings, comprises of three centrally developed Knowledge Sources:
these semantic resources have not been used in unison with a Metathesaurus, a Semantic Network, and SPECIALIST
word embeddings. Lexicon and Lexical Tools. It contains a variety of files
We use information in SemMedDB to engineer hybrid and softwares that make use of these Knowledge Sources
sentences which we call semantic sentences for scope of this to help users in different domains find machine-readable
study. These semantic sentences in addition to biomedical information relevant to their particular need or research
text is provided as input to our word2vec(skip-gram) to problems. Metathesaurus consists of biomedical terms and
train our semantic word embeddings. As a result we create codes from many vocabularies, including CPT, ICD-10-
biomedical word embeddings whose vectors are not only CM, LOINC, MeSH, RxNorm, and SNOMED CT. This
words but also semantic types. later in this paper we explain information is gathered by making use of Semantic Network
benefit of inducing these semantic types into vector space. and lexical tools to group synonymous terms, categorize
This section further provides an outline of the tools and biomedical concepts by semantic types, link health informa-
resources utilized to achieve our semantic word embeddings. tion, medical terms, drug names, and billing codes across
The section further provides a brief description on how the different computer systems. Semantic Network consists of
tools are implemented in this novel approach. These domain broad categories of semantic types and relationships between
specific tools and resources are provided by National Library them. SPECIALIST Lexicon and Lexical Tools consists
of Medicine (NLM). of Natural language processing tools such as SemRep,

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
Metamap, etc. Details of each resource that is used in this in text. For this reason it has been used in many applications
research has been provided in below sections. such as information retrieval, data mining and decision
support systems.
C. Semantic Network
Given the biomedical text, metamap outputs the biomed-
Semantic Network [21] provides a categorization of all ical concepts in the text and provide details about these
concepts represented in the UMLS Metathesaurus and a set biomedical concepts, such as concept id, concept name,
of relationships that exist between these concepts. Semantic preferred name, semantic type, etc.
Network contains information about the set of semantic For Example: when we provide ”Histochemical changes
types, or categories, which may be assigned to a biomedical in psoriasis treated with triamcinolone” as input text to
concept, and it defines the set of relationships that may hold metamap, the following biomedical concepts are identified
between these semantic types. The Semantic Network con- in text along with its information:
tains 133 different semantic types and 54 relationships that Psoriasis [Disease or Syndrome]
can exist between these concepts. The semantic types are changes [Quantitative Concept]
represented as nodes in the Network, and the relationships Treated with [Therapeutic or Preventive Procedure]
between them are the links. Triamcinolone [Organic Chemical, Pharmacologic Sub-
These semantic types are represented using unique identifiers stance]
comprising of four characters, ex: phsu for ”Pharmacologic We only show the concept [semantic type] information
Substance”. We make use of these unique identifiers in our above, since these are the two values that we make use of in
paper and also provide its full form as necessary. our research, however metamap provide many other details
Example of a relationship in semantic network is related to biomedical concepts identified. We make use of
”phsu—affects—dsyn” where ”phsu”(Pharmacologic Sub- concept and semantic types identified in biomedical text to
stance) and ”dsyn”(Disease or Syndrome) are semantic types use them as input features for our text classification task.
and ”affects” is the relationship between them.
These semantic types and relationships are defined as fol- E. SemRep
lows by semantic network: SemRep [17] is another useful tool provided by NLM
phsu: A substance used in the treatment or prevention of that extracts semantic predications from sentences in
pathologic disorders. This includes substances that occur biomedical text. Semantic Predication consist of a subject
naturally in the body and are administered therapeutically. argument, an object argument, and the relation that exist
dsyn: A condition which alters or interferes with a normal between them. The subject and object argument of each
process, state, or activity of an organism. It is usually predication are concepts from the UMLS Metathesaurus
characterized by the abnormal functioning of one or more and their binding relationship is a relation from the UMLS
of the host’s systems, parts, or organs. Included here is a Semantic Network. In the example below, sentence (1) is a
complex of symptoms descriptive of a disorder. statement found in text, SemRep extracts the predications
affects: Produces a direct effect on. Implied here is the alter- given in (2):
ing or influencing of an existing condition, state, situation, or (1) We used hemofiltration to treat a patient with digoxin
entity. This includes has a role in, alters, influences, predis- overdose that was complicated by refractory hyperkalemia.
poses, catalyzes, stimulates, regulates, depresses, impedes, (2) Hemofiltration-TREATS-Patients
enhances, contributes to, leads to, and modifies. Digoxin overdose-PROCESS OF-Patients
We use information in semantic network as standard guid- Hyperkalemia-COMPLICATES-Digoxin overdose
ance for working with semantic type vectors that are intro- Hemofiltration-TREATS-Digoxin overdose
duced as part of our research. this information will be used
during our applications, similarity and relatedness related
tasks which will come in later sections of this paper. F. SemMedDB
D. MetaMap The Semantic MEDLINE Database (SemMedDB) [22] is
a database of semantic predications extracted by running
MetaMap [12] is a tool developed to map biomedical texts
SemRep [17] on PubMed. SemMedDB currently contains
to biomedical concepts. These concepts are classified by
information about approximately 94.0 million predications
semantic types1 defined by the UMLS Metathesaurus, such
from all of PubMed citations (about 27.9 million citations,
as Body Part, Organ, or Organ Component (T023, A.1.2.3.1)
as of December 31 2017) and provides valuable information
and Anatomical Structure (T017, A1.2).
MetaMap uses a knowledge intensive approach based on for extraction purposes. This database consists of 8 tables
symbolic, natural language processing (NLP) and computa- out of which SENTENCE, PREDICATION and PRED-
tional linguistic techniques to identify biomedical concepts ICATION AUX tables are used for our research. More
specifically we use row data present in SENTENCE, SUB-
1 https://metamap.nlm.nih.gov/Docs/SemanticTypes\ 2013AA.txt JECT SEMTYPE, OBJECT SEMTYPE, SUJECT TEXT

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
sentence subject object subject semantic object semantic semantic sentence
type type
Histochemical changes in psoriasis triamcinolone psoriasis phsu dsyn Histochemical changes in dsyn
treated with triamcinolone. treated with phsu.
Fulminating hepatic necrosis in a multiple patient neop humn Fulminating hepatic necrosis in a
patient with multiple myeloma myeloma humn with neop treated with ure-
treated with urethan. than.
Clinical importance of the Russian spasmolytic stenocardia phsu sosy Clinical importance of the Rus-
spasmolytic preparation Etaphen in sian phsu preparation Etaphen in
visceral-reflex stenocardia. visceral-reflex sosy.
Effect of cyclic progestin-estrogen acne women dsyn humn Effect of cyclic progestin-estrogen
therapy on sebum and acne in therapy on sebum and dsyn in
women. humn.
Surgical treatment of congenital congenital cyanosis dsyn sosy Surgical treatment of dsyn accom-
cardiovascular anomalies accom- cardiovascular panied by sosy.
panied by cyanosis. anomalies
Table I
E XAMPLES OF SEMANTIC SENTENCES CREATED FROM SEMANTIC INFORMATION OF A SENTENCE IN S EM M ED DB, HERE ” PHSU ”(P HARMACOLOGIC
S UBSTANCE ), ” DSYN ”(D ISEASE OR S YNDROME ), ” NEOP ”(N EOPLASTIC P ROCESS ), ” HUMN ”(H UMAN ), ” SOSY ”(S IGN OR S YMPTOM ) ARE
BIOMEDICAL SEMATIC TYPES DEFINED BY UMLS

and OBJECT TEXT columns of these tables. between two biomedical terms and text classification utiliz-
The most critical part of our research i.e. ’Semantic Sen- ing the vectors of these semantic types as input features.
tences’ are created by extracting information from these For analysis purpose we create two word embeddings
columns of SemMedDB. SMDB and SMDB+ using skip-gram model from gensim
library, we use the default parameters provided by gensim
III. M ETHOD library such as window size of 5 and vector dimension
A. Semantic Embeddings of 200, the training sentences which we use for these
embeddings is what differentiate them, detailed as below:
Word embeddings are created by providing list of sen- 1) SMDB: These is the word embeddings trained using
tences as input to skip-gram model. Skip-gram model aims at 94 million sentences in SemMedDB as input to skip-gram
deducing word representations based on context words. The model, we use this as a baseline to compare with semantic
main goal of our research is to introduce semantic types into embeddings created in this research.
vector space of word embeddings. To do so it is essential 2) SMDB+: These is our special word embeddings cre-
that input to skip-gram model have sentences or text that ated using 94 million sentences in SemMedDB and respec-
contain semantic types. With this aim of providing semantic tive 94 million semantic sentences which we created utilising
types and its context information to skip-gram model, we semantic types information. It consists of both biomedical
first create ’semantic sentences’ which have semantic types terms and semantic types in vector space.
and its context information. By training skip-gram model
on these semantic sentences we get word embeddings with IV. A PPLICATIONS
distributed representation of words and semantic types. In this section we demonstrate how our semantic word
SemMedDB provides semantic data related to predicate embeddings can be put to use.
sentences. Semantic sentences are constructed using SEN- Word embeddings are a powerful tool because of the
TENCE, SUBJECT SEMTYPE, OBJECT SEMTYPE, SU- relationship it captures between different words. For ex-
JECT TEXT and OBJECT TEXT data in SemMedDB. This ample: on taking vectors of ”queen”, ”woman” and ”man”
is done by replacing subject and object text in a sentence and computing vector(”queen”) - vector(”woman”) + vec-
with their respective semantic types. Table I shows examples tor(”man”), the output generated is the vector of ”king”.
of semantic sentences created using semantic information Similarly the standard UMLS semantic types introduced in
from SemMedDB. our semantic word embeddings, captures various semantic
We use information in SemMedDB to construct ’semantic relations between different biomedical terms. To further
sentences’. Using these ’semantic sentences’ as input to skip- demonstrate this idea we show below how we can compute
gram, word embeddings are created, which consist of both a disease vector given the value of its related drug.
biomedical terms and UMLS semantic types. Figure 1 shows Drug-Disease pairs: using vectors of a drug and semantic
how the word embeddings are created. Finally, the word types of drug and disease, a set of related diseases are
embeddings created in this process not only have distributed generated. For example: calculating vector(”bortezomib”) -
representation of words but also semantic types, which are at vector(”orch”) + vector(”neop”) gives a vector X and on
our disposal for various tasks such as calculating relatedness computing the nearest word vector to X we get ”myeloma”,

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
+
Figure 1. biomedical semantic embeddings are created from sentences and ”semantic sentences”, where in ”semantic sentences” are created by utilizing
semantic information in SemMedDB. Here ”phsu”(Pharmacologic Substance) and ”dsyn”(Disease or Syndrome) are biomedical semantic types defined by
UMLS

in the word embeddings. Bortezomib is a drug used and scores associated with them based on degree of simi-
to cure multiple myeloma, the semantic type of borte- larity and relatedness. This dataset acts as a benchmark for
zomib in UMLS is defined as orch(organic chemical), and the intrinsic evaluation of the embeddings presented here.
neop(neoplastic process) is the semantic type which is To compute similarity between vectors A and B cosine
assigned to cancer diseases. similarity is used which is defined as follows:
By using the knowledge of semantic types from semantic
network and utilizing the semantic types vectors from our ∑n
AB i=1 Ai Bi
word embeddings we were able to arrive at Myeloma cancer cos(A, B) = = √∑n √∑n
disease just by knowing the name of the drug Bortezomib. ∥A∥∥B∥ i=1 (Ai )
2
i=1 (Bi )
2

Similarly, there are 133 different semantic types in seman- 1) Similarity: From Pakhomov, et al. [23] observations
tic network and 54 relationships that exist between these and results data we can conclude that similarity between two
semantic types, using our semantic word embeddings these biomedical terms exist when these two terms belong to the
semantic types can be exploited to get pairs such as drug- same semantic types. For Example: medrol and prednisolone
disease, drug-target, etc. are name of the drugs which belong to same drug class, treats
similar diseases and both of these are assigned same se-
A. Intrinsic Evaluation
mantic types of phsu(Pharmacologic Substance) by UMLS.
Pakhomov, et al. [23] provides similarity and relatedness Similarity score in such pairs can easily be determined by
for clinical terms culminating in a set of biomedical pairs finding cosine similarity between two biomedical terms. We

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
compute the similarity measure using the cosine similarity Algorithm 1 Relatedness Calculation
as it has been used previously [9]. 1: procedure RELATEDNESS ( TERM 1, TERM 2,SMDB+)
2) Relatedness: Relatedness between two biomedical 2: sem1 ← getSem(term1) ◃ use MetaMap to get
terms is a difficult task to define computationally. From semantic type
Pakhomov, et al. [23] observations and results data we 3: sem2 ← getSem(term2) ◃ use MetaMap to get
can conclude that two semantically related terms can be of semantic type
same or different semantic types. In the previous research [9] 4: vectorT erm1 ← getV ector(term1, SM DB+)
problem of relatedness is targeted by using the same cosine 5: vectorT erm2 ← getV ector(term2, SM DB+)
similarity score. Although, cosine score is a good measure 6: vectorSem1 ← getV ector(sem1, SM DB+)
if the semantic types of two terms are same, it is not a good 7: vectorSem2 ← getV ector(sem2, SM DB+)
measure to compute relatedness between two terms which 8: relatedV ector ← vectorT erm1 − vectorSem1 +
are of different semantic types. For example: diabetes and vectorSem2
insulin are semantically related but both of these biomedical 9: relatedness ← cos(relatedV ector, vectorT erm2)
terms are of different semantic types, where diabetes have 10: return relatedness
semantic type of disease or syndrome and insulin have
semantic type of pharmacologic substance. There exists a
relatedness between these two terms because insulin is the Applying Yoon Kim’s [29] CNN model showcases the
drug which is used to treat diabetes. power of the semantic embeddings(SMDB+) compared to
Given the distribution of words in word embeddings, a normal word embedding(SMDB) which does not contain
words which are similar are close to each other but semantic types vector.
words with different semantic types are far apart. Based Normally, vectors of words present in a text are used as
on this phenomenon we define a better measurement of input features to a CNN text classifier. However, due to
semantic relatedness here, and this measure of relatedness nature of word embeddings developed here, the vector of
is only possible because of the special nature of our word semantic types present in a text can be used as additional
embeddings which contains vectors of semantic types in input features to the CNN classifier. Semantic types in text
addition to biomedical terms. more accurate Relatedness are determined by using MetaMap, semantic type vectors are
score is determined by utilizing semantic types in the then appended to vectors of words in text as input. Results
word embeddings to capture the relationship between two of text classification is discussed in details in the following
biomedical terms. For a given biomedical pair (term1,term2) section.
it is possible to calculate relatedness using the formula:
relatedness(term1,term2)=cos(X,Y) where X=”term1”- V. R ESULTS
”sem1”+”sem2” and Y=”term2”, where ”sem1” and ”sem2”
A. Data sets
are semantic types of ”term1” and ”term2” respectively.
For example: relatedness(”diabetes”,”insulin”)=cos(X,Y) 1) UMNSRS-Rel and UMNSRS-Sim: UMNSRS-Rel and
where X=”diabetes”-”dsyn”+”aapp” and Y=”insulin”, then UMNSRS-Sim are the datasets compiled by Pakhomov, et al.
dsyn(disease or syndrome) is the semantic type of diabetes [23] which consist of semantically related and semantically
and aapp(amino acid, peptide or protein) is semantic type similar biomedical terms. The degree of similarity and
of insulin. Relatedness score algorithm 1 is provided below relatedness is captured by a score assigned by group of eight
for illustration: medical residents. Based on these scores assigned we use
Since, distribution of words captures relations between top 12 semantically similar pairs from UMNSRS-Sim for
different words, semantic types yield a better calculation of similarity measure and top 12 related pairs from UMNSRS-
relatedness. Note: in case there is no semantic type for a Rel for relatedness measure.
biomedical concept we can just use the word vector value 2) Clinical Questions: A subset of clinical question
ignoring the semantic type vector in formula, which is same dataset provided by National Library of Medicine, which
as similarity measure. The results achieved in relatedness were accumulated by Ely et al. and D’Alessandro et al.
score are comparably higher than those in other biomedical [30], [31], [32]. This dataset consists of 4,654 clinical
word embeddings. questions that arose during patients care and visit. This
dataset contains information on topics assigned to each
B. Text Classification question, each question is assigned one or more topics to
Recently, CNNs and other neural networks have been them from a set of 12 topics. This paper makes use of
widely used for various NLP tasks including sentence question set that belongs to top five most recurring top-
classification [24], [25], [26], [27]. Pre-trained word ics namely ”Pharmacological”, ”Management”, ”Diagnosis”,
embeddings have shown effective results when used as input ”Treatment & Prevention” and ”Test”. Distribution of this
feature to train a sentence classification model [28], [29]. questions across topics is illustrated in Table III.

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
Term 1 Term 2 Pyysalo et al. SMDB SMDB+
medrol prednisolone 0.60804 0.53884 0.50556
lipitor zocor 0.66682 0.85673 0.84450
thalassemia hemoglobinopathy 0.70729 0.62486 0.61177
convulsion epilepsy 0.54465 0.52118 0.51423
emaciation cachexia 0.49607 0.57806 0.57375
dizziness vertigo 0.72978 0.80141 0.81400
mycosis histoplasmosis 0.55776 0.61603 0.59177
enalapril lisinopril 0.94660 0.93651 0.94842
actonel fosamax 0.67757 0.76767 0.82370
carboplatin cisplatin 0.87725 0.85271 0.86052
xanax ativan 0.72460 0.80707 0.80830
ethanol alcohol 0.57237 0.61436 0.62272
Average 0.67573 0.70962 0.70994
Table II
COMPARISON OF COSINE SCORES BETWEEN T ERM 1 AND T ERM 2 VECTORS FROM PYYSALO ET AL ., SMDB AND SMDB+ WORD EMBEDDINGS .
VALUES IN BOLD INDICATES THE HIGHEST SCORES .

Question Topic No. of Questions


Pharmacological 1594
of introducing semantic types in this word embeddings and
Management 1403 further bolsters the formula used to compute the relatedness
Diagnosis 994 score. Again SMDB performed better than Pyysalo et al.
Treatment & Prevention 868
Test 746
[9] due to better quality of sentences. Cosine score is not
a good measure of relatedness due to the fact that words
Table III
Q UESTION TOPICS DISTRIBUTION with similar smeantic types are grouped together but words
with different semantic types are far apart in vector space.
However, by using semantic types in SMDB+ vector space
we were able to arrive at better relatedness measures.
B. Semantic Similarity D. Text Classification
Similarities are computed between the two terms by From the clinical questions dataset, text classification
finding cos(Term1,Term2) where Term1 and Term2 are word was applied to the five topics with the highest questions.
vectors from different word embeddings. Table II shows These topic questions dataset were prepared in accordance
the results of similarity scores computed on top 12 similar to Cao, et al. [13] for binary classification. The accuracies
terms from UMNSRS-Sim dataset using Pyysalo et al. [9], of SVM classifier trained on a set of domain independent
SMDB and SMDB+ word embeddings. Word embeddings and domain specific features as trained by Cao, et al. [13]
created from SemMedDB sentences and semantic sentences were compared to the accuracies of CNN classifiers trained
provided approximately 3.5% increase in average score using vectors in SMDB and SMDB+ word embeddings as
demonstrating effectiveness of this technique. One interest- features. Cao et al. uses combination of BOW”, ”POS”,
ing thing to note here is SMDB performed equally well, ”CSTY” and ”BIGRAM” where ”BOW” is bag of words,
this could be due to the quality of sentences in SemMedDB. ”POS” is Part of Speech, and ”CSTY” is Concept Semantic
Since, SemMedDB has sentences which have subject object Type. Here SMDB classifier uses vectors of words in text as
relationship in them, the similarity between words were input features, whereas SMDB+ classifier uses both vectors
captured better than Pyysalo et al. [9] embeddings which of word and semantic type in question text. To get semantic
was trained on entire PubMed. types in question text we make use of Metamap. Table
IV shows accuracies comparison of SVM classifiers trained
C. Semantic Relatedness
on different features to CNN classifiers trained using word
The same cosine score for computing relatedness in embeddings as features. It is evident from the result that
Pyysalo et al. and SMDB was applied here. However, use of both word and semantic types(SMDB+) provided a
SMDB+ provides with semantic types vectors which can 2.35% increase in classification accuracy compared to Cao,
be exploited to better capture the relatedness between two et al. [13], and a 1% increase in SMDB+ from SMDB
terms. We calculate relatedness in SMDB+ word embed- classifier illustrates the effectiveness of using semantic types
dings using the formula defined in Method section. Table vectors as additional features to CNN model.
IV shows relatedness score on top 12 biomedical pairs from
UMNSRS-Rel which are different in semantic types but se- VI. C ONCLUSIONS
mantically related to each other. The 11% increase in average This work takes a new and effective approach towards
score of SMDB+ from Pyssalo et al., showcases the power training of biomedical word embeddings by utilizing se-

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
Term 1 Sem 1 Term 2 Sem 2 Pyysalo et al. SMDB SMDB+
diabetes dsyn insulin aapp 0.43701 0.34331 0.43770
meningitis dsyn headache sosy 0.31292 0.32223 0.51885
nausea sosy zofran orch 0.35318 0.46599 0.37232
hypothyroidism dsyn synthroid aapp 0.27220 0.32810 0.26122
pain sosy morphine orch 0.34102 0.44372 0.54410
diabetes dsyn polydipsia sosy 0.27831 0.25547 0.37107
hyperemesis sosy zofran orch 0.09222 0.19064 0.20726
diabetes dsyn polydipsia sosy 0.30673 0.32209 0.35027
obesity dsyn snoring sosy 0.37785 0.40361 0.45213
dyslipidemia dsyn lipitor orch 0.22700 0.24698 0.40581
headache sosy tylenol orch 0.25602 0.10493 0.25559
ataxia sosy ethanol orch 0.03230 0.00496 0.36580
Average 0.26851 0.28517 0.37851
Table IV
COMPARISON OF RELATEDNESS SCORES BETWEEN T ERM 1 AND T ERM 2 VECTORS FROM PYYSALO ET AL ., SMDB AND SMDB+ WORD
EMBEDDINGS . H ERE S EM 1 AND SEM 2 ARE SEMANTIC TYPES FOR TERM 1 AND TERM 2 RESPECTIVELY. VALUES IN BOLD INDICATES THE HIGHEST
SCORES . AAPP (A MINO ACID , P EPTIDE , OR P ROTEIN ), DSYN (D ISEASE OR S YNDROME ), ORCH (O RGANIC C HEMICAL ), SOSY (S IGN OR S YMPTOM ) ARE
SEMANTIC TYPES DEFINED IN UMLS

BOW + BOW + BIGRAM + BOW + BIGRAM + BOW + BIGRAM + CSTY SMDB SMDB+
BIGRAM POS CSTY + POS
Pharmacological 86.99 87.22 87.81 87.97 90.35 91.3
Management 67.06 67.09 67.02 67.02 65.99 68.51
Diagnosis 76.82 76.61 76.97 77.13 77.19 78.97
Treatment & 71.09 71.20 71.26 71.03 74.59 75.28
Prevention
Test 79.71 79.78 80.79 80.79 82.55 83.98
Average: 76.33 76.38 76.77 76.78 78.13 79.60
Table V
ACCURACY COMPARISION OF SVM CLASSIFIER TRAINED ON A SET OF FEATURES I . E . ”BOW+BIGRAM”, ”BOW+BIGRAM+POS”,
”BOW+BIGRAM+CSTY”, ”BOW+BIGRAM+CSTY+POS” AS TRAINED BY C AO , ET AL . [13] TO CNN CLASSIFIERS TRAINED USING SMDB AND
SMDB+ WORD EMBEDDINGS AS FEATURES . VALUES IN BOLD INDICATES THE HIGHEST SCORES .

mantic information in SemMedDB. We bridge the gap be- R EFERENCES


tween utilization of semantic resources and biomedical word [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural
embeddings. Using semantic information from SemMedDB, probabilistic language model,” Journal of machine learning
semantic types were introduced into the vector space of research, vol. 3, no. Feb, pp. 1137–1155, 2003.
word embeddings. This paper further illustrates the use of
semantic type vectors in various tasks: finding biomedical [2] R. Collobert and J. Weston, “A unified architecture for natural
language processing: Deep neural networks with multitask
pairs such as drug-disease; calculating better relatedness learning,” in Proceedings of the 25th international conference
scores; improving text classification accuracy. This proposed on Machine learning. ACM, 2008, pp. 160–167.
semantic embeddings has potential use in various biomedical
tasks that uses pre-trained word embeddings as features. [3] J. Turian, L. Ratinov, and Y. Bengio, “Word representations:
a simple and general method for semi-supervised learning,”
in Proceedings of the 48th annual meeting of the association
In the future adding text from other clinical text resources for computational linguistics. Association for Computational
in addition to the ’semantic sentences’ will create more Linguistics, 2010, pp. 384–394.
powerful word embeddings, with increased vocabulary. Also,
skip-gram model was used for training of word embeddings, [4] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
in future we will be analyzing how different model archi- estimation of word representations in vector space,” arXiv
preprint arXiv:1301.3781, 2013.
tectures produce different results of word embeddings. Also
we did not focus much on the preprocessing of text before [5] O. Levy and Y. Goldberg, “Neural word embedding as im-
creation of word embeddings and have potential for improve- plicit matrix factorization,” in Advances in neural information
ment in quality of word embeddings. Experimenting with processing systems, 2014, pp. 2177–2185.
different hyper-parameters while training word embeddings
[6] J. Pennington, R. Socher, and C. Manning, “Glove: Global
can also be done in future. Finally, more research can be vectors for word representation,” in Proceedings of the 2014
done on applications of our semantic embeddings where pre- conference on empirical methods in natural language pro-
trained word embeddings are used as features. cessing (EMNLP), 2014, pp. 1532–1543.

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.
[7] A. Österlund, D. Ödling, and M. Sahlgren, “Factorization [20] S. Abdeddaı̈m, S. Vimard, and L. F. Soualmia, “The mesh-
of latent variables in distributional semantic models,” in gram neural network model: Extending word embedding
Proceedings of the 2015 Conference on Empirical Methods vectors with mesh concepts for umls semantic similarity
in Natural Language Processing, 2015, pp. 227–231. and relatedness in the biomedical domain,” arXiv preprint
arXiv:1812.02309, 2018.
[8] P. Stenetorp, H. Soyer, S. Pyysalo, S. Ananiadou, and
T. Chikayama, “Size (and domain) matters: Evaluating se- [21] A. T. McCray, “The umls semantic network.” in Proceed-
mantic word space representations for biomedical text,” Pro- ings. Symposium on Computer Applications in Medical Care.
ceedings of SMBM, vol. 12, 2012. American Medical Informatics Association, 1989, pp. 503–
507.
[9] S. Moen and T. S. S. Ananiadou, “Distributional semantics
resources for biomedical text processing,” in Proceedings of [22] H. Kilicoglu, D. Shin, M. Fiszman, G. Rosemblat, and
the 5th International Symposium on Languages in Biology T. C. Rindflesch, “Semmeddb: a pubmed-scale repository of
and Medicine, Tokyo, Japan, 2013, pp. 39–43. biomedical semantic predications,” Bioinformatics, vol. 28,
no. 23, pp. 3158–3160, 2012.
[10] B. Chiu, G. Crichton, A. Korhonen, and S. Pyysalo, “How
to train good word embeddings for biomedical nlp,” in [23] S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen,
Proceedings of the 15th Workshop on Biomedical Natural and G. B. Melton, “Semantic similarity and relatedness be-
Language Processing, 2016, pp. 166–174. tween clinical terms: an experimental study,” in AMIA annual
symposium proceedings, vol. 2010. American Medical
[11] A. Sarker and G. Gonzalez, “Portable automatic text classi- Informatics Association, 2010, p. 572.
fication for adverse drug reaction detection via multi-corpus
training,” Journal of biomedical informatics, vol. 53, pp. 196– [24] W.-t. Yih, X. He, and C. Meek, “Semantic parsing for
207, 2015. single-relation question answering,” in Proceedings of the
52nd Annual Meeting of the Association for Computational
[12] A. R. Aronson, “Effective mapping of biomedical text to the Linguistics (Volume 2: Short Papers), vol. 2, 2014, pp. 643–
umls metathesaurus: the metamap program.” in Proceedings 648.
of the AMIA Symposium. American Medical Informatics
Association, 2001, p. 17. [25] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning
semantic representations using convolutional neural networks
[13] Y.-g. Cao, J. J. Cimino, J. Ely, and H. Yu, “Automatically ex- for web search,” in Proceedings of the 23rd International
tracting information needs from complex clinical questions,” Conference on World Wide Web. ACM, 2014, pp. 373–374.
Journal of biomedical informatics, vol. 43, no. 6, pp. 962–
971, 2010. [26] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A con-
volutional neural network for modelling sentences,” arXiv
[14] W. Weiming, D. Hu, M. Feng, and L. Wenyin, “Automatic preprint arXiv:1404.2188, 2014.
clinical question answering based on umls relations,” in Third
International Conference on Semantics, Knowledge and Grid [27] R. Collobert, J. Weston, L. Bottou, M. Karlen,
(SKG 2007). IEEE, 2007, pp. 495–498. K. Kavukcuoglu, and P. Kuksa, “Natural language processing
(almost) from scratch,” Journal of Machine Learning
[15] Y. Cao, F. Liu, P. Simpson, L. Antieau, A. Bennett, J. J. Research, vol. 12, no. Aug, pp. 2493–2537, 2011.
Cimino, J. Ely, and H. Yu, “Askhermes: An online question
answering system for complex clinical questions,” Journal of [28] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A con-
biomedical informatics, vol. 44, no. 2, pp. 277–288, 2011. volutional neural network for modelling sentences,” arXiv
preprint arXiv:1404.2188, 2014.
[16] D. Hristovski, D. Dinevski, A. Kastrin, and T. C. Rindflesch,
“Biomedical question answering using semantic relations,” [29] Y. Kim, “Convolutional neural networks for sentence classi-
BMC bioinformatics, vol. 16, no. 1, p. 6, 2015. fication,” arXiv preprint arXiv:1408.5882, 2014.
[17] T. C. Rindflesch and M. Fiszman, “The interaction of domain [30] J. W. Ely, J. A. Osheroff, M. H. Ebell, G. R. Bergus,
knowledge and linguistic structure in natural language pro- B. T. Levy, M. L. Chambliss, and E. R. Evans, “Analysis
cessing: interpreting hypernymic propositions in biomedical of questions asked by family doctors regarding patient care,”
text,” Journal of biomedical informatics, vol. 36, no. 6, pp. Bmj, vol. 319, no. 7206, pp. 358–361, 1999.
462–477, 2003.
[31] J. W. Ely, J. A. Osheroff, K. J. Ferguson, M. L. Chambliss,
[18] B. L. Humphreys and D. Lindberg, “The umls project: making D. C. Vinson, and J. L. Moore, “Lifelong self-directed learn-
the conceptual connection between users and the information ing using a computer database of clinical questions,” Journal
they need.” Bulletin of the Medical Library Association, of family practice, vol. 45, no. 5, pp. 382–389, 1997.
vol. 81, no. 2, p. 170, 1993.
[32] D. M. DAlessandro, C. D. Kreiter, and M. W. Peterson,
[19] Z. Yu, T. Cohen, B. Wallace, E. Bernstam, and T. Johnson, “An evaluation of information-seeking behaviors of general
“Retrofitting word vectors of mesh terms to improve semantic pediatricians,” Pediatrics, vol. 113, no. 1, pp. 64–69, 2004.
similarity measures,” in Proceedings of the Seventh Inter-
national Workshop on Health Text Mining and Information
Analysis, 2016, pp. 43–51.

Authorized licensed use limited to: MICROSOFT. Downloaded on July 08,2022 at 07:24:28 UTC from IEEE Xplore. Restrictions apply.

You might also like