You are on page 1of 4

Journal 

of Engineering, Computers & Applied Sciences (JEC&AS)                 ISSN No: 2319‐5606                  
 
Volume 2, No.4, April 2013 
_________________________________________________________________________________ 
 

To Identify Disease Treatment Relationship in Short Text


Using Machine Learning & Natural Language Processing 
Khan Razik, Student, Computer Department, Sinhgad Institute of Technology, Lonavala
Dhande Mayur, Student, Computer Department, Sinhgad Institute of Technology, Lonavala
Patil Aniket, Student, Computer Department, Sinhgad Institute of Technology, Lonavala
Gaikwad Namrata, Student, Computer Department, Sinhgad Institute of Technology, Lonavala
ABSTRACT
Due to advancements in medical domain automatic learning has gained popularity in the fields of medical
decision support, extraction of medical knowledge and complete health management. Using Machine
Learning(hereafter, ML) and Natural language Processing(hereafter, NLP) we can make the healthcare field
more efficient and reliable. This paper describes how ML and NLP can be used for extracting knowledge from
published medical papers. It extracts the sentences that mention diseases and treatments and identifies the
relationship between them
Kewwords: Automatic Learning, Medical Decision Support, Machine Learning, Natural Language processing,
Healthcare.

identify and eliminate uninformative sentences and


Introduction then second is to classify the rest of the sentences
People are more concerned about their health than by the relation of interest. By this a substantial
ever before. In spite of their busy schedules they improvement is shown getting the information
want each and everything to go in a good flow.
People want Fast access to reliable information and 2 Related Work
in a manner that is suitable to their habits and Entity recognition for Diseases and Treatments--
workflow. Medical field has grown to such an The most relevant work is done by Rosario and
extent that the people practicing medicine should Hearst[2].It uses Hidden Markov Models and
not only have experience but also information maximum entropy models to perform both the task
about latest discoveries. Electronic Health of entity recognition and the relation
Record(hereafter, EHR) is becoming a standard in discrimination. Their representation techniques are
healthcare domain. Websites such as Google based on words in context, part of speech
Health[9] and Microsoft Health Vault[10] make information, phrases, and a medical lexical
people to care deeply about their health. Having an ontology—Mesh terms[13].The task of relation
extraction or relation identification is previously
EHR has the following benefits[11]: done by (Craven, [3])with a focus on biomedical
1. Rapid access to information that is focused on taks, gene disorder association(Ray and Craven,
certain topics such as immunizations, drugs [4]) and diseases and drugs(Shrinivasan and
etc. Rindflesch, [5]).
2. To have quality medical data for taking proper
medical decisions. For this purpose we need a Rule-based approaches
better, more efficient and reliable access to It has been widely used for solving relation
information. extraction tasks. The main sources of information
According to researches people are searching the used by this technique are either syntactic: part-of-
web in order to be informed regularly about their speech (POS) and syntactic structures; or semantic
health. In medical domain the most used source of information in the form of fixed patterns that
information is MEDLINE[12]. MEDLINE is contain words that trigger a certain relation. The
database where all the research discoveries come best rule-based systems are the ones that use rules
and enter at a high rate. Due to the busy schedules constructed manually or semi automatically—
the experts don't get time to read millions of extracted automatically and refined manually. A
articles therefore there is a need to build a tool that positive aspect of rule-based systems is the fact that
will suffice the purpose. they obtain good precision results, while the recall
Our objective is to work with ML and NLP is levels tend to be low. They tend to require more
that the task of identifying and disseminating human-expert effort than data-driven methods
reliable healthcare information becomes easy and (though human effort is needed in data-driven
beneficial for the people. A hierarchical approach is methods too, to label the data).
used for performing the two tasks: The first is to

www.borjournals.com                                                               Blue Ocean Research Journals          72 
Journal off Engineering,, Computers &
& Applied Scieences (JEC&ASS)                 ISSSN No: 2319‐5
5606                  
 
Volume 2,, No.4, April 2
2013 
________________________________________________________________________________________ 
 

Syntacttic rule-basedd relation exttraction system ms annd Side Efffect, becausee these are of most
are compplex systems based
b on addittional tools ussed immportance. Appplying task1 first followedd by task2
to assignn POS tags or o to extract syntactic parrse giives far bettter results thhan applying machine
trees. It iis known thatt in the biommedical literatuure leearning directtly to the conntent.This appproach is
such toolls are not yet at the state-off-the-art level as beetter becau use uninform mative data can be
they are for general English textss, and therefoore coonsidered as potential
p data if
i not filtered((task1).
their perrformance on sentences is not always the t
best(Bunnescu et al. [6]).

3 Prop
posed Apprroach
3.1 Tassk
The workk that we preesent in this paper
p is focussed
on two ttasks: automaatically identiifying sentencces
publishedd in medical abstracts (Medline) as
containinng or not infoformation aboout diseases and a
treatmentts, and autommatically identtifying semanntic Fig 1.Architecture Of th
he Proposed System
S
relations that existt between diseases and a
treatmentts, as expressed in these teexts. The secoond 3..2 Algorithms
A U
UsedAs classsification
task is foocused on thrree semantic R Relations: Cuure, allgorithms, wee use a set of six repreesentative
Prevent, and Side Effeect. m
models: decisiion-based mo odels (Decisioon trees),
The prroblems addressed in this paper form the t prrobabilistic models
m (Naı¨¨ve Bayes (NB) ( and
building blocks of a frramework thatt can be used by Complement Naı¨ve N Bayees (CNB), which w is
healthcarre providers (ee.g., private cllinics, hospitaals, addapted for text withh imbalanceed class
medical ddoctors, etc.), or laypeople who want to be diistribution), adaptive
a learnning (Ada- B Boost), a
in chargee of their heaalth by readinng the latest lifel linnear classifieer (support vector machinne (SVM)
science published
p articcles related too their interessts. w
with polynomial kernel), and a classsifier that
The finall product cann be envisioneed as a browsser allways predictss the majorityy class in thee training
plug-in or a desk ktop applicattion that will w daata (used as a baseline). We W decided to use these
automaticcally find annd extract thee latest mediccal cllassifiers becaause they are representativve for the
discoveriies related too disease-treaatment relatioons leearning algorithms in thee literature and a were
and preseent them to thet user. The product can be shhown to work k well on bothh short and loong texts.
developeed and sold by y companies thatt do researrch D
Decision trees are
a decision-b based models similar to
in Heallthcare Inforrmatics, Nattural Languaage thhe rule-based models that are a used in haandcrafted
Processinng, and Mach hine Learning,, and companies syystems, and are suitabble for shoort texts.
that develop tools liike Microsoftt Health Vauult. Prrobabilistic models,
m especially the ones based on
Consumeers are looking to buy or use u products thhat thhe Naı¨ve Bayyes theory, aree the state of the art in
satisfy ttheir needs and gain their t trust and
a teext classificatiion and in almmost any autom matic text
confidencce. Healthcarre products arre probably the t cllassification task.
t Adaptivve learning algorithms
a
most sennsitive to th he trust and confidence of arre the ones th hat focus on hard-to-learn
h concepts,
consumers. Compannies that want w to sell
s ussually undeerrepresented in the data, a
informatiion technolo ogy healthcaare frameworrks chharacteristic that
t appears in i our short texts and
need to bbuild tools thhat allow them m to extract and
a immbalanced daata sets. SV VM-based moodels are
mine auutomatically the wealth of publish hed accknowledged state-of-th
he-art classsification
research. The first task (task 1 or sentennce teechniques on text.
t All classsifiers are partt of a tool
selectionn) identifies sentences from Medliine caalled Weka[14 4].(Oana Frunnza et al. [1]).
publishedd abstracts thhat talk abouut diseases and a
treatmentts. The taskk is similar to a scan of 3..2.1 Bag-of-w words Repressentation
sentencess contained inn the abstract of an article in Thhe bag-of-w words (BOW W) representtation is
order to present to thee user-only seentences that are a coommonly usedd
identifiedd as contaiining relevannt informatiion foor text classifiication tasks. It
I is a represeentation in
(disease treatment infformation). The T second taask w
which features are chosen am mong the wordds that are
(task 2 or relation identification)
i ) has a deepper prresent in the training
t data. Because we deal with
semantic dimension an nd it is focuseed on identifyiing shhort texts wiith an averaage of 20 w words per
disease-trreatment relattions in the seentences alreaady seentence, the difference
d beetween a binary value
selected as
a being inforrmative (e.g., task
t 1 is appliied reepresentation and
a a frequency value repreesentation
first). We focus on thhree relations:: Cure, Preveent, is not large. Inn our case, we w chose a ffrequency
vaalue representtation. This haas the advantaage that if

www.bo
orjournals.co
om                                                               Blue Ocean Research JJournals          73 
Journal of Engineering, Computers & Applied Sciences (JEC&AS)                 ISSN No: 2319‐5606                  
 
Volume 2, No.4, April 2013 
_________________________________________________________________________________ 
 

a feature appears more than once in a sentence, this macromeasure is not influenced by the majority
means that it is important and the frequency value class, as the micromeasure is. The macromeasure
representation will capture this—the feature’s value better focuses on the performance the classifier has
will be greater than that of other features. We keep on the minority classes. The formulas for the
only the words that appeared at least three times in evaluation measures are: Accuracy ¼ the total
the training collection, contain at least one number of correctly classified instances; Recall ¼
alphanumeric character, are not part of an English the ratio of correctly classified positive instances to
list of stop words[15] and are longer than three the total number of positives. This evaluation
characters. Words that have length of two or one measure is known to the medical research
character are not considered as features because of community as
two other reasons: possible incorrect tokenization sensitivity. Precision ¼ the ratio of correctly
and problems with very short acronyms in the classified positive instances to the total number of
medical domain that could be highly ambiguous classified as positive. F-measure ¼ the harmonic
(could be an acronym or an abbreviation of a mean between precision and recall(Oana Frunza et
common word). al. [1]).

3.2.2 NLP Representation 5 Conclusion


The second type of representation is based on This approach is very useful for everyone as it
syntactic information. In order to extract this type gives information only of the area of interest. The
of information we used the Stanford pos-tagger[8] task is divided into two tasks The first task that we
tool. The tagger analyzes English sentences and tackle in this paper is a task that has applications in
outputs the base forms, part-of-speech tags, chunk information retrieval, information extraction, and
tags, and named entity tags. text summarization. We identify potential
The following preprocessing steps are applied in improvements in results when more information is
order to identify the final set of features to be used brought in the representation technique for the task
for classification: removing features that contain of classifying short medical texts.
only punctuation, removing The second task that we address can be viewed as
stop words (using the same list of words as for our a task that could benefit from solving the first task
BOW representation), and considering valid first. In this study, we have focused on three
features only the lemma-based forms. We chose to semantic relations between diseases and treatments.
use lemmas because there are a lot of inflected Our work shows that the best results are obtained
forms (e.g., plural forms) for the same word and when the classifier is not overwhelmed by
the lemmatized form (the base form of a word) will sentences that are not related to the task.
give us the same base form for all of them. This study is related to a particular field but the
future scope of the paper lies in the fact that this
4 Evaluation Measures can be extended to the information on the web.
The most common used evaluation measures in the Identifying and classifying medical-related
ML settings are: accuracy, precision, recall, and F- information on the web is a challenge that can
measure. All these measures are computed form a bring valuable information to the research
confusion matrix (Kohavi and Provost [7]) that community and also to the end user. We also
contains information about the actual classes, the consider as potential future work ways in which the
true classes and the classes predicted framework’s capabilities can be used in a
by the classifier. The test set on which the models commercial recommender system and in
are evaluated contain the true classes and the integration in a new EHR system.
evaluation tries to identify how many of the true
classes were predicted by the model classifier. In 6 Acknowledgement
the ML settings, special attention needs to be We express our sincere gratitude towards co-
directed to the evaluation measures that are used. operative department who has provided us with
For data sets that are highly imbalanced (one class valuable assistance and requirements for the
is overrepresented in comparison with another), development. We hereby take this opportunity to
standard evaluation measures like accuracy are not record our sincere thanks and heartily gratitude to
suitable. Because our data sets are imbalanced, we our guide Prof. M. Galphade for her useful
chose to report in addition to accuracy, the guidance and making us available her intimate
macroaveraged F-measure. We decided to report knowledge and experience.
macro and not microaveraged F-measure because
the

www.borjournals.com                                                               Blue Ocean Research Journals          74 
Journal of Engineering, Computers & Applied Sciences (JEC&AS)                 ISSN No: 2319‐5606                  
 
Volume 2, No.4, April 2013 
_________________________________________________________________________________ 
 

7 References Machine Learning and the Knowledge


Discovery Process, vol. 30, pp. 271-274,
[1] Oana Frunza, Diana Inkpen, and Thomas 1998.
Tran " A Machine Learning Approach for [8] http://nlp.stanford.edu/software/tagger.sht
Identifying Disease-Treatment Relations ml.
in Short Texts" [9] Google health report,
vol. 23, 2011. https://www.google.com/health
[2] B. Rosario and M.A. Hearst, “Semantic [10] Microsoft Health Vault,
Relations in Bioscience Text,” Proc. 42nd http://healthvault.com
Ann. Meeting on Assoc. for [11] Health care tracker,
Computational Linguistics, vol. 430, http://healthcaretracker.wordpress.com/
2004. [12] Medline Database,
[3] M. Craven, “Learning to Extract http://www.proquest.com/en
Relations from Medline,” Proc. Assoc. for US/catalogs/databases/detail/medline_ft.s
the Advancement of Artificial html
Intelligence, 1999. [13] Medical Subject Headings,
[4] S. Ray and M. Craven, “Representing http://www.nlm.nih.gov/mesh/meshhome.
Sentence Structure in Hidden Markov html.
Models for Information Extraction,” Proc. [14] Weka tool,
Int’l Joint Conf. Artificial Intelligence http://www.cs.waikato.ac.nz/ml/weka/.
(IJCAI ’01), 2001.
[5] P. Srinivasan and T. Rindflesch, List of Stop words,
“Exploring Text Mining from Medline,” http://www.site.uottawa.ca/~diana/csi5180/StopWo
Proc. Am. Medical Informatics Assoc. rds.
(AMIA) Symp., 2002.
[6] R. Bunescu, R. Mooney, Y. Weiss, B.
Scho¨ lkopf, and J. Platt,“Subsequence
Kernels for Relation Extraction,”
Advances in Neural Information
Processing Systems, vol.18, pp. 171-178,
20
[7] R. Kohavi and F. Provost, “Glossary of
Terms,” Machine Learning, Editorial for
the Special Issue on Applications of

www.borjournals.com                                                               Blue Ocean Research Journals          75 

You might also like