Professional Documents
Culture Documents
of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: 2319‐5606
Volume 2, No.4, April 2013
_________________________________________________________________________________
www.borjournals.com Blue Ocean Research Journals 72
Journal off Engineering,, Computers &
& Applied Scieences (JEC&ASS) ISSSN No: 2319‐5
5606
Volume 2,, No.4, April 2
2013
________________________________________________________________________________________
Syntacttic rule-basedd relation exttraction system ms annd Side Efffect, becausee these are of most
are compplex systems based
b on addittional tools ussed immportance. Appplying task1 first followedd by task2
to assignn POS tags or o to extract syntactic parrse giives far bettter results thhan applying machine
trees. It iis known thatt in the biommedical literatuure leearning directtly to the conntent.This appproach is
such toolls are not yet at the state-off-the-art level as beetter becau use uninform mative data can be
they are for general English textss, and therefoore coonsidered as potential
p data if
i not filtered((task1).
their perrformance on sentences is not always the t
best(Bunnescu et al. [6]).
3 Prop
posed Apprroach
3.1 Tassk
The workk that we preesent in this paper
p is focussed
on two ttasks: automaatically identiifying sentencces
publishedd in medical abstracts (Medline) as
containinng or not infoformation aboout diseases and a
treatmentts, and autommatically identtifying semanntic Fig 1.Architecture Of th
he Proposed System
S
relations that existt between diseases and a
treatmentts, as expressed in these teexts. The secoond 3..2 Algorithms
A U
UsedAs classsification
task is foocused on thrree semantic R Relations: Cuure, allgorithms, wee use a set of six repreesentative
Prevent, and Side Effeect. m
models: decisiion-based mo odels (Decisioon trees),
The prroblems addressed in this paper form the t prrobabilistic models
m (Naı¨¨ve Bayes (NB) ( and
building blocks of a frramework thatt can be used by Complement Naı¨ve N Bayees (CNB), which w is
healthcarre providers (ee.g., private cllinics, hospitaals, addapted for text withh imbalanceed class
medical ddoctors, etc.), or laypeople who want to be diistribution), adaptive
a learnning (Ada- B Boost), a
in chargee of their heaalth by readinng the latest lifel linnear classifieer (support vector machinne (SVM)
science published
p articcles related too their interessts. w
with polynomial kernel), and a classsifier that
The finall product cann be envisioneed as a browsser allways predictss the majorityy class in thee training
plug-in or a desk ktop applicattion that will w daata (used as a baseline). We W decided to use these
automaticcally find annd extract thee latest mediccal cllassifiers becaause they are representativve for the
discoveriies related too disease-treaatment relatioons leearning algorithms in thee literature and a were
and preseent them to thet user. The product can be shhown to work k well on bothh short and loong texts.
developeed and sold by y companies thatt do researrch D
Decision trees are
a decision-b based models similar to
in Heallthcare Inforrmatics, Nattural Languaage thhe rule-based models that are a used in haandcrafted
Processinng, and Mach hine Learning,, and companies syystems, and are suitabble for shoort texts.
that develop tools liike Microsoftt Health Vauult. Prrobabilistic models,
m especially the ones based on
Consumeers are looking to buy or use u products thhat thhe Naı¨ve Bayyes theory, aree the state of the art in
satisfy ttheir needs and gain their t trust and
a teext classificatiion and in almmost any autom matic text
confidencce. Healthcarre products arre probably the t cllassification task.
t Adaptivve learning algorithms
a
most sennsitive to th he trust and confidence of arre the ones th hat focus on hard-to-learn
h concepts,
consumers. Compannies that want w to sell
s ussually undeerrepresented in the data, a
informatiion technolo ogy healthcaare frameworrks chharacteristic that
t appears in i our short texts and
need to bbuild tools thhat allow them m to extract and
a immbalanced daata sets. SV VM-based moodels are
mine auutomatically the wealth of publish hed accknowledged state-of-th
he-art classsification
research. The first task (task 1 or sentennce teechniques on text.
t All classsifiers are partt of a tool
selectionn) identifies sentences from Medliine caalled Weka[14 4].(Oana Frunnza et al. [1]).
publishedd abstracts thhat talk abouut diseases and a
treatmentts. The taskk is similar to a scan of 3..2.1 Bag-of-w words Repressentation
sentencess contained inn the abstract of an article in Thhe bag-of-w words (BOW W) representtation is
order to present to thee user-only seentences that are a coommonly usedd
identifiedd as contaiining relevannt informatiion foor text classifiication tasks. It
I is a represeentation in
(disease treatment infformation). The T second taask w
which features are chosen am mong the wordds that are
(task 2 or relation identification)
i ) has a deepper prresent in the training
t data. Because we deal with
semantic dimension an nd it is focuseed on identifyiing shhort texts wiith an averaage of 20 w words per
disease-trreatment relattions in the seentences alreaady seentence, the difference
d beetween a binary value
selected as
a being inforrmative (e.g., task
t 1 is appliied reepresentation and
a a frequency value repreesentation
first). We focus on thhree relations:: Cure, Preveent, is not large. Inn our case, we w chose a ffrequency
vaalue representtation. This haas the advantaage that if
www.bo
orjournals.co
om Blue Ocean Research JJournals 73
Journal of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: 2319‐5606
Volume 2, No.4, April 2013
_________________________________________________________________________________
a feature appears more than once in a sentence, this macromeasure is not influenced by the majority
means that it is important and the frequency value class, as the micromeasure is. The macromeasure
representation will capture this—the feature’s value better focuses on the performance the classifier has
will be greater than that of other features. We keep on the minority classes. The formulas for the
only the words that appeared at least three times in evaluation measures are: Accuracy ¼ the total
the training collection, contain at least one number of correctly classified instances; Recall ¼
alphanumeric character, are not part of an English the ratio of correctly classified positive instances to
list of stop words[15] and are longer than three the total number of positives. This evaluation
characters. Words that have length of two or one measure is known to the medical research
character are not considered as features because of community as
two other reasons: possible incorrect tokenization sensitivity. Precision ¼ the ratio of correctly
and problems with very short acronyms in the classified positive instances to the total number of
medical domain that could be highly ambiguous classified as positive. F-measure ¼ the harmonic
(could be an acronym or an abbreviation of a mean between precision and recall(Oana Frunza et
common word). al. [1]).
www.borjournals.com Blue Ocean Research Journals 74
Journal of Engineering, Computers & Applied Sciences (JEC&AS) ISSN No: 2319‐5606
Volume 2, No.4, April 2013
_________________________________________________________________________________
www.borjournals.com Blue Ocean Research Journals 75