You are on page 1of 4

2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

Bidirectional LSTM-CRF for biomedical named entity recognition

Xuemin Yang Zhihong Gao


Jiangsu Key Laboratory of Big Data Security &Intelligent Zhejiang Engineering Research Center of Intelligence
Processing, Nanjing University of Posts and Medicine, Wenzhou, 325035, China
Telecommunications, Nanjing, 210003, China

Yongmin Li
Jiangsu Key Laboratory of Big Data Security &Intelligent Chuandi Pan
Processing, Nanjing University of Posts and Zhejiang Engineering Research Center of Intelligence
Telecommunications, Nanjing, 210003, China Medicine, Wenzhou, 325035, China

Ronggen Yang
Faculty Intelligent Science and Control Engineering, Jinling
Institute of Technology, Nanjing, 211169, China Lejun Gong*
Jiangsu Key Laboratory of Big Data Security &Intelligent
Geng Yang Processing, Nanjing University of Posts and
Jiangsu Key Laboratory of Big Data Security &Intelligent Telecommunications, Nanjing, 210003, China
Processing, Nanjing University of Posts and *Corresponding author
Telecommunications, Nanjing, 210003, China

Abstract—Bio-medical entity recognition extracts texts, which is broadly used in biological research. Text
significant entities, for instance cells, proteins and mining is a research subfield of information digging for
genes, which is an arduous task in an automatic system knowledge discovery based on text information. Its
that mine knowledge in bioinformatics texts. In this purpose is to transform text information into human usable
thesis, we utilized a bidirectional long short-term knowledge. It refers to the process of extracting previously
memory (Bi-LSTM) combined with conditional unknown, understandable, and ultimately usable
random fields (CRFs) approach to automatically knowledge from large amounts of textual data, while using
obtain word representation, obliterated the need for a this knowledge to better organize the information for
marvelous number of feature engineering tasks. The future reference. Text digging used intelligence algorithm
consequences of this experiment represent the word for instance neural networks, case reasoning, possibility
representation method can effectually acquire reasoning, etc. combined with language processing
potential semantic information. Without relying on techniques to analyze massive plain text, extract or mark
any artificial features, the result on the test dataset keywords concepts, relationships between words, and
obtained 76.81% F-score. Therefore, the proposed categorizing documents by content to gain useful
method is expected to advance biomedical text mining knowledge and information [1]. Named recognition refers
in bioinformatics entity recognition. to the identification of entities with specific meaning in
text. It is the bearer information unit of natural language
Keywords – entity recognition; word representation; and pertains to the research field of text information
CRFs; Bi-LSTM processing. Biomedical entity recognition is a key part in
information extraction in the bioinformatics field. It is of
I. INTRODUCTION great significance for research and application of
As the swift growth of biomedical literature, aimed at information retrieval, automatic question and answer,
attaining information structured information extracted machine translation, and knowledge base construction. At
from the original text received mounting number of present, the means of biomedical entity recognition is
attention in recent years. Biomedical literature is mainly divided into shallow machine learning and deep
considered a huge unstructured database that provides a neural network methods. Shallow machine learning means
wealth of resources for most biological studies; text data mainly include Conditional Random Field Model (CRF)
mining helps us obtain information from a wide range of [2-3], Hidden Markov Model [4], Maximum Entropy

978-1-5386-8097-1/18/$31.00 ©2018 IEEE 239


Model [5], Support Vector Machine [6], etc. The
traditional entity recognition method directly depended on
a miraculous number of artificial features and specialized
domain knowledge. Wang [7] verified the Gimli method
based on CRF, and the F-value reached 72.23%. Zhou and
Su [8] used CRF to increase the F-value reached 72.55%
through rich domain knowledge and artificial
characteristics. Liao et al. [9] constructed a skip-chain
CRF model for biomedical named entity recognition. This
model can fully take into account the biomedical
information that depended on a long distance, and reached Figure 1. Workflow of the method
a 73.20% F-value in GENIA corpus.
At present, the majority of the features used in A. Models
traditional shallow machine learning means are based on Among this part, this text have a technically elaborate
domain knowledge and experience, which is time- the BiLSTM-CRF method and analyze the effects of Bi-
consuming and laborious. It requires repeated experiments LSTM and CRF on the biological entity recognition. At
and corresponding feature selection. It is difficult to the same time the function of each module illustrated in
summarize deep-seated semantic information. Of late the model.
years, neural networks have shown excellent performance 1) LSTM networks
in the generic domain name entity recognition. Compared LSTM [13] network is a special recurrent neural
with statistical machine learning methods or rule-based network model that overcomes the gradient vanishing
methods, neural network-based deep learning means problems caused by the long sequence of traditional RNN
have the beauty of being more generalized and less model. The LSTM model can selectively store contextual
dependent on artificial features. Therefore, many universal information through a specially designed gate structure.
domains entity identified model of neural networks are As a result, LSTM has characteristics suitable for
proposed. In this study of biomedical entity identified biomedical named entity recognition. Figure 2 illustrates
using deep neural networks, Yao [10] first used neural the infrastructure of LSTM memory cell.
networks to generate word vectors on unlabeled biological
texts, and then established a multi-layer neural network
it ot
with F-values reached 71.01%. Li and Jin [11] et al.
achieved a 72.76% F-value in the GENIA corpus using the
Bi-LSTM method. The thesis presented a bio-medical
entity identification method based the structure of
BiLSTM-CRF. The method no longer directly depended
xt
⊗ Ct
⊗ ht

on artificial features and domain knowledge, but used


context-based word embedding. Experimental results ⊗
reveal that the advance of information digging in biology
has broad prospects. The next part would consider this ft
way in detail.
xt
Figure 2. LSTM Memory Unit.
II. RELATED WORK
We utilized neural network architecture to recognize The main structure of the LSTM network can be
entities in this article. It mainly consists of three modules. expressed formally as follows:
Firstly, we preprocessing the biomedical training dataset σ ( x ) = 1 + 1e − x
and transformed them into word embedding. In addition,
we selected the BiLSTM-CRF model for training to obtain it = σ (W xi x t + W hi ht −1 + W ci c t −1 + bi )
annotation model. Finally, we preprocessing the test data f t = σ (W f ht −1 + U f xt + b f )
set and evaluated the training model.The workflow
diagram is as illustrated in Figure 1.The next part would o t = σ (W o h t − 1 + U o x t + b o )
described the vital procedure of the work. ct = tanh(Wc ht −1 + U c xt + bc )
c t = it ⊗ c t + f t ⊗ c t − 1

240
ht = tanh( ct ) ⊗ ot number vector, after that the Euclidean distance between
any two words or between any two part-of-speech tags will
Where σ and tanh represent two different neuron
be closer. The distributed representation of word can solve
activation functions, and i , f , o represent the different the trouble of dimensionality disaster and local
doors. w represents the weight matrix connecting the two generalization limitation in machine learning. Compared
layers (eg. wxi represents the weight matrix of the input with traditional feature representation, it can explore the
layer to the input door of hidden layer), U represents the inherent relationship between input data and capture its
internal grammar and semantic similarity.
weight matrix of the different gates for input xt , and b
Word2vec is a natural language processing tool
represents the bias vector(eg. bi represents the offset launched by Google in 2013. It will vectorization all the
vector of the input gate of hidden layer), and c words so that the words and words can quantitatively
represents the state of the memory unit. This threshold measure the relationship between them and explore the
mechanism can effectively filter and memorize the connection between words. In this thesis, in order to obtain
information of the memory unit to solve the problem of word embedding, we used the continuous bag-of-word
RNN. The LSTM only captures the above information for language model to convert each word in the corpus to be
text. For named entity identification tasks, the following processed into a d-dimensional word vector as the word
information also has important reference values. vector of the current input corpus. Where d is expressed
2) BiLSTM-CRF networks as 300 dimensions, and at last obtain 21571 words pre-
To take effective advantage of contextual information, training vectors.
we utilized Bi-LSTM. The sentence x = (x1, x2,...xk ) as 4) Training parameters
input to neural network. For each word ( xi ) in a sentence, The pre-training or randomly initialized embedding
we used its word embedding. In addition, we embed these matrix used to map each word ( xi ) in a sentence from a
words in the sequence of words in a given sentence into a one-hot vector to a low dimensional word vector, where
Bi-LSTM network where the forward and backward the dimension is set to 300.We marked words that are not
 in the embedded vocabulary as UNK and initialized them
representation of each word is compute. ht is the output randomly for words that do not exist in the pre-training file.

of the forward LSTM at time t , and ht is the output As model parameters, adjust in the course of model
representation of the reverse LSTM at time t . The output training. Random word embedding are uniformly sampled
representation of Bi-LSTM at time t is defined as from range [-0.25, 0.25].
  We used Adam [14] to train our models. The Adam
h t = [ h t , h t ] . This representation contains the above
optimizer compared to Stochastic Gradient Descent (SGD)
information and context information, more suitable for the
[15] model training speed faster. We focused a learning
type of label named entity identification.
rate of 0.001. Experiments demonstrate that adding
The linear CRF is able to obtain a globally optimal
Dropout [16] to the input and output parts of a
sequence of markers taking into account the relationship
bidirectional LSTM can reduce the problem of overfitting
of adjacent labels. It can make full use of the relationship
the model. The Dropout value is set to 0.5.
of adjacent tags, optimize the output tag sequence globally,
and have higher recognition performance for biologically B. Experimental dataset
named entities with larger lengths and modified We evaluated BiLSTM-CRF model on the GENIA
vocabulary. In this treatise, the output of the Bi-LSTM corpus. The GENIA is being developed to provide
model is used as the input of the CRFs model to acquire references to advance information retrieval and
the global optimal marker sequence. The CRFs used the knowledge digging techniques for biological field. This
forward-backward algorithm to perform the conditional corpus contain a summary of 1999 Medline, which was
probability and feature expectation of different sequence searched by PubMed for the medical subject heading
positions, and also utilized the quasi-Newton method to terms of human, blood cells, and transcription factors.
maximize. The likelihood estimation solves the model There are a total of 22,402 sentences including 18,546
parameters, and the Viterbi algorithm is used for dynamic training sentences and 3,856 test sentences. In terms of
programming decoding for classification prediction. corpus processing, in order to clearly represent the named
3) Word embedding entities to be identified in corpus, we utilized BIO tags to
Word embedding is a means of mapping a vocabulary handle these entities. BIO's marking rules are as follows:
to a real vector that captures the distributed syntax and "B" indicates the beginning of the boundary of the entity,
semantic information of the word. Each word or each part "I" denotes an intermediate entity, and "O" represents a
of speech tag is represented as a low-dimensional real non-biological medical entity.

241
III. EXPERIMENTAL RESULTS Natural Science Foundation of the Jiangsu Province
In accordance with the above formulated method, we (Grant Nos.BK20130417,BK20150863,BK20140895,
implemented tests on GENIA corpus. Meanwhile, we BK20140875, BK20150862),China Postdoctoral Science
analyzed five types of entities in GENIA corpus, including Foundation (2016M590483, 2018M632349), Jiangsu
cell_type, DNA, Cell_type, RNA and protein. The Province postdoctoral Science Foundation (1501072B),
outcome of experiment is illustrated in Figure 3. Scientific and Technological Support Project (Society) of
Jiangsu Province (BE2016776), Natural Science
100 P(%) R(%) F(%) Foundation of Higher Education Institutions of Jiangsu
Province in China( 16KJD520003), Nanjing University
80 of Posts and Telecommunications’ Science Foundation
(NY214068, NY213088).This study is also partially
60 supported by Zhejiang Engineering Research Center of
Intelligent Medicine under 2016E10011.
40 REFERENCES
[1] Ye Z,Tafti A P,He K Y,et al.SparkText: Biomedical Text Mining on
20 Big Data Framework:[J]. Plos One,2016,11(9):e0162721.
[2] Lee H C,Kao H Y.CDRnN: A high performance chemical-disease
0 recognizer in biomedical literature[C]// IEEE International Conference
on Bioinformatics and Biomedicine.2017:374-379.
protein DNA RNA cell_type cell_line
[3] Li K,Ai W,Tang Z,et al. Hadoop Recognition of Biomedical Named
Figure 3. Results of biomedical entity recognition Entity Using Conditional Random Fields[J]. IEEE Transactions on
Parallel & Distributed Systems,2015,26(11):3040-3051.
Our method obtained 76.81% F-score,and aiming at the [4] Ponomareva N,Pla F,Molina A,et al. Biomedical named entity
recognition: a poor knowledge HMM-based approach;2007:382-387.
best dataset, our approach got better performance than
[5] Ekbal A,Saha S,Hasanuzzaman M.Multiobjective Approach for
perious works[8-12]. Therefore, the method is expected to Feature Selection in Maximum Entropy Based Named Entity
be used to develop biological text processing techniques. Recognition[C]. IEEE Computer Society,2010:323-326.
[6] Ju Z,Wang J,Zhu F.Named Entity Recognition from Biomedical Text
IV. CONCLUSION Using SVM[J].2011:1-4.
[7] WANG X,YANG C,GUAN R.A comparative study for biomedical
The article presented a method to recognized entities named entity recognition[J]. International Journal of Machine Learning
utilizing neural network architecture. Our neural network and Cybernetics, 2015, 9 (3): 373-82.
is based on bidirectional LSTM, word embedding and [8] ZHOU G,ZHANG J,SU J,et al.Recognizing names in biomedical texts:
a machine learning approach [J]. Bioinformatics,2004,20(7):1178-90.
CRFs. In order to obtain more accurate identification
[9] LIAO Z, WU H. Biomedical Named Entity Recognition Based Skip-
results, we decode the output of the Bi-LSTM network Chain CRFS [J]. 2012, 1495-8.
through the CRFs to obtain the optimal tag sequence. The [10] Yao L,Liu H,Liu Y, et al. Biomedical Named Entity Recognition based
integration of CRFs has improved the recognition Deep Neutral Network[J]. International Journal of Hybrid Information
performance of biomedical entities with multiple Technology, 2015,8.
modifiers and blurred boundaries. We adopted test dataset [11] Li L, Jin L, Jiang Y, et al. Recognizing Biomedical Named Entities
Based the Sentence Vector/Twin Word Embeddings Conditioned
to validate our approach, compared other approaches, our Bidirectional LSTM[M];2016 :165-176.
approach is the most advanced. In summary, in this [12] Li Lishuang, Guo Yuankai. Bio-medical Named Entity Recognition
recognition task of biological named entities, the adoption Based CNN-BLSTM-CRF Model [J]. Chinese Journal of Information
, 2018 (1) .
of the word vectors and the integration of Bi-LSTM and
[13] Huang Y,Ali S,Wang L,et al.Biomedical Entity Recognition based
CRFs model can effectively promote the recognition Long and Short Term Memory Model[C]// International Conference on
performance. Mechatronics,Computer and Education Informationization.2017.
[14] L Balles,P Hennig;Dissecting Adam:The Sign, Magnitude and
ACKNOWLEDGEMENTS Variance of Stochastic Gradients.PMLR 80:404-413,2018.
This work is supported by the National Natural Science [15] Li X L.Preconditioned Stochastic Gradient Descent[J].IEEE Trans
Neural Netw Learn Syst,2018,29(5):1454-1466.
Foundation of China (Grant Nos.61502243, 61272084,
[16] Gil Vera,Victor.Learning Analytics and Scholar Dropout: A Predictive
61300240,61572263,61502251,61502247, 61503195), Model. 10.5829/idosi.mejsr.2017.1414.1419.

242

You might also like