Professional Documents
Culture Documents
Abstract
Despite recent advances in natural language
processing, many statistical models for pro-
arXiv:1902.07669v2 [cs.CL] 21 Feb 2019
Table 5: Test F1 Measure on NER for the small and medium scispaCy models compared to a variety of
strong baselines and state of the art models. The Baseline and SOTA (State of the Art) columns include
only single models which do not use additional resources, such as language models, or additional sources
of supervision, such as multi-task learning. + Resources allows any type of supervision or pretraining. All
scispaCy results are the mean of 5 random seeds.
Medicine focus specifically on entity linking using way interactions from PubMed abstracts, lead-
the Unified Medical Language System (UMLS) ing to the development of Literome (Poon et al.,
(Bodenreider, 2004) as a knowledge base. (Buyko 2014), a search engine for genic pathway interac-
et al.) adapt Apache OpenNLP using the GENIA tions and genotype-phenotype interactions. A fun-
corpus, but their system is not openly available and damental aspect of (Valenzuela-Escarcega et al.,
is less suitable for modern, Python-based work- 2018; Poon et al., 2014) is the use of hand-written
flows. The GENIA Tagger (Tsuruoka et al., 2005) rules and triggers for events based on dependency
provides the closest comparison to scispaCy due to tree paths; the connection to the application of
it’s multi-stage pipeline, integrated research con- scispaCy is quite apparent.
tributions and production quality runtime. We im-
prove on the GENIA Tagger by adding a full de- 7 Conclusion
pendency parser rather than just noun chunking,
In this paper we presented several robust model
as well as improved results for NER without com-
pipelines for a variety of natural language process-
promising significantly on speed.
ing tasks focused on biomedical text. The scis-
In more fundamental NLP research, the GENIA
paCy models are fast, easy to use, scalable, and
corpus (Kim et al., 2003) has been widely used to
achieve close to state of the art performance. We
evaluate transfer learning and domain adaptation.
hope that the release of these models enables new
(McClosky et al., 2006) demonstrate the effective-
applications in biomedical information extraction
ness of self-training and parse re-ranking for do-
whilst making it easy to leverage high quality syn-
main adaptation. (Rimell and Clark, 2008) adapt
tactic annotation for downstream tasks. Addition-
a CCG parser using only POS and lexical cate-
ally, we released a reformatted GENIA 1.0 cor-
gories, while (Joshi et al., 2018) extend a neu-
pus augmented with automatically produced Uni-
ral phrase structure parser trained on web text to
versal Dependency annotations and recovered and
the biomedical domain with a small number of
aligned original abstract metadata.
partially annotated examples. These papers focus
mainly of the problem of domain adaptation it-
self, rather than the objective of obtaining a robust, References
high-performance parser using existing resources.
NLP techniques, and in particular, distant su- Waleed Ammar, Dirk Groeneveld, Chandra Bhagavat-
ula, Iz Beltagy, Miles Crawford, Doug Downey, Ja-
pervision have been employed to assist the cu- son Dunkelberger, Ahmed Elgohary, Sergey Feld-
ration of large, structured biomedical resources. man, Vu Ha, Rodney Kinney, Sebastian Kohlmeier,
(Poon et al., 2015) extract 1.5 million cancer path- Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew E.
Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Yoav Goldberg and Joakim Nivre. 2012. A dynamic
Wang, Chris Wilhelm, Zheng Yuan, Madeleine van oracle for arc-eager dependency parsing. In Coling
Zuylen, and Oren Etzioni. 2018. Construction of 2012.
the literature graph in semantic scholar. In NAACL-
HLT. Matthew Honnibal and Ines Montani. 2017. spacy 2:
Natural language understanding with bloom embed-
Anonymous. 2019. Anonymous. dings, convolutional neural networks and incremen-
tal parsing. To appear.
Alan R. Aronson. 2001. Effective mapping of biomedi-
Eduard H. Hovy, Mitchell P. Marcus, Martha Palmer,
cal text to the umls metathesaurus: the metamap pro-
Lance A. Ramshaw, and Ralph M. Weischedel.
gram. Proceedings. AMIA Symposium, pages 17–
2006. Ontonotes: The 90% solution. In HLT-
21.
NAACL.
Michael Bada, Miriam Eckert, Donald Evans, Kristin Vidur Joshi, Matthew Peters, and Mark Hopkins. 2018.
Garcia, Krista Shipley, Dmitry Sitnikov, William A. Extending a parser to distant domains using a few
Baumgartner, K. Bretonnel Cohen, Karin M. Ver- dozen partially annotated examples. In ACL.
spoor, Judith A. Blake, and Lawrence Hunter. 2011.
Concept annotation in the craft corpus. In BMC Jin-Dong Kim, Tomoko Ohta, Yuka Tateisi, and
Bioinformatics. Jun’ichi Tsujii. 2003. Genia corpus - a semantically
annotated corpus for bio-textmining. Bioinformat-
Olivier Bodenreider. 2004. The unified medical lan- ics, 19 Suppl 1:i180–2.
guage system (umls): integrating biomedical ter-
minology. Nucleic acids research, 32 Database Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-
issue:D267–70. ple and accurate dependency parsing using bidirec-
tional lstm feature representations. Transactions
Lutz Bornmann and Rüdiger Mutz. 2014. Growth rates of the Association for Computational Linguistics,
of modern science: A bibliometric analysis. CoRR, 4:313–327.
abs/1402.4578.
Martin Krallinger, Florian Leitner, Obdulia Rabal,
Ekaterina Buyko, Joachim Wermter, Michael Poprat, Miguel Vazquez, Julen Oyarzábal, and Alfonso Va-
and Udo Hahn. Automatically adapting an nlp core lencia. 2015. Chemdner: The drugs and chemical
engine to the biology domain. In Proceedings of names extraction challenge. In J. Cheminformatics.
the ISMB 2006 Joint Linking Literature, Informa- Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
tion and Knowledge for Biology and the 9th Bio- ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
Ontologies Meeting. Neural architectures for named entity recognition.
In HLT-NAACL.
Nigel Collier and Jin-Dong Kim. 2004. Introduction
to the bio-entity recognition task at jnlpba. In NLP- Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci-
BA/BioNLP. aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter
Davis, Carolyn J. Mattingly, Thomas C. Wiegers,
Gamal K. O. Crichton, Sampo Pyysalo, Billy Chiu, and and Zhiyong Lu. 2016. Biocreative v cdr task cor-
Anna Korhonen. 2017. A neural network multi-task pus: a resource for chemical disease relation extrac-
learning approach to biomedical named entity recog- tion. Database : the journal of biological databases
nition. In BMC Bioinformatics. and curation, 2016.
Dina Demner-Fushman, Willie J. Rogers, and Alan R. David McClosky and Eugene Charniak. 2008. Self-
Aronson. 2017. Metamap lite: an evaluation of training for biomedical parsing. In ACL.
a new java implementation of metamap. Journal
of the American Medical Informatics Association : David McClosky, Eugene Charniak, and Mark John-
JAMIA, 24 4:841–844. son. 2006. Reranking and self-training for parser
adaptation. In ACL.
Rezarta Islamaj Dogan, Robert Leaman, and Zhiyong
Shikhar Murty, Patrick Verga, Luke Vilnis, Irena
Lu. 2014. Ncbi disease corpus: A resource for dis-
Radovanovic, and Andrew McCallum. 2018. Hi-
ease name recognition and concept normalization.
erarchical losses and new resources for fine-grained
Journal of biomedical informatics, 47:1–10.
entity typing and linking. In ACL.
Timothy Dozat and Christopher D. Manning. 2016. Dat Quoc Nguyen and Karin Verspoor. 2018. From
Deep biaffine attention for neural dependency pars- POS tagging to dependency parsing for biomedical
ing. CoRR, abs/1611.01734. event extraction. arXiv preprint arXiv:1808.03731.
Martin Gerner, Goran Nenadic, and Casey M. Hoifung Poon, Chris Quirk, Charlie DeZiel, and David
Bergman. 2009. Linnaeus: A species name iden- Heckerman. 2014. Literome: Pubmed-scale ge-
tification system for biomedical literature. In BMC nomic knowledge base in the cloud. Bioinformatics,
Bioinformatics. 30 19:2840–2.
Hoifung Poon, Kristina Toutanova, and Chris Quirk. Marco Antonio Valenzuela-Escarcega, Ozgun Babur,
2015. Distant supervision for cancer pathway ex- Gus Hahn-Powell, Dane Bell, Thomas Hicks, En-
traction from text. Pacific Symposium on Biocom- rique Noriega-Atala, Xia Wang, Mihai Surdeanu,
puting. Pacific Symposium on Biocomputing, pages Emek Demir, and Clayton T. Morrison. 2018.
120–31. Large-scale automated machine reading discovers
new cancer-driving mechanisms. In Database.
Sampo Pyysalo and Sophia Ananiadou. 2014.
Anatomical entity mention recognition at literature Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang,
scale. In Bioinformatics. Marinka Zitnik, Jingbo Shang, Curtis P. Langlotz,
and Jiawei Han. 2018. Cross-type biomedical
Sampo Pyysalo, Tomoko Ohta, Rafal Rak, Andrew named entity recognition with deep multi-task learn-
Rowley, Hong-Woo Chun, Sung-Jae Jung, Sung-Pil ing. Bioinformatics.
Choi, Jun’ichi Tsujii, and Sophia Ananiadou. 2015.
Overview of the cancer genetics and pathway cura- Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Al-
tion tasks of bionlp shared task 2013. In BMC Bioin- lan Peter Davis, Carolyn J. Mattingly, Jiao Li,
formatics. Thomas C. Wiegers, and Zhiyong Lu. 2016. Assess-
ing the state of the art in biomedical relation extrac-
Laura Rimell and Stephen Clark. 2008. Adapting a tion: overview of the biocreative v chemical-disease
lexicalized-grammar parser to contrasting domains. relation (cdr) task. Database : the journal of biolog-
In EMNLP. ical databases and curation, 2016.