Construction of The Literature Graph in Semantic Scholar

Construction of the Literature Graph in Semantic Scholar
Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford,

Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha,
Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi,
Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm,
Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni
waleeda@allenai.org
Allen Institute for Artificial Intelligence, Seattle WA 98103, USA

Northwestern University, Evanston IL 60208, USA
Abstract
We describe a deployed scalable system
for organizing published scientific litera-
ture into a heterogeneous graph to facili-
tate algorithmic manipulation and discov-
ery. The resulting literature graph consists
of more than 280M nodes, representing pa-
pers, authors, entities and various interac-
tions between them (e.g., authorships, cita-
tions, entity mentions). We reduce litera-
ture graph construction into familiar NLP
tasks (e.g., entity extraction and linking),
point out research challenges due to differ-
ences from standard formulations of these
tasks, and report empirical results for each
task. The methods described in this pa-
Figure 1: Part of the literature graph.
per are used to enable semantic features in
www.semanticscholar.org.
2017). We describe methods used in a scalable de-
1 Introduction
ployed production system for extracting structured
The goal of this work is to facilitate algorithmic information from scientific documents into the lit-
discovery in the scientific literature. Despite no- erature graph (see Fig. 1). The literature graph is
table advances in scientific search engines, data a directed property graph which summarizes key
mining and digital libraries (e.g., Wu et al., 2014), information in the literature and can be used to an-
researchers remain unable to answer simple ques- swer the queries mentioned earlier as well as more
tions such as: complex queries. For example, in order to com-
pute the Erdős number of an author X, the graph
What is the percentage of female subjects in can be queried to find the number of nodes on the
depression clinical trials? shortest undirected path between author X and Paul
Which of my co-authors published one or more Erdős such that all edges on the path are labeled
papers on coreference resolution? “authored”.
Which papers discuss the effects of Ranibizumab We reduce literature graph construction into fa-
on the Retina? miliar NLP tasks such as sequence labeling, entity
linking and relation extraction, and address some
In this paper, we focus on the problem of ex- of the impractical assumptions commonly made in
tracting structured data from scientific documents, the standard formulations of these tasks. For ex-
which can later be used in natural language inter- ample, most research on named entity recognition
faces (e.g., Iyer et al., 2017) or to improve ranking tasks report results on large labeled datasets such
of results in academic search (e.g., Xiong et al., as CoNLL-2003 and ACE-2005 (e.g., Lample et al.,
84
Proceedings of NAACL-HLT 2018, pages 84–91
c
New Orleans, Louisiana, June 1 - 6, 2018. 2017 Association for Computational Linguistics
2016), and assume that entity types in the test set Entities. Each node of this type represents a
match those labeled in the training set (including unique scientific concept discussed in the literature,
work on domain adaptation, e.g., Daumé, 2007). with attributes such as ‘canonical name’, ‘aliases’
These assumptions, while useful for developing and ‘description’. Our literature graph has 0.4M
and benchmarking new methods, are unrealistic for nodes of this type. We describe how we populate
many domains and applications. The paper also entity nodes in §4.3.
serves as an overview of the approach we adopt Entity mentions. Each node of this type rep-
at www.semanticscholar.org in a step towards resents a textual reference of an entity in one of
more intelligent academic search engines (Etzioni, the papers, with attributes such as ‘mention text’,
2011). ‘context’, and ‘confidence’. We describe how we
In the next section, we start by describing our populate the 237M mentions in the literature graph
symbolic representation of the literature. Then, we in §4.1.
discuss how we extract metadata associated with a
paper such as authors and references, then how we 2.2 Edge Types
extract the entities mentioned in paper text. Before Citations. We instantiate a directed citation
we conclude, we briefly describe other research edge from paper nodes p1 ! p2 for each p2
challenges we are actively working on in order to referenced in p1 . Citation edges have attributes
improve the quality of the literature graph. such as ‘from paper id’, ‘to paper id’ and ‘contexts’
(the textual contexts where p2 is referenced in p1 ).
2 Structure of The Literature Graph While some of the paper sources provide these at-
tributes as metadata, it is often necessary to extract
The literature graph is a property graph with di- them from the paper PDF as detailed in §3.
rected edges. Unlike Resource Description Frame- Authorship. We instantiate a directed author-
work (RDF) graphs, nodes and edges in property ship edge between an author node and a paper node
graphs have an internal structure which is more a ! p for each author of that paper.
suitable for representing complex data types such Entity linking edges. We instantiate a directed
as papers and entities. In this section, we describe edge from an extracted entity mention node to the
the attributes associated with nodes and edges of entity it refers to.
different types in the literature graph. Mention–mention relations. We instantiate
a directed edge between a pair of mentions in the
2.1 Node Types same sentential context if the textual relation ex-
Papers. We obtain metadata and PDF files traction model predicts one of a predefined list of
of papers via partnerships with publishers (e.g., relation types between them in a sentential con-
Springer, Nature), catalogs (e.g., DBLP, MED- text.1 We encode a symmetric relation between
LINE), pre-publishing services (e.g., arXiv, bioRx- m1 and m2 as two directed edges m1 ! m2 and
ive), as well as web-crawling. Paper nodes are m2 ! m1 .
associated with a set of attributes such as ‘title’, ‘ab- Entity–entity relations. While mention–
stract’, ‘full text’, ‘venues’ and ‘publication year’. mention edges represent relations between men-
While some of the paper sources provide these at- tions in a particular context, entity–entity edges
tributes as metadata, it is often necessary to extract represent relations between abstract entities. These
them from the paper PDF (details in §3). We de- relations may be imported from an existing knowl-
terministically remove duplicate papers based on edge base (KB) or inferred from other edges in the
string similarity of their metadata, resulting in 37M graph.
unique paper nodes. Papers in the literature graph
cover a variety of scientific disciplines, including 3 Extracting Metadata
computer science, molecular biology, microbiology In the previous section, we described the overall
and neuroscience. structure of the literature graph. Next, we discuss
Authors. Each node of this type represents a how we populate paper nodes, author nodes, au-
unique author, with attributes such as ‘first name’ thorship edges, and citation edges.
and ‘last name’. The literature graph has 12M 1 Due to space constraints, we opted not to discuss our
nodes of this type. relation extraction models in this draft.
85
Although some publishers provide sufficient Field Precision Recall F1
metadata about their papers, many papers are pro-
title 85.5 85.5 85.5
vided with incomplete metadata. Also, papers ob-
authors 92.1 92.1 92.1
tained via web-crawling are not associated with
bibliography titles 89.3 89.4 89.3
any metadata. To fill in this gap, we built the Sci-
bibliography authors 97.1 97.0 97.0
enceParse system to predict structured data from
bibliography venues 91.7 89.7 90.7
the raw PDFs using recurrent neural networks
bibliography years 98.0 98.0 98.0
(RNNs).2 For each paper, the system extracts the
paper title, list of authors, and list of references;
Table 1: Results of the ScienceParse system.
each reference consists of a title, a list of authors, a
venue, and a year.
Preparing the input layer. We split each fed into a two-layer bidirectional LSTM (Long
PDF into individual pages, and feed each page to Short-Term Memory, Hochreiter and Schmidhuber,
Apache’s PDFBox library3 to convert it into a se- 1997), i.e.,
quence of tokens, where each token has features,
e.g., ‘text’, ‘font size’, ‘space width’, ‘position on g! ! !
k D LSTM.Wik ; gk 1 /; gk D Œgk I gk ;
the page’.
h! ! !
k D LSTM.gk ; hk 1 /; hk D Œhk I gk
We normalize the token-level features before
feeding them as inputs to the model. For each of the
‘font size’ and ‘space width’ features, we compute where W is a weight matrix, gk and hk are de-
three normalized values (with respect to current fined similarly to g!
k
and h! k
but process token
page, current document, and the whole training sequences in the opposite direction.
corpus), each value ranging between -0.5 to +0.5. Following Collobert et al. (2011), we feed the
The token’s ‘position on the page’ is given in XY output of the second layer hk into a dense layer to
coordinate points. We scale the values linearly to predict unnormalized label weights for each token
range from . 0:5; 0:5/ at the top-left corner of and learn label bigram feature weights (often de-
the page to .0:5; 0:5/ at the bottom-right corner. scribed as a conditional random field layer when
In order to capture case information, we add used in neural architectures) to account for depen-
seven numeric features to the input representa- dencies between labels.
tion of each token: whether the first/second let- Training. The ScienceParse system is trained
ter is uppercase/lowercase, the fraction of upper- on a snapshot of the data at PubMed Central. It
case/lowercase letters and the fraction of digits. consists of 1.4M PDFs and their associated meta-
To help the model make correct predictions for data, which specify the correct titles, authors, and
metadata which tend to appear at the beginning bibliographies. We use a heuristic labeling pro-
(e.g., titles and authors) or at the end of papers (e.g., cess that finds the strings from the metadata in the
references), we provide the current page number tokenized PDFs to produce labeled tokens. This la-
as two discrete variables (relative to the beginning beling process succeeds for 76% of the documents.
and end of the PDF file) with values 0, 1 and 2+. The remaining documents are not used in the train-
These features are repeated for each token on the ing process. During training, we only use pages
same page. which have at least one token with a label that is
For the k-th token in the sequence, we compute not “none”.
the input representation ik by concatenating the nu-
Decoding. At test time, we use Viterbi decod-
meric features, an embedding of the ‘font size’, and
ing to find the most likely global sequence, with
the word embedding of the lowercased token. Word
no further constraints. To get the title, we use the
embeddings are initialized with GloVe (Pennington
longest continuous sequence of tokens with the
et al., 2014).
“title” label. Since there can be multiple authors,
Model. The input token representations are we use all continuous sequences of tokens with the
passed through one fully-connected layer and then “author” label as authors, but require that all authors
2 The of a paper are mentioned on the same page. If the
ScienceParse libraries can be found at http://
allenai.org/software/. author labels are predicted in multiple pages, we
3 https://pdfbox.apache.org use the one with the largest number of authors.
86
Results. We run our final tests on a held-out Approach CS Bio
set from PubMed Central, consisting of about 54K prec. yield prec. yield
documents. The results are detailed in Table 1. We
Statistical 98.4 712 94.4 928
use a conservative evaluation where an instance is
Hybrid 91.5 1990 92.1 3126
correct if it exactly matches the gold annotation,
Off-the-shelf 97.4 873 77.5 1206
with no credit for partial matching.
To give an example for the type of errors
Table 2: Document-level evaluation of three ap-
our model makes, consider the paper (Wang
proaches in two scientific areas: computer science
et al., 2013) titled “Clinical review: Efficacy of
(CS) and biomedical (Bio).
antimicrobial-impregnated catheters in external
ventricular drainage - a systematic review and meta-
analysis.” The title we extract for this paper omits We evaluate the performance of each approach in
the first part “Clinical review:”. This is likely to two broad scientific areas: computer science (CS)
be a result of the pattern “Foo: Bar Baz” appear- and biomedical research (Bio). For each unique
ing in many training examples with only “Bar Baz” (paper ID, entity ID) pair predicted by one of the
labeled as the title. approaches, we ask human annotators to label each
mention extracted for this entity in the paper. We
4 Entity Extraction and Linking use CrowdFlower to manage human annotations
In the previous section, we described how we popu- and only include instances where three or more
late the backbone of the literature graph, i.e., paper annotators agree on the label. If one or more of
nodes, author nodes and citation edges. Next, we the entity mentions in that paper is judged to be
discuss how we populate mentions and entities in correct, the pair (paper ID, entity ID) counts as
the literature graph using entity extraction and link- one correct instance. Otherwise, it counts as an
ing on the paper text. In order to focus on more incorrect instance. We report ‘yield’ in lieu of
salient entities in a given paper, we only use the ‘recall’ due to the difficulty of doing a scalable
title and abstract. comprehensive annotation.
Table 2 shows the results based on 500 papers
4.1 Approaches using v1.1.2 of our entity extraction and linking
We experiment with three approaches for entity components. In both domains, the statistical ap-
extraction and linking: proach gives the highest precision and the lowest
yield. The hybrid approach consistently gives the
I. Statistical: uses one or more statistical models highest yield, but sacrifices precision. The TagMe
for predicting mention spans, then uses another sta- off-the-shelf library used for the CS domain gives
tistical model to link mentions to candidate entities surprisingly good results, with precision within 1
in a KB. point from the statistical models. However, the
II. Hybrid: defines a small number of hand- MetaMap Lite off-the-shelf library we used for the
engineered, deterministic rules for string-based biomedical domain suffered a huge loss in preci-
matching of the input text to candidate entities sion. Our error analysis showed that each of the
in the KB, then uses a statistical model to disam- approaches is able to predict entities not predicted
biguate the mentions.4 by the other approaches so we decided to pool their
outputs in our deployed system, which gives signif-
III. Off-the-shelf: uses existing libraries, namely
icantly higher yield than any individual approach
(Ferragina and Scaiella, 2010, TagMe)5 and
while maintaining reasonably high precision.
(Demner-Fushman et al., 2017, MetaMap Lite)6 ,
with minimal post-processing to extract and link
4.2 Entity Extraction Models
entities to the KB.
4 We also experimented with a “pure” rules-based approach Given the token sequence t1 ; : : : ; tN in a sentence,
which disambiguates deterministically but the hybrid approach we need to identify spans which correspond to en-
consistently gave better results. tity mentions. We use the BILOU scheme to en-
5 The TagMe APIs are described at https://sobigdata.
code labels at the token level. Unlike most formula-
d4science.org/web/tagme/tagme-help
6 We use v3.4 (L0) of MetaMap Lite, available at https: tions of named entity recognition problems (NER),
//metamap.nlm.nih.gov/MetaMapLite.shtml we do not identify the entity type (e.g., protein,
87
drug, chemical, disease) for each mention since the Description F1
output mentions are further grounded in a KB with
Without LM 49.9
further information about the entity (including its
With LM 54.1
type), using an entity linking module.
Avg. of 15 models with LM 55.2
Model. First, we construct the token embed-
ding xk D Œck I wk for each token tk in the input Table 3: Results of the entity extraction model on
sequence, where ck is a character-based represen- the development set of SemEval-2017 task 10.
tation computed using a convolutional neural net-
work (CNN) with filter of size 3 characters, and wk
are learned word embeddings initialized with the with a similar architecture, but trained on differ-
GloVe embeddings (Pennington et al., 2014). ent datasets. Two instances are trained on the
We also compute context-sensitive word embed- BC5CDR (Li et al., 2016) and the CHEMDNER
dings, denoted as lmk D Œlm! k I lmk , by con-
datasets (Krallinger et al., 2015) to extract key en-
catenating the projected outputs of forward and tity mentions in the biomedical domain such as dis-
backward recurrent neural network language mod- eases, drugs and chemical compounds. The third
els (RNN-LM) at position k. The language model instance is trained on mention labels induced from
(LM) for each direction is trained independently Wikipedia articles in the computer science domain.
and consists of a single layer long short-term mem- The output of all model instances are pooled to-
ory (LSTM) network followed by a linear project gether and combined with the rule-based entity
layer. While training the LM parameters, lm! k is
extraction module, then fed into the entity linking
used to predict tkC1 and lmk is used to predict model (described below).
tk 1 . We fix the LM parameters during training of
4.3 Knowledge Bases
the entity extraction model. See Peters et al. (2017)
and Ammar et al. (2017) for more details. In this section, we describe the construction of en-
Given the xk and lmk embeddings for each token tity nodes and entity-entity edges. Unlike other
k 2 f1; : : : ; N g, we use a two-layer bidirectional knowledge extraction systems such as the Never-
LSTM to encode the sequence with xk and lmk Ending Language Learner (NELL)7 and OpenIE
feeding into the first and second layer, respectively. 4,8 we use existing knowledge bases (KBs) of en-
That is, tities to reduce the burden of identifying coher-
g! ! ! ent concepts. Grounding the entity mentions in
k D LSTM.xk ; gk 1 /; gk D Œgk I gk ;
a manually-curated KB also increases user confi-
h! ! !
k D LSTM.Œgk I lmk ; hk 1 /; hk D Œhk I hk ; dence in automated predictions. We use two KBs:
where gk and hk are defined similarly to g! and
k UMLS: The UMLS metathesaurus integrates in-
h! but process token sequences in the opposite
k formation about concepts in specialized ontologies
direction.
in several biomedical domains, and is funded by
Similar to the model described in §3, we feed the
the U.S. National Library of Medicine.
output of the second LSTM into a dense layer to
DBpedia: DBpedia provides access to structured
predict unnormalized label weights for each token
information in Wikipedia. Rather than including all
and learn label bigram feature weights to account
Wikipedia pages, we used a short list of Wikipedia
for dependencies between labels.
categories about CS and included all pages up to
Results. We use the standard data splits of depth four in their trees in order to exclude irrele-
the SemEval-2017 Task 10 on entity (and relation) vant entities, e.g., “Lord of the Rings” in DBpedia.
extraction from scientific papers (Augenstein et al.,
2017). Table 3 compares three variants of our en- 4.4 Entity Linking Models
tity extraction model. The first line omits the LM Given a text span s identified by the entity extrac-
embeddings lmk , while the second line is the full tion model in §4.2 (or with heuristics) and a ref-
model (including LM embeddings) showing a large erence KB, the goal of the entity linking model
improvement of 4.2 F1 points. The third line shows is to associate the span with the entity it refers to.
that creating an ensemble of 15 models further im- A span and its surrounding words are collectively
proves the results by 1.1 F1 points. 7 http://rtw.ml.cmu.edu/rtw/
Model instances. In the deployed system, we 8 https://github.com/allenai/
use three instances of the entity extraction model openie-standalone
88
referred to as a mention. We first identify a set of CS Bio
candidate entities that a given mention may refer
Baseline 84.2 54.2
to. Then, we rank the candidate entities based on
Neural 84.6 85.8
a score computed using a neural model trained on
labeled data.
Table 4: The Bag of Concepts F1 score of the base-
For example, given the string “. . . database
line and neural model on the two curated datasets.
of facts, an ILP system will . . . ”, the entity ex-
traction model identifies the span “ILP” as a
possible entity and the entity linking model as- earlier. We compute two scores based on the word
sociates it with “Inductive_Logic_Programming” overlap of (i) mention’s context and candidate’s
as the referent entity (from among other can- definition and (ii) mention’s surface span and the
didates like “Integer_Linear_Programming” or candidate entity’s name. Finally, we feed the con-
“Instruction-level_Parallelism”). catenation of the cosine similarity between f.m/
Datasets. We used two datasets: i) a biomed- and g.e/ and the intersection-based scores into an
ical dataset formed by combining MSH (Jimeno- affine transformation followed by a sigmoid non-
Yepes et al., 2011) and BC5CDR (Li et al., 2016) linearity to compute the final score for the pair (m,
with UMLS as the reference KB, and ii) a CS e).
dataset we curated using Wikipedia articles about Results. We use the Bag of Concepts F1 metric
CS concepts with DBpedia as the reference KB. (Ling et al., 2015) for comparison. Table 4 com-
Candidate selection. In a preprocessing step, pares the performance of the most-frequent-entity
we build an index which maps any token used in baseline and our neural model described above.
a labeled mention or an entity name in the KB
to associated entity IDs, along with the frequency 5 Other Research Problems
this token is associated with that entity. This is
similar to the index used in previous entity linking In the previous sections, we discussed how we con-
systems (e.g., Bhagavatula et al., 2015) to estimate struct the main components of the literature graph.
the probability that a given mention refers to an In this section, we briefly describe several other
entity. At train and test time, we use this index related challenges we are actively working on.
to find candidate entities for a given mention by Author disambiguation. Despite initiatives to
looking up the tokens in the mention. This method have global author IDs ORCID and ResearcherID,
also serves as our baseline in Table 4 by selecting most publishers provide author information as
the entity with the highest frequency for a given names (e.g., arXiv). However, author names cannot
mention. be used as a unique identifier since several people
Scoring candidates. Given a mention (m) and often share the same name. Moreover, different
a candidate entity (e), the neural model constructs a venues and sources use different conventions in
vector encoding of the mention and the entity. We reporting the author names, e.g., “first initial, last
encode the mention and entity using the functions name” vs. “last name, first name”. Inspired by
f and g, respectively, as follows: Culotta et al. (2007), we train a supervised binary
f.m/ D Œvm.name I avg.vm.lc ; vm.rc /; classifier for merging pairs of author instances and
g.e/ D Œve.name I ve.def ; use it to incrementally create author clusters. We
where m.surface, m.lc and m.rc are the mention’s only consider merging two author instances if they
surface form, left and right contexts, and e.name have the same last name and share the first initial.
and e.def are the candidate entity’s name and def- If the first name is spelled out (rather than abbrevi-
inition, respectively. vtext is a bag-of-words sum ated) in both author instances, we also require that
encoder for text. We use the same encoder for the the first name matches.
mention surface form and the candidate name, and Ontology matching. Popular concepts are
another encoder for the mention contexts and entity often represented in multiple KBs. For example,
definition. the concept of “artificial neural networks” is repre-
Additionally, we include numerical features to sented as entity ID D016571 in the MESH ontology,
estimate the confidence of a candidate entity based and represented as page ID ‘21523’ in DBpedia.
on the statistics collected in the index described Ontology matching is the problem of identifying
89
semantically-equivalent entities across KBs or on- is being cited and whether it is accelerating), and
tologies.9 opens the door for further research to better under-
Limited KB coverage. The convenience of stand and predict citations. For example, in order
grounding entities in a hand-curated KB comes at to allow users to better understand what impact a
the cost of limited coverage. Introduction of new paper had and effectively navigate its citations, we
concepts and relations in the scientific literature experimented with methods for classifying a cita-
occurs at a faster pace than KB curation, resulting tion as important or incidental, as well as more fine-
in a large gap in KB coverage of scientific concepts. grained classes (Valenzuela et al., 2015). The cita-
In order to close this gap, we need to develop mod- tion information also enables us to develop models
els which can predict textual relations as well as for estimating the potential of a paper or an author.
detailed concept descriptions in scientific papers. In Weihs and Etzioni (2017), we predict citation-
For the same reasons, we also need to augment based metrics such as an author’s h-index and the
the relations imported from the KB with relations citation rate of a paper in the future. Also related
extracted from text. Our approach to address both is the problem of predicting which papers should
entity and relation coverage is based on distant su- be cited in a given draft (Bhagavatula et al., 2018),
pervision (Mintz et al., 2009). In short, we train which can help improve the quality of a paper draft
two models for identifying entity definitions and before it is submitted for peer review, or used to
relations expressed in natural language in scientific supplement the list of references after a paper is
documents, and automatically generate labeled data published.
for training these models using known definitions
and relations in the KB. 6 Conclusion and Future Work
We note that the literature graph currently lacks In this paper, we discuss the construction of a graph,
coverage for important entity types (e.g., affilia- providing a symbolic representation of the scien-
tions) and domains (e.g., physics). Covering af- tific literature. We describe deployed models for
filiations requires small modifications to the meta- identifying authors, references and entities in the
data extraction model followed by an algorithm for paper text, and provide experimental results to eval-
matching author names with their affiliations. In uate the performance of each model.
order to cover additional scientific domains, more Three research directions follow from this work
agreements need to be signed with publishers. and other similar projects, e.g., Hahn-Powell et al.
Figure and table extraction. Non-textual (2017); Wu et al. (2014): i) improving quality and
components such as charts, diagrams and tables enriching content of the literature graph (e.g., on-
provide key information in many scientific docu- tology matching and knowledge base population).
ments, but the lack of large labeled datasets has im- ii) aggregating domain-specific extractions across
peded the development of data-driven methods for many papers to enable a better understanding of the
scientific figure extraction. In Siegel et al. (2018), literature as a whole (e.g., identifying demographic
we induced high-quality training labels for the task biases in clinical trial participants and summarizing
of figure extraction in a large number of scientific empirical results on important tasks). iii) exploring
documents, with no human intervention. To accom- the literature via natural language interfaces.
plish this we leveraged the auxiliary data provided In order to help future research efforts, we make
in two large web collections of scientific documents the following resources publicly available: meta-
(arXiv and PubMed) to locate figures and their as- data for over 20 million papers,10 meaningful cita-
sociated captions in the rasterized PDF. We use tions dataset,11 models for figure and table extrac-
the resulting dataset to train a deep neural network tion,12 models for predicting citations in a paper
for end-to-end figure detection, yielding a model draft 13 and models for extracting paper metadata,14
that can be more easily extended to new domains among other resources.15
compared to previous work.
10 http://labs.semanticscholar.org/corpus/
Understanding and predicting citations. 11 http://allenai.org/data.html
The citation edges in the literature graph provide 12 https://github.com/allenai/
a wealth of information (e.g., at what rate a paper deepfigures-open

13 https://github.com/allenai/citeomatic
9 Variants of this problem are also known as deduplication 14 https://github.com/allenai/science-parse
or record linkage. 15 http://allenai.org/software/
90
References Martin Krallinger, Florian Leitner, Obdulia Rabal,
Miguel Vazquez, Julen Oyarzabal, and Alfonso Va-
Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- lencia. 2015. CHEMDNER: The drugs and chemi-
ula, and Russell Power. 2017. The ai2 system at cal names extraction challenge. In J. Cheminformat-
semeval-2017 task 10 (scienceie): semi-supervised ics.
end-to-end entity and relation extraction. In ACL
workshop (SemEval). Guillaume Lample, Miguel Ballesteros, Sandeep K
Subramanian, Kazuya Kawakami, and Chris Dyer.
Isabelle Augenstein, Mrinal Das, Sebastian Riedel, 2016. Neural architectures for named entity recog-
Lakshmi Vikraman, and Andrew D. McCallum. nition. In HLT-NAACL.
2017. Semeval 2017 task 10 (scienceie): Extracting
keyphrases and relations from scientific publications. Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci-
In ACL workshop (SemEval). aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter
Davis, Carolyn J. Mattingly, Thomas C. Wiegers,
Chandra Bhagavatula, Sergey Feldman, Russell Power, and Zhiyong Lu. 2016. Biocreative v cdr task cor-
and Waleed Ammar. 2018. Content-based citation pus: a resource for chemical disease relation extrac-
recommendation. In NAACL. tion. Database : the journal of biological databases
and curation 2016.
Chandra Bhagavatula, Thanapon Noraset, and Doug
Downey. 2015. TabEL: entity linking in web tables. Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015.
In ISWC. Design challenges for entity linking. Transactions
of the Association for Computational Linguistics
Ronan Collobert, Jason Weston, Léon Bottou, Michael 3:315–328.
Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa.
Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-
2011. Natural language processing (almost) from
rafsky. 2009. Distant supervision for relation extrac-
scratch. In JMLR.
tion without labeled data. In ACL.
Aron Culotta, Pallika Kanani, Robert Hall, Michael Jeffrey Pennington, Richard Socher, and Christopher D.
Wick, and Andrew D. McCallum. 2007. Author Manning. 2014. GloVe: Global vectors for word rep-
disambiguation using error-driven machine learning resentation. In EMNLP.
with a ranking loss function. In IIWeb Workshop.
Matthew E. Peters, Waleed Ammar, Chandra Bhagavat-
Hal Daumé. 2007. Frustratingly easy domain adapta- ula, and Russell Power. 2017. Semi-supervised se-
tion. In ACL. quence tagging with bidirectional language models.
In ACL.
Dina Demner-Fushman, Willie J. Rogers, and Alan R.
Aronson. 2017. MetaMap Lite: an evaluation of a Noah Siegel, Nicholas Lourie, Russell Power, and
new Java implementation of MetaMap. In JAMIA. Waleed Ammar. 2018. Extracting scientific figures
with distantly supervised neural networks. In JCDL.
Oren Etzioni. 2011. Search needs a shake-up. Nature
476 7358:25–6. Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015.
Identifying meaningful citations. In AAAI Workshop
Paolo Ferragina and Ugo Scaiella. 2010. TAGME: (Scholarly Big Data).
on-the-fly annotation of short text fragments (by
Xiang Wang, Yan Dong, Xiang qian Qi, Yi-Ming Li,
wikipedia entities). In CIKM.
Cheng-Guang Huang, and Lijun Hou. 2013. Clin-
ical review: Efficacy of antimicrobial-impregnated
Gus Hahn-Powell, Marco Antonio Valenzuela-
catheters in external ventricular drainage - a system-
Escarcega, and Mihai Surdeanu. 2017. Swanson
atic review and meta-analysis. In Critical care.
linking revisited: Accelerating literature-based dis-
covery across domains using a conceptual influence Luca Weihs and Oren Etzioni. 2017. Learning to pre-
graph. In ACL. dict citation-based impact measures. In JCDL.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian
short-term memory. Neural computation . Khabsa, Cornelia Caragea, Alexander Ororbia, Dou-
glas Jordan, and C. Lee Giles. 2014. CiteSeerX: AI
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant in a digital library search engine. In AAAI.
Krishnamurthy, and Luke S. Zettlemoyer. 2017.
Learning a neural semantic parser from user feed- Chenyan Xiong, Russell Power, and Jamie Callan.
back. In ACL. 2017. Explicit semantic ranking for academic
search via knowledge graph embedding. In WWW.
Antonio J. Jimeno-Yepes, Bridget T. McInnes, and
Alan R. Aronson. 2011. Exploiting mesh indexing
in medline to generate a data set for word sense dis-
ambiguation. BMC bioinformatics 12(1):223.
91

Construction of The Literature Graph in Semantic Scholar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Construction of The Literature Graph in Semantic Scholar

Uploaded by

Copyright:

Available Formats

Construction of the Literature Graph in Semantic Scholar

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford,

Allen Institute for Artificial Intelligence, Seattle WA 98103, USA

a wealth of information (e.g., at what rate a paper deepfigures-open

or record linkage. 15 http://allenai.org/software/

You might also like