You are on page 1of 14

SpanBERT: Improving Pre-training by Representing

and Predicting Spans

Mandar Joshi∗† Danqi Chen∗‡§ Yinhan Liu§


Daniel S. Weld† Luke Zettlemoyer†§ Omer Levy§


Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
{mandar90,weld,lsz}@cs.washington.edu

Computer Science Department, Princeton University, Princeton, NJ
danqic@cs.princeton.edu

Allen Institute of Artificial Intelligence, Seattle
{danw}@allenai.org

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


§
Facebook AI Research, Seattle
{danqi,yinhanliu,lsz,omerlevy}@fb.com

Abstract termining that the ‘‘Denver Broncos’’ is a type of


‘‘NFL team’’ is critical for answering the ques-
We present SpanBERT, a pre-training method tion ‘‘Which NFL team won Super Bowl 50?’’
that is designed to better represent and predict Such spans provide a more challenging target
spans of text. Our approach extends BERT for self supervision tasks, for example, predicting
by (1) masking contiguous random spans, ‘‘Denver Broncos’’ is much harder than predicting
rather than random tokens, and (2) training only ‘‘Denver’’ when you know the next word is
the span boundary representations to predict ‘‘Broncos’’. In this paper, we introduce a span-
the entire content of the masked span, without level pretraining approach that consistently out-
relying on the individual token representations performs BERT, with the largest gains on span
within it. SpanBERT consistently outperforms selection tasks such as question answering and
BERT and our better-tuned baselines, with coreference resolution.
substantial gains on span selection tasks such We present SpanBERT, a pre-training method
as question answering and coreference reso- that is designed to better represent and predict
lution. In particular, with the same training data spans of text. Our method differs from BERT in
and model size as BERTlarge , our single model both the masking scheme and the training objec-
obtains 94.6% and 88.7% F1 on SQuAD 1.1 tives. First, we mask random contiguous spans,
and 2.0 respectively. We also achieve a new rather than random individual tokens. Second, we
state of the art on the OntoNotes coreference introduce a novel span-boundary objective (SBO)
resolution task (79.6% F1), strong perfor- so the model learns to predict the entire masked
mance on the TACRED relation extraction span from the observed tokens at its boundary.
benchmark, and even gains on GLUE.1 Span-based masking forces the model to predict
entire spans solely using the context in which
1 Introduction
they appear. Furthermore, the SBO encourages
Pre-training methods like BERT (Devlin et al., the model to store this span-level information at
2019) have shown strong performance gains using the boundary tokens, which can be easily accessed
self-supervised training that masks individual words during the fine-tuning stage. Figure 1 illustrates
or subword units. However, many NLP tasks in- our approach.
volve reasoning about relationships between two To implement SpanBERT, we build on a well-
or more spans of text. For example, in extractive tuned replica of BERT, which itself substantially
question answering (Rajpurkar et al., 2016), de- outperforms the original BERT. While building on

Equal contribution.
our baseline, we find that pre-training on single
1
Our code and pre-trained models are available at https:// segments, instead of two half-length segments
github.com/facebookresearch/SpanBERT. with the next sentence prediction (NSP) objective,
64

Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020. https://doi.org/10.1162/tacl a 00300
Action Editor: Radu Florian. Submission batch: 9/2019; Revision batch: 10/2019; Published 3/2020.
c 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022
Figure 1: An illustration of SpanBERT training. The span an American football game is masked. The SBO uses
the output representations of the boundary tokens, x4 and x9 (in blue), to predict each token in the masked span.
The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked
by the position embedding p3 , is the third token from x4 .

considerably improves performance on most good pre-training tasks and objectives, which can
downstream tasks. Therefore, we add our modifi- also have a remarkable impact.
cations on top of the tuned single-sequence BERT
baseline. 2 Background: BERT
Together, our pre-training process yields mod- BERT (Devlin et al., 2019) is a self-supervised
els that outperform all BERT baselines on a approach for pre-training a deep transformer en-
wide variety of tasks, and reach substantially coder (Vaswani et al., 2017), before fine-tuning
better performance on span selection tasks in par- it for a particular downstream task. BERT opti-
ticular. Specifically, our method reaches 94.6% mizes two training objectives—masked language
and 88.7% F1 on SQuAD 1.1 and 2.0 (Rajpurkar model (MLM) and next sentence prediction (NSP)—
et al., 2016, 2018), respectively—reducing error which only require a large collection of unlabeled
by as much as 27% compared with our tuned text.
BERT replica. We also observe similar gains
on five additional extractive question answering Notation Given a sequence of word or sub-
benchmarks (NewsQA, TriviaQA, SearchQA, word tokens X = (x1 , x2 , . . . , xn ), BERT trains
HotpotQA, and Natural Questions).2 an encoder that produces a contextualized vector
representation for each token:
SpanBERT also arrives at a new state of the art
on the challenging CoNLL-2012 (‘‘OntoNotes’’) enc(x1 , x2 , . . . , xn ) = x1 , x2 , . . . , xn .
shared task for document-level coreference resolu- Masked Language Model Also known as a
tion, where we reach 79.6% F1, exceeding the pre- cloze test, MLM is the task of predicting missing
vious top model by 6.6% absolute. Finally, tokens in a sequence from their placeholders.
we demonstrate that SpanBERT also helps on Specifically, a subset of tokens Y ⊆ X is sampled
tasks that do not explicitly involve span selec- and substituted with a different set of tokens. In
tion, and show that our approach even im- BERT’s implementation, Y accounts for 15% of
proves performance on TACRED (Zhang et al., the tokens in X ; of those, 80% are replaced with
2017) and GLUE (Wang et al., 2019). [MASK], 10% are replaced with a random token
Whereas others show the benefits of adding (according to the unigram distribution), and 10%
more data (Yang et al., 2019) and increasing are kept unchanged. The task is to predict the
model size (Lample and Conneau, 2019), this original tokens in Y from the modified input.
work demonstrates the importance of designing BERT selects each token in Y independently
by randomly selecting a subset. In SpanBERT, we
2
We use the modified MRQA version of these datasets. define Y by randomly selecting contiguous spans
See more details in Section 4.1. (Section 3.1).

65
Next Sentence Prediction The NSP task takes
two sequences (XA , XB ) as input, and predicts
whether XB is the direct continuation of XA . This
is implemented in BERT by first reading XA from
the corpus, and then (1) either reading XB from the
point where XA ended, or (2) randomly sampling
XB from a different point in the corpus. The two
sequences are separated by a special [SEP] token.
Additionally, a special [CLS] token is added to
XA , XB to form the input, where the target of
[CLS] is whether XB indeed follows XA in the
corpus.
In summary, BERT optimizes the MLM and the
NSP objectives by masking word pieces uniformly Figure 2: We sample random span lengths from a

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


at random in data generated by the bi-sequence geometric distribution  ∼ Geo(p = 0.2) clipped at
max = 10.
sampling procedure. In the next section, we will
present our modifications to the data pipeline,
masking, and pre-training objectives. p = 0.2, and also clip  at max = 10. This yields
a mean span length of mean () = 3.8. Figure 2
3 Model shows the distribution of span mask lengths.
We present SpanBERT, a self-supervised pre- As in BERT, we also mask 15% of the tokens
training method designed to better represent and in total: replacing 80% of the masked tokens with
predict spans of text. Our approach is inspired [MASK], 10% with random tokens, and 10% with
by BERT (Devlin et al., 2019), but deviates from the original tokens. However, we perform this
its bi-text classification framework in three ways. replacement at the span level and not for each
First, we use a different random process to mask token individually; that is, all the tokens in a span
spans of tokens, rather than individual ones. We are replaced with [MASK] or sampled tokens.
also introduce a novel auxiliary objective—the
SBO—which tries to predict the entire masked 3.2 Span Boundary Objective
span using only the representations of the tokens at Span selection models (Lee et al., 2016, 2017;
the span’s boundary. Finally, SpanBERT samples He et al., 2018) typically create a fixed-length
a single contiguous segment of text for each train- representation of a span using its boundary tokens
ing example (instead of two), and thus does not (start and end). To support such models, we would
use BERT’s next sentence prediction objective, ideally like the representations for the end of the
which we omit. span to summarize as much of the internal span
content as possible. We do so by introducing a
3.1 Span Masking
span boundary objective that involves predicting
Given a sequence of tokens X = (x1 , x2 , . . . , xn ), each token of a masked span using only the
we select a subset of tokens Y ⊆ X by iteratively representations of the observed tokens at the
sampling spans of text until the masking budget boundaries (Figure 1).
(e.g., 15% of X ) has been spent. At each iteration, Formally, we denote the output of the trans-
we first sample a span length (number of words) former encoder for each token in the sequence
from a geometric distribution  ∼ Geo(p), which by x1 , . . . , xn . Given a masked span of tokens
is skewed towards shorter spans. We then ran- (xs , . . . , xe ) ∈ Y , where (s, e) indicates its start
domly (uniformly) select the starting point for the and end positions, we represent each token xi
span to be masked. We always sample a sequence in the span using the output encodings of the
of complete words (instead of subword tokens) external boundary tokens xs−1 and xe+1 , as well
and the starting point must be the beginning of as the position embedding of the target token
one word. Following preliminary trials,3 we set pi−s+1 :
3
We experimented with p = {0.1, 0.2, 0.4} and found 0.2
to perform the best. yi = f (xs−1, xe+1 , pi−s+1 )

66
where position embeddings p1 , p2 , . . . mark rela- to MLM using a single-sequence data pipeline
tive positions of the masked tokens with respect (Section 3.3). A procedural description can be
to the left boundary token xs−1 . We implement found in Appendix A.
the representation function f (·) as a 2-layer
feed-forward network with GeLU activations 4 Experimental Setup
(Hendrycks and Gimpel, 2016) and layer normal- 4.1 Tasks
ization (Ba et al., 2016):
We evaluate on a comprehensive suite of tasks,
h0 = [xs−1 ; xe+1 ; pi−s+1 ] including seven question answering tasks, coref-
h1 = LayerNorm (GeLU(W1 h0 )) erence resolution, nine tasks in the GLUE bench-
mark (Wang et al., 2019), and relation extraction.
yi = LayerNorm (GeLU(W2 h1 )) We expect that the span selection tasks, question
answering and coreference resolution, will partic-
We then use the vector representation yi to predict
ularly benefit from our span-based pre-training.
the token xi and compute the cross-entropy loss

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


exactly like the MLM objective. Extractive Question Answering Given a short
SpanBERT sums the loss from both the span passage of text and a question as input, the task
boundary and the regular masked language model of extractive question answering is to select a
objectives for each token xi in the masked span contiguous span of text in the passage as the
(xs , . . . , xe ), while reusing the input embedding answer.
(Press and Wolf, 2017) for the target tokens in We first evaluate on SQuAD 1.1 and 2.0
both MLM and SBO: (Rajpurkar et al., 2016, 2018), which have
served as major question answering benchmarks,
particularly for pre-trained models (Peters et al.,
L(xi ) = LMLM (xi ) + LSBO (xi ) 2018; Devlin et al., 2019; Yang et al., 2019).
= − log P (xi | xi ) − log P (xi | yi ) We also evaluate on five more datasets from
the MRQA shared task (Fisch et al., 2019)4 :
NewsQA (Trischler et al., 2017), SearchQA
3.3 Single-Sequence Training (Dunn et al., 2017), TriviaQA (Joshi et al.,
2017), HotpotQA (Yang et al., 2018), and Natural
As described in Section 2, BERT’s examples con- Questions (Kwiatkowski et al., 2019). Because
tain two sequences of text (XA , XB ), and an the MRQA shared task does not have a public
objective that trains the model to predict whether test set, we split the development set in half
they are connected (NSP). We find that this set- to make new development and test sets. The
ting is almost always worse than simply using a datasets vary in both domain and collection meth-
single sequence without the NSP objective (see odology, making this collection a good test bed
Section 5 for further details). We conjecture that for evaluating whether our pre-trained models can
single-sequence training is superior to bi-sequence generalize well across different data distributions.
training with NSP because (a) the model benefits Following BERT (Devlin et al., 2019), we use
from longer full-length contexts, or (b) condi- the same QA model architecture for all the data-
tioning on, often unrelated, context from an- sets. We first convert the passage P = (p1 , p2 ,
other document adds noise to the masked language . . . , pl ) and question Q = (q1 , q2 , . . . , ql ) into
model. Therefore, in our approach, we remove a single sequence X = [CLS]p1 p2 . . . pl [SEP]
both the NSP objective and the two-segment sam- q1 q2 . . . ql [SEP], pass it to the pre-trained trans-
pling procedure, and simply sample a single con- former encoder, and train two linear classifiers
tiguous segment of up to n = 512 tokens, rather independently on top of it for predicting the answer
than two half-segments that sum up to n tokens span boundary (start and end). For the unanswer-
together. able questions in SQuAD 2.0, we simply set the
In summary, SpanBERT pre-trains span repre-
4
sentations by: (1) masking spans of full words https://github.com/mrqa/MRQA-Shared-
Task-2019. MRQA changed the original datasets to unify
using a geometric distribution based masking them into the same format, e.g., all the contexts are truncated
scheme (Section 3.1), (2) optimizing an auxiliary to a maximum of 800 tokens and only answerable questions
span-boundary objective (Section 3.2) in addition are kept.

67
answer span to be the special token [CLS] for was born in [OBJ-LOC] , Michigan, . . . ’’, and
both training and testing. finally add a linear classifier on top of the [CLS]
token to predict the relation type.
Coreference Resolution Coreference resolu-
tion is the task of clustering mentions in text which GLUE The General Language Understanding
refer to the same real-world entities. We evaluate Evaluation (GLUE) benchmark (Wang et al.,
on the CoNLL-2012 shared task (Pradhan et al., 2019) consists of 9 sentence-level classification
2012) for document-level coreference resolu- tasks:
tion. We use the independent version of the Joshi
et al. (2019b) implementation of the higher-order • Two sentence-level classification tasks in-
coreference model (Lee et al., 2018). The docu- cluding CoLA (Warstadt et al., 2018)
ment is divided into non-overlapping segments of for evaluating linguistic acceptability and
a pre-defined length.5 Each segment is encoded SST-2 (Socher et al., 2013) for sentiment
independently by the pre-trained transformer classification.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


encoder, which replaces the original LSTM-based • Three sentence-pair similarity tasks includ-
encoder. For each mention span x, the model ing MRPC (Dolan and Brockett, 2005), a
learns a distribution P (·) over possible antecedent binary paraphrasing task sentence pairs from
spans Y : news sources, STS-B (Cer et al., 2017), a
graded similarity task for news headlines,
es(x,y)
P (y ) =  s(x,y  )
and QQP,6 a binary paraphrasing tasking be-
y  ∈Y e tween Quora question pairs.
The span pair scoring function s(x, y ) is a feed-
• Four natural language inference tasks in-
forward neural network over fixed-length span
cluding MNLI (Williams et al., 2018), QNLI
representations and hand-engineered features over
(Rajpurkar et al., 2016), RTE (Dagan et al.,
x and y :
2005; Bar-Haim et al., 2006; Giampiccolo
s(x, y ) = sm (x) + sm (y ) + sc (x, y ) et al., 2007), and WNLI (Levesque et al.,
2011).
sm (x) = FFNN m (gx )
sc (x, y ) = FFNN c (gx , gy , φ(x, y )) Unlike question answering, coreference resolu-
tion, and relation extraction, these sentence-level
Here gx and gy denote the span representations, tasks do not require explicit modeling of span-
which are a concatenation of the two transformer level semantics. However, they might still benefit
output states of the span endpoints and an attention from implicit span-based reasoning (e.g., the
vector computed over the output representations Prime Minister is the head of the government).
of the token in the span. FFNNm and FFNNc Following previous work (Devlin et al., 2019;
represent two feedforward neural networks with Radford et al., 2018),7 we exclude WNLI from
one hidden layer, and φ(x, y ) represents the hand- the results to enable a fair comparison. Although
engineered features (e.g., speaker and genre infor- recent work Liu et al. (2019a) has applied several
mation). A more detailed description of the model task-specific strategies to increase performance
can be found in Joshi et al. (2019b). on the individual GLUE tasks, we follow BERT’s
Relation Extraction TACRED (Zhang et al., single-task setting and only add a linear classi-
2017) is a challenging relation extraction dataset. fier on top of the [CLS] token for these classifi-
Given one sentence and two spans within it— cation tasks.
subject and object—the task is to predict the 4.2 Implementation
relation between the spans from 42 pre-defined
relation types, including no relation. We follow We reimplemented BERT’s model and pre-
the entity masking schema from Zhang et al. training method in fairseq (Ott et al., 2019).
6
(2017) and replace the subject and object entities https://data.quora.com/First-Quora-
by their NER tags such as ‘‘[CLS] [SUBJ-PER] Dataset-Release-Question-Pairs.
7
Previous work has excluded WNLI on account of con-
The length was chosen from {128, 256, 384, 512}. See
5
struction issues outlined on the GLUE website – https://
more details in Appendix B. gluebenchmark.com/faq.

68
We used the model configuration of BERTlarge SQuAD 1.1 SQuAD 2.0
as in Devlin et al. (2019) and also pre-trained all EM F1 EM F1
our models on the same corpus: BooksCorpus and
Human Perf. 82.3 91.2 86.8 89.4
English Wikipedia using cased Wordpiece tokens.
Google BERT 84.3 91.3 80.0 83.3
Compared with the original BERT implemen- Our BERT 86.5 92.6 82.8 85.9
tation, the main differences in our implementation Our BERT-1seq 87.5 93.3 83.8 86.6
include: (a) We use different masks at each epoch SpanBERT 88.8 94.6 85.7 88.7
while BERT samples 10 different masks for each
sequence during data processing. (b) We remove Table 1: Test results on SQuAD 1.1 and SQuAD
all the short-sequence strategies used before (they 2.0.
sampled shorter sequences with a small proba-
bility 0.1; they also first pre-trained with smaller Our BERT-1seq Our reimplementation of BERT
sequence length of 128 for 90% of the steps). trained on single full-length sequences without

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


Instead, we always take sequences of up to 512 NSP (Section 3.3).
tokens until it reaches a document boundary. We
refer readers to Liu et al. (2019b) for further dis- 5 Results
cussion on these modifications and their effects.
We compare SpanBERT to the baselines per task,
As in BERT, the learning rate is warmed up
and draw conclusions based on the overall trends.
over the first 10,000 steps to a peak value of 1e-4,
and then linearly decayed. We retain β hyper- 5.1 Per-Task Results
parameters (β1 = 0.9, β2 = 0.999) and a de-
Extractive Question Answering Table 1 shows
coupled weight decay (Loshchilov and Hutter,
the performance on both SQuAD 1.1 and 2.0.
2019) of 0.1. We also keep a dropout of 0.1 on
SpanBERT exceeds our BERT baseline by 2.0%
all layers and attention weights, and a GeLU ac-
and 2.8% F1, respectively (3.3% and 5.4% over
tivation function (Hendrycks and Gimpel, 2016).
Google BERT). In SQuAD 1.1, this result ac-
We deviate from the optimization by running
counts for over 27% error reduction, reaching
for 2.4M steps and using an epsilon of 1e-8 for
3.4% F1 above human performance.
AdamW (Kingma and Ba, 2015), which con-
Table 2 demonstrates that this trend goes be-
verges to a better set of model parameters. Our im-
yond SQuAD, and is consistent in every MRQA
plementation uses a batch size of 256 sequences
dataset. On average, we see a 2.9% F1 improve-
with a maximum of 512 tokens.8 For the SBO,
ment from our reimplementation of BERT. Al-
we use 200 dimension position embeddings p1 ,
though some gains are coming from single-sequence
p2 , . . . to mark positions relative to the left
training (+1.1%), most of the improvement stems
boundary token. The pre-training was done on 32
from span masking and the span boundary objec-
Volta V100 GPUs and took 15 days to complete.
tive (+1.8%), with particularly large gains on
Fine-tuning is implemented based on Hugging-
TriviaQA (+3.2%) and HotpotQA (+2.7%).
Face’s codebase (Wolf et al., 2019) and more
details are given in Appendix B. Coreference Resolution Table 3 shows the
performance on the OntoNotes coreference res-
4.3 Baselines olution benchmark. Our BERT reimplementation
improves the Google BERT model by 1.2% on
We compare SpanBERT to three baselines:
the average F1 metric and single-sequence train-
Google BERT The pre-trained models released ing brings another 0.5% gain. Finally, SpanBERT
by Devlin et al. (2019).9 improves considerably on top of that, achieving a
new state of the art of 79.6% F1 (previous best
Our BERT Our reimplementation of BERT result is 73.0%).
with improved data preprocessing and optimiza- Relation Extraction Table 4 shows the perfor-
tion (Section 4.2). mance on TACRED. SpanBERT exceeds our
8
On the average, this is approximately 390 sequences,
reimplementation of BERT by 3.3% F1 and
because some documents have fewer than 512 tokens. achieves close to the current state of the art (Soares
9
https://github.com/google-research/bert. et al., 2019)—Our model performs better than

69
NewsQA TriviaQA SearchQA HotpotQA Natural Avg.
Questions
Google BERT 68.8 77.5 81.7 78.3 79.9 77.3
Our BERT 71.0 79.0 81.8 80.5 80.5 78.6
Our BERT-1seq 71.9 80.4 84.0 80.3 81.8 79.7
SpanBERT 73.6 83.6 84.8 83.0 82.5 81.5

Table 2: Performance (F1) on the five MRQA extractive question answering tasks.

MUC B3 CEAFφ4
P R F1 P R F1 P R F1 Avg. F1
Prev. SotA:
(Lee et al., 2018) 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


Google BERT 84.9 82.5 83.7 76.7 74.2 75.4 74.6 70.1 72.3 77.1
Our BERT 85.1 83.5 84.3 77.3 75.5 76.4 75.0 71.9 73.9 78.3
Our BERT-1seq 85.5 84.1 84.8 77.8 76.7 77.2 75.3 73.5 74.4 78.8
SpanBERT 85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3 79.6

Table 3: Performance on the OntoNotes coreference resolution benchmark. The main evaluation is the
average F1 of three metrics: MUC, B3 , and CEAFφ4 on the test set.

p R F1 training without the NSP objective substantially


improves CoLA, and yields smaller (but consid-
BERTEM (Soares et al., 2019) − − 70.1
erable) improvements on MRPC and MNLI. The
BERTEM +MTB∗ − − 71.5
main gains from SpanBERT are in the SQuAD-
Google BERT 69.1 63.9 66.4 based QNLI dataset (+1.3%) and in RTE (+6.9%),
Our BERT 67.8 67.2 67.5 the latter accounting for most of the rise in
Our BERT-1seq 72.4 67.9 70.1 SpanBERT’s GLUE average.
SpanBERT 70.8 70.9 70.8
5.2 Overall Trends
Table 4: Test performance on the TACRED
relation extraction benchmark. BERTlarge and We compared our approach to three BERT base-
BERTEM +MTB from Soares et al. (2019) are the lines on 17 benchmarks, and found that SpanBERT
current state-of-the-art. ∗ : BERTEM +MTB incor- outperforms BERT on almost every task. In 14
porated an intermediate ‘‘matching the blanks’’ tasks, SpanBERT performed better than all base-
pre-training on the entity-linked text based on lines. In two tasks (MRPC and QQP), it performed
English Wikipedia, which is not a direct compar- on-par in terms of accuracy with single-sequence
ison to ours trained only from raw text. trained BERT, but still outperformed the other
baselines. In one task (SST-2), Google’s BERT
baseline performed better than SpanBERT by
their BERTEM but is 0.7 point behind BERTEM + 0.4% accuracy.
MTB, which used entity-linked text for additional
When considering the magnitude of the gains,
pre-training. Most of this gain (+2.6%) stems from
it appears that SpanBERT is especially better at
single-sequence training although the contribution
extractive question answering. In SQuAD 1.1,
of span masking and the span boundary objective
for example, we observe a solid gain of 2.0% F1
is still a considerable 0.7%, resulting largely from
even though the baseline is already well above
higher recall.
human performance. On MRQA, SpanBERT im-
GLUE Table 5 shows the performance on proves between 2.0% (Natural Questions) and
GLUE. 4.6% (TriviaQA) F1 on top of our BERT baseline.
For most tasks, the different models appear Finally, we observe that single-sequence train-
to perform similarly. Moving to single-sequence ing works considerably better than bi-sequence

70
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE (Avg)
Google BERT 59.3 95.2 88.5/84.3 86.4/88.0 71.2/89.0 86.1/85.7 93.0 71.1 80.4
Our BERT 58.6 93.9 90.1/86.6 88.4/89.1 71.8/89.3 87.2/86.6 93.0 74.7 81.1
Our BERT-1seq 63.5 94.8 91.2/87.8 89.0/88.4 72.1/89.5 88.0/87.4 93.0 72.1 81.7
SpanBERT 64.3 94.8 90.9/87.9 89.9/89.1 71.9/89.5 88.1/87.7 94.3 79.0 82.8

Table 5: Test set performance on GLUE tasks. MRPC: F1/accuracy, STS-B: Pearson/Spearmanr
correlation, QQP: F1/accuracy, MNLI: matched/mistached accuracies, and accuracy for all the other
tasks. WNLI (not shown) is always set to majority class (65.1% accuracy) and included in the average.

training with NSP with BERT’s choice of se- Named Entities At 50% of the time, we sample
quence lengths for a wide variety of tasks. This from named entities in the text, and sample random
is surprising because BERT’s ablations showed whole words for the other 50%. The total number

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


gains from the NSP objective (Devlin et al., 2019). of masked subtokens is 15%. Specifically, we run
However, the ablation studies still involved bi- spaCy’s named entity recognizer (Honnibal and
sequence data processing (i.e., the pre-training Montani, 2017)11 on the corpus and select all the
stage only controlled for the NSP objective while non-numerical named entity mentions as candidates.
still sampling two half-length sequences). We hy-
Noun Phrases Similar to Named Entities, we
pothesize that bi-sequence training, as it is im-
sample from noun phrases at 50% of the time. The
plemented in BERT (see Section 2), impedes the
noun phrases are extracted by running spaCy’s
model from learning longer-range features, and
constituency parser.
consequently hurts performance on many down-
stream tasks. Geometric Spans We sample random spans
from a geometric distribution, as in our SpanBERT
6 Ablation Studies (see Section 3.1).

We compare our random span masking scheme Table 6 shows how different pre-training
with linguistically-informed masking schemes, masking schemes affect performance on the devel-
and find that masking random spans is a com- opment set of a selection of tasks. All the mod-
petitive and often better approach. We then study els are evaluated on the development sets and are
the impact of the SBO, and contrast it with BERT’s based on the default BERT setup of bi-sequence
NSP objective.10 training with NSP; the results are not directly com-
parable to the main evaluation. With the exception
6.1 Masking Schemes of coreference resolution, masking random spans
is preferable to other strategies. Although linguis-
Previous work (Sun et al., 2019) has shown im-
tic masking schemes (named entities and noun
provements in downstream task performance by
phrases) are often competitive with random spans,
masking linguistically informed spans during pre-
their performance is not consistent; for instance,
training for Chinese data. We compare our ran-
masking noun phrases achieves parity with ran-
dom span masking scheme with masking of
dom spans on NewsQA, but underperforms on
linguistically informed spans. Specifically, we
TriviaQA (−1.1% F1).
train the following five baseline models differing
On coreference resolution, we see that masking
only in the way tokens are masked.
random subword tokens is preferable to any form
Subword Tokens We sample random Word- of span masking. Nevertheless, we shall see in
piece tokens, as in the original BERT. the following experiment that combining random
span masking with the span boundary objective
Whole Words We sample random words, and can improve upon this result considerably.
then mask all of the subword tokens in those
words. The total number of masked subtokens is 6.2 Auxiliary Objectives
around 15%. In Section 5, we saw that bi-sequence training
10
with the NSP objective can hurt performance on
To save time and resources, we use the checkpoints at
11
1.2M steps for all the ablation experiments. https://spacy.io/.

71
SQuAD 2.0 NewsQA TriviaQA Coreference MNLI-m QNLI GLUE (Avg)
Subword Tokens 83.8 72.0 76.3 77.7 86.7 92.5 83.2
Whole Words 84.3 72.8 77.1 76.6 86.3 92.8 82.9
Named Entities 84.8 72.7 78.7 75.6 86.0 93.1 83.2
Noun Phrases 85.0 73.0 77.7 76.7 86.5 93.2 83.5
Geometric Spans 85.4 73.0 78.8 76.4 87.0 93.3 83.4

Table 6: The effect of replacing BERT’s original masking scheme (Subword Tokens) with different
masking schemes. Results are F1 scores for QA tasks and accuracy for MNLI and QNLI on the
development sets. All the models are based on bi-sequence training with NSP.

SQuAD 2.0 NewsQA TriviaQA Coref MNLI-m QNLI GLUE (Avg)


Span Masking (2seq) + NSP 85.4 73.0 78.8 76.4 87.0 93.3 83.4

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


Span Masking (1seq) 86.7 73.4 80.0 76.3 87.3 93.8 83.8
Span Masking (1seq) + SBO 86.8 74.1 80.3 79.0 87.6 93.9 84.0

Table 7: The effects of different auxiliary objectives, given MLM over random spans as the primary
objective.

downstream tasks, when compared with single- guage generation tasks—SpanBERT pretrains
sequence training. We test whether this holds true span representations (Lee et al., 2016), which are
for models pre-trained with span masking, and also widely used for question answering, coreference
evaluate the effect of replacing the NSP objective resolution, and a variety of other tasks. ERNIE
with the SBO. (Sun et al., 2019) shows improvements on Chinese
Table 7 confirms that single-sequence training NLP tasks using phrase and named entity mask-
typically improves performance. Adding SBO fur- ing. MASS (Song et al., 2019) focuses on language
ther improves performance, with a substantial gain generation tasks, and adopts the encoder-decoder
on coreference resolution (+2.7% F1) over span framework to reconstruct a sentence fragment
masking alone. Unlike the NSP objective, SBO given the remaining part of the sentence. We
does not appear to have any adverse effects. attempt to more explicitly model spans using the
SBO objective, and show that (geometrically dis-
tributed) random span masking works as well,
7 Related Work
and sometimes better than, masking linguistically-
Pre-trained contextualized word representations that coherent spans. We evaluate on English bench-
can be trained from unlabeled text (Dai and Le, marks for question answering, relation extraction,
2015; Melamud et al., 2016; Peters et al., 2018) and coreference resolution in addition to GLUE.
have had immense impact on NLP lately, partic- A different ERNIE (Zhang et al., 2019) fo-
ularly as methods for initializing a large model cuses on integrating structured knowledge bases
before fine-tuning it for a specific task (Howard with contextualized representations with an eye on
and Ruder, 2018; Radford et al., 2018; Devlin knowledge-driven tasks like entity typing and re-
et al., 2019). Beyond differences in model hyper- lation classification. UNILM (Dong et al., 2019)
parameters and corpora, these methods mainly uses multiple language modeling objectives—
differ in their pre-training tasks and loss functions, unidirectional (both left-to-right and right-to-left),
with a considerable amount of contemporary liter- bidirectional, and sequence-to-sequence prediction—
ature proposing augmentations of BERT’s MLM to aid generation tasks like summarization and
objective. question generation. XLM (Lample and Conneau,
While previous and concurrent work has looked 2019) explores cross-lingual pre-training for multi-
at masking (Sun et al., 2019) or dropping (Song lingual tasks such as translation and cross-lingual
et al., 2019; Chan et al., 2019) multiple words classification. Kermit (Chan et al., 2019), an in-
from the input—particularly as pretraining for lan- sertion based approach, fills in missing tokens

72
(instead of predicting masked ones) during pre- substantially better performance on span selection
training; they show improvements on machine tasks in particular.
translation and zero-shot question answering.
Concurrent with our work, RoBERTa (Liu et al., Appendices
2019b) presents a replication study of BERT
pre-training that measures the impact of many A Pre-training Procedure
key hyperparameters and training data size. Also We describe our pre-training procedure as follows:
concurrent, XLNet (Yang et al., 2019) combines
an autoregressive loss and the Transformer-XL 1. Divide the corpus into single contiguous
(Dai et al., 2019) architecture with a more than an blocks of up to 512 tokens.
eight-fold increase in data to achieve current state-
of-the-art results on multiple benchmarks. XLNet 2. At each step of pre-training:
also masks spans (of 1–5 tokens) during pre-
(a) Sample a batch of blocks uniformly at

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


training, but predicts them autoregressively. Our
model focuses on incorporating span-based pre- random.
training, and as a side effect, we present a stronger (b) Mask 15% of word pieces in each block
BERT baseline while controlling for the corpus, in the batch using the span masking
architecture, and the number of parameters. scheme (Section 3.1).
Related to our SBO objective, pair2vec (Joshi (c) For each masked token xi , opti-
et al., 2019a) encodes word-pair relations using mize L(xi ) = LMLM (xi ) + LSBO (xi )
a negative sampling-based multivariate objective (Section 3.2).
during pre-training. Later, the word-pair repre-
sentations are injected into the attention-layer of B Fine-tuning Hyperparameters
downstream tasks, and thus encode limited down- We apply the following fine-tuning hyperparam-
stream context. Unlike pair2vec, our SBO objec- eters to all methods, including the baselines.
tive yields ‘‘pair’’ (start and end tokens of spans)
representations which more fully encode the con- Extractive Question Answering For all the
text during both pre-training and finetuning, and question answering tasks, we use max seq
are thus more appropriately viewed as span repre- length = 512 and a sliding window of size
sentations. Stern et al. (2018) focus on improving 128 if the lengths are longer than 512. We choose
language generation speed using a block-wise par- learning rates from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5}
allel decoding scheme; they make predictions for and batch sizes from {16, 32} and fine-tune four
multiple time steps in parallel and then back off epochs for all the datasets.
to the longest prefix validated by a scoring model.
Also related are sentence representation methods Coreference Resolution We divide the docu-
(Kiros et al., 2015; Logeswaran and Lee, 2018), ments into multiple chunks of lengths up to max
which focus on predicting surrounding contexts seq length and encode each chunk indepen-
from sentence embeddings. dently. We choose max seq length from {128,
256, 384, 512}, BERT learning rates from {1e-5,
2e-5}, task-specific learning rates from {1e-4,
8 Conclusion 2e-4, 3e-4}, and fine-tune 20 epochs for all the
datasets. We use batch size = 1 (one document)
We presented a new method for span-based pre- for all the experiments.
training which extends BERT by (1) masking
contiguous random spans, rather than random TACRED/GLUE We use max seq length =
tokens, and (2) training the span boundary repre- 128 and choose learning rates from {5e-6, 1e-5,
sentations to predict the entire content of the 2e-5, 3e-5, 5e-5} and batch sizes from {16, 32}
masked span, without relying on the individual and fine-tuning 10 epochs for all the datasets.
token representations within it. Together, our pre- The only exception is CoLA, where we used four
training process yields models that outperform all epochs (following Devlin et al., 2019), because 10
BERT baselines on a variety of tasks, and reach epochs lead to severe overfitting.

73
Acknowledgments Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
We would like to thank Pranav Rajpurkar and Robin of deep bidirectional transformers for language
Jia for patiently helping us evaluate SpanBERT understanding. In North American Association
on SQuAD. We thank the anonymous reviewers, for Computational Linguistics (NAACL).
the action editor, and our colleagues at Facebook
AI Research and the University of Washington for William B. Dolan and Chris Brockett. 2005.
their insightful feedback that helped improve the Automatically constructing a corpus of sen-
paper. tential paraphrases. In Proceedings of the Inter-
national Workshop on Paraphrasing.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei,
References
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Zhou, and Hsiao-Wuen Hon. 2019. Unified

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


Hinton. 2016. Layer normalization. arXiv pre- language model pre-training for natural lan-
print arXiv:1607.06450. guage understanding and generation. In Ad-
vances in Neural Information Processing Systems
Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, (NIPS).
Danilo Giampiccolo, Bernardo Magnini, and
Idan Szpektor. 2006. The second PASCAL Matthew Dunn, Levent Sagun, Mike Higgins,
recognising textual entailment challenge. In V. Ugur Guney, Volkan Cirik, and Kyunghyun
Proceedings of the Second PASCAL Challenges Cho. 2017. SearchQA: A new Q&A dataset
Workshop on Recognising Textual Entailment, augmented with context from a search engine.
pages 6–4. arXiv preprint arXiv:1704.05179.
Adam Fisch, Alon Talmor, Robin Jia, Minjoon
Daniel Cer, Mona Diab, Eneko Agirre, IÃśigo
Seo, Eunsol Choi, and Danqi Chen. 2019.
Lopez-Gazpio, and Lucia Specia. 2017.
MRQA 2019 shared task: Evaluating general-
Semeval-2017 task 1: Semantic textual similar-
ization in reading comprehension. In Proceed-
ity multilingual and crosslingual focused eval-
ings of 2nd Machine Reading for Reading
uation. In International Workshop on Semantic
Comprehension (MRQA) Workshop at EMNLP.
Evaluation (SemEval), pages 1–14. Vancouver,
Canada. Danilo Giampiccolo, Bernardo Magnini, Ido
Dagan, and Bill Dolan. 2007. The third PASCAL
William Chan, Nikita Kitaev, Kelvin Guu, recognizing textual entailment challenge. In Pro-
Mitchell Stern, and Jakob Uszkoreit. 2019. ceedings of the ACL-PASCAL Workshop on Tex-
KERMIT: Generative insertion-based modeling tual Entailment and Paraphrasing, pages 1–9.
for sequences. arXiv preprint arXiv:1906.01604.
Luheng He, Kenton Lee, Omer Levy, and Luke
Ido Dagan, Oren Glickman, and Bernardo Zettlemoyer. 2018. Jointly predicting predicates
Magnini. 2005. The PASCAL recognising tex- and arguments in neural semantic role labeling.
tual entailment challenge. In Machine Learning In Association for Computational Linguistics
Challenges Workshop, pages 177–190. Springer. (ACL), pages 364–369.

Andrew M. Dai and Quoc V. Le. 2015. Semi- Dan Hendrycks and Kevin Gimpel. 2016.
supervised sequence learning. In Advances in Gaussian error linear units (gelus). arXiv pre-
Neural Information Processing Systems (NIPS), print arXiv:1606.08415.
pages 3079–3087. Matthew Honnibal and Ines Montani. 2017. spaCy
2: Natural language understanding with Bloom
Zihang Dai, Zhilin Yang, Yiming Yang, William
embeddings, convolutional neural networks and
W. Cohen, Jaime Carbonell, Quoc V. Le, and
incremental parsing. To appear.
Ruslan Salakhutdinov. 2019. Transformer-XL:
Attentive language models beyond a fixed- Jeremy Howard and Sebastian Ruder. 2018. Uni-
length context. In Association for Computa- versal language model fine-tuning for text clas-
tional Linguistics (ACL). sification. arXiv preprint arXiv:1801.06146.

74
Mandar Joshi, Eunsol Choi, Omer Levy, coarse-to-fine inference. In North American
Daniel Weld, and Luke Zettlemoyer. 2019a. Association for Computational Linguistics
pair2vec: Compositional word-pair embeddings (NAACL), pages 687–692.
for cross-sentence inference. In North American
Association for Computational Linguistics Kenton Lee, Shimi Salant, Tom Kwiatkowski,
(NAACL), pages 3597–3608. Ankur Parikh, Dipanjan Das, and Jonathan
Berant. 2016. Learning recurrent span repre-
Mandar Joshi, Eunsol Choi, Daniel Weld, and sentations for extractive question answering.
Luke Zettlemoyer. 2017. TriviaQA: A large arXiv preprint arXiv:1611.01436.
scale distantly supervised challenge dataset for
reading comprehension. In Association for Com- Hector J. Levesque, Ernest Davis, and Leora
putational Linguistics (ACL), pages 1601–1611. Morgenstern. 2011. The Winograd schema
challenge. In AAAI Spring Symposium: Logical
Mandar Joshi, Omer Levy, Daniel S. Weld, Luke Formalizations of Commonsense Reasoning,

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


Zettlemoyer, and Omer Levy. 2019b. BERT for volume 46, page 47.
coreference resolution: Baselines and analysis.
In Empirical Methods in Natural Language Xiaodong Liu, Pengcheng He, Weizhu Chen, and
Processing (EMNLP). Jianfeng Gao. 2019a. Multi-task deep neural
networks for natural language understanding. In
Diederik Kingma and Jimmy Ba. 2015. Adam: Proceedings of the 57th Annual Meeting of the
A method for stochastic optimization. In Inter- Association for Computational Linguistics. As-
national Conference on Learning Representa- sociation for Computational Linguistics (ACL).
tions (ICLR).
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Du, Mandar Joshi, Danqi Chen, Omer Levy,
Richard S. Zemel, Antonio Torralba, Raquel Mike Lewis, Luke Zettlemoyer, and Veselin
Urtasun, and Sanja Fidler. 2015. Skip-thought Stoyanov. 2019b. RoBERTa: A robustly opti-
vectors. In Advances in Neural Information mized BERT pretraining approach. arxiv pre-
Processing Systems (NIPS). print arXiv:1907.11692.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Lajanugen Logeswaran and Honglak Lee. 2018.
Redfield, Michael Collins, Ankur Parikh, Chris An efficient framework for learning sentence
Alberti, Danielle Epstein, Illia Polosukhin, representations. arxiv preprint arXiv:1803.02893.
Matthew Kelcey, Jacob Devlin, Kenton Lee,
Kristina N. Toutanova, Llion Jones, Ming- Ilya Loshchilov and Frank Hutter. 2019. Decou-
Wei Chang, Andrew Dai, Jakob Uszkoreit, pled weight decay regularization. In Interna-
Quoc Le, and Slav Petrov. 2019. Natural tional Conference on Learning Representations
questions: A benchmark for question answering (ICLR).
research. Transactions of the Association of
Computational Linguistics (TACL). Oren Melamud, Jacob Goldberger, and Ido Dagan.
2016. context2vec: Learning generic context
Guillaume Lample and Alexis Conneau. 2019. embedding with bidirectional LSTM. In Com-
Cross-lingual language model pretraining. putational Natural Language Learning (CoNLL),
Advances in Neural Information Processing pages 51–61.
Systems (NIPS).
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Kenton Lee, Luheng He, Mike Lewis, and Fan, Sam Gross, Nathan Ng, David Grangier,
Luke Zettlemoyer. 2017. End-to-end neural and Michael Auli. 2019. fairseq: A fast, exten-
coreference resolution. In Empirical Methods sible toolkit for sequence modeling. In North
in Natural Language Processing (EMNLP), American Association for Computational Lin-
pages 188–197. guistics (NAACL), pages 48–53.

Kenton Lee, Luheng He, and Luke Zettlemoyer. Matthew Peters, Mark Neumann, Mohit Iyyer,
2018. Higher-order coreference resolution with Matt Gardner, Christopher Clark, Kenton Lee,

75
and Luke Zettlemoyer. 2018. Deep contextual- Mitchell Stern, Noam Shazeer, and Jakob
ized word representations. In North American Uszkoreit. 2018. Blockwise parallel decod-
Association for Computational Linguistics ing for deep autoregressive models. In Advances
(NAACL), pages 2227–2237. in Neural Information Processing Systems
(NIPS).
Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Olga Uryupina, and Yuchen Zhang. 2012. Yu Stephanie Sun, Shuohuan Wang, Yukun Li,
CoNLL-2012 shared task: Modeling multi- Shikun Feng, Xuyi Chen, Han Zhang, Xinlun
lingual unrestricted coreference in ontonotes. Tian, Danxiang Zhu, Hao Tian, and Hua
In Joint Conference on EMNLP and CoNLL- Wu. 2019. ERNIE: Enhanced representation
Shared Task, pages 1–40. through knowledge integration. arXiv preprint
arXiv:1904.09223.
Ofir Press and Lior Wolf. 2017. Using the out-
put embedding to improve language models. Adam Trischler, Tong Wang, Xingdi Yuan, Justin

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022


In Proceedings of the 15th Conference of the Harris, Alessandro Sordoni, Philip Bachman,
European Chapter of the Association for Com- and Kaheer Suleman. 2017. NewsQA: A ma-
putational Linguistics: Volume 2, Short Papers, chine comprehension dataset. In 2nd Work-
pages 157–163. Association for Computational shop on Representation Learning for NLP,
Linguistics (ACL). pages 191–200.

Alec Radford, Karthik Narasimhan, Time Salimans, Ashish Vaswani, Noam Shazeer, Niki Parmar,
and Ilya Sutskever. 2018. Improving language un- Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
derstanding with unsupervisedlearning, OpenAI. Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neural
Pranav Rajpurkar, Robin Jia, and Percy Liang. Information Processing Systems (NIPS).
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. In Association for Alex Wang, Amapreet Singh, Julian Michael,
Computational Linguistics (ACL), pages 784–789. Felix Hill, Omer Levy, and Samuel R. Bowman.
2019. GLUE: A multi-task benchmark and anal-
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, ysis platform for natural language understand-
and Percy Liang. 2016. SQuAD: 100,000+ ing. In International Conference on Learning
questions for machine comprehension of text. Representations (ICLR).
In Empirical Methods in Natural Language
Processing (EMNLP), pages 2383–2392. Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2018. Neural network acceptability
Livio Baldini Soares, Nicholas Arthur FitzGerald, judgments. arXiv preprint arXiv:1805.12471.
Jeffrey Ling, and Tom Kwiatkowski. 2019.
Matching the blanks: Distributional similarity Adina Williams, Nikita Nangia, and Samuel
for relation learning. In Association for Compu- Bowman. 2018. A broad-coverage challenge
tational Linguistics (ACL), pages 2895–2905. corpus for sentence understanding through in-
ference. In North American Association for Com-
Richard Socher, Alex Perelygin, Jean Wu, Jason putational Linguistics (NAACL),pages1112–1122.
Chuang, Christopher D. Manning, Andrew
Ng, and Christopher Potts. 2013. Recursive Thomas Wolf, Lysandre Debut, Victor Sanh,
deep models for semantic compositionality over Julien Chaumond, Clement Delangue, Anthony
a sentiment treebank. In Empirical Methods Moi, Pierric Cistac, Tim Rault, R’emi Louf,
in Natural Language Processing (EMNLP), Morgan Funtowicz, and Jamie Brew. 2019.
pages 1631–1642. HuggingFace’s Transformers: State-of-the-art
natural language processing. arXiv preprint
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, arXiv:1910.03771.
and Tie-Yan Liu. 2019. MASS: Masked se-
quence to sequence pre-training for language Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
generation. In International Conference on Carbonell, Ruslan Salakhutdinov, and Quoc V.
Machine Learning (ICML), pages 5926–5936. Le. 2019. XLNet: Generalized autoregressive

76
pretraining for language understanding. In Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor
Advances in Neural Information Processing Angeli, and Christopher D. Manning. 2017.
Systems (NeurIPS). Position-aware attention and supervised data
improve slot filling. In Empirical Methods
in Natural Language Processing (EMNLP),
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua pages 35–45.
Bengio, William Cohen, Ruslan Salakhutdinov,
and Christopher D. Manning. 2018. HotpotQA: Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
A dataset for diverse, explainable multi-hop Jiang, Maosong Sun, and Qun Liu. 2019.
question answering. In Empirical Methods ERNIE: Enhanced language representation with
in Natural Language Processing (EMNLP), informative entities. In Association for Compu-
pages 2369–2380. tational Linguistics (ACL), pages 1441–1451.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022

77

You might also like