Professional Documents
Culture Documents
†
Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
{mandar90,weld,lsz}@cs.washington.edu
‡
Computer Science Department, Princeton University, Princeton, NJ
danqic@cs.princeton.edu
Allen Institute of Artificial Intelligence, Seattle
{danw}@allenai.org
Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020. https://doi.org/10.1162/tacl a 00300
Action Editor: Radu Florian. Submission batch: 9/2019; Revision batch: 10/2019; Published 3/2020.
c 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00300/1923170/tacl_a_00300.pdf by guest on 16 December 2022
Figure 1: An illustration of SpanBERT training. The span an American football game is masked. The SBO uses
the output representations of the boundary tokens, x4 and x9 (in blue), to predict each token in the masked span.
The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked
by the position embedding p3 , is the third token from x4 .
considerably improves performance on most good pre-training tasks and objectives, which can
downstream tasks. Therefore, we add our modifi- also have a remarkable impact.
cations on top of the tuned single-sequence BERT
baseline. 2 Background: BERT
Together, our pre-training process yields mod- BERT (Devlin et al., 2019) is a self-supervised
els that outperform all BERT baselines on a approach for pre-training a deep transformer en-
wide variety of tasks, and reach substantially coder (Vaswani et al., 2017), before fine-tuning
better performance on span selection tasks in par- it for a particular downstream task. BERT opti-
ticular. Specifically, our method reaches 94.6% mizes two training objectives—masked language
and 88.7% F1 on SQuAD 1.1 and 2.0 (Rajpurkar model (MLM) and next sentence prediction (NSP)—
et al., 2016, 2018), respectively—reducing error which only require a large collection of unlabeled
by as much as 27% compared with our tuned text.
BERT replica. We also observe similar gains
on five additional extractive question answering Notation Given a sequence of word or sub-
benchmarks (NewsQA, TriviaQA, SearchQA, word tokens X = (x1 , x2 , . . . , xn ), BERT trains
HotpotQA, and Natural Questions).2 an encoder that produces a contextualized vector
representation for each token:
SpanBERT also arrives at a new state of the art
on the challenging CoNLL-2012 (‘‘OntoNotes’’) enc(x1 , x2 , . . . , xn ) = x1 , x2 , . . . , xn .
shared task for document-level coreference resolu- Masked Language Model Also known as a
tion, where we reach 79.6% F1, exceeding the pre- cloze test, MLM is the task of predicting missing
vious top model by 6.6% absolute. Finally, tokens in a sequence from their placeholders.
we demonstrate that SpanBERT also helps on Specifically, a subset of tokens Y ⊆ X is sampled
tasks that do not explicitly involve span selec- and substituted with a different set of tokens. In
tion, and show that our approach even im- BERT’s implementation, Y accounts for 15% of
proves performance on TACRED (Zhang et al., the tokens in X ; of those, 80% are replaced with
2017) and GLUE (Wang et al., 2019). [MASK], 10% are replaced with a random token
Whereas others show the benefits of adding (according to the unigram distribution), and 10%
more data (Yang et al., 2019) and increasing are kept unchanged. The task is to predict the
model size (Lample and Conneau, 2019), this original tokens in Y from the modified input.
work demonstrates the importance of designing BERT selects each token in Y independently
by randomly selecting a subset. In SpanBERT, we
2
We use the modified MRQA version of these datasets. define Y by randomly selecting contiguous spans
See more details in Section 4.1. (Section 3.1).
65
Next Sentence Prediction The NSP task takes
two sequences (XA , XB ) as input, and predicts
whether XB is the direct continuation of XA . This
is implemented in BERT by first reading XA from
the corpus, and then (1) either reading XB from the
point where XA ended, or (2) randomly sampling
XB from a different point in the corpus. The two
sequences are separated by a special [SEP] token.
Additionally, a special [CLS] token is added to
XA , XB to form the input, where the target of
[CLS] is whether XB indeed follows XA in the
corpus.
In summary, BERT optimizes the MLM and the
NSP objectives by masking word pieces uniformly Figure 2: We sample random span lengths from a
66
where position embeddings p1 , p2 , . . . mark rela- to MLM using a single-sequence data pipeline
tive positions of the masked tokens with respect (Section 3.3). A procedural description can be
to the left boundary token xs−1 . We implement found in Appendix A.
the representation function f (·) as a 2-layer
feed-forward network with GeLU activations 4 Experimental Setup
(Hendrycks and Gimpel, 2016) and layer normal- 4.1 Tasks
ization (Ba et al., 2016):
We evaluate on a comprehensive suite of tasks,
h0 = [xs−1 ; xe+1 ; pi−s+1 ] including seven question answering tasks, coref-
h1 = LayerNorm (GeLU(W1 h0 )) erence resolution, nine tasks in the GLUE bench-
mark (Wang et al., 2019), and relation extraction.
yi = LayerNorm (GeLU(W2 h1 )) We expect that the span selection tasks, question
answering and coreference resolution, will partic-
We then use the vector representation yi to predict
ularly benefit from our span-based pre-training.
the token xi and compute the cross-entropy loss
67
answer span to be the special token [CLS] for was born in [OBJ-LOC] , Michigan, . . . ’’, and
both training and testing. finally add a linear classifier on top of the [CLS]
token to predict the relation type.
Coreference Resolution Coreference resolu-
tion is the task of clustering mentions in text which GLUE The General Language Understanding
refer to the same real-world entities. We evaluate Evaluation (GLUE) benchmark (Wang et al.,
on the CoNLL-2012 shared task (Pradhan et al., 2019) consists of 9 sentence-level classification
2012) for document-level coreference resolu- tasks:
tion. We use the independent version of the Joshi
et al. (2019b) implementation of the higher-order • Two sentence-level classification tasks in-
coreference model (Lee et al., 2018). The docu- cluding CoLA (Warstadt et al., 2018)
ment is divided into non-overlapping segments of for evaluating linguistic acceptability and
a pre-defined length.5 Each segment is encoded SST-2 (Socher et al., 2013) for sentiment
independently by the pre-trained transformer classification.
68
We used the model configuration of BERTlarge SQuAD 1.1 SQuAD 2.0
as in Devlin et al. (2019) and also pre-trained all EM F1 EM F1
our models on the same corpus: BooksCorpus and
Human Perf. 82.3 91.2 86.8 89.4
English Wikipedia using cased Wordpiece tokens.
Google BERT 84.3 91.3 80.0 83.3
Compared with the original BERT implemen- Our BERT 86.5 92.6 82.8 85.9
tation, the main differences in our implementation Our BERT-1seq 87.5 93.3 83.8 86.6
include: (a) We use different masks at each epoch SpanBERT 88.8 94.6 85.7 88.7
while BERT samples 10 different masks for each
sequence during data processing. (b) We remove Table 1: Test results on SQuAD 1.1 and SQuAD
all the short-sequence strategies used before (they 2.0.
sampled shorter sequences with a small proba-
bility 0.1; they also first pre-trained with smaller Our BERT-1seq Our reimplementation of BERT
sequence length of 128 for 90% of the steps). trained on single full-length sequences without
69
NewsQA TriviaQA SearchQA HotpotQA Natural Avg.
Questions
Google BERT 68.8 77.5 81.7 78.3 79.9 77.3
Our BERT 71.0 79.0 81.8 80.5 80.5 78.6
Our BERT-1seq 71.9 80.4 84.0 80.3 81.8 79.7
SpanBERT 73.6 83.6 84.8 83.0 82.5 81.5
Table 2: Performance (F1) on the five MRQA extractive question answering tasks.
MUC B3 CEAFφ4
P R F1 P R F1 P R F1 Avg. F1
Prev. SotA:
(Lee et al., 2018) 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0
Table 3: Performance on the OntoNotes coreference resolution benchmark. The main evaluation is the
average F1 of three metrics: MUC, B3 , and CEAFφ4 on the test set.
70
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE (Avg)
Google BERT 59.3 95.2 88.5/84.3 86.4/88.0 71.2/89.0 86.1/85.7 93.0 71.1 80.4
Our BERT 58.6 93.9 90.1/86.6 88.4/89.1 71.8/89.3 87.2/86.6 93.0 74.7 81.1
Our BERT-1seq 63.5 94.8 91.2/87.8 89.0/88.4 72.1/89.5 88.0/87.4 93.0 72.1 81.7
SpanBERT 64.3 94.8 90.9/87.9 89.9/89.1 71.9/89.5 88.1/87.7 94.3 79.0 82.8
Table 5: Test set performance on GLUE tasks. MRPC: F1/accuracy, STS-B: Pearson/Spearmanr
correlation, QQP: F1/accuracy, MNLI: matched/mistached accuracies, and accuracy for all the other
tasks. WNLI (not shown) is always set to majority class (65.1% accuracy) and included in the average.
training with NSP with BERT’s choice of se- Named Entities At 50% of the time, we sample
quence lengths for a wide variety of tasks. This from named entities in the text, and sample random
is surprising because BERT’s ablations showed whole words for the other 50%. The total number
We compare our random span masking scheme Table 6 shows how different pre-training
with linguistically-informed masking schemes, masking schemes affect performance on the devel-
and find that masking random spans is a com- opment set of a selection of tasks. All the mod-
petitive and often better approach. We then study els are evaluated on the development sets and are
the impact of the SBO, and contrast it with BERT’s based on the default BERT setup of bi-sequence
NSP objective.10 training with NSP; the results are not directly com-
parable to the main evaluation. With the exception
6.1 Masking Schemes of coreference resolution, masking random spans
is preferable to other strategies. Although linguis-
Previous work (Sun et al., 2019) has shown im-
tic masking schemes (named entities and noun
provements in downstream task performance by
phrases) are often competitive with random spans,
masking linguistically informed spans during pre-
their performance is not consistent; for instance,
training for Chinese data. We compare our ran-
masking noun phrases achieves parity with ran-
dom span masking scheme with masking of
dom spans on NewsQA, but underperforms on
linguistically informed spans. Specifically, we
TriviaQA (−1.1% F1).
train the following five baseline models differing
On coreference resolution, we see that masking
only in the way tokens are masked.
random subword tokens is preferable to any form
Subword Tokens We sample random Word- of span masking. Nevertheless, we shall see in
piece tokens, as in the original BERT. the following experiment that combining random
span masking with the span boundary objective
Whole Words We sample random words, and can improve upon this result considerably.
then mask all of the subword tokens in those
words. The total number of masked subtokens is 6.2 Auxiliary Objectives
around 15%. In Section 5, we saw that bi-sequence training
10
with the NSP objective can hurt performance on
To save time and resources, we use the checkpoints at
11
1.2M steps for all the ablation experiments. https://spacy.io/.
71
SQuAD 2.0 NewsQA TriviaQA Coreference MNLI-m QNLI GLUE (Avg)
Subword Tokens 83.8 72.0 76.3 77.7 86.7 92.5 83.2
Whole Words 84.3 72.8 77.1 76.6 86.3 92.8 82.9
Named Entities 84.8 72.7 78.7 75.6 86.0 93.1 83.2
Noun Phrases 85.0 73.0 77.7 76.7 86.5 93.2 83.5
Geometric Spans 85.4 73.0 78.8 76.4 87.0 93.3 83.4
Table 6: The effect of replacing BERT’s original masking scheme (Subword Tokens) with different
masking schemes. Results are F1 scores for QA tasks and accuracy for MNLI and QNLI on the
development sets. All the models are based on bi-sequence training with NSP.
Table 7: The effects of different auxiliary objectives, given MLM over random spans as the primary
objective.
downstream tasks, when compared with single- guage generation tasks—SpanBERT pretrains
sequence training. We test whether this holds true span representations (Lee et al., 2016), which are
for models pre-trained with span masking, and also widely used for question answering, coreference
evaluate the effect of replacing the NSP objective resolution, and a variety of other tasks. ERNIE
with the SBO. (Sun et al., 2019) shows improvements on Chinese
Table 7 confirms that single-sequence training NLP tasks using phrase and named entity mask-
typically improves performance. Adding SBO fur- ing. MASS (Song et al., 2019) focuses on language
ther improves performance, with a substantial gain generation tasks, and adopts the encoder-decoder
on coreference resolution (+2.7% F1) over span framework to reconstruct a sentence fragment
masking alone. Unlike the NSP objective, SBO given the remaining part of the sentence. We
does not appear to have any adverse effects. attempt to more explicitly model spans using the
SBO objective, and show that (geometrically dis-
tributed) random span masking works as well,
7 Related Work
and sometimes better than, masking linguistically-
Pre-trained contextualized word representations that coherent spans. We evaluate on English bench-
can be trained from unlabeled text (Dai and Le, marks for question answering, relation extraction,
2015; Melamud et al., 2016; Peters et al., 2018) and coreference resolution in addition to GLUE.
have had immense impact on NLP lately, partic- A different ERNIE (Zhang et al., 2019) fo-
ularly as methods for initializing a large model cuses on integrating structured knowledge bases
before fine-tuning it for a specific task (Howard with contextualized representations with an eye on
and Ruder, 2018; Radford et al., 2018; Devlin knowledge-driven tasks like entity typing and re-
et al., 2019). Beyond differences in model hyper- lation classification. UNILM (Dong et al., 2019)
parameters and corpora, these methods mainly uses multiple language modeling objectives—
differ in their pre-training tasks and loss functions, unidirectional (both left-to-right and right-to-left),
with a considerable amount of contemporary liter- bidirectional, and sequence-to-sequence prediction—
ature proposing augmentations of BERT’s MLM to aid generation tasks like summarization and
objective. question generation. XLM (Lample and Conneau,
While previous and concurrent work has looked 2019) explores cross-lingual pre-training for multi-
at masking (Sun et al., 2019) or dropping (Song lingual tasks such as translation and cross-lingual
et al., 2019; Chan et al., 2019) multiple words classification. Kermit (Chan et al., 2019), an in-
from the input—particularly as pretraining for lan- sertion based approach, fills in missing tokens
72
(instead of predicting masked ones) during pre- substantially better performance on span selection
training; they show improvements on machine tasks in particular.
translation and zero-shot question answering.
Concurrent with our work, RoBERTa (Liu et al., Appendices
2019b) presents a replication study of BERT
pre-training that measures the impact of many A Pre-training Procedure
key hyperparameters and training data size. Also We describe our pre-training procedure as follows:
concurrent, XLNet (Yang et al., 2019) combines
an autoregressive loss and the Transformer-XL 1. Divide the corpus into single contiguous
(Dai et al., 2019) architecture with a more than an blocks of up to 512 tokens.
eight-fold increase in data to achieve current state-
of-the-art results on multiple benchmarks. XLNet 2. At each step of pre-training:
also masks spans (of 1–5 tokens) during pre-
(a) Sample a batch of blocks uniformly at
73
Acknowledgments Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
We would like to thank Pranav Rajpurkar and Robin of deep bidirectional transformers for language
Jia for patiently helping us evaluate SpanBERT understanding. In North American Association
on SQuAD. We thank the anonymous reviewers, for Computational Linguistics (NAACL).
the action editor, and our colleagues at Facebook
AI Research and the University of Washington for William B. Dolan and Chris Brockett. 2005.
their insightful feedback that helped improve the Automatically constructing a corpus of sen-
paper. tential paraphrases. In Proceedings of the Inter-
national Workshop on Paraphrasing.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei,
References
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Zhou, and Hsiao-Wuen Hon. 2019. Unified
Andrew M. Dai and Quoc V. Le. 2015. Semi- Dan Hendrycks and Kevin Gimpel. 2016.
supervised sequence learning. In Advances in Gaussian error linear units (gelus). arXiv pre-
Neural Information Processing Systems (NIPS), print arXiv:1606.08415.
pages 3079–3087. Matthew Honnibal and Ines Montani. 2017. spaCy
2: Natural language understanding with Bloom
Zihang Dai, Zhilin Yang, Yiming Yang, William
embeddings, convolutional neural networks and
W. Cohen, Jaime Carbonell, Quoc V. Le, and
incremental parsing. To appear.
Ruslan Salakhutdinov. 2019. Transformer-XL:
Attentive language models beyond a fixed- Jeremy Howard and Sebastian Ruder. 2018. Uni-
length context. In Association for Computa- versal language model fine-tuning for text clas-
tional Linguistics (ACL). sification. arXiv preprint arXiv:1801.06146.
74
Mandar Joshi, Eunsol Choi, Omer Levy, coarse-to-fine inference. In North American
Daniel Weld, and Luke Zettlemoyer. 2019a. Association for Computational Linguistics
pair2vec: Compositional word-pair embeddings (NAACL), pages 687–692.
for cross-sentence inference. In North American
Association for Computational Linguistics Kenton Lee, Shimi Salant, Tom Kwiatkowski,
(NAACL), pages 3597–3608. Ankur Parikh, Dipanjan Das, and Jonathan
Berant. 2016. Learning recurrent span repre-
Mandar Joshi, Eunsol Choi, Daniel Weld, and sentations for extractive question answering.
Luke Zettlemoyer. 2017. TriviaQA: A large arXiv preprint arXiv:1611.01436.
scale distantly supervised challenge dataset for
reading comprehension. In Association for Com- Hector J. Levesque, Ernest Davis, and Leora
putational Linguistics (ACL), pages 1601–1611. Morgenstern. 2011. The Winograd schema
challenge. In AAAI Spring Symposium: Logical
Mandar Joshi, Omer Levy, Daniel S. Weld, Luke Formalizations of Commonsense Reasoning,
Kenton Lee, Luheng He, and Luke Zettlemoyer. Matthew Peters, Mark Neumann, Mohit Iyyer,
2018. Higher-order coreference resolution with Matt Gardner, Christopher Clark, Kenton Lee,
75
and Luke Zettlemoyer. 2018. Deep contextual- Mitchell Stern, Noam Shazeer, and Jakob
ized word representations. In North American Uszkoreit. 2018. Blockwise parallel decod-
Association for Computational Linguistics ing for deep autoregressive models. In Advances
(NAACL), pages 2227–2237. in Neural Information Processing Systems
(NIPS).
Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Olga Uryupina, and Yuchen Zhang. 2012. Yu Stephanie Sun, Shuohuan Wang, Yukun Li,
CoNLL-2012 shared task: Modeling multi- Shikun Feng, Xuyi Chen, Han Zhang, Xinlun
lingual unrestricted coreference in ontonotes. Tian, Danxiang Zhu, Hao Tian, and Hua
In Joint Conference on EMNLP and CoNLL- Wu. 2019. ERNIE: Enhanced representation
Shared Task, pages 1–40. through knowledge integration. arXiv preprint
arXiv:1904.09223.
Ofir Press and Lior Wolf. 2017. Using the out-
put embedding to improve language models. Adam Trischler, Tong Wang, Xingdi Yuan, Justin
Alec Radford, Karthik Narasimhan, Time Salimans, Ashish Vaswani, Noam Shazeer, Niki Parmar,
and Ilya Sutskever. 2018. Improving language un- Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
derstanding with unsupervisedlearning, OpenAI. Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neural
Pranav Rajpurkar, Robin Jia, and Percy Liang. Information Processing Systems (NIPS).
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. In Association for Alex Wang, Amapreet Singh, Julian Michael,
Computational Linguistics (ACL), pages 784–789. Felix Hill, Omer Levy, and Samuel R. Bowman.
2019. GLUE: A multi-task benchmark and anal-
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, ysis platform for natural language understand-
and Percy Liang. 2016. SQuAD: 100,000+ ing. In International Conference on Learning
questions for machine comprehension of text. Representations (ICLR).
In Empirical Methods in Natural Language
Processing (EMNLP), pages 2383–2392. Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2018. Neural network acceptability
Livio Baldini Soares, Nicholas Arthur FitzGerald, judgments. arXiv preprint arXiv:1805.12471.
Jeffrey Ling, and Tom Kwiatkowski. 2019.
Matching the blanks: Distributional similarity Adina Williams, Nikita Nangia, and Samuel
for relation learning. In Association for Compu- Bowman. 2018. A broad-coverage challenge
tational Linguistics (ACL), pages 2895–2905. corpus for sentence understanding through in-
ference. In North American Association for Com-
Richard Socher, Alex Perelygin, Jean Wu, Jason putational Linguistics (NAACL),pages1112–1122.
Chuang, Christopher D. Manning, Andrew
Ng, and Christopher Potts. 2013. Recursive Thomas Wolf, Lysandre Debut, Victor Sanh,
deep models for semantic compositionality over Julien Chaumond, Clement Delangue, Anthony
a sentiment treebank. In Empirical Methods Moi, Pierric Cistac, Tim Rault, R’emi Louf,
in Natural Language Processing (EMNLP), Morgan Funtowicz, and Jamie Brew. 2019.
pages 1631–1642. HuggingFace’s Transformers: State-of-the-art
natural language processing. arXiv preprint
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, arXiv:1910.03771.
and Tie-Yan Liu. 2019. MASS: Masked se-
quence to sequence pre-training for language Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
generation. In International Conference on Carbonell, Ruslan Salakhutdinov, and Quoc V.
Machine Learning (ICML), pages 5926–5936. Le. 2019. XLNet: Generalized autoregressive
76
pretraining for language understanding. In Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor
Advances in Neural Information Processing Angeli, and Christopher D. Manning. 2017.
Systems (NeurIPS). Position-aware attention and supervised data
improve slot filling. In Empirical Methods
in Natural Language Processing (EMNLP),
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua pages 35–45.
Bengio, William Cohen, Ruslan Salakhutdinov,
and Christopher D. Manning. 2018. HotpotQA: Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
A dataset for diverse, explainable multi-hop Jiang, Maosong Sun, and Qun Liu. 2019.
question answering. In Empirical Methods ERNIE: Enhanced language representation with
in Natural Language Processing (EMNLP), informative entities. In Association for Compu-
pages 2369–2380. tational Linguistics (ACL), pages 1441–1451.
77