Professional Documents
Culture Documents
Contextualized Representations Using Textual Encyclopedic Knowledge
Contextualized Representations Using Textual Encyclopedic Knowledge
Contextualized Representations Using Textual Encyclopedic Knowledge
ically retrieved textual encyclopedic back- work appeared in the Daily Express and Daily Mail... he was a
ground knowledge from multiple documents. contributor to the William Hickey column...
We apply our method to reading comprehen-
sion tasks by encoding questions and passages ``William Hickey'' is The Daily Express is
together with background sentences about the the pseudonymous a daily national
byline of a gossip middle-market tabloid
entities they mention. We show that inte- column published in newspaper in the
grating background knowledge directly from the Daily Express United Kingdom.
text is effective for tasks focusing on factual
reasoning and allows direct reuse of power-
Figure 1: A TriviaQA example showing how back-
ful pretrained BERT-style encoders. More-
ground sentences from Wikipedia help define the mean-
over, knowledge integration can be further im-
ing of phrases in the context and their relationship to
proved with suitable pretraining via a self-
phrases in the question. The answer William Hickey is
supervised masked language model objective
connected to the question phrase pen-name of the gos-
over words in background-augmented input
sip columnist in the Daily Express through the back-
text. On TriviaQA, our approach obtains im-
ground sentence.
provements of 1.6 to 3.1 F1 over comparable
RoBERTa models which do not integrate back-
ground knowledge dynamically. On MRQA, a
needed answer to build representations for the task
large collection of diverse question answering
datasets, we see consistent gains in-domain (Chen et al., 2017; Guu et al., 2020; Lewis et al.,
along with large improvements out-of-domain 2020; Oh et al., 2017; Kadowaki et al., 2019).
on BioASQ (2.1 to 4.2 F1), TextbookQA (1.6 On the other hand, when relevant text is pro-
to 2.0 F1), and DuoRC (1.1 to 2.0 F1). vided as input, such as in reading comprehension
tasks (Rajpurkar et al., 2016), relation extraction,
1 Introduction syntactic analysis, etc., which can be cast as tasks
Current self-supervised representations, trained of labeling spans in the input text, prior work has
at large scale from document-level contexts, are not focused on drawing background information
known to encode linguistic (Tenney et al., 2019) from external text sources. Instead, most research
and factual (Petroni et al., 2019) knowledge into has explored architectures to integrate background
their parameters. Yet, even large pretrained repre- from structured knowledge bases to form input text
sentations are unable to capture and preserve all representations (Bauer et al., 2018; Mihaylov and
factual knowledge they have “read” during pretrain- Frank, 2018; Yang et al., 2019; Zhang et al., 2019;
ing due to the long tail of entity and event-specific Peters et al., 2019).1
information (Logan et al., 2019). For open-domain We posit that representations should be able to
tasks, where the input consists of only a question directly integrate textual background knowledge
or a statement out of context, as in open-domain since a wider scope of information is more readily
QA or factuality prediction, previous work has re- available in textual form. Our method represents
trieved and used text that may contain or entail the 1
A notable exception is Weissenborn et al. (2017), with a
∗
Work completed while interning at Google specialized architecture which uses textual entity descriptions.
TEK-enriched Representations
… h383 h384 h386 h387 h388 … h510 h511 h512
h1 h2 h3 h382
Transformer
[CLS] What is the pen-name of the gossip columnist… William Hickey: ``William Hickey'' is the pseudonymous
[SEP]… appeared in the Daily Express and Daily Mail…he byline of a gossip column…[SEP] Daily Express: The
was a contributor to the William Hickey column … [SEP] Daily Express is a daily national middle-market… [SEP]
Figure 2: We contextualize the input text, in this case a question and a passage, together with textual encyclopedic
knowledge (TEK) using a pretrained Transformer to create TEK-enriched representations.
input texts by jointly encoding them with dynami- Transformer model can be substantially improved
cally retrieved sentences from the Wikipedia pages by reducing this mismatch via self-supervised
of entities they mention. We term these represen- masked language model (MLM) (Devlin et al.,
tations TEK-enriched, for Textual Encyclopedic 2019) pretraining on TEK-augmented input texts.
Knowledge (Figure 2 shows an illustration), and Our approach records considerable improve-
use them for reading comprehension (RC) by con- ments over state of the art base (12-layer) and large
textualizing questions and passages together with (24-layer) Transformer models for in-domain and
retrieved Wikipedia background sentences. Such out-of-domain document-level extractive question
background knowledge can help reason about the answering (QA), for tasks where factual knowledge
relationships between questions and passages. Fig- about entities is important and well-covered by the
ure 1 shows an example question from the Triv- background collection. On TriviaQA, we see im-
iaQA dataset (Joshi et al., 2017) asking for the provements of 1.6 to 3.1 F1, respectively, over com-
pen-name of a gossip columnist. Encoding relevant parable RoBERTa models which do not integrate
background knowledge (pseudonymous byline of background information. On MRQA (Fisch et al.,
a gossip column published in the Daily Express) 2019), a large collection of diverse QA datasets,
helps ground the vague reference to the William we see consistent gains in-domain along with large
Hickey column in the given document context. improvements out-of-domain on BioASQ (2.1 to
Using text as background knowledge allows us 4.2 F1), TextbookQA (1.6 to 2.0 F1), and DuoRC
to directly reuse powerful pretrained BERT-style (1.1 to 2.0 F1).
encoders (Devlin et al., 2019). We show that an
off-the-shelf RoBERTa (Liu et al., 2019b) model 2 TEK-enriched Representations
can be directly finetuned on minimally structured We follow recent work on pretraining bidirectional
TEK-enriched inputs, which are formatted to al- Transformer representations on unlabeled text, and
low the encoder to distinguish between the original finetuning them for downstream tasks (Devlin et al.,
passages and background sentences. This method 2019). Subsequent approaches have shown sig-
considerably improves on current state-of-the art nificant improvements over BERT by improving
methods which only consider context from a sin- the training example generation, the masking strat-
gle input document (Section 4). The improvement egy, the pretraining objectives, and the optimization
comes without an increase in the length of the input methods (Liu et al., 2019b; Joshi et al., 2020). We
window for the Transformer (Vaswani et al., 2017). build on these improvements to train TEK-enriched
Although existing pretrained models provide a representations and use them for extractive QA.
good starting point for task-specific TEK-enriched Our approach seeks to contextualize input text
representations, there is still a mismatch between X = (x1 , . . . , xn ) jointly with relevant textual en-
the type of input seen during pretraining (single cyclopedic background knowledge B retrieved dy-
document segments) and the type of input the namically from multiple documents. We define
model is asked to represent for downstream tasks a retrieval function, fret (X, D), which takes X
(document text with background Wikipedia sen- as input and retrieves a list of text spans B =
tences from multiple pages). We show that the (B1 , . . . , BM ) from the corpus D. In our imple-
mentation, each of the text spans Bi is a sentence. QA Model Following BERT, our QA model ar-
The encoder then represents X by jointly encoding chitecture consists of two independent linear classi-
X with B using fenc (X, B) such that the output fiers for predicting the answer span boundary (start
representations of X are cognizant of the informa- and end) on top of the output representations of
tion present in B (see Figure 2). We use a deep X. We assume that the answer, if present, is con-
Transformer encoder operating over the input se- tained only in the given passage, P , and do not
quence [CLS]X [SEP]B [SEP]for fenc (·). consider potential mentions of the answer in the
We refer to inputs X generically as contexts. background B. For instances which do not contain
These could be either contiguous word sequences the answer, we set the answer span to be the special
from documents (passages), or, for the QA applica- token [CLS]. We use a fixed Transformer input
tion, question-passage pairs, which we refer to as window size of 512, and use a sliding window with
RC-contexts. For a fixed Transformer input length a stride of 128 tokens to handle longer documents.
limit (which is necessary for computational effi- Our TEK-enriched representations use document
ciency), there is a trade-off between the length of passages of length 384 while baselines use longer
the document context (the length of X) and the passages of length 512.
amount of background knowledge (the length of
B). Section 5 explores this trade-off and shows 2.2 TEK-enriched Pretraining
that for an encoder input limit of 512, the values of Standard pretraining uses contiguous document-
NC = 384 for the length of X and NB = 128 for level natural language inputs. Since TEK-
the length of B provide an effective compromise. augmented inputs are formatted as natural language
We use a simple implementation of the back- sequences, off-the-shelf pretrained models can be
ground retrieval function fret (X, D), using an used as a starting point for creating TEK-enriched
entity linker for finetuning (Section 2.1) and representations. As one of our approaches, we use
Wikipedia hyperlinks for pretraining (Section 2.2), a standard single-document pretraining model.
and a way to score the relevance of individual sen- While the input format is the same, there is a
tences using ngram overlap. mismatch between contiguous document segments
and TEK-augmented inputs sourced from multiple
2.1 TEK-Enriched Question Answering documents. We propose an additional pretraining
The input X for the extractive QA task consists of stage—starting from the RoBERTa parameters, we
the question Q and a candidate passage P . We use resume pretraining using an MLM objective on
the following retrieval function fret (X) to obtain TEK-augmented document text X, which encour-
relevant background B. ages the model to integrate the knowledge from
multiple background segments.
Background Knowledge Retrieval for QA We
Background Knowledge Retrieval in Pretrain-
detect entity mentions in X using a proprietary
ing In pretraining, X is a contiguous block of text
Wikipedia-based entity linker,2 and form a candi-
from Wikipedia. The retrieval function fret (X, D)
date pool of background segments Bi as the union
returns B = (B1 , . . . , BM ) where each Bi is a sen-
of the sentences in the Wikipedia pages of the de-
tence from the Wikipedia page of some entity hy-
tected entities. These sentences are then ranked
perlinked from a span in X. We use high-precision
based on their number of overlapping ngrams with
Wikipedia hyperlinks instead of an entity linker for
the question (equally weighted unigrams, bigrams,
pretraining. The background candidate sentences
and trigrams). To form the input for the Trans-
are ranked by their ngram overlap with X. The top
former encoder, each background sentence is mini-
ranking sentences in B up to NB tokens are used.
mally structured as Bi by prepending the name of
If no entities are found in X, B is constructed from
the entity whose page it belongs to along with a
the context following X from the same document.
separator ‘:’ token. Each sentence Bi is followed
by [SEP]. Appendix A shows an example of an Training Objective We continue pretraining a
RC-context with background knowledge segments. deep Transformer using the MLM objective (De-
2
vlin et al., 2019) after initializing the parameters
We also report results on publicly available linkers show-
ing that our method is robust to the exact choice of the linker with pretrained RoBERTa weights. Following im-
(Section 5). provements in SpanBERT (Joshi et al., 2020), we
Task Train Dev Test 2017), SearchQA (Dunn et al., 2017), TriviaQA
TQA Wiki 61,888 7,993 7,701 Web (Joshi et al., 2017), HotpotQA (Yang et al.,
TQA Web 528,979 68,621 65,059 2018) and Natural Questions (Kwiatkowski et al.,
MRQA 616,819 58,221 9,633
2019). The out-of-domain test evaluation, includ-
Table 1: Data statistics for TriviaQA and MRQA.
ing access to questions and passages, is only avail-
able through Codalab. Due to the complexity of our
system which involves entity linking and retrieval,
mask spans with lengths sampled from a geomet- we perform development and model selection on
ric distribution in the entire input (X and B). We the in-domain dev set and treat the out-of-domain
use a single segment ID, and remove the next sen- dev set as the test set. The out-of-domain set we
tence prediction objective which has been shown evaluate on has examples from BioASQ (Tsat-
to not improve performance (Joshi et al., 2020; Liu saronis et al., 2015), DROP (Dua et al., 2019),
et al., 2019b) for multiple tasks including QA. We DuoRC (Saha et al., 2018), RACE (Lai et al., 2017),
evaluate two methods building textual-knowledge RelationExtraction (Levy et al., 2017), and Text-
enriched representations for QA differing in the bookQA (Kembhavi et al., 2017).
pretraining approach used:
3.1 Baselines
TEKP F Our full approach TEKP F 3 consists of
We compare TEKP F and TEKF with two baselines,
two stages: (a) 200K steps of TEK-pretraining
RoBERTa and RoBERTa++. Both use the same ar-
on Wikipedia starting from the RoBERTa check-
chitecture as our approach, but use only original
point, and (b) finetuning and doing inference on
RC-contexts for finetuning and inference, and use
RC-contexts augmented with TEK background.
standard single-document-context RoBERTa pre-
TEKF TEKF replaces the first specialized pre- training. TEKP F and TEKF use NC = 384 and
training stage in TEKP F with 200K steps for stan- NB = 128, while both baselines use NC = 512
dard single-document-context pretraining for a fair and NB = 0.
comparison with TEKP F , but follows the same
finetuning regimen. RoBERTa We finetune the model on QA data
without knowledge augmentation starting from the
3 Experimental Setup same RoBERTa checkpoint that is used as an ini-
tializer for TEK-augmented pretraining.
We perform experiments on TriviaQA and MRQA,
two large extractive question answering bench- RoBERTa++ For a fair evaluation of the new
marks (see Table 1 for dataset statistics). TEK-augmented pretraining method while control-
ling for the number of pretraining steps and other
TriviaQA TriviaQA (Joshi et al., 2017) contains hyperparameters, we extend RoBERTa’s pretrain-
trivia questions paired with evidence collected via ing for an additional 200K steps on single con-
entity linking and web search. The dataset is dis- tiguous blocks of text (without background infor-
tantly supervised in that the answers are contained mation). We use the same masking and other hy-
in the evidence but the context may not support perparameters as in TEK-augmented pretraining.
answering the questions. We experiment with both This pretrained checkpoint is also used to initialize
the Wikipedia and Web tasks. parameters for our TEKF approach.
MRQA The MRQA shared task (Fisch et al., The implementation details of all models, includ-
2019) consists of several widely used QA datasets ing hyperpameters, can be found in Appendix B.
unified into a common format aimed at evaluating
out-of-domain generalization. The data consists of
4 Results
a training set, in-domain and out-of-domain dev TriviaQA Table 2 compares our approaches
sets, and a private out-of-domain test set. The train- with baselines and previous work. The 12-layer
ing and the in-domain dev sets consist of modified variant of our RoBERTa baseline outperforms or
versions of corresponding sets from SQuAD (Ra- matches the performance of several previous sys-
jpurkar et al., 2016), NewsQA (Trischler et al., tems including ELMo-based ones (Wang et al.,
3
The subscripts P and F stand for pretraining and finetun- 2018; Lewis, 2018) which are specialized for this
ing, respectively. task. We also see that RoBERTa++ outperforms
TQA Wiki TQA Web of this phenomenon for future work. The base vari-
EM F1 EM F1 ants of TEKF and TEKP F outperform both base-
Previous work lines on all other datasets. Comparing the base vari-
Clark and Gardner (2018) 64.0 68.9 66.4 71.3 ant of our full TEKP F approach to RoBERTa++,
Weissenborn et al. (2017) 64.6 69.9 67.5 72.8 we observe an overall improvement of 1.6 F1 with
Wang et al. (2018) 66.6 71.4 68.6 73.1
Lewis (2018) 67.3 72.3 - - strong gains on BioASQ (4.2 F1), DuoRC (2.0 F1),
This work
and TextbookQA (2.0 F1). The 24-layer variants of
TEKP F show similar trends with improvements of
RoBERTa (Base) 66.7 71.7 77.0 81.4
RoBERTa++ (Base) 68.0 72.9 76.8 81.4 2.1 F1 on BioASQ, 1.1 F1 on DuoRC, and 1.6 F1
TEKF (Base) 70.0 74.8 78.2 83.0 on TextbookQA. Our large models see a reduction
TEKP F (Base) 71.2 76.0 78.8 83.4 in the average gain mostly due to drop in perfor-
RoBERTa (Large) 72.3 76.9 80.6 85.1 mance on DROP. Like in the case of TriviaQA,
RoBERTa++ (Large) 72.9 77.5 81.1 85.5
TEKF (Large) 74.1 78.6 82.2 86.5 TEK-pretraining generally improves performance
TEKP F (Large) 74.6 79.1 83.0 87.2 even further where TEK-finetuning is useful (with
the exception of DuoRC which sees a small loss
Table 2: Test set performance on TriviaQA. of 0.24 F1 due to TEK-pretraining for the large
models4 ), with the biggest gains seen on BioASQ.
RoBERTa, indicating that there is still room for Takeaways Both TEKP F and TEKF record
improvement by simply pretraining for more steps strong gains on benchmarks that focus on factual
on task-domain relevant text. Furthermore, the reasoning outperforming the RoBERTa-based base-
12-layer and 24-layer variants of our TEKF ap- lines that use only RC-contexts. The success of
proach considerably improve over a comparable TEKF underscores the advantage of textual en-
RoBERTa++ baseline for both Wikipedia (1.9 and cyclopedic knowledge in that it improves current
1.1 F1 respectively) and Web (1.6 and 1.0 F1 re- models even without additional TEK-pretraining.
spectively) indicating that TEK representations are Finally, TEK-pretraining further improves the
useful even without additional TEK-pretraining. model’s ability to use the retrieved background
The base variant of our best model TEKP F , which knowledge for the downstream RC task.
uses TEK-pretrained TEK-enriched representations
records even bigger gains of 3.1 F1 and 2.0 F1 on 5 Ablation Studies
Wikipedia and Web respectively over a compara-
TEK vs. Context-only Pretraining We also
ble 12-layer RoBERTa++ baseline. The 24-layer
compare the two pretraining setups for models
models show similar trends with improvements of
which do not use background knowledge to form
1.6 and 1.7 F1 over RoBERTa++.
representations for the finetuning tasks. Table 4
MRQA Table 3 shows in-domain and out-of- shows results for all four combinations of the pre-
domain evaluation on MRQA. As in the case of training and finetuning method variables, using
TriviaQA, the 12-layer variants of our RoBERTa 12-layer base models on the development sets of
baselines are competitive with previous work, TriviaQA and MRQA (in-domain). Comparing
which includes D-Net (Li et al., 2019) and Del- rows 1 and 3, we see marginal gains across all
phi (Longpre et al., 2019), the top two systems of datasets for TEK pretraining indicating that pre-
the MRQA shared task, while the 24-layer variants training with encyclopedic knowledge does not
considerably outperform the current state of the art hurt QA performance even when such informa-
across all datasets. RoBERTa++ again performs tion is not available during finetuning and infer-
better than RoBERTa on all datasets except DROP ence. While previous work (Liu et al., 2019b; Joshi
and RACE. DROP is designed to test arithmetic rea- et al., 2020) has shown that pretraining with sin-
soning, while RACE contains (often fictional and gle contiguous chunks of text clearly outperforms
thus not groundable to Wikipedia) passages from BERT’s bi-sequence pipeline,5 our results suggest
English exams for middle and high school students 4
According to the Wilcoxon signed rank test of statistical
in China. The performance drop after further pre- significance, the large TEKP F is significantly better than
training on Wikipedia could be a result of multiple TEKF on BioASQ and TextbookQA p-value < .05, and is
not significantly different from it for DuoRC.
factors including the difference in style of required 5
BERT randomly samples the second sequence from a
reasoning or content; we leave further investigation different document in the corpus with a probability of 0.5.
MRQA-In BioASQ TextbookQA DuoRC RE DROP RACE MRQA-Out
Shared task
D-Net (Ensemble) 84.82 - - - - - - 70.42
Delphi - 71.98 65.54 63.36 87.85 58.9 53.87 66.92
This work
RoBERTa (Base) 82.98 68.80 58.32 62.56 86.87 54.88 49.14 68.17
RoBERTa++ (Base) 83.22 68.36 60.51 62.40 87.93 53.11 47.90 68.38
TEKF (Base) 83.44 69.71 62.19 63.43 87.49 51.04 46.43 68.46
TEKP F (Base) 83.71 72.58 62.55 64.43 88.29 54.58 47.75 70.01
RoBERTa (Large) 85.75 73.41 65.95 66.79 88.82 68.63 56.84 74.02
RoBERTa++ (Large) 85.80 74.73 67.51 67.40 89.58 67.62 55.95 74.58
TEKF (Large) 86.23 75.37 68.17 68.80 89.43 67.46 55.20 74.88
TEKP F (Large) 86.33 76.80 69.10 68.54 89.15 66.24 56.14 75.00
Table 3: In-domain and out-of-domain performance (F1) on MRQA. RE refers to the Relation Extraction dataset.
MRQA-Out refers to the averaged out-of-domain F1.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Percy Liang. 2016. SQuAD: 100,000+ questions for Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
machine comprehension of text. In Proceedings of Kaiser, and Illia Polosukhin. 2017. Attention is all
the 2016 Conference on Empirical Methods in Natu- you need. In I. Guyon, U. V. Luxburg, S. Bengio,
ral Language Processing, pages 2383–2392, Austin, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
Texas. Association for Computational Linguistics. nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran Asso-
Lev Ratinov and Dan Roth. 2009. Design challenges ciates, Inc.
and misconceptions in named entity recognition. In
Proceedings of the Thirteenth Conference on Com- Chao Wang and Hui Jiang. 2019. Explicit utilization
putational Natural Language Learning, CoNLL ’09, of general knowledge in machine reading compre-
pages 147–155, Stroudsburg, PA, USA. Association hension. In Proceedings of the 57th Annual Meet-
for Computational Linguistics. ing of the Association for Computational Linguis-
tics, pages 2263–2272, Florence, Italy. Association
Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and for Computational Linguistics.
Karthik Sankaranarayanan. 2018. DuoRC: Towards
complex language understanding with paraphrased Wei Wang, Ming Yan, and Chen Wu. 2018. Multi-
reading comprehension. In Proceedings of the 56th granularity hierarchical attention fusion networks
for reading comprehension and question answering.
In Proceedings of the 56th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 1705–1714, Melbourne, Aus-
tralia. Association for Computational Linguistics.
Dirk Weissenborn, Tomáš Kočiskỳ, and Chris Dyer.
2017. Dynamic integration of background knowl-
edge in neural nlu systems. arXiv preprint
arXiv:1706.02596.
Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo,
and William Yang Wang. 2019. Improving question
answering over incomplete KBs with knowledge-
aware reader. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 4258–4264, Florence, Italy. Associa-
tion for Computational Linguistics.
Jinxi Xu and W. Bruce Croft. 1996. Query expansion
using local and global document analysis. In SIGIR.
An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu,
Hua Wu, Qiaoqiao She, and Sujian Li. 2019. En-
hancing pre-trained language representations with
rich knowledge for machine reading comprehension.
In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages
2346–2357, Florence, Italy. Association for Compu-
tational Linguistics.
Table 7: Pretraining (left) and QA finetuning (right) examples which encode contexts with background sentences
from Wikipedia. The input is minimally structured by including the source page of each background sentence, and
separating the sentences using special [SEP] tokens. Background is shown in blue and entities are indicated in
bold.
B Implementation
We implemented all models in TensorFlow (Abadi
et al., 2015). For pretraining, we used the 12-layer
RoBERTa-base (125M parameters) and 24-layer
RoBERTa-large (355M parameters) configurations,
and initialized the parameters from their respective
checkpoints. In TEK-augmented pretraining, we
further pretrained the models for 200K steps with
a batch size of 512 and BERT’s triangular learn-
ing rate schedule with a warmup of 5000 steps on
TEK-augmented contexts. We used a peak learning
rate of 0.0001 for base and 5e−5 for large models.
All models were trained and evaluated on Google
Cloud TPUs. We apply the following finetuning
hyperparameters to all methods, including the base-
lines. For each method, we chose the best model
based on dev set performance measured using F1.
TriviaQA7 We follow the input preprocessing of
Clark and Gardner (2018). The input to our model
is the concatenation of the first four 400-token pas-
sages selected by their linear passage ranker. For
training, we define the gold span to be the first oc- Dataset Epochs LR
currence of the gold answer(s) in the context (Joshi
12-layer Models
et al., 2017; Talmor and Berant, 2019). We choose
MRQA 3 2e-5
learning rates from {1e-5, 2e-5} and finetune for 5 TriviaQA Wiki 5 1e-5
epochs with a batch size of 32. TriviaQA Web 5 1e-5
24-layer Models
MRQA8 We choose learning rates from {1e-5,
2e-5} and number of epochs from {2, 3, 5} with a MRQA 2 1e-5
TriviaQA Wiki 5 2e-5
batch size of 32. For both benchmarks, especially TriviaQA Web 5 1e-5
7
https://nlp.cs.washington.edu/triviaqa/
8
https://github.com/mrqa/MRQA-Shared-Task-2019 Table 8: Hyperparameter configurations for TEKP F