Professional Documents
Culture Documents
Google Research
{xyguo, jainslie, duthus, santiontanon, jianmon, yhsung, yinfeiy}@google.com
40 Arxiv PubMed
Abstract 42
39
Attention queries
Attention queries
Each input token can
attend to its
r r
neighborhood (like in
local attention), plus
to all global tokens.
given choice of r, complexity is linear in input ally masking attention from input tokens to global
sequence length l: O(l × r). tokens of other examples. We found block size
k = 16 to be sufficient in practice. Notice thus,
3.1.2 Transient Global Attention (TGlobal)
that TGlobal attention introduces a block of l ∗ l/k
To allow input tokens to interact with each other in additional attention key-value pairs to calculate on
each layer of the encoder at a longer range than Lo- top of Local Attention (l input tokens, attending
cal Attention’s local radius, we introduce Transient to l/k global tokens; represented by the right most
Global Attention as a modification of ETC’s global- rectangle in Figure 2.b), hence for input sequence
local attention in a “fixed blocks” pattern. Namely, length l, complexity is O(l(r + l/k)).
we divide the input sequence into blocks of k to-
kens, and for each block we compute a global token 3.2 PEGASUS Principle Sentences
by summing (and then normalizing) the embed- Generation Pre-training
dings of every token in the block (see Figure 2.b).
Now when computing attention, we allow each T5 is pre-trained with a span corruption objective,
input token to attend not only to nearby tokens where spans of consecutive input tokens are re-
like in Local Attention, but also to every global placed with a mask token and the model is trained
token. We call these global tokens transient be- to reconstruct the masked-out tokens. While it is
cause in contrast to ETC-like global-local attention effective, recent work on masked language model-
patterns, these tokens are dynamically constructed ing (MLM) (Liu et al., 2019; Zhang et al., 2019b)
(and subsequently discarded) within each attention shows that carefully selecting the prediction objec-
operation, removing any requirement for deciding tive could lead to significantly better performance.
which input tokens should be treated as “global”. One argument is that predicting more informative
TGlobal attention only introduces a couple new tokens from the text could force the model to learn
parameters4 : (1) T5-style relative position biases better semantics of the text. Motivated by that,
representing the distance from an input token’s we explore masking and generating the principle
block to the block of each global token it’s attend- sentences from the text. In particular, we adopt
ing to, and (2) T5-style layer normalization parame- the Gap Sentences Generation with Principle Ind-
ters for normalizing each global token’s embedding. Uniq strategy from Zhang et al. (2019a), which
The rest of the parameters are identical to T5, and was used for summarization pre-training.
we accommodate sequence packing by addition- Following Zhang et al. (2019a), we select
top-m scored (Principle) sentences based on
example in the same input sequence to increase training effi-
ciency. This is specially useful in LongT5, since with the large ROUGE-F1 score (Lin, 2004) using si =
input lengths used in our model, if many examples are short, rouge(xi , D \ {xi }, ∀i ), where i is the sentence
most of the input sequence would be dedicated to padding, index, D is the collection of sentences in the docu-
wasting significant computation.
4
For base models, we introduced 10k additional parame- ment. Each sentence is scored independently (Ind),
ters, 25k for large, and 50k for xl. and each n-gram is only counted once (Uniq).
4 Experiments 4.2 Evaluation on Summarization Tasks
We choose to benchmark our models on summa-
4.1 Configurations
rization tasks that cover various context lengths,
LongT5 is implemented using JAX5 and the Flax- because of their long context understanding and
former6 library. Following the same setup as generative nature.
T5.1.17 , we consider models of 3 sizes: base
(∼220M), large (∼770M), and xl (∼3B), and use 4.2.1 Datasets
the same cased English SentencePiece vocab model LongT5 was benchmarked on the following six
used by T5.1.1, which contains 32000 sentence datasets.
pieces. We use batch size of 128 and Adafactor CNN / Daily Mail (Nallapati et al., 2016) News
as the optimizer in all experiments. We decide to from CNN and Daily Mail are used as input and the
use greedy decoding instead of beam search for article’s summary bullets are the target summary.
all our experiments even with the test sets, there-
fore, our results reported below could potentially PubMed (Cohan et al., 2018) Scientific docu-
be improved further by using beam search, but we ments were collected from PubMed, with a docu-
would like to make the setup consistent with our ment’s content used as input and its corresponding
dev setup. abstract as the target summary.
arXiv (Cohan et al., 2018) Similar to PubMed,
4.1.1 Pre-training but with documents taken from arXiv.
We pre-train LongT5 models for 1M steps on BigPatent (Sharma et al., 2019) U.S. patent doc-
4096 input sequence length and 910 output se- uments, with the patent’s details used as input and
quence length. We use the same inverse square- the patent’s abstract as the target summary.
root learning ratepschedule as T5, with learn-
ing rate set to 1/ max(step, warm_up steps), MediaSum (Zhu et al., 2021) Interview tran-
where warm_up steps is set to 10000. The same as scripts from CNN and NPR were used as input
T5.1.1, we pre-train LongT5 only on the C4 dataset and their corresponding topic and overviews used
(Raffel et al., 2019b), and we do not apply dropout as the target summary.
during pre-training. As described in section 3.2, Multi-News (Fabbri et al., 2019) The task in-
we use the PEGASUS Principle Sentences Gener- volves summarizing multiple news documents
ation objective as our pre-training objective. The about a topic into a human-written summary.
configuration is similar to what was described by Table 1 provides statistics for the number of ex-
Zhang et al. (2019a) for their larger models, ex- amples in train, validation, and test splits, and the
cept for the masked sentence ratio in which we average, median, max, and 90th percentile input
use a value of 0.2 instead of 0.458 . In section 5.3, sequence length. As can be seen, these datasets are
we will show our ablation study between Principle long in input length, and would benefit from mod-
Sentences Generation and Span Corruption. els that can model lengthier inputs. We included
the CNN / Daily Mail dataset to benchmark on a
4.1.2 Fine-tuning common task, especially to see how using TGlobal
For fine-tuning, we use a constant learning rate of attention impacts the model, despite the length of
0.001 and dropout rate of 0.1 for all tasks. For the inputs being smaller than the other datasets.
summarization tasks, we experiment with values of
4.2.2 Results
4096, 8192, and 16384 for input lengths and 512
for output lengths. For QA tasks, we experiment We compare LongT5 with various top approaches:
with values starting at 512 and scale up to 36864 BigBird-PEGASUS (Zaheer et al., 2020b), HAT-
for input lengths and 128 for output lengths. BART (Rohde et al., 2021), DANCER PEGASUS
(Gidiotis and Tsoumakas, 2020), PRIMER (Xiao
5
https://github.com/google/jax et al., 2021), TG-MultiSum (Cui and Hu, 2021),
6
https://github.com/google/flaxformer LED (Beltagy et al., 2020), and an application of
7
https://github.com/google-research/text-to-text-transfer- BART by Zhu et al. (2021). For these comparisons,
transformer/blob/main/released_checkpoints.md#t511
8
We briefly experimented with other values, but found 0.2 we use common evaluation metrics of ROUGE-1,
to work best with the downstream tasks of interest. ROUGE-2, and ROUGE-L.
Example Count Input Length
Dataset
Train Validation Test Average Median Max 90th percentile
CNN / Daily Mail 287,113 13,368 11,490 982.39 894 5268 1659
arXiv 203,037 6,436 6,440 10,720.18 8,519 378,825 20,170
PubMed 119,924 6,633 6,658 4,747.97 3,883 452,915 8,883
BigPatent 1,207,222 67,068 67,072 6,537.32 5,236 294,004 11,328
MediaSum 443,596 10,000 10,000 2,302.02 1,748 125,974 4,128
Multi-News 44,972 5,622 5,622 2,593.81 1,902.5 683,544 4,853
Table 1: Statistics for the summarization datasets. Input length measured in tokens using a SentencePiece Model.
Table 3: Statistics for the QA datasets. Input length measured in tokens using a SentencePiece Model.
F1 (short answer)
62 512
8k 2k
Sequences per second
400 61
512
60 1k
59
200 512
58
57
100 56
80 0 100 200 300 400 500 600
600
800
1000
2000
4000
6000
8000
10000
20000
40000
Input Length Figure 4: Speed versus Performance on NQ (short-
answer F1), for T5, LongT5 with Local Attention and
T5.1.1 base LongT5 base Local LongT5 base TGlobal
T5.1.1 large LongT5 large Local LongT5 large TGlobal
LongT5 with TGlobal attention, for different input se-
quence lengths. Input lengths start at 512, and go as far
Figure 3: Sequences per second as a function of input as possible before running out of memory. Measure-
length for T5.1.1, LongT5 with Local Attention and ments taken with batch size 128, on 4x8 TPUv3 slices.
LongT5 with TGlobal attention. Input lengths start at
512, and go as far as possible before running out of
memory. Measurements taken with batch size 128, on the input length steadily until models ran out of
4x8 TPUv3 slices. base and large model sizes shown. memory on a 4x8 TPUv3 slice. Results are shown
in Figure 3, which compares 6 different model
configurations: T5.1.1 base, T5.1.1 large, LongT5
lengths underperform those with 4k length, but we
(base Local), LongT5 (large Local), LongT5 (base
believe those to be due to noise in the experiments,
TGlobal), and LongT5 (large TGlobal). For each
as results are the output of just one repetition of
model configuration, we show a curve plotting the
each experiment due to resource constraints. More-
number of sequences per second processed during
over, while LongT5 with Local Attention often
training (speed, in the vertical axis) for each input
underperforms T5.1.1, LongT5 with TGlobal at-
length (horizontal axis). Both axes are shown in
tention significantly outperforms T5.1.1. For ex-
logarithmic scale.
ample, considering the large size models, T5.1.1
We can see that at shorter lengths (512), T5.1.1,
was able only to scale up to an input length of 3k
LongT5 Local, LongT5 TGlobal have similar
tokens, while the TGlobal model was able to reach
speeds, but as we increase the sequence length,
6k tokens, outperforming T5.1.1 at 4k token length
LongT5 becomes significantly faster. For exam-
(there was a dip at 6k token length, but we hypothe-
ple at sequence length 2048, T5.1.1 base can only
size this is just due to variance, as we only did one
process 479 sequences per second, while LongT5
run for each configuration).
(base TGlobal) can process 765 and LongT5 (base
For TriviaQA, we compare LongT5 with various
Local) can process 860. The differences grow even
top approaches on the leader board: BigBird-ETC
larger as sequence length increases.
(Zaheer et al., 2020a), Fusion-in-Decoder (Izacard
Another important fact that Figure 3 shows is
and Grave, 2021), and ReadTwice (Zemlyanskiy
that T5.1.1 models reach their out of memory point
et al., 2021). As shown in Table 3, TriviaQA inputs
much earlier. For example, we could only scale
are quite long, therefore being able to scale up both
up to 6k tokens for T5.1.1 base. On the other
in model size and to 16k input length helps LongT5
hand, LongT5 (base Local) can go up to 36k tokens
achieve state-of-the-art.
in length, and LongT5 (base TGlobal) up to 12k.
5 Analysis Large models show a similar picture with T5.1.1
large going only up to 3k, but the LongT5 variants
5.1 Input Length vs Speed going to 10k (large Local) and 6k (large TGlobal).
In order to evaluate the training speed and mem-
ory consumption of LongT5, compared to T5.1.1, 5.2 Input Length vs Performance
we performed a series of training runs in the NQ This section presents a similar analysis, but where
data set starting at input length 512, and increasing we plotted model speed versus performance in NQ
(F1 score). Results are shown in Figure 4 for mod- NQ arXiv
els with large size. Each point in the curves is Objective EM F1 R-1 R-2 R-3
annotated with the corresponding sequence length. PSG 62.21 66.94 44.95 18.74 40.99
As Figure 4 shows, performance increases sig- SC 58.65 63.05 43.49 18.12 39.71
SC + PSG 59.74 64.54 44.85 18.79 40.90
nificantly as input length increases, highlighting
the benefits of LongT5. Moreover, input length by
Table 5: Ablation study on dev set for different pre-
itself is not enough to achieve good performance training strategies using span corruption (SC) vs. prin-
in all datasets, and in particular, in the NQ dataset ciple sentences generation (PSG) and the effects on
(used in this figure), using Local Attention signif- NQ and arXiv fine-tuning tasks. The models are
icantly hurts performance when compared with TGlobal base, and fine-tuning is done with input se-
TGlobal or with T5.1.1. So, even at very long quence length 4096.
input lengths, LongT5 with Local Attention just
matches T5.1.1 with input length of 3k in NQ. How- arXiv
ever, LongT5 with TGlobal attention outperforms Objective R-1 R-2 R-3
T5.1.1. Moreover, note that although the plot shows SC 44.59 18.34 40.65
PSG 45.78 18.94 41.53
a few irregularities (such as 8k length for LongT5 LongT5 (4k) 45.66 19.22 41.49
with Local Attention, or 6k length with TGlobal LongT5 (16k) 48.21 21.7 44.03
Attention), that is because the plot shows only the PubMed
results of a single run, and hence there is some Objective R-1 R-2 R-3
noise. However, trends can clearly be seen. SC 47.86 22.14 44.39
PSG 48.74 23.42 45.24
LongT5 (4k) 48.47 23.38 45.01
5.3 Principle Sentences Generation vs. Span LongT5 (16k) 50.12 24.78 46.56
Corruption
Table 6: Ablation study on arXiv and PubMed for
As mentioned in section 3.2, we use PEGASUS different pre-training strategies using span corruption
Principle Sentences Generation instead of default (SC) vs. principle sentences generation (PSG) with
Span Corruption used in T5 as our pre-training T5.1.1 model along with LongT5 with TGlobal atten-
objective. Table 5 shows our ablation study for tion. Fine-tuning was done on large model size, with
fine-tuning on NQ and arXiv from a model pre- input sequence length of 4096 except where otherwise
trained using the default Span Corruption objec- noted.
tive, a model pre-trained with Principle Sentences
Generation, and a model pre-trained with both ob-
also compare this with dev scores from LongT5
jectives. The comparison is done on the dev set of
with TGlobal attention at 4k and 16k input lengths,
the tasks, and with TGlobal base models. Both pre-
such that we can see having full attention will allow
training and fine-tuning on the models mentioned
for better results, but being able to scale to longer
above are done with input sequence length 4096.
input sequence lengths allows LongT5 to achieve
The table shows, even though Principle Sentences
its stronger results.
Generation was developed by Zhang et al. (2019a)
as a pre-training strategy for summarization, it ben-
6 Related Work
efits both summarization and QA tasks, but using
both objectives together perform worse than just Language model pre-training followed by task
using PSG. specific fine-tuning has proven to be a powerful
Table 6 shows an additional ablation study with tool for numerous NLP tasks (Devlin et al., 2019;
arXiv and PubMed, where we compare using reg- Liu et al., 2019; Zhang et al., 2019b; Radford et al.,
ular T5.1.1 with Span Corruption compared to 2019; Raffel et al., 2019a; Lewis et al., 2020; Joshi
T5.1.1 pretrained with Principle Sentences Gen- et al., 2020). BERT (Devlin et al., 2019) intro-
eration while using the same pre-training input se- duced Mask Language Model (MLM), where a
quence length of 512 (as was done in the original model predicts masked tokens given a sequence of
T5.1.1 pre-training task). As expected, Principle text input. Fine-tuning a pre-trained BERT model
Sentences Generation helped the model achieve has led to improved performance on various NLP
better results compared to Span Corruption when tasks. However, MLM predictions are not made
seeing the same amount of pre-training data. We auto-regressively, which limits the capability of the
BERT family for generation tasks. Raffel et al. sient Global attention, which is a drop-in replace-
(2019a) introduced the span corruption task in T5 ment to the standard T5 attention mechanism, and
as the pre-training objective, where a model pre- hence can be used without needing additional side-
dicts the masked token span using an autoregressive inputs to the model or modifications to the model
model. It can handle the generation tasks as the pre- inputs; and (2) using a PEGASUS-style Principle
training is done in a generative way. BART (Lewis Sentences Generation pre-training objective.
et al., 2020) is similar to T5 but used a slightly Via experimentation in several challenging sum-
different pre-training objective, in which spans are marization and question answering datasets, we
masked from the input but the complete output is have explored the performance gains that can be
predicted. However, none of these works tried to achieved by scaling both input length and model
investigate pre-training for very long sequence in- size, resulting in state-of-the-art results on several
puts. They often use a transformer (Vaswani et al., datasets: arXiv, PubMed, BigPatent, MediaSum,
2017) architecture as backbone, the complexity of and TriviaQA.
which is quadratic to the input length, making them As part of our future work, we would like to pur-
impractical to model very long sequence input. sue several directions such as studying efficient at-
tention mechanisms in the decoder and decoder-to-
Long text modeling An extensive amount of
encoder attention pieces of the model (both Local
work has also been done for modeling long text like
Attention and TGlobal attention are only applied
documents. The work from Roy et al. (2016); Chen
to the encoder in LongT5 for now). Additionally,
(2017); Wu et al. (2018) obtained document embed-
we would like to incorporate additional long-input
dings from word-level embeddings. Another line
transformer ideas into the LongT5 architecture, that
of research tries to model long documents through
could further improve model efficiency.
hierarchical training. The work from Yang et al.
(2016); Miculicich et al. (2018) employed Hier-
archical Attention Networks for document classi- References
fication and neural machine translation, and Guo Joshua Ainslie, Santiago Ontañón, Chris Alberti, Va-
et al. (2019) proposed using a hierarchy network clav Cvicek, Zachary Fisher, Philip Pham, Anirudh
to build document embeddings on top of sentence Ravula, Sumit Sanghai, Qifan Wang, and Li Yang.
embeddings for parallel document mining. 2020. Etc: Encoding long and structured data in
More recent research has been focusing on im- transformers. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language
proving the memory and computation efficiency Processing (EMNLP 2020).
of transformer models (Tay et al., 2020b, 2021)
for handling long input. One type of such ap- Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document transformer.
proaches is using non-full attention patterns to re- arXiv:2004.05150.
strict the attention field range, so that it reduces the
attention complexity from O(n2 ) to O(nlogn) or Minmin Chen. 2017. Efficient vector representation
for documents through corruption. 5th International
O(n), including Sinkhorn (Tay et al., 2020a), Long- Conference on Learning Representations.
former (Beltagy et al., 2020), ETC (Ainslie et al.,
2020), and BigBird (Zaheer et al., 2020a). An- Krzysztof Marcin Choromanski, Valerii Likhosherstov,
David Dohan, Xingyou Song, Andreea Gane, Tamas
other type of approaches is leveraging the low-rank Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz
approximation of the attention matrix, such as Lin- Mohiuddin, Lukasz Kaiser, David Benjamin Be-
former (Wang et al., 2020), Performer (Choroman- langer, Lucy J Colwell, and Adrian Weller. 2021.
ski et al., 2021), Random Feature Attention (Peng Rethinking attention with performers. In Interna-
tional Conference on Learning Representations.
et al., 2021), and LUNA (Ma et al., 2021).
Arman Cohan, Franck Dernoncourt, Doo Soon Kim,
7 Conclusion Trung Bui, Seokhwan Kim, Walter Chang, and Na-
zli Goharian. 2018. A discourse-aware attention
This paper presented a new Transformer-based neu- model for abstractive summarization of long docu-
ral model called LongT5, with which we have ex- ments. In Proceedings of the 2018 Conference of
plored the effects of scaling both input length and the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
model size at the same time. Specifically, the main nologies, Volume 2 (Short Papers), pages 615–621,
differences of LongT5 with respect to T5.1.1 are New Orleans, Louisiana. Association for Computa-
(1) a new scalable attention mechanism called Tran- tional Linguistics.
Peng Cui and Le Hu. 2021. Topic-guided abstractive research. Transactions of the Association of Compu-
multi-document summarization. tational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Kristina Toutanova. 2019. BERT: Pre-training of jan Ghazvininejad, Abdelrahman Mohamed, Omer
deep bidirectional transformers for language under- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
standing. In Proceedings of the 2019 Conference 2020. BART: Denoising sequence-to-sequence pre-
of the North American Chapter of the Association training for natural language generation, translation,
for Computational Linguistics: Human Language and comprehension. In Proceedings of the 58th An-
Technologies, Volume 1 (Long and Short Papers), nual Meeting of the Association for Computational
pages 4171–4186, Minneapolis, Minnesota. Associ- Linguistics, pages 7871–7880, Online. Association
ation for Computational Linguistics. for Computational Linguistics.
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Chin-Yew Lin. 2004. ROUGE: A package for auto-
Dragomir Radev. 2019. Multi-News: A large-scale matic evaluation of summaries. In Text Summariza-
multi-document summarization dataset and abstrac- tion Branches Out, pages 74–81, Barcelona, Spain.
tive hierarchical model. In Proceedings of the 57th Association for Computational Linguistics.
Annual Meeting of the Association for Computa-
tional Linguistics, pages 1074–1084, Florence, Italy. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Association for Computational Linguistics. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Alexios Gidiotis and Grigorios Tsoumakas. 2020. A Luke Zettlemoyer, and Veselin Stoyanov. 2019.
divide-and-conquer approach to the summarization Roberta: A robustly optimized bert pretraining ap-
of long documents. IEEE/ACM Transactions on Au- proach.
dio, Speech, and Language Processing, 28:3029–
3040. Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting
Zhou, Jonathan May, Hao Ma, and Luke Zettle-
Mandy Guo, Yinfei Yang, Keith Stevens, Daniel Cer, moyer. 2021. Luna: Linear unified nested attention.
Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray In Thirty-Fifth Conference on Neural Information
Kurzweil. 2019. Hierarchical document encoder for Processing Systems.
parallel corpus mining. In Proceedings of the Fourth
Conference on Machine Translation (Volume 1: Re- Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas,
search Papers), pages 64–72, Florence, Italy. Asso- and James Henderson. 2018. Document-level neu-
ciation for Computational Linguistics. ral machine translation with hierarchical attention
networks. In Proceedings of the 2018 Conference
Gautier Izacard and Edouard Grave. 2021. Leveraging on Empirical Methods in Natural Language Process-
passage retrieval with generative models for open do- ing, pages 2947–2954, Brussels, Belgium. Associa-
main question answering. tion for Computational Linguistics.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Weld, Luke Zettlemoyer, and Omer Levy. 2020. Çağlar Gu̇lçehre, and Bing Xiang. 2016. Abstrac-
SpanBERT: Improving pre-training by representing tive text summarization using sequence-to-sequence
and predicting spans. Transactions of the Associa- RNNs and beyond. In Proceedings of The 20th
tion for Computational Linguistics, 8:64–77. SIGNLL Conference on Computational Natural Lan-
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke guage Learning, pages 280–290, Berlin, Germany.
Zettlemoyer. 2017. Triviaqa: A large scale distantly Association for Computational Linguistics.
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy
the Association for Computational Linguistics, Van- Schwartz, Noah Smith, and Lingpeng Kong. 2021.
couver, Canada. Association for Computational Lin- Random feature attention. In International Confer-
guistics. ence on Learning Representations.
Jared Kaplan, Sam McCandlish, Tom Henighan, Ofir Press, Noah A. Smith, and Mike Lewis. 2021.
Tom B Brown, Benjamin Chess, Rewon Child, Scott Train short, test long: Attention with linear biases
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. enables input length extrapolation.
2020. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361. Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- models are unsupervised multitask learners.
field, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Wei Li, and Peter J. Liu. 2019a. Exploring the limits
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- of transfer learning with a unified text-to-text trans-
ral questions: a benchmark for question answering former. CoRR, abs/1910.10683.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine American Chapter of the Association for Computa-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, tional Linguistics: Human Language Technologies,
Wei Li, and Peter J. Liu. 2019b. Exploring the limits pages 1480–1489, San Diego, California. Associa-
of transfer learning with a unified text-to-text trans- tion for Computational Linguistics.
former. CoRR, abs/1910.10683.
Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
Tobias Rohde, Xiaoxia Wu, and Yinhan Liu. 2021. Hi- Joshua Ainslie, Chris Alberti, Santiago Ontañón,
erarchical learning for generation with long source Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
sequences. and Amr Ahmed. 2020a. Big bird: Transformers for
longer sequences. CoRR, abs/2007.14062.
Dwaipayan Roy, Debasis Ganguly, Mandar Mitra, and
Gareth J. F. Jones. 2016. Representing documents Manzil Zaheer, Guru Guruganesh, Kumar Avinava
and queries as sets of word embedded vectors for Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
information retrieval. CoRR, abs/1606.07869. tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, and Amr Ahmed. 2020b. Big Bird: Trans-
Eva Sharma, Chen Li, and Lu Wang. 2019. BIG- formers for longer sequences. In Advances in
PATENT: A large-scale dataset for abstractive and Neural Information Processing Systems, volume 33,
coherent summarization. In Proceedings of the 57th pages 17283–17297. Curran Associates, Inc.
Annual Meeting of the Association for Computa-
tional Linguistics, pages 2204–2213, Florence, Italy. Yury Zemlyanskiy, Joshua Ainslie, Michiel de Jong,
Association for Computational Linguistics. Philip Pham, Ilya Eckstein, and Fei Sha. 2021.
Readtwice: Reading very large documents with
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da- memories.
Cheng Juan. 2020a. Sparse sinkhorn attention. In
ICML. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
ter J. Liu. 2019a. PEGASUS: pre-training with ex-
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang tracted gap-sentences for abstractive summarization.
Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu CoRR, abs/1912.08777.
Yang, Sebastian Ruder, and Donald Metzler. 2021.
Long range arena : A benchmark for efficient trans- Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
formers. In International Conference on Learning Maosong Sun, and Qun Liu. 2019b. ERNIE: En-
Representations. hanced language representation with informative en-
tities. In Proceedings of the 57th Annual Meet-
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald ing of the Association for Computational Linguis-
Metzler. 2020b. Efficient transformers: A survey. tics, pages 1441–1451, Florence, Italy. Association
ArXiv, abs/2009.06732. for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng.
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz 2021. MediaSum: A large-scale media interview
Kaiser, and Illia Polosukhin. 2017. Attention is all dataset for dialogue summarization. In Proceedings
you need. In Advances in Neural Information Pro- of the 2021 Conference of the North American Chap-
cessing Systems, volume 30. Curran Associates, Inc. ter of the Association for Computational Linguistics:
Human Language Technologies, pages 5927–5934,
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Online. Association for Computational Linguistics.
Fang, and Hao Ma. 2020. Linformer: Self-attention
with linear complexity.
Table 8: Summarization results comparing T5, T5 with PEGASUS-style Principle Sentences Generation (PSG)
pre-training, and LongT5 with best known approaches for the various datasets. All T5 scores are with standard
T5.1.1 model. All LongT5 scores are with models using TGlobal attention. For each task, we scale up the input
length depending on the statistics of the inputs, thus not all of the tasks were scaled to 16k. We do not include input
length of other models because each model uses the input differently, and hence, direct comparison is not possible.