LongT5 Paper

LongT5: Efficient Text-To-Text Transformer for Long Sequences
Mandy Guo∗†, Joshua Ainslie∗†, David Uthus∗, Santiago Ontañón∗

Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang
Google Research
{xyguo, jainslie, duthus, santiontanon, jianmon, yhsung, yinfeiy}@google.com
40 Arxiv PubMed
Abstract 42
39
Average ROUGE Score

41
Recent work has shown that either (1) in- 38 40
LongT5-XL
PRIMER LongT5-XL
creasing the input length or (2) increasing 37 39
LED-large
36 HAT-BARTBigBird-Pegasus 38
arXiv:2112.07916v2 [cs.CL] 3 May 2022
model size can improve the performance of

35 37
Transformer-based neural models. In this pa- BigBird-Pegasus
34 36
per, we present LongT5, a new model that 330 350
HAT-BART
4k 8k 16k 4k 8k 16k
explores the effects of scaling both the in- Input Sequence Length
put length and model size at the same time.
Specifically, we integrate attention ideas from Figure 1: The average ROUGE score ((R-1 + R-2 +
long-input transformers (ETC), and adopt pre- R-L)/3) of LongT5 and baseline models on arXiv and
training strategies from summarization pre- PubMed summarization tasks (Cohan et al., 2018) with
training (PEGASUS) into the scalable T5 ar- different input length (x axis). Baseline models: HAT-
chitecture. The result is a new attention mech- BART (Rohde et al., 2021), BigBird-PEGASUS (Za-
anism we call Transient Global (TGlobal), heer et al., 2020b), PRIMER (Xiao et al., 2021),
which mimics ETC’s local/global attention LED (Beltagy et al., 2020). The size of circle roughly
mechanism, but without requiring additional indicates the # of parameters for each model.
side-inputs. We are able to achieve state-of-
the-art results on several summarization and
question answering tasks, as well as outper-
form the original T5 models on these tasks. transformer attention and pre-training ideas into
We have open sourced our architecture and the scalable T5 (Raffel et al., 2019a) model archi-
training code, as well as our pre-trained model tecture. The resulting model, as shown in Figure 1,
checkpoints. achieves state-of-the-art performance on several
tasks which require handling long sequence inputs.
1 Introduction
Regarding attention, we design a new atten-
Transformer models such as BERT (Devlin et al., tion mechanism, which we call Transient Global
2019), and other variants (Liu et al., 2019; Radford (TGlobal), that mimics ETC’s local/global mecha-
et al., 2019; Raffel et al., 2019a; Lewis et al., 2020) nism (Ainslie et al., 2020). Importantly, TGlobal
have achieved state-of-the-art results on many chal- attention removes the need for the additional side
lenging NLP tasks. Moreover, recent work in long- inputs in ETC, in order to fit within the T5 archi-
input transformers (Ainslie et al., 2020; Zaheer tecture. The main idea of ETC’s local/global mech-
et al., 2020b; Beltagy et al., 2020; Tay et al., 2021) anism is to introduce local sparsity in the attention
has shown that increasing the input length a Trans- mechanism to reduce the quadratic cost when scal-
former is able to process results in further perfor- ing to long inputs. Specifically, ETC only allows
mance gains. Additionally, it is also known that tokens in the input (called the long input) to attend
increasing model size also leads to performance to a local neighborhood, and adds a secondary input
gains in many tasks (Kaplan et al., 2020). called the global memory, through which tokens in
In this paper, we present a new model, called the long input can attend to each other indirectly.
LongT5, with which we explore the effects of scal- One disadvantage of this mechanism is that it re-
ing both the input length and model size at the quires designing this secondary global input for
same time. To achieve this, we integrate long-input each new problem. In order to adapt it to T5, our
∗
Equal contributions. new TGlobal mechanism synthesizes these global
†
Corresponding authors. tokens on the fly (as aggregations of groups of
tokens in the input), at each attention layer. Our ex- 2 T5
periments show that this mechanism results in only
T5 (Raffel et al., 2019a) is a transformer based text-
a small degradation in performance with respect to
to-text pre-trained language model that is gaining
full attention in the same input length but allows
popularity for its unified framework that converts
the model to scale to much larger input lengths,
all text-based language problems into a text-to-text
resulting in significant performance gains.
format, and its ease to scale up in number of param-
Regarding pre-training, we adopt the pre-
eters (from 60M to 11B parameters) with model
training strategy in the PEGASUS (Zhang et al.,
parallelism. With full attention transformer, T5 has
2019a) model. This pre-training strategy was origi-
been successfully applied to many NLP tasks, but
nally designed for abstractive summarization, but
the tasks only require shorter input sequences. This
in our experiments, we found it also improves
is due to the limitation of quadratic computation
model performance for other tasks, such as ques-
growth with respect to input sequence length, re-
tion answering, and hence we adopted it in LongT5.
sulting in larger memory consumption and longer
The key idea is to mask out key (principle) sen-
training time. Recently, Press et al. (2021) explored
tences from a document and ask the model to repro-
scaling up T5 style models at inference time to
duce them as a single string, as if it was a summary.
longer sequences than seen during training, but how
We evaluate LongT5 on several summariza-
to scale up T5 style models in the input sequence
tion and question answering tasks (see Sections
length during training remains underexplored.
4.2.1 and 4.3.1 for detailed descriptions of these
datasets). Thanks to the scaling of both input length 3 LongT5
and model size, we achieve state-of-the-art results
on many of them. 3.1 Architecture
The main contributions of this work are: We extend the original T5 encoder with global-
local attention sparsity patterns (Ainslie et al.,
• A new Transformer architecture, LongT5, that
2020; Zaheer et al., 2020a) to handle long inputs.
allows for scaling both input length and model
For the work reported in this paper, we used a stan-
scale at the same time.
dard T5 decoder since all of the tasks we considered
• A new attention mechanism (TGlobal), which require relatively short output sequence lengths.
mimics ETC’s local/global mechanism but is Architecturally, the main difference between T5
a drop-in replacement to regular attention for and LongT5 lies in the attention mechanism. We
existing Transformer architectures like T5. experiment with two attention mechanism varia-
tions for LongT5, illustrated in Figure 2: (1) Lo-
• An analysis of model performance when vary- cal Attention and (2) Transient Global Attention
ing both input length and model size of vanilla (TGlobal). Both variations preserve several prop-
T5 and LongT5 models (pushing both models erties of T5: relative position representations, sup-
up to the maximum lengths they can handle port for example packing, and compatibility with
before encountering memory issues), to un- T5 checkpoints.
derstand the trade-offs in both performance
and computation cost. 3.1.1 Local Attention
For Local Attention, we simply replace the encoder
• State-of-the-art results on the arXiv, PubMed,
self-attention operation in T5 with a sparse sliding-
BigPatent, MediaSum, and TriviaQA datasets.
window local attention operation following the im-
For Natural Questions, we used a slightly dif-
plementation in ETC (Ainslie et al., 2020). Specif-
ferent formulation than the original tasks, and
ically, for a given local radius r, this formulation
hence we do not make state-of-the-art claims.
only allows each token to attend r tokens to the left
• We open source our model architecture1 and and right of it (see Figure 2.a). We found r = 127
training code, as well as pre-trained model to be sufficient in practice, where r is the number
checkpoints on GitHub2 . of neighboring tokens to the left and to the right.
1
Published under the Flaxformer GitHub https: Local Attention does not introduce any new pa-
//github.com/google/flaxformer/tree/ rameters and easily accommodates the attention
main/flaxformer/architectures/longt5
2
masking required for example packing3 . For a
https://github.com/google-research/
3
longt5 Example packing refers to packing more than one short
s
Attention keys Attention keys
Attention queries
Attention queries
Each input token can
attend to its
r r
neighborhood (like in
local attention), plus
to all global tokens.
Input Tokens Global Tokens

Each input token can
x1 … xk xk+1 … xl g1 … gm
attend to its
neighborhood: r …
tokens to the left, and LayerNorm
r tokens to the right. +
Each global token is the result of …
averaging k input tokens.
+
a) LongT5 Local Attention b) LongT5 Transient Global (TGlobal) Attention
Figure 2: Illustration of the two attention mechanisms we experimented with in LongT5.
given choice of r, complexity is linear in input ally masking attention from input tokens to global
sequence length l: O(l × r). tokens of other examples. We found block size
k = 16 to be sufficient in practice. Notice thus,
3.1.2 Transient Global Attention (TGlobal)
that TGlobal attention introduces a block of l ∗ l/k
To allow input tokens to interact with each other in additional attention key-value pairs to calculate on
each layer of the encoder at a longer range than Lo- top of Local Attention (l input tokens, attending
cal Attention’s local radius, we introduce Transient to l/k global tokens; represented by the right most
Global Attention as a modification of ETC’s global- rectangle in Figure 2.b), hence for input sequence
local attention in a “fixed blocks” pattern. Namely, length l, complexity is O(l(r + l/k)).
we divide the input sequence into blocks of k to-
kens, and for each block we compute a global token 3.2 PEGASUS Principle Sentences
by summing (and then normalizing) the embed- Generation Pre-training
dings of every token in the block (see Figure 2.b).
Now when computing attention, we allow each T5 is pre-trained with a span corruption objective,
input token to attend not only to nearby tokens where spans of consecutive input tokens are re-
like in Local Attention, but also to every global placed with a mask token and the model is trained
token. We call these global tokens transient be- to reconstruct the masked-out tokens. While it is
cause in contrast to ETC-like global-local attention effective, recent work on masked language model-
patterns, these tokens are dynamically constructed ing (MLM) (Liu et al., 2019; Zhang et al., 2019b)
(and subsequently discarded) within each attention shows that carefully selecting the prediction objec-
operation, removing any requirement for deciding tive could lead to significantly better performance.
which input tokens should be treated as “global”. One argument is that predicting more informative
TGlobal attention only introduces a couple new tokens from the text could force the model to learn
parameters4 : (1) T5-style relative position biases better semantics of the text. Motivated by that,
representing the distance from an input token’s we explore masking and generating the principle
block to the block of each global token it’s attend- sentences from the text. In particular, we adopt
ing to, and (2) T5-style layer normalization parame- the Gap Sentences Generation with Principle Ind-
ters for normalizing each global token’s embedding. Uniq strategy from Zhang et al. (2019a), which
The rest of the parameters are identical to T5, and was used for summarization pre-training.
we accommodate sequence packing by addition- Following Zhang et al. (2019a), we select
top-m scored (Principle) sentences based on
example in the same input sequence to increase training effi-
ciency. This is specially useful in LongT5, since with the large ROUGE-F1 score (Lin, 2004) using si =
input lengths used in our model, if many examples are short, rouge(xi , D \ {xi }, ∀i ), where i is the sentence
most of the input sequence would be dedicated to padding, index, D is the collection of sentences in the docu-
wasting significant computation.
4
For base models, we introduced 10k additional parame- ment. Each sentence is scored independently (Ind),
ters, 25k for large, and 50k for xl. and each n-gram is only counted once (Uniq).
4 Experiments 4.2 Evaluation on Summarization Tasks
We choose to benchmark our models on summa-
4.1 Configurations
rization tasks that cover various context lengths,
LongT5 is implemented using JAX5 and the Flax- because of their long context understanding and
former6 library. Following the same setup as generative nature.
T5.1.17 , we consider models of 3 sizes: base
(∼220M), large (∼770M), and xl (∼3B), and use 4.2.1 Datasets
the same cased English SentencePiece vocab model LongT5 was benchmarked on the following six
used by T5.1.1, which contains 32000 sentence datasets.
pieces. We use batch size of 128 and Adafactor CNN / Daily Mail (Nallapati et al., 2016) News
as the optimizer in all experiments. We decide to from CNN and Daily Mail are used as input and the
use greedy decoding instead of beam search for article’s summary bullets are the target summary.
all our experiments even with the test sets, there-
fore, our results reported below could potentially PubMed (Cohan et al., 2018) Scientific docu-
be improved further by using beam search, but we ments were collected from PubMed, with a docu-
would like to make the setup consistent with our ment’s content used as input and its corresponding
dev setup. abstract as the target summary.
arXiv (Cohan et al., 2018) Similar to PubMed,
4.1.1 Pre-training but with documents taken from arXiv.
We pre-train LongT5 models for 1M steps on BigPatent (Sharma et al., 2019) U.S. patent doc-
4096 input sequence length and 910 output se- uments, with the patent’s details used as input and
quence length. We use the same inverse square- the patent’s abstract as the target summary.
root learning ratepschedule as T5, with learn-
ing rate set to 1/ max(step, warm_up steps), MediaSum (Zhu et al., 2021) Interview tran-
where warm_up steps is set to 10000. The same as scripts from CNN and NPR were used as input
T5.1.1, we pre-train LongT5 only on the C4 dataset and their corresponding topic and overviews used
(Raffel et al., 2019b), and we do not apply dropout as the target summary.
during pre-training. As described in section 3.2, Multi-News (Fabbri et al., 2019) The task in-
we use the PEGASUS Principle Sentences Gener- volves summarizing multiple news documents
ation objective as our pre-training objective. The about a topic into a human-written summary.
configuration is similar to what was described by Table 1 provides statistics for the number of ex-
Zhang et al. (2019a) for their larger models, examples in train, validation, and test splits, and the
cept for the masked sentence ratio in which we average, median, max, and 90th percentile input
use a value of 0.2 instead of 0.458 . In section 5.3, sequence length. As can be seen, these datasets are
we will show our ablation study between Principle long in input length, and would benefit from mod-
Sentences Generation and Span Corruption. els that can model lengthier inputs. We included
the CNN / Daily Mail dataset to benchmark on a
4.1.2 Fine-tuning common task, especially to see how using TGlobal
For fine-tuning, we use a constant learning rate of attention impacts the model, despite the length of
0.001 and dropout rate of 0.1 for all tasks. For the inputs being smaller than the other datasets.
summarization tasks, we experiment with values of
4.2.2 Results
4096, 8192, and 16384 for input lengths and 512
for output lengths. For QA tasks, we experiment We compare LongT5 with various top approaches:
with values starting at 512 and scale up to 36864 BigBird-PEGASUS (Zaheer et al., 2020b), HAT-
for input lengths and 128 for output lengths. BART (Rohde et al., 2021), DANCER PEGASUS
(Gidiotis and Tsoumakas, 2020), PRIMER (Xiao
5
https://github.com/google/jax et al., 2021), TG-MultiSum (Cui and Hu, 2021),
6
https://github.com/google/flaxformer LED (Beltagy et al., 2020), and an application of
7
https://github.com/google-research/text-to-text-transfer- BART by Zhu et al. (2021). For these comparisons,
transformer/blob/main/released_checkpoints.md#t511
8
We briefly experimented with other values, but found 0.2 we use common evaluation metrics of ROUGE-1,
to work best with the downstream tasks of interest. ROUGE-2, and ROUGE-L.
Example Count Input Length
Dataset
Train Validation Test Average Median Max 90th percentile
CNN / Daily Mail 287,113 13,368 11,490 982.39 894 5268 1659
arXiv 203,037 6,436 6,440 10,720.18 8,519 378,825 20,170
PubMed 119,924 6,633 6,658 4,747.97 3,883 452,915 8,883
BigPatent 1,207,222 67,068 67,072 6,537.32 5,236 294,004 11,328
MediaSum 443,596 10,000 10,000 2,302.02 1,748 125,974 4,128
Multi-News 44,972 5,622 5,622 2,593.81 1,902.5 683,544 4,853
Table 1: Statistics for the summarization datasets. Input length measured in tokens using a SentencePiece Model.
As can be seen in Table 2, LongT5 is able

arXiv to achieve state-of-the-art rouge scores for arXiv,
Approach R-1 R-2 R-L
PubMed, BigPatent, and MediaSum. For arXiv
DANCER PEGASUS 45.01 17.6 40.56
BigBird-PEGASUS (large) 46.63 19.02 41.77
and PubMed, which are composed of longer inputs,
HAT-BART 46.68 19.07 42.17 being able to scale up to 16k input length helps
LED (large) 46.63 19.62 41.83 LongT5 achieve strong results.
PRIMER 47.6 20.8 42.6
One dataset where LongT5 is not able to achieve
LongT5 (large - 16k input) 48.28 21.63 44.11 state-of-the-art results is with Multi-News. LongT5
LongT5 (xl - 16k input) 48.35 21.92 44.27
is the 2nd best model, slightly worth than PRIMER.
PubMed
Approach R-1 R-2 R-L This is understandable as the PRIMER model was
DANCER PEGASUS 46.34 19.97 42.42
pre-trained on a large corpus of documents related
BigBird-PEGASUS (large) 46.32 20.65 42.33 to news events, thus exposing the model to a similar
HAT-BART 48.36 21.43 37.00 corpus as that seen in Multi-News.
LongT5 (large - 16k input) 49.98 24.69 46.46 When looking at CNN / Daily Mail, we can
LongT5 (xl - 16k input) 50.23 24.76 46.67 see that LongT5 was comparable with HAT-BART,
BigPatent despite not having full attention. LongT5 did at
Approach R-1 R-2 R-L
least get stronger scores in the ROUGE-2 metric.
BigBird-PEGASUS (large) 60.64 42.46 50.01
LongT5 (large - 16k input) 70.38 56.81 62.73 4.3 Evaluation on QA Tasks
LongT5 (xl - 16k input) 76.87 66.06 70.76
MultiNews
For the evaluation on QA tasks, we choose two pop-
Approach R-1 R-2 R-L ular benchmarks, Natural Questions and TriviaQA,
TG-MultiSum 47.10 17.55 20.73 that require long context understanding.
PRIMER 49.9 21.1 25.9
4.3.1 Datasets
LongT5 (large - 8k input) 47.18 18.44 24.18
LongT5 (xl - 8k input) 48.17 19.43 24.94 NaturalQuestions (NQ) Questions are real
MediaSum queries issued by multiple users to Google search
Approach R-1 R-2 R-L that retrieve a Wikipedia page in the top five search
BART (large) 35.09 18.05 31.44 results. Answer text is drawn from the search re-
LongT5 (large - 4k input) 35.54 19.04 32.20 sults (Kwiatkowski et al., 2019).
LongT5 (xl - 4k input) 36.15 19.66 32.80 The original NQ dataset asks models to predict a
CNN / Daily Mail short answer (including no-answer or yes/no) and
Approach R-1 R-2 R-L a long answer. We framed the task as a seq2seq
HAT-BART 44.48 21.31 41.52 task and ignored the long answer. Hence, our re-
LongT5 (large - 4k input) 42.49 20.51 40.18 sults focus only on short answer. Moreover, since
LongT5 (xl - 4k input) 43.94 21.40 41.28 our models predict answer texts instead of answer
spans, our evaluation method differs slightly from
Table 2: Summarization results comparing LongT5
the leader boards, and our results are not directly
with best known approaches. LongT5 scores are with
models using TGlobal attention. For each task, we comparable to other existing approaches: (1) Since
scale up the input length depending on the inputs’ statis- only the train and dev sets are publicly available,
tics, thus not all are scaled to 16k. For more results, we use 90% of the official train set for training
please see Section A in the Appendix. while using 10% as hold-out dev set to fine-tune
the hyperparameters and training epoch, and use
Example Count Input Length
Dataset
Train Validation Test Average Median Max 90th percentile
NQ 307,373 7,830 6,695.92 4,486 151,519 15,290.8
TriviaQA 87,622 11,313 10,832 69,082.51 45,011 1,174,918 150,643
Table 3: Statistics for the QA datasets. Input length measured in tokens using a SentencePiece Model.
the official dev set as our test set. (2) We benchmark

LongT5 against the corresponding T5.1.1 models
instead of directly comparing to the leader boards.
TriviaQA Trivia enthusiasts authored question-
answer pairs. Answers are drawn from Wikipedia
and Bing web search results, excluding trivia web-
sites (Joshi et al., 2017).
NQ
Approach EM F1 We use the official train/validation splits for
training and fine-tuning the hyperparameters and
T5.1.1 (base - 512 input) 50.93 52.54
T5.1.1 (base - 6k input) 56.73 56.73 training epoch, then re-train that model combining
T5.1.1 (large - 512 input) 57.29 60.68 both train and validation sets to evaluate on the
T5.1.1 (large - 3k input) 60.09 64.17
Wikipedia domain on the leader board 9 .
T5.1.1 (xl - 4k input) 60.75 64.07
Table 3 shows the dataset statistics for the num-
Local:
LongT5 (base - 512 input) 54.39 58.24 ber of examples in train and validation splits, and
LongT5 (base - 36k input) 55.77 59.66 the average, median, max, and 90th percentile input
LongT5 (large - 512 input) 55.19 58.00 sequence length.
LongT5 (large - 10k input) 60.01 64.40
TGlobal:
LongT5 (base - 512 input) 55.73 59.06
4.3.2 Results
LongT5 (base - 12k input) 58.12 62.44 Table 4 shows a summary of the results for the NQ
LongT5 (large - 512 input) 57.55 61.53
LongT5 (large - 4k input) 60.77 65.38 and TriviaQA datasets (see Appendix B for full
LongT5 (large - 6k input) 59.17 63.38 results). For each dataset, we show two metrics:
LongT5 (xl - 8k input) 62.66 66.61 EM (Exact Match) and F1 score (evaluating preci-
sion and recall of individual words in the answer
TriviaQA
Approach EM F1 compared to the ground truth, ignoring stop words).
For NQ, we compare T5.1.1, LongT5 with Local
BigBird-ETC (random attn) 80.86 84.5
Fusion-in-Decoder 80.09 84.35 Attention, and LongT5 with TGlobal attention. We
ReadTwice 76.86 80.85 decided to run T5.1.1 (1) with the default 512 input
TGlobal: sequence length10 and (2) with the largest input
LongT5 (base - 16k input) 74.67 78.9 sequence length that can fit into device memory11 ,
LongT5 (large - 16k input) 78.38 82.45
LongT5 (xl - 16k input) 81.00 84.83 and use those as baselines. Since we are comparing
against T5.1.1, for LongT5 experiments we report
Table 4: QA results: (1) NQ results comparing T5.1.1 results at 512 input length for base and large, and
and LongT5. Base/large models are trained on 4x8 the largest input length allowed by each model be-
TPUv3 with no model partitioning. Xl models are fore running out of memory on the same hardware
trained on 8x16 TPUv3 with 8 partitions. (2) Trivi- configuration used in our T5.1.1 experiments.
aQA results compared to top models on leader board.
As the table shows, increasing input length gen-
LongT5 scores using Local and TGlobal attention. Full
results in Appendix B. erally results in significant benefits in NQ, with
models with larger input lengths significantly out-
performing those with smaller input lengths in most
cases. Some times, models with the largest input
9
https://competitions.codalab.org/competitions/17208
10
For base and large models.
11
For base and large models, we used 4x8 TPUv3 and no
model partitioning; for xl model, we used 8x16 TPUv3 and 8
partitions.
66
4k
65 3k
10k 2k 1k
1000 64
800 63 6k 4k
1k
600
F1 (short answer)
62 512
8k 2k
Sequences per second
400 61
512
60 1k
59
200 512
58
57
100 56
80 0 100 200 300 400 500 600
60 Sequences per second
T5.1.1 large LongT5 large Local LongT5 large TGlobal

400
600
800
1000
2000
4000
6000
8000
10000
20000
40000
Input Length Figure 4: Speed versus Performance on NQ (short-
answer F1), for T5, LongT5 with Local Attention and
T5.1.1 base LongT5 base Local LongT5 base TGlobal
T5.1.1 large LongT5 large Local LongT5 large TGlobal
LongT5 with TGlobal attention, for different input se-
quence lengths. Input lengths start at 512, and go as far
Figure 3: Sequences per second as a function of input as possible before running out of memory. Measure-
length for T5.1.1, LongT5 with Local Attention and ments taken with batch size 128, on 4x8 TPUv3 slices.
LongT5 with TGlobal attention. Input lengths start at
512, and go as far as possible before running out of
memory. Measurements taken with batch size 128, on the input length steadily until models ran out of
4x8 TPUv3 slices. base and large model sizes shown. memory on a 4x8 TPUv3 slice. Results are shown
in Figure 3, which compares 6 different model
configurations: T5.1.1 base, T5.1.1 large, LongT5
lengths underperform those with 4k length, but we
(base Local), LongT5 (large Local), LongT5 (base
believe those to be due to noise in the experiments,
TGlobal), and LongT5 (large TGlobal). For each
as results are the output of just one repetition of
model configuration, we show a curve plotting the
each experiment due to resource constraints. More-
number of sequences per second processed during
over, while LongT5 with Local Attention often
training (speed, in the vertical axis) for each input
underperforms T5.1.1, LongT5 with TGlobal at-
length (horizontal axis). Both axes are shown in
tention significantly outperforms T5.1.1. For ex-
logarithmic scale.
ample, considering the large size models, T5.1.1
We can see that at shorter lengths (512), T5.1.1,
was able only to scale up to an input length of 3k
LongT5 Local, LongT5 TGlobal have similar
tokens, while the TGlobal model was able to reach
speeds, but as we increase the sequence length,
6k tokens, outperforming T5.1.1 at 4k token length
LongT5 becomes significantly faster. For exam-
(there was a dip at 6k token length, but we hypothe-
ple at sequence length 2048, T5.1.1 base can only
size this is just due to variance, as we only did one
process 479 sequences per second, while LongT5
run for each configuration).
(base TGlobal) can process 765 and LongT5 (base
For TriviaQA, we compare LongT5 with various
Local) can process 860. The differences grow even
top approaches on the leader board: BigBird-ETC
larger as sequence length increases.
(Zaheer et al., 2020a), Fusion-in-Decoder (Izacard
Another important fact that Figure 3 shows is
and Grave, 2021), and ReadTwice (Zemlyanskiy
that T5.1.1 models reach their out of memory point
et al., 2021). As shown in Table 3, TriviaQA inputs
much earlier. For example, we could only scale
are quite long, therefore being able to scale up both
up to 6k tokens for T5.1.1 base. On the other
in model size and to 16k input length helps LongT5
hand, LongT5 (base Local) can go up to 36k tokens
achieve state-of-the-art.
in length, and LongT5 (base TGlobal) up to 12k.
5 Analysis Large models show a similar picture with T5.1.1
large going only up to 3k, but the LongT5 variants
5.1 Input Length vs Speed going to 10k (large Local) and 6k (large TGlobal).
In order to evaluate the training speed and mem-
ory consumption of LongT5, compared to T5.1.1, 5.2 Input Length vs Performance
we performed a series of training runs in the NQ This section presents a similar analysis, but where
data set starting at input length 512, and increasing we plotted model speed versus performance in NQ
(F1 score). Results are shown in Figure 4 for mod- NQ arXiv
els with large size. Each point in the curves is Objective EM F1 R-1 R-2 R-3
annotated with the corresponding sequence length. PSG 62.21 66.94 44.95 18.74 40.99
As Figure 4 shows, performance increases sig- SC 58.65 63.05 43.49 18.12 39.71
SC + PSG 59.74 64.54 44.85 18.79 40.90
nificantly as input length increases, highlighting
the benefits of LongT5. Moreover, input length by
Table 5: Ablation study on dev set for different pre-
itself is not enough to achieve good performance training strategies using span corruption (SC) vs. prin-
in all datasets, and in particular, in the NQ dataset ciple sentences generation (PSG) and the effects on
(used in this figure), using Local Attention signif- NQ and arXiv fine-tuning tasks. The models are
icantly hurts performance when compared with TGlobal base, and fine-tuning is done with input se-
TGlobal or with T5.1.1. So, even at very long quence length 4096.
input lengths, LongT5 with Local Attention just
matches T5.1.1 with input length of 3k in NQ. How- arXiv
ever, LongT5 with TGlobal attention outperforms Objective R-1 R-2 R-3
T5.1.1. Moreover, note that although the plot shows SC 44.59 18.34 40.65
PSG 45.78 18.94 41.53
a few irregularities (such as 8k length for LongT5 LongT5 (4k) 45.66 19.22 41.49
with Local Attention, or 6k length with TGlobal LongT5 (16k) 48.21 21.7 44.03
Attention), that is because the plot shows only the PubMed
results of a single run, and hence there is some Objective R-1 R-2 R-3
noise. However, trends can clearly be seen. SC 47.86 22.14 44.39
PSG 48.74 23.42 45.24
LongT5 (4k) 48.47 23.38 45.01
5.3 Principle Sentences Generation vs. Span LongT5 (16k) 50.12 24.78 46.56
Corruption
Table 6: Ablation study on arXiv and PubMed for
As mentioned in section 3.2, we use PEGASUS different pre-training strategies using span corruption
Principle Sentences Generation instead of default (SC) vs. principle sentences generation (PSG) with
Span Corruption used in T5 as our pre-training T5.1.1 model along with LongT5 with TGlobal atten-
objective. Table 5 shows our ablation study for tion. Fine-tuning was done on large model size, with
fine-tuning on NQ and arXiv from a model pre- input sequence length of 4096 except where otherwise
trained using the default Span Corruption objec- noted.
tive, a model pre-trained with Principle Sentences
Generation, and a model pre-trained with both ob-
also compare this with dev scores from LongT5
jectives. The comparison is done on the dev set of
with TGlobal attention at 4k and 16k input lengths,
the tasks, and with TGlobal base models. Both pre-
such that we can see having full attention will allow
training and fine-tuning on the models mentioned
for better results, but being able to scale to longer
above are done with input sequence length 4096.
input sequence lengths allows LongT5 to achieve
The table shows, even though Principle Sentences
its stronger results.
Generation was developed by Zhang et al. (2019a)
as a pre-training strategy for summarization, it ben-
6 Related Work
efits both summarization and QA tasks, but using
both objectives together perform worse than just Language model pre-training followed by task
using PSG. specific fine-tuning has proven to be a powerful
Table 6 shows an additional ablation study with tool for numerous NLP tasks (Devlin et al., 2019;
arXiv and PubMed, where we compare using reg- Liu et al., 2019; Zhang et al., 2019b; Radford et al.,
ular T5.1.1 with Span Corruption compared to 2019; Raffel et al., 2019a; Lewis et al., 2020; Joshi
T5.1.1 pretrained with Principle Sentences Gen- et al., 2020). BERT (Devlin et al., 2019) intro-
eration while using the same pre-training input se- duced Mask Language Model (MLM), where a
quence length of 512 (as was done in the original model predicts masked tokens given a sequence of
T5.1.1 pre-training task). As expected, Principle text input. Fine-tuning a pre-trained BERT model
Sentences Generation helped the model achieve has led to improved performance on various NLP
better results compared to Span Corruption when tasks. However, MLM predictions are not made
seeing the same amount of pre-training data. We auto-regressively, which limits the capability of the
BERT family for generation tasks. Raffel et al. sient Global attention, which is a drop-in replace-
(2019a) introduced the span corruption task in T5 ment to the standard T5 attention mechanism, and
as the pre-training objective, where a model pre- hence can be used without needing additional side-
dicts the masked token span using an autoregressive inputs to the model or modifications to the model
model. It can handle the generation tasks as the pre- inputs; and (2) using a PEGASUS-style Principle
training is done in a generative way. BART (Lewis Sentences Generation pre-training objective.
et al., 2020) is similar to T5 but used a slightly Via experimentation in several challenging sum-
different pre-training objective, in which spans are marization and question answering datasets, we
masked from the input but the complete output is have explored the performance gains that can be
predicted. However, none of these works tried to achieved by scaling both input length and model
investigate pre-training for very long sequence in- size, resulting in state-of-the-art results on several
puts. They often use a transformer (Vaswani et al., datasets: arXiv, PubMed, BigPatent, MediaSum,
2017) architecture as backbone, the complexity of and TriviaQA.
which is quadratic to the input length, making them As part of our future work, we would like to pur-
impractical to model very long sequence input. sue several directions such as studying efficient at-
tention mechanisms in the decoder and decoder-to-
Long text modeling An extensive amount of
encoder attention pieces of the model (both Local
work has also been done for modeling long text like
Attention and TGlobal attention are only applied
documents. The work from Roy et al. (2016); Chen
to the encoder in LongT5 for now). Additionally,
(2017); Wu et al. (2018) obtained document embed-
we would like to incorporate additional long-input
dings from word-level embeddings. Another line
transformer ideas into the LongT5 architecture, that
of research tries to model long documents through
could further improve model efficiency.
hierarchical training. The work from Yang et al.
(2016); Miculicich et al. (2018) employed Hier-
archical Attention Networks for document classi- References
fication and neural machine translation, and Guo Joshua Ainslie, Santiago Ontañón, Chris Alberti, Va-
et al. (2019) proposed using a hierarchy network clav Cvicek, Zachary Fisher, Philip Pham, Anirudh
to build document embeddings on top of sentence Ravula, Sumit Sanghai, Qifan Wang, and Li Yang.
embeddings for parallel document mining. 2020. Etc: Encoding long and structured data in
More recent research has been focusing on im- transformers. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language
proving the memory and computation efficiency Processing (EMNLP 2020).
of transformer models (Tay et al., 2020b, 2021)
for handling long input. One type of such ap- Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document transformer.
proaches is using non-full attention patterns to re- arXiv:2004.05150.
strict the attention field range, so that it reduces the
attention complexity from O(n2 ) to O(nlogn) or Minmin Chen. 2017. Efficient vector representation
for documents through corruption. 5th International
O(n), including Sinkhorn (Tay et al., 2020a), Long- Conference on Learning Representations.
former (Beltagy et al., 2020), ETC (Ainslie et al.,
2020), and BigBird (Zaheer et al., 2020a). An- Krzysztof Marcin Choromanski, Valerii Likhosherstov,
David Dohan, Xingyou Song, Andreea Gane, Tamas
other type of approaches is leveraging the low-rank Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz
approximation of the attention matrix, such as Lin- Mohiuddin, Lukasz Kaiser, David Benjamin Be-
former (Wang et al., 2020), Performer (Choroman- langer, Lucy J Colwell, and Adrian Weller. 2021.
ski et al., 2021), Random Feature Attention (Peng Rethinking attention with performers. In Interna-
tional Conference on Learning Representations.
et al., 2021), and LUNA (Ma et al., 2021).
Arman Cohan, Franck Dernoncourt, Doo Soon Kim,
7 Conclusion Trung Bui, Seokhwan Kim, Walter Chang, and Na-
zli Goharian. 2018. A discourse-aware attention
This paper presented a new Transformer-based neu- model for abstractive summarization of long docu-
ral model called LongT5, with which we have ex- ments. In Proceedings of the 2018 Conference of
plored the effects of scaling both input length and the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
model size at the same time. Specifically, the main nologies, Volume 2 (Short Papers), pages 615–621,
differences of LongT5 with respect to T5.1.1 are New Orleans, Louisiana. Association for Computa-
(1) a new scalable attention mechanism called Tran- tional Linguistics.
Peng Cui and Le Hu. 2021. Topic-guided abstractive research. Transactions of the Association of Compu-
multi-document summarization. tational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Kristina Toutanova. 2019. BERT: Pre-training of jan Ghazvininejad, Abdelrahman Mohamed, Omer
deep bidirectional transformers for language under- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
standing. In Proceedings of the 2019 Conference 2020. BART: Denoising sequence-to-sequence pre-
of the North American Chapter of the Association training for natural language generation, translation,
for Computational Linguistics: Human Language and comprehension. In Proceedings of the 58th An-
Technologies, Volume 1 (Long and Short Papers), nual Meeting of the Association for Computational
pages 4171–4186, Minneapolis, Minnesota. Associ- Linguistics, pages 7871–7880, Online. Association
ation for Computational Linguistics. for Computational Linguistics.
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Chin-Yew Lin. 2004. ROUGE: A package for auto-
Dragomir Radev. 2019. Multi-News: A large-scale matic evaluation of summaries. In Text Summariza-
multi-document summarization dataset and abstrac- tion Branches Out, pages 74–81, Barcelona, Spain.
tive hierarchical model. In Proceedings of the 57th Association for Computational Linguistics.
Annual Meeting of the Association for Computa-
tional Linguistics, pages 1074–1084, Florence, Italy. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Association for Computational Linguistics. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Alexios Gidiotis and Grigorios Tsoumakas. 2020. A Luke Zettlemoyer, and Veselin Stoyanov. 2019.
divide-and-conquer approach to the summarization Roberta: A robustly optimized bert pretraining ap-
of long documents. IEEE/ACM Transactions on Au- proach.
dio, Speech, and Language Processing, 28:3029–
3040. Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting
Zhou, Jonathan May, Hao Ma, and Luke Zettle-
Mandy Guo, Yinfei Yang, Keith Stevens, Daniel Cer, moyer. 2021. Luna: Linear unified nested attention.
Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray In Thirty-Fifth Conference on Neural Information
Kurzweil. 2019. Hierarchical document encoder for Processing Systems.
parallel corpus mining. In Proceedings of the Fourth
Conference on Machine Translation (Volume 1: Re- Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas,
search Papers), pages 64–72, Florence, Italy. Asso- and James Henderson. 2018. Document-level neu-
ciation for Computational Linguistics. ral machine translation with hierarchical attention
networks. In Proceedings of the 2018 Conference
Gautier Izacard and Edouard Grave. 2021. Leveraging on Empirical Methods in Natural Language Process-
passage retrieval with generative models for open do- ing, pages 2947–2954, Brussels, Belgium. Associa-
main question answering. tion for Computational Linguistics.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Weld, Luke Zettlemoyer, and Omer Levy. 2020. Çağlar Gu̇lçehre, and Bing Xiang. 2016. Abstrac-
SpanBERT: Improving pre-training by representing tive text summarization using sequence-to-sequence
and predicting spans. Transactions of the Associa- RNNs and beyond. In Proceedings of The 20th
tion for Computational Linguistics, 8:64–77. SIGNLL Conference on Computational Natural Lan-
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke guage Learning, pages 280–290, Berlin, Germany.
Zettlemoyer. 2017. Triviaqa: A large scale distantly Association for Computational Linguistics.
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy
the Association for Computational Linguistics, Van- Schwartz, Noah Smith, and Lingpeng Kong. 2021.
couver, Canada. Association for Computational Lin- Random feature attention. In International Confer-
guistics. ence on Learning Representations.
Jared Kaplan, Sam McCandlish, Tom Henighan, Ofir Press, Noah A. Smith, and Mike Lewis. 2021.
Tom B Brown, Benjamin Chess, Rewon Child, Scott Train short, test long: Attention with linear biases
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. enables input length extrapolation.
2020. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361. Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- models are unsupervised multitask learners.
field, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Wei Li, and Peter J. Liu. 2019a. Exploring the limits
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- of transfer learning with a unified text-to-text trans-
ral questions: a benchmark for question answering former. CoRR, abs/1910.10683.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine American Chapter of the Association for Computa-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, tional Linguistics: Human Language Technologies,
Wei Li, and Peter J. Liu. 2019b. Exploring the limits pages 1480–1489, San Diego, California. Associa-
of transfer learning with a unified text-to-text trans- tion for Computational Linguistics.
former. CoRR, abs/1910.10683.
Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
Tobias Rohde, Xiaoxia Wu, and Yinhan Liu. 2021. Hi- Joshua Ainslie, Chris Alberti, Santiago Ontañón,
erarchical learning for generation with long source Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
sequences. and Amr Ahmed. 2020a. Big bird: Transformers for
longer sequences. CoRR, abs/2007.14062.
Dwaipayan Roy, Debasis Ganguly, Mandar Mitra, and
Gareth J. F. Jones. 2016. Representing documents Manzil Zaheer, Guru Guruganesh, Kumar Avinava
and queries as sets of word embedded vectors for Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
information retrieval. CoRR, abs/1606.07869. tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, and Amr Ahmed. 2020b. Big Bird: Trans-
Eva Sharma, Chen Li, and Lu Wang. 2019. BIG- formers for longer sequences. In Advances in
PATENT: A large-scale dataset for abstractive and Neural Information Processing Systems, volume 33,
coherent summarization. In Proceedings of the 57th pages 17283–17297. Curran Associates, Inc.
Annual Meeting of the Association for Computa-
tional Linguistics, pages 2204–2213, Florence, Italy. Yury Zemlyanskiy, Joshua Ainslie, Michiel de Jong,
Association for Computational Linguistics. Philip Pham, Ilya Eckstein, and Fei Sha. 2021.
Readtwice: Reading very large documents with
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da- memories.
Cheng Juan. 2020a. Sparse sinkhorn attention. In
ICML. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
ter J. Liu. 2019a. PEGASUS: pre-training with ex-
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang tracted gap-sentences for abstractive summarization.
Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu CoRR, abs/1912.08777.
Yang, Sebastian Ruder, and Donald Metzler. 2021.
Long range arena : A benchmark for efficient trans- Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
formers. In International Conference on Learning Maosong Sun, and Qun Liu. 2019b. ERNIE: En-
Representations. hanced language representation with informative en-
tities. In Proceedings of the 57th Annual Meet-
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald ing of the Association for Computational Linguis-
Metzler. 2020b. Efficient transformers: A survey. tics, pages 1441–1451, Florence, Italy. Association
ArXiv, abs/2009.06732. for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng.
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz 2021. MediaSum: A large-scale media interview
Kaiser, and Illia Polosukhin. 2017. Attention is all dataset for dialogue summarization. In Proceedings
you need. In Advances in Neural Information Pro- of the 2021 Conference of the North American Chap-
cessing Systems, volume 30. Curran Associates, Inc. ter of the Association for Computational Linguistics:
Human Language Technologies, pages 5927–5934,
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Online. Association for Computational Linguistics.
Fang, and Hao Ma. 2020. Linformer: Self-attention
with linear complexity.
Lingfei Wu, Ian En-Hsu Yen, Kun Xu, Fangli

Xu, Avinash Balakrishnan, Pin-Yu Chen, Pradeep
Ravikumar, and Michael J. Witbrock. 2018. Word
mover’s embedding: From word2vec to document
embedding. In Proceedings of the 2018 Confer-
ence on Empirical Methods in Natural Language
Processing, pages 4524–4534. Association for Com-
putational Linguistics.
Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman

Cohan. 2021. PRIMER: Pyramid-based masked sen-
tence pre-training for multi-document summariza-
tion.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,

Alex Smola, and Eduard Hovy. 2016. Hierarchical
attention networks for document classification. In
Proceedings of the 2016 Conference of the North
A Summarization Results
Table 8 shows the full set of results on the summa-
rization datasets used in this paper. This includes
both standard T5 model (using version T5.1.1), T5
with PEGASUS Principle Sentences Generation
pre-training, and LongT5 model.
As can be seen, scaling up the input size for the
models helps achieve better performance metrics.
NQ TriviaQA
T5 models though struggle when scaling up to 4k Approach EM F1 EM F1
for input, as the fine-tuning task can take many base:
days even when using a large topology of TPUv3. T5.1.1 (512) 50.93 52.54 48.91 52.89
When comparing regular T5.1.1 model with T5.1.1 (6k) 56.73 56.73 59.09 63.31
large:
a T5.1.1 model using PEGASUS Principle Sen- T5.1.1 (512) 57.29 60.68 53.26 57.01
tences Generation pre-training, the latter was able T5.1.1 (3k) 60.09 64.17 60.15 64.15
to achieve better results, with the results also im- xl:
T5.1.1 (4k) 60.75 64.07 65.33 69.43
proving as the input size scaled up. This helps
base Local:
show that both using the latter pre-training objec- LongT5 (512) 54.39 58.24 - -
tive along with scaling up allows us to get the best LongT5 (1k) 54.60 57.88 - -
results from these models. LongT5 (2k) 56.48 60.56 - -
LongT5 (4k) 56.10 60.52 - -
LongT5, despite having a reduced attention from LongT5 (8k) 55.90 59.98 - -
using TGlobal attention, is able to get strong per- LongT5 (16k) 56.41 60.46 - -
formance results due to both scaling up to larger LongT5 (32k) 55.84 59.59 - -
LongT5 (36k) 55.77 59.66 - -
inputs and leveraging the Gap Sentences Genera- base TGlobal:
tion pre-training strategy. LongT5 (512) 55.73 59.06 - -
LongT5 (1k) 57.41 61.25 - -
B QA Results LongT5 (2k) 56.96 60.25 - -
LongT5 (4k) 58.97 63.03 - -
LongT5 (8k) 58.07 62.67 - -
Table 7 shows the full set of results comparing LongT5 (12k) 58.12 62.44 63.27 67.42
T5.1.1 and LongT5 models on the QA datasets large Local:
used in this paper. For both NQ and TriviaQA in LongT5 (512) 55.19 58.00 - -
LongT5 (1k) 57.47 60.79 - -
this comparison study, we use 90% of the official LongT5 (2k) 58.49 62.12 - -
training set for training while using 10% as hold- LongT5 (4k) 59.44 63.72 - -
out dev set to fine-tune the hyperparameters and LongT5 (8k) 58.66 62.28 - -
LongT5 (10k) 60.01 64.40 - -
training epoch, and use the official dev set to report large TGlobal:
the numbers in this table. We run each model to LongT5 (512) 57.55 61.53 - -
the largest input length allowed before running out LongT5 (1k) 59.69 63.91 - -
LongT5 (4k) 60.77 65.38 - -
of memory on specific hardware configuration - LongT5 (6k) 59.17 63.38 63.76 67.82
base/large models on 4x8 TPUv3 with no model xl TGlobal:
LongT5 (4k) 62.38 66.39 - -
partitioning, and xl models on 8x16 TPUv3 with 8 LongT5 (8k) 62.66 66.61 67.89 71.71
partitions.
Table 7: QA results comparing T5.1.1 and LongT5 at
different sequence lengths. Base and large models are
trained on 4x8 TPUv3 with no model partitioning, and
xl models are trained on 8x16 TPUv3 with 8 partitions.
arXiv PubMed
Approach R-1 R-2 R-L R-1 R-2 R-L
DANCER PEGASUS 45.01 17.6 40.56 46.34 19.97 42.42
BigBird-PEGASUS (large) 46.63 19.02 41.77 46.32 20.65 42.33
HAT-BART 46.68 19.07 42.17 48.36 21.43 37.00
LED (large) 46.63 19.62 41.83 - - -
PRIMER 47.6 20.8 42.6 - - -
T5.1.1 (large - 1k input) 39.79 14.02 36.23 42.18 16.60 38.96
T5.1.1 (large - 2k input) 42.84 16.62 39.01 45.51 19.55 42.10
T5.1.1 (large - 4k input) 44.51 18.20 40.62 47.90 22.08 44.36
T5.1.1 + PSG (large - 1k input) 38.53 13.61 35.08 43.34 17.55 40.10
LongT5 (base - 4k input) 44.87 18.54 40.97 47.77 22.58 44.38
LongT5 (large - 4k input) 45.64 18.6 41.51 48.38 23.32 44.93
LongT5 (xl - 4k input) 45.99 19.51 42.04 48.99 23.48 45.51
LongT5 (xl - 8k input) 47.44 20.84 43.34 50.04 24.45 46.42
LongT5 (xl - 16k input) 48.35 21.92 44.27 50.23 24.76 46.67
BigPatent MultiNews
BigBird-PEGASUS (large) 60.64 42.46 50.01 - - -
TG-MultiSum - - - 47.10 17.55 20.73
PRIMER - - - 49.9 21.1 25.9
T5.1.1 (large - 1k input) 55.07 37.49 45.90 43.69 16.26 23.03
T5.1.1 (large - 2k input) 60.07 43.49 50.90 44.95 17.26 23.74
T5.1.1 (large - 4k input) 62.14 45.85 52.95 45.67 17.88 24.15
LongT5 (large - 16k input) 70.38 56.81 62.73 - - -
LongT5 (xl - 4k input) 75.82 64.64 69.54 48.15 19.30 24.76
LongT5 (xl - 8k input) 76.39 65.37 70.16 48.17 19.43 24.94
LongT5 (xl - 16k input) 76.87 66.06 70.76 - - -
MediaSum CNN / Daily Mail
HAT-BART - - - 44.48 21.31 41.52
BART (large) 35.09 18.05 31.44 - - -
T5.1.1 (large - 1k input) 30.68 14.88 27.88 42.60 20.41 40.03
T5.1.1 (large - 2k input) 32.83 16.75 29.79 42.55 20.25 39.99
T5.1.1 (large - 4k input) 34.37 18.09 31.12 42.27 19.93 39.72
LongT5 (xl - 4k input) 36.15 19.66 32.80 43.94 21.40 41.28
Table 8: Summarization results comparing T5, T5 with PEGASUS-style Principle Sentences Generation (PSG)
pre-training, and LongT5 with best known approaches for the various datasets. All T5 scores are with standard
T5.1.1 model. All LongT5 scores are with models using TGlobal attention. For each task, we scale up the input
length depending on the statistics of the inputs, thus not all of the tasks were scaled to 16k. We do not include input
length of other models because each model uses the input differently, and hence, direct comparison is not possible.

LongT5 Paper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LongT5 Paper

Uploaded by

Copyright:

Available Formats

LongT5: Efficient Text-To-Text Transformer for Long Sequences

Mandy Guo∗†, Joshua Ainslie∗†, David Uthus∗, Santiago Ontañón∗

Average ROUGE Score

model size can improve the performance of

Input Tokens Global Tokens

Figure 2: Illustration of the two attention mechanisms we experimented with in LongT5.

As can be seen in Table 2, LongT5 is able

the official dev set as our test set. (2) We benchmark

60 Sequences per second

T5.1.1 large LongT5 large Local LongT5 large TGlobal

Lingfei Wu, Ian En-Hsu Yen, Kun Xu, Fangli

Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,

You might also like