You are on page 1of 6

PaReprop: Fast Parallelized Reversible Backpropagation

Tyler Zhu∗ Karttikeya Mangalam∗



denotes equal technical contribution
UC Berkeley
tyler.zhu@berkeley.edu
arXiv:2306.09342v1 [cs.LG] 15 Jun 2023

Abstract 307.5
293.4 PaReprop
Reprop
The growing size of datasets and deep learning models 207.0

Training Throughput (im/s)


197.5
has made faster and memory-efficient training crucial. Re-
160.2
versible transformers have recently been introduced as an 155.2

exciting new method for extremely memory-efficient train- 103.7


99.8
ing, but they come with an additional computation overhead
50.5
of activation re-computation in the backpropagation phase.
46.8
We present PaReprop, a fast Parallelized Reversible Back- 16.1
propagation algorithm that parallelizes the additional acti-
13.5
vation re-computation overhead in reversible training with
the gradient computation itself in backpropagation phase. Tiny Small Base Large Huge Giant
Swin Backbone
We demonstrate the effectiveness of the proposed PaReprop
algorithm through extensive benchmarking across model Figure 1. PaReprop vs. Reprop for Swin Transformer. We
families (ViT, MViT, Swin and RoBERTa), data modalities benchmark our proposed Parallelized Reversible backpropagation
(Vision & NLP), model sizes (from small to giant), and (PaReprop) algorithm against vanilla reversible backpropagation
training batch sizes. Our empirical results show that PaRe- (Reprop). By exploiting parallelized kernels, PaReprop achieves
prop achieves up to 20% higher training throughput than up to 20% training throughput gain without any change to the un-
vanilla reversible training, largely mitigating the theoreti- derlying computation, thereby ensuring accuracy. Notably deeper
transformers enjoy better throughout improvement, a desirable
cal overhead of 25% lower throughput from activation re-
quality for scaling memory-efficient reversible training.
computation in reversible training. Project page: https:
//tylerzhu.com/pareprop.
results by stacking multiple layers of transformer blocks.
However, this increase in depth is quite costly, as the num-
1. Introduction ber of activations that need to be stored during training in-
The field of deep learning has made great strides in re- creases linearly with the depth of the model. Reversible vi-
cent years on the back of large-scale models. By increas- sion transformers [20], as well as other reversible architec-
ing the size of models, we achieved state of the art per- tures [10,14,22], is one recent result which offers a promis-
formances again and again on a variety of tasks, such as ing approach to improving the efficiency of large models
image recognition [7], language modeling [3], and speech by decoupling the memory needed from the depth by us-
recognition [1]. However, as these models become larger, ing reversible activations. This allows them to maintain top
it becomes equally as important to improve the efficiency performance while also requiring less memory.
of these models so that they can be deployed in real-world The framework of reversible transformations however
applications. Models need to be able to run on smaller offers a further tradeoff where for a tiny amount of addi-
amounts of compute and run faster. When models are at the tional memory, we can improve our throughput by a sign-
huge scale that they are today, small speedups can lead to ficant amount using parallelization. This takes advantage
large gains over the course of the many epochs and training of the independence between the recomputations and gra-
weeks that are required to train these models. dient updates in the backward pass which allows them to
Vision transformers are able to achieve their impressive theoretically be computed simultaneously. From this obser-

1
vation, we introduce a method for parallelizing a reversible
backpropagation which we call PaReprop, or Parallelized
Reversible backpropagation. Our method is general enough
to be applied to any reversible architecture, and we demon-
strate this by testing on a wide variety of reversible archi-
tectures, hardware, and memory settings, and show that we Figure 2. Illustration of the Reversible Transformation with ar-
can achieve significant speedups in practice. In summary, bitrary functions F, G and the forward process (left) and backward
we make the following contributions: process (right). See Eq. 1 for the mathematical definition.

1. We propose a novel method for parallelizing reversible


backpropagation which is compatible with modern model from scratch [23]. Similar work has also been done
auto-differentiation packages like PyTorch. to extend these results to natural language transformers
such as BERT, BART, and OPT [17]. Our results adds
2. Our method achieves significant speedups across a di-
to these works by additionally proposing Rev-Swin and
verse set of model families, data modalities, model
Rev-RoBERTa models [18, 19], as well as an orthogonal
sizes, and training batch sizes. We increase training
throughput improvement to all of these methods.
throughput for all of our models while maintaining the
same accuracy as the original model.
3. Parallelized Reversible Backpropagation
3. Using PaReprop scales better on throughput with
We begin with a review of the reversible transforma-
memory than standard reversible backpropagation. In
tion (Section 3.1), and subsequently introduce the vanilla
particular, PaReprop leads to more favorable memory
memory-efficient reversible training algorithm (Section 3.2)
vs. throughput trade-offs, i.e. our method can achieve
and its application to training modern transformers. Then,
higher throughput at any given threshold of memory
we present PaReprop, our proposed procedure for speeding
used.
up reversible backpropagation with parallelized activation
and gradient computation (Section 3.3).
2. Related Works
Reversible architectures are a type of neural network ar- 3.1. Reversible Transformations
chitecture based on NICE [5, 6], which was an early model A reversible transformation T maps inputs I1 and I2 to
for generative flow-based image generation [12, 13]. Re- outputs O1 and O2 in an invertible process even if there is
versible ResNet [10] is a type of reversible architecture no analytical inverse to the transformation. We will use in-
that uses the NICE invertible transformations to enable termediate functions F and G, which need not be invertible.
memory-efficient image classification. Other researchers The reversible transformation applies each function F
have built upon this idea by proposing improved reversible and G one at a time, first to I1 and then to I2 + F (I1 ), and
CNN models using ODE characterizations [4, 15, 21], mo- adds it to the other input, providing a simple method to re-
mentum [15, 21], and several other improvements [2, 9, 11, cover the inputs from the outputs. This process is illustrated
22]. Recently, reversible networks have also been adapted in Figure 2, and the final transformation is given by
to core NLP tasks in Reformer [14] and to several core vi-
sion tasks in Rev-ViT [20]. Crucially, Rev-ViT [20], un-      
der very strict parity constraints on parameters, FLOPs and I1 I + G(I2 + F (I1 )) O1
I= 7−−→ 1 = := O (1)
activation sizes of the proposed models, shows reversible I2 T I2 + F (I1 ) O2
models to an equivalently powerful class of models as
vanilla transformer but with the added benefit of extremely It requires one call of F (·) and G(·) to compute either T
memory-efficient training. or an inverse T ′ , so both the forward and the backward pass
require the same computational cost.
Extensions of reversible transformers have been re-
3.2. Reversible Transformers
cently proposed that allow them to be used in more general
settings. Rev-ViT and Rev-MViT were proposed in [20] Reversible Transformers [10, 20] are a family of
as demonstrations that reversible architectures can be made memory-efficient models based on these reversible trans-
from both isotropic and hierarchical transformers. Re- formations. For their application, we will make use of the
cently, work has also been done to show non-reversible property that they can perfectly recompute any input I from
checkpoints of models can be rewired to become reversible its output O, which can be used at the granularity of trans-
backbones for temporal action localization very efficiently, former blocks. For vision transformers, F (·) is set to be the
using much less compute than it would take to train a attention block while G(·) is set to be the MLP block.

2
Backprop
stream0 Mem (GiB) 0 1 2 3 3 2 1 0

Time elapsed ush


Reprop
stream0 0 1 2 3 3 3 2 2 1 1 0 0
Mem (GiB)

Time elapsed ush


PaReprop
stream0 0 1 2 3 3 3 1 1
stream1 2 2 0 0
Mem (GiB)

Time elapsed ush

x Forward pass of block x

x Backward pass with gradient updates of block x

x Activation recomputation in reversible backprop of block x

Figure 4. Reversible Swin. [20] introduces the Reversible ViT


and MViT architectures. Following the same principles, we intro-
fl
fl
fl
Figure 3. Parallelized Reversible backpropagation. PaReprop duce the reversible Swin architecture and benchmark all the three
(bottom) parallelizes the activation re-computation stage of block reversible architectures with PaReprop. We showcase (a) a typical
i − 1 (green blocks) with the gradient computation stage of block Swin block, as well as (b) a downsample block for processing at
i. Reversible training drastically alleviates training memory bur- multiple hierarchies of scale.
den of vanilla networks (top vs. middle rows) but introduces
additional computational burden of activation re-computation in
the backward pass (see [20]). PaReprop further alleviates this
re-computation burden, thereby making reversible architecture a dient update for block N at the same time as the activation
practical choice for deep transformer training. recomputation for block N − 1, as there is no dependence
once we have the activations for block N . Approximating
This means that during our forward pass, we do not need a backward pass as twice the time as a forward pass, this
to store any activations. In the backward pass, we recom- means theoretically we can hide 25% of the computations
pute the activations of the current block using the output, and see this speedup in our throughput.
and then backpropagate to recover our gradients and update However, achieving this parallelization is rather tricky.
our weights. We then delete the activations of the current Standard autodifferentation frameworks like PyTorch use a
block, and repeat this process for the next block in normal CUDA stream by default to maintain sequential ordering for
Reversible Backpropagation, or Reprop, in Figure 3. GPU operations. Our method extends this by maintaining
However, the backward pass of block N , i.e. the gra- multiple streams to parallelize operations over. However,
dient update, is not needed to recompute the activations of these streams enforce that forward and backward passes oc-
block N − 1. Therefore, if we can hide the forward pass cur on the same stream, which causes issues if we imple-
of block N − 1 within the backward pass of block N , we ment PaReprop naively by keeping one stream for the acti-
can theoretically speedup our computation on par with that vation recomputation and another for the gradient updates.
of normal backprop. This is our key insight, which we will This necessitates our alternative computation scheme.
now present in detail.
Another problem is that PyTorch is unable to free mem-
3.3. Parallelizing with Two Streams ory efficiently in parallel processes as there is no asyn-
chronous implementation of CUDA free yet. Thus, it re-
Our method’s key contribution is both theoretical and quires a costly CUDA free operation which synchronizes
practical. The first is illustrated in Figure 3, where our our streams and thus slows our process down significantly.
PaReprop method is able to parallelize the backward pass of In practice, it’s most beneficial to run our PaReprop method
a normal reversible backprop so that it takes nearly the same at anywhere from 33% to 50% of the empirical maximum
time as vanilla backprop. We do this by performing the gra- batch size so that we don’t hit this trap.

3
Throughput (ims/sec) vs Batch Size for Vision Architectures
320
PaReprop 100
600 Reprop
240
75
ViT 400
160
50
200
80
25
0
21 Small 210 21 Base 29 21 Large 28
120 45
12
90
30
MViT 9
60
6
30 15
3
21 Base 28 21 Large 27 21 Huge 25

16
90 45

12
Swin 60 30
8
30
15
4
21 Large 28 21 Huge 26 21 Giant 25

Figure 5. PaReprop Vision Training throughput vs. Batch size across model architectures and sizes. 1σ error bars are shown. Accuracies
are kept intact with PaReprop (Section 4).

3.4. Reversible Swin and RoBERTa mary metric we are concerned with is training throughput
(images/sec or sequence/sec), which is the number of im-
In order to demonstrate the generality of our method to
ages (or sequences) we can process per second, as our mea-
other architectures, we propose two novel reversible archi-
sure of speed. We benchmark on 224×224 image classifi-
tectures based on established models: Swin Transformer
cation and standard text classification.
and RoBERTa. Figure 4 shows our modifications to the
As shown in [20], the reversible models attains the
original Swin architecture. We follow suit with the origi-
same accuracy as the original reversible ViT, and we do
nal reversible ViT authors and choose to keep the Attention
not change the underlying algorithm but simply propose a
blocks and MLP blocks as our reversible subfunctions. We
faster implementation, so we focus instead on analyzing the
also demonstrate how we handle multiscale features in the
speedup of our method instead of performance. All of our
architecture by utilizing a fusion block (either averaging or
results are run on a single NVIDIA A100 GPU with 80GB
a simple MLP) before the usual patch merging layer.
of memory w/ AdamW as our optimizer. Our results were
similar on an NVIDIA RTX 2080 Ti as well as with SGD.
4. Results
4.1. PaReprop Training Throughput is Better
In this section, we present our experimental results of
our proposed method, denoted as PaReprop, in comparison In the first experiment, we compare our PaReprop
with the original reversible ViT, denoted as Reprop. We an- method with the original Reprop method used in the origi-
alyze our method over the choice of backbone size (from nal reversible ViT. We compare the top throughput achieved
5.5M to over 2B parameters), architecture class (ViT [7], over batch sizes of {1, 2, 4, . . . , 256, 1024} (as allowed by
Swin [19], MViT [8, 16], RoBERTa [18]), data modalities memory) for each backbone. In these cases, we run on
(Vision and NLP), GPU training memory allocated, input standard image classification, but our results will hold over
sequence length (for RoBERTa) and batch size. The pri- any choice of task (video understanding, etc.). We see

4
Throughput (seqs/sec) vs Batch Size for RoBERTa Sizes Mem (GiB) vs Batch Size for Swin Architectures
45 60
28.5
PaReprop PaReprop
Reprop Reprop
40 30
12.0 Backprop 40

Base 27.0 20 15 20
11.6
0 0 0
25.5 21 29 21 29 21 29
11.2 Tiny Small Base
45 72
45
21 Seq Len of 512 27 21 Seq Len of 1024 26
30 64
4.35 30
10.0
15 56
4.20 15
9.5
Large 21 Large 28 21 Huge 26 21 Giant 25

9.0 4.05
Figure 7. PaReprop Memory Vs. Batchsize. Our method,
PaReprop , as well as standard reversible backprop, Reprop, use
8.5 3.90 comparatively much less memory than nonreversible approaches
21 Seq Len of 512 27 21 Seq Len of 1024 25
(Backprop). The memory increase to achieve proposed speedups
is negligible compared to the overall savings.
Figure 6. PaReprop NLP Training Throughput across Se-
quence length and Batch Sizes. Comparison of our method on size (Base and Large) and sequence lengths (512 and 1024).
RoBERTa, a language transformer. One s.d. error bars are shown.
4.2. Training GPU Memory Trends
that across three different architecture families (ViT, MViT,
Swin) and three choices of model sizes, our method outper- We also investigate the effect of memory on our method
forms Reprop, in some cases drastically. (Figure 7). Specifically, we compare the memory used by
our method with the original Reprop and the vanilla back-
Vision Transformers show a mostly matched throughput prop, and show that our method is more memory efficient.
between PaReProp and standard reversible backpropagation As shown in the plots, using any kind of reversible back-
(Figure 5, top row). We find that because ReProp can al- propagation offers memory savings of up to almost 3x in
ready utilize a large batch size, the GPU utilization is quite some cases, which allows us to significantly extend the
high and the PaReprop kernels are unable to run in parallel. batch sizes that we can use. In these scenarios, using our
Hierarchical Transformers enjoy a much more pro- parallelized reversible backpropagation requires only an ex-
nounced benefit with PaReprop such as Multiscale Vision tra fraction of the small amount of memory being used
Transformer (Figure 5, middle row) and Swin Transformer to maintain parallelization and allows us to achieve higher
(Figure 5, bottom row; Figure 1). Hierarchical transform- throughputs. This finding is consistent across all model ar-
ers have non homogeneous architectures that allow PaRe- chitectures we tested, but we illustrate most of our findings
prop to hide the recomputation latency. They have sev- on Swin in Figure 7 for simplicity.
eral small kernels that can be parallelized more effectively,
which leads to significant speedups of 9.8% on the largest 5. Conclusion
MViT-H and a 19.3% increase on largest Swin-G, consis-
tently outperforming the standard reversible backpropaga- We present Parallelized Reversible Backpropagation
tion method (Reprop). This shows that PaReprop provides (PaReprop), a fast reversible backpropagation algorithm for
significant speedup for vision-specific models. training reversible transformer in both Vision (Rev-ViT,
Rev-Swin, Rev-MViT) and NLP (Rev-RoBERTa). PaRe-
NLP Transformers. Finally, we also clearly demonstrate prop parallelizes the backward activation recomputation, an
PaReprop gains on the natural language modality. Re- additonal overhead introduced in the reversible networks,
versible models have also successfully been applied to lan- with the gradient computation itself in backpropgation.
guage tasks such as Reformer. Following [20], we extend By allowing recomputation to overlap with gradient cal-
RoBERTa [18] to Rev-RoBERTa and provide throughput culation, the additional latency of recomputation is hidden
benchmarking results on sentiment analysis. Note that as thereby increasing the training throughput while still main-
shown in Rev-ViT [20] using a simple reversible rewiring taining extremeley memory-efficient training. Our method
of the model maintains the original accuracy of the vanilla maintains the same accuracy as the vanilla method, and
model. Figure 6 shows our method outperforming the orig- achieves better throughput on all of the models we tested,
inal Reprop by large amounts across both choices of model outperforming some by up to 20%.

5
References of the IEEE/CVF International Conference on Computer Vi-
sion Workshops, pages 0–0, 2019. 2
[1] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and
[12] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and
Michael Auli. wav2vec 2.0: A framework for self-supervised
Pieter Abbeel. Flow++: Improving flow-based generative
learning of speech representations. In H. Larochelle, M. Ran-
models with variational dequantization and architecture de-
zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances
sign. In International Conference on Machine Learning,
in Neural Information Processing Systems, volume 33, pages
pages 2722–2730. PMLR, 2019. 2
12449–12460. Curran Associates, Inc., 2020. 1
[13] Diederik P Kingma and Prafulla Dhariwal. Glow: Gener-
[2] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Du-
ative flow with invertible 1x1 convolutions. arXiv preprint
venaud, and Jörn-Henrik Jacobsen. Invertible residual net-
arXiv:1807.03039, 2018. 2
works. In International Conference on Machine Learning,
[14] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.
pages 573–582. PMLR, 2019. 2
Reformer: The efficient transformer. arXiv preprint
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
arXiv:2001.04451, 2020. 1, 2
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
[15] Duo Li and Shang-Hua Gao. m-revnet: Deep re-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
versible neural networks with momentum. arXiv preprint
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
arXiv:2108.05862, 2021. 2
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric [16] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man-
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack galam, Bo Xiong, Jitendra Malik, and Christoph Feichten-
Clark, Christopher Berner, Sam McCandlish, Alec Radford, hofer. Mvitv2: Improved multiscale vision transformers for
Ilya Sutskever, and Dario Amodei. Language models are classification and detection. In CVPR, 2022. 4
few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, [17] Baohao Liao, Shaomu Tan, and Christof Monz. Make your
M.F. Balcan, and H. Lin, editors, Advances in Neural Infor- pre-trained model reversible: From parameter to memory ef-
mation Processing Systems, volume 33, pages 1877–1901. ficient fine-tuning, 2023. 2
Curran Associates, Inc., 2020. 1 [18] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[4] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
Begert, and Elliot Holtham. Reversible architectures for ar- moyer, and Veselin Stoyanov. Roberta: A robustly optimized
bitrarily deep residual neural networks. In Proceedings of bert pretraining approach. arXiv preprint arXiv:1907.11692,
the AAAI Conference on Artificial Intelligence, volume 32, 2019. 2, 4, 5
2018. 2 [19] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[5] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Non-linear independent components estimation. arXiv Hierarchical vision transformer using shifted windows. In
preprint arXiv:1410.8516, 2014. 2 Proc. CVPR, 2022. 2, 4
[6] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- [20] Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan
gio. Density estimation using real nvp. arXiv preprint Wu, Bo Xiong, Christoph Feichtenhofer, and Jitendra Ma-
arXiv:1605.08803, 2016. 2 lik. Reversible vision transformers. In Proceedings of
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, the IEEE/CVF Conference on Computer Vision and Pattern
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Recognition, pages 10830–10840, 2022. 1, 2, 3, 4, 5
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [21] Michael E Sander, Pierre Ablin, Mathieu Blondel, and
vain Gelly, et al. An image is worth 16x16 words: Trans- Gabriel Peyré. Momentum residual neural networks. arXiv
formers for image recognition at scale. In Proc. ICLR, 2021. preprint arXiv:2102.07870, 2021. 2
1, 4 [22] Yang Song, Chenlin Meng, and Stefano Ermon. Mintnet:
[8] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Building invertible neural networks with masked convolu-
Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. tions. arXiv preprint arXiv:1907.07945, 2019. 1, 2
Multiscale vision transformers. In Proc. ICCV, 2021. 4 [23] Chen Zhao, Shuming Liu, Karttikeya Mangalam, and
[9] Marc Finzi, Pavel Izmailov, Wesley Maddox, Polina Bernard Ghanem. Reˆ 2tal: Rewiring pretrained video back-
Kirichenko, and Andrew Gordon Wilson. Invertible convo- bones for reversible temporal action localization. In Pro-
lutional networks. In Workshop on Invertible Neural Nets ceedings of the IEEE/CVF Conference on Computer Vision
and Normalizing Flows, International Conference on Ma- and Pattern Recognition, 2023. 2
chine Learning, 2019. 2
[10] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B
Grosse. The reversible residual network: Backpropagation
without storing activations. In Proceedings of the 31st Inter-
national Conference on Neural Information Processing Sys-
tems, pages 2211–2221, 2017. 1, 2
[11] Tristan Hascoet, Quentin Febvre, Weihao Zhuang, Yasuo
Ariki, and Tetsuya Takiguchi. Layer-wise invertibility for ex-
treme memory cost reduction of cnn training. In Proceedings

You might also like