Professional Documents
Culture Documents
Pareprop: Fast Parallelized Reversible Backpropagation: Tyler Zhu Karttikeya Mangalam Uc Berkeley Tyler - Zhu@Berkeley - Edu
Pareprop: Fast Parallelized Reversible Backpropagation: Tyler Zhu Karttikeya Mangalam Uc Berkeley Tyler - Zhu@Berkeley - Edu
Abstract 307.5
293.4 PaReprop
Reprop
The growing size of datasets and deep learning models 207.0
1
vation, we introduce a method for parallelizing a reversible
backpropagation which we call PaReprop, or Parallelized
Reversible backpropagation. Our method is general enough
to be applied to any reversible architecture, and we demon-
strate this by testing on a wide variety of reversible archi-
tectures, hardware, and memory settings, and show that we Figure 2. Illustration of the Reversible Transformation with ar-
can achieve significant speedups in practice. In summary, bitrary functions F, G and the forward process (left) and backward
we make the following contributions: process (right). See Eq. 1 for the mathematical definition.
2
Backprop
stream0 Mem (GiB) 0 1 2 3 3 2 1 0
3
Throughput (ims/sec) vs Batch Size for Vision Architectures
320
PaReprop 100
600 Reprop
240
75
ViT 400
160
50
200
80
25
0
21 Small 210 21 Base 29 21 Large 28
120 45
12
90
30
MViT 9
60
6
30 15
3
21 Base 28 21 Large 27 21 Huge 25
16
90 45
12
Swin 60 30
8
30
15
4
21 Large 28 21 Huge 26 21 Giant 25
Figure 5. PaReprop Vision Training throughput vs. Batch size across model architectures and sizes. 1σ error bars are shown. Accuracies
are kept intact with PaReprop (Section 4).
3.4. Reversible Swin and RoBERTa mary metric we are concerned with is training throughput
(images/sec or sequence/sec), which is the number of im-
In order to demonstrate the generality of our method to
ages (or sequences) we can process per second, as our mea-
other architectures, we propose two novel reversible archi-
sure of speed. We benchmark on 224×224 image classifi-
tectures based on established models: Swin Transformer
cation and standard text classification.
and RoBERTa. Figure 4 shows our modifications to the
As shown in [20], the reversible models attains the
original Swin architecture. We follow suit with the origi-
same accuracy as the original reversible ViT, and we do
nal reversible ViT authors and choose to keep the Attention
not change the underlying algorithm but simply propose a
blocks and MLP blocks as our reversible subfunctions. We
faster implementation, so we focus instead on analyzing the
also demonstrate how we handle multiscale features in the
speedup of our method instead of performance. All of our
architecture by utilizing a fusion block (either averaging or
results are run on a single NVIDIA A100 GPU with 80GB
a simple MLP) before the usual patch merging layer.
of memory w/ AdamW as our optimizer. Our results were
similar on an NVIDIA RTX 2080 Ti as well as with SGD.
4. Results
4.1. PaReprop Training Throughput is Better
In this section, we present our experimental results of
our proposed method, denoted as PaReprop, in comparison In the first experiment, we compare our PaReprop
with the original reversible ViT, denoted as Reprop. We an- method with the original Reprop method used in the origi-
alyze our method over the choice of backbone size (from nal reversible ViT. We compare the top throughput achieved
5.5M to over 2B parameters), architecture class (ViT [7], over batch sizes of {1, 2, 4, . . . , 256, 1024} (as allowed by
Swin [19], MViT [8, 16], RoBERTa [18]), data modalities memory) for each backbone. In these cases, we run on
(Vision and NLP), GPU training memory allocated, input standard image classification, but our results will hold over
sequence length (for RoBERTa) and batch size. The pri- any choice of task (video understanding, etc.). We see
4
Throughput (seqs/sec) vs Batch Size for RoBERTa Sizes Mem (GiB) vs Batch Size for Swin Architectures
45 60
28.5
PaReprop PaReprop
Reprop Reprop
40 30
12.0 Backprop 40
Base 27.0 20 15 20
11.6
0 0 0
25.5 21 29 21 29 21 29
11.2 Tiny Small Base
45 72
45
21 Seq Len of 512 27 21 Seq Len of 1024 26
30 64
4.35 30
10.0
15 56
4.20 15
9.5
Large 21 Large 28 21 Huge 26 21 Giant 25
9.0 4.05
Figure 7. PaReprop Memory Vs. Batchsize. Our method,
PaReprop , as well as standard reversible backprop, Reprop, use
8.5 3.90 comparatively much less memory than nonreversible approaches
21 Seq Len of 512 27 21 Seq Len of 1024 25
(Backprop). The memory increase to achieve proposed speedups
is negligible compared to the overall savings.
Figure 6. PaReprop NLP Training Throughput across Se-
quence length and Batch Sizes. Comparison of our method on size (Base and Large) and sequence lengths (512 and 1024).
RoBERTa, a language transformer. One s.d. error bars are shown.
4.2. Training GPU Memory Trends
that across three different architecture families (ViT, MViT,
Swin) and three choices of model sizes, our method outper- We also investigate the effect of memory on our method
forms Reprop, in some cases drastically. (Figure 7). Specifically, we compare the memory used by
our method with the original Reprop and the vanilla back-
Vision Transformers show a mostly matched throughput prop, and show that our method is more memory efficient.
between PaReProp and standard reversible backpropagation As shown in the plots, using any kind of reversible back-
(Figure 5, top row). We find that because ReProp can al- propagation offers memory savings of up to almost 3x in
ready utilize a large batch size, the GPU utilization is quite some cases, which allows us to significantly extend the
high and the PaReprop kernels are unable to run in parallel. batch sizes that we can use. In these scenarios, using our
Hierarchical Transformers enjoy a much more pro- parallelized reversible backpropagation requires only an ex-
nounced benefit with PaReprop such as Multiscale Vision tra fraction of the small amount of memory being used
Transformer (Figure 5, middle row) and Swin Transformer to maintain parallelization and allows us to achieve higher
(Figure 5, bottom row; Figure 1). Hierarchical transform- throughputs. This finding is consistent across all model ar-
ers have non homogeneous architectures that allow PaRe- chitectures we tested, but we illustrate most of our findings
prop to hide the recomputation latency. They have sev- on Swin in Figure 7 for simplicity.
eral small kernels that can be parallelized more effectively,
which leads to significant speedups of 9.8% on the largest 5. Conclusion
MViT-H and a 19.3% increase on largest Swin-G, consis-
tently outperforming the standard reversible backpropaga- We present Parallelized Reversible Backpropagation
tion method (Reprop). This shows that PaReprop provides (PaReprop), a fast reversible backpropagation algorithm for
significant speedup for vision-specific models. training reversible transformer in both Vision (Rev-ViT,
Rev-Swin, Rev-MViT) and NLP (Rev-RoBERTa). PaRe-
NLP Transformers. Finally, we also clearly demonstrate prop parallelizes the backward activation recomputation, an
PaReprop gains on the natural language modality. Re- additonal overhead introduced in the reversible networks,
versible models have also successfully been applied to lan- with the gradient computation itself in backpropgation.
guage tasks such as Reformer. Following [20], we extend By allowing recomputation to overlap with gradient cal-
RoBERTa [18] to Rev-RoBERTa and provide throughput culation, the additional latency of recomputation is hidden
benchmarking results on sentiment analysis. Note that as thereby increasing the training throughput while still main-
shown in Rev-ViT [20] using a simple reversible rewiring taining extremeley memory-efficient training. Our method
of the model maintains the original accuracy of the vanilla maintains the same accuracy as the vanilla method, and
model. Figure 6 shows our method outperforming the orig- achieves better throughput on all of the models we tested,
inal Reprop by large amounts across both choices of model outperforming some by up to 20%.
5
References of the IEEE/CVF International Conference on Computer Vi-
sion Workshops, pages 0–0, 2019. 2
[1] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and
[12] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and
Michael Auli. wav2vec 2.0: A framework for self-supervised
Pieter Abbeel. Flow++: Improving flow-based generative
learning of speech representations. In H. Larochelle, M. Ran-
models with variational dequantization and architecture de-
zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances
sign. In International Conference on Machine Learning,
in Neural Information Processing Systems, volume 33, pages
pages 2722–2730. PMLR, 2019. 2
12449–12460. Curran Associates, Inc., 2020. 1
[13] Diederik P Kingma and Prafulla Dhariwal. Glow: Gener-
[2] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Du-
ative flow with invertible 1x1 convolutions. arXiv preprint
venaud, and Jörn-Henrik Jacobsen. Invertible residual net-
arXiv:1807.03039, 2018. 2
works. In International Conference on Machine Learning,
[14] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.
pages 573–582. PMLR, 2019. 2
Reformer: The efficient transformer. arXiv preprint
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
arXiv:2001.04451, 2020. 1, 2
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
[15] Duo Li and Shang-Hua Gao. m-revnet: Deep re-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
versible neural networks with momentum. arXiv preprint
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
arXiv:2108.05862, 2021. 2
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric [16] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man-
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack galam, Bo Xiong, Jitendra Malik, and Christoph Feichten-
Clark, Christopher Berner, Sam McCandlish, Alec Radford, hofer. Mvitv2: Improved multiscale vision transformers for
Ilya Sutskever, and Dario Amodei. Language models are classification and detection. In CVPR, 2022. 4
few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, [17] Baohao Liao, Shaomu Tan, and Christof Monz. Make your
M.F. Balcan, and H. Lin, editors, Advances in Neural Infor- pre-trained model reversible: From parameter to memory ef-
mation Processing Systems, volume 33, pages 1877–1901. ficient fine-tuning, 2023. 2
Curran Associates, Inc., 2020. 1 [18] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[4] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
Begert, and Elliot Holtham. Reversible architectures for ar- moyer, and Veselin Stoyanov. Roberta: A robustly optimized
bitrarily deep residual neural networks. In Proceedings of bert pretraining approach. arXiv preprint arXiv:1907.11692,
the AAAI Conference on Artificial Intelligence, volume 32, 2019. 2, 4, 5
2018. 2 [19] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[5] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Non-linear independent components estimation. arXiv Hierarchical vision transformer using shifted windows. In
preprint arXiv:1410.8516, 2014. 2 Proc. CVPR, 2022. 2, 4
[6] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- [20] Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan
gio. Density estimation using real nvp. arXiv preprint Wu, Bo Xiong, Christoph Feichtenhofer, and Jitendra Ma-
arXiv:1605.08803, 2016. 2 lik. Reversible vision transformers. In Proceedings of
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, the IEEE/CVF Conference on Computer Vision and Pattern
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Recognition, pages 10830–10840, 2022. 1, 2, 3, 4, 5
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [21] Michael E Sander, Pierre Ablin, Mathieu Blondel, and
vain Gelly, et al. An image is worth 16x16 words: Trans- Gabriel Peyré. Momentum residual neural networks. arXiv
formers for image recognition at scale. In Proc. ICLR, 2021. preprint arXiv:2102.07870, 2021. 2
1, 4 [22] Yang Song, Chenlin Meng, and Stefano Ermon. Mintnet:
[8] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Building invertible neural networks with masked convolu-
Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. tions. arXiv preprint arXiv:1907.07945, 2019. 1, 2
Multiscale vision transformers. In Proc. ICCV, 2021. 4 [23] Chen Zhao, Shuming Liu, Karttikeya Mangalam, and
[9] Marc Finzi, Pavel Izmailov, Wesley Maddox, Polina Bernard Ghanem. Reˆ 2tal: Rewiring pretrained video back-
Kirichenko, and Andrew Gordon Wilson. Invertible convo- bones for reversible temporal action localization. In Pro-
lutional networks. In Workshop on Invertible Neural Nets ceedings of the IEEE/CVF Conference on Computer Vision
and Normalizing Flows, International Conference on Ma- and Pattern Recognition, 2023. 2
chine Learning, 2019. 2
[10] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B
Grosse. The reversible residual network: Backpropagation
without storing activations. In Proceedings of the 31st Inter-
national Conference on Neural Information Processing Sys-
tems, pages 2211–2221, 2017. 1, 2
[11] Tristan Hascoet, Quentin Febvre, Weihao Zhuang, Yasuo
Ariki, and Tetsuya Takiguchi. Layer-wise invertibility for ex-
treme memory cost reduction of cnn training. In Proceedings