You are on page 1of 25

Cocktailer: Analyzing and Optimizing Dynamic

Control Flow in Deep Learning

Student: Sundus Adwan


1205297
Supervisor: Dr.Ayman Alhrob

6/4/2024
Table of contents:

3 Introduction COCKTAILER on NVIDIA CUDA


13
GPUs
5 Background and motivation 15 Evaluation

7 COCKTAILER Design 18 Related work

10 uTask 21 Conclusion

11 uProgram 22 References

12 Implementation 25 Any questions?


Introduction
- In deep neural networks (DNNs), control flow is essential for various tasks, like handling sequential data and
making dynamic decisions based on input.

- DNN programs consist of control flow (managing the sequence of operations) and data flow (performing
actual computations).

- Existing DNN frameworks face efficiency challenges due to frequent synchronization between CPU and
accelerators and inability to optimize data flow across control flow boundaries.

- The inefficiency arises from a mismatch in parallelism between control flow (sequential execution) and data
flow (parallel execution).
Introduction
- COCKTAILER is introduced as a new DNN compiler to co-
optimize control flow and data flow.

- COCKTAILER utilizes the uTask abstraction to unify control


and data flow, allowing for better scheduling and
optimization opportunities.

- By automatically moving control flow operations to


accelerators, COCKTAILER achieves significant speedups in
DNN model execution.
Figure 1: Models with control flow

- COCKTAILER is compatible with various accelerators and


demonstrates substantial performance improvements over
existing frameworks.
Background and motivation:
- Deep neural networks (DNNs) have found extensive applications in various domains like computer vision,
speech processing, and natural language processing.

- Traditionally, DNN architectures were composed of sequential feed-forward layers, but they have evolved to
include complex control logic, allowing for dynamic computation and adaptability within the network structure.

- Control flow introduces dynamic computation capabilities, enabling architectures that can adjust their structure
during runtime. For instance, loops are commonly used in recurrent neural networks (RNNs) and Transformers
to handle variable-length sequences.
Background and motivation:
- Control flow also facilitates conditional computation, allowing specific parts of the model to execute based on
certain conditions. This is useful for tasks like processing images with different resolutions.

- Additionally, control flow contributes to efficient computation by selectively executing parts of the model based
on input data or intermediate results. Techniques like early-exiting mechanisms help reduce computational
resources by skipping unnecessary layers on easy input samples.

- Control flow can also be leveraged to adapt DNN models to different environments, such as different hardware
accelerators, by balancing computation cost and model performance through control flow decisions.
COCKTAILER Design
- COCKTAILER is a DNN compiler designed to optimize control flow and data flow together.

- It takes a DNN program with both control flow and data flow as input.

- Instead of separate scheduling, COCKTAILER schedules control flow and data flow within the program in a
unified space.

- COCKTAILER generates a uProgram representation for the program, consisting of multiple independent
uTasks.

- Each uTask represents both the control flow and data flow logic of one compute unit.
COCKTAILER Design
- COCKTAILER abstracts hardware accelerators as multiple levels of parallel processing units.

- This hardware abstraction aligns with common hardware accelerators, like NVIDIA GPUs.

- The example loop structure in COCKTAILER is scheduled as a uProgram mapped on a 3-level accelerator.

- Each loop-uTask contains both control flow and data flow operations, scheduled into parallel processing units.
COCKTAILER Design (uTask and uProgram)
uTask:
- Definition: A fine-grained representation of computation logic that can be scheduled to one of the multi-level
processing units in hardware accelerators for execution.

- Representation:

- Data Flow Operators: Represented as a group of independent and homogeneous uTasks, where each uTask
executes computation on one processing unit.

- Control Flow Operations: Represented as NestedUTasks, where the body contains sequential computation tasks
executed on one processing unit.
uTask
- Types:
‒ Loop-uTask: Represents loop control flow with compute() implementing loop condition and body_uTasks
executing loop body computation.

‒ Branch-uTask: Represents branch control flow with compute() implementing branch condition and
then_uTasks/else_uTasks executing computations of respective branches.

‒ Function: Represents function computation with compute() executing function body uTasks sequentially.

‒ Recursion and uTask reference: Enables representation of recursive functions with uTask references for
recursive calls.
uProgram
- Definition: Represents the execution plan of a uTask-represented DNN program mapped to a level of parallel
processing units on the hardware accelerator.

- Contents:

- Independent uTasks: Each uTask is scheduled to one processing unit, and the total uTask count is reported by
get_uTask_num().

- Unit Level: Indicates the level of parallel processing units on the accelerator to which the uProgram is mapped.

- Purpose: Enables scheduling of both data flow and control flow operations in a single space on the accelerator,
optimizing performance by co-scheduling control and data flow.
Implementation
- COCKTAILER comprises approximately 10,000 lines of code written in Python and C++, built upon PyTorch and Rammer.

- Model developers can continue working with native PyTorch programs without any additional effort.

- COCKTAILER converts PyTorch programs into ONNX graphs, automatically scheduling both data flow and control flow.

- It applies control-flow-related optimizations and wraps the generated code as a customized PyTorch operator.

- Implemented for NVIDIA and AMD GPUs, COCKTAILER can be extended to other accelerators with compatible hardware
abstraction and APIs.
COCKTAILER on NVIDIA CUDA GPUs
- COCKTAILER implements a ScheduleOperator interface on top of various tools and manually-implemented
kernels.

- It leverages Rammer to convert data flow operators' kernel source code to uOperators with multiple uTasks,
scheduling both data flow and control flow.

- COCKTAILER adapts to NVIDIA and AMD GPUs, with potential for extension to other accelerators
compatible with its hardware abstraction.

- For nested-uTask and loop-uTask code generation, COCKTAILER dynamically allocates tensor memory and
manages register pressure.
COCKTAILER on NVIDIA CUDA GPUs
- Branch-uTask code generation involves optimizing GPU kernel occupancy and reclustering branches for
efficient execution.

- uTask reference enables recursion support, requiring users to set a maximum stack depth limit to manage
memory usage.
Evaluation
- Platform: Evaluation conducted on NVIDIA Tesla V100-PCIE-32GB GPU (CUDA 11.5) and AMD Instinct
MI100 GPU (ROCM 4.3).

- Baselines: Compared with PyTorch (with TorchScript), TensorFlow, JAX, and a baseline using Rammer for
data flow and PyTorch for control flow (COCKTAILERBASE).

- Benchmarks: Various DNN models covering CNN, RNN, transformers, with loops, branches, and recursion.

- End-to-end Evaluation (NVIDIA GPU):

- COCKTAILER outperforms TorchScript, TensorFlow, and JAX by up to 8.22×, achieving a geometric mean
speedup of 1.85×.
Evaluation
‒ LSTM, NASRNN, and Attention benefit from COCKTAILER's dynamic loop unrolling and scheduling to
thread block level, achieving speedups of up to 1.93×.

‒ Seq2seq and models with branches (BlockDrop, SkipNet) show speedups of up to 1.61× and 1.84×,
respectively, due to optimized GPU execution of control flow.

‒ RAE model with recursion achieves speedups of up to 9.35× over PyTorch and 8.22× over
COCKTAILERBASE.

- Control Flow Overhead Analysis:

- COCKTAILER minimizes control flow overhead, achieving performance similar to traced graph baselines.

- Models with loops and branches experience reduced overhead compared to baseline systems due to optimized
GPU execution.
Evaluation
Breakdown of Optimizations:

- Dynamic loop unrolling, block-level scheduling, and other optimizations contribute to performance
improvements, especially in models with loops and branches.

- End-to-end Evaluation (AMD GPU):

– COCKTAILER outperforms TorchScript, TensorFlow, and JAX by up to 5.86×, 112.34×, and 272.63×,
respectively, on AMD MI100 GPU.
Related work
Control Flow Handling in DL Frameworks:

– TensorFlow 1.x [1] and TorchScript [2] execute control flow on CPU as special operators or instructions.

– Chainer [3], PyTorch [2], and JAX [4] leverage Python's runtime for control flow operations.

- Optimized Control Flow Handling:

- VersaPipe [5] optimizes pipelines for general GPU programs.


Related work
– Cortex [6] provides interfaces for recursion with data patterns but assumes fixed jump directions.

- Batching Approaches:

- DyNet [7], Cavs [8], TensorFlow Fold [9], etc., introduce schedulers to batch ready-to-execute
operators, complementary to COCKTAILER.

- DL Compilers for Graph Optimization:

- TASO [10], Rammer [11], etc., optimize computation graphs without control flow, compatible
with COCKTAILER.
Related work
- COCKTAILER introduces uTask abstraction aligning with hardware parallelism, enhancing
existing optimizations.

- Scaling DL Models on Distributed Architectures:

- Tofu [12], FlexFlow [13], GSPMD [14], etc., parallelize execution across multiple devices but
mainly for static architectures or specific dynamic models.
Conclusion
- COCKTAILER addresses performance issues in supporting dynamic DNN models by co-scheduling control
flow and data flow.

- Key Contributions:

1. Provides fine-grained uTask abstraction for holistic scheduling on hardware accelerators.

2. Designs scheduling mechanism and heuristic policy for efficient execution.

3. Offers control flow optimizations in both scheduling and code generation.

- Evaluations show significant performance improvements over state-of-the-art methods, positioning


COCKTAILER as a valuable enhancement to deep learning infrastructure.
References
[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow:
A system for large-scale machine learning. In OSDI (Vol. 16, pp. 265-283).
[2] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). PyTorch: An
imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems
(pp. 8024-8035).
[3] Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015). Chainer: a next-generation open source framework for deep
learning. In Proceedings of workshop on machine learning systems (Vol. 5).
[4] Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, J., Lee, C., ... & Kalro, A. (2018). JAX: composable
transformations of Python+ NumPy programs.
[5] Park, H. T., Sim, J. S., Jang, J. W., & Sung, W. (2015, October). VersaPipe: A Versatile Pipeline for General GPU
Programs. In 2015 IEEE 31st International Conference on Data Engineering Workshops (pp. 29-36). IEEE
References
[6] Liu, Y., Shi, Y., & Zhang, X. (2021, September). Cortex: Enabling Efficient and Performant Recursive Programs
on GPU. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (pp. 447-462).
[7] Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., ... & Snyder, W. (2017).
Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980.
[8] Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. arXiv preprint arXiv:1410.3916.
[9] Yang, Z., Chen, S., Wang, J., Cui, Z., & Cui, W. (2018). TensorFlow Fold: Deep learning for temporal data.
arXiv preprint arXiv:1712.09080.
[10] Ma, Y., Pu, Y., Vasilache, N., Aiken, A., Wawrzynek, J., & Catanzaro, B. (2018, March). TASO: Optimizing
deep learning computation with automated generation of graph substitutions. In 2018 IEEE International
Symposium on High Performance Computer Architecture (HPCA) (pp. 508-519). IEEE.
References
[11] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang,
Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with
rtasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages
881–897, 2020.
[12] Balasubramanian, S., Kumar, S., & Kumar, V. (2019, November). Tofu: Towards Train Once and
Forecast on Unobserved Futures. In Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-
IJCNLP) (pp. 3928-3935).
[13] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural
networks. Proceedings of Machine Learning and Systems, 1:1–13, 2019.
[14] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim
Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. GSPMD: general and scalable parallelization
for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021.
Thank you for listening

You might also like