Professional Documents
Culture Documents
6/4/2024
Table of contents:
10 uTask 21 Conclusion
11 uProgram 22 References
- DNN programs consist of control flow (managing the sequence of operations) and data flow (performing
actual computations).
- Existing DNN frameworks face efficiency challenges due to frequent synchronization between CPU and
accelerators and inability to optimize data flow across control flow boundaries.
- The inefficiency arises from a mismatch in parallelism between control flow (sequential execution) and data
flow (parallel execution).
Introduction
- COCKTAILER is introduced as a new DNN compiler to co-
optimize control flow and data flow.
- Traditionally, DNN architectures were composed of sequential feed-forward layers, but they have evolved to
include complex control logic, allowing for dynamic computation and adaptability within the network structure.
- Control flow introduces dynamic computation capabilities, enabling architectures that can adjust their structure
during runtime. For instance, loops are commonly used in recurrent neural networks (RNNs) and Transformers
to handle variable-length sequences.
Background and motivation:
- Control flow also facilitates conditional computation, allowing specific parts of the model to execute based on
certain conditions. This is useful for tasks like processing images with different resolutions.
- Additionally, control flow contributes to efficient computation by selectively executing parts of the model based
on input data or intermediate results. Techniques like early-exiting mechanisms help reduce computational
resources by skipping unnecessary layers on easy input samples.
- Control flow can also be leveraged to adapt DNN models to different environments, such as different hardware
accelerators, by balancing computation cost and model performance through control flow decisions.
COCKTAILER Design
- COCKTAILER is a DNN compiler designed to optimize control flow and data flow together.
- It takes a DNN program with both control flow and data flow as input.
- Instead of separate scheduling, COCKTAILER schedules control flow and data flow within the program in a
unified space.
- COCKTAILER generates a uProgram representation for the program, consisting of multiple independent
uTasks.
- Each uTask represents both the control flow and data flow logic of one compute unit.
COCKTAILER Design
- COCKTAILER abstracts hardware accelerators as multiple levels of parallel processing units.
- This hardware abstraction aligns with common hardware accelerators, like NVIDIA GPUs.
- The example loop structure in COCKTAILER is scheduled as a uProgram mapped on a 3-level accelerator.
- Each loop-uTask contains both control flow and data flow operations, scheduled into parallel processing units.
COCKTAILER Design (uTask and uProgram)
uTask:
- Definition: A fine-grained representation of computation logic that can be scheduled to one of the multi-level
processing units in hardware accelerators for execution.
- Representation:
- Data Flow Operators: Represented as a group of independent and homogeneous uTasks, where each uTask
executes computation on one processing unit.
- Control Flow Operations: Represented as NestedUTasks, where the body contains sequential computation tasks
executed on one processing unit.
uTask
- Types:
‒ Loop-uTask: Represents loop control flow with compute() implementing loop condition and body_uTasks
executing loop body computation.
‒ Branch-uTask: Represents branch control flow with compute() implementing branch condition and
then_uTasks/else_uTasks executing computations of respective branches.
‒ Function: Represents function computation with compute() executing function body uTasks sequentially.
‒ Recursion and uTask reference: Enables representation of recursive functions with uTask references for
recursive calls.
uProgram
- Definition: Represents the execution plan of a uTask-represented DNN program mapped to a level of parallel
processing units on the hardware accelerator.
- Contents:
- Independent uTasks: Each uTask is scheduled to one processing unit, and the total uTask count is reported by
get_uTask_num().
- Unit Level: Indicates the level of parallel processing units on the accelerator to which the uProgram is mapped.
- Purpose: Enables scheduling of both data flow and control flow operations in a single space on the accelerator,
optimizing performance by co-scheduling control and data flow.
Implementation
- COCKTAILER comprises approximately 10,000 lines of code written in Python and C++, built upon PyTorch and Rammer.
- Model developers can continue working with native PyTorch programs without any additional effort.
- COCKTAILER converts PyTorch programs into ONNX graphs, automatically scheduling both data flow and control flow.
- It applies control-flow-related optimizations and wraps the generated code as a customized PyTorch operator.
- Implemented for NVIDIA and AMD GPUs, COCKTAILER can be extended to other accelerators with compatible hardware
abstraction and APIs.
COCKTAILER on NVIDIA CUDA GPUs
- COCKTAILER implements a ScheduleOperator interface on top of various tools and manually-implemented
kernels.
- It leverages Rammer to convert data flow operators' kernel source code to uOperators with multiple uTasks,
scheduling both data flow and control flow.
- COCKTAILER adapts to NVIDIA and AMD GPUs, with potential for extension to other accelerators
compatible with its hardware abstraction.
- For nested-uTask and loop-uTask code generation, COCKTAILER dynamically allocates tensor memory and
manages register pressure.
COCKTAILER on NVIDIA CUDA GPUs
- Branch-uTask code generation involves optimizing GPU kernel occupancy and reclustering branches for
efficient execution.
- uTask reference enables recursion support, requiring users to set a maximum stack depth limit to manage
memory usage.
Evaluation
- Platform: Evaluation conducted on NVIDIA Tesla V100-PCIE-32GB GPU (CUDA 11.5) and AMD Instinct
MI100 GPU (ROCM 4.3).
- Baselines: Compared with PyTorch (with TorchScript), TensorFlow, JAX, and a baseline using Rammer for
data flow and PyTorch for control flow (COCKTAILERBASE).
- Benchmarks: Various DNN models covering CNN, RNN, transformers, with loops, branches, and recursion.
- COCKTAILER outperforms TorchScript, TensorFlow, and JAX by up to 8.22×, achieving a geometric mean
speedup of 1.85×.
Evaluation
‒ LSTM, NASRNN, and Attention benefit from COCKTAILER's dynamic loop unrolling and scheduling to
thread block level, achieving speedups of up to 1.93×.
‒ Seq2seq and models with branches (BlockDrop, SkipNet) show speedups of up to 1.61× and 1.84×,
respectively, due to optimized GPU execution of control flow.
‒ RAE model with recursion achieves speedups of up to 9.35× over PyTorch and 8.22× over
COCKTAILERBASE.
- COCKTAILER minimizes control flow overhead, achieving performance similar to traced graph baselines.
- Models with loops and branches experience reduced overhead compared to baseline systems due to optimized
GPU execution.
Evaluation
Breakdown of Optimizations:
- Dynamic loop unrolling, block-level scheduling, and other optimizations contribute to performance
improvements, especially in models with loops and branches.
– COCKTAILER outperforms TorchScript, TensorFlow, and JAX by up to 5.86×, 112.34×, and 272.63×,
respectively, on AMD MI100 GPU.
Related work
Control Flow Handling in DL Frameworks:
– TensorFlow 1.x [1] and TorchScript [2] execute control flow on CPU as special operators or instructions.
– Chainer [3], PyTorch [2], and JAX [4] leverage Python's runtime for control flow operations.
- Batching Approaches:
- DyNet [7], Cavs [8], TensorFlow Fold [9], etc., introduce schedulers to batch ready-to-execute
operators, complementary to COCKTAILER.
- TASO [10], Rammer [11], etc., optimize computation graphs without control flow, compatible
with COCKTAILER.
Related work
- COCKTAILER introduces uTask abstraction aligning with hardware parallelism, enhancing
existing optimizations.
- Tofu [12], FlexFlow [13], GSPMD [14], etc., parallelize execution across multiple devices but
mainly for static architectures or specific dynamic models.
Conclusion
- COCKTAILER addresses performance issues in supporting dynamic DNN models by co-scheduling control
flow and data flow.
- Key Contributions: