Professional Documents
Culture Documents
Model
Development Queries
AI ENTERPRISE
DEVELOPERS APPLICATIONS
Data Curation Distributed Training Model Customization Accelerated Inference Guardrails
NeMo Framework
NVIDIA AI Enterprise
Multi-Modality Data Curation at Scale Optimized Training Model Customization Deploy at Scale Guardrails Support
Build language, image, Extract, deduplicate, filter Accelerate training and Easily customize with P- Run optimized inference Keep applications aligned NVIDIA AI Enterprise and
generative AI models info from large unstructured throughput by parallelizing the tuning, SFT, Adapters, RLHF, at-scale anywhere with safety and security experts by your side to
data @ scale model and the training data across AliBi requirements using keep projects on track
1,000s of nodes. NeMo Guardrails
Megatron-Core
MLPerf Optimizations
3D Parallelism Techniques To Build Foundation Model
• Communication Overlapping
• Data parallel
• grads allreduce ovelap(wo disopt)
• grads reducescatter and params
allgather overlap (w/ disopt)
• Interleaved Pipeline parallel
• TP-comm optimization (Experimental)
New Release - FP8 Enablement via Transformer Engine API
Benefits of FP8
★ Accelerates math intensive operations - FP8 Tensor Cores are 2X faster than 16 bit Tensor Cores
★ Accelerates memory intensive operations - Reduces memory traffic, since 8 bits requires half number of bytes to access memory compared to 16 bits
★ Facilitates preparation for fast inference deployment - models are already trained in FP8
★ Fourth generation of Tensor Cores (H100* FP8) are up to 6x faster compared to A100 (Network end to end speedups are up to 3x
Training in FP8
- FP8 training matches behavior of 16 bits - No changes to hyperparameters
- FP8 is enabled in Transformer Engine (TE) - a library that handles FP8 details internally and abstracts away from DL frameworks. Calls to FP8 operations and
casts are done within modules
- TE APIs are integrated in the Megatron-core Transformer Layer
7
FP8 TRAINING
Less Storage, More Performance
•The H100 GPU adds FP8 Tensor Cores to accelerate both AI training and inference. FP8 Tensor Cores support FP32 and
FP16 accumulators, and two new FP8 input types:
• E4M3 with 4 exponent bits, 3 mantissa bits, and 1 sign bit
• E5M2, with 5 exponent bits, 2 mantissa bits, and 1 sign bit
8
FP8 TRAINING
Less Storage, More Performance
Splitting the sequence for the entire network - Reduce Need for Checkpoint and hence Recompute Overhead
New Release - Context Parallelism
support long sequence
• Benefits
• For Pre-training: 30% improvement in performance for large models (175B+) and long seq length (16k+) by removing
checkpoint/recompute overheads
• For Fine-tuning: Avoid OOM for long sequence (32k-128k)
Splitting the sequence for the entire network - Reduce Need for Checkpoint and hence Recompute Overhead
New Release - MoE support
• Mixture of Expert model is a type of neural network model that can be thought of as having different submodels (or
experts) that are each specialized for different inputs. Each expert is a feed-forward network with different weights.
• the experts are sparsely activated, meaning that for a given input token, only a subset of experts are used, giving the model more capacity
while limiting computation.
• Which expert to use depending on the routing algorithm
• Benefits:
• MoE can substantially scales up the model capacity and only introduces small computation overhead
● Router choices
○ Sinkhorn
○ Top-K (on roadmap)
○ Expert Choice (on roadmap)
● MoE Parallelism
○ Expert Parallelism (EP+DP)
○ Expert Tensor Parallelism (EP+DP+TP(+SP))
○ ETP with Pipeline Parallelism (EP+DP+TP(+SP)+PP)
Traditional MoE Layer Workflow
Gale, Trevor, et al. "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts." Proceedings of Machine Learning and Systems 5 (2023).
MoE Layer Workflow in Megatron-core
No token is dropped to ensure model accuracy
Expert 0
Tokens permutation
permutation
Scale
Expert 1
Expert 2
router
MOE in Megatron-Core
Future Updates/Roadmap
● Megablocks integration
● …
LLM INFERENCE – TENSORRT-LLM &
TRITON
TensorRT-LLM Optimizing LLM Inference
SoTA Performance for Large Language Models for Production Deployments
TensorRT-LLM is an open-source library for optimal performance on the latest large language models for inference on NVIDIA GPUs
TensorRT-LLM wraps TensorRT’s Deep Learning Compiler, optimized kernels from FasterTransformer, pre/post processing, and MGMN
communication in a simple open-source Python API for defining, optimizing, & executing LLMs for inference in production.
return hidden
Performance TCO Avg Latency Cost
Numbers are preliminary based on internal evaluation on Llama 7B on H100
TensorRT-LLM in the DL Compiler Ecosystem
TensorRT-LLM builds on TensorRT Compilation
• TensorRT-LLM
• Built on-top of TensorRT
• Leverages TensorRT for general graph optimizations & fast kernels
• Adds LLM specific optimizations: TensorRT-LLM
• KV Caching & Custom MHA Kernels LLM specific optimizations:
• KV Caching
• Inflight batching, Paged Attention • Multi-GPU, Muti-Node
• Multi-GPU, Multi-Node • Custom MHA optimizations
• Quantization (int4/int8 weight only, GPTQ, AWQ, SQ, FP8) • Paged Attention
• etc…
• & more
• ONLY for LLMs TensorRT
General Purpose Compiler
• Optimized GEMMs & general kernels
• TensorRT • Kernel Fusion
• General purpose Deep Learning Compiler • Auto Tuning
• Graph rewriting, constant folding, kernel fusion • Memory Optimizations
• Optimized GEMMs & pointwise kernels • Multi-stream execution
• Kernel Auto-Tuning
• Memory Optimizations
• & more
• All AI Workloads
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TensorRT-LLM Usage
Create, Build, Execute
0. Trained Model in FW
NeMo, HuggingFace, or from DL Frameworks
MGMN inference
●
Easily modify models # add a new layer to a model with modular building blocks.
hidden_states = self.linear(hidden_states)
• Modify the model similar to in a DL FW return hidden_states
• Add operators to forward call as desired Modify models simply with modular Python-layers
• Ops can be used with any model
# functional.py
def sigmoid(input: Tensor) → Tensor:
layer = default_trtnet().add_activation(input.trt_tensor,
Improved op coverage & definition trt.ActivationType.SIGMOID)
return _create_tensor(layer.get_output(0), layer)
• Define ops in Python with TensorRT Python primitives
• or Map ops to arbitrary kernels with plugins def silu(input: Tensor) → Tensor:
return input * sigmoid(input)
• Compose operators in Python
def swiglu(input: Tensor) -> Tensor:
x, gate = chunk(input, 2, dim=-1)
return silu(gate) * x
with precision(“float32”)
• Quickly implement, entirely in python, unblock deployments varx = pow(input, 2.0)
varx = varx.mean(dim, keepdim=True)
denom = varx + eps
• Compiled entirely in TensorRT denom = denom.sqrt()
y = input / denom
return y
• Same or similar performance to custom CUDA kernels
Implementing RMSNorm in TensorRT-LLM
TensorRT may not always fuse operators to a single kernel,
impacting performance.
Implementing New Operators
Utilizing custom CUDA kernels
0. GPU Kernel
Compiled GPU kernel from CUTLASS, OAIT, or other
TensorRT-LLM can use custom kernels via “Plugins” 1. Define kernel config
Metadata on the kernel for plugin generation
• Any kernels from CUTLASS, OpenAI Triton, CUDA, & more TensorRT-LLM Plugin functional.py layers
H100 A100
5x
4.6x
4x
3.5x
Max Relative Performance
3x 2.9x
2.5x
2x
1x
0x
GPT-J 6B Llama 7B Llama 70B (TP4) Falcon 180B (TP8)
10x
8.9x
8x
Relative Performance
6x
5.1x
4.3x
4x
2x
0x
GPT-J 6B Llama 7B Llama 70B
H100 A100
60
50.9
40
Requests/S
28.2
20
14.5
12.4
9.6
7.8
6.2
4.4
- -
0
GPT-J 6B Llama 7B Llama 70B (TP2) Llama 70B (TP4) Falcon 180B (TP8)
TensorRT-LLM v0.5.0 internal build. Triton with TensorRT-LLM inflight batching backend
DGX H100 FP8 & DGX A100 FP16
CNN Daily Mail dataset. Varying max concurrency. SOL serving scenario
TPN = Tensor Parallel across N devices.
TensorRT-LLM Performance
Advance Techniques can further improve TensorRT-LLM performance & memory consumption
Throughput Memory
2.0x
1.6x
1.5x
Relative Improvement
1.0x
0.5x
0.3x
0.0x
FP16 KVQuant INT8 INT8 + KVQuant INT8 SQ INT8 SQ + KVQuant INT4 INT4 + KVQuant