You are on page 1of 31

LLM TRAINING UPDATES

Overview of NVIDIA’s Large Language Model offerings


TRT-LLM Core value Proposition
Nemo
Megatron-LM
Framework
Nemo Framework: Easy to use OOTB FW with a large model
collections for Enterprise users to experiment, train, and deploy.

Megatron-Core Megatron-LM: A lightweight framework reference for using


Megatron-Core to build your own LLM framework.

Megatron-Core: Library for GPU optimized techniques for LLM


Transformer Engine training. For customers to build custom LLM framework.

Transformer Engine: Hopper accelerated Transformer models.


Specific acceleration library, including FP8 on Hopper.

TRT-LLM: achieving optimal inference performance on the latest


Product Reference
Large Language Models on NVIDIA GPUs
NVIDIA NeMo for Custom LLMs
End-to-end, cloud-native framework to build, customize and deploy generative AI models

Model
Development Queries

AI ENTERPRISE
DEVELOPERS APPLICATIONS
Data Curation Distributed Training Model Customization Accelerated Inference Guardrails

NeMo Framework

NVIDIA AI Enterprise

Multi-Modality Data Curation at Scale Optimized Training Model Customization Deploy at Scale Guardrails Support

Build language, image, Extract, deduplicate, filter Accelerate training and Easily customize with P- Run optimized inference Keep applications aligned NVIDIA AI Enterprise and
generative AI models info from large unstructured throughput by parallelizing the tuning, SFT, Adapters, RLHF, at-scale anywhere with safety and security experts by your side to
data @ scale model and the training data across AliBi requirements using keep projects on track
1,000s of nodes. NeMo Guardrails
Megatron-Core

Performance at Scale Flexibility Formalized Product


Memory, Compute, and Support
Communication Optimization Optimized transformer blocks and
techniques for LLM frameworks Latest research and performance
Parallelism Techniques
optimizations
● Tensor-Parallelism PyT programmability interface
● Pipeline-Parallelism
Regular release, open-source on
● Sequence-Parallelism
GitHub and pip wheels
Checkpointing – full/selective
Versioned APIs and
Documentation
Distributed optimizer
Open source - welcome PRs from
Hopper FP8 via Transformer
the community
Engine

MLPerf Optimizations
3D Parallelism Techniques To Build Foundation Model

• Requires extensive experimentation to


configure hyperparameters

• Needs state-of-the-art algorithms to


process internet-scale data across an
• Needs state-of-the-art algorithms to process internet-scale data across an entire datacenter
entire datacenter

• Communication Overlapping
• Data parallel
• grads allreduce ovelap(wo disopt)
• grads reducescatter and params
allgather overlap (w/ disopt)
• Interleaved Pipeline parallel
• TP-comm optimization ​(Experimental)
New Release - FP8 Enablement via Transformer Engine API
Benefits of FP8
★ Accelerates math intensive operations - FP8 Tensor Cores are 2X faster than 16 bit Tensor Cores
★ Accelerates memory intensive operations - Reduces memory traffic, since 8 bits requires half number of bytes to access memory compared to 16 bits
★ Facilitates preparation for fast inference deployment - models are already trained in FP8
★ Fourth generation of Tensor Cores (H100* FP8) are up to 6x faster compared to A100 (Network end to end speedups are up to 3x
Training in FP8
- FP8 training matches behavior of 16 bits - No changes to hyperparameters
- FP8 is enabled in Transformer Engine (TE) - a library that handles FP8 details internally and abstracts away from DL frameworks. Calls to FP8 operations and
casts are done within modules
- TE APIs are integrated in the Megatron-core Transformer Layer

Megatron-core Transformer Layer

Attention MLP Norm


TE.ColumnParallelLinear TE.ColumnParallelLinear TE.layernorm
TE.RowParallelLinear TE.RowParallelLinear TE.RMSnorm
TE.DotProductAttention
* For Technical Discussion & Reference Only

7
FP8 TRAINING
Less Storage, More Performance

•The H100 GPU adds FP8 Tensor Cores to accelerate both AI training and inference. FP8 Tensor Cores support FP32 and
FP16 accumulators, and two new FP8 input types:
• E4M3 with 4 exponent bits, 3 mantissa bits, and 1 sign bit
• E5M2, with 5 exponent bits, 2 mantissa bits, and 1 sign bit

8
FP8 TRAINING
Less Storage, More Performance

• Partition the DL network graph into safe and unsafe regions


• FP8 training recipe can be combined with FP32, FP16/BF16 recipe for unsafe region
• Explicit casts are not enough - FP8 operators need to use higher precision internally and be able to output higher precision
output
• E4M3 for forward, E5M2 for backward
• Use the per-tensor scaling factors
• while FP16 use the scaling factor during the backward pass to to avoid over- and underflows in the value distribution of the
tensors
• Scaling factors are needed in both passes
• A single scaling factor is no longer enough
• delayed scaling
• choose the scaling factor based on the maximums of absolute values seen in some number of previous iterations.
• require storing the history of maximums as additional parameters of the FP8 operators.

FP8 Tensor BF16 Tensor


FP8 TRAINING
training loss and accuracy
New Release - Context Parallelism
support long sequence

• Many LLM tasks require long seq-lengths, e.g.,


• QA over long documents
• Summarization over long documents, for example medical records, financial records
• Code analysis and generation using programming manual

• What is Context Parallelism


• Split long-context input along seq dimension, and parallelize across multi-GPUs​
• Different from sequence parallel which only split activations of dropout and LN along seq dim
• Supports BF16 and FP8​
• Supports Flash Attn v1. Adding v2 now​
• Supports self-attn with causal masking. Extending to cross-attn and other masking mechanisms​

Splitting the sequence for the entire network - Reduce Need for Checkpoint and hence Recompute Overhead
New Release - Context Parallelism
support long sequence

• Benefits
• For Pre-training: 30% improvement in performance for large models (175B+) and long seq length (16k+) by removing
checkpoint/recompute overheads​
• For Fine-tuning: Avoid OOM for long sequence (32k-128k)

Splitting the sequence for the entire network - Reduce Need for Checkpoint and hence Recompute Overhead
New Release - MoE support
• Mixture of Expert model is a type of neural network model that can be thought of as having different submodels (or
experts) that are each specialized for different inputs. Each expert is a feed-forward network with different weights.
• the experts are sparsely activated, meaning that for a given input token, only a subset of experts are used, giving the model more capacity
while limiting computation.
• Which expert to use depending on the routing algorithm

• Benefits:
• MoE can substantially scales up the model capacity and only introduces small computation overhead

• Fast inference time

Transformer Encoder MoE Transformer Encoder


[1] Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." arXiv
preprint arXiv:2006.16668 (2020).
[2] Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv
preprint arXiv:1701.06538 (2017).
New Release - MoE support

● Router choices
○ Sinkhorn
○ Top-K (on roadmap)
○ Expert Choice (on roadmap)

● Token dropable vs. dropless


○ Dropless

● MoE layer frequency


○ All layers

● MoE Parallelism
○ Expert Parallelism (EP+DP)
○ Expert Tensor Parallelism (EP+DP+TP(+SP))
○ ETP with Pipeline Parallelism (EP+DP+TP(+SP)+PP)
Traditional MoE Layer Workflow

Gale, Trevor, et al. "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts." Proceedings of Machine Learning and Systems 5 (2023).
MoE Layer Workflow in Megatron-core
No token is dropped to ensure model accuracy

(1) Routing (2) Permutation (3) Computation


Assigning token feature vectors to Group tokens by expert. Compute the expert layers for the
experts based on probabilities set of tokens they were assigned.

Expert 0
Tokens permutation

permutation

Scale
Expert 1
Expert 2
router
MOE in Megatron-Core
Future Updates/Roadmap

● Megablocks integration

● More Routers (Top-k, Expert-Choice, …)

● Pipeline communication/computation overlapping for MoE layer

● Token dropable workflow support

● …
LLM INFERENCE – TENSORRT-LLM &
TRITON
TensorRT-LLM Optimizing LLM Inference
SoTA Performance for Large Language Models for Production Deployments

TensorRT-LLM is an open-source library for optimal performance on the latest large language models for inference on NVIDIA GPUs

TensorRT-LLM wraps TensorRT’s Deep Learning Compiler, optimized kernels from FasterTransformer, pre/post processing, and MGMN
communication in a simple open-source Python API for defining, optimizing, & executing LLMs for inference in production.

SoTA Performance Ease Extension LLM Batching with Triton


Leverage TensorRT compilation & kernels from Add new operators or models in Python to quickly Maximize throughput and GPU utilization
FasterTransformers, CUTLASS, OAI Triton, ++ support new LLMs with optimized performance through new scheduling techniques for LLMs

A100 H100 TRT-LLM Static Inflight Batching


# define a new activation
def silu(input: Tensor) → Tensor:
4.6x return input * sigmoid(input)

#implement models like in DL FWs


class LlamaModel(Module)
def __init__(…)
self.layers = ModuleList([…]) 2x
def forward (…)
hidden = self.embedding(…)
3x for layer in self.layers: 5x
hidden_states = layer(hidden)

return hidden
Performance TCO Avg Latency Cost
Numbers are preliminary based on internal evaluation on Llama 7B on H100
TensorRT-LLM in the DL Compiler Ecosystem
TensorRT-LLM builds on TensorRT Compilation

• TensorRT-LLM
• Built on-top of TensorRT
• Leverages TensorRT for general graph optimizations & fast kernels
• Adds LLM specific optimizations: TensorRT-LLM
• KV Caching & Custom MHA Kernels LLM specific optimizations:
• KV Caching
• Inflight batching, Paged Attention • Multi-GPU, Muti-Node
• Multi-GPU, Multi-Node • Custom MHA optimizations
• Quantization (int4/int8 weight only, GPTQ, AWQ, SQ, FP8) • Paged Attention
• etc…
• & more
• ONLY for LLMs TensorRT
General Purpose Compiler
• Optimized GEMMs & general kernels
• TensorRT • Kernel Fusion
• General purpose Deep Learning Compiler • Auto Tuning
• Graph rewriting, constant folding, kernel fusion • Memory Optimizations
• Optimized GEMMs & pointwise kernels • Multi-stream execution
• Kernel Auto-Tuning
• Memory Optimizations
• & more
• All AI Workloads
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TensorRT-LLM Usage
Create, Build, Execute
0. Trained Model in FW
NeMo, HuggingFace, or from DL Frameworks

• Instantiate model and load the weights 1. Model Initialization


• Load pre-built models or define via TensorRT-LLM Python Load example model, or create one via python APIs
APIs

• Build & serialize the engines 2. Engine Building


• Compile to optimized implementations via TensorRT Optimized model via TensorRT and custom kernels
• Saved as a serialized engine
TensorRT-LLM Engine TRT Engine Plugins
• Load the engines and run optimized inference!
• Execute in Python, C++, or Triton 3. Execution
Load & execute engines in Python, C++, or Triton
LLM FWs

TensorRT-LLM Usage NeMo PyT JAX …


Use Pre-built models, or optimize new ones!
Model Building
Pre-built Models Custom Model
trt_llm.layers.*
transformer mlp
TensorRT-LLM Inference
GPT LLaMa …
attention …
● PyTorch/TE-like building blocks for transformers
○ Ex. fMHA, layerNorm, activations, etc. TensorRT-LLM Backend
○ Built on top of TensorRT Python API

● Build arbitrary LLM or deploy pre-built TensorRT FT NCCL Pre/Post


implementations Primatives Kernels Comm. Processing
○ Ex. GPT, LLaMa, BERT, etc.

MGMN inference

○ Leverages NCCL plugins for multi-device


Model Execution
communication TensorRT-LLM Runtime
■ Long term will be in TensorRT
○ Pre-segmented graphs in pre-built models
User can manually segment for custom models C++/Py
TensorRT Runtime

○ Future will allow automatic segmentation across gpus Runtime


● Combines TensorRT layers, NCCL plugins, perf plugins,
& pre/post processing ops into a single object
○ Include tokenization & Sampling (ex. Beam search)
TensorRT-LLM Usage # llama.py
class LLaMaModel(Module)
def __init__(…)
Model Building & Defining Operators self.layers = ModuleList([…,trt_llm.linear()])

def forward (…)


hidden_states = self.embedding(…)

for layer in self.layers:


hidden_states = layer(hidden_states)

Easily modify models # add a new layer to a model with modular building blocks.
hidden_states = self.linear(hidden_states)
• Modify the model similar to in a DL FW return hidden_states

• Add operators to forward call as desired Modify models simply with modular Python-layers
• Ops can be used with any model
# functional.py
def sigmoid(input: Tensor) → Tensor:
layer = default_trtnet().add_activation(input.trt_tensor,
Improved op coverage & definition trt.ActivationType.SIGMOID)
return _create_tensor(layer.get_output(0), layer)
• Define ops in Python with TensorRT Python primitives
• or Map ops to arbitrary kernels with plugins def silu(input: Tensor) → Tensor:
return input * sigmoid(input)
• Compose operators in Python
def swiglu(input: Tensor) -> Tensor:
x, gate = chunk(input, 2, dim=-1)
return silu(gate) * x

Creating new ops from TRT is only a few lines of Python


Implementing New Operators
RMSNorm via TensorRT Python APIs

def rms_norm(input: Tensor,


normalized_shape: Union[int, Tuple[int]],
weight: Optional[Tensor] = None,
Implement new operators quickly via TensorRT eps: float = 1e-05) → Tensor:
dim = tuple([-i - 1 for i in range(len(noramlized_shape))])

with precision(“float32”)
• Quickly implement, entirely in python, unblock deployments varx = pow(input, 2.0)
varx = varx.mean(dim, keepdim=True)
denom = varx + eps
• Compiled entirely in TensorRT denom = denom.sqrt()
y = input / denom

• Fused into a single kernel if weight is not None:


y = y * weight

return y
• Same or similar performance to custom CUDA kernels
Implementing RMSNorm in TensorRT-LLM
TensorRT may not always fuse operators to a single kernel,
impacting performance.
Implementing New Operators
Utilizing custom CUDA kernels
0. GPU Kernel
Compiled GPU kernel from CUTLASS, OAIT, or other

TensorRT-LLM can use custom kernels via “Plugins” 1. Define kernel config
Metadata on the kernel for plugin generation

• Allows for peak performance on key ops (ex. MHA)


2. Generate the plugin!
• Quickly improve any performance bottlenecks TensorRT-LLM auto generates the plugins

• Any kernels from CUTLASS, OpenAI Triton, CUDA, & more TensorRT-LLM Plugin functional.py layers

• Insert as layers directly into the TensorRT-LLM model 3. Embed!


Use the generated layers in the model definition
• Execution & management handled entirely by TensorRT
TensorRT-LLM Performance Across Architectures
H100 up to 4.6x faster than A100 on TensorRT-LLM

H100 A100

5x
4.6x

4x
3.5x
Max Relative Performance

3x 2.9x

2.5x

2x

1x

0x
GPT-J 6B Llama 7B Llama 70B (TP4) Falcon 180B (TP8)

TensorRT-LLM v0.5.0 internal build. Tokens/s/gpu relative improvement


DGX H100 FP8 vs DGX A100 FP16.
Max batchsize up to 64. Input & output sequence lengths of {1,128, 2048, 4096}.
TPN = Tensor Parallel across N devices
TensorRT-LLM Performance Improvement
Up to 9x faster than baseline LLM implementations in DL frameworks

TensorRT-LLM Framework Baseline

10x

8.9x

8x
Relative Performance

6x
5.1x

4.3x
4x

2x

0x
GPT-J 6B Llama 7B Llama 70B

TensorRT-LLM v0.5.0 internal build. HF Accelerate. Tokens/s/GPU relative improvement


DGX H100. TensorRT-LLM FP8, HF Accelerate FP16.
Max batchsize up to 64. Input & output sequence length 128:128
TPN = Tensor Parallel across N devices. TensorRT-LLM Llama 70B TP2, HF Accelerate TP1
TensorRT-LLM Performance
End-to-End Performance Using Inflight Batching & Triton

H100 A100

60

50.9

40
Requests/S

28.2

20
14.5
12.4
9.6
7.8
6.2
4.4
- -
0
GPT-J 6B Llama 7B Llama 70B (TP2) Llama 70B (TP4) Falcon 180B (TP8)

TensorRT-LLM v0.5.0 internal build. Triton with TensorRT-LLM inflight batching backend
DGX H100 FP8 & DGX A100 FP16
CNN Daily Mail dataset. Varying max concurrency. SOL serving scenario
TPN = Tensor Parallel across N devices.
TensorRT-LLM Performance
Advance Techniques can further improve TensorRT-LLM performance & memory consumption

Throughput Memory

2.0x

1.6x

1.5x
Relative Improvement

1.0x

0.5x
0.3x

0.0x
FP16 KVQuant INT8 INT8 + KVQuant INT8 SQ INT8 SQ + KVQuant INT4 INT4 + KVQuant

TensorRT-LLM v0.2.0 internal build. MPT-7B


1xA100-40GB. Averaged across BS [1, 512], seqlen [1, 512]

You might also like