Professional Documents
Culture Documents
1
Agenda
• LLMs in Context
2
Why Bother Creating A Custom Model?
Motivations for Fine-Tuning
3
Why Bother Creating A Custom Model?
Motivations for Fine-Tuning
4
Why Bother Creating A Custom Model?
Motivations for Fine-Tuning
3. You want a smaller model (fewer parameters) that was trained to imitate a larger one.
5
Example App 1: Domain-Tailored
Summarization with Hosted LLM APIs
6
Domain-Tailored Summarization
Typical downstream clients only read from the API
Summarization
API
7
Domain-Tailored Summarization
Our example application also writes new data back into the API
Summarization Client
API Application
8
Domain-Tailored Summarization
9
How It Works
Domain-Tailored Summarization Under the Hood
10
Tuning Hosted API LLMs
11
Tune and Host an LLM Entirely Through an API
12
OpenAI UI for Fine-Tuning and Assistants
https://platform.openai.com/finetune https://platform.openai.com/playground?mode=assistant
13
What is Fine-Tuning?
Traditionally means updating full parameters of the model with supervised learning
Copy To
14
What is Fine-Tuning?
Also refers to alignment with human intent
1 2 3
Train Reward Model with Human Reinforcement Learning Pipeline
Supervised Fine-Tuning of LLM
Feedback with Human Feedback
PPO Training
(Simplified Version)
Humans Rating Reward (Preference)
Responses Model Prompt Policy
Response
Network
Reward
Reward
Model
https://huggingface.co/nvidia/SteerLM-llama2-13B
https://arxiv.org/abs/2310.05344
16
Latest Techniques for Customizing LLMs
FULL-PARAMETER FINE-TUNING
TECHNIQUES • SFT
• RLHF
• SteerLM
TRAINING DATA
Thousands of examples &
complex use cases
Zero-Shot Few-Shot
Asking the foundation model to perform a task with no previous example Providing examples as context to the foundation model before giving it a task
18
Latest Techniques for Customizing LLMs
20
Prompt Learning (Continued)
Prompt Tuning vs P-Tuning
21
Example App 1: P-Tuning through NeMo LLM Service
Web UI for Easy Model Customization
22
Example App 1: P-Tuning through NeMo LLM Service
Loss Curves
23
Latest Techniques for Customizing LLMs
HOW Prompt templates Tune companion model Tune LLM model weights
TRAINING DATA Single-digit number of prompt- A few hundred examples & use
Thousands of examples &
completion examples & simple cases where prompt engineering
complex use cases
use cases is not sufficient
25
Latest Techniques for Customizing LLMs
Parameter-Efficient Fine-Tuning (PEFT)
HOW Prompt templates Tune companion model Add custom layers to LLM Tune LLM model weights
TRAINING DATA Single-digit number of prompt- A few hundred examples & use
Hundreds of examples for a Thousands of examples &
completion examples & simple cases where prompt engineering
multitude of downstream tasks complex use cases
use cases is not sufficient
27
Obtaining Datasets for Tuning
28 28
Example App 2: Tailored RAG
29
Retrieval Augmented Generation (RAG)
Motivation
• Decouples an LLM from only being able to act on
original training data
Components of the Application
• Obviates the need to retrain the LLM with the latest
data 1. Human input (prompt)
2. Vectorization (embedding)
• LLMs limited by context window sizes
3. Retrieve vectors and calculate distance
4. Extract closest matching docs
Concept 5. Inject relevant docs into the prompt
• Connect LLM to data sources at inference time 6. Output becomes up-to-date, more accurate,
with ability to cite source
• e.g., databases, web, documents, 3rd party APIs, etc.
• Find relevant data
• Inject relevant data into the prompt
30
Embeddings and the Vector Database
Searching via semantic similarity
Specialized Chunk 1 shares more keywords with the query, but semantic search
medical topics can differentiate the meanings of "lead" and understand that "team"
and "group" are similar, so Chunk 2 may be more helpful for the query.
Speech AI
2D representation of a 768-dimension embedding space
31
Canonical RAG Workflow
Response Prompt
User Data
Guardrails Guardrails
Context
33
Example App 2: Naive RAG
34
Example App 2: Question Decomposition RAG
35
How It Works
Synthetic Data Generation and Context Distillation
36
More Robust RAG
37
More Robust RAG
38
More Robust RAG
39
More Robust RAG
40
Tuning Self-Managed LLMs
41
Self-Managed vs. Hosted API
Motivations: Motivations:
• Privacy/Ownership • Easy to use: Push-button experiences
• Portability/Flexibility • Easy deployment: Don't have to worry about
managing hardware and keeping your API healthy
• Cost: Run on own infrastructure
Examples for Getting Started: OpenAI, Cohere, AWS
• Choice of customization Bedrock, NeMo LLM Service
Examples for Getting Started: NeMo Framework,
HuggingFace Hub + PEFT
42
NVIDIA NeMo Framework
From foundation model to application
In-domain, secure,
cited responses
Queries
In-domain queries
DEVELOPERS Pre-Trained Foundation Models APPLICATIONS
NeMo Framework
NVIDIA AI Enterprise
adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
'mixedfusedlayernorm']
lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal
44 44
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null
adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
a) Adapter
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
'mixedfusedlayernorm']
lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal
45 45
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null
adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
a) Adapter
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
b) LoRA 'mixedfusedlayernorm']
lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal
46 46
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null
adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
a) Adapter
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
b) LoRA 'mixedfusedlayernorm']
c) P-Tuning lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal
p_tuning:
virtual_tokens: 10 # The number of virtual tokens the prompt encoder
should add at the start of the sequence
bottleneck_dim: 1024 # the size of the prompt encoder mlp bottleneck
embedding_dim: 1024 # the size of the prompt encoder embeddings
init_std: 0.023
47 47
Recap
What did we learn today?
48
Live Q&A
49
Explore More Gen AI/LLM Training:
25% off Workshops for LLM Day registrants*
USE CODE: TRAIN-LLM
Self-Paced Courses
Join the NVIDIA Developer community to get access to technical > Generative AI Explained (Free)
training, technology, AI models and 600+ SDKs:
> Generative AI With Diffusion Models
Developer.nvidia.com/join
View our comprehensive Gen AI/LLM learning path, covering
fundamental to advanced topics
> Gen AI/LLM Learning Path
*Offer valid for any of the DLI public workshops scheduled through March 01, 2024.
50
Supplementary
Materials 51
Comparison of Approaches
No in training
Same model architecture Yes Yes Yes No Yes in No Yes
Inference
New added parameters Zero Limited Limited Moderate Moderate Limited Large
Extra inference latency High Moderate Moderate Limited Zero Limited Zero
Training data requested Minimum Limited Limited Moderate Moderate Moderate High
Training computation cost Zero Limited Limited Moderate Moderate Moderate High
52