Llmdevdaysession 2 Final 1699896189333

Tailoring LLMs to Your Use Case
Christopher Pang, Senior Solutions Engineer | 17 November 2023
1
Agenda
• LLMs in Context
• Tuning Hosted API LLMs
• Data Collection and Prep for Tuning
• Tuning Self-Managed LLMs
2
Why Bother Creating A Custom Model?
Motivations for Fine-Tuning
1. You just want the best use-case/task-specific results.
3
2. You want to save money and reduce latency.
4
2. You want to save money and reduce latency.
3. You want a smaller model (fewer parameters) that was trained to imitate a larger one.
5
Example App 1: Domain-Tailored
Summarization with Hosted LLM APIs
6
Domain-Tailored Summarization
Typical downstream clients only read from the API
Summarization
API
7
Our example application also writes new data back into the API
Summarization Client
API Application
8
9
How It Works
Domain-Tailored Summarization Under the Hood
Generate New Data

Base and Custom LLM Humans Comparing
Tune LLM on Your Data With Voting and
Output Predictions Outputs Side-by-Side
Editing
10
Tuning Hosted API LLMs
11
Tune and Host an LLM Entirely Through an API
Jurassic-2 Mid Titan Command

Jurassic-2 Light
GPT 4 NeMo GPT Llama 2 7B and 70B

GPT 3.5 Llama 2 Mistral 7B
GPT 3 MPT 7B Instruct
12
OpenAI UI for Fine-Tuning and Assistants
https://platform.openai.com/finetune https://platform.openai.com/playground?mode=assistant
13
What is Fine-Tuning?
Traditionally means updating full parameters of the model with supervised learning
Copy To
• Full-parameter fine-tuning re-trains a LLM

Initial Pre-Trained LLM
with a prompt dataset in a supervised manner,
requiring an update of all model weights per
task
• Prompt dataset needs to be sufficiently large,
on the order of thousands to hundreds of
thousands of prompts
Dataset Training New LLM
(Prompt-Sample Pairs)
14
What is Fine-Tuning?
Also refers to alignment with human intent
1 2 3
Train Reward Model with Human Reinforcement Learning Pipeline
Supervised Fine-Tuning of LLM
Feedback with Human Feedback
PPO Training
(Simplified Version)
Humans Rating Reward (Preference)
Responses Model Prompt Policy
Response
Network
Reward
Reward
Model
~100K-1M responses ranked and rated.

~10K-100K prompt-responses as input. Reward model: trained to mimic human Build pipeline with RLHF to
Fine-tune LLM using prompt and feedback of model generated continuously improve model over time.
responses. responses to prompts. Multiple neural networks interacting.
For more details (from 🤗): https://huggingface.co/blog/rlhf

15
Alignment with Human Intent: SteerLM
A technique to customize LLMs during inference
1. Train a prediction model on human-annotated

datasets to evaluate response quality on any
number of attributes like helpfulness, humor, and
creativity.
2. Annotate diverse datasets by predicting their
attribute scores to enrich the diversity of data
available to the model.
3. Fine-tune by training the LLM to generate
responses conditioned on specified combinations
of attributes, like user-perceived quality and
helpfulness.
4. Bootstrap training through model sampling by
generating diverse responses conditioned on
maximum quality, then fine-tuning on them to
further improve alignment
https://huggingface.co/nvidia/SteerLM-llama2-13B
https://arxiv.org/abs/2310.05344
16
Latest Techniques for Customizing LLMs
Data, compute &

time investment
FULL-PARAMETER FINE-TUNING
Accuracy for specific use-cases
TECHNIQUES • SFT
• RLHF
• SteerLM
HOW Tune LLM model weights
TRAINING DATA
Thousands of examples &
complex use cases
SFT is traditional method

ADVANTAGE supported by all libraries.
Robust for tuning to challenging
domains (biomedical, coding,
17
etc.)
Prompt Engineering
Prompt design is crucial to obtaining good results from an LLM
Zero-Shot Few-Shot
Asking the foundation model to perform a task with no previous example Providing examples as context to the foundation model before giving it a task
Q: What is the capital

of Spain?
A: {'answer': 'Madrid'}
Q: “What is the capital
Q: What is the capital A: {'answer': 'Paris'}
of France?” A: “Paris”
of Italy?
A: {'answer': 'Rome'}
LLM LLM
Q: What is the capital
of France?
Lower token count Better aligned responses

More space for context Higher accuracy on complex questions
18
Data, compute &

time investment
PROMPT ENGINEERING FULL-PARAMETER FINE-TUNING

• Few-shot / In-context
TECHNIQUES learning • SFT
• Chain-of-thought • RLHF
reasoning • SteerLM
• System prompting
HOW Prompt templates Tune LLM model weights
TRAINING DATA Single-digit number of prompt-

completion examples & simple
complex use cases
use cases

ADVANTAGE Minimal input of sample supported by all libraries.
prompts – can be tuned online by Robust for tuning to challenging
end users domains (biomedical, coding,
19
etc.)
Prompt Learning
Comparison to Full-Parameter Fine-Tuning
• Prompt learning adds a small number of trainable

virtual tokens upstream of the LLM
• More efficient: for each new custom task, all we do is
train those tokens
• The downstream foundation model is unchanged
• Often outperforms full-parameter fine-tuning when
training data is small
Source: Lightning AI (Creators of PyTorch Lightning)
20
Prompt Learning (Continued)
Prompt Tuning vs P-Tuning
Prompt Tuning P-Tuning

Fixed prompt of special tokens, where only embeddings can be updated. A small LSTM (Long Short-Term Memory) model is used to predict
embeddings of a fixed prompt of tokens.
Fewer parameters to fine-tune. Requires more parameters to be tuned.

Limited capacity to adapt to target task, but lower HW Higher accuracy at the cost of increased HW resources.
resource cost.
21
Example App 1: P-Tuning through NeMo LLM Service
Web UI for Easy Model Customization
22
Example App 1: P-Tuning through NeMo LLM Service
Loss Curves
23
Data, compute &

time investment
PROMPT ENGINEERING PROMPT LEARNING FULL-PARAMETER FINE-TUNING

TECHNIQUES learning
• Prompt tuning
• SFT
• Chain-of-thought • RLHF
• P-tuning
reasoning • SteerLM
HOW Prompt templates Tune companion model Tune LLM model weights
TRAINING DATA Single-digit number of prompt- A few hundred examples & use
completion examples & simple cases where prompt engineering
complex use cases
use cases is not sufficient

ADVANTAGE Fast customization – only tuning
Minimal input of sample supported by all libraries.
a small model per task –
prompts – can be tuned online by Robust for tuning to challenging
downstream foundation model is
end users domains (biomedical, coding,
24
unchanged
etc.)
Adapter-Based Techniques
Adapters, LoRA, IA3
Adapters Low-Rank Adaptation (LoRA) IA3

Insert into each transformer layer, only Optimize rank decomposition matrices of Like adapters, but where each adapter is a
update weights of adapters dense layers vector that scales key, value or ffn
https://arxiv.org/abs/1902.00751 https://arxiv.org/abs/2106.09685 https://arxiv.org/abs/2205.05638
25
Parameter-Efficient Fine-Tuning (PEFT)
Data, compute &

time investment
PROMPT ENGINEERING PROMPT LEARNING ADAPTERS FULL-PARAMETER FINE-TUNING

TECHNIQUES learning
• Prompt tuning
• Adapters • SFT
• Chain-of-thought • LoRA • RLHF
• P-tuning
reasoning • IA3 • SteerLM
HOW Prompt templates Tune companion model Add custom layers to LLM Tune LLM model weights
TRAINING DATA Single-digit number of prompt- A few hundred examples & use
Hundreds of examples for a Thousands of examples &
completion examples & simple cases where prompt engineering
multitude of downstream tasks complex use cases
use cases is not sufficient

ADVANTAGE Fast customization – only tuning
Minimal input of sample Achieve higher accuracy while supported by all libraries.
a small model per task –
prompts – can be tuned online by requiring less samples than Robust for tuning to challenging
downstream foundation model is
end users traditional fine-tuning domains (biomedical, coding,
26
unchanged
etc.)
Data Collection and Preparation
For Tuning
27
Obtaining Datasets for Tuning
{“prompt”: “Summarize the following

text:\nNVIDIA announced the release
• Don’t over-rely on public datasets. Datasets are of…”,
everywhere. Need to curate input/output pairs. “completion”: “NVIDIA’s new product
Guardrails…”}
• “Less is More for Alignment.” High-quality, low-
quantity training data vs. low-quality, high-quantity.
• https://arxiv.org/abs/2305.11206 {“prompt”: “Summarize the following
• Synthetic data generation: Use high-end model and text:\nWhen Jensen Huang first
complex prompt template to induce correct founded…”,
behavior/outputs from smaller model (“context “completion”: “NVIDIA’s CEO outlines
distillation”). vision for…”}
{“prompt”: “Summarize the following

text:\nHi team,\n\nI noticed that
Omni…”,
“completion”: “Omniverse Replicator
feature request…”}
...
28 28
Example App 2: Tailored RAG
29
Retrieval Augmented Generation (RAG)
Motivation
• Decouples an LLM from only being able to act on
original training data
Components of the Application
• Obviates the need to retrain the LLM with the latest
data 1. Human input (prompt)
2. Vectorization (embedding)
• LLMs limited by context window sizes
3. Retrieve vectors and calculate distance
4. Extract closest matching docs
Concept 5. Inject relevant docs into the prompt
• Connect LLM to data sources at inference time 6. Output becomes up-to-date, more accurate,
with ability to cite source
• e.g., databases, web, documents, 3rd party APIs, etc.
• Find relevant data
• Inject relevant data into the prompt
30
Embeddings and the Vector Database
Searching via semantic similarity
• Embeddings are data (text, image, or other data)

represented as numerical vectors
• Input text -> embedding model -> output vector
Scientific
computing • Part of semantic search
High-performance • Model trained to embed similar inputs close together
computing
• Useful for: classification, clustering, topic discovery
• Many pretrained and trainable embedding model sources
• Modern ones are often deep neural networks
Query: Who will lead the construction team?

Chunk 1: The construction team found lead in the paint.
Chunk 2: Ozzy has been picked to lead the group.
Specialized Chunk 1 shares more keywords with the query, but semantic search
medical topics can differentiate the meanings of "lead" and understand that "team"
and "group" are similar, so Chunk 2 may be more helpful for the query.
Speech AI
2D representation of a 768-dimension embedding space
31
Canonical RAG Workflow
Response Prompt
User Data
Guardrails Guardrails
Workflow Embedding Model

Framework
Context
LLM Vector Database

Question Decomposition
Making hard questions easier
• Retrieval augmented generation (RAG) can struggle

Sub- out-of-the-box with complex prompts when retrieval
Question 1 fails to find the right documents.
• Solution: Tune a small question decomposition model.
Ques- • Decompose complex questions into easier sub-
Decompose DB Combine Output
tion
question with a single topic—makes retrieval more
likely to succeed.
• Anthropic: “Question Decomposition Improves the
Sub- Faithfulness of Model-Generated Reasoning”
Question 2
33
Example App 2: Naive RAG
34
Example App 2: Question Decomposition RAG
35
How It Works
Synthetic Data Generation and Context Distillation
Train Llama 2 7B Prompt Llama 2

Acquire dataset Prompt Llama 2 with PEFT 7B with zero-
Manually verify
of complex 70B with few- (LoRA) and FP4 shot prompts.
results
questions shot prompts to mimic Llama Hosted on
2 70B OctoAI platform.
36
More Robust RAG
• Tuning a question decomposition model

• Tuning a QA model that explicitly only answers from context
37
More Robust RAG

• Retrieval enhancements:
• Tuning a custom embedding model
• Re-ranker
• Decoupling retrieval and generation chunks
• Using document metadata/hierarchy
38
More Robust RAG

• Re-ranker
• Decoupling retrieval and generation embeddings
• Agents: using external tools (e.g., to answer a math question)
39
More Robust RAG

• Re-ranker
• Decoupling retrieval and generation embeddings
• Agents: using external tools (e.g., to answer a math question)
• Serving for inference
40
Tuning Self-Managed LLMs
41
Self-Managed vs. Hosted API
Self-Managed LLMs Hosted API LLMs

Own & manage underlying model weights Access only available through hosted APIs
Motivations: Motivations:
• Privacy/Ownership • Easy to use: Push-button experiences
• Portability/Flexibility • Easy deployment: Don't have to worry about
managing hardware and keeping your API healthy
• Cost: Run on own infrastructure
Examples for Getting Started: OpenAI, Cohere, AWS
• Choice of customization Bedrock, NeMo LLM Service
Examples for Getting Started: NeMo Framework,
HuggingFace Hub + PEFT
42
NVIDIA NeMo Framework
From foundation model to application
Foundation Model Building

Data Curation Distributed Training
In-domain, secure,
cited responses
Queries
In-domain queries
DEVELOPERS Pre-Trained Foundation Models APPLICATIONS
Model Customization Accelerated Inference Guardrails
GPT-8B GPT-22B GPT-43B Community Models
NeMo Framework
NVIDIA AI Enterprise
NVIDIA DGX Cloud

43
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null
adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
'mixedfusedlayernorm']
lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal
# Used for p-tuning peft training

p_tuning:
virtual_tokens: 10 # The number of virtual tokens the prompt encoder
should add at the start of the sequence
bottleneck_dim: 1024 # the size of the prompt encoder mlp bottleneck
embedding_dim: 1024 # the size of the prompt encoder embeddings
init_std: 0.023
44 44
adapter_tuning:
'linear_adapter'
adapter_dim: 32
a) Adapter
'mixedfusedlayernorm']
lora_tuning:
adapter_dim: 32

p_tuning:
init_std: 0.023
45 45
adapter_tuning:
'linear_adapter'
adapter_dim: 32
a) Adapter
b) LoRA 'mixedfusedlayernorm']
lora_tuning:
adapter_dim: 32

p_tuning:
init_std: 0.023
46 46
adapter_tuning:
'linear_adapter'
adapter_dim: 32
a) Adapter
b) LoRA 'mixedfusedlayernorm']
c) P-Tuning lora_tuning:
adapter_dim: 32
p_tuning:
init_std: 0.023
47 47
Recap
What did we learn today?
1. Why might you want to you customize your own LLM?

a. Better performance, save money, reduce latency, smaller models.
2. How should you customize your own LLM?
a. For most use cases, parameter-efficient fine-tuning (PEFT). Choose what’s easiest for you.
3. What data is needed to customize your own LLM?
a. You’re already generating your own data. Start recording it! Also try synthetic data generation.
4. Do you use a hosted API or self-manage to customize your own LLM?
a. Choice is up to the developer. Consider cost, convenience, privacy, portability.
48
Live Q&A
49
Explore More Gen AI/LLM Training:
25% off Workshops for LLM Day registrants*
USE CODE: TRAIN-LLM
Apply to NVIDIA Inception for startups:

NVIDIA.com/startups Instructor-Led Workshops
> Generative AI With Diffusion Models
> Rapid Application Development Using LLMs
> Efficient Large Language Model (LLM) Customizations
Self-Paced Courses
Join the NVIDIA Developer community to get access to technical > Generative AI Explained (Free)
training, technology, AI models and 600+ SDKs:
> Generative AI With Diffusion Models
Developer.nvidia.com/join
View our comprehensive Gen AI/LLM learning path, covering
fundamental to advanced topics
> Gen AI/LLM Learning Path
*Offer valid for any of the DLI public workshops scheduled through March 01, 2024.
50
Supplementary
Materials 51
Comparison of Approaches
Prompt Prompt Full-Param

P-Tuning Adapter LoRA IA3
Engineering Turning Fine-Tuning
Frozen model weights Yes Yes Yes Yes Yes Yes No
No in training
Same model architecture Yes Yes Yes No Yes in No Yes
Inference
New added parameters Zero Limited Limited Moderate Moderate Limited Large
Extra inference latency High Moderate Moderate Limited Zero Limited Zero
Extra inference computation

High Moderate Moderate Limited Zero Limited Zero
cost
Multi-task in one inference

Yes Yes Yes No No No No
batch
Accuracy Fair Good Good Better Better Better Best
Training data requested Minimum Limited Limited Moderate Moderate Moderate High
Training computation cost Zero Limited Limited Moderate Moderate Moderate High
52

Llmdevdaysession 2 Final 1699896189333

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Llmdevdaysession 2 Final 1699896189333

Uploaded by

Copyright:

Available Formats

Tailoring LLMs to Your Use Case

Christopher Pang, Senior Solutions Engineer | 17 November 2023

• Tuning Hosted API LLMs

• Data Collection and Prep for Tuning

• Tuning Self-Managed LLMs

1. You just want the best use-case/task-specific results.

1. You just want the best use-case/task-specific results.

2. You want to save money and reduce latency.

1. You just want the best use-case/task-specific results.

2. You want to save money and reduce latency.

Generate New Data

Jurassic-2 Mid Titan Command

GPT 4 NeMo GPT Llama 2 7B and 70B

• Full-parameter fine-tuning re-trains a LLM

~100K-1M responses ranked and rated.

For more details (from 🤗): https://huggingface.co/blog/rlhf

1. Train a prediction model on human-annotated

Data, compute &

Accuracy for specific use-cases

HOW Tune LLM model weights

SFT is traditional method

Q: What is the capital

Lower token count Better aligned responses

Data, compute &

PROMPT ENGINEERING FULL-PARAMETER FINE-TUNING

Accuracy for specific use-cases

HOW Prompt templates Tune LLM model weights

TRAINING DATA Single-digit number of prompt-

SFT is traditional method

• Prompt learning adds a small number of trainable

Source: Lightning AI (Creators of PyTorch Lightning)

Prompt Tuning P-Tuning

Fewer parameters to fine-tune. Requires more parameters to be tuned.

Data, compute &

PROMPT ENGINEERING PROMPT LEARNING FULL-PARAMETER FINE-TUNING

Accuracy for specific use-cases

SFT is traditional method

Adapters Low-Rank Adaptation (LoRA) IA3

https://arxiv.org/abs/1902.00751 https://arxiv.org/abs/2106.09685 https://arxiv.org/abs/2205.05638

Data, compute &

PROMPT ENGINEERING PROMPT LEARNING ADAPTERS FULL-PARAMETER FINE-TUNING

Accuracy for specific use-cases

SFT is traditional method

{“prompt”: “Summarize the following

{“prompt”: “Summarize the following

• Embeddings are data (text, image, or other data)

Query: Who will lead the construction team?

Workflow Embedding Model

LLM Vector Database

• Retrieval augmented generation (RAG) can struggle

Train Llama 2 7B Prompt Llama 2

• Tuning a question decomposition model

• Tuning a question decomposition model

• Tuning a question decomposition model

• Tuning a question decomposition model

Self-Managed LLMs Hosted API LLMs

Foundation Model Building

Model Customization Accelerated Inference Guardrails

GPT-8B GPT-22B GPT-43B Community Models

NVIDIA DGX Cloud

# Used for p-tuning peft training

# Used for p-tuning peft training

# Used for p-tuning peft training