You are on page 1of 52

Tailoring LLMs to Your Use Case

Christopher Pang, Senior Solutions Engineer | 17 November 2023

1
Agenda
• LLMs in Context

• Tuning Hosted API LLMs

• Data Collection and Prep for Tuning

• Tuning Self-Managed LLMs

2
Why Bother Creating A Custom Model?
Motivations for Fine-Tuning

1. You just want the best use-case/task-specific results.

3
Why Bother Creating A Custom Model?
Motivations for Fine-Tuning

1. You just want the best use-case/task-specific results.

2. You want to save money and reduce latency.

4
Why Bother Creating A Custom Model?
Motivations for Fine-Tuning

1. You just want the best use-case/task-specific results.

2. You want to save money and reduce latency.

3. You want a smaller model (fewer parameters) that was trained to imitate a larger one.

5
Example App 1: Domain-Tailored
Summarization with Hosted LLM APIs

6
Domain-Tailored Summarization
Typical downstream clients only read from the API

Summarization
API

7
Domain-Tailored Summarization
Our example application also writes new data back into the API

Summarization Client
API Application

8
Domain-Tailored Summarization

9
How It Works
Domain-Tailored Summarization Under the Hood

Generate New Data


Base and Custom LLM Humans Comparing
Tune LLM on Your Data With Voting and
Output Predictions Outputs Side-by-Side
Editing

10
Tuning Hosted API LLMs

11
Tune and Host an LLM Entirely Through an API

Jurassic-2 Mid Titan Command


Jurassic-2 Light

GPT 4 NeMo GPT Llama 2 7B and 70B


GPT 3.5 Llama 2 Mistral 7B
GPT 3 MPT 7B Instruct

12
OpenAI UI for Fine-Tuning and Assistants

https://platform.openai.com/finetune https://platform.openai.com/playground?mode=assistant

13
What is Fine-Tuning?
Traditionally means updating full parameters of the model with supervised learning

Copy To

• Full-parameter fine-tuning re-trains a LLM


Initial Pre-Trained LLM
with a prompt dataset in a supervised manner,
requiring an update of all model weights per
task
• Prompt dataset needs to be sufficiently large,
on the order of thousands to hundreds of
thousands of prompts
Dataset Training New LLM
(Prompt-Sample Pairs)

14
What is Fine-Tuning?
Also refers to alignment with human intent

1 2 3
Train Reward Model with Human Reinforcement Learning Pipeline
Supervised Fine-Tuning of LLM
Feedback with Human Feedback

PPO Training
(Simplified Version)
Humans Rating Reward (Preference)
Responses Model Prompt Policy
Response
Network

Reward

Reward
Model

~100K-1M responses ranked and rated.


~10K-100K prompt-responses as input. Reward model: trained to mimic human Build pipeline with RLHF to
Fine-tune LLM using prompt and feedback of model generated continuously improve model over time.
responses. responses to prompts. Multiple neural networks interacting.

For more details (from 🤗): https://huggingface.co/blog/rlhf


15
Alignment with Human Intent: SteerLM
A technique to customize LLMs during inference

1. Train a prediction model on human-annotated


datasets to evaluate response quality on any
number of attributes like helpfulness, humor, and
creativity.
2. Annotate diverse datasets by predicting their
attribute scores to enrich the diversity of data
available to the model.
3. Fine-tune by training the LLM to generate
responses conditioned on specified combinations
of attributes, like user-perceived quality and
helpfulness.
4. Bootstrap training through model sampling by
generating diverse responses conditioned on
maximum quality, then fine-tuning on them to
further improve alignment

https://huggingface.co/nvidia/SteerLM-llama2-13B
https://arxiv.org/abs/2310.05344

16
Latest Techniques for Customizing LLMs

Data, compute &


time investment

FULL-PARAMETER FINE-TUNING

Accuracy for specific use-cases

TECHNIQUES • SFT
• RLHF
• SteerLM

HOW Tune LLM model weights

TRAINING DATA
Thousands of examples &
complex use cases

SFT is traditional method


ADVANTAGE supported by all libraries.
Robust for tuning to challenging
domains (biomedical, coding,
17
etc.)
Prompt Engineering
Prompt design is crucial to obtaining good results from an LLM

Zero-Shot Few-Shot
Asking the foundation model to perform a task with no previous example Providing examples as context to the foundation model before giving it a task

Q: What is the capital


of Spain?
A: {'answer': 'Madrid'}
Q: “What is the capital
Q: What is the capital A: {'answer': 'Paris'}
of France?” A: “Paris”
of Italy?
A: {'answer': 'Rome'}
LLM LLM
Q: What is the capital
of France?

Lower token count Better aligned responses


More space for context Higher accuracy on complex questions

18
Latest Techniques for Customizing LLMs

Data, compute &


time investment

PROMPT ENGINEERING FULL-PARAMETER FINE-TUNING

Accuracy for specific use-cases


• Few-shot / In-context
TECHNIQUES learning • SFT
• Chain-of-thought • RLHF
reasoning • SteerLM
• System prompting

HOW Prompt templates Tune LLM model weights

TRAINING DATA Single-digit number of prompt-


Thousands of examples &
completion examples & simple
complex use cases
use cases

SFT is traditional method


ADVANTAGE Minimal input of sample supported by all libraries.
prompts – can be tuned online by Robust for tuning to challenging
end users domains (biomedical, coding,
19
etc.)
Prompt Learning
Comparison to Full-Parameter Fine-Tuning

• Prompt learning adds a small number of trainable


virtual tokens upstream of the LLM
• More efficient: for each new custom task, all we do is
train those tokens
• The downstream foundation model is unchanged
• Often outperforms full-parameter fine-tuning when
training data is small

Source: Lightning AI (Creators of PyTorch Lightning)

20
Prompt Learning (Continued)
Prompt Tuning vs P-Tuning

Prompt Tuning P-Tuning


Fixed prompt of special tokens, where only embeddings can be updated. A small LSTM (Long Short-Term Memory) model is used to predict
embeddings of a fixed prompt of tokens.

Fewer parameters to fine-tune. Requires more parameters to be tuned.


Limited capacity to adapt to target task, but lower HW Higher accuracy at the cost of increased HW resources.
resource cost.

21
Example App 1: P-Tuning through NeMo LLM Service
Web UI for Easy Model Customization

22
Example App 1: P-Tuning through NeMo LLM Service
Loss Curves

23
Latest Techniques for Customizing LLMs

Data, compute &


time investment

PROMPT ENGINEERING PROMPT LEARNING FULL-PARAMETER FINE-TUNING

Accuracy for specific use-cases


• Few-shot / In-context
TECHNIQUES learning
• Prompt tuning
• SFT
• Chain-of-thought • RLHF
• P-tuning
reasoning • SteerLM
• System prompting

HOW Prompt templates Tune companion model Tune LLM model weights

TRAINING DATA Single-digit number of prompt- A few hundred examples & use
Thousands of examples &
completion examples & simple cases where prompt engineering
complex use cases
use cases is not sufficient

SFT is traditional method


ADVANTAGE Fast customization – only tuning
Minimal input of sample supported by all libraries.
a small model per task –
prompts – can be tuned online by Robust for tuning to challenging
downstream foundation model is
end users domains (biomedical, coding,
24
unchanged
etc.)
Adapter-Based Techniques
Adapters, LoRA, IA3

Adapters Low-Rank Adaptation (LoRA) IA3


Insert into each transformer layer, only Optimize rank decomposition matrices of Like adapters, but where each adapter is a
update weights of adapters dense layers vector that scales key, value or ffn

https://arxiv.org/abs/1902.00751 https://arxiv.org/abs/2106.09685 https://arxiv.org/abs/2205.05638

25
Latest Techniques for Customizing LLMs
Parameter-Efficient Fine-Tuning (PEFT)

Data, compute &


time investment

PROMPT ENGINEERING PROMPT LEARNING ADAPTERS FULL-PARAMETER FINE-TUNING

Accuracy for specific use-cases


• Few-shot / In-context
TECHNIQUES learning
• Prompt tuning
• Adapters • SFT
• Chain-of-thought • LoRA • RLHF
• P-tuning
reasoning • IA3 • SteerLM
• System prompting

HOW Prompt templates Tune companion model Add custom layers to LLM Tune LLM model weights

TRAINING DATA Single-digit number of prompt- A few hundred examples & use
Hundreds of examples for a Thousands of examples &
completion examples & simple cases where prompt engineering
multitude of downstream tasks complex use cases
use cases is not sufficient

SFT is traditional method


ADVANTAGE Fast customization – only tuning
Minimal input of sample Achieve higher accuracy while supported by all libraries.
a small model per task –
prompts – can be tuned online by requiring less samples than Robust for tuning to challenging
downstream foundation model is
end users traditional fine-tuning domains (biomedical, coding,
26
unchanged
etc.)
Data Collection and Preparation
For Tuning

27
Obtaining Datasets for Tuning

{“prompt”: “Summarize the following


text:\nNVIDIA announced the release
• Don’t over-rely on public datasets. Datasets are of…”,
everywhere. Need to curate input/output pairs. “completion”: “NVIDIA’s new product
Guardrails…”}
• “Less is More for Alignment.” High-quality, low-
quantity training data vs. low-quality, high-quantity.
• https://arxiv.org/abs/2305.11206 {“prompt”: “Summarize the following
• Synthetic data generation: Use high-end model and text:\nWhen Jensen Huang first
complex prompt template to induce correct founded…”,
behavior/outputs from smaller model (“context “completion”: “NVIDIA’s CEO outlines
distillation”). vision for…”}

{“prompt”: “Summarize the following


text:\nHi team,\n\nI noticed that
Omni…”,
“completion”: “Omniverse Replicator
feature request…”}
...

28 28
Example App 2: Tailored RAG

29
Retrieval Augmented Generation (RAG)

Motivation
• Decouples an LLM from only being able to act on
original training data
Components of the Application
• Obviates the need to retrain the LLM with the latest
data 1. Human input (prompt)
2. Vectorization (embedding)
• LLMs limited by context window sizes
3. Retrieve vectors and calculate distance
4. Extract closest matching docs
Concept 5. Inject relevant docs into the prompt
• Connect LLM to data sources at inference time 6. Output becomes up-to-date, more accurate,
with ability to cite source
• e.g., databases, web, documents, 3rd party APIs, etc.
• Find relevant data
• Inject relevant data into the prompt

30
Embeddings and the Vector Database
Searching via semantic similarity

• Embeddings are data (text, image, or other data)


represented as numerical vectors
• Input text -> embedding model -> output vector
Scientific
computing • Part of semantic search
High-performance • Model trained to embed similar inputs close together
computing
• Useful for: classification, clustering, topic discovery
• Many pretrained and trainable embedding model sources
• Modern ones are often deep neural networks

Query: Who will lead the construction team?


Chunk 1: The construction team found lead in the paint.
Chunk 2: Ozzy has been picked to lead the group.

Specialized Chunk 1 shares more keywords with the query, but semantic search
medical topics can differentiate the meanings of "lead" and understand that "team"
and "group" are similar, so Chunk 2 may be more helpful for the query.
Speech AI
2D representation of a 768-dimension embedding space

31
Canonical RAG Workflow

Response Prompt

User Data

Guardrails Guardrails

Workflow Embedding Model


Framework

Context

LLM Vector Database


Question Decomposition
Making hard questions easier

• Retrieval augmented generation (RAG) can struggle


Sub- out-of-the-box with complex prompts when retrieval
Question 1 fails to find the right documents.
• Solution: Tune a small question decomposition model.
Ques- • Decompose complex questions into easier sub-
Decompose DB Combine Output
tion
question with a single topic—makes retrieval more
likely to succeed.
• Anthropic: “Question Decomposition Improves the
Sub- Faithfulness of Model-Generated Reasoning”
Question 2

33
Example App 2: Naive RAG

34
Example App 2: Question Decomposition RAG

35
How It Works
Synthetic Data Generation and Context Distillation

Train Llama 2 7B Prompt Llama 2


Acquire dataset Prompt Llama 2 with PEFT 7B with zero-
Manually verify
of complex 70B with few- (LoRA) and FP4 shot prompts.
results
questions shot prompts to mimic Llama Hosted on
2 70B OctoAI platform.

36
More Robust RAG

• Tuning a question decomposition model


• Tuning a QA model that explicitly only answers from context

37
More Robust RAG

• Tuning a question decomposition model


• Tuning a QA model that explicitly only answers from context
• Retrieval enhancements:
• Tuning a custom embedding model
• Re-ranker
• Decoupling retrieval and generation chunks
• Using document metadata/hierarchy

38
More Robust RAG

• Tuning a question decomposition model


• Tuning a QA model that explicitly only answers from context
• Retrieval enhancements:
• Tuning a custom embedding model
• Re-ranker
• Decoupling retrieval and generation embeddings
• Using document metadata/hierarchy
• Agents: using external tools (e.g., to answer a math question)

39
More Robust RAG

• Tuning a question decomposition model


• Tuning a QA model that explicitly only answers from context
• Retrieval enhancements:
• Tuning a custom embedding model
• Re-ranker
• Decoupling retrieval and generation embeddings
• Using document metadata/hierarchy
• Agents: using external tools (e.g., to answer a math question)
• Serving for inference

40
Tuning Self-Managed LLMs

41
Self-Managed vs. Hosted API

Self-Managed LLMs Hosted API LLMs


Own & manage underlying model weights Access only available through hosted APIs

Motivations: Motivations:
• Privacy/Ownership • Easy to use: Push-button experiences
• Portability/Flexibility • Easy deployment: Don't have to worry about
managing hardware and keeping your API healthy
• Cost: Run on own infrastructure
Examples for Getting Started: OpenAI, Cohere, AWS
• Choice of customization Bedrock, NeMo LLM Service
Examples for Getting Started: NeMo Framework,
HuggingFace Hub + PEFT

42
NVIDIA NeMo Framework
From foundation model to application

Foundation Model Building


Data Curation Distributed Training

In-domain, secure,
cited responses
Queries

In-domain queries
DEVELOPERS Pre-Trained Foundation Models APPLICATIONS

Model Customization Accelerated Inference Guardrails

GPT-8B GPT-22B GPT-43B Community Models

NeMo Framework

NVIDIA AI Enterprise

NVIDIA DGX Cloud


43
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null

adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
'mixedfusedlayernorm']

lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal

# Used for p-tuning peft training


p_tuning:
virtual_tokens: 10 # The number of virtual tokens the prompt encoder
should add at the start of the sequence
bottleneck_dim: 1024 # the size of the prompt encoder mlp bottleneck
embedding_dim: 1024 # the size of the prompt encoder embeddings
init_std: 0.023

44 44
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null

adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
a) Adapter
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
'mixedfusedlayernorm']

lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal

# Used for p-tuning peft training


p_tuning:
virtual_tokens: 10 # The number of virtual tokens the prompt encoder
should add at the start of the sequence
bottleneck_dim: 1024 # the size of the prompt encoder mlp bottleneck
embedding_dim: 1024 # the size of the prompt encoder embeddings
init_std: 0.023

45 45
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null

adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
a) Adapter
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
b) LoRA 'mixedfusedlayernorm']

lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal

# Used for p-tuning peft training


p_tuning:
virtual_tokens: 10 # The number of virtual tokens the prompt encoder
should add at the start of the sequence
bottleneck_dim: 1024 # the size of the prompt encoder mlp bottleneck
embedding_dim: 1024 # the size of the prompt encoder embeddings
init_std: 0.023

46 46
Fine-Tuning in NeMo peft:
peft_scheme: "adapter" # can be either adapter, ia3, ptuning,
An All-in-One Implementation adapter_and_ptuning, or lora
restore_from_path: null

adapter_tuning:
type: 'parallel_adapter' # this should be either 'parallel_adapter' or
'linear_adapter'
adapter_dim: 32
1. Set Parameter-Efficient Fine-Tuning adapter_dropout: 0.0
Type norm_position: 'pre' # This can be set to 'pre' or 'post', 'pre' is
normally what is used.
2. Set Hyperparameters column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # options: xavier, zero or normal
a) Adapter
norm_type: 'mixedfusedlayernorm' # options are ['layernorm',
b) LoRA 'mixedfusedlayernorm']
c) P-Tuning lora_tuning:
adapter_dim: 32
adapter_dropout: 0.0
column_init_method: 'xavier' # options: xavier, zero or normal
row_init_method: 'zero' # IGNORED if linear_adapter is used, options:
xavier, zero or normal

p_tuning:
virtual_tokens: 10 # The number of virtual tokens the prompt encoder
should add at the start of the sequence
bottleneck_dim: 1024 # the size of the prompt encoder mlp bottleneck
embedding_dim: 1024 # the size of the prompt encoder embeddings
init_std: 0.023

47 47
Recap
What did we learn today?

1. Why might you want to you customize your own LLM?


a. Better performance, save money, reduce latency, smaller models.
2. How should you customize your own LLM?
a. For most use cases, parameter-efficient fine-tuning (PEFT). Choose what’s easiest for you.
3. What data is needed to customize your own LLM?
a. You’re already generating your own data. Start recording it! Also try synthetic data generation.
4. Do you use a hosted API or self-manage to customize your own LLM?
a. Choice is up to the developer. Consider cost, convenience, privacy, portability.

48
Live Q&A

49
Explore More Gen AI/LLM Training:
25% off Workshops for LLM Day registrants*
USE CODE: TRAIN-LLM

Apply to NVIDIA Inception for startups:


NVIDIA.com/startups Instructor-Led Workshops
> Generative AI With Diffusion Models
> Rapid Application Development Using LLMs
> Efficient Large Language Model (LLM) Customizations

Self-Paced Courses
Join the NVIDIA Developer community to get access to technical > Generative AI Explained (Free)
training, technology, AI models and 600+ SDKs:
> Generative AI With Diffusion Models
Developer.nvidia.com/join
View our comprehensive Gen AI/LLM learning path, covering
fundamental to advanced topics
> Gen AI/LLM Learning Path

*Offer valid for any of the DLI public workshops scheduled through March 01, 2024.

50
Supplementary
Materials 51
Comparison of Approaches

Prompt Prompt Full-Param


P-Tuning Adapter LoRA IA3
Engineering Turning Fine-Tuning

Frozen model weights Yes Yes Yes Yes Yes Yes No

No in training
Same model architecture Yes Yes Yes No Yes in No Yes
Inference

New added parameters Zero Limited Limited Moderate Moderate Limited Large

Extra inference latency High Moderate Moderate Limited Zero Limited Zero

Extra inference computation


High Moderate Moderate Limited Zero Limited Zero
cost

Multi-task in one inference


Yes Yes Yes No No No No
batch

Accuracy Fair Good Good Better Better Better Best

Training data requested Minimum Limited Limited Moderate Moderate Moderate High

Training computation cost Zero Limited Limited Moderate Moderate Moderate High

52

You might also like