You are on page 1of 115

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.

For the rest of the details of the license, see


https://creativecommons.org/licenses/by-sa/2.0/legalcode
Generative AI and
large-language
models (LLMs)
FINE-TUNING, INSTRUCTION
PROMPTS, AND PARAMETER
EFFICIENT FINE-TUNING
Fine-tuning with
instruction prompts
GenAI project lifecycle

Scope Select Adapt and align model Application integration

Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
GenAI project lifecycle

Scope Select Adapt and align model Application integration

Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Fine-tuning an LLM
with instruction prompts
In-context learning (ICL) - zero shot inference

Prompt Model Completion


Classify this review: Classify this review:
I loved this DVD! I loved this DVD!
Sentiment: Sentiment: Positive
LLM
In-context learning (ICL) - zero shot inference

Prompt Model Completion


Classify this review: Classify this review:
I loved this DVD! LLM I loved this DVD!
Sentiment: Sentiment: eived a
very nice book review
In-context learning (ICL) - one/few shot inference

Prompt Model Completion


Classify this review: Classify this review:
I loved this DVD! LLM I loved this DVD!
Sentiment: Positive Sentiment: Positive

Classify this review: Classify this review:


I don’t like this I don’t like this
chair. chair.
Sentiment: Sentiment: Negative

One-shot or Few-shot Inference


Limitations of in-context learning
Classify this review:
I loved this movie! ● In-context learning may
Sentiment: Positive
not work for smaller
Classify this review: Even with models LLM
I don’t like this chair. multiple
Sentiment: Negative examples
● Examples take up space
Classify this review:
This sofa is so ugly. in the context window
Sentiment: Negative

Classify this review: Instead, try fine-tuning


Who would use this product? the model
Sentiment:
Context Window
LLM fine-tuning at a high level
LLM pre-training

TEXT[...]
TEXT[...] Model
TEXT[...]
TEXT[...]
TEXT[...]
TEXT[...] Pre-trained
TEXT[...]
TEXT[...]
LLM
TEXT[...]
TEXT[...]

GB - TB - PB
of unstructured textual data
LLM fine-tuning at a high level
LLM fine-tuning

Model Task-specific examples Model

TEXT[...], LABEL[...]
Pre-trained TEXT[...],
TEXT[...],
LABEL[...]
LABEL[...]
Fine-tuned
LLM TEXT[...], LABEL[...] LLM
TEXT[...], LABEL[...]

GB - TB
of labeled examples for a specific
task or set of tasks
LLM fine-tuning at a high level
LLM fine-tuning

Model Task-specific examples Model

PROMPT[...], COMPLETION[...]
Pre-trained PROMPT[...],
PROMPT[...],
COMPLETION[...]
COMPLETION[...]
Fine-tuned
LLM PROMPT[...], COMPLETION[...] LLM
PROMPT[...], COMPLETION[...]

Improved
GB - TB performance
of labeled examples for a specific Prompt-completion pairs
task or set of tasks
Using prompts to fine-tune LLMs with instruction
LLM fine-tuning

Model Model
Classify this review:
I loved this DVD!
Sentiment: Positive
Pre-trained Fine-tuned
LLM Classify this review: LLM
I don’t like this
chair.
Sentiment: Negative

Each prompt/completion pair includes a


specific “instruction” to the LLM
Using prompts to fine-tune LLMs with instruction
LLM fine-tuning

Model Task-specific examples Model

PROMPT[...], COMPLETION[...]
Pre-trained PROMPT[...],
PROMPT[...],
COMPLETION[...]
COMPLETION[...]
Fine-tuned
LLM PROMPT[...], COMPLETION[...] LLM
PROMPT[...], COMPLETION[...]

Summarize the following text: Translate this sentence to...


[EXAMPLE TEXT] [EXAMPLE TEXT]
[EXAMPLE COMPLETION] [EXAMPLE COMPLETION]
Using prompts to fine-tune LLMs with instruction
LLM fine-tuning

Model Task-specific examples Model

PROMPT[...], COMPLETION[...]
Pre-trained PROMPT[...],
PROMPT[...],
COMPLETION[...]
COMPLETION[...]
Fine-tuned
LLM PROMPT[...], COMPLETION[...] LLM
PROMPT[...], COMPLETION[...]

Full fine-tuning Improved


updates all parameters performance
Sample prompt instruction templates
Classification / sentiment analysis

Text generation

Text summarization

Source: https://github.com/bigscience-workshop/promptsource/blob/main/promptsource/templates/amazon_polarity/templates.yaml
LLM fine-tuning process
LLM fine-tuning

Prepared instruction dataset Training splits

PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...] Training

PROMPT[...], COMPLETION[...]
... Validation

PROMPT[...], COMPLETION[...]
... Test
LLM fine-tuning process
LLM fine-tuning
LLM completion:

Classify this review:


Prepared instruction dataset Model
I loved this DVD!

Sentiment: Neutral

Pre-trained
Label:
LLM
Prompt: Classify this review:
I loved this DVD!
Classify this review:
I loved this DVD! Sentiment: Positive

Sentiment:
LLM fine-tuning process
LLM fine-tuning
LLM completion:

Classify this review:


Prepared instruction dataset Model
I loved this DVD!

Sentiment: Neutral

Pre-trained
Label:
LLM
Prompt: Classify this review:
I loved this DVD!
Classify this review:
I loved this DVD! Sentiment: Positive

Sentiment:
Loss: Cross-Entropy
LLM fine-tuning process
LLM fine-tuning

Prepared instruction dataset Training splits

PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...] Training

PROMPT[...], COMPLETION[...]
... validation_accuracy
Validation

PROMPT[...], COMPLETION[...]
... Test
LLM fine-tuning process
LLM fine-tuning

Prepared instruction dataset Training splits

PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...] Training

PROMPT[...], COMPLETION[...]
... Validation

PROMPT[...], COMPLETION[...]
... Test test_accuracy
LLM fine-tuning process

Model Model

Pre-trained Fine-tuned
Instruct
LLM LLM
Fine-tuning on a single task
Fine-tuning on a single task

Model Single-task training dataset, Model


e.g. summarization

Pre-trained Summarize
Summarize the
the following
following text:
text: Instruct
Summarize the following text:
LLM [EXAMPLE
[EXAMPLE
[EXAMPLE
TEXT]
Summarize the
TEXT]
TEXT]
following text: LLM
[EXAMPLE
[EXAMPLE COMPLETION]
[EXAMPLE TEXT]
COMPLETION]
… [EXAMPLE
[EXAMPLE COMPLETION]
COMPLETION]
……


Often, only 500-1000 examples
needed to fine-tune a single task
Catastrophic forgetting
● Fine-tuning can significantly increase the performance of a model on a
specific task…

Before fine-tuning
Prompt Model Completion
Classify this review: Classify this review:
I loved this DVD! LLM I loved this DVD!
Sentiment: Sentiment: eived a
very nice book review
Catastrophic forgetting
● Fine-tuning can significantly increase the performance of a model on a
specific task…

After fine-tuning
Prompt Model Completion
Classify this review: Classify this review:
I loved this DVD! LLM I loved this DVD!
Sentiment: Sentiment: POSITIVE
Catastrophic forgetting
● …but can lead to reduction in ability on other tasks

Before fine-tuning
Prompt Model Completion
What is the name of What is the name of
the cat? LLM the cat?
Charlie the cat roamed Charlie the cat roamed
the garden at night. the garden at night.
Charlie
Catastrophic forgetting
● …but can lead to reduction in ability on other tasks

After fine-tuning
Prompt Model Completion
What is the name of What is the name of
the cat? LLM the cat?
Charlie the cat roamed Charlie the cat roamed
the garden at night. the garden at night.
The garden was
positive.
How to avoid catastrophic forgetting
● First note that you might not have to!
● Fine-tune on multiple tasks at the same time
● Consider Parameter Efficient Fine-tuning (PEFT)
Multi-task, instruction fine-tuning
Multi-task, instruction fine-tuning

Model Instruction fine-tune on many tasks

Summarize the following text:

Pre-trained
Rate this review:
LLM
Translate into Python code:

Identify the places:

[EXAMPLE TEXT] …
[EXAMPLE COMPLETION]
Multi-task, instruction fine-tuning

Model Instruction fine-tune on many tasks Model

Summarize the following text:

Pre-trained Instruct
Rate this review:
LLM LLM
Translate into Python code:

Identify the places:

Many examples of each [EXAMPLE TEXT] …


needed for training [EXAMPLE COMPLETION]
Instruction fine-tuning with FLAN
● FLAN models refer to a specific set of instructions used to perform
instruction fine-tuning

“The metaphorical dessert to the main course of


pretraining”

FLAN
Instruction fine-tuning with FLAN
● FLAN models refer to a specific set of instructions used to perform
instruction fine-tuning

T5 FLAN-T5
T5

FLAN PALM FLAN-PALM


PALM
(Fine-tuned
LAnguage Net)
FLAN-T5: Fine-tuned version of pre-trained T5 model
● FLAN-T5 is a great, general purpose, instruct model

T0-SF Muffin CoT (reasoning) Natural Instructions

- Commonsense Reasoning, - Natural language inference, - Arithmetic reasoning, - Cause effect classification,
- Question Generation, - Code instruction gen, - Commonsense reasoning - Commonsense reasoning,
- Closed-book QA, - Code repair - Explanation generation, - Named Entity Recognition,
- Adversarial QA, - Dialog context generation, - Sentence composition, - Toxic Language Detection,
- Extractive QA - Summarization (SAMSum) - Implicit reasoning, - Question answering
… … … …

55 Datasets 69 Datasets 9 Datasets 372 Datasets


14 Categories 27 Categories 1 Category 108 Categories
193 Tasks 80 Tasks 9 Tasks 1554 Tasks

Source: Chung et al. 2022, “Scaling Instruction-Finetuned Language Models”


FLAN-T5: Fine-tuned version of pre-trained T5 model
● FLAN-T5 is a great, general purpose, instruct model

T0-SF Muffin CoT (reasoning) Natural Instructions

- Commonsense Reasoning, - Natural language inference, - Arithmetic reasoning, - Cause effect classification,
- Question Generation, - Code instruction gen, - Commonsense reasoning - Commonsense reasoning,
- Closed-book QA, - Code repair - Explanation generation, - Named Entity Recognition,
- Adversarial QA, - Dialog context generation, - Sentence composition, - Toxic Language Detection,
- Extractive QA - Summarization (SAMSum) - Implicit reasoning, - Question answering
… … … …

55 Datasets 69 Datasets 9 Datasets 372 Datasets


14 Categories 27 Categories 1 Category 108 Categories
193 Tasks 80 Tasks 9 Tasks 1554 Tasks

Source: Chung et al. 2022, “Scaling Instruction-Finetuned Language Models”


SAMSum: A dialogue dataset
Sample prompt training dataset (samsum) to fine-tune FLAN-T5 from pretrained T5

Source: https://huggingface.co/datasets/samsum, https://github.com/google-research/FLAN/blob/2c79a31/flan/v2/templates.py#L3285


Sample FLAN-T5 prompt templates
"samsum": [
("{dialogue}\n\Briefly summarize that dialogue.", "{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
"{summary}"),
("Dialogue:\n{dialogue}\n\nWhat is a summary of this dialogue?",
"{summary}"),
("{dialogue}\n\nWhat was that dialogue about, in two sentences or less?",
"{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWhat were they talking about?",
"{summary}"),
("Dialogue:\n{dialogue}\nWhat were the main points in that "
"conversation?", "{summary}"),
("Dialogue:\n{dialogue}\nWhat was going on in that conversation?",
"{summary}"),
]
Sample FLAN-T5 prompt templates
"samsum": [
("{dialogue}\n\Briefly summarize that dialogue.", "{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
"{summary}"),
("Dialogue:\n{dialogue}\n\nWhat is a summary of this dialogue?",
"{summary}"),
("{dialogue}\n\nWhat was that dialogue about, in two sentences or less?",
"{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWhat were they talking about?",
"{summary}"),
("Dialogue:\n{dialogue}\nWhat were the main points in that "
"conversation?", "{summary}"),
("Dialogue:\n{dialogue}\nWhat was going on in that conversation?",
"{summary}"),
]
Improving FLAN-T5’s summarization capabilities

ShopBot

Hi there! How can I help


you today?

I need to return a pair of


jeans that I purchased

Of course, I’d be happy T


to help you. When did
you buy your item?

Two weeks ago.

T
Improving FLAN-T5’s summarization capabilities

ShopBot

Hi there! How can I help


you today?

I need to return a pair of Goal: Summarize


jeans that I purchased conversations to
identify actions to
Of course, I’d be happy T
to help you. When did take
you buy your item?

Two weeks ago.

T
Improving FLAN-T5’s summarization capabilities
Further fine-tune FLAN-T5 with a domain-specific instruction dataset (dialogsum)
Example support-dialog summarization
Prompt (created from template)

Summarize the following conversation.


Tommy: Hello. My name is Tommy Sandals, I have a reservation.
Mike: May I see some identification, sir, please?
Tommy: Sure. Here you go.
Mike: Thank you so much. Have you got a credit card, Mr.
Sandals?
Tommy: I sure do.
Mike: Thank you, sir. You'll be in room 507, nonsmoking,
queen bed.
Tommy: That's great, thank you!
Mike: Enjoy your stay!
Source: https://huggingface.co/datasets/knkarthick/dialogsum/viewer/knkarthick--dialogsum/
Summary before fine-tuning FLAN-T5 with our dataset
Prompt (created from template) Model Completion (Summary)

Summarize the following Tommy Sandals has a reservation


conversation.
for a room at the Venetian
Tommy: Hello. My name is FLAN-T5
Tommy Sandals, I have a Hotel in Las Vegas.
reservation.
Mike: May I see some
... Adequate completion, but does not
... match human baseline.
...
Tommy: That's great, thank
you! Human baseline summary:
Mike: Enjoy your stay! Tommy Sandals has got a
reservation. Mike asks for his
identification and credit card
and helps his check-in.
Summary before fine-tuning FLAN-T5 with our dataset
Prompt (created from template) Model Completion (Summary)

Summarize the following Tommy Sandals has a reservation


conversation.
for a room at the Venetian
Tommy: Hello. My name is FLAN-T5
Tommy Sandals, I have a Hotel in Las Vegas.
reservation.
Mike: May I see some
... Adequate completion, but does not
... match human baseline.
...
Tommy: That's great, thank
you! Human baseline summary:
Mike: Enjoy your stay! Tommy Sandals has got a
reservation. Mike asks for his
identification and credit card
and helps his check-in.
Summary before fine-tuning FLAN-T5 with our dataset
Prompt (created from template) Model Completion (Summary)

Summarize the following Tommy Sandals has a reservation


conversation.
for a room at the Venetian
Tommy: Hello. My name is FLAN-T5
Tommy Sandals, I have a Hotel in Las Vegas.
reservation.
Mike: May I see some
... Adequate completion, but does not
... match human baseline.
...
Tommy: That's great, thank
you! Human baseline summary:
Mike: Enjoy your stay! Tommy Sandals has got a
reservation. Mike asks for his
identification and credit card
and helps his check-in.
Summary after fine-tuning FLAN-T5 with our dataset
Prompt (created from template) Model Completion (Summary)

Summarize the following Tommy Sandals has a


conversation.
FLAN-T5 reservation and checks in
Tommy: Hello. My name is
Tommy Sandals, I have a (Fine-tuned) showing his ID and credit
reservation. card. Mike helps him to
Mike: May I see some check in and approves his
...
reservation.
...
...
Tommy: That's great, thank Better summary,
you! more-closely matches
Mike: Enjoy your stay!
human baseline.
Fine-tuning with your own data

ShopBot

Hi there! How can I help


you today?

I need to return a pair of


jeans that I purchased

Of course, I’d be happy T


to help you. When did
you buy your item?

Two weeks ago.

T
Model evaluation metrics
LLM Evaluation - Challenges

Correct Predictions
Accuracy =
Total Predictions
LLM Evaluation - Challenges

“Mike really loves drinking tea.” “Mike adores sipping tea.”

❤☕ = ❤☕
“Mike does not drink coffee.” “Mike does drink coffee.”

🤢☕ = 🧐☕
LLM Evaluation - Metrics

BLEU
ROUGE
SCORE

● Used for text summarization ● Used for text translation


● Compares a summary to one ● Compares to human-generated
or more reference summaries translations
LLM Evaluation - Metrics - Terminology

n-gram

The dog lay on the rug as I sipped a cup of tea.

bigram unigram
LLM Evaluation - Metrics - ROUGE-1

Reference (human):
ROUGE-1 unigram matches 4
It is cold outside. = = = 1.0
Recall unigrams in reference 4

Generated output:
It is very cold outside. ROUGE-1 unigram matches 4
= 0.8
= =
Precision: unigrams in output 5

ROUGE-1 = 2 precision x recall 0.8


= 2 1.8 = 0.89
F1: precision + recall
LLM Evaluation - Metrics - ROUGE-1

Reference (human):
ROUGE-1 unigram matches 4
It is cold outside. = = = 1.0
Recall unigrams in reference 4

Generated output:
It is not cold outside. ROUGE-1 unigram matches 4
= 0.8
= =
Precision: unigrams in output 5

ROUGE-1 = 2 precision x recall 0.8


= 2 1.8 = 0.89
F1: precision + recall
LLM Evaluation - Metrics - ROUGE-2

Reference (human):

It is cold outside.
It is is cold cold outside

Generated output:

It is very cold outside.


It is is very very cold cold outside
LLM Evaluation - Metrics - ROUGE-2
Reference (human):
ROUGE-2 bigram matches 2
It is cold outside. Recall:
=
bigrams in reference = = 0.67
3
It is is cold

cold outside
ROUGE-2 bigram matches 2
= = = 0.5
Generated output: Precision: bigrams in output 4
It is very cold outside.
It is is very
ROUGE-2 = 2 precision x recall = 2
0.335
= 0.57
very cold cold outside F1: precision + recall 1.17
LLM Evaluation - Metrics - ROUGE-L
Reference (human):

It is cold outside.
Generated output:

It is very cold outside.


Longest common subsequence (LCS):

It is

cold outside 2
LLM Evaluation - Metrics - ROUGE-L

Reference (human):
ROUGE-L LCS(Gen, Ref) 2
It is cold outside. = = = 0.5
Recall: unigrams in reference 4

Generated output:
It is very cold outside. ROUGE-L LCS(Gen, Ref) 2
= 0.4
= =
Precision: unigrams in output 5

ROUGE-L = 2 precision x recall 0.2


= 2 0.9 = 0.44
F1: precision + recall
LLM Evaluation - Metrics - ROUGE-L

Reference (human):
ROUGE-L LCS(Gen, Ref) 2
It is cold outside. = = = 0.5
Recall: unigrams in reference 4

Generated output:
It is very cold outside. ROUGE-L LCS(Gen, Ref) 2
= 0.4
= =
Precision: unigrams in output 5
LCS:
Longest common subsequence
ROUGE-L = 2 precision x recall 0.2
= 2 0.9 = 0.44
F1: precision + recall
LLM Evaluation - Metrics - ROUGE hacking
Reference (human):

It is cold outside.
Generated output:

cold cold cold cold


LLM Evaluation - Metrics - ROUGE clipping

Reference (human):
ROUGE-1 unigram matches 4
It is cold outside. = = = 1.0
Precision unigrams in output 4
Generated output:
🥶
cold cold cold cold
Modified clip(unigram matches) 1
= = = 0.25
precision unigrams in output 4

Generated output:
Modified clip(unigram matches) 4
outside cold it is = = = 1.0
precision unigrams in output 4
🤔
LLM Evaluation - Metrics

BLEU
ROUGE
SCORE

● Used for text summarization ● Used for text translation


● Compares a summary to one ● Compares to human-generated
or more reference summaries translations
LLM Evaluation - Metrics - BLEU
BLEU metric = Avg(precision across range of n-gram sizes)

Reference (human):

I am very happy to say that I am drinking a warm cup of tea.


Generated output:
I am very happy that I am drinking a cup of tea. - BLEU 0.495
I am very happy that I am drinking a warm cup of tea. - BLEU 0.730
I am very happy to say that I am drinking a warm tea. - BLEU 0.798
I am very happy to say that I am drinking a warm cup of tea. - BLEU 1.000
LLM Evaluation - Metrics

BLEU
ROUGE
SCORE

● Used for text summarization ● Used for text translation


● Compares a summary to one ● Compares to human-generated
or more reference summaries translations
Benchmarks
Evaluation benchmarks
GLUE

The tasks included in SuperGLUE benchmark:

Source: Wang et al. 2018, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding”
SuperGLUE

The tasks included in SuperGLUE benchmark:

Source: Wang et al. 2019, “SuperGLUE: A Stickier Benchmark for General-Purpose Language
Understanding Systems“
GLUE and SuperGLUE leaderboards

Disclaimer: metrics may not be up-to-date. Check https://super.gluebenchmark.com and https://gluebenchmark.com/leaderboard for the latest.
Benchmarks for massive models

Massive
Multitask
Language
Understanding
(MMLU)

2021
Source: Hendrycks, 2021. “Measuring Massive
Multitask Language Understanding”
Benchmarks for massive models

Massive
Multitask
BIG-bench Hard
Language
Understanding BIG-bench
(MMLU)

Lite

2021 2022
Source: Hendrycks, 2021. “Measuring Massive Source: Suzgun et al. 2022. “Challenging BIG-Bench
Multitask Language Understanding” tasks and whether chain-of-thought can solve them“
Holistic Evaluation of Language Models (HELM)

Metrics:
1. Accuracy
2. Calibration
3. Robustness
4. Fairness
5. Bias
6. Toxicity
7. Efficiency
Holistic Evaluation of Language Models (HELM)

Disclaimer: metrics may not be up-to-date. Check https://crfm.stanford.edu/helm/latest for the latest.
Key takeaways
LLM fine-tuning process
LLM fine-tuning
LLM completion:

Training dataset
Model

Pre-trained
Label:
LLM
Prompt:

Classify this review:


I loved this DVD!

Sentiment:
Loss: Cross-
LLM fine-tuning process
LLM fine-tuning
LLM completion:

Training dataset
Model

Updated
Label:
LLM
Prompt:

Classify this review:


I loved this DVD!

Sentiment:
Loss: Cross-Entropy
LLM fine-tuning process
LLM fine-tuning
LLM completion:

Classify this review:


Prepared instruction dataset Model
I loved this DVD!

Sentiment: Neutral

Pre-trained
Label:
LLM
Prompt: Classify this review:
I loved this DVD!
Classify this review:
I loved this DVD! Sentiment: Positive

Sentiment:
Parameter-
efficient
Fine-tuning (PEFT)
Full fine-tuning of large LLMs is challenging

Temp memory

Forward
Activations
LLM 12-20x
Gradients weights

Optimizer states

Trainable
Weights
Parameter efficient fine-tuning (PEFT)
Small number of
trainable layers

LLM

LLM with most layers


frozen
Parameter efficient fine-tuning (PEFT)
New trainable
layers
Less prone to
catastrophic forgetting

LLM
LLM
Other
❄ components

Trainable
weights

LLM with additional


Frozen Weights ❄
layers for PEFT
Full fine-tuning creates full copy of original LLM per task
GBs

QA
QA fine tune
LLM
GBs
GBs

Summarize Summarize
LLM
fine tune LLM

GBs
Generate fine Generate
tune LLM
PEFT fine-tuning saves space and is flexible
PEFT weights

QA PEFT MBs

GBs

Summarize Generate
Summarize
QA
LLM MBs
PEFT LLM

Generate
MBs
PEFT
PEFT Trade-offs
Parameter Efficiency

Memory Efficiency Training Speed

Model Performance Inference Costs


PEFT methods

Selective Reparameterization
Select subset of initial Reparameterize model
LLM parameters to weights using a low-rank
fine-tune representation

Source: Lialin et al. 2023, “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning”,
PEFT methods

Selective Reparameterization Additive


Select subset of initial Reparameterize model Add trainable layers or
LLM parameters to weights using a low-rank parameters to model
fine-tune representation
Adapters

LoRA
Soft Prompts
Prompt Tuning

Source: Lialin et al. 2023, “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning”,
Low-Rank Adaptation of Large Language
Models (LoRA)
Output
Transformers: recap
Softmax
output

J'aime l'apprentissage
automatique

Decoder

Encoder

2345 3425 3853


Embedding Embedding

Inputs
Output
Transformers: recap
Softmax
output

Feed forward
network

Feed forward
network Decoder

Encoder

Self-attention Self-attention

2345 3425 3853


Embedding Embedding

Inputs
LoRA: Low Rank Adaption of LLMs

1. Freeze most of the original LLM weights.


Encoder

Self-attention

Weights
applied to ❄
embedding
vectors

Embedding
LoRA: Low Rank Adaption of LLMs

1. Freeze most of the original LLM weights.


Encoder 2. Inject 2 rank decomposition matrices
Self-attention 3. Train the weights of the smaller matrices

❄ +
Rank r is small
dimension,
typically 4, 8 … 64

Embedding
LoRA: Low Rank Adaption of LLMs

1. Freeze most of the original LLM weights.


Encoder 2. Inject 2 rank decomposition matrices
Self-attention 3. Train the weights of the smaller matrices

❄ + Steps to update model for inference


1. Matrix multiply the low rank matrices

B * A = AxB
Embedding
2. Add to original weights

❄ + AxB
LoRA: Low Rank Adaption of LLMs

1. Freeze most of the original LLM weights.


Encoder 2. Inject 2 rank decomposition matrices
Self-attention 3. Train the weights of the smaller matrices

Updated
weights ❄ + AxB
Steps to update model for inference:
1. Matrix multiply the low rank matrices

B * A = AxB
Embedding
2. Add to original weights

❄ + AxB
Concrete example using base Transformer as reference

Use the base Transformer model presented by Vaswani et al. 2017:


● Transformer weights have dimensions d x k = 512 x 64
● So 512 x 64 = 32,768 trainable parameters

512 64

In LoRA with rank r = 8:


● A has dimensions r x k = 8 x 64 = 512 parameters
● B has dimension d x r = 512 x 8 = 4,096 trainable parameters
8
64 512 86% reduction in parameters
8 to train!
LoRA: Low Rank Adaption of LLMs

1. Train different rank decomposition


Encoder matrices for different tasks
2. Update weights before inference
Self-attention

Weights Task A
applied to * =
embedding
vectors ❄ +
Embedding
LoRA: Low Rank Adaption of LLMs

1. Train different rank decomposition


Encoder matrices for different tasks
2. Update weights before inference
Self-attention

Updated Task A
weights for ❄ + * =
task A
❄ +
Embedding
Task B
* =
❄ +
LoRA: Low Rank Adaption of LLMs

1. Train different rank decomposition


Encoder matrices for different tasks
2. Update weights before inference
Self-attention

Updated Task A
weights for ❄ + * =
task B
❄ +
Embedding
Task B
* =
❄ +
Sample ROUGE metrics for full vs. LoRA fine-tuning
Base model Full fine-tune
ROUGE ROUGE

FLAN-T5

Dialog
summarization

flan_t5_base
{‘rouge1’: 0.2334,
‘rouge2’: 0.0760,
‘rougeL’: 0.2014,
Baseline ‘rougeLsum’: 0.2015}
score
Sample ROUGE metrics for full vs. LoRA fine-tuning
Base model Full fine-tune LoRA fine tune
ROUGE +80.63% ROUGE -3.20% ROUGE
flan_t5_base_instruct_full
{‘rouge1’: 0.4216, flan_t5_base_instruct_lora
FLAN-T5 ‘rouge2’: 0.1804, {‘rouge1’: 0.4081,
‘rougeL’: 0.3384, ‘rouge2’: 0.1633,
‘rougeLsum’: 0.3384} ‘rougeL’: 0.3251,
Dialog ‘rougeLsum’: 0.3249}
summarization

flan_t5_base
{‘rouge1’: 0.2334,
‘rouge2’: 0.0760,
‘rougeL’: 0.2014,
Baseline ‘rougeLsum’: 0.2015}
score
Choosing the LoRA rank

● Effectiveness of higher rank


appears to plateau
● Relationship between rank
and dataset size needs more
empirical data

Source: Hu et al. 2021, “LoRA: Low-Rank Adaptation of Large Language Models”


QLoRA: Quantized LoRA

● Introduces 4-bit NormalFloat (nf4) data type for 4-bit quantization


● Supports double-quantization to reduce memory ~0.4 bits per parameter
(~3 GB for a 65B model)
● Unified GPU-CPU memory management reduces GPU memory usage
● LoRA adapters at every layer - not just attention layers
● Minimizes accuracy trade-off

Source: Dettmers et al. 2023, “QLoRA: Efficient


Finetuning of Quantized LLMs”
Prompt tuning with soft prompts
Prompt tuning is not prompt engineering!

Prompt Model Completion


Classify this review: Classify this review:
I loved this DVD! LLM I loved this DVD!
Sentiment: Positive Sentiment: Positive

Classify this review: Classify this review:


I don’t like this I don’t like this
chair. chair.
Sentiment: Sentiment: Negative

One-shot or Few-shot Inference


Prompt tuning adds trainable “soft prompt” to inputs

Soft prompt

X1 X1 X1 X1 X1 X1 X2 X3 X4 X5 X6 X7 X8

Same length as
token vectors

Typically 20-100
tokens
The teacher teaches the student with the book.
Soft prompts
fire

book
jumps

fox Embeddings of each token


exist at unique point in
swim
multi-dimensional space

whale
Soft prompts
Full Fine-tuning vs prompt tuning

Weights of model updated


during training
Full Fine-tuning vs prompt tuning

Weights of model updated Weights of model frozen and


during training soft prompt trained

Millions to Billions of 10K - 100K of parameters


parameter updated updated
Prompt tuning for multiple tasks
Task A
Switch out soft prompt at
❄ inference time to change task!


Task B


Performance of prompt tuning
⬤ Full Fine-tuning
⬤ Multi-task Fine-tuning

SuperGLUE Score ✖ Prompt tuning


🟦 Prompt engineering

Prompt tuning can


be as effective as full
Number of Model Parameters Fine-tuning for
Source: Lester et al. 2021, “The Power of Scale for Parameter-Efficient Prompt Tuning”
larger models!
Interpretability of soft prompts
fire

book
jumps

fox Trained soft-prompt


embedding does not
swim
correspond to a known
token…
whale
Interpretability of soft prompts

100%
completely
totally
altogether …but nearest neighbors
form a semantic group
entirely with similar meanings.
PEFT methods summary

Selective Reparameterization Additive


Select subset of initial Reparameterize model Add trainable layers or
LLM parameters to weights using a low-rank parameters to model
fine-tune representation
Adapters

LoRA
Soft Prompts
Prompt Tuning

You might also like