Professional Documents
Culture Documents
Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
GenAI project lifecycle
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Fine-tuning an LLM
with instruction prompts
In-context learning (ICL) - zero shot inference
TEXT[...]
TEXT[...] Model
TEXT[...]
TEXT[...]
TEXT[...]
TEXT[...] Pre-trained
TEXT[...]
TEXT[...]
LLM
TEXT[...]
TEXT[...]
GB - TB - PB
of unstructured textual data
LLM fine-tuning at a high level
LLM fine-tuning
TEXT[...], LABEL[...]
Pre-trained TEXT[...],
TEXT[...],
LABEL[...]
LABEL[...]
Fine-tuned
LLM TEXT[...], LABEL[...] LLM
TEXT[...], LABEL[...]
GB - TB
of labeled examples for a specific
task or set of tasks
LLM fine-tuning at a high level
LLM fine-tuning
PROMPT[...], COMPLETION[...]
Pre-trained PROMPT[...],
PROMPT[...],
COMPLETION[...]
COMPLETION[...]
Fine-tuned
LLM PROMPT[...], COMPLETION[...] LLM
PROMPT[...], COMPLETION[...]
Improved
GB - TB performance
of labeled examples for a specific Prompt-completion pairs
task or set of tasks
Using prompts to fine-tune LLMs with instruction
LLM fine-tuning
Model Model
Classify this review:
I loved this DVD!
Sentiment: Positive
Pre-trained Fine-tuned
LLM Classify this review: LLM
I don’t like this
chair.
Sentiment: Negative
PROMPT[...], COMPLETION[...]
Pre-trained PROMPT[...],
PROMPT[...],
COMPLETION[...]
COMPLETION[...]
Fine-tuned
LLM PROMPT[...], COMPLETION[...] LLM
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
Pre-trained PROMPT[...],
PROMPT[...],
COMPLETION[...]
COMPLETION[...]
Fine-tuned
LLM PROMPT[...], COMPLETION[...] LLM
PROMPT[...], COMPLETION[...]
Text generation
Text summarization
Source: https://github.com/bigscience-workshop/promptsource/blob/main/promptsource/templates/amazon_polarity/templates.yaml
LLM fine-tuning process
LLM fine-tuning
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...] Training
PROMPT[...], COMPLETION[...]
... Validation
PROMPT[...], COMPLETION[...]
... Test
LLM fine-tuning process
LLM fine-tuning
LLM completion:
Sentiment: Neutral
Pre-trained
Label:
LLM
Prompt: Classify this review:
I loved this DVD!
Classify this review:
I loved this DVD! Sentiment: Positive
Sentiment:
LLM fine-tuning process
LLM fine-tuning
LLM completion:
Sentiment: Neutral
Pre-trained
Label:
LLM
Prompt: Classify this review:
I loved this DVD!
Classify this review:
I loved this DVD! Sentiment: Positive
Sentiment:
Loss: Cross-Entropy
LLM fine-tuning process
LLM fine-tuning
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...] Training
PROMPT[...], COMPLETION[...]
... validation_accuracy
Validation
PROMPT[...], COMPLETION[...]
... Test
LLM fine-tuning process
LLM fine-tuning
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...]
PROMPT[...], COMPLETION[...] Training
PROMPT[...], COMPLETION[...]
... Validation
PROMPT[...], COMPLETION[...]
... Test test_accuracy
LLM fine-tuning process
Model Model
Pre-trained Fine-tuned
Instruct
LLM LLM
Fine-tuning on a single task
Fine-tuning on a single task
Pre-trained Summarize
Summarize the
the following
following text:
text: Instruct
Summarize the following text:
LLM [EXAMPLE
[EXAMPLE
[EXAMPLE
TEXT]
Summarize the
TEXT]
TEXT]
following text: LLM
[EXAMPLE
[EXAMPLE COMPLETION]
[EXAMPLE TEXT]
COMPLETION]
… [EXAMPLE
[EXAMPLE COMPLETION]
COMPLETION]
……
…
Often, only 500-1000 examples
needed to fine-tune a single task
Catastrophic forgetting
● Fine-tuning can significantly increase the performance of a model on a
specific task…
Before fine-tuning
Prompt Model Completion
Classify this review: Classify this review:
I loved this DVD! LLM I loved this DVD!
Sentiment: Sentiment: eived a
very nice book review
Catastrophic forgetting
● Fine-tuning can significantly increase the performance of a model on a
specific task…
After fine-tuning
Prompt Model Completion
Classify this review: Classify this review:
I loved this DVD! LLM I loved this DVD!
Sentiment: Sentiment: POSITIVE
Catastrophic forgetting
● …but can lead to reduction in ability on other tasks
Before fine-tuning
Prompt Model Completion
What is the name of What is the name of
the cat? LLM the cat?
Charlie the cat roamed Charlie the cat roamed
the garden at night. the garden at night.
Charlie
Catastrophic forgetting
● …but can lead to reduction in ability on other tasks
After fine-tuning
Prompt Model Completion
What is the name of What is the name of
the cat? LLM the cat?
Charlie the cat roamed Charlie the cat roamed
the garden at night. the garden at night.
The garden was
positive.
How to avoid catastrophic forgetting
● First note that you might not have to!
● Fine-tune on multiple tasks at the same time
● Consider Parameter Efficient Fine-tuning (PEFT)
Multi-task, instruction fine-tuning
Multi-task, instruction fine-tuning
Pre-trained
Rate this review:
LLM
Translate into Python code:
[EXAMPLE TEXT] …
[EXAMPLE COMPLETION]
Multi-task, instruction fine-tuning
Pre-trained Instruct
Rate this review:
LLM LLM
Translate into Python code:
FLAN
Instruction fine-tuning with FLAN
● FLAN models refer to a specific set of instructions used to perform
instruction fine-tuning
T5 FLAN-T5
T5
- Commonsense Reasoning, - Natural language inference, - Arithmetic reasoning, - Cause effect classification,
- Question Generation, - Code instruction gen, - Commonsense reasoning - Commonsense reasoning,
- Closed-book QA, - Code repair - Explanation generation, - Named Entity Recognition,
- Adversarial QA, - Dialog context generation, - Sentence composition, - Toxic Language Detection,
- Extractive QA - Summarization (SAMSum) - Implicit reasoning, - Question answering
… … … …
- Commonsense Reasoning, - Natural language inference, - Arithmetic reasoning, - Cause effect classification,
- Question Generation, - Code instruction gen, - Commonsense reasoning - Commonsense reasoning,
- Closed-book QA, - Code repair - Explanation generation, - Named Entity Recognition,
- Adversarial QA, - Dialog context generation, - Sentence composition, - Toxic Language Detection,
- Extractive QA - Summarization (SAMSum) - Implicit reasoning, - Question answering
… … … …
ShopBot
T
Improving FLAN-T5’s summarization capabilities
ShopBot
T
Improving FLAN-T5’s summarization capabilities
Further fine-tune FLAN-T5 with a domain-specific instruction dataset (dialogsum)
Example support-dialog summarization
Prompt (created from template)
ShopBot
T
Model evaluation metrics
LLM Evaluation - Challenges
Correct Predictions
Accuracy =
Total Predictions
LLM Evaluation - Challenges
❤☕ = ❤☕
“Mike does not drink coffee.” “Mike does drink coffee.”
🤢☕ = 🧐☕
LLM Evaluation - Metrics
BLEU
ROUGE
SCORE
n-gram
bigram unigram
LLM Evaluation - Metrics - ROUGE-1
Reference (human):
ROUGE-1 unigram matches 4
It is cold outside. = = = 1.0
Recall unigrams in reference 4
Generated output:
It is very cold outside. ROUGE-1 unigram matches 4
= 0.8
= =
Precision: unigrams in output 5
Reference (human):
ROUGE-1 unigram matches 4
It is cold outside. = = = 1.0
Recall unigrams in reference 4
Generated output:
It is not cold outside. ROUGE-1 unigram matches 4
= 0.8
= =
Precision: unigrams in output 5
Reference (human):
It is cold outside.
It is is cold cold outside
Generated output:
cold outside
ROUGE-2 bigram matches 2
= = = 0.5
Generated output: Precision: bigrams in output 4
It is very cold outside.
It is is very
ROUGE-2 = 2 precision x recall = 2
0.335
= 0.57
very cold cold outside F1: precision + recall 1.17
LLM Evaluation - Metrics - ROUGE-L
Reference (human):
It is cold outside.
Generated output:
It is
cold outside 2
LLM Evaluation - Metrics - ROUGE-L
Reference (human):
ROUGE-L LCS(Gen, Ref) 2
It is cold outside. = = = 0.5
Recall: unigrams in reference 4
Generated output:
It is very cold outside. ROUGE-L LCS(Gen, Ref) 2
= 0.4
= =
Precision: unigrams in output 5
Reference (human):
ROUGE-L LCS(Gen, Ref) 2
It is cold outside. = = = 0.5
Recall: unigrams in reference 4
Generated output:
It is very cold outside. ROUGE-L LCS(Gen, Ref) 2
= 0.4
= =
Precision: unigrams in output 5
LCS:
Longest common subsequence
ROUGE-L = 2 precision x recall 0.2
= 2 0.9 = 0.44
F1: precision + recall
LLM Evaluation - Metrics - ROUGE hacking
Reference (human):
It is cold outside.
Generated output:
Reference (human):
ROUGE-1 unigram matches 4
It is cold outside. = = = 1.0
Precision unigrams in output 4
Generated output:
🥶
cold cold cold cold
Modified clip(unigram matches) 1
= = = 0.25
precision unigrams in output 4
Generated output:
Modified clip(unigram matches) 4
outside cold it is = = = 1.0
precision unigrams in output 4
🤔
LLM Evaluation - Metrics
BLEU
ROUGE
SCORE
Reference (human):
BLEU
ROUGE
SCORE
Source: Wang et al. 2018, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding”
SuperGLUE
Source: Wang et al. 2019, “SuperGLUE: A Stickier Benchmark for General-Purpose Language
Understanding Systems“
GLUE and SuperGLUE leaderboards
Disclaimer: metrics may not be up-to-date. Check https://super.gluebenchmark.com and https://gluebenchmark.com/leaderboard for the latest.
Benchmarks for massive models
Massive
Multitask
Language
Understanding
(MMLU)
2021
Source: Hendrycks, 2021. “Measuring Massive
Multitask Language Understanding”
Benchmarks for massive models
Massive
Multitask
BIG-bench Hard
Language
Understanding BIG-bench
(MMLU)
Lite
2021 2022
Source: Hendrycks, 2021. “Measuring Massive Source: Suzgun et al. 2022. “Challenging BIG-Bench
Multitask Language Understanding” tasks and whether chain-of-thought can solve them“
Holistic Evaluation of Language Models (HELM)
Metrics:
1. Accuracy
2. Calibration
3. Robustness
4. Fairness
5. Bias
6. Toxicity
7. Efficiency
Holistic Evaluation of Language Models (HELM)
Disclaimer: metrics may not be up-to-date. Check https://crfm.stanford.edu/helm/latest for the latest.
Key takeaways
LLM fine-tuning process
LLM fine-tuning
LLM completion:
Training dataset
Model
Pre-trained
Label:
LLM
Prompt:
Sentiment:
Loss: Cross-
LLM fine-tuning process
LLM fine-tuning
LLM completion:
Training dataset
Model
Updated
Label:
LLM
Prompt:
Sentiment:
Loss: Cross-Entropy
LLM fine-tuning process
LLM fine-tuning
LLM completion:
Sentiment: Neutral
Pre-trained
Label:
LLM
Prompt: Classify this review:
I loved this DVD!
Classify this review:
I loved this DVD! Sentiment: Positive
Sentiment:
Parameter-
efficient
Fine-tuning (PEFT)
Full fine-tuning of large LLMs is challenging
Temp memory
Forward
Activations
LLM 12-20x
Gradients weights
Optimizer states
Trainable
Weights
Parameter efficient fine-tuning (PEFT)
Small number of
trainable layers
LLM
❄
LLM
LLM
Other
❄ components
Trainable
weights
QA
QA fine tune
LLM
GBs
GBs
Summarize Summarize
LLM
fine tune LLM
GBs
Generate fine Generate
tune LLM
PEFT fine-tuning saves space and is flexible
PEFT weights
QA PEFT MBs
GBs
Summarize Generate
Summarize
QA
LLM MBs
PEFT LLM
Generate
MBs
PEFT
PEFT Trade-offs
Parameter Efficiency
Selective Reparameterization
Select subset of initial Reparameterize model
LLM parameters to weights using a low-rank
fine-tune representation
Source: Lialin et al. 2023, “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning”,
PEFT methods
LoRA
Soft Prompts
Prompt Tuning
Source: Lialin et al. 2023, “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning”,
Low-Rank Adaptation of Large Language
Models (LoRA)
Output
Transformers: recap
Softmax
output
J'aime l'apprentissage
automatique
Decoder
Encoder
Inputs
Output
Transformers: recap
Softmax
output
Feed forward
network
Feed forward
network Decoder
Encoder
Self-attention Self-attention
Inputs
LoRA: Low Rank Adaption of LLMs
Self-attention
Weights
applied to ❄
embedding
vectors
Embedding
LoRA: Low Rank Adaption of LLMs
❄ +
Rank r is small
dimension,
typically 4, 8 … 64
Embedding
LoRA: Low Rank Adaption of LLMs
B * A = AxB
Embedding
2. Add to original weights
❄ + AxB
LoRA: Low Rank Adaption of LLMs
Updated
weights ❄ + AxB
Steps to update model for inference:
1. Matrix multiply the low rank matrices
B * A = AxB
Embedding
2. Add to original weights
❄ + AxB
Concrete example using base Transformer as reference
512 64
Weights Task A
applied to * =
embedding
vectors ❄ +
Embedding
LoRA: Low Rank Adaption of LLMs
Updated Task A
weights for ❄ + * =
task A
❄ +
Embedding
Task B
* =
❄ +
LoRA: Low Rank Adaption of LLMs
Updated Task A
weights for ❄ + * =
task B
❄ +
Embedding
Task B
* =
❄ +
Sample ROUGE metrics for full vs. LoRA fine-tuning
Base model Full fine-tune
ROUGE ROUGE
FLAN-T5
Dialog
summarization
flan_t5_base
{‘rouge1’: 0.2334,
‘rouge2’: 0.0760,
‘rougeL’: 0.2014,
Baseline ‘rougeLsum’: 0.2015}
score
Sample ROUGE metrics for full vs. LoRA fine-tuning
Base model Full fine-tune LoRA fine tune
ROUGE +80.63% ROUGE -3.20% ROUGE
flan_t5_base_instruct_full
{‘rouge1’: 0.4216, flan_t5_base_instruct_lora
FLAN-T5 ‘rouge2’: 0.1804, {‘rouge1’: 0.4081,
‘rougeL’: 0.3384, ‘rouge2’: 0.1633,
‘rougeLsum’: 0.3384} ‘rougeL’: 0.3251,
Dialog ‘rougeLsum’: 0.3249}
summarization
flan_t5_base
{‘rouge1’: 0.2334,
‘rouge2’: 0.0760,
‘rougeL’: 0.2014,
Baseline ‘rougeLsum’: 0.2015}
score
Choosing the LoRA rank
Soft prompt
X1 X1 X1 X1 X1 X1 X2 X3 X4 X5 X6 X7 X8
Same length as
token vectors
Typically 20-100
tokens
The teacher teaches the student with the book.
Soft prompts
fire
book
jumps
whale
Soft prompts
Full Fine-tuning vs prompt tuning
❄
Task B
❄
Performance of prompt tuning
⬤ Full Fine-tuning
⬤ Multi-task Fine-tuning
book
jumps
100%
completely
totally
altogether …but nearest neighbors
form a semantic group
entirely with similar meanings.
PEFT methods summary
LoRA
Soft Prompts
Prompt Tuning