Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Week 3
Overview
deeplearning.ai
Week 3
Question BERT
Answering
Transfer T5
learning
Question Answering
Context-based Closed book
Model Model
Not just the model
Data Data
Training Training Transfer Learning!
Model Model
Classical training
Training Inference
Course Review
Course Review Model

Model
Transfer learning
Inference
Pre-training
Movie Reviews Model Course Review
Model
Training
on “Downstream” Task
Course Review Model
Transfer Learning: Different Tasks
Inference
Pre-
Training Watching the Model When’s my
movie is like ... birthday?
Sentiment
Classification
Model
Training When is
Downstream task: Pi Day? Model
Question Answering March
14! Umm…
BERT: Bi-directional Context
Uni-directional
Learning from deeplearning.ai is like watching the sunset with my best friend!
context
Bi-directional
Learning from deeplearning.ai is like watching the sunset with my best friend!
context context
T5: Single task vs. Multi task
Studying with Studying with
deeplearning.ai deeplearning.ai
was ... was ...
Model 1
Model
Model 2
T5: more data, better performance
C4
Colossal Clean Crawled
English wikipedia
Corpus
~13 GB ~800 GB
Transfer
Learning
in NLP
deeplearning.ai
Desirable Goals
● Reduce training time
Transfer Learning!
● Improve predictions
● Small datasets
Transfer learning options
1 2
Pre-train
Transfer Model
data
Train
Model Labeled
data Feature- Unlabeled
based
3
prediction Fine- Pre-training task
tuning
Language modeling
Masked words
Next sentence
Transfer 1
General purpose learning
I am because I am learning CBOW “Happy”
Word Embeddings
input
Translation
“Features”
Transfer 1
Feature-based vs. Fine-Tuning
Pre-Train Pre-Train
Model prediction Model prediction
features
Fine-tune same model
on Downstream task
Model prediction
Model prediction
Train a new model
Fine-tune: adding a layer Transfer
1
Pre-Training
..
Movies .
Course
reviews
2
Pre-train
Data and performance data
Data Model
Data Model
2
Pre-train
Labeled vs Unlabeled Data data
Labeled text data Unlabeled text data

2
Pre-train
Transfer learning with unlabeled data data
Pre-Training
Model
Which tasks work with

No labels !
unlabeled data?
Downstream task
What day is Pi
Model March 14
day?
Labeled data
3
Self-supervised task Pre-training task
Unlabeled
data
Create
Inputs targets
(features) (Labels)
3
Self-supervised tasks Pre-training task
Unlabeled Data
Learning from deeplearning.ai
is like watching the sunset Target
with my best friend.
Input friend
Learning from deeplearning.ai

is like watching the sunset Model prediction Loss
with my best _______
Update
Language modeling
Fine-tune a model for each downstream task
Pre Training
Model
Training on
Downstream task
Model Model Model
Translation Summarization Q&A

Summary
1 2
Pre-train
Model
Transfer data
Train
Model Labeled
data Feature- Unlabeled
based
3
prediction Fine- Pre-training task
tuning
Language modeling
Masked words
Next sentence
ELMo, GPT,
BERT, T5
deeplearning.ai
Outline
CBOW ELMo GPT BERT T5

Context
… right ...
… they were on the right ...
… they were on the right side of the street

Continuous Bag of Words
… they were on the right side of the street

Fixed window Fixed window
“on”
“the” “right”
“side”
“of” Fully-connected (Feed Forward) neural

network
Need more context?
… they were on the right side of the street.

Fixed window Fixed window
… they were on the right side of history.

Use all context words
The legislators believed that they were on the right side of history, so they changed the law.
ELMo: Full context using RNN
The legislators believed that they were on the _____ side of history so they changed the law.
Bi-directional LSTM
“right”
LSTM LSTM
Word embedding for “right”

Open AI GPT
ELMo Transformer GPT
Decoder
RNN
Decoder
Encoder
The legislators believed that they were on the _____
Uni-directional
Why not bi-directional?
Transformer
Attention
… on the right side...

Each word can peek at
itself!
GPT: Uni-directional
Transformer Transformer
Attention
Attention
… on the right side... … on the right

Each word can peek at No peeking!
itself!
BERT
Transformer GPT BERT
Decoder
Decoder Encoder
Encoder
The legislators believed that they were on the _____ side of history, so they changed the
law.
Bi-directional
Transformer + Bi-directional Context
… on the ___ side ___ history ... Model “right”

“of”
Multi-Mask Language Modeling

BERT: Words to Sentences
So they changed the law.
The legislators believed that they were
on the right side of history. ?
Then the bunny ate the carrot.
Sentence “A”
? Sentence “B”
Next Sentence Prediction

BERT Pre-training Tasks
Multi-Mask Language Modeling
… on the ___ side ___ history ... Model “right”

“of”
Next Sentence Prediction
Sentence “A”
? Sentence “B”
T5: Encoder vs. Encoder-Decoder
Transformer GPT BERT T5
Decoder Decoder
Decoder Encoder
Encoder Encoder
T5: Multi-task
Studying with
deeplearning.ai
was ...
How?
Model
T5: Text-to-Text
“5 stars”
“Classify: Learning from deeplearning.ai is like...”
Classify
“It was alright”

“Summarize: It was the best of times…” Summarize
Task type
Question
“Question: “When is Pi day?” “March 14”
Summary
More details next!
CBOW ELMo GPT BERT T5
Context Full sentence Transformer: Transformer: Transformer:

window Decoder Encoder Encoder -
Bi-directional Decoder
FFNN Context Uni-directional Bi-directional
Context Context Bi-directional
RNN Context
Multi-Mask
Multi-Task
Next Sentence
Prediction
Bidirectional Encoder
Representations from
Transformers (BERT)
deeplearning.ai
Outline
● Learn about the BERT architecture
● Understand how BERT pre-training works

BERT
● Makes use of transfer learning/pre-training:
...
...
...
...
BERT
● A multi layer bidirectional transformer
● Positional embeddings
● BERT_base:
12 layers (12 transformer blocks)
12 attentions heads
110 million parameters
BERT pre-training
After school Lukasz does his in the library.
● Masked language modeling (MLM)

BERT pre-training
After school Lukasz does his homework in the library.
After school his homework in the .

Summary
● Choose 15% of the tokens at random: mask them 80% of the time,
replace them with a random token 10% of the time, or keep as is 10%
of the time.
● There could be multiple masked spans in a sentence
● Next sentence prediction is also used when pre-training.

BERT
Objective
deeplearning.ai
Outline
● Understand how BERT inputs are fed into the model
● Visualize the output
● Learn about the BERT objective

Formalizing the input
Input [CLS] my dog is cute [SEP] he likes play ##ing [SEP]
E E E E E E E E E E E
Token [CLS] my dog is cute [SEP] he likes play ##ing [SEP]
Embeddings
E E E E E E E E E E E
Segment A A A
A
A A B B B B B
Embeddings
Position E E E E E E E E E E E
0 1 2 4 5 6 7 8 9
3 10
Embeddings
Visualizing the output
NSP Mask ML Mask ML
C T1 TN T [SEP] T1 ’ TM ’
• [CLS]: a special
... ...
classification symbol
added in front of
BERT
every input
E [CLS] E1 ... EN E E1 ’ ... EM’
[SEP]
[CLS]
Tok
... Tok N [SEP] Tok 1 ... Tok M • [SEP]: a special
1
separator token
Masked sentence A Masked sentence
B
Unlabeled Sentence A and B Pair
BERT Objective
Objective 1: Objective 2:
Multi-Mask LM Next Sentence Prediction
Loss: Cross Entropy Loss Loss: Binary Loss
V 2
Summary
● BERT objective
● Model inputs/outputs
Fine-tuning
BERT
deeplearning.ai
Fine-tuning BERT: Outline
MNLI
Pre-train BERT
BERT
Hypothesis Premise
Sentence A Sentence B
NER
SQuAD BERT
BERT
Sentence A Tags
Question Answer
Inputs
Summary
Sentence A Sentence B Sentence Entities
Text Ø Sentence Paraphrase
Question Passage Article Summary
Hypothesis Premise
⋮
Transformer
T5
deeplearning.ai
Outline
● Understand how T5 works
● Recognize the different types of attention used
● Overview of model architecture

Transformer - T5 Model
Text to Text Machine Translation
Classification
Summarization
Question
Answering (Q&A)
Sentiment
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Transformer - T5 Model
Original text
Thank you for inviting me to your party last week.
Inputs
Thank you <X> me to your party <Y>

week.
Targets
<X> for inviting <Y> last

<Z>
Model Architecture
y1 y2 .
Language model Prefix LM
Decoder
X2 X3 y1 y2 . X2 X3 y1 y2 .
Encoder
X1 X2 X3 X4 X1 X2 X3 y1 y2 X1 X X 3
y1 y2
2
Model Architecture
● Encoder/decoder Decoder
Encoder
● 12 transformer blocks each
● 220 million parameters
Summary
● Prefix LM attention
● Model architecture
● Pre-training T5 (MLM)
Multi-task
Training
Strategy
deeplearning.ai
Multi-task training strategy
“Translate English to German: That is
good.” “Das ist gut”
“cola sentence: The course is jumping
T5
well.” “not acceptable”
“stsb sentence1: The rhino grazed on “3.8”

the grass. Sentence2: A rhino is
grazing in a field.”
“six people
“Summarize: state authorities hospitalized after a
dispatched emergency crews tuesday storm in attala
to survey the damage after an county”
onslaught of severe weather in
mississippi…”
Input and Output Format
Machine translation:
• translate English to German: That is good.
● Predict entailment, contradiction , or neutral
• mnli premise: I hate pigeons hypothesis: My feelings
towards pigeons are filled with animosity. target:
entailment
● Winograd schema
• The city councilmen refused the demonstrators a permit
because *they* feared violence
Multi-task Training Strategy
Fine-tuning method GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
* All parameters 83.28 19.24 80.88 71.36 26.98 39.82 27.65
Adapter layers, 80.52 15.08 79.32 60.40 13.84 17.88 15.54
Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63
Adapter layers, 81.54 17.78 79.18 64.30 23.45 33.98 25.81
Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63
Gradual unfreezing 82.50 18.95 79.17 70.79 26.71 39.02 26.93
How much data from each task to train on?
Data Training Strategies
Examples-proportional mixing Equal mixing
Data 1 Data 1
Sample 1 Sample 1
Data 2
Data 2 Sample 2
Sample 2
Temperature-scaled mixing
Gradual unfreezing vs. Adapter layers
Gradual unfreezing Adapter layers

Fine-tuning
Pre Training
Model Model Model
Translation Summarization MLM
Fine Tune on Specific Task

Model 218 steps
Q&A
GLUE
Benchmark
deeplearning.ai
General Language Understanding Evaluation
● A collection used to train, evaluate, analyze natural language
understanding systems
● Datasets with different genres, and of different sizes and
difficulties
● Leaderboard
Tasks Evaluated on
● Sentence grammatical or not?
● Sentiment
● Paraphrase
● Similarity
● Questions duplicates
● Answerable
● Contradiction
● Entailment
● Winograd (co-ref)
General Language Understanding Evaluation
● Drive research
● Model agnostic
● Makes use of transfer learning

Question
Answering
deeplearning.ai
Transformer encoder Feedforward:
Add & [
Norm
Feed LayerNorm,
Forward
dense,
Add &
Norm activation,
Multi-Head
dropout_middle,
Attention
dense,
Positional dropout_final
Encoding ]
Input
Embedding
Inputs
Transformer encoder Encoder block:
Add & [
Norm
Feed Residual(
Forward
LayerNorm,
Add &
Norm attention,
Multi-Head
dropout_,
Attention
),
Positional Residual(
Encoding feed_forward,
Input
Embedding ),
Inputs ]
Transformer encoder Feedforward: Encoder block:
Add & [ [
Norm
Feed LayerNorm, Residual(
Forward
dense, LayerNorm,
Add &
activation, attention,
Norm
Multi-Head dropout_middle, dropout_,
Attention
dense, ),
Positional dropout_final Residual(

Encoding ] feed_forward,
Input
Embedding )
Inputs ]
Data examples
Question: What percentage of the French population today is non - European ?
Context: Since the end of the Second World War , France has become an ethnically diverse country . Today ,
approximately five percent of the French population is non - European and non - white . This does not
approach the number of non - white citizens in the United States ( roughly 28 – 37 % , depending on how Latinos are
classified ; see Demographics of the United States ) . Nevertheless , it amounts to at least three million people , and has
forced the issues of ethnic diversity onto the French policy agenda . France has developed an approach to dealing with
ethnic problems that stands in contrast to that of many advanced , industrialized countries . Unlike the United States ,
Britain , or even the Netherlands , France maintains a " color - blind " model of public policy . This means that it targets
virtually no policies directly at racial or ethnic groups . Instead , it uses geographic or class criteria to address issues of social
inequalities . It has , however , developed an extensive anti - racist policy repertoire since the early 1970s . Until recently ,
French policies focused primarily on issues of hate speech — going much further than their American counterparts — and
relatively less on issues of discrimination in jobs , housing , and in provision of goods and services .
Target: Approximately five percent

Implementing Q&A with T5
“Translate English to
● Load a pre-trained model German: That is good.” “Das ist gut”
● Process data to get the required inputs “cola sentence: The course “not
T5
is jumping well.” acceptable”
and outputs: "question: Q context: C"
as input and "A" as target “stsb sentence1: The rhino
grazed on the grass. “3.8”
● Fine tune your model on the new task Sentence2: A rhino is
grazing in a field.”
and input “six people
“Summarize: state hospitalized
authorities dispatched after a storm
● Predict using your own model emergency crews tuesday in attala
to survey the damage after county”
an onslaught of severe
weather in mississippi…”
Hugging
Face:
Introduction
deeplearning.ai
Outline
● What is Hugging Face?
● How you can use the Hugging Face ecosystem

Hugging Face Use it with
Transformers library
Use it for
Applying state of the art Fine-tuning pretrained

transformer models transformer models
Hugging Face: Using Transformers
Pipelines 1. Pre-processing your inputs
2. Running the model
3. Post-processing the outputs
Context
Q/ Answers
A
Questions
Hugging Face: Fine-Tuning Transformers
Datasets:
Tokenizer
One Thousand
Model Checkpoints:
Trainer Evaluation metrics
More than 14 thousand
Tokenizer
Checkpoint: Set of learned
parameters for a model using a
training procedure for some task
Human readable output
Hugging
Face: Using
Transformers
deeplearning.ai
Using Transformers
Pipelines 1. Pre-processing your inputs
2. Running the model
3. Post-processing the outputs
Context
Q/ Answers
A
Questions
Tasks Task
Initialization
Pipelines Model
Checkpoint
Inputs for
Use
the task
Sentiment Analysis Question Answering Fill-Mask
Context and Sentence and

Sequence
questions position
Checkpoints
Huge number of model checkpoints that you can

use in your pipelines.
But beware, not every checkpoint would be

suitable for your task.
Model Hub
Hub containing models that you can use in your

pipelines according to the task you need:
https://huggingface.co/models
Model Card shows a description of your selected

model and useful information such as code
snippet examples.
Hugging
Face: Fine-
Tuning
Transformers
deeplearning.ai
Fine-Tuning Tools
Datasets:
Tokenizer
One Thousand
Model Checkpoints:
Trainer Evaluation metrics
Tokenizer
Human readable output

Model Checkpoints
Model Checkpoints:
Model Dataset Name in
(and increasing) Stanford Question
distilbert-base-cased-
DistilBERT Answering Dataset
distilled-squad
(SQuAD)
Upload the architecture
and weights with 1 line Wikipedia and
BERT bert-base-cased
of code! Book Corpus
... ... ...

Datasets
Datasets:
One Thousand Load them using just one function
Optimized to work with massive amounts of data!

Tokenizers
[ 101, 1327, 1218, 118,

"What well-known
1227, 18365, 1279, 1127,
superheroes were introduced
Tokenizer 2234, 1206, 3061, 1105,
between 1939 and 1941 by
3018, 1118, 9187, 7452,
Detective Comics?"
136, 102]
Depending on the use case, you

might need to run additional steps.
Trainer and Evaluation Metrics
Trainer object let’s you define the training procedure
Number of epochs
Warm-up steps
Weight decay
...
Train using one line of code!
Pre-defined evaluation metrics, like BLEU and ROUGE

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Training Training Transfer Learning!

Course Review Model

● Reduce training time

Model prediction Model prediction

Labeled text data Unlabeled text data

Which tasks work with

Learning from deeplearning.ai

Translation Summarization Q&A

CBOW ELMo GPT BERT T5

… they were on the right ...

… they were on the right side of the street

… they were on the right side of the street

“of” Fully-connected (Feed Forward) neural

… they were on the right side of the street.

… they were on the right side of history.

Word embedding for “right”

The legislators believed that they were on the _____

… on the right side...

… on the right side... … on the right

… on the ___ side ___ history ... Model “right”

Multi-Mask Language Modeling

Next Sentence Prediction

… on the ___ side ___ history ... Model “right”

Next Sentence Prediction

Transformer GPT BERT T5

“It was alright”

CBOW ELMo GPT BERT T5

Context Full sentence Transformer: Transformer: Transformer:

● Learn about the BERT architecture

● Understand how BERT pre-training works

● A multi layer bidirectional transformer

After school Lukasz does his in the library.

● Masked language modeling (MLM)

After school Lukasz does his homework in the library.

After school his homework in the .

● There could be multiple masked spans in a sentence

● Next sentence prediction is also used when pre-training.

● Understand how BERT inputs are fed into the model

● Visualize the output

● Learn about the BERT objective

Input [CLS] my dog is cute [SEP] he likes play ##ing [SEP]

Loss: Cross Entropy Loss Loss: Binary Loss

Text Ø Sentence Paraphrase

Question Passage Article Summary

● Understand how T5 works

● Recognize the different types of attention used

● Overview of model architecture

Thank you for inviting me to your party last week.

Thank you <X> me to your party <Y>

<X> for inviting <Y> last

● 12 transformer blocks each

● 220 million parameters

“cola sentence: The course is jumping

“stsb sentence1: The rhino grazed on “3.8”

* All parameters 83.28 19.24 80.88 71.36 26.98 39.82 27.65

Adapter layers, 80.52 15.08 79.32 60.40 13.84 17.88 15.54

Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63

Adapter layers, 81.54 17.78 79.18 64.30 23.45 33.98 25.81

Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63

Gradual unfreezing 82.50 18.95 79.17 70.79 26.71 39.02 26.93

… on the _ side _ history ... Model “right”

… on the _ side _ history ... Model “right”