Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Transformers
vs RNNs
deeplearning.ai
Outline
● Issues with RNNs
● Comparison with Transformers

Neural Machine Translation
Comment allez- vous
How are you

No parallel computing!
Seq2Seq Architectures
...
Loss of information
T sequential steps
...
...
...
...
Vanishing gradient
RNNs vs Transformer: Encoder-Decoder
C’est
c Decoder
⊕
h1 h2 h3 h4
si-1
Attention <sos>
Encoder Mechanism
LSTMs
It’s time for tea
Transformers don’t use RNNs, such
as LSTMs or GRUs
Transformers
Overview
deeplearning.ai
The Transformer Model
https://arxiv.org/abs/1706.03762
Scaled Dot-Product Attention
Values
Queries Keys
(Vaswani et al., 2017)
Multi-Head Attention
Scaled dot-product
attention multiple times in
parallel
Linear transformations of
the input queries, keys and
values
The Encoder
Provides contextual
representation of each item
in the input sequence
Self-Attention
Every item in the input

attends to every other
item in the sequence
The Decoder
Every position from the

Encoder-Decoder
decoder attents to the
Attention
outputs from the encoder
Masked Self-Attention
Every position attends to

previous positions
RNNs vs Transformer: Positional Encoding
POSITIONAL 0 0 1 1 0.84 0.0001 0.52 1 0.91 0.0002 -0.42 1

ENCODING
EMBEDDINGS
INPUT Je suis content

The Transformer
Decoder
Encoder
Easy to parallelize!
Summary
● In RNNs parallel computing is difficult to implement
● For long sequences in RNNs there is loss of information
● In RNNs there is the problem of vanishing gradient
● Transformers help with all of the above

Transformer
Applications
deeplearning.ai
Outline
● Transformers applications in NLP
● Some Transformers
● Introduction to T5
Transformer NLP applications
Translation
Text
summarization
Chat-bots
te Auto-
Complete
Other NLP tasks
Named entity Sentiment Analysis
recognition (NER) Market Intelligence
Question Text Classification
answering (Q&A) Character Recognition
Spell Checking
State of the Art Transformers
Radford, A., et al. (2018) GPT-2: Generative Pre-training for

Open AI Transformer
Devlin, J., et al. (2018) BERT:Bidirectional Encoder

Google AI Language Representations from Transformers
Colin, R., et al. (2019) T5: Text-to-text transfer transformer

Google
T5: Text-To-Text Transfer Transformer
Translate English into French: “I am happy” “Je suis content”

Translation
Unacceptable
Classification
Cola sentence: “He bought fruits and.”
Cola sentence: “He bought fruits and T5 Acceptable
vegetables.”
Q&A
Question: Which volcano in Tanzania is the Answer: Mount

highest mountain in Africa? Kilimanjaro
T5: Text-To-Text Transfer Transformer
Stsb sentence1: “Cats and dogs are
mammals.” Sentence2: “There are four
known forces in nature – gravity, 0.0
electromagnetic, weak and strong.” Regression
Stsb sentence1: “Cats and dogs are

mammals.” Sentence2:“Cats, dogs, and T5 2.6
cows are domesticated.”
Summarization
Summarize: “State authorities “Six people

dispatched emergency crews Tuesday to hospitalized
survey the damage after an onslaught of after a storm in
severe weather in mississippi…” Attala county”
T5: Demo
Summary
● Transformers are suitable for a wide range of NLP applications

● GPT-2, BERT and T5 are the cutting-edge Transformers
● T5 is a powerful multi-task transformer

Scaled Dot-Product
Attention
deeplearning.ai
Outline
● Revisit scaled dot product attention
● Mathematics behind Attention

Scaled dot-product attention
Improves
Weights add up to 1 performance
Queries Values Weighted sum of values V

Keys
Just two matrix multiplications
(Vaswani et al., 2017)
and a Softmax!
Queries, Keys and Values Size of the
Je suis heureux embedding
Embedding Stack
Je suis heureux Q
I am happy
Embedding Stack
I am happy K
Same
Generally the
number of
same
Stack rows
V
Attention Math
Context vectors
for each query
Number of
queries
Size of the
value vector
Weight assigned to the third key

for the second query
Summary
● Scaled Dot-product Attention is essential for Transformer
● The input to Attention are queries, keys, and values
● GPUs and TPUs

Masked Self-
Attention
deeplearning.ai
Outline
● Ways of Attention
● Overview of masked Self-Attention

Encoder-Decoder Attention
Queries from one sentence, keys and values from another
it’s time for tea
c’est
l’heure
Weight matrix
du
thé
Self-Attention
Queries, keys and values come from the same sentence
it’s time for tea
it’s
time
Weight matrix
for
Meaning of each
word within the
tea sentence
Masked Self-Attention
Queries, keys and values come from the same sentence. Queries don’t
attend to future positions.
it’s time for tea
it’s
time
Weight matrix
for
tea
Masked self-attention math
Minus infinity
0
0 0
0 0 0
0 0
Weights assigned to future
0
positions are equal to 0
Summary
● There are three main ways of Attention: Encoder/Decoder, self-

attention and masked self-attention.
● In self-attention, queries and keys come from the same sentence
● In masked self-attention queries cannot attend to the future

Multi-head
Attention
deeplearning.ai
Outline
● Intuition Multi-Head Attention
● Math of Multi-Head Attention

Multi-Head Attention - Overview
Queries Keys Values
c’est it’s
thé it’s
Original Embeddings tea tea
l’heure time time
du for for
Head 1 Head 2
for tea
thé it’s it’s it’s
du
it’s c’est time tea
tea time tea thé
c’est time
l’heure time l’heure
for
for
for du
Multi-Head Attention - Overview
Linear
Concatenation
Learnable parameters
Scaled Dot-Product heads
Attention
Linear Linear Linear heads
Queries Keys Values

Multi-Head Attention
Head 1
Attention
Context vectors
for each query
Head 2 Concat
Attention
Usual choice of dimensions
: Embedding size
Summary
● Multi-Headed models attend to information from different

representations
● Parallel computations
● Similar computational cost to single-head attention

Transformer
decoder
deeplearning.ai
Outline
● Overview of Transformer decoder
● Implementation (decoder and feed-forward block)

Transformer decoder
Overview
Output Probabilities
SoftMax ● input: sentence or paragraph
○ we predict the next word
Linear
Add & ● sentence gets embedded, add positional
Norm encoding
Feed ○ (vectors representing )
Forward ● multi-head attention looks at previous words
Add &
Norm ● feed-forward layer with ReLU
Multi-Head
○ that’s where most parameters are!
Attention
● residual connection with layer normalization
Positional
Encoding ● repeat N times
Input
Embedding ● dense layer and softmax for output
Inputs
Transformer decoder
SoftMax
Explanation
Linear
Add &
Norm
Feed
Forward Decoder Block
Add &
Norm
Multi-Head
Attention Positional Encoding
Positional
Input Embedding
Encoding
Input
Embedding
<start> I am happy
Inputs
The Transformer decoder
Add & Norm Decoder
SoftMax
Linear
Block
Feed Feed Feed
Add & Forward Forward Forward
Norm
Feed
Forward
Add &
Norm LayerNormAdd
( & Norm +
Multi-Head
Attention )
Output Vector
Positional
Encoding
Input Multi-Head Attention
Embedding
Positional input
Inputs embedding
Feed forward layer
SoftMax
Linear
Add &
Norm Feed Forward Feed Forward
Feed
Forward (ReLu) (ReLu)
Add &
Norm
Multi-Head
Attention
Positional Self Attention

Encoding
Input
Embedding
Inputs
Feed forward layer
SoftMax
Linear
Add &
Norm Feed Forward Feed Forward
Feed
Forward (ReLu) (ReLu)
Add &
Norm
Multi-Head
Attention
Positional Self Attention

Encoding
Input
Embedding
Inputs
Summary
● Transformer decoder mainly consists of three layers
● Decoder and feed-forward blocks are the core of this model code
● It also includes a module to calculate the cross-entropy loss

Transformer
summarizer
deeplearning.ai
Outline
● Overview of Transformer summarizer
● Technical details for data processing
● Inference with a Language Model

Transformer for summarization
SoftMax
Linear Input Output:
Add &
Norm
Summary
Feed
Forward
Add &
Norm
Multi-Head
Attention
Positional
Encoding
Input
Embedding
Inputs
Technical details for data processing
SoftMax Model Input:
Linear
Add & ARTICLE TEXT <EOS> SUMMARY <EOS> <pad> …
Norm
Feed
Forward
Add & Tokenized version:
Norm
Multi-Head
Attention [2,3,5,2,1,3,4,7,8,2,5,1,2,3,6,2,1,0,0]
Positional
Encoding Loss weights: 0s until the first <EOS> and then
Input
Embedding 1 on the start of the summary.
Inputs
Cost function
Output Probabilities Cross entropy loss
SoftMax
Linear
Add &
Norm
Feed
Forward
Add &
Norm
Multi-Head : over summary
Attention
: bach elements
Positional
Encoding
Input
Embedding
Inputs
Inference with a Language Model
Model input:
[Article] <EOS> [Summary] <EOS>
Inference:
● Provide: [Article] <EOS>
● Generate summary word-by-word

○ until the final <EOS>
● Pick the next word by random sampling

○ each time you get a different summary!
Summary
● For summarization, a weighted loss function is optimized
● Transformer Decoder summarizes predicting the next word using
● The transformer uses tokenized versions of the input

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

● Issues with RNNs

● Comparison with Transformers

Comment allez- vous

How are you

Every item in the input

Every position from the

Every position attends to

POSITIONAL 0 0 1 1 0.84 0.0001 0.52 1 0.91 0.0002 -0.42 1

INPUT Je suis content

● In RNNs parallel computing is difficult to implement

● For long sequences in RNNs there is loss of information

● In RNNs there is the problem of vanishing gradient

● Transformers help with all of the above

● Transformers applications in NLP

Radford, A., et al. (2018) GPT-2: Generative Pre-training for

Devlin, J., et al. (2018) BERT:Bidirectional Encoder

Colin, R., et al. (2019) T5: Text-to-text transfer transformer

Translate English into French: “I am happy” “Je suis content”

Question: Which volcano in Tanzania is the Answer: Mount

Stsb sentence1: “Cats and dogs are

Summarize: “State authorities “Six people

● Transformers are suitable for a wide range of NLP applications

● T5 is a powerful multi-task transformer

● Revisit scaled dot product attention

● Mathematics behind Attention

Queries Values Weighted sum of values V

Weight assigned to the third key

● Scaled Dot-product Attention is essential for Transformer

● The input to Attention are queries, keys, and values

● GPUs and TPUs

● Overview of masked Self-Attention

it’s time for tea

it’s time for tea

● There are three main ways of Attention: Encoder/Decoder, self-

● In self-attention, queries and keys come from the same sentence

● In masked self-attention queries cannot attend to the future

● Intuition Multi-Head Attention

● Math of Multi-Head Attention

Linear Linear Linear heads

Queries Keys Values

Usual choice of dimensions

● Multi-Headed models attend to information from different

● Similar computational cost to single-head attention

● Overview of Transformer decoder

● Implementation (decoder and feed-forward block)

Positional Self Attention

Positional Self Attention

● Transformer decoder mainly consists of three layers

● It also includes a module to calculate the cross-entropy loss

● Overview of Transformer summarizer

● Technical details for data processing

● Inference with a Language Model

● Generate summary word-by-word

● Pick the next word by random sampling

● For summarization, a weighted loss function is optimized

● Transformer Decoder summarizes predicting the next word using

● The transformer uses tokenized versions of the input

You might also like