Professional Documents
Culture Documents
Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
...
Loss of information
T sequential steps
...
...
...
...
Vanishing gradient
RNNs vs Transformer: Encoder-Decoder
C’est
c Decoder
⊕
h1 h2 h3 h4
si-1
Attention <sos>
Encoder Mechanism
LSTMs
It’s time for tea
Transformers don’t use RNNs, such
as LSTMs or GRUs
Transformers
Overview
deeplearning.ai
The Transformer Model
https://arxiv.org/abs/1706.03762
Scaled Dot-Product Attention
Values
Queries Keys
(Vaswani et al., 2017)
Multi-Head Attention
Scaled dot-product
attention multiple times in
parallel
Linear transformations of
the input queries, keys and
values
The Encoder
Provides contextual
representation of each item
in the input sequence
Self-Attention
Masked Self-Attention
EMBEDDINGS
Decoder
Encoder
Easy to parallelize!
Summary
● Some Transformers
● Introduction to T5
Transformer NLP applications
Translation
Text
summarization
Chat-bots
te Auto-
Complete
Other NLP tasks
Named entity Sentiment Analysis
recognition (NER) Market Intelligence
Question Text Classification
answering (Q&A) Character Recognition
Spell Checking
State of the Art Transformers
Embedding Stack
Je suis heureux Q
I am happy
Embedding Stack
I am happy K
Same
Generally the
number of
same
Stack rows
V
Attention Math
Context vectors
for each query
Number of
queries
Size of the
value vector
● Ways of Attention
c’est
l’heure
Weight matrix
du
thé
Self-Attention
Queries, keys and values come from the same sentence
it’s
time
Weight matrix
for
Meaning of each
word within the
tea sentence
Masked Self-Attention
Queries, keys and values come from the same sentence. Queries don’t
attend to future positions.
it’s time for tea
it’s
time
Weight matrix
for
tea
Masked self-attention math
Minus infinity
0
0 0
0 0 0
0 0
Weights assigned to future
0
positions are equal to 0
Summary
c’est it’s
thé it’s
Original Embeddings tea tea
l’heure time time
du for for
Head 1 Head 2
for tea
thé it’s it’s it’s
du
it’s c’est time tea
tea time tea thé
c’est time
l’heure time l’heure
for
for
for du
Multi-Head Attention - Overview
Linear
Concatenation
Learnable parameters
Scaled Dot-Product heads
Attention
Attention
Context vectors
for each query
Head 2 Concat
Attention
: Embedding size
Summary
Positional
Input Embedding
Encoding
Input
Embedding
<start> I am happy
Inputs
The Transformer decoder
Output Probabilities
Add & Norm Decoder
SoftMax
Linear
Block
Feed Feed Feed
Add & Forward Forward Forward
Norm
Feed
Forward
Add &
Norm LayerNormAdd
( & Norm +
Multi-Head
Attention )
Output Vector
Positional
Encoding
Input Multi-Head Attention
Embedding
Positional input
Inputs embedding
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add &
Norm Feed Forward Feed Forward
Feed
Forward (ReLu) (ReLu)
Add &
Norm
Multi-Head
Attention
● Decoder and feed-forward blocks are the core of this model code
Positional
Encoding
Input
Embedding
Inputs
Technical details for data processing
Output Probabilities
SoftMax Model Input:
Linear
Add & ARTICLE TEXT <EOS> SUMMARY <EOS> <pad> …
Norm
Feed
Forward
Add & Tokenized version:
Norm
Multi-Head
Attention [2,3,5,2,1,3,4,7,8,2,5,1,2,3,6,2,1,0,0]
Positional
Encoding Loss weights: 0s until the first <EOS> and then
Input
Embedding 1 on the start of the summary.
Inputs
Cost function
Output Probabilities Cross entropy loss
SoftMax
Linear
Add &
Norm
Feed
Forward
Add &
Norm
Multi-Head : over summary
Attention
: bach elements
Positional
Encoding
Input
Embedding
Inputs
Inference with a Language Model
Model input:
[Article] <EOS> [Summary] <EOS>
Inference:
● Provide: [Article] <EOS>