Transformers

CDS.IISc.ac.
in | Department of Computational and Data Sciences
Transformers
▪Context
Slides: Google Research 1

CDS.IISc.ac.in | Department of Computational and Data Sciences
Context
▪“The animal didn't cross the street because it was
too tired”
▪What is “it”?
2
Context
3
Attention
▪Attention mechanism has an infinite reference
window.
4
Attention is all you need
Encoder is a cont.
vector representation
of the inputs
5
1. Input Embedding
6
2. Positional Encoding
7
3-4 Encoder layer
8
Multi-Head Attention
Q K V
X1 X2 X3
9
Self Attention
▪ Associates each individual word in the input to
other words in the input.
10
Self attention details

▪ Given an input sequence, X
▪ Project to Queries, Keys
and Values using linear
transforms.
Q = WQ X
K = WK X
V = WV X
11
12
13
14
15
▪N
N=2
16
17
18
Residual Connection, Layer

Normalization and Pointwise FW NW
19
Transformer Encoder
▪NN
Transformer Encoder
N Times
Transformer Encoder
Transformer Encoder
20
Encoder Example
21
Credit: Y Tamura
Multi Head Example
▪ No of tokens=9
▪ Token encoded dimension =512
▪ Split each token into 8 equal parts of 64 dimension
H1 H2 H3 H4 H5 H6 H7 H8
22
Multi Head Example
23
Transformer Decoder
▪Decoder is to generate
text
▪It is auto regressive –

Takes previous outputs as
inputs
▪Stops generation when

<end> token is generated.
24
Output Embedding and positional

Encoding
25
Decoder MH Attention1
26
Look ahead Mask

▪ Prevents decoder from looking at future tokens
27
Linear Classifier
28
Transformer – Encoder & Decoder
29
Transformers, GPT-2, and BERT

1. A transformer uses an encoder stack to
model input, and uses decoder stack to
model output (using input information from encoder
side)
2. If we do not have input, we just want to model the
“next word”, we can get rid of the encoder side of a
transformer and output “next word” one by one. This
gives us GPT
3. If we are only interested in training a language
model for the input for some other tasks, then we do
not need the decoder of the transformer, that gives
us BERT 30
Training a Transformer
▪Transformers typically use semi-supervised
learning with
‣Unsupervised pretraining over a very large dataset of
general text
‣Followed by supervised fine-tuning over a focused data
set of inputs and outputs for a particular task
▪Tasks for pretraining and fine-tuning commonly include:
‣language modeling
‣next-sentence prediction (aka completion)
‣question answering
‣reading comprehension
‣sentiment analysis
31
‣paraphrasing
Pre-trained Models
▪ Since training a model requires huge datasets of
text and significant computation, researchers often
use common pretrained models
▪ Examples (~ December 2021) include
‣Google’s BERT model
‣Huggingface’s various Transformer models
‣OpenAI’s and GPT-3 models
32
GPT-2 & BERT
33
GPT - Varients
GPT released June 2018
GPT-2 released Nov. 2019 with 1.5B parameters
GPT-3 released in 2020 with 175B parameters
117M parameters 345M 762M 1542M
34
ViT – Vision Transformers
35
•Divide an input image into 196 (14x14) small images

of size (16x16)
•Treat it as embedding in NLP
•Use it as an input for traditional transformer encoder

(like in BERT)
•Use 12 transformer layers (Norm, Multi-head

attention, etc.)
•Take the last output, use it as input for Dense Layer

with 1000 classes 36
37
Patch Embedding
Each patch is converted into a vector of size 3×16×16=768 38

[CLS] Token
[CLS] token is a vector of size (1,768)(1,768)

The final patch matrix has size (197,768)(197,768), 196 from patches and 1 [CLS] token
39
ViT - Summary
40
ViT Layers
41
How Good is ViT?
•Worse than Resnet when trained just on

ImageNet
•Performance improved when pre-trained

on big (and I mean it) dataset
•Pretrained outperforms much bigger CNNs
42
ViT – in Numbers
43
Critics
▪
•Better results - only with more data
•The cost of training from scratch is

ridiculously high (30k$)
•Is it really that different from

Convolutions?
44

Transformers

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transformers

Uploaded by

Copyright:

Available Formats

CDS.IISc.ac.

in | Department of Computational and Data Sciences

Slides: Google Research 1

Attention is all you need

3-4 Encoder layer

Self attention details

Residual Connection, Layer

Multi Head Example

Multi Head Example

▪It is auto regressive –

▪Stops generation when

Output Embedding and positional

Look ahead Mask

Transformer – Encoder & Decoder

Transformers, GPT-2, and BERT

GPT-2 & BERT

117M parameters 345M 762M 1542M

ViT – Vision Transformers

•Divide an input image into 196 (14x14) small images

•Treat it as embedding in NLP

•Use it as an input for traditional transformer encoder

•Use 12 transformer layers (Norm, Multi-head

•Take the last output, use it as input for Dense Layer

Each patch is converted into a vector of size 3×16×16=768 38

[CLS] token is a vector of size (1,768)(1,768)

How Good is ViT?

•Worse than Resnet when trained just on

•Performance improved when pre-trained

•Pretrained outperforms much bigger CNNs

•The cost of training from scratch is

•Is it really that different from

You might also like