You are on page 1of 44

CDS.IISc.ac.

in | Department of Computational and Data Sciences

Transformers
▪Context

Slides: Google Research 1


CDS.IISc.ac.in | Department of Computational and Data Sciences

Context
▪“The animal didn't cross the street because it was
too tired”
▪What is “it”?

2
CDS.IISc.ac.in | Department of Computational and Data Sciences

Context

3
CDS.IISc.ac.in | Department of Computational and Data Sciences

Attention
▪Attention mechanism has an infinite reference
window.

4
CDS.IISc.ac.in | Department of Computational and Data Sciences

Attention is all you need

Encoder is a cont.
vector representation
of the inputs

5
CDS.IISc.ac.in | Department of Computational and Data Sciences

1. Input Embedding

6
CDS.IISc.ac.in | Department of Computational and Data Sciences

2. Positional Encoding

7
CDS.IISc.ac.in | Department of Computational and Data Sciences

3-4 Encoder layer

8
CDS.IISc.ac.in | Department of Computational and Data Sciences

Multi-Head Attention

Q K V

X1 X2 X3

9
CDS.IISc.ac.in | Department of Computational and Data Sciences

Self Attention
▪ Associates each individual word in the input to
other words in the input.

10
CDS.IISc.ac.in | Department of Computational and Data Sciences

Self attention details


▪ Given an input sequence, X
▪ Project to Queries, Keys
and Values using linear
transforms.
Q = WQ X
K = WK X
V = WV X

11
CDS.IISc.ac.in | Department of Computational and Data Sciences

12
CDS.IISc.ac.in | Department of Computational and Data Sciences

13
CDS.IISc.ac.in | Department of Computational and Data Sciences

14
CDS.IISc.ac.in | Department of Computational and Data Sciences

15
CDS.IISc.ac.in | Department of Computational and Data Sciences

▪N

N=2

16
CDS.IISc.ac.in | Department of Computational and Data Sciences

17
CDS.IISc.ac.in | Department of Computational and Data Sciences

18
CDS.IISc.ac.in | Department of Computational and Data Sciences

Residual Connection, Layer


Normalization and Pointwise FW NW

19
CDS.IISc.ac.in | Department of Computational and Data Sciences

Transformer Encoder

▪NN

Transformer Encoder

N Times

Transformer Encoder

Transformer Encoder

20
CDS.IISc.ac.in | Department of Computational and Data Sciences

Encoder Example

21
Credit: Y Tamura
CDS.IISc.ac.in | Department of Computational and Data Sciences

Multi Head Example

▪ No of tokens=9
▪ Token encoded dimension =512
▪ Split each token into 8 equal parts of 64 dimension

H1 H2 H3 H4 H5 H6 H7 H8

22
CDS.IISc.ac.in | Department of Computational and Data Sciences

Multi Head Example

23
CDS.IISc.ac.in | Department of Computational and Data Sciences

Transformer Decoder

▪Decoder is to generate
text

▪It is auto regressive –


Takes previous outputs as
inputs

▪Stops generation when


<end> token is generated.

24
CDS.IISc.ac.in | Department of Computational and Data Sciences

Output Embedding and positional


Encoding

25
CDS.IISc.ac.in | Department of Computational and Data Sciences

Decoder MH Attention1

26
CDS.IISc.ac.in | Department of Computational and Data Sciences

Look ahead Mask


▪ Prevents decoder from looking at future tokens

27
CDS.IISc.ac.in | Department of Computational and Data Sciences

Linear Classifier

28
CDS.IISc.ac.in | Department of Computational and Data Sciences

Transformer – Encoder & Decoder

29
CDS.IISc.ac.in | Department of Computational and Data Sciences

Transformers, GPT-2, and BERT


1. A transformer uses an encoder stack to
model input, and uses decoder stack to
model output (using input information from encoder
side)
2. If we do not have input, we just want to model the
“next word”, we can get rid of the encoder side of a
transformer and output “next word” one by one. This
gives us GPT
3. If we are only interested in training a language
model for the input for some other tasks, then we do
not need the decoder of the transformer, that gives
us BERT 30
CDS.IISc.ac.in | Department of Computational and Data Sciences

Training a Transformer
▪Transformers typically use semi-supervised
learning with
‣Unsupervised pretraining over a very large dataset of
general text
‣Followed by supervised fine-tuning over a focused data
set of inputs and outputs for a particular task
▪Tasks for pretraining and fine-tuning commonly include:
‣language modeling
‣next-sentence prediction (aka completion)
‣question answering
‣reading comprehension
‣sentiment analysis
31
‣paraphrasing
CDS.IISc.ac.in | Department of Computational and Data Sciences

Pre-trained Models
▪ Since training a model requires huge datasets of
text and significant computation, researchers often
use common pretrained models
▪ Examples (~ December 2021) include
‣Google’s BERT model
‣Huggingface’s various Transformer models
‣OpenAI’s and GPT-3 models

32
CDS.IISc.ac.in | Department of Computational and Data Sciences

GPT-2 & BERT

33
CDS.IISc.ac.in | Department of Computational and Data Sciences

GPT - Varients
GPT released June 2018
GPT-2 released Nov. 2019 with 1.5B parameters
GPT-3 released in 2020 with 175B parameters

117M parameters 345M 762M 1542M

34
CDS.IISc.ac.in | Department of Computational and Data Sciences

ViT – Vision Transformers

35
CDS.IISc.ac.in | Department of Computational and Data Sciences

•Divide an input image into 196 (14x14) small images


of size (16x16)

•Treat it as embedding in NLP

•Use it as an input for traditional transformer encoder


(like in BERT)

•Use 12 transformer layers (Norm, Multi-head


attention, etc.)

•Take the last output, use it as input for Dense Layer


with 1000 classes 36
CDS.IISc.ac.in | Department of Computational and Data Sciences

37
CDS.IISc.ac.in | Department of Computational and Data Sciences

Patch Embedding

Each patch is converted into a vector of size 3×16×16=768 38


CDS.IISc.ac.in | Department of Computational and Data Sciences

[CLS] Token

[CLS] token is a vector of size (1,768)(1,768)


The final patch matrix has size (197,768)(197,768), 196 from patches and 1 [CLS] token

39
CDS.IISc.ac.in | Department of Computational and Data Sciences

ViT - Summary

40
CDS.IISc.ac.in | Department of Computational and Data Sciences

ViT Layers

41
CDS.IISc.ac.in | Department of Computational and Data Sciences

How Good is ViT?

•Worse than Resnet when trained just on


ImageNet

•Performance improved when pre-trained


on big (and I mean it) dataset

•Pretrained outperforms much bigger CNNs

42
CDS.IISc.ac.in | Department of Computational and Data Sciences

ViT – in Numbers

43
CDS.IISc.ac.in | Department of Computational and Data Sciences

Critics

•Better results - only with more data

•The cost of training from scratch is


ridiculously high (30k$)

•Is it really that different from


Convolutions?

44

You might also like