You are on page 1of 2

Module 13

Compare BERT,GPT-2 and XLNET.Write down the differences between them.

TheoryComparison:-
BERT :-

 BERT, which stands for Bidirectional Encoder Representations from Transformers.


 BERT is designed to pretrain deep bidirectional representations from unlabelled text
by jointly conditioning on both left and right context in all layers.
 BERT model can be finetuned with just one additional output layer to create state-
of-the-art models for a wide range of tasks, such as question answering and
language inference, without substantial taskspecific architecture modifications.
 Mainly of two kinds one is masked language modelling (MLM) and next is next
sentence prediction (NSP).
 Masked Language Model: The model masks some random words [from the input] and
tries to predict the missing tokens. As reported in the paper, a total of 15% of the
words will be chosen for masking. Out of them, 1) 80% of the chosen words will be
replaced by [MASK] token; 2) There is a 10% chance to replace the word with a
random word, and 3) The remaining 10% words will remain unchanged.
 Next Sentence Prediction: The objective is to understand the relationship between
two sentences. The process is to feed sentences in a pair of two [and separate them
with a special separator token] and measure how likely is it for the 2nd sentence to
be the follow-up sequence of the 1st sentence.
GPT :-

 GPT is known to train huge models with billions of parameters; for example, GPT-3’s
largest edition has 175B parameters. Their architecture is based on the
Transformer’s decoder block. The encoder-decoder cross attention part of the block
is removed because there is no encoder, and the self-attention part is replaced with
the masked self-attention.
 They chose an autoregressive pre-training objective by introducing Causal Language
Modelling. It means we will feed all the whole input tokens to the model and expect
it to predict the next token at each timestep. (Then we have a loop of feeding back
the newly generated tokens to the model to get the next timestep’s token
prediction) The masked self-attention method will prevent the model from cheating
and looking forward at each timestep by masking out the future tokens.
 It is a generative model and can-do different tasks with linear layers on top. It also
uses special tokens for each task to pass both input and target sequences jointly to
the model so it can understand the task and make the prediction accordingly.
XLNET :-

 It is encoder-only model and pre-trained based on the idea that corrupting input
data (like BERT) is not a good idea because we will lose information and
dependencies. Instead, a Permutation Language Model pre-training objective was
introduced to consider all the possible permutations of an input sequence.

Conclusion :

 XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive


model, into pretraining. Empirically, under comparable experiment settings, XLNet
outperforms BERT on tasks, often by a large margin, including question answering,
natural language inference, sentiment analysis, and document ranking.

You might also like