Professional Documents
Culture Documents
TheoryComparison:-
BERT :-
GPT is known to train huge models with billions of parameters; for example, GPT-3’s
largest edition has 175B parameters. Their architecture is based on the
Transformer’s decoder block. The encoder-decoder cross attention part of the block
is removed because there is no encoder, and the self-attention part is replaced with
the masked self-attention.
They chose an autoregressive pre-training objective by introducing Causal Language
Modelling. It means we will feed all the whole input tokens to the model and expect
it to predict the next token at each timestep. (Then we have a loop of feeding back
the newly generated tokens to the model to get the next timestep’s token
prediction) The masked self-attention method will prevent the model from cheating
and looking forward at each timestep by masking out the future tokens.
It is a generative model and can-do different tasks with linear layers on top. It also
uses special tokens for each task to pass both input and target sequences jointly to
the model so it can understand the task and make the prediction accordingly.
XLNET :-
It is encoder-only model and pre-trained based on the idea that corrupting input
data (like BERT) is not a good idea because we will lose information and
dependencies. Instead, a Permutation Language Model pre-training objective was
introduced to consider all the possible permutations of an input sequence.
Conclusion :