You are on page 1of 12
wargsames das. woqpamos das @akchay_pachaar Here's the GPTLanguageModel class & its visual representation. I'll break it down for you in the subsequent tweets. Continue reading ... @ Swipe © © @akshay_pachaar ® gpt_language_model py class GPTLanguageModel(nn. Module) def # each token directly reads off the logits for the next token from a lockup table self. token_embedding table = nn.Embedding(vocab.size, n_enbd) self. position embedding table = nn.Enbedding(block_size, n_enbd) self-blocks = nn.Sequentiat(+{(Btock(n_enbd, n_head=n_nead) for _ in range(n_tayer)]) self.Un_f = nn.LayerNora(n_enbd) # final layer norm self.lm_head = nn.Linear(n_enbd, vocab_size) # better init, not covered in the original GPT video, but important, will cover in followup video selt-apply(self._init_weights) Anit_weights(self, modute) : if isinstance(module, nn.Linear) : ‘torch.nn.init.normat_(module.weight, mean=0.0, std=0.02) Sf module.bias is not Non torch.nn. init .zeros_(nodute.bias) elif isinstance(modute, nn.Enbedding) : torch.nn.init.normal.(nodute.ueight, mean=0.8, std=0.02) Foruard(self, idx, targets=None) : 8, T = idx.shape # idx and targets are both (8,1) tensor of integers tokenb = self. tokenenbedding table(idx) # (8,T,C) pos_enb = selt.position embedding table(torch.arange(T, device=device)) # (T,c) X= tok.emb + pos_enb # (8,7,C) X= self.blocks(x) # (8,T,C) K = self. Unt(x) # (8,T,C) Logits = self.tmhead(x) # (B,T,vocab_size) if targets is None: oss = None else T, © = Logits.shape logits = logits.view(BeT, c) targets = targets.view(BsT) loss = F.cross_entropy(logits, targets) return logits, loss We start with the Block! Depending on how deep you want to build your network, numbers of blocks (N) can be changed Each block has: Layer Norm Multi headed attention A skip connection Second layer Norm Feed Forward network Another skip connection Swipe © © @akshay_pachaar @ee@ = @ dblock.py class Block(nn.Modute) """ Transformer block: communication followed by computation """ def init__(self, n_embd, n_head): # n_embd: embedding dimension, n_head: the number of heads we'd like super(). it_Q) head_size = n_embd // n_head self.sa = MultiHeadAttention(n_head, head_size) self. ffwd = FeedFoward(n_embd) self. Uni = nn.LayerNoxm(n_embd) self.Un2 = nn.LayerNorm(n_embd) ward(self, x): x = xX + self.sa(self.Un1(x)) x = xX + self. ffud(self.Un2(x)) xeturn x The multi-headed attention 6 Self attention is applied in parallel across multiple heads & results are concatenated at the end. Swipe © © @akshay_pachaar Output 4 Attention applied in parallel & outputs are concatenated Linear ©©.© & mutt_nead atterion py + ee LayerWorm ” - ree te inh fonvara(sett, x): out = torch.cat({h(x) for h in self-heads], dime-2) out = seit dropout(sett proj (out) return out nection pc ©©e @ attention head.py Individual Attention head class Head(nn.Modute) "w" one head of self-attention "™** def _init_(self, nead_size) super().-init__() seUf.key = nn.Linear(n_enbd, head_size, bias-False) seUf.query = nn.Linear(n_enbd, head: self.value = nn.Linear(n_enbd, head_size, bia! self .register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) 5 ‘Skip cone; SeUf dropout = nn. Dropout (dropout) forard(setf, input of size (batch, time-step, channets) # output of size (batch, time-step, head size) 8,1,¢ = x.shape k= sovr-key(x) (8, T,hs) q = self.query(x) # (8,T,hs) ) # compute attention scores ("affinities") | Token Embedeling k.transpose(-2,-1) * k.shape[-i]#4-0.5 # (8, T, hs) @ (8, Ns, T) + (B, T, 1) 7 FilU(self.trill:T, :T] = ©, float('-inf')) # (B, T, T) wei = Fesoftmax(wei, dime-t) # (8, T, T) luei = setf.dropout (wei) # perform the weighted aggregation of the values v= self.vatue(x) # (6,T,hs) out = wei @v # (8, T,'T) 8 (8, T, hs) + (6, T, hs) return out The feed forward module! A simple linear layer followed by a non-Linearity. Swipe © © @akshay_pachaar Mask Multi-Heael Attention f tt LoyerNerm ? © : { Token Embedding ‘Skip connection Skip connection @@@ @ feed_forward.py class FeedFoward(nn.Module) : u" a simple Linear layer followed by a non-linearity def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), nn.Dropout (dropout), def forward(self, x): return self.net(x) At the bottom we have two things: A a token embedding table, that provides embeddings for each token. And a positional embedding table that helps the network understand relative positions of the tokens in each block. Swipe © © @akshay_pachaar TTTT TT TIT ttt) LTTE TTT TT erp) Token Embedding Table vocab_size, That's a wrap! If you interested in: - Python @ - Data Science - Machine Learning # - MLOps - NLP @: - Computer Vision - LLMs @ Follow me on LinkedIn JY Everyday, I share tutorials on above topics! Cheers!! @& © @akshay_pachaar

You might also like