Peacock CP

wargsames das. woqpamos das @akchay_pachaarHere's the GPTLanguageModel class & its visual representation. I'll break it down for you in the subsequent tweets. Continue reading ... @ Swipe © © @akshay_pachaar® gpt_language_model py class GPTLanguageModel(nn. Module) def # each token directly reads off the logits for the next token from a lockup table self. token_embedding table = nn.Embedding(vocab.size, n_enbd) self. position embedding table = nn.Enbedding(block_size, n_enbd) self-blocks = nn.Sequentiat(+{(Btock(n_enbd, n_head=n_nead) for _ in range(n_tayer)]) self.Un_f = nn.LayerNora(n_enbd) # final layer norm self.lm_head = nn.Linear(n_enbd, vocab_size) # better init, not covered in the original GPT video, but important, will cover in followup video selt-apply(self._init_weights) Anit_weights(self, modute) : if isinstance(module, nn.Linear) : ‘torch.nn.init.normat_(module.weight, mean=0.0, std=0.02) Sf module.bias is not Non torch.nn. init .zeros_(nodute.bias) elif isinstance(modute, nn.Enbedding) : torch.nn.init.normal.(nodute.ueight, mean=0.8, std=0.02) Foruard(self, idx, targets=None) : 8, T = idx.shape # idx and targets are both (8,1) tensor of integers tokenb = self. tokenenbedding table(idx) # (8,T,C) pos_enb = selt.position embedding table(torch.arange(T, device=device)) # (T,c) X= tok.emb + pos_enb # (8,7,C) X= self.blocks(x) # (8,T,C) K = self. Unt(x) # (8,T,C) Logits = self.tmhead(x) # (B,T,vocab_size) if targets is None: oss = None else T, © = Logits.shape logits = logits.view(BeT, c) targets = targets.view(BsT) loss = F.cross_entropy(logits, targets) return logits, lossWe start with the Block! Depending on how deep you want to build your network, numbers of blocks (N) can be changed Each block has: Layer Norm Multi headed attention A skip connection Second layer Norm Feed Forward network Another skip connection Swipe © © @akshay_pachaar@ee@ = @ dblock.py class Block(nn.Modute) """ Transformer block: communication followed by computation """ def init__(self, n_embd, n_head): # n_embd: embedding dimension, n_head: the number of heads we'd like super(). it_Q) head_size = n_embd // n_head self.sa = MultiHeadAttention(n_head, head_size) self. ffwd = FeedFoward(n_embd) self. Uni = nn.LayerNoxm(n_embd) self.Un2 = nn.LayerNorm(n_embd) ward(self, x): x = xX + self.sa(self.Un1(x)) x = xX + self. ffud(self.Un2(x)) xeturn xThe multi-headed attention 6 Self attention is applied in parallel across multiple heads & results are concatenated at the end. Swipe © © @akshay_pachaarOutput 4 Attention applied in parallel & outputs are concatenated Linear ©©.© & mutt_nead atterion py + ee LayerWorm ” - ree te inh fonvara(sett, x): out = torch.cat({h(x) for h in self-heads], dime-2) out = seit dropout(sett proj (out) return out nection pc ©©e @ attention head.py Individual Attention head class Head(nn.Modute) "w" one head of self-attention "™** def _init_(self, nead_size) super().-init__() seUf.key = nn.Linear(n_enbd, head_size, bias-False) seUf.query = nn.Linear(n_enbd, head: self.value = nn.Linear(n_enbd, head_size, bia! self .register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) 5 ‘Skip cone; SeUf dropout = nn. Dropout (dropout) forard(setf, input of size (batch, time-step, channets) # output of size (batch, time-step, head size) 8,1,¢ = x.shape k= sovr-key(x) (8, T,hs) q = self.query(x) # (8,T,hs) ) # compute attention scores ("affinities") | Token Embedeling k.transpose(-2,-1) * k.shape[-i]#4-0.5 # (8, T, hs) @ (8, Ns, T) + (B, T, 1) 7 FilU(self.trill:T, :T] = ©, float('-inf')) # (B, T, T) wei = Fesoftmax(wei, dime-t) # (8, T, T) luei = setf.dropout (wei) # perform the weighted aggregation of the values v= self.vatue(x) # (6,T,hs) out = wei @v # (8, T,'T) 8 (8, T, hs) + (6, T, hs) return outThe feed forward module! A simple linear layer followed by a non-Linearity. Swipe © © @akshay_pachaarMask Multi-Heael Attention f tt LoyerNerm ? © : { Token Embedding ‘Skip connection Skip connection @@@ @ feed_forward.py class FeedFoward(nn.Module) : u" a simple Linear layer followed by a non-linearity def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, 4 * n_embd), nn.ReLU(), nn.Linear(4 * n_embd, n_embd), nn.Dropout (dropout), def forward(self, x): return self.net(x)At the bottom we have two things: A a token embedding table, that provides embeddings for each token. And a positional embedding table that helps the network understand relative positions of the tokens in each block. Swipe © © @akshay_pachaarTTTT TT TIT ttt) LTTE TTT TT erp) Token Embedding Table vocab_size,That's a wrap! If you interested in: - Python @ - Data Science - Machine Learning # - MLOps - NLP @: - Computer Vision - LLMs @ Follow me on LinkedIn JY Everyday, I share tutorials on above topics! Cheers!! @& © @akshay_pachaar

Peacock CP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Peacock CP

Uploaded by

Copyright:

Available Formats

You might also like