You are on page 1of 43
41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Openinapp 7 @O dQ search Gwe O © eated from phtofunt: Solving Transformer by Hand: A Step-by-Step Math Example @ Fareed Khan + Follow Published in Level Up Coding - 1 minread . Dec 18,2028 S2x*x Qs W © fo nitpsoevelupgitconnected.comunderstar formers ron-star-o-end-a-step-by-step-math-example-téd4e6ée6eb1 3 24, 14:11AM Salving Transformer by Hand: A Step-by-Step Math Example | by Fareed Knan | Level Up Cosing Ihave already written a detailed blog on how transformers work using a very small sample of the dataset, which will be my best blog ever because it has elevated my profile and given me the motivation to write more. However, that blog is incomplete as it only covers 20% of the transformer architecture and contains numerous calculation errors, as pointed out by readers. After a considerable amount of time has passed since that blog, I will be revisiting the topic in this new blog. My previous blog on transformer architecture (covers only 20%): Understanding Transformers: A Step-by-Step Math Example — Part1 lunderstand that the transformer architecture may seem scary, and you might have encountered various explanations on... medium.com I plan to explain the transformer again in the same manner as I did in my previous blog (for both coders and non-coders), providing a complete guide with a step-by-step approach to understanding how they work. Table of Contents + Defining our Dataset + Finding Vocab Size + Encoding + Calculating Embedding * Calculating Positional Embedding * Concatenating Positional and Word Embeddings ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 2183 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Multi Head Attention Adding and Normalizing Feed Forward Network Adding and Normalizing Again * Di r Part + Understanding Mask Multi Head Attention * Calculating the Predicted Word Important Points Conclusion Step 1— Defining our Dataset The dataset used for creating ChatGPT is 570 GB. On the other hand, for our purposes, we will be using a very small dataset to perform numerical calculations visually. Dataset (corpus) | drink and | know things. When you play the game of thrones, you win or you die. The true enemy won't wait out the storm, He brings the storm. Our entire dataset containing only three sentences Our entire dataset contains only three sentences, all of which are dialogues taken from a TV show. Although our dataset is cleaned, in real-world hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 3143 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding scenarios like ChatGPT creation, cleaning a 570 GB dataset requires a significant amount of effort. Step 2— Finding Vocab Size The vocabulary size determines the total number of unique words in our dataset. It can be calculated using the below formula, where N is the total number of words in our dataset. vocab size = count(set(N)) vocab_size formula where N is total number of words In order to find N, we need to break our dataset into individual words. Dataset (Corpus) | drink and | know things. When you play the game of thrones, you win or you die. The true enemy won't wait out the storm, He brings the storm. When, you, play, the, game, of, thrones, you, win, or, you, die, N |, drink, and, 1, Know, things, The, true, enemy, won't, wait, out, the, storm, He, brings, the, storm calculating variable N After obtaining N, we perform a set operation to remove duplicates, and then we can count the unique words to determine the vocabulary size. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 4183 s1ar24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding vocab size = count(set(N)) I rink, and, 1, Know, things, set (irene lay, the, game, of, thrones, you, win, oF you, de, ) The, ie, enemy, wor, welt, out the, storm, He, brings, te, storm |, drink, and, Know, things, When, you, play, the, game, of, L, COUNT (tiveness or ce te, enemy, wont wat ot stom, Ho L,=23 finding vocab size Therefore, the vocabulary size is 23, as there are 23 unique words in our dataset. Step 3 — Encoding Now, we need to assign a unique number to each unique word. 12 3 4 5 6 7 8 9 10 1 12 | drink things Know When won't play out true storm brings game 13001415 6 7 #18 19 20 2 © 22028 the win of enemy you wait thrones and or die He encoding our unique words ‘As we have considered a single token as a single word and assigned a number to it, ChatGPT has considered a portion of a word as a single token using this formula: 1 Token = 0.75 Word hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 5143 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding After encoding our entire dataset, it’s time to select our input and start working with the transformer architecture. Step 4— Calculating Embedding Let's select a sentence from our corpus that will be processed in our transformer architecture. 12 34 5 6 7 8 | drink things Know When won't play out Bo 4 6 6 7 8 19 the win of enemy you wait thrones Positional Encoding @&-o Inputs. <——4 9 mo on 2 true storm brings game 2 «8 andor diese play game of-_—_— thrones, 7 2 18 19 Input sentence for transformer We have selected our input, and we need to find an embedding vector for it. The original paper uses a 512-dimensional embedding vector for each input word. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 6143 star24, 111 AM 512 Attention Is All You Need paper When s you 7 play 7 game of 12 15 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding thrones 19 Original Paper uses 512 dimension vector Since, for our case, we need to work with a smaller dimension of embedding vector to visualize how the calculation is taking place. So, we will be using a dimension of 6 for the embedding vector. ional seang Q-@ Tost embsng Inputs When | you play_|_ game of _ | thrones 5 7 7 12 15 19 el e2 3 ed 5 26 o7o | 038 | 001 | 012 | 088 06 06 012 [| 051 06 041 | 0.33 os | 0.06 | 0.27 | 065 | 079 | 075 oes | 079 | 031 | 0.22 | 062 | 0.48 0.97 og 0.56 0.07 0.5 0.94 0.2 074 | 059 | 037 07 0.21 Embedding vectors of our input These values of the embedding vector are between 0 and 1 and are filled randomly in the beginning. They will later be updated as our transformer hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding starts understanding the meanings among the words. Step 5 — Calculating Positional Embedding Now we need to find positional embeddings for our input. There are two formulas for positional embedding depending on the position of the ith value of that embedding vector for each word. Embedding vector for any word even position For even position ‘odd position PE (qos,2i) = sin(pos/100007!/4') even position odd position ‘even position For odd position PE (pos,2i-+1) = €08(pos/10000?"/4«!) Positional Embedding formula ‘As you do know, our input sentence is “when you play the game of thrones” and the starting word is “when” with a starting index (POS) value is 0 , having a dimension (d) of 6. For i from 0 to 5, we calculate the positional embedding for our first word of the input sentence. nitps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt 2143 star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding When 5 i el Position Formula pi 0 0.79 Even _|sin(0/10000°*")| 0 1 0.6 Odd | cos(0/10000°*/)) fa 2 | 0.96 | Even _|sin(0/10000°/)|_g 3_| 064 | odd _[¢as(0/10000")F 4 4 | 0.97 | Even |sin(0/10000°")| oo 5 0.2 Oda | cos(0/10000°°°/)| 4 d(dim) 6 Pos 0 Positional Embedding for word: When Similarly, we can calculate positional embedding for all the words in our input sentence. When | you | play | same | of | thrones 5 17, 7 12 15 19 i[o [2 [ 23 [| m [ % [ os Postrel 0 0 | oeais | 09093 | o.ai1 | -0.7568 | -0.9589 eer OF 1 1_| 00464 | 0.9957 _| 0.1388 | 0.1846 | 0.9732 mim 2 o 0.0022 0.0043 | 0.0065 | 0.0086 | 0.0108 Inputs: 3 [1 [00001 [ 1 | 0.0003 | 0.0004 | 1 4 oO 0 oO oO 0 0 5 ? 0 1 oO oO 1 ddim) 6 6 6 6 6 6 Pos 0) 1 2 Calculating Positional Embeddings of our input (The calculated values are rounded) hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 91a srar24, 111 AM Step 6 — Concatenating Positional and Word Embeddings After calculating positional embedding, we need to add word embeddings Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding and positional embeddings. gee Poe Te Tee a Tee — Sig O-4 <——_ Position Embedding Matrix ae [aoa | os [os [aan [oe ous | aoeze | oon | a0oss | acoes | core 1 [ooo | 1 | oaeos | omnes [1 eot_[ exe [eos [eo [ens | ono tt ce o7 | 122 | 092 | 026 | 012 | -0.36 xe | oa7 | 151 | 07 | oso [13 + oss | 00s | o27 | oss | os | 076 Word Embedding Matrix aea_[ 079 | 131 | 022 | 062 | 149 ote poe ee os7 | os | 056 | 007 | os | ose as |-e12 | osi_| 06 | oat _| ox 12 [| o7 [ ase [oa | o7 | 1a a2 | a7 | aso [087 | 07 | oz concatenation step This resultant matrix from combining both matrices (Word embedding matrix and positional embedding matrix) will be considered as an input to the encoder part. Step 7 — Multi Head Attention A multi-head attention is comprised of many single-head attentions. It is up to us how many single heads we need to combine. For example, LLaMA LLM from Meta has used 32 single heads in the encoder architecture. Below is the illustrated diagram of how a single-head attention looks like. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 10183 41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Single Head Attention Single Head attention in Transformer There are three inputs: query, key, and value. Each of these matrices is obtained by multiplying a different set of weights matrix from the Transpose of same matrix that we computed earlier by adding the word embedding and positional embedding matrix. Let's say, for computing the query matrix, the set of weights matrix must have the number of rows the same as the number of columns of the transpose matrix, while the columns of the weights matrix can be any; for example, we suppose 4 columns in our weights matrix. The values in the weights matrix are between © and 1 randomly, which will later be updated when our transformer starts learning the meaning of these words. ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 13 s1ar24, 1211 AM, Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding ‘Word Embedding + Positional Embedding Linear weights for query wien | 070 | 16 | 096 | 16¢ | oo7 | 12 os | 045 | os: | 069 you | 122 | o17 | 00s | 072 | 09 | o74 00s | oss | 037 | oss oy | oz | asi | or | am [oss | sso | Xx [oss | or | ose | oct game | 026 | o7 | 065 | om | oo7 | o37 on | os | o4 | om ot | o12 | 059 | os | os | os | o7 07 | 027 | 02 | 067 vones | 096 | 13 | 076 | 148 | oo4 | 1.20 oas | 056 | 057 | 0.07 6x6 6x4 calculating Query matrix Similarly, we can compute the key and value matrices using the same procedure, but the values in the weights matrix must be different for both. Word Embedding + Positional Embedding. Linear weights for key when [073 | 16 | os | 16¢ | o97 | 12 o7 | os7 | om | 073 vou | 122 [ 017 | aos | 07 | 09 | 074 oss | ors | os | 0.17 voy | ose [xsi | 027 | 1m | ose | 150 | y | os | ove | 08 | ose some | 025 [07 | 06s | 022 | 007 | 037 os | 073 | 02 | oat a [ow | ose | os | os | os | o7 os7 | 086 | 042 | 0.08 wones | -036 | 13 | 076 | 148 | ose | 121 078 | om | 087 | 088 6x6 6x4 Word Embedding + Positional Embedding Linear weights for value Wien [073 | 46 | 096 | 166 [ 097 [| 12 ose [007 [07 | 035 vou | 122 | 017 | 006s | 070 | 09 | 074 02 | os7 | 061 | 0.35 pay | 092 | 151 | 027 | 131 | 086 | 1.58 os7 | os | os | os ome | 026 | 074 | 066 [022 | 007 | 037 | X o67 | 035 | 098 | 054 of [01 [050 [08 | o62 | 05 | 07 047 | 083 | 034 | 0.98 Grones | 036 | 13 | 076 | 148 | 094 | 121 ‘06 | 069 | 013 | 0.98 6x6 6x4 Calculating Key and Value Matrices hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 12183 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding So, after multiplying matrices, the resultant query, key, and values are obtained: Query 3es_| 38 | 408 | 342 2ss | 196 | 277 | 178 aa | 36 | 349 | 272 oe: |i iis; |i aaa |? 45 19 | 156 | 108 | 153 aoe | 26 [L2ms. || 222 6x4 key value am_| 40a | 415 | 3a 3ea_| a8 | 408 | 3.42 21g | 251 | 164 | 193 255 | 196 | 277 | 178 3.28 | 311 | 365 | 3.01 3so_|_36 | sao | 272 1o7_| 113 | 164 | 135 102 | 11¢ | 124 | 13 asi |_.ate7!_|| ‘onal |_ aon ao) |l|_fs6_|_ te6 "|| 15s 251 | 30a | 345 | 2.28 aoa | 29 | 273 | 222 6x4 6x4 Query, Key, Value matrices Now that we have all three matrices, let’s start calculating single-head attention step by step. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 19143 star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Query Transpose (Key) ae | 38 | aoe | aa 25s | 196 | 27 | 170 an | 21 | a2 | 107 | 149 | 251 aa [as [ae [am | yy [om fos [am [us {aor [aoe 102 | ste | 12 | 19 ats [noe | 30s | tee [21a | sas 19 [ass | 180 | ass aa 193 | am | 105 | 101 | 228 3.04 273 2.22 4x6 6x4 [Mati] 50.341 | 31.2000 | 49.7206 | 19.7596 | 20.1006 | 43.1604 (_Sottvax_) 34.5402 | 16.2056 | 29.6169 | 11.761 | 16.6133 | 25.6608 = _| 50.8796 | 27.3994 | 43.2409 | 17.0909 | 24.5349 | 37.695, ss 18.1304 | 9.728 | 15.4544 | 6.2134 | a.061_| 13.3808 26,3707 | 14.0997 | 22.5500 | 8.9445 | 12.6967 | 19.4858 «.a0ei | 22.668 | 35.6360 | 14008 | 20.103 | sosass | ——> Cuatwat) matrix multiplication between Query and Key For scaling the resultant matrix, we have to reuse the dimension of our embedding vector, which is 6. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt aia srar24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding sea | 51.208 52 [48 7306 19,7538 | 20.006 | a3.16ae ‘34.5002 | 10.2058 | 206160 1.7761 | 168139 | 25.6608 50.8796 | 27.904 | 23.2409 17.0908 | 24.seo | 97.685, rei | 9.726 1sasee 62s | 98st 13.2004 26.707 | 14.085 7 | 225800 cas | 12.6067 | 19.4058 ‘axes | 22.668 | 95.6089 4.004 | 20.109 | s0s26s vay where d (dimension) is 6 23.81721 12.77409 | 20.30219 8.062904, 1.50852 17.62 14.1009 7.434201 | 12.09231 4.809165, 6.781004, 10.47973 20.7167 | 11.186 17.6526 6.976963, 10.01433 15.39096 7.401542 | 3.972256 | 6.307436. 2.535222, 3.612997 | 5.466445 10.7651 5.752218 | 9.205999 3.64974 5.184753) 7.956759 17.10152 9.254989 | 14.5497. 5.715476 8.205791 12.62712 Cama) Came) > Gam scaling the resultant matrix with dimension 5 The next step of masking is optional, and we won't be calculating it. Masking is like telling the model to focus only on what’s happened before a certain point and not peek into the future while figuring out the importance of different words in a sentence. It helps the model understand things in a step- by-step manner, without cheating by looking ahead. So now we will be applying the softmax operation on our scaled resultant matrix. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 15183 s1ar24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding (EEN er | os [om [os | ve SoftMax: 141 | 743 | 1209 | 481 | 678 | 1048 s(x) —u zor | 1139 | sas | ese | 1001 | 1520 Dee! war [57 | sat | 30s | aie | 708 va [eas Duss Dan [an [ees as softmax(23.82) = rary Gait 4 gas BOF GAT y GIR L (ener) Eo or 0 | 0002 | ———> (Gavan se | aooit { axts2 | acoo1 | ocoos | o.022 = = 0.9534 | 0.0001 | 0.0421 oO 0.0044 0.6476 | 0.021 0.2177 | 0.005 | 0.0146 | 0.094 (teen) 0.7603 | 0.0052 | 0.164 | 0.0006 | 0.0023 | 0.047 a] 0.9174 | 0.0004 | 0.0716 0.0001 | 0.0105 axKy Applying soft max on resultant matrix Doing the final multiplication step to obtain the resultant matrix from single-head attention. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 16183 star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding oe) softmax| (iz Value ose |_o | oars] o | 0 | ooo ace | a8 | aoe | an 0.96 | oooii | 0.1152 | acoos | oo006 | 0.020 zs5_| 106 | 277 | 170 esse | oooor | oo | o | 0 | conus ase [a6 | s49 | 272 asa | ones | oxi | 005 | onus | ove | X [soe | use [1m [sa 0.7003 | ao0s2 | oes | 0.0006 | a.0029 | 0.047 19 | ise | ree | so 09174 | ooo | canis | 0 | 0001 | o.o105 aoa [ 29 [ 27a | 22 6x6 6x4 3.864257| 3.79246 | 4.060367| 3.39751 | ——————> [amu] 3,601295| 3.75252 |3.977937| 3.30861 = [3.855542] 3.787426] 4.04909 | 3.386086 Mask (o5t) 3.622841 | 3.584996] 3.750419 | 3.081634) 3.745786 | 3.706744] 3.904894 | 3.233619) (sexe) 3.835366] 3.77528 | 4.022837 | 3.356435) Tattaar We have calculated single-head attention, while multi-head attention comprises many single-head attentions, as I stated earlier. Below is a visual calculating the final matrix of single head attention of how it looks like: hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt mas 41424, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Multi Head Attention Multi Head attention in Transformer Each single-head attention has three inputs: query, key, and value, and each three have a different set of weights. Once all single-head attentions output their resultant matrices, they will all be concatenated, and the final concatenated matrix is once again transformed linearly by multiplying it with a set of weights matrix initialized with random values, which will later get updated when the transformer starts training. Since, in our case, we are considering a single-head attention, but this is how it looks if we are working with multi-head attention. ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 1643 s1ar24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Single Head Attention Multi Head Attention (N Heads) Our Case Real world Case Concatenation 3.884257| 3.79246 | 4.060367) 3.39751 3.801295 | 3.75252 | 3.977997) 3.90061 3.855542| 3.767426] 4.04000 | 3.385086] 3.622641 | 3.564996] 3.750419] 3.081854 3.745766 | 3.706744 3.904898] 3.233519| 3.835966] 3.7752 | 4.022837] 3.356435 | (mata (Sona) ma) Cae) Ta (task ot) J es (x) GD} ee ee) Cotari_} or) qd ow Sev okey Single Head attention vs Multi Head attention In either case, whether it’s single-head or multi-head attention, the resultant matrix needs to be once again transformed linearly by multiplying a set of weights matrix. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 19143 star24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding o Linear weights soma la ) %. ue colina length mst be, sen ae TO (embedding positional) matrix columns length 38 | 375 | 398 | 331 oa [os | 045 | 054 | 007 | 058 306 | 379 | 405 | 3.39 xX oes | 074 | 078 0s 075 | 055 362_[ "358 [375 [3.08 os3_| oa | _os5 | os9 | o49 | 014 375 [371 [39 [323 o7 | 06 | 012 | 042 | 029 | 087 3ea_| 378 | 402 | 3:36 4x6 exa 10.84 | 945 | 7.33 78 6.09 | 7.66 10.65 | 9.28 | 722 | 767 | 599 | 7.51 10.83 | 943 | 7.33 | 7.79 | 608 | 7.65 10.08 | 8.77 | 685 | 7.25 | 567 | 7.09 0.48 [| 912 | 741 | 754 | 589 | 7.38 10.7 [| 9.38 | 7.29 | 7.75 | 6.05 76 normalizing single head attention matrix Make sure the linear set of weights matrix number of columns must be equal to the matrix that we computed earlier (word embedding + positional embedding) matrix number of columns, because the next step, we will be adding the resultant normalized matrix with (word embedding + positional embedding) matrix. Output of Multi Head attention Fi = 10.84 | 945 | 733 | 78 6.09 | 7.66 cee | |i 10.65 | 9.28 | 7.22 | 7.67 | 599 | 7.51 ee 10.83 | 943 | 733 | 7.79 | 608 | 7.65 10.08 | 8.77 | 685 | 7.25 | 567 | 7.09 feral 10.48 | 912 | 741 | 754 | 5.89 | 7.98 10.77 | 938 | 729 | 775 | 605 | 76 6x6 Output matrix of multi head attention hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 20183 srar24, 111 AM As we have computed the resultant matrix for multi-head attention, next, we will be working on adding and normalizing step. Step 8 — Adding and Normalizing Once we obtain the resultant matrix from multi-head attention, we have to add it to our original matrix. Let’s do it first. Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding ‘Word Embedding + Positional Embedding Wen [ore [a6 [oss [iss [amr [a2 you | -u2a[oa7_| 006 [ore | 09 | ove voy [os [ast | oz [ist | oss | 159 wne_|_oz6 [074 [06s | 022 [007 [037 # [012 |os9 [08 [062 | 05 | 07 Worse [036 | 13 [076 | 148 [ose | 129 6x6 Output of Multi Head attention wos [sas [733 | 78 | 609 | 705 106s 928 | 722 | 767 599 | 7.51 1os3_|043_| 733 | 779 | 608 _| 7.65 10.08_| 677 | 6as_| 7.25 | sa7_| 7.09 10ae[912 | 71 | 758 | 589 | 7.8 10.77 [938] 729 | 775 | 605 | 76 6x6 11.63 11.05 8.29 9.44 7.06 8.86 11.87 9.45 7.28 8.46 6.89 8.25 11.75 10.94 76 9.1 6.64 9.24 10.34 9.51 7.51 7.A7 5.74. 7.46 10.6 9.71 791 8.16 6.39 8.08 10.41 10.68 8.05 9.23 6.99 8.81 To normalize the above matrix, we need to compute the mean and standard ‘Adding matrices to perform add and norm step deviation row-wise for each row. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 2118s s1ar24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding mean = Row Wise Implementation Mean __ Standard Deviation 11.63 [11.05 [8.29 [9.44[7.06| 8.86] ————>__9.26 1.57 11.87 9.45 [7.28 [8.46] 6.89] 8.25| ————>_ 8.56 1.64 11.75| 10.94] 7.6 | 9.1 [6.64] 9.28]-————>|_9.04 1.76 10.34| 9.51 [7.51|7.47| 5.74] 7.46] ————>|_7.86. 1.51 10.6 [ 9.71 |7.91[8.16|6.39| 8.08] ————>|_ 8.37, 1.35) 10.41 [10.68 [8.05 [9.23 6.99] 8.81] ————__ 8.93 1.28 calculating meand and std we subtract each value of the matrix by the corresponding row mean and divide it by the corresponding standard deviation. Mean Sed 11.05] 629 9.44 ]7.06 [6.85] 11.87/ 9.45 |7.26[6.46|6.09/8.25| [ese [160 1175/1084] 7 | 91 [6st|9.28) [aoa | 176 10.34] 951 |751|7.47|5.74|7.46| [796 [151 106 [9.71 |7.91]8.16[629/6.08] [esr [1.35 10.41] 10.68 /6.05]9.23| 699/861] [eos | 1.26 value mean __ 11.63 — 9.26 std+error 1.57 + 0.0001 i H - aia _[-0.62[ 011 [ -14 | -0.25 202 | 0.54 | -0.78 | -0.06[ -1.02 | -0.19 154 | 108 | -082[ 0.03 | -136 | 0.11 164 | 1.09 | -0.23[-026| 14 | -0.26 165 | 0.99 | -0.34[-016[ -1.47 | -0.21 ais | 137 [-069|[ 0.23 | -152 | -0.09 normalizing the resultant matrix hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 22183 cs, 81084 Salvng Tantomery Hand A Stepy-Stp Mah Exaile by Fareed Kran | Level Up Cong Adding a small value of error prevents the denominator from being zero and avoids making the entire term infinity. Step 9 — Feed Forward Network After normalizing the matrix, it will be processed through a feedforward network. We will be using a very basic network that contains only one linear layer and one ReLU activation function layer. This is how it looks like visually: ReLU( max(0, x) Real world case (multiple layers) Linear Layer = X-W+b our case (one linear layer) tT = Feed Forward network comparison First, we need to calculate the linear layer by multiplying our last calculated matrix with a random set of weights matrix, which will be updated when the transformer starts learning, and adding the resultant matrix to a bias matrix that also contains random values. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 23183 tar24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Matrix afrer add and norm step w ra_[ 116 [oe [on [1a [025 a TS 202 | 058 |-0.78 | 006 | 1.02 | 019 ox7_{ 082 | 063 | 046 | 006 1.54 | 1.08 [0.82 [0.03 [1.36 [011 | X | 053 | 087 ) 047 | 1) oat tee 108 Tos Lome ot 5 ‘053| 088 [038 | 009 | oe | 0.25 tes [oss[-ase[-ose [ray [om] | eet eas fore {oss fost [ose teas Passos ase | on ex6 6x6 we [on [on [om or Bias ‘02 [-126-[- 111 [012 | 046] 07 oss [ise [oso [os {os [oso] y i 2st ‘053 [097 [096 | ~015_| 016 | 02 oss_[-1_[-as7_[-on1 | 02 | oe ‘os2[102[ 061 | 026-014 | 02 x6 (lo [one [025 [ae [035 [04s] os | 125 | 109 | 056 | 057 | 115 oes | 144 | 196 | 054 | O81 | 142 = os | 136 | -057 | 081 | 068 | 1.04 oss [| iis | 123 | 057 [ 051 | 0.97 ose | 129 | -o62 | 053 | 055 | 1.09 1.04 12 oss | 068 | 049 | 0.97 6x6 Calculating Linear Layer After calculating the linear layer, we need to pass it through the ReLU layer and use its formula. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt alas ster24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding ReLU(z) = max(0, x) 1.25 1.09 0.56 0.57 1.15 0.66 1.44 1.36 0.54 0.81 1.42 0.95, 1.36 -0.57 0.81 0.68 1.04 0.95 1.15 1.23 0.57, 0.51 0.97 0.98 1.29 0.62 0.53 0.55 1.09 1.04 1.2 0.86 0.68 0.49 0.97 6x6 max(0, 0.91) [isi] 1.25 [1.09] 0.56 | 0.57] 1.15 == 0.66 | 1.44/1.36 | 0.54 | 0.81 | 1.42 Forward 0.95[ 1.36] 0 | 0.81 | 0.68 | 1.04 0.95 | 1.15 |1.23 | 0.57 | 0.51 | 0.97 Fara Nom 0.98129] 0 | 0.53| 0.55 | 1.09 1.04] 1.2 [0.86] 0.68 | 0.49 | 0.97 Calculating ReLU Layer Step 10 — Adding and Normalizing Again Once we obtain the resultant matrix from feed forward network, we have to add it to the matrix that is obtained from previous add and norm step, and then normalizing it using the row wise mean and standard deviation. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 25183 star24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Matrix from Feed Forward Network Matrix from Previous Add and Norm Step oat [125] 1.08] 086 [O57 [15 Te Pa] eyo 0.66] 1.44 1.36] 054 [0.81 [1.42 per ost are | aos [ar [039 tetieteterteetie |p Fectietsetestister 095/1.15]1.23] 057 [051 [0.97 je tee tat | em |__| ae 096/129] 0 [083/055 109 nei [aes | ans [ise | on 108] 1.2 [0.66] 0.69 | 049] 097 Mean Std 1.0033 [1.103534 za a ae % 1.1233 [1.214349 — Pes hea ee Pas is] —» [0.9033 [1.301837 TT soe Ci oar 9.9933 | 1.289058 os Pe Pata 0.8167 |1.306016| 0.95 [1.320773 ize [ 126 | -o4s | -03 | -166 | -0.09 ize | om | 045 | -053 | 11 [0.09 122 | 118 | 1.32 | -005 | -1.22 | 0.19 124 [097 | 001 | -053 | -1.46 | 0.22 Forward 139 [112 | -o89 | -o34 | -133 [0.05 09s | 123 | -059 | -003 | -15 | -0.05 =i ‘Add and Norm after Feed Forward Network ‘The output matrix of this add and norm step will serve as the query and key matrix in one of the multi-head attention mechanisms present in the decoder part, which you can easily understand by tracing outward from the add and norm to the decoder section. Step 11 — Decoder Part The good news is that up until now, we have calculated Encoder part, all the steps that we have performed, from encoding our dataset to passing our matrix through the feedforward network, are unique. It means we haven't calculated them before. But from now on, all the upcoming steps that is the remaining architecture of the transformer (Decoder part) are going to involve similar kinds of matrix multiplications. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 26183 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Take a look at our transformer architecture. What we have covered so far and what we have to cover yet: What we have covered so far What we have to cover Output Probabities Tres em Feed evar Tan ‘Aertan Ne Nasied aoa ‘tern Postrel Eb Poston Encoding QY Encoding Tot Sues Emeeg enbessra t Inputs Cutouts Upcoming steps illustration We won't be calculating the entire decoder because most of its portion contains similar calculations to what we have already done in the encoder. Calculating the decoder in detail would only make the blog lengthy due to repetitive steps. Instead, we only need to focus on the calculations of the input and output of the decoder. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 278s srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding When training, there are two inputs to the decoder. One is from the encoder, where the output matrix of the last add and norm layer serves as the query and key for the second multi-head attention layer in the decoder part. Below is the visualization of it (from batool haider): Encoder 1 Visualization is from Batol Haider While the value matrix comes from the decoder after the first add and norm step. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 28183 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding The second input to the decoder is the predicted text. If you remember, our input to the encoder is when you play game of thrones so the input to the decoder is the predicted text, which in our case is you win or you die . But the predicted input text needs to follow a standard wrapping of tokens that make the transformer aware of where to start and where to end. Encoder Input ———> When you play game of thrones Decoder Input ———> you win or you die input comparison of encoder and decoder Where and are two new tokens being introduced. Moreover, the decoder takes one token as an input at a time. It means that will be served as an input, and you must be the predicted text for it. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 29183 star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding a a = 5 EQ) Pesiti¢ ? © YY Encod el jut } < 0.31 dding 0.21 | 0.12 0.64 uts Outputs 0.98 (shifted right) 0.2 Decoder input word ‘As we already know, these embeddings are filled with random values, which will later be updated during the training process. Compute rest of the blocks in the same way that we computed earlier in the encoder part. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 30183 41424, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Output Probabilities Calculation is same as in Encoder moods Outputs Calculating Decoder Before diving into any further details, we need to understand what masked multi-head attention is, using a simple mathematical example. Step 12 — Understanding Mask Multi Head Attention Ina Transformer, the masked multi-head attention is like a spotlight that a model uses to focus on different parts of a sentence. It’s special because it doesn’t let the model cheat by looking at words that come later in the sentence. This helps the model understand and generate sentences step by step, which is important in tasks like talking or translating words into another language. ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 31a 41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Suppose we have the following input matrix, where each row represents a position in the sequence, and each column represents a feature: 1 Input Matrix 4 7 Co Ot bh OMDW inpur matrix for masked multi head attentions Now, let’s understand the masked multi-head attention components having two heads: » . Linear Projections (Query, Key, Value): Assume the linear projections for each head: Head 1: Wql,WK1,Wv1 and Head 2: Wq2,Wk2,Wv2 nN . Calculate Attention Scores: For each head, calculate attention scores using the dot product of Query and Key, and apply the mask to prevent attending to future positions. » . Apply Softmax: Apply the softmax function to obtain attention weights. + . Weighted Summation (Value): Multiply the attention weights by the Value to get the weighted sum for each head. a . Concatenate and Linear Transformation: Concatenate the outputs from both heads and apply a linear transformation. Let’s do a simplified calculation: ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 323 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Assuming two conditions + Wql = Wk1 = Wv1 = Wq2 = Wk2 = Wv2=I, the identity matrix. °@ ‘Input Matrix Head 1: Head 2: Qi = K, =V; = Input Matrix Qz = Ky = Va = Input Matrix A, =Q:- KT Ay = Qo: KY 147 23 =|2 58 56 369 8 9 100 23 =|4 5 0] (Masked) 5 6] (Masked) 789 09 softmax(A1) =WeM Concatenate and Linear Transformation: Coneatenate([O1, O2)) (Apply Learnable Linear Transformation) Mask Multi Head Attention (Two Heads) The concatenation step combines the outputs from the two attention heads into a single set of information. Imagine you have two friends who each give you advice on a problem. Concatenating their advice means putting both pieces of advice together so that you have a more complete view of what they suggest. In the context of the transformer model, this step helps capture different aspects of the input data from multiple perspectives, contributing toa richer representation that the model can use for further processing. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 33143 sara, 11 AM Solving Transformer by Hand: A Step-by-Step Math Example [by Fareed Kan | Level Up Coding Step 13 — Calculating the Predicted Word The output matrix of the last add and norm block of the decoder must contain the same number of rows as the input matrix, while the number of columns can be any. Here, we work with 6. Outrut Prbepies Rows must be 6 while columns can be any length 1.2 133 [ 211 | 262 [ 306 | 354 ose | 145 | 224 | 278 | 378 | 455 134 [204 [ 226 | 266 | 356 | 3.96 oso | 179 | 246 | 249 | 332 | 3.64 oot | 104 [| 163 | 192 | 195 | 21 oi2 | 099 | 12 156 | 157 17 =) ‘Add and Norm output of decoder The last add and norm block resultant matrix of the decoder must be flattened in order to match it with a linear layer to find the predicted probability of each unique word in our dataset (corpus). hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 44s a6, 1:44AM Solving Transformer by Hand A Stop-by-Step Math Example | by Fareed Kan | Level Up Coding 12 1.33 211 2.62 3.06 4 3.54) Flatten the matrix 0.56 a 12 [| 133 [ 211 [ 262 [ 306 | 354 23 oss | 145 | 224 | 278 | 3.78 | 4.55 278 1.34 2.04 2.26 2.66 3.56 3.86 az8! oso | 179 | 246 | 249 | 3.32 | 3.64 458 ost | 104 | 183 | 192 [ 195 [21 . oi2 | o99 | 12 | 156 | 157 | 17 This flattened layer will be passed through a linear layer to compute the logits (scores) of each unique word in our dataset. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt flattened the last add and norm block matrix 012 0.99 12 1.56 157 17 35183 srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Linear Layer = X -W No Bias Linear set of weights rink | things | He af 2 [a eos Pa os [ om [om xt 1 [os [oa 037 Flatten Layer (Row Matrix) om {em [ov | +++ Last 023 | oe | oss 076 1.2[1.33[241] ee+ |1.2]1.56[1.57]17 4 075 | 076 | 042] 444 | 026 1x 1n (our case nis 36) ase [os [oa a a3 | 036 | 00s os 035 | oa | one 087 nxm (m is 23 Vocab Size) 1 | drink | things| .,, | you | ,,, | He : a a 7 2 logits = iis [23 [ 115] e** [5a] ee [25 Calculating Logits Once we obtain the logits, we can use the softmax function to normalize them and find the word that contains the highest probability. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt star24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding 1_| drink [things] .,, [you ] .., [_He . 1 2 3 17 23 logits = iia [23 | 115 sa] e+ (23 Applying softmax @) => si) = Sy 2; drink | things you He sees. 1 z 3 Psa 17 23 Probabilities = 0.21 0.05, 0.001 oor 0.56 ver 0.12 , highest Probability Finding the Predicted word So based on our calculations, the predicted word from the decoder is you. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 37183 tar24, 1211.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding you 7 0.56 you is the predicted word, which will now act an an input for our decoder ‘and so on. Output Probabilties Final output of decoder This predicted word you, will be treated as the input word for the decoder, and this process continues until the token is predicted. Important Points 1. The above example is very simple, as it does not involve epochs or any other important parameters that can only be visualized using a programming language like Python. 2. It has shown the process only until training, while evaluation or testing cannot be visually seen using this matrix approach. 3. Masked multi-head attentions can be used to prevent the transformer from looking at the future, helping to avoid overfitting your model. Conclusion hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 38143 cs, 81084 Salvng Tantomery Hand A Stepy-Stp Mah Exaile by Fareed Kran | Level Up Cong In this blog, I have shown you a very basic way of how transformers mathematically work using matrix approaches. We have applied positional encoding, softmax, feedforward network, and most importantly, multi-head attention. In the future, I will be posting more blogs on transformers and LLM as my core focus is on NLP. More importantly, if you want to build your own million-parameter LLM from scratch using Python, I have written a blog on it which has received a lot of appreciation on Medium. You can read it here: Building a Million-Parameter LLM from Scratch Using Python A Step-by-Step Guide to Replicating LLaMA Architecture levelupptconnectedicom Have a great time reading! Data Science Artificial Intelligence Machine Learning Python Deep Learning ¢ hntps:evelupgiteornectee,comlunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 39143 41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding @ Written by Fareed Khan Croton) O 21K Followers + Writer for Level Up Coding MSc Data Science, | write on Al https://wwwlinkedin.com/in/fareed-khan-dew/ More from Fareed Khan and Level Up Coding @) Fareed Khan @ Tirendaz Al in Level Up Coding Modern GUI using Tkinter How to Use ChatGPT in Daily Life? There are two things to remember: Save time and money using Chat@PT 4minread - Sep 4,2022 @minroad - Apr 4,2023 &) 59k Q 107 imi oe i System Design Interview Question: Building a Million-Parameter LLM Design Spotify from Scratch Using Python @ Hayk Simonyan in Level Up Coding @ Fareed Khan in Level Up Coding ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 4049 41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding High-level overview of a System Design A Step-by-Step Guide to Replicating LlaMA Interview Question - Design Spotify. Architecture Sminread + Feb 21,2024 25minread » Dec, 2023 4K Qs wt we 25K QQ 30 Chow See all from Fareed Khan See all from Level Up Coding Recommended from Medium a kaggle imi @veepLeorning A © Benedict Neo in btgrit Data Science Publication XQ in The Research Nest Roadmap to Learn Al in 2024 Explained: Transformers for A free curriculum for hackers and Everyone programmers to learn Al The underlying architecture of modern LLMs Mmin read + Mar 11,2024 1Smin read » Mar 8,2024 & a4x Q 100 too mm Qs tio ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 449 41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Lists Predictive Modeling w/ === Practical Guides to Machine Python Learning 20stories » 1060 saves TOstories - 1269saves Natural Language Processing = === —\} ChatGPT 1249 stories - 828 saves ECT 2istories - 551 saves @ Austin star. in Arica intetigence in Plain Ena Reinforcement Learning is Dead. Long Live the Transformer! from Scratch Using Python Large Language Models are more powerful A Step-by-Step Guide to Replicating LLaMA than you imagine Architecture + © Bminread - Jan 13,2024 25minread - Dec7, 2023 Sy 199k QQ 48 af we Sp 25K

You might also like