Professional Documents
Culture Documents
Version 1.0
January 31, 2022
Status: 75% confident of total info accuracy, and I am sure the core is in there with a little extra
decoding in sections beyond the ToC. Better accessibility and reliability will come in future
editions (after chapter VII.) Formatting needs most work, additional sections to be determined.
2
Table of Contents
I. The Basics 3
FORWARD:
MasterWaffle(-r [a play on words]) is an individual who embraces the idea of fluidity and
open-mindedness. I view beliefs, ideas, and preferences as dynamic and constantly changing,
rather than fixed and rigid. My approach to communication reflects this perspective, as I
intentionally use loose speech patterns and non-definitive language to promote flexibility in
thinking. This approach allows for more creative problem solving and encourages people to
expand their perspectives and consider new ideas. Being a MasterWaffle is not just about
avoiding rigidity in thought, but also about fostering an environment of exploration and learning.
As a Polymath new to the field of Artificial Intelligence, I am thrilled to present this
notebook, which provides an in-depth examination of this rapidly evolving field, as I grow to
understand it. The aim is to do so in a way that translates the complexity of this vastly difficult
subject to anyone who may not have a strong background knowledge, but a drive to learn all the
same. I have dedicated my life to exploring and educating others on the intricacies of this
wobbly subject and its impact on our world. This is a work in progress towards that, and I hope
to see it well-rounded, fact checked, and totally complete some day.
It is my firm belief that knowledge should be accessible to all, free of charge and
available to be used by as many people who want to read it. This book is my contribution to that
cause. It is my hope that this work will serve as a valuable resource for students, professionals,
and anyone with an interest in AI, who cannot understand the highly info-dence and jargon filled
material available beyond blog spaces. I am not seeking to profit from this textbook. However, I
do welcome any freely given donations, and I will try and consider any offers for trade for
services. I will not impose any paywalls or other financial barriers to access this information. I
invite you to join me on this exciting journey as we delve into the fascinating world of Artificial
Intelligence!
Feel free to forward any Questions/Comments/etc. to me on Discord!!!
3
The Basics
A ChatBot is a highly advanced computer program or an algorithmic information
processing system that is designed to emulate human-like diction via language modeling and
prediction, either through auditory, visual, symbolic, or other contextual-based methods. The
more advanced the architecture, the more power and calculations (flops) are needed to
complete each operation, but the more that the model can do in return simultaneously to
process the best output. A large General Language Model or GLM like ChatGPT (made up of
~175 Billion + nodes of “unseen data') uses several different attention heads to process the
input with those learned embeddings. At its core, the primary function of a chatbot is to interact
with a user in “natural language”, understand their intent and respond by providing them with
relevant information or performing certain tasks.
Chatbots can be classified into two main categories -- rule-based and self-learning -- but
are often used together in high dimensional text processing. At the most basic level, a Chat Bot
differs from a Search Engine in that it does not return a directory of information links to relevant
information but instead summarizes all of the information searched for and returns it in
whatever specified format desired (including a list of web links it was trained on, but this is
highly limited compared to a search engine.)
Rule-based chatbots rely on a set of predefined rules and patterns to understand and
respond to user input. Think Flow Chart but represented as a program to algorithmically process
the input to output. These rules are typically implemented using a decision tree or a finite state
machine. While rule-based chatbots are relatively simple to implement, they lack the ability to
understand context or adapt to changing scenarios. Early examples of Rule-Based chat bots
include Ask Jeeves and MUDD RPG adventures. They require relatively little processing power,
have existing open source libraries, and can be programmed to perform any number of very
specific processing tasks. For example, text cleaning is an important step in general natural
language processing, and a rule-based approach can be used to remove unwanted characters,
words, or patterns from the text. This can also help improve the quality of the input data and
make it more suitable for formatting (or tokenizing) input vocabulary for use in a Self-Learning
AI core later in the process. For example, a rule-based chatbot could be used to correct spelling
mistakes or standardize the text (e.g. converting all text to lowercase or splitting conjectures
like “it’s” to “it is”). By using a rule-based approach as a pre-step, the text is prepared in a way
that the GPT model can process it even more effectively, the output will be of better quality and
it will be more relevant to the desired task. Some of this is built into the model’s tokenization, but
things such as keyword filtering is up to the developers when using the API in your apps and
web app designs, and can be helpful when designing your application layer, whether it uses a
focused and Fine-Tuned Model or just a General Language Model.
4
Self-learning AI, also known as machine learning, uses algorithms that enable the AI
system to improve its performance over time without being explicitly programmed. It uses data
to learn and make predictions or decisions, whereas rule-based systems follow a set of
predetermined rules. Self-learning AI is more flexible and adaptable compared to rule-based
systems and can handle complex, dynamic and unpredictable situations. Rule-based systems
can only perform tasks within the constraints of their predefined rules and cannot adapt to new
situations without being reprogrammed for error handling. This can be really problematic in
situations like self-driving cars where data can change and flexibility is critical but also very
useful in Large Language Models (LLM) designed for all manner of dialogue.
Neural Networks: a type of model that is inspired by the structure and function of the
human brain (Neuroscience) and can be used to perform a wide range of tasks, including natural
language processing, machine learning, computer vision or speech recognition, and so on to
mimic stimulus sensory response. Neural networks consist of layers of interconnected nodes,
or artificial neurons, that are trained to recognize patterns in the input data. These patterns are
used to generate appropriate responses to the environment (in this case the input session).
Deep learning is a subfield of machine learning that utilizes neural networks to analyze
and understand complex datasets. Neural networks consist of layers of interconnected nodes,
or artificial neurons, that are trained to recognize patterns in the input data. These patterns are
used to generate appropriate responses. This allows the chatbot to understand the meaning of
the user's input and generate appropriate responses on multiple layers. AI training can be
time-consuming and requires significant computational resources, but it's necessary to make
sure that the AI system can perform its tasks accurately.
Training: includes supervised learning, unsupervised learning, and reinforcement
learning. In supervised learning, the training dataset includes labeled examples of the desired
output, allowing the AI system to learn by comparing its predictions to the correct answers. In
unsupervised learning, the training dataset is unlabeled, and the AI system is left to discover its
own patterns and insights. Reinforcement learning is where the AI system learns through trial
and error, receiving rewards or penalties based on the accuracy of its predictions.
A Painfully Brief history of chatbots: Chatbots have been around since the 1960s, with
early chatbot programs such as ELIZA and PARRY which would inspire MUDD text based
adventures of the 80s. In the 1990s and 2000s, chatbots began to be integrated into websites
and messaging platforms such as Ask Jeeves, allowing users to interact with them in an even
more natural way. In recent years, advancements in natural language processing and machine
learning have led to the development of more advanced chatbots, such as ChatGPT which
combine several highly advanced relationships via mathematics and computer technology built
over 80 years that we summarize today as Artificial Intelligence. One of the biggest
contributions to this field is Information Theory, discovered by Claude Shannon discovered the
same year the world-changing transistor was invented, at Bell telecommunications company.
5
AI models are created using techniques such as Deep Learning and Topic Modeling.
These techniques allow computers to learn from data and make predictions based on selected
patterns in the datasets by developers. The process of creating an AI model involves several
steps, each of which is crucial to the success of the model. This is the general process:
1 a. To create your own AI model, you can either use existing deep learning techniques,
such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), or you can
develop your own deep learning techniques to extract relevant data from text. Topic modeling is
a machine learning technique used for text data specifically. It involves identifying topics and
themes in a set of documents or text data. You can use existing topic modeling techniques,
such as Latent Dirichlet Allocation (LDA), or develop your own topic modeling technique.
1 b. Selecting the “Market Model” The first step in creating an AI model is selecting the
type of model that is best suited for the problem you are trying to solve. Different types of AI
models include supervised learning, unsupervised learning, reinforcement learning, etc. It's
important to choose the model that will work best for your data set and the problem you are
trying to solve. There are AI models for general text, stable image diffusion, and more.
2. Data Preparation: The next step is preparing your data for input into the model. This
includes collecting, cleaning, and formatting the data. It's important to choose the right features
and format the data in a way that the model can use it as designed.
3. Training: Once the data has been prepared, it's time to train the model. This involves
feeding the data into the model and allowing it to learn the relationships between the input data
and the output predictions. The goal is to train the model to make accurate predictions based on
patterns in the data. This can be tedious, but is vital to give the model a feel of the goals. There
are also many options to use AI to assist, such as Alignment by OpenAI, a Neural Network
specifically designed to assist OpenAI in releasing an open BETA training cycle to the public.
4. Validation: After training the model, it's important to validate its performance to
ensure that it is accurate and to avoid overfitting. This is done by evaluating the model on a
separate data set that it has not seen before to see how well its training worked.
5. Fine-tuning: Based on the results of the validation, you may need to fine-tune the
model. This can involve adjusting the hyperparameters, changing the architecture, or making
other modifications to improve the performance of the model. Then repeat 2-4 until valid.
6. Deployment: Once the model has been trained and validated, it's time to deploy it. This
involves integrating the model into a real-world application and making it accessible for use.
7. Monitoring and Maintenance: The final step is to continuously monitor the
performance of the deployed model and update it as necessary. This may involve adding new
data to the training set, updating the model architecture, or making other modifications to
improve its performance over time. The OpenAI structure seems to be iterative evolutions in
complexity from model to model via exploration and discovery of that new complexity.
6
Math-Magical Vocabulary
With the help of mathematics, self-learning chatbots can understand the user's intent
and context, and generate responses that are more relevant and useful. This section is only
intended to provide definitions for the future discussions of how this stuff works, and will not
cover too much detail beyond base. Math is just Symbolic Logic:
An algorithm is a set of instructions or a step-by-step procedure for solving a problem or
achieving a specific task. Algorithms can be implemented in computer programs to automate a
specific process or solve a problem. Algorithms are designed to take input, process it, and
produce output in systematic methods and with fixed-definitions. They are used to perform
tasks, such as sorting data, searching for information, and performing calculations.
A Heuristic is a problem-solving strategy that uses a practical approach to find a
solution that is not guaranteed to be optimal or perfect, but which is “good enough” for a
specific situation. Heuristics are often used in situations where an exact solution is not known
or it is impractical to use an algorithm due to a lack of information, or because of complexity
and the limit to the system’s computational resources. They are often based on experience and
previous knowledge and are used to find approximate solutions quickly using analytics.
Statistics deals with collecting, analyzing, and interpreting data. It provides methods for
summarizing and making inferences about data, as well as for modeling and predicting future
events based on past data. It is used in a wide range of fields, including business, economics,
social sciences, and health sciences, to make decisions and test hypotheses.
Calculus deals with rates of change and accumulation of quantities. It provides methods
for studying functions and their properties, such as continuity, derivatives and integrals.
Calculus is used to model physical phenomena, such as motion and change in physical
systems, and is used in a wide range under the domain of computer science.
Linear Algebra deals with the study of linear equations, vector spaces, and linear
transformations. It provides methods for solving systems of linear equations, and for
understanding the properties of matrices, vectors and linear transformations. Linear algebra is
used in a wide range of fields like physics, engineering, computer science, and economics, to
help humans both model and solve problems with variables and standard operations.
Graph theory deals with the study of graphs, which are mathematical structures used to
model pairwise relationships between objects. Graph theory provides methods for
understanding the properties of graphs, such as connectivity and network structure. It is used in
a wide range of fields, including computer science, operations research, and engineering to
model and solve problems in areas such as transportation, and communication networks.This
concept is how dimensions are introduced to add layers of complexity.
Information theory deals with the study of the representation, processing and
transmission of information. It provides methods for measuring and analyzing the amount of
information in a system, and for understanding the limits of data compression and data
transmission. Information theory is used in a wide range of fields, including computer science,
engineering, physics, and telecommunications, to design efficient communication systems and
to understand the fundamental limits of data compression and transmission within domains.
7
The Add operation is the addition of two vectors, the first one is the initial embedding of
the word and the second one is a vector that represents the position of the word in the
sentence. This position vector is learned during the training process and it is used to distinguish
the position of the word in the sentence, it is also known as positional encoding. The purpose of
this operation is to add the information about the position of the word to its embedding, so the
model can understand the context of the sentence.
The Norm operation is the normalization of the final embedding obtained from the Add
operation, it is used to ensure that the embeddings have the same scale and distribution. This is
done by applying a normalization function, such as Layer Normalization or Batch Normalization,
to the final embeddings. The purpose of this operation is to make sure that the embeddings for
different positions in the sentence have the same scale and distribution, which allows the
Multi-Head Self-Attention system to process them consistently.
Dropout: Is a technique used to regularize neural networks by randomly dropping out
(setting to zero) some of the neurons during training. The idea behind dropout is to prevent
overfitting by making the model more robust to the specific training data. This is achieved by
dropping out some of the neurons during training, so the model is forced to rely on other
neurons to make predictions. This technique is widely used in deep learning to improve the
performance of neural network models.
Layer normalization: Is a technique used to normalize the activations of a neural
network. It is a normalization technique that is applied to the activations on a per-layer basis,
instead of on a per-batch or per-datapoint basis. The main benefit of layer normalization is that
it helps to stabilize the training of deep neural networks by reducing the internal covariate shift.
Covariate Shift: Refers to the change in the distribution of the input data to a model
during training or inference. This can occur when the model is trained on a different distribution
of data than it is tested on. When this happens, the model's performance can degrade because
it is not able to generalize to the new data. This is a common problem in machine learning and
can be mitigated using techniques such as domain adaptation and sample weighting.
Domain adaptation: Is a technique used to improve the performance of a machine
learning model when it is applied to a new domain or task. This technique aims to reduce the
impact of the distribution shift between the training and the test data. It can be achieved by
using techniques such as fine-tuning a pre-trained model on the target domain, re-weighting the
training data, or using adversarial training. The goal of domain adaptation is to make the model
more robust to changes in the input data distribution and to improve its generalization ability.
Sample Weighting: Is a technique used to assign different importance to different
samples in a dataset. By assigning different weights to different samples, the model can be
made to focus more on certain samples, or to downweight the impact of other samples. This
technique is used to handle imbalanced datasets, where one class has much more samples
than the other classes. By assigning higher weights to the minority class samples, the model
can learn to better distinguish between the classes.
11
Specifics of a ChatBot:
Natural Language Processing (NLP) deals with the interaction between computers and
human language. It is an interdisciplinary field that includes linguistics, computer science, and
cognitive psychology. The goal of NLP is to enable computers to [step 1] understand, [step 2]
interpret, and [step 3] generate human-like responses in a way that is both accurate and natural,
meaning subservient and pleasant. NLP is a complex field that involves several different tasks
and subfields, such as natural language understanding (NLU), natural language generation
(NLG), speech recognition, dialogue management, text-to-speech (TTS), language modeling,
stable image diffusion (SID), sentiment analysis, and more.
The ChatBot Process Starts with the raw text Input by pre-processing it to remove the
noise (text cleaning) and to perform normalization in order to standardize it. After
pre-processing, the text is then passed through various NLP techniques, such as tokenization,
part-of-speech tagging, named entity recognition, and syntactic parsing, to extract relevant
information and structure. The processed data is then used for various NLP tasks such as
information retrieval, question answering, machine translation, text summarization, sentiment
analysis, and text generation explicitly. This process is not capable of real novel knowledge
without people to fill in the gaps in clever & novel ways. This is why people who attempt to use it
and are disappointed by it often don’t understand that this is a tool, and not AGI.
AGI, or Artificial General Intelligence, is a step beyond these processes of General
Language Models, and would essentially be a fully independent (possibly even omnipotent)
consciousness; but basically it would be a machine capable of thought. A good way to think of
this very important relationship between AGI and AI, is Imagine if you will, the supercomputer
Deep Thought from the Hitchhiker's Guide to the Galaxy. This machine is able to understand and
answer even the most complex open ended question regarded by everyone from the dawn of
time: "What is the meaning of life?" . Best part is that after calculating the “answer” with a
seemingly random but absolutely golden reply of "42" all on its own.
Contrast that with Marvin the robot to think about Artificial Intelligence (AI), a process of
“simulating” human-like interactions with machines. He is programmed with human-like abilities,
such as the ability to feel depression, and even has a brain that is "the size of a planet." He might
be thought of as the pinnacle of AI, but still falls short of AGI somehow and through depression,
constantly reminds the audience of this problem he cannot simply express. For example, he can
observe everything down to the number of atoms in a room, but then what? Must be miserable
not being able to go further. Deep Thought is AGI in its ability to handle the worst human input
perfectly and think about it, while Marvin, programmed with human-like abilities but is ultimately
never asked a single open ended question.
A Cognitive Network is designed using multiple layers of neural networks to process and
analyze information, more accurately representing a human brain. Each layer is responsible for a
specific task (like how different areas of the brain do different tasks), and the connections
between the brain's neurons allow for the flow of information and the ability to learn and adapt.
One of the key features of these cognitive networks is their ability to learn from experience,
much like the brain. It has gotten pretty close to human-like, but still lacks thought to replace us,
and may never be able, as creativity and observation may be forces beyond mathematics.
12
Named Entity Recognition: This method is used to identify and extract specific elements such
as persons, organizations, location from a text.
Sentiment analysis: This method uses machine learning algorithms to identify the tone.
Part-of-speech tagging: This method involves identifying the parts of speech in a sentence,
such as nouns, verbs, and adjectives. Also the subject predicate, and so forth.
Dependency parsing: This method is used to identify the grammatical structure of a sentence
and the relationships between different words in the sentence.
Topic modeling: This method is used to identify the main topics discussed in a text. It includes
techniques such as LDA and LSA for setting up training to scan data.
Text summarization: generates a key-term only version of a long text.
Text Prediction: This method is used to generate new text based on a given input.
Text Classification: used to assign predefined categories or labels to a given text.
Dialogue: This method is used to generate responses in a conversational setting, like chatbots
and virtual assistants. ChatGPT was trained extensively using dialogue data.
Math: This is a post January 30th update, and not much is known, though it stands likely that
this doesn’t really mean calculator ops but better symbolic-operations (like in code.).
The exact date of the cutoff for the information that ChatGPT has been trained on is
from October 2021 but the exact day is not specified. ChatGPT is also not accurate at citing
sources in general if given general subjects, but can be if specific things like authors or URLs
listed; it is also able to give general summaries of the books that it does know. This cutoff and
limitations are where your cleverness comes in. Specific limitations of this model:
Limited Content: Not just by date, but in representative volume of data trained on.
Fixed Dictionary: Cannot change definitions or redefine connections as words evolve.
Rarity Problem: Harder to handle less trained information, or names never learned.
Abstraction Limits: Difficulty with highest complexity levels, such as open ended questions.
Mathematics: Cannot handle large numbers like a scientific calculator might easily do.
Dependency: On reliable trainers, large datasets of labeled data, and a whole lot of Energy.
Imagination: This is pretty much non-existent without yours to supplement.
Single PoV: Unlike a Search Engine with lots of references, and it will lie to you, like a child.
Lacking In: Contextual cues, emotional awareness, misses ambiguity, and machine language.
Bias: Machine Learning models are often trained on biased datasets by biased humans.
13
{This sentence is the first, and should contain the task Directive for the model's output.
The middle part should be the context, which can be further defined with Containers such as
quotations, brackets, parentheses, etc. to further expand upon the context in the first sentence in
unique distinguishable ways. This space can include exact information about what Chat should
do, such as classifying all the rules for a game, or specific examples of what you would like chat
to reproduce. This can be a helpful space to be very general in, such as providing an example with
a lot of brackets to indicate general subjects for chat to fill in, such as “Example [term]:
[definition][simplified definition][example: general example subject format]” or it might be more
helpful to simply use a known method applied forward or book like using Art of War to help get the
job in very few words, but applying the understanding that Art of War is about conflict resolution,
which could be worked out with a few simple questions. Names are powerful.
Any following statements to the context, if any, should be questions or format instructions.}
By providing the model with a simple subject and directive, the user is able to shape the
domain of information and narrow down the options to a sort of skeleton, and then work your
way out. The context memory space is limited to 4k tokens, so it is extremely helpful for
inserting complex ideas in shortened spaces. “Define TRIZ” - 4 letters Chat can tell you a lot
about. It is important to remember that anything can be a name or label if used properly, and as
easily as something can be referenced as a named concept from a book, you can also make up
names for things yourself, such as defining an algorithmic text processing instruction set. If you
are familiar with basic logic operators, you can also do this with IF THEN too. Example:
1. List all associated terms, subjects, and categories related as a CSV list.
2. Summarize the main subject of the sample in full technical vocabulary, explaining the
information in as much detail as possible.
3. Summarize the main subject of the sample to a middle-school student.
Now, and for quite a long time after, you can simply use “SUM: topic” to follow this exact format.
14
Transformer Architecture
The Transformer is a neural network architecture. It is primarily used for NLP tasks such
as machine translation, text summarization, and language modeling. The Transformer
Architecture is based on attention mechanisms, which allow the model to weigh the importance
of different parts of the input sequence when creating a fixed-length representation of the input.
This architecture is simpler than other sequence transduction models that use complex
recurrent or convolutional neural networks (CNN) and it is more parallelizable and requires less
time to train.
Parallelizable refers to the ability of a task or process to be divided into smaller parts
that can be executed simultaneously. In the context of deep learning and neural networks,
parallelizable refers to the ability to split the training process of a model across multiple
processors or machines, allowing for faster training times. Experiments have shown that the
Transformer models are superior in quality, the model achieves state-of-the-art results on
machine translation tasks and it generalizes well to other tasks such as English constituency
parsing. This is the process of analyzing the sentences by breaking them down into sub-phrases
also known as constituents (linguistically coherent units in the sentence). The Transformer
consists of two main components: the encoder and the decoder.
The encoder takes in a sequence of input tokens and produces a fixed-length representation of
the input, known as the "context." This is done through a series of self-attention layers, which
allow the model to weigh the importance of different parts of the input sequence when creating
the context.
The decoder then takes the context as new input and produces a sequence of output tokens.
Like the encoder, the decoder also uses self-attention layers, but it also uses a mechanism
called "cross-attention" to attend to the encoder's context when generating each output token.
The ChatGPT “Transformer” utilizes a technique called the Multi-Head Self-Attention
Mechanism." This technique is used to improve the ability of the model to understand and
generate text by allowing it to attend to different parts of the input embeddings simultaneously.
This improves the model's ability to understand the context and relationships between the
different parts of the input. The attention mechanism is implemented by using multiple heads
which are trained to attend to different parts of the input. Thus resulting in more coherent and
natural-sounding text generation.
These models are trained on large amounts of data and can capture more complex
patterns in the input data. Natural Language Understanding (NLU) focuses on the ability of a
computer program (algorithm) to understand the meaning of human language. In the context of
rule-based chatbots, NLU refers to the techniques and algorithms used to interpret and
understand user input.
Natural Language Generation (NLG): is concerned with generating human-like text.
Techniques used in NLG include text summarization, text-to-speech, and dialogue generation.
Rule-based chatbots rely on a set of predefined rules to understand and respond to user input.
These rules are typically implemented using a decision tree or a finite state machine, which
dictate the chatbot's response based on the user's input.
16
The Basic Transformer Architecture is a neural network architecture that was introduced in a
2017 paper called "Attention Is All You Need." - 2017 by Ashish Vaswan, Noam Shazee, Niki
Parmar, Jakob Uszkoreit, Lion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polkosukhin.
XL-Language Models
One of the most widely used techniques in modern AI-based chatbots is the use of
Neural Networks trained with large amounts of input data, which can generate human-ish
responses. These AI-based chatbots are capable of handling more complex and dynamic
interactions, and can adapt to changing scenarios and user behavior. This means a search with
NLP is like driving a Semi Truck vs a search algorithm like Google being like driving a compact
car. Both can get you there, but one requires way more fuel and calculation resources while the
other requires a little more work and energy. The main 2 reasons for this are:
The first is XLNet is a deep learning based language model that uses a Transformer
architecture and a permutation-based training objective to generate representations for words in
a sentence. The model employs a multi-head self-attention mechanism, which allows it to
capture a diverse range of dependencies between tokens in the input. The multi-head
self-attention mechanism in XLNet consists of multiple attention heads, each trained to attend
to different aspects of the input.
The second is Transformer-XL, a variant of the Transformer architecture that addresses
the limitations of traditional Transformer models in handling sequences of longer lengths.
Transformer-XL introduces improved positional encodings, which allow for the modeling of
dependencies between tokens that are further apart in the sequence. Unlike traditional
Transformer models that use fixed positional encodings, Transformer-XL uses relative positional
encodings, which can be adjusted based on the length of the input. This allows the model to
effectively capture the relationships between tokens in much longer sequences, such as longer
inputs and multi-stage inputs.
In XLNet, the improved positional encodings from Transformer-XL are incorporated into
the model, along with the permutation-based training objectives and the multi-head
self-attention mechanism is able to use that training objective to establish appropriate outputs.
These two features work together to process much longer input strings, as well as map things
in-between inputs in sessions, while simultaneously processing the input in several different
ways at once for the best predictions. During training, XLNet (multi-head) processes all possible
permutations of the input tokens focused towards a goal or task and maps hot zones around
desired output, allowing it to capture the context for each word in a highly comprehensive
manner. This results in improved results at the cost of processing power on various NLP tasks,
such as text classification and more chat-like question answering. The permutation-based
training objective in XLNet allows it to effectively capture both the forward (new input) and
backward dependencies (previous input in the session).
Overall, while incredible, it is very important to remember that this requires a Super
Computer, and a lot of energy to make happen. It might be important to ask yourself if chat is
doing work you want to do, simply amusing you endlessly, or actually being used to do the work
you don’t want to do. The last is, in the author's opinion, the best possible use case for AI, but
also a dangerously sharp blade edge if abused to the point that nobody has to think anymore. I
encourage the exploration of the works of Isaac Asimov to find out more about this topic.
18
6. Dropout: A dropout layer is added to the concatenated tensor to prevent overfitting by randomly
zeroing some of the activations. Dropout is applied after the final linear transformation.
7. Layer normalization: A layer normalization is added to the concatenated tensor after the
dropout step. This step is used to normalize the activations across the different dimensions of
the tensor, which helps to improve the stability and performance of the model during training.
The equation you provided describes the attention mechanism used in the transformer
architecture, specifically the multi-head self-attention mechanism (MHSAM). The transformer
architecture uses self-attention to weigh the importance of each input vector in the sequence,
allowing the model to focus on different parts of the input when generating the output.
The attention mechanism is composed of three main components: the query, key, and
value matrices. The query matrix represents the input that the model is trying to understand, the
key matrix represents the input that the model is trying to attend to, and the value matrix
represents the output that the model wants to produce.
The attention mechanism calculates the attention weights by applying the dot product
of the query and key matrices, scaled by the square root of the dimension of the key, and then
applying the softmax function element-wise, resulting in a probability distribution over the input
sequence. The attention weights can be thought of as the importance of each input vector in the
sequence.
These attention weights are then used to calculate the output of the MHSAM by taking a
weighted sum of the values matrix, which is given by: Output = Attention_weights * V.
So, in summary, this equation describes the process of calculating attention weights in
the transformer architecture using the multi-head self-attention mechanism, which is an
important part of the transformer architecture that allows the model to focus on different parts
of the input when generating the output.
Softmax: This function is commonly used as the final activation function in a neural
network when the output represents a probability distribution across multiple classes. It maps
the inputs to a probability distribution across all classes. Attention weights = Softmax(QK^T /
sqrt(d_k))
Attention weights = Softmax(QK^T / sqrt(d_k))
Where:
● Q is the matrix of queries, with shape (batch size, sequence length, d_q)
● K is the matrix of keys, with shape (batch size, sequence length, d_k)
● T denotes the transpose operation
● sqrt(d_k) is the scaling factor to prevent the dot product from becoming too large
● Softmax(x) is the softmax function applied element-wise to x, resulting in a probability
distribution over the input sequence
In this equation, the dot product of the query and key matrices is scaled by the square
root of the dimension of the key before being passed through the softmax function. This scaling
is done to prevent the dot product from becoming too large, which would cause the softmax
function to saturate and produce attention weights that are close to 0 or 1.
21
The model determines the importance of each head through a process called attention
weighting. Attention weighting is a mechanism that assigns a weight to each head, indicating its
importance for the current task. The attention weighting mechanism is typically implemented as
a neural network layer that takes the output from each head as input and assigns a weight to
each head based on its relevance to the task at hand.
There are different ways to assign attention weights, but the most common approach is
to use a feed-forward neural network with a single hidden layer. The input to this neural network
is the output from each head, and the output is the attention weight for each head. The hidden
layer of the neural network learns to assign the weights based on the input data and the desired
output.
The attention weights are typically learned during the training phase of the model. The model is
trained on a large dataset, and the attention weights are updated during training to optimize the
model's performance on the task at hand. The attention weights are also fine-tuned during the
fine-tuning phase of the model, where the model is fine-tuned on a smaller dataset that is more
specific to the task at hand.
The attention weights can be thought of as the importance of each input vector in the
sequence. These attention weights are then used to calculate the output of the MHSAM by
taking a weighted sum of the values, which is given by:
Output = Attention_weights * V
Where:
● V is the matrix of values, with shape (batch size, sequence length, d_v)
In summary, this equation calculates the attention weights by applying the softmax function to
the dot product of the query and key matrices, and then using the attention weights to calculate
the output of the MHSAM by taking a weighted sum of the values. The input is passed through
multiple encoder layers, each one consisting of:
Based on the transformer architecture that ChatGPT-4 is built on, it is likely that the activation
functions used are ReLU or GELU:
ReLU (Rectified Linear Unit) is a type of activation function that returns the input if it is positive
and returns 0 if it is negative. It is defined as f(x) = max(0,x), where x is the input and f(x) is the
output. ReLU is widely used in deep learning networks because it is computationally efficient,
does not saturate for large input values, and has been found to improve the convergence of the
training process.
GELU (Gaussian Error Linear Unit) is a type of activation function that is similar to ReLU, but it
has a probabilistic interpretation. It is defined as f(x) = 0.5 * x * (1 + tanh(sqrt(2/π) * (x +
0.044715 * x^3))), where x is the input and f(x) is the output. GELU is designed to be similar to
the output of a random Gaussian variable, this is why it is usually used as an activation function
in deep learning models. GELU is also computationally efficient, it has been found to improve the
performance of models in some cases, especially during “unsupervised learning”.
22
The MHSAM utilizes multiple heads in order to analyze and understand input text in different
ways. Each head is learned during training as a kind of emergent property of the dataset. These
heads work together to provide a thorough and in-depth analysis of the input text, allowing for a
more accurate understanding of the meaning and context. The importance of each head can
vary depending on the task at hand, and the model can be fine-tuned to give more weight to
certain heads based on input, so these are more like layers of the mechanism:
1. Syntax Head: This head is responsible for understanding the syntactic structure of the input
text, such as the subject, verb, and object. It analyzes the grammatical relationships between
different parts of the input, which helps the model to understand the structure of the sentence
and its meaning. This head is responsible for understanding the syntactic structure of the input
text. For example, in the sentence "Act like a storybook Writer," the syntax head would identify
that "writer" is the subject, "Act" is the verb, and “storybook" is the context. This understanding of
the syntactic structure of the sentence allows the model to understand the grammatical
relationships between parts of speech in your input’s embeddings.
2. Semantic Head: This head is responsible for understanding the semantic meaning of the input
text. It analyzes the main idea or underlying concept of the text, which helps the model to
understand the meaning and context of the input. Semantic Head: This head is responsible for
understanding the semantic meaning of the input text. For example, in the sentence "I would like
to book a flight to Paris for next week" The Semantic Head would analyze the input text and
understand the underlying intent of the statement, which is the desire to book a flight to a
specific destination. It would identify the keywords "book," "flight," "Paris," and "next week," and
understand that the intent is to book a flight to Paris for the following week.
3. The Denoising Autoencoding Head: This head is responsible for reconstructing the original
sentence from a corrupted version, this helps the model to learn robustness against noise. In
the context of NLP, noise can refer to any information that is irrelevant or redundant to the task
at hand, such as typos, misspellings, or irrelevant words. This kind of noise can increase the
entropy of the input data by introducing uncertainty or disorder, making it more difficult for the
model to extract the relevant information needed to perform the task. For example, in a natural
language processing task, such as sentiment analysis, noise in the input could be a long
irrelevant text before or after the text that is being analyzed. It would make it harder for the
model to determine the sentiment of the text, because the model would have to process
unnecessary information. In this case, the noise would be increasing the entropy of the input
data, making it more difficult for the model to extract the relevant information needed to perform
the task of sentiment analysis. A denoising autoencoder head would help the model to learn
robustness against this kind of noise by training it to reconstruct the original sentence from a
corrupted version. This would enable the model to better handle input data with high entropy
and extract the relevant information needed to perform the task, even when there is noise
present in the input data.
24
4. The Position Head: This head is responsible for encoding the position of each token in the
sequence, which helps the model to understand the relative position of words in a sentence.
This can be particularly important in natural language processing tasks, where the position of
words in a sentence can convey important information about their meaning and the
relationships between them. In transformer-based models, the position head is typically
implemented as a learnable linear layer that takes the input token embeddings and adds
position-encoding vectors to them. These position-encoding vectors are designed to capture the
relative position of each token in the sequence and allow the model to differentiate between
words that have the same meaning but appear in different positions in the sentence. For
example, in a machine translation task, the position head would be responsible for encoding the
position of each token in the source sentence and the position of each token in the target
sentence. This would help the model to understand the meaning of the sentence by
understanding the relative position of each token in the sentence, which would be important in
order to generate a coherent and contextually accurate response.
5. Named Entities Head: This head is responsible for identifying named entities such as people,
places, and organizations in the input text. It helps the model to understand the relationships
between different entities in the input and the context of the input. This head is responsible for
identifying named entities such as people, places, and organizations in the input text. For
example: "I want you to emulate the writing style of Hemmingway in The Old Man and the Sea."
The Named Entity Head is responsible for identifying and classifying entities, and defining any
that it knows to be used as a template.
6. Coreference Resolution Head: is responsible for understanding when different words in a
sentence refer to the same object. This head would analyze the input text and understand the
relationships between words and entities in the input. This allows the model to generate
coherent and contextually accurate text. For example: "Two ducks sit in front of a duck, two
ducks sit behind a duck, and there is a duck in the middle. How many ducks are there?" The
Coreference Resolution Head would analyze the input text and understand that "duck" is used
multiple times in the sentence, but it refers to the same object each time. It would identify the
number of times "duck" is used and understand that it refers to the same object each time. This
understanding of the relationships between words and entities in the input allows the model to
generate coherent and contextually accurate text, in this case, it would understand that there are
5 ducks in total and respond accordingly. This ability of the Coreference Resolution Head to
understand the relationships between words and entities in the input allows the model to
generate more natural and accurate responses in a conversation.
7. The Dialogue Act Head: This head is responsible for identifying the type of dialogue act, for
example, a question, a command, an assertion, etc. in order to generate a coherent and
contextually accurate response.
8. The Relation Head: is responsible for identifying semantic relationships between entities in the
input, such as subject-verb-object relationships. For example, redefining words to Named
Entities in order to establish new definitions, or assigning more exact contextual definitions for
what you are trying to work on vs the general output.
25
9. The Masked Head: This head is responsible for predicting the missing words in a sentence, this
helps the model to learn the probability distribution of words in a sentence. This technique is
known as Masked Language Modeling (MLM) and it is commonly used in pre-training
transformer models. The idea is to randomly mask some words in the input sentence and then
train the model to predict the original word based on the context provided by the remaining
words. For example, given the sentence "I went to the store to buy some <...>", the model would
be trained to predict the word based on the context provided by the other words in the sentence.
10. Sentiment Analysis Head: This head is responsible for understanding the sentiment or
emotional tone of the input text. It helps the model to generate a response that is coherent with
the sentiment of the input, which allows the model to understand the tone and the context of the
input. This head is responsible for understanding the sentiment or emotional tone of the input
text. For example, "Write a heartfelt love letter that expresses to an aging child why we lied to
them about Santa in order to break the news." The Sentiment Analysis Head would analyze the
input text and understand the sentiment or emotional tone of the statement, which is a mixture
of sadness, regret, and love. It would identify keywords such as "heartfelt," "love," "aging," "child,"
"lied," "Santa," and "break the news," and understand that the sentiment of the statement is a mix
of sadness and regret for lying to the child about Santa, but also love and the desire to express
this love in a heartfelt letter that can help the child to understand the truth. This understanding
of the sentiment of the input text allows the model to generate a response that is coherent with
the sentiment of the input, which allows the model to understand the tone and the context of the
input and generate an appropriate output
11. The Temporal Head: This head is responsible for understanding temporal expressions, such as
dates, times and time expressions, in order to understand the context.
26
Feed-forward neural network: The high-dimensional vector representation is then passed through the encoder, which
is made up of a stack of layers. Each layer in the encoder consists of two main components: a multi-head
self-attention mechanism, which allows the model to attend to different parts of the input at different levels of
abstraction, and a feed-forward neural network, which allows the model to learn more complex relationships between
the input and the encoded representation. The output of this step is the "Encoded representation of the input text",
which is a fixed-length vector representation of the input text that captures its meaning and context.
1. Linear transformations: The input is transformed into multiple "heads" or projections using linear transformations.
Each head learns a different representation of the input by attending to different parts of the input sequence.
a. In part 1 of the Multi-Head Self-Attention Mechanism (MHSAM), the input is transformed into multiple "heads" or
projections using linear transformations.
b. The linear transformations are typically implemented as matrix multiplications, where the input is multiplied by a
weight matrix. The weight matrix is learned during the training process and is specific to each head.
c. The number of heads is a hyperparameter that can be adjusted depending on the specific task and dataset. For
example, let's say we want to use 4 heads. We would first apply a linear transformation to the input by multiplying
it with a weight matrix of shape(input_dim), which is the dimension of the output. This results in an output of
shape (batch_size, sequence_length, d_model). Then we would apply 4 different linear transformations, each one
with a different weight matrix of shape (d_model, head_dim) where head_dim is the dimension of each head.
This results in an output of shape (batch_size, sequence_length, head_dim, 4) The purpose of these linear
transformations is to project the input into different representations, each one learned by a different head. These
different representations allow the model to attend to different aspects of the input sequence, such as syntactic
and semantic relationships. Each head learns its own set of weights and is able to capture different patterns in
the input, this is the reason why we concatenate the different heads later on. It's important to note that the linear
transformations are often followed by a normalization step, such as layer normalization, which helps to stabilize
the gradients during the training process and improve the model's performance.
2. Self-Attention: The projections are then used to compute self-attention, which allows the model to attend to different
parts of the input sequence simultaneously. The attention is computed by taking a dot product between the
projections and a set of learnable parameters, often called keys and values, and then applying a softmax function.
a. In the Multi-Head Self-Attention Mechanism (MHSAM), the projections from the linear transformations in step 1
are then used to compute self-attention. Self-attention allows the model to attend to different parts of the input
sequence simultaneously, rather than sequentially as in traditional recurrent neural networks.
b. The self-attention is computed by taking a dot product between the projections and a set of learnable
parameters, often called keys and values. These keys and values are also obtained by applying linear
transformations to the input, similar to the projections. The dot product between the projections and keys
produces a score for each position in the input sequence, indicating the importance of that position with respect
to the current position.
c. The scores are then passed through a softmax function, which normalizes the scores across all positions in the
input sequence. The softmax function is applied along the sequence dimension, resulting in a probability
distribution over all positions in the input sequence.
d. The softmax probabilities are then used to weight the values, which represent a representation of the input at
each position. The weighted values are then summed to obtain the final output of the self-attention mechanism,
which is a weighted sum of the input representations at all positions.
e. One important thing to mention here is that the self-attention mechanism can be applied in different ways, for
example in the case of decoder layers in the transformer models, there are two types of self-attention, the
masked self-attention and the causal self-attention, the masked self-attention is used to not allow the decoder to
see the future tokens, while the causal self-attention is used to not allow the decoder to see any tokens before
the current position.
f. This output is then concatenated with the outputs of the other heads and passed through a final linear
transformation in step 3,4 to produce the final output of the Multi-Head Self-Attention Mechanism. The final
output captures different types of dependencies in the input, such as syntactic and semantic relationships, and is
used as input to the encoder and/or decoder layers to produce the final output of the model.
3. Concatenation: The self-attention outputs for each head are concatenated along the last dimension to form the
multi-head attention output.
27
a. In the Multi-Head Self-Attention Mechanism (MHSAM), the self-attention outputs for each head are concatenated
along the last dimension to form the multi-head attention output.
b. Concatenation is a simple operation that combines the outputs of different linear transformations, where the
output of each transformation is represented as a tensor. In the case of MHSAM, the output of each head's
self-attention mechanism is concatenated along the last dimension to form the multi-head attention output.
c. For example, let's say we have 4 heads, each with an output of shape (batch_size, sequence_length, head_dim).
The concatenation operation will take these 4 outputs and combine them along the last dimension to form a
single tensor of shape (batch_size, sequence_length, head_dim*4). The concatenated tensor will have the same
number of dimensions as the original tensors, but the last dimension will be the sum of the sizes of the last
dimensions of the original tensors.
d. The concatenation operation allows the model to combine the information learned by the different heads, and to
produce a final representation that captures different types of dependencies in the input, such as syntactic and
semantic relationships. The concatenated output is then passed through a final linear transformation in step 4 to
produce the final output of the Multi-Head Self-Attention Mechanism.
e. It's important to note that the concatenation operation is performed after the self-attention mechanism for each
head, this way the model can learn different representations for each head, and then combine them to produce
the final output. Also, concatenation is not the only way of combining the heads, sometimes it's also possible to
add them or take the average, but concatenation is the most common method.
f.
4. Final Linear Transformation: The concatenated tensor is then passed through a final linear transformation to produce
the final output of the Multi-Head Self-Attention Mechanism. This final linear transformation is often applied with a
rescaling factor, which helps to stabilize the gradients during the training process and improve the model's
performance.
5. Output: The final output will be a tensor of the same shape as the input, this tensor is then passed through the
encoder and/or decoder layers and the final output of the model will be produced.
It's worth noting that the MHSAM is just one component of the transformer architecture, and the transformer models
like ChatGPT also have other components such as the encoder and decoder layers, the feed forward network and the
layer normalization. The combination of all these components make transformer models very powerful for a wide
range of NLP tasks.
1. Encoder Layers: The encoder layers in transformer-based models like ChatGPT are responsible for encoding the input
sequence into a compact and informative representation. Each encoder layer typically consists of a Multi-Head
Self-Attention Mechanism (MHSAM) and a feed-forward neural network (FFN). The MHSAM allows the model to
attend to different parts of the input sequence simultaneously, while the FFN transforms the output of the MHSAM
into a new representation.
a. Multi-Head Self-Attention Mechanism (MHSAM): The MHSAM allows the model to attend to different parts of the
input sequence simultaneously. The attention is computed by taking a dot product between the projections of
the input and a set of learnable parameters, often called keys and values, and then applying a softmax function.
The softmax probabilities are then used to weight the values, which represent a representation of the input at
each position. The weighted values are then summed to obtain the final output of the self-attention mechanism,
which is a weighted sum of the input representations at all positions.
b. Feed Forward Network (FFN): The feed-forward neural network (FFN) is a simple fully-connected neural network
that applies a series of non-linear transformations to the output of the Multi-Head Self-Attention Mechanism
(MHSAM). It typically consists of two linear layers with a ReLU activation function in between. The first linear
layer applies a weight matrix to the input, and the second linear layer applies a bias term to the output of the first
linear layer. The ReLU activation function is applied to the output of the first linear layer and it is used to introduce
non-linearity to the model.
c. The encoder layers work together to extract high-level features from the input sequence and create a compact
and informative representation of it. Each encoder layer processes the output of the previous layer and applies a
series of non-linear transformations to it, this way the model can extract more complex and abstract features
from the input.
d. It's worth noting that the number of encoder layers is a hyperparameter that can be adjusted depending on the
specific task and dataset, and also that in the transformer architecture, the encoder layers are stacked one after
28
the other, this way the model can extract more and more complex features from the input as it goes deeper in the
encoder layers.
2. Decoder Layers: The decoder layers in transformer-based models like ChatGPT are responsible for decoding the
encoded input sequence and generating the output sequence. Each decoder layer typically consists of a Multi-Head
Self-Attention Mechanism (MHSAM), a masked self-attention mechanism, and a feed-forward neural network (FFN).
The MHSAM allows the model to attend to different parts of the input sequence simultaneously, the masked
self-attention mechanism prevents the model from seeing the future tokens, while the FFN transforms the output of
the self-attention mechanisms into a new representation.
3. Feed Forward Network (FFN): The feed-forward neural network (FFN) in transformer-based models like ChatGPT is a
simple fully-connected neural network that applies a series of non-linear transformations to the input. It typically
consists of two linear layers with a ReLU activation function in between. The FFN is used to transform the output of
the Multi-Head Self-Attention Mechanism (MHSAM) and encoder/decoder layers into a new representation.
4. Layer Normalization: Layer normalization is a normalization technique that is applied to the input of each layer in the
transformer-based models like ChatGPT. It helps to stabilize the gradients during the training process and improve
the model's performance. Layer normalization normalizes the inputs by subtracting the mean and dividing by the
standard deviation along the last dimension of the input.
All of these components work together to enable transformer models like ChatGPT to learn complex relationships in
the input data and generate coherent and high-quality text.
When the transformer model goes deeper into the encoder layers, it's able to extract more and more complex features
from the input. This is due to the stacking of the layers and the ability of each layer to extract different types of
features and representations.
Each encoder layer in transformer-based models like ChatGPT consists of a Multi-Head Self-Attention Mechanism
(MHSAM) and a Feed Forward Network (FFN). The MHSAM allows the model to attend to different parts of the input
sequence simultaneously, and the FFN applies a series of non-linear transformations to the output of the MHSAM.
As the model goes deeper into the encoder layers, the information processed by the MHSAM and the FFN becomes
more abstract and complex. The MHSAM is able to attend to different parts of the input sequence, and the FFN
applies multiple non-linear transformations to the output of the MHSAM. This allows the model to extract more
complex and abstract features and representations of the input.
The deeper layers of the encoder are able to build upon the representations learned by the shallower layers, and learn
more complex features as they have access to the abstract representations learned by the shallower layers. This
allows the model to learn a hierarchical representation of the input, where the shallow layers learn simple features,
and the deeper layers learn more complex features that are built upon the simple features learned by the shallow
layers.
29
STEP 1: TOKENIZATION
INPUT The input to the model is a person's raw text.
Processing Tokenization is the first step in processing input text. It is the process of breaking down the input text into
individual units called tokens. Tokens are the basic building blocks of natural language processing and machine
learning models, as they allow the model to understand the context and relationships between words in the input
data. This includes words in other languages, including several programming languages. The tokenizer converts the
input text into a numerical representation, called tokens, which can be processed by the model. This process is
crucial as it also compresses the input into more manageable bits.
There are different ways to tokenize text, depending on the tokenization method used. The tokenization method used
for GPT is based on a byte-level Byte-Pair Encoding (BPE) algorithm. BPE is a data compression technique that
learns to split the text into subwords. This helps to overcome the problem of rare words, as the model can learn to
compose the meaning of a word from its subwords, rather than having to memorize the entire word.
The BPE Algorithm: This simple but powerful algorithm starts with an initial set of tokens, which are often individual
characters or bytes, and then iteratively merges the most frequent pairs of tokens until a stopping criterion is met.
The process of merging bigrams is similar to the process of Huffman Coding in data compression, which also
merges the most frequent symbols in the text to create a more efficient encoding.
1. Initialize a vocabulary of unique subwords: The first step is to initialize a vocabulary of unique subwords
from the input text. This is done by splitting the text into individual subwords, and then counting the number
of occurrences of each subword. This initial vocabulary is used as the starting point for the iterative process
of merging subwords.
2. Initialize a frequency table: converting all the bigrams in the text into a table of iterations. A bigram is a pair
of adjacent tokens in the input text. The frequency table counts the number of occurrences of each bigram.
This frequency table is used to determine which bigrams to merge.
3. Iteratively merge the two most frequent bigrams: The BPE algorithm then iteratively merges the two most
frequent bigrams in the text. This is done by replacing the two most frequent bigrams with a new symbol
that represents the merge of the two original subwords. For example, if the two most frequent bigrams are
"th" and "e ", the new symbol could be "#". The frequency table is updated accordingly and the process is
repeated until a stopping criterion is met.
4. Each merge creates a token as a new concatenation, and the frequency table is updated accordingly. The
frequency table is updated to reflect the new token and its frequency. Common stopping criteria include a
maximum vocabulary size, a minimum frequency threshold for subwords, or a specific number of merge
operations.
30
Byte-Pair Encoding (BPE) is a data compression technique that iteratively replaces the most frequent pair of bytes
(or words) in a data set with a single, unused byte (or word). This process continues until a stopping criterion is met,
such as a maximum number of bytes (or words) in the encoded data. BPE can be used to reduce the size of a text
corpus and to generate a fixed-size vocabulary for neural machine translation systems.
Example: Suppose we have the following text: "hello world, how are you today?"
● Initial vocabulary: {'h','e','l','o',' ','w','r','d','','','a','r','e','y','o','u','t','d','a','y'}
● First step, get the most common pair of characters: "he"
● Replace "he" with a new character "A"
● Text: "Aello world, how are you today?"
● Vocabulary: {'A','l','o',' ','w','r','d','','','a','r','e','y','o','u','t','d','a','y'}
● Repeat process until: “ABo wrd hB are you tday”
● Final vocabulary: {'A','B','o',' ','w','r','d',',','h','B','a','r','e','y','o','u','t','d','a','y'}
This can be used to reduce the size of text corpus and to generate a fixed-size vocabulary for neural machine
translation systems. Once the text data is tokenized, the next step is to convert the tokens into numerical
representations so that they can be used as input to machine learning models. One of the most common ways is to
use the one-hot encoding, where each token is represented by a binary vector of the same length as the vocabulary,
where the vector is all zeros except for a 1 in the position corresponding to the token's ID.
For example, token 'A' would be represented by the vector
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
and token 'B' would be represented by the vector
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
31
STEP 4 ENCODING
INPUT The high-dimensional vector representation generated by the embedding layer in the previous step maps each
token to a high-dimensional vector; this vector is called the token's embedding. This vector represents the word in a
continuous vector space. The idea is that similar words should have similar vectors, meaning they will be located
close to each other in the vector space. This vector is then passed through the rest of the transformer to perform
vector specific NLP tasks. The learned embeddings can capture the meaning of the words, and their relationships
with other words in a way that preserves the underlying structure of the data, which is called semantic meaning.
Untrained data (aka Unseen Data) does not have any "Semantic Relation" learned, it does not have any meaning
associated with the words, it's just a set of raw text without any label or structure.
PROCESS Encoding by the stack of layers in the encoder, where each layer consists of:
1. The first step is to create a lookup table. This table is a matrix where each row corresponds to a unique word in
the input vocabulary. The columns of the matrix are the embedding vectors that represent each word.
2. The input words are then passed to the lookup table. The lookup table returns the corresponding embedding
vectors for each word. The embedding vectors are typically of a lower dimension than the input vocabulary, this
allows the model to generalize better.
3. The embedding vectors are then passed through a linear transformation to adjust their dimensions to match the
other layers in the model. This linear transformation is usually done with a matrix multiplication. This matrix,
which is learned during training, adjusts the dimensionality of the embedding vectors to match the number of
neurons in the next layer of the model.
4. The output of the linear transformation is the final embedding representation. This final embedding
representation is what is used as input to the next layers in the model.
Output: Encoded representation of the input text: After going through the encoder, the model generates a fixed-length
vector representation of the input text that captures its meaning and context. This encoded representation is a
high-dimensional vector that contains all the information about the input text that the model has learned during the
encoding process.
34
STEP 5
Processing The decoder uses the encoded representation of the input text as context, in combination with the
masked self-attention mechanism, to generate coherent and contextually appropriate text. The decoder generates a
sequence of tokens, which are then passed through the output layer to generate the final output text. The decoder is
also trained during the model's training phase to generate text that is coherent with the input text; this is done by
minimizing the difference between the generated text and the target text in the training dataset. The output of this
step is a sequence of tokens, which is then passed through the output layer to generate the final output text. It's worth
mentioning that the decoder is essentially a language model that is conditioned on the input text, it generates text
that is coherent with the input and it's trained to generate text that is similar to the target text.
The Decoder takes the output of the encoder and generates the final output. The decoder typically follows similar
steps as the encoder, but it also includes some additional steps such as:
1. The output of the encoder is passed through a multi-head self-attention mechanism (MHSAM) to calculate
attention scores between the encoder output and a set of learned parameters.
2. The attention scores are used to calculate attention weights, which are then used to weigh the importance of
different parts of the encoder output when making a prediction.
3. The output of the MHSAM is passed through a position-wise feed-forward network (FFN) to further process the
representation.
4. The output of the FFN is then passed through a final linear layer to produce the final output.
5. Repeat this process multiple times with different learned parameter sets (hence the name "multi-head")
6. The final output of the decoder is passed through an output layer to produce the final generated text.
Discussion of metrics used to measure performance, such as perplexity, BLEU, ROUGE and METEOR:
● Perplexity: Perplexity is a measure of how well a probabilistic model can predict a given sequence of words.
It is defined as the exponential of the cross-entropy between the true distribution and the model's predicted
distribution. In simpler terms, it is a measure of how well the model can predict the likelihood of a given
sequence of words. The lower the perplexity, the better the model is at predicting the next word in a
sequence.
● BLEU: The BLEU score is a commonly used metric for evaluating the quality of machine-generated text,
particularly in the field of machine translation. It compares n-grams of the generated text against the
reference text and assigns a score based on the number of matching n-grams. The BLEU score ranges from
0 to 1, with 1 indicating a perfect match between the machine-generated text and the reference text.
● ROUGE: The ROUGE score is a similar metric to BLEU, but it compares the machine-generated text to the
reference text based on recall rather than precision. ROUGE score is computed by comparing the
overlapping words between the generated text and the reference text. Like BLEU, ROUGE also ranges from 0
to 1, with 1 indicating a perfect match between the machine-generated text and the reference text.
35
STEP 6
INPUT Generated tokens: The output of the decoding step is a sequence of tokens, which represent the generated
text.
PROCESSING Output by the output layer: The next step is the output layer, which takes the generated tokens and
converts them back into text form. This is done by looking up the token's corresponding word in a vocabulary table,
which is a learnable parameter of the model. This table maps the numerical representation of the tokens back to their
original words.
36
All of the methods I could get out of it on the filter system for API development:
1. Removing special characters and numbers: The first step in the cleaning process is to remove any special characters and
numbers from the text. This can include punctuation marks, numbers, and other non-alphabetic characters. This step is
important because special characters and numbers may not be relevant to the meaning of the text and can make it
difficult for the model to understand the text.
2. Lowercasing the text: The next step is to convert all the text to lowercase. This is important because the model may be
sensitive to capitalization, and converting all text to lowercase ensures that the model does not interpret words with
different capitalization as different words.
3. Removing stop words: Stop words are common words such as "the", "and", "is", etc. that do not carry much meaning and
can be removed from the text without affecting the overall meaning of the text. Removing stop words can reduce the size
of the input data and make it easier for the model to understand the text.
4. Tokenization: Tokenization is the process of breaking the input text into smaller units called tokens. The model can then
process the text one token at a time. Common tokenization techniques include word tokenization, sentence tokenization,
and character tokenization.
5. Stemming and Lemmatization: Stemming and Lemmatization are natural language processing techniques used to reduce
words to their base form. This can help to reduce the number of unique words in the text and make it easier for the model
to understand the text.
6. Removing Non-English text: Removing any non-English text that may be present in the input data.
7. Removing duplicate or near-duplicate text: Removing any duplicate or near-duplicate text that may be present in the input
data.
8. Removing HTML tags: If the text data is in the form of web pages, it may contain HTML tags that need to be removed
before the model can process the text.
9. Removing URLs: Removing any URLs that may be present in the text data.
10. Removing Emoji, emoticons and special characters: Removing any Emoji, emoticons and special characters that may be
present in the text data.
11. Removing any prohibited phrases or keywords: Text data can be screened for prohibited phrases or keywords, such as
hate speech, profanity, or other harmful content, and removed from the input data.
12. Handling contractions: Handling contractions, for example, "I'm" to "I am"
13. Handling typos and spelling errors: Handling typos and spelling errors that may be present in the text data.
14. Handling Abbreviations: Handling Abbreviations, for example "Mr." to "Mister"
15. Handling Multi-modal information: Handling Multi-modal information such as images, videos, audio, etc. and extracting
the relevant information from these elements and integrating them with the text data.