You are on page 1of 67

8:28 ,4.12.

2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

An AI Engineer’s Guide to Machine Learning


and Generative AI
Everything you need to know about the underlying ideas, scientific innovations, &
technologies that are powering modern AI Applications and Agents

ai geek (wishesh) · Follow


31 min read · Oct 4

Listen Share More

TL;DR
In this blog I will provide an in-depth overview of the key ideas, scientific
innovations, and technologies powering modern AI applications and agents. It
https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 1/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

covers the fundamentals of machine learning, diving into techniques like


supervised, unsupervised, semi-supervised, self-supervised, and reinforcement
learning. We will then explore generative AI models in depth, focusing on large
language models, diffusion models, multimodal models, and vision-language
models. The goal is to equip AI Engineers, who have a background in software
engineering, with the necessary knowledge to create engaging AI experiences by
seamlessly integrating Generative AI into real-world applications.
Table of Contents
1. Introduction
2. Machine Learning
— Supervised Learning
— Unsupervised Learning
— Semi-Supervised Learning
— Self-Supervised Learning
— Reinforcement Learning
— Neural Networks
— Deep Learning
— Natural Language Processing (NLP)
— Autoregressive Language Models
— Transformers
— Markov Chain
— Autoencoders
— Contrastive Language–Image Pre-training (CLIP)
— Generative Pre-trained Transformers (GPT)
3. Generative AI Models
— Large Language Models (LLMs)
— Diffusion Models
— Multimodal Models
— Vision Language Models (VLM)
4. Conclusion

Introduction
Generative AI systems like GPTs, Llama, PALM, Claude, and others are transforming
digital user experiences. They can be seamlessly integrated with various data
sources, allowing AI applications to interact with users in a highly personalized way,
delivering information tailored to their specific needs.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 2/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

OpenAI has played a crucial role in making generative AI accessible to software


developers, enabling them to create engaging experiences through prompt
engineering. This was once considered difficult to achieve. In addition to that,
establishing connections between these models and external data sources is
important for providing context and building a long-term memory, often done
through vector searches. Frameworks like LangChain and Llama-Index have
simplified the development process, handling complex prompt sequences, and
ensuring smooth integration of LLM capabilities. Additionally, ChatGPT Plugins
provide businesses with an opportunity to incorporate specific contexts into their
interactions with users.

The trend is now moving towards smaller, task-specific AI applications and


adaptable development frameworks. Prompt engineering is a crucial part of
creating these specialized AI applications. Composable frameworks, which combine
LLMs to meet specific user needs, are becoming a promising trend. Innovators are
now designing frameworks without being limited by existing models, promising an
even more dynamic future for AI applications. Furthermore, open-source LLMs like
Meta’s Llama-2 and Mistral are addressing concerns about depending too heavily on

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 3/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

a few providers, ensuring a diverse and sustainable path forward for AI


applications.

There are a few different ways you can get LLMs to do what you want. Prompt
Engineering and Fine-tuning are the two most promising places to get started. Fine-
tuning is a more advanced technique, and we will discuss it in more detail in a
separate blog post. But let’s talk about Prompt Engineering. It is an iterative process,
and it’s difficult to predict how well a prompt will perform for a specific task in
advance. This approach involves trying out different prompts, evaluating the
results, and deciding what to do next. For instance, out of the two prompts provided
below, the second one performs better:

Prompt 1: [Problem/question] State the answer and then explain your reasoning.

Prompt 2: [Problem/question] Explain your reasoning and then state the answer.

The reason the second prompt works better is because LLMs have a tendency to
predict the next part of a sentence. This can sometimes lead them to come up with
answers too quickly when the first prompt is used, resulting in less effective
outcomes.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 4/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

State the answer and then explain your reasoning (there’s no subway at exactly 8:15 am)

Explain your reasoning and then state the answer (more reasonable answer)

To gain a comprehensive understanding of AI application development and prompt


engineering, it’s essential to acquaint yourself with the fundamental concepts,
technologies, and advancements underpinning machine learning and generative
AI. While creating demos powered by Large Language Models (LLMs) may be
straightforward, building production-ready real-world applications necessitates a
profound grasp of the fundamentals of machine learning and generative AI.
https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 5/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

In this blog, we will delve into the pivotal concepts, innovations, and technologies
powering modern AI applications. We will begin by focusing on the fundamentals
of machine learning, and subsequently, we will explore generative AI models in
greater depth. Additionally, I’ve included courses and hands-on tutorials at the
conclusion of each topic for those who wish to further immerse themselves in these
subjects.

Machine Learning
Machine learning, a crucial aspect of artificial intelligence, imitates how humans
learn by using data and algorithms to improve accuracy with experience. Coined by
IBM’s Arthur Samuel in a historic game of checkers in 1962, this concept has led to
revolutionary advancements like Netflix’s recommendation system, Tesla’s self-
driving cars, and OpenAI’s ChatGPT. It relies on statistical techniques to make
forecasts, uncovering important information. TensorFlow and PyTorch are essential
tools for creating machine learning models. As this field progresses, it introduces
both transformative possibilities and important ethical dilemmas.

I am including a brief introduction to the key machine learning techniques that are
relevant to generative AI. Knowing the basics of these techniques and the
https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 6/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

underlying ideas will help you better understand the model outputs and aid in
conceptualizing your AI application development and evaluation process. A good
place to start with machine learning is Andrew Ng’s Machine Learning
Specialization on Coursera.

Supervised Learning: Precise Predictions with Labeled Data


Supervised learning relies on labeled datasets to teach algorithms how to accurately
classify and predict data. This is achieved through cross-validation, where models
adjust their weights based on input-output pairs from a training set. This method
addresses two main types of problems: classification (sorting test data) and
regression (understanding relationships between variables).

Supervised Machine Learning (Source: NeuroSpace)

Widely used algorithms like neural networks, Naive Bayes, and support vector
machines (SVM) make supervised learning applicable in various business areas,
such as image recognition, sentiment analysis, text generation. Despite its
widespread use, there are ongoing challenges, including the need for expert model
design and the time-intensive nature of training. Nevertheless, supervised learning
continues to be crucial in generating predictive insights across different industries.

If you are interested in delving deeper into Supervised Learning, you can explore
the Supervised Learning course on Coursera.

Unsupervised Learning: Discovering Patterns in Unlabeled Data


Unsupervised learning finds hidden patterns in data without needing labeled
annotations, working directly with raw input. One important use is clustering,

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 7/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

which groups similar data together, revealing important information.

Unsupervised Machine Learning (Source: Diego Calvo)

Tasks like customer segmentation and recommendation systems rely on


unsupervised learning. Clustering methods like K-Means and K-Medoids identify
similarities among features, forming cohesive groups. For example, in a
recommendation system, clustering users based on behavior can suggest content
without knowing specific interests.

Unsupervised learning is widely applied in various fields like exploratory data


analysis, customer segmentation, cross-selling, and image recognition. By
identifying similarities and differences, it uncovers valuable insights, making it a
crucial tool in machine learning.

If you are interested in delving deeper into Unsupervised Learning, you can explore
the Unsupervised Learning course on Coursera.

Semi-Supervised Learning: Bridging the Gap Between Labeled and Unlabeled Data
Semi-Supervised Learning is useful when there is a shortage of labeled data but an
abundance of unlabeled data. For example, if there are millions of pictures of
different real-world objects, but only 50k of them are labeled, it’s impractical and
expensive to label the rest manually. Instead, semi-supervised learning offers a
solution by training a model on the labeled subset and using it to predict labels for

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 8/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

the unlabeled majority. This not only saves time and resources but also maintains
accuracy.

Semi-Supervised Machine Learning (Source: Christophe Atten)

This approach operates at the intersection of supervised and unsupervised


learning, making use of both labeled and unlabeled data. It works on the idea that
nearby data points often have similar labels, which makes it versatile for tasks like
classification and clustering. Through key assumptions like continuity, cluster, and
manifold, semi-supervised learning extracts meaningful information from data
relationships.

There are several techniques that drive semi-supervised learning. Pseudo-labeling


is one of them, which assigns approximate labels to unlabeled data using the
model’s predictions. Self-Training refines this process by only accepting high-
confidence predictions. On the other hand, Label Propagation relies on graph-based
transductive methods to infer labels for unlabeled data points based on majority
neighbor votes.

While semi-supervised learning is highly effective in many applications, it’s


important to note its limitations. It works best when labeled data represents the
entire distribution. In situations where subtle differences are crucial or if data
representation is biased, semi-supervised learning may not perform as well.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 9/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

If you are interested in delving deeper into this topic, you can explore the hands-on
tutorials by DigitalSreeni.

Self-Supervised Learning: Pioneering Autonomous Knowledge Acquisition


Self-supervised learning (SSL) is changing the landscape of machine learning by
reducing the need for labeled data. Traditionally, intelligent systems relied on
expensive annotated data, which was a major limitation. SSL addresses this by
automatically creating labels from unstructured data, unlocking the potential of
large, unlabeled datasets. This method turns unsupervised problems into
supervised ones by utilizing the inherent structure of the data. For example, in
natural language processing, SSL completes sentences, and in videos, it predicts
frames. Importantly, SSL accomplishes tasks like image classification without fixed
labels. Through three main steps — programmatic label generation, pre-training,
and fine-tuning — SSL enables models to learn strong data representations,
decreasing the reliance on manual labels. This advancement spans various
domains, including text, images, speech, and graph data. As SSL continues to
develop, it holds the promise of revolutionizing the machine learning field,
enabling more potent models with less data and effort.

Self-Supervised Learning = Filling in the Blanks (Source: Yann LeCun)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 10/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

The self-supervised workflow starts with an unlabelled source dataset and a labelled target dataset. SSL enables
models to learn strong data representations through programmatic label generation, pre-training, and fine-tuning
(Source: arxvi.org)

You may be curious about how SSL differs from unsupervised, semi-supervised, and
supervised learning. Let’s explore that further.

SSL and unsupervised learning have distinct purposes. Although both operate
without labeled data, they vary in feedback mechanisms. Unsupervised learning is
broader, emphasizing model-centric approaches, while SSL focuses on data-centric
feedback. Unsupervised learning excels at clustering and dimensionality reduction,
while SSL sets the stage for regression and classification tasks, similar to supervised
learning.

In terms of the distinction between SSL and semi-supervised learning, SSL relies on
data structure and doesn’t need labeled data, whereas semi-supervised learning
uses a small amount of labeled data alongside unlabeled data. Both aim to reduce
dependency on labels, but they differ in approach and application.

Furthermore, supervised learning and self-supervised learning both use training


datasets with labels, but they differ significantly in how they acquire labels.
Supervised learning depends on manually annotated data, similar to a classroom
setting where a teacher provides examples. On the other hand, SSL eliminates the
need for manual labeling by autonomously generating labels.

If you are interested in delving deeper into this topic, you can explore the Self-
Supervised Learning Series by Yann LeCun.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 11/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Reinforcement Learning: Mastering Decision-Making Through Trial and Error


Reinforcement Learning (RL) focuses on maximizing rewards in a given situation,
without a predefined answer. The agent learns through trial and error, gathering
data from interactions. This self-teaching system is great for tasks with many
choices, as it excels in decision-making without human guidance. For example, it
can navigate obstacles to reach a reward. Key elements include the initial input
state, multiple possible outputs, rewards or punishments based on user input,
continuous learning, and decision-making based on maximum reward. In short, RL
empowers machines to learn by doing, leading to optimal results.

Reinforcement Machine Learning (Source: Stanford)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 12/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

RL has a wide range of applications in various fields, demonstrating its versatility


and effectiveness in autonomous decision-making. It has made significant strides in
gaming, as seen in AlphaGo’s success in the ancient game of Go. In robotics, RL
enables machines to learn complex tasks and adapt to new challenges, especially
when human demonstration is not feasible. RL also plays a crucial role in
autonomous vehicles, aiding in tasks like path planning and adaptive driving
strategies. Beyond physical realms, RL is vital in providing personalized
recommendations in e-commerce and content delivery platforms. It also helps in
optimizing resource allocation and management, making the most of limited
resources to achieve specific goals. In healthcare, RL guides treatment policies,
ensuring individualized and efficient patient care. Additionally, in finance, RL is
used for stock trading agents, optimizing strategies for dynamic markets.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 13/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Deep Learning based self-driving car. The architecture can be implemented either as a sequential
perceptionplaning-action pipeline (a), or as an End2End system (b) (Source: arxiv.org)

Deep Reinforcement Learning (DRL) combines artificial neural networks and


reinforcement learning, advancing AI’s understanding of the visual world. It allows
agents to learn tasks with specific goals, such as playing video games from pixel
data or controlling real-world robots using camera inputs. This innovation,
demonstrated by AlphaGo, holds significant potential not only in gaming but also in
complex, real-world applications. DRL addresses the challenge of linking
immediate actions with delayed outcomes, akin to human decision-making. Its
progress in uncertain, dynamic environments signifies a step towards practical AI
solutions.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 14/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

AlphaGo: Neural network training pipeline and architecture (Source: Nature)

Deep Reinforcement Learning for Dialogue Generation (Source: arxiv.org)

If you are interested in delving deeper into the topic of Reinforcement Learning,
you can explore the Stanford Course CS234: Reinforcement Learning by Emma
Brunskill. And if you want to further explore Deep Reinforcement Learning, you
can learn from UC Berkley lectures: Deep Reinforcement Learning.
Neural Networks: Mimicking the Brain for Intelligent Computing
Neural networks, also known as artificial neural networks (ANNs), form the basis of
deep learning, a branch of machine learning. These networks are inspired by the
structure of the human brain and consist of nodes organized into layers, including
input, hidden, and output layers. Each node, or artificial neuron, processes data
using weights and thresholds, becoming active if the output exceeds a specific
threshold. Through training with data, their accuracy is refined, making them
valuable tools for tasks like speech and image recognition, image generation, as
well as natural language processing.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 15/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Neural Network (Source: GeeksforGeeks)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 16/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Each node operates like a linear regression model, with inputs, weights, a bias, and
an output. This allows the network to handle complex information and make
decisions. The process of passing data from one layer to the next defines a neural
network as feedforward. Sigmoid neurons, which produce values between 0 and 1,
are commonly used for this purpose.

Sigmoid Function (Source: Wikipedia)

Neural networks are widely used in image recognition, speech processing, and
natural language processing. They rely on supervised learning, using labeled
datasets to fine-tune their algorithms. Training involves minimizing the cost
function through gradient descent, gradually adjusting the model’s parameters.
Additionally, backpropagation allows for error calculation and parameter
adjustments, further improving the model’s accuracy.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 17/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Gradient Descent (Source: IBM)

Neural Network Training in Progress, Showing Forward and Backward Passes (Source: 3Blue1Brown)

There are various types of neural networks designed for specific purposes.
Perceptrons, the earliest form, paved the way for multi-layer perceptrons (MLPs)
and convolutional neural networks (CNNs), which are extensively used in image
recognition. Recurrent neural networks (RNNs) excel in making predictions for
time-series data, while autoencoders focus on creating abstract representations
from input data.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 18/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

If you are interested in delving deeper into the topic of Neural Networks, you can
explore the hands-on tutorials by Sentdex.
Deep Learning: Revolutionizing AI and Industries with Neural Networks
Deep learning involves neural networks with three or more layers, inspired by the
human brain, which learn from extensive data sets. Additional layers improve
accuracy compared to single-layer networks, forming the basis for various AI
applications like digital assistants, computer vision, and self-driving cars.

Deep Neural Network (Source: IBM)

Unlike traditional machine learning, deep learning excels in processing


unstructured data, automating feature extraction, and performing tasks with
minimal human intervention. It includes supervised, unsupervised, semi-
supervised, self-supervised, and reinforcement learning. Deep neural networks
refine predictions using layers of interconnected nodes and utilize forward
propagation and backpropagation for training. Different types of neural networks,
such as CNNs for image processing, RNNs for sequential data, and Transformers for
text generation, are tailored to specific tasks.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 19/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Convolutional Neural Network (CNN) (Source: LearnOpenCV)

Real-world applications of deep learning are widespread, from fraud detection in


law enforcement to risk assessment in finance. Customer service benefits from
chatbots, and healthcare uses image recognition for faster diagnoses. However,
deep learning requires significant computing power, often provided by high-
performance GPUs. The resurgence of deep learning is largely attributed to the
advancement of GPUs, enabling the development of multi-layered networks. Recent
theoretical progress has brought clarity to neural networks’ computational abilities,
global optimization, and overfitting, dispelling their previous opacity.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 20/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

If you are interested in delving deeper into the topic of Deep Learning, you can
explore the MIT’s Introduction to Deep Learning course by Alexander Amini and
Practical Deep Learning for Coders course by Jeremy Howard.

Natural Language Processing (NLP): Harnessing the Power of Human Communication


Natural Language Processing (NLP) allows computers to comprehend, generate,
and manipulate both written and spoken human language. This technology is the
foundation for virtual assistants like Siri, Cortana, Google Assistant, and Alexa, as
well as AI applications like ChatGPT, Claude, and BART. It plays a vital role in
various applications, from web search to sentiment analysis.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 21/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

In exploring NLP, it’s essential to distinguish between Natural Language


Understanding (NLU) and Natural Language Generation (NLG), focusing on
language comprehension and production, respectively. Recent advancements in
NLP, driven by machine learning, especially deep learning, have enabled us to
uncover complex language patterns from extensive datasets.

Classical NLP approach (Source: Thanaki)

Deep learning approach for NLP (Source: Thanaki)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 22/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

The applications of NLP are broad and influential. It spans from automating tasks
with chatbots and agents to enhancing search capabilities and organizing large
document collections. Industries such as healthcare, legal, finance, customer
service, and insurance are benefiting greatly by streamlining processes involving
unstructured text.

At the core of NLP are machine learning models, with deep learning leading the
way. Techniques like pretrained foundation models such as LLMs and transfer
learning allow for adaptability to new tasks with minimal training data. API
providers like OpenAI have developed pretrained LLM models tailored to various
applications, further accelerating development.

Preprocessing techniques like tokenization, bag-of-words models, and stop word


removal are crucial in NLP. While part-of-speech tagging and syntactic parsing have
their place, modern deep learning-based models have less reliance on them.

If you are interested in delving deeper into the topic of Deep Learning, you can
explore the Stanford’s CS224N: Natural Language Processing with Deep Learning
course.

Autoregressive Language Models: Harnessing the Power of Sequential Data


Autoregressive language models, a key tool in Natural Language Processing (NLP),
have greatly advanced tasks like text generation and machine translation. These
models, known for their depth and reliance on neural networks, are excellent at
understanding complex relationships in sequential data. Unlike traditional
statistical autoregressive models, they use deep learning frameworks like
TensorFlow and PyTorch for more flexibility.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 23/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Autoregressive WaveNet model (Source: DeepMind)

Unlike Recurrent Neural Networks (RNNs), autoregressive models don’t require


hidden states, making them computationally efficient and ensuring stable training.
They can generate new data by modeling the overall distribution of observations
and targets, which is different from discriminative models that focus only on
conditional distributions.

Autoregressive architectures like PixelCNN, PixelCNN++, WaveNet, and the


transformative Transformer model have been successful in various domains,
including image and audio modeling as well as NLP applications. They are
particularly useful in tasks like text generation, where sentence lengths vary.

One of the remarkable strengths of autoregressive models is their ability to identify


patterns across different time scales. This versatility is crucial in tasks like music
modeling, where both short-term and long-term correlations play a fundamental
role.

It is also important to keep in mind that auto-regressive models have received


significant critique from Dr. Yann LeCun, a Turing Award winner scientist.
According to Yann autoregressive models are capable of generating text that
appears intelligent, but they have limitations. They lack reasoning abilities, make
factual and logical errors, and are inconsistent. They are also unable to plan their
answers and can make mistakes that deviates them away from the correct set of
answers.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 24/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

According to Yann, the alternative to autoregressive models is the Joint Embedding


Predictive Architecture (JEPA), which is a non-generative architecture. This
architecture involves running both the input and output through encoders that
eliminate irrelevant details about the input and output. This approach allows for
more accurate predictions, especially in video data, where there are too many
details to predict every single one. It is important to note that this is an active area
of research. Unlike autoregressive LLMs, it does not have fully developed, ready-to-
use machine learning models for practical purposes.

The Image-based Joint-Embedding Predictive Architecture (I-JEPA) uses a single context block to predict
the representations of various target blocks originating from the same image (Source: MetaAI)

If you are interested in delving deeper into the topic of Autoregressive Language
Models, you can explore the UC Berkley CS294 Deep Unsupervised Learning course.

Transformers: A Paradigm Shift in AI


Introduced by Google in 2017, transformer models are a groundbreaking type of
neural network. They are highly skilled at comprehending context and relationships
in sequential data, using mathematical techniques like attention and self-attention
to identify even subtle connections between distant elements in a sequence.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 25/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

The Transformer — model architecture (arxiv.org)

Researchers at Stanford refer to transformers as “foundation models” because of


their remarkable versatility, which is causing a significant shift in the field of
artificial intelligence. They find applications in various areas like real-time
translation, genetic research, fraud detection, and healthcare improvement. Their
impact extends to everyday internet use, with Google and Microsoft Bing relying on
transformers.

Transformers have also led to a positive cycle in AI development, as they can make
precise predictions and generate more data for ongoing model improvement.
Components of the transformer architecture, such as input and positional
embeddings, encoder-decoder layers, and residual connections, have paved the way
for significant advancements in natural language understanding and generation.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 26/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

In the broader context, there are three main types of transformers suited for
different tasks: auto-regressive, auto-encoding, and sequence-to-sequence models.
The choice of transformer depends on factors like dataset size, task complexity, and
desired outcomes. GPT-like models excel in text generation, BERT-like models in
text comprehension, and BART/T5-like models are adept at both.

If you are interested in delving deeper into the topic of Transformers, you can
explore the Stanford CS25: Transformers United course.

Markov Chain
A Markov chain, originally conceptualized by mathematician Andrey Markov,
stands as a transformative force in stochastic modeling. It operates on the principle
that the subsequent state depends solely on the current one, eliminating the need
for a complete historical record. This inherent ‘memorylessness’ expands its scope
of application, from fields as varied as economics to genetics and finance. Central
to its functioning are state transitions, dictated by probabilities. The future state is
contingent upon the present state and the time elapsed, disregarding prior
trajectories. This fundamental concept underlies Markov theory, which finds wide-
ranging applications in economics, genetics, and finance.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 27/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Theoretical — Markov Chains (Source: Github)

These chains are typically represented as directed graphs, where each arrow
signifies a transition probability. The use of matrix representation is pivotal,
providing a visual depiction of transition probabilities between states. Higher-order
matrices offer valuable insights into multi-step transitions. Markov chains are
further categorized into discrete and continuous-time, each influencing the nature
of transitions. Properties such as irreducibility and periodicity offer valuable
insights into their behavior. They have greatly simplified the study of real-world
processes and play a crucial role in Data Science, spanning techniques like Markov
Chain Monte Carlo (MCMC), information theory, and Diffusion Models. Diffusion
Models are modelled using a Markov chain having T steps. While certain
assumptions underlie their application, Markov chains, as exemplified in
something as everyday as meal choices, demonstrate their effectiveness and
versatility in practical scenarios.

The Markov Chain of forward/reverse diffusion process of generating a sample by slowly adding/removing
noise (Source: arxiv.org)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 28/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

If you are interested in delving deeper into the topic of Markov Chain, you can
explore the explanation of using Markov Chain in Diffusion Models by Ari Seff. You
can also checkout the the in-dept tutorials on Markov Chain by Normalized Nerd.
Autoencoders
Autoencoders are designed to replicate their input as output, making them
invaluable in tasks like image reconstruction and noise reduction. The magic lies in
their ability to distill complex data into a compact, lower-dimensional
representation, known as the bottleneck. This compression, coupled with
structured data, enables autoencoders to excel at tasks where correlations between
input features exist.

An autoencoder uses an encoder to compress an input into a representation and a decoder to reconstruct
the input from the representation (Source: DeepLearning.ai)

The architecture of an autoencoder comprises three essential components: the


Encoder, Bottleneck, and Decoder. The Encoder efficiently compresses input data,
creating a compact bottleneck that holds the key information. The Bottleneck,
although the smallest part, is the neural network’s heart, allowing only vital
information to flow through. This restricted flow mitigates overfitting, making
autoencoders robust learners.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 29/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Illustration of Autoencoder model architecture (Source: Lil’Log)

The Decoder, on the other hand, is tasked with reconstructing the compressed
knowledge representation back into the original form. For simple autoencoders, the
output mirrors the input, albeit with reduced noise. However, variational
autoencoders (VAEs) generate entirely new content based on the input, showcasing
the versatility of this technology.

The relationship between these components is pivotal. The Encoder uses


convolutional blocks and pooling modules to compress input data, culminating in
the bottleneck. The Bottleneck’s design ensures it captures the essence of the input,
forming a knowledge representation. This representation not only prevents
memorization of the input but also establishes vital correlations within the network.

The size of the bottleneck is crucial. A smaller bottleneck reduces the risk of
overfitting but may lead to the loss of important information. Therefore, striking a
balance is essential to ensure optimal performance.

If you are interested in delving deeper into the topic of Autoencoders, you can
explore the hands-on tutorials by DigitalSreeni.

Contrastive Language–Image Pre-training (CLIP)


CLIP is a neural network that addresses critical issues in traditional computer vision
approaches. Unlike traditional vision models, CLIP is trained on a diverse range of
images with natural language supervision readily available on the internet. This
enables it to perform a wide array of classification tasks without direct optimization
for specific benchmarks. As a result, CLIP closes the “robustness gap” by up to 75%,

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 30/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

while matching the performance of established models like ResNet-50 on ImageNet,


without relying on the original labeled examples.

This innovation promises to revolutionize computer vision by reducing reliance on


costly datasets and enabling adaptability to various tasks.

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts
in our dataset. (Source: OpenAI)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 31/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Using CLIP as a zero-shot classifier. (Source: OpenAI)

Generative Pre-trained Transformers (GPT)


GPTs (GPT-1, GPT-2, and GPT-3) are a family of neural network models utilizing the
transformer architecture, revolutionizing artificial intelligence (AI). The system
employs a two-stage process: first, a transformer model is trained on a large dataset
using unsupervised learning with language modeling as a signal. Then, the model is
fine-tuned on smaller supervised datasets for specific tasks. This approach builds
on prior work in semi-supervised sequence learning and ULMFiT, demonstrating
that a Transformer-based model can excel in a range of tasks including
commonsense reasoning and reading comprehension.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 32/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

GPT Architecture (left) and Fine-Tuning (right) (Source: OpenAI)

They power generative applications like ChatGPT, enabling the creation of human-
like text, images, and more. GPT’s impact spans industries, from Q&A bots to
content generation. Its significance lies in the transformative potential of the
transformer architecture, automating tasks from language translation to content
creation. GPT’s versatility spans social media content creation, code writing, data
analysis, and even building interactive voice assistants. By understanding and
predicting language, GPT models represent a leap towards achieving artificial
general intelligence.

Generative AI Models
Generative AI encompasses models capable of producing high-quality content
across various mediums, including text, images, audio, and video, driven by their
training data. OpenAI’s ChatGPT exemplifies this revolution, crafting poems, jokes,
and essays that rival human creations. While the initial emphasis was on visual
generation, the spotlight has now shifted to natural language processing.
Generative models can also extend beyond language, decoding software code,
molecular structures, and more.

Large Language Models (LLMs)


Large Language Models, like the transformer networks, are the backbone of
modern AI. These models consist of multiple layers, including self-attention and
feed-forward layers, enabling them to understand context and meaning in
sequential data.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 33/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Bidirectional Encoder Representations from Transformers (BERT) revolutionized


natural language processing. Unlike earlier models like Glove, which had fixed
word embeddings regardless of context, BERT harnessed bidirectional
Transformers, allowing it to consider both left and right contexts simultaneously.
This provided a more nuanced understanding of word meanings. In comparison,
ELMo concatenated left-to-right and right-to-left information, missing out on the
full contextual picture. BERT’s attention layers outperformed its predecessors,
thanks to its joint conditioning on both left and right contexts in all layers. This
breakthrough in pre-training architecture, utilizing a bidirectional Transformer,
propelled BERT to the forefront of NLP. It reads entire sentences in one pass,
enabling attention layers to grasp the context of a word from all surrounding
directions. BERT’s two-step pre-training process, Masked Language Model (MLM)
and Next Sentence Prediction, further fine-tunes its understanding of language
nuances, making it a powerful tool in various downstream tasks.

Overall pre-training and fine-tuning procedures for BERT (Source: arxiv.org)

Generative Pre-trained Transformers (GPT) represent a monumental leap in


artificial intelligence, underpinned by the transformer architecture. These models,
including the renowned GPT-3 & 4, empower applications like ChatGPT to craft
remarkably human-like text and content, revolutionizing sectors from marketing to
education. GPT’s impact is profound, streamlining tasks from language translation
to content creation at an unprecedented pace. With applications spanning social
media content creation, code generation, data analysis, and interactive voice
assistants, GPT’s versatility is striking. GPT’s functioning relies on neural network-
based language prediction, utilizing the transformer architecture’s self-attention
https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 34/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

mechanisms. By processing vast datasets, GPT-3, with its 175 billion parameters,
attains a level of proficiency and fluency that marks a paradigm shift in AI
capabilities. From improving customer feedback analysis to enhancing virtual
reality interactions, GPT is reshaping industries across the board.

Original Generative Pre-Trained Transformer (GPT) Architecture (Source: Wikipedia)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 35/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Transformer architecture and training objectives used for training original GPT model (Source: OpenAI)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 36/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

GPT performance on academic and professional exams. Exams are ordered from low to high based on GPT-
3.5 performance. GPT-4 outperforms GPT-3.5 on most exams tested. (Source: arxiv.org)

ChatGPT training pipeline (Source: Andrej Karpathy)

Key innovations such as positional encoding and self-attention allow them to


process input non-sequentially, extracting subtle relationships. This parallel
processing capability, harnessing GPUs, drastically reduces training time. Large
Language Models, often with billions of parameters, undergo unsupervised
learning, mastering grammar, languages, and knowledge. They can handle vast
amounts of data from sources like the internet and Wikipedia, making them
invaluable tools in various applications, from natural language understanding to
data analysis.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 37/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

A survey of LLMs — May 2023. A timeline of existing large language models (having a size larger than 10B) in recent
years. The timeline was established mainly according to the release date of the technical paper for a model. (Source:
arxiv.org)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 38/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

The evolutionary tree of modern LLMs traces the development of language models in recent years and highlights
some of the most well-known models. Models on the same branch have closer relationships. Transformer-based
models are shown in non-grey colors: decoder-only models in the blue branch, encoder-only models in the pink
branch, and encoder-decoder models in the green branch. The vertical position of the models on the timeline
represents their release dates. Open-source models are represented by solid squares, while closed-source models
are represented by hollow ones. The stacked bar plot in the bottom right corner shows the number of models from
various companies and institutions (Source: arxiv.org)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 39/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

LLM Ecosystem Graph is a framework to document the foundation models ecosystem, namely both the assets
(datasets, models, and applications) and their relationships. (Source: Stanford.edu)

A broader overview of LLMs, dividing LLMs into four branches: 1. Training 2. Inference 3. Applications 4. Challenges
(arxiv.org)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 40/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Open-source (and open) LLMs are becoming increasingly capable, and finetuned
models designed for specific tasks have started to surpass even the most capable
models like GPT-4 (Phind-CodeLlama-34B-v2 for coding and Gorilla for writing API
calls) . In contrast to their proprietary counterparts, which are restricted by
licensing agreements, open-source LLMs are generally freely accessible. This
accessibility allows AI engineers and researchers not only to employ them for
various purposes but also to enhance and distribute them. This democratization
brings forth a host of advantages.

Helpfulness human evaluation results for Llama 2-Chat compared to other open-source and closed-source
models. (Source: arxiv.org)

Firstly, it fosters transparency and flexibility, enabling enterprises to exert precise


control over their data and minimizing the risk of unauthorized access or data
leaks. Moreover, open-source LLMs are considerably more cost-effective in the long
run, as they don’t incur licensing fees. Additionally, they allow for customization
and fine-tuning to suit specific organizational needs, a task that would be both
cumbersome and costly with proprietary models. The open-source model
ecosystem empowers a diverse community to contribute, resulting in cutting-edge
solutions that keep businesses at the forefront of technology.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 41/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

However, it’s essential to acknowledge and address potential risks, including issues
of bias, misinformation, consent, and security. Through education and robust AI
governance, these challenges can be mitigated, ensuring the responsible and
effective use of open-source LLMs in a variety of domains.

The landscape of open-source LLMs is evolving rapidly, and models like Llama-2
and Mistral are emerging as the preferred choices for many AI researchers and
engineers. HuggingFace Open LLM Leaderboard and Stanford Ecosystem Graph are
good places to keep track of open source (and open) LLMs.

For successful prompt engineering for your AI applications you need to understand
two crucial factors of LLMs: Temperature and Top P. Temperature influences text
randomness, with lower values favoring conservative predictions and higher values
fostering creativity. Adjusting temperature is key to tailoring output for specific
tasks. For fact-based questions, opt for a lower temperature to prioritize accuracy.
Conversely, creative tasks, like poetry generation, benefit from higher temperatures
for imaginative results. Top P, on the other hand, controls response determinism. A
lower value narrows token choices for more precise but potentially less diverse
outputs, while a higher value encourages diversity by considering a wider token
range. Choose a lower top_p for accuracy-driven tasks and increase it for more
diverse responses.

If you are interested in delving deeper into the topic of Large Language Models, you
can explore these courses: Stanford CS324 — Large Language Models, Stanford CS
224N NLP with Deep Learning, Princeton COS 597G Understanding Large Language
Models, Stanford XCS224U Natural Language Understanding, MIT Generative AI For
Constructive Communication.

Diffusion Models: Revolutionizing Image Generation and Manipulation in Machine Learning


Dall-E 3, Imagen, Stable Diffusion, and Midjourney have propelled diffusion models
to the forefront of machine learning. These models redefine interaction with
technology, enabling the generation of a wide array of images from text prompts.
From photorealistic to fantastical, they transcend imagination.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 42/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

A high-level overview of unCLIP. Also the architecture behind OpenAI’s DALL-E 2 (Source: arxiv.org)

These generative models operate by adding noise to training data and then learning
to reverse this process, resulting in coherent images from randomness. They excel
in tasks like text-to-image generation, denoising, and more. At their core, diffusion
models are parameterized Markov chains, honed through variational inference,
designed to generate data resembling their training data. Put simply, if these
models are trained on images of dogs, they can conjure remarkably lifelike canine
images.

Diffusion models smoothly perturb data by adding noise, then reverse this process to generate new data from noise.
(Source: arxiv.org)

Diffusion models represent the pinnacle of generative capabilities, surpassing


predecessors like Generative Adversarial Networks (GANs) and Variational
Autoencoders (VAEs). Their stability and versatility in conditioning on various
inputs promise transformative impacts on industries ranging from Entertainment,
Retail to AR/VR.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 43/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Moreover, Platforms like DALL-E 3 and Stable Diffusion’s DreamStudio/StableStudio


are democratizing access. DreamStudio offers user-friendly tools for image
manipulation, while DALL-E 3 simplifies image generation and Stable Diffusion’s
open-source models allow for local installations of diffusion models for all sorts of
applications.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 44/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

DALL-E 3 is a powerful text-to-image model trained by OpenAI (Source: OpenAI)

If you are interested in delving deeper into the topic of Diffusion Models, you can
explore these two couses: UC Berkley CS 198 Lecture on Diffusion Models and
Practical Deep Learning: Deep Learning Foundations to Stable Diffusion by Jeremy
Howard.

Multimodal Models
Multimodal models simultaneously handle diverse sensory inputs like text, images,
audio, and video. Unlike traditional unimodal AI systems, they fuse information
from various sources, yielding a richer understanding of data with context and
supporting details. These models employ intricate deep learning techniques
involving encoder, mixer, and decoder layers, imitating how humans integrate
sensory input.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 45/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

a) Comparison between the human brain and multimodal foundation model BriVL (Bridging-Vision-and-Language)
for coping with both vision and language information. b) Comparison between modeling weak semantic correlation
data and modeling strong semantic correlation data. (Source: Nature)

Multimodal models represent a paradigm shift in artificial intelligence, emulating


the human brain’s ability to process information through various senses. By
integrating data from sources like text, images, audio, and video, these models offer
a comprehensive understanding of complex datasets, with applications spanning
from speech recognition, visual question answering, to autonomous vehicles.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 46/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

MotionLM autoregressively generates sequences of discrete motion tokens for a set of agents to produce
consistent interactive trajectory forecasts. (Source: arxiv.org)

Generative AI for Autonomy (GAIA-1) Architecture: a generative world model that leverages video, text, and action
inputs to generate realistic driving scenarios. (Source: arxiv.org)

Combining different machine learning models involves leveraging multiple models


to enhance performance. Techniques like ensemble models, stacking, and bagging
are employed. For instance, ensemble models, exemplified by random forests,
https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 47/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

amalgamate outputs from diverse models to improve accuracy, offering a robust


approach for prediction tasks.

There are predominantly three techniques for multimodal learning: 1) fusion-based


approach, 2) alignment-based approach, and 3) late fusion.

The fusion-based approach encodes different modalities into a shared


representation space, creating a unified, modality-invariant understanding. Early
and mid-fusion techniques are employed to determine when fusion occurs. An
example is image and text captioning, where visual features and text semantics are
merged for a comprehensive representation.

Multimodal Fusion Architecture (Source: IEEE)

In the alignment-based approach, modalities are aligned to facilitate direct


comparison. This proves advantageous when modalities have an inherent
relationship, as seen in audio-visual speech recognition. An example is sign
language recognition, where temporal alignment of visual and audio modalities is
crucial for accurate interpretation.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 48/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Illustration of the lightweight multimodal alignment learning of encoding and decoding. (Source: arxiv.org)

The late fusion approach combines predictions from separate models trained on
individual modalities, resulting in a final prediction. Late fusion proves effective
when modalities aren’t directly related or offer complementary information. An
example is emotion recognition in music, where audio features and lyrics are
separately modeled and then combined for a more accurate prediction.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 49/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Multimodal Fusion Learning (Source: Github)

There are a few challenges in multimodal learning. The first one is representation.
Processing different modalities while preserving their unique characteristics
presents challenges. Joint and coordinated representation strategies are generally
employed to address this challenge. For instance, the MS COCO dataset requires
both joint and coordinated representation strategies to effectively handle
multimodal challenges.

Determining the optimal fusion method is also a challenge. Different techniques


may be more effective based on specific tasks or situations. For instance, in a movie
recommendation system, combining textual data, audio, and visual information
requires careful consideration of the importance of each modality.

The next challenge is alignment. Tasks like audio-visual speech recognition demand
precise alignment of audio and visual data. Techniques like hidden Markov models
and dynamic time warping are used for synchronization.

Co-learning is also a challenge as well. Transferring knowledge from one modality


to another is crucial, especially in low-resource scenarios. This approach is vital in
medical diagnosis, where combining modalities like CT scans and MRI scans
enhances accuracy.

MACAW-LLM is a multi-modal LLM that seamlessly incorporates visual, audio, and


textual data. Its three key components — the Modality Module, Alignment Module,
and Cognitive Module — work in tandem to process diverse information sources.
The Modality Module expands the model’s capabilities by integrating additional
encoders for visual and audio data, allowing it to handle multiple modalities
effectively. The Alignment Module resolves compatibility issues arising from
independently trained modality encoders, ensuring smooth integration of multi-
modal information. Leveraging pretrained Language Learning Models, the
Cognitive Module serves as both the foundation and textual modality encoder,
showcasing MACAW-LLM’s understanding of human instructions.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 50/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

An overview of MACAW-LLM model architecture (Source: arxiv.org)

Video-Audio-Text Transformer (VATT) is a groundbreaking framework for


unsupervised multimodal representation learning. By leveraging convolution-free
Transformer architectures, VATT processes raw signals like video, audio, and text to
extract rich, joint embeddings. This versatility proves invaluable across various
downstream tasks, from video action recognition to text-to-video retrieval. VATT’s
architecture combines elements from BERT and ViT, with a focus on preserving
modality-specific processing. Through unsupervised learning and noisy contrastive
estimation, VATT creates a common space for meaningful representations,
revolutionizing applications in speech recognition, image captioning, and video
retrieval. VATT’s capacity to unify modalities promises a paradigm shift in
multimodal AI.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 51/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Overview of the VATT architecture and the self-supervised, multimodal learning strategy (Source: arxiv.org)

If you are interested in delving deeper into the topic of Multimodal Models, you can
explore CMU Multimodal Machine Learning course.
Vision Language Models (VLM)
The convergence of vision and language is propelling us into an era of
unprecedented multimodal understanding. Vision-language models (VLMs), adept
at processing diverse modalities like images, text, and video, stand as a
monumental advancement in this journey.

VLMs can perform multimodal tasks through few-shot learning. With just a handful
of task-specific examples, VLMs excel in problem-solving without additional
training.

These models process interleaved images, videos, and text prompts to generate
associated language. Much like their linguistic counterparts, VLMs use a dual
interface to tackle multimodal tasks. By providing example pairs of visual inputs
and expected text responses, the model learns to answer questions based on new
images or videos. This versatile approach extends to image and video tasks, treating
them as text prediction challenges with visual input conditioning.

VLMs stand out with their unique ability to process sequences of text tokens
interleaved with multimedia, bridging the gap between visual and linguistic
understanding. VLMs like Flamingo synergize pre-trained vision and language
models, enabling them to “perceive” visual scenes and engage in rudimentary
reasoning.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 52/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

A query result for “cups with dancing people” (Source: Google.com)

Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data
including clinical language, imaging, and genomics with the same model weights. (Source: Google
Research)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 53/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Approach to grafting a model works by training a medical information adapter that maps the output of an
existing or refined image encoder into an LLM-understandable form (Source: Google Research)

IDEFICS is an 80 billion parameters multimodal VLM that accepts sequences of images and texts as input and
generates coherent text as output. (Source: HuggingFace.co)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 54/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Selected examples of inputs and outputs obtained from Google Deepmind’s Flamingo-80B (Source: arxiv.org)

A qualitative example generated by a visual language model — InstructBLIP Vicuna model (Source:
arxiv.org)

Since 2021, models like Google’s Flamingo, OpenAI’s CLIP & GPT-4V, and Alibaba’s
Qwen-VL are redefining tasks such as image captioning and visual question-
answering, showcasing the transformative potential of joint vision-language
models. This evolution has ushered in the era of zero-shot generalization, opening
up new practical applications across a multitude of industries.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 55/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Overview of the Flamingo model (Source: Google DeepMind)

Modern vision-language models rely on transformer architectures, seamlessly


integrating an image encoder, text encoder, and a sophisticated strategy for
information fusion. This design revolutionizes the learning process, allowing
models to discern intricate relationships between visual and textual data with
unprecedented precision.

The pre-training strategies, including Contrastive Learning, PrefixLM, Multi-modal


Fusing with Cross Attention, Masked-Language Modeling, and No Training, play a
pivotal role in empowering these models. These techniques enable models to
effectively map images and text, predict textual sequences, and align specific
regions of images with text, thus facilitating tasks like visual question answering.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 56/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Underpinning these pre-training efforts are vast multi-modal datasets such as PMD,
COCO, Conceptual Captions, and Flickr30K. These rich and diverse datasets serve as
the foundation upon which these models are trained. For downstream tasks,
datasets like VQA, NLVR2, TextVQA, and Hateful Memes are instrumental in fine-
tuning models for specific applications.

One interesting development in Vision Language Models is BLIP-2 (Bootstrapping


Language-Image Pre-training) by Salesforce Research. It leverages off-the-shelf
frozen image encoders and large language models, introducing a lightweight
Querying Transformer to bridge modalities. This two-stage pre-training process
propels representation learning and generative capabilities. Remarkably, BLIP-2
surpasses existing methods with fewer parameters, outperforming Flamingo 80B by
8.7% on zero-shot VQAv2, with a fraction of trainable parameters. Additionally, it
showcases unprecedented zero-shot image-to-text generation aligned with natural
language instructions.

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 57/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Overview of BLIP-2’s framework (Source: arxiv.org)

If you are interested in delving deeper into the topic of Vision Language Models, you
Open in app
can explore Microsoft Research Vision-and-Language course.

Conclusion
In this blog I provided a comprehensive overview of the foundational concepts,
innovations, and technologies enabling the rapid advancement of AI applications
and agents. Understanding machine learning techniques like supervised learning,
neural networks, NLP, and reinforcement learning is key for AI Engineers looking
to leverage AI capabilities. Generative models like LLMs, diffusion models, and
VLMs are transforming user experiences across industries. Frameworks like
LangChain and Llama-Index are simplifying the integration of LLMs into real-world
applications. The future will likely see more specialized, adaptable AI applications
and open-source models, ensuring innovation flourishes. With this background
knowledge, developers can effectively engineer prompts, build demos, and create
production-ready AI applications that provide personalized, engaging experiences.

Join the Conversation!


If you enjoyed this article and want to stay connected, I invite you to follow me here
on Medium AI Geek and on Twitter at AI Geek.

AI Machine Learning Large Language Models Ai Engineering

Prompt Engineering

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 58/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Follow

Written by ai geek (wishesh)


369 Followers

ai engineer. ai whisperer. ai curious. building llm powered ai applications & agents at AI Geek Labs

More from ai geek (wishesh)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 59/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

ai geek (wishesh) in Level Up Coding

Building a Private AI Chatbot


Chat with PDF using Google Colab, Zephyr 7B Alpha, ChromaDB, HuggingFace, and
Langchain. It’s free and it works like a charm.

5 min read · Oct 23

177 11

ai geek (wishesh)

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 60/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

13 Practical Use Cases Where Generative AI powered AI Applications are


Already Making an Impact
Using Prompt Engineering to Get the Most Out of the Large Language Models (LLMs)

9 min read · Sep 11

73

ai geek (wishesh)

An AI Engineer’s Guide to Prompt Engineering


Everything you need to know about prompt engineering: From Simple Prompting to Retrieval
Augmented Generation (RAG)

24 min read · Sep 23

76

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 61/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

ai geek (wishesh)

10 AI Predictions for the Next 12 Months — State of AI 2023


By Nathan Benaich and Air Street Capital team

2 min read · Oct 17

42

See all from ai geek (wishesh)

Recommended from Medium

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 62/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Paul Rose

I Found A Very Profitable AI Side Hustle


And it’s perfect for beginners

6 min read · Oct 19

12.7K 222

Jeremy Arancio in Towards AI

Why are AI Products Doomed to Fail?


https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 63/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

After one year of implementing AI features for various businesses, I share my perspective on
the mistakes I see companies making with LLMs…

· 17 min read · Nov 18

1.2K 34

Lists

Natural Language Processing


924 stories · 441 saves

Predictive Modeling w/ Python


20 stories · 659 saves

The New Chatbots: ChatGPT, Bard, and Beyond


12 stories · 226 saves

Practical Guides to Machine Learning


10 stories · 739 saves

Qwak

Building an End-to-End MLOps Pipeline with Open-Source Tools


MLOps Open Source: TL;DR

11 min read · Oct 18


https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 64/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

485 5

Rahul Nayak in Towards Data Science

How to Convert Any Text Into a Graph of Concepts


A method to convert any text corpus into a Knowledge Graph using Mistral 7B.

12 min read · Nov 10

3.4K 39

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 65/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

Gathnex

Mastering Generative AI: A Roadmap from Zero to Expertise in Gen AI


field
Are you interested in learning Generative AI but worried about the math involved? Don’t fret! In
this guide, we’ll break down the syllabus…

10 min read · Sep 12

211

Shushant Lakhyani

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 66/67
8:28 ,4.12.2023 An AI Engineer’s Guide to Machine Learning and Generative AI | by ai geek (wishesh) | Medium

10 GPTs Built Using GPT Builder That Have Gone Viral


ChatGPT was the start.

3 min read · Nov 11

2.3K 38

See more recommendations

https://medium.com/@_aigeek/an-ai-engineers-guide-to-machine-learning-and-generative-ai-b7444941ccee 67/67

You might also like