Deep Learning

II Year - I L T P C
Semester
3 0 0 3
Deep Learning
DEEP LEARNING
Course Objectives:
At the end of the course, the students will be expected to:
 Learn deep learning methods for working with sequential data,
 Learn deep recurrent and memory networks,
 Learn deep Turing machines,
 Apply such deep learning mechanisms to various learning problems.
 Know the open issues in deep learning, and have a grasp of the current research directions.
Course Outcomes:
After the completion of the course, student will be able to
 Demonstrate the basic concepts fundamental learning techniques and layers.
 Discuss the Neural Network training, various random models.
 Explain different types of deep learning network models.
 Classify the Probabilistic Neural Networks.
 Implement tools on Deep Learning techniques.
UNIT I: Introduction: Various paradigms of learning problems, Perspectives and Issues in deep
learning framework, review of fundamental learning techniques. Feed forward neural network:
Artificial Neural Network, activation function, multi-layer neural network.
UNIT II: Training Neural Network: Risk minimization, loss function, back propagation,
regularization, model selection, and optimization.
Conditional Random Fields: Linear chain, partition function, Markov network, Belief
propagation, Training CRFs, Hidden Markov Model, Entropy.
UNIT III: Deep Learning: Deep Feed Forward network, regularizations, training deep models,
dropouts, Convolution Neural Network, Recurrent Neural Network, and Deep Belief Network.
UNIT IV: Probabilistic Neural Network: Hopfield Net, Boltzmann machine, RBMs, Sigmoid
net, Auto encoders.
UNIT V: Applications: Object recognition, sparse coding, computer vision, natural language
processing. Introduction to Deep Learning Tools: Caffe, Theano, Torch.
Text Books:
1. Goodfellow, I., Bengio,Y., and Courville, A., Deep Learning, MIT Press, 2016..
2. Bishop, C. ,M., Pattern Recognition and Machine Learning, Springer, 2006.
Reference Books:
1. Artificial Neural Networks, Yegnanarayana, B., PHI Learning Pvt. Ltd, 2009.
2. Matrix Computations, Golub, G.,H., and Van Loan,C.,F, JHU Press,2013.
3. Neural Networks: A Classroom Approach, Satish Kumar, Tata McGraw-Hill Education, 2004.
UNIT-I
INTRODUCTION
Introduction to Deep Learning

Deep learning is a branch of machine learning which is based on artificial neural
networks. It is capable of learning complex patterns and relationships within data. In
deep learning, we don’t need to explicitly program everything. It has become
increasingly popular in recent years due to the advances in processing power and the
availability of large datasets. Because it is based on artificial neural networks (ANNs)
also known as deep neural networks (DNNs). These neural networks are inspired by
the structure and function of the human brain’s biological neurons, and they are
designed to learn from large amounts of data.
1. Deep Learning is a subfield of Machine Learning that involves the use of neural
networks to model and solve complex problems. Neural networks are modeled after
the structure and function of the human brain and consist of layers of
interconnected nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks, which
have multiple layers of interconnected nodes. These networks can learn complex
representations of data by discovering hierarchical patterns and features in the
data. Deep Learning algorithms can automatically learn and improve from data
without the need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including image
recognition, natural language processing, speech recognition, and recommendation
systems. Some of the popular Deep Learning architectures include Convolutional
Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Deep Belief
Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and
computational resources. However, the availability of cloud computing and the
development of specialized hardware, such as Graphics Processing Units (GPUs),
has made it easier to train deep neural networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the use of
deep neural networks to model and solve complex problems. Deep Learning has
achieved significant success in various fields, and its use is expected to continue to
grow as more data becomes available, and more powerful computing resources
become available.
What is Deep Learning?
Deep learning is the branch of machine learning which is based on artificial neural
network architecture. An artificial neural network or ANN uses layers of interconnected
nodes called neurons that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more
hidden layers connected one after the other. Each neuron receives input from the
previous layer neurons or the input layer. The output of one neuron becomes the input
to other neurons in the next layer of the network, and this process continues until the
final layer produces the output of the network. The layers of the neural network
transform the input data through a series of nonlinear transformations, allowing the
network to learn complex representations of the input data.
.
Today Deep learning has become one of the most popular and visible areas of
machine learning, due to its success in a variety of applications, such as computer
vision, natural language processing, and Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as reinforcement
machine learning. it uses a variety of ways to process these.
 Supervised Machine Learning: Supervised machine learning is the machine
learning technique in which the neural network learns to make predictions or
classify data based on the labeled datasets. Here we input both input features
along with the target variables. the neural network learns to make predictions based
on the cost or error that comes from the difference between the predicted and the
actual target, this process is known as backpropagation. Deep learning algorithms
like Convolutional neural networks, Recurrent neural networks are used for many
supervised tasks like image classifications and recognization, sentiment analysis,
language translations, etc.
 Unsupervised Machine Learning: Unsupervised machine learning is the machine
learning technique in which the neural network learns to discover the patterns or to
cluster the dataset based on unlabeled datasets. Here there are no target variables.
while the machine has to self-determined the hidden patterns or relationships within
the datasets. Deep learning algorithms like autoencoders and generative models
are used for unsupervised tasks like clustering, dimensionality reduction, and
anomaly detection.
 Reinforcement Machine Learning: Reinforcement Machine Learning is
the machine learning technique in which an agent learns to make decisions in an
environment to maximize a reward signal. The agent interacts with the environment
by taking action and observing the resulting rewards. Deep learning can be used to
learn policies, or a set of actions, that maximizes the cumulative reward over time.
Deep reinforcement learning algorithms like Deep Q networks and Deep
Deterministic Policy Gradient (DDPG) are used to reinforce tasks like robotics and
game playing etc.
Artificial neural networks
Artificial neural networks are built on the principles of the structure and operation of
human neurons. It is also known as neural networks or neural nets. An artificial neural
network’s input layer, which is the first layer, receives input from external sources and
passes it on to the hidden layer, which is the second layer. Each neuron in the hidden
layer gets information from the neurons in the previous layer, computes the weighted
total, and then transfers it to the neurons in the next layer. These connections are
weighted, which means that the impacts of the inputs from the preceding layer are
more or less optimized by giving each input a distinct weight. These weights are then
adjusted during the training process to enhance the performance of the model.
Fully Connected Artificial Neural Network
Artificial neurons, also known as units, are found in artificial neural networks. The
whole Artificial Neural Network is composed of these artificial neurons, which are
arranged in a series of layers. The complexities of neural networks will depend on the
complexities of the underlying patterns in the dataset whether a layer has a dozen
units or millions of units. Commonly, Artificial Neural Network has an input layer, an
output layer as well as hidden layers. The input layer receives data from the outside
world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or more
hidden layers connected one after the other. Each neuron receives input from the
previous layer neurons or the input layer. The output of one neuron becomes the input
to other neurons in the next layer of the network, and this process continues until the
final layer produces the output of the network. Then, after passing through one or
more hidden layers, this data is transformed into valuable data for the output
layer. Finally, the output layer provides an output in the form of an artificial neural
network’s response to the data that comes in.
Units are linked to one another from one layer to another in the bulk of neural
networks. Each of these links has weights that control how much one unit influences
another. The neural network learns more and more about the data as it moves from
one unit to another, ultimately producing an output from the output layer.
Difference between Machine Learning and Deep Learning :
machine learning and deep learning both are subsets of artificial intelligence but there
are many similarities and differences between them.
Machine Learning Deep Learning
Uses artificial neural network architecture to

Apply statistical algorithms to learn the hidden
learn the hidden patterns and relationships
patterns and relationships in the dataset.
in the dataset.
Requires the larger volume of dataset

Can work on the smaller amount of dataset
compared to machine learning
Better for complex task like image

Better for the low-label task.
processing, natural language processing, etc.
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant features which Relevant features are automatically

are manually extracted from images to detect extracted from images. It is an end-to-end
an object in the image. learning process.
More complex, it works like the black box

Less complex and easy to interpret the result.
interpretations of the result are not easy.
It can work on the CPU or requires less

It requires a high-performance computer
computing power as compared to deep
with GPU.
learning.
Types of neural networks
Deep Learning models are able to automatically learn features from the data, which
makes them well-suited for tasks such as image recognition, speech recognition, and
natural language processing. The most widely used architectures in deep learning are
feedforward neural networks, convolutional neural networks (CNNs), and recurrent
neural networks (RNNs).
Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow
of information through the network. FNNs have been widely used for tasks such as
image classification, speech recognition, and natural language processing.
Convolutional Neural Networks (CNNs) are specifically for image and video
recognition tasks. CNNs are able to automatically learn features from the images,
which makes them well-suited for tasks such as image classification, object detection,
and image segmentation.
Recurrent Neural Networks (RNNs) are a type of neural network that is able to
process sequential data, such as time series and natural language. RNNs are able to
maintain an internal state that captures information about the previous inputs, which
makes them well-suited for tasks such as speech recognition, natural language
processing, and language translation.
Applications of Deep Learning :
The main applications of deep learning can be divided into computer vision, natural
language processing (NLP), and reinforcement learning.
Computer vision
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in computer
vision include:
 Object detection and recognition: Deep learning model can be used to identify
and locate objects within images and videos, making it possible for machines to
perform tasks such as self-driving cars, surveillance, and robotics.
 Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such
as medical imaging, quality control, and image retrieval.
 Image segmentation: Deep learning models can be used for image segmentation
into different regions, making it possible to identify specific features within images.
Natural language processing (NLP):
In NLP, the Deep learning model can enable machines to understand and generate
human language. Some of the main applications of deep learning in NLP include:
 Automatic Text Generation – Deep learning model can learn the corpus of text
and new text like summaries, essays can be automatically generated using these
trained models.
 Language translation: Deep learning models can translate text from one language
to another, making it possible to communicate with people from different linguistic
backgrounds.
 Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or
neutral. This is used in applications such as customer service, social media
monitoring, and political analysis.
 Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion, voice
search, and voice-controlled devices.
Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in an
environment to maximize a reward. Some of the main applications of deep learning in
reinforcement learning include:
 Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari.
 Robotics: Deep reinforcement learning models can be used to train robots to
perform complex tasks such as grasping objects, navigation, and manipulation.
 Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.
Challenges in Deep Learning
Deep learning has made significant advancements in various fields, but there are still
some challenges that need to be addressed. Here are some of the main challenges in
deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
2. Computational Resources: For training the deep learning model, it is
computationally expensive because it requires specialized hardware like GPUs and
TPUs.
3. Time-consuming: While working on sequential data depending on the computational
resource it can take very large even in days or months.
4. Interpretability: Deep learning models are complex, it works like a black box. it is
very difficult to interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized
for the training data, leading to overfitting and poor performance on new data.
Advantages of Deep Learning:
1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance

in various tasks, such as image recognition and natural language processing.
2. Automated feature engineering: Deep Learning algorithms can automatically
discover and learn relevant features from data without the need for manual feature
engineering.
3. Scalability: Deep Learning models can scale to handle large and complex datasets,
and can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can
handle various types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their
performance as more data becomes available.
Disadvantages of Deep Learning:
1. High computational requirements: Deep Learning models require large amounts of

data and computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often require a
large amount of labeled data for training, which can be expensive and time-
consuming to acquire.
3. Interpretability: Deep Learning models can be challenging to interpret, making it
difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training data,
resulting in poor performance on new and unseen data.
4. Black-box nature: Deep Learning models are often treated as black boxes, making
it difficult to understand how they work and how they arrived at their predictions.
In summary, while Deep Learning offers many advantages, including high accuracy
and scalability, it also has some disadvantages, such as high computational
requirements, the need for large amounts of labeled data, and interpretability
challenges. These limitations need to be carefully considered when deciding
whether to use Deep Learning for a specific task.
Difference Between Artificial Intelligence vs Machine Learning vs Deep Learning
Artificial Intelligence is basically the mechanism to incorporate human intelligence

into machines through a set of rules(algorithm). AI is a combination of two words:
“Artificial” meaning something made by humans or non-natural things and
“Intelligence” meaning the ability to understand or think accordingly. Another definition
could be that “AI is basically the study of training your machine(computers) to
mimic a human brain and its thinking capabilities”.
AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction to
obtain the maximum efficiency possible.
Machine Learning:
Machine Learning is basically the study/process which provides the system(computer)
to learn automatically on its own through experiences it had and improve accordingly
without being explicitly programmed. ML is an application or subset of AI. ML
focuses on the development of programs so that it can access data to use it for itself.
The entire process makes observations on data to identify the possible patterns being
formed and make better future decisions as per the examples provided to them. The
major aim of ML is to allow the systems to learn by themselves through
experience without any kind of human intervention or assistance.
Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning
which makes use of Neural Networks(similar to the neurons working in our brain) to
mimic human brain-like behavior. DL algorithms focus on information processing
patterns mechanism to possibly identify the patterns just like our human brain does
and classifies the information accordingly. DL works on larger sets of data when
compared to ML and the prediction mechanism is self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning and
Deep Learning:
Artificial Intelligence Machine Learning Deep Learning
AI stands for Artificial DL stands for Deep Learning,

Intelligence, and is ML stands for Machine and is the study that makes
basically the study/process Learning, and is the study that use of Neural
which enables machines to uses statistical methods Networks(similar to neurons
mimic human behaviour enabling machines to improve present in human brain) to
through particular with experience. imitate functionality just like a
algorithm. human brain.
AI is the broader family

consisting of ML and DL as ML is the subset of AI. DL is the subset of ML.
it’s components.
DL is a ML algorithm that uses

AI is a computer algorithm ML is an AI algorithm which deep(more than one layer)
which exhibits intelligence allows system to learn from neural networks to analyze
through decision making. data. data and provide output
accordingly.
Search Trees and much If you have a clear idea about If you are clear about the
complex math is involved the logic(math) involved in math involved in it but don’t
in AI. behind and you can visualize have idea about the features,
the complex functionalities like so you break the complex
functionalities into
K-Mean, Support Vector linear/lower dimension
Machines, etc., then it defines features by adding more
the ML aspect. layers, then it defines the DL
aspect.
It attains the highest rank in

The aim is to basically The aim is to increase accuracy
terms of accuracy when it is
increase chances of not caring much about the
trained with large amount of
success and not accuracy. success ratio.
data.
DL can be considered as
neural networks with a large
Three broad number of parameters layers
categories/types Of AI are: Three broad categories/types lying in one of the four
Artificial Narrow Of ML are: Supervised fundamental network
Intelligence (ANI), Artificial Learning, Unsupervised architectures: Unsupervised
General Intelligence (AGI) Learning and Reinforcement Pre-trained Networks,
and Artificial Super Learning Convolutional Neural
Intelligence (ASI) Networks, Recurrent Neural
Networks and Recursive
Neural Networks
The efficiency Of AI is
Less efficient than DL as it can’t More powerful than ML as it
basically the efficiency
work for longer dimensions or can easily work for larger sets
provided by ML and DL
higher amount of data. of data.
respectively.
Examples of AI Examples of ML applications Examples of DL applications

applications include: include: Virtual Personal include: Sentiment based
Google’s AI-Powered Assistants: Siri, Alexa, Google, news aggregation, Image
Predictions, Ridesharing etc., Email Spam and Malware analysis and caption
Apps Like Uber and Lyft,

Commercial Flights Use an Filtering. generation, etc.
AI Autopilot, etc.
AI refers to the broad field

of computer science that ML is a subset of AI that
focuses on creating focuses on developing DL is a subset of ML that
intelligent machines that algorithms that can learn from focuses on developing deep
can perform tasks that data and improve their neural networks that can
would normally require performance over time automatically learn and
human intelligence, such without being explicitly extract features from data.
as reasoning, perception, programmed.
and decision-making.
ML algorithms can be
categorized as supervised,
unsupervised, or
AI can be further broken
reinforcement learning. In DL algorithms are inspired by
down into various
supervised learning, the the structure and function of
subfields such as robotics,
algorithm is trained on labeled the human brain, and they are
natural language
data, where the desired output particularly well-suited to
processing, computer
is known. In unsupervised tasks such as image and
vision, expert systems, and
learning, the algorithm is speech recognition.
more.
trained on unlabeled data,
where the desired output is
unknown.
AI systems can be rule- In reinforcement learning, the DL networks consist of

based, knowledge-based, algorithm learns by trial and multiple layers of
or data-driven. error, receiving feedback in the interconnected neurons that
form of rewards or process data in a hierarchical
punishments. manner, allowing them to
learn increasingly complex
representations of the data.
AI vs. Machine Learning vs. Deep Learning Examples:

Artificial Intelligence (AI) refers to the development of computer systems that can
perform tasks that would normally require human intelligence.
Some examples of AI include:
There are numerous examples of AI applications across various industries. Here are
some common examples:
 Speech recognition: speech recognition systems use deep learning algorithms to
recognize and classify images and speech. These systems are used in a variety of
applications, such as self-driving cars, security systems, and medical imaging.
 Personalized recommendations: E-commerce sites and streaming services like
Amazon and Netflix use AI algorithms to analyze users’ browsing and viewing
history to recommend products and content that they are likely to be interested in.
 Predictive maintenance: AI-powered predictive maintenance systems analyze
data from sensors and other sources to predict when equipment is likely to fail,
helping to reduce downtime and maintenance costs.
 Medical diagnosis: AI-powered medical diagnosis systems analyze medical
images and other patient data to help doctors make more accurate diagnoses and
treatment plans.
 Autonomous vehicles: Self-driving cars and other autonomous vehicles use AI
algorithms and sensors to analyze their environment and make decisions about
speed, direction, and other factors.
 Virtual Personal Assistants (VPA) like Siri or Alexa – these use natural
language processing to understand and respond to user requests, such as playing
music, setting reminders, and answering questions.
 Autonomous vehicles – self-driving cars use AI to analyze sensor data, such as
cameras and lidar, to make decisions about navigation, obstacle avoidance, and
route planning.
 Fraud detection – financial institutions use AI to analyze transactions and detect
patterns that are indicative of fraud, such as unusual spending patterns or
transactions from unfamiliar locations.
 Image recognition – AI is used in applications such as photo organization, security
systems, and autonomous robots to identify objects, people, and scenes in images.
 Natural language processing – AI is used in chatbots and language translation
systems to understand and generate human-like text.
 Predictive analytics – AI is used in industries such as healthcare and marketing to
analyze large amounts of data and make predictions about future events, such as
disease outbreaks or consumer behavior.
 Game-playing AI – AI algorithms have been developed to play games such as
chess, Go, and poker at a superhuman level, by analyzing game data and making
predictions about the outcomes of moves.
Examples of Machine Learning:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves the use of
algorithms and statistical models to allow a computer system to “learn” from data and
improve its performance over time, without being explicitly programmed to do so.
Here are some examples of Machine Learning:
 Image recognition: Machine learning algorithms are used in image recognition
systems to classify images based on their contents. These systems are used in a
variety of applications, such as self-driving cars, security systems, and medical
imaging.
 Speech recognition: Machine learning algorithms are used in speech recognition
systems to transcribe speech and identify the words spoken. These systems are
used in virtual assistants like Siri and Alexa, as well as in call centers and other
applications.
 Natural language processing (NLP): Machine learning algorithms are used in
NLP systems to understand and generate human language. These systems are
used in chatbots, virtual assistants, and other applications that involve natural
language interactions.
 Recommendation systems: Machine learning algorithms are used in
recommendation systems to analyze user data and recommend products or
services that are likely to be of interest. These systems are used in e-commerce
sites, streaming services, and other applications.
 Sentiment analysis: Machine learning algorithms are used in sentiment analysis
systems to classify the sentiment of text or speech as positive, negative, or neutral.
These systems are used in social media monitoring and other applications.
 Predictive maintenance: Machine learning algorithms are used in predictive
maintenance systems to analyze data from sensors and other sources to predict
when equipment is likely to fail, helping to reduce downtime and maintenance
costs.
 Spam filters in email – ML algorithms analyze email content and metadata to
identify and flag messages that are likely to be spam.
 Recommendation systems – ML algorithms are used in e-commerce websites
and streaming services to make personalized recommendations to users based on
their browsing and purchase history.
 Predictive maintenance – ML algorithms are used in manufacturing to predict
when machinery is likely to fail, allowing for proactive maintenance and reducing
downtime.
 Credit risk assessment – ML algorithms are used by financial institutions to
assess the credit risk of loan applicants, by analyzing data such as their income,
employment history, and credit score.
 Customer segmentation – ML algorithms are used in marketing to segment
customers into different groups based on their characteristics and behavior,
allowing for targeted advertising and promotions.
 Fraud detection – ML algorithms are used in financial transactions to detect
patterns of behavior that are indicative of fraud, such as unusual spending patterns
or transactions from unfamiliar locations.
 Speech recognition – ML algorithms are used to transcribe spoken words into
text, allowing for voice-controlled interfaces and dictation software.
Examples of Deep Learning:
Deep Learning is a type of Machine Learning that uses artificial neural networks with
multiple layers to learn and make decisions.
Here are some examples of Deep Learning:
 Image and video recognition: Deep learning algorithms are used in image and
video recognition systems to classify and analyze visual data. These systems are
used in self-driving cars, security systems, and medical imaging.
 Generative models: Deep learning algorithms are used in generative models to
create new content based on existing data. These systems are used in image and
video generation, text generation, and other applications.
 Autonomous vehicles: Deep learning algorithms are used in self-driving cars and
other autonomous vehicles to analyze sensor data and make decisions about
speed, direction, and other factors.
 Image classification – Deep Learning algorithms are used to recognize objects
and scenes in images, such as recognizing faces in photos or identifying items in
an image for an e-commerce website.
 Speech recognition – Deep Learning algorithms are used to transcribe spoken
words into text, allowing for voice-controlled interfaces and dictation software.
 Natural language processing – Deep Learning algorithms are used for tasks such
as sentiment analysis, language translation, and text generation.
 Recommender systems – Deep Learning algorithms are used in recommendation
systems to make personalized recommendations based on users’ behavior and
preferences.
 Fraud detection – Deep Learning algorithms are used in financial transactions to
detect patterns of behavior that are indicative of fraud, such as unusual spending
patterns or transactions from unfamiliar locations.
 Game-playing AI – Deep Learning algorithms have been used to develop game-
playing AI that can compete at a superhuman level, such as the AlphaGo AI that
defeated the world champion in the game of Go.
 Time series forecasting – Deep Learning algorithms are used to forecast future
values in time series data, such as stock prices, energy consumption, and weather
patterns.
AI vs. ML vs. DL works: Is There a Difference?
Working in AI is not the same as being an ML or DL engineer. Here’s how you can tell
those careers apart and decide which one is the right call for you.
What Does an AI Engineer Do?
An AI Engineer is a professional who designs, develops, and implements artificial

intelligence (AI) systems and solutions. Here are some of the key responsibilities and
tasks of an AI Engineer:
 Design and development of AI algorithms: AI Engineers design, develop, and
implement AI algorithms, such as decision trees, random forests, and neural
networks, to solve specific problems.
 Data analysis: AI Engineers analyze and interpret data, using statistical and
mathematical techniques, to identify patterns and relationships that can be used to
train AI models.
 Model training and evaluation: AI Engineers train AI models on large datasets,
evaluate their performance, and adjust the parameters of the algorithms to improve
accuracy.
 Deployment and maintenance: AI Engineers deploy AI models into production
environments and maintain and update them over time.
 Collaboration with stakeholders: AI Engineers work closely with stakeholders,
including data scientists, software engineers, and business leaders, to understand
their requirements and ensure that the AI solutions meet their needs.
 Research and innovation: AI Engineers stay current with the latest advancements
in AI and contribute to the research and development of new AI techniques and
algorithms.
 Communication: AI Engineers communicate the results of their work, including the
performance of AI models and their impact on business outcomes, to stakeholders.
An AI Engineer must have a strong background in computer science, mathematics,
and statistics, as well as experience in developing AI algorithms and solutions. They
should also be familiar with programming languages, such as Python and R.
What Does a Machine Learning Engineer Do?
A Machine Learning Engineer is a professional who designs, develops, and

implements machine learning (ML) systems and solutions. Here are some of the key
responsibilities and tasks of a Machine Learning Engineer:
 Design and development of ML algorithms: Machine Learning Engineers design,
develop, and implement ML algorithms, such as decision trees, random forests, and
neural networks, to solve specific problems.
 Data analysis: Machine Learning Engineers analyze and interpret data, using
statistical and mathematical techniques, to identify patterns and relationships that
can be used to train ML models.
 Model training and evaluation: Machine Learning Engineers train ML models on
large datasets, evaluate their performance, and adjust the parameters of the
algorithms to improve accuracy.
 Deployment and maintenance: Machine Learning Engineers deploy ML models
into production environments and maintain and update them over time.
 Collaboration with stakeholders: Machine Learning Engineers work closely with
stakeholders, including data scientists, software engineers, and business leaders,
to understand their requirements and ensure that the ML solutions meet their
needs.
 Research and innovation: Machine Learning Engineers stay current with the
latest advancements in ML and contribute to the research and development of new
ML techniques and algorithms.
 Communication: Machine Learning Engineers communicate the results of their
work, including the performance of ML models and their impact on business
outcomes, to stakeholders.
A Machine Learning Engineer must have a strong background in computer science,
mathematics, and statistics, as well as experience in developing ML algorithms and
solutions. They should also be familiar with programming languages, such as Python
and R, and have experience working with ML frameworks and tools.
What Does a Deep Learning Engineer Do?
A Deep Learning Engineer is a professional who designs, develops, and implements

deep learning (DL) systems and solutions. Here are some of the key responsibilities
and tasks of a Deep Learning Engineer:
 Design and development of DL algorithms: Deep Learning Engineers design,
develop, and implement deep neural networks and other DL algorithms to solve
specific problems.
 Data analysis: Deep Learning Engineers analyze and interpret large datasets,
using statistical and mathematical techniques, to identify patterns and relationships
that can be used to train DL models.
 Model training and evaluation: Deep Learning Engineers train DL models on
massive datasets, evaluate their performance, and adjust the parameters of the
algorithms to improve accuracy.
 Deployment and maintenance: Deep Learning Engineers deploy DL models into
production environments and maintain and update them over time.
 Collaboration with stakeholders: Deep Learning Engineers work closely with
stakeholders, including data scientists, software engineers, and business leaders,
to understand their requirements and ensure that the DL solutions meet their
needs.
 Research and innovation: Deep Learning Engineers stay current with the latest
advancements in DL and contribute to the research and development of new DL
techniques and algorithms.
 Communication: Deep Learning Engineers communicate the results of their work,
including the performance of DL models and their impact on business outcomes, to
stakeholders.
Learning Paradigms basically states a particular pattern on which something or

someone learns. In this blog, we will talking about the Learning Paradigms related to
machine learning, i.e how a machine learns when some data is given to it, its pattern of
approach for some particular data.
There are three basic types of learning paradigms widely associated with machine
learning, namely
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
We will be talking in brief about all of them.
Supervised Learning
Supervised learning is a machine learning task in which a function maps the input to
output data using the provided input-output pairs.
The above statement states that in this type of learning, you need to give both the input
and output(usually in the form of labels) to the computer for it to learn from it. What the
computer does is that it generates a function based on this data, which can be anything
like a simple line, to a complex convex function, depending on the data provided.
This is the most basic type of learning paradigm, and most algorithms we learn today are
based on this type of learning pattern. Some examples of these are :
1. Linear Regression (the simple Line Function!)

1. Logistic Regression (0 or 1 logic, meaning yes or no!)
I have talked about both these algorithms in my previous blog, so please have a read.
Click above to be redirected to the same.
Some practical examples of the same are :
Reference : https://www.geeksforgeeks.org/supervised-unsupervised-learning/
Classification: Machine is trained to classify something into some class.

 classifying whether a patient has disease or not
 classifying whether an email is spam or not
Regression: Machine is trained to predict some value like price, weight or height.
 predicting house/property price
 predicting stock market price
Unsupervised Learning
In this type of learning paradigm, the computer is provided with just the input to
develop a learning pattern. It is basically Learning from no results!!
This means that the computer has to recognize a pattern in the given input, and develop
an learning algorithm accordingly. So we conclude that “the machine learns through
observation & find structures in data”. This is still a very unexplored field of
machine learning, and big tech giants like Google and Microsoft are currently
researching on development in it.
Some real life examples of the same are:
Reference : https://www.geeksforgeeks.org/supervised-unsupervised-learning/
Clustering: A clustering problem is where you want to discover the inherent groupings
in the data
 such as grouping customers by purchasing behavior
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data
 such as people that buy X also tend to buy Y
Reinforcement Learning
Reinforcement Learning is a type of Machine Learning, and thereby also a branch of

Artificial Intelligence. It allows machines and software agents to automatically
determine the ideal behavior within a specific context, in order to maximize its
performance.
Learning pattern in reinforcement learning
There is an excellent analogy to explain this type of learning paradigm, “training a

dog”.
This learning paradigm is like a dog trainer, which teaches the dog how to respond to
specific signs, like a whistle, clap, or anything else. Whenever the dog responds correctly,
the trainer gives a reward to the dog, which can be a “Bone or a biscuit”.
Deep Learning Frameworks
As with any other AI algorithm, a programming framework is required to create deep learning
algorithms. These are usually extensions of existing frameworks, or specialized frameworks
developed to create deep learning algorithms. Each framework comes with its own drawbacks
and advantages. Let’s delve deeper into some of the most popular and powerful deep learning
frameworks.
1. TensorFlow
Flow is a machine learning and deep learning framework that was created and released by
Google in 2015. TensorFlow is the most popular deep learning framework in use today, as it is
not only used by big leaders like Google, NVIDIA, and Uber, but also by data scientists and AI
practitioners on a daily basis.
TensorFlow is a library for Python, although work is being done to port it to other popular
languages like Java, JavaScript, C++, and more. However, a lot of resources are required to
create a deep learning model with TensorFlow, as it relies on using a lot of coding to specify the
structure of the network.
A commonly cited drawback of TensorFlow is that it operates with a static computation graph
meaning that the algorithm has to be run every time to see the changes. However, the platform
itself is extremely extensible and powerful, contributing to its high degree of adoption.
2. PyTorch
In many ways, PyTorch is TensorFlow’s primary competitor in the deep learning framework
market. Developed and created by Facebook, PyTorch is an open-source deep learning
framework that works with Python. Apart from powering most of Facebook’s services, other
companies, such as Johnson & Johnson, Twitter, and Salesforce.
As suggested by its name, PyTorch offers in-built support for Python, and even allows users to
use the standard debuggers provided by the software. As opposed to TensorFlow’s static graph,
PyTorch has a dynamic computation graph, allowing users to easily see what effect their changes
will have on the end result while programming the solution.
The framework offers easier options for training neural networks, utilizing modern technologies
like data parallelism and distributed learning. The community for PyTorch is also highly active,
with pre-trained models being published on a regular basis. However, TensorFlow beats PyTorch
in terms of providing cross-platform solutions, as Google’s vertical integration with Android
offers more power to TF users.
3. Keras
Keras is a deep learning framework that is built on top of other prominent frameworks like
TensorFlow, Theano, and the Microsoft Cognitive Toolkit (CNTK). Even though it loses out
to PyTorch and TensorFlow in terms of programmability, it is the ideal starting point for
beginners to learn neural network.
Keras allows users to create large and complex models with simple commands. While this means
that it is not as configurable as its competitors, creating prototypes and proofs-of-concept is a
much easier task. It is also accessible as an application programming interface (API), making the
software accessible in any scenario.
Closing Thoughts for Techies
Deep learning and neural networks are at the forefront of AI research and technology today. This
sets the stage for even more advancements in deep learning, as research progresses, and newer
methods enter the mainstream.
With the rise of open-source and accessible tools like TensorFlow and Keras, deep learning is
sure to get the ball rolling in terms of enterprise adoption. Supporting infrastructure, such as
powerful and accessible cloud computing and marketplaces for pre-trained models, are also
laying the groundwork for greater adoption of the technology.
Working professionals in the AI space must learn skills that can be used in deep learning
applications. Neural networks may hold the key to a future where AI can function at levels that
are unheard of today.
Top 10 Deep Learning Techniques
1. Classic Neural Networks
Also known as Fully Connected Neural Networks, it is often identified by its multilayer
perceptrons, where the neurons are connected to the continuous layer. It was designed by Fran
Rosenblatt, an American psychologist, in 1958. It involves the adaptation of the model into
fundamental binary data inputs. There are three functions included in this model: they are:
 Linear function: Rightly termed, it represents a single line which multiplies its inputs
with a constant multiplier.
 Non-Linear function: It is further divided into three subsets:
1. Sigmoid Curve: It is a function interpreted as an S-shaped curve with its range from 0 to
1.
2. Hyperbolic tangent (tanh) refers to the S-shaped curve having a range of -1 to 1.
3. Rectified Linear Unit (ReLU): It is a single-point function that yields 0 when the input
value is lesser than the set value and yields the linear multiple if the input is given is
higher than the set value.
Works Best in:
1. Any table dataset which has rows and columns formatted in CSV
2. Classification and Regression issues with the input of real values
3. Any model with the highest flexibility, like that of ANNS
2. Convolutional Neural Networks
CNN is an advanced and high-potential type of the classic artificial neural network model. It is
built for tackling higher complexity, preprocessing, and data compilation. It takes reference
from the order of arrangement of neurons present in the visual cortex of an animal brain.
The CNNs can be considered as one of the most efficiently flexible models for specializing in
image as well as non-image data. These have four different organizations:
 It is made up of a single input layer, which generally is a two-dimensional arrangement of

neurons for analyzing primary image data, which is similar to that of photo pixels.
 Some CNNs also consist of a single-dimensional output layer of neurons that processes
images on their inputs, via the scattered connected convolutional layers.
 The CNNs also have the presence of a third layer known as the sampling layer to limit
the number of neurons involved in the corresponding network layers.
 Overall, CNNs have single or multiple connected layers that connect the sampling to
output layers.
This network model can help derive relevant image data in the form of smaller units or chunks.
The neurons present in the convolution layers are accountable for the cluster of neurons in the
previous layer.
Once the input data is imported into the convolutional model, there are four stages involved in
building the CNN:
 Convolution: The process derives feature maps from input data, followed by a function
applied to these maps.
 Max-Pooling: It helps CNN to detect an image based on given modifications.
 Flattening: In this stage, the data generated is then flattened for a CNN to analyze.
 Full Connection: It is often described as a hidden layer that compiles the loss function
for a model.
The CNNs are adequate for tasks, including image recognition, image analyzing, image
segmentation, video analysis, and natural language processing. However, there can be other
scenarios where CNN networks can prove to be useful like:
 Image datasets containing OCR document analysis

 Any two-dimensional input data which can be further transformed to one-dimensional for
quicker analysis
 The model needs to be involved in its architecture to yield output.
Read more: Convulational neural network
3. Recurrent Neural Networks (RNNs)
The RNNs were first designed to help predict sequences, for example, the Long Short-Term
Memory (LSTM) algorithm is known for its multiple functionalities. Such networks work
entirely on data sequences of the variable input length.
The RNN puts the knowledge gained from its previous state as an input value for the current
prediction. Therefore, it can help in achieving short-term memory in a network, leading to the
effective management of stock price changes, or other time-based data systems.
As mentioned earlier, there are two overall types of RNN designs that help in analyzing
problems. They are:
 LSTMs: Useful in the prediction of data in time sequences, using memory. It has three
gates: Input, Output, and Forget.
 Gated RNNs: Also useful in data prediction of time sequences via memory. It has two
gates— Update and Reset.
Works Best in:
 One to One: A single input connected to a single output, like Image classification.
 One to many: A single input linked to output sequences, like Image captioning that
includes several words from a single image.
 Many to One: Series of inputs generating single output, like Sentiment Analysis.
 Many to many: Series of inputs yielding series of outputs, like video classification.
It is also widely used in language translation, conversation modeling, and more.
Get best machine learning course online from the World’s top Universities. Earn Masters,
Executive PGP, or Advanced Certificate Programs to fast-track your career.
Best Machine Learning and AI Courses Online
Master of Science in
Executive Post Graduate Programme in
Machine Learning & AI
Machine Learning & AI from IIITB
from LJMU
Advanced Advanced
Certificate Certificate
Programm Programme
Executive Post Graduate Program in Data
e in in Machine
Science & Machine Learning from
Machine Learning &
University of Maryland
Learning & Deep
NLP from Learning
IIITB from IIITB
To Explore all our courses, visit our page below.
Machine Learning Courses

4. Generative Adversarial Networks
It is a combination of two deep learning techniques of neural networks – a Generator and a

Discriminator. While the Generator Network yields artificial data, the Discriminator helps in
discerning between a real and a false data.
Both of the networks are competitive, as the Generator keeps producing artificial data identical
to real data – and the Discriminator continuously detecting real and unreal data. In a scenario
where there’s a requirement to create an image library, the Generator network would produce
simulated data to the authentic images. It would then generate a deconvolution neural network.
It would then be followed by an Image Detector network to differentiate between the real and
fake images. Starting with a 50% accuracy chance, the detector needs to develop its quality of
classification since the generator would grow better in its artificial image generation. Such
competition would overall contribute to the network in its effectiveness and speed.
Works Best in:
 Image and Text Generation

 Image Enhancement
 New Drug Discovery processes
5. Self-Organizing Maps
The SOMs or Self-Organizing Maps operate with the help of unsupervised data that reduces
the number of random variables in a model. In this type of deep learning technique, the
output dimension is fixed as a two-dimensional model, as each synapse connects to its input
and output nodes.
As each data point competes for its model representation, the SOM updates the weight of the
closest nodes or Best Matching Units (BMUs). Based on the proximity of a BMU, the value
of the weights changes. As weights are considered as a node characteristic in itself, the value
represents the location of the node in the network.
Works best in:
 When the datasets don’t come with a Y-axis values

 Project explorations for analyzing the dataset framework
 Creative projects in Music, Videos, and Text with the help of AI
6. Boltzmann Machines
This network model doesn’t come with any predefined direction and therefore has its nodes
connected in a circular arrangement. Because of such uniqueness, this deep learning
technique is used to produce model parameters.
Different from all previous deterministic network models, the Boltzmann Machines model is
referred to as stochastic.
Works Best in:
 System monitoring
 Setting up of a binary recommendation platform
 Analyzing specific datasets
Read: Step-by-Step Methods To Build Your Own AI System Today
7. Deep Reinforcement Learning
Before understanding the Deep Reinforcement Learning technique, reinforcement learning

refers to the process where an agent interacts with an environment to modify its state. The
agent can observe and take actions accordingly, the agent helps a network to reach its objective
by interacting with the situation.
Here, in this network model, there is an input layer, output layer, and several hidden multiple
layers – where the state of the environment is the input layer itself. The model works on the
continuous attempts to predict the future reward of each action taken in the given state of the
situation.
Works Best in:
 Board Games like Chess, Poker

 Self-Drive Cars
 Robotics
 Inventory Management
 Financial tasks like asset pricing
8. Autoencoders
One of the most commonly used types of deep learning techniques, this model operates
automatically based on its inputs, before taking an activation function and final output
decoding. Such a bottleneck formation leads to yielding lesser categories of data and
leveraging most of the inherent data structures.
The Types of Autoencoders are:
 Sparse – Where hidden layers outnumber the input layer for the generalization approach
to take place to reduce overfitting. It limits the loss function and prevents the autoencoder
from overusing all its nodes.
 Denoising – Here, a modified version of inputs gets transformed into 0 at random.
 Contractive – Addition of a penalty factor to the loss function to limit overfitting and
data copying, incase of hidden layer outnumbering input layer.
 Stacked – To an autoencoder, once another hidden layer gets added, it leads to two
stages of encoding to that of one phase of decoding.
Works Best in:
 Feature detection
 Setting up a compelling recommendation model
 Add features to large datasets
Read: Regularization in Deep Learning
9. Backpropagation
In deep learning, the backpropagation or back-prop technique is referred to as the central

mechanism for neural networks to learn about any errors in data prediction. Propagation, on the
other hand, refers to the transmission of data in a given direction via a dedicated channel. The
entire system can work according to the signal propagation in the forward direction in the
moment of decision, and sends back any data regarding shortcomings in the network, in
reverse.
 First, the network analyzes the parameters and decides on the data
 Second, it is weighted out with a loss function
 Third, the identified error gets back-propagated to self-adjust any incorrect parameters
Works Best in:
 Data Debugging
Also read: 15 Interesting Machine Learning Project Ideas For Beginners
10. Gradient Descent
In the mathematical context, gradient refers to a slop that has a measurable angle and can be
represented into a relationship between variables. In this deep learning technique, the
relationship between the error produced in the neural network to that of the data parameters
can be represented as “x” and “y”. Since the variables are dynamic in a neural network,
therefore the error can be increased or decreased with small changes.
Many professionals visualize the technique as that of a river path coming down the mountain
slopes. The objective of such a method is — to find the optimum solution. Since there is the
presence of several local minimum solutions in a neural network, in which the data can get
trapped and lead to slower, incorrect compilations – there are ways to refrain from such
events.
As the terrain of the mountain, there are particular functions in the neural network called
Convex Functions, which keeps the data flowing into expected rates and reach its most-
minimum. There can be differences in methods taken by the data entering the final destination
due to variation in initial values of the function.
Works Best in:
 Updating parameters in a given model
FEED FORWARD NEURAL NETWORK
Feed Forward Process in Deep Neural Network
Now, we know how with the combination of lines with different weight and biases can result in non-
linear models. How does a neural network know what weight and biased values to have in each layer? It
is no different from how we did it for the single based perceptron model.
We are still making use of a gradient descent optimization algorithm which acts to minimize the error
of our model by iteratively moving in the direction with the steepest descent, the direction which
updates the parameters of our model while ensuring the minimal error. It updates the weight of every
model in every single layer. We will talk more about optimization algorithms and backpropagation later.
It is important to recognize the subsequent training of our neural network. Recognition is done by
dividing our data samples through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind of prediction is
known as Feed Forward." Feed Forward neural network is the core of many other important neural
networks such as convolution neural network.
In the feed-forward neural network, there are not any feedback loops or connections in the network.
Here is simply an input layer, a hidden layer, and an output layer.
There can be multiple hidden layers which depend on what kind of data you are dealing with. The
number of hidden layers is known as the depth of the neural network. The deep neural network can
learn from more functions. Input layer first provides the neural network with data and the output layer
then make predictions on that data which is based on a series of functions. ReLU Function is the most
commonly used activation function in the deep neural network.
To gain a solid understanding of the feed-forward process, let's see this mathematically.
1) The first input is fed to the network, which is represented as matrix x1, x2, and one where one is the
bias value.
2) Each input is multiplied by weight with respect to the first and second model to obtain their
probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
3) After that, we will take the sigmoid of our scores and gives us the probability of the point being in
the positive region in both models.
4) We multiply the probability which we have obtained from the previous step with the second set of
weights. We always include a bias of one whenever taking a combination of inputs.
And as we know to obtain the probability of the point being in the positive region of this model, we
take the sigmoid and thus producing our final output in a feed-forward process.
Let takes the neural network which we had previously with the following linear models and the hidden
layer which combined to form the non-linear model in the output layer.
So, what we will do we use our non-linear model to produce an output that describes the probability of
the point being in the positive region. The point was represented by 2 and 2. Along with bias, we will
represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are multiplied by -4, -1 and
the bias value is multiplied by twelve.
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to obtain the
linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both models we apply
sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear models in the first
layer to obtain the non-linear model in the second layer. The weights are 1.5, 1, and a bias value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of weights as
Now, we will take the sigmoid of our final score
It is complete math behind the feed forward process where the inputs from the input traverse the entire
depth of the neural network. In this example, there is only one hidden layer. Whether there is one
hidden layer or twenty, the computational processes are the same for all hidden layers.
Artificial Neural Networks and its Applications
As you read this article, which organ in your body is thinking about it? It’s the brain of
course! But do you know how the brain works? Well, it has neurons or nerve cells that
are the primary units of both the brain and the nervous system. These neurons receive
sensory input from the outside world which they process and then provide the output
which might act as the input to the next neuron.
Each of these neurons is connected to other neurons in complex arrangements at
synapses. Now, are you wondering how this is related to Artificial Neural Networks?
Well, Artificial Neural Networks are modeled after the neurons in the human brain.
Let’s check out what they are in detail and how they learn information.
Artificial Neural Networks
Artificial Neural Networks contain artificial neurons which are called units. These units
are arranged in a series of layers that together constitute the whole Artificial Neural
Network in a system. A layer can have only a dozen units or millions of units as this
depends on how the complex neural networks will be required to learn the hidden
patterns in the dataset. Commonly, Artificial Neural Network has an input layer, an
output layer as well as hidden layers. The input layer receives data from the outside
world which the neural network needs to analyze or learn about. Then this data
passes through one or multiple hidden layers that transform the input into data that is
valuable for the output layer. Finally, the output layer provides an output in the form of
a response of the Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to another.
Each of these connections has weights that determine the influence of one unit on
another unit. As the data transfers from one unit to another, the neural network learns
more and more about the data which eventually results in an output from the output
layer.
Neural Networks Architecture
The structures and operations of human neurons serve as the basis for artificial neural
networks. It is also known as neural networks or neural nets. The input layer of an
artificial neural network is the first layer, and it receives input from external sources
and releases it to the hidden layer, which is the second layer. In the hidden layer, each
neuron receives input from the previous layer neurons, computes the weighted sum,
and sends it to the neurons in the next layer. These connections are weighted means
effects of the inputs from the previous layer are optimized more or less by assigning
different-different weights to each input and it is adjusted during the training process
by optimizing these weights for improved model performance.
Artificial neurons vs Biological neurons
The concept of artificial neural networks comes from biological neurons found in
animal brains So they share a lot of similarities in structure and function wise.
 Structure: The structure of artificial neural networks is inspired by biological
neurons. A biological neuron has a cell body or soma to process the impulses,
dendrites to receive them, and an axon that transfers them to other neurons. The
input nodes of artificial neural networks receive input signals, the hidden layer
nodes compute these input signals, and the output layer nodes compute the final
output by processing the hidden layer’s results using activation functions.
Biological Neuron Artificial Neuron
Dendrite Inputs
Cell nucleus or Soma Nodes
Synapses Weights
Axon Output
 Synapses: Synapses are the links between biological neurons that enable the
transmission of impulses from dendrites to the cell body. Synapses are the weights
that join the one-layer nodes to the next-layer nodes in artificial neurons. The
strength of the links is determined by the weight value.
 Learning: In biological neurons, learning happens in the cell body nucleus or soma,
which has a nucleus that helps to process the impulses. An action potential is
produced and travels through the axons if the impulses are powerful enough to
reach the threshold. This becomes possible by synaptic plasticity, which
represents the ability of synapses to become stronger or weaker over time in
reaction to changes in their activity. In artificial neural networks, backpropagation
is a technique used for learning, which adjusts the weights between nodes
according to the error or differences between predicted and actual outcomes.
Biological Neuron Artificial Neuron
Synaptic plasticity Backpropagations
 Activation: In biological neurons, activation is the firing rate of the neuron which
happens when the impulses are strong enough to reach the threshold. In artificial
neural networks, A mathematical function known as an activation function maps the
input to the output, and executes activations.
Biological neurons to Artificial neurons
How do Artificial Neural Networks learn?
Artificial neural networks are trained using a training set. For example, suppose you
want to teach an ANN to recognize a cat. Then it is shown thousands of different
images of cats so that the network can learn to identify a cat. Once the neural network
has been trained enough using images of cats, then you need to check if it can identify
cat images correctly. This is done by making the ANN classify the images it is
provided by deciding whether they are cat images or not. The output obtained by the
ANN is corroborated by a human-provided description of whether the image is a cat
image or not. If the ANN identifies incorrectly then back-propagation is used to adjust
whatever it has learned during training. Backpropagation is done by fine-tuning the
weights of the connections in ANN units based on the error rate obtained. This
process continues until the artificial neural network can correctly recognize a cat in an
image with minimal possible error rates.
What are the types of Artificial Neural Networks?
 Feedforward Neural Network : The feedforward neural network is one of the most
basic artificial neural networks. In this ANN, the data or the input provided travels in
a single direction. It enters into the ANN through the input layer and exits through
the output layer while hidden layers may or may not exist. So the feedforward
neural network has a front-propagated wave only and usually does not have
backpropagation.
 Convolutional Neural Network : A Convolutional neural network has some
similarities to the feed-forward neural network, where the connections between
units have weights that determine the influence of one unit on another unit. But a
CNN has one or more than one convolutional layer that uses a convolution
operation on the input and then passes the result obtained in the form of output to
the next layer. CNN has applications in speech and image processing which is
particularly useful in computer vision.
 Modular Neural Network: A Modular Neural Network contains a collection of
different neural networks that work independently towards obtaining the output with
no interaction between them. Each of the different neural networks performs a
different sub-task by obtaining unique inputs compared to other networks. The
advantage of this modular neural network is that it breaks down a large and
complex computational process into smaller components, thus decreasing its
complexity while still obtaining the required output.
 Radial basis function Neural Network: Radial basis functions are those functions
that consider the distance of a point concerning the center. RBF functions have two
layers. In the first layer, the input is mapped into all the Radial basis functions in the
hidden layer and then the output layer computes the output in the next step. Radial
basis function nets are normally used to model the data that represents any
underlying trend or function.
 Recurrent Neural Network: The Recurrent Neural Network saves the output of a
layer and feeds this output back to the input to better predict the outcome of the
layer. The first layer in the RNN is quite similar to the feed-forward neural network
and the recurrent neural network starts once the output of the first layer is
computed. After this layer, each unit will remember some information from the
previous step so that it can act as a memory cell in performing computations.
Applications of Artificial Neural Networks
1. Social Media: Artificial Neural Networks are used heavily in Social Media. For
example, let’s take the ‘People you may know’ feature on Facebook that suggests
people that you might know in real life so that you can send them friend requests.
Well, this magical effect is achieved by using Artificial Neural Networks that analyze
your profile, your interests, your current friends, and also their friends and various
other factors to calculate the people you might potentially know. Another common
application of Machine Learning in social media is facial recognition. This is done
by finding around 100 reference points on the person’s face and then matching
them with those already available in the database using convolutional neural
networks.
2. Marketing and Sales: When you log onto E-commerce sites like Amazon and
Flipkart, they will recommend your products to buy based on your previous
browsing history. Similarly, suppose you love Pasta, then Zomato, Swiggy, etc. will
show you restaurant recommendations based on your tastes and previous order
history. This is true across all new-age marketing segments like Book sites, Movie
services, Hospitality sites, etc. and it is done by implementing personalized
marketing. This uses Artificial Neural Networks to identify the customer likes,
dislikes, previous shopping history, etc., and then tailor the marketing campaigns
accordingly.
3. Healthcare: Artificial Neural Networks are used in Oncology to train algorithms that
can identify cancerous tissue at the microscopic level at the same accuracy as
trained physicians. Various rare diseases may manifest in physical characteristics
and can be identified in their premature stages by using Facial Analysis on the
patient photos. So the full-scale implementation of Artificial Neural Networks in the
healthcare environment can only enhance the diagnostic abilities of medical experts
and ultimately lead to the overall improvement in the quality of medical care all over
the world.
4. Personal Assistants: I am sure you all have heard of Siri, Alexa, Cortana, etc.,
and also heard them based on the phones you have!!! These are personal
assistants and an example of speech recognition that uses Natural Language
Processing to interact with the users and formulate a response accordingly.
Natural Language Processing uses artificial neural networks that are made to
handle many tasks of these personal assistants such as managing the language
syntax, semantics, correct speech, the conversation that is going on, etc.
Activation functions in Neural Networks
It is recommended to understand Neural Networks before reading this article.

In the process of building a neural network, one of the choices you get to make is
what Activation Function to use in the hidden layer as well as at the output layer of the
network. This article discusses some of the choices.
Elements of a Neural Network
Input Layer: This layer accepts input features. It provides information from the outside
world to the network, no computation is performed at this layer, nodes here just pass
on the information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of
the abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to
the output layer.
Output Layer: This layer bring up the information learned by the network to the outer
world.
What is an activation function and why use them?
The activation function decides whether a neuron should be activated or not by
calculating the weighted sum and further adding bias to it. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we
would update the weights and biases of the neurons on the basis of the error at the
output. This process is known as back-propagation. Activation functions make the
back-propagation possible since the gradients are supplied along with the error to
update the weights and biases.
Why do we need Non-linear activation function?
A neural network without an activation function is essentially just a linear regression
model. The activation function does the non-linear transformation to the input making
it capable to learn and perform more complex tasks.
Mathematical proof
Suppose we have a Neural net like this :-
Elements of the diagram are as follows:
Hidden layer i.e. layer 1:
z(1) = W(1)X + b(1) a(1)
Here,
 z(1) is the vectorized output of layer 1
 W(1) be the vectorized weights assigned to neurons of hidden layer i.e. w1, w2, w3
and w4
 X be the vectorized input features i.e. i1 and i2
 b is the vectorized bias assigned to neurons in hidden layer i.e. b1 and b2
 a(1) is the vectorized form of any linear function.
(Note: We are not considering activation function here)
Layer 2 i.e. output layer :-

Note : Input for layer 2 is output from layer 1
z(2) = W(2)a(1) + b(2)
a(2) = z(2)
Calculation at Output layer
z(2) = (W(2) * [W(1)X + b(1)]) + b(2)
z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]
Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b
Final output : z(2) = W*X + b
which is again a linear function
This observation results again in a linear function even after applying a hidden layer,
hence we can conclude that, doesn’t matter how many hidden layer we attach in
neural net, all layers will behave same way because the composition of two linear
function is a linear function itself. Neuron can not learn with just a linear function
attached to it. A non-linear activation function will let it learn as per the difference w.r.t
error. Hence we need an activation function.
Variants of Activation Function
Linear Function
 Equation : Linear function has the equation similar to as of a straight line i.e. y = x
 No matter how many layers we have, if all are linear in nature, the final activation
function of last layer is nothing but just a linear function of the input of first layer.
 Range : -inf to +inf
 Uses : Linear activation function is used at just one place i.e. output layer.
 Issues : If we will differentiate linear function to bring non-linearity, result will no
more depend on input “x” and function will become constant, it won’t introduce any
ground-breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price
may have any big/small value, so we can apply linear activation at output layer. Even
in this case neural net must have any non-linear function at hidden layers.
Sigmoid Function
 It is a function which is plotted as ‘S’ shaped graph.

 Equation : A = 1/(1 + e-x)
 Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very
steep. This means, small changes in x would also bring about large changes in the
value of Y.
 Value Range : 0 to 1
 Uses : Usually used in output layer of a binary classification, where result is either
0 or 1, as value for sigmoid function lies between 0 and 1 only so, result can be
predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.
Tanh Function
 The activation that works almost always better than sigmoid function is Tanh
function also known as Tangent Hyperbolic function. It’s actually mathematically
shifted version of the sigmoid function. Both are similar and can be derived from
each other.
 Equation :-
 Value Range :- -1 to +1
 Nature :- non-linear
 Uses :- Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very close
to it, hence helps in centering the data by bringing mean close to 0. This makes
learning for the next layer much easier.
RELU Function
 It Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural network.
 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
 Value Range :- [0, inf)
 Nature :- non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
 Uses :- ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are
trying to handle multi- class classification problems.
 Uses :- Usually used when trying to handle multiple classes. the softmax function
was commonly found in the output layer of image classification problems.The
softmax function would squeeze the outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
 Output:- The softmax function is ideally used in the output layer of the classifier
where we are actually trying to attain the probabilities to define the class of each
input.
 The basic rule of thumb is if you really don’t know what activation function to use,
then simply use RELU as it is a general activation function in hidden layers and is
used in most cases these days.
 If your output is for binary classification then, sigmoid function is very natural choice
for output layer.
 If your output is for multi-class classification then, Softmax is very useful to predict
the probabilities of each classes.
Multi Layered Neural Networks in R Programming
A series or set of algorithms that try to recognize the underlying relationship in a data
set through a definite process that mimics the operation of the human brain is known
as a Neural Network. Hence, the neural networks could refer to the neurons of the
human, either artificial or organic in nature. A neural network can easily adapt to the
changing input to achieve or generate the best possible result for the network and
does not need to redesign the output criteria.
Types of Neural Network
Neural Networks can be classified into multiple types based on their Layers and depth
activation filters, Structure, Neurons used, Neuron density, data flow, and so on. The
types of Neural Networks are as follows:
 Perceptron
 Multi-Layer Perceptron or Multi-Layer Neural Network
 Feed Forward Neural Networks
 Convolutional Neural Networks
 Radial Basis Function Neural Networks
 Recurrent Neural Networks
 Sequence to Sequence Model
 Modular Neural Network
Multi-Layer Neural Network
To be accurate a fully connected Multi-Layered Neural Network is known as Multi-
Layer Perceptron. A Multi-Layered Neural Network consists of multiple layers of
artificial neurons or nodes. Unlike Single-Layer Neural networks, in recent times most
networks have Multi-Layered Neural Network. The following diagram is a visualization
of a multi-layer neural network.
Explanation: Here the nodes marked as “1” are known as bias units. The leftmost
layer or Layer 1 is the input layer, the middle layer or Layer 2 is the hidden layer and
the rightmost layer or Layer 3 is the output layer. It can say that the above diagram
has 3 input units (leaving the bias unit), 1 output unit, and 4 hidden units(1 bias
unit is not included).
A Multi-layered Neural Network is a typical example of the Feed Forward Neural
Network. The number of neurons and the number of layers consists of the
hyperparameters of Neural Networks which need tuning. In order to find ideal values
for the hyperparameters, one must use some cross-validation techniques. Using the
Back-Propagation technique, weight adjustment training is carried out.
Formula for Multi-Layered Neural Network
Suppose we have xn inputs(x1, x2….xn) and a bias unit. Let the weight applied to be w 1,
w2…..wn. Then find the summation and bias unit on performing dot product among
inputs and weights as:
r = Σmi=1 wixi + bias

On feeding the r into activation function F(r) we find the output for the hidden layers.
For the first hidden layer h 1, the neuron can be calculated as:
h11 = F(r)
For all the other hidden layers repeat the same procedure. Keep repeating the
process until reach the last weight set.
Implementing Multi-Layered Neural Network in R
In R Language, install the neuralnet package to work on the concepts of Neural

Network. The neuralnet package demands an all-numeric matrix or data frame.
Control the hidden layers by mentioning the value against the hidden parameter of
the neuralnet() function which can be a vector for many hidden layers. Use
the set.seed() function every time to generate random numbers.
Example: Use the neuralnet package in order to fit a linear model. Let us see the
steps to fit a Multi-Layered Neural network in R.
Step 1: The first step is to pick the dataset. Here in this example, let’s work on
the Boston dataset of the MASS package. This dataset typically deals with the
housing values in the fringes or suburbs of Boston. The goal is to find the medv or
median values of the houses occupied by its owner by using all the other available
continuous variables. Use the set.seed() function to generate random numbers.
r
set.seed(500)
library(MASS)
data <- Boston
Step 2: Then check for missing values or data points in the dataset. If any, then fix
the data points which are missing.
r
apply(data, 2, function(x) sum(is.na(x)))
Output:
crim zn indus chas nox rm age dis rad
tax ptratio black lstat medv
0 0 0 0 0 0 0 0 0
0 0 0 0 0
Step 3: Since no data points are missing proceed towards preparing the data set.
Now randomly split the data into two sets, Train set and Test set. On preparing the
data, try to fit the data on a linear regression model and then test it on the test set.
r
index <- sample(1 : nrow(data),
round(0.75 * nrow(data)))
train <- data[index, ]
test <- data[-index, ]
lm.fit <- glm(medv~., data = train)
summary(lm.fit)
pr.lm <- predict(lm.fit, test)
MSE.lm <- sum((pr.lm - test$medv)^2) / nrow(test)
Output:
Deviance Residuals:
Min 1Q Median 3Q Max
-14.9143 -2.8607 -0.5244 1.5242 25.0004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.469681 6.099347 7.127 5.50e-12 ***
crim -0.105439 0.057095 -1.847 0.065596 .
zn 0.044347 0.015974 2.776 0.005782 **
indus 0.024034 0.071107 0.338 0.735556
chas 2.596028 1.089369 2.383 0.017679 *
nox -22.336623 4.572254 -4.885 1.55e-06 ***
rm 3.538957 0.472374 7.492 5.15e-13 ***
age 0.016976 0.015088 1.125 0.261291
dis -1.570970 0.235280 -6.677 9.07e-11 ***
rad 0.400502 0.085475 4.686 3.94e-06 ***
tax -0.015165 0.004599 -3.297 0.001072 **
ptratio -1.147046 0.155702 -7.367 1.17e-12 ***
black 0.010338 0.003077 3.360 0.000862 ***
lstat -0.524957 0.056899 -9.226 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 23.26491)
Null deviance: 33642 on 379 degrees of freedom

Residual deviance: 8515 on 366 degrees of freedom
AIC: 2290
Number of Fisher Scoring iterations: 2

Step 4: Now normalize the data set before training a Neural network. Hence, scaling
and splitting the data. The scale() function returns a matrix that needs to be coerced
in data frame.
r
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled <- as.data.frame(scale(data,
center = mins,
scale = maxs - mins))

train_ <- scaled[index, ]
test_ <- scaled[-index, ]
Step 5: Now fit the data into Neural network. Use the neuralnet package.
r
library(neuralnet)
n <- names(train_)
f <- as.formula(paste("medv ~",
paste(n[!n %in% "medv"],
collapse = " + ")))
nn <- neuralnet(f,
data = train_,
hidden = c(4, 2),
linear.output = T)
Now our model is fitted to the Multi-Layered Neural network. Now combine all the
steps and also plot the neural network to visualize the output. Use the plot() function
to do so.
r
# R program to illustrate
# Multi Layered Neural Networks
# Use the set.seed() function

# To generate random numbers
set.seed(500)
# Import required library
library(MASS)
# Working on the Boston dataset
data <- Boston
apply(data, 2, function(x) sum(is.na(x)))
index <- sample(1 : nrow(data),
round(0.75 * nrow(data)))
train <- data[index, ]
test <- data[-index, ]
lm.fit <- glm(medv~., data = train)
summary(lm.fit)
pr.lm <- predict(lm.fit, test)
MSE.lm <- sum((pr.lm - test$medv)^2) / nrow(test)
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)

scaled <- as.data.frame(scale(data,
center = mins,
scale = maxs - mins))
train_ <- scaled[index, ]
test_ <- scaled[-index, ]
# Applying Neural network concepts
library(neuralnet)
n <- names(train_)
f <- as.formula(paste("medv ~",
paste(n[!n %in% "medv"],
collapse = " + ")))
nn <- neuralnet(f, data = train_,
hidden = c(4, 2),
linear.output = T)
# Plotting the graph
plot(nn)
Output:
UNIT-II
TRAINING NEURAL NETWORK
Principles of Risk Minimization for Learning Theory
Abstract
Learning is posed as a problem of function estimation, for which two principles of solution are
considered: empirical risk minimization and structural risk minimization. These two principles
are applied to two different statements of the function estimation problem: global and local.
Systematic improvements in prediction power are illustrated in application to zip-code
recognition. 1 INTRODUCTION
The structure of the theory of learning differs from that of most other theories for applied
problems. The search for a solution to an applied problem usually requires the three following
steps:
1. State the problem in mathematical terms.
2. Formulate a general principle to look for a solution to the problem.
3. Develop an algorithm based on such general principle.
The first two steps of this procedure offer in general no major difficulties; the third step
requires most efforts, in developing computational algorithms to solve the problem at hand. In
the case of learning theory, however, many algorithms have been developed, but we still lack a
clear understanding of the mathematical statement needed to describe the learning procedure,
and of the general principle on which the search for solutions
should be based. This paper is devoted to these first two steps, the statement of the problem
and the general principle of solution. The paper is organized as follows. First, the problem of
function estimation is stated, and two principles of solution are discussed: the principle of
empirical risk minimization and the principle of structural risk minimization. A new statement is
then given: that of local estimation of function, to which the same principles are applied. An
application to zip-code recognition is used to illustrate these ideas.
2 FUNCTION ESTIMATION MODEL
The learning process is described through three components:
1. A generator of random vectors x, drawn independently from a fixed but unknown

distribution P(x).
2. A supervisor which returns an output vector y to every input vector x, according to a

conditional distribution function P(ylx), also fixed but unknown.
3. A learning machine capable of implementing a set of functions !(x, w), wE W. The problem of
learning is that of choosing from the given set of functions the one which approximates best
the supervisor's response. The selection is based on a training set of e independent
observations: (1) The formulation given above implies that learning corresponds to the problem
of function approximation.
3 PROBLEM OF RISK MINIMIZATION
In order to choose the best available approximation to the supervisor's response, we measure
the loss or discrepancy L(y, !(x, w» between the response y of the supervisor to a given input x
and the response !(x, w) provided by the learning machine. Consider the expected value of the
loss, given by the risk functional R(w) = J L(y, !(x, w»dP(x,y). ---(2)
The goal is to minimize the risk functional R( w) over the class of functions !(x, w), w E W. But
the joint probability distribution P(x, y) = P(ylx )P(x) is unknown and the only available
information is contained in the training set (1).
4 EMPIRICAL RISK MINIMIZATION
In order to solve this problem, the following induction principle is proposed:
the risk functional R( w) is replaced by the empirical risk functional 1 l E(w) = i LL(Yi,!(Xi'W» i=l --
 (3)
constructed on the basis of the training set (1). The induction principle of empirical risk
minimization (ERM) assumes that the function I(x, wi) ,which minimizes E(w) over the set w E
W, results in a risk R( wi) which is close to its minimum. This induction principle is quite general;
many classical methods such as least square or maximum likelihood are realizations of the ERM
principle. The evaluation of the soundness of the ERM principle requires answers to the
following two questions: 1. Is the principle consistent? (Does R( wi) converge to its minimum
value on the set wE W when f- oo?) 2. How fast is the convergence as f increases? The answers
to these two questions have been shown (Vapnik et al., 1989) to be equivalent to the answers
to the following two questions: 1. Does the empirical risk E( w) converge uniformly to the actual
risk R( w) over the full set I(x, w), wE W? Uniform convergence is defined as Prob{ sup IR(w) -
E(w)1 > £} - 0 as f - 00. ---(4) wEW
2. What is the rate of convergence?
It is important to stress that uniform convergence (4) for the full set of functions is
a necessary and sufficient condition for the consistency of the ERM principle.
5 VC-DIMENSION OF THE SET OF FUNCTIONS
The theory of uniform convergence of empirical risk to actual risk developed in the 70's and
SO's, includes a description of necessary and sufficient conditions as well as bounds for the
rate of convergence (Vapnik, 19S2). These bounds, which are independent of the
distribution function P(x,y), are based on a quantitative measure of the capacity of the set
offunctions implemented by the learning machine: the VC-dimension of the set. For
simplicity, these bounds will be discussed here only for the case of binary pattern
recognition, for which y E {O, 1} and I(x, w), wE W is the class of indicator functions. The loss
function takes only two values L(y, I(x, w)) = 0 if y = I(x, w) and L(y, I(x, w)) = 1 otherwise. In
this case, the risk functional (2) is the probability of error, denoted by pew). The empirical
risk functional (3), denoted by v(w), is the frequency of error in the training set. The VC-
dimension of a set of indicator functions is the maximum number h of vectors which can be
shattered in all possible 2h ways using functions in the set. For instance, h = n + 1 for linear
decision rules in n-dimensional space, since they can shatter at most n + 1 points.
6 RATES OF UNIFORM CONVERGENCE
The notion of VC-dimension provides a bound to the rate of uniform convergence. For a set
of indicator functions with VC-dimension h, the following inequality holds:
Prob{ SUp IP(w) - v(w)1 > c} < (-h) exp{-e fl· wEW It then follows that with probability 1 - T},
simultaneously for all w E W, pew) < v(w) + Co(f/h, T}), with confidence interval C (f/h ) _ .
Ih(1n 21/h + 1) - In T} o ,T}-V f . (5) (6) (7)
This important result provides a bound to the actual risk P( w) for all w E W, including the w·
which minimizes the empirical risk v(w). The deviation IP(w) - v(w)1 in (5) is expected to be
maximum for pew) close to 1/2, since it is this value of pew) which maximizes the error
variance u(w) = J P( w)( 1 - P( w)).
The worst case bound for the confidence interval (7) is thus likely be controlled by the worst
decision rule.
The bound (6) is achieved for the worst case pew) = 1/2, but not for small pew), which is the
case of interest. A uniformly good approximation to P( w) follows from considering pew) -
v(w) Prob{ sup > e}. wEW (j(w) (8)
The variance of the relative deviation (P( w) - v( w))/ (j( w) is now independent of w. A
bound for the probability (8),
if available, would yield a uniformly good bound for actual risks for all P( w). Such a bound
has not yet been established. But for pew) « 1, the approximation (j(w) ~ JP(w) is true, and
the following inequality holds: pew) - v(w) 2le h e2 f Prob{ sup > e} < (-) exp{--}. (9) wEW
JP(w) h 4 It then follows that with probability 1 - T}, simultaneously for all w E W, pew) <
v(w) + CI(f/h, v(w), T}), (10) with confidence interval CI(l/h,v(w),T}) =2 (h(ln2f/h;l)-lnT}) (1+ 1
v(w)f ) + h(1n 2f/h + 1) -In T} . (11) Note that the confidence interval now depends on v( w),
and that for v( w) = 0 it reduces to CI(f/ h, 0, T}) = 2C'5(f/ h, T}), which provides a more
precise bound for real case learning.
7 STRUCTURAL RISK MINIMIZATION
The method of ERM can be theoretically justified by considering the inequalities (6) or (10).
When l/h is large, the confidence intervals Co or C1 become small, and
can be neglected. The actual risk is then bound by only the empirical risk, and the
probability of error on the test set can be expected to be small when the frequency of error
in the training set is small. However, if ljh is small, the confidence interval cannot be
neglected, and even v( w) = 0 does not guarantee a small probability of error. In this case
the minimization of P( w) requires a new principle, based on the simultaneous minimization
of v( w) and the confidence interval. It is then necessary to control the VC-dimension of the
learning machine. To do this, we introduce a nested structure of subsets Sp = {lex, w), wE
Wp}, such that SlCS2C ... CSn . The corresponding VC-dimensions of the subsets satisfy hl <
h2 < ... < hn . The principle of structure risk minimization (SRM) requires a two-step process:
the empirical risk has to be minimized for each element of the structure. The optimal
element S* is then selected to minimize the guaranteed risk, defined as the sum of the
empirical risk and the confidence interval. This process involves a trade-off: as h increases
the minimum empirical risk decreases, but the confidence interval mcreases.
8 EXAMPLES OF STRUCTURES FOR NEURAL NETS
The general principle of SRM can be implemented in many different ways. Here we consider
three different examples of structures built for the set of functions implemented by a neural
network .
1. Structure given by the architecture of the neural network. Consider an ensemble of fully
connected neural networks in which the number of units in one of the hidden layers is
monotonically increased. The set of implement able functions makes a structure as the
number of hidden units is increased.
2. Structure given by the learning procedure. Consider the set of functions S = {lex, w), w E
W} implementable by a neural net of fixed architecture. The parameters {w} are the weights
of the neural network. A structure is introduced through Sp = {lex, w), Ilwll < Cp } and Cl < C2
< ... < Cn. For a convex loss function, the minimization of the empirical risk within the
element Sp of the structure is achieved through the minimizat.ion of 1 l E(w"P) = l LL(Yi,!
(Xi'W» +'P llwI12 i=l with appropriately chosen Lagrange multipliers II > 12 > ... > In' The
well-known "weight decay" procedure refers to the minimization of this functional. 3.
Structure given by preprocessing. Consider a neural net with fixed architecture. The input
representation is modified by a transformation z = K(x, 13), where the parameter f3 controls
the degree of the degeneracy introduced by this transformation (for instance f3 could be
the width of a smoothing kernel). 836 Vapnik A structure is introduced in the set of
functions S = {!(I«x, 13), w), w E W} through 13 > CP1 and Cl > C2 > ... > Cn·
Loss Functions and Their Use In Neural Networks
Loss functions are one of the most important aspects of neural networks, as they (along
with the optimization functions) are directly responsible for fitting the model to the given
training data.
This article will dive into how loss functions are used in neural networks, different types
of loss functions, writing custom loss functions in TensorFlow, and practical
implementations of loss functions to process image and video training data — the
primary data types used for computer vision, my topic of interest & focus.
Background Information
First, a quick review of the fundamentals of neural networks and how they work.
Image Source: Wikimedia Commons
Neural networks are a set of algorithms that are designed to recognize

trends/relationships in a given set of training data. These algorithms are based on the
way human neurons process information.
This equation represents how a neural network processes the input data at each layer and
eventually produces a predicted output value.
Image Source: Author

To train — the process by which the model maps the relationship between the training
data and the outputs — the neural network updates its hyperparameters, the
weights, wT, and biases, b, to satisfy the equation above.
Each training input is loaded into the neural network in a process called forward
propagation. Once the model has produced an output, this predicted output is
compared against the given target output in a process called backpropagation — the
hyperparameters of the model are then adjusted so that it now outputs a result closer to
the target output.
This is where loss functions come in.
Loss Functions Overview

A loss function is a function that compares the target and predicted output values;
measures how well the neural network models the training data. When training, we aim
to minimize this loss between the predicted and target outputs.
The hyperparameters are adjusted to minimize the average loss — we find the
weights, wT, and biases, b, that minimize the value of J (average loss).
We can think of this akin to residuals, in statistics, which measure the distance of the
actual y values from the regression line (predicted values) — the goal being to minimize
the net distance.
How Loss Functions Are Implemented in TensorFlow
For this article, we will use Google’s TensorFlow library to implement different loss
functions — easy to demonstrate how loss functions are used in models.
In TensorFlow, the loss function the neural network uses is specified as a parameter in
model.compile() —the final method that trains the neural network.
model.compile(loss='mse', optimizer='sgd')
The loss function can be inputed either as a String — as shown above — or as a function
object — either imported from TensorFlow or written as custom loss functions, as we will
discuss later.
from tensorflow.keras.losses import mean_squared_error
model.compiile(loss=mean_squared_error, optimizer='sgd')
All loss functions in TensorFlow have a similar structure:

def loss_function (y_true, y_pred):
return losses
It must be formatted this way because the model.compile() method expects only two
input parameters for the loss attribute.
Types of Loss Functions
In supervised learning, there are two main types of loss functions — these correlate to the
2 major types of neural networks: regression and classification loss functions
1. Regression Loss Functions — used in regression neural networks; given an input

value, the model predicts a corresponding output value (rather than pre-selected
labels); Ex. Mean Squared Error, Mean Absolute Error
2. Classification Loss Functions — used in classification neural networks; given an

input, the neural network produces a vector of probabilities of the input belonging to
various pre-set categories — can then select the category with the highest probability
of belonging; Ex. Binary Cross-Entropy, Categorical Cross-Entropy
Mean Squared Error (MSE)
One of the most popular loss functions, MSE finds the average of the squared differences
between the target and the predicted outputs
This function has numerous properties that make it especially suited for calculating loss.
The difference is squared, which means it does not matter whether the predicted value is
above or below the target value; however, values with a large error are penalized. MSE is
also a convex function (as shown in the diagram above) with a clearly defined global
minimum — this allows us to more easily utilize gradient descent optimization to set
the weight values.
Here is a standard implementation in TensorFlow — built into the TensorFlow library as

well.
def mse (y_true, y_pred):
return tf.square (y_true - y_pred)
However, one disadvantage of this loss function is that it is very sensitive to outliers; if a
predicted value is significantly greater than or less than its target value, this will
significantly increase the loss.
Mean Absolute Error (MAE)
MAE finds the average of the absolute differences between the target and the predicted
outputs.
This loss function is used as an alternative to MSE in some cases. As mentioned

previously, MSE is highly sensitive to outliers, which can dramatically affect the loss
because the distance is squared. MAE is used in cases when the training data has a large
number of outliers to mitigate this.
Here is a standard implementation in TensorFlow — built into the TensorFlow library as
well.
def mae (y_true, y_pred):
return tf.abs(y_true - y_pred)
It also has some disadvantages; as the average distance approaches 0, gradient descent
optimization will not work, as the function's derivative at 0 is undefined (which will
result in an error, as it is impossible to divide by 0).
Because of this, a loss function called a Huber Loss was developed, which has the
advantages of both MSE and MAE.
If the absolute difference between the actual and predicted value is less than or equal to a
threshold value, 𝛿, then MSE is applied. Otherwise — if the error is sufficiently large —
MAE is applied.
This is the TensorFlow implementation —this involves using a wrapper function to utilize
the threshold variable, which we will discuss in a little bit.
def huber_loss_with_threshold (t = 𝛿):
def huber_loss (y_true, y_pred):
error = y_true - y_pred
within_threshold = tf.abs(error) <= t
small_error = tf.square(error)
large_error = t * (tf.abs(error) - (0.5*t))
if within_threshold:
return small_error
else:
return large_error
return huber_loss
Binary Cross-Entropy/Log Loss
This is the loss function used in binary classification models — where the model takes in
an input and has to classify it into one of two pre-set categories.
Classification neural networks work by outputting a vector of probabilities — the

probability that the given input fits into each of the pre-set categories; then selecting the
category with the highest probability as the final output.
In binary classification, there are only two possible actual values of y — 0 or 1. Thus, to
accurately determine loss between the actual and predicted values, it needs to compare
the actual value (0 or 1) with the probability that the input aligns with that category
(p(i) = probability that the category is 1; 1 — p(i) = probability that the category is 0)
This is the TensorFlow implementation.

def log_loss (y_true, y_pred):
y_pred = tf.clip_by_value(y_pred, le-7, 1 - le-7)
error = y_true * tf.log(y_pred + 1e-7) (1-y_true) * tf.log(1-
y_pred + 1e-7)
return -error
Categorical Cross-Entropy Loss
In cases where the number of classes is greater than two, we utilize categorical cross-
entropy — this follows a very similar process to binary cross-entropy.

Binary cross-entropy is a special case of categorical cross-entropy, where M = 2 — the
number of categories is 2.
Custom Loss Functions
As seen earlier, when writing neural networks, you can import loss functions as function
objects from the tf.keras.losses module. This module contains the following built-in loss
functions:
However, there may be cases where these traditional/main loss functions may not be
sufficient. Some examples would be if there is too much noise in your training data
(outliers, erroneous attribute values, etc.) — which cannot be compensated for with data
preprocessing — or use in unsupervised learning (as we will discuss later). In these
instances, you can write custom loss functions to suit your specific conditions.
def custom_loss_function (y_true, y_pred):
return losses
Writing custom loss functions is very straightforward; the only requirements are that the
loss function must take in only two parameters: y_pred (predicted output) and y_true
(actual output).
Some examples of these are 3 custom loss functions, in the case of a variational auto-
encoder (VAE) model, from Hands-On Image Generation with TensorFlow by Soon Yau
Cheong.
def vae_kl_loss(y_true, y_pred):
kl_loss = - 0.5 * tf.reduce_mean(1 + vae.logvar -
tf.square(vae.mean) - tf.exp(vae.logvar))
return kl_lossdef vae_rc_loss(y_true, y_pred):
#rc_loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
rc_loss = tf.keras.losses.MSE(y_true, y_pred)
return rc_lossdef vae_loss(y_true, y_pred):
kl_loss = vae_kl_loss(y_true, y_pred)
rc_loss = vae_rc_loss(y_true, y_pred)
kl_weight_const = 0.01
return kl_weight_const*kl_loss + rc_loss
Depending on the math of your loss function, you may need to add additional parameters
— such as the threshold value 𝛿 in the Huber Loss (above); to do this, you must include a
wrapper function, as TF will not allow you to have more than 2 parameters in your loss
function.
def custom_loss_with_threshold (threshold = 1):
def custom_loss (y_true, y_pred):
pass #Implement loss function - can call the threshold
variable
return custom_loss
Implementation In Image Processing

Photo by Christopher Gower on Unsplash
Let’s look at some practical implementations of loss functions. Specifically, we will look
at how loss functions are used to process image data in various use cases.
Image Classification
One of the most fundamental aspects of computer vision is image classification — being
able to assign an image to one of two or more pre-selected labels; this allows users to
recognize objects, writing, people, etc. within the image (in image classification, the
image usually has only one subject).
The most commonly used loss function in image classification is cross-entropy loss/log
loss (binary for classification between 2 classes and sparse categorical for 3 or more),
where the model outputs a vector of probabilities that the input image belongs to each of
the pre-set categories. This output is then compared to the actual output, represented by
a vector of equal size, where the correct category has a probability of 1 and all others have
a probability of 0.
A rudimentary implementation of this can be imported directly from the TensorFlow

library and does not require any further customization or modification. Below is an
excerpt of an open-source Deep Convolutional Neural Network (CNN) by IBM which
classifies document images (id cards, application forms, etc.).
model.add(Dense(5, activation='sigmoid'))
model.summary()model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
Research is currently being done to develop new (custom) loss functions to optimize
multi-class classification. Below is an excerpt of a proposed loss function developed by
researchers at Duke University, which extends categorical cross-entropy loss by looking
at patterns in incorrect results as well, to speed up the learning process.
def matrix_based_crossentropy (output, target, matrixA, from_logits
= False):
Loss = 0
ColumnVector = np.matul(matrixA, target)
for i, y in enumerate (output):
Loss -= (target[i]*math.log(output[i],2))
Loss += ColumnVector[i]*exponential(output[i])
Loss -= (target[i]*exponential(output[i]))
newMatrix = updateMatrix(matrixA, target, output, 4)
return [Loss, newMatrix]
Image Generation
Image generation is a process by which neural networks create images (from an existing
library) per the user’s specifications.
Throughout this article, we have dealt primarily with the use of loss functions in
supervised learning — where we have had clearly labeled inputs, x, and outputs, y, and
the model was supposed to determine the relationship between these two variables.
Image generation is an application of unsupervised learning — where the model is
required to analyze and find patterns in unlabelled input datasets. The basic principle of
loss functions still holds; the goal of a loss function in unsupervised learning is to
determine the difference between the input example and the hypothesis — the model’s
approximation of the input example itself.
For example, this equation models how MSE would be implemented for unsupervised
learning, where h(x) is the hypothesis function.
Below is an excerpt of a Contrastive Language-Image Pretraining (CLIP) diffusion model

— which generates art (images) through text descriptions — along with a few image
samples.
if args.init_weight:
result.append(F.mse_loss(z, z_orig) * args.init_weight /
2)lossAll,img = ascend_txt()if i % args.display_freq == 0:
checkin(i, lossAll)loss = sum(lossAll)
loss.backward()
Another example of the use of loss functions in image generation was shown above in
our Custom Loss Functions section, in the case of a variational auto-encoder (VAE)
model.
Backpropagation Process in Deep Neural Network
Backpropagation is one of the important concepts of a neural network. Our task is to

classify our data best. For this, we have to update the weights of parameter and bias,
but how can we do that in a deep neural network? In the linear regression model, we
use gradient descent to optimize the parameter. Similarly here we also use gradient
descent algorithm using Backpropagation.
For a single training example, Backpropagation algorithm calculates the gradient of

the error function. Backpropagation can be written as a function of the neural network.
Backpropagation algorithms are a set of methods used to efficiently train artificial neural
networks following a gradient descent approach which exploits the chain rule.
The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not
able to perform the task for which it is being trained. Derivatives of the activation
function to be known at network design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Now, we first calculate the values of H1 and H2 by a forward pass.
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
To calculate the final result of H1, we performed the sigmoid function as
We will calculate the value of H2 in the same way as H1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and
H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as
We will calculate the value of y2 in the same way as y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as
So, the total error is
Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer
To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with
respect to w.
We perform backward process so first consider the last weight w5 as
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate E total with respect to w5 as
Putting the value of e-y in equation (5)
So, we put the values of in equation no (3) to find the final
result.
Now, we will calculate the updated weight w5 new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
Backward pass at Hidden layer
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.
We will calculate the error at w1 as
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can
easily differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate E total with respect to w1 as
We again split this because there is no any H1 final term in Etoatal as
will again split because in E1 and E2 there is no H1 term. Splitting is

done as
We again Split both because there is no any y1 and y2 term in E1 and E2.
We split it as
Now, we find the value of by putting values in equation (18) and (19) as
From equation (18)
From equation (8)
From equation (19)

Putting the value of e-y2 in equation (23)
From equation (21)

Now from equation (16) and (17)
Put the value of in equation (15) as

We have we need to figure out as
Putting the value of e-H1 in equation (30)
We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:
So, we put the values of in equation (13) to find the final
result.
Now, we will calculate the updated weight w1 new with the help of the following formula
In the same way, we calculate w2 new,w3new, and w4 and this will give us the following
values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when
we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
error is down to 0.291027924. After repeating this process 10,000, the total error is
down to 0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.
Model Selection for Deep Neural Networks
Artificial neural networks can be used to create highly accurate

models in domains such as language processing and image
recognition. However, since hundreds of trials may be needed to
generate a good model architecture and associated
hyperparameters, training deep neural networks is incredibly
resource-intensive. For example, trying five network architectures
with five values for learning rate and five values for regularizer will
result in 125 combinations. This is the challenge of model
selection, which is time-consuming and expensive, especially when
training one model at a time. So, because massively parallel
processing databases like Greenplum can have dozens of segments
(workers), we exploit this parallel compute architecture to address
the challenge of model selection by training many model
configurations simultaneously.
Model hopper parallelism
To train many models simultaneously, we implement a novel approach

from recent research called model hopper parallelism (MOP) [4, 5].
MOP combines task and data parallelism by exploiting the fact that
stochastic gradient descent (SGD), a widely used optimization method,
is robust to data visit order.
The method works as follows. Suppose, as we show in this image, we

have a set S of model configurations that we want to train
for k iterations. The dataset is shuffled once and distributed
to p segments (workers). We pick p model configurations
from S and assign one configuration per segment, and each
configuration is trained for a single sub-epoch. When a segment
completes a sub-epoch, it is assigned a new configuration that has not
yet been trained on that segment. Thus, a model "hops" from one
segment to another until all workers have been visited, which
completes one iteration of SGD for each model. Other models
from S are rotated in, in round robin fashion, until all S model
configurations are trained for k iterations.
Figure 1: An example of model hopper parallelism with three

segments
Note that the training dataset is locked in place on the segments

throughout the training period; only the model state moves between
segments. The advantage of moving the model state rather than the
training dataset is efficiency, since in practical deep learning
problems the former is far smaller than the latter.
Data distribution rules are adjustable. For example, GPUs may not be
connected to all segment hosts for cost control reasons. In this case,
the training data set will only be distributed to segments with GPUs
attached, and the models will only hop to those segments.
The APIs
Below are the two SQL queries needed to run MOP on Greenplum using
Apache MADlib. The first loads model configurations to be trained into
a model selection table. The second calls the MOP function to fit each
of the models in the model selection table.
Figure 2: APIs for model selection on Greenplum using Apache MADlib
Putting it to the test
We use the well-known CIFAR-10 dataset with 50,000 training

examples of 32x32 color images. Grid search generates a set of 96
training configurations, consisting of three model architectures, eight
combinations of optimizers and parameters, and four batch sizes.
Here we see the accuracy and cross-entropy loss for each of the 96 combinations after
training for 10 epochs. Training takes 2 hours and 47 minutes on the test cluster.* The
models display a wide range of accuracy, with some performing well and some not
converging at all.
Figure 3: Initial training of 96 model configurations
The next step that a data scientist may take is to look for good models
from the initial run and select a subset of those that have reasonable
accuracy and low variance compared to the training set, meaning they
do not overfit. We select 12 of these good models and run them for 50
iterations, which takes 1 hour and 53 minutes on the test cluster.
Figure 4: Final training for more iterations with the 12 best model
configurations
From the second run, we select the best model, again ensuring that it
does not overfit the training data. The chosen model achieves
validation accuracy of roughly 81 percent using the SGD optimizer with
learning rate=.001, momentum=0.95, and batch size=256.
Once we have this model, we can then use it for inference to classify
new images.
Figure 5: Inference showing the top three probabilities
Optimization for Neural Networks
Various artificial neural networks have fully connected layers in their

architecture. In this chapter, we explain how the mathematics of a
fully connected neural network works. We design and experiment with
various training and loss functions. We also explain that the
optimization and backpropagation steps used when training neural
networks are similar to how learning happens in our brains. The brain
learns by reinforcing neuron connections when faced with a concept it
has seen before, and weakening connections if it learns new
information that contradicts previously learned concepts. Machines
only understand numbers. Mathematically, stronger connections
correspond to larger numbers, and weaker connections correspond to
smaller numbers.
Finally, we walk through various regularization techniques, explaining
their advantages, disadvantages, and use cases.
The Brain Cortex and Artificial Neural Networks
Neural networks are modeled after the brain cortex, which involves billions of neurons arranged
in a layered structure. Figure 4-1 shows an image of three vertical cross-sections of the brain
neocortex, and Figure 4-2 shows a diagram of a fully connected artificial neural network.
Figure 4-1. Three drawings of cortical lamination by Santiago Ramón y Cajal,
taken from the book Comparative Study of the Sensory Areas of the Human
Cortex (Andesite Press) (image source: Wikipedia)
In Figure 4-1, each drawing shows a vertical cross-section of the cortex, with the surface
(outermost side closest to the skull) of the cortex at the top. On the left is a Nissl-stained visual
cortex of a human adult. In the middle is a Nissl-stained motor cortex of a human adult. On the
right is a Golgi-stained cortex of a month-and-a half-old infant. The Nissl stain shows the cell
bodies of neurons. The Golgi stain shows the dendrites and axons of a random subset of neurons.
The layered structure of the neurons in the cortex is evident in all three cross-sections.
Figure 4-2. A fully connected or dense artificial neural network with four layers
Even though different regions of the cortex are responsible for different functions, such as vision,
auditory perception, logical thinking, language, speech, etc., what actually determines the
function of a specific region are its connections: which sensory and motor skills input and output
regions it connects to. This means that if a cortical region is connected to a different sensory
input/output region—for example, a vision locality instead of an auditory one—then it will
perform vision tasks (computations), not auditory tasks. In a very simplified sense, the cortex
performs one basic function at the neuron level. In an artificial neural network, the basic
computation unit is the perceptron, and it functions in the same way across the whole network.
The various connections, layers, and architecture of the neural network (both the brain cortex and
artificial neural networks) are what allow these computational structures to do very
impressive things.
Training Function: Fully Connected, or Dense, Feed Forward Neural Networks
In a fully connected or dense artificial neural network (see Figure 4-2), every neuron, represented
by a node (the circles), in every layer is connected to all the neurons in the next layer. The first
layer is the input layer, the last layer is the output layer, and the intermediate layers are
called hidden layers. The neural network itself, whether fully connected or not (the networks that
we will encounter in the new few chapters are convolutional and are not fully connected), is a
computational graph representing the formula of the training function. Recall that we use this
function to make predictions after training.
Training in the neural networks context means finding the parameter values, or weights, that
enter into the formula of the training function via minimizing a loss function. This is similar to
training linear regression, logistic regression, softmax regression, and support vector machine
models, which we discussed in Chapter 3. The mathematical structure here remains the same:
1. Training function
2. Loss function
3. Optimization
The only difference is that for the simple models of Chapter 3, the formulas of the training
functions are very uncomplicated. They linearly combine the data features, add a bias term (�0),
and pass the result into at most one nonlinear function (for example, the logistic function in
logistic regression). As a consequence, the results of these models are also simple: a linear (flat)
function for linear regression, and a linear division boundary between different classes in logistic
regression, softmax regression, and support vector machines. Even when we use these simple
models to represent nonlinear data, such as in the cases of polynomial regression (fitting the data
into polynomial functions of the features) or support vector machines with the kernel trick, we
still end up with linear functions or division boundaries, but these will either be in higher
dimensions (for polynomial regression, the dimensions are the feature and its powers) or in
transformed dimensions (such as when we use the kernel trick with support vector machines).
For neural network models, on the other hand, the process of linearly combining the features,
adding a bias term, then passing the result through a nonlinear function (now called activation
function) is the computation that happens only in one neuron. This simple process happens over
and over again in dozens, hundreds, thousands, or sometimes millions of neurons, arranged in
layers, where the output of one layer acts as the input of the next layer. Similar to the brain
cortex, the aggregation of simple and similar processes over many neurons and layers produces,
or allows for the representation of, much more complex functionalities. This is sort of
miraculous. Thankfully, we are able to understand much more about artificial neural networks
than our brain’s neural networks, mainly because we design them, and after all, an artificial
neural network is just one mathematical function. No black box remains dark once we dissect it
under the lens of mathematics. That said, the mathematical analysis of artificial neural networks
is a relatively new field. There are still many questions to be answered and a lot to be discovered.
A Neural Network Is a Computational Graph Representation of the Training Function

Even for a network with only five neurons, such as the one in Figure 4-3, it is pretty messy to
write the formula of the training function. This justifies the use of computational graphs to
represent neural networks in an organized and easy way. Graphs are characterized by two things:
nodes and edges (congratulations, this was lesson one in graph theory). In a neural network, an
edge connecting node i in layer m to node j in layer n is assigned a weight ��,��. That is
four indices for only one edge! At the risk of drowning in a deep ocean of indices, we must
organize a neural network’s weights in matrices.
Figure 4-3. A fully connected (or dense) feed forward neural network with only
five neurons arranged in three layers. The first layer (the three black dots on the
very left) is the input layer, the second layer is the only hidden layer with three
neurons, and the last layer is the output layer with two neurons.
Let’s model the training function of a feed forward fully connected neural network. Feed forward
means that the information flows forward through the computational graph representing the
network’s training function.
Conditional Random Fields
Conditional random field

Conditional random fields (CRFs) are a class of statistical modeling methods often
applied in pattern recognition and machine learning and used for structured
prediction. Whereas a classifier predicts a label for a single sample without
considering "neighbouring" samples, a CRF can take context into account. To do so,
the predictions are modelled as a graphical model, which represents the presence of
dependencies between the predictions. What kind of graph is used depends on the
application. For example, in natural language processing, "linear chain" CRFs are
popular, for which each prediction is dependent only on its immediate neighbours.
In image processing, the graph typically connects locations to nearby and/or similar
locations to enforce that they receive similar predictions.
Other examples where CRFs are used are: labeling or parsing of sequential data
for natural language processing or biological sequences, part-of-speech
tagging, shallow parsing, named entity recognition, gene finding, peptide critical
functional region finding, and object recognition and image
segmentation in computer vision.
Description
CRFs are a type of discriminative undirected probabilistic graphical model.
Lafferty, McCallum and Pereira define a CRF on observations and random variables
as follows:
Let be a graph such that , so that is indexed by the vertices of
Then is a conditional random field when each random variable , conditioned
on , obeys the Markov property with respect to the graph; that is, its probability is
dependent only on its neighbours in G:
, where means that and are neighbors in .
What this means is that a CRF is an undirected graphical model whose nodes can be
divided into exactly two disjoint sets and , the observed and output
variables, respectively; the conditional distribution is then modeled.

Inference
For general graphs, the problem of exact inference in CRFs is intractable. The inference problem
for a CRF is basically the same as for an MRF and the same arguments hold. However, there
exist special cases for which exact inference is feasible:
 If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms
used in these cases are analogous to the forward-backward and Viterbi algorithm for the case of
HMMs.
 If the CRF only contains pair-wise potentials and the energy is submodular, combinatorial min
cut/max flow algorithms yield exact solutions.
If exact inference is impossible, several algorithms can be used to obtain approximate solutions.
These include:
 Loopy belief propagation

 Alpha expansion
 Mean field inference
 Linear programming relaxations
Parameter Learning
Learning the parameters is usually done by maximum likelihood learning for .

If all nodes have exponential family distributions and all nodes are observed during training,
this optimization is convex. It can be solved for example using gradient descent algorithms,
or Quasi-Newton methods such as the L-BFGS algorithm. On the other hand, if some variables
are unobserved, the inference problem has to be solved for these variables. Exact inference is
intractable in general graphs, so approximations have to be used.
Examples
In sequence modeling, the graph of interest is usually a chain graph. An input sequence of
observed variables represents a sequence of observations and represents a
hidden (or unknown) state variable that needs to be inferred given the observations. The
are structured to form a chain, with an edge between each and . As well as
having a simple interpretation of the as "labels" for each element in the input sequence,
this layout admits efficient algorithms for:
 model training, learning the conditional distributions between the and feature functions
from some corpus of training data.
 decoding, determining the probability of a given label sequence given .

 inference, determining the most likely label sequence given .
The conditional dependency of each on is defined through a fixed set
of feature functions of the form , which can be thought of as measurements on the input
sequence that partially determine the likelihood of each possible value for . The model
assigns each feature a numerical weight and combines them to determine the probability of a
certain value for .

Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov
models (HMMs), but relax certain assumptions about the input and output sequence
distributions. An HMM can loosely be understood as a CRF with very specific feature functions
that use constant probabilities to model state transitions and emissions. Conversely, a CRF can
loosely be understood as a generalization of an HMM that makes the constant transition
probabilities into arbitrary functions that vary across the positions in the sequence of hidden
states, depending on the input sequence.
Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the
feature functions can inspect the entire input sequence at any point during
inference, and the range of the feature functions need not have a probabilistic
interpretation
Variants
Higher-order CRFs and semi-Markov CRFs
CRFs can be extended into higher order models by making each dependent on a fixed
number of previous variables . In conventional formulations of higher order
CRFs, training and inference are only practical for small values of (such as k ≤ 5), since
their computational cost increases exponentially with .

However, another recent advance has managed to ameliorate these issues by leveraging concepts
and tools from the field of Bayesian nonparametrics. Specifically, the CRF-infinity
approach constitutes a CRF-type model that is capable of learning infinitely-long temporal
dynamics in a scalable fashion. This is effected by introducing a novel potential function for
CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for
learning infinitely-long dynamics in sequential observations. To render such a model
computationally tractable, CRF-infinity employs a mean-field approximation of the postulated
novel potential functions (which are driven by an SM). This allows for devising efficient
approximate training and inference algorithms for the model, without undermining its capability
to capture and model temporal dependencies of arbitrary length.
There exists another generalization of CRFs, the semi-Markov conditional random field (semi-
CRF), which models variable-length segmentations of the label sequence . This

provides much of the power of higher-order CRFs to model long-range dependencies of the
, at a reasonable computational cost.
Finally, large-margin models for structured prediction, such as the structured Support Vector
Machine can be seen as an alternative training procedure to CRFs.
Latent-dynamic conditional random field

Latent-dynamic conditional random fields (LDCRF) or discriminative probabilistic latent
variable models (DPLVM) are a type of CRFs for sequence tagging tasks. They are latent
variable models that are trained discriminatively.
In an LDCRF, like in any sequence tagging task, given a sequence of observations x = ,
the main problem the model must solve is how to assign a sequence of labels y = from
one finite set of labels Y. Instead of directly modeling P(y|x) as an ordinary linear-chain CRF
would do, a set of latent variables h is "inserted" between x and y using the chain rule of
probability:
This allows capturing latent structure between the observations and labels. While LDCRFs can
be trained using quasi-Newton methods, a specialized version of the perceptron algorithm called
the latent-variable perceptron has been developed for them as well, based on
Collins' structured perceptron algorithm
derivation of partition function in

conditional random fields
Markov Networks
A Markov network is an undirected graphical model that allows reasoning under
uncertainty and which is often compared to Bayesian networks.
Graphical Representation and Applications
Figure 1. Examples of Markov networks. (a) A simple Markov network with cyclical
dependencies; for example, A is dependent on B, which is dependent on C, which is
dependent on A. Note that the cycles can be traversed in either direction due to
symmetry. (b) A portion of a large grid-like Markov network that could represent pixels
in an image. Note that each node depends on its neighbors, with no hierarchy of
dependence.
A Markov network is represented by an undirected graph having the
required Markov properties, where the nodes in represent random variables,
and the edges in represent dependency relationships (see Figure 1). While this
representation is similar to that of Bayesian networks, there are significant differences.
The undirected edges in Markov networks allow symmetrical dependencies between
variables, which are useful in fields such as image analysis, where the value of a given
pixel depends on all the pixels around it, rather than having a hierarchy of pixels. In
addition, Markov networks permit cycles, and so cyclical dependencies can be
represented.
Markov networks are used in a variety of areas of artificial intelligence, including[1]:

 Image processing
 Computer vision
 Recurrent neural networks and content-addressable memory
Conditional Independence
Markov Properties
Figure 2. The three Markov properties. (a) By the global property, the
subgraphs A and C are conditionally independent given B. (b) By
the local property, A is conditionally independent from the rest of the network given its
neighbors. (c) By the pairwise property, A is conditionally independent
from B and C given the rest of the network, but B and C are still dependent on each
other.
It may be useful to summarize conditional independence in Markov networks as
follows: any two parts of a Markov network are conditionally independent given
evidence that "blocks" all paths between them. More formally, there are three key
properties of Markov networks related to conditional independence:[1]
 Global Markov property - Consider disjoint subsets , , and
of the Markov network. and are conditionally independent given
iff all paths connecting and pass through .

 Undirected local Markov property - A node is conditionally independent from the
rest of the network given its neighbors. Note that this means that the set of
neighbors of a node constitute its Markov blanket (the smallest set of nodes that,
when given as evidence, render the node conditionally independent from the rest of
the network).
 Pairwise Markov property - Any two nodes in the Markov network are conditionally
independent given the rest of the network if they are not neighbors.
It should be evident that the pairwise property follows from the local property, which in
turn follows from the global property.
Belief propagation
What is Bayesian belief propagation?
Bayesian beliefs are a way of understanding probability that is based on the concept of
subjective beliefs. In Bayesian probability, the probability of an event is based on an
individual’s subjective belief about the likelihood of the event occurring, taking into
account all available information. Beliefs can be updated if new information occurs.
Bayesian belief propagation is a type of message-passing algorithm that can be used to

solve probabilistic graphical models. These models represent the relationships between
different variables in a network, and are often used to solve complex problems involving
uncertainty.
In Bayesian belief propagation, messages are passed between the nodes in the network.
Each node updates its beliefs about the variables based on the messages it receives from
its neighbors. The algorithm iteratively updates the beliefs of each node until a stable
state is reached, at which point the beliefs of the nodes represent the most probable
values of the variables.
How does Bayesian belief propagation work in

heterogeneous networks?
Heterogeneous networks are networks in which the nodes and edges have different
values or characteristics. For example, a social network might have nodes representing
people and edges representing friendships, but some people might be more influential
than others, or some friendships might be stronger than others.
In such a network, Bayesian belief propagation can be used to solve probabilistic

graphical models in which the nodes and edges have different types. For example, the
algorithm can be used to infer the most likely values of variables based on the
relationships between nodes of different types.
How does Bayesian belief propagation contribute to the
emergence of filter bubbles?
Filter bubbles are a phenomenon that occurs when people are exposed to a limited range
of ideas and viewpoints, often as a result of personalized algorithms that selectively
present information to them. This can lead to people becoming isolated in their own echo
chambers, where their beliefs and perspectives are reinforced and unchallenged.
Bayesian belief propagation has been used in the development of personalized

algorithms that can contribute to the emergence of filter bubbles. These algorithms use
the relationships between nodes in a network to make recommendations or present
information to users.
For example, a recommendation algorithm might use Bayesian belief propagation to

infer the preferences of a user based on the preferences of their friends and other people
they are connected to in the network.
This can lead to the algorithm presenting the user with information that is more closely
aligned with their existing beliefs and perspectives, further reinforcing their existing
filter bubble.
Counterfactuals, shared beliefs, and polarization between

social groups
Counterfactuals are statements about events that did not happen, such as “If the election
had been held on a different day, the outcome might have been different.” These
statements are often used to explore alternative scenarios and consider the implications
of different choices or events.
However, when people in opposing social groups old incompatible beliefs in

counterfactuals, these beliefs can become polarized.
For example, if one group believes that the election outcome would have been different if
certain actions had been taken, and the other group believes that the outcome would
have been the same no matter what, these conflicting beliefs can lead to further
polarization of beliefs and attitudes between the two groups.
This can both lead to and be exacerbated by the presence of filter bubbles, as people are
more likely to encounter information that supports their existing beliefs and
perspectives, rather than being exposed to a diverse range of viewpoints.
As a result, shared beliefs in counterfactuals can contribute to the emergence of echo

chambers, in which people’s beliefs and attitudes become more extreme and entrenched.
Training CRF
When we start off with our neural network we initialize our weights randomly.
Obviously, it won’t give you very good results. In the process of training, we want to start
with a bad performing neural network and wind up with network with high accuracy. In
terms of loss function, we want our loss function to much lower in the end of training.
Improving the network is possible, because we can change its function by adjusting
weights. We want to find another function that performs better than the initial one.
The problem of training is equivalent to the problem of minimizing the loss function.
Why minimize loss instead of maximizing? Turns out loss is much easier function to
optimize.
There are a lot of algorithms that optimize functions. These algorithms can gradient-
based or not, in sense that they are not only using the information provided by the
function, but also by its gradient. One of the simplest gradient-based algorithms — the
one I’ll be covering in this post — is called Stochastic Gradient Descent. Let’s see how it
works.
First, we need to remember what a derivative is with respect to some variable. Let’s take
some easy function f(x) = x. If we remember the rules of calculus from high school we
know, that the derivative of that is one at every value of x. What does it tell us ? The
derivative is the rate of how fast our function is changing when we take infinitely small
step in the positive direction. Mathematically it can be written as the following:
Which means: how much our function changes(left term) approximately equals to
derivative of that function with respect to some variable x multiplied with how much we
changed that variable. That approximation is going to be exact when we step we take is
infinitely small and this is very important concept of the derivative. Going back to our
simple function f(x) = x, we said that our derivative is 1, which means, that if take some
step epsilon in the positive direction, the function outputs will change by 1 multiplied by
our step epsilon which is just epsilon. It’s really easy to check that that’s rule. That’s
actually not even an approximation, that’s exact. Why ? Because our derivative is the
same for every value of x. That is not true for most functions. Let’s look at a slightly more
complex function f(x) = x².
y = x²
From rules of calculus we know, that the derivative of that functions is 2x. It’s easy to
check that now if we start at some value of x and make some step epsilon, then how much
our function changed is not going to be exactly equal to the formula given above.
Now, gradient is vector of partial derivatives, whose elements contains derivatives with
respect to some variable on which function is dependent. With simple functions we’ve
considering so far, this vector only contains one element, because we’ve only been using
function which take one input. With more complex functions(like our loss function), the
gradient will contain derivatives with respect to each variable we want.
How can we use this information, provided for us by derivatives, in order to minimize
some function ? Let’s go back to our function f(x) = x². Obviously, the minimum of that
function is at point x = 0, but how would a computer know it ? Suppose, we start off with
some random value of x and this value is 2. The derivative of the function in that in x =
2 equals 4. Which means that is we take a step in positive direction our function will
change proportionally to 4. So it will increase. Instead, we want to minimize our
function, so we can take a step in opposite direction, negative, to be sure that our
function will decrease, at least a little bit. How big of a step we can take ? Well, that’s the
bad news. Our derivative only guarantees that the function will decrease if take infinitely
small step. We can’t do that. Generally, you want to control how big of step you make
with some kind of hyper-parameter. This hyper-parameter is called learning rate and I’ll
talk about it later. Let’s now see what happens if we start at a point x = -2. The derivative
is now equals -4, which means, that if take a small step in positive direction our function
will change proportionally to -4, thus it will decrease. That’s exactly what we want.
Have noticed a pattern here ? When x > 0, our derivative greater than zero and we need
to go in negative direction, when x < 0, the derivative less than zero, we need to go in
positive direction. We always need to take a step in the direction which is opposite of
derivative. Let’s apply the same idea to gradient. Gradient is vector which points to some
direction in space. It actually point to the direction of the steepest increase of the
function. Since we want minimize our function, we’ll take a step in the opposite direction
of gradient. Let’s apply our idea. In neural network we think of inputs x, and outputs y as
fixed numbers. The variable with respect to which we’re going to be taking our
derivatives are weights w, since these are the values we want to change to improve our
network. If we compute the gradient of the loss function w.r.t our weights and take small
steps in the opposite direction of gradient our loss will gradually decrease until it
converges to some local minima. This algorithms is called Gradient Descent. The rule for
updating weights on each iteration of Gradient Descent is the following:
For each weight subtract the derivative with respect to it, multiplied by learning rate.
lr in the notation above means learning rate. It’s there to control how big of a step we’re
taking each iteration. It is the most important hyper-parameter to tune when training
neural networks. If you choose learning that is too big, then you’ll make steps that are too
large and will ‘jump over’ the minimum. Which means that your algorithms will diverge.
If you choose learning rate that is too small, it might take too much time to converge to
some local minima. People have developed some very good techniques for finding an
optimal learning rate, but this goes beyond the scope of this post. Some of them are
described in my other post ‘Improving the way we work with learning rate’.
Unfortunately, we can’t really apply this algorithm for training neural networks and the
reason lies in our formula for the loss function.
As you can see from what we defined above, our formula is the average over the sum.
From calculus we know, that derivative of the sum is the sum of the derivatives.
Therefore, in order to compute the derivatives of the loss, we need to go through each
example of our dataset. It would be very inefficient to do it every iteration of the Gradient
Descent, because each iteration of the algorithm only improves our loss by some small
step. To solve this problem, there’s another algorithm, called Mini-batch Gradient
Descent. The rule for updating the weights stays the same, but we’re not going to be
calculating the exact derivative. Instead, we’ll approximate the derivate on some small
mini-batch of the dataset and use that derivative to update the weights. Mini-batch isn’t
guaranteed to take steps in optimal direction. In fact, it usually won’t. With gradient
descent, if choose small enough learning rate, your loss is guaranteed to decrease every
iteration. With mini-batch that’s not true. Your loss will decrease over time, but it will
fluctuate and be much more ‘noisy’.
The size of the batch to use for estimating the derivatives is another hyper-parameter
that you’ll have to choose. Usually, you want as big batch size as your memory can
handle. But I’ve rarely seen people use batch size much larger than around 100.
Extreme version of mini-batch gradient descent with batch size equals to 1 is called
Stochastic Gradient Descent. In moder literature and very commonly, when people say
Stochastic Gradient Descent(SGD) they actually refer to Mini-batch gradient descent.
Most deep learning frameworks will let you choose batch size for SGD.
That’s it on Gradient Descent and its variations. Recently, more and more people have
been using more advanced algorithms. Most of them are gradient-based and actually
based on SGD with slight modifications. I’m planning on writing about those as well.
Hidden Markov Model in Machine Learning

Hidden Markov Models (HMMs) are a type of probabilistic model that are commonly
used in machine learning for tasks such as speech recognition, natural language
processing, and bioinformatics. They are a popular choice for modelling sequences of
data because they can effectively capture the underlying structure of the data, even
when the data is noisy or incomplete. In this article, we will give a comprehensive
overview of Hidden Markov Models, including their mathematical foundations,
applications, and limitations.
What are Hidden Markov Models?
A Hidden Markov Model (HMM) is a probabilistic model that consists of a sequence

of hidden states, each of which generates an observation. The hidden states are
usually not directly observable, and the goal of HMM is to estimate the sequence of
hidden states based on a sequence of observations. An HMM is defined by the following
components:
o A set of N hidden states, S = {s1, s2, ..., sN}.

o A set of M observations, O = {o1, o2, ..., oM}.
o An initial state probability distribution, ? = {?1, ?2, ..., ?N}, which specifies the
probability of starting in each hidden state.
o A transition probability matrix, A = [aij], defines the probability of moving from
one hidden state to another.
o An emission probability matrix, B = [bjk], defines the probability of emitting an
observation from a given hidden state.
The basic idea behind an HMM is that the hidden states generate the observations, and
the observed data is used to estimate the hidden state sequence. This is often referred
to as the forward-backwards algorithm.
Applications of Hidden Markov Models
Now, we will explore some of the key applications of HMMs, including speech
recognition, natural language processing, bioinformatics, and finance.
o Speech Recognition
One of the most well-known applications of HMMs is speech recognition. In this
field, HMMs are used to model the different sounds and phones that makeup
speech. The hidden states, in this case, correspond to the different sounds or
phones, and the observations are the acoustic signals that are generated by the
speech. The goal is to estimate the hidden state sequence, which corresponds to
the transcription of the speech, based on the observed acoustic signals. HMMs
are particularly well-suited for speech recognition because they can effectively
capture the underlying structure of the speech, even when the data is noisy or
incomplete. In speech recognition systems, the HMMs are usually trained on
large datasets of speech signals, and the estimated parameters of the HMMs are
used to transcribe speech in real time.
o NaturalLanguageProcessing
Another important application of HMMs is natural language processing. In this
field, HMMs are used for tasks such as part-of-speech tagging, named entity
recognition, and text classification. In these applications, the hidden states are
typically associated with the underlying grammar or structure of the text, while
the observations are the words in the text. The goal is to estimate the hidden
state sequence, which corresponds to the structure or meaning of the text, based
on the observed words. HMMs are useful in natural language processing because
they can effectively capture the underlying structure of the text, even when the
data is noisy or ambiguous. In natural language processing systems, the HMMs
are usually trained on large datasets of text, and the estimated parameters of the
HMMs are used to perform various NLP tasks, such as text classification, part-of-
speech tagging, and named entity recognition.
o Bioinformatics
HMMs are also widely used in bioinformatics, where they are used to model
sequences of DNA, RNA, and proteins. The hidden states, in this case, correspond
to the different types of residues, while the observations are the sequences of
residues. The goal is to estimate the hidden state sequence, which corresponds to
the underlying structure of the molecule, based on the observed sequences of
residues. HMMs are useful in bioinformatics because they can effectively capture
the underlying structure of the molecule, even when the data is noisy or
incomplete. In bioinformatics systems, the HMMs are usually trained on large
datasets of molecular sequences, and the estimated parameters of the HMMs are
used to predict the structure or function of new molecular sequences.
o Finance
Finally, HMMs have also been used in finance, where they are used to model
stock prices, interest rates, and currency exchange rates. In these applications, the
hidden states correspond to different economic states, such as bull and bear
markets, while the observations are the stock prices, interest rates, or exchange
rates. The goal is to estimate the hidden state sequence, which corresponds to
the underlying economic state, based on the observed prices, rates, or exchange
rates. HMMs are useful in finance because they can effectively capture the
underlying economic state, even when the data is noisy or incomplete. In finance
systems, the HMMs are usually trained on large datasets of financial data, and the
estimated parameters of the HMMs are used to make predictions about future
market trends or to develop investment strategies.
Limitations of Hidden Markov Models
Now, we will explore some of the key limitations of HMMs and discuss how they can
impact the accuracy and performance of HMM-based systems.
o Limited Modeling Capabilities

One of the key limitations of HMMs is that they are relatively limited in their
modelling capabilities. HMMs are designed to model sequences of data, where
the underlying structure of the data is represented by a set of hidden states.
However, the structure of the data can be quite complex, and the simple
structure of HMMs may not be enough to accurately capture all the details. For
example, in speech recognition, the complex relationship between the speech
sounds and the corresponding acoustic signals may not be fully captured by the
simple structure of an HMM.
o Overfitting
Another limitation of HMMs is that they can be prone to overfitting, especially
when the number of hidden states is large or the amount of training data is
limited. Overfitting occurs when the model fits the training data too well and is
unable to generalize to new data. This can lead to poor performance when the
model is applied to real-world data and can result in high error rates. To avoid
overfitting, it is important to carefully choose the number of hidden states and to
use appropriate regularization techniques.
o Lack of Robustness
HMMs are also limited in their robustness to noise and variability in the data. For
example, in speech recognition, the acoustic signals generated by speech can be
subjected to a variety of distortions and noise, which can make it difficult for the
HMM to accurately estimate the underlying structure of the data. In some cases,
these distortions and noise can cause the HMM to make incorrect decisions,
which can result in poor performance. To address these limitations, it is often
necessary to use additional processing and filtering techniques, such as noise
reduction and normalization, to pre-process the data before it is fed into the
HMM.
o Computational Complexity
Finally, HMMs can also be limited by their computational complexity, especially
when dealing with large amounts of data or when using complex models. The
computational complexity of HMMs is due to the need to estimate the
parameters of the model and to compute the likelihood of the data given in the
model. This can be time-consuming and computationally expensive, especially for
large models or for data that is sampled at a high frequency. To address this
limitation, it is often necessary to use parallel computing techniques or to use
approximations that reduce the computational complexity of the model.
Entropy in Machine Learning

We are living in a technology world, and somewhere everything is related to technology.
Machine Learning is also the most popular technology in the computer science world
that enables the computer to learn automatically from past experiences.
Also, Machine Learning is so much demanded in the IT world that most companies want
highly skilled machine learning engineers and data scientists for their business. Machine
Learning contains lots of algorithms and concepts that solve complex problems easily,
and one of them is entropy in Machine Learning. Almost everyone must have heard the
Entropy word once during their school or college days in physics and chemistry. The
base of entropy comes from physics, where it is defined as the measurement of
disorder, randomness, unpredictability, or impurity in the system. In this article, we will
discuss what entropy is in Machine Learning and why entropy is needed in Machine
Learning. So let's start with a quick introduction to the entropy in Machine Learning.
Introduction to Entropy in Machine Learning
Entropy is defined as the randomness or measuring the disorder of the information

being processed in Machine Learning. Further, in other words, we can say that entropy
is the machine learning metric that measures the unpredictability or impurity in
the system.
When information is processed in the system, then every piece of information has a
specific value to make and can be used to draw conclusions from it. So if it is easier to
draw a valuable conclusion from a piece of information, then entropy will be lower in
Machine Learning, or if entropy is higher, then it will be difficult to draw any conclusion
from that piece of information.
Entropy is a useful tool in machine learning to understand various concepts such as

feature selection, building decision trees, and fitting classification models, etc. Being a
machine learning engineer and professional data scientist, you must have in-depth
knowledge of entropy in machine learning.
What is Entropy in Machine Learning
Entropy is the measurement of disorder or impurities in the information processed in

machine learning. It determines how a decision tree chooses to split data.
We can understand the term entropy with any simple example: flipping a coin. When we
flip a coin, then there can be two outcomes. However, it is difficult to conclude what
would be the exact outcome while flipping a coin because there is no direct relation
between flipping a coin and its outcomes. There is a 50% probability of both outcomes;
then, in such scenarios, entropy would be high. This is the essence of entropy in
machine learning.
Mathematical Formula for Entropy
Consider a data set having a total number of N classes, then the entropy (E) can be
determined with the formula below:
Where;
Pi = Probability of randomly selecting an example in class I;

UNIT-III DEEP LEARNING
Deep Learning:
Feedforward Neural
Networks
Feedforward neural networks are also known as Multi-layered Network of

Neurons (MLN). These networks of models are called feedforward because the
information only travels forward in the neural network, through the input nodes then
through the hidden layers (single or many layers) and finally through the output
nodes. In MLN there are no feedback connections such that the output of the
network is fed back into itself. These networks are represented by a combination of
many simpler models(sigmoid neurons).
Motivation: Non-Linear Data

Before we talk about the feedforward neural networks, let’s understand what was the
need for such neural networks. Traditional models like perceptron — which takes real
inputs and give boolean output only works if the data is linearly separable. That means
that the positive points (green) should lie on one side of the boundary and negative
points (red) lie another side of the boundary. As you can see from the below figure
perceptron is doing a very poor job of finding the best decision boundary to separate
positive and negative points.
Perceptron Decision Surface for Non-Linear Data
Next, we have the sigmoid neuron model this is similar to perceptron, but the sigmoid
model is slightly modified such that the output from the sigmoid neuron is much
smoother than the step functional output from perceptron. Although we have introduced
the non-linear sigmoid neuron function, it is still not able to effectively separate red
points(negatives) from green points(positives).
Sigmoid Neuron Decision Boundary for Non-Linear Data
The important point is that from a rigid decision boundary in perceptron, we have taken
our first step in the direction of creating a decision boundary that works well for non-
linearly separable data. Hence the sigmoid neuron is the building block of our
feedforward neural network.
From my previous post on Universal Approximation theorem, we have proved that even
though single sigmoid neuron can’t deal with non-linear data. If we connect multiple
sigmoid neurons in an effective way, we can approximate the combination of neurons to
any complex relationship between input and the output, required to deal with non-linear
data.
Combination of Sigmoid Neurons for Non-Linear Data
Feedforward Neural Networks
Multi-layered Network of neurons is composed of many sigmoid neurons. MLNs are
capable of handling the non-linearly separable data. The layers present between the
input and output layers are called hidden layers. The hidden layers are used to handle
the complex non-linearly separable relations between input and the output.
Simple Deep Neural Network
In this section let’s see how can we use a very simple neural network to solve complex
non-linear decision boundaries.
Let’s take an example of a mobile phone like/dislike predictor with two variables: screen
size, and the cost. It has a complex decision boundary as shown below:
Decision Boundary
We know that by using a single sigmoid neuron, it is impossible to obtain this kind of
non-linear decision boundary. Regardless of how we vary the sigmoid neuron
parameters w and b. Now change the situation and use a simple network of neurons for
the same problem and see how it handles.
Simple Neural Network
We have our inputs x₁ — screen size and x₂— price going into the network along with
the bias b₁ and b₂.
Now let’s break down the model neuron by neuron to understand. We have our first
neuron (leftmost) in the first layer which is connected to inputs x₁ and x₂ with the
weights w₁₁ and w₁₂ and the bias b₁ and b₂. The output of that neuron represented as
h₁, which is a function of x₁ and x₂ with parameters w₁₁ and w₁₂.
Output of the first Neuron

If we apply the sigmoid function to the inputs x₁ and x₂ with the appropriate weights w₁₁,
w₁₂ and bias b₁ we would get an output h₁, which would be some real value between 0
and 1. The sigmoid output for the first neuron h₁ will be given by the following equation,
Output Sigmoid for First Neuron

Next up, we have another neuron in the first layer which is connected to inputs x₁ and x₂
with the weights w₁₃ and w₁₄ along with the bias b₃ and b₄. The output of the second
neuron represented as h₂.
Output of the second Neuron
Similarly, the sigmoid output for the second neuron h₂ will be given by the following
equation,
So far we have seen the neurons present in the first layer but we also have another
output neuron which takes h₁ and h₂ as the input as opposed to the previous neurons.
The output from this neuron will be the final predicted output, which is a function of h₁
and h₂. The predicted output is given by the following equation,
Complex looking Equation!! Really?. Comment if you understood it or not.

Earlier we can adjust only w₁, w₂ and b — parameters of a single sigmoid neuron. Now
we can adjust the 9 parameters (w₁₁, w₁₂, w₁₃, w₁₄, w₂₁, w₂₂, b₁, b₂, b₃), which allows
the handling of much complex decision boundary. By trying out the different
configurations for these parameters we would be able to find the optimal surface where
output for the entire middle area (red points) is one and everywhere else is zero, what
we desire.
Decision Boundary from Network
The important point to note is that even with the simple neural network we were able to
model the complex relationship between the input and output. Now the question arises,
how do we know in advance this particular configuration is good and why not add few
more layers between or add few more neurons in the first layer. All these questions are
valid but for now, we will keep things simple take the network as it is. We will discuss
these questions and a lot more in detail when we discuss hyper-parameter tunning.
Generic Deep Neural Network
In the previous, we have seen the neural network for a specific task, now we will talk
about the neural network in generic form.
Generic Network without Connections
Let’s assume we have our network of neurons with two hidden layers (in blue but there
can more than 2 layers if needed) and each hidden layer has 3 sigmoid neurons there
can be more neurons but for now I am keeping things simple. We have three inputs
going into the network (for simplicity I used only three inputs but it can take n inputs)
and there are two neurons in the output layer. As I said before we will take this network
as it is and understand the intricacies of the deep neural network.
First, I will explain the terminology then we will go into how these neurons interact with
each other. For each of these neurons, two things will happen
1. Pre-activation represented by ‘a’: It is a weighted sum of inputs plus the bias.

2. Activation represented by ‘h’: Activation function is Sigmoid function.
Generic Network with Connections
Let’s understand the network neuron by neuron. Consider the first neuron present in the
first hidden layer. The first neuron is connected to each of the inputs by weight W₁.
Going forward I will be using this format of the indices to represent the weights and
biases associated with a particular neuron,
W(Layer number)(Neuron number in the layer)(Input number)
b(Layer number)(Bias number associated for that input)
W₁₁₁— Weight associated with the first neuron present in the first hidden layer
connected to the first input.
W₁₁₂— Weight associated with the first neuron present in the first hidden layer
connected to the second input.
b₁₁ — Bias associated with the first neuron present in the first hidden layer.
b₁₂ — Bias associated with the second neuron present in the first hidden layer.
….
Here W₁ a weight matrix containing the individual weights associated with the
respective inputs. The pre-activation at each layer is the weighted sum of the inputs
from the previous layer plus bias. The mathematical equation for pre-activation at each
layer ‘i’ is given by,
Pre-activation Function
The activation at each layer is equal to applying the sigmoid function to the output of
pre-activation of that layer. The mathematical equation for the activation at each layer ‘i’
is given by,
Activation Function
Finally, we can get the predicted output of the neural network by applying some kind of
activation function (could be softmax depending on the task) to the pre-activation output
of the previous layer. The equation for the predicted output is shown below,
Output Activation Function

Computations in Deep Neural Network
We have seen the terminology and functional aspect of the neural network. Now we will
see how the computations inside a DNN takes place.
Consider that you have 100 inputs and 10 neurons in the first and second hidden layers.
Each of the 100 inputs will be connected to the neurons will be The weight matrix of the
first neuron W₁ will have a total of 10 x 100 weights.
Weight Matrix
Remember, we are following a very specific format for the indices of the weight and bias
variable as shown below,
W(Layer number)(Neuron number in the layer)(Input number)
b(Layer number)(Bias number associated for that input)
Now let’s see how we can compute the pre-activation for the first neuron of the first
layer a₁₁. We know that pre-activation is nothing but the weighted sum of inputs plus
bias. In other words, it is the dot product between the first row of the weight matrix W₁
and the input matrix X plus bias b₁₁.
Similarly the pre-activation for other 9 neurons in the first layer given by,
In short, the overall pre-activation of the first layer is given by,

Where,
W₁ is a matrix containing the individual weights associated with the corresponding
inputs and b₁ is a vector containing(b₁₁, b₁₂, b₁₃,….,b₁₀) the individual bias associated
with the sigmoid neurons.
The activation for the first layer is given by,
Where ‘g’ represents the sigmoid function.

Remember that a₁ is a vector of 10 pre-activation values, here we are applying the
element-wise sigmoid function on all these 10 values and storing them in another vector
represented as h₁. Similarly, we can compute the pre-activation and activation values
for ’n’ number of hidden layers present in the network.
Output Layer of DNN
So far we have talked about the computations in the hidden layer. Now we will talk
about the computations in the output layer.
Generic Network with Connections
We can compute the pre-activation a₃ at the output layer by taking the dot product of
weights associated W₃ and activation of the previous layer h₂ plus bias vector b₃.
Pre-activation for the output layer
To find out the predicted output from the network, we are applying some function (which
we don’t know yet) to the pre-activation values.
Predicted Values
These two outputs will form a probability distribution that means their summation would
be equal to 1.
The output Activation function is chosen depending on the task at hand, can be softmax
or linear.
Softmax Function
We will use the Softmax function as the output activation function. The most frequently
used activation function in deep learning for classification problems.
Softmax Function
In the Softmax function, the output is always positive, irrespective of the input.
Now, let us illustrate the Softmax function on the above-shown network with 4 output
neurons. The output of all these 4 neurons is represented in a vector ‘a’. To this vector,
we will apply our softmax activation function to get the predicted probability distribution
as shown below,
Applying the Softmax Function
By applying the softmax function we would get a predicted probability distribution and
our true output is also a probability distribution, we can compare these two distributions
to compute the loss of the network.
Loss Function
In this section, we will talk about the loss function for binary and multi-class
classification. The purpose of the loss function is to tell the model that some correction
needs to be done in the learning process.
In general, the number of neurons in the output layer would be equal to the number of
classes. But in the case of binary classification, we can use only one sigmoid neuron
which outputs the probability P(Y=1) therefore we can obtain P(Y=0) = 1-P(Y=1). In the
case of classification, We will use the cross-entropy loss to compare the predicted
probability distribution and the true probability distribution.
Cross-entropy loss for binary classification is given by,
Cross-entropy loss for multi-class classification is given by,

Learning Algorithm: Non-Mathy Version
We will be looking at the non-math version of the learning algorithm using gradient
descent. The objective of the learning algorithm is to determine the best possible values
for the parameters, such that the overall loss of the deep neural network is minimized as
much as possible.
The learning algorithm goes like this,
We initialize all the weights w (w₁₁₁, w₁₁₂,…) and b (b₁, b₂,….) randomly. We then
iterate over all the observations in the data, for each observation find the corresponding
predicted distribution from the neural network and compute the loss using cross-entropy
function. Based on the loss value, we will update the weights such that the overall loss
of the model at the new parameters will be less than the current loss of the model.
Photo by Vasily Koloda on Unsplash
Conclusion
In this post, we briefly looked at the limitations of traditional models like perceptron and
sigmoid neuron to handle the non-linear data and then we went on to see how we can
use a simple neural network to solve complex decision boundary problem. We then
looked at the neural network in a generic sense and the computations behind the neural
network in detail. Finally, we have looked at the learning algorithm of the deep
neural network.
Recommended Reading:
Sigmoid Neuron Learning Algorithm Explained With Math
In my next post, we will discuss how to implement the feedforward neural network from
scratch in python using numpy. So make sure you follow me on medium to get notified
as soon as it drops.
Until then Peace :)
NK.
Regularization Techniques in Deep Learning

One of the most common problem data science professionals face is to
avoid overfitting. Have you come across a situation where your model performed
exceptionally well on train data, but was not able to predict test data. Or you were on
the top of a competition in public leaderboard, only to fall hundreds of places in the final
rankings? Well – this is the kernel for you!.
Table of Contents
1. What is Regularization?
2. How does Regularization help in reducing Overfitting?
3. Different Regularization techniques in Deep Learning
 L2 and L1 regularization
 Dropout
 Data augmentation
 Early stopping
linkcode
What is Regularization?
Before we deep dive into the topic, take a look at this image:
Have you seen this image before? As we move towards the right in this image, our
model tries to learn too well the details and the noise from the training data, which
ultimately results in poor performance on the unseen data.
In other words, while going towards the right, the complexity of the model increases
such that the training error reduces but the testing error doesn’t. This is shown in the
image below.
If you’ve built a neural network before, you know how complex they are. This makes
them more prone to overfitting.
Regularization is a technique which makes slight modifications to the learning
algorithm such that the model generalizes better. This in turn improves the model’s
performance on the unseen data as well.
How does Regularization help reduce Overfitting?

Let’s consider a neural network which is overfitting on the training data.
If you have studied the concept of regularization in machine learning, you will have a fair
idea that regularization penalizes the coefficients. In deep learning, it actually
penalizes the weight matrices of the nodes.
Assume that our regularization coefficient is so high that some of the weight matrices
are nearly equal to zero. This will result in a much simpler linear network and slight
underfitting of the training data.
Such a large value of the regularization coefficient is not that useful. We need to
optimize the value of regularization.
coefficient in order to obtain a well-fitted model as shown in the image below
Different Regularization Techniques in Deep Learning
Now that we have an understanding of how regularization helps in reducing overfitting,
we’ll learn a few different techniques in order to apply regularization in deep learning.
L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost
function by adding another term known as the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization term

Due to the addition of this regularization term, the values of weight matrices decrease
because it assumes that a neural network with smaller weight matrices leads to simpler
models. Therefore, it will also reduce overfitting to quite an extent.
However, this regularization term differs in L1 and L2.
In L2, we have:
Here, lambda is the regularization parameter. It is the hyperparameter whose value is
optimized for better results. L2 regularization is also known as weight decay as it forces
the weights to decay towards zero (but not exactly zero).
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may be
reduced to zero here. Hence, it is very useful when we are trying to compress our
model. Otherwise, we usually prefer L2 over it.
In keras, we can directly apply regularization to any layer using the regularizers. Below I
have applied regularizer on dense layer having 500 neurons and relu activation
function.
Dropout
This is the one of the most interesting types of regularization techniques. It also
produces very good results and is consequently the most frequently used regularization
technique in the field of deep learning.
To understand dropout, let’s say our neural network structure is akin to the one shown
below:
So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as shown
below.
So each iteration has a different set of nodes and this results in a different set of
outputs. It can also be thought of as an ensemble technique in machine learning.
Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout also performs better than a normal neural network
model.
This probability of choosing how many nodes should be dropped is the hyperparameter
of the dropout function. As seen in the image above, dropout can be applied to both the
hidden layers as well as the input layers.
Due to these reasons, dropout is usually preferred when we have a large neural
network structure in order to introduce more randomness.
In keras, we can implement dropout using the keras layer. Below is the Dropout
Implementation. I have introduced dropout of 0.2 as the probability of dropping in my
neural network architecture after last hidden layer having 64 kernels and after first
dense layer having 500 neurons.
Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data. In
machine learning, we were not able to increase the size of training data as the labeled
data was too costly.
But, now let’s consider we are dealing with images. In this case, there are a few ways of
increasing the size of the training data – rotating the image, flipping, scaling, shifting,
etc. In the below image, some transformation has been done on the handwritten digits
dataset.
This technique is known as data augmentation. This usually provides a big leap in
improving the accuracy of the model. It can be considered as a mandatory trick in order
to improve our predictions.
In keras, we can perform all of these transformations using ImageDataGenerator. It
has a big list of arguments which you
you can use to pre-process your training data.
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the
training set as the validation set. When we see that the performance on the validation
set is getting worse, we immediately stop the training on the model. This is known
as early stopping.
In the above image, we will stop training at
the dotted line since after that our model will start overfitting on the training data.
In keras, we can apply early stopping using the callbacks function. Below is the
implementation code for it.I have applied early stopping so that it will stop immendiately
if validation error will not decreased after 3 epochs.
In [14]:
linkcode
from keras.callbacks import EarlyStopping
earlystop= EarlyStopping(monitor='val_acc', patience=3)
epochs = 20 #
batch_size = 256
Here, monitor denotes the quantity that needs to be monitored and ‘val_err’ denotes
the validation error.
Patience denotes the number of epochs with no further improvement after which the
training will be stopped. For better understanding, let’s take a look at the above image
again. After the dotted line, each epoch will result in a higher value of validation error.
Therefore, 5 epochs after the dotted line (since our patience is equal to 3), our model
will stop because no further improvement is seen.
Note: It may be possible that after 3 epochs (this is the value defined for patience
in general), the model starts improving again and the validation error starts
decreasing as well. Therefore, we need to take extra care while tuning this
hyperparameter.
linkcode
Implementation on Malaria Cell Identification with keras
By this point, you should have a theoretical understanding of the different techniques we
have gone through. We will now apply this knowledge to our deep learning practice
problem – Identify Malaria cell. In this problem I will use all the regularization techniques
which I have discussed earlier i.e.,
1. L1,L2 Regularizer
2. Dropout
3. Data Augmentation
4. Early Stopping Now Let's start firing bullets !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Deep Learning
Deep learning is a type of artificial intelligence where machines learn to understand and
solve complex problems on their own, without explicit programming. It mimics how the
human brain works, using layers of interconnected artificial neurons to process and
learn from vast amounts of data. Deep learning has enabled significant advancements
in areas like image recognition, natural language processing etc. Training of deep
learning models is essential in shaping their capacity to generalize and excel in handling
new and unseen data.
How are Deep Learning Models trained?
The training process in deep learning involves feeding the model with a large dataset,
allowing it to learn patterns and adjusting its parameters to reduce errors. It can be
summarized in the following steps:
Data Collection: A huge and diverse dataset is gathered to train the model. High-quality
data is essential for the model to learn representative features and generalize well.
Example:
Suppose we are building an image classifier to identify whether an image contains a cat or a
dog. We gather a large and diverse dataset of labeled images of cats and dogs. This is the data
collection process.
Data Preprocessing: The collected data is preprocessed to ensure uniformity, remove
noise, and make it suitable for input to the model. This step includes normalization,
scaling, and data augmentation.
Example:
Continuing with our above example, we preprocess the images by resizing them to a consistent
size, normalizing the pixel values to a certain range and applying data augmentation techniques
to increase the diversity of the training samples.
Model Architecture: Choosing an appropriate model architecture is crucial. Different
model types, such as convolutional neural networks (CNNs), recurrent neural networks
(RNNs) and others, have distinct structures to do specific tasks.
Example:
We choose a simple neural network architecture consisting of an input layer, a hidden layer with
ReLU activation, and an output layer with a sigmoid activation function.
Initialization: When we create a machine learning model, it consists of many internal
components called parameters, which can be adjusted to make better predictions. For
example, in a neural network, the parameters are the weights and biases of the
neurons. To start the training process, we randomly set these parameters to small
values. If we set them to zero or the same value, the model won't be able to learn
effectively because all the neurons will behave the same way. Proper initialization is
vital to prevent the model from getting stuck in local minima during training.
Example:
We initialize the weights and biases of the neurons in the hidden and output layers with small
random values. This random initialization helps the model learn diverse features from the data.
Forward Pass: The training process begins by giving the model some input data, which
could be images, text, or any other type of data the model is designed to handle. The
input data is fed into the model, where it goes through a series of mathematical
operations and transformations in different layers of the model. Each layer learns to
recognize different patterns or features in the data. As the data flows through the layers,
the model becomes more capable of understanding the patterns and relationships in the
data.
Example:
We take an input image and feed it through the network. The image's pixel values are passed
through the layers, and the hidden layer applies the ReLU activation function to the weighted
sum of inputs. The output layer applies the sigmoid activation to produce a prediction between 0
and 1, indicating the probability of the image being a dog.
Loss Calculation: After the data has passed through the model and predictions have
been made, we need to evaluate how well the model did. To do this, we compare its
predictions to the actual correct answers from the training data. The difference between
the predictions and the correct answers is calculated using a loss function. There are
various types of loss functions, depending on the problem we are trying to solve. For
example, in a regression problem (predicting a continuous value), we might use mean
squared error, while in a classification problem (predicting categories), we could use
cross-entropy loss. The loss function gives us a single number that represents how far
off the model's predictions are from the true values. The bigger the difference, the larger
the loss value. The goal of training is to minimize this loss by adjusting the model's
parameters.
Example:
We compare the model's prediction to the actual label (cat or dog) of the input image. Let's say
the actual label is 1 (dog). We use a loss function such as binary cross-entropy to calculate the
difference between the predicted probability and the actual label.
Backpropagation: Backpropagation is an important technique used to update the
model's parameters based on the computed loss. It propagates the error backward
through the layers to calculate gradients, indicating the direction and magnitude of
parameter updates. The number of epochs (one complete pass through the entire
training dataset during the model's learning process) determines how many times the
entire dataset is used during the backpropagation process, helping the model finetune
its parameters and improve its performance.
Example:
Backpropagation propagates the gradients backward through the layers, while updating the
parameters in the opposite direction of the gradient to minimize the loss. This process involves
adjusting the weights and biases in the hidden and output layers to make the model's prediction
more accurate.
Optimization techniques play a crucial role in training deep learning models by helping
them find the best parameters to minimize errors and improve performance. After
calculating the loss and gradients during the forward pass, the optimization algorithm is
used to update the model's parameters based on these gradients. The goal is to
minimize the loss function.
The Adam optimizer, short for Adaptive Moment Estimation, is one of the most popular
optimization algorithms. It combines the strengths of AdaGrad and RMSprop, to achieve
faster convergence and better results. Adam optimizes the learning rate for each
parameter based on their historical gradients, enabling the model to find an optimal
solution efficiently.
Other optimization techniques include Stochastic Gradient Descent (SGD) and its
variants like Mini Batch Gradient Descent and Batch Gradient Descent, which update
parameters based on the gradients computed on small batches of data. Momentum
optimization accelerates training by considering the direction of previous parameter
updates, allowing the model to bypass local minima and speed up convergence.
Variation in training among different models
The training process can vary significantly among different deep learning models based
on their architecture and the nature of the data they process. Here are a few examples:
Convolutional Neural Networks (CNNs): CNNs are widely used for image-related tasks.
Their training involves learning spatial hierarchies of features through convolutional
layers and pooling. Data augmentation is commonly used to increase the diversity of
training samples and improve generalization.
Recurrent Neural Networks (RNNs): RNNs are designed for sequential data like text or
time series data. Training RNNs involves handling vanishing gradients, which can be
addressed using specialized units like LSTM (Long short-term memory) or GRU (Gated
Recurrent Unit).
Generative Adversarial Networks (GANs): GANs are used to generate synthetic data.
Their training process involves two competing networks, the generator and
discriminator, which require careful tuning to achieve stability and convergence.
Transformer Models: Transformers are popular for natural language processing tasks.
They rely on self-attention mechanisms and parallel processing, which makes their
training process computationally intensive.
Reinforcement Learning Models: In reinforcement learning, the model learns by
interacting with an environment and receiving feedback. The training process involves
finding optimal policies through trial and error, which requires balancing exploration and
exploitation.
The variations in training among these models highlight the importance of selecting the
appropriate architecture and fine-tuning the training process to get optimal performance.
Factors Affecting Training of Models

Training deep learning models can be affected by various factors, which can impact the
model's performance and convergence. Some of the key factors include:
Dataset Size: The size of the training dataset can significantly influence the model's
ability to learn representative features.
Data Quality: High-quality data is essential for training deep learning models. Noisy or
incorrect data can adversely affect the model's performance.
Data Preprocessing: Proper data preprocessing, including normalization, scaling, and
data augmentation, can impact how well the model learns from the data.
Model Architecture: Choosing an appropriate model architecture is crucial. Different
architectures are suitable for different types of tasks and data.
Initialization: Random initialization of model parameters can impact how quickly the
model converges during training.
Learning Rate: The learning rate determines the step size for parameter updates during
training. An appropriate learning rate is essential for stable and effective training.
Optimizers: Different optimization algorithms, such as Adam, SGD, or RMSprop, can
affect the speed and efficiency of model training.
Batch Size: The number of samples processed in each training iteration (batch) can
impact the training process and convergence.
Compute Intensive Training Process
Training deep learning models can be extremely compute-intensive, especially for large-
scale models and datasets. The computational demands arise from the huge number of
operations involved in forward and backward passes during training. These operations
include matrix multiplications, convolutions, gradient computations etc.
The time required to train a deep learning model depends on several factors, including
the model's architecture, dataset size, hardware specifications (CPU/GPU), and
hyperparameters like batch size and learning rate. Larger models with more parameters
and complex architectures usually require more computational power and time for
training. To estimate training times for specific CPUs/GPUs, we need to consider the
hardware specifications and theoretical compute capabilities.
Here's how to estimate training time for a deep learning model on a specific CPU or
GPU:
 Determine the number of FLOPS for the CPU/GPU.
 Calculate the FLOPS required for one forward and backward pass through the model.
 Estimate the number of iterations (epochs) required for convergence.
 Multiply the FLOPS required per iteration by the number of iterations to get the total FLOPS.
 Divide the total FLOPS by the CPU/GPU's FLOPS to get the estimated training time.
 Training times can be affected by various factors, such as data loading and preprocessing,
communication overhead and implementation efficiency.
Here's an example:
Let us do a sample calculation for estimating the training time for a deep learning model
on two specific CPUs/GPUs: A and B.
Assumptions:
 CPU/GPU A has 10 TFLOPS (10 trillion floating-point operations per second) compute
capability.
 CPU/GPU B has 20 TFLOPS compute capability.
 The model requires 1 TFLOPS for one forward and backward pass.
 The model's training converges after 100 epochs.

Sample Calculation:
Let us determine FLOPS per iteration for the model:
Model's FLOPS per iteration = 1 TFLOPS
Estimating the total FLOPS for the entire training process

Total FLOPS = FLOPS per iteration * Number of iterations
Total FLOPS = 1 TFLOPS * 100 epochs = 100 TFLOPS
Estimate the training time for CPU/GPU A:

Training time for A (in seconds) = Total FLOPS / CPU/GPU A's FLOPS
Training time for A = 100 TFLOPS / 10 TFLOPS = 10 seconds
Estimate the training time for CPU/GPU B:

Training time for B (in seconds) = Total FLOPS / CPU/GPU B's FLOPS
Training time for B = 100 TFLOPS / 20 TFLOPS = 5 seconds
So, based on these calculations, the estimated training time for the model on CPU/GPU
A would be approximately 10 seconds, while on CPU/GPU B, it would be about 5
seconds.
A few more concepts in Deep Learning Training

Early Stopping
Early stopping is a technique used to prevent overfitting during the training of deep
learning models. Overfitting occurs when a model becomes too specialized to the
training data and performs poorly on new and unseen data. Early stopping helps to find
the right balance between model complexity and generalization. During the training
process, the model's performance is monitored on a separate validation dataset. The
training is stopped when the model's performance on the validation data starts to
degrade or no longer improves. The model is saved at the point where it achieved the
best performance on the validation set. By this, we ensure that the model generalizes
well to new data and avoids memorizing noise in the training dataset.
Calibration
Calibration is a post-processing technique used to improve the reliability of probability
estimates produced by deep learning models. It ensures that the predicted probabilities
are well-calibrated and reflect the actual probabilities as closely as possible. Firstly the
model is trained as usual to make predictions on a given task. After training, the model's
predicted probabilities are compared to the actual probabilities observed in the
validation or test data. A calibration function is then applied to adjust the predicted
probabilities based on the observed discrepancies between the model's predictions and
the true probabilities.
Knowledge Distillation
Knowledge distillation is a technique used for model compression and transfer learning.
It involves transferring the knowledge from a large, complex model (often referred to as
the 'teacher' model) to a smaller, more efficient model (the 'student' model). The teacher
model is trained on the target task using the standard training process. The student
model is initialized and trained to mimic the behavior of the teacher model. Instead of
using the true labels during training, the student model is trained with the soft targets
produced by the teacher model. Soft targets are the teacher model's predicted
probabilities for each class instead of the hard labels (0 or 1). This enables the student
model to learn from the rich knowledge and insights of the teacher model. It allows the
student model to achieve comparable performance to the teacher model while being
more computationally efficient and having a smaller memory footprint.
Conclusion
The training process of deep learning models is crucial in enabling machines to learn
and make decisions like the human brain. Through proper data collection and
preprocessing, we ensure the model's ability to generalize and perform well on new
data. Optimization techniques like Adam optimizer play an important role in fine-tuning
the model's parameters for better accuracy and faster convergence. The future of deep
learning holds great promise. With ongoing research and advancements, we can expect
even more significant breakthroughs in various fields such as image recognition, natural
language processing, and medical diagnosis. As computing power and data availability
continue to grow, deep learning models will become more powerful and pave the way
for new applications and help solve complex problems in ways we couldn't have
imagined before.
Dropout Regularization in Deep Learning Models

In this post, you will discover the Dropout regularization technique and
how to apply it to your models in Python with Keras.
After reading this post, you will know:
 How the Dropout regularization technique works

 How to use Dropout on your input layers
 How to use Dropout on your hidden layers
 How to tune the dropout level on your problem
Dropout Regularization for Neural Networks
Dropout is a regularization technique for neural network models

proposed by Srivastava et al. in their 2014 paper “Dropout: A Simple
Way to Prevent Neural Networks from Overfitting” (download the PDF).
Dropout is a technique where randomly selected neurons are ignored
during training. They are “dropped out” randomly. This means that
their contribution to the activation of downstream neurons is
temporally removed on the forward pass, and any weight updates are
not applied to the neuron on the backward pass.
As a neural network learns, neuron weights settle into their context

within the network. Weights of neurons are tuned for specific features,
providing some specialization. Neighboring neurons come to rely on
this specialization, which, if taken too far, can result in a fragile model
too specialized for the training data. This reliance on context for a
neuron during training is referred to as complex co-adaptations.
You can imagine that if neurons are randomly dropped out of the
network during training, other neurons will have to step in and handle
the representation required to make predictions for the missing
neurons. This is believed to result in multiple independent internal
representations being learned by the network.
The effect is that the network becomes less sensitive to the specific
weights of neurons. This, in turn, results in a network capable of better
generalization and less likely to overfit the training data.
Dropout Regularization in Keras
Dropout is easily implemented by randomly selecting nodes to be

dropped out with a given probability (e.g., 20%) in each weight update
cycle. This is how Dropout is implemented in Keras. Dropout is only
used during the training of a model and is not used when evaluating
the skill of the model.
Next, let’s explore a few different ways of using Dropout in Keras.
The examples will use the Sonar dataset. This is a binary

classification problem that aims to correctly identify rocks and mock-
mines from sonar chirp returns. It is a good test dataset for neural
networks because all the input values are numerical and have the
same scale.
The dataset can be downloaded from the UCI Machine Learning
repository. You can place the sonar dataset in your current working
directory with the file name sonar.csv.
You will evaluate the developed models using scikit-learn with 10-fold
cross validation in order to tease out differences in the results better.
There are 60 input values and a single output value. The input values
are standardized before being used in the network. The baseline neural
network model has two hidden layers, the first with 60 units and the
second with 30. Stochastic gradient descent is used to train the model
with a relatively low learning rate and momentum.
Using Dropout on the Visible Layer
Dropout can be applied to input neurons called the visible layer.
In the example below, a new Dropout layer between the input (or
visible layer) and the first hidden layer was added. The dropout rate is
set to 20%, meaning one in five inputs will be randomly excluded from
each update cycle.
Additionally, as recommended in the original paper on Dropout, a

constraint is imposed on the weights for each hidden layer, ensuring
that the maximum norm of the weights does not exceed a value of 3.
This is done by setting the kernel_constraint argument on the Dense
class when constructing the layers.
The learning rate was lifted by one order of magnitude, and the
momentum was increased to 0.9. These increases in the learning rate
were also recommended in the original Dropout paper.
Using Dropout on Hidden Layers
Dropout can be applied to hidden neurons in the body of your network

model.
In the example below, Dropout is applied between the two hidden
layers and between the last hidden layer and the output layer. Again a
dropout rate of 20% is used as is a weight constraint on those layers.
Dropout in Evaluation Mode
Dropout will randomly reset some of the input to zero. If you wonder
what happens after you have finished training, the answer is nothing!
In Keras, a layer can tell if the model is running in training mode or
not. The Dropout layer will randomly reset some input only when the
model runs for training. Otherwise, the Dropout layer works as a scaler
to multiply all input by a factor such that the next layer will see input
similar in scale. Precisely, if the dropout rate is �, the input will be
scaled by a factor of 1−�.
Tips for Using Dropout
The original paper on Dropout provides experimental results on a suite

of standard machine learning problems. As a result, they provide a
number of useful heuristics to consider when using Dropout in
practice.
 Generally, use a small dropout value of 20%-50% of neurons, with 20%

providing a good starting point. A probability too low has minimal
effect, and a value too high results in under-learning by the network.
 Use a larger network. You are likely to get better performance when
Dropout is used on a larger network, giving the model more of an
opportunity to learn independent representations.
 Use Dropout on incoming (visible) as well as hidden units. Application
of Dropout at each layer of the network has shown good results.
 Use a large learning rate with decay and a large momentum. Increase
your learning rate by a factor of 10 to 100 and use a high momentum
value of 0.9 or 0.99.
 Constrain the size of network weights. A large learning rate can result
in very large network weights.
Imposing a constraint on the size of network weights, such as max-
norm regularization, with a size of 4 or 5 has been shown to improve
results.
Introduction to Convolution Neural Network
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture
commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that enables
a computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural
Networks are used in various datasets like images, audio, and text. Different types of Neural
Networks are used for different purposes, for example for predicting the sequence of words we
use Recurrent Neural Networks more precisely an LSTM, similarly for image classification we
use Convolution Neural networks. In this blog, we are going to build a basic building block for
CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in this
layer is equal to the total number of features in our data (number of pixels in the case of an
image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can be
many hidden layers depending upon our model and data size. Each hidden layer can have
different numbers of neurons which are generally greater than the number of features. The
output from each layer is computed by matrix multiplication of output of the previous layer with
learnable weights of that layer and then by the addition of learnable biases followed by
activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid
or softmax which converts the output of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above step is
called feedforward, we then calculate the error using an error function, some common error
functions are cross-entropy, square loss error, etc. The error function measures how well the
network is performing. After that, we backpropagate into the model by calculating the derivatives.
This step is called Backpropagation which basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks
(ANN) which is predominantly used to extract the feature from the grid-like matrix dataset. For
example visual datasets like images or videos where data patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Simple CNN architecture
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
How Convolutional Layers works
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine
you have an image. It can be represented as a cuboid having its length, width (dimension of the
image), and height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter
or kernel on it, with say, K outputs and representing them vertically. Now slide that neural network
across the whole image, as a result, we will get another image with different widths, heights, and
depths. Instead of just R, G, and B channels now we have more channels but lesser width and
height. This operation is called Convolution. If the patch size is the same as that of the image it
will be a regular neural network. Because of this small patch, we have fewer weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
 Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).
 For example, if we have to run convolution on an image with dimensions 34x34x3. The possible
size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to
the image dimension.
 During the forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional
images) and compute the dot product between the kernel weights and patch from input volume.
 As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a
sequence of layers, and every layer transforms one volume to another through a differentiable
function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
 Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with
width 32, height 32, and depth 3.
 Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding input
image patch. The output of this layer is referred ad feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
 Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions are RELU:
max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will
have dimensions 32 x 32 x 12.
 Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also
prevents overfitting. Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of
dimension 16x16x12.
Image source: cs231n.stanford.edu
 Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
 Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
Image source: cs231n.stanford.edu
 Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Example:
Let’s consider an image and apply the convolution layer, activation layer, and pooling layer
operation to extract the inside feature.
Input image:
Input image
Step:
 import the necessary libraries
 set the parameter
 define the kernel
 Load the image and plot it.
 Reformat the image
 Apply convolution layer operation and plot the output image.
 Apply activation layer operation and plot the output image.
 Apply pooling layer operation and plot the output image.
Introduction to Recurrent Neural Network
In this article, we will introduce a new variation of neural network which is

the Recurrent Neural Network also known as (RNN) that works better than a simple
neural network when data is sequential like Time-Series data and text data.
What is Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output from
the previous step is fed as input to the current step. In traditional neural networks, all
the inputs and outputs are independent of each other, but in cases when it is required
to predict the next word of a sentence, the previous words are required and hence
there is a need to remember the previous words. Thus RNN came into existence,
which solved this issue with the help of a Hidden Layer. The main and most important
feature of RNN is its Hidden state, which remembers some information about a
sequence. The state is also referred to as Memory State since it remembers the
previous input to the network. It uses the same parameters for each input as it
performs the same task on all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural networks.
REcurrent neural network
Architecture Of Recurrent Neural Network

RNNs have the same input and output architecture as any other deep neural
architecture. However, differences arise in the way information flows from input to
output. Unlike Deep neural networks where we have different weight matrices for each
Dense network in RNN, the weight across the network remains the same. It calculates
state hidden state Hi for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C) Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep
i
The parameters in the network are W, U, V, c, b which are shared across timestep
What is Recurrent Neural Network
How RNN works
The Recurrent Neural Network consists of multiple fixed activation function units, one
for each time step. Each unit has an internal state which is called the hidden state of
the unit. This hidden state signifies the past knowledge that the network currently
holds at a given time step. This hidden state is updated at every time step to signify
the change in the knowledge of the network about the past. The hidden state is
updated using the following recurrence relation:-
The formula for calculating the current state:
where:
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
The formula for calculating output:
Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as
Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered network
each variable is computed one at a time in a specified order like first h1 then h2 then
h3 so on. Hence we will apply backpropagation throughout all these hidden time
states sequentially.
Backpropagation Through Time (BPTT) In RNN
L(θ)(loss function) depends on h3

h3 in turn depends on h2 and W
where h0 is a constant starting state.
For simplicity of this equation, we will apply backpropagation on only one row
We already know how to compute this one as it is the same as any simple deep neural
network backpropagation. .However, we will see how to apply
backpropagation to this term

As we know h3 = σ(Wh2 + b)
And In such an ordered network, we can’t compute by simply treating h3 as a
constant because as it also depends on W. the total derivative has two parts
1. Explicit: treating all other inputs as constant

2. Implicit: Summing over all indirect paths from h 3 to W
Let us see how to do this
This algorithm is called backpropagation through time (BPTT) as we backpropagate

over all previous time steps
Training through RNN
1. A single-time step of the input is provided to the network.

2. Then calculate its current state using a set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the information
from all the previous states.
5. Once all the time steps are completed the final current state is used to calculate the
output.
6. The output is then compared to the actual output i.e the target output and the error
is generated.
7. The error is then back-propagated to the network to update the weights and hence
the network (RNN) is trained using Backpropagation through time.
Advantages of Recurrent Neural Network
1. An RNN remembers each and every piece of information through time. It is useful in
time series prediction only because of the feature to remember previous inputs as
well. This is called Long Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend the
effective pixel neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation
function.
Applications of Recurrent Neural Network
1. Language Modelling and Generating Text
2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the
network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as
Vanilla Neural Network. In this Neural network, there is only one input and one output.
One to One RNN
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the
most used examples of this network is Image captioning where given an image we
predict a sentence having Multiple words.
One To Many RNN
Many to One
In this type of network, Many inputs are fed to the network at several states of the
network generating only one output. This type of network is used in the problems like
sentimental analysis. Where we give multiple words as input and predict only the
sentiment of the sentence as output.
Many to One RNN
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs
corresponding to a problem. One Example of this Problem will be language
translation. In language translation, we provide multiple words from one language as
input and predict multiple words from the second language as output.
Many to Many RNN
Variation Of Recurrent Neural Network (RNN)

To overcome the problems like vanishing gradient and exploding gradient descent
several new advanced versions of RNNs are formed some of these are as ;
1. Bidirectional Neural Network (BiNN)
2. Long Short-Term Memory (LSTM)
Bidirectional Neural Network (BiNN)
A BiNN is a variation of a Recurrent Neural Network in which the input information
flows in both direction and then the output of both direction are combined to produce
the input. BiNN is useful in situations when the context of the input is more important
such as Nlp tasks and Time-series analysis problems.
Long Short-Term Memory (LSTM)
Long Short-Term Memory works on the read-write-and-forget principle where given
the input information network reads and writes the most useful information from the
data and it forgets about the information which is not important in predicting the
output. For doing this three new gates are introduced in the RNN. In this way, only the
selected information is passed through the network.
Difference between RNN and Simple Neural Network
RNN is considered to be the better version of deep neural when the data is sequential.
There are significant differences between the RNN and deep neural networks they
are listed as:
Recurrent Neural Network Deep Neural Network
Weights are same across all the layers Weights are different for each layer of the
number of a Recurrent Neural Network network
Recurrent Neural Networks are used A Simple Deep Neural network does not have
when the data is sequential and the any special method for sequential data also
Recurrent Neural Network Deep Neural Network
number of inputs is not predefined. here the the number of inputs is fixed
The Numbers of parameter in the RNN

The Numbers of Parameter are lower than RNN
are higher than in simple DNN
Exploding and vanishing gradients is the These problems also occur in DNN but these
the major drawback of RNN are not the major problem with DNN
Deep Belief Network (DBN) in Deep

Learning
Introduction
Deep Belief Networks (DBNs) are a type of deep learning architecture
combining unsupervised learning principles and neural networks. They are
composed of layers of Restricted Boltzmann Machines (RBMs), which are
trained one at a time in an unsupervised manner. The output of one RBM is
used as the input to the next RBM, and the final output is used for
supervised learning tasks such as classification or regression.
Deep Belief Network

DBNs have been used in various applications, including image recognition,
speech recognition, and natural language processing. They have been shown
to achieve state-ofthe-art results in many tasks and are one of the most
powerful deep learning architectures currently available.
Since they don't use raw inputs like RBMs, DBNs also vary from other deep
learning algorithms like autoencoders and restricted Boltzmann machines
(RBMs). They instead operate on an input layer with one neuron for each
input vector and go through numerous levels before arriving at the final
layer, where outputs are produced using probabilities acquired from earlier
layers.
Architecture of DBN
The basic structure of a DBN is composed of several layers of RBMs. A
probability distribution is learned over the input data by each RBM, which is
a generative model. While the successive layers of the DBN learn higher-
level features, the initial layer of the DBN learns the fundamental structure
of the data. For supervised learning tasks like classification or regression,
the DBN's last layer is used.
Each RBM in a DBN is trained independently using contrastive divergence,

which is an unsupervised learning method. The gradient of the log-likelihood
of the data for the RBM's parameters can be approximated using this
method. The output of one trained RBM is then used as the input for the
subsequent RBM, which is done by stacking the trained RBMs on top of one
another.
After the DBN has been trained, supervised learning tasks can be performed
on it by adjusting the weights of the final layer using a supervised learning
technique like backpropagation. This fine-tuning process can improve the
DBN's performance on the specific task it was trained for.
Evolving of DBN
The earliest generation of neural networks, called perceptron’s, is
extraordinarily strong. Depending on our response, they can help us
recognize an object in a picture or gauge how much we enjoy a certain
cuisine. But they are constrained. They frequently consider one piece of
information at a time and find it hard to comprehend the context of what is
going on around them.
A type of second-generation neural network. Backpropagation is a technique

that compares the output received with the intended result and drives down
the error value until it is zero, meaning that each perceptron will finally
reach its ideal state.
Directed acyclic graphs (DAGs), commonly referred to as belief networks,

are the next step and help with inference and learning issues. It gives us
more control than ever before over our data.
Finally, deep belief networks (DBNs) can be used to create fair values that
we can then store in leaf nodes, ensuring that no matter what occurs along
the process, we always have the proper solution at hand.
Working of DBN
To get pixel input signals directly, we must train a property layer. Then, by
treating the values of these competing interest groups as pixels, we discover
the characteristics of the features that were initially obtained. Every new
subcaste of parcels or characteristic we add to the network raises the lower
bound on the log-liability of the training data set.
The following describes the operating pipeline of the deep belief network −
 We begin by performing multiple Gibbs sampling iterations in the top two hidden layers.
The top two hidden layers define the RBM. As a result, this stage successfully removes a
sample from it.
 After that, run a single ancestral sampling pass through the rest of the model to create a
sample from the visible units
 We will employ a single bottom-up approach to determine the values of the latent
variables in each layer. Greedy pretraining starts with an observed data vector in the
lowest layer. It then adjusts the generative weights in the other direction.
Advantages of DBN
One of the main advantages of DBNs is their ability to learn features from
the data in an unsupervised manner. This means they do not require labeled
data, which can be difficult and time-consuming. A hierarchical
representation of the data can also be learned by DBNs, with each layer
learning increasingly sophisticated features. For applications like picture
identification, where the earliest layers can pick up on fundamental details
like edges, this hierarchical representation can be quite helpful. Later layers,
however, are capable of learning more intricate properties, such forms and
objects
Moreover, DBNs have proven to be resistant to overfitting, a major issue in

deep learning. This is due to the RBMs' contribution to model regularisation
during their unsupervised pretraining. The risk of overfitting is minimised by
just using a small amount of labelled data during the fine-tuning phase.
Other deep learning architectures, such as convolutional neural networks

(CNNs) and recurrent neural networks, can have their weights initialised by
DBNs (RNNs). As a result, these architectures can start with a solid set of
initial weights, which can increase their performance.
The capacity of DBNs to manage missing data is another benefit. It happens
frequently in many real-world applications for some data to be corrupted or
absent. As they are made to work with complete and correct data, traditional
neural networks may require assistance to handle missing data. Yet, by
employing a method known as "dropout," DBNs can be taught to learn
robust characteristics that are unaffected by the existence of missing data.
DBNs can also be employed for generative activities like the creation of text
and images. The DBN can learn a probability distribution over the data
thanks to the unsupervised pretraining of the RBMs, and new samples that
resemble the training data can be produced. This may be helpful in software
programs like computer vision, which may create new images depending on
labels or other qualities.
The issue of vanishing gradients is among the key difficulties in deep neural
network training. The gradients used to update the weights during training
may get very small as the network's layer count rises, making it challenging
to train the network efficiently. Due of the RBMs' unsupervised pretraining,
DBNs can solve this issue. Each RBM learns a representation of the data
during pretraining that is comparatively stable and does not alter
dramatically with minor weight changes. This means that the gradients
utilised to update the weights are significantly larger when the DBN is
optimised for a supervised job, improving training efficiency.
DBNs have been effectively used in a variety of industries in addition to

standard deep learning tasks, including bioinformatics, drug development,
and financial forecasting. DBNs have been employed in bioinformatics to find
patterns in gene expression data suggestive of disease, which can be utilised
to create novel diagnostic tools. DBNs have been utilized in drug discovery
to find novel compounds with the potential to be turned into medicines.
Stock prices and other financial variables have been predicted using DBNs in
the finance industry
Conclusion
In conclusion, the powerful deep learning architecture known as DBNs can
be applied to a variety of tasks. They are made up of RBM layers that are
taught unsupervised, with supervised learning applied to the final layer.
DBNs are one of the most potent deep learning architectures currently
accessible and have been demonstrated to produce state-of-the-art
outcomes in a variety of tasks. They can unsupervised learn features from
the data, are resistant to overfitting, and can be used to set the weights of
other deep-learning architectures.
UNIT-IV
PROBABILISTIC NEURAL NETWORK
Hopfield Neural Network
The Hopfield Neural Networks, invented by Dr John J. Hopfield consists of one layer
of ‘n’ fully connected recurrent neurons. It is generally used in performing auto-
association and optimization tasks. It is calculated using a converging interactive
process and it generates a different response than our normal neural nets.
Discrete Hopfield Network
It is a fully interconnected neural network where each unit is connected to every other
unit. It behaves in a discrete manner, i.e. it gives finite distinct output, generally of two
types:
 Binary (0/1)
 Bipolar (-1/1)
The weights associated with this network are symmetric in nature and have the
following properties.
Structure & Architecture of Hopfield Network

 Each neuron has an inverting and a non-inverting output.
 Being fully connected, the output of each neuron is an input to all other neurons but
not the self.
The below figure shows a sample representation of a Discrete Hopfield Neural
Network architecture having the following elements.
Discrete Hopfield Network Architecture
[ x1 , x2 , ... , xn ] -> Input to the n given neurons.

[ y1 , y2 , ... , yn ] -> Output obtained from the n given neurons
Wij -> weight associated with the connection between the i th and the jth
neuron.
Training Algorithm
For storing a set of input patterns S(p) [p = 1 to P], where S(p) = S1(p) … Si(p) …
Sn(p), the weight matrix is given by:
 For binary patterns
 For bipolar patterns
(i.e. weights here have no self-connection)

Steps Involved in the training of a Hopfield Network are as mapped below:
 Initialize weights (wij) to store patterns (using training algorithm).
 For each input vector yi, perform steps 3-7.
 Make the initial activators of the network equal to the external input vector x.
 For each vector yi, perform steps 5-7.
 Calculate the total input of the network yin using the equation given below.
 Apply activation over the total input to calculate the output as per the equation
given below:
(where θi (threshold) and is normally taken as 0)

 Now feedback the obtained output y i to all other units. Thus, the activation vectors
are updated.
 Test the network for convergence.
Consider the following problem. We are required to create a Discrete Hopfield
Network with the bipolar representation of the input vector as [1 1 1 -1] or [1 1 1 0] (in
case of binary representation) is stored in the network. Test the Hopfield network with
missing entries in the first and second components of the stored vector (i.e. [0 0 1 0]).
Given the input vector, x = [1 1 1 -1] (bipolar) and we initialize the weight
matrix (wij) as:
and weight matrix with no self-connection is:
As per the question, input vector x with missing entries, x = [0 0 1 0] ([x1 x2 x3 x4])
(binary). Make yi = x = [0 0 1 0] ([y1 y2 y3 y4]). Choosing unit yi (order doesn’t
matter) for updating its activation. Take the i th column of the weight matrix for
calculation.
(we will do the next steps for all values of yi and check if there is convergence
or not)
‘
Now for next unit, we will take updated value via feedback. (i.e. y =
[1 0 1 0])
[1 0 1 0])
[1 0 1 0])
Continuous Hopfield Network

Unlike the discrete Hopfield networks, here the time parameter is treated as a
continuous variable. So, instead of getting binary/bipolar outputs, we can obtain
values that lie between 0 and 1. It can be used to solve constrained optimization and
associative memory problems. The output is defined as:
where,
 vi = output from the continuous hopfield network
 ui = internal activity of a node in continuous hopfield network.
Energy Function
The Hopfield networks have an energy function associated with them. It either
diminishes or remains unchanged on update (feedback) after every iteration. The
energy function for a continuous Hopfield network is defined as:
To determine if the network will converge to a stable configuration, we see if the

energy function reaches its minimum by:
The network is bound to converge if the activity of each neuron wrt time is given by the
following differential equation:
Types of Boltzmann Machines
Deep Learning models are broadly classified into supervised and unsupervised
models.
Supervised DL models:
 Artificial Neural Networks (ANNs)
 Recurrent Neural Networks (RNNs)
 Convolutional Neural Networks (CNNs)
Unsupervised DL models:
 Self Organizing Maps (SOMs)
 Boltzmann Machines
 Autoencoders
Let us learn what exactly Boltzmann machines are, how they work and also implement
a recommender system which recommends whether the user likes a movie or not
based on the previous movies watched.
Boltzmann Machines is an unsupervised DL model in which every node is connected

to every other node. That is, unlike the ANNs, CNNs, RNNs and SOMs, the Boltzmann
Machines are undirected (or the connections are bidirectional). Boltzmann Machine is
not a deterministic DL model but a stochastic or generative DL model. It is rather a
representation of a certain system. There are two types of nodes in the Boltzmann
Machine — Visible nodes — those nodes which we can and do measure, and
the Hidden nodes – those nodes which we cannot or do not measure. Although the
node types are different, the Boltzmann machine considers them as the same and
everything works as one single system. The training data is fed into the Boltzmann
Machine and the weights of the system are adjusted accordingly. Boltzmann machines
help us understand abnormalities by learning about the working of the system in
normal conditions.
Boltzmann Machine
Energy-Based Models:
Boltzmann Distribution is used in the sampling distribution of the Boltzmann
Machine. The Boltzmann distribution is governed by the equation –
Pi = e(-∈i/kT)/ ∑e(-∈j/kT)
Pi - probability of system being in state i
∈i - Energy of system in state i
T - Temperature of the system
k - Boltzmann constant
∑e(-∈j/kT) - Sum of values for all possible states of the system
Boltzmann Distribution describes different states of the system and thus Boltzmann
machines create different states of the machine using this distribution. From the above
equation, as the energy of system increases, the probability for the system to be in
state ‘i’ decreases. Thus, the system is the most stable in its lowest energy state (a
gas is most stable when it spreads). Here, in Boltzmann machines, the energy of the
system is defined in terms of the weights of synapses. Once the system is trained
and the weights are set, the system always tries to find the lowest energy state for
itself by adjusting the weights.
Types of Boltzmann Machines:
 Restricted Boltzmann Machines (RBMs)
 Deep Belief Networks (DBNs)
 Deep Boltzmann Machines (DBMs)
Restricted Boltzmann Machines (RBMs):
In a full Boltzmann machine, each node is connected to every other node and hence
the connections grow exponentially. This is the reason we use RBMs. The
restrictions in the node connections in RBMs are as follows –
 Hidden nodes cannot be connected to one another.
 Visible nodes connected to one another.
Energy function example for Restricted Boltzmann Machine –
E(v, h) = -∑ aivi - ∑ bjhj - ∑∑ viwi,jhj
a, v - biases in the system - constants
vi, hj - visible node, hidden node
P(v, h) = Probability of being in a certain state
P(v, h) = e(-E(v, h))/Z
Z - sum if values for all possible states
Suppose that we are using our RBM for building a recommender system that works on
six (6) movies. RBM learns how to allocate the hidden nodes to certain features. By
the process of Contrastive Divergence, we make the RBM close to our set of movies
that is our case or scenario. RBM identifies which features are important by the
training process. The training data is either 0 or 1 or missing data based on whether a
user liked that movie (1), disliked that movie (0) or did not watch the movie (missing
data). RBM automatically identifies important features.
Contrastive Divergence:
RBM adjusts its weights by this method. Using some randomly assigned initial
weights, RBM calculates the hidden nodes, which in turn use the same weights to
reconstruct the input nodes. Each hidden node is constructed from all the visible
nodes and each visible node is reconstructed from all the hidden node and hence, the
input is different from the reconstructed input, though the weights are the same. The
process continues until the reconstructed input matches the previous input. The
process is said to be converged at this stage. This entire procedure is known
as Gibbs Sampling.
Gibb’s Sampling
The Gradient Formula gives the gradient of the log probability of the certain state of
the system with respect to the weights of the system. It is given as follows –
d/dwij(log(P(v0))) = <vi0 * hj0> - <vi∞ * hj∞>
v - visible state, h- hidden state
<vi0 * hj0> - initial state of the system
<vi∞ * hj∞> - final state of the system
P(v0) - probability that the system is in state v 0
wij - weights of the system
The above equations tell us – how the change in weights of the system will change the
log probability of the system to be a particular state. The system tries to end up in the
lowest possible energy state (most stable). Instead of continuing the adjusting of
weights process until the current input matches the previous one, we can also
consider the first few pauses only. It is sufficient to understand how to adjust our curve
so as to get the lowest energy state. Therefore, we adjust the weights, redesign the
system and energy curve such that we get the lowest energy for the current position.
This is known as the Hinton’s shortcut.
Hinton’s Shortcut
Working of RBM – Illustrative Example –

Consider – Mary watches four movies out of the six available movies and rates four of
them. Say, she watched m1, m3, m4 and m5 and likes m3, m5 (rated 1) and dislikes the
other two, that is m1, m4 (rated 0) whereas the other two movies – m2, m6 are unrated.
Now, using our RBM, we will recommend one of these movies for her to watch next.
Say –
 m3, m5 are of ‘Drama’ genre.
 m1, m4 are of ‘Action’ genre.
 ‘Dicaprio’ played a role in m 5.
 m3, m5 have won ‘Oscar.’
 ‘Tarantino’ directed m4.
 m2 is of the ‘Action’ genre.
 m6 is of both the genres ‘Action’ and ‘Drama’, ‘Dicaprio’ acted in it and it has won an
‘Oscar’.
We have the following observations –
 Mary likes m3, m5 and they are of genre ‘Drama,’ she probably likes
‘Drama’ movies.
 Mary dislikes m1, m4 and they are of action genre, she probably dislikes
‘Action’ movies.
 Mary likes m3, m5 and they have won an ‘Oscar’, she
probably likes an ‘Oscar’ movie.
 Since ‘Dicaprio’ acted in m 5 and Mary likes it, she will probably like a movie in
which ‘Dicaprio’ acted.
 Mary does not like m 4 which is directed by Tarantino, she probably dislikes any
movie directed by ‘Tarantino’.
Therefore, based on the observations and the details of m 2, m6; our
RBM recommends m6 to Mary (‘Drama’, ‘Dicaprio’ and ‘Oscar’ matches both Mary’s
interests and m6). This is how an RBM works and hence is used in recommender
systems.
Working of RBM
Thus, RBMs are used to build Recommender Systems.

Deep Belief Networks (DBNs):
Suppose we stack several RBMs on top of each other so that the first RBM outputs
are the input to the second RBM and so on. Such networks are known as Deep Belief
Networks. The connections within each layer are undirected (since each layer is an
RBM). Simultaneously, those in between the layers are directed (except the top two
layers – the connection between the top two layers is undirected). There are two ways
to train the DBNs-
1. Greedy Layer-wise Training Algorithm – The RBMs are trained layer by layer.
Once the individual RBMs are trained (that is, the parameters – weights, biases are
set), the direction is set up between the DBN layers.
2. Wake-Sleep Algorithm – The DBN is trained all the way up (connections going up
– wake) and then down the network (connections going down — sleep).
Therefore, we stack the RBMs, train them, and once we have the parameters trained,
we make sure that the connections between the layers only work downwards (except
for the top two layers).
Deep Boltzmann Machines (DBMs):
DBMs are similar to DBNs except that apart from the connections within layers, the
connections between the layers are also undirected (unlike DBN in which the
connections between layers are directed). DBMs can extract more complex or
sophisticated features and hence can be used for more complex tasks.
Feeling lost in the world of random DSA topics, wasting time without progress? It's
time for a change! Join our DSA course, where we'll guide you on an exciting journey
to master DSA efficiently and on schedule.
You'll access excellent video content by our CEO, Sandeep Jain, tackle common
interview questions, and engage in real-time coding contests covering various DSA
topics. We're here to prepare you thoroughly for online assessments and interviews.
Ready to dive in? Explore our free demo content and join our DSA course, trusted by
over 100,000 geeks! Whether it's DSA in C++, Java, Python, or JavaScript we've got
you covered. Let's embark on this exciting journey together!
Restricted Boltzmann Machine

Introduction :
Restricted Boltzmann Machine (RBM) is a type of artificial neural network that is used
for unsupervised learning. It is a type of generative model that is capable of learning a
probability distribution over a set of input data.
RBM was introduced in the mid-2000s by Hinton and Salakhutdinov as a way to
address the problem of unsupervised learning. It is a type of neural network that
consists of two layers of neurons – a visible layer and a hidden layer. The visible layer
represents the input data, while the hidden layer represents a set of features that are
learned by the network.
The RBM is called “restricted” because the connections between the neurons in the
same layer are not allowed. In other words, each neuron in the visible layer is only
connected to neurons in the hidden layer, and vice versa. This allows the RBM to
learn a compressed representation of the input data by reducing the dimensionality of
the input.
The RBM is trained using a process called contrastive divergence, which is a variant
of the stochastic gradient descent algorithm. During training, the network adjusts the
weights of the connections between the neurons in order to maximize the likelihood of
the training data. Once the RBM is trained, it can be used to generate new samples
from the learned probability distribution.
RBM has found applications in a wide range of fields, including computer vision,
natural language processing, and speech recognition. It has also been used in
combination with other neural network architectures, such as deep belief networks and
deep neural networks, to improve their performance.
What are Boltzmann Machines?
It is a network of neurons in which all the neurons are connected to each other. In this
machine, there are two layers named visible layer or input layer and hidden layer. The
visible layer is denoted as v and the hidden layer is denoted as the h. In Boltzmann
machine, there is no output layer. Boltzmann machines are random and generative
neural networks capable of learning internal representations and are able to represent
and (given enough time) solve tough combinatoric problems.
The Boltzmann distribution (also known as Gibbs Distribution) which is an integral
part of Statistical Mechanics and also explain the impact of parameters like Entropy
and Temperature on the Quantum States in Thermodynamics. Due to this, it is also
known as Energy-Based Models (EBM). It was invented in 1985 by Geoffrey Hinton,
then a Professor at Carnegie Mellon University, and Terry Sejnowski, then a Professor
at Johns Hopkins University
What are Restricted Boltzmann Machines (RBM)?

A restricted term refers to that we are not allowed to connect the same type layer to
each other. In other words, the two neurons of the input layer or hidden layer can’t
connect to each other. Although the hidden layer and visible layer can be connected to
each other.
As in this machine, there is no output layer so the question arises how we are going to
identify, adjust the weights and how to measure the that our prediction is accurate or
not. All the questions have one answer, that is Restricted Boltzmann Machine.
The RBM algorithm was proposed by Geoffrey Hinton (2007), which learns probability
distribution over its sample training data inputs. It has seen wide applications in
different areas of supervised/unsupervised machine learning such as feature learning,
dimensionality reduction, classification, collaborative filtering, and topic modeling.
Consider the example movie rating discussed in the recommender system section.
Movies like Avengers, Avatar, and Interstellar have strong associations with the latest
fantasy and science fiction factor. Based on the user rating RBM will discover latent
factors that can explain the activation of movie choices. In short, RBM describes
variability among correlated variables of input dataset in terms of a potentially lower
number of unobserved variables.
The energy function is given by
Applications of Restricted Boltzmann Machine

Restricted Boltzmann Machines (RBMs) have found numerous applications in various
fields, some of which are:
 Collaborative filtering: RBMs are widely used in collaborative filtering for
recommender systems. They learn to predict user preferences based on their past
behavior and recommend items that are likely to be of interest to the user.
 Image and video processing: RBMs can be used for image and video processing
tasks such as object recognition, image denoising, and image reconstruction. They
can also be used for tasks such as video segmentation and tracking.
 Natural language processing: RBMs can be used for natural language processing
tasks such as language modeling, text classification, and sentiment analysis. They
can also be used for tasks such as speech recognition and speech synthesis.
 Bioinformatics: RBMs have found applications in bioinformatics for tasks such as
protein structure prediction, gene expression analysis, and drug discovery.
 Financial modeling: RBMs can be used for financial modeling tasks such as
predicting stock prices, risk analysis, and portfolio optimization.
 Anomaly detection: RBMs can be used for anomaly detection tasks such as fraud
detection in financial transactions, network intrusion detection, and medical
diagnosis.
 It is used in Filtering.
 It is used in Feature Learning.
 It is used in Classification.
 It is used in Risk Detection.
 It is used in Business and Economic analysis.
How do Restricted Boltzmann Machines work?

In RBM there are two phases through which the entire RBM works:
1st Phase: In this phase, we take the input layer and using the concept of weights
and biased we are going to activate the hidden layer. This process is said to be Feed
Forward Pass. In Feed Forward Pass we are identifying the positive association and
negative association.
Feed Forward Equation:
 Positive Association — When the association between the visible unit and the
hidden unit is positive.
 Negative Association — When the association between the visible unit and the
hidden unit is negative.
2nd Phase: As we don’t have any output layer. Instead of calculating the output layer,
we are reconstructing the input layer through the activated hidden state. This process
is said to be Feed Backward Pass. We are just backtracking the input layer through
the activated hidden neurons. After performing this we have reconstructed Input
through the activated hidden state. So, we can calculate the error and adjust weight in
this way:
Feed Backward Equation:
 Error = Reconstructed Input Layer-Actual Input layer
 Adjust Weight = Input*error*learning rate (0.1)
After doing all the steps we get the pattern that is responsible to activate the hidden
neurons. To understand how it works:
Let us consider an example in which we have some assumption that V1 visible unit
activates the h1 and h2 hidden unit and V2 visible unit activates the h2 and h3 hidden.
Now when any new visible unit let V5 has come into the machine and it also activates
the h1 and h2 unit. So, we can back trace the hidden units easily and also identify that
the characteristics of the new V5 neuron is matching with that of V1. This is because
V1 also activated the same hidden unit earlier.
Restricted Boltzmann Machines
Types of RBM :
There are mainly two types of Restricted Boltzmann Machine (RBM) based on the
types of variables they use:
1. Binary RBM: In a binary RBM, the input and hidden units are binary variables.
Binary RBMs are often used in modeling binary data such as images or text.
2. Gaussian RBM: In a Gaussian RBM, the input and hidden units are continuous
variables that follow a Gaussian distribution. Gaussian RBMs are often used in
modeling continuous data such as audio signals or sensor data.
Apart from these two types, there are also variations of RBMs such as:
1. Deep Belief Network (DBN): A DBN is a type of generative model that consists of
multiple layers of RBMs. DBNs are often used in modeling high-dimensional data
such as images or videos.
2. Convolutional RBM (CRBM): A CRBM is a type of RBM that is designed
specifically for processing images or other grid-like structures. In a CRBM, the
connections between the input and hidden units are local and shared, which makes
it possible to capture spatial relationships between the input units.
3. Temporal RBM (TRBM): A TRBM is a type of RBM that is designed for processing
temporal data such as time series or video frames. In a TRBM, the hidden units are
connected across time steps, which allows the network to model temporal
dependencies in the data.
Activation functionsIsigmoid function) in Neural
Networks
In the process of building a neural network, one of the choices you get to make is
what Activation Function to use in the hidden layer as well as at the output layer of the
network. This article discusses some of the choices.
Elements of a Neural Network
Input Layer: This layer accepts input features. It provides information from the outside
world to the network, no computation is performed at this layer, nodes here just pass
on the information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of
the abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to
the output layer.
Output Layer: This layer bring up the information learned by the network to the outer world.
What is an activation function and why use them?
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we would update
the weights and biases of the neurons on the basis of the error at the output. This process is
known as back-propagation. Activation functions make the back-propagation possible since the
gradients are supplied along with the error to update the weights and biases.
Why do we need Non-linear activation function?
A neural network without an activation function is essentially just a linear regression model. The
activation function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.
Mathematical proof
Suppose we have a Neural net like this :-
Elements of the diagram are as follows:
Hidden layer i.e. layer 1:
z(1) = W(1)X + b(1) a(1)
Here,
 z(1) is the vectorized output of layer 1
 W(1) be the vectorized weights assigned to neurons of hidden layer i.e. w1, w2, w3
and w4
 X be the vectorized input features i.e. i1 and i2
 b is the vectorized bias assigned to neurons in hidden layer i.e. b1 and b2
 a(1) is the vectorized form of any linear function.
(Note: We are not considering activation function here)
Layer 2 i.e. output layer :-

Note : Input for layer 2 is output from layer 1
z(2) = W(2)a(1) + b(2)
a(2) = z(2)
Calculation at Output layer
z(2) = (W(2) * [W(1)X + b(1)]) + b(2)
z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]
Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b
Final output : z(2) = W*X + b
which is again a linear function
This observation results again in a linear function even after applying a hidden layer,
hence we can conclude that, doesn’t matter how many hidden layer we attach in
neural net, all layers will behave same way because the composition of two linear
function is a linear function itself. Neuron can not learn with just a linear function
attached to it. A non-linear activation function will let it learn as per the difference w.r.t
error. Hence we need an activation function.
Variants of Activation Function
Linear Function
 Equation : Linear function has the equation similar to as of a straight line i.e. y = x
 No matter how many layers we have, if all are linear in nature, the final activation
function of last layer is nothing but just a linear function of the input of first layer.
 Range : -inf to +inf
 Uses : Linear activation function is used at just one place i.e. output layer.
 Issues : If we will differentiate linear function to bring non-linearity, result will no
more depend on input “x” and function will become constant, it won’t introduce any
ground-breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price
may have any big/small value, so we can apply linear activation at output layer. Even
in this case neural net must have any non-linear function at hidden layers.
Sigmoid Function
 It is a function which is plotted as ‘S’ shaped graph.

 Equation : A = 1/(1 + e-x)
 Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very
steep. This means, small changes in x would also bring about large changes in the
value of Y.
 Value Range : 0 to 1
 Uses : Usually used in output layer of a binary classification, where result is either
0 or 1, as value for sigmoid function lies between 0 and 1 only so, result can be
predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.
Tanh Function
 The activation that works almost always better than sigmoid function is Tanh
function also known as Tangent Hyperbolic function. It’s actually mathematically
shifted version of the sigmoid function. Both are similar and can be derived from
each other.
 Equation :-
 Value Range :- -1 to +1
 Uses :- Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very close
to it, hence helps in centering the data by bringing mean close to 0. This makes
learning for the next layer much easier.
RELU Function
 It Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural network.
 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
 Value Range :- [0, inf)
 Nature :- non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
 Uses :- ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are
trying to handle multi- class classification problems.
 Uses :- Usually used when trying to handle multiple classes. the softmax function
was commonly found in the output layer of image classification problems.The
softmax function would squeeze the outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
 Output:- The softmax function is ideally used in the output layer of the classifier
where we are actually trying to attain the probabilities to define the class of each
input.
 The basic rule of thumb is if you really don’t know what activation function to use,
then simply use RELU as it is a general activation function in hidden layers and is
used in most cases these days.
 If your output is for binary classification then, sigmoid function is very natural choice
for output layer.
 If your output is for multi-class classification then, Softmax is very useful to predict
the probabilities of each classes.
Deep Autoencoders
A deep autoencoder is composed of two, symmetrical deep-belief networks that
typically have four or five shallow layers representing the encoding half of the net, and
second set of four or five layers that make up the decoding half.
The layers are restricted Boltzmann machines, the building blocks of deep-belief
networks, with several peculiarities that we’ll discuss below. Here’s a simplified schema
of a deep autoencoder’s structure, which we’ll explain below.
Processing the benchmark dataset MNIST, a deep autoencoder would use binary
transformations after each RBM. Deep autoencoders can also be used for other types
of datasets with real-valued data, on which you would use Gaussian rectified
transformations for the RBMs instead.
Encoding Input Data
Let’s sketch out an example encoder:
784 (input) ----> 1000 ----> 500 ----> 250 ----> 100 -----> 30
If, say, the input fed to the network is 784 pixels (the square of the 28x28 pixel images
in the MNIST dataset), then the first layer of the deep autoencoder should have 1000
parameters; i.e. slightly larger.
This may seem counterintuitive, because having more parameters than input is a good
way to overfit a neural network.
In this case, expanding the parameters, and in a sense expanding the features of the
input itself, will make the eventual decoding of the autoencoded data possible.
This is due to the representational capacity of sigmoid-belief units, a form of

transformation used with each layer. Sigmoid belief units can’t represent as much as
information and variance as real-valued data. The expanded first layer is a way of
compensating for that.
The layers will be 1000, 500, 250, 100 nodes wide, respectively, until the end, where
the net produces a vector 30 numbers long. This 30-number vector is the last layer of
the first half of the deep autoencoder, the pretraining half, and it is the product of a
normal RBM, rather than an classification output layer such as Softmax or logistic
regression, as you would normally see at the end of a deep-belief network.
Decoding Representations
Those 30 numbers are an encoded version of the 28x28 pixel image. The second half of
a deep autoencoder actually learns how to decode the condensed vector, which
becomes the input as it makes its way back.
The decoding half of a deep autoencoder is a feed-forward net with layers 100, 250,
500 and 1000 nodes wide, respectively. Layer weights are initialized randomly.
784 (output) <---- 1000 <---- 500 <---- 250 <---- 30
The decoding half of a deep autoencoder is the part that learns to reconstruct the
image. It does so with a second feed-forward net which also conducts back
propagation. The back propagation happens through reconstruction entropy.
Training Nuances
At the stage of the decoder’s backpropagation, the learning rate should be lowered, or
made slower: somewhere between 1e-3 and 1e-6, depending on whether you’re
handling binary or continuous data, respectively.
Use Cases
Image Search
As we mentioned above, deep autoencoders are capable of compressing images into

30-number vectors.
Image search, therefore, becomes a matter of uploading an image, which the search
engine will then compress to 30 numbers, and compare that vector to all the others in its
index.
Vectors containing similar numbers will be returned for the search query, and translated
into their matching image.
Data Compression
A more general case of image compression is data compression. Deep autoencoders

are useful for semantic hashing, as discussed in this paper by Geoff Hinton.
Topic Modeling & Information Retrieval (IR)
Deep autoencoders are useful in topic modeling, or statistically modeling abstract topics
that are distributed across a collection of documents.
This, in turn, is an important step in question-answer systems like Watson.
In brief, each document in a collection is converted to a Bag-of-Words (i.e. a set of word

counts) and those word counts are scaled to decimals between 0 and 1, which may be
thought of as the probability of a word occurring in the doc.
The scaled word counts are then fed into a deep-belief network, a stack of restricted
Boltzmann machines, which themselves are just a subset of feedforward-backprop
autoencoders. Those deep-belief networks, or DBNs, compress each document to a set
of 10 numbers through a series of sigmoid transforms that map it onto the feature
space.
Each document’s number set, or vector, is then introduced to the same vector space,
and its distance from every other document-vector measured. Roughly speaking,
nearby document-vectors fall under the same topic.
For example, one document could be the “question” and others could be the “answers,”
a match the software would make using vector-space measurements.
UNIT-V APPLICATIONS
Object Detection vs Object Recognition vs Image
Segmentation
Object Recognition:
Object recognition is the technique of identifying the object present in images and
videos. It is one of the most important applications of machine learning and deep
learning. The goal of this field is to teach machines to understand (recognize) the
content of an image just like humans do.
Object Recognition
Object Recognition Using Machine Learning

 HOG (Histogram of oriented Gradients) feature Extractor and SVM (Support
Vector Machine) model: Before the era of deep learning, it was a state-of-the-art
method for object detection. It takes histogram descriptors of both positive ( images
that contain objects) and negative (images that does not contain objects) samples
and trains our SVM model on that.
 Bag of features model: Just like bag of words considers document as an orderless
collection of words, this approach also represents an image as an orderless
collection of image features. Examples of this are SIFT, MSER, etc.
 Viola-Jones algorithm: This algorithm is widely used for face detection in the
image or real-time. It performs Haar-like feature extraction from the image. This
generates a large number of features. These features are then passed into a
boosting classifier. This generates a cascade of the boosted classifier to perform
image detection. An image needs to pass to each of the classifiers to generate a
positive (face found) result. The advantage of Viola-Jones is that it has a detection
time of 2 fps which can be used in a real-time face recognition system.
Object Recognition Using Deep Learning
Convolution Neural Network (CNN) is one of the most popular ways of doing object
recognition. It is widely used and most state-of-the-art neural networks used this
method for various object recognition related tasks such as image classification. This
CNN network takes an image as input and outputs the probability of the different
classes. If the object present in the image then it’s output probability is high else the
output probability of the rest of classes is either negligible or low. The advantage of
Deep learning is that we don’t need to do feature extraction from data as compared to
machine learning.
Challenges of Object Recognition:

 Since we take the output generated by last (fully connected) layer of the CNN
model is a single class label. So, a simple CNN approach will not work if more than
one class labels are present in the image.
 If we want to localize the presence of an object in the bounding box, we need to try
a different approach that not only outputs the class label but also outputs the
bounding box locations.
Overview of tasks related to Object Recognition
Image Classification :
In Image classification, it takes an image as an input and outputs the classification
label of that image with some metric (probability, loss, accuracy, etc). For Example: An
image of a cat can be classified as a class label “cat” or an image of Dog can be
classified as a class label “dog” with some probability.
Image Classification
Object Localization: This algorithm locates the presence of an object in the image
and represents it with a bounding box. It takes an image as input and outputs the
location of the bounding box in the form of (position, height, and width).
Object Detection:
Object Detection algorithms act as a combination of image classification and object
localization. It takes an image as input and produces one or more bounding boxes
with the class label attached to each bounding box. These algorithms are capable
enough to deal with multi-class classification and localization as well as to deal with
the objects with multiple occurrences.
Challenges of Object Detection:
 In object detection, the bounding boxes are always rectangular. So, it does not help
with determining the shape of objects if the object contains the curvature part.
 Object detection cannot accurately estimate some measurements such as the area
of an object, perimeter of an object from image.
Difference between classification. Localization and Detection (Source: Link)
Image Segmentation:
Image segmentation is a further extension of object detection in which we mark the
presence of an object through pixel-wise masks generated for each object in the
image. This technique is more granular than bounding box generation because this
can helps us in determining the shape of each object present in the image because
instead of drawing bounding boxes , segmentation helps to figure out pixels that are
making that object. This granularity helps us in various fields such as medical image
processing, satellite imaging, etc. There are many image segmentation approaches
proposed recently. One of the most popular is Mask R-CNN proposed by K He et al. in
2017.
Object Detection vs Segmentation (Source: Link)
There are primarily two types of segmentation:

 Instance Segmentation: Multiple instances of same class are separate
segments i.e. objects of same class are treated as different. Therefore, all the
objects are coloured with different colour even if they belong to same class.
 Semantic Segmentation: All objects of same class form a single
classification ,therefore , all objects of same class are coloured by same colour.
Semantic vs Instance Segmentation (Source: Link)
Applications:
The above-discussed object recognition techniques can be utilized in many fields such
as:
 Driver-less Cars: Object Recognition is used for detecting road signs, other
vehicles, etc.
 Medical Image Processing: Object Recognition and Image Processing techniques
can help detect disease more accurately. Image segmentation helps to detect the
shape of the defect present in the body . For Example, Google AI for breast cancer
detection detects more accurately than doctors.
 Surveillance and Security: such as Face Recognition, Object Tracking, Activity
Recognition, etc.
Sparse Coding Neural
Networks
1. Overview
In this tutorial, we’ll provide a quick introduction to the sparse coding neural
networks concept.
Sparse coding, in simple words, is a machine learning approach in which a dictionary of
basis functions is learned and then used to represent input as a linear combination of a
minimal number of these basis functions.
Moreover, sparse coding seeks a compact, efficient data representation, which may be
utilized for tasks such as denoising, compression, and classification.
2. Sparse Coding
In the case of sparse coding, the data is first translated into a high-dimensional feature
space, where the basis functions are then employed to represent the data as a linear
combination of a limited number of these basis functions.
Subject to a sparsity constraint that promotes the coefficients to be zero or nearly zero,
the coefficients of linear combinations are selected to reduce the reconstruction error
between the original data and the representation. So, with most of the coefficients being
zero or almost zero, this sparsity constraint pushes the final representation to be a
linear combination of the basis functions.
In this way, sparse coding, as previously noted, seeks to discover a usable sparse
representation of any given data. Then, a sparse code will be used to encode data:
 To learn the sparse representation, the sparse coding requires input data
 This is incredibly helpful since you can utilize unsupervised learning to apply it
straight to any data
 Without sacrificing any information, it will automatically find the representation
Sparse coding techniques aim to meet the following two constraints simultaneously:
 It’ll try to learn how to create a useful sparse representation as a vector
for a given data (as a vector) using an encoder matrix
 It’ll try to reconstruct the data using a decoder matrix for each
representation is given as a vector
2.1. Mathematical Background of Sparse Coding
We assume our data satisfies the following equation:
(1)
 Sparse learning: a process that aims to train the input data
and to generate a dictionary of basis functions

 Sparse encoding: a process that aims to learn the sparse code h using the test
data and the generated dictionary ( )
The general form of the target function to optimize the dictionary learning process may
be mathematically represented as follows:
(2)
In the previous equation, is a constant; denotes the dimensionality of data;
and means the k-given vector of the data. Furthermore, is the sparse
representation of ; is the decoder matrix (dictionary); and the coefficient of
sparsity is .
The min expression in the general form gives the impression that we are attempting to
nest two optimization issues. We can think of this as two separate optimization issues
fighting to get the optimal compromise:
 Outer minimization: the procedure that deals with the left side of the sum
process while attempting to adjust to reduce the input’s loss during the
reconstruction process
 Inner minimization: the procedure that attempts to increase sparsity by lowering
the of the sparse representation h. Said, we are trying to solve a

problem (in this case, reconstruction) while utilizing the fewest resources feasible
to store our data
The following algorithm performs sparse coding by iteratively updating the coefficients
for each sample in the dataset, using the residuals from the previous iteration to update
the coefficients for the current iteration. The sparsity constraint is enforced by setting
the coefficients to zero or close to zero if they fall below a certain threshold.
3. Sparse Coding Neural Networks
From a biological perspective, sparse code follows the more all-encompassing idea of
neural code. Consider the case when you have binary neurons. So, basically:
 The neural networks will get some inputs and deliver outputs
 Some neurons in the neural network will be frequently activated while others
won’t be activated at all to calculate the outputs
 The average activity ratio refers to the number of activations on some data,
whereas the neural code is the observation of those activations for a specific
input
 Neural coding is the process of instructing your neurons to produce a reliable
neural code
Now that we know what a neural code is, we can speculate on what it may be
like. Then, data will be encoded using a sparse code while taking into
consideration the following scenarios:
 No neurons are even activated

 One neuron alone is activated
 Half of the neurons are active
 Neurons in half of the brain activate
The figure next shows an example of how a dense neural network graph is transformed
into a sparse neural network graph:
4. Applications of Sparse Coding
Brain imaging, natural language processing, and image and sound processing all use
sparse coding. Also, sparse coding has been employed in leading-edge systems in
several areas and has shown to be very successful at capturing the structure of
complicated data.
Resources for the implementation of sparse coding are available in several Python
modules. A well-known toolkit called Scikit-learn offers a light coding module with
methods for creating a dictionary of basis functions and sparsely combining these basis
functions to encode data.
5. Conclusion
In short, sparse coding is a powerful machine learning technique that allows us to find
compact, efficient representations of complex data. It has been widely applied to various
applications such as brain imaging, natural language processing, and image and sound
processing and achieved promising results.
Difference Between Computer Vision and Deep

Learning
Over the last few decades or so, the then-technologies of the future like AI and machine
vision have now become mainstream embracing many applications, ranging from
automated robot assembly to automatic vehicle guidance, analysis of remotely sensed
images and automated visual inspection. Computer vision and deep learning are among
the hottest topics these days with every tech industry and even start-ups rushing to head
on the competition.
What Is Computer Vision?

Computer Vision is an interdisciplinary field of artificial intelligence that enables
computers to process, analyze and interpret the visual world. There is a massive number
of objects exists in the real world and while different objects might have similar visual
appearance, it’s the subtle details that separate them from each other. Image recognition
is regarded as the most common application in computer vision. Well, the idea is to
make computers to identify and process images the same way as the human vision does.
The ease with which human vision processes and interprets images is truly
impeccable. Computer visions aims to pass this characteristic trait of humans onto
computers so that computers would understand and analyze complex systems just like
humans would do or even better.
What is Deep Learning?
Deep learning is a subset of machine learning and AI based on artificial neural
networks that seeks to mimic the functioning of the human brain so that computer
would learn what comes naturally to humans. Deep learning is concerned with
algorithms inspired by the structure of the human brain that enables machines to gain
some level of understanding and knowledge just the way human brain filters
information. It defines model parameters for decision making process mimicking the
understanding process in the human brain. It is a way of data inference in machine
learning and together, they are among the major tools of modern AI. It was initially
developed as a machine learning approach to deal with complex input-output mappings.
Today, deep learning is a state of the art system used across many industries for various
applications.
Difference between Computer Vision and Deep Learning
Concept
– Computer vision is a subset of machine learning that deals with making computers or
machines understand human actions, behaviors, and languages similarly to humans.
The idea is to get machines to understand and interpret the visual world so that they
make sense out of it and derive some meaningful insights. Deep learning is a subset of
AI that seeks to mimic the functioning of the human brain based on artificial neural
networks.
Purpose
– The purpose of computer vision is to program a computer to interpret visual

information contained within image and video data in order to make better sense of the
digital data. The idea is to translate this data into meaningful insights, using contextual
information provided by humans in order to make better business decisions and solve
complex problems. Deep learning has been introduced with the objective of moving
machine learning closer to AI. DL algorithms have revolutionized the way we deal with
data. The goal is to extract features from raw data based on the notion of artificial neural
networks.
Applications
– The most common real world applications of computer vision include defect
detection, image labeling, face recognition, object detection, image classification, object
tracking, movement analysis, cell classification, and more. Top applications of deep
learning are self driving cars, natural language processing, visual recognition, image and
speech recognition, virtual assistants, chatbots, fraud detection, etc.
Computer Vision vs. Deep Learning: Comparison Chart
Summary
Deep learning has achieved remarkable progress in various fields in a short span of
time, particularly it has brought a revolution to the computer vision community,
introducing efficient solutions to the problems that had long remained unsolved.
Computer vision is a subfield of AI that seeks to make computers understand the
contents of the digital data contained within images or videos and make some sense out
of them. Deep learning aims to bring machine learning one step closer to one of its
original goals, that is, artificial intelligence.
Is computer vision part of deep learning?
The link between computer vision and machine learning is very fuzzy, as is the link
between computer vision and deep learning. In a short span of time, computer vision
has shown tremendous progress, and from interpreting optical data to object modeling,
the term deep learning has started to creep into computer vision, as it did into machine
learning, AI, and other domains.
What is computer vision with deep learning?
Many traditional applications in computer vision can be solved through invoking deep
learning methods. Computer vision seeks to guide machines and computers toward
understanding the contents of digital data such as images or videos.
Is computer vision machine learning?
Computer vision is a subset of machine learning, and machine learning is a subfield of

AI. Computer vision trains computers to make sense of the visual world as the human
vision does. While computer vision uses machine learning algorithms such as neural
networks, it is more than machine learning applied. They are closely related to each
other, but they are not the same.
Why is computer vision so hard?
Computer vision is a challenge because it is limited by hardware and the way machines
see objects and images is very different from how human see them and interpret them.
Machines see them as numbers representing individual pixels, making it difficult to
make them understand what and how we see things.
What is the role of computer vision?
Computer vision is the science of making computer or machines understand human

actions, behaviors, and languages similarly to humans. Computer vision has an amazing
diversity of real world applications such as autonomous driving, biometric systems,
pedestrian protection system, video surveillance, robotics, medical diagnosis, and more.
Natural Language Processing – Overview

Machine Comprehension is a very interesting but challenging task in both Natural
Language Processing (NLP) and artificial intelligence (AI) research. There are several
approaches to natural language processing tasks. With recent breakthroughs in deep
learning algorithms, hardware, and user-friendly APIs like TensorFlow, some tasks
have become feasible up to a certain accuracy. This article contains information about
TensorFlow implementations of various deep learning models, with a focus on
problems in natural language processing. The purpose of this project article is to help
the machine to understand the meaning of sentences, which improves the efficiency of
machine translation, and to interact with the computing systems to obtain useful
information from it.
Understanding Natural Language Processing:
Our ability to evaluate the relationship between sentences is essential for tackling a
variety of natural language challenges, such as text summarization, information
extraction, and machine translation. This challenge is formalized as the natural
language inference task of Recognizing Textual Entailment (RTE), which involves
classifying the relationship between two sentences as one of entailment, contradiction,
or neutrality. For instance, the premise “Garfield is a cat”, naturally entails the
statement “Garfield has paws”, contradicts the statement “Garfield is a German
Shepherd”, and is neutral to the statement “Garfield enjoys sleeping”.
Natural language processing is the ability of a computer program to understand
human language as it is spoken. NLP is a component of artificial intelligence that deal
with the interactions between computers and human languages in regard to the
processing and analyzing large amounts of natural language data. Natural language
processing can perform several different tasks by processing natural data through
different efficient means. These tasks could include:
 Answering questions about anything (what Siri*, Alexa*, and Cortana* can do).
 Sentiment analysis (determining whether the attitude is positive, negative, or
neutral).
 Image to text mappings (creating captions using an input image).
 Machine translation (translating text to different languages).
 Speech recognition
 Part of Speech (POS) tagging.
 Entity identification
The traditional approach to NLP involved a lot of domain knowledge of linguistics
itself.
Deep learning at its most basic level, is all about representation learning. With
convolutional neural networks (CNN), the composition of different filters is used to
classify objects into categories. Taking a similar approach, this article creates
representations of words through large datasets.
Conversational AI: Natural Language Processing Features
 Natural Language Processing (NLP)

 Text Mining (TM)
 Computational Linguistics (CL)
 Machine Learning on Text Data (ML on Text)
 Deep Learning approaches for Text Data (DL on Text)
 Natural Language Understanding (NLU)
 Natural Language Generation (NLG)
Conversational AI, has seen several amazing advances in recent years, with
significant improvements in automatic speech recognition (ASR), text to speech (TTS),
and intent recognition, as well as the significant growth of voice assistant devices like
the Amazon Echo and Google Home.
Using deep learning techniques can work efficiently on NLP-related problems. This
article uses backpropagation and stochastic gradient descent (SGD) as 4 algorithms
in the NLP models.
The loss depends on each element of the training set, especially when it is compute-
intensive, which in the case of NLP problems is true as the data set is large. As
gradient descent is iterative, it has to be done through many steps which means going
through the data hundreds and thousands of times. Estimate the loss by taking the
average loss from a random, small data set chosen from the larger data set. Then
compute the derivative for that sample and assumes that the derivative is the right
direction to use the gradient descent. It might even increase the loss, not reduce it.
Compensate by doing it many times, taking very small steps each time. Each step is
cheaper to compute and overall will produce better performance. The SGD algorithm
is at the core of deep learning.
Word Vectors:
Words need to be represented as input to the machine learning models, one

mathematical way to do this is to use vectors. There are an estimated 13 million words
in the English language, but many of these are related.
Search for an N-dimensional vector space (where N << 13 million) that is sufficient to
encode all semantics in our language. To do this, there needs to be an understanding
of the similarity and differences between words. The concept of vectors and distances
between them (Cosine, Euclidean, etc.) can be exploited to find similarities and
differences between words.
How Do We Represent the Meaning of Words?
If separate vectors are used for all of the +13 million words in the English vocabulary,
several problems can occur. First, there will be large vectors with a lot of ‘zeroes’ and
one ‘one’ (in different positions representing a different word). This is also known as
one-hot encoding. Second, when searching for phrases such as “hotels in New
Jersey” in Google, expectations are that results pertaining to “motel”, “lodging”, and
“accommodation” in New Jersey are returned. And if using one-hot encoding, these
words have no natural notion of similarity. Ideally, dot products (since we are dealing
with vectors) of synonym/similar words would be close to one of the expected results.
Word2vec8 is a group of models which helps derive relations between a word and its
contextual words. Beginning with a small, random initialization of word vectors, the
predictive model learns the vectors by minimizing the loss function. In Word2vec, this
happens with a feed-forward neural network and optimization techniques such as the
SGD algorithm. There are also count-based models which make a co-occurrence
count matrix of words in the corpus; with a large matrix with a row for each of the
“words” and columns for the “context”. The number of “contexts” is of course large,
since it is essentially combinatorial in size. To overcome the size issue, singular value
decomposition can be applied to the matrix, reducing the dimensions of the matrix and
retaining maximum information.
Software and Hardware:
The programming language being used is Python 3.5.2 with Intel Optimization for
TensorFlow as the framework. For training and computation purposes, the Intel AI
DevCloud powered by Intel Xeon Scalable processors was used. Intel AI DevCloud
can provide a great performance bump from the host CPU for the right application and
use case due to having 50+ cores and its own memory, interconnect, and operating
system.
Training Models for NLP: Langmod_nn and Memn2n-master
Langmod_nn Model –
The Langmod_nn model6 builds a three-layer Forward Bigram Model neural network
consisting of an embedding layer, a hidden layer, and a final softmax layer where the
goal is to use a given word in a corpus to attempt to predict the next word.
To pass the input into one hot encoded vector of dimensions of 5000.
Input:
A word in a corpus. Because the vocabulary size can get very large, we have limited
the vocabulary to the top 5000 words in the corpus, and the rest of the words are
replaced with the UNK symbol. Each sentence in the corpus is also double-padded
with stop symbols.
Output:
The following word in the corpus also encoded one-hot in a vector the size of the
vocabulary.
Layers –
The model consists of the following three layers:

Embedding Layer: Each word corresponds to a unique embedding vector, a
representation of the word in some embedding space. Here, the embedding, all have
dimension 50. We find the embedding for a given word by doing a matrix multiply
(essentially a table lookup) with an embedding matrix that is trained during regular
backpropagation.
Hidden Layer: A fully-connected feed-forward layer with a hidden layer size of 100,
and rectified linear unit (ReLU) activation.
Softmax Layer: A fully-connected feed-forward layer with a layer size equal to the
vocabulary size, where each element of the output vector (logits) corresponds to the
probability of that word in the vocabulary being the next word.
Loss- The normal cross-entropy loss between the logits and the true labels as the
model’s cost.
Optimizer –
A normal SGD Optimizer with a learning rate of .05.
Each epoch (around 480, 000 examples) takes about 10 minutes to train on the CPU.
The test log-likelihood after epoch five is -846493.44
Memn2n-master:
Memn2n-master7 is a neural network with a recurrent attention model over a possibly
large external memory. The architecture is a form of memory network but unlike the
model in that work, it is trained end-to-end, and hence requires significantly less
supervision during training, making it more generally applicable in realistic settings.
Input data –
This directory includes the first set of 20 tasks for testing text understanding and
reasoning in the bAbI5 project. The motive behind these 20 tasks is that each task
tests a unique aspect of text and reasoning, and hence by testing the different abilities
of the trained models.
For both testing and training, we have 1000 questions each. However, we have not
used this much data as it might not be of much use.
The results of this model were a testing accuracy of 99.6%, training accuracy of
97.6%, and validation accuracy of 88%.
The tensorFlow framework has shown good results for training neural network models
with NLP models showing good accuracy. Training, testing, and loss results have
been great. Langmod_nn model and memory networks resulted in good accuracy
rates with low loss and error value. The flexibility of the memory model allows it to be
applied to tasks as diverse as question answering and language modeling.
Conclusion:
As shown, NLP provides a wide set of techniques and tools which can be applied in all
areas of life. By learning the models and using them in everyday interactions, quality
of life would highly improve. NLP techniques help to improve communications, reach
goals, and improve the outcomes received from every interaction. NLP helps people to
use the tools and techniques that are already available to them. By learning NLP
techniques properly, people can achieve goals and overcome obstacles.
In the future, NLP will move beyond both statistical and rule-based systems to a
natural understanding of language. There are already some improvements made by
tech giants. For example, Facebook* tried to use deep learning to understand the text
without parsing, tags, named-entity recognition (NER), etc., and Google is trying to
convert language into mathematical expressions. Endpoint detection using grid long
short-term memory networks and end-to-end memory networks on bAbI tasks
performed by Google and Facebook respectively shows the advancement that can be
done in NLP models.
What Is Caffe?
Caffe is an open-source deep learning framework that supports a
variety of deep learning architectures.
Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep
learning framework that supports a variety of deep learning architectures such
as CNN, RCNN, LSTM and fully connected networks. With its graphics processing unit
(GPU) support and out-of-the-box templates that simplify model setup and training,
Caffe is most popular for image classification and segmentation tasks.
In Caffe, you can define model, solver and optimization details in configuration files,
thanks to its expressive architecture. In addition, you can switch between GPU and
central processing unit (CPU) computation by changing a single flag in the configuration
file. Together, these features eliminate the need for hard coding in your project, which is
normally required in other deep learning frameworks. Caffe is also considered one of the
fastest convolutional network implementations available.
IS CAFFE STILL USED?

The short answer is: Yes, Caffe is still used for a wide range of scientific research
projects with applications in natural language processing, computer vision and
multimedia. A number of large organizations including Meta make use of Caffe in one
way or another.
RELATED READING FROM OUR EXPERTSConvolutional Neural Networks

Explained: Using PyTorch to Understand CNNs
Caffe Applications
Caffe is used in a wide range of scientific research projects, startup prototypes and large-
scale industrial applications in natural language processing, computer vision and
multimedia. Several projects are built on top of the Caffe framework, such as Caffe2 and
CaffeOnSpark. Caffe2 is built on Caffe and is merged into Meta’s PyTorch. Yahoo has
also integrated Caffe with Apache Spark to create CaffeOnSpark, which brings deep
learning to Hadoop and Spark clusters.
How Does Caffe Work?
INTERFACES
Caffe is primarily a C++ library and exposes a modular development interface, but not
every situation requires custom compilation. Therefore, Caffe offers interfaces for daily
use by way of the command line, Python and MATLAB.
DATA PROCESSING
Caffe processes data in the form of Blobs which are N-dimensional arrays stored in a C-
contiguous fashion. Data is stored both as data we pass along the model and as diff,
which is a gradient computed by the network.
Data layers handle how the data is processed in and out of the Caffe model. Pre-
processing and transformation like random cropping, mirroring, scaling and mean
subtraction can be done by configuring the data layer. Furthermore, pre-fetching and
multiple-input configurations are also possible.
CAFFE LAYERS
Caffe layers and their parameters are the foundation of every Caffe deep learning model.
The bottom connection of the layer is where the input data is supplied and the top
connection is where the results are provided after computation. In each layer, three
different computations take place, which are setup, forward and backward
computations. In that respect, they are also the primary unit of computation.
Many state-of-the-art deep learning models can be created with Caffe using its layer
catalog. Data layers, normalization layers, utility layers, activation layers and loss layers
are among the layer types provided by Caffe.
CAFFE SOLVER
Caffe solver is responsible for learning — specifically for model optimization and
generating parameter updates to improve the loss. There are several solvers provided in
Caffe such as stochastic gradient descent, adaptive gradient and RMSprop. The solver is
configured separately to decouple modeling and optimization.
MORE FROM ERDEM ISBILENPython Tuples vs. Lists: When to Use Tuples Instead
of Lists
What Are the Pros and Cons of Using Caffe?
PROS OF CAFFE
Caffe Is Fast
 Caffe is one of the fastest convolutional network implementations available.

 Caffe can process over 60 million images per day with a single NVIDIA K40 GPU with pre-
fetching. That’s one millisecond per image for inference and four millisecond per image for
learning.
Caffe Is Easy to Use
 No coding is required for most of the cases. Mode, solver and optimization details can be
defined in configuration files.
 There are ready-to-use templates for common use cases.
 Caffe supports GPU training.
 Caffe is an open-source framework.
CONS OF CAFFE
Caffe Is Not Flexible
 A new network layer must be coded in C++/Cuda.

 It is difficult to experience new deep learning architectures not already covered in Caffe.
 HDF5 is the only output format. Additionally, the framework only supports a few input formats.
 Integration with other deep learning frameworks is limited.
 Defining the models in configuration files can be challenging when the model parameters and
layer numbers increase.
 There’s no high-level API to speed up the initial development.
Caffe Has a Limited Community and Little Commercial Support
 Caffe is developing at a slow pace. As a result, its popularity among machine learning
professionals is diminishing.
 The documentation is limited and most support is provided by the community alone, rather
than the developer.
 The absence of commercial support discourages enterprise-grade developers.
What Are Caffe Alternatives?
In addition to Caffe, there are many other deep learning frameworks available, such as:
 TensorFlow by Google: With its huge and active community, it is the most famous end-to-end
machine learning platform which provides tools not only for deep learning but also for
statistical and mathematical calculations.
 Keras by François Chollet: This is a high-level API for developing deep neural networks, while
using TensorFlow as a back-end engine. It's easy to create sophisticated neural models with
Keras, thanks to its simple programming interface.
 PyTorch by Meta: PyTorch is another open-source machine learning framework used for
applications such as computer vision and natural language processing.
 Apache MxNet by Apache Software Foundation: This is an open-source deep learning
framework that allows you to develop deep learning projects on a wide range of devices.
Apache MxNet supports multiple languages such as Python, C++, R, Scala, Julia, MATLAB
and JavaScript.
History of Caffe
Yangqing Jia initiated the Caffe project during his doctoral studies at the University of
California, Berkeley. Caffe was developed (and is currently maintained) by Berkeley AI
Research and community contributors under the BSD license. It is written in C++ and
its first stable release date was April 18, 2017.
Theano in Python
Theano is a Python library that allows us to evaluate mathematical operations
including multi-dimensional arrays so efficiently. It is mostly used in building Deep
Learning Projects. It works a way more faster on Graphics Processing Unit (GPU)
rather than on CPU. Theano attains high speeds that gives a tough competition to C
implementations for problems involving large amounts of data. It can take advantage
of GPUs which makes it perform better than C on a CPU by considerable orders of
magnitude under some certain circumstances.
It knows how to take structures and convert them into very efficient code that uses
numpy and some native libraries. It is mainly designed to handle the types of
computation required for large neural network algorithms used in Deep Learning. That
is why, it is a very popular library in the field of Deep Learning.
How to install Theano :
pip install theano
Several of the symbols we will need to use are in the tensor subpackage of Theano.
We often import such packages with a handy name, let’s say, T.
from theano import *
import theano.tensor as T
Why Theano Python Library :
Theano is a sort of hybrid between numpy and sympy, an attempt is made to combine
the two into one powerful library. Some advantages of theano are as follows:
 Stability Optimization: Theano can find out some unstable expressions and can
use more stable means to evaluate them
 Execution Speed Optimization: As mentioned earlier, theano can make use of
recent GPUs and execute parts of expressions in your CPU or GPU, making it
much faster than Python
 Symbolic Differentiation: Theano is smart enough to automatically create
symbolic graphs for computing gradients
Basics of Theano :
Theano is a Python library that allows you to define, optimize, and evaluate
mathematical expressions involving multi-dimensional arrays efficiently.Some Theano
implementations are as follows.
Subtracting two scalars :
# Python program showing
# subtraction of two scalars
import theano
from theano import tensor
# Declaring variables
a = tensor.dscalar()
b = tensor.dscalar()
# Subtracting
res = a - b
# Converting it to a callable object
# so that it takes matrix as parameters
func = theano.function([a, b], res)
# Calling function
assert 20.0 == func(30.5, 10.5)
It will not provide any output as the assertion of two numbers matches the number
given, hence it results into a true value.
Adding two scalars :
 Python
# addition of two scalars

# Addition of two scalars
import numpy
from theano import function
# Declaring two variables
x = T.dscalar('x')
y = T.dscalar('y')
# Summing up the two numbers
z =x +y
# Converting it to a callable object
# so that it takes matrix as parameters
f = function([x, y], z)
f(5, 7)
Output:
array(12.0)
Adding two matrices :
 Python

# addition of two matrices
# Adding two matrices
import numpy
from theano import function
x = T.dmatrix('x')
y = T.dmatrix('y')
z =x +y
f = function([x, y], z)
f([[30, 50], [2, 3]], [[60, 70], [3, 4]])
Output:
array([[ 90., 120.],
[ 5., 7.]])
Logistic function using theano :
Let’s try to compute the logistic curve, which is given by:
Logistic Function :
 Python
# Python program to illustrate logistic
# sigmoid function using theano

# Load theano library
import theano
from theano import tensor
# Declaring variable
a = tensor.dmatrix('a')
# Sigmoid function
sig = 1 / (1 + tensor.exp(-a))
# Now it takes matrix as parameters
log = theano.function([a], sig)
# Calling function
print(log([[0, 1], [-1, -2]]))
Output :
[[0.5 0.73105858
0.26894142 0.11920292]]
Theano is a foundation library mainly used for deep learning research and
development and directly to create deep learning models or by convenient libraries
such as Keras. It supports both convolutional networks and recurrent networks, as
well as combinations of the two.
Deep Learning with PyTorch | An Introduction

PyTorch in a lot of ways behaves like the arrays we love from Numpy. These Numpy
arrays, after all, are just tensors. PyTorch takes these tensors and makes it simple to
move them to GPUs for the faster processing needed when training neural networks. It
also provides a module that automatically calculates gradients (for backpropagation)
and another module specifically for building neural networks. All together, PyTorch
ends up being more flexible with Python and the Numpy stack compared to
TensorFlow and other frameworks.
Neural Networks: Deep Learning is based on artificial neural networks which have
been around in some form since the late 1950s. The networks are built from individual
parts approximating neurons, typically called units or simply “neurons.” Each unit has
some number of weighted inputs. These weighted inputs are summed together (a
linear combination) then passed through an activation function to get the unit’s output.
Below is an example of a simple
PyTorch is an open-source machine learning library based on the Torch library,
developed by Facebook’s AI Research lab. It is widely used in deep learning, natural
language processing, and computer vision applications. PyTorch provides a dynamic
computational graph, which allows for easy and efficient modeling of complex neural
networks.
Here’s a general overview of how to use PyTorch for deep learning:
Define the model: PyTorch provides a wide range of pre-built neural network
architectures, such as fully connected networks, convolutional neural networks
(CNNs), and recurrent neural networks (RNNs). The model can be defined using
PyTorch’s torch.nn module.
Define the loss function: The loss function is used to evaluate the model’s
performance and update its parameters. PyTorch provides a wide range of loss
functions, such as cross-entropy loss and mean squared error loss.
Define the optimizer: The optimizer is used to update the model’s parameters based
on the gradients of the loss function. PyTorch provides a wide range of optimizers,
such as Adam and SGD.
Train the model: The model is trained on a dataset using the defined loss function and
optimizer. PyTorch provides a convenient data loading and processing library,
torch.utils.data, that can be used to load and prepare the data.
Evaluate the model: After training, the model’s performance can be evaluated on a
test dataset.
Deploy the model: The trained model can be deployed in a wide range of applications,
such as computer vision and natural language processing.
neural net.
Tensors: It turns out neural network computations are just a bunch of linear algebra
operations on tensors, which are a generalization of matrices. A vector is a 1-
dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a
3-dimensional tensor. The fundamental data structure for neural networks are tensors
and PyTorch is built around tensors. It’s time to explore how we can use PyTorch to
build a simple neural network.
Python3
# First, import PyTorch
import torch
Define an activation function(sigmoid) to compute the linear output

Python3
def activation(x):
""" Sigmoid activation function
Arguments
---------
x: torch.Tensor
"""
return 1/(1 + torch.exp(-x))
Python3
# Generate some data
# Features are 3 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables

again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))
features = torch.randn((1, 5)) creates a tensor with shape (1, 5), one row and five
columns, that contains values randomly distributed according to the normal distribution
with a mean of zero and standard deviation of one. weights =
torch.randn_like(features) creates another tensor with the same shape as features,
again containing values from a normal distribution. Finally, bias = torch.randn((1, 1))
creates a single value from a normal distribution.
Now we calculate the output of the network using matrix multiplication.
Python3
y = activation(torch.mm(features, weights.view(5, 1)) +

bias)
That’s how we can calculate the output for a single neuron. The real power of this
algorithm happens when you start stacking these individual units into layers and
stacks of layers, into a network of neurons. The output of one layer of neurons
becomes the input for the next layer. With multiple input units and output units, we
now need to express the weights as a matrix. We define the structure of neural
network and initialize the weights and biases.
Python3
# Features are 3 random normal variables
features = torch.randn((1, 3))
# Define the size of each layer in our network
# Number of input units, must match number of input

features
n_input = features.shape[1]
n_hidden = 2 # Number of hidden units
n_output = 1 # Number of output units
# Weights for inputs to hidden layer
W1 = torch.randn(n_input, n_hidden)
# Weights for hidden layer to output layer
W2 = torch.randn(n_hidden, n_output)
# and bias terms for hidden and output layers
B1 = torch.randn((1, n_hidden))
B2 = torch.randn((1, n_output))
Now we can calculate the output for this multi-layer network using the weights W1 &
W2, and the biases, B1 & B2.
Python3
h = activation(torch.mm(features, W1) + B1)
output = activation(torch.mm(h, W2) + B2)
print(output)
ADVANTAGES AND DISADVANTAGES:
Advantages of using PyTorch for deep learning:
Dynamic computational graph: PyTorch’s dynamic computational graph allows for

easy and efficient modeling of complex neural networks. This makes it well-suited for
tasks such as natural language processing and computer vision.
User-friendly API: PyTorch has a user-friendly and intuitive API, which makes it easy
to implement, debug, and experiment with different models.
Active community: PyTorch has a large and active community, which means that you
can find a lot of tutorials, pre-trained models, and other resources to help you get
started.
Integration with other libraries: PyTorch has great integration with other libraries such
as CUDA, which allows for fast and efficient training of models on GPUs.
Easy to use for distributed training: PyTorch makes it easy to perform distributed
training across multiple GPUs and machines, which can significantly speed up training
time.
Disadvantages of using PyTorch for deep learning:
Less mature than TensorFlow: PyTorch is relatively new compared to TensorFlow,

which means that it may have less mature libraries and fewer resources available.
Less production ready: PyTorch is more suited for research and development work,
rather than for production-level deployment.
Less optimized for mobile: PyTorch does not have as good support for mobile
deployment as TensorFlow Lite.
Less optimized for browser: PyTorch does not have as good support for browser
deployment as TensorFlow.js
Less optimized for Autograd: PyTorch’s Autograd implementation is less optimized
than TensorFlow’s, which can make training large models more memory-intensive.
.

Deep Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning

Uploaded by

Copyright:

Available Formats

II Year - I L T P C

Introduction to Deep Learning

Artificial neural networks

Uses artificial neural network architecture to

Requires the larger volume of dataset

Better for complex task like image

A model is created by relevant features which Relevant features are automatically

More complex, it works like the black box

It can work on the CPU or requires less

Types of neural networks

Advantages of Deep Learning:

1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance

1. High computational requirements: Deep Learning models require large amounts of

Difference Between Artificial Intelligence vs Machine Learning vs Deep Learning

Artificial Intelligence is basically the mechanism to incorporate human intelligence

Artificial Intelligence Machine Learning Deep Learning

AI stands for Artificial DL stands for Deep Learning,

AI is the broader family

DL is a ML algorithm that uses

It attains the highest rank in

Examples of AI Examples of ML applications Examples of DL applications

Apps Like Uber and Lyft,

AI refers to the broad field

AI systems can be rule- In reinforcement learning, the DL networks consist of

representations of the data.

AI vs. Machine Learning vs. Deep Learning Examples:

An AI Engineer is a professional who designs, develops, and implements artificial

A Machine Learning Engineer is a professional who designs, develops, and

A Deep Learning Engineer is a professional who designs, develops, and implements

Learning Paradigms basically states a particular pattern on which something or

We will be talking in brief about all of them.

1. Linear Regression (the simple Line Function!)

Some practical examples of the same are :

Classification: Machine is trained to classify something into some class.

 classifying whether an email is spam or not

 predicting house/property price

 predicting stock market price

Some real life examples of the same are:

 such as grouping customers by purchasing behavior

 such as people that buy X also tend to buy Y

Reinforcement Learning is a type of Machine Learning, and thereby also a branch of

There is an excellent analogy to explain this type of learning paradigm, “training a

Deep Learning Frameworks

Closing Thoughts for Techies

Top 10 Deep Learning Techniques

1. Classic Neural Networks

 It is made up of a single input layer, which generally is a two-dimensional arrangement of

 Image datasets containing OCR document analysis

3. Recurrent Neural Networks (RNNs)

Best Machine Learning and AI Courses Online

To Explore all our courses, visit our page below.

Machine Learning Courses

It is a combination of two deep learning techniques of neural networks – a Generator and a

Works Best in:

 Image and Text Generation

Works best in:

 When the datasets don’t come with a Y-axis values

Works Best in:

7. Deep Reinforcement Learning

Before understanding the Deep Reinforcement Learning technique, reinforcement learning

Works Best in:

 Board Games like Chess, Poker

In deep learning, the backpropagation or back-prop technique is referred to as the central

10. Gradient Descent