Professional Documents
Culture Documents
Semester
3 0 0 3
Deep Learning
DEEP LEARNING
Course Objectives:
At the end of the course, the students will be expected to:
Learn deep learning methods for working with sequential data,
Learn deep recurrent and memory networks,
Learn deep Turing machines,
Apply such deep learning mechanisms to various learning problems.
Know the open issues in deep learning, and have a grasp of the current research directions.
Course Outcomes:
After the completion of the course, student will be able to
Demonstrate the basic concepts fundamental learning techniques and layers.
Discuss the Neural Network training, various random models.
Explain different types of deep learning network models.
Classify the Probabilistic Neural Networks.
Implement tools on Deep Learning techniques.
UNIT I: Introduction: Various paradigms of learning problems, Perspectives and Issues in deep
learning framework, review of fundamental learning techniques. Feed forward neural network:
Artificial Neural Network, activation function, multi-layer neural network.
UNIT II: Training Neural Network: Risk minimization, loss function, back propagation,
regularization, model selection, and optimization.
Conditional Random Fields: Linear chain, partition function, Markov network, Belief
propagation, Training CRFs, Hidden Markov Model, Entropy.
UNIT III: Deep Learning: Deep Feed Forward network, regularizations, training deep models,
dropouts, Convolution Neural Network, Recurrent Neural Network, and Deep Belief Network.
UNIT IV: Probabilistic Neural Network: Hopfield Net, Boltzmann machine, RBMs, Sigmoid
net, Auto encoders.
UNIT V: Applications: Object recognition, sparse coding, computer vision, natural language
processing. Introduction to Deep Learning Tools: Caffe, Theano, Torch.
Text Books:
1. Goodfellow, I., Bengio,Y., and Courville, A., Deep Learning, MIT Press, 2016..
2. Bishop, C. ,M., Pattern Recognition and Machine Learning, Springer, 2006.
Reference Books:
1. Artificial Neural Networks, Yegnanarayana, B., PHI Learning Pvt. Ltd, 2009.
2. Matrix Computations, Golub, G.,H., and Van Loan,C.,F, JHU Press,2013.
3. Neural Networks: A Classroom Approach, Satish Kumar, Tata McGraw-Hill Education, 2004.
UNIT-I
INTRODUCTION
Today Deep learning has become one of the most popular and visible areas of
machine learning, due to its success in a variety of applications, such as computer
vision, natural language processing, and Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as reinforcement
machine learning. it uses a variety of ways to process these.
Supervised Machine Learning: Supervised machine learning is the machine
learning technique in which the neural network learns to make predictions or
classify data based on the labeled datasets. Here we input both input features
along with the target variables. the neural network learns to make predictions based
on the cost or error that comes from the difference between the predicted and the
actual target, this process is known as backpropagation. Deep learning algorithms
like Convolutional neural networks, Recurrent neural networks are used for many
supervised tasks like image classifications and recognization, sentiment analysis,
language translations, etc.
Unsupervised Machine Learning: Unsupervised machine learning is the machine
learning technique in which the neural network learns to discover the patterns or to
cluster the dataset based on unlabeled datasets. Here there are no target variables.
while the machine has to self-determined the hidden patterns or relationships within
the datasets. Deep learning algorithms like autoencoders and generative models
are used for unsupervised tasks like clustering, dimensionality reduction, and
anomaly detection.
Reinforcement Machine Learning: Reinforcement Machine Learning is
the machine learning technique in which an agent learns to make decisions in an
environment to maximize a reward signal. The agent interacts with the environment
by taking action and observing the resulting rewards. Deep learning can be used to
learn policies, or a set of actions, that maximizes the cumulative reward over time.
Deep reinforcement learning algorithms like Deep Q networks and Deep
Deterministic Policy Gradient (DDPG) are used to reinforce tasks like robotics and
game playing etc.
Artificial neural networks are built on the principles of the structure and operation of
human neurons. It is also known as neural networks or neural nets. An artificial neural
network’s input layer, which is the first layer, receives input from external sources and
passes it on to the hidden layer, which is the second layer. Each neuron in the hidden
layer gets information from the neurons in the previous layer, computes the weighted
total, and then transfers it to the neurons in the next layer. These connections are
weighted, which means that the impacts of the inputs from the preceding layer are
more or less optimized by giving each input a distinct weight. These weights are then
adjusted during the training process to enhance the performance of the model.
Fully Connected Artificial Neural Network
Artificial neurons, also known as units, are found in artificial neural networks. The
whole Artificial Neural Network is composed of these artificial neurons, which are
arranged in a series of layers. The complexities of neural networks will depend on the
complexities of the underlying patterns in the dataset whether a layer has a dozen
units or millions of units. Commonly, Artificial Neural Network has an input layer, an
output layer as well as hidden layers. The input layer receives data from the outside
world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or more
hidden layers connected one after the other. Each neuron receives input from the
previous layer neurons or the input layer. The output of one neuron becomes the input
to other neurons in the next layer of the network, and this process continues until the
final layer produces the output of the network. Then, after passing through one or
more hidden layers, this data is transformed into valuable data for the output
layer. Finally, the output layer provides an output in the form of an artificial neural
network’s response to the data that comes in.
Units are linked to one another from one layer to another in the bulk of neural
networks. Each of these links has weights that control how much one unit influences
another. The neural network learns more and more about the data as it moves from
one unit to another, ultimately producing an output from the output layer.
Difference between Machine Learning and Deep Learning :
machine learning and deep learning both are subsets of artificial intelligence but there
are many similarities and differences between them.
Machine Learning Deep Learning
Takes less time to train the model. Takes more time to train the model.
Deep Learning models are able to automatically learn features from the data, which
makes them well-suited for tasks such as image recognition, speech recognition, and
natural language processing. The most widely used architectures in deep learning are
feedforward neural networks, convolutional neural networks (CNNs), and recurrent
neural networks (RNNs).
Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow
of information through the network. FNNs have been widely used for tasks such as
image classification, speech recognition, and natural language processing.
Convolutional Neural Networks (CNNs) are specifically for image and video
recognition tasks. CNNs are able to automatically learn features from the images,
which makes them well-suited for tasks such as image classification, object detection,
and image segmentation.
Recurrent Neural Networks (RNNs) are a type of neural network that is able to
process sequential data, such as time series and natural language. RNNs are able to
maintain an internal state that captures information about the previous inputs, which
makes them well-suited for tasks such as speech recognition, natural language
processing, and language translation.
Applications of Deep Learning :
The main applications of deep learning can be divided into computer vision, natural
language processing (NLP), and reinforcement learning.
Computer vision
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in computer
vision include:
Object detection and recognition: Deep learning model can be used to identify
and locate objects within images and videos, making it possible for machines to
perform tasks such as self-driving cars, surveillance, and robotics.
Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such
as medical imaging, quality control, and image retrieval.
Image segmentation: Deep learning models can be used for image segmentation
into different regions, making it possible to identify specific features within images.
Natural language processing (NLP):
In NLP, the Deep learning model can enable machines to understand and generate
human language. Some of the main applications of deep learning in NLP include:
Automatic Text Generation – Deep learning model can learn the corpus of text
and new text like summaries, essays can be automatically generated using these
trained models.
Language translation: Deep learning models can translate text from one language
to another, making it possible to communicate with people from different linguistic
backgrounds.
Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or
neutral. This is used in applications such as customer service, social media
monitoring, and political analysis.
Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion, voice
search, and voice-controlled devices.
Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in an
environment to maximize a reward. Some of the main applications of deep learning in
reinforcement learning include:
Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari.
Robotics: Deep reinforcement learning models can be used to train robots to
perform complex tasks such as grasping objects, navigation, and manipulation.
Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.
Challenges in Deep Learning
Deep learning has made significant advancements in various fields, but there are still
some challenges that need to be addressed. Here are some of the main challenges in
deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
2. Computational Resources: For training the deep learning model, it is
computationally expensive because it requires specialized hardware like GPUs and
TPUs.
3. Time-consuming: While working on sequential data depending on the computational
resource it can take very large even in days or months.
4. Interpretability: Deep learning models are complex, it works like a black box. it is
very difficult to interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized
for the training data, leading to overfitting and poor performance on new data.
Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning
which makes use of Neural Networks(similar to the neurons working in our brain) to
mimic human brain-like behavior. DL algorithms focus on information processing
patterns mechanism to possibly identify the patterns just like our human brain does
and classifies the information accordingly. DL works on larger sets of data when
compared to ML and the prediction mechanism is self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning and
Deep Learning:
Search Trees and much If you have a clear idea about If you are clear about the
complex math is involved the logic(math) involved in math involved in it but don’t
in AI. behind and you can visualize have idea about the features,
the complex functionalities like so you break the complex
Artificial Intelligence Machine Learning Deep Learning
functionalities into
K-Mean, Support Vector linear/lower dimension
Machines, etc., then it defines features by adding more
the ML aspect. layers, then it defines the DL
aspect.
DL can be considered as
neural networks with a large
Three broad number of parameters layers
categories/types Of AI are: Three broad categories/types lying in one of the four
Artificial Narrow Of ML are: Supervised fundamental network
Intelligence (ANI), Artificial Learning, Unsupervised architectures: Unsupervised
General Intelligence (AGI) Learning and Reinforcement Pre-trained Networks,
and Artificial Super Learning Convolutional Neural
Intelligence (ASI) Networks, Recurrent Neural
Networks and Recursive
Neural Networks
The efficiency Of AI is
Less efficient than DL as it can’t More powerful than ML as it
basically the efficiency
work for longer dimensions or can easily work for larger sets
provided by ML and DL
higher amount of data. of data.
respectively.
ML algorithms can be
categorized as supervised,
unsupervised, or
AI can be further broken
reinforcement learning. In DL algorithms are inspired by
down into various
supervised learning, the the structure and function of
subfields such as robotics,
algorithm is trained on labeled the human brain, and they are
natural language
data, where the desired output particularly well-suited to
processing, computer
is known. In unsupervised tasks such as image and
vision, expert systems, and
learning, the algorithm is speech recognition.
more.
trained on unlabeled data,
where the desired output is
unknown.
There are three basic types of learning paradigms widely associated with machine
learning, namely
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Learning
Supervised learning is a machine learning task in which a function maps the input to
output data using the provided input-output pairs.
The above statement states that in this type of learning, you need to give both the input
and output(usually in the form of labels) to the computer for it to learn from it. What the
computer does is that it generates a function based on this data, which can be anything
like a simple line, to a complex convex function, depending on the data provided.
This is the most basic type of learning paradigm, and most algorithms we learn today are
based on this type of learning pattern. Some examples of these are :
I have talked about both these algorithms in my previous blog, so please have a read.
Click above to be redirected to the same.
Reference : https://www.geeksforgeeks.org/supervised-unsupervised-learning/
Regression: Machine is trained to predict some value like price, weight or height.
Unsupervised Learning
In this type of learning paradigm, the computer is provided with just the input to
develop a learning pattern. It is basically Learning from no results!!
This means that the computer has to recognize a pattern in the given input, and develop
an learning algorithm accordingly. So we conclude that “the machine learns through
observation & find structures in data”. This is still a very unexplored field of
machine learning, and big tech giants like Google and Microsoft are currently
researching on development in it.
Reference : https://www.geeksforgeeks.org/supervised-unsupervised-learning/
Clustering: A clustering problem is where you want to discover the inherent groupings
in the data
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data
Reinforcement Learning
This learning paradigm is like a dog trainer, which teaches the dog how to respond to
specific signs, like a whistle, clap, or anything else. Whenever the dog responds correctly,
the trainer gives a reward to the dog, which can be a “Bone or a biscuit”.
As with any other AI algorithm, a programming framework is required to create deep learning
algorithms. These are usually extensions of existing frameworks, or specialized frameworks
developed to create deep learning algorithms. Each framework comes with its own drawbacks
and advantages. Let’s delve deeper into some of the most popular and powerful deep learning
frameworks.
1. TensorFlow
Flow is a machine learning and deep learning framework that was created and released by
Google in 2015. TensorFlow is the most popular deep learning framework in use today, as it is
not only used by big leaders like Google, NVIDIA, and Uber, but also by data scientists and AI
practitioners on a daily basis.
TensorFlow is a library for Python, although work is being done to port it to other popular
languages like Java, JavaScript, C++, and more. However, a lot of resources are required to
create a deep learning model with TensorFlow, as it relies on using a lot of coding to specify the
structure of the network.
A commonly cited drawback of TensorFlow is that it operates with a static computation graph
meaning that the algorithm has to be run every time to see the changes. However, the platform
itself is extremely extensible and powerful, contributing to its high degree of adoption.
2. PyTorch
In many ways, PyTorch is TensorFlow’s primary competitor in the deep learning framework
market. Developed and created by Facebook, PyTorch is an open-source deep learning
framework that works with Python. Apart from powering most of Facebook’s services, other
companies, such as Johnson & Johnson, Twitter, and Salesforce.
As suggested by its name, PyTorch offers in-built support for Python, and even allows users to
use the standard debuggers provided by the software. As opposed to TensorFlow’s static graph,
PyTorch has a dynamic computation graph, allowing users to easily see what effect their changes
will have on the end result while programming the solution.
The framework offers easier options for training neural networks, utilizing modern technologies
like data parallelism and distributed learning. The community for PyTorch is also highly active,
with pre-trained models being published on a regular basis. However, TensorFlow beats PyTorch
in terms of providing cross-platform solutions, as Google’s vertical integration with Android
offers more power to TF users.
3. Keras
Keras is a deep learning framework that is built on top of other prominent frameworks like
TensorFlow, Theano, and the Microsoft Cognitive Toolkit (CNTK). Even though it loses out
to PyTorch and TensorFlow in terms of programmability, it is the ideal starting point for
beginners to learn neural network.
Keras allows users to create large and complex models with simple commands. While this means
that it is not as configurable as its competitors, creating prototypes and proofs-of-concept is a
much easier task. It is also accessible as an application programming interface (API), making the
software accessible in any scenario.
Deep learning and neural networks are at the forefront of AI research and technology today. This
sets the stage for even more advancements in deep learning, as research progresses, and newer
methods enter the mainstream.
With the rise of open-source and accessible tools like TensorFlow and Keras, deep learning is
sure to get the ball rolling in terms of enterprise adoption. Supporting infrastructure, such as
powerful and accessible cloud computing and marketplaces for pre-trained models, are also
laying the groundwork for greater adoption of the technology.
Working professionals in the AI space must learn skills that can be used in deep learning
applications. Neural networks may hold the key to a future where AI can function at levels that
are unheard of today.
Also known as Fully Connected Neural Networks, it is often identified by its multilayer
perceptrons, where the neurons are connected to the continuous layer. It was designed by Fran
Rosenblatt, an American psychologist, in 1958. It involves the adaptation of the model into
fundamental binary data inputs. There are three functions included in this model: they are:
Linear function: Rightly termed, it represents a single line which multiplies its inputs
with a constant multiplier.
Non-Linear function: It is further divided into three subsets:
1. Sigmoid Curve: It is a function interpreted as an S-shaped curve with its range from 0 to
1.
2. Hyperbolic tangent (tanh) refers to the S-shaped curve having a range of -1 to 1.
3. Rectified Linear Unit (ReLU): It is a single-point function that yields 0 when the input
value is lesser than the set value and yields the linear multiple if the input is given is
higher than the set value.
Works Best in:
1. Any table dataset which has rows and columns formatted in CSV
2. Classification and Regression issues with the input of real values
3. Any model with the highest flexibility, like that of ANNS
2. Convolutional Neural Networks
CNN is an advanced and high-potential type of the classic artificial neural network model. It is
built for tackling higher complexity, preprocessing, and data compilation. It takes reference
from the order of arrangement of neurons present in the visual cortex of an animal brain.
The CNNs can be considered as one of the most efficiently flexible models for specializing in
image as well as non-image data. These have four different organizations:
Once the input data is imported into the convolutional model, there are four stages involved in
building the CNN:
Convolution: The process derives feature maps from input data, followed by a function
applied to these maps.
Max-Pooling: It helps CNN to detect an image based on given modifications.
Flattening: In this stage, the data generated is then flattened for a CNN to analyze.
Full Connection: It is often described as a hidden layer that compiles the loss function
for a model.
The CNNs are adequate for tasks, including image recognition, image analyzing, image
segmentation, video analysis, and natural language processing. However, there can be other
scenarios where CNN networks can prove to be useful like:
The RNNs were first designed to help predict sequences, for example, the Long Short-Term
Memory (LSTM) algorithm is known for its multiple functionalities. Such networks work
entirely on data sequences of the variable input length.
The RNN puts the knowledge gained from its previous state as an input value for the current
prediction. Therefore, it can help in achieving short-term memory in a network, leading to the
effective management of stock price changes, or other time-based data systems.
As mentioned earlier, there are two overall types of RNN designs that help in analyzing
problems. They are:
LSTMs: Useful in the prediction of data in time sequences, using memory. It has three
gates: Input, Output, and Forget.
Gated RNNs: Also useful in data prediction of time sequences via memory. It has two
gates— Update and Reset.
Works Best in:
One to One: A single input connected to a single output, like Image classification.
One to many: A single input linked to output sequences, like Image captioning that
includes several words from a single image.
Many to One: Series of inputs generating single output, like Sentiment Analysis.
Many to many: Series of inputs yielding series of outputs, like video classification.
It is also widely used in language translation, conversation modeling, and more.
Get best machine learning course online from the World’s top Universities. Earn Masters,
Executive PGP, or Advanced Certificate Programs to fast-track your career.
Master of Science in
Executive Post Graduate Programme in
Machine Learning & AI
Machine Learning & AI from IIITB
from LJMU
Advanced Advanced
Certificate Certificate
Programm Programme
Executive Post Graduate Program in Data
e in in Machine
Science & Machine Learning from
Machine Learning &
University of Maryland
Learning & Deep
NLP from Learning
IIITB from IIITB
Both of the networks are competitive, as the Generator keeps producing artificial data identical
to real data – and the Discriminator continuously detecting real and unreal data. In a scenario
where there’s a requirement to create an image library, the Generator network would produce
simulated data to the authentic images. It would then generate a deconvolution neural network.
It would then be followed by an Image Detector network to differentiate between the real and
fake images. Starting with a 50% accuracy chance, the detector needs to develop its quality of
classification since the generator would grow better in its artificial image generation. Such
competition would overall contribute to the network in its effectiveness and speed.
The SOMs or Self-Organizing Maps operate with the help of unsupervised data that reduces
the number of random variables in a model. In this type of deep learning technique, the
output dimension is fixed as a two-dimensional model, as each synapse connects to its input
and output nodes.
As each data point competes for its model representation, the SOM updates the weight of the
closest nodes or Best Matching Units (BMUs). Based on the proximity of a BMU, the value
of the weights changes. As weights are considered as a node characteristic in itself, the value
represents the location of the node in the network.
This network model doesn’t come with any predefined direction and therefore has its nodes
connected in a circular arrangement. Because of such uniqueness, this deep learning
technique is used to produce model parameters.
Different from all previous deterministic network models, the Boltzmann Machines model is
referred to as stochastic.
System monitoring
Setting up of a binary recommendation platform
Analyzing specific datasets
Read: Step-by-Step Methods To Build Your Own AI System Today
Here, in this network model, there is an input layer, output layer, and several hidden multiple
layers – where the state of the environment is the input layer itself. The model works on the
continuous attempts to predict the future reward of each action taken in the given state of the
situation.
One of the most commonly used types of deep learning techniques, this model operates
automatically based on its inputs, before taking an activation function and final output
decoding. Such a bottleneck formation leads to yielding lesser categories of data and
leveraging most of the inherent data structures.
The Types of Autoencoders are:
Sparse – Where hidden layers outnumber the input layer for the generalization approach
to take place to reduce overfitting. It limits the loss function and prevents the autoencoder
from overusing all its nodes.
Denoising – Here, a modified version of inputs gets transformed into 0 at random.
Contractive – Addition of a penalty factor to the loss function to limit overfitting and
data copying, incase of hidden layer outnumbering input layer.
Stacked – To an autoencoder, once another hidden layer gets added, it leads to two
stages of encoding to that of one phase of decoding.
Works Best in:
Feature detection
Setting up a compelling recommendation model
Add features to large datasets
Read: Regularization in Deep Learning
9. Backpropagation
First, the network analyzes the parameters and decides on the data
Second, it is weighted out with a loss function
Third, the identified error gets back-propagated to self-adjust any incorrect parameters
Works Best in:
Data Debugging
Also read: 15 Interesting Machine Learning Project Ideas For Beginners
In the mathematical context, gradient refers to a slop that has a measurable angle and can be
represented into a relationship between variables. In this deep learning technique, the
relationship between the error produced in the neural network to that of the data parameters
can be represented as “x” and “y”. Since the variables are dynamic in a neural network,
therefore the error can be increased or decreased with small changes.
Many professionals visualize the technique as that of a river path coming down the mountain
slopes. The objective of such a method is — to find the optimum solution. Since there is the
presence of several local minimum solutions in a neural network, in which the data can get
trapped and lead to slower, incorrect compilations – there are ways to refrain from such
events.
As the terrain of the mountain, there are particular functions in the neural network called
Convex Functions, which keeps the data flowing into expected rates and reach its most-
minimum. There can be differences in methods taken by the data entering the final destination
due to variation in initial values of the function.
Now, we know how with the combination of lines with different weight and biases can result in non-
linear models. How does a neural network know what weight and biased values to have in each layer? It
is no different from how we did it for the single based perceptron model.
We are still making use of a gradient descent optimization algorithm which acts to minimize the error
of our model by iteratively moving in the direction with the steepest descent, the direction which
updates the parameters of our model while ensuring the minimal error. It updates the weight of every
model in every single layer. We will talk more about optimization algorithms and backpropagation later.
It is important to recognize the subsequent training of our neural network. Recognition is done by
dividing our data samples through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind of prediction is
known as Feed Forward." Feed Forward neural network is the core of many other important neural
networks such as convolution neural network.
In the feed-forward neural network, there are not any feedback loops or connections in the network.
Here is simply an input layer, a hidden layer, and an output layer.
There can be multiple hidden layers which depend on what kind of data you are dealing with. The
number of hidden layers is known as the depth of the neural network. The deep neural network can
learn from more functions. Input layer first provides the neural network with data and the output layer
then make predictions on that data which is based on a series of functions. ReLU Function is the most
commonly used activation function in the deep neural network.
To gain a solid understanding of the feed-forward process, let's see this mathematically.
1) The first input is fed to the network, which is represented as matrix x1, x2, and one where one is the
bias value.
2) Each input is multiplied by weight with respect to the first and second model to obtain their
probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
3) After that, we will take the sigmoid of our scores and gives us the probability of the point being in
the positive region in both models.
4) We multiply the probability which we have obtained from the previous step with the second set of
weights. We always include a bias of one whenever taking a combination of inputs.
And as we know to obtain the probability of the point being in the positive region of this model, we
take the sigmoid and thus producing our final output in a feed-forward process.
Let takes the neural network which we had previously with the following linear models and the hidden
layer which combined to form the non-linear model in the output layer.
So, what we will do we use our non-linear model to produce an output that describes the probability of
the point being in the positive region. The point was represented by 2 and 2. Along with bias, we will
represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are multiplied by -4, -1 and
the bias value is multiplied by twelve.
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to obtain the
linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both models we apply
sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear models in the first
layer to obtain the non-linear model in the second layer. The weights are 1.5, 1, and a bias value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of weights as
It is complete math behind the feed forward process where the inputs from the input traverse the entire
depth of the neural network. In this example, there is only one hidden layer. Whether there is one
hidden layer or twenty, the computational processes are the same for all hidden layers.
Artificial Neural Networks and its Applications
As you read this article, which organ in your body is thinking about it? It’s the brain of
course! But do you know how the brain works? Well, it has neurons or nerve cells that
are the primary units of both the brain and the nervous system. These neurons receive
sensory input from the outside world which they process and then provide the output
which might act as the input to the next neuron.
Each of these neurons is connected to other neurons in complex arrangements at
synapses. Now, are you wondering how this is related to Artificial Neural Networks?
Well, Artificial Neural Networks are modeled after the neurons in the human brain.
Let’s check out what they are in detail and how they learn information.
Artificial Neural Networks contain artificial neurons which are called units. These units
are arranged in a series of layers that together constitute the whole Artificial Neural
Network in a system. A layer can have only a dozen units or millions of units as this
depends on how the complex neural networks will be required to learn the hidden
patterns in the dataset. Commonly, Artificial Neural Network has an input layer, an
output layer as well as hidden layers. The input layer receives data from the outside
world which the neural network needs to analyze or learn about. Then this data
passes through one or multiple hidden layers that transform the input into data that is
valuable for the output layer. Finally, the output layer provides an output in the form of
a response of the Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to another.
Each of these connections has weights that determine the influence of one unit on
another unit. As the data transfers from one unit to another, the neural network learns
more and more about the data which eventually results in an output from the output
layer.
Neural Networks Architecture
The structures and operations of human neurons serve as the basis for artificial neural
networks. It is also known as neural networks or neural nets. The input layer of an
artificial neural network is the first layer, and it receives input from external sources
and releases it to the hidden layer, which is the second layer. In the hidden layer, each
neuron receives input from the previous layer neurons, computes the weighted sum,
and sends it to the neurons in the next layer. These connections are weighted means
effects of the inputs from the previous layer are optimized more or less by assigning
different-different weights to each input and it is adjusted during the training process
by optimizing these weights for improved model performance.
Artificial neurons vs Biological neurons
The concept of artificial neural networks comes from biological neurons found in
animal brains So they share a lot of similarities in structure and function wise.
Structure: The structure of artificial neural networks is inspired by biological
neurons. A biological neuron has a cell body or soma to process the impulses,
dendrites to receive them, and an axon that transfers them to other neurons. The
input nodes of artificial neural networks receive input signals, the hidden layer
nodes compute these input signals, and the output layer nodes compute the final
output by processing the hidden layer’s results using activation functions.
Biological Neuron Artificial Neuron
Dendrite Inputs
Synapses Weights
Axon Output
Synapses: Synapses are the links between biological neurons that enable the
transmission of impulses from dendrites to the cell body. Synapses are the weights
that join the one-layer nodes to the next-layer nodes in artificial neurons. The
strength of the links is determined by the weight value.
Learning: In biological neurons, learning happens in the cell body nucleus or soma,
which has a nucleus that helps to process the impulses. An action potential is
produced and travels through the axons if the impulses are powerful enough to
reach the threshold. This becomes possible by synaptic plasticity, which
represents the ability of synapses to become stronger or weaker over time in
reaction to changes in their activity. In artificial neural networks, backpropagation
is a technique used for learning, which adjusts the weights between nodes
according to the error or differences between predicted and actual outcomes.
Biological Neuron Artificial Neuron
Activation: In biological neurons, activation is the firing rate of the neuron which
happens when the impulses are strong enough to reach the threshold. In artificial
neural networks, A mathematical function known as an activation function maps the
input to the output, and executes activations.
Biological neurons to Artificial neurons
Artificial neural networks are trained using a training set. For example, suppose you
want to teach an ANN to recognize a cat. Then it is shown thousands of different
images of cats so that the network can learn to identify a cat. Once the neural network
has been trained enough using images of cats, then you need to check if it can identify
cat images correctly. This is done by making the ANN classify the images it is
provided by deciding whether they are cat images or not. The output obtained by the
ANN is corroborated by a human-provided description of whether the image is a cat
image or not. If the ANN identifies incorrectly then back-propagation is used to adjust
whatever it has learned during training. Backpropagation is done by fine-tuning the
weights of the connections in ANN units based on the error rate obtained. This
process continues until the artificial neural network can correctly recognize a cat in an
image with minimal possible error rates.
Feedforward Neural Network : The feedforward neural network is one of the most
basic artificial neural networks. In this ANN, the data or the input provided travels in
a single direction. It enters into the ANN through the input layer and exits through
the output layer while hidden layers may or may not exist. So the feedforward
neural network has a front-propagated wave only and usually does not have
backpropagation.
Convolutional Neural Network : A Convolutional neural network has some
similarities to the feed-forward neural network, where the connections between
units have weights that determine the influence of one unit on another unit. But a
CNN has one or more than one convolutional layer that uses a convolution
operation on the input and then passes the result obtained in the form of output to
the next layer. CNN has applications in speech and image processing which is
particularly useful in computer vision.
Modular Neural Network: A Modular Neural Network contains a collection of
different neural networks that work independently towards obtaining the output with
no interaction between them. Each of the different neural networks performs a
different sub-task by obtaining unique inputs compared to other networks. The
advantage of this modular neural network is that it breaks down a large and
complex computational process into smaller components, thus decreasing its
complexity while still obtaining the required output.
Radial basis function Neural Network: Radial basis functions are those functions
that consider the distance of a point concerning the center. RBF functions have two
layers. In the first layer, the input is mapped into all the Radial basis functions in the
hidden layer and then the output layer computes the output in the next step. Radial
basis function nets are normally used to model the data that represents any
underlying trend or function.
Recurrent Neural Network: The Recurrent Neural Network saves the output of a
layer and feeds this output back to the input to better predict the outcome of the
layer. The first layer in the RNN is quite similar to the feed-forward neural network
and the recurrent neural network starts once the output of the first layer is
computed. After this layer, each unit will remember some information from the
previous step so that it can act as a memory cell in performing computations.
1. Social Media: Artificial Neural Networks are used heavily in Social Media. For
example, let’s take the ‘People you may know’ feature on Facebook that suggests
people that you might know in real life so that you can send them friend requests.
Well, this magical effect is achieved by using Artificial Neural Networks that analyze
your profile, your interests, your current friends, and also their friends and various
other factors to calculate the people you might potentially know. Another common
application of Machine Learning in social media is facial recognition. This is done
by finding around 100 reference points on the person’s face and then matching
them with those already available in the database using convolutional neural
networks.
2. Marketing and Sales: When you log onto E-commerce sites like Amazon and
Flipkart, they will recommend your products to buy based on your previous
browsing history. Similarly, suppose you love Pasta, then Zomato, Swiggy, etc. will
show you restaurant recommendations based on your tastes and previous order
history. This is true across all new-age marketing segments like Book sites, Movie
services, Hospitality sites, etc. and it is done by implementing personalized
marketing. This uses Artificial Neural Networks to identify the customer likes,
dislikes, previous shopping history, etc., and then tailor the marketing campaigns
accordingly.
3. Healthcare: Artificial Neural Networks are used in Oncology to train algorithms that
can identify cancerous tissue at the microscopic level at the same accuracy as
trained physicians. Various rare diseases may manifest in physical characteristics
and can be identified in their premature stages by using Facial Analysis on the
patient photos. So the full-scale implementation of Artificial Neural Networks in the
healthcare environment can only enhance the diagnostic abilities of medical experts
and ultimately lead to the overall improvement in the quality of medical care all over
the world.
4. Personal Assistants: I am sure you all have heard of Siri, Alexa, Cortana, etc.,
and also heard them based on the phones you have!!! These are personal
assistants and an example of speech recognition that uses Natural Language
Processing to interact with the users and formulate a response accordingly.
Natural Language Processing uses artificial neural networks that are made to
handle many tasks of these personal assistants such as managing the language
syntax, semantics, correct speech, the conversation that is going on, etc.
Activation functions in Neural Networks
Value Range :- -1 to +1
Nature :- non-linear
Uses :- Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very close
to it, hence helps in centering the data by bringing mean close to 0. This makes
learning for the next layer much easier.
RELU Function
It Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural network.
Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
Value Range :- [0, inf)
Nature :- non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are
trying to handle multi- class classification problems.
Nature :- non-linear
Uses :- Usually used when trying to handle multiple classes. the softmax function
was commonly found in the output layer of image classification problems.The
softmax function would squeeze the outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
Output:- The softmax function is ideally used in the output layer of the classifier
where we are actually trying to attain the probabilities to define the class of each
input.
The basic rule of thumb is if you really don’t know what activation function to use,
then simply use RELU as it is a general activation function in hidden layers and is
used in most cases these days.
If your output is for binary classification then, sigmoid function is very natural choice
for output layer.
If your output is for multi-class classification then, Softmax is very useful to predict
the probabilities of each classes.
Multi Layered Neural Networks in R Programming
A series or set of algorithms that try to recognize the underlying relationship in a data
set through a definite process that mimics the operation of the human brain is known
as a Neural Network. Hence, the neural networks could refer to the neurons of the
human, either artificial or organic in nature. A neural network can easily adapt to the
changing input to achieve or generate the best possible result for the network and
does not need to redesign the output criteria.
Neural Networks can be classified into multiple types based on their Layers and depth
activation filters, Structure, Neurons used, Neuron density, data flow, and so on. The
types of Neural Networks are as follows:
Perceptron
Multi-Layer Perceptron or Multi-Layer Neural Network
Feed Forward Neural Networks
Convolutional Neural Networks
Radial Basis Function Neural Networks
Recurrent Neural Networks
Sequence to Sequence Model
Modular Neural Network
Multi-Layer Neural Network
To be accurate a fully connected Multi-Layered Neural Network is known as Multi-
Layer Perceptron. A Multi-Layered Neural Network consists of multiple layers of
artificial neurons or nodes. Unlike Single-Layer Neural networks, in recent times most
networks have Multi-Layered Neural Network. The following diagram is a visualization
of a multi-layer neural network.
Explanation: Here the nodes marked as “1” are known as bias units. The leftmost
layer or Layer 1 is the input layer, the middle layer or Layer 2 is the hidden layer and
the rightmost layer or Layer 3 is the output layer. It can say that the above diagram
has 3 input units (leaving the bias unit), 1 output unit, and 4 hidden units(1 bias
unit is not included).
A Multi-layered Neural Network is a typical example of the Feed Forward Neural
Network. The number of neurons and the number of layers consists of the
hyperparameters of Neural Networks which need tuning. In order to find ideal values
for the hyperparameters, one must use some cross-validation techniques. Using the
Back-Propagation technique, weight adjustment training is carried out.
Formula for Multi-Layered Neural Network
Suppose we have xn inputs(x1, x2….xn) and a bias unit. Let the weight applied to be w 1,
w2…..wn. Then find the summation and bias unit on performing dot product among
inputs and weights as:
set.seed(500)
library(MASS)
Step 2: Then check for missing values or data points in the dataset. If any, then fix
the data points which are missing.
r
Output:
crim zn indus chas nox rm age dis rad
tax ptratio black lstat medv
0 0 0 0 0 0 0 0 0
0 0 0 0 0
Step 3: Since no data points are missing proceed towards preparing the data set.
Now randomly split the data into two sets, Train set and Test set. On preparing the
data, try to fit the data on a linear regression model and then test it on the test set.
r
round(0.75 * nrow(data)))
summary(lm.fit)
Output:
Deviance Residuals:
Min 1Q Median 3Q Max
-14.9143 -2.8607 -0.5244 1.5242 25.0004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.469681 6.099347 7.127 5.50e-12 ***
crim -0.105439 0.057095 -1.847 0.065596 .
zn 0.044347 0.015974 2.776 0.005782 **
indus 0.024034 0.071107 0.338 0.735556
chas 2.596028 1.089369 2.383 0.017679 *
nox -22.336623 4.572254 -4.885 1.55e-06 ***
rm 3.538957 0.472374 7.492 5.15e-13 ***
age 0.016976 0.015088 1.125 0.261291
dis -1.570970 0.235280 -6.677 9.07e-11 ***
rad 0.400502 0.085475 4.686 3.94e-06 ***
tax -0.015165 0.004599 -3.297 0.001072 **
ptratio -1.147046 0.155702 -7.367 1.17e-12 ***
black 0.010338 0.003077 3.360 0.000862 ***
lstat -0.524957 0.056899 -9.226 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
center = mins,
Step 5: Now fit the data into Neural network. Use the neuralnet package.
r
library(neuralnet)
n <- names(train_)
nn <- neuralnet(f,
data = train_,
linear.output = T)
Now our model is fitted to the Multi-Layered Neural network. Now combine all the
steps and also plot the neural network to visualize the output. Use the plot() function
to do so.
r
# R program to illustrate
set.seed(500)
library(MASS)
round(0.75 * nrow(data)))
summary(lm.fit)
center = mins,
library(neuralnet)
n <- names(train_)
linear.output = T)
plot(nn)
Output:
UNIT-II
Abstract
Learning is posed as a problem of function estimation, for which two principles of solution are
considered: empirical risk minimization and structural risk minimization. These two principles
are applied to two different statements of the function estimation problem: global and local.
Systematic improvements in prediction power are illustrated in application to zip-code
recognition. 1 INTRODUCTION
The structure of the theory of learning differs from that of most other theories for applied
problems. The search for a solution to an applied problem usually requires the three following
steps:
The first two steps of this procedure offer in general no major difficulties; the third step
requires most efforts, in developing computational algorithms to solve the problem at hand. In
the case of learning theory, however, many algorithms have been developed, but we still lack a
clear understanding of the mathematical statement needed to describe the learning procedure,
and of the general principle on which the search for solutions
should be based. This paper is devoted to these first two steps, the statement of the problem
and the general principle of solution. The paper is organized as follows. First, the problem of
function estimation is stated, and two principles of solution are discussed: the principle of
empirical risk minimization and the principle of structural risk minimization. A new statement is
then given: that of local estimation of function, to which the same principles are applied. An
application to zip-code recognition is used to illustrate these ideas.
3. A learning machine capable of implementing a set of functions !(x, w), wE W. The problem of
learning is that of choosing from the given set of functions the one which approximates best
the supervisor's response. The selection is based on a training set of e independent
observations: (1) The formulation given above implies that learning corresponds to the problem
of function approximation.
In order to choose the best available approximation to the supervisor's response, we measure
the loss or discrepancy L(y, !(x, w» between the response y of the supervisor to a given input x
and the response !(x, w) provided by the learning machine. Consider the expected value of the
loss, given by the risk functional R(w) = J L(y, !(x, w»dP(x,y). ---(2)
The goal is to minimize the risk functional R( w) over the class of functions !(x, w), w E W. But
the joint probability distribution P(x, y) = P(ylx )P(x) is unknown and the only available
information is contained in the training set (1).
the risk functional R( w) is replaced by the empirical risk functional 1 l E(w) = i LL(Yi,!(Xi'W» i=l --
(3)
constructed on the basis of the training set (1). The induction principle of empirical risk
minimization (ERM) assumes that the function I(x, wi) ,which minimizes E(w) over the set w E
W, results in a risk R( wi) which is close to its minimum. This induction principle is quite general;
many classical methods such as least square or maximum likelihood are realizations of the ERM
principle. The evaluation of the soundness of the ERM principle requires answers to the
following two questions: 1. Is the principle consistent? (Does R( wi) converge to its minimum
value on the set wE W when f- oo?) 2. How fast is the convergence as f increases? The answers
to these two questions have been shown (Vapnik et al., 1989) to be equivalent to the answers
to the following two questions: 1. Does the empirical risk E( w) converge uniformly to the actual
risk R( w) over the full set I(x, w), wE W? Uniform convergence is defined as Prob{ sup IR(w) -
E(w)1 > £} - 0 as f - 00. ---(4) wEW
It is important to stress that uniform convergence (4) for the full set of functions is
a necessary and sufficient condition for the consistency of the ERM principle.
The theory of uniform convergence of empirical risk to actual risk developed in the 70's and
SO's, includes a description of necessary and sufficient conditions as well as bounds for the
rate of convergence (Vapnik, 19S2). These bounds, which are independent of the
distribution function P(x,y), are based on a quantitative measure of the capacity of the set
offunctions implemented by the learning machine: the VC-dimension of the set. For
simplicity, these bounds will be discussed here only for the case of binary pattern
recognition, for which y E {O, 1} and I(x, w), wE W is the class of indicator functions. The loss
function takes only two values L(y, I(x, w)) = 0 if y = I(x, w) and L(y, I(x, w)) = 1 otherwise. In
this case, the risk functional (2) is the probability of error, denoted by pew). The empirical
risk functional (3), denoted by v(w), is the frequency of error in the training set. The VC-
dimension of a set of indicator functions is the maximum number h of vectors which can be
shattered in all possible 2h ways using functions in the set. For instance, h = n + 1 for linear
decision rules in n-dimensional space, since they can shatter at most n + 1 points.
The notion of VC-dimension provides a bound to the rate of uniform convergence. For a set
of indicator functions with VC-dimension h, the following inequality holds:
Prob{ SUp IP(w) - v(w)1 > c} < (-h) exp{-e fl· wEW It then follows that with probability 1 - T},
simultaneously for all w E W, pew) < v(w) + Co(f/h, T}), with confidence interval C (f/h ) _ .
This important result provides a bound to the actual risk P( w) for all w E W, including the w·
which minimizes the empirical risk v(w). The deviation IP(w) - v(w)1 in (5) is expected to be
maximum for pew) close to 1/2, since it is this value of pew) which maximizes the error
variance u(w) = J P( w)( 1 - P( w)).
The worst case bound for the confidence interval (7) is thus likely be controlled by the worst
decision rule.
The bound (6) is achieved for the worst case pew) = 1/2, but not for small pew), which is the
case of interest. A uniformly good approximation to P( w) follows from considering pew) -
v(w) Prob{ sup > e}. wEW (j(w) (8)
The variance of the relative deviation (P( w) - v( w))/ (j( w) is now independent of w. A
bound for the probability (8),
if available, would yield a uniformly good bound for actual risks for all P( w). Such a bound
has not yet been established. But for pew) « 1, the approximation (j(w) ~ JP(w) is true, and
the following inequality holds: pew) - v(w) 2le h e2 f Prob{ sup > e} < (-) exp{--}. (9) wEW
JP(w) h 4 It then follows that with probability 1 - T}, simultaneously for all w E W, pew) <
v(w) + CI(f/h, v(w), T}), (10) with confidence interval CI(l/h,v(w),T}) =2 (h(ln2f/h;l)-lnT}) (1+ 1
v(w)f ) + h(1n 2f/h + 1) -In T} . (11) Note that the confidence interval now depends on v( w),
and that for v( w) = 0 it reduces to CI(f/ h, 0, T}) = 2C'5(f/ h, T}), which provides a more
precise bound for real case learning.
The method of ERM can be theoretically justified by considering the inequalities (6) or (10).
When l/h is large, the confidence intervals Co or C1 become small, and
can be neglected. The actual risk is then bound by only the empirical risk, and the
probability of error on the test set can be expected to be small when the frequency of error
in the training set is small. However, if ljh is small, the confidence interval cannot be
neglected, and even v( w) = 0 does not guarantee a small probability of error. In this case
the minimization of P( w) requires a new principle, based on the simultaneous minimization
of v( w) and the confidence interval. It is then necessary to control the VC-dimension of the
learning machine. To do this, we introduce a nested structure of subsets Sp = {lex, w), wE
Wp}, such that SlCS2C ... CSn . The corresponding VC-dimensions of the subsets satisfy hl <
h2 < ... < hn . The principle of structure risk minimization (SRM) requires a two-step process:
the empirical risk has to be minimized for each element of the structure. The optimal
element S* is then selected to minimize the guaranteed risk, defined as the sum of the
empirical risk and the confidence interval. This process involves a trade-off: as h increases
the minimum empirical risk decreases, but the confidence interval mcreases.
The general principle of SRM can be implemented in many different ways. Here we consider
three different examples of structures built for the set of functions implemented by a neural
network .
1. Structure given by the architecture of the neural network. Consider an ensemble of fully
connected neural networks in which the number of units in one of the hidden layers is
monotonically increased. The set of implement able functions makes a structure as the
number of hidden units is increased.
2. Structure given by the learning procedure. Consider the set of functions S = {lex, w), w E
W} implementable by a neural net of fixed architecture. The parameters {w} are the weights
of the neural network. A structure is introduced through Sp = {lex, w), Ilwll < Cp } and Cl < C2
< ... < Cn. For a convex loss function, the minimization of the empirical risk within the
element Sp of the structure is achieved through the minimizat.ion of 1 l E(w"P) = l LL(Yi,!
(Xi'W» +'P llwI12 i=l with appropriately chosen Lagrange multipliers II > 12 > ... > In' The
well-known "weight decay" procedure refers to the minimization of this functional. 3.
Structure given by preprocessing. Consider a neural net with fixed architecture. The input
representation is modified by a transformation z = K(x, 13), where the parameter f3 controls
the degree of the degeneracy introduced by this transformation (for instance f3 could be
the width of a smoothing kernel). 836 Vapnik A structure is introduced in the set of
functions S = {!(I«x, 13), w), w E W} through 13 > CP1 and Cl > C2 > ... > Cn·
Loss functions are one of the most important aspects of neural networks, as they (along
with the optimization functions) are directly responsible for fitting the model to the given
training data.
This article will dive into how loss functions are used in neural networks, different types
of loss functions, writing custom loss functions in TensorFlow, and practical
implementations of loss functions to process image and video training data — the
primary data types used for computer vision, my topic of interest & focus.
Background Information
First, a quick review of the fundamentals of neural networks and how they work.
Image Source: Wikimedia Commons
This equation represents how a neural network processes the input data at each layer and
eventually produces a predicted output value.
Each training input is loaded into the neural network in a process called forward
propagation. Once the model has produced an output, this predicted output is
compared against the given target output in a process called backpropagation — the
hyperparameters of the model are then adjusted so that it now outputs a result closer to
the target output.
The hyperparameters are adjusted to minimize the average loss — we find the
weights, wT, and biases, b, that minimize the value of J (average loss).
We can think of this akin to residuals, in statistics, which measure the distance of the
actual y values from the regression line (predicted values) — the goal being to minimize
the net distance.
Image Source: Wikimedia Commons
For this article, we will use Google’s TensorFlow library to implement different loss
functions — easy to demonstrate how loss functions are used in models.
In TensorFlow, the loss function the neural network uses is specified as a parameter in
model.compile() —the final method that trains the neural network.
model.compile(loss='mse', optimizer='sgd')
The loss function can be inputed either as a String — as shown above — or as a function
object — either imported from TensorFlow or written as custom loss functions, as we will
discuss later.
from tensorflow.keras.losses import mean_squared_error
model.compiile(loss=mean_squared_error, optimizer='sgd')
It must be formatted this way because the model.compile() method expects only two
input parameters for the loss attribute.
In supervised learning, there are two main types of loss functions — these correlate to the
2 major types of neural networks: regression and classification loss functions
One of the most popular loss functions, MSE finds the average of the squared differences
between the target and the predicted outputs
Image Source: Author
This function has numerous properties that make it especially suited for calculating loss.
The difference is squared, which means it does not matter whether the predicted value is
above or below the target value; however, values with a large error are penalized. MSE is
also a convex function (as shown in the diagram above) with a clearly defined global
minimum — this allows us to more easily utilize gradient descent optimization to set
the weight values.
However, one disadvantage of this loss function is that it is very sensitive to outliers; if a
predicted value is significantly greater than or less than its target value, this will
significantly increase the loss.
Image Source: Wikimedia Commons
MAE finds the average of the absolute differences between the target and the predicted
outputs.
It also has some disadvantages; as the average distance approaches 0, gradient descent
optimization will not work, as the function's derivative at 0 is undefined (which will
result in an error, as it is impossible to divide by 0).
Because of this, a loss function called a Huber Loss was developed, which has the
advantages of both MSE and MAE.
If the absolute difference between the actual and predicted value is less than or equal to a
threshold value, 𝛿, then MSE is applied. Otherwise — if the error is sufficiently large —
MAE is applied.
Image Source: Wikimedia Commons
This is the TensorFlow implementation —this involves using a wrapper function to utilize
the threshold variable, which we will discuss in a little bit.
def huber_loss_with_threshold (t = 𝛿):
def huber_loss (y_true, y_pred):
error = y_true - y_pred
within_threshold = tf.abs(error) <= t
small_error = tf.square(error)
large_error = t * (tf.abs(error) - (0.5*t))
if within_threshold:
return small_error
else:
return large_error
return huber_loss
This is the loss function used in binary classification models — where the model takes in
an input and has to classify it into one of two pre-set categories.
Image Source: Author
In binary classification, there are only two possible actual values of y — 0 or 1. Thus, to
accurately determine loss between the actual and predicted values, it needs to compare
the actual value (0 or 1) with the probability that the input aligns with that category
(p(i) = probability that the category is 1; 1 — p(i) = probability that the category is 0)
In cases where the number of classes is greater than two, we utilize categorical cross-
entropy — this follows a very similar process to binary cross-entropy.
As seen earlier, when writing neural networks, you can import loss functions as function
objects from the tf.keras.losses module. This module contains the following built-in loss
functions:
However, there may be cases where these traditional/main loss functions may not be
sufficient. Some examples would be if there is too much noise in your training data
(outliers, erroneous attribute values, etc.) — which cannot be compensated for with data
preprocessing — or use in unsupervised learning (as we will discuss later). In these
instances, you can write custom loss functions to suit your specific conditions.
def custom_loss_function (y_true, y_pred):
return losses
Writing custom loss functions is very straightforward; the only requirements are that the
loss function must take in only two parameters: y_pred (predicted output) and y_true
(actual output).
Some examples of these are 3 custom loss functions, in the case of a variational auto-
encoder (VAE) model, from Hands-On Image Generation with TensorFlow by Soon Yau
Cheong.
def vae_kl_loss(y_true, y_pred):
kl_loss = - 0.5 * tf.reduce_mean(1 + vae.logvar -
tf.square(vae.mean) - tf.exp(vae.logvar))
return kl_lossdef vae_rc_loss(y_true, y_pred):
#rc_loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
rc_loss = tf.keras.losses.MSE(y_true, y_pred)
return rc_lossdef vae_loss(y_true, y_pred):
kl_loss = vae_kl_loss(y_true, y_pred)
rc_loss = vae_rc_loss(y_true, y_pred)
kl_weight_const = 0.01
return kl_weight_const*kl_loss + rc_loss
Depending on the math of your loss function, you may need to add additional parameters
— such as the threshold value 𝛿 in the Huber Loss (above); to do this, you must include a
wrapper function, as TF will not allow you to have more than 2 parameters in your loss
function.
def custom_loss_with_threshold (threshold = 1):
def custom_loss (y_true, y_pred):
pass #Implement loss function - can call the threshold
variable
return custom_loss
Let’s look at some practical implementations of loss functions. Specifically, we will look
at how loss functions are used to process image data in various use cases.
Image Classification
One of the most fundamental aspects of computer vision is image classification — being
able to assign an image to one of two or more pre-selected labels; this allows users to
recognize objects, writing, people, etc. within the image (in image classification, the
image usually has only one subject).
The most commonly used loss function in image classification is cross-entropy loss/log
loss (binary for classification between 2 classes and sparse categorical for 3 or more),
where the model outputs a vector of probabilities that the input image belongs to each of
the pre-set categories. This output is then compared to the actual output, represented by
a vector of equal size, where the correct category has a probability of 1 and all others have
a probability of 0.
Research is currently being done to develop new (custom) loss functions to optimize
multi-class classification. Below is an excerpt of a proposed loss function developed by
researchers at Duke University, which extends categorical cross-entropy loss by looking
at patterns in incorrect results as well, to speed up the learning process.
def matrix_based_crossentropy (output, target, matrixA, from_logits
= False):
Loss = 0
ColumnVector = np.matul(matrixA, target)
for i, y in enumerate (output):
Loss -= (target[i]*math.log(output[i],2))
Loss += ColumnVector[i]*exponential(output[i])
Loss -= (target[i]*exponential(output[i]))
newMatrix = updateMatrix(matrixA, target, output, 4)
return [Loss, newMatrix]
Image Generation
Image generation is a process by which neural networks create images (from an existing
library) per the user’s specifications.
Throughout this article, we have dealt primarily with the use of loss functions in
supervised learning — where we have had clearly labeled inputs, x, and outputs, y, and
the model was supposed to determine the relationship between these two variables.
Image generation is an application of unsupervised learning — where the model is
required to analyze and find patterns in unlabelled input datasets. The basic principle of
loss functions still holds; the goal of a loss function in unsupervised learning is to
determine the difference between the input example and the hypothesis — the model’s
approximation of the input example itself.
For example, this equation models how MSE would be implemented for unsupervised
learning, where h(x) is the hypothesis function.
Another example of the use of loss functions in image generation was shown above in
our Custom Loss Functions section, in the case of a variational auto-encoder (VAE)
model.
Backpropagation Process in Deep Neural Network
The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not
able to perform the task for which it is being trained. Derivatives of the activation
function to be known at network design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and
H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer
To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with
respect to w.
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate E total with respect to w5 as
Putting the value of e-y in equation (5)
So, we put the values of in equation no (3) to find the final
result.
Now, we will calculate the updated weight w5 new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can
easily differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate E total with respect to w1 as
Now, we find the value of by putting values in equation (18) and (19) as
We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:
So, we put the values of in equation (13) to find the final
result.
Now, we will calculate the updated weight w1 new with the help of the following formula
In the same way, we calculate w2 new,w3new, and w4 and this will give us the following
values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when
we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
error is down to 0.291027924. After repeating this process 10,000, the total error is
down to 0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.
Data distribution rules are adjustable. For example, GPUs may not be
connected to all segment hosts for cost control reasons. In this case,
the training data set will only be distributed to segments with GPUs
attached, and the models will only hop to those segments.
The APIs
Below are the two SQL queries needed to run MOP on Greenplum using
Apache MADlib. The first loads model configurations to be trained into
a model selection table. The second calls the MOP function to fit each
of the models in the model selection table.
Here we see the accuracy and cross-entropy loss for each of the 96 combinations after
training for 10 epochs. Training takes 2 hours and 47 minutes on the test cluster.* The
models display a wide range of accuracy, with some performing well and some not
converging at all.
The next step that a data scientist may take is to look for good models
from the initial run and select a subset of those that have reasonable
accuracy and low variance compared to the training set, meaning they
do not overfit. We select 12 of these good models and run them for 50
iterations, which takes 1 hour and 53 minutes on the test cluster.
Figure 4: Final training for more iterations with the 12 best model
configurations
From the second run, we select the best model, again ensuring that it
does not overfit the training data. The chosen model achieves
validation accuracy of roughly 81 percent using the SGD optimizer with
learning rate=.001, momentum=0.95, and batch size=256.
Once we have this model, we can then use it for inference to classify
new images.
Figure 5: Inference showing the top three probabilities
Neural networks are modeled after the brain cortex, which involves billions of neurons arranged
in a layered structure. Figure 4-1 shows an image of three vertical cross-sections of the brain
neocortex, and Figure 4-2 shows a diagram of a fully connected artificial neural network.
Figure 4-1. Three drawings of cortical lamination by Santiago Ramón y Cajal,
taken from the book Comparative Study of the Sensory Areas of the Human
Cortex (Andesite Press) (image source: Wikipedia)
In Figure 4-1, each drawing shows a vertical cross-section of the cortex, with the surface
(outermost side closest to the skull) of the cortex at the top. On the left is a Nissl-stained visual
cortex of a human adult. In the middle is a Nissl-stained motor cortex of a human adult. On the
right is a Golgi-stained cortex of a month-and-a half-old infant. The Nissl stain shows the cell
bodies of neurons. The Golgi stain shows the dendrites and axons of a random subset of neurons.
The layered structure of the neurons in the cortex is evident in all three cross-sections.
Figure 4-2. A fully connected or dense artificial neural network with four layers
Even though different regions of the cortex are responsible for different functions, such as vision,
auditory perception, logical thinking, language, speech, etc., what actually determines the
function of a specific region are its connections: which sensory and motor skills input and output
regions it connects to. This means that if a cortical region is connected to a different sensory
input/output region—for example, a vision locality instead of an auditory one—then it will
perform vision tasks (computations), not auditory tasks. In a very simplified sense, the cortex
performs one basic function at the neuron level. In an artificial neural network, the basic
computation unit is the perceptron, and it functions in the same way across the whole network.
The various connections, layers, and architecture of the neural network (both the brain cortex and
artificial neural networks) are what allow these computational structures to do very
impressive things.
Training Function: Fully Connected, or Dense, Feed Forward Neural Networks
In a fully connected or dense artificial neural network (see Figure 4-2), every neuron, represented
by a node (the circles), in every layer is connected to all the neurons in the next layer. The first
layer is the input layer, the last layer is the output layer, and the intermediate layers are
called hidden layers. The neural network itself, whether fully connected or not (the networks that
we will encounter in the new few chapters are convolutional and are not fully connected), is a
computational graph representing the formula of the training function. Recall that we use this
function to make predictions after training.
Training in the neural networks context means finding the parameter values, or weights, that
enter into the formula of the training function via minimizing a loss function. This is similar to
training linear regression, logistic regression, softmax regression, and support vector machine
models, which we discussed in Chapter 3. The mathematical structure here remains the same:
1. Training function
2. Loss function
3. Optimization
The only difference is that for the simple models of Chapter 3, the formulas of the training
functions are very uncomplicated. They linearly combine the data features, add a bias term (�0),
and pass the result into at most one nonlinear function (for example, the logistic function in
logistic regression). As a consequence, the results of these models are also simple: a linear (flat)
function for linear regression, and a linear division boundary between different classes in logistic
regression, softmax regression, and support vector machines. Even when we use these simple
models to represent nonlinear data, such as in the cases of polynomial regression (fitting the data
into polynomial functions of the features) or support vector machines with the kernel trick, we
still end up with linear functions or division boundaries, but these will either be in higher
dimensions (for polynomial regression, the dimensions are the feature and its powers) or in
transformed dimensions (such as when we use the kernel trick with support vector machines).
For neural network models, on the other hand, the process of linearly combining the features,
adding a bias term, then passing the result through a nonlinear function (now called activation
function) is the computation that happens only in one neuron. This simple process happens over
and over again in dozens, hundreds, thousands, or sometimes millions of neurons, arranged in
layers, where the output of one layer acts as the input of the next layer. Similar to the brain
cortex, the aggregation of simple and similar processes over many neurons and layers produces,
or allows for the representation of, much more complex functionalities. This is sort of
miraculous. Thankfully, we are able to understand much more about artificial neural networks
than our brain’s neural networks, mainly because we design them, and after all, an artificial
neural network is just one mathematical function. No black box remains dark once we dissect it
under the lens of mathematics. That said, the mathematical analysis of artificial neural networks
is a relatively new field. There are still many questions to be answered and a lot to be discovered.
Figure 4-3. A fully connected (or dense) feed forward neural network with only
five neurons arranged in three layers. The first layer (the three black dots on the
very left) is the input layer, the second layer is the only hidden layer with three
neurons, and the last layer is the output layer with two neurons.
Let’s model the training function of a feed forward fully connected neural network. Feed forward
means that the information flows forward through the computational graph representing the
network’s training function.
Conditional Random Fields
Other examples where CRFs are used are: labeling or parsing of sequential data
for natural language processing or biological sequences, part-of-speech
tagging, shallow parsing, named entity recognition, gene finding, peptide critical
functional region finding, and object recognition and image
segmentation in computer vision.
Description
CRFs are a type of discriminative undirected probabilistic graphical model.
Lafferty, McCallum and Pereira define a CRF on observations and random variables
as follows:
on , obeys the Markov property with respect to the graph; that is, its probability is
dependent only on its neighbours in G:
What this means is that a CRF is an undirected graphical model whose nodes can be
divided into exactly two disjoint sets and , the observed and output
If exact inference is impossible, several algorithms can be used to obtain approximate solutions.
These include:
Parameter Learning
Examples
In sequence modeling, the graph of interest is usually a chain graph. An input sequence of
hidden (or unknown) state variable that needs to be inferred given the observations. The
are structured to form a chain, with an edge between each and . As well as
having a simple interpretation of the as "labels" for each element in the input sequence,
this layout admits efficient algorithms for:
model training, learning the conditional distributions between the and feature functions
from some corpus of training data.
of feature functions of the form , which can be thought of as measurements on the input
sequence that partially determine the likelihood of each possible value for . The model
assigns each feature a numerical weight and combines them to determine the probability of a
Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the
feature functions can inspect the entire input sequence at any point during
inference, and the range of the feature functions need not have a probabilistic
interpretation
Variants
CRFs can be extended into higher order models by making each dependent on a fixed
CRFs, training and inference are only practical for small values of (such as k ≤ 5), since
Finally, large-margin models for structured prediction, such as the structured Support Vector
Machine can be seen as an alternative training procedure to CRFs.
the main problem the model must solve is how to assign a sequence of labels y = from
one finite set of labels Y. Instead of directly modeling P(y|x) as an ordinary linear-chain CRF
would do, a set of latent variables h is "inserted" between x and y using the chain rule of
probability:
This allows capturing latent structure between the observations and labels. While LDCRFs can
be trained using quasi-Newton methods, a specialized version of the perceptron algorithm called
the latent-variable perceptron has been developed for them as well, based on
Collins' structured perceptron algorithm
Figure 1. Examples of Markov networks. (a) A simple Markov network with cyclical
dependencies; for example, A is dependent on B, which is dependent on C, which is
dependent on A. Note that the cycles can be traversed in either direction due to
symmetry. (b) A portion of a large grid-like Markov network that could represent pixels
in an image. Note that each node depends on its neighbors, with no hierarchy of
dependence.
and the edges in represent dependency relationships (see Figure 1). While this
representation is similar to that of Bayesian networks, there are significant differences.
The undirected edges in Markov networks allow symmetrical dependencies between
variables, which are useful in fields such as image analysis, where the value of a given
pixel depends on all the pixels around it, rather than having a hierarchy of pixels. In
addition, Markov networks permit cycles, and so cyclical dependencies can be
represented.
Figure 2. The three Markov properties. (a) By the global property, the
subgraphs A and C are conditionally independent given B. (b) By
the local property, A is conditionally independent from the rest of the network given its
neighbors. (c) By the pairwise property, A is conditionally independent
from B and C given the rest of the network, but B and C are still dependent on each
other.
It may be useful to summarize conditional independence in Markov networks as
follows: any two parts of a Markov network are conditionally independent given
evidence that "blocks" all paths between them. More formally, there are three key
properties of Markov networks related to conditional independence:[1]
Bayesian beliefs are a way of understanding probability that is based on the concept of
subjective beliefs. In Bayesian probability, the probability of an event is based on an
individual’s subjective belief about the likelihood of the event occurring, taking into
account all available information. Beliefs can be updated if new information occurs.
In Bayesian belief propagation, messages are passed between the nodes in the network.
Each node updates its beliefs about the variables based on the messages it receives from
its neighbors. The algorithm iteratively updates the beliefs of each node until a stable
state is reached, at which point the beliefs of the nodes represent the most probable
values of the variables.
Heterogeneous networks are networks in which the nodes and edges have different
values or characteristics. For example, a social network might have nodes representing
people and edges representing friendships, but some people might be more influential
than others, or some friendships might be stronger than others.
Filter bubbles are a phenomenon that occurs when people are exposed to a limited range
of ideas and viewpoints, often as a result of personalized algorithms that selectively
present information to them. This can lead to people becoming isolated in their own echo
chambers, where their beliefs and perspectives are reinforced and unchallenged.
This can lead to the algorithm presenting the user with information that is more closely
aligned with their existing beliefs and perspectives, further reinforcing their existing
filter bubble.
Counterfactuals are statements about events that did not happen, such as “If the election
had been held on a different day, the outcome might have been different.” These
statements are often used to explore alternative scenarios and consider the implications
of different choices or events.
This can both lead to and be exacerbated by the presence of filter bubbles, as people are
more likely to encounter information that supports their existing beliefs and
perspectives, rather than being exposed to a diverse range of viewpoints.
Training CRF
When we start off with our neural network we initialize our weights randomly.
Obviously, it won’t give you very good results. In the process of training, we want to start
with a bad performing neural network and wind up with network with high accuracy. In
terms of loss function, we want our loss function to much lower in the end of training.
Improving the network is possible, because we can change its function by adjusting
weights. We want to find another function that performs better than the initial one.
The problem of training is equivalent to the problem of minimizing the loss function.
Why minimize loss instead of maximizing? Turns out loss is much easier function to
optimize.
There are a lot of algorithms that optimize functions. These algorithms can gradient-
based or not, in sense that they are not only using the information provided by the
function, but also by its gradient. One of the simplest gradient-based algorithms — the
one I’ll be covering in this post — is called Stochastic Gradient Descent. Let’s see how it
works.
First, we need to remember what a derivative is with respect to some variable. Let’s take
some easy function f(x) = x. If we remember the rules of calculus from high school we
know, that the derivative of that is one at every value of x. What does it tell us ? The
derivative is the rate of how fast our function is changing when we take infinitely small
step in the positive direction. Mathematically it can be written as the following:
Which means: how much our function changes(left term) approximately equals to
derivative of that function with respect to some variable x multiplied with how much we
changed that variable. That approximation is going to be exact when we step we take is
infinitely small and this is very important concept of the derivative. Going back to our
simple function f(x) = x, we said that our derivative is 1, which means, that if take some
step epsilon in the positive direction, the function outputs will change by 1 multiplied by
our step epsilon which is just epsilon. It’s really easy to check that that’s rule. That’s
actually not even an approximation, that’s exact. Why ? Because our derivative is the
same for every value of x. That is not true for most functions. Let’s look at a slightly more
complex function f(x) = x².
y = x²
From rules of calculus we know, that the derivative of that functions is 2x. It’s easy to
check that now if we start at some value of x and make some step epsilon, then how much
our function changed is not going to be exactly equal to the formula given above.
Now, gradient is vector of partial derivatives, whose elements contains derivatives with
respect to some variable on which function is dependent. With simple functions we’ve
considering so far, this vector only contains one element, because we’ve only been using
function which take one input. With more complex functions(like our loss function), the
gradient will contain derivatives with respect to each variable we want.
How can we use this information, provided for us by derivatives, in order to minimize
some function ? Let’s go back to our function f(x) = x². Obviously, the minimum of that
function is at point x = 0, but how would a computer know it ? Suppose, we start off with
some random value of x and this value is 2. The derivative of the function in that in x =
2 equals 4. Which means that is we take a step in positive direction our function will
change proportionally to 4. So it will increase. Instead, we want to minimize our
function, so we can take a step in opposite direction, negative, to be sure that our
function will decrease, at least a little bit. How big of a step we can take ? Well, that’s the
bad news. Our derivative only guarantees that the function will decrease if take infinitely
small step. We can’t do that. Generally, you want to control how big of step you make
with some kind of hyper-parameter. This hyper-parameter is called learning rate and I’ll
talk about it later. Let’s now see what happens if we start at a point x = -2. The derivative
is now equals -4, which means, that if take a small step in positive direction our function
will change proportionally to -4, thus it will decrease. That’s exactly what we want.
Have noticed a pattern here ? When x > 0, our derivative greater than zero and we need
to go in negative direction, when x < 0, the derivative less than zero, we need to go in
positive direction. We always need to take a step in the direction which is opposite of
derivative. Let’s apply the same idea to gradient. Gradient is vector which points to some
direction in space. It actually point to the direction of the steepest increase of the
function. Since we want minimize our function, we’ll take a step in the opposite direction
of gradient. Let’s apply our idea. In neural network we think of inputs x, and outputs y as
fixed numbers. The variable with respect to which we’re going to be taking our
derivatives are weights w, since these are the values we want to change to improve our
network. If we compute the gradient of the loss function w.r.t our weights and take small
steps in the opposite direction of gradient our loss will gradually decrease until it
converges to some local minima. This algorithms is called Gradient Descent. The rule for
updating weights on each iteration of Gradient Descent is the following:
For each weight subtract the derivative with respect to it, multiplied by learning rate.
lr in the notation above means learning rate. It’s there to control how big of a step we’re
taking each iteration. It is the most important hyper-parameter to tune when training
neural networks. If you choose learning that is too big, then you’ll make steps that are too
large and will ‘jump over’ the minimum. Which means that your algorithms will diverge.
If you choose learning rate that is too small, it might take too much time to converge to
some local minima. People have developed some very good techniques for finding an
optimal learning rate, but this goes beyond the scope of this post. Some of them are
described in my other post ‘Improving the way we work with learning rate’.
Unfortunately, we can’t really apply this algorithm for training neural networks and the
reason lies in our formula for the loss function.
As you can see from what we defined above, our formula is the average over the sum.
From calculus we know, that derivative of the sum is the sum of the derivatives.
Therefore, in order to compute the derivatives of the loss, we need to go through each
example of our dataset. It would be very inefficient to do it every iteration of the Gradient
Descent, because each iteration of the algorithm only improves our loss by some small
step. To solve this problem, there’s another algorithm, called Mini-batch Gradient
Descent. The rule for updating the weights stays the same, but we’re not going to be
calculating the exact derivative. Instead, we’ll approximate the derivate on some small
mini-batch of the dataset and use that derivative to update the weights. Mini-batch isn’t
guaranteed to take steps in optimal direction. In fact, it usually won’t. With gradient
descent, if choose small enough learning rate, your loss is guaranteed to decrease every
iteration. With mini-batch that’s not true. Your loss will decrease over time, but it will
fluctuate and be much more ‘noisy’.
The size of the batch to use for estimating the derivatives is another hyper-parameter
that you’ll have to choose. Usually, you want as big batch size as your memory can
handle. But I’ve rarely seen people use batch size much larger than around 100.
Extreme version of mini-batch gradient descent with batch size equals to 1 is called
Stochastic Gradient Descent. In moder literature and very commonly, when people say
Stochastic Gradient Descent(SGD) they actually refer to Mini-batch gradient descent.
Most deep learning frameworks will let you choose batch size for SGD.
That’s it on Gradient Descent and its variations. Recently, more and more people have
been using more advanced algorithms. Most of them are gradient-based and actually
based on SGD with slight modifications. I’m planning on writing about those as well.
The basic idea behind an HMM is that the hidden states generate the observations, and
the observed data is used to estimate the hidden state sequence. This is often referred
to as the forward-backwards algorithm.
Now, we will explore some of the key applications of HMMs, including speech
recognition, natural language processing, bioinformatics, and finance.
o Speech Recognition
One of the most well-known applications of HMMs is speech recognition. In this
field, HMMs are used to model the different sounds and phones that makeup
speech. The hidden states, in this case, correspond to the different sounds or
phones, and the observations are the acoustic signals that are generated by the
speech. The goal is to estimate the hidden state sequence, which corresponds to
the transcription of the speech, based on the observed acoustic signals. HMMs
are particularly well-suited for speech recognition because they can effectively
capture the underlying structure of the speech, even when the data is noisy or
incomplete. In speech recognition systems, the HMMs are usually trained on
large datasets of speech signals, and the estimated parameters of the HMMs are
used to transcribe speech in real time.
o NaturalLanguageProcessing
Another important application of HMMs is natural language processing. In this
field, HMMs are used for tasks such as part-of-speech tagging, named entity
recognition, and text classification. In these applications, the hidden states are
typically associated with the underlying grammar or structure of the text, while
the observations are the words in the text. The goal is to estimate the hidden
state sequence, which corresponds to the structure or meaning of the text, based
on the observed words. HMMs are useful in natural language processing because
they can effectively capture the underlying structure of the text, even when the
data is noisy or ambiguous. In natural language processing systems, the HMMs
are usually trained on large datasets of text, and the estimated parameters of the
HMMs are used to perform various NLP tasks, such as text classification, part-of-
speech tagging, and named entity recognition.
o Bioinformatics
HMMs are also widely used in bioinformatics, where they are used to model
sequences of DNA, RNA, and proteins. The hidden states, in this case, correspond
to the different types of residues, while the observations are the sequences of
residues. The goal is to estimate the hidden state sequence, which corresponds to
the underlying structure of the molecule, based on the observed sequences of
residues. HMMs are useful in bioinformatics because they can effectively capture
the underlying structure of the molecule, even when the data is noisy or
incomplete. In bioinformatics systems, the HMMs are usually trained on large
datasets of molecular sequences, and the estimated parameters of the HMMs are
used to predict the structure or function of new molecular sequences.
o Finance
Finally, HMMs have also been used in finance, where they are used to model
stock prices, interest rates, and currency exchange rates. In these applications, the
hidden states correspond to different economic states, such as bull and bear
markets, while the observations are the stock prices, interest rates, or exchange
rates. The goal is to estimate the hidden state sequence, which corresponds to
the underlying economic state, based on the observed prices, rates, or exchange
rates. HMMs are useful in finance because they can effectively capture the
underlying economic state, even when the data is noisy or incomplete. In finance
systems, the HMMs are usually trained on large datasets of financial data, and the
estimated parameters of the HMMs are used to make predictions about future
market trends or to develop investment strategies.
Now, we will explore some of the key limitations of HMMs and discuss how they can
impact the accuracy and performance of HMM-based systems.
Also, Machine Learning is so much demanded in the IT world that most companies want
highly skilled machine learning engineers and data scientists for their business. Machine
Learning contains lots of algorithms and concepts that solve complex problems easily,
and one of them is entropy in Machine Learning. Almost everyone must have heard the
Entropy word once during their school or college days in physics and chemistry. The
base of entropy comes from physics, where it is defined as the measurement of
disorder, randomness, unpredictability, or impurity in the system. In this article, we will
discuss what entropy is in Machine Learning and why entropy is needed in Machine
Learning. So let's start with a quick introduction to the entropy in Machine Learning.
Consider a data set having a total number of N classes, then the entropy (E) can be
determined with the formula below:
Where;
Deep Learning:
Feedforward Neural
Networks
So far we have seen the neurons present in the first layer but we also have another
output neuron which takes h₁ and h₂ as the input as opposed to the previous neurons.
The output from this neuron will be the final predicted output, which is a function of h₁
and h₂. The predicted output is given by the following equation,
Pre-activation Function
The activation at each layer is equal to applying the sigmoid function to the output of
pre-activation of that layer. The mathematical equation for the activation at each layer ‘i’
is given by,
Activation Function
Finally, we can get the predicted output of the neural network by applying some kind of
activation function (could be softmax depending on the task) to the pre-activation output
of the previous layer. The equation for the predicted output is shown below,
Weight Matrix
Remember, we are following a very specific format for the indices of the weight and bias
variable as shown below,
W(Layer number)(Neuron number in the layer)(Input number)
b(Layer number)(Bias number associated for that input)
Now let’s see how we can compute the pre-activation for the first neuron of the first
layer a₁₁. We know that pre-activation is nothing but the weighted sum of inputs plus
bias. In other words, it is the dot product between the first row of the weight matrix W₁
and the input matrix X plus bias b₁₁.
Similarly the pre-activation for other 9 neurons in the first layer given by,
Predicted Values
These two outputs will form a probability distribution that means their summation would
be equal to 1.
The output Activation function is chosen depending on the task at hand, can be softmax
or linear.
Softmax Function
We will use the Softmax function as the output activation function. The most frequently
used activation function in deep learning for classification problems.
Softmax Function
In the Softmax function, the output is always positive, irrespective of the input.
Now, let us illustrate the Softmax function on the above-shown network with 4 output
neurons. The output of all these 4 neurons is represented in a vector ‘a’. To this vector,
we will apply our softmax activation function to get the predicted probability distribution
as shown below,
Applying the Softmax Function
By applying the softmax function we would get a predicted probability distribution and
our true output is also a probability distribution, we can compare these two distributions
to compute the loss of the network.
Loss Function
In this section, we will talk about the loss function for binary and multi-class
classification. The purpose of the loss function is to tell the model that some correction
needs to be done in the learning process.
In general, the number of neurons in the output layer would be equal to the number of
classes. But in the case of binary classification, we can use only one sigmoid neuron
which outputs the probability P(Y=1) therefore we can obtain P(Y=0) = 1-P(Y=1). In the
case of classification, We will use the cross-entropy loss to compare the predicted
probability distribution and the true probability distribution.
Cross-entropy loss for binary classification is given by,
Table of Contents
1. What is Regularization?
2. How does Regularization help in reducing Overfitting?
3. Different Regularization techniques in Deep Learning
L2 and L1 regularization
Dropout
Data augmentation
Early stopping
linkcode
What is Regularization?
Before we deep dive into the topic, take a look at this image:
Have you seen this image before? As we move towards the right in this image, our
model tries to learn too well the details and the noise from the training data, which
ultimately results in poor performance on the unseen data.
In other words, while going towards the right, the complexity of the model increases
such that the training error reduces but the testing error doesn’t. This is shown in the
image below.
If you’ve built a neural network before, you know how complex they are. This makes
them more prone to overfitting.
Regularization is a technique which makes slight modifications to the learning
algorithm such that the model generalizes better. This in turn improves the model’s
performance on the unseen data as well.
If you have studied the concept of regularization in machine learning, you will have a fair
idea that regularization penalizes the coefficients. In deep learning, it actually
penalizes the weight matrices of the nodes.
Assume that our regularization coefficient is so high that some of the weight matrices
are nearly equal to zero. This will result in a much simpler linear network and slight
underfitting of the training data.
Such a large value of the regularization coefficient is not that useful. We need to
optimize the value of regularization.
coefficient in order to obtain a well-fitted model as shown in the image below
Different Regularization Techniques in Deep Learning
Now that we have an understanding of how regularization helps in reducing overfitting,
we’ll learn a few different techniques in order to apply regularization in deep learning.
L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost
function by adding another term known as the regularization term.
In L2, we have:
Here, lambda is the regularization parameter. It is the hyperparameter whose value is
optimized for better results. L2 regularization is also known as weight decay as it forces
the weights to decay towards zero (but not exactly zero).
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may be
reduced to zero here. Hence, it is very useful when we are trying to compress our
model. Otherwise, we usually prefer L2 over it.
In keras, we can directly apply regularization to any layer using the regularizers. Below I
have applied regularizer on dense layer having 500 neurons and relu activation
function.
Dropout
This is the one of the most interesting types of regularization techniques. It also
produces very good results and is consequently the most frequently used regularization
technique in the field of deep learning.
To understand dropout, let’s say our neural network structure is akin to the one shown
below:
So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as shown
below.
So each iteration has a different set of nodes and this results in a different set of
outputs. It can also be thought of as an ensemble technique in machine learning.
Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout also performs better than a normal neural network
model.
This probability of choosing how many nodes should be dropped is the hyperparameter
of the dropout function. As seen in the image above, dropout can be applied to both the
hidden layers as well as the input layers.
Due to these reasons, dropout is usually preferred when we have a large neural
network structure in order to introduce more randomness.
In keras, we can implement dropout using the keras layer. Below is the Dropout
Implementation. I have introduced dropout of 0.2 as the probability of dropping in my
neural network architecture after last hidden layer having 64 kernels and after first
dense layer having 500 neurons.
Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data. In
machine learning, we were not able to increase the size of training data as the labeled
data was too costly.
But, now let’s consider we are dealing with images. In this case, there are a few ways of
increasing the size of the training data – rotating the image, flipping, scaling, shifting,
etc. In the below image, some transformation has been done on the handwritten digits
dataset.
This technique is known as data augmentation. This usually provides a big leap in
improving the accuracy of the model. It can be considered as a mandatory trick in order
to improve our predictions.
In keras, we can perform all of these transformations using ImageDataGenerator. It
has a big list of arguments which you
you can use to pre-process your training data.
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the
training set as the validation set. When we see that the performance on the validation
set is getting worse, we immediately stop the training on the model. This is known
as early stopping.
In the above image, we will stop training at
the dotted line since after that our model will start overfitting on the training data.
In keras, we can apply early stopping using the callbacks function. Below is the
implementation code for it.I have applied early stopping so that it will stop immendiately
if validation error will not decreased after 3 epochs.
In [14]:
linkcode
from keras.callbacks import EarlyStopping
earlystop= EarlyStopping(monitor='val_acc', patience=3)
epochs = 20 #
batch_size = 256
Here, monitor denotes the quantity that needs to be monitored and ‘val_err’ denotes
the validation error.
Patience denotes the number of epochs with no further improvement after which the
training will be stopped. For better understanding, let’s take a look at the above image
again. After the dotted line, each epoch will result in a higher value of validation error.
Therefore, 5 epochs after the dotted line (since our patience is equal to 3), our model
will stop because no further improvement is seen.
Note: It may be possible that after 3 epochs (this is the value defined for patience
in general), the model starts improving again and the validation error starts
decreasing as well. Therefore, we need to take extra care while tuning this
hyperparameter.
linkcode
Implementation on Malaria Cell Identification with keras
By this point, you should have a theoretical understanding of the different techniques we
have gone through. We will now apply this knowledge to our deep learning practice
problem – Identify Malaria cell. In this problem I will use all the regularization techniques
which I have discussed earlier i.e.,
1. L1,L2 Regularizer
2. Dropout
3. Data Augmentation
4. Early Stopping Now Let's start firing bullets !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Deep Learning
Deep learning is a type of artificial intelligence where machines learn to understand and
solve complex problems on their own, without explicit programming. It mimics how the
human brain works, using layers of interconnected artificial neurons to process and
learn from vast amounts of data. Deep learning has enabled significant advancements
in areas like image recognition, natural language processing etc. Training of deep
learning models is essential in shaping their capacity to generalize and excel in handling
new and unseen data.
How are Deep Learning Models trained?
The training process in deep learning involves feeding the model with a large dataset,
allowing it to learn patterns and adjusting its parameters to reduce errors. It can be
summarized in the following steps:
Data Collection: A huge and diverse dataset is gathered to train the model. High-quality
data is essential for the model to learn representative features and generalize well.
Example:
Suppose we are building an image classifier to identify whether an image contains a cat or a
dog. We gather a large and diverse dataset of labeled images of cats and dogs. This is the data
collection process.
Data Preprocessing: The collected data is preprocessed to ensure uniformity, remove
noise, and make it suitable for input to the model. This step includes normalization,
scaling, and data augmentation.
Example:
Continuing with our above example, we preprocess the images by resizing them to a consistent
size, normalizing the pixel values to a certain range and applying data augmentation techniques
to increase the diversity of the training samples.
Model Architecture: Choosing an appropriate model architecture is crucial. Different
model types, such as convolutional neural networks (CNNs), recurrent neural networks
(RNNs) and others, have distinct structures to do specific tasks.
Example:
We choose a simple neural network architecture consisting of an input layer, a hidden layer with
ReLU activation, and an output layer with a sigmoid activation function.
Initialization: When we create a machine learning model, it consists of many internal
components called parameters, which can be adjusted to make better predictions. For
example, in a neural network, the parameters are the weights and biases of the
neurons. To start the training process, we randomly set these parameters to small
values. If we set them to zero or the same value, the model won't be able to learn
effectively because all the neurons will behave the same way. Proper initialization is
vital to prevent the model from getting stuck in local minima during training.
Example:
We initialize the weights and biases of the neurons in the hidden and output layers with small
random values. This random initialization helps the model learn diverse features from the data.
Forward Pass: The training process begins by giving the model some input data, which
could be images, text, or any other type of data the model is designed to handle. The
input data is fed into the model, where it goes through a series of mathematical
operations and transformations in different layers of the model. Each layer learns to
recognize different patterns or features in the data. As the data flows through the layers,
the model becomes more capable of understanding the patterns and relationships in the
data.
Example:
We take an input image and feed it through the network. The image's pixel values are passed
through the layers, and the hidden layer applies the ReLU activation function to the weighted
sum of inputs. The output layer applies the sigmoid activation to produce a prediction between 0
and 1, indicating the probability of the image being a dog.
Loss Calculation: After the data has passed through the model and predictions have
been made, we need to evaluate how well the model did. To do this, we compare its
predictions to the actual correct answers from the training data. The difference between
the predictions and the correct answers is calculated using a loss function. There are
various types of loss functions, depending on the problem we are trying to solve. For
example, in a regression problem (predicting a continuous value), we might use mean
squared error, while in a classification problem (predicting categories), we could use
cross-entropy loss. The loss function gives us a single number that represents how far
off the model's predictions are from the true values. The bigger the difference, the larger
the loss value. The goal of training is to minimize this loss by adjusting the model's
parameters.
Example:
We compare the model's prediction to the actual label (cat or dog) of the input image. Let's say
the actual label is 1 (dog). We use a loss function such as binary cross-entropy to calculate the
difference between the predicted probability and the actual label.
Backpropagation: Backpropagation is an important technique used to update the
model's parameters based on the computed loss. It propagates the error backward
through the layers to calculate gradients, indicating the direction and magnitude of
parameter updates. The number of epochs (one complete pass through the entire
training dataset during the model's learning process) determines how many times the
entire dataset is used during the backpropagation process, helping the model finetune
its parameters and improve its performance.
Example:
Backpropagation propagates the gradients backward through the layers, while updating the
parameters in the opposite direction of the gradient to minimize the loss. This process involves
adjusting the weights and biases in the hidden and output layers to make the model's prediction
more accurate.
Optimization techniques play a crucial role in training deep learning models by helping
them find the best parameters to minimize errors and improve performance. After
calculating the loss and gradients during the forward pass, the optimization algorithm is
used to update the model's parameters based on these gradients. The goal is to
minimize the loss function.
The Adam optimizer, short for Adaptive Moment Estimation, is one of the most popular
optimization algorithms. It combines the strengths of AdaGrad and RMSprop, to achieve
faster convergence and better results. Adam optimizes the learning rate for each
parameter based on their historical gradients, enabling the model to find an optimal
solution efficiently.
Other optimization techniques include Stochastic Gradient Descent (SGD) and its
variants like Mini Batch Gradient Descent and Batch Gradient Descent, which update
parameters based on the gradients computed on small batches of data. Momentum
optimization accelerates training by considering the direction of previous parameter
updates, allowing the model to bypass local minima and speed up convergence.
Variation in training among different models
The training process can vary significantly among different deep learning models based
on their architecture and the nature of the data they process. Here are a few examples:
Convolutional Neural Networks (CNNs): CNNs are widely used for image-related tasks.
Their training involves learning spatial hierarchies of features through convolutional
layers and pooling. Data augmentation is commonly used to increase the diversity of
training samples and improve generalization.
Recurrent Neural Networks (RNNs): RNNs are designed for sequential data like text or
time series data. Training RNNs involves handling vanishing gradients, which can be
addressed using specialized units like LSTM (Long short-term memory) or GRU (Gated
Recurrent Unit).
Generative Adversarial Networks (GANs): GANs are used to generate synthetic data.
Their training process involves two competing networks, the generator and
discriminator, which require careful tuning to achieve stability and convergence.
Transformer Models: Transformers are popular for natural language processing tasks.
They rely on self-attention mechanisms and parallel processing, which makes their
training process computationally intensive.
Reinforcement Learning Models: In reinforcement learning, the model learns by
interacting with an environment and receiving feedback. The training process involves
finding optimal policies through trial and error, which requires balancing exploration and
exploitation.
The variations in training among these models highlight the importance of selecting the
appropriate architecture and fine-tuning the training process to get optimal performance.
Dataset Size: The size of the training dataset can significantly influence the model's
ability to learn representative features.
Data Quality: High-quality data is essential for training deep learning models. Noisy or
incorrect data can adversely affect the model's performance.
Data Preprocessing: Proper data preprocessing, including normalization, scaling, and
data augmentation, can impact how well the model learns from the data.
Model Architecture: Choosing an appropriate model architecture is crucial. Different
architectures are suitable for different types of tasks and data.
Initialization: Random initialization of model parameters can impact how quickly the
model converges during training.
Learning Rate: The learning rate determines the step size for parameter updates during
training. An appropriate learning rate is essential for stable and effective training.
Optimizers: Different optimization algorithms, such as Adam, SGD, or RMSprop, can
affect the speed and efficiency of model training.
Batch Size: The number of samples processed in each training iteration (batch) can
impact the training process and convergence.
Compute Intensive Training Process
Training deep learning models can be extremely compute-intensive, especially for large-
scale models and datasets. The computational demands arise from the huge number of
operations involved in forward and backward passes during training. These operations
include matrix multiplications, convolutions, gradient computations etc.
The time required to train a deep learning model depends on several factors, including
the model's architecture, dataset size, hardware specifications (CPU/GPU), and
hyperparameters like batch size and learning rate. Larger models with more parameters
and complex architectures usually require more computational power and time for
training. To estimate training times for specific CPUs/GPUs, we need to consider the
hardware specifications and theoretical compute capabilities.
Here's how to estimate training time for a deep learning model on a specific CPU or
GPU:
Calculate the FLOPS required for one forward and backward pass through the model.
Multiply the FLOPS required per iteration by the number of iterations to get the total FLOPS.
Divide the total FLOPS by the CPU/GPU's FLOPS to get the estimated training time.
Training times can be affected by various factors, such as data loading and preprocessing,
communication overhead and implementation efficiency.
Here's an example:
Let us do a sample calculation for estimating the training time for a deep learning model
on two specific CPUs/GPUs: A and B.
Assumptions:
CPU/GPU A has 10 TFLOPS (10 trillion floating-point operations per second) compute
capability.
The model requires 1 TFLOPS for one forward and backward pass.
Calibration
Calibration is a post-processing technique used to improve the reliability of probability
estimates produced by deep learning models. It ensures that the predicted probabilities
are well-calibrated and reflect the actual probabilities as closely as possible. Firstly the
model is trained as usual to make predictions on a given task. After training, the model's
predicted probabilities are compared to the actual probabilities observed in the
validation or test data. A calibration function is then applied to adjust the predicted
probabilities based on the observed discrepancies between the model's predictions and
the true probabilities.
Knowledge Distillation
Knowledge distillation is a technique used for model compression and transfer learning.
It involves transferring the knowledge from a large, complex model (often referred to as
the 'teacher' model) to a smaller, more efficient model (the 'student' model). The teacher
model is trained on the target task using the standard training process. The student
model is initialized and trained to mimic the behavior of the teacher model. Instead of
using the true labels during training, the student model is trained with the soft targets
produced by the teacher model. Soft targets are the teacher model's predicted
probabilities for each class instead of the hard labels (0 or 1). This enables the student
model to learn from the rich knowledge and insights of the teacher model. It allows the
student model to achieve comparable performance to the teacher model while being
more computationally efficient and having a smaller memory footprint.
Conclusion
The training process of deep learning models is crucial in enabling machines to learn
and make decisions like the human brain. Through proper data collection and
preprocessing, we ensure the model's ability to generalize and perform well on new
data. Optimization techniques like Adam optimizer play an important role in fine-tuning
the model's parameters for better accuracy and faster convergence. The future of deep
learning holds great promise. With ongoing research and advancements, we can expect
even more significant breakthroughs in various fields such as image recognition, natural
language processing, and medical diagnosis. As computing power and data availability
continue to grow, deep learning models will become more powerful and pave the way
for new applications and help solve complex problems in ways we couldn't have
imagined before.
You can imagine that if neurons are randomly dropped out of the
network during training, other neurons will have to step in and handle
the representation required to make predictions for the missing
neurons. This is believed to result in multiple independent internal
representations being learned by the network.
The effect is that the network becomes less sensitive to the specific
weights of neurons. This, in turn, results in a network capable of better
generalization and less likely to overfit the training data.
There are 60 input values and a single output value. The input values
are standardized before being used in the network. The baseline neural
network model has two hidden layers, the first with 60 units and the
second with 30. Stochastic gradient descent is used to train the model
with a relatively low learning rate and momentum.
In the example below, a new Dropout layer between the input (or
visible layer) and the first hidden layer was added. The dropout rate is
set to 20%, meaning one in five inputs will be randomly excluded from
each update cycle.
The learning rate was lifted by one order of magnitude, and the
momentum was increased to 0.9. These increases in the learning rate
were also recommended in the original Dropout paper.
Dropout will randomly reset some of the input to zero. If you wonder
what happens after you have finished training, the answer is nothing!
In Keras, a layer can tell if the model is running in training mode or
not. The Dropout layer will randomly reset some input only when the
model runs for training. Otherwise, the Dropout layer works as a scaler
to multiply all input by a factor such that the next layer will see input
similar in scale. Precisely, if the dropout rate is �, the input will be
scaled by a factor of 1−�.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Simple CNN architecture
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine
you have an image. It can be represented as a cuboid having its length, width (dimension of the
image), and height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter
or kernel on it, with say, K outputs and representing them vertically. Now slide that neural network
across the whole image, as a result, we will get another image with different widths, heights, and
depths. Instead of just R, G, and B channels now we have more channels but lesser width and
height. This operation is called Convolution. If the patch size is the same as that of the image it
will be a regular neural network. Because of this small patch, we have fewer weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimensions 34x34x3. The possible
size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to
the image dimension.
During the forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional
images) and compute the dot product between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a
sequence of layers, and every layer transforms one volume to another through a differentiable
function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with
width 32, height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding input
image patch. The output of this layer is referred ad feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions are RELU:
max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will
have dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also
prevents overfitting. Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of
dimension 16x16x12.
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
Image source: cs231n.stanford.edu
Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Example:
Let’s consider an image and apply the convolution layer, activation layer, and pooling layer
operation to extract the inside feature.
Input image:
Input image
Step:
import the necessary libraries
set the parameter
define the kernel
Load the image and plot it.
Reformat the image
Apply convolution layer operation and plot the output image.
Apply activation layer operation and plot the output image.
Apply pooling layer operation and plot the output image.
The Recurrent Neural Network consists of multiple fixed activation function units, one
for each time step. Each unit has an internal state which is called the hidden state of
the unit. This hidden state signifies the past knowledge that the network currently
holds at a given time step. This hidden state is updated at every time step to signify
the change in the knowledge of the network about the past. The hidden state is
updated using the following recurrence relation:-
The formula for calculating the current state:
where:
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
The formula for calculating output:
Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as
Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered network
each variable is computed one at a time in a specified order like first h1 then h2 then
h3 so on. Hence we will apply backpropagation throughout all these hidden time
states sequentially.
For simplicity of this equation, we will apply backpropagation on only one row
We already know how to compute this one as it is the same as any simple deep neural
constant because as it also depends on W. the total derivative has two parts
1. An RNN remembers each and every piece of information through time. It is useful in
time series prediction only because of the feature to remember previous inputs as
well. This is called Long Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend the
effective pixel neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation
function.
Applications of Recurrent Neural Network
1. Language Modelling and Generating Text
2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the
network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as
Vanilla Neural Network. In this Neural network, there is only one input and one output.
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the
most used examples of this network is Image captioning where given an image we
predict a sentence having Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the
network generating only one output. This type of network is used in the problems like
sentimental analysis. Where we give multiple words as input and predict only the
sentiment of the sentence as output.
Many to One RNN
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs
corresponding to a problem. One Example of this Problem will be language
translation. In language translation, we provide multiple words from one language as
input and predict multiple words from the second language as output.
Weights are same across all the layers Weights are different for each layer of the
number of a Recurrent Neural Network network
Recurrent Neural Networks are used A Simple Deep Neural network does not have
when the data is sequential and the any special method for sequential data also
Recurrent Neural Network Deep Neural Network
number of inputs is not predefined. here the the number of inputs is fixed
Exploding and vanishing gradients is the These problems also occur in DNN but these
the major drawback of RNN are not the major problem with DNN
Since they don't use raw inputs like RBMs, DBNs also vary from other deep
learning algorithms like autoencoders and restricted Boltzmann machines
(RBMs). They instead operate on an input layer with one neuron for each
input vector and go through numerous levels before arriving at the final
layer, where outputs are produced using probabilities acquired from earlier
layers.
Architecture of DBN
The basic structure of a DBN is composed of several layers of RBMs. A
probability distribution is learned over the input data by each RBM, which is
a generative model. While the successive layers of the DBN learn higher-
level features, the initial layer of the DBN learns the fundamental structure
of the data. For supervised learning tasks like classification or regression,
the DBN's last layer is used.
After the DBN has been trained, supervised learning tasks can be performed
on it by adjusting the weights of the final layer using a supervised learning
technique like backpropagation. This fine-tuning process can improve the
DBN's performance on the specific task it was trained for.
Evolving of DBN
The earliest generation of neural networks, called perceptron’s, is
extraordinarily strong. Depending on our response, they can help us
recognize an object in a picture or gauge how much we enjoy a certain
cuisine. But they are constrained. They frequently consider one piece of
information at a time and find it hard to comprehend the context of what is
going on around them.
Finally, deep belief networks (DBNs) can be used to create fair values that
we can then store in leaf nodes, ensuring that no matter what occurs along
the process, we always have the proper solution at hand.
Working of DBN
To get pixel input signals directly, we must train a property layer. Then, by
treating the values of these competing interest groups as pixels, we discover
the characteristics of the features that were initially obtained. Every new
subcaste of parcels or characteristic we add to the network raises the lower
bound on the log-liability of the training data set.
The following describes the operating pipeline of the deep belief network −
We begin by performing multiple Gibbs sampling iterations in the top two hidden layers.
The top two hidden layers define the RBM. As a result, this stage successfully removes a
sample from it.
After that, run a single ancestral sampling pass through the rest of the model to create a
sample from the visible units
We will employ a single bottom-up approach to determine the values of the latent
variables in each layer. Greedy pretraining starts with an observed data vector in the
lowest layer. It then adjusts the generative weights in the other direction.
Advantages of DBN
One of the main advantages of DBNs is their ability to learn features from
the data in an unsupervised manner. This means they do not require labeled
data, which can be difficult and time-consuming. A hierarchical
representation of the data can also be learned by DBNs, with each layer
learning increasingly sophisticated features. For applications like picture
identification, where the earliest layers can pick up on fundamental details
like edges, this hierarchical representation can be quite helpful. Later layers,
however, are capable of learning more intricate properties, such forms and
objects
DBNs can also be employed for generative activities like the creation of text
and images. The DBN can learn a probability distribution over the data
thanks to the unsupervised pretraining of the RBMs, and new samples that
resemble the training data can be produced. This may be helpful in software
programs like computer vision, which may create new images depending on
labels or other qualities.
The issue of vanishing gradients is among the key difficulties in deep neural
network training. The gradients used to update the weights during training
may get very small as the network's layer count rises, making it challenging
to train the network efficiently. Due of the RBMs' unsupervised pretraining,
DBNs can solve this issue. Each RBM learns a representation of the data
during pretraining that is comparatively stable and does not alter
dramatically with minor weight changes. This means that the gradients
utilised to update the weights are significantly larger when the DBN is
optimised for a supervised job, improving training efficiency.
Conclusion
In conclusion, the powerful deep learning architecture known as DBNs can
be applied to a variety of tasks. They are made up of RBM layers that are
taught unsupervised, with supervised learning applied to the final layer.
DBNs are one of the most potent deep learning architectures currently
accessible and have been demonstrated to produce state-of-the-art
outcomes in a variety of tasks. They can unsupervised learn features from
the data, are resistant to overfitting, and can be used to set the weights of
other deep-learning architectures.
UNIT-IV
PROBABILISTIC NEURAL NETWORK
The Hopfield Neural Networks, invented by Dr John J. Hopfield consists of one layer
of ‘n’ fully connected recurrent neurons. It is generally used in performing auto-
association and optimization tasks. It is calculated using a converging interactive
process and it generates a different response than our normal neural nets.
Discrete Hopfield Network
It is a fully interconnected neural network where each unit is connected to every other
unit. It behaves in a discrete manner, i.e. it gives finite distinct output, generally of two
types:
Binary (0/1)
Bipolar (-1/1)
The weights associated with this network are symmetric in nature and have the
following properties.
Apply activation over the total input to calculate the output as per the equation
given below:
As per the question, input vector x with missing entries, x = [0 0 1 0] ([x1 x2 x3 x4])
(binary). Make yi = x = [0 0 1 0] ([y1 y2 y3 y4]). Choosing unit yi (order doesn’t
matter) for updating its activation. Take the i th column of the weight matrix for
calculation.
(we will do the next steps for all values of yi and check if there is convergence
or not)
‘
Now for next unit, we will take updated value via feedback. (i.e. y =
[1 0 1 0])
Now for next unit, we will take updated value via feedback. (i.e. y =
[1 0 1 0])
Now for next unit, we will take updated value via feedback. (i.e. y =
[1 0 1 0])
The network is bound to converge if the activity of each neuron wrt time is given by the
following differential equation:
Deep Learning models are broadly classified into supervised and unsupervised
models.
Supervised DL models:
Artificial Neural Networks (ANNs)
Recurrent Neural Networks (RNNs)
Convolutional Neural Networks (CNNs)
Unsupervised DL models:
Self Organizing Maps (SOMs)
Boltzmann Machines
Autoencoders
Let us learn what exactly Boltzmann machines are, how they work and also implement
a recommender system which recommends whether the user likes a movie or not
based on the previous movies watched.
Boltzmann Machine
Energy-Based Models:
Boltzmann Distribution is used in the sampling distribution of the Boltzmann
Machine. The Boltzmann distribution is governed by the equation –
Pi = e(-∈i/kT)/ ∑e(-∈j/kT)
Pi - probability of system being in state i
∈i - Energy of system in state i
T - Temperature of the system
k - Boltzmann constant
∑e(-∈j/kT) - Sum of values for all possible states of the system
Boltzmann Distribution describes different states of the system and thus Boltzmann
machines create different states of the machine using this distribution. From the above
equation, as the energy of system increases, the probability for the system to be in
state ‘i’ decreases. Thus, the system is the most stable in its lowest energy state (a
gas is most stable when it spreads). Here, in Boltzmann machines, the energy of the
system is defined in terms of the weights of synapses. Once the system is trained
and the weights are set, the system always tries to find the lowest energy state for
itself by adjusting the weights.
Types of Boltzmann Machines:
Restricted Boltzmann Machines (RBMs)
Deep Belief Networks (DBNs)
Deep Boltzmann Machines (DBMs)
Restricted Boltzmann Machines (RBMs):
In a full Boltzmann machine, each node is connected to every other node and hence
the connections grow exponentially. This is the reason we use RBMs. The
restrictions in the node connections in RBMs are as follows –
Hidden nodes cannot be connected to one another.
Visible nodes connected to one another.
Energy function example for Restricted Boltzmann Machine –
E(v, h) = -∑ aivi - ∑ bjhj - ∑∑ viwi,jhj
a, v - biases in the system - constants
vi, hj - visible node, hidden node
P(v, h) = Probability of being in a certain state
P(v, h) = e(-E(v, h))/Z
Z - sum if values for all possible states
Suppose that we are using our RBM for building a recommender system that works on
six (6) movies. RBM learns how to allocate the hidden nodes to certain features. By
the process of Contrastive Divergence, we make the RBM close to our set of movies
that is our case or scenario. RBM identifies which features are important by the
training process. The training data is either 0 or 1 or missing data based on whether a
user liked that movie (1), disliked that movie (0) or did not watch the movie (missing
data). RBM automatically identifies important features.
Contrastive Divergence:
RBM adjusts its weights by this method. Using some randomly assigned initial
weights, RBM calculates the hidden nodes, which in turn use the same weights to
reconstruct the input nodes. Each hidden node is constructed from all the visible
nodes and each visible node is reconstructed from all the hidden node and hence, the
input is different from the reconstructed input, though the weights are the same. The
process continues until the reconstructed input matches the previous input. The
process is said to be converged at this stage. This entire procedure is known
as Gibbs Sampling.
Gibb’s Sampling
The Gradient Formula gives the gradient of the log probability of the certain state of
the system with respect to the weights of the system. It is given as follows –
d/dwij(log(P(v0))) = <vi0 * hj0> - <vi∞ * hj∞>
v - visible state, h- hidden state
<vi0 * hj0> - initial state of the system
<vi∞ * hj∞> - final state of the system
P(v0) - probability that the system is in state v 0
wij - weights of the system
The above equations tell us – how the change in weights of the system will change the
log probability of the system to be a particular state. The system tries to end up in the
lowest possible energy state (most stable). Instead of continuing the adjusting of
weights process until the current input matches the previous one, we can also
consider the first few pauses only. It is sufficient to understand how to adjust our curve
so as to get the lowest energy state. Therefore, we adjust the weights, redesign the
system and energy curve such that we get the lowest energy for the current position.
This is known as the Hinton’s shortcut.
Hinton’s Shortcut
Working of RBM
Feeling lost in the world of random DSA topics, wasting time without progress? It's
time for a change! Join our DSA course, where we'll guide you on an exciting journey
to master DSA efficiently and on schedule.
You'll access excellent video content by our CEO, Sandeep Jain, tackle common
interview questions, and engage in real-time coding contests covering various DSA
topics. We're here to prepare you thoroughly for online assessments and interviews.
Ready to dive in? Explore our free demo content and join our DSA course, trusted by
over 100,000 geeks! Whether it's DSA in C++, Java, Python, or JavaScript we've got
you covered. Let's embark on this exciting journey together!
Types of RBM :
There are mainly two types of Restricted Boltzmann Machine (RBM) based on the
types of variables they use:
1. Binary RBM: In a binary RBM, the input and hidden units are binary variables.
Binary RBMs are often used in modeling binary data such as images or text.
2. Gaussian RBM: In a Gaussian RBM, the input and hidden units are continuous
variables that follow a Gaussian distribution. Gaussian RBMs are often used in
modeling continuous data such as audio signals or sensor data.
Apart from these two types, there are also variations of RBMs such as:
1. Deep Belief Network (DBN): A DBN is a type of generative model that consists of
multiple layers of RBMs. DBNs are often used in modeling high-dimensional data
such as images or videos.
2. Convolutional RBM (CRBM): A CRBM is a type of RBM that is designed
specifically for processing images or other grid-like structures. In a CRBM, the
connections between the input and hidden units are local and shared, which makes
it possible to capture spatial relationships between the input units.
3. Temporal RBM (TRBM): A TRBM is a type of RBM that is designed for processing
temporal data such as time series or video frames. In a TRBM, the hidden units are
connected across time steps, which allows the network to model temporal
dependencies in the data.
Activation functionsIsigmoid function) in Neural
Networks
In the process of building a neural network, one of the choices you get to make is
what Activation Function to use in the hidden layer as well as at the output layer of the
network. This article discusses some of the choices.
Elements of a Neural Network
Input Layer: This layer accepts input features. It provides information from the outside
world to the network, no computation is performed at this layer, nodes here just pass
on the information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of
the abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to
the output layer.
Output Layer: This layer bring up the information learned by the network to the outer world.
What is an activation function and why use them?
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we would update
the weights and biases of the neurons on the basis of the error at the output. This process is
known as back-propagation. Activation functions make the back-propagation possible since the
gradients are supplied along with the error to update the weights and biases.
Why do we need Non-linear activation function?
A neural network without an activation function is essentially just a linear regression model. The
activation function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.
Mathematical proof
Suppose we have a Neural net like this :-
Elements of the diagram are as follows:
Hidden layer i.e. layer 1:
z(1) = W(1)X + b(1) a(1)
Here,
z(1) is the vectorized output of layer 1
W(1) be the vectorized weights assigned to neurons of hidden layer i.e. w1, w2, w3
and w4
X be the vectorized input features i.e. i1 and i2
b is the vectorized bias assigned to neurons in hidden layer i.e. b1 and b2
a(1) is the vectorized form of any linear function.
(Note: We are not considering activation function here)
The activation that works almost always better than sigmoid function is Tanh
function also known as Tangent Hyperbolic function. It’s actually mathematically
shifted version of the sigmoid function. Both are similar and can be derived from
each other.
Equation :-
Value Range :- -1 to +1
Nature :- non-linear
Uses :- Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very close
to it, hence helps in centering the data by bringing mean close to 0. This makes
learning for the next layer much easier.
RELU Function
It Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural network.
Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
Value Range :- [0, inf)
Nature :- non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are
trying to handle multi- class classification problems.
Nature :- non-linear
Uses :- Usually used when trying to handle multiple classes. the softmax function
was commonly found in the output layer of image classification problems.The
softmax function would squeeze the outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
Output:- The softmax function is ideally used in the output layer of the classifier
where we are actually trying to attain the probabilities to define the class of each
input.
The basic rule of thumb is if you really don’t know what activation function to use,
then simply use RELU as it is a general activation function in hidden layers and is
used in most cases these days.
If your output is for binary classification then, sigmoid function is very natural choice
for output layer.
If your output is for multi-class classification then, Softmax is very useful to predict
the probabilities of each classes.
Deep Autoencoders
A deep autoencoder is composed of two, symmetrical deep-belief networks that
typically have four or five shallow layers representing the encoding half of the net, and
second set of four or five layers that make up the decoding half.
The layers are restricted Boltzmann machines, the building blocks of deep-belief
networks, with several peculiarities that we’ll discuss below. Here’s a simplified schema
of a deep autoencoder’s structure, which we’ll explain below.
Processing the benchmark dataset MNIST, a deep autoencoder would use binary
transformations after each RBM. Deep autoencoders can also be used for other types
of datasets with real-valued data, on which you would use Gaussian rectified
transformations for the RBMs instead.
784 (input) ----> 1000 ----> 500 ----> 250 ----> 100 -----> 30
If, say, the input fed to the network is 784 pixels (the square of the 28x28 pixel images
in the MNIST dataset), then the first layer of the deep autoencoder should have 1000
parameters; i.e. slightly larger.
This may seem counterintuitive, because having more parameters than input is a good
way to overfit a neural network.
In this case, expanding the parameters, and in a sense expanding the features of the
input itself, will make the eventual decoding of the autoencoded data possible.
The layers will be 1000, 500, 250, 100 nodes wide, respectively, until the end, where
the net produces a vector 30 numbers long. This 30-number vector is the last layer of
the first half of the deep autoencoder, the pretraining half, and it is the product of a
normal RBM, rather than an classification output layer such as Softmax or logistic
regression, as you would normally see at the end of a deep-belief network.
Decoding Representations
Those 30 numbers are an encoded version of the 28x28 pixel image. The second half of
a deep autoencoder actually learns how to decode the condensed vector, which
becomes the input as it makes its way back.
The decoding half of a deep autoencoder is a feed-forward net with layers 100, 250,
500 and 1000 nodes wide, respectively. Layer weights are initialized randomly.
The decoding half of a deep autoencoder is the part that learns to reconstruct the
image. It does so with a second feed-forward net which also conducts back
propagation. The back propagation happens through reconstruction entropy.
Training Nuances
At the stage of the decoder’s backpropagation, the learning rate should be lowered, or
made slower: somewhere between 1e-3 and 1e-6, depending on whether you’re
handling binary or continuous data, respectively.
Use Cases
Image Search
Vectors containing similar numbers will be returned for the search query, and translated
into their matching image.
Data Compression
Deep autoencoders are useful in topic modeling, or statistically modeling abstract topics
that are distributed across a collection of documents.
The scaled word counts are then fed into a deep-belief network, a stack of restricted
Boltzmann machines, which themselves are just a subset of feedforward-backprop
autoencoders. Those deep-belief networks, or DBNs, compress each document to a set
of 10 numbers through a series of sigmoid transforms that map it onto the feature
space.
Each document’s number set, or vector, is then introduced to the same vector space,
and its distance from every other document-vector measured. Roughly speaking,
nearby document-vectors fall under the same topic.
For example, one document could be the “question” and others could be the “answers,”
a match the software would make using vector-space measurements.
UNIT-V APPLICATIONS
Object Detection vs Object Recognition vs Image
Segmentation
Object Recognition:
Object recognition is the technique of identifying the object present in images and
videos. It is one of the most important applications of machine learning and deep
learning. The goal of this field is to teach machines to understand (recognize) the
content of an image just like humans do.
Object Recognition
Image Classification :
In Image classification, it takes an image as an input and outputs the classification
label of that image with some metric (probability, loss, accuracy, etc). For Example: An
image of a cat can be classified as a class label “cat” or an image of Dog can be
classified as a class label “dog” with some probability.
Image Classification
Object Localization: This algorithm locates the presence of an object in the image
and represents it with a bounding box. It takes an image as input and outputs the
location of the bounding box in the form of (position, height, and width).
Object Detection:
Object Detection algorithms act as a combination of image classification and object
localization. It takes an image as input and produces one or more bounding boxes
with the class label attached to each bounding box. These algorithms are capable
enough to deal with multi-class classification and localization as well as to deal with
the objects with multiple occurrences.
Challenges of Object Detection:
In object detection, the bounding boxes are always rectangular. So, it does not help
with determining the shape of objects if the object contains the curvature part.
Object detection cannot accurately estimate some measurements such as the area
of an object, perimeter of an object from image.
Image Segmentation:
Image segmentation is a further extension of object detection in which we mark the
presence of an object through pixel-wise masks generated for each object in the
image. This technique is more granular than bounding box generation because this
can helps us in determining the shape of each object present in the image because
instead of drawing bounding boxes , segmentation helps to figure out pixels that are
making that object. This granularity helps us in various fields such as medical image
processing, satellite imaging, etc. There are many image segmentation approaches
proposed recently. One of the most popular is Mask R-CNN proposed by K He et al. in
2017.
Object Detection vs Segmentation (Source: Link)
Applications:
The above-discussed object recognition techniques can be utilized in many fields such
as:
Driver-less Cars: Object Recognition is used for detecting road signs, other
vehicles, etc.
Medical Image Processing: Object Recognition and Image Processing techniques
can help detect disease more accurately. Image segmentation helps to detect the
shape of the defect present in the body . For Example, Google AI for breast cancer
detection detects more accurately than doctors.
Surveillance and Security: such as Face Recognition, Object Tracking, Activity
Recognition, etc.
Sparse Coding Neural
Networks
1. Overview
In this tutorial, we’ll provide a quick introduction to the sparse coding neural
networks concept.
Sparse coding, in simple words, is a machine learning approach in which a dictionary of
basis functions is learned and then used to represent input as a linear combination of a
minimal number of these basis functions.
Moreover, sparse coding seeks a compact, efficient data representation, which may be
utilized for tasks such as denoising, compression, and classification.
2. Sparse Coding
In the case of sparse coding, the data is first translated into a high-dimensional feature
space, where the basis functions are then employed to represent the data as a linear
combination of a limited number of these basis functions.
Subject to a sparsity constraint that promotes the coefficients to be zero or nearly zero,
the coefficients of linear combinations are selected to reduce the reconstruction error
between the original data and the representation. So, with most of the coefficients being
zero or almost zero, this sparsity constraint pushes the final representation to be a
linear combination of the basis functions.
In this way, sparse coding, as previously noted, seeks to discover a usable sparse
representation of any given data. Then, a sparse code will be used to encode data:
To learn the sparse representation, the sparse coding requires input data
This is incredibly helpful since you can utilize unsupervised learning to apply it
straight to any data
Without sacrificing any information, it will automatically find the representation
Sparse coding techniques aim to meet the following two constraints simultaneously:
It’ll try to learn how to create a useful sparse representation as a vector
It’ll try to reconstruct the data using a decoder matrix for each
(1)
Sparse learning: a process that aims to train the input data
The general form of the target function to optimize the dictionary learning process may
be mathematically represented as follows:
(2)
In the previous equation, is a constant; denotes the dimensionality of data;
and means the k-given vector of the data. Furthermore, is the sparse
representation of ; is the decoder matrix (dictionary); and the coefficient of
sparsity is .
The min expression in the general form gives the impression that we are attempting to
nest two optimization issues. We can think of this as two separate optimization issues
fighting to get the optimal compromise:
Outer minimization: the procedure that deals with the left side of the sum
process while attempting to adjust to reduce the input’s loss during the
reconstruction process
Inner minimization: the procedure that attempts to increase sparsity by lowering
The following algorithm performs sparse coding by iteratively updating the coefficients
for each sample in the dataset, using the residuals from the previous iteration to update
the coefficients for the current iteration. The sparsity constraint is enforced by setting
the coefficients to zero or close to zero if they fall below a certain threshold.
From a biological perspective, sparse code follows the more all-encompassing idea of
neural code. Consider the case when you have binary neurons. So, basically:
The neural networks will get some inputs and deliver outputs
Some neurons in the neural network will be frequently activated while others
won’t be activated at all to calculate the outputs
The average activity ratio refers to the number of activations on some data,
whereas the neural code is the observation of those activations for a specific
input
Neural coding is the process of instructing your neurons to produce a reliable
neural code
Now that we know what a neural code is, we can speculate on what it may be
like. Then, data will be encoded using a sparse code while taking into
consideration the following scenarios:
The figure next shows an example of how a dense neural network graph is transformed
into a sparse neural network graph:
4. Applications of Sparse Coding
Brain imaging, natural language processing, and image and sound processing all use
sparse coding. Also, sparse coding has been employed in leading-edge systems in
several areas and has shown to be very successful at capturing the structure of
complicated data.
Resources for the implementation of sparse coding are available in several Python
modules. A well-known toolkit called Scikit-learn offers a light coding module with
methods for creating a dictionary of basis functions and sparsely combining these basis
functions to encode data.
5. Conclusion
In short, sparse coding is a powerful machine learning technique that allows us to find
compact, efficient representations of complex data. It has been widely applied to various
applications such as brain imaging, natural language processing, and image and sound
processing and achieved promising results.
Concept
– Computer vision is a subset of machine learning that deals with making computers or
machines understand human actions, behaviors, and languages similarly to humans.
The idea is to get machines to understand and interpret the visual world so that they
make sense out of it and derive some meaningful insights. Deep learning is a subset of
AI that seeks to mimic the functioning of the human brain based on artificial neural
networks.
Purpose
Applications
– The most common real world applications of computer vision include defect
detection, image labeling, face recognition, object detection, image classification, object
tracking, movement analysis, cell classification, and more. Top applications of deep
learning are self driving cars, natural language processing, visual recognition, image and
speech recognition, virtual assistants, chatbots, fraud detection, etc.
Computer Vision vs. Deep Learning: Comparison Chart
Summary
Deep learning has achieved remarkable progress in various fields in a short span of
time, particularly it has brought a revolution to the computer vision community,
introducing efficient solutions to the problems that had long remained unsolved.
Computer vision is a subfield of AI that seeks to make computers understand the
contents of the digital data contained within images or videos and make some sense out
of them. Deep learning aims to bring machine learning one step closer to one of its
original goals, that is, artificial intelligence.
The link between computer vision and machine learning is very fuzzy, as is the link
between computer vision and deep learning. In a short span of time, computer vision
has shown tremendous progress, and from interpreting optical data to object modeling,
the term deep learning has started to creep into computer vision, as it did into machine
learning, AI, and other domains.
Many traditional applications in computer vision can be solved through invoking deep
learning methods. Computer vision seeks to guide machines and computers toward
understanding the contents of digital data such as images or videos.
Computer vision is a challenge because it is limited by hardware and the way machines
see objects and images is very different from how human see them and interpret them.
Machines see them as numbers representing individual pixels, making it difficult to
make them understand what and how we see things.
What is the role of computer vision?
Our ability to evaluate the relationship between sentences is essential for tackling a
variety of natural language challenges, such as text summarization, information
extraction, and machine translation. This challenge is formalized as the natural
language inference task of Recognizing Textual Entailment (RTE), which involves
classifying the relationship between two sentences as one of entailment, contradiction,
or neutrality. For instance, the premise “Garfield is a cat”, naturally entails the
statement “Garfield has paws”, contradicts the statement “Garfield is a German
Shepherd”, and is neutral to the statement “Garfield enjoys sleeping”.
Natural language processing is the ability of a computer program to understand
human language as it is spoken. NLP is a component of artificial intelligence that deal
with the interactions between computers and human languages in regard to the
processing and analyzing large amounts of natural language data. Natural language
processing can perform several different tasks by processing natural data through
different efficient means. These tasks could include:
Answering questions about anything (what Siri*, Alexa*, and Cortana* can do).
Sentiment analysis (determining whether the attitude is positive, negative, or
neutral).
Image to text mappings (creating captions using an input image).
Machine translation (translating text to different languages).
Speech recognition
Part of Speech (POS) tagging.
Entity identification
The traditional approach to NLP involved a lot of domain knowledge of linguistics
itself.
Deep learning at its most basic level, is all about representation learning. With
convolutional neural networks (CNN), the composition of different filters is used to
classify objects into categories. Taking a similar approach, this article creates
representations of words through large datasets.
Word Vectors:
If separate vectors are used for all of the +13 million words in the English vocabulary,
several problems can occur. First, there will be large vectors with a lot of ‘zeroes’ and
one ‘one’ (in different positions representing a different word). This is also known as
one-hot encoding. Second, when searching for phrases such as “hotels in New
Jersey” in Google, expectations are that results pertaining to “motel”, “lodging”, and
“accommodation” in New Jersey are returned. And if using one-hot encoding, these
words have no natural notion of similarity. Ideally, dot products (since we are dealing
with vectors) of synonym/similar words would be close to one of the expected results.
Word2vec8 is a group of models which helps derive relations between a word and its
contextual words. Beginning with a small, random initialization of word vectors, the
predictive model learns the vectors by minimizing the loss function. In Word2vec, this
happens with a feed-forward neural network and optimization techniques such as the
SGD algorithm. There are also count-based models which make a co-occurrence
count matrix of words in the corpus; with a large matrix with a row for each of the
“words” and columns for the “context”. The number of “contexts” is of course large,
since it is essentially combinatorial in size. To overcome the size issue, singular value
decomposition can be applied to the matrix, reducing the dimensions of the matrix and
retaining maximum information.
Software and Hardware:
The programming language being used is Python 3.5.2 with Intel Optimization for
TensorFlow as the framework. For training and computation purposes, the Intel AI
DevCloud powered by Intel Xeon Scalable processors was used. Intel AI DevCloud
can provide a great performance bump from the host CPU for the right application and
use case due to having 50+ cores and its own memory, interconnect, and operating
system.
Langmod_nn Model –
The Langmod_nn model6 builds a three-layer Forward Bigram Model neural network
consisting of an embedding layer, a hidden layer, and a final softmax layer where the
goal is to use a given word in a corpus to attempt to predict the next word.
To pass the input into one hot encoded vector of dimensions of 5000.
Input:
A word in a corpus. Because the vocabulary size can get very large, we have limited
the vocabulary to the top 5000 words in the corpus, and the rest of the words are
replaced with the UNK symbol. Each sentence in the corpus is also double-padded
with stop symbols.
Output:
The following word in the corpus also encoded one-hot in a vector the size of the
vocabulary.
Layers –
Conclusion:
As shown, NLP provides a wide set of techniques and tools which can be applied in all
areas of life. By learning the models and using them in everyday interactions, quality
of life would highly improve. NLP techniques help to improve communications, reach
goals, and improve the outcomes received from every interaction. NLP helps people to
use the tools and techniques that are already available to them. By learning NLP
techniques properly, people can achieve goals and overcome obstacles.
In the future, NLP will move beyond both statistical and rule-based systems to a
natural understanding of language. There are already some improvements made by
tech giants. For example, Facebook* tried to use deep learning to understand the text
without parsing, tags, named-entity recognition (NER), etc., and Google is trying to
convert language into mathematical expressions. Endpoint detection using grid long
short-term memory networks and end-to-end memory networks on bAbI tasks
performed by Google and Facebook respectively shows the advancement that can be
done in NLP models.
What Is Caffe?
Caffe is an open-source deep learning framework that supports a
variety of deep learning architectures.
Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep
learning framework that supports a variety of deep learning architectures such
as CNN, RCNN, LSTM and fully connected networks. With its graphics processing unit
(GPU) support and out-of-the-box templates that simplify model setup and training,
Caffe is most popular for image classification and segmentation tasks.
In Caffe, you can define model, solver and optimization details in configuration files,
thanks to its expressive architecture. In addition, you can switch between GPU and
central processing unit (CPU) computation by changing a single flag in the configuration
file. Together, these features eliminate the need for hard coding in your project, which is
normally required in other deep learning frameworks. Caffe is also considered one of the
fastest convolutional network implementations available.
Caffe Applications
Caffe is used in a wide range of scientific research projects, startup prototypes and large-
scale industrial applications in natural language processing, computer vision and
multimedia. Several projects are built on top of the Caffe framework, such as Caffe2 and
CaffeOnSpark. Caffe2 is built on Caffe and is merged into Meta’s PyTorch. Yahoo has
also integrated Caffe with Apache Spark to create CaffeOnSpark, which brings deep
learning to Hadoop and Spark clusters.
INTERFACES
Caffe is primarily a C++ library and exposes a modular development interface, but not
every situation requires custom compilation. Therefore, Caffe offers interfaces for daily
use by way of the command line, Python and MATLAB.
DATA PROCESSING
Caffe processes data in the form of Blobs which are N-dimensional arrays stored in a C-
contiguous fashion. Data is stored both as data we pass along the model and as diff,
which is a gradient computed by the network.
Data layers handle how the data is processed in and out of the Caffe model. Pre-
processing and transformation like random cropping, mirroring, scaling and mean
subtraction can be done by configuring the data layer. Furthermore, pre-fetching and
multiple-input configurations are also possible.
CAFFE LAYERS
Caffe layers and their parameters are the foundation of every Caffe deep learning model.
The bottom connection of the layer is where the input data is supplied and the top
connection is where the results are provided after computation. In each layer, three
different computations take place, which are setup, forward and backward
computations. In that respect, they are also the primary unit of computation.
Many state-of-the-art deep learning models can be created with Caffe using its layer
catalog. Data layers, normalization layers, utility layers, activation layers and loss layers
are among the layer types provided by Caffe.
CAFFE SOLVER
Caffe solver is responsible for learning — specifically for model optimization and
generating parameter updates to improve the loss. There are several solvers provided in
Caffe such as stochastic gradient descent, adaptive gradient and RMSprop. The solver is
configured separately to decouple modeling and optimization.
MORE FROM ERDEM ISBILENPython Tuples vs. Lists: When to Use Tuples Instead
of Lists
What Are the Pros and Cons of Using Caffe?
PROS OF CAFFE
Caffe Is Fast
No coding is required for most of the cases. Mode, solver and optimization details can be
defined in configuration files.
There are ready-to-use templates for common use cases.
Caffe supports GPU training.
Caffe is an open-source framework.
CONS OF CAFFE
Caffe is developing at a slow pace. As a result, its popularity among machine learning
professionals is diminishing.
The documentation is limited and most support is provided by the community alone, rather
than the developer.
The absence of commercial support discourages enterprise-grade developers.
In addition to Caffe, there are many other deep learning frameworks available, such as:
TensorFlow by Google: With its huge and active community, it is the most famous end-to-end
machine learning platform which provides tools not only for deep learning but also for
statistical and mathematical calculations.
Keras by François Chollet: This is a high-level API for developing deep neural networks, while
using TensorFlow as a back-end engine. It's easy to create sophisticated neural models with
Keras, thanks to its simple programming interface.
PyTorch by Meta: PyTorch is another open-source machine learning framework used for
applications such as computer vision and natural language processing.
Apache MxNet by Apache Software Foundation: This is an open-source deep learning
framework that allows you to develop deep learning projects on a wide range of devices.
Apache MxNet supports multiple languages such as Python, C++, R, Scala, Julia, MATLAB
and JavaScript.
History of Caffe
Yangqing Jia initiated the Caffe project during his doctoral studies at the University of
California, Berkeley. Caffe was developed (and is currently maintained) by Berkeley AI
Research and community contributors under the BSD license. It is written in C++ and
its first stable release date was April 18, 2017.
Theano in Python
Theano is a Python library that allows us to evaluate mathematical operations
including multi-dimensional arrays so efficiently. It is mostly used in building Deep
Learning Projects. It works a way more faster on Graphics Processing Unit (GPU)
rather than on CPU. Theano attains high speeds that gives a tough competition to C
implementations for problems involving large amounts of data. It can take advantage
of GPUs which makes it perform better than C on a CPU by considerable orders of
magnitude under some certain circumstances.
It knows how to take structures and convert them into very efficient code that uses
numpy and some native libraries. It is mainly designed to handle the types of
computation required for large neural network algorithms used in Deep Learning. That
is why, it is a very popular library in the field of Deep Learning.
How to install Theano :
pip install theano
Several of the symbols we will need to use are in the tensor subpackage of Theano.
We often import such packages with a handy name, let’s say, T.
from theano import *
import theano.tensor as T
Why Theano Python Library :
Theano is a sort of hybrid between numpy and sympy, an attempt is made to combine
the two into one powerful library. Some advantages of theano are as follows:
Stability Optimization: Theano can find out some unstable expressions and can
use more stable means to evaluate them
Execution Speed Optimization: As mentioned earlier, theano can make use of
recent GPUs and execute parts of expressions in your CPU or GPU, making it
much faster than Python
Symbolic Differentiation: Theano is smart enough to automatically create
symbolic graphs for computing gradients
Basics of Theano :
Theano is a Python library that allows you to define, optimize, and evaluate
mathematical expressions involving multi-dimensional arrays efficiently.Some Theano
implementations are as follows.
Subtracting two scalars :
import theano
from theano import tensor
# Declaring variables
a = tensor.dscalar()
b = tensor.dscalar()
# Subtracting
res = a - b
# Calling function
It will not provide any output as the assertion of two numbers matches the number
given, hence it results into a true value.
Adding two scalars :
Python
import numpy
import theano.tensor as T
x = T.dscalar('x')
y = T.dscalar('y')
z =x +y
f = function([x, y], z)
f(5, 7)
Output:
array(12.0)
Adding two matrices :
Python
import numpy
import theano.tensor as T
x = T.dmatrix('x')
y = T.dmatrix('y')
z =x +y
f = function([x, y], z)
Output:
array([[ 90., 120.],
[ 5., 7.]])
Logistic function using theano :
Let’s try to compute the logistic curve, which is given by:
Logistic Function :
Python
import theano
# Declaring variable
a = tensor.dmatrix('a')
# Sigmoid function
sig = 1 / (1 + tensor.exp(-a))
# Calling function
Output :
[[0.5 0.73105858
0.26894142 0.11920292]]
Theano is a foundation library mainly used for deep learning research and
development and directly to create deep learning models or by convenient libraries
such as Keras. It supports both convolutional networks and recurrent networks, as
well as combinations of the two.
Neural Networks: Deep Learning is based on artificial neural networks which have
been around in some form since the late 1950s. The networks are built from individual
parts approximating neurons, typically called units or simply “neurons.” Each unit has
some number of weighted inputs. These weighted inputs are summed together (a
linear combination) then passed through an activation function to get the unit’s output.
Below is an example of a simple
PyTorch is an open-source machine learning library based on the Torch library,
developed by Facebook’s AI Research lab. It is widely used in deep learning, natural
language processing, and computer vision applications. PyTorch provides a dynamic
computational graph, which allows for easy and efficient modeling of complex neural
networks.
Here’s a general overview of how to use PyTorch for deep learning:
Define the model: PyTorch provides a wide range of pre-built neural network
architectures, such as fully connected networks, convolutional neural networks
(CNNs), and recurrent neural networks (RNNs). The model can be defined using
PyTorch’s torch.nn module.
Define the loss function: The loss function is used to evaluate the model’s
performance and update its parameters. PyTorch provides a wide range of loss
functions, such as cross-entropy loss and mean squared error loss.
Define the optimizer: The optimizer is used to update the model’s parameters based
on the gradients of the loss function. PyTorch provides a wide range of optimizers,
such as Adam and SGD.
Train the model: The model is trained on a dataset using the defined loss function and
optimizer. PyTorch provides a convenient data loading and processing library,
torch.utils.data, that can be used to load and prepare the data.
Evaluate the model: After training, the model’s performance can be evaluated on a
test dataset.
Deploy the model: The trained model can be deployed in a wide range of applications,
such as computer vision and natural language processing.
neural net.
Tensors: It turns out neural network computations are just a bunch of linear algebra
operations on tensors, which are a generalization of matrices. A vector is a 1-
dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a
3-dimensional tensor. The fundamental data structure for neural networks are tensors
and PyTorch is built around tensors. It’s time to explore how we can use PyTorch to
build a simple neural network.
Python3
import torch
def activation(x):
Arguments
---------
x: torch.Tensor
"""
Python3
weights = torch.randn_like(features)
features = torch.randn((1, 5)) creates a tensor with shape (1, 5), one row and five
columns, that contains values randomly distributed according to the normal distribution
with a mean of zero and standard deviation of one. weights =
torch.randn_like(features) creates another tensor with the same shape as features,
again containing values from a normal distribution. Finally, bias = torch.randn((1, 1))
creates a single value from a normal distribution.
Now we calculate the output of the network using matrix multiplication.
Python3
That’s how we can calculate the output for a single neuron. The real power of this
algorithm happens when you start stacking these individual units into layers and
stacks of layers, into a network of neurons. The output of one layer of neurons
becomes the input for the next layer. With multiple input units and output units, we
now need to express the weights as a matrix. We define the structure of neural
network and initialize the weights and biases.
Python3
n_input = features.shape[1]
W1 = torch.randn(n_input, n_hidden)
# Weights for hidden layer to output layer
W2 = torch.randn(n_hidden, n_output)
B1 = torch.randn((1, n_hidden))
B2 = torch.randn((1, n_output))
Now we can calculate the output for this multi-layer network using the weights W1 &
W2, and the biases, B1 & B2.
Python3
print(output)