Neural Networks Unit 3

lOMoARcPSD|34467219
Neural Networks unit 3
Neural Network and Deep Learning (Anna University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)
lOMoARcPSD|34467219
Spiking Neural Networks-Convolutional Neural Networks-Deep Learning Neural Networks-Extreme

Learning Machine Model-Convolutional Neural Networks: The Convolution Operation – Motivation –
Pooling – Variants of the basic Convolution Function – Structured Outputs – Data Types – Efficient
Convolution Algorithms – Neuroscientific Basis – Applications: Computer Vision, Image Generation,
Image Compression.
SPIKING NEURAL NETWORKS:

The Spiking Neural Network (SNN) is the third generation of neural network models, built with
specialized network topologies that redefine the entire computational process. The spiking makes it more
intelligent and energy-efficient, which is crucial for small devices to perform.
With a three-layered feedforward specialized network topology, the SNN is one of the most powerful neural
networks that can process temporal data in real-time. This high computational power and advanced topology
make it suitable for robotics and computer vision applications that require real-time data processing.
SNN facilitates real-time sourcing and processing of the data and is a major improvement over other neural
networks, which primarily rely on frequency rather than temporal data.
The SNN spikes are computationally more advanced, and the firing activity of the neuron in the SNN
architecture is not tied to static inputs but to the notion of time.
CONVOLUTIONAL NEURAL NETWORKS:

Neural networks are a subset of machine learning, and they are at the heart of deep learning
algorithms. They are comprised of node layers, containing an input layer, one or more hidden layers, and an
output layer. Each node connects to another and has an associated weight and threshold. If the output of any
individual node is above the specified threshold value, that node is activated, sending data to the next layer
of the network. Otherwise, no data is passed along to the next layer of the network.
While we primarily focused on feedforward networks in that article, there are various types of neural nets,
which are used for different use cases and data types. For example, recurrent neural networks are commonly
used for natural language processing and speech recognition whereas convolutional neural networks
(ConvNets or CNNs) are more often utilized for classification and computer vision tasks. Prior to CNNs,
manual, time-consuming feature extraction methods were used to identify objects in images. However,
convolutional neural networks now provide a more scalable approach to image classification and object
recognition tasks, leveraging principles from linear algebra, specifically matrix multiplication, to identify
patterns within an image. That said, they can be computationally demanding, requiring graphical processing
units (GPUs) to train models.
How do convolutional neural networks work?

Convolutional neural networks are distinguished from other neural networks by their superior
performance with image, speech, or audio signal inputs. They have three main types of layers, which are:
 Convolutional layer
 Pooling layer
 Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While convolutional layers can be
followed by additional convolutional layers or pooling layers, the fully-connected layer is the final layer.
With each layer, the CNN increases in its complexity, identifying greater portions of the image. Earlier
layers focus on simple features, such as colors and edges. As the image data progresses through the layers of

lOMoARcPSD|34467219
the CNN, it starts to recognize larger elements or shapes of the object until it finally identifies the intended
object.
Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the majority of
computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let’s
assume that the input will be a color image, which is made up of a matrix of pixels in 3D. This means that
the input will have three dimensions—a height, width, and depth—which correspond to RGB in an image.
We also have a feature detector, also known as a kernel or a filter, which will move across the receptive
fields of the image, checking if the feature is present. This process is known as a convolution.
The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While
they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive
field. The filter is then applied to an area of the image, and a dot product is calculated between the input
pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter shifts by a stride,
repeating the process until the kernel has swept across the entire image. The final output from the series of
dot products from the input and the filter is known as a feature map, activation map, or a convolved feature.
Note that the weights in the feature detector remain fixed as it moves across the image, which is also known
as parameter sharing. Some parameters, like the weight values, adjust during training through the process of
backpropagation and gradient descent. However, there are three hyperparameters which affect the volume
size of the output that need to be set before the training of the neural network begins. These include:
1. The number of filters affects the depth of the output. For example, three distinct filters would yield three
different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.
3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements that fall
outside of the input matrix to zero, producing a larger or equally sized output. There are three types of
padding:
 Valid padding: This is also known as no padding. In this case, the last convolution is dropped if
dimensions do not align.
 Same padding: This padding ensures that the output layer has the same size as the input layer
 Full padding: This type of padding increases the size of the output by adding zeros to the border of
the input.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the
feature map, introducing nonlinearity to the model.

lOMoARcPSD|34467219
Additional convolutional layer

As we mentioned earlier, another convolution layer can follow the initial convolution layer. When
this happens, the structure of the CNN can become hierarchical as the later layers can see the pixels within
the receptive fields of prior layers. As an example, let’s assume that we’re trying to determine if an image
contains a bicycle. You can think of the bicycle as a sum of parts. It is comprised of a frame, handlebars,
wheels, pedals, et cetera. Each individual part of the bicycle makes up a lower-level pattern in the neural net,
and the combination of its parts represents a higher-level pattern, creating a feature hierarchy within the
CNN. Ultimately, the convolutional layer converts the image into numerical values, allowing the neural
network to interpret and extract relevant patterns.
Pooling layer
Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the
number of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps a filter
across the entire input, but the difference is that this filter does not have any weights. Instead, the kernel
applies an aggregation function to the values within the receptive field, populating the output array. There
are two main types of pooling:

lOMoARcPSD|34467219
 Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to
send to the output array. As an aside, this approach tends to be used more often compared to average
pooling.
 Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits to the CNN. They help
to reduce complexity, improve efficiency, and limit risk of overfitting.
Fully-connected layer
The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel values of
the input image are not directly connected to the output layer in partially connected layers. However, in the
fully-connected layer, each node in the output layer connects directly to a node in the previous layer.
This layer performs the task of classification based on the features extracted through the previous layers and
their different filters. While convolutional and pooling layers tend to use ReLu functions, FC layers usually
leverage a softmax activation function to classify inputs appropriately, producing a probability from 0 to 1.
Types of convolutional neural networks

Kunihiko Fukushima and Yann LeCun laid the foundation of research around convolutional neural
networks in their work in 1980 (link resides outside IBM) and "Backpropagation Applied to Handwritten
Zip Code Recognition" in 1989, respectively. More famously, Yann LeCun successfully applied
backpropagation to train neural networks to identify and recognize patterns within a series of handwritten
zip codes. He would continue his research with his team throughout the 1990s, culminating with “LeNet-5”,
which applied the same principles of prior research to document recognition. Since then, a number of variant
CNN architectures have emerged with the introduction of new datasets, such as MNIST and CIFAR-10, and
competitions, like ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Some of these other
architectures include:
 AlexNet (link resides outside IBM)
 VGGNet (link resides outside IBM)
 GoogLeNet (link resides outside IBM)
 ResNet (link resides outside IBM)
 ZFNet
However, LeNet-5 is known as the classic CNN architecture.
Convolutional neural networks and computer vision

Convolutional neural networks power image recognition and computer vision tasks. Computer vision
is a field of artificial intelligence (AI) that enables computers and systems to derive meaningful information
from digital images, videos and other visual inputs, and based on those inputs, it can take action. This ability
to provide recommendations distinguishes it from image recognition tasks. Some common applications of
this computer vision today can be seen in:
 Marketing: Social media platforms provide suggestions on who might be in photograph that has
been posted on a profile, making it easier to tag friends in photo albums.
 Healthcare: Computer vision has been incorporated into radiology technology, enabling doctors to
better identify cancerous tumors in healthy anatomy.
 Retail: Visual search has been incorporated into some e-commerce platforms, allowing brands to
recommend items that would complement an existing wardrobe.
 Automotive: While the age of driverless cars hasn’t quite emerged, the underlying technology has
started to make its way into automobiles, improving driver and passenger safety through features like
lane line detection.

lOMoARcPSD|34467219
DEEP LEARNING NEURAL NETWORKS

The History of Deep Learning
Deep learning was conceptualized by Geoffrey Hinton in the 1980s. He is widely considered to be the
founding father of the field of deep learning. Hinton has worked at Google since March 2013 when his
company, DNNresearch Inc., was acquired.
Hinton’s main contribution to the field of deep learning was to compare machine learning techniques to the
human brain.
More specifically, he created the concept of a "neural network", which is a deep learning algorithm
structured similar to the organization of neurons in the brain. Hinton took this approach because the human
brain is arguably the most powerful computational engine known today.
The structure that Hinton created was called an artificial neural network (or artificial neural net for short).
Here’s a brief description of how they function:
 Artificial neural networks are composed of layers of node
 Each node is designed to behave similarly to a neuron in the brain
 The first layer of a neural net is called the input layer, followed by hidden layers, then finally the
output layer
 Each node in the neural net performs some sort of calculation, which is passed on to other nodes
deeper in the neural net
Here is a simplified visualization to demonstrate how this works:
Neural nets represented an immense stride forward in the field of deep learning.
However, it took decades for machine learning (and especially deep learning) to gain prominence.
We’ll explore why in the next section.
Why Deep Learning Did Not Immediately Work
If deep learning was originally conceived decades ago, why is it just beginning to gain momentum today?
It’s because any mature deep learning model requires an abundance of two resources:
 Data
 Computing power

lOMoARcPSD|34467219
At the time of deep learning’s conceptual birth, researchers did not have access to enough of either data or
computing power to build and train meaningful deep learning models. This has changed over time, which
has led to deep learning’s prominence today.
Understanding Neurons in Deep Learning

Neurons are a critical component of any deep learning model.
In fact, one could argue that you can’t fully understand deep learning with having a deep knowledge of how
neurons work.
This section will introduce you to the concept of neurons in deep learning. We’ll talk about the origin of
deep learning neurons, how they were inspired by the biology of the human brain, and why neurons are so
important in deep learning models today.
What is a Neuron in Biology?
Neurons in deep learning were inspired by neurons in the human brain. Here is a diagram of the anatomy of
a brain neuron:
As you can see, neurons have quite an interesting structure. Groups of neurons work together inside the
human brain to perform the functionality that we require in our day-to-day lives.
The question that Geoffrey Hinton asked during his seminal research in neural networks was whether we
could build computer algorithms that behave similarly to neurons in the brain. The hope was that by
mimicking the brain’s structure, we might capture some of its capability.
To do this, researchers studied the way that neurons behaved in the brain. One important observation was
that a neuron by itself is useless. Instead, you require networks of neurons to generate any meaningful
functionality.
This is because neurons function by receiving and sending signals. More specifically, the neuron’s dendrites
receive signals and pass along those signals through the axon.
The dendrites of one neuron are connected to the axon of another neuron. These connections are called
synapses, which is a concept that has been generalized to the field of deep learning.
What is a Neuron in Deep Learning?
Neurons in deep learning models are nodes through which data and computations flow.
Neurons work like this:
 They receive one or more input signals. These input signals can come from either the raw data set or
from neurons positioned at a previous layer of the neural net.
 They perform some calculations.

lOMoARcPSD|34467219
 They send some output signals to neurons deeper in the neural net through a synapse.
Here is a diagram of the functionality of a neuron in a deep learning neural net:
Let’s walk through this diagram step-by-step.

As you can see, neurons in a deep learning model are capable of having synapses that connect to more than
one neuron in the preceding layer. Each synapse has an associated weight, which impacts the preceding
neuron’s importance in the overall neural network.
Weights are a very important topic in the field of deep learning because adjusting a model’s weights is the
primary way through which deep learning models are trained. You’ll see this in practice later on when we
build our first neural networks from scratch.
Once a neuron receives its inputs from the neurons in the preceding layer of the model, it adds up each
signal multiplied by its corresponding weight and passes them on to an activation function, like this:
The activation function calculates the output value for the neuron. This output value is then passed on to the
next layer of the neural network through another synapse.
This serves as a broad overview of deep learning neurons. Do not worry if it was a lot to take in – we’ll learn
much more about neurons in the rest of this tutorial. For now, it’s sufficient for you to have a high-level
understanding of how they are structured in a deep learning model.
Deep Learning Activation Functions

Activation functions are a core concept to understand in deep learning.
They are what allows neurons in a neural network to communicate with each other through their synapses.
In this section, you will learn to understand the importance and functionality of activation functions in deep
learning.

lOMoARcPSD|34467219
What Are Activation Functions in Deep Learning?

In the last section, we learned that neurons receive input signals from the preceding layer of a neural
network. A weighted sum of these signals is fed into the neuron's activation function, then the activation
function's output is passed onto the next layer of the network.
There are four main types of activation functions that we’ll discuss in this tutorial:
 Threshold functions
 Sigmoid functions
 Rectifier functions, or ReLUs
 Hyperbolic Tangent functions
Let’s work through these activations functions one-by-one.
Threshold Functions
Threshold functions compute a different output signal depending on whether or not its input lies above or
below a certain threshold. Remember, the input value to an activation function is the weighted sum of the
input values from the preceding layer in the neural network.
Mathematically speaking, here is the formal definition of a deep learning threshold function:
As the image above suggests, the threshold function is sometimes also called a unit step function.
Threshold functions are similar to boolean variables in computer programming. Their computed value is
either 1 (similar to True) or 0 (equivalent to False).
The Sigmoid Function
The sigmoid function is well-known among the data science community because of its use in logistic
regression, one of the core machine learning techniques used to solve classification problems.
The sigmoid function can accept any value, but always computes a value between 0 and 1.
Here is the mathematical definition of the sigmoid function:

lOMoARcPSD|34467219
One benefit of the sigmoid function over the threshold function is that its curve is smooth. This means it is
possible to calculate derivatives at any point along the curve.
The Rectifier Function
The rectifier function does not have the same smoothness property as the sigmoid function from the last
section. However, it is still very popular in the field of deep learning.
The rectifier function is defined as follows:
 If the input value is less than 0, then the function outputs 0
 If not, the function outputs its input value
Here is this concept explained mathematically:
Rectifier functions are often called Rectified Linear Unit activation functions, or ReLUs for short.
The Hyperbolic Tangent Function
The hyperbolic tangent function is the only activation function included in this tutorial that is based on a
trigonometric identity.
It’s mathematical definition is below:

lOMoARcPSD|34467219
The hyperbolic tangent function is similar in appearance to the sigmoid function, but its output values are all
shifted downwards.
How Do Neural Networks Really Work?

So far in this tutorial, we have discussed two of the building blocks for building neural networks:
 Neurons
 Activation functions
However, you’re probably still a bit confused as to how neural networks really work.
This tutorial will put together the pieces we’ve already discussed so that you can understand how neural
networks work in practice.
The Example We’ll Be Using In This Tutorial
This tutorial will work through a real-world example step-by-step so that you can understand how neural
networks make predictions.
More specifically, we will be dealing with property valuations.
You probably already know that there are a ton of factors that influence house prices, including the
economy, interest rates, its number of bedrooms/bathrooms, and its location.
The high dimensionality of this data set makes it an interesting candidate for building and training a neural
network on.
One caveat about this section is the neural network we will be using to make predictions has already been
trained. We’ll explore the process for training a new neural network in the next section of this tutorial.
The Parameters In Our Data Set
Let’s start by discussing the parameters in our data set. More specifically, let’s imagine that the data set
contains the following parameters:
 Square footage
 Bedrooms
 Distance to city center

lOMoARcPSD|34467219
 House age
These four parameters will form the input layer of the artificial neural network. Note that in reality, there are
likely many more parameters that you could use to train a neural network to predict housing prices. We have
constrained this number to four to keep the example reasonably simple.
The Most Basic Form of a Neural Network
In its most basic form, a neural network only has two layers - the input layer and the output layer. The
output layer is the component of the neural net that actually makes predictions.
For example, if you wanted to make predictions using a simple weighted sum (also called linear regression)
model, your neural network would take the following form:
While this diagram is a bit abstract, the point is that most neural networks can be visualized in this manner:
 An input layer
 Possibly some hidden layers
 An output layer
It is the hidden layer of neurons that causes neural networks to be so powerful for calculating predictions.
For each neuron in a hidden layer, it performs calculations using some (or all) of the neurons in the last layer
of the neural network. These values are then used in the next layer of the neural network.
The Purpose of Neurons in the Hidden Layer of a Neural Network
You are probably wondering – what exactly does each neuron in the hidden layer mean? Said differently,
how should machine learning practitioners interpret these values?
Generally speaking, neurons in the midden layers of a neural net are activated (meaning their activation
function returns 1) for an input value that satisfies certain sub-properties.
For our housing price prediction model, one example might be 5-bedroom houses with small distances to the
city center.
In most other cases, describing the characteristics that would cause a neuron in a hidden layer to activate is
not so easy.
How Neurons Determine Their Input Values
Earlier in this tutorial, I wrote “For each neuron in a hidden layer, it performs calculations using some (or
all) of the neurons in the last layer of the neural network.”
This illustrates an important point – that each neuron in a neural net does not need to use every neuron in the
preceding layer.

lOMoARcPSD|34467219
The process through which neurons determine which input values to use from the preceding layer of the
neural net is called training the model. We will learn more about training neural nets in the next section of
this course.
Visualizing A Neural Net’s Prediction Process
When visualizing a neutral network, we generally draw lines from the previous layer to the current layer
whenever the preceding neuron has a weight above 0 in the weighted sum formula for the current neuron.
The following image will help visualize this:
As you can see, not every neuron-neuron pair has synapse. x4 only feeds three out of the five neurons in the
hidden layer, as an example. This illustrates an important point when building neural networks – that not
every neuron in a preceding layer must be used in the next layer of a neural network.
How Neural Networks Are Trained

So far you have learned the following about neural networks:
 That they are composed of neurons
 That each neuron uses an activation function applied to the weighted sum of the outputs from the
preceding layer of the neural network
 A broad, no-code overview of how neural networks make predictions
We have not yet covered a very important part of the neural network engineering process: how neural
networks are trained.
Now you will learn how neural networks are trained. We’ll discuss data sets, algorithms, and broad
principles used in training modern neural networks that solve real-world problems.
Hard-Coding vs. Soft-Coding
There are two main ways that you can develop computer applications. Before digging in to how neural
networks are trained, it’s important to make sure that you have an understanding of the difference between
hard-coding and soft-coding computer programs.
Hard-coding means that you explicitly specify input variables and your desired output variables. Said
differently, hard-coding leaves no room for the computer to interpret the problem that you’re trying to solve.
Soft-coding is the complete opposite. It leaves room for the program to understand what is happening in the
data set. Soft-coding allows the computer to develop its own problem-solving approaches.
A specific example is helpful here. Here are two instances of how you might identify cats within a data set
using soft-coding and hard-coding techniques.

lOMoARcPSD|34467219
 Hard-coding: you use specific parameters to predict whether an animal is a cat. More specifically,
you might say that if an animal’s weight and length lie within certain
 Soft-coding: you provide a data set that contains animals labelled with their species type and
characteristics about those animals. Then you build a computer program to predict whether an animal
is a cat or not based on the characteristics in the data set.
As you might imagine, training neural networks falls into the category of soft-coding. Keep this in mind as
you proceed through this course.
Training A Neural Network Using A Cost Function
Neural networks are trained using a cost function, which is an equation used to measure the error contained
in a network’s prediction.
The formula for a deep learning cost function (of which there are many – this is just one example) is below:
Note: this cost function is called the mean squared error, which is why there is an MSE on the left side of the
equal sign.
While there is plenty of formula mathematics in this equation, it is best summarized as follows:
Take the difference between the predicted output value of an observation and the actual output value of that
observation. Square that difference and divide it by 2.
To reiterate, note that this is simply one example of a cost function that could be used in machine learning
(although it is admittedly the most popular choice). The choice of which cost function to use is a complex
and interesting topic on its own, and outside the scope of this tutorial.
As mentioned, the goal of an artificial neural network is to minimize the value of the cost function. The cost
function is minimized when your algorithm’s predicted value is as close to the actual value as possible. Said
differently, the goal of a neural network is to minimize the error it makes in its predictions!
Modifying A Neural Network
After an initial neural network is created and its cost function is imputed, changes are made to the neural
network to see if they reduce the value of the cost function.
More specifically, the actual component of the neural network that is modified is the weights of each neuron
at its synapse that communicate to the next layer of the network.
The mechanism through which the weights are modified to move the neural network to weights with less
error is called gradient descent. For now, it’s enough for you to understand that the process of training neural
networks looks like this:
 Initial weights for the input values of each neuron are assigned
 Predictions are calculated using these initial values
 The predictions are fed into a cost function to measure the error of the neural network
 A gradient descent algorithm changes the weights for each neuron’s input values
 This process is continued until the weights stop changing (or until the amount of their change at each
iteration falls below a specified threshold)

lOMoARcPSD|34467219
This may seem very abstract - and that’s OK! These concepts are usually only fully understood when you
begin training your first machine learning models.
Extreme Learning Machine Mode
What is ELM?
ELM (Extreme Learning Machines) are feedforward neural networks. “Invented” in 2006 by G. Huang.
As said in the original paper:
Hence the phrase “Extreme” in ELM (but the real reason for the name might vary depends on the source).
Why ELM is different from standard Neural Network

ELM doesn’t require gradient-based backpropagation to work. It uses Moore-Penrose generalized inverse to
set its weights.
First, we look at standard SLFN (Single hidden Layer Feedforward Neural network):

lOMoARcPSD|34467219
Single hidden Layer Feedforward Neural network, Source: Shifei Ding under CC BY 3.0
It’s pretty straightforward:
1. multiply inputs by weights
2. add bias
3. apply the activation function
4. repeat steps 1–3 number of layers times
5. calculate output
6. backpropagate
7. repeat everything
ELM removes step 4 (because it’s always SLFN), replaces step 6 with matrix inverse, and does it only once,
so step 7 goes away as well.
More details
Before going into details we need to look at how ELM output is calculated:

lOMoARcPSD|34467219
Where:
 L is a number of hidden units
 N is a number of training samples
 is weight vector between th hidden layer and output
 w is a weight vector between input and hidden layer
 g is an activation function
 b is a vias vector
 x in an input vector
It is quite similar to what’s going one in standard NN with backpropagation but if you look closely you can
see that we’re naming weight between hidden layer and output as Beta. This Beta matrix is a special matrix
because that is our pseudo-inverse. We can shorten the equation and write it as:
Where:
Where:
 m is a number of outputs
 H is called Hidden Layer Output Matrix
 T is a training data target matrix
The theory behind the learning (You can skip this section if you want)
Now we have to dig dipper into theories behind the network to decide what to do next.

lOMoARcPSD|34467219
A function is infinitely differentiable if it’s a smooth function

lOMoARcPSD|34467219
I’m not going to prove those theorems but if you’re interested please refer Page 3, ELM-NC-2006 for further
explanation.
Now what we have to do is to define our cost function. Bassing our assumptions on Capabilities of a four-
layered feedforward neural network: four layers versus three we can see that SLFN is a linear system if the
input weights and the hidden layer biases can be chosen randomly.
Because our ELM is a linear system then we can create optimization objective:
To approximate the solution we need to use Rao’s and Mitra’s work again:
Now we can figure out that because H is invertible we can calculate Beta hat as:
Learning algorithm
After going through some difficult math we can define learning algorithm now. The algorithm itself is
relatively easy:

lOMoARcPSD|34467219
If you’re interested in seeing python implementation please check this repository:
https://github.com/burnpiro/elm-pure
And here is a preview of how the model works on MNIST dataset:
https://github.com/burnpiro/elm-pure/blob/master/ELM%20example.ipynb
As you can see, a simple version of ELM achieves >91% accuracy on the MNIST dataset and it takes
around 3s to train the network on intel i7 7820X CPU.
Performance comparison
I’m going to use metrics from the original paper in this section and it might surprise you how long some
training is done in compare with previous MNIST example, but remember that original paper was published
in 2006 and networks were trained on Pentium 4 1.9GHz CPU.
Datasets
Results
We can ignore training time for now because it’s obvious that gradient descent takes longer than matrix
invert. The most important information form this result table is Accuracy and Nodes. In the first two

lOMoARcPSD|34467219
datasets, you can see that author used different size of BP to achieve the same results as ELM. The size of
BP network in the first case was 5x smaller and 2x smaller in the second case. That affects testing times (it’s
faster to run 100 nodes NN than 500 nodes NN). That tells us how accurate is our method in approximating
dataset.
It is hard to find any tests of ELM networks on popular datasets but I’ve managed to do so. Here is a
benchmark on CIFAR-10 and MNIST
Where:
 DELM is a deep ELM
 ReNet is described in this paper
 RNN is a Recurrent Neural Network
 EfficientNet is described in this paper
I didn’t find training times for ELMs so there was no way to compare them with results from other networks
but all those multipliers ( 20x, 30x) are relative differences in training time based on the training of ELM
1000 on CIFAR-10. If there is a 30x time increase between ELM 1000 and ELM 3500 then you can
imagine how long it would take to train DELM which has 15000 neurons.

lOMoARcPSD|34467219
THE CONVOLUTION OPERATION:
the key points of the Feed Forward Neural Network:
 Universal Approximation Theorem(UAT) says that Deep Neural Networks(DNN) are powerful function
approximators.
 DNNs can be trained using backpropagation.
 However, Fully Connected DNNs(fully connected network means that any neuron in any of the layers is
connected to all the neurons in the previous layer) are prone to overfitting as the network is very deep
and the no. of parameters are very large which might result in the overfitting of the model.
 And the second problem with the fully connected networks is that some gradients might vanish due to
long chains. Since the network is very deep, the gradients in the few of the starting layers might get
vanished when flowing back and therefore resulting in no training of the weights.
 So, the objective is to have a network that is a complex network(having non-linearities) as we know that
in most of the real-world problems the output is going to be a complex function of the input but has
fewer parameters and is, therefore, less prone to overfitting. And CNN's belong to the family of networks
that serves this objective.

lOMoARcPSD|34467219

lOMoARcPSD|34467219
Convolutional Operation
Convolutional Operation means for a given input we re-estimate it as the weighted average of all the inputs
around it. We have some weights assigned to the neighbor values and we take the weighted sum of the
neighbor values to estimate the value of the current input/pixel.
For a 2D input, the classic input would be an image, where we re-calculate the value of every pixel by taking
the weighted sum of pixels(neighbors) around it for example: let’s say the input image is as given below
Input Image
Now in this input image, we calculate the value of each and every pixel by considering the weighted sum of
pixels around it

lOMoARcPSD|34467219
Here we are calculating the value of circled pixel considering 3 neighbors around it, assume that the weights
w1, w2, w3, w4 are associated with these 4 pixels respectively
Now, this matrix of weights is referred to as the Kernel or Filter. In the above case, we have the kernel of
size 2X2.
We compute the output(re-estimated value of current pixel) using the following formula:
Here m refers to the number of rows(which is 2 in this case) and n refers to the number of columns(which is
2 i this case).
Now we place the 2X2 filter over the first 2X2 portion of the image and take the weighted sum and that
would give the new value of the first pixel.

lOMoARcPSD|34467219
We map the 2X2 kernel/filter over the 2X2 portion of the input.
The output of this operation would be: (aw + bx + ey + fz)
Then we move the filter horizontally by one and place it over the next 2 X 2 portion of the input; in this case
pixels of interest would be b, c, f, g and we compute the output using the same technique and we would get:
And then again we move the kernel/filter by 1 in the horizontal direction and take the weighted sum.
So, after this, the output from the first layer would look like:
Then we move the kernel by 1 down in the vertical direction, calculate the output, move the kernel in the
horizontal direction and in general we move the kernel like this: first, we start off with the starting portion of

lOMoARcPSD|34467219
the image, move the filter in the horizontal direction and cover this row completely then we move the filter in
the vertical direction(by some amount respective to top left portion of image), again stride it horizontally
through the entire row and continue like this. In essence, we move the kernel left to right top to bottom.
Instead of considering pixels only in the forward direction, we consider previous neighbors as well
And to consider the previous neighbors, the formula for computing the output would be:
We take the limits from -m/2 to m/2 i.e we take half of the rows from previous neighbors and the other half
from the forward direction(forward neighbors) and the same is the case in the vertical direction(-n/2 to n/2).
Typically, we take the odd-dimensional kernel.
Convolutional Operation in practice
Let the input image be as given below:

lOMoARcPSD|34467219
and we use kernel/filter of size 3X3 and for each pixel, we take the 3 X 3 neighborhood around it(pixel itself
is a part of this 3 X 3 neighborhood and would be at the center) just like in the below image:
Input Image, we consider 3X3 portions of this image as the kernel is of size 3X3
Let’s say this input is a 30X30 image, we go over every pixel systematically, place the filter such that the
pixel is at the center of the kernel and re-estimate the value of that pixel as the weighted sum of pixels around
it.

lOMoARcPSD|34467219
So, in this way, we get back the re-estimated value of all the pixels.
We all have seen the convolutional operation in practice. Let’s say the kernel that we are using is as below:
Kernel
So, we move this kernel all over the image and re-compute every pixel as the weighted sum of the
neighborhood. In this case, since all the weights are 1/9 that means the re-estimated value of each and
every pixel would be 1/9th of its original value. This kernel is taking the average of all the 9 pixels over
which this kernel would be placed.
That means for each pixel/color in the image, if we take the average(divide the weighted sum value by 9), it
would dilute the value/blurs the image and the output we get by applying this convolutional operation is:

lOMoARcPSD|34467219
So, the blur operation that we all might have used in any of the photo editing application actually applies the
convolution operation behind the scenes.
Now in the below-mentioned scenario, we are using 5 as the weight for the central pixel and 0 for the all the
boundary pixels and -1 for the remaining pixels, so the net effect would be that the value/color intensity of the
central pixel is boosted and its neighborhood information is getting subtracted so the result of this is that it
sharpens the image.
The output of the above convolutional is:

lOMoARcPSD|34467219
Let’s take one more example: in the below case, the value for the central pixel is -8 and for all other pixels it
is 1, so if we have the same color in the 3X3 portion of the image(just like for the marked pixel in the below
image), let say the pixel intensity for this current pixel is denoted by ‘x’ then we get (8x from the central pixel
and -8x from the weighted sum of all other pixels and summation of the these results into 0).
So, wherever we have the same color in the 3X3 portion(some sample regions marked in the below image) or
to say the neighbors are exactly the same as the current pixel, we get the output intensity as 0.

lOMoARcPSD|34467219
So, in effect, what will happen is that where ever there is a boundary(yellow highlighted in the below image),
there the neighboring pixels can not be the same as the current pixel, only in such regions we get the non-zero
value, everywhere else we get a zero value. So, in effect, we end up detecting all the edges in the input image.

lOMoARcPSD|34467219
2D Convolution with 3D filter:
Below is a complete picture of how the 2D convolutional operation is performed over the input, we start with
the top left corner, apply the kernel over that area, move the kernel horizontally towards right and once we
have reached the end(completed the entire row) on the right side, we move the kernel downwards by some
steps and again start from the left side and move towards right:

lOMoARcPSD|34467219
We slide the kernel horizontally
Once we complete the entire row, we slide the kernel vertically in downwards direction and start from the
left side
We move from left to right and from top to bottom.

lOMoARcPSD|34467219
In the case of 3D input(image is also a 3D input as it has 3 channels corresponding to Red, Green, Blue, all
these channels are superimposed on each other and that’s how we get the final image. In other words, every
pixel in the image has 3 values associated with it, so we can look at that as the depth), we have 3
channels(depth) one corresponding to each of the RGB in the image, we use the filter of the same depth as the
input and place the filter over the input and compute the weighted sum across all the 3 dimensions.

lOMoARcPSD|34467219
In most cases when we use convolution for 3D inputs, we use a 3D convolution filter(as depicted in the below
image) that means if we place the filter at a given location in the image, we would take a weighted average of
its 3D neighborhood but we are not going to slide it along the depth. What this conveys is that the kernel
would have the same depth as the original input and that’s why there is no scope to move it through the
depth/input. For example, the input image depth is 3 and the kernel depth is also 3 so there is no scope to
move it along the depth. There is no movement available there

lOMoARcPSD|34467219
In this case, also, we move the filter horizontally and vertically as in the 2D case. We don’t move the filter
along the depth as the input image depth is the same as the filter depth and there is no scope to move across
the depth.
So, what we do in practice is we have this 3D kernel, we will start moving it, we will move it along the
horizontal direction first, and we keep doing this through the entire image and once we reach the last box(we
move from left to right and top to bottom), at the end of this, although our input was 3 dimensional, we get
back a 2D output.

lOMoARcPSD|34467219
Points to consider:
 Input is 3D
 The filter is also 3D
 The convolutional operation that we perform is 2D as we are sliding the filter horizontally and vertically
and not along the depth
 This is because the depth of the filter is the same as the depth of the input

lOMoARcPSD|34467219
In practice, we apply multiple kernels/filters to the same input and get the different representations/output
from the same input as per the kernel used for example one filter might detect the vertical edges in the input,
second might detect the horizontal edges in the image, another filter might blur the image and so on.

lOMoARcPSD|34467219
In the above image, we are using 3 different filters and we are getting 3 outputs corresponding to each
filter. We can combine these different outputs representations into one single volume(each of the output
representation would have width and height and after combining all of the representations we get the depth as
well). So, if we apply 3 filters to the input, we get an output of depth 3, if we apply 100 filters to the input, we
get the output of depth 100.
Points to consider:
 Each filter applied to a 3D input would give a 2D output.
 Combining the output of multiple such filters would result in a 3D output.

lOMoARcPSD|34467219
Terminology
Let’s define some terminology and find out the relation between the input dimensions and the output
dimensions:
The spatial extent(extent of the neighborhood we are looking at) of a filter(F) means the dimension of the
filter, it would be ‘F X F’. Usually, we have an odd-dimensional filter and the depth of the filter would be the
same as the depth of the input(Di in this case).
Now we want to relate the output dimensions with the input dimensions:
Let’s take 2D input of dimension ‘7 X 7’ and we have a filter of size ‘3 X 3’ over it.

lOMoARcPSD|34467219
As we slide the filter over it(from left to right and top to bottom), we keep computing the output values, and
it's very clear that the output is smaller than the input.
This is how we slide the filter over the image:
The reason is obvious why this is happening, we can’t place the kernel at the corners as it will cross the
boundary
We can’t place the filter at the crossed pixel(below image) because if we place it there then yellow
highlighted portion would be undefined:

lOMoARcPSD|34467219
And in practice, we would stop at the crossed pixel(as in the below image) when the filter completely lies
inside the image:
And this is why we get the smaller output because we would not be able to apply the filter in any part in the
shaded region in the below image:

lOMoARcPSD|34467219
Hence for every pixel in the input, we are not computing the re-estimated value and therefore the number of
pixels in the output is less than the number of pixels in the input.
This was the case for ‘3 X 3’ kernel, now let’s see what happens when we have ‘5 X 5’ kernel:
Now we can not place the kernel at the crossed pixel in the above image. We can not place the kernel at the
yellow highlighted pixel as well. So, in this case, we can not place the kernel at any of the shaded regions in
the below image:

lOMoARcPSD|34467219
The bigger the kernel used, the smaller is the output.
So, the output dimension in terms of the input is:
What if we want the output to be of the same size as the input?
If we want the output to be the same size as the input, then we need to pad the input appropriately:

lOMoARcPSD|34467219
Here we pad the input with 0 all over the input image and apply the 3X3 filter over the input and we get the
output of the same dimension as the input
If we place the kernel at the crossed pixel in the below image, we now have 5 artificial pixels with a value of
0 and we would be able to re-estimate the value of this crossed pixel.
Now the output would be again ‘7 X 7’ as we have introduced this artificial boundary around the original
input and this boundary contains all the values as 0.
If we have a ‘5 X 5’ filter, it would still go outside the image even after this artificial padding

lOMoARcPSD|34467219
So, in this case, we need to increase padding. Earlier we added padding of 1(meaning 1 row at the top, 1 at
the bottom, 1 at the left and 1 at the right). And it’s obvious from the above image that if we want to use a ‘5
X 5’, then we should use the padding of 2.
The bigger the kernel size the larger is the padding required and the updated formula for the relation
between input and output dimension is:

lOMoARcPSD|34467219
Stride(S): Stride defines the interval at which the filter is applied, till now we discussed all the cases
considering stride to be 1 as we’re moving the filter by 1 in the horizontal and vertical direction as depicted in
the below image:
In some cases, we may not want this to say we don’t want a full replica of the image and just need a summary
of it. In that case, we may choose to apply the filter only at alternate locations in the input.

lOMoARcPSD|34467219
Here we use S = 2 i.e we move the filter by 2 in the horizontal as well as the vertical direction
This interval between two successive pixels where we apply the kernel is termed as the Stride. And in the
above case, the output would be roughly half the input as we are skipping part of the image by 1 every time.
Now, if we are using a stride ‘S’, then the formula to compute the width and height is given by:
Higher the stride, the smaller is the size of the output.
The depth of the output is going to be the same as the number of filters that we have.
Each 3D filter applied over 3D input would give one 2D output if we use K such filters, we get K such 2D
outputs and if we stack up all these K outputs we get the depth of the output as K. So, the depth of the output
is the same as the no. of filters used.

lOMoARcPSD|34467219
MOTIVATION:
 a digital image is 2D grid image , since neural network expects a vector as input , one idea to deal with
images would be to flatten that image and feed the output of the flattening operation to the neural
network and this would work to some extent
But eventually ,that flattened vector won’t be the same for a translated image

lOMoARcPSD|34467219
The neural network would have to learn very different parameters in order to classify the objects , which Is
difficult job since natural images are very variant (lightning, translated , angles …..)
Also it is worth mentioning that the input Vector would be relatively big 64*64*3(RGB images) which can
cause problem with memory while using neural network since we will have in The first layer with just 10
neurons alone (64*64*3*10) Weights to train
Natural images have 2 main characteristics
 Locality : nearby pixels are more strongly correlated
 Translation invariance: meaningful patterns can occur anywhere in the image
How Convolutional Neural Network solve The problem for images ?
· The answer to this question is 3 characteristics of The CNN
 Sparse Connectivity : when processing an image, the input image might have thousands or millions of
pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or
hundreds of pixels. This means that we need to store fewer parameters, which both reduces the memory
requirements of the model and improves its statistical efficiency. It also means that computing the output
requires fewer operations. These improvements in efficiency are usually quite large. If there are m inputs
and n outputs, then matrix multiplication requires m×n parameters and the algorithms used in practice
have O(m × n) runtime (per example). If we limit the number of connections each output may have to k,
then the sparsely connected approach requires only k × n parameters and O(k × n) runtime

lOMoARcPSD|34467219
 Parameter sharing : In a convolutional neural net, each member of the kernel is used at every position
of the input (except perhaps some of the boundary pixels, depending on the design decisions regarding
the boundary). The parameter sharing used by the convolution operation means that rather than learning
a separate set of parameters for every location, we learn only one set. This does not affect the runtime of
forward propagation it is still O(k × n) but it does further reduce the storage requirements of the model to
k parameters
 Equivariance : In the case of convolution, the particular form of parameter sharing causes the layer to
have a property called equivariance to translation. use the same network parameters to detect local
patterns at many locations in the image
The Convolution Operation
So in Practical set , the convolutional operation is implemented by making The kernel slides across The
image and produces an output Value at each position
Also we convolve different Kernels and as a result obtain Different feature maps or channels

lOMoARcPSD|34467219
Variants of The Convolution Operation :
 Valid Convolution: Doesn’t used any padding
 Same Convolution :pads in a way that the input size is the same as the output size
 Full Convolution : — : we compute output wherever the kernel and the output overlap by at least 1
pixel
 Strided Convolution : : kernel slides along the image with a step > 1

lOMoARcPSD|34467219
 Dilated Convolution : : kernel is spread out, step > 1 between kernel elements
 Depth wise Convolution : each output channel is connected only to one input channel
POOLING:
A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby
outputs. For example, the max pooling operation reports the maximum output within a rectangular
neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the L2
norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel. In all
cases, pooling helps to make the representation become approximately invariant to small translations of the
input. Invariance to translation means that if we translate the input by a small amount, the values of most of
the pooled outputs do not change.

lOMoARcPSD|34467219
VARIANTS OF THE BASIC CONVOLUTION FUNCTION:

Convolution in the context of NN means an operation that consists of many applications of convolution in
parallel.
 Kernel K with element Ki,j,k,l��,�,�,� giving the connection strength between a unit in channel i
of output and a unit in channel j of the input, with an offset of k rows and l columns between the output
unit and the input unit.
 Input: Vi,j,k��,�,� with channel i, row j and column k
 Output Z same format as V
 Use 1 as first entry
Full Convolution
0 Padding 1 stride
Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n��,�,�=∑�,�,��,�+�−1,�+�−1��,�,�,�
0 Padding s stride
Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]��,�,�=�(�,�,�)�,�,�=∑�,�,�[
��,�∗(�−1)+�,�∗(�−1)+��,�,�,�]
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by downsampling:

lOMoARcPSD|34467219
Some 0 Paddings and 1 stride

Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at each layer.
We are forced to choose between shrinking the spatial extent of the network rapidly and using small kernel.
0 padding allows us to control the kernel width and the size of the output independently.

lOMoARcPSD|34467219
Special case of 0 padding:
 Valid: no 0 padding is used. Limited number of layers.

 Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near the border
influence fewer output pixels than the input pixels near the center.
 Full: Enough zeros are added for every pixels to be visited k (kernel width) times in each direction,
resulting width m + k - 1. Difficult to learn a single kernel that performs well at all positions in the
convolutional feature map.
Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
Unshared Convolution
In some case when we do not want to use convolution but want to use locally connected layer. We
use Unshared convolution. Indices into weight W
 i: the output channel

 j: the output row;

lOMoARcPSD|34467219
 k: the output column

 l: the input channel
 m: row offset within input
 n: column offset within input
Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]��,�,�=∑�,�,�[��,�+�−1,�+�−1��,�,�,�,�
,�]
Comparison on local connections, convolution and full connection
Useful when we know that each feature should be a function of a small part of space, but no reason to think
that the same feature should occur accross all the space. eg: look for mouth only in the bottom half of the
image.
It can be also useful to make versions of convolution or local connected layers in which the connectivity is
further restricted, eg: constrain each output channeel i to be a function of only a subset of the input channel.
Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for both forward
and backward prop.

lOMoARcPSD|34467219
Tiled Convolution
Learn a set of kernels that we rotate through as we move through space. Immediately neighboring locations
will have different filters, but the memory requirement for storing the parameters will increase by a factor of
the size of this set of kernels. Comparison on locally connected layers, tiled convolution and stardard
convolution:

lOMoARcPSD|34467219
K: 6-D tensor, t different choice of kernel stack
Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]��,�,�=∑�,�,�[��,�+�−1,�+�−1��,
�,�,�,�%�+1,�%�+1]
Local connected layers and tiled convolutional layer with max pooling: the detector units of these layers are
driven by different filters. If the filters learn to detect different tranformed version of the same underlying
features, then the max-pooled units become invariant to the learned transformation.
Review:

lOMoARcPSD|34467219

lOMoARcPSD|34467219
Back prop in conv layer

Back prop of conv layer:
 K: Kernel stack
 V: Input image
 Z: Output of conv layer
 G: gradient on Z
SRUCTURED OUTPUTS:
A deep neural network model is a powerful framework for learning representations. Usually, it is used to
learn the relation x→y by exploiting the regularities in the input x. In structured output prediction
problems, y is multi-dimensional and structural relations often exist between the dimensions. The motivation
of this work is to learn the output dependencies that may lie in the output data in order to improve the
prediction accuracy. Unfortunately, feedforward networks are unable to exploit the relations between the
outputs. In order to overcome this issue, we propose in this paper a regularization scheme for training neural
networks for these particular tasks using a multi-task framework. Our scheme aims at incorporating the
learning of the output representation y in the training process in an unsupervised fashion while learning the
supervised mapping function x→y.
TYPES OF DATA:
Deep learning can be applied to any data type. The data types you work with, and the data you gather, will
depend on the problem you’re trying to solve.
1. Sound (Voice Recognition)

2. Text (Classifying Reviews)
3. Images (Computer Vision)
4. Time Series (Sensor Data, Web Activity)
5. Video (Motion Detection)
Use Cases
Deep learning can solve almost any problem of machine perception, including classifying data , clustering it,
or making predictions about it.
 Classification: This image represents a horse; this email looks like spam; this transaction is
fraudulent
 Clustering: These two sounds are similar. This document is probably what user X is looking for
 Predictions: Given their web log activity, Customer A looks like they are going to stop using your
service

lOMoARcPSD|34467219
Deep learning is best applied to unstructured data like images, video, sound or text. An image is just a blob
of pixels, a message is just a blob of text. This data is not organized in a typical, relational database by rows
and columns. That makes it more difficult to specify its features manually.
Common use cases for deep learning include sentiment analysis, classifying images, predictive analytics,
recommendation systems, anomaly detection and more.
If you are not sure whether deep learning makes sense for your use case, please get in touch.
Data Attributes
For deep learning to succeed, your data needs to have certain characteristics.
Relevancy
The data you use to train your neural net must be directly relevant to your problem; that is, it must resemble
as much as possible the real-world data you hope to process. Neural networks are born as blank slates, and
they only learn what you teach them. If you want them to solve a problem involving certain kinds of data,
like CCTV video, then you have to train them on CCTV video, or something similar to it. The training data
should resemble the real-world data that they will classify in production.
Proper Classification
If a client wants to build a deep-learning solution that classifies data, then they need to have a labeled
dataset. That is, someone needs to apply labels to the raw data: “This image is a flower, that image is a
panda.” With time and tuning, this training dataset can teach a neural network to classify new images it has
not seen before.
Formatting
Neural networks eat vectors of data and spit out decisions about those vectors. All data needs to be
vectorized, and the vectors should be the same length when they enter the neural net. To get vectors of the
same length, it’s helpful to have, say, images of the same size (the same height and width). So sometimes
you need to resize the images. This is called data pre-processing.
Accessibility
The data needs to be stored in a place that’s easy to work with. A local file system, or HDFS (the Hadoop
file system), or an S3 bucket on AWS, for example. If the data is stored in many different databases that are
unconnected, you will have to build data pipelines. Building data pipelines and performing preprocessing
can account for at least half the time you spend building deep-learning solutions.

lOMoARcPSD|34467219
Minimum Data Requirement
The minimums vary with the complexity of the problem, but 100,000 instances in total, across all categories,
is a good place to start.
If you have labeled data (i.e. categories A, B, C and D), it’s preferable to have an evenly balanced dataset
with 25,000 instances of each label; that is, 25,000 instances of A, 25,000 instances of B and so forth.
EFFICIENT CONVOLUTION ALGORITHMS:
Efficient convolution algorithms are essential for various signal and image processing tasks, as well
as deep learning and computer vision applications. Convolution is a fundamental operation that involves
multiplying and summing values from two input arrays, and it can be computationally expensive, especially
for large input data and filter kernels. Several efficient algorithms and techniques have been developed to
speed up convolution operations. Here are some of the most important ones:
1. Direct (Naïve) Convolution:
The most straightforward way to compute a convolution is to perform the element-wise multiplication and
sum for each possible location of the filter over the input. While this is conceptually simple, it is highly
inefficient and slow for large inputs and filters.
2. Fast Fourier Transform (FFT) Convolution:
One efficient technique for convolution is to use the FFT. The idea is to convert the input and filter into
the frequency domain, perform element-wise multiplication, and then convert the result back to the time
domain using the inverse FFT. This approach can significantly reduce the computational complexity,
especially for large filters and inputs. However, it may introduce some artifacts due to the finite precision of
floating-point arithmetic.
3. Winograd Convolution:
Winograd convolution is an algorithm that minimizes the number of multiplications required for
convolution. It uses a small set of precomputed matrices to transform the input and filter, allowing for faster
computation with less computational cost. This method is particularly effective for small filter sizes.
4. Strassen Algorithm:
Originally developed for matrix multiplication, the Strassen algorithm has also been adapted for
convolution. It reduces the number of multiplicative operations by recursively breaking down the
convolution into smaller sub-convolutions. This can be more efficient for large convolutions.

lOMoARcPSD|34467219
5. FFT-based Methods for 2D Convolution:
For 2D convolutions, you can use the FFT approach by performing separate 1D FFTs along each
dimension of the input and filter, and then combining them. This is particularly useful when dealing with
images and 2D data.
6. Im2Col and Col2Im:
These techniques involve reformatting the input and filter data into matrix form, where convolution can be
performed as a simple matrix multiplication operation. While this approach can be more efficient for
hardware implementations, it requires additional memory for the reformatted data.
7. Depthwise Separable Convolution:
Depthwise separable convolution is a technique used in deep learning, where a convolution operation is
split into two parts: depthwise convolution (applying a single filter to each input channel separately) and
pointwise convolution (applying 1x1 convolutions to combine the results). This reduces the number of
parameters and computations, making it efficient for mobile and embedded devices.
8. Winograd-like Transformations:
Variations of the Winograd algorithm exist for different input sizes and filter dimensions, providing
options for optimizing convolution for specific scenarios.
The choice of convolution algorithm depends on the specific use case, hardware, and trade-offs between
speed and memory usage.
NEUROSCIENTIFIC BASIS:
The history of convolutional networks begins with neuroscientific experiments long before the relevant
computational models were developed.
Neurophysiologists David Hubel and Torsten Wiesel observed how neurons in the cat’s brain responded
to images projected in precise locations on a screen in front of the cat.
“Their great discovery was that neurons in the early visual system responded most strongly to very specific
patterns of light, such as precisely oriented bars, but responded hardly at all to other patterns”

lOMoARcPSD|34467219
The Neurons in the early visual cortex are organized in a hierarchical fashion, where the first cells connected
to the cat’s retinas are responsible for detecting simple patterns like edges and bars, followed by later layers
responding to more complex patterns by combining the earlier neuronal activities.
Convolutional Neural Network may learn to detect edges from raw pixels in the first layer, then use the
edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features,
such as facial shapes in higher layers
Filters in a Convolutional Neural network
The Visual Cortex of the brain is a part of the cerebral cortex that processes visual information. V1 is the
first area of the brain that begins to
perform significantly advanced processing of visual input.

lOMoARcPSD|34467219
A convolutional network layer is designed to capture three properties of V1:
1. V1 is arranged in a spatial map. It actually has a two-dimensional structure mirroring the structure of the
image in the retina. Convolutional networks capture this property by having their features defined
in terms of two dimensional maps.
2. V1 contains many simple cells. A simple cell’s activity can be characterized by a linear function of the
image in a small, spatially
localized receptive field. The detector units of a convolutional network are designed to emulate
these properties of simple cells.
3. V1 also contains many complex cells. These cells respond to features that
are similar to those detected by simple cells, but complex cells are invariant to small shifts in the position
of the feature. This inspires the pooling units of convolutional networks.
There are many differences between convolutional networks

and the mammalian vision system. Some of these differences are -
1. The human eye is mostly very low resolution, except for a tiny patch called the fovea. Most
convolutional networks receive large full resolution photographs as input.
2. The human visual system is integrated with many other senses, such as
hearing, and factors like our moods and thoughts. Convolutional networks
so far are purely visual.
3. Even simple brain areas like V1 are heavily impacted by feedback from higher levels. Feedback has been
explored extensively in neural network models but has not yet been shown to offer a compelling
improvement.
APPLICATIONS
COMPUTER VISION:
What is Computer Vision?

Computer vision is one of the fields of artificial intelligence that trains and enables computers to understand
the visual world. Computers can use digital images and deep learning models to accurately identify and
classify objects and react to them.

lOMoARcPSD|34467219
Computer vision in AI is dedicated to the development of automated systems that can interpret visual data
(such as photographs or motion pictures) in the same manner as people do. The idea behind computer vision
is to instruct computers to interpret and comprehend images on a pixel-by-pixel basis. This is the foundation
of the computer vision field. Regarding the technical side of things, computers will seek to extract visual
data, manage it, and analyze the outcomes using sophisticated software programs.
The amount of data that we generate today is tremendous - 2.5 quintillion bytes of data every single day.
This growth in data has proven to be one of the driving factors behind the growth of computer vision.
How Does Computer Vision Work?
Massive amounts of information are required for computer vision. Repeated data analyses are performed
until the system can differentiate between objects and identify visuals. Deep learning, a specific kind of
machine learning, and convolutional neural networks, an important form of a neural network, are the two
key techniques that are used to achieve this goal.
With the help of pre-programmed algorithmic frameworks, a machine learning system may automatically
learn about the interpretation of visual data. The model can learn to distinguish between similar pictures if it
is given a large enough dataset. Algorithms make it possible for the system to learn on its own, so that it
may replace human labor in tasks like image recognition.
Convolutional neural networks aid machine learning and deep learning models in understanding by dividing
visuals into smaller sections that may be tagged. With the help of the tags, it performs convolutions and then
leverages the tertiary function to make recommendations about the scene it is observing. With each cycle,
the neural network performs convolutions and evaluates the veracity of its recommendations. And that's
when it starts perceiving and identifying pictures like a human.
Computer vision is similar to solving a jigsaw puzzle in the real world. Imagine that you have all these
jigsaw pieces together and you need to assemble them in order to form a real image. That is exactly how the
neural networks inside a computer vision work. Through a series of filtering and actions, computers can put
all the parts of the image together and then think on their own. However, the computer is not just given a
puzzle of an image - rather, it is often fed with thousands of images that train it to recognize certain objects.
For example, instead of training a computer to look for pointy ears, long tails, paws and whiskers that make
up a cat, software programmers upload and feed millions of images of cats to the computer. This enables the
computer to understand the different features that make up a cat and recognize it instantly.
History
For almost 60 years, researchers and developers have sought to teach computers how to perceive and make
sense of visual information. In 1959, neurophysiologists started showing a cat a variety of sights in an effort
to correlate a reaction in the animal's brain. They found that it was particularly sensitive to sharp corners and
lines, which technically indicates that straight lines and other basic forms are the foundation upon which
image analysis is built.

lOMoARcPSD|34467219
Around the same period, the first image-scanning technology emerged that enabled computers to scan
images and obtain digital copies of them. This gave computers the ability to digitize and store images. In the
1960s, artificial intelligence (AI) emerged as an area of research, and the effort to address AI's inability to
mimic human vision began.
Neuroscientists demonstrated in 1982 that vision operates hierarchically and presented techniques enabling
computers to recognize edges, vertices, arcs, and other fundamental structures. At the same time, data
scientists created a pattern-recognition network of cells. By the year 2000, researchers were concentrating
their efforts on object identification, and by the following year, the industry saw the first-ever real-time face
recognition solutions.
Deep Learning Revolution
Examining the algorithms upon which modern computer vision technology is based is essential to
understanding its development. Deep learning is a kind of machine learning that modern computer vision
utilizes to get data-based insights.
When it comes to computer vision, deep learning is the way to go. An algorithm known as a neural network
is used. Patterns in the data are extracted using neural networks. Algorithms are based on our current
knowledge of the brain's structure and operation, specifically the linkages between neurons within the
cerebral cortex.
The perceptron, a mathematical model of a biological neuron, is the fundamental unit of a neural network. It
is possible to have many layers of linked perceptrons, much like the layers of neurons in the biological
cerebral cortex. As raw data is fed into the perceptron-generated network, it is gradually transformed into
predictions.
How Long Does It Take To Decipher An Image
Extremely fast CPUs and associated technology, together with a swift, dependable internet and cloud-based
infrastructures, make the entire process blistering fast nowadays. Importantly, several of the largest
businesses investing in AI research, like Google, Facebook, Microsoft, and IIBM, have been upfront about
their research and development in the field. In this way, people may build upon the foundation they've laid.
This has resulted in the AI sector heating up, and studies that used to take weeks to complete may now be
completed in a few minutes. In addition, for many computer vision tasks in the actual world, this whole
process takes place constantly in a matter of microseconds. As a result, a computer may currently achieve
what researchers refer to as "circumstantially conscious" status.
Computer Vision Applications
One field of Machine Learning where fundamental ideas are already included in mainstream products is
computer vision. The applications include:

lOMoARcPSD|34467219
 Self-Driving Cars
With the use of computer vision, autonomous vehicles can understand their environment. Multiple cameras
record the environment surrounding the vehicle, which is then sent into computer vision algorithms that
analyzes the photos in perfect sync to locate road edges, decipher signposts, and see other vehicles,
obstacles, and people. Then, the autonomous vehicle can navigate streets and highways on its own, swerve
around obstructions, and get its passengers where they need to go safely.
 Facial Recognition
Facial recognition programs, which use computer vision to recognize individuals in photographs, rely
heavily on this field of study. Facial traits in photos are identified by computer vision algorithms, which then
match those aspects to stored face profiles. In order to verify the identity of the people using consumer
electronics, face recognition is increasingly being used. Facial recognition is used in social networking
applications for both user detection and user tagging. For the same reason, law enforcement uses face
recognition software to track down criminals using surveillance footage.
 Augmented & Mixed Reality
Augmented reality, which allows computers like smartphones and wearable technology to superimpose or
embed digital content onto real-world environments, also relies heavily on computer vision. Virtual items
may be placed in the actual environment through computer vision in augmented reality equipment. In order
to properly generate depth and proportions and position virtual items in the real environment, augmented
reality apps rely on computer vision techniques to recognize surfaces like tabletops, ceilings, and floors.
 Healthcare
Computer vision has contributed significantly to the development of health tech. Automating the process of
looking for malignant moles on a person's skin or locating indicators in an x-ray or MRI scan is only one of
the many applications of computer vision algorithms.
Examples
The following are some examples of well-established activities using computer vision:
 Categorization of Images
A computer program that uses image categorization can determine what an image is of (a dog, a banana, a
human face, etc.). In particular, it may confidently assert that an input picture matches a specific category. It
might be used by a social networking platform, for instance, to filter out offensive photos that people post.
 Object Detection
By first classifying images into categories, object detection may then utilize this information to search for
and catalog instances of the desired class of images. In the manufacturing industry, this can include finding
defects on the production line or locating broken equipment.

lOMoARcPSD|34467219
 Observation of Moving Objects
If an item is discovered, object tracking will continue to move in the same location. A common method for
doing this is by using a live video stream or a series of sequentially taken photos. For example, driverless
cars must not only identify and categorize moving things like people, other motorists, and road systems in
order to prevent crashes and adhere to traffic regulations.
 Retrieval of Images Based on Their Contents
In contrast to traditional visual retrieval methods, which rely on metadata labels, a content-based recognition
system employs computer vision to search, explore, and retrieve pictures from huge data warehouses based
on the actual image content. Automatic picture annotations, which can replace traditional visual tagging,
may be used for this work.
Computer Vision Algorithms
Computer vision algorithms include the different methods used to understand the objects in digital images
and extract high-dimensional data from the real world to produce numerical or symbolic information. There
are many other computer vision algorithms involved in recognizing things in photographs. Some common
ones are:
 Object Classification - What is the main category of the object present in this photograph?
 Object Identification - What is the type of object present in this photograph?
 Object Detection - Where is the object in the photograph?
 Object Segmentation - What pixels belong to the object in the image?
 Object Verification - Is the object in the photograph?
 Object Recognition - What are the objects present in this photograph and where are they located?
 Object Landmark Detection - What are the key points for the object in this photograph?

lOMoARcPSD|34467219
Fig: Computer vision detecting cats in a picture (Source)
Many other advanced computer vision algorithms such as style transfer, colorization, human pose
estimation, action recognition, and more can be learned alongside deep learning algorithms.
Challenges of Computer Vision
Creating a machine with human-level vision is surprisingly challenging, and not only because of the
technical challenges involved in doing so with computers. We still have a lot to learn about the nature of
human vision.
To fully grasp biological vision, one must learn not just how various receptors like the eye work, but also
how the brain processes what it sees. The process has been mapped out, and its tricks and shortcuts have
been discovered, but, as with any study of the brain, there is still a considerable distance to cover.
Computer Vision Benefits
Computer vision can automate several tasks without the need for human intervention. As a result, it provides
organizations with a number of benefits:
 Faster and simpler process - Computer vision systems can carry out repetitive and monotonous tasks at a
faster rate, which simplifies the work for humans.
 Better products and services - Computer vision systems that have been trained very well will commit
zero mistakes. This will result in faster delivery of high-quality products and services.
 Cost-reduction - Companies do not have to spend money on fixing their flawed processes because
computer vision will leave no room for faulty products and services.
Computer Vision Disadvantages
There is no technology that is free from flaws, which is true for computer vision systems. Here are a few
limitations of computer vision:
 Lack of specialists - Companies need to have a team of highly trained professionals with deep knowledge
of the differences between AI vs. Machine Learning vs. Deep Learning technologies to train computer
vision systems. There is a need for more specialists that can help shape this future of technology.
 Need for regular monitoring - If a computer vision system faces a technical glitch or breaks down, this
can cause immense loss to companies. Hence, companies need to have a dedicated team on board to
monitor and evaluate these systems.

lOMoARcPSD|34467219
Choose the Right Program
Supercharge your career in AI and ML with Simplilearn's comprehensive courses. Gain the skills and
knowledge to transform industries and unleash your true potential. Enroll now and unlock limitless
possibilities!
IMAGE GENERATION:
Generative Adversarial Network which is popularly known as GANs is a deep learning, unsupervised
machine learning technique which is proposed in year 2014 through this research paper. The main blocks of
this architecture are ;
1. Generator : This block tries to generates the images which are very similar to that of original dataset by
taking noise as input. It tries to learn the join probability of the input data (X) and output data(Y);
P(X|Y).
2. Discriminator : This block tries to accept two inputs, one from main dataset and other from images
generated from Generator, and bifurcates them as Real or Fake.
To make this Generative and Adversarial process simple, both these block are made from Deep Neural
Network based architecture which can be trained through forward and backward propagation techniques.
To understand this concept in-depth, we will implement GAN architectures through tensorflow-keras. We
will be focusing on generation of MNIST images through simple GANs and also through Deep Convolution
GANs and also the Super Resolution GANs with working example.
Simple Generative Adversarial Networks (GANs)
With the above architecture of Simple GANs, we will look at the architecture of Generator model.
Generator consists of four dense layers, where a 100-dimensional Noise data is passed as input. The last
dense layer of the Generator produces the 784 (28x28 = 784) dimensional vector which is mainly flattened
vector corresponding to that of each of individual MNIST images.

lOMoARcPSD|34467219
Generator of Simple GAN
For last Dense layer, we used tanh activation unit because we normalize each image from [-1, +1].This
generator vector from Generator is then passed to next block, which is Discriminator network of GANs.
Discriminator, whose main task is to get maximum probability while predicting real or fake data, we pass our
784-dimensional generator output vector to it. This block also comprises of four dense layers as shown
below.

lOMoARcPSD|34467219
Discriminator of Simple GAN
A sigmoidal activation was used for the last layer, which gives the probability of input image being real or
fake.
LeakyRelu activation function is used in both, Generator and Discriminator which helps in faster
convergence of the model.
Both these Generative and Discriminative blocks are combined together as below;
Simple GAN model

lOMoARcPSD|34467219
To perform actual training, we will initialize the generator, discriminator and gan objects by initializing each
functional blocks. We will generate a 100-dimensional noise input for generator. As we normalized images
between [-1, +1], we will have random noise from normal distribution of range [-1, +1].
Prediction of Real and Fake images through GAN Discriminator
With above code, we will first generate sample images with generator by passing random noise to it. These
images are then combined with real images to generate a batch of real and fake images. This batch is passed
to a discriminator which predicts the probability of image being real or fake.
Till this stage of discriminator prediction, we keep discriminator trainable, as the loss of the prediction needs
to be back-propagated through network for updation of weights corresponding to each layer.
Now, by freezing the layers of discriminator, we try to back-propagate the GAN loss through generator for
updation of weights of each layer.
Freezing Discriminator, we backpropagate loss through Generator
Below image shows the progress of GAN architecture, where with the simple noise input, generator is able to
create similar MNIST images as that of original data.

lOMoARcPSD|34467219
Deep Convolutional Generative Adversarial Network (DC GANs)
It was very interesting to see generation of similar MNIST images with the help of Deep Neural network, yet
the model fails to implement the major Deep Learning algorithm in the architecture, and that is Convolutional
Neural Networks.
Henceforth, instead of flattening the image into dense layer, we will be using convolutional filters to generate
the image from Noise input.
Generator of DC GAN consist of Dense Layer followed by Batch Normalization layer. Here, we first take
Noise as input which is being multiplied with FxFxK elements. This output is then reshaped into FxFxK
shape.
Convolution2DTranspose layer is used in DC GANs, whose main objective is to up sample the input image.
The complete architecture is as below;
Discriminator of the DC GANs, unlike previous examples, accepts the input as image instead of vector
format. Image generated through Generator and from Original data are sampled to pass it to the discriminator.

lOMoARcPSD|34467219
LeakyRelu activation is used along with low value of Dropout to avoid the overfitting of the model.
The remaining flow of DC GANs is same as that of Simple GANs, where we let discriminator first update it’s
weights through backpropagation from training loss. After the discriminator gets updated, we freeze the
discriminator and fit the generator with fake data. The loss of generator is then back propagated through it so
as to update the weights.
Below is the image which shows us the progress of DC-GAN performance on MNIST data through 400
epochs.
Super Resolution Generative Adversarial Networks (SR GANs)
Now we will look into one of the advance GAN architecture which is called as SR GANs. The main purpose
of this is;
Generation of Super Resolution image using Generator by accepting Low Resolution image as input. This
Super Resolution image is very similar to the original High Resolut

lOMoARcPSD|34467219
ion image of the original dataset.
SR GANs block diagram
Working of SR GANs from above block diagram;
 Original dataset consist of High Resolution (HR) images, which are down sampled to get the Low
Resolution (LR) images.
 These LR images are then passed to SR GAN Generator which generates the Super Resolution (SR)
image which match close to that of HR image.
 Batches of these SR and HR images are then passed to SR GAN Discriminator which predicts whether
the images are real or fake.
 The final loss of the SR GAN is then back-propagated to Generator and Discriminator network for
updation of weights.
Now that we understood the working of the architecture, we can now understand the details of each of the
core blocks; Generator and Discriminator.
SR GAN Generator

lOMoARcPSD|34467219
Asstated in actual research paper, above image shows the block diagram of Generator of SR GAN. A LR
input image is passed through Convolution layer followed by Parametric ReLU activation. The output is then
passed to set of 16 residual blocks. Output from residual block is then passed to couple of Convolutional
blocks which is then passed to Up-sampling block, which increases the resolution of the image to the desired
level.
SR GAN Discriminator
Discriminator of SR GAN like all other GAN architecture, does the job of predicting the fake and real images
by accepting two images at the same time, here we can see little complex structure that was implemented
compared to that of previously seen architectures.
Discriminator output is then used to find the loss for model.
Loss function in SR GAN can be a combination of Content Loss and Adversarial Loss. Here, content loss can
be captured Pixel wise using MSE between HR and SR images. This can be calculated by extracting the
dense vector corresponding to each of input images by passing them through VGG19 network.

lOMoARcPSD|34467219
vgg19 Content Loss
Wile running the complete SR GAN model, we will initialize Generator, Discriminator and SR-GAN objects.
Generator object will be compiled with Adam optimizer and only the content loss (i.e. VGG19 pixel wise
MSE).
Discriminator object, on the other hand, will be compiled with binary_crossentropy with Adam optimizer.
Instead of training a mixed batch of HR and LR images, we first pass HR images to discriminator and later
train the same with batch of LR images by making generator.trainable = True.
Now to train the discriminator, we’ll be freezing the discriminator and will train the srgan object with LR
images with [HR_images, REAL_IMAGE_LABELS] as the desired output.
Running above network with batch size of 64 and epochs of 400, we were able to get the significant output
from SE GAN architecture.

lOMoARcPSD|34467219
The GIF at the start of this article is generated using SR GAN architecture itself. From below, we can clearly
see, model was able to predict the edges for bridge and canal very clearly.
As the model progresses through epochs, we can see from Epoch1 to Epoch200, the Bridge structure was
very clearly visible and also the clif, its greenery along with some civilization was clearly visible in SR
image(center).
Here's a result at the end of Epoch400, the color of the sky improved pretty much, including the water on the
bridge. So, considering the low level image, where we could see small pixels to our eyes, model was able to
regenerate the image with much finer details.

lOMoARcPSD|34467219
Here are some of the other image generation;

lOMoARcPSD|34467219
Not so good, but not so bad too. Haha… !!!!
Conclusion
Though, with simple modeling architecture, we were able to recreate not the best but much better image
output through all three forms of GAN architecture. By providing more image data and giving more time to
learn the features and in-dept detailing of the image, certainly the models will outperform the Original data
available.
IMAGE COMPRESSION:
Compression-decompression description
Сompression-decompression task involves compressing data, sending them using low internet traffic usage,
and their further decompression. The objective of the process is to achieve minimal difference between the
original and the decompressed images as well as obtain the same image quality after compression-
decompression as before data transfer.
The schema for a compression-decompression task is presented in Figure 2:

lOMoARcPSD|34467219
Figure 2. Schema for a compression-decompression method. Data is the initial image file. An encoder is a
compression process, data compressed is a file after compression, a decoder is a decompression process,
Data* is a decompressed file.
To compare the performance of different methods we, first, measure compression coefficient and, after that,
we apply SSIM and PSNR metrics to measure similarities between the original image and the decompressed
image (all these metrics are described in the section Metrics below).
As we demonstrate in the Results section, different methods achieve different objectives: some produce high-
quality image results while having small compression efficiency, others reach high compression efficiency
while producing low-quality image results.
Dataset
We selected 10 images to compare and test different methods for a compression task. The dataset represents 5
bottles of Italian wines and 1 bottle of sauce (we chose this type of picture to further use the methods for the
bottle detection task as part of the ‘Bottle detection and classification’ company’s project). Examples of
images are presented in Figure 3:

lOMoARcPSD|34467219
Figure 3. Dataset for experiments with image compression methods (test data).
JPEG compression method
For the JPEG compression method, we employ the PIL library for python to compress .bmp images to .png
(code for running this is posted in GitHub), and JPEG format (Joint Photographic Experts Group)[10], which
is a standard image format for containing lossy and compressed image data. The format was introduced in the
early ‘90s, and since then, it became the most widely used image compression standard in the world[11]. The
main basis for JPEG’s lossy compression algorithm is the discrete cosine transform: this mathematical
operation converts each frame/field of the video source from the spatial (2D) domain into the frequency
domain. The JPEG standard specifies the codec, which defines how an image is compressed into a stream of
bytes and decompressed back into an image.
JPEG compression code:

from io import BytesIO
from PIL import ImageIMAGE_FILE = '1.bmp' # image file name
im1 = Image.open(IMAGE_FILE)
# here, we create an empty string buffer

buffer = BytesIO()
im1.save(buffer, "JPEG", quality=60) # compressed file
Machine learning models
We tested several machine learning models (code for testing is posted in GitHub) and chose the most optimal
models (which are effortless to run, require minimal GPU, and can be evaluated using the selected metrics).
Model 1 — ‘Factorized Prior Autoencoder’
The model is taken from the paper “Variational image compression with a scale hyperprior”[5]. The
architecture is shown in Figure 4:

lOMoARcPSD|34467219
Figure 4. The architecture of the proposed network ‘Variational image compression with a scale hyperprior’.
We employed TensorFlow framework[9] to compare the models because all the models can be run within
the same framework, and it is convenient for our task. We used Google Colab to run the models because it
provides free GPU. Below, we show the code for running the framework for Factorized Prior Autoencoder
model (installation instructions in Colab).
First, install tensorflow-compression library:

!pip install tensorflow-compression
Second, clone the project to Colab:

![[ -e /tfc ]] || git clone https://github.com/tensorflow/compression /tfc
%cd /tfc/models
import tfci # Check if tfci.py is available.
Third, run the model.
Compression in TensorFlow for Factorized Prior Autoencoder optimized for MS-SSIM (multiscale
SSIM) is the following:

!python tfci.py compress bmshj2018-factorized-msssim-6 /1.png
 bmshj2018-factorized-msssim — model name;
 number 6 at the end of the name indicates the quality level (1: lowest, 8: highest);

lOMoARcPSD|34467219
 /1.png — input file name (image).
We experimented with several quality levels, and in the result table, we include the models which give an
approximately similar performance for SSIM metrics (around 0.97), namely, bmshj2018-factorized-msssim-6
in Table 2.
This script runs compression and produces a compressed file with .tfci name in addition to the target input
image (1.png). This file 1.png.tfci — is so-called compressed data from Figure 1.
Decompression in TensorFlow:
!python tfci.py decompress /1.png.tfci
This script produces a file with extension .png in addition to the compressed file name, for example,
1.png.tfci.png. The decompression code is the same for other models described below.
Model 2 — Nonlinear transform coder model with factorized priors
The second model is a nonlinear transform coder model with factorized priors (entropy models) optimized for
MSE, with GDN (generalized divisive normalization) activation functions, and 128 filters per layer[4]. Its
architecture is shown in Figure 5. It was also run on TensorFlow framework[9].
Figure 5. Schema of model architecture for nonlinear transform coder with factorized priors (entropy
models) optimized for MSE, with GDN[12].

lOMoARcPSD|34467219
GDN is typically applied to linear filter responses z = Hx, where x is image data vectors; or applied to linear
filter responses inside a composite function such as an ANN (artificial neural networks). Its general form is
defined as
where y represents the vector of normalized responses, and vectors β, ε and matrices α, γ represent
parameters of the transformation (all non-negative).
Compression in TensorFlow for nonlinear transform coder model with factorized priors (entropy models)
optimized for MSE, with GDN (generalized divisive normalization) activation functions:
!python tfci.py compress b2018-gdn-128-4 /1.png
The number 1–4 at the end indicates the quality level (1: lowest, 4: highest). We experiment with different
levels of quality and choose the model which produces SSIM quality of approximately 0.97 (b2018-gdn-128–
4 in Table 2).
Model 3 — Hyperprior model with non zero-mean Gaussian conditionals
The third model is hyperprior model with non zero-mean Gaussian conditionals (without autoregression),
optimized for MS-SSIM (multiscale SSIM)[6]. The architecture of the figure is shown in Figure 6. It was also
run on TensorFlow framework[9].

lOMoARcPSD|34467219
Figure 6. Model architecture for hyperprior model with non zero-mean Gaussian conditionals (without
autoregression) [6].
Compression in TensorFlow for hyperprior model with non zero-mean Gaussian conditionals (without
autoregression), optimized for MS-SSIM:

!python tfci.py compress mbt2018-mean-msssim-5 /1.png
The number 1–8 at the end indicates the quality level (1: lowest, 8: highest). We experiment with different
levels of quality and choose the model which produces SSIM quality of approximately 0.97 (mbt2018-mean-
msssim-5 in table 2).
Metrics
The performance of image compression-decompression methods can be evaluated using several metrics [4]:
 Compression efficiency/compression coefficient — the ratio between the compressed and the initial data
(image) size,
 Image quality (Distortion Measurement) — the difference between the original image and the
compressed/decompressed image,
 Computational cost — the number of seconds required for computing the compression and the additional
physical tool, such as GPU units.

lOMoARcPSD|34467219
Below, we summarize two metrics used for comparison, namely, compression efficiency/compression
coefficient, and image quality.
Compression efficiency/compression coefficient
Formula for this metric is the following:
N_compression = size(compressed data)/ size(uncompressed data).
N_compression is a compression coefficient equal to the size of the compressed data divided by the size of
the initial data. Size(compressed data) — is the file size in bites after the models’ compression.
Size(uncompressed data) equals the image’s height*width*channels in bites. Our dataset for evaluation has
10 equal images with width 576px, height 768px and channels =3, and size of the initial uncompressed data
576*768*3 = 1,327,104 bits = 165,888 bytes= size(uncompressed data).
Image quality
To compare the quality of compression we chose three metrics. We measure the quality of the compressed
files using the formula:
N_quality = Quality_metric(Data, Data*),
where Quality_metric is either SSIM or PSNR. Below, we show formulas for those metrics.
SSIM
In image comparison, the mean squared error (MSE) is simple to implement, but it is not highly indicative of
the perceived similarity. Structural similarity aims to address this shortcoming by taking texture into
account[7].

lOMoARcPSD|34467219
Formula for SSIM
where x, y — images to compare, μ — the average of image x or y respectively, σ — the variance of x and y
respectively, c1 and c2 — two variables to stabilize the division with weak denominator.
from skimage.metrics import structural_similaritySSIM = structural_similarity(img1, img2, multichannel=True)
PSNR
Compute peak signal-to-noise ratio (PSNR) between images[8].
R is the maximum fluctuation in the input image data type. For example, if the input image has a double-
precision floating-point data type, then R is 1. If it has an 8-bit unsigned integer data type, R is 255.
import mathfrom torch import Tensorimport torch.nn.functional as F
def psnr(x: Tensor, x_hat: Tensor) -> float: return -10 * math.log10(F.mse_loss(x, x_hat).item())
Results
JPEG compression method using classical codecs for image compression via python library PIL gave the
following results (see Table 1). For equal comparison, we intentionally chose the parameters to compress the
images in such a way that SSIM would be approximately 0.97 (that means, images were compressed with a
certain compression coefficient N_compression, which would give SSIM close to 0.97).
Table 1. Results were obtained for JPEG compression method.
In Table 2, we included models for neural network compression-decompression:

lOMoARcPSD|34467219
Table 2. Results obtained for three different neural networks models: FactorizedPriorAutoencoder:
bmshj2018-factorized-msssim-6[5], nonlinear transform coder with factorized priors: b2018-gdn-128–4[4],
hyperprior model with non zero-mean: Gaussian conditionals mbt2018-mean-msssim-5[6].
Conclusions
We compare the classical JPEG compression method with three different machine learning models for
compression-decompression task with TensorFlow framework. Several metrics are applied to compare the
performance. The results are as follows: with relatively equal SSIM quality (about 0.97), the best
compression was produced by the mbt2018-mean-msssim-5 model (N_compression is approximately 0.13).
The next best compression model is bmshj2018-factorized-msssim-6 (N_compression is approximately 0.23).
After this, follows the classical JPEG compression method with N_compression of around 0.288. The latest
in quality is the b2018-gdn-128–4 model (N_compression is approximately 0.29). At the same time, the
PSNR metrics for all neural networks models are approximately the same (about 35) (meaning that the
quality for MSE of images after compression-decompression is almost the same for every model). Also
interesting to mention, that the PSNR metric is higher for the JPEG method.

Neural Networks Unit 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks Unit 3

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|34467219

Neural Networks unit 3

Neural Network and Deep Learning (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Spiking Neural Networks-Convolutional Neural Networks-Deep Learning Neural Networks-Extreme

SPIKING NEURAL NETWORKS:

CONVOLUTIONAL NEURAL NETWORKS:

How do convolutional neural networks work?

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Additional convolutional layer

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Types of convolutional neural networks

Convolutional neural networks and computer vision

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

DEEP LEARNING NEURAL NETWORKS

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Understanding Neurons in Deep Learning

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Let’s walk through this diagram step-by-step.

Deep Learning Activation Functions

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

What Are Activation Functions in Deep Learning?

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

How Do Neural Networks Really Work?

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

How Neural Networks Are Trained

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Extreme Learning Machine Mode

As said in the original paper:

Why ELM is different from standard Neural Network

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

It’s pretty straightforward:

1. multiply inputs by weights

3. apply the activation function

4. repeat steps 1–3 number of layers times

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

 L is a number of hidden units

 N is a number of training samples

 is weight vector between th hidden layer and output

 w is a weight vector between input and hidden layer

 H is called Hidden Layer Output Matrix

 T is a training data target matrix

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

A function is infinitely differentiable if it’s a smooth function

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

If you’re interested in seeing python implementation please check this repository:

And here is a preview of how the model works on MNIST dataset:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

 DELM is a deep ELM

 ReNet is described in this paper

 RNN is a Recurrent Neural Network

 EfficientNet is described in this paper

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

THE CONVOLUTION OPERATION:

the key points of the Feed Forward Neural Network:

 DNNs can be trained using backpropagation.

that serves this objective.

neurons alone (64643*10) Weights to train