You are on page 1of 93

lOMoARcPSD|34467219

Neural Networks unit 3

Neural Network and Deep Learning (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)
lOMoARcPSD|34467219

Spiking Neural Networks-Convolutional Neural Networks-Deep Learning Neural Networks-Extreme


Learning Machine Model-Convolutional Neural Networks: The Convolution Operation – Motivation –
Pooling – Variants of the basic Convolution Function – Structured Outputs – Data Types – Efficient
Convolution Algorithms – Neuroscientific Basis – Applications: Computer Vision, Image Generation,
Image Compression.

SPIKING NEURAL NETWORKS:


The Spiking Neural Network (SNN) is the third generation of neural network models, built with
specialized network topologies that redefine the entire computational process. The spiking makes it more
intelligent and energy-efficient, which is crucial for small devices to perform.
With a three-layered feedforward specialized network topology, the SNN is one of the most powerful neural
networks that can process temporal data in real-time. This high computational power and advanced topology
make it suitable for robotics and computer vision applications that require real-time data processing.
SNN facilitates real-time sourcing and processing of the data and is a major improvement over other neural
networks, which primarily rely on frequency rather than temporal data.
The SNN spikes are computationally more advanced, and the firing activity of the neuron in the SNN
architecture is not tied to static inputs but to the notion of time.

CONVOLUTIONAL NEURAL NETWORKS:


Neural networks are a subset of machine learning, and they are at the heart of deep learning
algorithms. They are comprised of node layers, containing an input layer, one or more hidden layers, and an
output layer. Each node connects to another and has an associated weight and threshold. If the output of any
individual node is above the specified threshold value, that node is activated, sending data to the next layer
of the network. Otherwise, no data is passed along to the next layer of the network.
While we primarily focused on feedforward networks in that article, there are various types of neural nets,
which are used for different use cases and data types. For example, recurrent neural networks are commonly
used for natural language processing and speech recognition whereas convolutional neural networks
(ConvNets or CNNs) are more often utilized for classification and computer vision tasks. Prior to CNNs,
manual, time-consuming feature extraction methods were used to identify objects in images. However,
convolutional neural networks now provide a more scalable approach to image classification and object
recognition tasks, leveraging principles from linear algebra, specifically matrix multiplication, to identify
patterns within an image. That said, they can be computationally demanding, requiring graphical processing
units (GPUs) to train models.

How do convolutional neural networks work?


Convolutional neural networks are distinguished from other neural networks by their superior
performance with image, speech, or audio signal inputs. They have three main types of layers, which are:
 Convolutional layer
 Pooling layer
 Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While convolutional layers can be
followed by additional convolutional layers or pooling layers, the fully-connected layer is the final layer.
With each layer, the CNN increases in its complexity, identifying greater portions of the image. Earlier
layers focus on simple features, such as colors and edges. As the image data progresses through the layers of

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

the CNN, it starts to recognize larger elements or shapes of the object until it finally identifies the intended
object.

Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the majority of
computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let’s
assume that the input will be a color image, which is made up of a matrix of pixels in 3D. This means that
the input will have three dimensions—a height, width, and depth—which correspond to RGB in an image.
We also have a feature detector, also known as a kernel or a filter, which will move across the receptive
fields of the image, checking if the feature is present. This process is known as a convolution.
The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While
they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive
field. The filter is then applied to an area of the image, and a dot product is calculated between the input
pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter shifts by a stride,
repeating the process until the kernel has swept across the entire image. The final output from the series of
dot products from the input and the filter is known as a feature map, activation map, or a convolved feature.
Note that the weights in the feature detector remain fixed as it moves across the image, which is also known
as parameter sharing. Some parameters, like the weight values, adjust during training through the process of
backpropagation and gradient descent. However, there are three hyperparameters which affect the volume
size of the output that need to be set before the training of the neural network begins. These include:
1. The number of filters affects the depth of the output. For example, three distinct filters would yield three
different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.
3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements that fall
outside of the input matrix to zero, producing a larger or equally sized output. There are three types of
padding:
 Valid padding: This is also known as no padding. In this case, the last convolution is dropped if
dimensions do not align.
 Same padding: This padding ensures that the output layer has the same size as the input layer
 Full padding: This type of padding increases the size of the output by adding zeros to the border of
the input.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the
feature map, introducing nonlinearity to the model.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Additional convolutional layer


As we mentioned earlier, another convolution layer can follow the initial convolution layer. When
this happens, the structure of the CNN can become hierarchical as the later layers can see the pixels within
the receptive fields of prior layers. As an example, let’s assume that we’re trying to determine if an image
contains a bicycle. You can think of the bicycle as a sum of parts. It is comprised of a frame, handlebars,
wheels, pedals, et cetera. Each individual part of the bicycle makes up a lower-level pattern in the neural net,
and the combination of its parts represents a higher-level pattern, creating a feature hierarchy within the
CNN. Ultimately, the convolutional layer converts the image into numerical values, allowing the neural
network to interpret and extract relevant patterns.

Pooling layer
Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the
number of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps a filter
across the entire input, but the difference is that this filter does not have any weights. Instead, the kernel
applies an aggregation function to the values within the receptive field, populating the output array. There
are two main types of pooling:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to
send to the output array. As an aside, this approach tends to be used more often compared to average
pooling.
 Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits to the CNN. They help
to reduce complexity, improve efficiency, and limit risk of overfitting.

Fully-connected layer
The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel values of
the input image are not directly connected to the output layer in partially connected layers. However, in the
fully-connected layer, each node in the output layer connects directly to a node in the previous layer.
This layer performs the task of classification based on the features extracted through the previous layers and
their different filters. While convolutional and pooling layers tend to use ReLu functions, FC layers usually
leverage a softmax activation function to classify inputs appropriately, producing a probability from 0 to 1.

Types of convolutional neural networks


Kunihiko Fukushima and Yann LeCun laid the foundation of research around convolutional neural
networks in their work in 1980 (link resides outside IBM) and "Backpropagation Applied to Handwritten
Zip Code Recognition" in 1989, respectively. More famously, Yann LeCun successfully applied
backpropagation to train neural networks to identify and recognize patterns within a series of handwritten
zip codes. He would continue his research with his team throughout the 1990s, culminating with “LeNet-5”,
which applied the same principles of prior research to document recognition. Since then, a number of variant
CNN architectures have emerged with the introduction of new datasets, such as MNIST and CIFAR-10, and
competitions, like ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Some of these other
architectures include:
 AlexNet (link resides outside IBM)
 VGGNet (link resides outside IBM)
 GoogLeNet (link resides outside IBM)
 ResNet (link resides outside IBM)
 ZFNet
However, LeNet-5 is known as the classic CNN architecture.

Convolutional neural networks and computer vision


Convolutional neural networks power image recognition and computer vision tasks. Computer vision
is a field of artificial intelligence (AI) that enables computers and systems to derive meaningful information
from digital images, videos and other visual inputs, and based on those inputs, it can take action. This ability
to provide recommendations distinguishes it from image recognition tasks. Some common applications of
this computer vision today can be seen in:
 Marketing: Social media platforms provide suggestions on who might be in photograph that has
been posted on a profile, making it easier to tag friends in photo albums.
 Healthcare: Computer vision has been incorporated into radiology technology, enabling doctors to
better identify cancerous tumors in healthy anatomy.
 Retail: Visual search has been incorporated into some e-commerce platforms, allowing brands to
recommend items that would complement an existing wardrobe.
 Automotive: While the age of driverless cars hasn’t quite emerged, the underlying technology has
started to make its way into automobiles, improving driver and passenger safety through features like
lane line detection.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

DEEP LEARNING NEURAL NETWORKS


The History of Deep Learning
Deep learning was conceptualized by Geoffrey Hinton in the 1980s. He is widely considered to be the
founding father of the field of deep learning. Hinton has worked at Google since March 2013 when his
company, DNNresearch Inc., was acquired.
Hinton’s main contribution to the field of deep learning was to compare machine learning techniques to the
human brain.
More specifically, he created the concept of a "neural network", which is a deep learning algorithm
structured similar to the organization of neurons in the brain. Hinton took this approach because the human
brain is arguably the most powerful computational engine known today.
The structure that Hinton created was called an artificial neural network (or artificial neural net for short).
Here’s a brief description of how they function:
 Artificial neural networks are composed of layers of node
 Each node is designed to behave similarly to a neuron in the brain
 The first layer of a neural net is called the input layer, followed by hidden layers, then finally the
output layer
 Each node in the neural net performs some sort of calculation, which is passed on to other nodes
deeper in the neural net
Here is a simplified visualization to demonstrate how this works:

Neural nets represented an immense stride forward in the field of deep learning.
However, it took decades for machine learning (and especially deep learning) to gain prominence.
We’ll explore why in the next section.
Why Deep Learning Did Not Immediately Work
If deep learning was originally conceived decades ago, why is it just beginning to gain momentum today?
It’s because any mature deep learning model requires an abundance of two resources:
 Data
 Computing power

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

At the time of deep learning’s conceptual birth, researchers did not have access to enough of either data or
computing power to build and train meaningful deep learning models. This has changed over time, which
has led to deep learning’s prominence today.

Understanding Neurons in Deep Learning


Neurons are a critical component of any deep learning model.
In fact, one could argue that you can’t fully understand deep learning with having a deep knowledge of how
neurons work.
This section will introduce you to the concept of neurons in deep learning. We’ll talk about the origin of
deep learning neurons, how they were inspired by the biology of the human brain, and why neurons are so
important in deep learning models today.
What is a Neuron in Biology?
Neurons in deep learning were inspired by neurons in the human brain. Here is a diagram of the anatomy of
a brain neuron:

As you can see, neurons have quite an interesting structure. Groups of neurons work together inside the
human brain to perform the functionality that we require in our day-to-day lives.
The question that Geoffrey Hinton asked during his seminal research in neural networks was whether we
could build computer algorithms that behave similarly to neurons in the brain. The hope was that by
mimicking the brain’s structure, we might capture some of its capability.
To do this, researchers studied the way that neurons behaved in the brain. One important observation was
that a neuron by itself is useless. Instead, you require networks of neurons to generate any meaningful
functionality.
This is because neurons function by receiving and sending signals. More specifically, the neuron’s dendrites
receive signals and pass along those signals through the axon.
The dendrites of one neuron are connected to the axon of another neuron. These connections are called
synapses, which is a concept that has been generalized to the field of deep learning.
What is a Neuron in Deep Learning?
Neurons in deep learning models are nodes through which data and computations flow.
Neurons work like this:
 They receive one or more input signals. These input signals can come from either the raw data set or
from neurons positioned at a previous layer of the neural net.
 They perform some calculations.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 They send some output signals to neurons deeper in the neural net through a synapse.
Here is a diagram of the functionality of a neuron in a deep learning neural net:

Let’s walk through this diagram step-by-step.


As you can see, neurons in a deep learning model are capable of having synapses that connect to more than
one neuron in the preceding layer. Each synapse has an associated weight, which impacts the preceding
neuron’s importance in the overall neural network.
Weights are a very important topic in the field of deep learning because adjusting a model’s weights is the
primary way through which deep learning models are trained. You’ll see this in practice later on when we
build our first neural networks from scratch.
Once a neuron receives its inputs from the neurons in the preceding layer of the model, it adds up each
signal multiplied by its corresponding weight and passes them on to an activation function, like this:

The activation function calculates the output value for the neuron. This output value is then passed on to the
next layer of the neural network through another synapse.
This serves as a broad overview of deep learning neurons. Do not worry if it was a lot to take in – we’ll learn
much more about neurons in the rest of this tutorial. For now, it’s sufficient for you to have a high-level
understanding of how they are structured in a deep learning model.

Deep Learning Activation Functions


Activation functions are a core concept to understand in deep learning.
They are what allows neurons in a neural network to communicate with each other through their synapses.
In this section, you will learn to understand the importance and functionality of activation functions in deep
learning.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

What Are Activation Functions in Deep Learning?


In the last section, we learned that neurons receive input signals from the preceding layer of a neural
network. A weighted sum of these signals is fed into the neuron's activation function, then the activation
function's output is passed onto the next layer of the network.
There are four main types of activation functions that we’ll discuss in this tutorial:
 Threshold functions
 Sigmoid functions
 Rectifier functions, or ReLUs
 Hyperbolic Tangent functions
Let’s work through these activations functions one-by-one.
Threshold Functions
Threshold functions compute a different output signal depending on whether or not its input lies above or
below a certain threshold. Remember, the input value to an activation function is the weighted sum of the
input values from the preceding layer in the neural network.
Mathematically speaking, here is the formal definition of a deep learning threshold function:

As the image above suggests, the threshold function is sometimes also called a unit step function.
Threshold functions are similar to boolean variables in computer programming. Their computed value is
either 1 (similar to True) or 0 (equivalent to False).
The Sigmoid Function
The sigmoid function is well-known among the data science community because of its use in logistic
regression, one of the core machine learning techniques used to solve classification problems.
The sigmoid function can accept any value, but always computes a value between 0 and 1.
Here is the mathematical definition of the sigmoid function:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

One benefit of the sigmoid function over the threshold function is that its curve is smooth. This means it is
possible to calculate derivatives at any point along the curve.
The Rectifier Function
The rectifier function does not have the same smoothness property as the sigmoid function from the last
section. However, it is still very popular in the field of deep learning.
The rectifier function is defined as follows:
 If the input value is less than 0, then the function outputs 0
 If not, the function outputs its input value
Here is this concept explained mathematically:

Rectifier functions are often called Rectified Linear Unit activation functions, or ReLUs for short.
The Hyperbolic Tangent Function
The hyperbolic tangent function is the only activation function included in this tutorial that is based on a
trigonometric identity.
It’s mathematical definition is below:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

The hyperbolic tangent function is similar in appearance to the sigmoid function, but its output values are all
shifted downwards.

How Do Neural Networks Really Work?


So far in this tutorial, we have discussed two of the building blocks for building neural networks:
 Neurons
 Activation functions
However, you’re probably still a bit confused as to how neural networks really work.
This tutorial will put together the pieces we’ve already discussed so that you can understand how neural
networks work in practice.
The Example We’ll Be Using In This Tutorial
This tutorial will work through a real-world example step-by-step so that you can understand how neural
networks make predictions.
More specifically, we will be dealing with property valuations.
You probably already know that there are a ton of factors that influence house prices, including the
economy, interest rates, its number of bedrooms/bathrooms, and its location.
The high dimensionality of this data set makes it an interesting candidate for building and training a neural
network on.
One caveat about this section is the neural network we will be using to make predictions has already been
trained. We’ll explore the process for training a new neural network in the next section of this tutorial.
The Parameters In Our Data Set
Let’s start by discussing the parameters in our data set. More specifically, let’s imagine that the data set
contains the following parameters:
 Square footage
 Bedrooms
 Distance to city center

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 House age
These four parameters will form the input layer of the artificial neural network. Note that in reality, there are
likely many more parameters that you could use to train a neural network to predict housing prices. We have
constrained this number to four to keep the example reasonably simple.
The Most Basic Form of a Neural Network
In its most basic form, a neural network only has two layers - the input layer and the output layer. The
output layer is the component of the neural net that actually makes predictions.
For example, if you wanted to make predictions using a simple weighted sum (also called linear regression)
model, your neural network would take the following form:

While this diagram is a bit abstract, the point is that most neural networks can be visualized in this manner:
 An input layer
 Possibly some hidden layers
 An output layer
It is the hidden layer of neurons that causes neural networks to be so powerful for calculating predictions.
For each neuron in a hidden layer, it performs calculations using some (or all) of the neurons in the last layer
of the neural network. These values are then used in the next layer of the neural network.
The Purpose of Neurons in the Hidden Layer of a Neural Network
You are probably wondering – what exactly does each neuron in the hidden layer mean? Said differently,
how should machine learning practitioners interpret these values?
Generally speaking, neurons in the midden layers of a neural net are activated (meaning their activation
function returns 1) for an input value that satisfies certain sub-properties.
For our housing price prediction model, one example might be 5-bedroom houses with small distances to the
city center.
In most other cases, describing the characteristics that would cause a neuron in a hidden layer to activate is
not so easy.
How Neurons Determine Their Input Values
Earlier in this tutorial, I wrote “For each neuron in a hidden layer, it performs calculations using some (or
all) of the neurons in the last layer of the neural network.”
This illustrates an important point – that each neuron in a neural net does not need to use every neuron in the
preceding layer.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

The process through which neurons determine which input values to use from the preceding layer of the
neural net is called training the model. We will learn more about training neural nets in the next section of
this course.
Visualizing A Neural Net’s Prediction Process
When visualizing a neutral network, we generally draw lines from the previous layer to the current layer
whenever the preceding neuron has a weight above 0 in the weighted sum formula for the current neuron.
The following image will help visualize this:

As you can see, not every neuron-neuron pair has synapse. x4 only feeds three out of the five neurons in the
hidden layer, as an example. This illustrates an important point when building neural networks – that not
every neuron in a preceding layer must be used in the next layer of a neural network.

How Neural Networks Are Trained


So far you have learned the following about neural networks:
 That they are composed of neurons
 That each neuron uses an activation function applied to the weighted sum of the outputs from the
preceding layer of the neural network
 A broad, no-code overview of how neural networks make predictions
We have not yet covered a very important part of the neural network engineering process: how neural
networks are trained.
Now you will learn how neural networks are trained. We’ll discuss data sets, algorithms, and broad
principles used in training modern neural networks that solve real-world problems.
Hard-Coding vs. Soft-Coding
There are two main ways that you can develop computer applications. Before digging in to how neural
networks are trained, it’s important to make sure that you have an understanding of the difference between
hard-coding and soft-coding computer programs.
Hard-coding means that you explicitly specify input variables and your desired output variables. Said
differently, hard-coding leaves no room for the computer to interpret the problem that you’re trying to solve.
Soft-coding is the complete opposite. It leaves room for the program to understand what is happening in the
data set. Soft-coding allows the computer to develop its own problem-solving approaches.
A specific example is helpful here. Here are two instances of how you might identify cats within a data set
using soft-coding and hard-coding techniques.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 Hard-coding: you use specific parameters to predict whether an animal is a cat. More specifically,
you might say that if an animal’s weight and length lie within certain
 Soft-coding: you provide a data set that contains animals labelled with their species type and
characteristics about those animals. Then you build a computer program to predict whether an animal
is a cat or not based on the characteristics in the data set.
As you might imagine, training neural networks falls into the category of soft-coding. Keep this in mind as
you proceed through this course.
Training A Neural Network Using A Cost Function
Neural networks are trained using a cost function, which is an equation used to measure the error contained
in a network’s prediction.
The formula for a deep learning cost function (of which there are many – this is just one example) is below:

Note: this cost function is called the mean squared error, which is why there is an MSE on the left side of the
equal sign.
While there is plenty of formula mathematics in this equation, it is best summarized as follows:
Take the difference between the predicted output value of an observation and the actual output value of that
observation. Square that difference and divide it by 2.
To reiterate, note that this is simply one example of a cost function that could be used in machine learning
(although it is admittedly the most popular choice). The choice of which cost function to use is a complex
and interesting topic on its own, and outside the scope of this tutorial.
As mentioned, the goal of an artificial neural network is to minimize the value of the cost function. The cost
function is minimized when your algorithm’s predicted value is as close to the actual value as possible. Said
differently, the goal of a neural network is to minimize the error it makes in its predictions!
Modifying A Neural Network
After an initial neural network is created and its cost function is imputed, changes are made to the neural
network to see if they reduce the value of the cost function.
More specifically, the actual component of the neural network that is modified is the weights of each neuron
at its synapse that communicate to the next layer of the network.
The mechanism through which the weights are modified to move the neural network to weights with less
error is called gradient descent. For now, it’s enough for you to understand that the process of training neural
networks looks like this:
 Initial weights for the input values of each neuron are assigned
 Predictions are calculated using these initial values
 The predictions are fed into a cost function to measure the error of the neural network
 A gradient descent algorithm changes the weights for each neuron’s input values
 This process is continued until the weights stop changing (or until the amount of their change at each
iteration falls below a specified threshold)

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

This may seem very abstract - and that’s OK! These concepts are usually only fully understood when you
begin training your first machine learning models.

Extreme Learning Machine Mode

What is ELM?
ELM (Extreme Learning Machines) are feedforward neural networks. “Invented” in 2006 by G. Huang.

As said in the original paper:

Hence the phrase “Extreme” in ELM (but the real reason for the name might vary depends on the source).

Why ELM is different from standard Neural Network


ELM doesn’t require gradient-based backpropagation to work. It uses Moore-Penrose generalized inverse to
set its weights.

First, we look at standard SLFN (Single hidden Layer Feedforward Neural network):

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Single hidden Layer Feedforward Neural network, Source: Shifei Ding under CC BY 3.0

It’s pretty straightforward:

1. multiply inputs by weights

2. add bias

3. apply the activation function

4. repeat steps 1–3 number of layers times

5. calculate output

6. backpropagate

7. repeat everything

ELM removes step 4 (because it’s always SLFN), replaces step 6 with matrix inverse, and does it only once,
so step 7 goes away as well.

More details
Before going into details we need to look at how ELM output is calculated:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Where:

 L is a number of hidden units

 N is a number of training samples

 is weight vector between th hidden layer and output

 w is a weight vector between input and hidden layer

 g is an activation function

 b is a vias vector

 x in an input vector

It is quite similar to what’s going one in standard NN with backpropagation but if you look closely you can
see that we’re naming weight between hidden layer and output as Beta. This Beta matrix is a special matrix
because that is our pseudo-inverse. We can shorten the equation and write it as:

Where:

Where:

 m is a number of outputs

 H is called Hidden Layer Output Matrix

 T is a training data target matrix

The theory behind the learning (You can skip this section if you want)
Now we have to dig dipper into theories behind the network to decide what to do next.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

A function is infinitely differentiable if it’s a smooth function

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

I’m not going to prove those theorems but if you’re interested please refer Page 3, ELM-NC-2006 for further
explanation.

Now what we have to do is to define our cost function. Bassing our assumptions on Capabilities of a four-

layered feedforward neural network: four layers versus three we can see that SLFN is a linear system if the
input weights and the hidden layer biases can be chosen randomly.

Because our ELM is a linear system then we can create optimization objective:

To approximate the solution we need to use Rao’s and Mitra’s work again:

Now we can figure out that because H is invertible we can calculate Beta hat as:

Learning algorithm
After going through some difficult math we can define learning algorithm now. The algorithm itself is
relatively easy:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

If you’re interested in seeing python implementation please check this repository:

https://github.com/burnpiro/elm-pure

And here is a preview of how the model works on MNIST dataset:

https://github.com/burnpiro/elm-pure/blob/master/ELM%20example.ipynb

As you can see, a simple version of ELM achieves >91% accuracy on the MNIST dataset and it takes
around 3s to train the network on intel i7 7820X CPU.

Performance comparison
I’m going to use metrics from the original paper in this section and it might surprise you how long some

training is done in compare with previous MNIST example, but remember that original paper was published
in 2006 and networks were trained on Pentium 4 1.9GHz CPU.

Datasets

Results

We can ignore training time for now because it’s obvious that gradient descent takes longer than matrix

invert. The most important information form this result table is Accuracy and Nodes. In the first two

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

datasets, you can see that author used different size of BP to achieve the same results as ELM. The size of

BP network in the first case was 5x smaller and 2x smaller in the second case. That affects testing times (it’s

faster to run 100 nodes NN than 500 nodes NN). That tells us how accurate is our method in approximating
dataset.

It is hard to find any tests of ELM networks on popular datasets but I’ve managed to do so. Here is a
benchmark on CIFAR-10 and MNIST

Where:

 DELM is a deep ELM

 ReNet is described in this paper

 RNN is a Recurrent Neural Network

 EfficientNet is described in this paper

I didn’t find training times for ELMs so there was no way to compare them with results from other networks

but all those multipliers ( 20x, 30x) are relative differences in training time based on the training of ELM

1000 on CIFAR-10. If there is a 30x time increase between ELM 1000 and ELM 3500 then you can
imagine how long it would take to train DELM which has 15000 neurons.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

THE CONVOLUTION OPERATION:

the key points of the Feed Forward Neural Network:

 Universal Approximation Theorem(UAT) says that Deep Neural Networks(DNN) are powerful function

approximators.

 DNNs can be trained using backpropagation.

 However, Fully Connected DNNs(fully connected network means that any neuron in any of the layers is

connected to all the neurons in the previous layer) are prone to overfitting as the network is very deep

and the no. of parameters are very large which might result in the overfitting of the model.

 And the second problem with the fully connected networks is that some gradients might vanish due to

long chains. Since the network is very deep, the gradients in the few of the starting layers might get

vanished when flowing back and therefore resulting in no training of the weights.

 So, the objective is to have a network that is a complex network(having non-linearities) as we know that

in most of the real-world problems the output is going to be a complex function of the input but has

fewer parameters and is, therefore, less prone to overfitting. And CNN's belong to the family of networks

that serves this objective.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Convolutional Operation

Convolutional Operation means for a given input we re-estimate it as the weighted average of all the inputs

around it. We have some weights assigned to the neighbor values and we take the weighted sum of the

neighbor values to estimate the value of the current input/pixel.

For a 2D input, the classic input would be an image, where we re-calculate the value of every pixel by taking

the weighted sum of pixels(neighbors) around it for example: let’s say the input image is as given below

Input Image

Now in this input image, we calculate the value of each and every pixel by considering the weighted sum of

pixels around it

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Here we are calculating the value of circled pixel considering 3 neighbors around it, assume that the weights
w1, w2, w3, w4 are associated with these 4 pixels respectively

Now, this matrix of weights is referred to as the Kernel or Filter. In the above case, we have the kernel of

size 2X2.

We compute the output(re-estimated value of current pixel) using the following formula:

Here m refers to the number of rows(which is 2 in this case) and n refers to the number of columns(which is
2 i this case).

Now we place the 2X2 filter over the first 2X2 portion of the image and take the weighted sum and that

would give the new value of the first pixel.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

We map the 2X2 kernel/filter over the 2X2 portion of the input.

The output of this operation would be: (aw + bx + ey + fz)

Then we move the filter horizontally by one and place it over the next 2 X 2 portion of the input; in this case
pixels of interest would be b, c, f, g and we compute the output using the same technique and we would get:

And then again we move the kernel/filter by 1 in the horizontal direction and take the weighted sum.

So, after this, the output from the first layer would look like:

Then we move the kernel by 1 down in the vertical direction, calculate the output, move the kernel in the

horizontal direction and in general we move the kernel like this: first, we start off with the starting portion of

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

the image, move the filter in the horizontal direction and cover this row completely then we move the filter in

the vertical direction(by some amount respective to top left portion of image), again stride it horizontally

through the entire row and continue like this. In essence, we move the kernel left to right top to bottom.

Instead of considering pixels only in the forward direction, we consider previous neighbors as well

And to consider the previous neighbors, the formula for computing the output would be:

We take the limits from -m/2 to m/2 i.e we take half of the rows from previous neighbors and the other half

from the forward direction(forward neighbors) and the same is the case in the vertical direction(-n/2 to n/2).

Typically, we take the odd-dimensional kernel.

Convolutional Operation in practice

Let the input image be as given below:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

and we use kernel/filter of size 3X3 and for each pixel, we take the 3 X 3 neighborhood around it(pixel itself

is a part of this 3 X 3 neighborhood and would be at the center) just like in the below image:

Input Image, we consider 3X3 portions of this image as the kernel is of size 3X3

Let’s say this input is a 30X30 image, we go over every pixel systematically, place the filter such that the

pixel is at the center of the kernel and re-estimate the value of that pixel as the weighted sum of pixels around

it.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

So, in this way, we get back the re-estimated value of all the pixels.

We all have seen the convolutional operation in practice. Let’s say the kernel that we are using is as below:

Kernel

So, we move this kernel all over the image and re-compute every pixel as the weighted sum of the

neighborhood. In this case, since all the weights are 1/9 that means the re-estimated value of each and

every pixel would be 1/9th of its original value. This kernel is taking the average of all the 9 pixels over

which this kernel would be placed.

That means for each pixel/color in the image, if we take the average(divide the weighted sum value by 9), it

would dilute the value/blurs the image and the output we get by applying this convolutional operation is:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

So, the blur operation that we all might have used in any of the photo editing application actually applies the

convolution operation behind the scenes.

Now in the below-mentioned scenario, we are using 5 as the weight for the central pixel and 0 for the all the

boundary pixels and -1 for the remaining pixels, so the net effect would be that the value/color intensity of the

central pixel is boosted and its neighborhood information is getting subtracted so the result of this is that it

sharpens the image.

The output of the above convolutional is:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Let’s take one more example: in the below case, the value for the central pixel is -8 and for all other pixels it

is 1, so if we have the same color in the 3X3 portion of the image(just like for the marked pixel in the below

image), let say the pixel intensity for this current pixel is denoted by ‘x’ then we get (8x from the central pixel

and -8x from the weighted sum of all other pixels and summation of the these results into 0).

So, wherever we have the same color in the 3X3 portion(some sample regions marked in the below image) or

to say the neighbors are exactly the same as the current pixel, we get the output intensity as 0.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

So, in effect, what will happen is that where ever there is a boundary(yellow highlighted in the below image),

there the neighboring pixels can not be the same as the current pixel, only in such regions we get the non-zero

value, everywhere else we get a zero value. So, in effect, we end up detecting all the edges in the input image.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

2D Convolution with 3D filter:

Below is a complete picture of how the 2D convolutional operation is performed over the input, we start with

the top left corner, apply the kernel over that area, move the kernel horizontally towards right and once we

have reached the end(completed the entire row) on the right side, we move the kernel downwards by some

steps and again start from the left side and move towards right:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

We slide the kernel horizontally

Once we complete the entire row, we slide the kernel vertically in downwards direction and start from the
left side

We move from left to right and from top to bottom.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

In the case of 3D input(image is also a 3D input as it has 3 channels corresponding to Red, Green, Blue, all

these channels are superimposed on each other and that’s how we get the final image. In other words, every

pixel in the image has 3 values associated with it, so we can look at that as the depth), we have 3

channels(depth) one corresponding to each of the RGB in the image, we use the filter of the same depth as the

input and place the filter over the input and compute the weighted sum across all the 3 dimensions.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

In most cases when we use convolution for 3D inputs, we use a 3D convolution filter(as depicted in the below

image) that means if we place the filter at a given location in the image, we would take a weighted average of

its 3D neighborhood but we are not going to slide it along the depth. What this conveys is that the kernel

would have the same depth as the original input and that’s why there is no scope to move it through the

depth/input. For example, the input image depth is 3 and the kernel depth is also 3 so there is no scope to
move it along the depth. There is no movement available there

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

In this case, also, we move the filter horizontally and vertically as in the 2D case. We don’t move the filter

along the depth as the input image depth is the same as the filter depth and there is no scope to move across

the depth.

So, what we do in practice is we have this 3D kernel, we will start moving it, we will move it along the

horizontal direction first, and we keep doing this through the entire image and once we reach the last box(we

move from left to right and top to bottom), at the end of this, although our input was 3 dimensional, we get

back a 2D output.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Points to consider:

 Input is 3D

 The filter is also 3D

 The convolutional operation that we perform is 2D as we are sliding the filter horizontally and vertically

and not along the depth

 This is because the depth of the filter is the same as the depth of the input

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

In practice, we apply multiple kernels/filters to the same input and get the different representations/output

from the same input as per the kernel used for example one filter might detect the vertical edges in the input,

second might detect the horizontal edges in the image, another filter might blur the image and so on.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

In the above image, we are using 3 different filters and we are getting 3 outputs corresponding to each

filter. We can combine these different outputs representations into one single volume(each of the output

representation would have width and height and after combining all of the representations we get the depth as

well). So, if we apply 3 filters to the input, we get an output of depth 3, if we apply 100 filters to the input, we

get the output of depth 100.

Points to consider:

 Each filter applied to a 3D input would give a 2D output.

 Combining the output of multiple such filters would result in a 3D output.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Terminology

Let’s define some terminology and find out the relation between the input dimensions and the output

dimensions:

The spatial extent(extent of the neighborhood we are looking at) of a filter(F) means the dimension of the

filter, it would be ‘F X F’. Usually, we have an odd-dimensional filter and the depth of the filter would be the
same as the depth of the input(Di in this case).

Now we want to relate the output dimensions with the input dimensions:

Let’s take 2D input of dimension ‘7 X 7’ and we have a filter of size ‘3 X 3’ over it.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

As we slide the filter over it(from left to right and top to bottom), we keep computing the output values, and

it's very clear that the output is smaller than the input.

This is how we slide the filter over the image:

The reason is obvious why this is happening, we can’t place the kernel at the corners as it will cross the

boundary

We can’t place the filter at the crossed pixel(below image) because if we place it there then yellow

highlighted portion would be undefined:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

And in practice, we would stop at the crossed pixel(as in the below image) when the filter completely lies

inside the image:

And this is why we get the smaller output because we would not be able to apply the filter in any part in the

shaded region in the below image:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Hence for every pixel in the input, we are not computing the re-estimated value and therefore the number of

pixels in the output is less than the number of pixels in the input.

This was the case for ‘3 X 3’ kernel, now let’s see what happens when we have ‘5 X 5’ kernel:

Now we can not place the kernel at the crossed pixel in the above image. We can not place the kernel at the

yellow highlighted pixel as well. So, in this case, we can not place the kernel at any of the shaded regions in

the below image:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

The bigger the kernel used, the smaller is the output.

So, the output dimension in terms of the input is:

What if we want the output to be of the same size as the input?

If we want the output to be the same size as the input, then we need to pad the input appropriately:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Here we pad the input with 0 all over the input image and apply the 3X3 filter over the input and we get the
output of the same dimension as the input

If we place the kernel at the crossed pixel in the below image, we now have 5 artificial pixels with a value of

0 and we would be able to re-estimate the value of this crossed pixel.

Now the output would be again ‘7 X 7’ as we have introduced this artificial boundary around the original

input and this boundary contains all the values as 0.

If we have a ‘5 X 5’ filter, it would still go outside the image even after this artificial padding

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

So, in this case, we need to increase padding. Earlier we added padding of 1(meaning 1 row at the top, 1 at

the bottom, 1 at the left and 1 at the right). And it’s obvious from the above image that if we want to use a ‘5

X 5’, then we should use the padding of 2.

The bigger the kernel size the larger is the padding required and the updated formula for the relation

between input and output dimension is:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Stride(S): Stride defines the interval at which the filter is applied, till now we discussed all the cases

considering stride to be 1 as we’re moving the filter by 1 in the horizontal and vertical direction as depicted in

the below image:

In some cases, we may not want this to say we don’t want a full replica of the image and just need a summary

of it. In that case, we may choose to apply the filter only at alternate locations in the input.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Here we use S = 2 i.e we move the filter by 2 in the horizontal as well as the vertical direction

This interval between two successive pixels where we apply the kernel is termed as the Stride. And in the

above case, the output would be roughly half the input as we are skipping part of the image by 1 every time.

Now, if we are using a stride ‘S’, then the formula to compute the width and height is given by:

Higher the stride, the smaller is the size of the output.

The depth of the output is going to be the same as the number of filters that we have.

Each 3D filter applied over 3D input would give one 2D output if we use K such filters, we get K such 2D

outputs and if we stack up all these K outputs we get the depth of the output as K. So, the depth of the output

is the same as the no. of filters used.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

MOTIVATION:

 a digital image is 2D grid image , since neural network expects a vector as input , one idea to deal with

images would be to flatten that image and feed the output of the flattening operation to the neural

network and this would work to some extent

But eventually ,that flattened vector won’t be the same for a translated image

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

The neural network would have to learn very different parameters in order to classify the objects , which Is

difficult job since natural images are very variant (lightning, translated , angles …..)

Also it is worth mentioning that the input Vector would be relatively big 64*64*3(RGB images) which can

cause problem with memory while using neural network since we will have in The first layer with just 10

neurons alone (64*64*3*10) Weights to train

Natural images have 2 main characteristics

 Locality : nearby pixels are more strongly correlated

 Translation invariance: meaningful patterns can occur anywhere in the image

How Convolutional Neural Network solve The problem for images ?

· The answer to this question is 3 characteristics of The CNN

 Sparse Connectivity : when processing an image, the input image might have thousands or millions of

pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or

hundreds of pixels. This means that we need to store fewer parameters, which both reduces the memory

requirements of the model and improves its statistical efficiency. It also means that computing the output

requires fewer operations. These improvements in efficiency are usually quite large. If there are m inputs

and n outputs, then matrix multiplication requires m×n parameters and the algorithms used in practice

have O(m × n) runtime (per example). If we limit the number of connections each output may have to k,

then the sparsely connected approach requires only k × n parameters and O(k × n) runtime

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 Parameter sharing : In a convolutional neural net, each member of the kernel is used at every position

of the input (except perhaps some of the boundary pixels, depending on the design decisions regarding

the boundary). The parameter sharing used by the convolution operation means that rather than learning

a separate set of parameters for every location, we learn only one set. This does not affect the runtime of

forward propagation it is still O(k × n) but it does further reduce the storage requirements of the model to

k parameters

 Equivariance : In the case of convolution, the particular form of parameter sharing causes the layer to

have a property called equivariance to translation. use the same network parameters to detect local

patterns at many locations in the image

The Convolution Operation

So in Practical set , the convolutional operation is implemented by making The kernel slides across The

image and produces an output Value at each position

Also we convolve different Kernels and as a result obtain Different feature maps or channels

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Variants of The Convolution Operation :

 Valid Convolution: Doesn’t used any padding

 Same Convolution :pads in a way that the input size is the same as the output size

 Full Convolution : — : we compute output wherever the kernel and the output overlap by at least 1

pixel

 Strided Convolution : : kernel slides along the image with a step > 1

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 Dilated Convolution : : kernel is spread out, step > 1 between kernel elements

 Depth wise Convolution : each output channel is connected only to one input channel

POOLING:

A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby

outputs. For example, the max pooling operation reports the maximum output within a rectangular

neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the L2

norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel. In all

cases, pooling helps to make the representation become approximately invariant to small translations of the

input. Invariance to translation means that if we translate the input by a small amount, the values of most of

the pooled outputs do not change.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

VARIANTS OF THE BASIC CONVOLUTION FUNCTION:


Convolution in the context of NN means an operation that consists of many applications of convolution in
parallel.

 Kernel K with element Ki,j,k,l��,�,�,� giving the connection strength between a unit in channel i
of output and a unit in channel j of the input, with an offset of k rows and l columns between the output
unit and the input unit.
 Input: Vi,j,k��,�,� with channel i, row j and column k
 Output Z same format as V
 Use 1 as first entry

Full Convolution
0 Padding 1 stride
Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n��,�,�=∑�,�,���,�+�−1,�+�−1��,�,�,�
0 Padding s stride
Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]��,�,�=�(�,�,�)�,�,�=∑�,�,�[
��,�∗(�−1)+�,�∗(�−1)+���,�,�,�]

Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by downsampling:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Some 0 Paddings and 1 stride


Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at each layer.
We are forced to choose between shrinking the spatial extent of the network rapidly and using small kernel.
0 padding allows us to control the kernel width and the size of the output independently.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Special case of 0 padding:

 Valid: no 0 padding is used. Limited number of layers.


 Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near the border
influence fewer output pixels than the input pixels near the center.
 Full: Enough zeros are added for every pixels to be visited k (kernel width) times in each direction,
resulting width m + k - 1. Difficult to learn a single kernel that performs well at all positions in the
convolutional feature map.

Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’

Unshared Convolution
In some case when we do not want to use convolution but want to use locally connected layer. We
use Unshared convolution. Indices into weight W

 i: the output channel


 j: the output row;

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 k: the output column


 l: the input channel
 m: row offset within input
 n: column offset within input

Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]��,�,�=∑�,�,�[��,�+�−1,�+�−1��,�,�,�,�
,�]
Comparison on local connections, convolution and full connection

Useful when we know that each feature should be a function of a small part of space, but no reason to think
that the same feature should occur accross all the space. eg: look for mouth only in the bottom half of the
image.

It can be also useful to make versions of convolution or local connected layers in which the connectivity is
further restricted, eg: constrain each output channeel i to be a function of only a subset of the input channel.

Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for both forward
and backward prop.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Tiled Convolution
Learn a set of kernels that we rotate through as we move through space. Immediately neighboring locations
will have different filters, but the memory requirement for storing the parameters will increase by a factor of
the size of this set of kernels. Comparison on locally connected layers, tiled convolution and stardard
convolution:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

K: 6-D tensor, t different choice of kernel stack

Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]��,�,�=∑�,�,�[��,�+�−1,�+�−1��,
�,�,�,�%�+1,�%�+1]

Local connected layers and tiled convolutional layer with max pooling: the detector units of these layers are
driven by different filters. If the filters learn to detect different tranformed version of the same underlying
features, then the max-pooled units become invariant to the learned transformation.

Review:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Back prop in conv layer


Back prop of conv layer:

 K: Kernel stack
 V: Input image
 Z: Output of conv layer
 G: gradient on Z

SRUCTURED OUTPUTS:

A deep neural network model is a powerful framework for learning representations. Usually, it is used to
learn the relation x→y by exploiting the regularities in the input x. In structured output prediction
problems, y is multi-dimensional and structural relations often exist between the dimensions. The motivation
of this work is to learn the output dependencies that may lie in the output data in order to improve the
prediction accuracy. Unfortunately, feedforward networks are unable to exploit the relations between the
outputs. In order to overcome this issue, we propose in this paper a regularization scheme for training neural
networks for these particular tasks using a multi-task framework. Our scheme aims at incorporating the
learning of the output representation y in the training process in an unsupervised fashion while learning the
supervised mapping function x→y.

TYPES OF DATA:

Deep learning can be applied to any data type. The data types you work with, and the data you gather, will
depend on the problem you’re trying to solve.

1. Sound (Voice Recognition)


2. Text (Classifying Reviews)
3. Images (Computer Vision)
4. Time Series (Sensor Data, Web Activity)
5. Video (Motion Detection)

Use Cases

Deep learning can solve almost any problem of machine perception, including classifying data , clustering it,
or making predictions about it.

 Classification: This image represents a horse; this email looks like spam; this transaction is
fraudulent
 Clustering: These two sounds are similar. This document is probably what user X is looking for
 Predictions: Given their web log activity, Customer A looks like they are going to stop using your
service

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Deep learning is best applied to unstructured data like images, video, sound or text. An image is just a blob
of pixels, a message is just a blob of text. This data is not organized in a typical, relational database by rows
and columns. That makes it more difficult to specify its features manually.

Common use cases for deep learning include sentiment analysis, classifying images, predictive analytics,
recommendation systems, anomaly detection and more.

If you are not sure whether deep learning makes sense for your use case, please get in touch.

Data Attributes

For deep learning to succeed, your data needs to have certain characteristics.

Relevancy

The data you use to train your neural net must be directly relevant to your problem; that is, it must resemble
as much as possible the real-world data you hope to process. Neural networks are born as blank slates, and
they only learn what you teach them. If you want them to solve a problem involving certain kinds of data,
like CCTV video, then you have to train them on CCTV video, or something similar to it. The training data
should resemble the real-world data that they will classify in production.

Proper Classification

If a client wants to build a deep-learning solution that classifies data, then they need to have a labeled
dataset. That is, someone needs to apply labels to the raw data: “This image is a flower, that image is a
panda.” With time and tuning, this training dataset can teach a neural network to classify new images it has
not seen before.

Formatting

Neural networks eat vectors of data and spit out decisions about those vectors. All data needs to be
vectorized, and the vectors should be the same length when they enter the neural net. To get vectors of the
same length, it’s helpful to have, say, images of the same size (the same height and width). So sometimes
you need to resize the images. This is called data pre-processing.

Accessibility

The data needs to be stored in a place that’s easy to work with. A local file system, or HDFS (the Hadoop
file system), or an S3 bucket on AWS, for example. If the data is stored in many different databases that are
unconnected, you will have to build data pipelines. Building data pipelines and performing preprocessing
can account for at least half the time you spend building deep-learning solutions.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Minimum Data Requirement

The minimums vary with the complexity of the problem, but 100,000 instances in total, across all categories,
is a good place to start.

If you have labeled data (i.e. categories A, B, C and D), it’s preferable to have an evenly balanced dataset
with 25,000 instances of each label; that is, 25,000 instances of A, 25,000 instances of B and so forth.

EFFICIENT CONVOLUTION ALGORITHMS:

Efficient convolution algorithms are essential for various signal and image processing tasks, as well
as deep learning and computer vision applications. Convolution is a fundamental operation that involves
multiplying and summing values from two input arrays, and it can be computationally expensive, especially
for large input data and filter kernels. Several efficient algorithms and techniques have been developed to
speed up convolution operations. Here are some of the most important ones:

1. Direct (Naïve) Convolution:

The most straightforward way to compute a convolution is to perform the element-wise multiplication and
sum for each possible location of the filter over the input. While this is conceptually simple, it is highly
inefficient and slow for large inputs and filters.

2. Fast Fourier Transform (FFT) Convolution:

One efficient technique for convolution is to use the FFT. The idea is to convert the input and filter into
the frequency domain, perform element-wise multiplication, and then convert the result back to the time
domain using the inverse FFT. This approach can significantly reduce the computational complexity,
especially for large filters and inputs. However, it may introduce some artifacts due to the finite precision of
floating-point arithmetic.

3. Winograd Convolution:

Winograd convolution is an algorithm that minimizes the number of multiplications required for
convolution. It uses a small set of precomputed matrices to transform the input and filter, allowing for faster
computation with less computational cost. This method is particularly effective for small filter sizes.

4. Strassen Algorithm:

Originally developed for matrix multiplication, the Strassen algorithm has also been adapted for
convolution. It reduces the number of multiplicative operations by recursively breaking down the
convolution into smaller sub-convolutions. This can be more efficient for large convolutions.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

5. FFT-based Methods for 2D Convolution:

For 2D convolutions, you can use the FFT approach by performing separate 1D FFTs along each
dimension of the input and filter, and then combining them. This is particularly useful when dealing with
images and 2D data.

6. Im2Col and Col2Im:

These techniques involve reformatting the input and filter data into matrix form, where convolution can be
performed as a simple matrix multiplication operation. While this approach can be more efficient for
hardware implementations, it requires additional memory for the reformatted data.

7. Depthwise Separable Convolution:

Depthwise separable convolution is a technique used in deep learning, where a convolution operation is
split into two parts: depthwise convolution (applying a single filter to each input channel separately) and
pointwise convolution (applying 1x1 convolutions to combine the results). This reduces the number of
parameters and computations, making it efficient for mobile and embedded devices.

8. Winograd-like Transformations:

Variations of the Winograd algorithm exist for different input sizes and filter dimensions, providing
options for optimizing convolution for specific scenarios.

The choice of convolution algorithm depends on the specific use case, hardware, and trade-offs between
speed and memory usage.
NEUROSCIENTIFIC BASIS:

The history of convolutional networks begins with neuroscientific experiments long before the relevant

computational models were developed.

Neurophysiologists David Hubel and Torsten Wiesel observed how neurons in the cat’s brain responded

to images projected in precise locations on a screen in front of the cat.

“Their great discovery was that neurons in the early visual system responded most strongly to very specific

patterns of light, such as precisely oriented bars, but responded hardly at all to other patterns”

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

The Neurons in the early visual cortex are organized in a hierarchical fashion, where the first cells connected

to the cat’s retinas are responsible for detecting simple patterns like edges and bars, followed by later layers

responding to more complex patterns by combining the earlier neuronal activities.

Convolutional Neural Network may learn to detect edges from raw pixels in the first layer, then use the

edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features,

such as facial shapes in higher layers

Filters in a Convolutional Neural network

The Visual Cortex of the brain is a part of the cerebral cortex that processes visual information. V1 is the

first area of the brain that begins to

perform significantly advanced processing of visual input.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

A convolutional network layer is designed to capture three properties of V1:

1. V1 is arranged in a spatial map. It actually has a two-dimensional structure mirroring the structure of the

image in the retina. Convolutional networks capture this property by having their features defined

in terms of two dimensional maps.

2. V1 contains many simple cells. A simple cell’s activity can be characterized by a linear function of the

image in a small, spatially

localized receptive field. The detector units of a convolutional network are designed to emulate

these properties of simple cells.

3. V1 also contains many complex cells. These cells respond to features that

are similar to those detected by simple cells, but complex cells are invariant to small shifts in the position

of the feature. This inspires the pooling units of convolutional networks.

There are many differences between convolutional networks


and the mammalian vision system. Some of these differences are -

1. The human eye is mostly very low resolution, except for a tiny patch called the fovea. Most

convolutional networks receive large full resolution photographs as input.

2. The human visual system is integrated with many other senses, such as

hearing, and factors like our moods and thoughts. Convolutional networks

so far are purely visual.

3. Even simple brain areas like V1 are heavily impacted by feedback from higher levels. Feedback has been

explored extensively in neural network models but has not yet been shown to offer a compelling

improvement.

APPLICATIONS

COMPUTER VISION:

What is Computer Vision?


Computer vision is one of the fields of artificial intelligence that trains and enables computers to understand
the visual world. Computers can use digital images and deep learning models to accurately identify and
classify objects and react to them.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Computer vision in AI is dedicated to the development of automated systems that can interpret visual data
(such as photographs or motion pictures) in the same manner as people do. The idea behind computer vision
is to instruct computers to interpret and comprehend images on a pixel-by-pixel basis. This is the foundation
of the computer vision field. Regarding the technical side of things, computers will seek to extract visual
data, manage it, and analyze the outcomes using sophisticated software programs.

The amount of data that we generate today is tremendous - 2.5 quintillion bytes of data every single day.
This growth in data has proven to be one of the driving factors behind the growth of computer vision.

How Does Computer Vision Work?

Massive amounts of information are required for computer vision. Repeated data analyses are performed
until the system can differentiate between objects and identify visuals. Deep learning, a specific kind of
machine learning, and convolutional neural networks, an important form of a neural network, are the two
key techniques that are used to achieve this goal.

With the help of pre-programmed algorithmic frameworks, a machine learning system may automatically
learn about the interpretation of visual data. The model can learn to distinguish between similar pictures if it
is given a large enough dataset. Algorithms make it possible for the system to learn on its own, so that it
may replace human labor in tasks like image recognition.

Convolutional neural networks aid machine learning and deep learning models in understanding by dividing
visuals into smaller sections that may be tagged. With the help of the tags, it performs convolutions and then
leverages the tertiary function to make recommendations about the scene it is observing. With each cycle,
the neural network performs convolutions and evaluates the veracity of its recommendations. And that's
when it starts perceiving and identifying pictures like a human.

Computer vision is similar to solving a jigsaw puzzle in the real world. Imagine that you have all these
jigsaw pieces together and you need to assemble them in order to form a real image. That is exactly how the
neural networks inside a computer vision work. Through a series of filtering and actions, computers can put
all the parts of the image together and then think on their own. However, the computer is not just given a
puzzle of an image - rather, it is often fed with thousands of images that train it to recognize certain objects.

For example, instead of training a computer to look for pointy ears, long tails, paws and whiskers that make
up a cat, software programmers upload and feed millions of images of cats to the computer. This enables the
computer to understand the different features that make up a cat and recognize it instantly.

History

For almost 60 years, researchers and developers have sought to teach computers how to perceive and make
sense of visual information. In 1959, neurophysiologists started showing a cat a variety of sights in an effort
to correlate a reaction in the animal's brain. They found that it was particularly sensitive to sharp corners and
lines, which technically indicates that straight lines and other basic forms are the foundation upon which
image analysis is built.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Around the same period, the first image-scanning technology emerged that enabled computers to scan
images and obtain digital copies of them. This gave computers the ability to digitize and store images. In the
1960s, artificial intelligence (AI) emerged as an area of research, and the effort to address AI's inability to
mimic human vision began.

Neuroscientists demonstrated in 1982 that vision operates hierarchically and presented techniques enabling
computers to recognize edges, vertices, arcs, and other fundamental structures. At the same time, data
scientists created a pattern-recognition network of cells. By the year 2000, researchers were concentrating
their efforts on object identification, and by the following year, the industry saw the first-ever real-time face
recognition solutions.

Deep Learning Revolution

Examining the algorithms upon which modern computer vision technology is based is essential to
understanding its development. Deep learning is a kind of machine learning that modern computer vision
utilizes to get data-based insights.

When it comes to computer vision, deep learning is the way to go. An algorithm known as a neural network
is used. Patterns in the data are extracted using neural networks. Algorithms are based on our current
knowledge of the brain's structure and operation, specifically the linkages between neurons within the
cerebral cortex.

The perceptron, a mathematical model of a biological neuron, is the fundamental unit of a neural network. It
is possible to have many layers of linked perceptrons, much like the layers of neurons in the biological
cerebral cortex. As raw data is fed into the perceptron-generated network, it is gradually transformed into
predictions.

How Long Does It Take To Decipher An Image

Extremely fast CPUs and associated technology, together with a swift, dependable internet and cloud-based
infrastructures, make the entire process blistering fast nowadays. Importantly, several of the largest
businesses investing in AI research, like Google, Facebook, Microsoft, and IIBM, have been upfront about
their research and development in the field. In this way, people may build upon the foundation they've laid.

This has resulted in the AI sector heating up, and studies that used to take weeks to complete may now be
completed in a few minutes. In addition, for many computer vision tasks in the actual world, this whole
process takes place constantly in a matter of microseconds. As a result, a computer may currently achieve
what researchers refer to as "circumstantially conscious" status.

Computer Vision Applications

One field of Machine Learning where fundamental ideas are already included in mainstream products is
computer vision. The applications include:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 Self-Driving Cars

With the use of computer vision, autonomous vehicles can understand their environment. Multiple cameras
record the environment surrounding the vehicle, which is then sent into computer vision algorithms that
analyzes the photos in perfect sync to locate road edges, decipher signposts, and see other vehicles,
obstacles, and people. Then, the autonomous vehicle can navigate streets and highways on its own, swerve
around obstructions, and get its passengers where they need to go safely.

 Facial Recognition

Facial recognition programs, which use computer vision to recognize individuals in photographs, rely
heavily on this field of study. Facial traits in photos are identified by computer vision algorithms, which then
match those aspects to stored face profiles. In order to verify the identity of the people using consumer
electronics, face recognition is increasingly being used. Facial recognition is used in social networking
applications for both user detection and user tagging. For the same reason, law enforcement uses face
recognition software to track down criminals using surveillance footage.

 Augmented & Mixed Reality

Augmented reality, which allows computers like smartphones and wearable technology to superimpose or
embed digital content onto real-world environments, also relies heavily on computer vision. Virtual items
may be placed in the actual environment through computer vision in augmented reality equipment. In order
to properly generate depth and proportions and position virtual items in the real environment, augmented
reality apps rely on computer vision techniques to recognize surfaces like tabletops, ceilings, and floors.

 Healthcare

Computer vision has contributed significantly to the development of health tech. Automating the process of
looking for malignant moles on a person's skin or locating indicators in an x-ray or MRI scan is only one of
the many applications of computer vision algorithms.

Examples

The following are some examples of well-established activities using computer vision:

 Categorization of Images

A computer program that uses image categorization can determine what an image is of (a dog, a banana, a
human face, etc.). In particular, it may confidently assert that an input picture matches a specific category. It
might be used by a social networking platform, for instance, to filter out offensive photos that people post.

 Object Detection

By first classifying images into categories, object detection may then utilize this information to search for
and catalog instances of the desired class of images. In the manufacturing industry, this can include finding
defects on the production line or locating broken equipment.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 Observation of Moving Objects

If an item is discovered, object tracking will continue to move in the same location. A common method for
doing this is by using a live video stream or a series of sequentially taken photos. For example, driverless
cars must not only identify and categorize moving things like people, other motorists, and road systems in
order to prevent crashes and adhere to traffic regulations.

 Retrieval of Images Based on Their Contents

In contrast to traditional visual retrieval methods, which rely on metadata labels, a content-based recognition
system employs computer vision to search, explore, and retrieve pictures from huge data warehouses based
on the actual image content. Automatic picture annotations, which can replace traditional visual tagging,
may be used for this work.

Computer Vision Algorithms

Computer vision algorithms include the different methods used to understand the objects in digital images
and extract high-dimensional data from the real world to produce numerical or symbolic information. There
are many other computer vision algorithms involved in recognizing things in photographs. Some common
ones are:

 Object Classification - What is the main category of the object present in this photograph?

 Object Identification - What is the type of object present in this photograph?

 Object Detection - Where is the object in the photograph?

 Object Segmentation - What pixels belong to the object in the image?

 Object Verification - Is the object in the photograph?

 Object Recognition - What are the objects present in this photograph and where are they located?

 Object Landmark Detection - What are the key points for the object in this photograph?

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Fig: Computer vision detecting cats in a picture (Source)

Many other advanced computer vision algorithms such as style transfer, colorization, human pose
estimation, action recognition, and more can be learned alongside deep learning algorithms.

Challenges of Computer Vision

Creating a machine with human-level vision is surprisingly challenging, and not only because of the
technical challenges involved in doing so with computers. We still have a lot to learn about the nature of
human vision.

To fully grasp biological vision, one must learn not just how various receptors like the eye work, but also
how the brain processes what it sees. The process has been mapped out, and its tricks and shortcuts have
been discovered, but, as with any study of the brain, there is still a considerable distance to cover.

Computer Vision Benefits

Computer vision can automate several tasks without the need for human intervention. As a result, it provides
organizations with a number of benefits:

 Faster and simpler process - Computer vision systems can carry out repetitive and monotonous tasks at a
faster rate, which simplifies the work for humans.

 Better products and services - Computer vision systems that have been trained very well will commit
zero mistakes. This will result in faster delivery of high-quality products and services.

 Cost-reduction - Companies do not have to spend money on fixing their flawed processes because
computer vision will leave no room for faulty products and services.

Computer Vision Disadvantages

There is no technology that is free from flaws, which is true for computer vision systems. Here are a few
limitations of computer vision:

 Lack of specialists - Companies need to have a team of highly trained professionals with deep knowledge
of the differences between AI vs. Machine Learning vs. Deep Learning technologies to train computer
vision systems. There is a need for more specialists that can help shape this future of technology.

 Need for regular monitoring - If a computer vision system faces a technical glitch or breaks down, this
can cause immense loss to companies. Hence, companies need to have a dedicated team on board to
monitor and evaluate these systems.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Choose the Right Program

Supercharge your career in AI and ML with Simplilearn's comprehensive courses. Gain the skills and
knowledge to transform industries and unleash your true potential. Enroll now and unlock limitless
possibilities!

IMAGE GENERATION:

Generative Adversarial Network which is popularly known as GANs is a deep learning, unsupervised

machine learning technique which is proposed in year 2014 through this research paper. The main blocks of

this architecture are ;

1. Generator : This block tries to generates the images which are very similar to that of original dataset by

taking noise as input. It tries to learn the join probability of the input data (X) and output data(Y);

P(X|Y).

2. Discriminator : This block tries to accept two inputs, one from main dataset and other from images

generated from Generator, and bifurcates them as Real or Fake.

To make this Generative and Adversarial process simple, both these block are made from Deep Neural

Network based architecture which can be trained through forward and backward propagation techniques.

To understand this concept in-depth, we will implement GAN architectures through tensorflow-keras. We

will be focusing on generation of MNIST images through simple GANs and also through Deep Convolution

GANs and also the Super Resolution GANs with working example.

Simple Generative Adversarial Networks (GANs)

With the above architecture of Simple GANs, we will look at the architecture of Generator model.

Generator consists of four dense layers, where a 100-dimensional Noise data is passed as input. The last

dense layer of the Generator produces the 784 (28x28 = 784) dimensional vector which is mainly flattened
vector corresponding to that of each of individual MNIST images.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Generator of Simple GAN

For last Dense layer, we used tanh activation unit because we normalize each image from [-1, +1].This

generator vector from Generator is then passed to next block, which is Discriminator network of GANs.

Discriminator, whose main task is to get maximum probability while predicting real or fake data, we pass our

784-dimensional generator output vector to it. This block also comprises of four dense layers as shown

below.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Discriminator of Simple GAN

A sigmoidal activation was used for the last layer, which gives the probability of input image being real or

fake.

LeakyRelu activation function is used in both, Generator and Discriminator which helps in faster

convergence of the model.

Both these Generative and Discriminative blocks are combined together as below;

Simple GAN model

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

To perform actual training, we will initialize the generator, discriminator and gan objects by initializing each

functional blocks. We will generate a 100-dimensional noise input for generator. As we normalized images

between [-1, +1], we will have random noise from normal distribution of range [-1, +1].

Prediction of Real and Fake images through GAN Discriminator

With above code, we will first generate sample images with generator by passing random noise to it. These

images are then combined with real images to generate a batch of real and fake images. This batch is passed
to a discriminator which predicts the probability of image being real or fake.

Till this stage of discriminator prediction, we keep discriminator trainable, as the loss of the prediction needs

to be back-propagated through network for updation of weights corresponding to each layer.

Now, by freezing the layers of discriminator, we try to back-propagate the GAN loss through generator for

updation of weights of each layer.

Freezing Discriminator, we backpropagate loss through Generator

Below image shows the progress of GAN architecture, where with the simple noise input, generator is able to

create similar MNIST images as that of original data.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Deep Convolutional Generative Adversarial Network (DC GANs)

It was very interesting to see generation of similar MNIST images with the help of Deep Neural network, yet

the model fails to implement the major Deep Learning algorithm in the architecture, and that is Convolutional

Neural Networks.

Henceforth, instead of flattening the image into dense layer, we will be using convolutional filters to generate

the image from Noise input.

Generator of DC GAN consist of Dense Layer followed by Batch Normalization layer. Here, we first take

Noise as input which is being multiplied with FxFxK elements. This output is then reshaped into FxFxK

shape.

Convolution2DTranspose layer is used in DC GANs, whose main objective is to up sample the input image.

The complete architecture is as below;

Discriminator of the DC GANs, unlike previous examples, accepts the input as image instead of vector

format. Image generated through Generator and from Original data are sampled to pass it to the discriminator.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

LeakyRelu activation is used along with low value of Dropout to avoid the overfitting of the model.

The remaining flow of DC GANs is same as that of Simple GANs, where we let discriminator first update it’s

weights through backpropagation from training loss. After the discriminator gets updated, we freeze the

discriminator and fit the generator with fake data. The loss of generator is then back propagated through it so

as to update the weights.

Below is the image which shows us the progress of DC-GAN performance on MNIST data through 400

epochs.

Super Resolution Generative Adversarial Networks (SR GANs)

Now we will look into one of the advance GAN architecture which is called as SR GANs. The main purpose

of this is;

Generation of Super Resolution image using Generator by accepting Low Resolution image as input. This

Super Resolution image is very similar to the original High Resolut

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

ion image of the original dataset.

SR GANs block diagram

Working of SR GANs from above block diagram;

 Original dataset consist of High Resolution (HR) images, which are down sampled to get the Low

Resolution (LR) images.

 These LR images are then passed to SR GAN Generator which generates the Super Resolution (SR)

image which match close to that of HR image.

 Batches of these SR and HR images are then passed to SR GAN Discriminator which predicts whether

the images are real or fake.

 The final loss of the SR GAN is then back-propagated to Generator and Discriminator network for

updation of weights.

Now that we understood the working of the architecture, we can now understand the details of each of the
core blocks; Generator and Discriminator.

SR GAN Generator

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Asstated in actual research paper, above image shows the block diagram of Generator of SR GAN. A LR

input image is passed through Convolution layer followed by Parametric ReLU activation. The output is then

passed to set of 16 residual blocks. Output from residual block is then passed to couple of Convolutional

blocks which is then passed to Up-sampling block, which increases the resolution of the image to the desired

level.

SR GAN Discriminator

Discriminator of SR GAN like all other GAN architecture, does the job of predicting the fake and real images

by accepting two images at the same time, here we can see little complex structure that was implemented

compared to that of previously seen architectures.

Discriminator output is then used to find the loss for model.

Loss function in SR GAN can be a combination of Content Loss and Adversarial Loss. Here, content loss can
be captured Pixel wise using MSE between HR and SR images. This can be calculated by extracting the

dense vector corresponding to each of input images by passing them through VGG19 network.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

vgg19 Content Loss

Wile running the complete SR GAN model, we will initialize Generator, Discriminator and SR-GAN objects.

Generator object will be compiled with Adam optimizer and only the content loss (i.e. VGG19 pixel wise

MSE).

Discriminator object, on the other hand, will be compiled with binary_crossentropy with Adam optimizer.

Instead of training a mixed batch of HR and LR images, we first pass HR images to discriminator and later

train the same with batch of LR images by making generator.trainable = True.

Now to train the discriminator, we’ll be freezing the discriminator and will train the srgan object with LR

images with [HR_images, REAL_IMAGE_LABELS] as the desired output.

Running above network with batch size of 64 and epochs of 400, we were able to get the significant output

from SE GAN architecture.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

The GIF at the start of this article is generated using SR GAN architecture itself. From below, we can clearly

see, model was able to predict the edges for bridge and canal very clearly.

As the model progresses through epochs, we can see from Epoch1 to Epoch200, the Bridge structure was

very clearly visible and also the clif, its greenery along with some civilization was clearly visible in SR

image(center).

Here's a result at the end of Epoch400, the color of the sky improved pretty much, including the water on the

bridge. So, considering the low level image, where we could see small pixels to our eyes, model was able to

regenerate the image with much finer details.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Here are some of the other image generation;

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Not so good, but not so bad too. Haha… !!!!

Conclusion

Though, with simple modeling architecture, we were able to recreate not the best but much better image

output through all three forms of GAN architecture. By providing more image data and giving more time to

learn the features and in-dept detailing of the image, certainly the models will outperform the Original data

available.

IMAGE COMPRESSION:

Compression-decompression description

Сompression-decompression task involves compressing data, sending them using low internet traffic usage,

and their further decompression. The objective of the process is to achieve minimal difference between the

original and the decompressed images as well as obtain the same image quality after compression-

decompression as before data transfer.

The schema for a compression-decompression task is presented in Figure 2:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Figure 2. Schema for a compression-decompression method. Data is the initial image file. An encoder is a
compression process, data compressed is a file after compression, a decoder is a decompression process,
Data* is a decompressed file.

To compare the performance of different methods we, first, measure compression coefficient and, after that,

we apply SSIM and PSNR metrics to measure similarities between the original image and the decompressed

image (all these metrics are described in the section Metrics below).

As we demonstrate in the Results section, different methods achieve different objectives: some produce high-

quality image results while having small compression efficiency, others reach high compression efficiency

while producing low-quality image results.

Dataset

We selected 10 images to compare and test different methods for a compression task. The dataset represents 5

bottles of Italian wines and 1 bottle of sauce (we chose this type of picture to further use the methods for the

bottle detection task as part of the ‘Bottle detection and classification’ company’s project). Examples of

images are presented in Figure 3:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Figure 3. Dataset for experiments with image compression methods (test data).

JPEG compression method

For the JPEG compression method, we employ the PIL library for python to compress .bmp images to .png

(code for running this is posted in GitHub), and JPEG format (Joint Photographic Experts Group)[10], which

is a standard image format for containing lossy and compressed image data. The format was introduced in the

early ‘90s, and since then, it became the most widely used image compression standard in the world[11]. The

main basis for JPEG’s lossy compression algorithm is the discrete cosine transform: this mathematical

operation converts each frame/field of the video source from the spatial (2D) domain into the frequency

domain. The JPEG standard specifies the codec, which defines how an image is compressed into a stream of

bytes and decompressed back into an image.

JPEG compression code:


from io import BytesIO
from PIL import ImageIMAGE_FILE = '1.bmp' # image file name
im1 = Image.open(IMAGE_FILE)

# here, we create an empty string buffer


buffer = BytesIO()
im1.save(buffer, "JPEG", quality=60) # compressed file

Machine learning models

We tested several machine learning models (code for testing is posted in GitHub) and chose the most optimal

models (which are effortless to run, require minimal GPU, and can be evaluated using the selected metrics).

Model 1 — ‘Factorized Prior Autoencoder’

The model is taken from the paper “Variational image compression with a scale hyperprior”[5]. The

architecture is shown in Figure 4:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Figure 4. The architecture of the proposed network ‘Variational image compression with a scale hyperprior’.

We employed TensorFlow framework[9] to compare the models because all the models can be run within

the same framework, and it is convenient for our task. We used Google Colab to run the models because it

provides free GPU. Below, we show the code for running the framework for Factorized Prior Autoencoder

model (installation instructions in Colab).

First, install tensorflow-compression library:


!pip install tensorflow-compression

Second, clone the project to Colab:


![[ -e /tfc ]] || git clone https://github.com/tensorflow/compression /tfc
%cd /tfc/models
import tfci # Check if tfci.py is available.

Third, run the model.

Compression in TensorFlow for Factorized Prior Autoencoder optimized for MS-SSIM (multiscale

SSIM) is the following:


!python tfci.py compress bmshj2018-factorized-msssim-6 /1.png

 bmshj2018-factorized-msssim — model name;

 number 6 at the end of the name indicates the quality level (1: lowest, 8: highest);

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

 /1.png — input file name (image).

We experimented with several quality levels, and in the result table, we include the models which give an

approximately similar performance for SSIM metrics (around 0.97), namely, bmshj2018-factorized-msssim-6

in Table 2.

This script runs compression and produces a compressed file with .tfci name in addition to the target input

image (1.png). This file 1.png.tfci — is so-called compressed data from Figure 1.

Decompression in TensorFlow:
!python tfci.py decompress /1.png.tfci

This script produces a file with extension .png in addition to the compressed file name, for example,

1.png.tfci.png. The decompression code is the same for other models described below.

Model 2 — Nonlinear transform coder model with factorized priors

The second model is a nonlinear transform coder model with factorized priors (entropy models) optimized for

MSE, with GDN (generalized divisive normalization) activation functions, and 128 filters per layer[4]. Its

architecture is shown in Figure 5. It was also run on TensorFlow framework[9].

Figure 5. Schema of model architecture for nonlinear transform coder with factorized priors (entropy
models) optimized for MSE, with GDN[12].

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

GDN is typically applied to linear filter responses z = Hx, where x is image data vectors; or applied to linear

filter responses inside a composite function such as an ANN (artificial neural networks). Its general form is

defined as

where y represents the vector of normalized responses, and vectors β, ε and matrices α, γ represent
parameters of the transformation (all non-negative).

Compression in TensorFlow for nonlinear transform coder model with factorized priors (entropy models)

optimized for MSE, with GDN (generalized divisive normalization) activation functions:
!python tfci.py compress b2018-gdn-128-4 /1.png

The number 1–4 at the end indicates the quality level (1: lowest, 4: highest). We experiment with different

levels of quality and choose the model which produces SSIM quality of approximately 0.97 (b2018-gdn-128–

4 in Table 2).

Model 3 — Hyperprior model with non zero-mean Gaussian conditionals

The third model is hyperprior model with non zero-mean Gaussian conditionals (without autoregression),

optimized for MS-SSIM (multiscale SSIM)[6]. The architecture of the figure is shown in Figure 6. It was also

run on TensorFlow framework[9].

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Figure 6. Model architecture for hyperprior model with non zero-mean Gaussian conditionals (without
autoregression) [6].

Compression in TensorFlow for hyperprior model with non zero-mean Gaussian conditionals (without

autoregression), optimized for MS-SSIM:


!python tfci.py compress mbt2018-mean-msssim-5 /1.png

The number 1–8 at the end indicates the quality level (1: lowest, 8: highest). We experiment with different

levels of quality and choose the model which produces SSIM quality of approximately 0.97 (mbt2018-mean-

msssim-5 in table 2).

Metrics

The performance of image compression-decompression methods can be evaluated using several metrics [4]:

 Compression efficiency/compression coefficient — the ratio between the compressed and the initial data

(image) size,

 Image quality (Distortion Measurement) — the difference between the original image and the

compressed/decompressed image,

 Computational cost — the number of seconds required for computing the compression and the additional
physical tool, such as GPU units.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Below, we summarize two metrics used for comparison, namely, compression efficiency/compression

coefficient, and image quality.

Compression efficiency/compression coefficient

Formula for this metric is the following:

N_compression = size(compressed data)/ size(uncompressed data).

N_compression is a compression coefficient equal to the size of the compressed data divided by the size of

the initial data. Size(compressed data) — is the file size in bites after the models’ compression.

Size(uncompressed data) equals the image’s height*width*channels in bites. Our dataset for evaluation has

10 equal images with width 576px, height 768px and channels =3, and size of the initial uncompressed data

576*768*3 = 1,327,104 bits = 165,888 bytes= size(uncompressed data).

Image quality

To compare the quality of compression we chose three metrics. We measure the quality of the compressed

files using the formula:

N_quality = Quality_metric(Data, Data*),

where Quality_metric is either SSIM or PSNR. Below, we show formulas for those metrics.

SSIM

In image comparison, the mean squared error (MSE) is simple to implement, but it is not highly indicative of

the perceived similarity. Structural similarity aims to address this shortcoming by taking texture into
account[7].

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Formula for SSIM

where x, y — images to compare, μ — the average of image x or y respectively, σ — the variance of x and y

respectively, c1 and c2 — two variables to stabilize the division with weak denominator.
from skimage.metrics import structural_similaritySSIM = structural_similarity(img1, img2, multichannel=True)

PSNR

Compute peak signal-to-noise ratio (PSNR) between images[8].

R is the maximum fluctuation in the input image data type. For example, if the input image has a double-
precision floating-point data type, then R is 1. If it has an 8-bit unsigned integer data type, R is 255.
import mathfrom torch import Tensorimport torch.nn.functional as F
def psnr(x: Tensor, x_hat: Tensor) -> float: return -10 * math.log10(F.mse_loss(x, x_hat).item())

Results

JPEG compression method using classical codecs for image compression via python library PIL gave the

following results (see Table 1). For equal comparison, we intentionally chose the parameters to compress the

images in such a way that SSIM would be approximately 0.97 (that means, images were compressed with a

certain compression coefficient N_compression, which would give SSIM close to 0.97).

Table 1. Results were obtained for JPEG compression method.

In Table 2, we included models for neural network compression-decompression:

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)


lOMoARcPSD|34467219

Table 2. Results obtained for three different neural networks models: FactorizedPriorAutoencoder:
bmshj2018-factorized-msssim-6[5], nonlinear transform coder with factorized priors: b2018-gdn-128–4[4],
hyperprior model with non zero-mean: Gaussian conditionals mbt2018-mean-msssim-5[6].

Conclusions

We compare the classical JPEG compression method with three different machine learning models for

compression-decompression task with TensorFlow framework. Several metrics are applied to compare the

performance. The results are as follows: with relatively equal SSIM quality (about 0.97), the best

compression was produced by the mbt2018-mean-msssim-5 model (N_compression is approximately 0.13).

The next best compression model is bmshj2018-factorized-msssim-6 (N_compression is approximately 0.23).

After this, follows the classical JPEG compression method with N_compression of around 0.288. The latest
in quality is the b2018-gdn-128–4 model (N_compression is approximately 0.29). At the same time, the

PSNR metrics for all neural networks models are approximately the same (about 35) (meaning that the

quality for MSE of images after compression-decompression is almost the same for every model). Also

interesting to mention, that the PSNR metric is higher for the JPEG method.

Downloaded by Karnan Suganya (karnansuganyacse@gmail.com)

You might also like