DL_Notes_ALL.docx

Deep Learning Notes(ALL): -
UNIT – I: Basic of Deep Learning - History of Deep Learning, McCulloch Pitts Neuron,
Thresholding Logic, Perceptrons, Perceptron Learning Algorithm and Convergence, Multilayer
Perceptrons (MLPs), Representation Power of MLPs, Sigmoid Neurons, Feed forward Neural
Networks.
1. McCulloch-Pitts Model of Neuron

The McCulloch-Pitts neural model, which was the earliest ANN model, has only two types of inputs
— Excitatory and Inhibitory. The excitatory inputs have weights of positive magnitude and the
inhibitory weights have weights of negative magnitude. The inputs of the McCulloch-Pitts neuron
could be either 0 or 1. It has a threshold function as an activation function. So, the output
signal yout is 1 if the input ysum is greater than or equal to a given threshold value, else 0. The
diagrammatic representation of the model is as follows:
McCulloch-Pitts Model
Simple McCulloch-Pitts neurons can be used to design logical operations. For that purpose, the
connection weights need to be correctly decided along with the threshold function (rather than the
threshold value of the activation function). For better understanding purpose, let me consider an
example:
John carries an umbrella if it is sunny or if it is raining. There are four given situations. I need to
decide when John will carry the umbrella. The situations are as follows:
● First scenario: It is not raining, nor it is sunny
● Second scenario: It is not raining, but it is sunny
● Third scenario: It is raining, and it is not sunny

● Fourth scenario: It is raining as well as it is sunny
To analyse the situations using the McCulloch-Pitts neural model, I can consider the input signals as
follows:
● X1: Is it raining?
● X2 : Is it sunny?
So, the value of both scenarios can be either 0 or 1. We can use the value of both weights X1 and
X2 as 1 and a threshold function as 1. So, the neural network model will look like:
Truth Table for this case will be:
Situation x1 x2 ysum yout
1 0 0 0 0
2 0 1 1 1
3 1 0 1 1
4 1 1 2 1
https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-7b90913be434058bb3ff705b6
e46bd05_l3.svg
https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-4012fe1bdf28a37b384dc0affd
209c22_l3.svg
2. Perceptron Learning Algorithm and Convergence:-

In downloads->DL_Notes->perceptron.pdf
Also, https://www.javatpoint.com/perceptron-in-machine-learning.
Convergence of the Perceptron Algorithm:
The Perceptron learning algorithm converges if the data is linearly separable, meaning
there exists a hyperplane that can completely separate the positive and negative
examples. When the data is linearly separable, the Perceptron algorithm is guaranteed
to find a solution, and it will converge after a finite number of iterations. This is known
as the Perceptron convergence theorem.
However, if the data is not linearly separable, the Perceptron algorithm may not
converge. In such cases, the algorithm keeps updating the weights endlessly, trying to
find a perfect separation, but it never stops. To handle non-linearly separable data,
techniques like adding a bias term or using more complex models (such as neural
networks with hidden layers) can be employed.
It's important to note that the Perceptron algorithm is a fundamental concept in neural
networks and machine learning, but its limitations led to the development of more
sophisticated algorithms and architectures, such as multilayer perceptrons (MLPs),
which can handle complex, non-linear relationships in the data.
3.Working of Back Propagation Algorithm:
4. MLP’s:
Structure of a Multi-Layer Perceptron (MLP):
By definition MLP is a type of artificial neural network that is composed of
multiple layer of interconnected neurons. These network is modelled after the
neurons in human brain.
MLP are that kind of feed forward network in which data is passed only in one
direction, Unlike others such as recurrent neural network where data is passed
in both directions and forms a cycle. MLP is the core algorithm behind all the
other powerful algorithms like CNN.
1. Input Layer:
● The input layer consists of nodes (also known as input neurons)
representing the features of the input data. Each feature of the
input is represented by a separate node.
● For example, in a simple image recognition task, each node in the
input layer might represent a pixel's intensity value in the image.
2. Hidden Layers:
● Between the input and output layers, there can be one or more
hidden layers. These layers are called "hidden" because they are
not directly exposed to the external environment or the user; their
workings are internal to the network.
● Each node in a hidden layer performs a weighted sum of its inputs.
The weights represent the strength of the connections between
nodes in the previous layer and the current node in the hidden
layer.
● After calculating the weighted sum, an activation function is
applied to introduce non-linearity into the network. Common
activation functions include ReLU (Rectified Linear Unit), sigmoid,
and tanh.
● The introduction of hidden layers and non-linear activation
functions allows MLPs to capture complex patterns and
relationships within the data.
3. Output Layer:
● The output layer produces the network's predictions or
classifications. The number of nodes in the output layer depends
on the task:
● For binary classification, there is one node in the output
layer, often using a sigmoid activation function to produce
values between 0 and 1.
● For multi-class classification, there are multiple nodes
(equal to the number of classes) with softmax activation,
ensuring that the output values represent probabilities and
sum up to 1.
● For regression tasks, there is one node in the output layer
without any activation function, allowing the network to
predict continuous numerical values.
Functioning of an MLP:
1. Initialization:
● Initialize the weights and biases of the network. These values are
usually initialized randomly.
2. Forward Propagation:
● During the forward pass, input data is fed into the input layer.
● The input values are multiplied by the weights and summed up in
each node of the hidden layers.
● The result of this summation is then passed through an activation
function, producing the output of each node in the hidden layers.
● The process continues through each hidden layer until the output
layer is reached. The output layer produces the network's
predictions.
3. Loss Calculation:
● Compare the predictions from the output layer to the actual target
values using a suitable loss function, such as mean squared error
for regression or cross-entropy for classification.
● The loss function measures the difference between the predicted
values and the true values, providing a measure of how well the
network is performing.
4. Backpropagation:
● Backpropagation involves calculating the gradients of the loss with
respect to the network's weights and biases.
● These gradients are calculated using the chain rule, starting from
the output layer and propagating backward through the hidden
layers.
● The gradients indicate how much each weight and bias
contributed to the overall error. The network then adjusts these
parameters using optimization algorithms like gradient descent,
updating them to minimize the loss function.
5. Training Iterations:
● Steps 2 to 4 are repeated for multiple iterations (epochs) or until a
convergence criterion is met.
● During each iteration, the network learns to improve its
predictions by adjusting the weights and biases based on the
computed gradients.
6. Prediction:
● Once the network is trained and the weights and biases are
optimized, the MLP can be used to make predictions on new,
unseen data.
● Input data is fed into the trained network, and the output layer
produces the predictions or classifications.
5.
Sigmoid Neurons:
A sigmoid neuron is a type of artificial neuron that addresses some of the
limitations of the McCulloch-Pitts neuron. It introduces a sigmoid activation
function that allows for continuous and smooth output, as opposed to the
binary output of the McCulloch-Pitts neuron. The sigmoid function is
commonly used and has the following mathematical form:
σ(z) = 1 / (1 + e^(-z))
Where:
● σ(z) is the output of the sigmoid function.
● z is the weighted sum of inputs plus a bias term.
Sigmoid neurons have the following components:
1. Inputs: Similar to the McCulloch-Pitts neuron, sigmoid neurons take
inputs with associated weights. Each input is multiplied by its weight,
and the weighted inputs are summed up.
2. Bias: A bias term is added to the weighted sum before applying the
sigmoid activation function. The bias allows for a shift in the output and
is an additional learnable parameter.
3. Activation Function: The sigmoid activation function transforms the
weighted sum plus bias into a continuous value between 0 and 1. This
smooth transition allows for more nuanced representations and gradual
changes in output.
Mathematically, the output of a sigmoid neuron can be represented as:
Output = σ(Σ(w * x) + b) //after adding biases
Where:
● σ is the sigmoid activation function.
● Σ(w * x) represents the weighted sum of inputs.
● b is the bias.
Sigmoid neurons are used as building blocks in feedforward neural networks,
which are the foundation of many machine learning and deep learning
applications.
Feedforward Neural Networks:
Feedforward is a type of neural network in deep learning that transmits data
in one direction, from input to output, without feedback loops. This makes
feedforward networks suitable for tasks like pattern recognition and
classification.
Feedforward networks are also known as multi-layer neural networks. During
data flow, input nodes receive data, which travel through hidden layers, and
exit output nodes
A feedforward neural network is a type of artificial neural network where the

neurons are organized into layers, and the information flows in one direction,
from the input layer to the output layer. Each layer consists of multiple neurons
that perform computations on the inputs they receive and pass the results to
the next layer. Here's how a feedforward neural network works:
1. Input Layer: This is the first layer of the network where the raw input
data is fed. Each neuron in the input layer corresponds to a feature or
attribute of the input data.
2. Hidden Layers: These are intermediate layers between the input and
output layers. Each neuron in a hidden layer receives inputs from the
previous layer, performs computations using its weights and bias, applies
an activation function (often sigmoid or more modern alternatives like
ReLU), and passes the result to the next layer.
3. Output Layer: This is the final layer of the network that produces the
desired output. The number of neurons in the output layer depends on
the problem type. For example, in a binary classification problem, there
might be one neuron with a sigmoid activation function. In a multi-class
classification problem, there could be multiple neurons with softmax
activation.
Feedforward neural networks are used for tasks such as classification,
regression, and even more complex tasks like image and speech recognition.
They are trained using various optimization algorithms (e.g., gradient descent)
to adjust the weights and biases in order to minimize a chosen loss function,
which measures the difference between predicted and actual outputs.
In summary, sigmoid neurons and feedforward neural networks build upon the
basic concepts of artificial neurons and threshold logic to create more versatile
and powerful models capable of capturing complex relationships in data.
6.Representation Power of MLP

The representation power of Multi-Layer Perceptrons (MLPs) refers to their
ability to approximate complex, non-linear relationships within data. MLPs with
one or more hidden layers and non-linear activation functions possess the
capability to learn and represent a wide range of functions. This property is
often referred to as the Universal Approximation Theorem, which states that a
neural network with a single hidden layer containing a finite number of neurons
can approximate any continuous function on a compact input domain, provided
a sufficiently large number of neurons are used.
Let's illustrate the representation power of MLPs with a simple example:
Example: Approximating a Non-Linear Function
Consider a non-linear function �(�)=sin⁡(�) f(x)=sin(x), where �x is the
input and �(�)f(x) is the output. The task is to approximate this sinusoidal
function using an MLP.
1. Data Generation:
● Generate a set of input-output pairs: xiand yi=sin(xi) for a range
of x values.
2. MLP Architecture: i.e, Applying a MLP architecture
● Use an MLP with one hidden layer containing a number of
neurons. The number of neurons in the hidden layer determines the
model's capacity to approximate complex functions.
● Apply an appropriate activation function in the hidden layer, such
as the sigmoid function or the Rectified Linear Unit (ReLU)
function.
● Use a linear activation function in the output layer, as the goal is to
approximate a continuous function.
3. Training the MLP:
● Train the MLP using the generated input-output pairs.
● During training, the network adjusts its weights and biases to
minimize the difference between the predicted values (output of the
MLP) and the actual sinusoidal values.
● Utilize a suitable optimization algorithm (e.g., gradient descent)
and a loss function (e.g., mean squared error) to guide the training
process.
4. Evaluation:
● After training, evaluate the performance of the trained MLP by
comparing its predictions with the true sinusoidal values for new,
unseen inputs.
By adjusting the number of neurons in the hidden layer, an MLP can
approximate the sinusoidal function with different levels of accuracy. As the
number of neurons increases, the MLP gains more expressive power, allowing it
to capture intricate patterns and achieve a closer approximation to the true
sinusoidal curve.
This example demonstrates the representation power of MLPs: they can learn
and approximate complex non-linear functions, making them valuable tools in
various applications, including regression, classification, and pattern
recognition, where the relationships within the data are non-linear and intricate.
The following are some of the most commonly utilized functions:

● Sigmoid: The formula g(z) = 1/(1 + e^-z) is used to express this.
● Tanh: The formula g(z) = (e^-z – e^-z)/(e^-z + e^-z) is used to express
this.
● Relu: The formula g(z) = max(0 , z) is used to express this.
UNIT II
(06Hrs)
Training of feedforward Neural Network - Representation Power of Feed forward Neural Networks,
Training of feed forward neural network, Gradient Descent, Gradient Descent (GD), Momentum
Based GD, Nesterov Accelerated GD, Stochastic GD, AdaGrad, RMSProp, Adam.
2.1 Gradient Descent: -

|-> With its type :- https://www.javatpoint.com/gradient-descent-in-machine-learning
Gradient Descent is known as one of the most commonly used optimization algorithms
to train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.
In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in
machine learning, optimization is the task of minimizing the cost function parameterized
by the model's parameters. The main objective of gradient descent is to minimize the
convex function using iteration of parameter updates. Once these machine learning
models are optimized, these models can be used as powerful tools for Artificial
Intelligence and various computer science applications.
In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.
Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It
helps in finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
o If we move towards a negative gradient or away from the gradient of the

function at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize the
cost function using iteration. To achieve this goal, it performs two steps iteratively:
o Calculates the first-order derivative of the function to compute the gradient or

slope of that function.
o Move away from the direction of the gradient, which means slope increased
from the current point by alpha times, where Alpha is defined as Learning Rate.
It is a tuning parameter in the optimization process which helps to decide the
length of the steps.
What is Cost-function?
The cost function is defined as the measurement of difference or error between
actual values and expected values at the current position and present in the
form of a single real number.
Further, it continuously iterates along the direction of the negative gradient until the cost
function approaches zero. At this steepest descent point, the model will stop learning
further.
The slight difference between the loss function and the cost function is about the error
within the training of machine learning models, as loss function refers to the error of
one training example, while a cost function calculates the average error across an entire
training set.
The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration.
How does Gradient Descent work?

Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple
linear regression is given as:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the
y-axis.
The starting point (shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are
required:
o Direction & Learning Rate
These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global minimum.
Let's discuss learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is
typically a small value that is evaluated and updated based on the behavior of the cost
function. If the learning rate is high, it results in larger steps but also leads to risks of
overshooting the minimum. At the same time, a low learning rate shows the small step
sizes, which compromises overall efficiency but gives the advantage of more precision.
4. What is RMSProp?
For optimizing the training of neural networks, RMSprop relies on
gradients. Backpropagation has its roots in this idea.
As data travels through very complicated functions, such as neural
networks, the resulting gradients often disappear or expand. RMSprop is
an innovative stochastic mini-batch learning method.
● RMSprop (Root Mean Squared Propagation) is an optimization

algorithm used in deep learning and other Machine Learning
techniques.
It is a variant of the gradient descent algorithm that helps to improve the
convergence speed and stability of the model training process.
RMSProp algorithm
Like other gradient descent algorithms, RMSprop works by calculating the
gradient of the loss function with respect to the model’s parameters and
updating the parameters in the opposite direction of the gradient to
minimize the loss. However, RMSProp introduces a few additional
techniques to improve the performance of the optimization process.
One key feature is its use of a moving average of the squared gradients to
scale the learning rate for each parameter. This helps to stabilize the
learning process and prevent oscillations in the optimization trajectory.
The algorithm can be summarized by the following RMSProp formula:
v_t = decay_rate * v_{t-1} + (1 - decay_rate) * gradient^2
parameter = parameter - learning_rate * gradient / (sqrt(v_t) + epsilon)
Where:
● v_t is the moving average of the squared gradients;

● decay_rate is a hyperparameter that controls the decay rate of the
moving average;
● learning_rate is a hyperparameter that controls the step size of the
update;
● gradient is the gradient of the loss function with respect to the
parameter; and
● epsilon is a small constant added to the denominator to prevent
division by zero.
Adam vs RMSProp
RMSProp is often compared to the Adam (Adaptive Moment Estimation)
optimization algorithm, another popular optimization method for deep
learning. Both algorithms combine elements of momentum and adaptive
learning rates to improve the optimization process, but Adam uses a
slightly different approach to compute the moving averages and adjust the
learning rates. Adam is generally more popular and widely used than the
RMSProp optimizer, but both algorithms can be effective in different
settings.
RMSProp advantages
● Fast convergence. RMSprop is known for its fast convergence
speed, which means that it can find good solutions to optimization
problems in fewer iterations than some other algorithms. This can be
especially useful for training large or complex models, where training
time is a critical concern.
● Stable learning. The use of a moving average of the squared
gradients in RMSprop helps to stabilize the learning process and
prevent oscillations in the optimization trajectory. This can make the
optimization process more robust and less prone to diverging or
getting stuck in local minima.
● Fewer hyperparameters. RMSprop has fewer hyperparameters than
some other optimization algorithms that make it easier to tune and
use in practice. The main hyperparameters in RMSprop are the
learning rate and the decay rate, which can be chosen using
techniques like grid search or random search.
● Good performance on non-convex problems. RMSprop tends to
perform well on non-convex optimization problems, common in
Machine Learning and deep learning. Non-convex optimization
problems have multiple local minima, and RMSprop’s fast
convergence speed and stable learning can help it find good
solutions even in these cases.
Overall, RMSprop is a powerful and widely used optimization algorithm that
can be effective for training a variety of Machine Learning models,
especially deep learning models.
UNIT IV
(06Hrs)
Convolution Neural Network (CNN) - Convolutional operation, Pooling, LeNet, AlexNet,
ZF-Net, VGGNet, GoogLeNet, ResNet. Visualizing Convolutional Neural Networks, Guided
Backpropagation.
4.1 Convolution Neural Network

Convolutional Neural Network (CNN) is the extended version of artificial neural networks
(ANN) which is predominantly used to extract the feature from the grid-like matrix dataset. For
example visual datasets like images or videos where data patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Simple CNN architecture
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
Convolutional Neural Network (CNN) Architecture:
1. Introduction:
● CNNs are a class of deep neural networks designed for tasks such as image
recognition, object detection, and image classification.
● They are particularly effective in capturing spatial hierarchies of features using

convolutional layers.
2. Basic Components:
● Convolutional Layers:
● The core building blocks of CNNs are convolutional layers. These layers use
convolutional operations to scan input data with learnable filters (kernels) to
detect patterns and features.
● Filters are small, learnable matrices that slide over the input data to perform
convolution operations. The result is feature maps that represent learned
features.
● Activation Functions:
● Non-linear activation functions like ReLU (Rectified Linear Unit) are applied
after convolutional operations to introduce non-linearity and enable the
network to learn complex relationships.
● Pooling Layers:
● Pooling layers reduce the spatial dimensions of the input and the number of
parameters in the network, helping to make the detection of features
invariant(not affected by) to scale and orientation changes.
● Max pooling is a common pooling operation, selecting the maximum value
from a group of neighbouring pixels.
3. Architectural Layers:
● Input Layer:
● The input layer represents the raw input data, such as an image. The
dimensions of the input layer depend on the size and color channels of the
input images.
● Convolutional Blocks:
● Convolutional layers are grouped into blocks, each typically consisting of

convolutional operations followed by activation functions.
● Multiple blocks may be stacked to capture complex hierarchical features.
● Pooling Layers:
● After convolutional blocks, pooling layers downsample the spatial

dimensions of the feature maps.
● Fully Connected (Dense) Layers:
● After convolutional and pooling layers, fully connected layers process the
flattened feature maps and make predictions.
● Dense layers are used for classification tasks, and their neurons are
connected to every neuron in the previous layer.
● Output Layer:
● The output layer produces the final prediction based on the task. For
classification, it may involve a softmax activation function to yield class
probabilities.
4. Common CNN Architectures:
● LeNet-5:
● Developed by Yann LeCun, LeNet-5 is one of the early CNN architectures. It

consists of convolutional and pooling layers, and it was primarily designed
for handwritten digit recognition.
● AlexNet:
● AlexNet, developed by Alex Krizhevsky, is a deeper CNN architecture with

more layers, introduced in the ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) in 2012. It played a crucial role in popularizing CNNs.
● VGGNet:
● VGGNet, created by the Visual Graphics Group at the University of Oxford,

features uniform convolutional blocks and is known for its simplicity. It was a
finalist in the ILSVRC 2014.
● ResNet (Residual Network):
● ResNet, proposed by Microsoft Research, introduces residual connections,

allowing the network to learn residual functions. This architecture mitigates
the vanishing gradient problem and enables training very deep networks.
● InceptionNet (GoogLeNet):
● GoogLeNet, developed by Google, introduces the concept of inception

modules, utilizing multiple filter sizes within the same layer. It won ILSVRC
2014.
● MobileNetV2:
● MobileNetV2 is designed for mobile and edge devices, featuring lightweight

depthwise separable convolutions. It balances model size and accuracy.
5. Training:
● CNNs are trained using backpropagation and optimization algorithms such as

stochastic gradient descent (SGD) or its variants.
● Large labeled datasets are crucial for effective training.
6. Transfer Learning:
● CNNs often leverage transfer learning by using pre-trained models on large datasets
like ImageNet. Fine-tuning is applied on specific tasks, saving training time and
resources.
Why should we use CNN?

Problem with Feedforward Neural Network:
• Suppose you are working with MNIST dataset, you know each image in MNIST is 28 x
28 x 1(black & white image contains only 1 channel).
• Total number of neurons in input layer will 28 x 28 = 784, this can be manageable.
• What if the size of image is 1000 x 1000 which means you need 10⁶ neurons in input
layer.
• This seems a huge number of neurons are required for operation.
• It is computationally ineffective.
• So here comes Convolutional Neural Network or CNN.
• In simple word what CNN does is, it extract the feature of image and convert it into
lower dimension without loosing its characteristics.
• In the following example you can see that initial the size of the image is 224 x 224 x
3. If you proceed without convolution then you need 224 x 224 x 3 = 100, 352
numbers of neurons in input layer but after applying convolution, your input image
dimension is reduced to 1 x 1 x 1000.
• It means you only need 1000 neurons in first layer of feedforward neural network.
• A convolutional neural network is used to detect and classify objects in an image.
Q)Convolution Operation in CNN:

Convolution is a mathematical operation that involves combining two functions to produce a
third. In the context of CNNs and image processing, convolution is used to extract features
from an input image. The idea is to slide a small filter (also known as a kernel) over the input
image and perform element-wise multiplication and summation to produce an output,
known as the feature map.
Here's how the convolution operation works step by step:
Input Image: Consider a grayscale image as our input. Each pixel in the image has an
intensity value (e.g., ranging from 0 to 255).
Filter (Kernel): The filter is a small matrix with learnable weights. It is smaller than the input
image and is usually square (e.g., 3x3 or 5x5). The filter's values determine what features it
detects.
Sliding: The filter is slid over the input image in a specified manner. At each position,
element-wise multiplication is performed between the filter's values and the corresponding
pixel values in the image region covered by the filter.
Summation: After element-wise multiplication, the resulting values are summed up to get a
single value.
Feature Map: The sum is placed in the output (feature map) at the position corresponding to
the center of the filter's current location.
The process is repeated for every possible position of the filter over the input image. This
results in a new image-like structure, the feature map, where each value represents the
response of the filter to a specific feature in the input image.
Edge Detection Example

In the previous article, we saw that the early layers of a neural network
detect edges from an image. Deeper layers might be able to detect the
cause of the objects and even more deeper layers might detect the cause
of complete objects (like a person’s face).
Step1- Convolution Layer Operation:-
In this section, we will focus on how the edges can be detected from an
image. Suppose we are given the below image:
As you can see, there are many vertical and horizontal edges in the image.
The first thing to do is to detect these edges:
But how do we detect these edges? To illustrate this, let’s take a 6 X 6

grayscale image (i.e. only one channel):
Next, we convolve this 6 X 6 matrix with a 3 X 3 filter:
After the convolution, we will get a 4 X 4 image. The first element of the 4
X 4 matrix will be calculated as:
So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with
the filter. Now, the first element of the 4 X 4 output will be the sum of the
element-wise product of these values, i.e. 3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1
+ 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of the 4 X 4
output, we will shift our filter one step towards the right and again get the
sum of the element-wise product:
Similarly, we will convolve over the entire image and get a 4 X 4 output:
So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4.
Consider one more example:
Note: Higher pixel values represent the brighter portion of the image and
the lower pixel values represent the darker portions. This is how we can
detect a vertical edge in an image.
Step2- What Is a Pooling Layer?
Similar to the Convolutional Layer, the Pooling layer is responsible for

reducing the spatial size of the Convolved Feature. This is to decrease the
computational power required to process the data by reducing the
dimensions. There are two types of pooling average pooling and max
pooling. I’ve only had experience with Max Pooling so far I haven’t faced
any difficulties.
So what we do in Max Pooling is we find the maximum value of a pixel from
a portion of the image covered by the kernel. Max Pooling also performs as
a Noise Suppressant. It discards the noisy activations altogether and also
performs de-noising along with dimensionality reduction.
On the other hand, Average Pooling returns the average of all the
values from the portion of the image covered by the Kernel. Average
Pooling simply performs dimensionality reduction as a noise suppressing
mechanism. Hence, we can say that Max Pooling performs a lot better
than Average Pooling.
More Edge Detection
The type of filter that we choose helps to detect the vertical or horizontal
edges. We can use the following filters to detect different edges:
Some of the commonly used filters are:

The Sobel filter puts a little bit more weight on the central pixels. Instead
of using these filters, we can create our own as well and treat them as a
parameter which the model will learn using backpropagation.
Padding
We have seen that convolving an input of 6 X 6 dimension with a 3 X 3

filter results in 4 X 4 output. We can generalize it and say that if the input
is n X n and the filter size is f X f, then the output size will be (n-f+1) X
(n-f+1):
● Input: n X n
● Filter size: f X f
● Output: (n-f+1) X (n-f+1)
There are primarily two disadvantages here:
1. Every time we apply a convolutional operation, the size of the image

shrinks
2. Pixels present in the corner of the image are used only a few number
of times during convolution as compared to the central pixels.
Hence, we do not focus too much on the corners since that can lead
to information loss
To overcome these issues, we can pad the image with an additional border,
i.e., we add one pixel all around the edges. This means that the input will be
an 8 X 8 matrix (instead of a 6 X 6 matrix). Applying convolution of 3 X 3 on
it will result in a 6 X 6 matrix which is the original shape of the image. This
is where padding comes to the fore:
● Input: n X n
● Padding: p
● Filter size: f X f
● Output: (n+2p-f+1) X (n+2p-f+1)
There are two common choices for padding:
1. Valid: It means no padding. If we are using valid padding, the output

will be (n-f+1) X (n-f+1)
2. Same: Here, we apply padding so that the output size is the same as
the input size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2
We now know how to use padded convolution. This way we don’t lose a lot
of information and the image does not shrink either. Next, we will look at
how to implement strided convolutions.
Fully Connected Layer
Till now we have performed the Feature Extraction steps, now comes the
Classification part. The Fully connected layer (as we have in ANN) is used for
classifying the input image into a label. This layer connects the information
extracted from the previous steps (i.e Convolution layer and Pooling layers)
to the output layer and eventually classifies the input into the desired label.
The complete process of a CNN model can be seen in the below image.
Q)Explain pooling layer with suitable example in convolutional
network.
→
A pooling layer is another important component of Convolutional
Neural Networks (CNNs) that follows convolutional layers. Its main
purpose is to reduce the spatial dimensions of the input feature
maps while retaining the most important information. Pooling is used
for downsampling and dimensionality reduction, which helps in
controlling the number of parameters in the network and reducing
computation.
Pooling Operation in CNN:
The pooling operation involves dividing the input feature map into
non-overlapping or overlapping regions and then performing an
aggregation operation (like max or average pooling) within each
region. The result is a pooled or downsampled version of the input
feature map.
There are two common types of pooling operations: max pooling and
average pooling.
1. Max Pooling: In max pooling, for each region, the maximum
value within that region is selected and placed in the pooled
feature map. Max pooling helps capture the most prominent
features within the region.
2. Average Pooling: In average pooling, the average value of all the
values within the region is calculated and placed in the pooled
feature map. Average pooling helps maintain a smoother
representation of the input.
Example:
Let's take a simple 4x4 feature map as an example:
Input Feature Map: | 2 | 4 | 1 | 3 | | 7 | 5 | 9 | 2 | | 8 | 3 | 6 | 5 | |
1|6|2|4|
We will use max pooling with a 2x2 window and a stride of 2. This
means we will divide the feature map into non-overlapping 2x2
regions and select the maximum value from each region to create the
pooled feature map.
The process goes as follows:
1. First, apply the 2x2 max pooling window to the top-left 2x2
region: Max value = 7.
2. Move the window to the top-right 2x2 region: Max value = 9.
3. Move the window to the bottom-left 2x2 region: Max value = 8.
4. Move the window to the bottom-right 2x2 region: Max value =
6.
The resulting pooled feature map would look like this:
Pooled Feature Map: | 7 | 9 | | 8 | 6 |
In this example, the pooling operation reduced the spatial
dimensions of the feature map from 4x4 to 2x2, effectively
downsampling the data. Max pooling helped retain the most
important information within each 2x2 region.
Pooling layers are typically inserted between convolutional layers in a
CNN architecture. They help in reducing the computation and
memory requirements of the network while preserving important
features for subsequent layers to work with.
Advantages of Max Pooling:
1. Translation Invariance: Max pooling helps make the network
less sensitive to small translations in the input data. Since only
the maximum value within a pooling window is retained, small
shifts in the input won't significantly affect the pooled output.
2. Feature Selection: Max pooling retains the most dominant and
important features within each local region of the input. It helps
the network focus on detecting key patterns.
3. Downsampling: Max pooling reduces the spatial dimensions of
the input, reducing computational requirements and memory
usage while retaining essential features.
Q) Explain in detail about - AlexNet, GoogleNet, VGGNet and

MobileNet with suitable diagrams
→
What is AlexNet?
AlexNet is the name given to a Convolutional Neural Network
Architecture that won the LSVRC competition in 2012.
LSVRC (Large Scale Visual Recognition Challenge) is a competition
where research teams evaluate their algorithms on a huge dataset of
labeled images (ImageNet) and compete to achieve higher accuracy
on several visual recognition tasks. This made a huge impact on how
teams approach the completion afterward.
The Architecture of AlexNet

The AlexNet contains 8 layers with weights;
5 convolutional layers
3 fully connected layers.
At the end of each layer, ReLu activation is performed except for the
last one, which outputs with a softmax with a distribution over the
1000 class labels. Dropout is applied in the first two fully connected
layers. As the figure above shows also applies Max-pooling after the
first, second, and fifth convolutional layers. The kernels of the
second, fourth, and fifth convolutional layers are connected only to
those kernel maps in the previous layer, which reside on the same
GPU. The kernels of the third convolutional layer are connected to all
kernel maps in the second layer. The neurons in the fully connected
layers are connected to all neurons in the previous layer.
ReLU
An important feature of the AlexNet is the use of ReLU(Rectified
Linear Unit) Nonlinearity.
Tanh or sigmoid activation functions used to be the usual way to train
a neural network model.
AlexNet showed that using ReLU nonlinearity, deep CNNs could be
trained much faster than using the saturating activation functions like
tanh or sigmoid.
Tested on the CIFAR-10 dataset.
Let's see why it trains faster with the ReLUs. The ReLU function is
given by
f(x) = max(0,x)
plots of the two functions —
1. tanh
2. ReLU.
image credits www.learnopencv.com

image credits www.learnopencv.com
The tanh function saturates at very high or very low values of z. In
these regions, the slope of the function goes very close to zero. This
can slow down gradient descent.
The ReLU function’s slope is not close to zero for higher positive
values of z. This helps the optimization to converge faster. For
negative values of z, the slope is still zero, but most of the neurons in
a neural network usually end up having positive values.
ReLU wins over the sigmoid function, too, for the same reason.
The Overfitting Problem. AlexNet had 60 million parameters, a major
issue in terms of overfitting.
Two methods to reduce overfitting:
1. Data Augmentation
2. Dropout.
Data Augmentation.
The authors generated image translations and horizontal reflections,
which increased the training set by 2048. They also performed
Principle Component Analysis (PCA) on the RGB pixel values to
change RGB channels' intensities, which reduced the top-1 error rate
by more than 1%.
Dropout
The second technique that AlexNet used to avoid overfitting was a
dropout. It consists of setting to zero the output of each hidden
neuron with a probability of 0.5. The neurons which are “dropped
out” in this way do not contribute to the forward pass and do not
participate in backpropagation. So every time an input is presented,
the neural network samples a different architecture. This technique
consists of turning off neurons with a predetermined probability.
This means that every iteration, the neurons “turned off” do not
contribute to the forward pass and do not participate in
backpropagation.
Pros of AlexNet
1. AlexNet is considered as the milestone of CNN for image
classification.
2. Many methods, such as the conv + pooling design, dropout,
GPU, parallel computing, ReLU, are still the industrial standard
for computer vision.
3. The unique advantage of AlexNet is the direct image input to
the classification model.
4. The convolution layers can automatically extract the edges of
the images and fully connected layers learning these features.
5. Theoretically the complexity of visual pattern scan be effective
extracted by adding more convlayer
Cons of AlexNet
1. AlexNet is NOT deep enough compared to the later model, such
as VGGNet, GoogLENet, and ResNet.
2. The use of large convolution filters (5*5) is not encouraged
shortly after that.
3. Use normal distribution to initiate the weights in the neural
networks, cannot effectively solve the problem of gradient
vanishing, replaced by the Xavier method later.
4. The performance is surpassed by more complex models such as
GoogLENet (6.7%), and ResNet (3.6%)
GoogLeNet: -
Google Net (or Inception V1) was proposed by research at Google
(with the collaboration of various universities) in 2014 in the research
paper titled “Going Deeper with Convolutions”. This architecture was
the winner at the ILSVRC(Large Scale Visual Recognition Challenge)
2014 image classification challenge. It has provided a significant
decrease in error rate as compared to previous winners AlexNet
(Winner of ILSVRC 2012) and ZF-Net (Winner of ILSVRC 2013) and
significantly less error rate than VGG (2014 runner up). This
architecture uses techniques such as 1×1 convolutions in the middle
of the architecture and global average pooling.
Features of GoogleNet:
The GoogLeNet architecture is very different from previous
state-of-the-art architectures such as AlexNet and ZF-Net. It uses
many different kinds of methods such as 1×1 convolution and global
average pooling that enables it to create deeper architecture. In the
architecture, we will discuss some of these methods:
● 1×1 convolution: The inception architecture
uses 1×1 convolution in its architecture. These convolutions
used to decrease the number of parameters (weights and
biases) of the architecture. By reducing the parameters we also
increase the depth of the architecture. Let’s look at an example
of a 1×1 convolution below:
● For Example, If we want to perform 5×5 convolution
having 48 filters without using 1×1 convolution as
intermediate:
● Total Number of operations : (14 x 14 x 48) x (5 x 5 x 480) =

112.9 M
● With 1×1 convolution :
● (14 x 14 x 16) x (1 x 1 x 480) + (14 x 14 x 48) x (5 x 5 x 16) = 1.5M
+ 3.8M = 5.3M which is much smaller than 112.9M.
● Global Average Pooling :
In the previous architecture such as AlexNet, the fully
connected layers are used at the end of the network. These
fully connected layers contain the majority of parameters of
many architectures that causes an increase in computation
cost.
In GoogLeNet architecture, there is a method called global
average pooling is used at the end of the network. This layer
takes a feature map of 7×7 and averages it to 1×1. This also
decreases the number of trainable parameters to 0 and
improves the top-1 accuracy by 0.6%
● Inception Module: **Above Img IMP.

The inception module is different from previous architectures
such as AlexNet, ZF-Net. In this architecture, there is a fixed
convolution size for each layer.
In the Inception module 1×1, 3×3, 5×5 convolution and 3×3 max
pooling performed in a parallel way at the input and the output
of these are stacked together to generated final output. The
idea behind that convolution filters of different sizes will handle
objects at multiple scale better.
● Auxiliary Classifier for Training:

Inception architecture used some intermediate classifier
branches in the middle of the architecture, these branches are
used during training only. These branches consist of a 5×5
average pooling layer with a stride of 3, a 1×1 convolutions
with 128 filters, two fully connected layers of 1024 outputs and
1000 outputs and a softmax classification layer. The generated
loss of these layers added to total loss with a weight of 0.3.
These layers help in combating gradient vanishing problem and
also provide regularization.
What is VGG?
What is VGG? VGG stands for Visual Geometry Group; it is a
standard deep Convolutional Neural Network (CNN)
architecture with multiple layers. The “deep” refers to the
number of layers with VGG-16 or VGG-19 consisting of 16 and
19 convolutional layers. The VGG architecture is the basis of
ground-breaking object recognition models. Developed as a
deep neural network, the VGGNet also surpasses baselines on
many tasks and datasets beyond ImageNet. Moreover, it is now
still one of the most popular image recognition architectures.
VGG Neural Network Architecture – Source
Read more
at: https://viso.ai/deep-learning/vgg-very-deep-convolutional-ne
tworks/
VGG network is constructed with very small convolutional filters. The

VGG-16 consists of 13 convolutional layers and three fully connected
layers.
Let’s take a brief look at the architecture of VGG:
● Input: The VGGNet takes in an image input size of 224×224. For the
ImageNet competition, the creators of the model cropped out the
center 224×224 patch in each image to keep the input size of the
image consistent.
● Convolutional Layers: VGG’s convolutional layers leverage a minimal

receptive field, i.e., 3×3, the smallest possible size that still captures
up/down and left/right. Moreover, there are also 1×1 convolution
filters acting as a linear transformation of the input. This is followed
by a ReLU unit, which is a huge innovation from AlexNet that reduces
training time. ReLU stands for rectified linear unit activation function;
it is a piecewise linear function that will output the input if positive;
otherwise, the output is zero. The convolution stride is fixed at 1 pixel
to keep the spatial resolution preserved after convolution (stride is
the number of pixel shifts over the input matrix).
● Hidden Layers: All the hidden layers in the VGG network use ReLU.
VGG does not usually leverage Local Response Normalization (LRN)
as it increases memory consumption and training time. Moreover, it
makes no improvements to overall accuracy.
● Fully-Connected Layers: The VGGNet has three fully connected
layers. Out of the three layers, the first two have 4096 channels each,
and the third has 1000 channels, 1 for each class.
Read more
at: https://viso.ai/deep-learning/vgg-very-deep-convolutional-network
s/
VGG16 Architecture
The number 16 in the name VGG refers to the fact that it is 16 layers
deep neural network (VGGnet). This means that VGG16 is a pretty
extensive network and has a total of around 138 million parameters.
Even according to modern standards, it is a huge network. However,
VGGNet16 architecture’s simplicity is what makes the network more
appealing. Just by looking at its architecture, it can be said that it is
quite uniform.
The number of filters that we can use doubles on every step or through
every stack of the convolution layer. This is a major principle used to
design the architecture of the VGG16 network. One of the crucial
downsides of the VGG16 network is that it is a huge network, which
means that it takes more time to train its parameters.
Because of its depth and number of fully connected layers, the VGG16
model is more than 533MB. This makes implementing a VGG network a
time-consuming task.
MobileNet:
MobileNet is designed for efficient computation on mobile and
embedded devices, focusing on reducing the number of operations
and parameters while maintaining good accuracy. Key features
include:
1. Depthwise Separable Convolutions: MobileNet uses depthwise
separable convolutions that split standard convolutions into
depthwise and pointwise convolutions. This drastically reduces
computation.
2. Width Multiplier and Resolution Multiplier: These parameters
allow trade-offs between accuracy and computational cost.
Width multiplier controls the number of channels, and
resolution multiplier controls input resolution.
3. Bottleneck Architecture: MobileNet uses a bottleneck
architecture with 1x1 convolutions to reduce the number of
input channels before performing more expensive operations.
here are two types of Convolution layers in MobileNet V2

architecture:
● 1x1 Convolution
● 3x3 Depthwise Convolution
These are the two different components in MobileNet V2 model:
Each block has 3 different layers:
● 1x1 Convolution with Relu6
● Depthwise Convolution
● 1x1 Convolution without any linearity
What is LeNet 5?
LeNet is a convolutional neural network that Yann LeCun
introduced in 1989. LeNet is a common term for LeNet-5, a simple
convolutional neural network.
The LeNet-5 signifies CNN’s emergence and outlines its core
components. However, it was not popular at the time due to a lack
of hardware, especially GPU (Graphics Process Unit, a specialised
electronic circuit designed to change memory to accelerate the
creation of images during a buffer intended for output to a show
device) and alternative algorithms, like SVM, which could perform
effects similar to or even better than those of the LeNet.
Features of LeNet-5
● Every convolutional layer includes three parts: convolution,
pooling, and nonlinear activation functions
● Using convolution to extract spatial features (Convolution was
called receptive fields originally)
● The average pooling layer is used for subsampling.
● ‘tanh’ is used as the activation function
● Using Multi-Layered Perceptron or Fully Connected Layers as
the last classifier
● The sparse connection between layers reduces the complexity
of computation
Architecture
The LeNet-5 CNN architecture has seven layers. Three
convolutional layers, two subsampling layers, and two fully linked
layers make up the layer composition.
**IMP Diagram
LeNet-5 Architecture
First Layer
A 32x32 grayscale image serves as the input for LeNet-5 and is
processed by the first convolutional layer comprising six feature
maps or filters with a stride of one. From 32x32x1 to 28x28x6, the
image’s dimensions shift.
Second Layer
Then, using a filter size of 22 and a stride of 2, the LeNet-5 adds an
average pooling layer or sub-sampling layer. 14x14x6 will be the
final image’s reduced size.
Third Layer
A second convolutional layer with 16 feature maps of size 55 and a
stride of 1 is then present. Only 10 of the 16 feature maps in this
layer are linked to the six feature maps in the layer below, as can
be seen in the illustration below.
The primary goal is to disrupt the network’s symmetry while

maintaining a manageable number of connections. Because of this,
there are 1516 training parameters instead of 2400 in these layers,
and similarly, there are 151600 connections instead of 240000.
Fourth Layer
With a filter size of 22 and a stride of 2, the fourth layer (S4) is once
more an average pooling layer. The output will be decreased to
5x5x16 because this layer is identical to the second layer (S2) but
has 16 feature maps.
Fifth Layer
With 120 feature maps, each measuring 1 x 1, the fifth layer (C5) is
a fully connected convolutional layer. All 400 nodes (5x5x16) in
layer four, S4, are connected to each of the 120 units in C5’s 120
units.
Sixth Layer
A fully connected layer (F6) with 84 units makes up the sixth layer.
Output Layer
The SoftMax output layer, which has 10 potential values and
corresponds to the digits 0 to 9, is the last layer.
Summary of LeNet-5 Architecture
**IMP Diagram for short viewing

UNIT V
(06 Hrs)
Recurrent Neural Network (RNN) - Recurrent Neural Networks,
Backpropagation through Time (BPTT), Vanishing and Exploding Gradients,
Long Short Term Memory (LSTM) Cells, Gated Recurrent Units (GRUs).
Q)Explain Vanishing Gradients and Exploding Gradients and

Gradient descent algorithm in detail. (7M)
Vanishing Gradients and Exploding Gradients are two issues that can
occur during the training of deep neural networks, particularly in deep
architectures. They are related to the behavior of gradients, which are
crucial for updating the weights of the neural network during the
training process. Let's explain both concepts and then delve into
Gradient Descent.
Vanishing Gradients:
Vanishing Gradients refer to a situation in which the gradients of the
loss function with respect to the model's parameters become
extremely small as they are backpropagated through the layers of a
deep neural network. This issue is particularly prevalent in networks
with many layers and is caused by the choice of activation functions,
weight initialization, and architecture design.
Causes:
● Activation Functions: Sigmoid and hyperbolic tangent (tanh)
activation functions squash their input values to a range between
0 and 1 or -1 and 1, respectively. When gradients are propagated
through many layers, these small values can quickly diminish to
almost zero.
● Weight Initialization: Poor choices in weight initialization
methods can exacerbate the vanishing gradients problem.
● Deep Architectures: Deep networks with many layers tend to
suffer more from vanishing gradients as the gradients have to
pass through multiple weight matrices and activation functions.
Consequences:
● Slow Training: The network learns slowly because small
gradients result in tiny weight updates.
● Poor Generalization: The model might underfit the training data
as it struggles to learn complex patterns.
Exploding Gradients:
Exploding Gradients, on the other hand, are the opposite problem. It
occurs when gradients become exceedingly large as they are
backpropagated through the layers of the network. This problem can
lead to numerical instability and failed training.
Causes:
● Weight Initialization: Poor weight initialization can lead to large
gradients, especially if the weights are initialized too large.
● Activation Functions: Activation functions like the ReLU
(Rectified Linear Unit) can amplify gradients when their input is
large, causing an explosion.
● Deep Networks: Networks with many layers are more prone to
exploding gradients, especially if the gradients are not properly
controlled.
Consequences:
● Numerical Overflow: Large gradients can lead to numerical
overflow, causing training to break.
● Unstable Learning: The model's weights can oscillate wildly
during training, making it difficult for the network to converge.
Gradient Descent Algorithm:
Gradient Descent is an optimization algorithm used to update the
parameters (weights and biases) of a neural network during training. It
aims to minimize a loss function by adjusting the parameters in the
direction that reduces the loss.
Here's a basic overview of the Gradient Descent algorithm:
1. Initialization: Initialize the model's parameters (weights and
biases) randomly or using a specific initialization method.
2. Forward Pass: Pass a batch of training data through the
network to compute the predicted outputs.
3. Compute Loss: Calculate the loss, which measures the
difference between the predicted outputs and the actual targets.
4. Backpropagation: Compute the gradients of the loss with
respect to the parameters of the network using the chain rule.
This involves propagating the gradients backward through the
layers.
5. Gradient Update: Update the parameters using the computed
gradients. The update rule is typically of the form:
parameter = parameter - learning_rate * gradient,
where the learning rate controls the step size.
6. Repeat: Repeat steps 2-5 for multiple iterations (epochs) over
the entire training dataset.
Gradient Descent helps the model learn by iteratively adjusting its
parameters to minimize the loss. Various variants of Gradient
Descent, such as Stochastic Gradient Descent (SGD), Mini-batch
Gradient Descent, and Adam, incorporate different strategies to
improve convergence speed and stability.
Q) Explain RNN architecture in detail? (7M)
→
Introduction on Recurrent Neural Networks
A Deep Learning approach for modelling sequential data
is Recurrent Neural Networks (RNN). RNNs were the standard
suggestion for working with sequential data before the advent of
attention models. Specific parameters for each element of the
sequence may be required by a deep feedforward model. It may
also be unable to generalize to variable-length sequences.
Source: Medium.com
Recurrent Neural Networks use the same weights for each element
of the sequence, decreasing the number of parameters and
allowing the model to generalize to sequences of varying lengths.
RNNs generalize to structured data other than sequential data,
such as geographical or graphical data, because of its design.
Recurrent neural networks, like many other deep learning
techniques, are relatively old. They were first developed in the
1980s, but we didn’t appreciate their full potential until lately. The
advent of long short-term memory (LSTM) in the 1990s, combined
with an increase in computational power and the vast amounts of
data that we now have to deal with, has really pushed RNNs to the
forefront.
RNNs are a type of neural network that can be used to model sequence
data. RNNs, which are formed from feedforward networks, are similar to
human brains in their behaviour. Simply said, recurrent neural networks can
anticipate sequential data in a way that other algorithms can’t.
Source: Quora.com
All of the inputs and outputs in standard neural networks are independent of
one another, however in some circumstances, such as when predicting the
next word of a phrase, the prior words are necessary, and so the previous
words must be remembered. As a result, RNN was created, which used a
Hidden Layer to overcome the problem. The most important component of
RNN is the Hidden state, which remembers specific information about a
sequence.
RNNs have a Memory that stores all information about the calculations. It
employs the same settings for each input since it produces the same outcome
by performing the same task on all inputs or hidden layers
How does Recurrent Neural Networks work?

→ The information in recurrent neural networks cycles through a
loop to the middle-hidden layer.
The input layer x receives and processes the neural network’s
input before passing it on to the middle layer.
Multiple hidden layers can be found in the middle layer h, each
with its own activation functions, weights, and biases. You can
utilize a recurrent neural network if the various parameters of
different hidden layers are not impacted by the preceding layer,
i.e. There is no memory in the neural network.
The different activation functions, weights, and biases will be
standardized by the Recurrent Neural Network, ensuring that
each hidden layer has the same characteristics. Rather than
constructing numerous hidden layers, it will create only one and
loop over it as many times as necessary.
Advantages of RNNs:
● Handle sequential data effectively, including text, speech, and time
series.
● Process inputs of any length, unlike feedforward neural networks.
● Share weights across time steps, enhancing training efficiency.
Disadvantages of RNNs:
● Prone to vanishing and exploding gradient problems, hindering
learning.
● Training can be challenging, especially for long sequences.
● Computationally slower than other neural network architectures.

DL_Notes_ALL.docx

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL_Notes_ALL.docx

Uploaded by

Copyright:

Available Formats

Deep Learning Notes(ALL): -

1. McCulloch-Pitts Model of Neuron

● First scenario: It is not raining, nor it is sunny

● Second scenario: It is not raining, but it is sunny

● Third scenario: It is raining, and it is not sunny

Truth Table for this case will be:

Situation x1 x2 ysum yout

2. Perceptron Learning Algorithm and Convergence:-

A feedforward neural network is a type of artificial neural network where the

6.Representation Power of MLP

The following are some of the most commonly utilized functions:

2.1 Gradient Descent: -

o If we move towards a negative gradient or away from the gradient of the

o Calculates the first-order derivative of the function to compute the gradient or

How does Gradient Descent work?

o Direction & Learning Rate

● RMSprop (Root Mean Squared Propagation) is an optimization

The algorithm can be summarized by the following RMSProp formula:

v_t = decay_rate * v_{t-1} + (1 - decay_rate) * gradient^2

parameter = parameter - learning_rate * gradient / (sqrt(v_t) + epsilon)

● v_t is the moving average of the squared gradients;

4.1 Convolution Neural Network

Convolutional Neural Network (CNN) Architecture:

● They are particularly effective in capturing spatial hierarchies of features using

● Convolutional layers are grouped into blocks, each typically consisting of

● Multiple blocks may be stacked to capture complex hierarchical features.

● After convolutional blocks, pooling layers downsample the spatial

● Fully Connected (Dense) Layers:

4. Common CNN Architectures:

● Developed by Yann LeCun, LeNet-5 is one of the early CNN architectures. It

● AlexNet, developed by Alex Krizhevsky, is a deeper CNN architecture with

● VGGNet, created by the Visual Graphics Group at the University of Oxford,

● ResNet, proposed by Microsoft Research, introduces residual connections,

● GoogLeNet, developed by Google, introduces the concept of inception

● MobileNetV2 is designed for mobile and edge devices, featuring lightweight

● CNNs are trained using backpropagation and optimization algorithms such as

● Large labeled datasets are crucial for effective training.

Why should we use CNN?

Q)Convolution Operation in CNN:

Here's how the convolution operation works step by step:

Edge Detection Example

Step1- Convolution Layer Operation:-

But how do we detect these edges? To illustrate this, let’s take a 6 X 6

Step2- What Is a Pooling Layer?

Similar to the Convolutional Layer, the Pooling layer is responsible for

Some of the commonly used filters are:

We have seen that convolving an input of 6 X 6 dimension with a 3 X 3

There are primarily two disadvantages here:

1. Every time we apply a convolutional operation, the size of the image

1. Valid: It means no padding. If we are using valid padding, the output

Fully Connected Layer

Q) Explain in detail about - AlexNet, GoogleNet, VGGNet and

The Architecture of AlexNet

image credits www.learnopencv.com

● Total Number of operations : (14 x 14 x 48) x (5 x 5 x 480) =

● Inception Module: **Above Img IMP.

● Auxiliary Classifier for Training:

VGG network is constructed with very small convolutional filters. The

● Convolutional Layers: VGG’s convolutional layers leverage a minimal

here are two types of Convolution layers in MobileNet V2

The primary goal is to disrupt the network’s symmetry while

**IMP Diagram for short viewing

Q)Explain Vanishing Gradients and Exploding Gradients and

How does Recurrent Neural Networks work?