FODL Unit-4

UNIT-4 FODL
Prerequisite of Convolution Neural Network

Deep Learning is an emerging field of Machine learning; that is, it is a subset of Machine Learning
where learning happens from past examples or experiences with the help of ‘Artificial Neural
Networks’.
Deep Learning uses deep neural networks, where the word ‘deep’ signifies the presence of more
than 1 or 2 hidden layers apart from the input and output layer.
What is an Artificial Neural Network?
Artificial neural networks are made up of neurons, which are the core processing units of the
network. For better understanding, refer to the diagram below:
n the given diagram, first, we have the ‘INPUT LAYER’, where the neurons are fed with training
observations. Then in between is the ‘HIDDEN LAYER‘ that performs most of the computations
required by our network. Lastly, the ‘OUTPUT LAYER‘ predicts the final output extracted from
the previous two layers.
How does this neural network work?
 For instance, if an image is passed as input, with N X N pixels, each pixel is fed as input
to each neuron of the first layer.
 Neurons of one layer are connected to the following layers through ‘channels’.
 Each of these channels is assigned a numerical value called ‘weight’.
 The inputs (x1, x2, …… xn) are multiplied by their corresponding weights, and their sum is
sent to the neurons in the hidden layer.
 Each of these neurons is associated with a numerical value called the ‘Bias’, further added
to the input sum.
 This value is then passed through a threshold function called the ‘Activation function’,
which determines whether the particular neuron will get activated or not.
 The activated neuron transmits data to neurons of the next layer over channels.
 Thus, data is propagated through the network, and the neuron with the highest value
determines the output.
 Output= f(sigma w i*xi)+Bias ,where f is the activation function.
Types of Deep Neural Network:

 Artificial Neural Network
 Multi-Layered Perceptron
 Recurrent Neural Network
 Convolutional Neural Network
CONVOLUTIONAL NEURAL NETWORK(CNN):
 It is a class of deep neural networks that extracts features from images, given as input, to
perform specific tasks such as image classification, face recognition and semantic image
system. A CNN has one or more convolution layers for simple feature extraction, which
execute convolution operation (i.e. multiplication of a set of weights with input) while
retaining the critical features (spatial and temporal information) without human
supervision.
Why do we need CNN over ANN?
 CNN is needed as it is an important and more accurate way for image classification
problems. With Artificial Neural Networks, a 2D image would first be converted into a 1-
dimensional vector before training the model.
 Also, with an increase in the size of the image, the number of training parameters would
increase exponentially, resulting in loss of storage. Moreover, ANN cannot capture the
sequential information required for sequence data.
 Thus, CNN would always be a preferred way for dealing with 2D image classification
problems because of its ability to deal with images as data, thereby providing higher
accuracy.
Q1. Explain CNN in detail along with its architecture.
Ans. The architecture of CNN:

The three primary layers that define the structure of a convolutional neural network are:
1)Convolution layer:
This is the first layer of the convolutional network that performs feature extraction by sliding the
filter over the input image. The output or the convolved feature is the element-wise product of
filters in the image and their sum for every sliding action.
The output layer, also known as the feature map, corresponds to original images like curves, sharp
edges, textures, etc.
In the case of networks with more convolutional layers, the initial layers are meant for extracting
the generic features while the complex parts are removed as the network gets deeper.
The image below shows the convolution operation.
2)Pooling Layer:
The primary purpose of this layer is to reduce the number of trainable parameters by decreasing
the spatial size of the image, thereby reducing the computational cost.
The image depth remains unchanged since pooling is done independently on each depth
dimension. Max Pooling is the most common pooling method, where the most significant element
is taken as input from the feature map. Max Pooling is then performed to give the output image
with dimensions reduced to a great extent while retaining the essential information.
3)Fully Connected Layer:
The last few layers which determine the output are the fully connected layers. The output from the
pooling layer is Flattened into a one-dimensional vector and then given as input to the fully
connected layer.
The output layer has the same number of neurons as the number of categories we had in our
problem for classification, thus associating features to a particular label.
After this process is known as forwarding propagation, the output so generated is compared to the
actual production for error generation.
The error is then backpropagated to update the filters(weights) and bias values. Thus, one training
is completed after this forwarding and backward propagation cycle.
Q2. How edges can be detected from an image? Also explain concepts related CNN along
with examples ( IMPORTANT, Must DO, refer for Numerical)
Ans. Edge Detection Example
Early layers of a neural network detect edges from an image. Deeper layers might be able to detect
the cause of the objects and even more deeper layers might detect the cause of complete objects
(like a person’s face).
Suppose we are given the below image:

As you can see, there are many vertical and horizontal edges in the image. The first thing to do is
to detect these edges:
But how do we detect these edges? To illustrate this, let’s take a 6 X 6 grayscale image (i.e. only
one channel):
Next, we convolve this 6 X 6 matrix with a 3 X 3 filter:
After the convolution, we will get a 4 X 4 image. The first element of the 4 X 4 matrix will be
calculated as:
So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with the filter. Now, the
first element of the 4 X 4 output will be the sum of the element-wise product of these values, i.e.
3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of
the 4 X 4 output, we will shift our filter one step towards the right and again get the sum of the
element-wise product:
Similarly, we will convolve over the entire image and get a 4 X 4 output:
So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4. Consider one more
example:
Note: Higher pixel values represent the brighter portion of the image and the lower pixel values
represent the darker portions. This is how we can detect a vertical edge in an image.
More Edge Detection
The type of filter that we choose helps to detect the vertical or horizontal edges. We can use the
following filters to detect different edges:
Some of the commonly used filters are:

The Sobel filter puts a little bit more weight on the central pixels. Instead of using these filters, we
can create our own as well and treat them as a parameter which the model will learn using
backpropagation.
Padding
We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4

output. We can generalize it and say that if the input is n X n and the filter size is f X f, then the
output size will be (n-f+1) X (n-f+1):
 Input: n X n
 Filter size: f X f
 Output: (n-f+1) X (n-f+1)
There are primarily two disadvantages here:
1. Every time we apply a convolutional operation, the size of the image shrinks
2. Pixels present in the corner of the image are used only a few number of times during
convolution as compared to the central pixels. Hence, we do not focus too much on the
corners since that can lead to information loss
To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel
all around the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6 matrix).
Applying convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original shape of the
image. This is where padding comes to the fore:
 Input: n X n
 Padding: p
 Output: (n+2p-f+1) X (n+2p-f+1)
There are two common choices for padding:
1. Valid: It means no padding. If we are using valid padding, the output will be (n-f+1) X (n-
f+1)
2. Same: Here, we apply padding so that the output size is the same as the input size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2
We now know how to use padded convolution. This way we don’t lose a lot of information and
the image does not shrink either. Next, we will look at how to implement strided convolutions.
Strided Convolutions
Suppose we choose a stride of 2. So, while convoluting through the image, we will take two steps
– both in the horizontal and vertical directions separately. The dimensions for stride s will be:
 Input: n X n
 Padding: p
 Stride: s
 Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]
Stride helps to reduce the size of the image, a particularly useful feature.
Convolutions Over Volume
Suppose, instead of a 2-D image, we have a 3-D input image of shape 6 X 6 X 3. How will we
apply convolution on this image? We will use a 3 X 3 X 3 filter instead of a 3 X 3 filter. Let’s look
at an example:
 Input: 6 X 6 X 3
 Filter: 3 X 3 X 3
The dimensions above represent the height, width and channels in the input and filter. Keep in
mind that the number of channels in the input and filter should be same. This will result in an
output of 4 X 4. Let’s understand it visually:
Since there are three channels in the input, the filter will consequently also have three channels.
After convolution, the output shape is a 4 X 4 matrix. So, the first element of the output is the sum
of the element-wise product of the first 27 values from the input (9 values from each channel) and
the 27 values from the filter. After that we convolve over the entire image.
Instead of using just a single filter, we can use multiple filters as well. How do we do that? Let’s
say the first filter will detect vertical edges and the second filter will detect horizontal edges from
the image. If we use multiple filters, the output dimension will change. So, instead of having a 4
X 4 output as in the above example, we would have a 4 X 4 X 2 output (if we have used 2 filters):
Generalized dimensions can be given as:
 Input: n X n X nc
 Filter: f X f X nc
 Padding: p
 Stride: s
 Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1] X nc’
Here, nc is the number of channels in the input and filter, while nc’ is the number of filters.
One Layer of a Convolutional Network
Once we get an output after convolving over the entire image using a filter, we add a bias term to
those outputs and finally apply an activation function to generate activations. This is one layer of
a convolutional network. Recall that the equation for one forward pass is given by:
z[1] = w[1]*a[0] + b[1]

a[1] = g(z[1])
In our case, input (6 X 6 X 3) is a[0]and filters (3 X 3 X 3) are the weights w[1]. These activations
from layer 1 act as the input for layer 2, and so on. Clearly, the number of parameters in case of
convolutional neural networks is independent of the size of the image. It essentially depends on
the filter size. Suppose we have 10 filters, each of shape 3 X 3 X 3. What will be the number of
parameters in that layer? Let’s try to solve this:
 Number of parameters for each filter = 3*3*3 = 27

 There will be a bias term for each filter, so total parameters per filter = 28
 As there are 10 filters, the total parameters for that layer = 28*10 = 280
No matter how big the image is, the parameters only depend on the filter size. Awesome, isn’t it?
Let’s have a look at the summary of notations for a convolution layer:
 f[l] = filter size

 p[l] = padding
 s[l] = stride
 n[c][l] = number of filters
Let’s combine all the concepts we have learned so far and look at a convolutional network
example.
Simple Convolutional Network Example
This is how a typical convolutional network looks like:
We take an input image (size = 39 X 39 X 3 in our case), convolve it with 10 filters of size 3 X 3,
and take the stride as 1 and no padding. This will give us an output of 37 X 37 X 10. We convolve
this output further and get an output of 7 X 7 X 40 as shown above. Finally, we take all these
numbers (7 X 7 X 40 = 1960), unroll them into a large vector, and pass them to a classifier that
will make predictions. This is a microcosm of how a convolutional network works.
There are a number of hyperparameters that we can tweak while building a convolutional network.
These include the number of filters, size of filters, stride to be used, padding, etc. We will look at
each of these in detail later in this article. Just keep in mind that as we go deeper into the network,
the size of the image shrinks whereas the number of channels usually increases.
In a convolutional network (ConvNet), there are basically three types of layers:
1. Convolution layer
2. Pooling layer
3. Fully connected layer
Let’s understand the pooling layer in the next section.
Pooling Layers
Pooling layers are generally used to reduce the size of the inputs and hence speed up the
computation. Consider a 4 X 4 matrix as shown below:
Applying max pooling on this matrix will result in a 2 X 2 output:
For every consecutive 2 X 2 block, we take the max number. Here, we have applied a filter of size
2 and a stride of 2. These are the hyperparameters for the pooling layer. Apart from max pooling,
we can also apply average pooling where, instead of taking the max of the numbers, we take their
average. In summary, the hyperparameters for a pooling layer are:
1. Filter size
2. Stride
3. Max or average pooling
If the input of the pooling layer is nh X nw X nc, then the output will be [{(nh – f) / s + 1} X {(nw –
f) / s + 1} X nc].
CNN Example
We’ll take things up a notch now. Let’s look at how a convolution neural network with
convolutional and pooling layer works. Suppose we have an input of shape 32 X 32 X 3:
There are a combination of convolution and pooling layers at the beginning, a few fully connected
layers at the end and finally a softmax classifier to classify the input into various categories. There
are a lot of hyperparameters in this network which we have to specify as well.
Generally, we take the set of hyperparameters which have been used in proven research and they
end up doing well. As seen in the above example, the height and width of the input shrinks as we
go deeper into the network (from 32 X 32 to 5 X 5) and the number of channels increases (from 3
to 10).
All of these concepts and techniques bring up a very fundamental question – why convolutions?
Why not something else?
Why Convolutions?
There are primarily two major advantages of using convolutional layers over using just fully
connected layers:
1. Parameter sharing
2. Sparsity of connections
Consider the below example:
If we would have used just the fully connected layer, the number of parameters would be =
32*32*3*28*28*6, which is nearly equal to 14 million! Makes no sense, right?
If we see the number of parameters in case of a convolutional layer, it will be = (5*5 + 1) * 6 (if
there are 6 filters), which is equal to 156. Convolutional layers reduce the number of parameters
and speed up the training of the model significantly.
In convolutions, we share the parameters while convolving through the input. The intuition behind
this is that a feature detector, which is helpful in one part of the image, is probably also useful in
another part of the image. So a single filter is convolved over the entire input and hence the
parameters are shared.
The second advantage of convolution is the sparsity of connections. For each layer, each output
value depends on a small number of inputs, instead of taking into account all the inputs.
Understanding and Calculating the number of Parameters in Convolution Neural Networks
(CNNs)
FYI: The above image does not represent correct number of parameters. Please refer to the
section titled “CORRECTION”. You may skip to that section if you just want the numbers.
If you’ve been playing with CNN’s it is common to encounter a summary of parameters as seen in
the above image. We all know it is easy to calculate the activation size, considering it’s merely the
product of width, height and the number of channels in that layer.
For example, as shown in the above image from coursera, the input layer’s shape is (32, 32, 3), the
activation size of that layer is 32 * 32 * 3 = 3072. The same holds good if you want to calculate the
activation shape of any other layer. Say, we want to calculate the activation size for CONV2. All
we have to do is just multiply (10,10,16) , i.e 10*10*16 = 1600, and you’re done calculating the
activation size.
However, what sometimes may get tricky, is the approach to calculate the number of parameters in
a given layer. With that said, here are some simple ideas to keep in my mind to do the same.
How does a CNN learn?
This goes back to the idea of understanding what we are doing with a convolution neural net, which
is basically trying to learn the values of filter(s) using backprop. In other words, if a layer has weight
matrices, that is a “learnable” layer.
Basically, the number of parameters in a given layer is the count of “learnable” (assuming such a
word exists) elements for a filter aka parameters for the filter for that layer.
Parameters in general are weights that are learnt during training. They are weight matrices that
contribute to model’s predictive power, changed during back-propagation process. Who governs
the change? Well, the training algorithm you choose, particularly the optimization strategy makes
them change their values.
Now that you know what “parameters” are, let’s dive into calculating the number of parameters in
the sample image we saw above. But, I’d want to include that image again here to avoid your
scrolling effort and time.
Example taken from Coursera: https://www.coursera.org/learn/convolutional-neural-
networks/lecture/uRYL1/cnn-example
1. Input layer: Input layer has nothing to learn, at it’s core, what it does is just provide
the input image’s shape. So no learnable parameters here. Thus number of parameters
= 0.
2. CONV layer: This is where CNN learns, so certainly we’ll have weight matrices. To
calculate the learnable parameters here, all we have to do is just multiply the by the
shape of width m, height n, previous layer’s filters d and account for all such
filters k in the current layer. Don’t forget the bias term for each of the filter. Number
of parameters in a CONV layer would be : ((m * n * d)+1)* k), added 1 because of the
bias term for each filter. The same expression can be written as follows: ((shape of
width of the filter * shape of height of the filter * number of filters in the previous
layer+1)*number of filters). Where the term “filter” refer to the number of filters in
the current layer.
3. POOL layer: This has got no learnable parameters because all it does is calculate a
specific number, no backprop learning involved! Thus number of parameters = 0.
4. Fully Connected Layer (FC): This certainly has learnable parameters, matter of fact,
in comparison to the other layers, this category of layers has the highest number of
parameters, why? because, every neuron is connected to every other neuron! So, how
to calculate the number of parameters here? You probably know, it is the product of
the number of neurons in the current layer c and the number of neurons on the previous
layer p and as always, do not forget the bias term. Thus number of parameters here
are: ((current layer neurons c * previous layer neurons p)+1*c).
Now let’s follow these pointers and calculate the number of parameters, shall we?
Example taken from coursera https://www.coursera.org/learn/convolutional-neural-

networks/lecture/uRYL1/cnn-example
1. The first input layer has no parameters. You know why.
2. Parameters in the second CONV1(filter shape =5*5, stride=1) layer is: ((shape of
width of filter*shape of height filter*number of filters in the previous
layer+1)*number of filters) = (((5*5*3)+1)*8) = 608.
3. The third POOL1 layer has no parameters. You know why.
4. Parameters in the fourth CONV2(filter shape =5*5, stride=1) layer is: ((shape of
width of filter * shape of height filter * number of filters in the previous layer+1)
* number of filters) = (((5*5*8)+1)*16) = 3216.
5. The fifth POOL2 layer has no parameters. You know why.
6. Parameters in the Sixth FC3 layer is((current layer c*previous layer p)+1*c) =
120*400+1*120= 48120.
7. Parameters in the Seventh FC4 layer is: ((current layer c*previous layer p)+1*c) = 84*120+1*
84 = 10164.
8. The Eighth Softmax layer has ((current layer c*previous layer p)+1*c) parameters =
10*84+1*10 = 850.
Update V2:
Thanks for the comments by observant readers. Appreciate the corrections. Changed the image for
better understanding.
FYI:
1. In this article, term “layer” very loosely to explain the separation. Ideally, CONV + Pooling
is termed as a layer.
2. Just because there are no parameters in the pooling layer, it does not imply that pooling has no
role in backprop. Pooling layer is responsible for passing on the values to the next and previous
layers during forward and backward propagation respectively.
In this article we saw what a parameter in means, we saw how to calculate the activation size, also
we understood how to calculate the number of parameters in a CNN.
Q2. Explain different types of CNN architectures.
Ans. Different Types of CNN Architectures Explained: Examples

In the fast-paced world of computer vision and image processing, the problem of image
classification consistently stands out: the ability to effectively recognize and classify images. As
we continue to digitize and automate our world, the demand for systems that can understand and
interpret visual data is growing at an unprecedented rate. The challenge is not just
about recognizing images – it’s about doing so accurately and efficiently. Traditional machine
learning methods often fall short, struggling to handle the complexity and high dimensionality of
image data. This is where Convolutional Neural Networks (CNNs) comes to rescue. And, there
are different types of CNN architectures based on which a CNN model can be trained for image
classification.
The CNN architecture is the most popular deep learning framework. CNNs shown remarkable
success in tackling the problem of image recognition, bringing a newfound level of precision and
scalability. But not all CNNs are created equal, and understanding the different types of CNN
architectures is key to leveraging their full potential. In this blog post, we will discuss each type of
CNN architecture in detail and provide examples of how these CNN models work. Even before we
get to learn about the different types of CNN architecture, let’s briefly recall what is CNN in the
first place?
What is CNN?
Before we get to different types of CNN architecture, let’s quickly recall what a CNN is? What a
CNN model is? What are the most fundamental components of a CNN architecture?
Convolutional Neural Networks, commonly referred to as CNNs, are a specialized kind of neural
network architecture that is designed to process data with a grid-like topology. This makes them
particularly well-suited for dealing with spatial and temporal data, like images and videos, that
maintain a high degree of correlation between adjacent elements.
CNNs are similar to other neural networks, but they have an added layer of complexity due to the
fact that they use a series of convolutional layers. Convolutional layers perform a mathematical
operation called convolution, a sort of specialized matrix multiplication, on the input data. The
convolution operation helps to preserve the spatial relationship between pixels by learning image
features using small squares of input data. . The picture below represents a typical CNN
architecture
The following are definitions of different layers shown in the above architecture:
Convolutional layers
Convolutional layers operate by sliding a set of ‘filters’ or ‘kernels’ across the input data. Each
filter is designed to detect a specific feature or pattern, such as edges, corners, or more complex
shapes in the case of deeper layers. As these filters move across the image, they generate a map
that signifies the areas where those features were found. The output of the convolutional layer
is a feature map, which is a representation of the input image with the filters applied.
Convolutional layers can be stacked to create more complex models, which can learn more
intricate features from images. Simply speaking, convolutional layers are responsible for
extracting features from the input images. These features might include edges, corners, textures,
or more complex patterns.
Pooling layers
Pooling layers follow the convolutional layers and are used to reduce the spatial dimension of
the input, making it easier to process and requiring less memory. In the context of images, “spatial
dimensions” refer to the width and height of the image. An image is made up of pixels, and you
can think of it like a grid, with rows and columns of tiny squares (pixels). By reducing the spatial
dimensions, pooling layers help reduce the number of parameters or weights in the network.
This helps to combat overfitting and help train the model in a fast manner. Max pooling helps in
reducing computational complexity owing to reduction in size of feature map, and, making the
model invariant to small transitions. Without max pooling, the network would not gain the ability
to recognize features irrespective of small shifts or rotations. This would make the model less
robust to variations in object positioning within the image, possibly affecting accuracy.
There are two main types of pooling: max pooling and average pooling. Max pooling takes the
maximum value from each feature map. For example, if the pooling window size is 2×2, it will
pick the pixel with the highest value in that 2×2 region. Max pooling effectively captures the most
prominent feature or characteristic within the pooling window. Average pooling calculates the
average of all values within the pooling window. It provides a smooth, average feature
representation.
Fully connected layers
Fully-connected layers are one of the most basic types of layers in a convolutional neural network
(CNN). As the name suggests, each neuron in a fully-connected layer is Fully connected- to every
other neuron in the previous layer. Fully connected layers are typically used towards the end of a
CNN- when the goal is to take the features learned by the convolutional and max pooling layers
and use them to make predictions such as classifying the input to a label. For example, if we were
using a CNN to classify images of animals, the final Fully connected layer might take the features
learned by the previous layers and use them to classify an image as containing a dog, cat, bird,
etc.
Fully connected layers take the high-dimensional output from the previous convolutional and
pooling layers and flatten it into a one-dimensional vector. This allows the network to combine
and integrate all the extracted features across the entire image, rather than considering localized
features. It helps in understanding the global context of the image. The fully connected layers are
responsible for mapping the integrated features to the desired output, such as class labels in
classification tasks. They act as the final decision-making part of the network, determining what
the extracted features mean in the context of the specific problem (e.g., recognizing a cat or a dog).
The combination of Convolution layer followed by max-pooling layer and then similar sets creates
a hierarchy of features. The first layer detects simple patterns, and subsequent layers build on those
to detect more complex patterns.
Output Layer
The output layer in a Convolutional Neural Network (CNN) plays a critical role as it’s the final
layer that produces the actual output of the network, typically in the form of a classification or
regression result. Its importance can be outlined as follows:
1. Transformation of Features to Final Output: The earlier layers of the CNN

(convolutional, pooling, and fully connected layers) are responsible for extracting and
transforming features from the input data. The output layer takes these high-level,
abstracted features and transforms them into a final output form, which is directly
interpretable in the context of the problem being solved.
2. Task-Specific Formulation:
 For classification tasks, the output layer typically uses a softmax
activation function, which converts the input from the previous layers into
a probability distribution over the predefined classes. The softmax
function ensures that the output probabilities sum to 1, making them
directly interpretable as class probabilities.
 For regression tasks, the output layer might consist of one or more neurons
with linear or no activation function, providing continuous output values.
Real-world usage of CNN
CNNs are often used for image recognition and classification tasks. For example, CNNs can be
used to identify objects in an image or to classify an image as being a cat or a dog. CNNs can also
be used for more complex tasks, such as generating descriptions of an image or identifying the
points of interest in an image. Beyond image data, CNNs can also handle time-series data, such as
audio data or even text data, although other types of networks like Recurrent Neural Networks
(RNNs) or transformers are often preferred for these scenarios. CNNs are a powerful tool for deep
learning, and they have been used to achieve state-of-the-art results in many different applications.
Different types of CNN Architectures
The following is a list of different types of CNN architectures in deep learning that are most famous
/ popular ones:
LeNet – First CNN Architecture

LeNet was developed in 1998 by Yann LeCun, Corinna Cortes, and Christopher Burges for
handwritten digit recognition problems. LeNet was one of the first successful CNNs and is often
considered the “Hello World” of deep learning. It is one of the earliest and most widely-used CNN
architectures and has been successfully applied to tasks such as handwritten digit recognition. The
LeNet architecture consists of multiple convolutional and pooling layers, followed by a fully-
connected layer. The model has five convolution layers followed by two fully connected layers.
LeNet was the beginning of CNNs in deep learning for computer vision problems. However, LeNet
could not train well due to the vanishing gradients problem. To solve this issue, a shortcut
connection layer known as max-pooling is used between convolutional layers to reduce the spatial
size of images which helps prevent overfitting and allows CNNs to train more effectively. The
diagram below represents LeNet-5 architecture.
The LeNet CNN is a simple yet powerful model that has been used for various tasks such as
handwritten digit recognition, traffic sign recognition, and face detection. Although LeNet was
developed more than 20 years ago, its architecture is still relevant today and continues to be used.
AlexNet – Deep Learning Architecture that popularized CNN

AlexNet was developed by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. AlexNet network
had a very similar architecture to LeNet, but was deeper, bigger, and featured Convolutional
Layers stacked on top of each other. AlexNet was the first large-scale CNN and was used to win
the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. The AlexNet
architecture was designed to be used with large-scale image datasets and it achieved state-of-the-
art results at the time of its publication. AlexNet is composed of 5 convolutional layers with a
combination of max-pooling layers, 3 fully connected layers, and 2 dropout layers. The activation
function used in all layers is Relu. The activation function used in the output layer is Softmax. The
total number of parameters in this architecture is around 60 million.
ZF Net: ZFnet is the CNN architecture that uses a combination of fully-connected layers and
CNNs. ZF Net was developed by Matthew Zeiler and Rob Fergus. It was the ILSVRC 2013
winner. The network has relatively fewer parameters than AlexNet, but still outperforms it on
ILSVRC 2012 classification task by achieving top accuracy with only 1000 images per class. It
was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by
expanding the size of the middle convolutional layers and making the stride and filter size on the
first layer smaller. It is based on the Zeiler and Fergus model, which was trained on the ImageNet
dataset. ZF Net CNN architecture consists of a total of seven layers: Convolutional layer, max-
pooling layer (downscaling), concatenation layer, convolutional layer with linear activation
function, and stride one, dropout for regularization purposes applied before the fully connected
output. This CNN model is computationally more efficient than AlexNet by introducing an
approximate inference stage through deconvolutional layers in the middle of CNNs. Here is
the paper on ZFNet.
GoogLeNet – CNN Architecture used by Google
GoogLeNet is the CNN architecture used by Google to win ILSVRC 2014 classification task. It
was developed by Jeff Dean, Christian Szegedy, Alexandro Szegedy et al.. It has been shown to
have a notably reduced error rate in comparison with previous winners AlexNet (Ilsvrc 2012
winner) and ZF-Net (Ilsvrc 2013 winner). In terms of error rate, the error is significantly lesser
than VGG (2014 runner up). It achieves deeper architecture by employing a number of distinct
techniques, including 1×1 convolution and global average pooling. GoogleNet CNN architecture
is computationally expensive. To reduce the parameters that must be learned, it uses heavy
unpooling layers on top of CNNs to remove spatial redundancy during training and also features
shortcut connections between the first two convolutional layers before adding new filters in later
CNN layers. Real-world applications/examples of GoogLeNet CNN architecture include Street
View House Number (SVHN) digit recognition task, which is often used as a proxy for roadside
object detection. Below is the simplified block diagram representing GoogLeNet CNN
architecture:
VGGNet – CNN Architecture with Large Filters
VGGNet is the CNN architecture that was developed by Karen Simonyan, Andrew Zisserman et
al. at Oxford University. The full form of “VGG16” stands for “Visual Geometry Group
16“. This name comes from the Visual Geometry Group at the University of Oxford, where this
neural network architecture was developed. The “16” in the name indicates that the model contains
16 layers that have weights; this includes convolutional layers as well as fully connected layers.
VGGNet is a 16-layer CNN with up to 95 million parameters and trained on over one billion
images (1000 classes). It can take large input images of 224 x 224-pixel size for which it has 4096
convolutional features. CNNs with such large filters are expensive to train and require a lot of data,
which is the main reason why CNN architectures like GoogLeNet (AlexNet architecture) work
better than VGGNet for most image classification tasks where input images have a size between
100 x 100-pixel and 350 x 350 pixels. Real-world applications/examples of VGGNet CNN
architecture include the ILSVRC 2014 classification task, which was also won by GoogleNet CNN
architecture. The VGG CNN model is computationally efficient and serves as a strong baseline for
many applications in computer vision due to its applicability for numerous tasks including object
detection. Its deep feature representations are used across multiple neural network architectures
like YOLO, SSD, etc. The diagram below represents the standard VGG16 network architecture
diagram:
ResNet – CNN architecture that also got used for NLP tasks apart from Image Classification
ResNet is the CNN architecture that was developed by Kaiming He et al. to win the ILSVRC 2015
classification task with a top-five error of only 15.43%. The network has 152 layers and over one
million parameters, which is considered deep even for CNNs because it would have taken more
than 40 days on 32 GPUs to train the network on the ILSVRC 2015 dataset. CNNs are mostly used
for image classification tasks with 1000 classes, but ResNet proves that CNNs can also be used
successfully to solve natural language processing problems like sentence completion or machine
comprehension, where it was used by the Microsoft Research Asia team in 2016 and 2017
respectively. Real-life applications/examples of ResNet CNN architecture include Microsoft’s
machine comprehension system, which has used CNNs to generate the answers for more than 100k
questions in over 20 categories. The CNN architecture ResNet is computationally efficient and can
be scaled up or down to match the computational power of GPUs.
MobileNets – CNN Architecture for Mobile Devices

MobileNets are CNNs that can be fit on a mobile device to classify images or detect objects with
low latency. MobileNets have been developed by Andrew G Trillion et al.. They are usually very
small CNN architectures, which makes them easy to run in real-time using embedded devices like
smartphones and drones. The architecture is also flexible so it has been tested on CNNs with 100-
300 layers and it still works better than other architectures like VGGNet. Real-life examples of
MobileNets CNN architecture include CNNs that is built into Android phones to run Google’s
Mobile Vision API, which can automatically identify labels of popular objects in images.
ZFNet – Improved Version of AlexNet CNN Architecture

ZFNet, short for Zeiler and Fergus Network, is a convolutional neural network (CNN) model that
gained significant attention in the field of deep learning, particularly in computer vision. It was
developed by Matthew Zeiler and Rob Fergus and became well-known after winning the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) in 2013.
The key innovation of ZFNet lies in its approach to improving the AlexNet architecture, which
was the winner of the ILSVRC in 2012. ZFNet addressed some of the limitations of AlexNet
by tweaking the CNN architecture, particularly focusing on the convolutional layers.
ZFNet modified the first few layers of AlexNet. It used smaller filter sizes in the first and second
convolutional layers and altered the stride and filter sizes to improve feature extraction. One of the
most notable contributions of ZFNet was the introduction of a novel visualization
technique that allowed for better understanding and interpretation of the feature maps in CNNs.
By fine-tuning the architecture, ZFNet achieved improved performance in image classification
tasks compared to its predecessor, AlexNet.
GoogLeNet_DeepDream – Generate images based on CNN features
GoogLeNet_DeepDream is a deep dream CNN architecture that was developed by Alexander
Mordvintsev, Christopher Olah, et al.. It uses the Inception network to generate images based on
CNN features. The architecture is often used with the ImageNet dataset to generate psychedelic
images or create abstract artworks using human imagination at the ICLR 2017 workshop by David
Ha, et al.
To summarize the different types of CNN architectures described above in an easy to remember
form, you can use the following:
Architecture Year Key Features Use Case
First successful applications of CNNs, 5

layers (alternating between convolutional Recognizing handwritten
and pooling), Used tanh/sigmoid activation and machine-printed
LeNet 1998 functions characters
Deeper and wider than LeNet, Used ReLU

activation function, Implemented dropout Large-scale image
AlexNet 2012 layers, Used GPUs for training recognition tasks
Similar architecture to AlexNet, but with

different filter sizes and numbers of filters,
Visualization techniques for understanding
ZFNet 2013 the network ImageNet classification
Deeper networks with smaller filters (3×3),

All convolutional layers have the same depth, Large-scale image
VGGNet 2014 Multiple configurations (VGG16, VGG19) recognition
Architecture Year Key Features Use Case
Introduced “skip connections” or “shortcuts”

to enable training of deeper networks, Large-scale image
Multiple configurations (ResNet-50, ResNet- recognition, won 1st place
ResNet 2015 101, ResNet-152) in the ILSVRC 2015
Introduced Inception module, which allows

for more efficient computation and deeper Large-scale image
networks, multiple versions (Inception v1, recognition, won 1st place
GoogleLeNet 2014 v2, v3, v4) in the ILSVRC 2014
Designed for mobile and embedded vision

applications, Uses depthwise separable Mobile and embedded
convolutions to reduce the model size and vision applications, real-
MobileNets 2017 complexity time object detection
Classic Networks
In this section, we will look at the following popular networks:
1. LeNet-5
2. AlexNet
3. VGG
We will also see how ResNet works and finally go through a case study of an inception neural
network.
LeNet-5
Let’s start with LeNet-5:

It takes a grayscale image as input. Once we pass it through a combination of convolution and
pooling layers, the output will be passed through fully connected layers and classified into
corresponding classes. The total number of parameters in LeNet-5 are:
 Parameters: 60k
 Layers flow: Conv -> Pool -> Conv -> Pool -> FC -> FC -> Output
 Activation functions: Sigmoid/tanh and ReLu
AlexNet
An illustrated summary of AlexNet is given below:
This network is similar to LeNet-5 with just more convolution and pooling layers:
 Parameters: 60 million
 Activation function: ReLu
VGG-16
The underlying idea behind VGG-16 was to use a much simpler network where the focus is on
having convolution layers that have 3 X 3 filters with a stride of 1 (and always using the same
padding). The max pool layer is used after each convolution layer with a filter size of 2 and a stride
of 2. Let’s look at the architecture of VGG-16:
As it is a bigger network, the number of parameters are also more.
 Parameters: 138 million
These are three classic architectures. Next, we’ll look at more advanced architecture starting with
ResNet.
ResNet
Training very deep networks can lead to problems like vanishing and exploding gradients. How
do we deal with these issues? We can use skip connections where we take activations from one
layer and feed it to another layer that is even more deeper in the network. There are residual blocks
in ResNet which help in training deeper networks.
Residual Blocks
The general flow to calculate activations from different layers can be given as:
This is how we calculate the activations a[l+2] using the activations a[l] and then a[l+1]. a[l] needs to
go through all these steps to generate a[l+2]:
In a residual network, we make a change in this path. We take the activations a [l] and pass them
directly to the second layer:
So, the activations a[l+2] will be:
a[l+2] = g(z[l+2] + a[l])
The residual network can be shown as:
The benefit of training a residual network is that even if we train deeper networks, the
training error does not increase. Whereas in case of a plain network, the training error first
decreases as we train a deeper network and then starts to rapidly increase:
We now have an overview of how ResNet works. But why does it perform so well? Let’s find out!
Why ResNets Work?
In order to make a good model, we first have to make sure that it’s performance on the training
data is good. That’s the first test and there really is no point in moving forward if our model fails
here. We have seen earlier that training deeper networks using a plain network increases the
training error after a point of time. But while training a residual network, this isn’t the case. Even
when we build a deeper residual network, the training error generally does not increase.
The equation to calculate activation using a residual block is given by:

a[l+2] = g(z[l+2] + a[l])
a[l+2] = g(w[l+2] * a[l+1] + b[l+2] + a[l])
Now, say w[l+2] = 0 and the bias b[l+2] is also 0, then:
a[l+2] = g(a[l])
It is fairly easy to calculate a[l+2] knowing just the value of a[l]. As per the research paper, ResNet
is given by:
Networks in Networks and 1×1 Convolutions
Let’s see how a 1 X 1 convolution can be helpful. Suppose we have a 28 X 28 X 192 input and we
apply a 1 X 1 convolution using 32 filters. So, the output will be 28 X 28 X 32:
The basic idea of using 1 X 1 convolution is to reduce the number of channels from the image. A
couple of points to keep in mind:
 We generally use a pooling layer to shrink the height and width of the image
 To reduce the number of channels from an image, we convolve it using a 1 X 1 filter (hence
reducing the computation cost as well)
The Motivation Behind Inception Networks
While designing a convolutional neural network, we have to decide the filter size. Should it be a 1
X 1 filter, or a 3 X 3 filter, or a 5 X 5? Inception does all of that for us! Let’s see how it works.
Suppose we have a 28 X 28 X 192 input volume. Instead of choosing what filter size to use, or
whether to use convolution layer or pooling layer, inception uses all of them and stacks all the
outputs:
A good question to ask here – why are we using all these filters instead of using just a single filter
size, say 5 X 5? Let’s look at how many computations would arise if we would have used only a
5 X 5 filter on our input:
Number of multiplies = 28 * 28 * 32 * 5 * 5 * 192 = 120 million! Can you imagine how expensive
performing all of these will be?
Now, let’s look at the computations a 1 X 1 convolution and then a 5 X 5 convolution will give
us:
Number of multiplies for first convolution = 28 * 28 * 16 * 1 * 1 * 192 = 2.4 million
Number of multiplies for second convolution = 28 * 28 * 32 * 5 * 5 * 16 = 10 million
Total number of multiplies = 12.4 million
A significant reduction. This is the key idea behind inception.
Inception Networks
This is how an inception block looks:

We stack all the outputs together. Also, we apply a 1 X 1 convolution before applying 3 X 3 and
5 X 5 convolutions in order to reduce the computations. An inception model is the combination of
these inception blocks repeated at different locations, some fully connected layer at the end, and a
softmax classifier to output the classes.
Q3. Write about application of CNN in vision , speech and audio

Convolutional Neural Networks (CNNs) have found widespread applications in various domains,
including computer vision, speech processing, and audio analysis, owing to their ability to
automatically learn hierarchical representations from data. Here are some notable applications of
CNNs in vision, speech, and audio:
Computer Vision:
1. Image Classification:
CNNs excel in image classification tasks by learning hierarchical features from pixels to high-
level representations. Applications include identifying objects in images, facial recognition, and
scene classification
2. Object Detection:
CNNs are widely used for object detection tasks where the goal is to not only classify objects but
also locate them in an image. Applications include autonomous vehicles, surveillance, and
augmented reality.
3. Segmentation: CNNs are applied to image segmentation tasks, where the goal is to assign
a label to each pixel in an image. This is useful in medical imaging for identifying tumors,
as well as in general image editing and understanding.
4. Image Generation:CNNs can be used for image generation tasks, such as generating
realistic images from scratch or modifying existing ones. Generative models like GANs
(Generative Adversarial Networks) leverage CNNs for image synthesis.
5. Facial Recognition:CNNs play a crucial role in facial recognition systems, enabling
applications like unlocking devices, verifying identities, and enhancing security.
Speech Processing:
1. Speech Recognition:
CNNs are applied to automatic speech recognition (ASR) systems, converting spoken language
into text. They learn spectral features from audio signals, improving the accuracy of transcriptions
in applications like voice assistants, dictation software, and customer service automation.
2. Speaker Identification: CNNs can be used to identify individuals based on their voice
characteristics. This has applications in security, authentication systems, and personalized
services.
Audio Analysis:
1. Environmental Sound Classification: CNNs are employed in classifying environmental

sounds, such as identifying the type of sounds in a given audio clip. This is useful in
applications like urban noise monitoring, wildlife conservation, and smart home systems.
2. Music Genre Classification: CNNs are used to automatically classify music into genres
based on spectral and temporal features. This application is prevalent in music streaming
services for personalized recommendations.
3. Speech Emotion Recognition: CNNs can be applied to analyze and classify the emotional
content of speech. This has applications in customer service, virtual assistants, and mental
health monitoring.
4. Anomaly Detection: CNNs can identify anomalous patterns in audio data, making them
valuable for detecting unusual events in applications like security surveillance and
industrial monitoring.
The success of CNNs in these applications lies in their ability to automatically learn hierarchical
features and patterns directly from the raw input data, without the need for manual feature
engineering. This adaptability makes CNNs a powerful tool in various domains, enabling
advancements in technology and improving the efficiency of a wide range of applications.
Q4. Differentiate between different architectures of CNN

Architecture Year Notable Features Applications Advantages Disadvantages
Convolutional
layers, Simple
Subsampling Handwritten architecture,
(Pooling), Fully Digit Effective for Limited capacity
LeNet-5 1998 Connected Layers Recognition small images for complex tasks
Deep architecture
(8 layers), ReLU ImageNet Large Pioneered deep Requires
activation, Local Scale Visual CNNs, Good substantial
Response Recognition performance on computational
AlexNet 2012 Normalization Challenge large datasets resources
Simplicity with Easy to
deep understand and High
VGGNet architectures, 3x3 Image implement, computational and
(VGG16, convolutional Classification, Transferable memory
VGG19) 2014 filters Object Detection features requirements
Inception modules Efficient use of
GoogLeNet (multiple filter Image parameters, Complex
(Inception sizes), Global Classification, Reduced risk of architecture may be
v1) 2014 Average Pooling Object Detection overfitting harder to train
Residual learning Mitigates
(skip vanishing
connections), Image gradient Increased
Very deep Classification, problem, Good complexity may
ResNet 2015 architecture Object Detection generalization lead to overfitting
Densely Parameter
connected blocks, efficiency, High memory
Feature reuse, Image Improved consumption,
Direct Classification, feature Computationally
DenseNet 2017 connections Object Detection propagation expensive
Depthwise Efficient on
separable Mobile and mobile devices,
convolutions, Embedded Low Reduced accuracy
Lightweight Vision computational compared to larger
MobileNet 2017 architecture Applications cost architectures
Compound
scaling (scaling
up width, depth, Achieves high
and resolution), Image accuracy with Requires careful
Efficient Classification, fewer tuning of scaling
EfficientNet 2019 architecture Object Detection parameters parameters
Architecture Year Notable Features Applications Advantages Disadvantages
Real-time Object
Fire modules Detection with
(squeeze and limited Small model May sacrifice
expand), Efficient computational size, Faster accuracy for model
SqueezeNet 2016 architecture resources inference speed compression
Sensitive to
Encoder-decoder Medical Image Excellent for variations in input
architecture, Skip Segmentation, segmentation size, Limited
connections for Biomedical tasks, Captures context
U-Net 2015 segmentation Image Analysis fine details understanding
Q5. Explain the concept of transfer learning in deep learning.
Ans. What Is Transfer Learning?
The reuse of a pre-trained model on a new problem is known as transfer learning in machine
learning. A machine uses the knowledge learned from a prior assignment to increase prediction
about a new task in transfer learning. You could, for example, use the information gained during
training to distinguish beverages when training a classifier to predict whether an image contains
cuisine.
The knowledge of an already trained machine learning model is transferred to a different but
closely linked problem throughout transfer learning. For example, if you trained a simple classifier
to predict whether an image contains a backpack, you could use the model’s training knowledge
to identify other objects such as sunglasses.

With transfer learning, we basically try to use what we’ve learned in one task to better understand
the concepts in another. weights are being automatically being shifted to a network performing
“task A” from a network that performed new “task B.”
Because of the massive amount of CPU power required, transfer learning is typically applied in
computer vision and natural language processing tasks like sentiment analysis.
Transfer learning is a powerful technique used in Deep Learning. By harnessing the ability to reuse
existing models and their knowledge of new problems, transfer learning has opened doors to
training deep neural networks even with limited data. This breakthrough is especially significant
in data science, where practical scenarios often need more labeled data.
What Is Transfer Learning?
cuisine.
How Transfer Learning Works?
In computer vision, neural networks typically aim to detect edges in the first layer, forms in the
middle layer, and task-specific features in the latter layers.
The early and central layers are employed in transfer learning, and the latter layers are only
retrained. It makes use of the labelled data from the task it was trained on.
Transfer learning is a powerful technique used in Deep Learning. By harnessing the ability to reuse
existing models and their knowledge of new problems, transfer learning has opened doors to
training deep neural networks even with limited data. This breakthrough is especially significant
in data science, where practical scenarios often need more labeled data. In this article, we delve
into the depths of transfer learning, unraveling its concepts and exploring its applications in
empowering data scientists to tackle complex challenges with newfound efficiency and
effectiveness.
What Is Transfer Learning?
cuisine.

How Transfer Learning Works?
In computer vision, neural networks typically aim to detect edges in the first layer, forms in the
middle layer, and task-specific features in the latter layers.
The early and central layers are employed in transfer learning, and the latter layers are only
retrained. It makes use of the labelled data from the task it was trained on.
Let’s return to the example of a model that has been intended to identify a backpack in an image
and will now be used to detect sunglasses. Because the model has trained to recognise objects in
the earlier levels, we will simply retrain the subsequent layers to understand what distinguishes
sunglasses from other objects.

Why Should You Use Transfer Learning?
Transfer learning offers a number of advantages, the most important of which are reduced training
time, improved neural network performance (in most circumstances), and the absence of a large
amount of data.
To train a neural model from scratch, a lot of data is typically needed, but access to that data isn’t
always possible – this is when transfer learning comes in handy.
Because the model has already been pre-trained, a good machine learning model can be generated
with fairly little training data using transfer learning. This is especially useful in natural language
processing, where huge labelled datasets require a lot of expert knowledge. Additionally, training
time is decreased because building a deep neural network from the start of a complex task can take
days or even weeks.
Steps to Use Transfer Learning
Time needed: 20 minutes
When we don’t have enough annotated data to train our model with and there is a pre-trained
model that has been trained on similar data and tasks. If you used TensorFlow to train the original
model, you might simply restore it and retrain some layers for your job. Transfer learning, on the
other hand, only works if the features learnt in the first task are general, meaning they can be
applied to another activity. Furthermore, the model’s input must be the same size as it was when
it was first trained.
If you don’t have it, add a step to resize your input to the required size:
1. Training a Model to Reuse it
Consider the situation in which you wish to tackle Task A but lack the necessary data to
train a deep neural network. Finding a related task B with a lot of data is one method to get
around this.
Utilize the deep neural network to train on task B and then use the model to solve task A.
The problem you’re seeking to solve will decide whether you need to employ the entire
model or just a few layers.
If the input in both jobs is the same, you might reapply the model and make predictions for
your new input. Changing and retraining distinct task-specific layers and the output layer,
on the other hand, is an approach to investigate.
2. Using a Pre Trained Model

The second option is to employ a model that has already been trained. There are a number
of these models out there, so do some research beforehand. The number of layers to reuse
and retrain is determined by the task.
Keras consists of nine pre-trained models used in transfer learning, prediction, fine-tuning.
These models, as well as some quick lessons on how to utilise them, may be found here.
Many research institutions also make trained models accessible.
The most popular application of this form of transfer learning is deep learning.
3. Extraction of Features
Another option is to utilise deep learning to identify the optimum representation of your
problem, which comprises identifying the key features. This method is known as
representation learning, and it can often produce significantly better results than hand-
designed representations.
Feature creation in machine learning is mainly done by hand by researchers and domain
specialists. Deep learning, fortunately, can extract features automatically. Of course, this
does not diminish the importance of feature engineering and domain knowledge; you must
still choose which features to include in your network.
4. Extraction of Features in Neural Networks

Neural networks, on the other hand, have the ability to learn which features are critical and
which aren’t. Even for complicated tasks that would otherwise necessitate a lot of human
effort, a representation learning algorithm can find a decent combination of characteristics
in a short amount of time.
The learned representation can then be applied to a variety of other challenges. Simply
utilise the initial layers to find the appropriate feature representation, but avoid using the
network’s output because it is too task-specific. Instead, send data into your network and
output it through one of the intermediate layers.
The raw data can then be understood as a representation of this layer.
This method is commonly used in computer vision since it can shrink your dataset,
reducing computation time and making it more suited for classical algorithms.
Models That Have Been Pre-Trained
There are a number of popular pre-trained machine learning models available. The Inception-v3
model, which was developed for the ImageNet “Large Visual Recognition Challenge,” is one of
them.” Participants in this challenge had to categorize pictures into 1,000 subcategories such as
“zebra,” “Dalmatian,” and “dishwasher.”

FODL Unit-4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FODL Unit-4

Uploaded by

Copyright:

Available Formats

UNIT-4 FODL

Prerequisite of Convolution Neural Network

What is an Artificial Neural Network?

Types of Deep Neural Network:

CONVOLUTIONAL NEURAL NETWORK(CNN):

Why do we need CNN over ANN?

Q1. Explain CNN in detail along with its architecture.

Ans. The architecture of CNN:

The image below shows the convolution operation.

3)Fully Connected Layer:

Ans. Edge Detection Example

Suppose we are given the below image:

Next, we convolve this 6 X 6 matrix with a 3 X 3 filter:

More Edge Detection

Some of the commonly used filters are:

We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4

There are primarily two disadvantages here:

There are two common choices for padding:

Convolutions Over Volume

One Layer of a Convolutional Network

z[1] = w[1]*a[0] + b[1]

 Number of parameters for each filter = 3*3*3 = 27

 f[l] = filter size

Simple Convolutional Network Example

This is how a typical convolutional network looks like:

In a convolutional network (ConvNet), there are basically three types of layers:

Let’s understand the pooling layer in the next section.

Applying max pooling on this matrix will result in a 2 X 2 output:

Consider the below example:

How does a CNN learn?

Example taken from coursera https://www.coursera.org/learn/convolutional-neural-

3. The third POOL1 layer has no parameters. You know why.

5. The fifth POOL2 layer has no parameters. You know why.

Q2. Explain different types of CNN architectures.

Ans. Different Types of CNN Architectures Explained: Examples

1. Transformation of Features to Final Output: The earlier layers of the CNN

LeNet – First CNN Architecture

AlexNet – Deep Learning Architecture that popularized CNN

MobileNets – CNN Architecture for Mobile Devices

ZFNet – Improved Version of AlexNet CNN Architecture

Architecture Year Key Features Use Case

First successful applications of CNNs, 5

Deeper and wider than LeNet, Used ReLU

Similar architecture to AlexNet, but with

Deeper networks with smaller filters (3×3),

Introduced “skip connections” or “shortcuts”

Introduced Inception module, which allows

Designed for mobile and embedded vision

In this section, we will look at the following popular networks:

Let’s start with LeNet-5:

An illustrated summary of AlexNet is given below:

 Parameters: 138 million

So, the activations a[l+2] will be:

a[l+2] = g(z[l+2] + a[l])

The residual network can be shown as:

Why ResNets Work?

The equation to calculate activation using a residual block is given by:

Now, say w[l+2] = 0 and the bias b[l+2] is also 0, then:

Networks in Networks and 1×1 Convolutions

The Motivation Behind Inception Networks

A significant reduction. This is the key idea behind inception.

This is how an inception block looks:

Q3. Write about application of CNN in vision , speech and audio

 Number of parameters for each filter = 333 = 27