Professional Documents
Culture Documents
UNIT – I: Basic of Deep Learning - History of Deep Learning, McCulloch Pitts Neuron,
Thresholding Logic, Perceptrons, Perceptron Learning Algorithm and Convergence, Multilayer
Perceptrons (MLPs), Representation Power of MLPs, Sigmoid Neurons, Feed forward Neural
Networks.
McCulloch-Pitts Model
Simple McCulloch-Pitts neurons can be used to design logical operations. For that purpose, the
connection weights need to be correctly decided along with the threshold function (rather than the
threshold value of the activation function). For better understanding purpose, let me consider an
example:
John carries an umbrella if it is sunny or if it is raining. There are four given situations. I need to
decide when John will carry the umbrella. The situations are as follows:
To analyse the situations using the McCulloch-Pitts neural model, I can consider the input signals as
follows:
● X1: Is it raining?
● X2 : Is it sunny?
So, the value of both scenarios can be either 0 or 1. We can use the value of both weights X1 and
X2 as 1 and a threshold function as 1. So, the neural network model will look like:
1 0 0 0 0
2 0 1 1 1
3 1 0 1 1
4 1 1 2 1
https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-7b90913be434058bb3ff705b6
e46bd05_l3.svg
https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-4012fe1bdf28a37b384dc0affd
209c22_l3.svg
Also, https://www.javatpoint.com/perceptron-in-machine-learning.
Convergence of the Perceptron Algorithm:
The Perceptron learning algorithm converges if the data is linearly separable, meaning
there exists a hyperplane that can completely separate the positive and negative
examples. When the data is linearly separable, the Perceptron algorithm is guaranteed
to find a solution, and it will converge after a finite number of iterations. This is known
as the Perceptron convergence theorem.
However, if the data is not linearly separable, the Perceptron algorithm may not
converge. In such cases, the algorithm keeps updating the weights endlessly, trying to
find a perfect separation, but it never stops. To handle non-linearly separable data,
techniques like adding a bias term or using more complex models (such as neural
networks with hidden layers) can be employed.
It's important to note that the Perceptron algorithm is a fundamental concept in neural
networks and machine learning, but its limitations led to the development of more
sophisticated algorithms and architectures, such as multilayer perceptrons (MLPs),
which can handle complex, non-linear relationships in the data.
3.Working of Back Propagation Algorithm:
4. MLP’s:
Structure of a Multi-Layer Perceptron (MLP):
By definition MLP is a type of artificial neural network that is composed of
multiple layer of interconnected neurons. These network is modelled after the
neurons in human brain.
MLP are that kind of feed forward network in which data is passed only in one
direction, Unlike others such as recurrent neural network where data is passed
in both directions and forms a cycle. MLP is the core algorithm behind all the
other powerful algorithms like CNN.
1. Input Layer:
● The input layer consists of nodes (also known as input neurons)
representing the features of the input data. Each feature of the
input is represented by a separate node.
● For example, in a simple image recognition task, each node in the
input layer might represent a pixel's intensity value in the image.
2. Hidden Layers:
● Between the input and output layers, there can be one or more
hidden layers. These layers are called "hidden" because they are
not directly exposed to the external environment or the user; their
workings are internal to the network.
● Each node in a hidden layer performs a weighted sum of its inputs.
The weights represent the strength of the connections between
nodes in the previous layer and the current node in the hidden
layer.
● After calculating the weighted sum, an activation function is
applied to introduce non-linearity into the network. Common
activation functions include ReLU (Rectified Linear Unit), sigmoid,
and tanh.
● The introduction of hidden layers and non-linear activation
functions allows MLPs to capture complex patterns and
relationships within the data.
3. Output Layer:
● The output layer produces the network's predictions or
classifications. The number of nodes in the output layer depends
on the task:
● For binary classification, there is one node in the output
layer, often using a sigmoid activation function to produce
values between 0 and 1.
● For multi-class classification, there are multiple nodes
(equal to the number of classes) with softmax activation,
ensuring that the output values represent probabilities and
sum up to 1.
● For regression tasks, there is one node in the output layer
without any activation function, allowing the network to
predict continuous numerical values.
Functioning of an MLP:
1. Initialization:
● Initialize the weights and biases of the network. These values are
usually initialized randomly.
2. Forward Propagation:
● During the forward pass, input data is fed into the input layer.
● The input values are multiplied by the weights and summed up in
each node of the hidden layers.
● The result of this summation is then passed through an activation
function, producing the output of each node in the hidden layers.
● The process continues through each hidden layer until the output
layer is reached. The output layer produces the network's
predictions.
3. Loss Calculation:
● Compare the predictions from the output layer to the actual target
values using a suitable loss function, such as mean squared error
for regression or cross-entropy for classification.
● The loss function measures the difference between the predicted
values and the true values, providing a measure of how well the
network is performing.
4. Backpropagation:
● Backpropagation involves calculating the gradients of the loss with
respect to the network's weights and biases.
● These gradients are calculated using the chain rule, starting from
the output layer and propagating backward through the hidden
layers.
● The gradients indicate how much each weight and bias
contributed to the overall error. The network then adjusts these
parameters using optimization algorithms like gradient descent,
updating them to minimize the loss function.
5. Training Iterations:
● Steps 2 to 4 are repeated for multiple iterations (epochs) or until a
convergence criterion is met.
● During each iteration, the network learns to improve its
predictions by adjusting the weights and biases based on the
computed gradients.
6. Prediction:
● Once the network is trained and the weights and biases are
optimized, the MLP can be used to make predictions on new,
unseen data.
● Input data is fed into the trained network, and the output layer
produces the predictions or classifications.
5.
Sigmoid Neurons:
A sigmoid neuron is a type of artificial neuron that addresses some of the
limitations of the McCulloch-Pitts neuron. It introduces a sigmoid activation
function that allows for continuous and smooth output, as opposed to the
binary output of the McCulloch-Pitts neuron. The sigmoid function is
commonly used and has the following mathematical form:
σ(z) = 1 / (1 + e^(-z))
Where:
● σ(z) is the output of the sigmoid function.
● z is the weighted sum of inputs plus a bias term.
Sigmoid neurons have the following components:
1. Inputs: Similar to the McCulloch-Pitts neuron, sigmoid neurons take
inputs with associated weights. Each input is multiplied by its weight,
and the weighted inputs are summed up.
2. Bias: A bias term is added to the weighted sum before applying the
sigmoid activation function. The bias allows for a shift in the output and
is an additional learnable parameter.
3. Activation Function: The sigmoid activation function transforms the
weighted sum plus bias into a continuous value between 0 and 1. This
smooth transition allows for more nuanced representations and gradual
changes in output.
Mathematically, the output of a sigmoid neuron can be represented as:
Output = σ(Σ(w * x) + b) //after adding biases
Where:
● σ is the sigmoid activation function.
● Σ(w * x) represents the weighted sum of inputs.
● b is the bias.
Sigmoid neurons are used as building blocks in feedforward neural networks,
which are the foundation of many machine learning and deep learning
applications.
Feedforward Neural Networks:
Feedforward is a type of neural network in deep learning that transmits data
in one direction, from input to output, without feedback loops. This makes
feedforward networks suitable for tasks like pattern recognition and
classification.
Feedforward networks are also known as multi-layer neural networks. During
data flow, input nodes receive data, which travel through hidden layers, and
exit output nodes
4. Evaluation:
● After training, evaluate the performance of the trained MLP by
comparing its predictions with the true sinusoidal values for new,
unseen inputs.
By adjusting the number of neurons in the hidden layer, an MLP can
approximate the sinusoidal function with different levels of accuracy. As the
number of neurons increases, the MLP gains more expressive power, allowing it
to capture intricate patterns and achieve a closer approximation to the true
sinusoidal curve.
This example demonstrates the representation power of MLPs: they can learn
and approximate complex non-linear functions, making them valuable tools in
various applications, including regression, classification, and pattern
recognition, where the relationships within the data are non-linear and intricate.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize the
cost function using iteration. To achieve this goal, it performs two steps iteratively:
What is Cost-function?
The cost function is defined as the measurement of difference or error between
actual values and expected values at the current position and present in the
form of a single real number.
Further, it continuously iterates along the direction of the negative gradient until the cost
function approaches zero. At this steepest descent point, the model will stop learning
further.
The slight difference between the loss function and the cost function is about the error
within the training of machine learning models, as loss function refers to the error of
one training example, while a cost function calculates the average error across an entire
training set.
The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration.
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the
y-axis.
The starting point (shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are
required:
These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global minimum.
Let's discuss learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is
typically a small value that is evaluated and updated based on the behavior of the cost
function. If the learning rate is high, it results in larger steps but also leads to risks of
overshooting the minimum. At the same time, a low learning rate shows the small step
sizes, which compromises overall efficiency but gives the advantage of more precision.
4. What is RMSProp?
For optimizing the training of neural networks, RMSprop relies on
gradients. Backpropagation has its roots in this idea.
As data travels through very complicated functions, such as neural
networks, the resulting gradients often disappear or expand. RMSprop is
an innovative stochastic mini-batch learning method.
RMSProp algorithm
Like other gradient descent algorithms, RMSprop works by calculating the
gradient of the loss function with respect to the model’s parameters and
updating the parameters in the opposite direction of the gradient to
minimize the loss. However, RMSProp introduces a few additional
techniques to improve the performance of the optimization process.
One key feature is its use of a moving average of the squared gradients to
scale the learning rate for each parameter. This helps to stabilize the
learning process and prevent oscillations in the optimization trajectory.
Where:
Adam vs RMSProp
RMSProp is often compared to the Adam (Adaptive Moment Estimation)
optimization algorithm, another popular optimization method for deep
learning. Both algorithms combine elements of momentum and adaptive
learning rates to improve the optimization process, but Adam uses a
slightly different approach to compute the moving averages and adjust the
learning rates. Adam is generally more popular and widely used than the
RMSProp optimizer, but both algorithms can be effective in different
settings.
RMSProp advantages
● Fast convergence. RMSprop is known for its fast convergence
speed, which means that it can find good solutions to optimization
problems in fewer iterations than some other algorithms. This can be
especially useful for training large or complex models, where training
time is a critical concern.
● Stable learning. The use of a moving average of the squared
gradients in RMSprop helps to stabilize the learning process and
prevent oscillations in the optimization trajectory. This can make the
optimization process more robust and less prone to diverging or
getting stuck in local minima.
● Fewer hyperparameters. RMSprop has fewer hyperparameters than
some other optimization algorithms that make it easier to tune and
use in practice. The main hyperparameters in RMSprop are the
learning rate and the decay rate, which can be chosen using
techniques like grid search or random search.
● Good performance on non-convex problems. RMSprop tends to
perform well on non-convex optimization problems, common in
Machine Learning and deep learning. Non-convex optimization
problems have multiple local minima, and RMSprop’s fast
convergence speed and stable learning can help it find good
solutions even in these cases.
Overall, RMSprop is a powerful and widely used optimization algorithm that
can be effective for training a variety of Machine Learning models,
especially deep learning models.
UNIT IV
(06Hrs)
Convolution Neural Network (CNN) - Convolutional operation, Pooling, LeNet, AlexNet,
ZF-Net, VGGNet, GoogLeNet, ResNet. Visualizing Convolutional Neural Networks, Guided
Backpropagation.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Simple CNN architecture
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
1. Introduction:
● CNNs are a class of deep neural networks designed for tasks such as image
recognition, object detection, and image classification.
2. Basic Components:
● Convolutional Layers:
● The core building blocks of CNNs are convolutional layers. These layers use
convolutional operations to scan input data with learnable filters (kernels) to
detect patterns and features.
● Filters are small, learnable matrices that slide over the input data to perform
convolution operations. The result is feature maps that represent learned
features.
● Activation Functions:
● Non-linear activation functions like ReLU (Rectified Linear Unit) are applied
after convolutional operations to introduce non-linearity and enable the
network to learn complex relationships.
● Pooling Layers:
● Pooling layers reduce the spatial dimensions of the input and the number of
parameters in the network, helping to make the detection of features
invariant(not affected by) to scale and orientation changes.
● Max pooling is a common pooling operation, selecting the maximum value
from a group of neighbouring pixels.
3. Architectural Layers:
● Input Layer:
● The input layer represents the raw input data, such as an image. The
dimensions of the input layer depend on the size and color channels of the
input images.
● Convolutional Blocks:
● Pooling Layers:
● After convolutional and pooling layers, fully connected layers process the
flattened feature maps and make predictions.
● Dense layers are used for classification tasks, and their neurons are
connected to every neuron in the previous layer.
● Output Layer:
● The output layer produces the final prediction based on the task. For
classification, it may involve a softmax activation function to yield class
probabilities.
● LeNet-5:
● AlexNet:
● VGGNet:
● InceptionNet (GoogLeNet):
● MobileNetV2:
5. Training:
6. Transfer Learning:
● CNNs often leverage transfer learning by using pre-trained models on large datasets
like ImageNet. Fine-tuning is applied on specific tasks, saving training time and
resources.
Input Image: Consider a grayscale image as our input. Each pixel in the image has an
intensity value (e.g., ranging from 0 to 255).
Filter (Kernel): The filter is a small matrix with learnable weights. It is smaller than the input
image and is usually square (e.g., 3x3 or 5x5). The filter's values determine what features it
detects.
Sliding: The filter is slid over the input image in a specified manner. At each position,
element-wise multiplication is performed between the filter's values and the corresponding
pixel values in the image region covered by the filter.
Summation: After element-wise multiplication, the resulting values are summed up to get a
single value.
Feature Map: The sum is placed in the output (feature map) at the position corresponding to
the center of the filter's current location.
The process is repeated for every possible position of the filter over the input image. This
results in a new image-like structure, the feature map, where each value represents the
response of the filter to a specific feature in the input image.
In this section, we will focus on how the edges can be detected from an
image. Suppose we are given the below image:
As you can see, there are many vertical and horizontal edges in the image.
The first thing to do is to detect these edges:
After the convolution, we will get a 4 X 4 image. The first element of the 4
X 4 matrix will be calculated as:
So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with
the filter. Now, the first element of the 4 X 4 output will be the sum of the
element-wise product of these values, i.e. 3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1
+ 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of the 4 X 4
output, we will shift our filter one step towards the right and again get the
sum of the element-wise product:
Similarly, we will convolve over the entire image and get a 4 X 4 output:
So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4.
Consider one more example:
Note: Higher pixel values represent the brighter portion of the image and
the lower pixel values represent the darker portions. This is how we can
detect a vertical edge in an image.
On the other hand, Average Pooling returns the average of all the
values from the portion of the image covered by the Kernel. Average
Pooling simply performs dimensionality reduction as a noise suppressing
mechanism. Hence, we can say that Max Pooling performs a lot better
than Average Pooling.
More Edge Detection
The type of filter that we choose helps to detect the vertical or horizontal
edges. We can use the following filters to detect different edges:
Padding
● Input: n X n
● Filter size: f X f
● Output: (n-f+1) X (n-f+1)
To overcome these issues, we can pad the image with an additional border,
i.e., we add one pixel all around the edges. This means that the input will be
an 8 X 8 matrix (instead of a 6 X 6 matrix). Applying convolution of 3 X 3 on
it will result in a 6 X 6 matrix which is the original shape of the image. This
is where padding comes to the fore:
● Input: n X n
● Padding: p
● Filter size: f X f
● Output: (n+2p-f+1) X (n+2p-f+1)
There are two common choices for padding:
We now know how to use padded convolution. This way we don’t lose a lot
of information and the image does not shrink either. Next, we will look at
how to implement strided convolutions.
Till now we have performed the Feature Extraction steps, now comes the
Classification part. The Fully connected layer (as we have in ANN) is used for
classifying the input image into a label. This layer connects the information
extracted from the previous steps (i.e Convolution layer and Pooling layers)
to the output layer and eventually classifies the input into the desired label.
The complete process of a CNN model can be seen in the below image.
Q)Explain pooling layer with suitable example in convolutional
network.
→
A pooling layer is another important component of Convolutional
Neural Networks (CNNs) that follows convolutional layers. Its main
purpose is to reduce the spatial dimensions of the input feature
maps while retaining the most important information. Pooling is used
for downsampling and dimensionality reduction, which helps in
controlling the number of parameters in the network and reducing
computation.
Pooling Operation in CNN:
The pooling operation involves dividing the input feature map into
non-overlapping or overlapping regions and then performing an
aggregation operation (like max or average pooling) within each
region. The result is a pooled or downsampled version of the input
feature map.
There are two common types of pooling operations: max pooling and
average pooling.
1. Max Pooling: In max pooling, for each region, the maximum
value within that region is selected and placed in the pooled
feature map. Max pooling helps capture the most prominent
features within the region.
2. Average Pooling: In average pooling, the average value of all the
values within the region is calculated and placed in the pooled
feature map. Average pooling helps maintain a smoother
representation of the input.
Example:
Let's take a simple 4x4 feature map as an example:
Input Feature Map: | 2 | 4 | 1 | 3 | | 7 | 5 | 9 | 2 | | 8 | 3 | 6 | 5 | |
1|6|2|4|
We will use max pooling with a 2x2 window and a stride of 2. This
means we will divide the feature map into non-overlapping 2x2
regions and select the maximum value from each region to create the
pooled feature map.
The process goes as follows:
1. First, apply the 2x2 max pooling window to the top-left 2x2
region: Max value = 7.
2. Move the window to the top-right 2x2 region: Max value = 9.
3. Move the window to the bottom-left 2x2 region: Max value = 8.
4. Move the window to the bottom-right 2x2 region: Max value =
6.
The resulting pooled feature map would look like this:
Pooled Feature Map: | 7 | 9 | | 8 | 6 |
In this example, the pooling operation reduced the spatial
dimensions of the feature map from 4x4 to 2x2, effectively
downsampling the data. Max pooling helped retain the most
important information within each 2x2 region.
Pooling layers are typically inserted between convolutional layers in a
CNN architecture. They help in reducing the computation and
memory requirements of the network while preserving important
features for subsequent layers to work with.
Advantages of Max Pooling:
1. Translation Invariance: Max pooling helps make the network
less sensitive to small translations in the input data. Since only
the maximum value within a pooling window is retained, small
shifts in the input won't significantly affect the pooled output.
2. Feature Selection: Max pooling retains the most dominant and
important features within each local region of the input. It helps
the network focus on detecting key patterns.
3. Downsampling: Max pooling reduces the spatial dimensions of
the input, reducing computational requirements and memory
usage while retaining essential features.
ReLU
An important feature of the AlexNet is the use of ReLU(Rectified
Linear Unit) Nonlinearity.
Tanh or sigmoid activation functions used to be the usual way to train
a neural network model.
AlexNet showed that using ReLU nonlinearity, deep CNNs could be
trained much faster than using the saturating activation functions like
tanh or sigmoid.
Tested on the CIFAR-10 dataset.
Let's see why it trains faster with the ReLUs. The ReLU function is
given by
f(x) = max(0,x)
plots of the two functions —
1. tanh
2. ReLU.
Pros of AlexNet
1. AlexNet is considered as the milestone of CNN for image
classification.
2. Many methods, such as the conv + pooling design, dropout,
GPU, parallel computing, ReLU, are still the industrial standard
for computer vision.
3. The unique advantage of AlexNet is the direct image input to
the classification model.
4. The convolution layers can automatically extract the edges of
the images and fully connected layers learning these features.
5. Theoretically the complexity of visual pattern scan be effective
extracted by adding more convlayer
Cons of AlexNet
1. AlexNet is NOT deep enough compared to the later model, such
as VGGNet, GoogLENet, and ResNet.
2. The use of large convolution filters (5*5) is not encouraged
shortly after that.
3. Use normal distribution to initiate the weights in the neural
networks, cannot effectively solve the problem of gradient
vanishing, replaced by the Xavier method later.
4. The performance is surpassed by more complex models such as
GoogLENet (6.7%), and ResNet (3.6%)
GoogLeNet: -
Google Net (or Inception V1) was proposed by research at Google
(with the collaboration of various universities) in 2014 in the research
paper titled “Going Deeper with Convolutions”. This architecture was
the winner at the ILSVRC(Large Scale Visual Recognition Challenge)
2014 image classification challenge. It has provided a significant
decrease in error rate as compared to previous winners AlexNet
(Winner of ILSVRC 2012) and ZF-Net (Winner of ILSVRC 2013) and
significantly less error rate than VGG (2014 runner up). This
architecture uses techniques such as 1×1 convolutions in the middle
of the architecture and global average pooling.
Features of GoogleNet:
The GoogLeNet architecture is very different from previous
state-of-the-art architectures such as AlexNet and ZF-Net. It uses
many different kinds of methods such as 1×1 convolution and global
average pooling that enables it to create deeper architecture. In the
architecture, we will discuss some of these methods:
● 1×1 convolution: The inception architecture
uses 1×1 convolution in its architecture. These convolutions
used to decrease the number of parameters (weights and
biases) of the architecture. By reducing the parameters we also
increase the depth of the architecture. Let’s look at an example
of a 1×1 convolution below:
● For Example, If we want to perform 5×5 convolution
having 48 filters without using 1×1 convolution as
intermediate:
● Hidden Layers: All the hidden layers in the VGG network use ReLU.
VGG does not usually leverage Local Response Normalization (LRN)
as it increases memory consumption and training time. Moreover, it
makes no improvements to overall accuracy.
● Fully-Connected Layers: The VGGNet has three fully connected
layers. Out of the three layers, the first two have 4096 channels each,
and the third has 1000 channels, 1 for each class.
Read more
at: https://viso.ai/deep-learning/vgg-very-deep-convolutional-network
s/
VGG16 Architecture
The number 16 in the name VGG refers to the fact that it is 16 layers
deep neural network (VGGnet). This means that VGG16 is a pretty
extensive network and has a total of around 138 million parameters.
Even according to modern standards, it is a huge network. However,
VGGNet16 architecture’s simplicity is what makes the network more
appealing. Just by looking at its architecture, it can be said that it is
quite uniform.
The number of filters that we can use doubles on every step or through
every stack of the convolution layer. This is a major principle used to
design the architecture of the VGG16 network. One of the crucial
downsides of the VGG16 network is that it is a huge network, which
means that it takes more time to train its parameters.
Because of its depth and number of fully connected layers, the VGG16
model is more than 533MB. This makes implementing a VGG network a
time-consuming task.
MobileNet:
MobileNet is designed for efficient computation on mobile and
embedded devices, focusing on reducing the number of operations
and parameters while maintaining good accuracy. Key features
include:
1. Depthwise Separable Convolutions: MobileNet uses depthwise
separable convolutions that split standard convolutions into
depthwise and pointwise convolutions. This drastically reduces
computation.
2. Width Multiplier and Resolution Multiplier: These parameters
allow trade-offs between accuracy and computational cost.
Width multiplier controls the number of channels, and
resolution multiplier controls input resolution.
3. Bottleneck Architecture: MobileNet uses a bottleneck
architecture with 1x1 convolutions to reduce the number of
input channels before performing more expensive operations.
What is LeNet 5?
LeNet is a convolutional neural network that Yann LeCun
introduced in 1989. LeNet is a common term for LeNet-5, a simple
convolutional neural network.
The LeNet-5 signifies CNN’s emergence and outlines its core
components. However, it was not popular at the time due to a lack
of hardware, especially GPU (Graphics Process Unit, a specialised
electronic circuit designed to change memory to accelerate the
creation of images during a buffer intended for output to a show
device) and alternative algorithms, like SVM, which could perform
effects similar to or even better than those of the LeNet.
Features of LeNet-5
● Every convolutional layer includes three parts: convolution,
pooling, and nonlinear activation functions
● Using convolution to extract spatial features (Convolution was
called receptive fields originally)
● The average pooling layer is used for subsampling.
● ‘tanh’ is used as the activation function
● Using Multi-Layered Perceptron or Fully Connected Layers as
the last classifier
● The sparse connection between layers reduces the complexity
of computation
Architecture
The LeNet-5 CNN architecture has seven layers. Three
convolutional layers, two subsampling layers, and two fully linked
layers make up the layer composition.
**IMP Diagram
LeNet-5 Architecture
First Layer
A 32x32 grayscale image serves as the input for LeNet-5 and is
processed by the first convolutional layer comprising six feature
maps or filters with a stride of one. From 32x32x1 to 28x28x6, the
image’s dimensions shift.
Second Layer
Then, using a filter size of 22 and a stride of 2, the LeNet-5 adds an
average pooling layer or sub-sampling layer. 14x14x6 will be the
final image’s reduced size.
Third Layer
A second convolutional layer with 16 feature maps of size 55 and a
stride of 1 is then present. Only 10 of the 16 feature maps in this
layer are linked to the six feature maps in the layer below, as can
be seen in the illustration below.
Sixth Layer
A fully connected layer (F6) with 84 units makes up the sixth layer.
Output Layer
The SoftMax output layer, which has 10 potential values and
corresponds to the digits 0 to 9, is the last layer.
Summary of LeNet-5 Architecture
Source: Medium.com
Recurrent Neural Networks use the same weights for each element
of the sequence, decreasing the number of parameters and
allowing the model to generalize to sequences of varying lengths.
RNNs generalize to structured data other than sequential data,
such as geographical or graphical data, because of its design.
Recurrent neural networks, like many other deep learning
techniques, are relatively old. They were first developed in the
1980s, but we didn’t appreciate their full potential until lately. The
advent of long short-term memory (LSTM) in the 1990s, combined
with an increase in computational power and the vast amounts of
data that we now have to deal with, has really pushed RNNs to the
forefront.
RNNs are a type of neural network that can be used to model sequence
data. RNNs, which are formed from feedforward networks, are similar to
human brains in their behaviour. Simply said, recurrent neural networks can
anticipate sequential data in a way that other algorithms can’t.
Source: Quora.com
All of the inputs and outputs in standard neural networks are independent of
one another, however in some circumstances, such as when predicting the
next word of a phrase, the prior words are necessary, and so the previous
words must be remembered. As a result, RNN was created, which used a
Hidden Layer to overcome the problem. The most important component of
RNN is the Hidden state, which remembers specific information about a
sequence.
RNNs have a Memory that stores all information about the calculations. It
employs the same settings for each input since it produces the same outcome
by performing the same task on all inputs or hidden layers
Disadvantages of RNNs:
● Prone to vanishing and exploding gradient problems, hindering
learning.
● Training can be challenging, especially for long sequences.
● Computationally slower than other neural network architectures.