Professional Documents
Culture Documents
In the last few years of the IT industry, there has been a huge demand for once
particular skill set known as Deep Learning. Deep Learning a subset of Machine
Learning which consists of algorithms that are inspired by the functioning of
the human brain or the neural networks.
Check out our free data science courses to get an edge over the competition.
You can also consider doing our Python Bootcamp course from upGrad to upskill
your career.
CNNs are a class of Deep Neural Networks that can recognize and classify
particular features from images and are widely used for analyzing visual
images. Their applications range from image and video recognition, image
classification, medical image analysis, computer vision and natural language
processing.
CNN has high accuracy, and because of the same, it is useful in image
recognition. Image recognition has a wide range of uses in various industries
such as medical image analysis, phone, security, recommendation systems,
etc.
Learn Machine Learning online from the World’s top Universities – Masters,
Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI
to fast-track your career.
Basic Architecture
A convolution tool that separates and identifies the various features of the image for
analysis in a process called as Feature Extraction.
The network of feature extraction consists of many pairs of convolutional or pooling
layers.
A fully connected layer that utilizes the output from the convolution process and
predicts the class of the image based on the features extracted in previous stages.
This CNN model of feature extraction aims to reduce the number of features present
in a dataset. It creates new features which summarises the existing features
contained in an original set of features. There are many CNN layers as shown in
the CNN architecture diagram.
Source
Convolution Layers
There are three types of layers that make up the CNN which are the
convolutional layers, pooling layers, and fully-connected (FC) layers. When
these layers are stacked, a CNN architecture will be formed. In addition to
these three layers, there are two more important parameters which are the
dropout layer and the activation function which are defined below.
1. Convolutional Layer
This layer is the first layer that is used to extract the various features from the
input images. In this layer, the mathematical operation of convolution is
performed between the input image and a filter of a particular size MxM. By
sliding the filter over the input image, the dot product is taken between the
filter and the parts of the input image with respect to the size of the filter
(MxM).
The output is termed as the Feature map which gives us information about the
image such as the corners and edges. Later, this feature map is fed to other
layers to learn several other features of the input image.
The convolution layer in CNN passes the result to the next layer once applying the
convolution operation in the input. Convolutional layers in CNN benefit a lot as
they ensure the spatial relationship between the pixels is intact.
2. Pooling Layer
In Max Pooling, the largest element is taken from feature map. Average Pooling
calculates the average of the elements in a predefined sized Image section. The
total sum of the elements in the predefined section is computed in Sum
Pooling. The Pooling Layer usually serves as a bridge between the
Convolutional Layer and the FC Layer.
This CNN model generalises the features extracted by the convolution layer,
and helps the networks to recognise the features independently. With the help
of this, the computations are also reduced in a network.
In this, the input image from the previous layers are flattened and fed to the FC
layer. The flattened vector then undergoes few more FC layers where the
mathematical functions operations usually take place. In this stage, the
classification process begins to take place. The reason two layers are
connected is that two fully connected layers will perform better than a single
connected layer. These layers in CNN reduce the human supervision
4. Dropout
Usually, when all the features are connected to the FC layer, it can cause
overfitting in the training dataset. Overfitting occurs when a particular model
works so well on the training data causing a negative impact in the model’s
performance when used on a new data.
5. Activation Functions
Finally, one of the most important parameters of the CNN model is the
activation function. They are used to learn and approximate any kind of
continuous and complex relationship between variables of the network. In
simple words, it decides which information of the model should fire in the
forward direction and which ones should not at the end of the network.
https://www.upgrad.com/blog/basic-cnn-architecture/
https://www.geeksforgeeks.org/introduction-convolution-neural-network/
CNN, or Convolutional Neural Network, is a type of deep learning model commonly used for image
recognition, computer vision tasks, and other pattern recognition problems. The building blocks of a CNN
are designed to efficiently extract features from input data while preserving spatial relationships. Here
are the key components or building blocks of a typical CNN:
1. Convolutional Layer: The convolutional layer is the core building block of a CNN. It applies a set of
learnable filters (also known as kernels or feature detectors) to the input data. Each filter performs a
convolution operation by sliding over the input, computing dot products at each position, and producing
a feature map. Convolutional layers help capture local patterns and spatial hierarchies in the data.
2. Activation Function: After each convolutional operation, an activation function is applied element-wise
to introduce non-linearity into the network. The most commonly used activation function in CNNs is the
Rectified Linear Unit (ReLU), which sets negative values to zero and keeps positive values unchanged.
3. Pooling Layer: Pooling layers reduce the spatial dimensions (width and height) of the input, while
retaining important features. Max pooling is a common pooling operation that takes the maximum value
within each pooling region. It helps reduce the computational complexity and provides a form of
translation invariance by preserving the most salient features.
4. Fully Connected Layer: Fully connected layers are traditional neural network layers where each neuron
is connected to every neuron in the previous layer. These layers are typically used at the end of the CNN
architecture to classify or regress the extracted features. They learn complex combinations of features
from the previous layers and make predictions based on the learned representations.
5. Dropout: Dropout is a regularization technique used to prevent overfitting in CNNs. It randomly sets a
fraction of the input units to zero during training, which helps to reduce co-adaptation between neurons
and improves generalization.
6. Batch Normalization: Batch normalization is a technique that normalizes the output of a previous layer
by subtracting the batch mean and dividing by the batch standard deviation. It helps in stabilizing the
network training process, allowing higher learning rates, and reducing the sensitivity to network
initialization.
7. Convolutional Neural Network Architecture: CNNs are typically composed of multiple convolutional
layers stacked together, interspersed with activation functions, pooling layers, and other components
mentioned above. Different CNN architectures like LeNet, AlexNet, VGGNet, GoogLeNet, and ResNet
have varying depths, layer arrangements, and architectural innovations.
These building blocks, combined with appropriate hyperparameter tuning, training data, and
optimization techniques, enable CNNs to learn complex features from raw data and perform tasks such
as image classification, object detection, and semantic segmentation.
Convolution Layer
A convolution layer transforms the input image in order to extract
features from it. In this transformation, the image is convolved with
a kernel (or filter).
A kernel is a small matrix, with its height and width smaller than the
image to be convolved. It is also known as a convolution matrix or
convolution mask. This kernel slides across the height and width of the
image input and dot product of the kernel and the image are computed
at every spatial position. The length by which the kernel slides is
known as the stride length. In the image below, the input image
is of size 5X5, the kernel is of size 3X3 and the stride length is
1. The output image is also referred to as the convolved feature.
When convolving a coloured image (RGB image) with channels 3, the
channel of the filters must be 3 as well. In other words, in
convolution, the number of channels in the kernel must be
the same as the number of channels in the input image.
Pooling Layer
Pooling layer is used to reduce the size of the input image. In a
convolutional neural network, a convolutional layer is usually followed
by a pooling layer. Pooling layer is usually added to speed up
computation and to make some of the detected features more robust.
Pooling operation uses kernel and stride as well. In the example image
below, 2X2 filter is used for pooling the 4X4 input image of size, with a
stride of 2.
Max Pooling: In max pooling, from each patch of a feature map, the
maximum value is selected to create a reduced map.
https://deepai.org/machine-learning-glossary-and-terms/stride
In the context of convolutional neural networks (CNNs), a strided operation refers to the process of
applying a convolutional filter with a certain stride value, which determines the step size for moving the
filter across the input data.
In a standard convolutional operation, the filter is usually applied to the input data with a stride value of
1. This means that the filter moves one pixel at a time, covering the entire input image, and produces a
feature map with the same spatial dimensions as the input.
However, when using a strided operation, the filter is applied with a larger stride value, skipping some
pixels as it moves across the input data. This leads to a reduction in the spatial dimensions of the output
feature map.
For example, consider a 3x3 filter applied to a 5x5 input image with a stride of 2. The filter will start at
the top-left corner of the input, perform the convolution operation, and then move two pixels to the
right for the next convolution. It will continue this process until it reaches the end of the row, and then
move two rows down to the next position. This stride of 2 will effectively reduce the spatial dimensions
of the output feature map by a factor of 2 in both width and height.
1. Dimensionality reduction: By applying a strided operation, the spatial dimensions of the feature maps
are reduced, which helps to reduce the computational complexity of subsequent layers and improve
efficiency.
2. Downsampling: Strided operations can act as a form of downsampling, where the information in the
input is summarized over larger regions. This can help capture more general features and reduce the
sensitivity to small local variations in the input.
3. Increasing receptive field: By using a larger stride value, the receptive field of each neuron in the
subsequent layers increases. This allows the network to capture larger spatial contexts and capture more
global information.
It's worth noting that strided operations can be used not only in convolutional layers but also in pooling
layers, where max pooling or average pooling can be applied with a stride value greater than 1 to achieve
similar downsampling effects.
Overall, strided operations provide a way to control the spatial dimensions and information flow within a
CNN, enabling more efficient processing of large-scale data while capturing important features.
Convolutional Neural Network —
II
Mandar Deshpande
·
Follow
Published in
5 min read
253
1
Continuing our learning from the last post we will be covering the
following topics in this post:
#channels.
In the given example the purpose of the filter is to detect vertical edges.
If the edge needs to be detected only in the R channel,then only the
weights in R channel need to set for the requirement. If you need to
detect vertical edges across all channels, then all filter channels will
have the same weight as demonstrated above.
The output dimension can be calculated for any general case using the
following equation :
Fig 3. Equation governing the output image/signal dimension wrt input and filter dimension
Here, nc is the number of channels in the input image and nf are the
number of filters used.
Softmax regression, also known as multinomial logistic regression, is a classification algorithm that is
commonly used in machine learning and deep learning for multi-class classification problems. It is an
extension of logistic regression, which is used for binary classification.
In softmax regression, the goal is to assign an input sample to one of the multiple classes in a mutually
exclusive manner. The algorithm computes a probability distribution over the classes and assigns the
input to the class with the highest probability.
1. Input Data: Each input sample is represented by a feature vector. The features can be real-valued or
discrete.
2. Model Parameters: Softmax regression learns a weight matrix and bias vector that map the input
features to the probabilities of the different classes. The weight matrix and bias vector are learned
through the training process.
3. Linear Transformation: The input features are linearly transformed using the weight matrix and bias
vector. This produces a set of scores or logits for each class. The logits represent the evidence or
confidence of the input sample belonging to each class.
4. Softmax Function: The logits are then passed through the softmax function, which normalizes them
into a probability distribution over the classes. The softmax function calculates the exponential of each
logit and divides it by the sum of exponentials across all classes.
5. Class Prediction: The class with the highest probability from the softmax function is selected as the
predicted class for the input sample.
6. Loss Function: Softmax regression uses a loss function to measure the discrepancy between the
predicted probabilities and the true class labels. The commonly used loss function is cross-entropy loss.
7. Training: The model parameters, i.e., the weight matrix and bias vector, are learned by minimizing the
loss function through optimization techniques such as gradient descent or its variants. The training
process adjusts the parameters iteratively to improve the model's ability to classify the input data
correctly.
Once the softmax regression model is trained, it can be used to predict the class labels for new, unseen
samples by passing their features through the trained model.
Softmax regression is widely used in various applications, including image classification, text
classification, and natural language processing tasks, where there are multiple classes to predict from. It
provides a probabilistic framework for multi-class classification by assigning class probabilities based on
the input features.
Introduction
Before understanding Softmax regression, we need to understand
the underlying softmax function that drives this regression.
The softmax function, also known as softargmax or normalized
exponential function, is, in simple terms, more like a
normalization function, which involves adjusting values measured
on different scales to a notionally common scale. There is more
than one method to accomplish this, and let us review why the
softmax method stands out. These methods could be used to
estimate probability scores from a set of values as in the case
of logistic regression or the output layer of a classification
neural network, both for finding the class with the largest
predicted probability.
where, σ (z)i is the probability score, zi,j are the outputs and β
is a parameter that we choose if we want to use a base other
than e1 .
Features of Softmax:
Now for our earlier outputs 3, 7 and 14 our probabilities would
be e3/ e (3+7+14) = 1.6 X 10-5, e7/ e (3+7+14) = 91 X 10-5 and e14/
e (3+7+14) =0.99 respectively. As you would have noticed, this method
highlights the largest values and suppresses values that are
significantly below the maximum value. Also, this is done
proportional to the scale of the numbers, i.e. we would not get
the same probability scores if the outputs were 0.3, 0.7 and
1.4, rather we would get the probability scores as 0.18, 0.27
and 0.55
Applications:
As mentioned earlier, softmax function/ regression finds utility
in several areas, and some of the popular applications are as
below:
Implementation:
Now that we’ve understood how the softmax function works, we can
use that function to compute the probabilities predicted by a
crude linear model, such as y= mx +b,
There are two sources of data used to develop the mobile app. The first data distribution is
small, 10 000 pictures uploaded from the mobile application.
Since they are from amateur users, the pictures are not professionally shot, not well framed
and blurrier. The second source is from the web, you downloaded 200 000 pictures where
cat’s pictures are professionally framed and in high resolution.
1. small data set from pictures uploaded by users. This distribution is important for the
mobile app.
The guideline used is that you have to choose a development set and test set to reflect data
you expect to get in the future and consider important to do well.
The advantage of this way of splitting up is that the target is well defined. The disadvantage
is that the training distribution is different from the development and test set distributions.
However, this way of splitting the data has a better performance in long term
Training and testing on different distributions refers to a scenario in machine learning where the data
used for training a model is drawn from a different distribution than the data used for testing or
evaluating the model. This situation can arise due to various reasons, such as changes in data collection
processes, domain shifts, or intentionally creating diverse datasets.
Training a model on one distribution and then evaluating it on a different distribution can lead to a
phenomenon called distribution mismatch or distributional shift. This can result in degraded
performance and reduced generalization of the model. The reasons for this performance drop include:
1. Differences in Data Characteristics: The statistical properties, such as mean, variance, or class
distributions, may differ between the training and testing data. As a result, the model may not perform
well on the unseen data because it has not learned to generalize to the different distribution.
2. Covariate Shift: Covariate shift occurs when the input features' marginal distribution differs between
the training and testing data, while the conditional distribution of the target variable remains the same.
This can lead to a mismatch between the training and testing data and affect the model's performance.
3. Concept Shift: Concept shift refers to a change in the underlying relationship between the input
features and the target variable. If the concept shift occurs between the training and testing data, the
model's learned patterns may not be applicable to the unseen data, leading to reduced accuracy.
To address the issue of training and testing on different distributions, several techniques can be
employed:
1. Data Collection: Efforts should be made to ensure that the training and testing data are as
representative of the real-world distribution as possible. Collecting a diverse and balanced dataset that
covers various scenarios can help mitigate the distributional shift.
2. Data Augmentation: Data augmentation techniques can be applied to artificially expand the training
dataset by creating new samples with variations. This can help the model learn to generalize better by
introducing more diverse examples.
3. Domain Adaptation: Domain adaptation methods aim to align the source and target domains to
reduce the distributional shift. Techniques like domain adaptation networks, importance weighting, or
feature adaptation can be applied to align the data distributions.
4. Transfer Learning: Transfer learning involves pretraining a model on a large, relevant dataset and then
fine-tuning it on the target distribution. This helps the model leverage the learned knowledge from the
source distribution and adapt it to the target distribution.
5. Cross-Validation: If labeled data from the target distribution is available, cross-validation can be used
to evaluate the model's performance. This allows for model selection and hyperparameter tuning on the
target distribution, improving the model's ability to generalize.
It is important to consider the potential distributional shift when designing machine learning models and
take appropriate steps to mitigate its impact. Understanding the data characteristics and employing
techniques to address the training-testing distribution mismatch can lead to more robust and accurate
models.
What is Bias?
Bias is simply defined as the inability of the model
because of that there is some difference or error occurring
between the model’s predicted value and the actual value.
These differences between actual or expected values and the
predicted values are known as error or bias error or error
due to bias. Bias is a systematic error that occurs due to
wrong assumptions in the machine learning process.
What is Variance?
Variance is the measure of spread in data from
its mean position. In machine learning variance is the
amount by which the performance of a predictive model
changes when it is trained on different subsets of the
training data. More specifically, variance is the
variability of the model that how much it is sensitive to
another subset of the training dataset. i.e. how much it
can adjust on the new subset of the training dataset.
Typically models with high bias have low variance, and models with high
variance have low bias. This is because the two come from opposite types of
models. A model that’s not flexible enough to match a data set correctly (high
bias) is also not flexible enough to change dramatically when given a different
data set (low variance).
Those who’ve read my previous article on underfitting and overfitting will
probably note a lot of similarity between these concepts. Underfit models
usually have high bias and low variance. Overfit models usually have high
variance and low bias.
Keep in mind increasing variance is not always a bad thing. An underfit model
is underfit because it doesn’t have enough variance, which leads to
consistently high bias errors. This means when you’re developing a model you
need to find the right amount of variance, or the right amount of model
complexity. The key is to increase model complexity, thus decreasing bias and
increasing variance, until bias has been minimized but before significant
variance errors become evident.
Another solution is to increase the size of the data set used to train your
model. High variance errors, also referred to as overfitting models, come from
creating a model that’s too complex for the available data set. If you’re able to
use more data to train the model, then you can create a model that’s more
complex without accidentally adding variance error.
When training and testing data come from different distributions, it can have an impact on the bias and
variance of a machine learning model. Bias and variance are two fundamental sources of error in a
model's predictions.
Bias refers to the error introduced by approximating a real-world problem with a simplified model. It
captures how much the predicted values differ from the true values on average. A high bias indicates
that the model is too simplistic and cannot capture the underlying patterns in the data. When training
and testing data come from different distributions, the model's bias can be affected if the underlying
relationship between the input features and the target variable changes. In this case, the model may
struggle to capture the new patterns present in the testing data, leading to increased bias.
Variance, on the other hand, measures the variability of the model's predictions for different training
datasets. It quantifies how much the predictions differ when the model is trained on different subsets of
the data. High variance indicates that the model is too complex and overfits the training data, capturing
noise and random fluctuations. When training and testing data come from different distributions, the
model's variance can increase because it has learned specific patterns from the training distribution that
may not generalize well to the different distribution in the testing phase.
In the context of training and testing on different distributions, here's how bias and variance can be
affected:
1. Bias: If the underlying relationship between the input features and the target variable changes
between the training and testing distributions, the model's bias can increase. The model may not be able
to capture the new patterns present in the testing data, resulting in a higher average prediction error.
2. Variance: When the training and testing data come from different distributions, the model may
struggle to generalize well. This can lead to an increase in variance as the model has learned specific
patterns from the training distribution that do not apply to the testing distribution. The model's
predictions may vary significantly when trained on different subsets of the data, indicating higher
variability.
To strike a balance between bias and variance in the context of training and testing on different
distributions, it is important to consider techniques such as transfer learning, domain adaptation, or
cross-validation. These techniques can help mitigate the distributional shift and improve the model's
ability to generalize to the testing data. By reducing the bias and variance, the model can achieve better
performance on unseen data, even when the distributions differ between training and testing.
U P D AT E D B Y
Jessica Powers | Sep 12, 2022
REVIEWED BY
Parul Pandey
We’ll take a look at what transfer learning is, how it works, why and when it
should be used. Additionally, we’ll cover the different approaches of transfer
learning and provide you with some resources on already pre-trained models.
WHAT IS TRANSFER LEARNING?
Transfer learning, used in machine learning, is the reuse of a pre-trained
model on a new problem. In transfer learning, a machine exploits the
knowledge gained from a previous task to improve generalization about
another. For example, in training a classifier to predict whether an image
contains food, you could use the knowledge it gained during training to
recognize drinks.
Table of Contents
Further Reading
With transfer learning, we basically try to exploit what has been learned in one
task to improve generalization in another. We transfer the weights that a
network has learned at “task A” to a new “task B.”
The general idea is to use the knowledge a model has learned from a task
with a lot of available labeled training data in a new task that doesn't have
much data. Instead of starting the learning process from scratch, we start with
patterns learned from solving a related task.
Transfer learning isn’t really a machine learning technique, but can be seen as
a “design methodology” within the field, for example, active learning. It is also
not an exclusive part or study-area of machine learning. Nevertheless, it has
become quite popular in combination with neural networks that require huge
amounts of data and computational power.
VIEW JOBS
Usually, a lot of data is needed to train a neural network from scratch but
access to that data isn't always available — this is where transfer learning
comes in handy. With transfer learning a solid machine learning model can be
built with comparatively little training data because the model is already pre-
trained. This is especially valuable in natural language processing because
mostly expert knowledge is required to create large labeled data sets.
Additionally, training time is reduced because it can sometimes take days or
even weeks to train a deep neural network from scratch on a complex task.
There isn’t enough labeled training data to train your network from scratch.
There already exists a network that is pre-trained on a similar task, which is usually trained on
massive amounts of data.
When task 1 and task 2 have the same input.
Imagine you want to solve task A but don’t have enough data to train a deep
neural network. One way around this is to find a related task B with an
abundance of data. Train the deep neural network on task B and use the model
as a starting point for solving task A. Whether you'll need to use the whole
model or only a few layers depends heavily on the problem you're trying to
solve.
If you have the same input in both tasks, possibly reusing the model and
making predictions for your new input is an option. Alternatively,
changing and retraining different task-specific layers and the output layer is a
method to explore.
Keras, for example, provides numerous pre-trained models that can be used
for transfer learning, prediction, feature extraction and fine-tuning. You can
find these models, and also some brief tutorials on how to use them, here.
There are also many research institutions that release trained models.
3. FEATURE EXTRACTION
The learned representation can then be used for other problems as well.
Simply use the first layers to spot the right representation of features, but
don’t use the output of the network because it is too task-specific. Instead,
feed data into your network and use one of the intermediate layers as the
output layer. This layer can then be interpreted as a representation of the raw
data.
This approach is mostly used in computer vision because it can reduce the size
of your dataset, which decreases computation time and makes it more suitable
for traditional algorithms, as well.
There are some pre-trained machine learning models out there that are quite
popular. One of them is the Inception-v3 model, which was trained for
the ImageNet “Large Visual Recognition Challenge.” In this challenge,
participants had to classify images into 1,000 classes like
“zebra,” “Dalmatian” and “dishwasher.”
Here’s a very good tutorial from TensorFlow on how to retrain image
classifiers.
Microsoft also offers some pre-trained models, available for both R and
Python development, through the MicrosoftML R package and
the Microsoftml Python package.
Other quite popular models are ResNet and AlexNet. I also encourage a visit
to pretrained.ml, a sortable and searchable compilation of pre-trained deep
learning models complete with demos and code.
Follow
Published in
5 min read
224
1
Photo by Arseny Togulev on Unsplash
Introduction
In most machine learning contexts, we are concerned with solving
a single task at a time. Regardless of what that task is, the problem is
typically framed as using data to solve a single task or optimize a single
metric at a time. However, this approach will eventually hit a
performance ceiling, oftentimes due to the size of the data-set or the
ability of the model to learn meaningful representations from it.
https://www.baeldung.com/cs/end-to-end-deep-
learning#:~:text=Definition,without%20any%20manual%20feature%20extraction.
https://towardsdatascience.com/e2e-the-every-purpose-ml-method-5d4f20dafee4
LeNet-5 is a convolutional neural network (CNN) architecture that was developed by Yann LeCun et al. in
1998. It was one of the pioneering CNN models for image recognition tasks and played a significant role
in the advancement of deep learning.
The LeNet-5 architecture was primarily designed for handwritten digit recognition, specifically for
classifying digits from the MNIST dataset. It consists of seven layers, including three convolutional layers,
followed by two fully connected layers and two subsampling (pooling) layers. The architecture can be
summarized as follows:
2. Convolutional Layers: The first convolutional layer applies six filters (also known as kernels) of size 5x5
to the input images, resulting in six feature maps. The second convolutional layer uses 16 filters of size
5x5 and produces 16 feature maps. Both layers use a stride of 1 and a "valid" padding, and the output
feature maps undergo a nonlinear activation using the hyperbolic tangent (tanh) function.
3. Subsampling Layers: Two subsampling layers follow the convolutional layers. They perform average
pooling over non-overlapping regions. The first subsampling layer reduces the spatial dimensions of the
feature maps by a factor of 2, and the second subsampling layer reduces them further.
4. Fully Connected Layers: The subsampled feature maps are then flattened and passed through two fully
connected layers. The first fully connected layer consists of 120 neurons, followed by the second fully
connected layer with 84 neurons. Each neuron is connected to all the neurons of the previous layer.
5. Output Layer: The final layer is a fully connected layer with 10 neurons, representing the 10 possible
classes (digits 0-9). The output layer uses a softmax activation function to produce the probability
distribution over the classes.
LeNet-5 was trained using the backpropagation algorithm and stochastic gradient descent (SGD)
optimization. It achieved remarkable performance on the MNIST dataset and showcased the potential of
CNNs in image recognition tasks.
LeNet-5 served as a foundation for subsequent advancements in deep learning and convolutional neural
networks, paving the way for more complex and powerful architectures for image recognition and other
computer vision tasks.
https://medium.com/@siddheshb008/lenet-5-architecture-explained-3b559cb2d52b
In total, the LeNet-5 architecture consists of seven layers. These layers can be categorized as follows:
1. Input Layer: The input layer accepts grayscale images of size 32x32 pixels.
2. Convolutional Layers: There are two convolutional layers in LeNet-5. The first convolutional layer
applies six filters of size 5x5 to the input images, and the second convolutional layer applies 16 filters of
size 5x5.
3. Subsampling (Pooling) Layers: LeNet-5 has two subsampling layers. Each subsampling layer performs
average pooling over non-overlapping regions.
4. Fully Connected Layers: There are two fully connected layers in LeNet-5. The first fully connected layer
consists of 120 neurons, and the second fully connected layer consists of 84 neurons.
5. Output Layer: The output layer is a fully connected layer with 10 neurons, representing the 10 possible
classes (digits 0-9) in the case of the MNIST dataset.
So, in total, LeNet-5 has seven layers: one input layer, two convolutional layers, two subsampling layers,
two fully connected layers, and one output layer.
AlexNet: The Architecture that
Challenged CNNs
Jerry Wei
·
Follow
Published in
4 min read
Jul 3, 2019
377
3
A few years back, we still used small datasets like CIFAR and NORB
consisting of tens of thousands of images. These datasets were
sufficient for machine learning models to learn basic recognition tasks.
However, real life is never simple and has many more variables than
are captured in these small datasets. The recent availability of large
datasets like ImageNet, which consist of hundreds of thousands to
millions of labeled images, have pushed the need for an extremely
capable deep learning model. Then came AlexNet.
AlexNet’s most probable labels on eight ImageNet images. The correct label is
written under each image, and the probability assigned to each label is also
shown by the bars. Image credits to Krizhevsky et al., the original authors of
the AlexNet paper.
AlexNet is a convolutional neural network (CNN) architecture that was introduced by Alex Krizhevsky, Ilya
Sutskever, and Geoffrey Hinton in 2012. It gained significant attention and marked a breakthrough in the
field of computer vision, particularly in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
competition.
1. Input Layer: The input layer accepts RGB images of size 227x227 pixels.
2. Convolutional Layers: AlexNet begins with five convolutional layers. The first convolutional layer
applies 96 filters of size 11x11 with a stride of 4. The subsequent convolutional layers use smaller filter
sizes, specifically 256 filters of size 5x5 in the second and third layers, and 384 filters of size 3x3 in the
fourth and fifth layers. All convolutional layers use the Rectified Linear Unit (ReLU) activation function.
3. Max Pooling Layers: After each of the first two convolutional layers, there is a max pooling layer that
performs 2x2 pooling with a stride of 2, reducing the spatial dimensions of the feature maps.
4. Local Response Normalization (LRN) Layer: Following the first and second max pooling layers, an LRN
layer is applied to enhance the model's response to specific patterns. It normalizes the responses across
neighboring feature maps.
5. Fully Connected Layers: After the convolutional and pooling layers, there are three fully connected
layers. The first fully connected layer consists of 4096 neurons, followed by a second fully connected
layer with 4096 neurons. Both these layers use the ReLU activation function. The final fully connected
layer, also known as the output layer, consists of 1000 neurons representing the 1000 classes in the
ImageNet dataset. It employs the softmax activation function to produce the probability distribution
over the classes.
6. Dropout: Dropout regularization is applied to the first and second fully connected layers with a
dropout rate of 0.5. It helps prevent overfitting by randomly dropping out units during training.
The AlexNet architecture was trained on the ImageNet dataset, which consists of millions of labeled
images across 1000 different classes. It utilized techniques such as data augmentation, dropout, and GPU
acceleration for efficient training. AlexNet significantly outperformed previous models in the ILSVRC
competition, demonstrating the power of deep convolutional neural networks for image classification
tasks.
Since its introduction, AlexNet has inspired numerous advancements in deep learning and CNN
architectures, setting the stage for subsequent models such as VGGNet, GoogLeNet, and ResNet.
https://medium.com/@mygreatlearning/everything-you-need-to-know-about-vgg16-7315defb5918
VGG-16 (Visual Geometry Group 16) is a convolutional neural network (CNN) architecture that was
developed by the Visual Geometry Group at the University of Oxford. It was introduced by Karen
Simonyan and Andrew Zisserman in 2014. VGG-16 is known for its depth and simplicity and has been
influential in the field of computer vision.
1. Input Layer: The input layer accepts RGB images of size 224x224 pixels.
2. Convolutional Layers: VGG-16 consists of 13 convolutional layers. The first convolutional layer applies
64 filters of size 3x3 with a stride of 1, followed by additional convolutional layers with 64 filters. The
subsequent convolutional layers maintain the same filter size but double the number of filters.
Specifically, there are two convolutional layers with 128 filters, three with 256 filters, and three with 512
filters. Finally, there are two convolutional layers with 512 filters.
3. Max Pooling Layers: After each set of two or three convolutional layers, there is a max pooling layer
that performs 2x2 pooling with a stride of 2, reducing the spatial dimensions of the feature maps.
4. Fully Connected Layers: After the convolutional and pooling layers, VGG-16 has three fully connected
layers. The first two fully connected layers consist of 4096 neurons each, while the last fully connected
layer, also known as the output layer, consists of the number of neurons corresponding to the specific
classification task.
5. ReLU Activation: ReLU (Rectified Linear Unit) activation is used after each convolutional and fully
connected layer. It introduces non-linearity into the network, allowing it to learn complex patterns and
representations.
VGG-16 has a total of about 138 million trainable parameters, making it a deep and computationally
intensive architecture. It is known for its homogeneous structure, with relatively small 3x3 filters and
max pooling layers throughout the network. This design choice aims to make the network more effective
in capturing fine-grained details.
VGG-16 has been widely used as a benchmark architecture for various computer vision tasks, including
image classification, object detection, and image segmentation. It has achieved notable performance on
the ImageNet dataset and has influenced subsequent CNN architectures, inspiring models like VGG-19,
ResNet, and DenseNet.
1. What is ResNet
Need for ResNet
Residual Block
How ResNet helps
2. ResNet architecture
3. Using ResNet with Keras
ResNet 50
What is ResNet?
ResNet, short for Residual Network is a specific type of neural
network that was introduced in 2015 by Kaiming He, Xiangyu
Zhang, Shaoqing Ren and Jian Sun in their paper “Deep Residual
Learning for Image Recognition”.The ResNet models were extremely
successful which you can guess from the following:
Residual Block
Next, this term goes through the activation function, f() and we
get our output as H(x).
H(x)=f( wx + b )
or H(x)=f(x)
H(x)=f(x)+x
H(x)=f(x),
H(x)=f(x)+x,
f(x)=0
H(x)=x
1. TensorFlow: TensorFlow, developed by Google, is one of the most popular deep learning frameworks.
It provides a flexible and comprehensive ecosystem for building and deploying machine learning models.
TensorFlow supports both high-level APIs (such as Keras) for rapid prototyping and low-level APIs for
advanced customization. It offers excellent support for distributed computing and deployment on
various platforms, including CPUs, GPUs, and TPUs.
3. Keras: Keras is a high-level deep learning framework that can run on top of TensorFlow, Theano, or
Microsoft Cognitive Toolkit (CNTK). It provides a user-friendly and intuitive API for building and training
neural networks. Keras allows rapid prototyping and supports both convolutional and recurrent neural
networks. With its focus on simplicity and ease of use, Keras is a popular choice for beginners in deep
learning.
4. Caffe: Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework
developed by Berkeley AI Research (BAIR). It emphasizes speed and efficiency, particularly in computer
vision tasks. Caffe has a declarative model definition syntax and a strong focus on convolutional neural
networks. It offers a large collection of pre-trained models, making it useful for transfer learning and
feature extraction.
5. MXNet: MXNet is a deep learning framework that provides flexible and efficient tools for building
neural networks. It supports both imperative and symbolic programming models and offers a wide range
of language bindings, including Python, R, and Julia. MXNet emphasizes scalability and distributed
computing, making it suitable for large-scale deep learning applications.
6. Theano: Theano is a deep learning framework that allows developers to define, optimize, and evaluate
mathematical expressions efficiently. It offers symbolic computation capabilities and efficient GPU
utilization. While Theano has been widely used in the past, its development has slowed down, and other
frameworks like TensorFlow and PyTorch have gained more popularity in recent years.
These are just a few examples of deep learning frameworks, and there are several other frameworks
available, such as Microsoft Cognitive Toolkit (CNTK), Deeplearning4j, and Chainer. The choice of
framework depends on factors such as the specific requirements of your project, the level of flexibility
needed, the size of the community and available resources, and your familiarity with the programming
language and interface.
https://marutitech.com/top-8-deep-learning-frameworks/