You are on page 1of 74

Introduction

In the last few years of the IT industry, there has been a huge demand for once
particular skill set known as Deep Learning. Deep Learning a subset of Machine
Learning which consists of algorithms that are inspired by the functioning of
the human brain or the neural networks.

Check out our free data science courses to get an edge over the competition.

These structures are called as Neural Networks. It teaches the computer to do


what naturally comes to humans. Deep learning, there are several types of
models such as the Artificial Neural Networks (ANN), Autoencoders, Recurrent
Neural Networks (RNN) and Reinforcement Learning. But there has been one
particular model that has contributed a lot in the field of computer vision and
image analysis which is the Convolutional Neural Networks (CNN) or the
ConvNets.

CNN is very useful as it minimises human effort by automatically detecting the


features. For example, for apples and mangoes, it would automatically detect
the distinct features of each class on its own.

You can also consider doing our Python Bootcamp course from upGrad to upskill
your career.

CNNs are a class of Deep Neural Networks that can recognize and classify
particular features from images and are widely used for analyzing visual
images. Their applications range from image and video recognition, image
classification, medical image analysis, computer vision and natural language
processing.

CNN has high accuracy, and because of the same, it is useful in image
recognition. Image recognition has a wide range of uses in various industries
such as medical image analysis, phone, security, recommendation systems,
etc.

The term ‘Convolution” in CNN denotes the mathematical function of


convolution which is a special kind of linear operation wherein two functions
are multiplied to produce a third function which expresses how the shape of
one function is modified by the other. In simple terms, two images which can
be represented as matrices are multiplied to give an output that is used to
extract features from the image.

Learn Machine Learning online from the World’s top Universities – Masters,
Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI
to fast-track your career.

Basic Architecture

There are two main parts to a CNN architecture

 A convolution tool that separates and identifies the various features of the image for
analysis in a process called as Feature Extraction.
 The network of feature extraction consists of many pairs of convolutional or pooling
layers.
 A fully connected layer that utilizes the output from the convolution process and
predicts the class of the image based on the features extracted in previous stages.
 This CNN model of feature extraction aims to reduce the number of features present
in a dataset. It creates new features which summarises the existing features
contained in an original set of features. There are many CNN layers as shown in
the CNN architecture diagram.
Source

Featured Program for you: Fullstack Development Bootcamp Course

Convolution Layers

There are three types of layers that make up the CNN which are the
convolutional layers, pooling layers, and fully-connected (FC) layers. When
these layers are stacked, a CNN architecture will be formed. In addition to
these three layers, there are two more important parameters which are the
dropout layer and the activation function which are defined below.

Good Read: Introduction to Deep Learning & Neural Networks

1. Convolutional Layer

This layer is the first layer that is used to extract the various features from the
input images. In this layer, the mathematical operation of convolution is
performed between the input image and a filter of a particular size MxM. By
sliding the filter over the input image, the dot product is taken between the
filter and the parts of the input image with respect to the size of the filter
(MxM).

The output is termed as the Feature map which gives us information about the
image such as the corners and edges. Later, this feature map is fed to other
layers to learn several other features of the input image.

The convolution layer in CNN passes the result to the next layer once applying the
convolution operation in the input. Convolutional layers in CNN benefit a lot as
they ensure the spatial relationship between the pixels is intact.

2. Pooling Layer

In most cases, a Convolutional Layer is followed by a Pooling Layer. The


primary aim of this layer is to decrease the size of the convolved feature map
to reduce the computational costs. This is performed by decreasing the
connections between layers and independently operates on each feature map.
Depending upon method used, there are several types of Pooling operations. It
basically summarises the features generated by a convolution layer.

In Max Pooling, the largest element is taken from feature map. Average Pooling
calculates the average of the elements in a predefined sized Image section. The
total sum of the elements in the predefined section is computed in Sum
Pooling. The Pooling Layer usually serves as a bridge between the
Convolutional Layer and the FC Layer.

This CNN model generalises the features extracted by the convolution layer,
and helps the networks to recognise the features independently. With the help
of this, the computations are also reduced in a network.

Must Read: Neural Network Project Ideas

3. Fully Connected Layer


The Fully Connected (FC) layer consists of the weights and biases along with
the neurons and is used to connect the neurons between two different layers.
These layers are usually placed before the output layer and form the last few
layers of a CNN Architecture.

In this, the input image from the previous layers are flattened and fed to the FC
layer. The flattened vector then undergoes few more FC layers where the
mathematical functions operations usually take place. In this stage, the
classification process begins to take place. The reason two layers are
connected is that two fully connected layers will perform better than a single
connected layer. These layers in CNN reduce the human supervision

4. Dropout

Usually, when all the features are connected to the FC layer, it can cause
overfitting in the training dataset. Overfitting occurs when a particular model
works so well on the training data causing a negative impact in the model’s
performance when used on a new data.

To overcome this problem, a dropout layer is utilised wherein a few neurons


are dropped from the neural network during training process resulting in
reduced size of the model. On passing a dropout of 0.3, 30% of the nodes are
dropped out randomly from the neural network.

Dropout results in improving the performance of a machine learning model as


it prevents overfitting by making the network simpler. It drops neurons from
the neural networks during training.

Must Read : Free deep learning course !

5. Activation Functions

Finally, one of the most important parameters of the CNN model is the
activation function. They are used to learn and approximate any kind of
continuous and complex relationship between variables of the network. In
simple words, it decides which information of the model should fire in the
forward direction and which ones should not at the end of the network.

It adds non-linearity to the network. There are several commonly used


activation functions such as the ReLU, Softmax, tanH and the Sigmoid
functions. Each of these functions have a specific usage. For a binary
classification CNN model, sigmoid and softmax functions are preferred an for a
multi-class classification, generally softmax us used. In simple terms, activation
functions in a CNN model determine whether a neuron should be activated or
not. It decides whether the input to the work is important or not to predict
using mathematical operations.

https://www.upgrad.com/blog/basic-cnn-architecture/

https://www.geeksforgeeks.org/introduction-convolution-neural-network/
CNN, or Convolutional Neural Network, is a type of deep learning model commonly used for image
recognition, computer vision tasks, and other pattern recognition problems. The building blocks of a CNN
are designed to efficiently extract features from input data while preserving spatial relationships. Here
are the key components or building blocks of a typical CNN:

1. Convolutional Layer: The convolutional layer is the core building block of a CNN. It applies a set of
learnable filters (also known as kernels or feature detectors) to the input data. Each filter performs a
convolution operation by sliding over the input, computing dot products at each position, and producing
a feature map. Convolutional layers help capture local patterns and spatial hierarchies in the data.

2. Activation Function: After each convolutional operation, an activation function is applied element-wise
to introduce non-linearity into the network. The most commonly used activation function in CNNs is the
Rectified Linear Unit (ReLU), which sets negative values to zero and keeps positive values unchanged.

3. Pooling Layer: Pooling layers reduce the spatial dimensions (width and height) of the input, while
retaining important features. Max pooling is a common pooling operation that takes the maximum value
within each pooling region. It helps reduce the computational complexity and provides a form of
translation invariance by preserving the most salient features.

4. Fully Connected Layer: Fully connected layers are traditional neural network layers where each neuron
is connected to every neuron in the previous layer. These layers are typically used at the end of the CNN
architecture to classify or regress the extracted features. They learn complex combinations of features
from the previous layers and make predictions based on the learned representations.

5. Dropout: Dropout is a regularization technique used to prevent overfitting in CNNs. It randomly sets a
fraction of the input units to zero during training, which helps to reduce co-adaptation between neurons
and improves generalization.

6. Batch Normalization: Batch normalization is a technique that normalizes the output of a previous layer
by subtracting the batch mean and dividing by the batch standard deviation. It helps in stabilizing the
network training process, allowing higher learning rates, and reducing the sensitivity to network
initialization.

7. Convolutional Neural Network Architecture: CNNs are typically composed of multiple convolutional
layers stacked together, interspersed with activation functions, pooling layers, and other components
mentioned above. Different CNN architectures like LeNet, AlexNet, VGGNet, GoogLeNet, and ResNet
have varying depths, layer arrangements, and architectural innovations.

These building blocks, combined with appropriate hyperparameter tuning, training data, and
optimization techniques, enable CNNs to learn complex features from raw data and perform tasks such
as image classification, object detection, and semantic segmentation.
Convolution Layer
A convolution layer transforms the input image in order to extract
features from it. In this transformation, the image is convolved with
a kernel (or filter).

Image convolution (source)

A kernel is a small matrix, with its height and width smaller than the
image to be convolved. It is also known as a convolution matrix or
convolution mask. This kernel slides across the height and width of the
image input and dot product of the kernel and the image are computed
at every spatial position. The length by which the kernel slides is
known as the stride length. In the image below, the input image
is of size 5X5, the kernel is of size 3X3 and the stride length is
1. The output image is also referred to as the convolved feature.
When convolving a coloured image (RGB image) with channels 3, the
channel of the filters must be 3 as well. In other words, in
convolution, the number of channels in the kernel must be
the same as the number of channels in the input image.

Convolution on RGB image (source)


When we want to extract more than one feature from an image using
convolution, we can use multiple kernels instead of using just
one. In such a case, the size of all the kernels must be the same. The
convolved features of the input image the output are stacked one after
the other to create an output so that the number of channels is equal to
the number of filters used. See the image below for reference.

Convolution of RGB image using multiple filters (kernels) (source)

An activation function is the last component of the convolutional


layer to increase the non-linearity in the output. Generally, ReLu
function or Tanh function is used as an activation function in a
convolution layer. Here is an image of a simple convolution layer,
where a 6X6X3 input image is convolved with two kernels of size
4X4X3 to get a convolved feature of size 3X3X2, to which activation
function is applied to get the output, which is also referred to as feature
map.
A convolution layer (source)

Pooling Layer
Pooling layer is used to reduce the size of the input image. In a
convolutional neural network, a convolutional layer is usually followed
by a pooling layer. Pooling layer is usually added to speed up
computation and to make some of the detected features more robust.

Pooling operation uses kernel and stride as well. In the example image
below, 2X2 filter is used for pooling the 4X4 input image of size, with a
stride of 2.

There are different types of pooling. Max pooling and average


pooling are the most commonly used pooling method a convolutional
neural network.
Max pooling on left, Average pooling on the right (source)

Max Pooling: In max pooling, from each patch of a feature map, the
maximum value is selected to create a reduced map.

Average Pooling: In average pooling, from each patch of a feature


map, the average value is selected to create a reduced map.
https://www.geeksforgeeks.org/cnn-introduction-to-padding/

https://deepai.org/machine-learning-glossary-and-terms/stride

In the context of convolutional neural networks (CNNs), a strided operation refers to the process of
applying a convolutional filter with a certain stride value, which determines the step size for moving the
filter across the input data.

In a standard convolutional operation, the filter is usually applied to the input data with a stride value of
1. This means that the filter moves one pixel at a time, covering the entire input image, and produces a
feature map with the same spatial dimensions as the input.
However, when using a strided operation, the filter is applied with a larger stride value, skipping some
pixels as it moves across the input data. This leads to a reduction in the spatial dimensions of the output
feature map.

For example, consider a 3x3 filter applied to a 5x5 input image with a stride of 2. The filter will start at
the top-left corner of the input, perform the convolution operation, and then move two pixels to the
right for the next convolution. It will continue this process until it reaches the end of the row, and then
move two rows down to the next position. This stride of 2 will effectively reduce the spatial dimensions
of the output feature map by a factor of 2 in both width and height.

Strided operations are commonly used in CNNs for several reasons:

1. Dimensionality reduction: By applying a strided operation, the spatial dimensions of the feature maps
are reduced, which helps to reduce the computational complexity of subsequent layers and improve
efficiency.

2. Downsampling: Strided operations can act as a form of downsampling, where the information in the
input is summarized over larger regions. This can help capture more general features and reduce the
sensitivity to small local variations in the input.

3. Increasing receptive field: By using a larger stride value, the receptive field of each neuron in the
subsequent layers increases. This allows the network to capture larger spatial contexts and capture more
global information.

It's worth noting that strided operations can be used not only in convolutional layers but also in pooling
layers, where max pooling or average pooling can be applied with a stride value greater than 1 to achieve
similar downsampling effects.

Overall, strided operations provide a way to control the spatial dimensions and information flow within a
CNN, enabling more efficient processing of large-scale data while capturing important features.
Convolutional Neural Network —
II

Mandar Deshpande
·

Follow
Published in

Towards Data Science

5 min read

Mar 27, 2018

253
1
Continuing our learning from the last post we will be covering the
following topics in this post:

 Convolution over volume

 Multiple filters at one time

 One layer of convolution network

 Understanding the dimensional change

I have tried to explain most topics through illustrations as much as


possible. If something isn’t easy to understand please ping me.

Let’s get started!

Convolution over volume


Don’t be scared to read ‘volume’, its just a way of saying images with
more than one channel i.e. RGB or any other channels.
Up until now we just had just a single channel so we were just
concerned about the height and width of the image. But with the
addition of more than one channel we need to take care of the filters
involved, as they should also encompass convolution across all
channels (3 here).
So if the image dimension is n x n x #channels, so the filter which was
earlier f x f would also now be required to be of dimension f x f x

#channels.

Intuitively our input data is no more 2 dimensional but in fact 3


dimension if you consider the channels, and hence the name volume.

Below is a simple example of convolution over volume, of an image


having dimension 6 x 6 x 3 with 3 denoting the 3 channels R, G and B.
Similarly the filter is of dimension 3 x 3 x 3.
Fig 1. Convolution over volume (colour image — 3 channels) with filter of dimension 3 x 3

In the given example the purpose of the filter is to detect vertical edges.
If the edge needs to be detected only in the R channel,then only the
weights in R channel need to set for the requirement. If you need to
detect vertical edges across all channels, then all filter channels will
have the same weight as demonstrated above.

Multiple filters at one time


There is a high chance that you may need to extract alot of different
features from an image, for which you will use multiple filters. If
individual filters are convolved separately,it will increase the
computation time and so its more convenient to use all required filters
at a time directly.

Convolution is carried out individually as is the case with a single filter,


and then results of both convolutions are merged together in a stack to
form an output volume with a 3rd dimenion representative of the
numbers of filters.

Below example considers a 6 x 6 x 3 image as above, but we are using


2 filters ( for vertical edge and horizontal edge) of dimension 3 x 3 x 3.

The resultant images of 4 x 4 each are stacked together to get a output


volume of 4 x 4 x 2.

Fig 2. Convolution over volume with multiple filter of dimensions 3 x 3

The output dimension can be calculated for any general case using the
following equation :
Fig 3. Equation governing the output image/signal dimension wrt input and filter dimension

Here, nc is the number of channels in the input image and nf are the
number of filters used.

Softmax regression, also known as multinomial logistic regression, is a classification algorithm that is
commonly used in machine learning and deep learning for multi-class classification problems. It is an
extension of logistic regression, which is used for binary classification.

In softmax regression, the goal is to assign an input sample to one of the multiple classes in a mutually
exclusive manner. The algorithm computes a probability distribution over the classes and assigns the
input to the class with the highest probability.

Here's how softmax regression works:

1. Input Data: Each input sample is represented by a feature vector. The features can be real-valued or
discrete.

2. Model Parameters: Softmax regression learns a weight matrix and bias vector that map the input
features to the probabilities of the different classes. The weight matrix and bias vector are learned
through the training process.

3. Linear Transformation: The input features are linearly transformed using the weight matrix and bias
vector. This produces a set of scores or logits for each class. The logits represent the evidence or
confidence of the input sample belonging to each class.
4. Softmax Function: The logits are then passed through the softmax function, which normalizes them
into a probability distribution over the classes. The softmax function calculates the exponential of each
logit and divides it by the sum of exponentials across all classes.

5. Class Prediction: The class with the highest probability from the softmax function is selected as the
predicted class for the input sample.

6. Loss Function: Softmax regression uses a loss function to measure the discrepancy between the
predicted probabilities and the true class labels. The commonly used loss function is cross-entropy loss.

7. Training: The model parameters, i.e., the weight matrix and bias vector, are learned by minimizing the
loss function through optimization techniques such as gradient descent or its variants. The training
process adjusts the parameters iteratively to improve the model's ability to classify the input data
correctly.

Once the softmax regression model is trained, it can be used to predict the class labels for new, unseen
samples by passing their features through the trained model.

Softmax regression is widely used in various applications, including image classification, text
classification, and natural language processing tasks, where there are multiple classes to predict from. It
provides a probabilistic framework for multi-class classification by assigning class probabilities based on
the input features.
Introduction
Before understanding Softmax regression, we need to understand
the underlying softmax function that drives this regression.
The softmax function, also known as softargmax or normalized
exponential function, is, in simple terms, more like a
normalization function, which involves adjusting values measured
on different scales to a notionally common scale. There is more
than one method to accomplish this, and let us review why the
softmax method stands out. These methods could be used to
estimate probability scores from a set of values as in the case
of logistic regression or the output layer of a classification
neural network, both for finding the class with the largest
predicted probability.

Although simple mathematically, let us take some deliberate baby


steps in understanding the softmax so that we could appreciate
the subtle beauty of this technique. As a first step, let us
understand the simplest form of normalization called ‘hard-max’.
Let us assume our classification model has returned three
values- 3, 7 and 14 as an output, and we want to assign
probabilities or label them. The easiest possible way is to
assign a 100% probability to the highest score and 0% to
everything else, i.e. 14 would get a 100% probability score. In
contrast, both 3, 7 would get probability scores of 0% each.
Although simple, this method is relatively crude and does not
consider the scores of other variables and their scales. Check
out this course on Linear regression vs logistic regression.

Next, let us consider the conventional normalization done by


taking the ratio of the score to the sum of all scores. In the
same model outputs- 3, 7 and 14 our probabilities would be 3/
(3+7+14) = 0.13, be 7/ (3+7+14) = 0.29 and be 14/ (3+7+14) =
0.58. Although this method takes into account the scores of
other outputs other than the maximum value, it suffers from the
following issues:

1. It does not take into account the effect of scales, i.e.


instead of outputs 3, 7 and 14, and if we had outputs of
0.3, 0.7 and 1.4, we would still end up with the same
probability score outputs, namely 0.13, 0.29 and 0.58
2. We would end up with negative probability scores of our
outputs were negative values, which may not make
mathematical sense.
Thus, it is imperative we resort to some other method that takes
care of the aforementioned issues.

Enter, the softmax method, which is mathematically given by,

where, σ (z)i is the probability score, zi,j are the outputs and β
is a parameter that we choose if we want to use a base other
than e1 .

Features of Softmax:
Now for our earlier outputs 3, 7 and 14 our probabilities would
be e3/ e (3+7+14) = 1.6 X 10-5, e7/ e (3+7+14) = 91 X 10-5 and e14/
e (3+7+14) =0.99 respectively. As you would have noticed, this method
highlights the largest values and suppresses values that are
significantly below the maximum value. Also, this is done
proportional to the scale of the numbers, i.e. we would not get
the same probability scores if the outputs were 0.3, 0.7 and
1.4, rather we would get the probability scores as 0.18, 0.27
and 0.55

In addition, even if we end up with negative values of outputs,


the probability scores would not be negative (due to the
property of the distribution)

Apart from these trivial properties, another interesting


property of the softmax function makes it all the more
attractive for neural network-based applications. This property
is the ability of the softmax function to be continuously
differentiable, making it possible to calculate the derivative
of the loss function concerning every weight in the network for
every input in the training set. Simply put, it makes things
easier to update the weights in the neural network.

Relationship with Sigmoid/Logistic


regression:
What if I said the sigmoid function (the one we use to model
probabilities) in the logistic regression? You probably wouldn’t
agree, as they might look very different at first stance. Now
let us prove that the softmax function, which handles multiple
classes, is a generalization of the logistic regression used for
two-class classification.
We know that the softmax for k classes, with β=1 is given by:

We also know that for a logistic regression, there are two-


classes, x and non-x (or zero), plugging these in the formula
above we get:

Now dividing the numerator and denominator by ex we get:

The above equation is nothing but the sigmoid function, thus we


see how the softmax function is a generalization of the sigmoid
function (for two-class problems).

Applications:
As mentioned earlier, softmax function/ regression finds utility
in several areas, and some of the popular applications are as
below:

Image recognition in convolutional neural networks: here, the


last hidden layer would output values, which would be taken as
an input by the output layer, then compute the probabilities
using the softmax. The class with the highest probability would
then be the final classification.

In reinforcement learning, the softmax function is also used


when a model needs to decide between taking action currently
known to have the highest probability of a reward, called
exploitation, and taking an exploratory step, called
exploration. Further details on how this is computed are out of
the scope of this article but are available in the cited
footnote.

Implementation:
Now that we’ve understood how the softmax function works, we can
use that function to compute the probabilities predicted by a
crude linear model, such as y= mx +b,

Initially, we can use the linear model to make some initial


predictions. We can then use an optimization algorithm, such as
gradient descent, that adjusts m and b to minimize the
prediction errors in our model. Thereby we end up with a final
softmax regression model with good enough m and b to make future
predictions from the data. Although we use softmax for the
probabilities, we will have to; assign a class to a data point
based on the highest probability (such as argmax).

The same is represented in the schematic below. A detailed and


nice write-up of the mathematics behind the model is also
available here and has not been included in this article, as it
is outside of the scope of this article.

The python implementation of the model on the iris dataset is as


attached and embedded below. Here we take the iris dataset,
which contains information on four flower characteristics: sepal
length, sepal width, petal length, and petal width. Using this
information, we try to predict one of 3 flower classes.

Training and testing on different distributions

 Example: Cat vs Non-cat


 In this example, we want to create a mobile application that will classify and recognize
pictures of cats taken and uploaded by users.

 There are two sources of data used to develop the mobile app. The first data distribution is
small, 10 000 pictures uploaded from the mobile application.

 Since they are from amateur users, the pictures are not professionally shot, not well framed
and blurrier. The second source is from the web, you downloaded 200 000 pictures where
cat’s pictures are professionally framed and in high resolution.

 The problem is that you have a different distribution:

1. small data set from pictures uploaded by users. This distribution is important for the
mobile app.

2. bigger data set from the web.

 The guideline used is that you have to choose a development set and test set to reflect data
you expect to get in the future and consider important to do well.

 The data is split as follow:

 The advantage of this way of splitting up is that the target is well defined. The disadvantage
is that the training distribution is different from the development and test set distributions.
However, this way of splitting the data has a better performance in long term
Training and testing on different distributions refers to a scenario in machine learning where the data
used for training a model is drawn from a different distribution than the data used for testing or
evaluating the model. This situation can arise due to various reasons, such as changes in data collection
processes, domain shifts, or intentionally creating diverse datasets.

Training a model on one distribution and then evaluating it on a different distribution can lead to a
phenomenon called distribution mismatch or distributional shift. This can result in degraded
performance and reduced generalization of the model. The reasons for this performance drop include:

1. Differences in Data Characteristics: The statistical properties, such as mean, variance, or class
distributions, may differ between the training and testing data. As a result, the model may not perform
well on the unseen data because it has not learned to generalize to the different distribution.

2. Covariate Shift: Covariate shift occurs when the input features' marginal distribution differs between
the training and testing data, while the conditional distribution of the target variable remains the same.
This can lead to a mismatch between the training and testing data and affect the model's performance.

3. Concept Shift: Concept shift refers to a change in the underlying relationship between the input
features and the target variable. If the concept shift occurs between the training and testing data, the
model's learned patterns may not be applicable to the unseen data, leading to reduced accuracy.

To address the issue of training and testing on different distributions, several techniques can be
employed:

1. Data Collection: Efforts should be made to ensure that the training and testing data are as
representative of the real-world distribution as possible. Collecting a diverse and balanced dataset that
covers various scenarios can help mitigate the distributional shift.

2. Data Augmentation: Data augmentation techniques can be applied to artificially expand the training
dataset by creating new samples with variations. This can help the model learn to generalize better by
introducing more diverse examples.
3. Domain Adaptation: Domain adaptation methods aim to align the source and target domains to
reduce the distributional shift. Techniques like domain adaptation networks, importance weighting, or
feature adaptation can be applied to align the data distributions.

4. Transfer Learning: Transfer learning involves pretraining a model on a large, relevant dataset and then
fine-tuning it on the target distribution. This helps the model leverage the learned knowledge from the
source distribution and adapt it to the target distribution.

5. Cross-Validation: If labeled data from the target distribution is available, cross-validation can be used
to evaluate the model's performance. This allows for model selection and hyperparameter tuning on the
target distribution, improving the model's ability to generalize.

It is important to consider the potential distributional shift when designing machine learning models and
take appropriate steps to mitigate its impact. Understanding the data characteristics and employing
techniques to address the training-testing distribution mismatch can lead to more robust and accurate
models.

What is Bias?
Bias is simply defined as the inability of the model
because of that there is some difference or error occurring
between the model’s predicted value and the actual value.
These differences between actual or expected values and the
predicted values are known as error or bias error or error
due to bias. Bias is a systematic error that occurs due to
wrong assumptions in the machine learning process.
What is Variance?
Variance is the measure of spread in data from
its mean position. In machine learning variance is the
amount by which the performance of a predictive model
changes when it is trained on different subsets of the
training data. More specifically, variance is the
variability of the model that how much it is sensitive to
another subset of the training dataset. i.e. how much it
can adjust on the new subset of the training dataset.

WHAT IS THE BIAS-VARIANCE


TRADEOFF?
Actions you take to decrease bias (leading to a better fit to the training data)
will simultaneously increase the variance in the model (leading to higher risk
of poor predictions). The inverse is also true; actions you take to reduce
variance will inherently increase bias.

MORE ON MODELING A Primer on Model Fitting


What Are Model Bias and Variance?
Both terms describe how a model changes as you retrain it using different
portions of a given data set. By changing the portion of the data set used to
train the model, you can change the functions describing the resulting model.
However, models of different structures will respond to new data sets in
different ways. Bias and variance describe the two different ways models can
respond.

BIAS VS. VARIANCE


 Bias describes how well a model matches the training set. A model with
high bias won’t match the data set closely, while a model with low bias
will match the data set very closely. Bias comes from models that are
overly simple and fail to capture the trends present in the data set.
 Variance describes how much a model changes when you train it using
different portions of your data set. A model with high variance will have
the flexibility to match any data set you provided it, which may result in
dramatically different models each time. Variance comes from models
that are highly complex and employ a significant number of features.

Typically models with high bias have low variance, and models with high
variance have low bias. This is because the two come from opposite types of
models. A model that’s not flexible enough to match a data set correctly (high
bias) is also not flexible enough to change dramatically when given a different
data set (low variance).
Those who’ve read my previous article on underfitting and overfitting will
probably note a lot of similarity between these concepts. Underfit models
usually have high bias and low variance. Overfit models usually have high
variance and low bias.

MORE BUILT IN TUTORIALS What is Multiple Regression?

What’s the Tradeoff Between Bias and


Variance?
The bias-variance trade-off is a commonly discussed term in data science.
Actions that you take to decrease bias (leading to a better fit to the training
data) will simultaneously increase the variance in the model (leading to higher
risk of poor predictions). The inverse is also true; actions you take to reduce
variance will inherently increase bias.

WHAT CAN I DO ABOUT THE BIAS-VARIANCE


TRADE-OFF?

Keep in mind increasing variance is not always a bad thing. An underfit model
is underfit because it doesn’t have enough variance, which leads to
consistently high bias errors. This means when you’re developing a model you
need to find the right amount of variance, or the right amount of model
complexity. The key is to increase model complexity, thus decreasing bias and
increasing variance, until bias has been minimized but before significant
variance errors become evident.
Another solution is to increase the size of the data set used to train your
model. High variance errors, also referred to as overfitting models, come from
creating a model that’s too complex for the available data set. If you’re able to
use more data to train the model, then you can create a model that’s more
complex without accidentally adding variance error.

When training and testing data come from different distributions, it can have an impact on the bias and
variance of a machine learning model. Bias and variance are two fundamental sources of error in a
model's predictions.

Bias refers to the error introduced by approximating a real-world problem with a simplified model. It
captures how much the predicted values differ from the true values on average. A high bias indicates
that the model is too simplistic and cannot capture the underlying patterns in the data. When training
and testing data come from different distributions, the model's bias can be affected if the underlying
relationship between the input features and the target variable changes. In this case, the model may
struggle to capture the new patterns present in the testing data, leading to increased bias.

Variance, on the other hand, measures the variability of the model's predictions for different training
datasets. It quantifies how much the predictions differ when the model is trained on different subsets of
the data. High variance indicates that the model is too complex and overfits the training data, capturing
noise and random fluctuations. When training and testing data come from different distributions, the
model's variance can increase because it has learned specific patterns from the training distribution that
may not generalize well to the different distribution in the testing phase.

In the context of training and testing on different distributions, here's how bias and variance can be
affected:
1. Bias: If the underlying relationship between the input features and the target variable changes
between the training and testing distributions, the model's bias can increase. The model may not be able
to capture the new patterns present in the testing data, resulting in a higher average prediction error.

2. Variance: When the training and testing data come from different distributions, the model may
struggle to generalize well. This can lead to an increase in variance as the model has learned specific
patterns from the training distribution that do not apply to the testing distribution. The model's
predictions may vary significantly when trained on different subsets of the data, indicating higher
variability.

To strike a balance between bias and variance in the context of training and testing on different
distributions, it is important to consider techniques such as transfer learning, domain adaptation, or
cross-validation. These techniques can help mitigate the distributional shift and improve the model's
ability to generalize to the testing data. By reducing the bias and variance, the model can achieve better
performance on unseen data, even when the distributions differ between training and testing.

What Is Transfer Learning? Exploring


the Popular Deep Learning Approach.
Discover the value of transfer learning and how to use it.

Written byNiklas Donges


Image: Shutterstock / Built In

U P D AT E D B Y
Jessica Powers | Sep 12, 2022

REVIEWED BY
Parul Pandey

Transfer learning is the reuse of a pre-trained model on a new problem. It’s


currently very popular in deep learning because it can train deep neural
networks with comparatively little data. This is very useful in the data
science field since most real-world problems typically do not have millions of
labeled data points to train such complex models.

We’ll take a look at what transfer learning is, how it works, why and when it
should be used. Additionally, we’ll cover the different approaches of transfer
learning and provide you with some resources on already pre-trained models.
WHAT IS TRANSFER LEARNING?
Transfer learning, used in machine learning, is the reuse of a pre-trained
model on a new problem. In transfer learning, a machine exploits the
knowledge gained from a previous task to improve generalization about
another. For example, in training a classifier to predict whether an image
contains food, you could use the knowledge it gained during training to
recognize drinks.

Table of Contents

 What Is Transfer Learning?

 How Does Transfer Learning Work?

 Why Is Transfer Learning Used?

 When Should Transfer Learning Be Used?

 Approaches to Transfer Learning

 Further Reading

An overview of transfer learning. Video: Professor Ryan

What Is Transfer Learning?


In transfer learning, the knowledge of an already trained machine
learning model is applied to a different but related problem. For example, if
you trained a simple classifier to predict whether an image contains a
backpack, you could use the knowledge that the model gained during its
training to recognize other objects like sunglasses.

With transfer learning, we basically try to exploit what has been learned in one
task to improve generalization in another. We transfer the weights that a
network has learned at “task A” to a new “task B.”

The general idea is to use the knowledge a model has learned from a task
with a lot of available labeled training data in a new task that doesn't have
much data. Instead of starting the learning process from scratch, we start with
patterns learned from solving a related task.

Transfer learning is mostly used in computer vision and natural language


processing tasks like sentiment analysis due to the huge amount of
computational power required.

Transfer learning isn’t really a machine learning technique, but can be seen as
a “design methodology” within the field, for example, active learning. It is also
not an exclusive part or study-area of machine learning. Nevertheless, it has
become quite popular in combination with neural networks that require huge
amounts of data and computational power.

How Transfer Learning Works


In computer vision, for example, neural networks usually try to detect edges in
the earlier layers, shapes in the middle layer and some task-specific features in
the later layers. In transfer learning, the early and middle layers are used and
we only retrain the latter layers. It helps leverage the labeled data of the task it
was initially trained on.

Let’s go back to the example of a model trained for recognizing a backpack on


an image, which will be used to identify sunglasses. In the earlier layers, the
model has learned to recognize objects, because of that we will only retrain the
latter layers so it will learn what separates sunglasses from other objects.
In transfer learning, we try to transfer as much knowledge as possible from
the previous task the model was trained on to the new task at hand. This
knowledge can be in various forms depending on the problem and the data.
For example, it could be how models are composed, which allows us to more
easily identify novel objects.

Find out who's hiring.


See all Data + Analytics jobs at top tech companies & startups

VIEW JOBS

Why Use Transfer Learning


Transfer learning has several benefits, but the main advantages are
saving training time, better performance of neural networks (in most cases),
and not needing a lot of data.

Usually, a lot of data is needed to train a neural network from scratch but
access to that data isn't always available — this is where transfer learning
comes in handy. With transfer learning a solid machine learning model can be
built with comparatively little training data because the model is already pre-
trained. This is especially valuable in natural language processing because
mostly expert knowledge is required to create large labeled data sets.
Additionally, training time is reduced because it can sometimes take days or
even weeks to train a deep neural network from scratch on a complex task.

According to DeepMind CEO Demis Hassabis, transfer learning is also one of


the most promising techniques that could lead to artificial general
intelligence (AGI) someday:

When to Use Transfer Learning


As is always the case in machine learning, it is hard to form rules that are
generally applicable, but here are some guidelines on when transfer learning
might be used:

 There isn’t enough labeled training data to train your network from scratch.
 There already exists a network that is pre-trained on a similar task, which is usually trained on
massive amounts of data.
 When task 1 and task 2 have the same input.

If the original model was trained using an open-source library like


TensorFlow, you can simply restore it and retrain some layers for your task.
Keep in mind, however, that transfer learning only works if the features
learned from the first task are general, meaning they can be useful for another
related task as well. Also, the input of the model needs to have the same size as
it was initially trained with. If you don’t have that, add a pre-processing step to
resize your input to the needed size.

Approaches to Transfer Learning


1. TRAINING A MODEL TO REUSE IT

Imagine you want to solve task A but don’t have enough data to train a deep
neural network. One way around this is to find a related task B with an
abundance of data. Train the deep neural network on task B and use the model
as a starting point for solving task A. Whether you'll need to use the whole
model or only a few layers depends heavily on the problem you're trying to
solve.

If you have the same input in both tasks, possibly reusing the model and
making predictions for your new input is an option. Alternatively,
changing and retraining different task-specific layers and the output layer is a
method to explore.

2. USING A PRE-TRAINED MODEL


The second approach is to use an already pre-trained model. There are a lot of
these models out there, so make sure to do a little research. How many layers
to reuse and how many to retrain depends on the problem.

Keras, for example, provides numerous pre-trained models that can be used
for transfer learning, prediction, feature extraction and fine-tuning. You can
find these models, and also some brief tutorials on how to use them, here.
There are also many research institutions that release trained models.

This type of transfer learning is most commonly used throughout deep


learning.

3. FEATURE EXTRACTION

Another approach is to use deep learning to discover the best representation


of your problem, which means finding the most important features. This
approach is also known as representation learning, and can often result in a
much better performance than can be obtained with hand-designed
representation.

In machine learning, features are usually manually hand-crafted by


researchers and domain experts. Fortunately, deep learning can extract
features automatically. Of course, this doesn't mean feature engineering and
domain knowledge isn’t important anymore — you still have to decide which
features you put into your network. That said, neural networks have the ability
to learn which features are really important and which ones aren’t. A
representation learning algorithm can discover a good combination of features
within a very short timeframe, even for complex tasks which would otherwise
require a lot of human effort.

The learned representation can then be used for other problems as well.
Simply use the first layers to spot the right representation of features, but
don’t use the output of the network because it is too task-specific. Instead,
feed data into your network and use one of the intermediate layers as the
output layer. This layer can then be interpreted as a representation of the raw
data.

This approach is mostly used in computer vision because it can reduce the size
of your dataset, which decreases computation time and makes it more suitable
for traditional algorithms, as well.

POPULAR PRE-TRAINED MODELS

There are some pre-trained machine learning models out there that are quite
popular. One of them is the Inception-v3 model, which was trained for
the ImageNet “Large Visual Recognition Challenge.” In this challenge,
participants had to classify images into 1,000 classes like
“zebra,” “Dalmatian” and “dishwasher.”
Here’s a very good tutorial from TensorFlow on how to retrain image
classifiers.

Microsoft also offers some pre-trained models, available for both R and
Python development, through the MicrosoftML R package and
the Microsoftml Python package.

Other quite popular models are ResNet and AlexNet. I also encourage a visit
to pretrained.ml, a sortable and searchable compilation of pre-trained deep
learning models complete with demos and code.

Multi-task learning in Machine


Learning
Deep multi-task learning with neural networks
Devin Soni
·

Follow
Published in

Towards Data Science

5 min read

Jun 27, 2021

224
1
Photo by Arseny Togulev on Unsplash

Introduction
In most machine learning contexts, we are concerned with solving
a single task at a time. Regardless of what that task is, the problem is
typically framed as using data to solve a single task or optimize a single
metric at a time. However, this approach will eventually hit a
performance ceiling, oftentimes due to the size of the data-set or the
ability of the model to learn meaningful representations from it.

Multi-task learning, on the other hand, is a machine learning approach


in which we try to learn multiple tasks simultaneously, optimizing
multiple loss functions at once. Rather than training independent
models for each task, we allow a single model to learn to complete all of
the tasks at once. In this process, the model uses all of the available
data across the different tasks to learn generalized representations of
the data that are useful in multiple contexts.

Multi-task learning has seen widespread usage across multiple


domains such as natural language processing, computer vision, and
recommendation systems. It is also commonly leveraged in industry,
such as at Google, due to its ability to effectively leverage large
amounts of data in order to solve related tasks.

When to use multi-task learning


Before going into the specifics of how to implement a multi-task
learning model, it is first important to go through situations in which
multi-task learning is, and is not, appropriate.

Generally, multi-task learning should be used when the tasks have


some level of correlation. In other words, multi-task learning
improves performance when there are underlying principles or
information shared between tasks.

For example, two tasks involving classifying images of animals are


likely to be correlated, as both tasks will involve learning to detect fur
patterns and colors. This would be a good use-case for multi-task
learning since learning these images features is useful for both tasks.

On the other hand, sometimes training on multiple tasks results


in negative transfer between the tasks, in which the multi-task
model performs worse than the equivalent single-task models. This
generally happens when the different tasks are unrelated to each other,
or when the information learned in one task contradicts that learned in
another task.

Building a multi-task model


Now that we know when we should use multi-task learning, we will go
through a simple model architecture for a multi-task model. This will
focus on a neural network architecture (deep multi-task learning),
since neural networks are by far the most common type of model used
in multi-task learning.

Learning a shared representation

At its core, deep multi-task learning aims to learn to produce


generalized representations that are powerful enough to be shared
across different tasks. I will focus on hard parameter sharing here, in
which the different tasks use exactly the same base representation of
the input data.
Source

As we can see, hard parameter sharing forces the model to learn an


intermediate representation that conveys enough information for all of
the tasks. The task-specific portions of the network all start with the
same base representation from the last shared layer.

Multi-task learning improves the generalizability of this representation


because learning multiple tasks forces the model to focus on the
features that are useful across all of the tasks. Assuming the tasks are
correlated, a feature that is important for Task A is also likely to be
important for Task C. The opposite is also true; unimportant features
are likely to be unimportant across all of the tasks.
Multi-task learning also effectively increases the size of your data-set,
since you are combining the data-sets from each task. By adding more
samples to the training set from different tasks, the model will learn to
better ignore the task-specific noise or biases within each individual
data-set.

https://www.baeldung.com/cs/end-to-end-deep-
learning#:~:text=Definition,without%20any%20manual%20feature%20extraction.

https://towardsdatascience.com/e2e-the-every-purpose-ml-method-5d4f20dafee4
LeNet-5 is a convolutional neural network (CNN) architecture that was developed by Yann LeCun et al. in
1998. It was one of the pioneering CNN models for image recognition tasks and played a significant role
in the advancement of deep learning.

The LeNet-5 architecture was primarily designed for handwritten digit recognition, specifically for
classifying digits from the MNIST dataset. It consists of seven layers, including three convolutional layers,
followed by two fully connected layers and two subsampling (pooling) layers. The architecture can be
summarized as follows:

1. Input Layer: Accepts grayscale images of size 32x32 pixels as input.

2. Convolutional Layers: The first convolutional layer applies six filters (also known as kernels) of size 5x5
to the input images, resulting in six feature maps. The second convolutional layer uses 16 filters of size
5x5 and produces 16 feature maps. Both layers use a stride of 1 and a "valid" padding, and the output
feature maps undergo a nonlinear activation using the hyperbolic tangent (tanh) function.

3. Subsampling Layers: Two subsampling layers follow the convolutional layers. They perform average
pooling over non-overlapping regions. The first subsampling layer reduces the spatial dimensions of the
feature maps by a factor of 2, and the second subsampling layer reduces them further.

4. Fully Connected Layers: The subsampled feature maps are then flattened and passed through two fully
connected layers. The first fully connected layer consists of 120 neurons, followed by the second fully
connected layer with 84 neurons. Each neuron is connected to all the neurons of the previous layer.

5. Output Layer: The final layer is a fully connected layer with 10 neurons, representing the 10 possible
classes (digits 0-9). The output layer uses a softmax activation function to produce the probability
distribution over the classes.

LeNet-5 was trained using the backpropagation algorithm and stochastic gradient descent (SGD)
optimization. It achieved remarkable performance on the MNIST dataset and showcased the potential of
CNNs in image recognition tasks.

LeNet-5 served as a foundation for subsequent advancements in deep learning and convolutional neural
networks, paving the way for more complex and powerful architectures for image recognition and other
computer vision tasks.
https://medium.com/@siddheshb008/lenet-5-architecture-explained-3b559cb2d52b

In total, the LeNet-5 architecture consists of seven layers. These layers can be categorized as follows:

1. Input Layer: The input layer accepts grayscale images of size 32x32 pixels.

2. Convolutional Layers: There are two convolutional layers in LeNet-5. The first convolutional layer
applies six filters of size 5x5 to the input images, and the second convolutional layer applies 16 filters of
size 5x5.

3. Subsampling (Pooling) Layers: LeNet-5 has two subsampling layers. Each subsampling layer performs
average pooling over non-overlapping regions.

4. Fully Connected Layers: There are two fully connected layers in LeNet-5. The first fully connected layer
consists of 120 neurons, and the second fully connected layer consists of 84 neurons.
5. Output Layer: The output layer is a fully connected layer with 10 neurons, representing the 10 possible
classes (digits 0-9) in the case of the MNIST dataset.

So, in total, LeNet-5 has seven layers: one input layer, two convolutional layers, two subsampling layers,
two fully connected layers, and one output layer.
AlexNet: The Architecture that
Challenged CNNs

Jerry Wei
·

Follow
Published in

Towards Data Science

4 min read

Jul 3, 2019

377
3
A few years back, we still used small datasets like CIFAR and NORB
consisting of tens of thousands of images. These datasets were
sufficient for machine learning models to learn basic recognition tasks.
However, real life is never simple and has many more variables than
are captured in these small datasets. The recent availability of large
datasets like ImageNet, which consist of hundreds of thousands to
millions of labeled images, have pushed the need for an extremely
capable deep learning model. Then came AlexNet.

Photo by Casper Johansson on Unsplash


The Problem. Convolutional Neural Networks (CNNs) had always
been the go-to model for object recognition — they’re strong models
that are easy to control and even easier to train. They don’t experience
overfitting at any alarming scales when being used on millions of
images. Their performance is almost identical to standard feedforward
neural networks of the same size. The only problem: they’re hard to
apply to high resolution images. At the ImageNet scale, there needed to
be an innovation that would be optimized for GPUs and cut down on
training times while improving performance.

The Dataset. ImageNet: a dataset made of more than 15 million high-


resolution images labeled with 22 thousand classes. The key: web-
scraping images and crowd-sourcing human labelers. ImageNet even
has its own competition: the ImageNet Large-Scale Visual Recognition
Challenge (ILSVRC). This competition uses a subset of ImageNet’s
images and challenges researchers to achieve the lowest top-1 and top-
5 error rates (top-5 error rate would be the percent of images where the
correct label is not one of the model’s five most likely labels). In this
competition, data is not a problem; there are about 1.2 million training
images, 50 thousand validation images, and 150 thousand testing
images. The authors enforced a fixed resolution of 256x256 pixels for
their images by cropping out the center 256x256 patch of each image.
Convolutional Neural Networks that use ReLU achieved a 25% error rate on
CIFAR-10 six times faster than those that used tanh. Image credits to
Krizhevsky et al., the original authors of the AlexNet paper.

AlexNet. The architecture consists of eight layers: five convolutional


layers and three fully-connected layers. But this isn’t what makes
AlexNet special; these are some of the features used that are new
approaches to convolutional neural networks:

 ReLU Nonlinearity. AlexNet uses Rectified Linear


Units (ReLU) instead of the tanh function, which was
standard at the time. ReLU’s advantage is in training
time; a CNN using ReLU was able to reach a 25% error on
the CIFAR-10 dataset six times faster than a CNN using
tanh.

 Multiple GPUs. Back in the day, GPUs were still rolling


around with 3 gigabytes of memory (nowadays those
kinds of memory would be rookie numbers). This was
especially bad because the training set had 1.2 million
images. AlexNet allows for multi-GPU training by putting
half of the model’s neurons on one GPU and the other half
on another GPU. Not only does this mean that a bigger
model can be trained, but it also cuts down on the training
time.

 Overlapping Pooling. CNNs traditionally “pool”


outputs of neighboring groups of neurons with no
overlapping. However, when the authors introduced
overlap, they saw a reduction in error by about 0.5% and
found that models with overlapping pooling generally find
it harder to overfit.
Illustration of AlexNet’s architecture. Image credits to Krizhevsky et al., the
original authors of the AlexNet paper.

The Overfitting Problem. AlexNet had 60 million parameters, a


major issue in terms of overfitting. Two methods were employed to
reduce overfitting:

 Data Augmentation. The authors used label-preserving


transformation to make their data more varied.
Specifically, they generated image translations and
horizontal reflections, which increased the training set by
a factor of 2048. They also performed Principle
Component Analysis (PCA) on the RGB pixel values to
change the intensities of RGB channels, which reduced
the top-1 error rate by more than 1%.

 Dropout. This technique consists of “turning off”


neurons with a predetermined probability (e.g. 50%). This
means that every iteration uses a different sample of the
model’s parameters, which forces each neuron to have
more robust features that can be used with other random
neurons. However, dropout also increases the training
time needed for the model’s convergence.

The Results. On the 2010 version of the ImageNet competition, the


best model achieved 47.1% top-1 error and 28.2% top-5 error. AlexNet
vastly outpaced this with a 37.5% top-1 error and a 17.0% top-5 error.
AlexNet is able to recognize off-center objects and most of its top five
classes for each image are reasonable. AlexNet won the 2012 ImageNet
competition with a top-5 error rate of 15.3%, compared to the second
place top-5 error rate of 26.2%.

AlexNet’s most probable labels on eight ImageNet images. The correct label is
written under each image, and the probability assigned to each label is also
shown by the bars. Image credits to Krizhevsky et al., the original authors of
the AlexNet paper.

What Now? AlexNet is an incredibly powerful model capable of


achieving high accuracies on very challenging datasets. However,
removing any of the convolutional layers will drastically degrade
AlexNet’s performance. AlexNet is a leading architecture for any
object-detection task and may have huge applications in the computer
vision sector of artificial intelligence problems. In the future, AlexNet
may be adopted more than CNNs for image tasks.

As a milestone in making deep learning more widely-applicable,


AlexNet can also be credited with bringing deep learning to adjacent
fields such as natural language processing and medical image analysis.

AlexNet is a convolutional neural network (CNN) architecture that was introduced by Alex Krizhevsky, Ilya
Sutskever, and Geoffrey Hinton in 2012. It gained significant attention and marked a breakthrough in the
field of computer vision, particularly in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
competition.

Here's an explanation of the AlexNet architecture:

1. Input Layer: The input layer accepts RGB images of size 227x227 pixels.

2. Convolutional Layers: AlexNet begins with five convolutional layers. The first convolutional layer
applies 96 filters of size 11x11 with a stride of 4. The subsequent convolutional layers use smaller filter
sizes, specifically 256 filters of size 5x5 in the second and third layers, and 384 filters of size 3x3 in the
fourth and fifth layers. All convolutional layers use the Rectified Linear Unit (ReLU) activation function.

3. Max Pooling Layers: After each of the first two convolutional layers, there is a max pooling layer that
performs 2x2 pooling with a stride of 2, reducing the spatial dimensions of the feature maps.
4. Local Response Normalization (LRN) Layer: Following the first and second max pooling layers, an LRN
layer is applied to enhance the model's response to specific patterns. It normalizes the responses across
neighboring feature maps.

5. Fully Connected Layers: After the convolutional and pooling layers, there are three fully connected
layers. The first fully connected layer consists of 4096 neurons, followed by a second fully connected
layer with 4096 neurons. Both these layers use the ReLU activation function. The final fully connected
layer, also known as the output layer, consists of 1000 neurons representing the 1000 classes in the
ImageNet dataset. It employs the softmax activation function to produce the probability distribution
over the classes.

6. Dropout: Dropout regularization is applied to the first and second fully connected layers with a
dropout rate of 0.5. It helps prevent overfitting by randomly dropping out units during training.

The AlexNet architecture was trained on the ImageNet dataset, which consists of millions of labeled
images across 1000 different classes. It utilized techniques such as data augmentation, dropout, and GPU
acceleration for efficient training. AlexNet significantly outperformed previous models in the ILSVRC
competition, demonstrating the power of deep convolutional neural networks for image classification
tasks.

Since its introduction, AlexNet has inspired numerous advancements in deep learning and CNN
architectures, setting the stage for subsequent models such as VGGNet, GoogLeNet, and ResNet.
https://medium.com/@mygreatlearning/everything-you-need-to-know-about-vgg16-7315defb5918

VGG-16 (Visual Geometry Group 16) is a convolutional neural network (CNN) architecture that was
developed by the Visual Geometry Group at the University of Oxford. It was introduced by Karen
Simonyan and Andrew Zisserman in 2014. VGG-16 is known for its depth and simplicity and has been
influential in the field of computer vision.

Here's an explanation of the VGG-16 architecture:

1. Input Layer: The input layer accepts RGB images of size 224x224 pixels.

2. Convolutional Layers: VGG-16 consists of 13 convolutional layers. The first convolutional layer applies
64 filters of size 3x3 with a stride of 1, followed by additional convolutional layers with 64 filters. The
subsequent convolutional layers maintain the same filter size but double the number of filters.
Specifically, there are two convolutional layers with 128 filters, three with 256 filters, and three with 512
filters. Finally, there are two convolutional layers with 512 filters.

3. Max Pooling Layers: After each set of two or three convolutional layers, there is a max pooling layer
that performs 2x2 pooling with a stride of 2, reducing the spatial dimensions of the feature maps.

4. Fully Connected Layers: After the convolutional and pooling layers, VGG-16 has three fully connected
layers. The first two fully connected layers consist of 4096 neurons each, while the last fully connected
layer, also known as the output layer, consists of the number of neurons corresponding to the specific
classification task.

5. ReLU Activation: ReLU (Rectified Linear Unit) activation is used after each convolutional and fully
connected layer. It introduces non-linearity into the network, allowing it to learn complex patterns and
representations.

VGG-16 has a total of about 138 million trainable parameters, making it a deep and computationally
intensive architecture. It is known for its homogeneous structure, with relatively small 3x3 filters and
max pooling layers throughout the network. This design choice aims to make the network more effective
in capturing fine-grained details.

VGG-16 has been widely used as a benchmark architecture for various computer vision tasks, including
image classification, object detection, and image segmentation. It has achieved notable performance on
the ImageNet dataset and has influenced subsequent CNN architectures, inspiring models like VGG-19,
ResNet, and DenseNet.

After the first CNN-based architecture (AlexNet) that win


the ImageNet 2012 competition, Every subsequent winning
architecture uses more layers in a deep neural network to
reduce the error rate. This works for less number of layers,
but when we increase the number of layers, there is a common
problem in deep learning associated with that called the
Vanishing/Exploding gradient. This causes the gradient to
become 0 or too large. Thus when we increases number of
layers, the training and test error rate also increases.
Comparison of 20-layer vs 56-layer architecture

In the above plot, we can observe that a 56-layer CNN gives


more error rate on both training and testing dataset than a
20-layer CNN architecture. After analyzing more on error rate
the authors were able to reach conclusion that it is caused
by vanishing/exploding gradient.
ResNet, which was proposed in 2015 by researchers at
Microsoft Research introduced a new architecture called
Residual Network.
Residual Network: In order to solve the problem of the
vanishing/exploding gradient, this architecture introduced
the concept called Residual Blocks. In this network, we use
a technique called skip connections. The skip connection
connects activations of a layer to further layers by
skipping some layers in between. This forms a residual block.
Resnets are made by stacking these residual blocks together.
The approach behind this network is instead of layers
learning the underlying mapping, we allow the network to fit
the residual mapping. So, instead of say H(x), initial
mapping, let the network fit,
F(x) := H(x) - x which gives H(x) := F(x) + x.
Skip (Shortcut) connection

The advantage of adding this type of skip connection is that


if any layer hurt the performance of architecture then it
will be skipped by regularization. So, this results in
training a very deep neural network without the problems
caused by vanishing/exploding gradient. The authors of the
paper experimented on 100-1000 layers of the CIFAR-10
dataset.
There is a similar approach called “highway networks”, these
networks also use skip connection. Similar to LSTM these skip
connections also use parametric gates. These gates determine
how much information passes through the skip connection. This
architecture however has not provided accuracy better than
ResNet architecture.
Over the last few years, there have been a series of
breakthroughs in the field of Computer Vision.Especially with
the introduction of deep Convolutional neural networks, we are
getting state of the art results on problems such as image
classification and image recognition. So, over the years,
researchers tend to make deeper neural networks(adding more
layers) to solve such complex tasks and to also improve the
classification/recognition accuracy. But, it has been seen that
as we go adding on more layers to the neural network, it becomes
difficult to train them and the accuracy starts saturating and
then degrades also. Here ResNet comes into rescue and helps
solve this problem. In this article, we shall know more about
ResNet and its architecture.

1. What is ResNet
 Need for ResNet
 Residual Block
 How ResNet helps
2. ResNet architecture
3. Using ResNet with Keras
 ResNet 50
What is ResNet?
ResNet, short for Residual Network is a specific type of neural
network that was introduced in 2015 by Kaiming He, Xiangyu
Zhang, Shaoqing Ren and Jian Sun in their paper “Deep Residual
Learning for Image Recognition”.The ResNet models were extremely
successful which you can guess from the following:

 Won 1st place in the ILSVRC 2015 classification


competition with a top-5 error rate of 3.57% (An ensemble
model)
 Won the 1st place in ILSVRC and COCO 2015 competition in
ImageNet Detection, ImageNet localization, Coco detection
and Coco segmentation.
 Replacing VGG-16 layers in Faster R-CNN with ResNet-101.
They observed relative improvements of 28%
 Efficiently trained networks with 100 layers and 1000
layers also.
Need for ResNet

Mostly in order to solve a complex problem, we stack some


additional layers in the Deep Neural Networks which results in
improved accuracy and performance. The intuition behind adding
more layers is that these layers progressively learn more
complex features. For example, in case of recognising images,
the first layer may learn to detect edges, the second layer may
learn to identify textures and similarly the third layer can
learn to detect objects and so on. But it has been found that
there is a maximum threshold for depth with the traditional
Convolutional neural network model. Here is a plot that
describes error% on training and testing data for a 20 layer
Network and 56 layers Network.

We can see that error% for 56-layer is more than a 20-layer


network in both cases of training data as well as testing data.
This suggests that with adding more layers on top of a network,
its performance degrades. This could be blamed on the
optimization function, initialization of the network and more
importantly vanishing gradient problem. You might be thinking
that it could be a result of overfitting too, but here the
error% of the 56-layer network is worst on both training as well
as testing data which does not happen when the model is
overfitting.

Residual Block

This problem of training very deep networks has been alleviated


with the introduction of ResNet or residual networks and these
Resnets are made up from Residual Blocks.

The very first thing we notice to be different is that there is


a direct connection which skips some layers(may vary in
different models) in between. This connection is called ’skip
connection’ and is the core of residual blocks. Due to this skip
connection, the output of the layer is not the same now. Without
using this skip connection, the input ‘x’ gets multiplied by the
weights of the layer followed by adding a bias term.

Next, this term goes through the activation function, f() and we
get our output as H(x).

H(x)=f( wx + b )
or H(x)=f(x)

Now with the introduction of skip connection, the output is


changed to

H(x)=f(x)+x

There appears to be a slight problem with this approach when the


dimensions of the input vary from that of the output which can
happen with convolutional and pooling layers. In this case, when
dimensions of f(x) are different from x, we can take two
approaches:

 The skip connection is padded with extra zero entries to


increase its dimensions.
 The projection method is used to match the dimension
which is done by adding 1×1 convolutional layers to
input. In such a case, the output is:
H(x)=f(x)+w1.x

Here we add an additional parameter w1 whereas no additional


parameter is added when using the first approach.

How ResNet helps

The skip connections in ResNet solve the problem of vanishing


gradient in deep neural networks by allowing this alternate
shortcut path for the gradient to flow through. The other way
that these connections help is by allowing the model to learn
the identity functions which ensures that the higher layer will
perform at least as good as the lower layer, and not worse. Let
me explain this further.

Say we have a shallow network and a deep network that maps an


input ‘x’ to output ’y’ by using the function H(x). We want the
deep network to perform at least as good as the shallow network
and not degrade the performance as we saw in case of plain
neural networks(without residual blocks). One way of achieving
so is if the additional layers in a deep network learn the
identity function and thus their output equals inputs which do
not allow them to degrade the performance even with extra
layers.
It has been seen that residual blocks make it exceptionally easy
for layers to learn identity functions. It is evident from the
formulas above. In plain networks the output is

H(x)=f(x),

So to learn an identity function, f(x) must be equal to x which


is grader to attain whereas incase of ResNet, which has output:

H(x)=f(x)+x,

f(x)=0

H(x)=x

All we need is to make f(x)=0 which is easier and we will get x


as output which is also our input.

In the best-case scenario, additional layers of the deep neural


network can better approximate the mapping of ‘x’ to output ‘y’
than it’s the shallower counterpart and reduces the error by a
significant margin. And thus we expect ResNet to perform equally
or better than the plain deep neural networks.

Using ResNet has significantly enhanced the performance of


neural networks with more layers and here is the plot of error%
when comparing it with neural networks with plain layers.

Clearly, the difference is huge in the networks with 34 layers


where ResNet-34 has much lower error% as compared to plain-34.
Also, we can see the error% for plain-18 and ResNet-18 is almost
the same.
There are several popular deep learning frameworks available that provide comprehensive tools and
libraries for building and training deep neural networks. These frameworks offer efficient
implementations of various deep learning algorithms and provide high-level abstractions that simplify
the development process. Here are some of the widely used deep learning frameworks:

1. TensorFlow: TensorFlow, developed by Google, is one of the most popular deep learning frameworks.
It provides a flexible and comprehensive ecosystem for building and deploying machine learning models.
TensorFlow supports both high-level APIs (such as Keras) for rapid prototyping and low-level APIs for
advanced customization. It offers excellent support for distributed computing and deployment on
various platforms, including CPUs, GPUs, and TPUs.

2. PyTorch: PyTorch is an open-source deep learning framework developed by Facebook's AI Research


(FAIR) lab. It is known for its dynamic computational graph, which allows for more intuitive and flexible
model development. PyTorch provides an easy-to-use API, supports dynamic neural networks, and offers
seamless integration with Python scientific libraries. It has gained popularity for its developer-friendly
interface and extensive community support.

3. Keras: Keras is a high-level deep learning framework that can run on top of TensorFlow, Theano, or
Microsoft Cognitive Toolkit (CNTK). It provides a user-friendly and intuitive API for building and training
neural networks. Keras allows rapid prototyping and supports both convolutional and recurrent neural
networks. With its focus on simplicity and ease of use, Keras is a popular choice for beginners in deep
learning.

4. Caffe: Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework
developed by Berkeley AI Research (BAIR). It emphasizes speed and efficiency, particularly in computer
vision tasks. Caffe has a declarative model definition syntax and a strong focus on convolutional neural
networks. It offers a large collection of pre-trained models, making it useful for transfer learning and
feature extraction.

5. MXNet: MXNet is a deep learning framework that provides flexible and efficient tools for building
neural networks. It supports both imperative and symbolic programming models and offers a wide range
of language bindings, including Python, R, and Julia. MXNet emphasizes scalability and distributed
computing, making it suitable for large-scale deep learning applications.

6. Theano: Theano is a deep learning framework that allows developers to define, optimize, and evaluate
mathematical expressions efficiently. It offers symbolic computation capabilities and efficient GPU
utilization. While Theano has been widely used in the past, its development has slowed down, and other
frameworks like TensorFlow and PyTorch have gained more popularity in recent years.

These are just a few examples of deep learning frameworks, and there are several other frameworks
available, such as Microsoft Cognitive Toolkit (CNTK), Deeplearning4j, and Chainer. The choice of
framework depends on factors such as the specific requirements of your project, the level of flexibility
needed, the size of the community and available resources, and your familiarity with the programming
language and interface.

https://marutitech.com/top-8-deep-learning-frameworks/

You might also like