You are on page 1of 16

Convolutional Neural Networks (CNNs / ConvNets)

Convolutional Neural Networks are very similar to ordinary Neural Networks. They are
made up of neurons that have learnable weights and biases. Each neuron receives some
inputs, performs a dot product, and optionally follows it with a non-linearity. The whole
network still expresses a single differentiable score function: from the raw image pixels on
one end to class scores at the other. And they still have a loss function (e.g., SVM/Softmax)
on the last (fully-connected) layer and all the tips/tricks we developed for learning regular
Neural Networks still apply.
ConvNet architectures make the explicit assumption that the inputs are images, which
allows us to encode certain properties into the architecture. These then make the forward
function more efficient to implement and vastly reduce the number of parameters in the
network. A simple ConvNet is a sequence of layers, and every layer of a ConvNet
transforms one volume of activations to another through a differentiable function. We use
three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling
Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will
stack these layers to form a full ConvNet

Architecture

1. INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of
width 32, height 32, and with three color channels R,G,B .
2. CONV layer will compute the output of neurons that are connected to local regions
in the input, each computing a dot product between their weights and a small region they
are connected to in the input volume. This may result in volume such as [32x32x12] if we
decided to use 12 filters.
3. RELU layer will apply an element-wise activation function, such as the
max(0,x)max(0,x) thresholding at zero. This leaves the size of the volume unchanged
([32x32x12]).
4. POOL layer will perform a downsampling operation along the spatial dimensions
(width, height), resulting in volume such as [16x16x12].
5. FC (i.e. fully-connected) layer will compute the class scores, resulting in a volume of
size [1x1x10], where each of the 10 numbers corresponds to a class score, such as among
the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies,
each neuron in this layer will be connected to all the numbers in the previous volume.

In this way, ConvNets transform the original image layer by layer from the original pixel
values to the final class scores. Note that some layers contain parameters and other don’t.
In particular, the CONV/FC layers perform transformations that are a function of not only
the activations in the input volume, but also of the parameters (the weights and biases of
the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The
parameters in the CONV/FC layers will be trained with gradient descent so that the class
scores that the ConvNet computes are consistent with the labels in the training set for each
image.

1. Flattening

Flattening is converting the data into a 1-dimensional array for inputting it to the next layer.
We flatten the output of the convolutional layers to create a single long feature vector. And
it is connected to the final classification model, which is called a fully-connected layer. In
other words, we put all the pixel data in one line and make connections with the final layer.
And once again

2. Subsampling (Super sample)

Subsampling is a statistical method for selecting a subset of a larger data set. The subset is a
representation of the larger data set. If a data set is too large to fit in memory, a subsample
representing the larger data set must be chosen.

Subsampling accelerates the training of machine learning models. A subsample of the data
can be used because training a machine learning model on a large data set takes time. This
is useful when the data set is too large to fit in memory. Subsampling can also help improve
the accuracy of machine learning models. This can be done with the following:

-selecting a representative sample from a larger population

-choosing a random subset of data from a larger dataset

-taking a smaller sample from a larger group in order to make inferences about the larger
group

3. Padding

The padding plays a vital role in creating CNN. After the convolution operation, the original
size of the image is shrunk. Also, in the image classification task, there are multiple
convolution layers after which our original image is shrunk after every step, which we do
not want. 
Secondly, when the kernel moves over the original image, it passes through the middle
layer more times than the edge layers, due to which there occurs an overlap.
To overcome this problem, a new concept was introduced named padding. It is an
additional layer that can add to the borders of an image while preserving the size of the
original picture.
Padding is a term relevant to convolutional neural networks as it refers to the number of
pixels added to an image when it is being processed by the kernel of a CNN. Padding is
adding extra pixels outside the image. The padding adds elements to the input matrix
before any convolutional filter is applied, and thus, it aids in preventing any information
loss, particularly from the edges of the image. It helps keep important aspects of the image
that would get lost and thus reveals useful relationships that may play significant roles in
the output

And zero padding means every pixel


value that you add is zero. If zero padding
= 1, there will be one pixel thick around the
original image with pixel value = 0.
4. What is Stride (Machine Learning)?

Stride is a component of convolutional neural networks, or neural networks tuned for the
compression of images and video data. Stride is a parameter of the neural network's filter
that modifies the amount of movement over the image or video. For example, if a neural
network's stride is set to 1, the filter will move one pixel, or unit, at a time. The size of the
filter affects the encoded output volume, so stride is often set to a whole integer, rather
than a fraction or decimal.

When the array is created, the pixels are shifted over to the input matrix. The number of
pixels turning to the input matrix is known as the strides. When the number of strides is 1,
we move the filters to 1 pixel at a time. Similarly, when the number of strides is 2, we carry
the filters to 2 pixels, and so on. They are essential because they control the convolution of
the filter against the input, i.e., Strides are responsible for regulating the features that could
be missed while flattening the image. They denote the number of steps we are moving in
each convolution. The following figure shows how the convolution would work.

stride = 0 stride=1 stride=2. 


5. Convolution Layer

The Convolution Layers are the initial layers to pull out features from the image. It
maintains the relationship between pixels by learning features using a small input data
sequence. It is a mathematical term that takes two inputs, an image matrix and a kernel or
filter. The result is calculated by:

In the above image,


The image matrix is h * w * d
The dimensions of the filter are fh * fw * d
The output is calculated as (h- fh +1)*(w- fw+1) x 1
Now, let us take an example and solve a 5x5 image matrix whose pixel values are 0, 1 and
the filter matrix as 3x3:

The matrix multiplication will work as follows:


The final convolution layers output matrix of a 5x5 image multiplied with a 3x3 filter will be:

Then the convolution of the 5 x 5 image matrix multiplies with a 3 x 3 filter matrix
which is called the “Feature Map”. The convolution of the image with different filter values
can produce a blur or sharpened image.

6. Pooling Layer

The pooling layer is another building block of a CNN and plays a vital role in pre-processing
an image. In the pre-process, the image size shrinks by reducing the number of parameters
if the image is too large. When the picture is shrunk, the pixel density is also reduced, and
the downscaled image is obtained from the previous layers. Basically, its function is to
progressively reduce the spatial size of the image to reduce the network complexity and
computational cost. Spatial pooling also known as downsampling or subsampling reduces
the dimensionality of each map but retains the essential features. Pooling is added after the
nonlinearity is applied to the feature maps. There are three types of spatial pooling:

1. Max Pooling

Max pooling is a rule to take the maximum of a region and help to proceed with the most
crucial features from the image. It is a sample-based process that transfers continuous
functions into discrete counterparts. Its primary objective is to downscale an input by
reducing its dimensionality and making assumptions about features contained in the sub-
region that were rejected.

2. Average Pooling

It is different from Max Pooling; it retains information about the lesser essential features. It
simply downscales by dividing the input matrix into rectangular regions and calculating the
average values of each area. 
3. Sum Pooling

It is like Max pooling, but instead of calculating the maximum value, we calculate the mean
of each sub-region.

Fully Connected Layer

The layer we call as FC layer, we flattened our matrix into vector and feed it into a
fully connected layer like a neural network.

7. 1x1 convolutions

Convolutions layers are lighter than fully connected ones. The 1x1 convolution operation
involves convolving the input with filters of size 1x1, usually with zero-padding and stride of
1. A 1x1 convolution kernel acts as an embedding solution. It reduces the size of the input
vector, the number of channels. It makes it more meaningful. The 1x1 convolutional layer is
also called a Pointwise Convolution. A 1×1 convolutional layer can be used that offers a
channel-wise pooling, often called feature map pooling or a projection layer. A 1 x 1
Convolution is a convolution with some special properties in that it can be used for
dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity
after convolutions. 
1X1 Convolution is effectively used for

1. Dimensionality Reduction/Augmentation
2. Reduce the computational load by reducing parameter map
3. Add additional non-linearity to the network
4. Create a deeper network through the “Bottle-Neck” layer

8. Inception Network

An Inception Network is a deep neural network that consists of repeating blocks where the
output of a block act as an input to the next block. Each block is defined as an Inception
block. Inception Modules are used in Convolutional Neural Networks to allow for more
efficient computation and deeper Networks through a dimensionality reduction with
stacked 1×1 convolutions. The modules were designed to solve the problem of
computational expense, as well as overfitting, among other issues. The solution, in short, is
to take multiple kernel filter sizes within the CNN, and rather than stacking them
sequentially, order them to operate on the same level.
9. Input Channels

A convolutional layer receives input, transforms the input in some way, and then outputs
the transformed input to the next layer. The inputs to convolutional layers are called input
channels, and the outputs are called output channels.
Convolutional layers typically involve more than one channel, where each channel of a layer
is associated with the channels of the next layer and vice versa. The basic structure of a
CNN model is composed of convolutional layers, pooling layers:

A convolution layer receives an input image and produces an output that consists of an
activation map, as we can see in the diagram above, where and are the width and height,
respectively. The filter for such a convolution is a tensor of dimensions, where is the filter
size which normally is 3, 5, 7, or 11, and is the number of channels. The depth of the
convolution matrices in the convolution network is the total number of channels and
must always have the same number of channels as the input. Therefore, a greater channel
depth would lead to the reduction of spatial resolution.

10. Transfer learning

Transfer learning is a technique in machine learning where a model trained on one task is
used as the starting point for a model on a second task. This can be useful when the
second task is like the first task, or when there is limited data available for the second
task. By using the learned features from the first task as a starting point, the model can
learn more quickly and effectively on the second task. This can also help to prevent
overfitting, as the model will have already learned general features that are likely to be
useful in the second task.
For example, knowledge gained while learning to recognize cars could be applied when
trying to recognize trucks
A widely used strategy in transfer learning is to:
 Load the weights matrices of a pre-trained model except for the weights of the
very last layers near the O/P.
 Hold those weights fixed, i.e. untrainable.
 Attach new layers suitable for the task at hand, and train the model with new data.

11. One-shot learning

Passport checks at the airport, and signature verification – These tasks require face or
object recognition. Historically, deep learning algorithms use a lot of labeled training data
for simple tasks like identifying objects in photos or recognizing faces. But in the above
situations, we normally do not possess a large variety of photos for each person for training
the AI. Therefore, we need an ML algorithm that would be able to perform recognition with
a limited number of training examples. This is where one-shot learning comes into play.

One-shot learning is an ML-based object classification algorithm that assesses the similarity
and differences between two images. It is mainly used in computer vision.

The goal of one-shot learning is to teach the model to set its own assumptions about their
similarities based on the minimal number of visuals. There can be only one image (or a very
limited number of them, in which case it is often called few-shot learning) for each class.
These examples are used to build a model that can then make predictions about further
unknown visuals.

For instance, to distinguish between apples and pears, a traditional AI model would need
thousands of images taken at various angles, with different lighting, background, etc. In
contrast, one-shot learning does not require many examples of each category. It generalizes
the information it has learned through experience with the same-type tasks by inferring
similar objects and classifying unseen objects into their respective groups

Apart from one-shot learning, there exist other models that require just several examples
(few-shot learning) or no examples at all (zero-shot learning).

Few-shot learning is simply a variation of the one-shot learning model with several training
images available.

The goal of zero-shot learning is to categorize unknown classes without training data at all.
The learning process here is based on the metadata of the images.

Applications

One-shot learning algorithms have been used for tasks like image classification, object
detection and localization, speech recognition, and more.

The most common applications are face recognition and signature verification. Apart from
airport checks, the former can be used, for example, by law enforcement agencies to detect
terrorists in crowded places and at mass events such as sports games, concerts, and
festivals. Based on surveillance cameras’ input, AI can identify people from police databases
in the crowd. This technology is also applicable in banks and other institutions where they
need to recognize the person from their ID or a photo in their records. The same process
works for signature verification.

One-shot learning is essential for computer vision, notably for drones and self-driving cars
to recognize objects in the environment.

It can also be effectively used for detecting brain activity in brain scans.
12.Dimensionality Reduction

The number of input features, variables, or columns present in each dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, such cases,
dimensionality reduction techniques are required use.

Dimensionality reduction technique can be defined as "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better-
fit predictive model while solving classification and regression problems.

It is commonly used in fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.

Benefits of applying Dimensionality Reduction


 By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
 Less Computation training time is required for reduced dimensions of features.
 Reduced dimensions of features of the dataset help in visualizing the data quickly.
 It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction


 Some data may be lost due to dimensionality reduction.
 In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.

Approaches of Dimension Reduction: - Two approaches

Feature Selection: - Feature selection is the process of selecting the subset of the
relevant features and leaving out the irrelevant features present in a dataset to build
a model of high accuracy. In other words, it is a way of selecting the optimal features
from the input dataset. Three methods are used for the feature selection:

1) Filters Methods: - In this method, the dataset is filtered, and a subset that
contains only the relevant features is taken. Some common techniques of the
filters method are:

 Correlation
 Chi-Square Test
 ANOVA
 Information Gain, etc.

2) Wrapper Methods:- The wrapper method has the same goal as the filter
method, but it takes a machine learning model for its evaluation. In this
method, some features are fed to the ML model and evaluated the
performance. The performance decides whether to add those features or
remove them to increase the accuracy of the model. This method is more
accurate than the filtering method but complex to work. Some common
techniques of wrapper methods are:

 Forward Selection
 Backward Selection
 Bi-directional Elimination

3) Embedded Methods: Embedded methods check the different training


iterations of the machine learning model and evaluate the importance of each
feature. Some common techniques of Embedded methods are:

 LASSO
 Elastic Net
 Ridge Regression, etc.
Feature Extraction: Feature extraction is the process of transforming the space
containing many dimensions into space with fewer dimensions. This approach is
useful when we want to keep the whole information but use fewer resources while
processing the information. Some common feature extraction techniques are:
 Principal Component Analysis
 Linear Discriminant Analysis
 Kernel PCA
 Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction


a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder
13.A Tensor Flow-based convolutional neural network

TensorFlow makes it easy to create convolutional neural networks once you understand
some of the nuances of the framework’s handling of them. we are going to create a
convolutional neural network with the structure detailed in the image below. The network
we are going to build will perform MNIST digit classification,

As can be observed, we start with the MNIST 28×28 greyscale images of digits. We then
create 32, 5×5 convolutional filters / channels plus ReLU (rectified linear unit) node
activations. After this, we still have a height and width of 28 nodes. We then perform
down-sampling by applying a 2×2 max pooling operation with a stride of 2. Layer 2
consists of the same structure, but now with 64 filters / channels and another stride-2 max
pooling down-sample. We then flatten the output to get a fully connected layer with 3164
nodes, followed by another hidden layer of 1000 nodes. These layers will use ReLU node
activations. Finally, we use a softmax classification layer to output the 10 digit probabilities.

14.Keras
Convolutional Neural Network Architecture
First, we will define the Convolutional neural networks architecture as follows:

1- The first hidden layer is a convolutional layer called Convolution2D. We will use 32
filters with size 5×5 each.
2- Then a Max pooling layer with a pool size of 2×2.
3- Another convolutional layer with 64 filters with size 5×5 each. 4- Then a Max pooling
layer with a pool size of 2×2.
5- Then next is a Flatten layer that converts the 2D matrix data to a 1D vector before
building the fully connected layers.
6- After that we will use a fully connected layer with 1024 neurons and relu activation
function.
7- Then we will use a regularization layer called Dropout. It is configured to randomly
exclude 20% of neurons in the layer in order to reduce overfitting.
8- Finally, the output layer which has 10 neurons for the 10 classes and a softmax
activation function to output probability-like predictions for each class.

You might also like