Professional Documents
Culture Documents
ON
DEVELOPED BY:
VIKAS ARORA (00255102719)
PRAKHAR GUPTA (03855102719)
2019 - 2023
DECLARATION
We would like to express our deep gratitude to our guide Ms. SHRUTI AHUJA for her valuable
guidance, faculty of computer science and engineering, MVSIT and timely suggestions during the
entire duration of our dissertation work, without which this work would not have been possible.
We would also like to convey our deep regards to all other faculty members of MVSIT, who have
bestowed their great effort and guidance at appropriate times without which it would have been
very difficult on our part to finish this work. Finally, we would also like to thank our friends for
their advice and pointing out our mistakes, parents, and classmates for their encouragement
throughout our project period. Last but not least, we thank everyone for supporting us directly or
indirectly in completing this project successfully.
ABSTRACT
Declaration
Acknowledgement
Abstract
CHAPTERS
CHAPTER 1. - INTRODUCTION
1.1 Convolutional Neural Network
1.1.1 Convolutional Layer
1.1.2 Pooling Layer
1.1.3 Fully Connected Layer
1.2 AIM & OBJECTIVE
1.3 Conceptual Framework
1.4 Method
Artificial Intelligence has been witnessing a monumental growth in bridging the gap
between the capabilities of humans and machines. Researchers and enthusiasts alike,
work on numerous aspects of the field to make amazing things happen. One of many such
The agenda for this field is to enable machines to view the world as humans do, perceive
it in a similar manner and even use the knowledge for a multitude of tasks such as Image
& Video recognition, Image Analysis & Classification, Media Recreation, Recommendation
Systems, Natural Language Processing, etc. The advancements in Computer Vision with
Deep Learning have been constructed and perfected with time, primarily over one
can take in an input image, assign importance (learnable weights and biases) to various
aspects/objects in the image and be able to differentiate one from the other. The pre-
algorithms. While in primitive methods filters are hand-engineered, with enough training,
in the Human Brain and was inspired by the organization of the Visual Cortex. Individual
neurons respond to stimuli only in a restricted region of the visual field known as the
Receptive Field. A collection of such fields overlaps to cover the entire visual area.
Convolutional neural networks are distinguished from other neural networks by their
superior performance with image, speech, or audio signal inputs. They have three main
types of layers, which are:
• Convolutional layer
• Pooling layer
• Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While convolutional
layers can be followed by additional convolutional layers or pooling layers, the fully-
connected layer is the final layer. With each layer, the CNN increases in its complexity,
identifying greater portions of the image. Earlier layers focus on simple features, such as
colors and edges. As the image data progresses through the layers of the CNN, it starts
to recognize larger elements or shapes of the object until it finally identifies the intended
object.
The convolutional layer is the core building block of a CNN, and it is where the majority
of computation occurs. It requires a few components, which are input data, a filter, and a
feature map. Let’s assume that the input will be a color image, which is made up of a
matrix of pixels in 3D. This means that the input will have three dimensions—a height,
width, and depth—which correspond to RGB in an image. We also have a feature
detector, also known as a kernel or a filter, which will move across the receptive fields of
the image, checking if the feature is present. This process is known as a convolution.
• Max pooling: As the filter moves across the input, it selects the pixel with the
maximum value to send to the output array. As an aside, this approach tends to
be used more often compared to average pooling.
• Average pooling: As the filter moves across the input, it calculates the
average value within the receptive field to send to the output array.
The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel
values of the input image are not directly connected to the output layer in partially
connected layers. However, in the fully-connected layer, each node in the output layer
connects directly to a node in the previous layer.
• The main aim of this learning is to help to Achieve and Understanding the Data
such as Images.
• Most of the Large Companies uses this kind of deep leaning at the core of their
service. Facebook uses neural nets for their automatic tagging algorithms, Google
for their photo search, Amazon for their product recommendations, and Instagram
• The image input which you give to the system will be analyzed and the predicted
1.4 Method:
Step 1: Getting the Dataset
Applications:
This project gives a general idea of how image classification can be done efficiently.
The scope of the project can be extended to the various industries where there is a huge
scope for automation, by just altering the dataset which is relevant to the problem.
CHAPTER 2. STUDY AND ANALYSIS
image will be analyzed and then the output is predicted. The model that is implemented
can be extended to a website or any mobile device as per the need. The Dogs vs Cats
dataset can be downloaded from the Kaggle website. The dataset contains a set of
images of cats and dogs. Our main aim here is for the model to learn various distinctive
features of cat and dog. Once the training of the model is done it will be able to
5. TensorFlow Keras layers – Every NN needs layers and CNN needs well a couple of
layers.
CNN does the processing of Images with the help of matrixes of weights known as filters.
They detect low-level features like vertical and horizontal edges etc. Through each layer,
Adaptive Moment Estimation (Adam) is a method used for computing individual learning
rates for each parameter. For loss function, we are using Binary cross-entropy to compare
the class output to each of the predicted probabilities. Then it calculates the penalization
images resulting in multiple transformed copies of the same image. The images are
different from each other in certain aspects because of shifting, rotating, flipping
techniques. So, we are using the Keras ImageDataGenerator class to augment our
images.
2.4 Convolution
Convolution is a linear operation involving the multiplication of weights with the input. The
known as filter or kernel. The filter is always smaller than input data and the dot product
need for activation function is to add non-linearity into the neural network.
2.6 Pooling
The pooling operation provides spatial variance making the system capable of
recognizing an object with some varied appearance. It involves adding a 2Dfilter over
each channel of the feature map and thus summarise features lying in that region covered
by the filter.
So, pooling basically helps reduce the number of parameters and computations present
in the network. It progressively reduces the spatial size of the network and thus controls
overfitting. There are two types of operations in this layer; Average pooling and Maximum
pooling. Here, we are using max-pooling which according to its name will only take out
the maximum from a pool. This is possible with the help of filters sliding through the input
and at each stride, the maximum parameter will be taken out and the rest will be dropped.
The pooling layer does not modify the depth of the network unlike in the convolution layer.
2.7 Fully Connected
The output from the final Pooling layer which is flattened is the input of the fully connected
layer.
The neurons present in the fully connected layer detect a certain feature and preserves
its value then communicates the value to both the dog and cat classes who then check
Although the problem sounds simple, it was only effectively addressed in the last
few years using deep learning convolutional neural networks. While the dataset is
effectively solved, it can be used as the basis for learning and practicing how to
develop, evaluate, and use convolutional deep learning neural networks for image
classification from scratch.
This includes how to develop a robust test harness for estimating the performance
of the model, how to explore improvements to the model, and how to save the
model and later load it to make predictions on new data.
The dogs vs cats dataset refers to a dataset used for a Kaggle machine learning
competition held in 2013.
The photos are labeled by their filename, with the word “dog” or “cat“. The file
naming convention is as follows:
CHAPTER 3. EXPERIMENTAL ANALYSIS AND RESULTS
For example, let’s load and plot the first nine photos of dogs in a single figure.
Running the example creates a figure showing the first nine photos of dogs in the dataset.
We can see that some photos are landscape format, some are portrait format, and some
are square.
We can update the example and change it to plot cat photos instead; the complete
example is listed below.
Again, we can see that the photos are all different sizes.
We can also see a photo where the cat is barely visible (bottom left corner) and another
that has two cats (lower right corner). This suggests that any classifier fit on this problem
will have to be robust.
A baseline model will establish a minimum model performance to which all of our other
models can be compared, as well as a model architecture that we can use as the basis
of study and improvement.
The architecture involves stacking convolutional layers with small 3×3 filters followed by
a max pooling layer. Together, these layers form a block, and these blocks can be
repeated where the number of filters in each block is increased with the depth of the
network such as 32, 64, 128, 256 for the first four blocks of the model. Padding is used
on the convolutional layers to ensure the height and width shapes of the output feature
maps matches the inputs.
We can explore this architecture on the dogs vs cats problem and compare a model with
this architecture with 1, 2, and 3 blocks.
We can create a function named define_model() that will define a model and return it
ready to be fit on the dataset. This function can then be customized to define different
baseline models, e.g. versions of the model with 1, 2, or 3 VGG style blocks.
The model will be fit with stochastic gradient descent and we will start with a conservative
learning rate of 0.001 and a momentum of 0.9.
The problem is a binary classification task, requiring the prediction of one value of either
0 or 1. An output layer with 1 node and a sigmoid activation will be used and the model
will be optimized using the binary cross-entropy loss function.
The define_model() function for this model was defined in the previous section but is
provided again below for completeness.
3.3.2 Two Block VGG Model
The two-block VGG model extends the one block model and adds a second block with
64 filters.
The define_model() function for this model is provided below for completeness.
The define_model() function for this model was defined in the previous section but is
provided again below for completeness.
Training deep learning neural network models on more data can result in more skillful
models, and the augmentation techniques can create variations of the images that can
improve the ability of the fit models to generalize what they have learned to new images.
Data augmentation can also act as a regularization technique, adding noise to the training
data, and encouraging the model to learn the same features, invariant to their position in
the input.
Small changes to the input photos of dogs and cats might be useful for this problem, such
as small shifts and horizontal flips. These augmentations can be specified as arguments
to the ImageDataGenerator used for the training dataset. The augmentations should not
be used for the test dataset, as we wish to evaluate the performance of the model on the
unmodified photographs.
This requires that we have a separate ImageDataGenerator instance for the train and test
dataset, then iterators for the train and test sets created from the respective data
generators.
In this tutorial, we will demonstrate the final model fit only on the training dataset as we
only have labels for the training dataset.
The first step is to prepare the training dataset so that it can be loaded by
the ImageDataGenerator class via flow_from_directory() function. Specifically, we need
to create a new directory with all training images organized
into dogs/ and cats/ subdirectories without any separation into train/ or test/ directories.
This can be achieved by updating the script we developed at the beginning of the tutorial.
In this case, we will create a new finalize_dogs_vs_cats/ folder
with dogs/ and cats/ subfolders for the entire training dataset.
The structure will look as follows:
3.6 Save Final Model
We are now ready to fit a final model on the entire training dataset.
The complete example of fitting the final model on the training dataset and saving it to file
is listed below.
Below is an image extracted from the test dataset for the dogs and cats competition. It
has no label, but we can clearly tell it is a photo of a dog. You can save it in your current
working directory with the filename ‘sample_image.jpg‘.
We will pretend this is an entirely new and unseen image, prepared in the required way,
and see how we might use our saved model to predict the integer that the image
represents. For this example, we expect class “1” for “Dog“.
First, we can load the image and force it to the size to be 224×224 pixels. The loaded
image can then be resized to have a single sample in a dataset. The pixel values must
also be centered to match the way that the data was prepared during the training of the
model. The load_image() function implements this and will return the loaded image ready
for classification.
Next, we can load the model as in the previous section and call the predict() function to
predict the content in the image as a number between “0” and “1” for “cat” and “dog”
respectively.
The data we collected is a subset of the Kaggle dog/cat dataset. In total, there are 10, 000
images, 80% for the training set, and 20% for the test set. In the training set, 4,000 images
of dogs, while the test set has 1,000 images of dogs, and the rest are cats.
All images are saved in a special folder structure, making it easy for Keras to understand
and differentiate the animal category of each image
CONCLUSION AND FUTURE WORK
This work aims at classifying images using Convolutional Neural Network (CNN). With
the optimization possible with CNN, it is easier to classify images as compared to
traditional image classification algorithms. With further enhancement in study of neural
networks, image classification problems will continue to become more and more easier to
solve. With image classification finding applications in various spheres of life, neural
networks have assumed even more significance. In future, this work can be extended for
real time image processing in various fields like validation and verification of different real
time images, spoofing.
BIBLIOGRAPHY
1: analyticsvidhya.com
2: towardsdatascience.com
3: geeksforgeeks.org
4: google.com
5: kaggle.com