You are on page 1of 34

MINOR PROJECT REPORT

ON

CAT AND DOG CLASSIFICATION USING CNN

DEVELOPED BY:
VIKAS ARORA (00255102719)
PRAKHAR GUPTA (03855102719)

Under the Guidance of


Mrs. SHRUTY AHUJA
HOD (CSE)
At

MAHAVIR SWAMI INSTITUTE OF TECHNOLOGY


SONEPAT
AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DWARKA NEW
DELHI

2019 - 2023
DECLARATION

We VIKAS ARORA (00255102719), PRAKHAR GUPTA (03855102719), of Fourth Year


B.Tech., in the Department of Computer Science and Engineering from MVSIT hereby declare
that the work presented in this report entitled “Cat and Dog classification using CNN”, in
fulfillment of the requirement for the award of the degree Bachelor of Technology in
Computer Science & Engineering, submitted in CSE Department, Mahaveer Swami Institute
of Technology affiliated to Guru Gobind Singh Indraprastha University, New Delhi, during the
academic year 2019-2023 is an authentic record of our own work carried out during our degree
under the guidance of MS. SHRUTI AHUJA. The work reported in this has not been submitted
by me for award of any other degree or diploma.

VIKAS ARORA (00255102719)

PRAKHAR GUPTA (03855102719)


ACKNOWLEDGEMENT

We would like to express our deep gratitude to our guide Ms. SHRUTI AHUJA for her valuable
guidance, faculty of computer science and engineering, MVSIT and timely suggestions during the
entire duration of our dissertation work, without which this work would not have been possible.
We would also like to convey our deep regards to all other faculty members of MVSIT, who have
bestowed their great effort and guidance at appropriate times without which it would have been
very difficult on our part to finish this work. Finally, we would also like to thank our friends for
their advice and pointing out our mistakes, parents, and classmates for their encouragement
throughout our project period. Last but not least, we thank everyone for supporting us directly or
indirectly in completing this project successfully.
ABSTRACT

Image classification is a fundamental problem in computer vision.


Deep learning provides successful results for machine learning
problems. Many algorithms like minimum distance algorithm, K-
Nearest neighbor algorithm, Nearest Clustering algorithm, Fuzzy
C - Means algorithm, Maximum likelihood algorithm are used for
the purpose of image classification. In this report, image
classification is performed using convolutional neural network
which is became standard after since Alex Krizhevsky, Geoff
Hinton and Ilya Sutskevar won ImageNet in 2012. Generally
convolutional neural network uses GPU technology because of
huge number of computations but, in proposed method we are
building a very small network which can work on CPU as well.
The network is trained using a subset of Kaggle Dog-Cat dataset.
This trained classifier can classify the given image into either cat
or dog. The same network can be trained with any other dataset
and classify the images into one of the two predefined class.
CONTENTS

Declaration
Acknowledgement
Abstract
CHAPTERS
CHAPTER 1. - INTRODUCTION
1.1 Convolutional Neural Network
1.1.1 Convolutional Layer
1.1.2 Pooling Layer
1.1.3 Fully Connected Layer
1.2 AIM & OBJECTIVE
1.3 Conceptual Framework
1.4 Method

CHAPTER 2. – STUDY AND ANALYSIS


2.1 Problem Statement
2.2 Installing Required Packages for Python
2.2.1 NumPy
2.2.2 TensorFlow
2.2.3 Keras
2.3 Import Libraries
2.4 Convolution
2.5 Activation
2.6 Pooling
2.7 Fully Connected
CHAPTER 3. EXPERIMENTAL ANALYSIS AND RESULTS
3.1 Plot Dog and Cat Photos
3.2 Pre-Process Photos into Standard Directories
3.3 Develop a Baseline CNN Model
3.3.1 One Block VGG Model
3.3.2 Two Block VGG Model
3.3.3 Three Block VGG Model
3.4 Image Data Augmentation
3.5 Prepare Final Dataset
3.6 Save Final Model
3.7 Make Prediction
3.8 Data overview
4 CONCLUSION AND FUTURE WORK
5 BIBLIOGRAPHY
CHAPTER 1. INTRODUCTION

1.1. Convolutional Neural Network

Artificial Intelligence has been witnessing a monumental growth in bridging the gap

between the capabilities of humans and machines. Researchers and enthusiasts alike,

work on numerous aspects of the field to make amazing things happen. One of many such

areas is the domain of Computer Vision.

The agenda for this field is to enable machines to view the world as humans do, perceive

it in a similar manner and even use the knowledge for a multitude of tasks such as Image

& Video recognition, Image Analysis & Classification, Media Recreation, Recommendation

Systems, Natural Language Processing, etc. The advancements in Computer Vision with

Deep Learning have been constructed and perfected with time, primarily over one

particular algorithm — a Convolutional Neural Network.

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which

can take in an input image, assign importance (learnable weights and biases) to various
aspects/objects in the image and be able to differentiate one from the other. The pre-

processing required in a ConvNet is much lower as compared to other classification

algorithms. While in primitive methods filters are hand-engineered, with enough training,

ConvNets have the ability to learn these filters/characteristics.

The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons

in the Human Brain and was inspired by the organization of the Visual Cortex. Individual

neurons respond to stimuli only in a restricted region of the visual field known as the
Receptive Field. A collection of such fields overlaps to cover the entire visual area.
Convolutional neural networks are distinguished from other neural networks by their
superior performance with image, speech, or audio signal inputs. They have three main
types of layers, which are:

• Convolutional layer
• Pooling layer
• Fully-connected (FC) layer

The convolutional layer is the first layer of a convolutional network. While convolutional
layers can be followed by additional convolutional layers or pooling layers, the fully-
connected layer is the final layer. With each layer, the CNN increases in its complexity,
identifying greater portions of the image. Earlier layers focus on simple features, such as
colors and edges. As the image data progresses through the layers of the CNN, it starts
to recognize larger elements or shapes of the object until it finally identifies the intended
object.

1.1.1 Convolutional Layer

The convolutional layer is the core building block of a CNN, and it is where the majority
of computation occurs. It requires a few components, which are input data, a filter, and a
feature map. Let’s assume that the input will be a color image, which is made up of a
matrix of pixels in 3D. This means that the input will have three dimensions—a height,
width, and depth—which correspond to RGB in an image. We also have a feature
detector, also known as a kernel or a filter, which will move across the receptive fields of
the image, checking if the feature is present. This process is known as a convolution.

1.1.2 Pooling Layer


Pooling layers, also known as down sampling, conducts dimensionality reduction,
reducing the number of parameters in the input. Similar to the convolutional layer, the
pooling operation sweeps a filter across the entire input, but the difference is that this filter
does not have any weights. Instead, the kernel applies an aggregation function to the
values within the receptive field, populating the output array. There are two main types of
pooling:

• Max pooling: As the filter moves across the input, it selects the pixel with the
maximum value to send to the output array. As an aside, this approach tends to
be used more often compared to average pooling.
• Average pooling: As the filter moves across the input, it calculates the
average value within the receptive field to send to the output array.

1.1.3 Fully-Connected Layer

The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel
values of the input image are not directly connected to the output layer in partially
connected layers. However, in the fully-connected layer, each node in the output layer
connects directly to a node in the previous layer.

1.2 AIM & OBJECTIVE

• The main aim of this learning is to help to Achieve and Understanding the Data

such as Images.

• Most of the Large Companies uses this kind of deep leaning at the core of their

service. Facebook uses neural nets for their automatic tagging algorithms, Google

for their photo search, Amazon for their product recommendations, and Instagram

for their search infrastructure.

However, use case of these networks is for image processing.

• To learn multiple levels of representations that correspond to different levels of

abstraction, the levels form a hierarchy of concepts.


• To learn in supervised and/or unsupervised manner.

• The image input which you give to the system will be analyzed and the predicted

result will be given as output. Machine learning algorithm [Convolutional Neural

Networks] is used to classify the image.

1.3 Conceptual Framework:


The project is entirely implemented using Python3. The Conceptual Framework involved
is mainly:
• Keras – TensorFlow backend
• OpenCV – Used to handle image operations

1.4 Method:
Step 1: Getting the Dataset

Step 2: Installing Required Packages [Python 3.6]


1. OpenCV —> Used to handle image operations like reading the image, resizing,
reshaping
2. NumPy —> Image that is read will be stored in an NumPy array
3. TensorFlow —> TensorFlow is the backend for Keras
4. Keras —> Keras is used to implement the CNN
Step 3: How the Model Works?
The dataset contains a lot of images of cats and dogs. Our aim is to make the model
learn the distinguishing features between the cat and dog. Once the model has learned,
i.e. once the model got trained, it will be able to classify the input image as either cat or
a dog.
Features Provided:
• Own image can be tested to verify the accuracy of the model
• This code can directly be integrated with your current project or can be
extended as a mobile application or a site.
• To extend the project to classify different entities, all you need to do is find
the suitable dataset, change the dataset accordingly and train the model

Data structures and Algorithms used in project


• NumPy Array: This most powerful and widely used data structure of python
is used to store the pixel value of images.
Tools Used:
• Python Interpreter
• Anaconda Prompt
• Spyder

Applications:
This project gives a general idea of how image classification can be done efficiently.
The scope of the project can be extended to the various industries where there is a huge
scope for automation, by just altering the dataset which is relevant to the problem.
CHAPTER 2. STUDY AND ANALYSIS

Convolutional Neural Network (CNN) is an algorithm taking an image as input then


assigning weights and biases to all the aspects of an image and thus differentiates one
from the other. Neural networks can be trained by using batches of images, each of them
having a label to identify the real nature of the image (cat or dog here). A batch can
contain few tenths to hundreds of images. For each and every image, the network
prediction is compared with the corresponding existing label, and the distance between
network prediction and the truth is evaluated for the whole batch. Then, the network
parameters are modified to minimize the distance and thus the prediction capability of the
network is increased. The training process continues for every batch similarly.
2.1 Dogs vs. Cats Prediction Problem Statement
The main goal is to develop a system that can identify images of cats and dogs. The input

image will be analyzed and then the output is predicted. The model that is implemented

can be extended to a website or any mobile device as per the need. The Dogs vs Cats

dataset can be downloaded from the Kaggle website. The dataset contains a set of

images of cats and dogs. Our main aim here is for the model to learn various distinctive

features of cat and dog. Once the training of the model is done it will be able to

differentiate images of cat and dog.

2.2 Installing Required Packages for Python 3.6


2.2.1. NumPy -> [ Image is read and stored in a NumPy array] 2.2.2. TensorFlow

-> [ TensorFlow is the backend for Keras]

2.2.3. Keras -> [ Keras is used for implementing the CNN]


2.3 Import Libraries
1. NumPy- For working with arrays, linear algebra.

2. Pandas – For reading/writing data

3. Matplotlib – to display images

4. TensorFlow Keras models – Need a model to predict

5. TensorFlow Keras layers – Every NN needs layers and CNN needs well a couple of

layers.

CNN does the processing of Images with the help of matrixes of weights known as filters.

They detect low-level features like vertical and horizontal edges etc. Through each layer,

the filters recognize high-level features.


We first initialize the CNN,

For compiling the CNN, we are using Adam optimizer.

Adaptive Moment Estimation (Adam) is a method used for computing individual learning

rates for each parameter. For loss function, we are using Binary cross-entropy to compare

the class output to each of the predicted probabilities. Then it calculates the penalization

score based on the total distance from the expected value.

Image augmentation is a method of applying different kinds of transformation to original

images resulting in multiple transformed copies of the same image. The images are

different from each other in certain aspects because of shifting, rotating, flipping

techniques. So, we are using the Keras ImageDataGenerator class to augment our

images.

2.4 Convolution
Convolution is a linear operation involving the multiplication of weights with the input. The

multiplication is performed between an array of input data and a 2D array of weights

known as filter or kernel. The filter is always smaller than input data and the dot product

is performed between input and filter array.


2.5 Activation
The activation function is added to help ANN learn complex patterns in the data. The main

need for activation function is to add non-linearity into the neural network.
2.6 Pooling
The pooling operation provides spatial variance making the system capable of

recognizing an object with some varied appearance. It involves adding a 2Dfilter over

each channel of the feature map and thus summarise features lying in that region covered

by the filter.

So, pooling basically helps reduce the number of parameters and computations present

in the network. It progressively reduces the spatial size of the network and thus controls

overfitting. There are two types of operations in this layer; Average pooling and Maximum

pooling. Here, we are using max-pooling which according to its name will only take out

the maximum from a pool. This is possible with the help of filters sliding through the input

and at each stride, the maximum parameter will be taken out and the rest will be dropped.

The pooling layer does not modify the depth of the network unlike in the convolution layer.
2.7 Fully Connected
The output from the final Pooling layer which is flattened is the input of the fully connected

layer.

The Full Connection process practically works as follows:

The neurons present in the fully connected layer detect a certain feature and preserves

its value then communicates the value to both the dog and cat classes who then check

out the feature and decide if the feature is relevant to them.


The Dogs vs. Cats dataset is a standard computer vision dataset that involves
classifying photos as either containing a dog or cat.

Although the problem sounds simple, it was only effectively addressed in the last
few years using deep learning convolutional neural networks. While the dataset is
effectively solved, it can be used as the basis for learning and practicing how to
develop, evaluate, and use convolutional deep learning neural networks for image
classification from scratch.

This includes how to develop a robust test harness for estimating the performance
of the model, how to explore improvements to the model, and how to save the
model and later load it to make predictions on new data.

The dogs vs cats dataset refers to a dataset used for a Kaggle machine learning
competition held in 2013.

The dataset is comprised of photos of dogs and cats provided as a subset of


photos from a much larger dataset of 3 million manually annotated photos.

The photos are labeled by their filename, with the word “dog” or “cat“. The file
naming convention is as follows:
CHAPTER 3. EXPERIMENTAL ANALYSIS AND RESULTS

3.1 Plot Dog and Cat Photos


Looking at a few random photos in the directory, we can see that the photos are color
and have different shapes and sizes.

For example, let’s load and plot the first nine photos of dogs in a single figure.

The complete example is listed below.

Running the example creates a figure showing the first nine photos of dogs in the dataset.

We can see that some photos are landscape format, some are portrait format, and some
are square.
We can update the example and change it to plot cat photos instead; the complete
example is listed below.
Again, we can see that the photos are all different sizes.

We can also see a photo where the cat is barely visible (bottom left corner) and another
that has two cats (lower right corner). This suggests that any classifier fit on this problem
will have to be robust.

3.2 Pre-Process Photos into Standard Directories


Alternately, we can load the images progressively using the Keras ImageDataGenerator
class and flow_from_directory() API. This will be slower to execute but will run on more
machines.
This API prefers data to be divided into separate train/ and test/ directories, and under
each directory to have a subdirectory for each class, e.g. a train/dog/ and
a train/cat/ subdirectories and the same for test. Images are then organized under the
subdirectories.
We can create directories in Python using the makedirs() function and use a loop to
create the dog/ and cat/ subdirectories for both the train/ and test/ directories.

3.3 Develop a Baseline CNN Model


In this section, we can develop a baseline convolutional neural network model for the
dogs vs. cats dataset.

A baseline model will establish a minimum model performance to which all of our other
models can be compared, as well as a model architecture that we can use as the basis
of study and improvement.
The architecture involves stacking convolutional layers with small 3×3 filters followed by
a max pooling layer. Together, these layers form a block, and these blocks can be
repeated where the number of filters in each block is increased with the depth of the
network such as 32, 64, 128, 256 for the first four blocks of the model. Padding is used
on the convolutional layers to ensure the height and width shapes of the output feature
maps matches the inputs.

We can explore this architecture on the dogs vs cats problem and compare a model with
this architecture with 1, 2, and 3 blocks.

We can create a function named define_model() that will define a model and return it
ready to be fit on the dataset. This function can then be customized to define different
baseline models, e.g. versions of the model with 1, 2, or 3 VGG style blocks.
The model will be fit with stochastic gradient descent and we will start with a conservative
learning rate of 0.001 and a momentum of 0.9.

The problem is a binary classification task, requiring the prediction of one value of either
0 or 1. An output layer with 1 node and a sigmoid activation will be used and the model
will be optimized using the binary cross-entropy loss function.

Below is an example of the define_model() function for defining a convolutional neural


network model for the dogs vs. cats problem with one vgg-style block.
The complete example of evaluating a one-block baseline model on the dogs and cats
dataset is listed below.
3.3.1 One Block VGG Model
The one-block VGG model has a single convolutional layer with 32 filters followed by a
max pooling layer.

The define_model() function for this model was defined in the previous section but is
provided again below for completeness.
3.3.2 Two Block VGG Model
The two-block VGG model extends the one block model and adds a second block with
64 filters.

The define_model() function for this model is provided below for completeness.

3.3.3 Three Block VGG Model


The three-block VGG model extends the two block model and adds a third block with 128
filters.

The define_model() function for this model was defined in the previous section but is
provided again below for completeness.

3.4 Image Data Augmentation


Image data augmentation is a technique that can be used to artificially expand the size of
a training dataset by creating modified versions of images in the dataset.

Training deep learning neural network models on more data can result in more skillful
models, and the augmentation techniques can create variations of the images that can
improve the ability of the fit models to generalize what they have learned to new images.

Data augmentation can also act as a regularization technique, adding noise to the training
data, and encouraging the model to learn the same features, invariant to their position in
the input.

Small changes to the input photos of dogs and cats might be useful for this problem, such
as small shifts and horizontal flips. These augmentations can be specified as arguments
to the ImageDataGenerator used for the training dataset. The augmentations should not
be used for the test dataset, as we wish to evaluate the performance of the model on the
unmodified photographs.

This requires that we have a separate ImageDataGenerator instance for the train and test
dataset, then iterators for the train and test sets created from the respective data
generators.

3.5 Prepare Final Dataset


A final model is typically fit on all available data, such as the combination of all train and
test datasets.

In this tutorial, we will demonstrate the final model fit only on the training dataset as we
only have labels for the training dataset.

The first step is to prepare the training dataset so that it can be loaded by
the ImageDataGenerator class via flow_from_directory() function. Specifically, we need
to create a new directory with all training images organized
into dogs/ and cats/ subdirectories without any separation into train/ or test/ directories.
This can be achieved by updating the script we developed at the beginning of the tutorial.
In this case, we will create a new finalize_dogs_vs_cats/ folder
with dogs/ and cats/ subfolders for the entire training dataset.
The structure will look as follows:
3.6 Save Final Model
We are now ready to fit a final model on the entire training dataset.

The complete example of fitting the final model on the training dataset and saving it to file
is listed below.

3.7 Make Prediction


We can use our saved model to make a prediction on new images.
The model assumes that new images are color and they have been segmented so that
one image contains at least one dog or cat.

Below is an image extracted from the test dataset for the dogs and cats competition. It
has no label, but we can clearly tell it is a photo of a dog. You can save it in your current
working directory with the filename ‘sample_image.jpg‘.

We will pretend this is an entirely new and unseen image, prepared in the required way,
and see how we might use our saved model to predict the integer that the image
represents. For this example, we expect class “1” for “Dog“.

First, we can load the image and force it to the size to be 224×224 pixels. The loaded
image can then be resized to have a single sample in a dataset. The pixel values must
also be centered to match the way that the data was prepared during the training of the
model. The load_image() function implements this and will return the loaded image ready
for classification.
Next, we can load the model as in the previous section and call the predict() function to
predict the content in the image as a number between “0” and “1” for “cat” and “dog”
respectively.

The complete example is listed below.


Running the example first loads and prepares the image, loads the model, and then
correctly predicts that the loaded image represents a ‘dog‘ or class ‘1‘.

3.8 Data overview

The data we collected is a subset of the Kaggle dog/cat dataset. In total, there are 10, 000
images, 80% for the training set, and 20% for the test set. In the training set, 4,000 images
of dogs, while the test set has 1,000 images of dogs, and the rest are cats.

All images are saved in a special folder structure, making it easy for Keras to understand
and differentiate the animal category of each image
CONCLUSION AND FUTURE WORK

This work aims at classifying images using Convolutional Neural Network (CNN). With
the optimization possible with CNN, it is easier to classify images as compared to
traditional image classification algorithms. With further enhancement in study of neural
networks, image classification problems will continue to become more and more easier to
solve. With image classification finding applications in various spheres of life, neural
networks have assumed even more significance. In future, this work can be extended for
real time image processing in various fields like validation and verification of different real
time images, spoofing.
BIBLIOGRAPHY

1: analyticsvidhya.com

2: towardsdatascience.com

3: geeksforgeeks.org

4: google.com

5: kaggle.com

You might also like