Professional Documents
Culture Documents
Tribhuvan University
A Seminar Report
On
“IMAGE CLASSIFICATION USING CNN”
Submitted To:
Submitted By:
Dhan Bahadur Pun
Roll No.: 18/077
I hereby recommend that his Seminar report is prepared under my supervision by Dhan
Bahadur Pun entitled “IMAGE CLASSIFICATION USING CNN” be accepted as
fulfillment in partial requirement for the degree of Masters of Science in Computer
Science and Information Technology. In my best knowledge, this is an original work in
computer science.
…….…………………………
i
LETTER OF APPROVAL
This is certify that the seminar report prepared by Mr. Dhan Bahadur Pun entitled
“IMAGE CLASSIFICATION USING CNN” in partial fulfillment of the requirements
for the degree of Masters of Science in Computer Science and Information Technology
has been well studied. In our opinion, it is satisfactory in the scope and quality as a
project for the required degree.
Evaluation Committee
………………………… …………………………
Asst. Prof. Nawaraj Paudel Asst. Prof. Jagdish Bhatta
(HOD) (Supervisor)
Central Department of Computer Science Central Department of Computer Science
and Information Technology and Information Technology
………………………………
(Internal)
ii
ACKNOWLEDGEMENT
The Seminar entitled “IMAGE CLASSIFICATION USING CNN” has been conducted
to satisfy the partial requirement for the degree of Master of Science in Computer Science
and Information Technology, Tribhuvan University.
Firstly, I would like to express appreciation to all those who provided us the possibility to
complete this seminar report. A special gratitude to our supervisor Prof. Jagdish Bhatta
for this contribution in stimulating suggestions and encouragement that helped us to co-
ordinate this project especially in writing this seminar report.
iii
ABSTRACT
In this study, CIFAR10 datasets are train through the different layers of convolution
neural network, first input image of 32 x 32 x 1 size is passed to first convolution layer
and then output of first convolution layer is passed to second convolution layer and then
flatten the output of second layer into single dimension array then passed to two different
fully connected layer then passed to final layer with softmax function. After this process
happened the finally 75% accuracy result is achieved by this model.
iv
TABLE OF CONTENTS
Contents
ACKNOWLEDGEMENT..............................................................................................................iii
ABSTRACT...................................................................................................................................iv
LIST OF FIGURES.......................................................................................................................vii
LIST OF ABBREVIATIONS......................................................................................................viii
CHAPTER 1 INTRODUCTION.................................................................................................1
1.1 Introduction.......................................................................................................................1
1.3 Objectives..........................................................................................................................3
2.1.2 CNN...........................................................................................................................4
CHAPTER 3 METHODOLOGY.................................................................................................6
3.1 Flowchart..........................................................................................................................6
v
3.4.4 Flatten Layers..........................................................................................................12
CHAPTER 4 IMPLEMENTATION..........................................................................................15
4.1 Numpy.............................................................................................................................15
4.2 Matplotlib........................................................................................................................15
4.3 Keras...............................................................................................................................15
4.4 Python.............................................................................................................................15
5.1.1 Accuracy:.................................................................................................................17
CHAPTER 6 CONCLUSION....................................................................................................19
References......................................................................................................................................20
vi
LIST OF FIGURES
Figure 3. 1 Methodology.................................................................................................................6
vii
LIST OF ABBREVIATIONS
1D One-dimensional
2D Two-dimensional
3D Three-dimensional
viii
CHAPTER 1 INTRODUCTION
1.1 Introduction
There are several techniques for classification of image, like Supervised and Unsupervised
classification, Artificial Neural Network, SVM, K-Nearest Neighbor, Naïve Bayes,
Random Forest Algorithm, and Convolution Neural Networks (CNNs). Convolution
Neural Network is describe in details. CNNs are very similar to ordinary Neural Networks.
CNNs are composed of artificial neurons that have biases and learnable weights. Each
neuron receives some inputs, performs a dot product [1]. Artificial neurons are
mathematical functions that calculate the weighted sum of multiple inputs and outputs. The
weight defined the behavior of each neurons. The convolution neural network is a
specialized type of neural network model designed for working with 2D (image) data,
although they can be used with 1D (text or audio) and 3D (video).
CNN is use different layers to classify the image. The first layer of convolutional neural
network is the convolution layer that gives the network its name. This layer performs an
operation called a convolution. It extracts the high-level features from the input signal.
1
When input image is provide into a convolution neural network, each of its layers
generates the several activation maps. Each neurons takes a patch of pixels as input,
multiplies their color values by its weights, sum them up, and runs them through the
activation function. The convolution layer detects basic features such as horizontal,
vertical, and diagonal edges. The output of first layer is input of the next layer [2].
The next layer is called pooling layer. The pooling operation, which is fixed according to
the applications, includes max-pooling, min-pooling and average pooling. Pooling
operation is mainly used for the dimensionality reduction of feature maps from
convolution operation and also to select the most significant feature [3]. Due to the
complicity of CNN, ReLU is the common choice for the activation function to transfer
gradient in training by back-propagation. Back-propagation networks are feed-forward
networks in which the signals propagate in only one direction, from the inputs of the input
layer to the output of the output layer.
The fully connected layers are final layers in the CNN structure that can be one or more
layers and placed after a sequence of convolution and pooling layers. This layer is also
called classification layer, which takes the output of the final convolution layer as input.
Based on the activation map of the final convolution layer, the classification layer outputs
a set of confidence scores that specify how likely the image is to belong to a class [1]. For
example, CNN that detects cows, elephants, and tigers, the output of the final layer is the
possibility that the input image contains any of those animals. The last layer of fully
connected layers is known as softmax classifier and determines the probability of each
class label over N number of classes.
2
Figure 1. 1 Architecture of CNN [2]
1.3 Objectives
The main objective of this study is classification of images according their categories or
classes using the convolution neural network and predict their result.
1.4
3
CHAPTER 2 BACKGROUND STUDY AND LITERATURE
REVIEW
A neural networks similarly to the human brain’s neural network. A neuron in a neural
network is a mathematical function that collects and classifies information according to a
specific architecture. The network bears a strong resemblance to statistical methods such
as curve fitting and regression analysis.
2.1.2 CNN
A Convolution Neural Network is a deep learning algorithm which can take in an image,
assign weights and biases to various objects in the image and be able to differentiate one
from the other.
GoogLNet: The GoogLNet is based on the inception network which is also popular for
CNNs. There are three version of inception network, which are named inception version 1,
2, and 3. The first version of inception network is called GoogLNet, developed by a team
4
at Google in 2014. It has 22 layers with 27 pooling layers and the network achieved
93.33% top-5 accuracy on the ImageNet dataset [5].
5
CHAPTER 3 METHODOLOGY
3.1 Methodology
The main steps in the image classification process are shown in the following diagram.
Figure 3. 1 Methodology
In this seminar CIFAR10 dataset is used and first load datasets then preprocess the loaded
dataset and then train the preprocessed data by using Convolution Neural Network
approach. Finally predict the result according to trained datasets.
6
3.2 Data Set Description
CIFAR10 datasets is used in this seminar paper. CIFAR10 (Canadian Institute for
Advanced Research) is a collection of images. The CIFAR10 contains 60,000 32 x 32
color images in 10 different classes with 6,000 images per class. The 10 different classes
represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are
50,000 training images and 10,000 test images.
Name of No. of Image Image per No. of No. of test Total no.
dataset classes size class training images of images
images
labels 0 1 2 3 4 5 6 7 8 9
Name Airplane Car Bird Cat Deer Dog Frog Horse Ship Truck
s
In data normalization process there are two arrays, training and testing, divide the every
elements of arrays by 255 to set every arrays values between 0 - 1. Training and Testing
7
arrays contain values between 0 - 255 before normalized arrays, where 0 means
completely black and white means completely white.
In image reshape/rotate process, reshape function used to reshape the arrays to fit data (or
array) into network size. Reshape help to change the dimension of arrays into appropriate
size.
CNN consist of different components in its architecture, which are describe as below.
The input layer is the first layer of CNN, which takes images as input and passing onto
further layers for features extraction. The image size used in this seminar is (32,32, 1),
which means image size is 32x32 in height and breath and 1D (means only gray color).
Each image pixel has a value between 0 and 1.
This layer is the first layer that is used to extract the various features from the images. This
layer occurs the majority of computation. In this seminar Conv2D function is used for
convolution layer, which is available in keras library. It takes a filter, kernel_size,
activation, padding, and input_shape as a parameter. The parameters input_shape takes
input image which is described in above input layer. The function of others parameters are
describe below.
8
3.4.2.1 Filters:
The filter is parameter, which is given to convolution layer. Here 32 is given to filter,
which means it can detect the 32 different features or edges in input image. If possible
give image for example.
The function of this parameter is to determine the size of filter images. In this seminar 3
is given to kernel size, which means each 32 different filter images have 3x3 size in
height and breath.
It is the number of pixels, or distance that the kernel moves over the input matrix. The
larger the stride value yields a smaller output size. The value of stride is two or higher
than two is rare case.
3.4.2.3 Stride
It is the number of pixel, or distance that the kernel moves over the input matrix. The
larger the stride value yields a smaller output size. In this seminar 1 value is given to
stride.
3.4.2.4 Padding
Padding is a process of adding layers of zeros to input images to fix the size of images, if
image is different size. In this seminar same padding size is used, because all images
have same size that is 32x32
9
Figure 3. 2 Operation of convolution with kernel 3, no padding, and stride 1 [7]
The figure above describe how convolution operations is performed. In above figure image
with 5x5 in size and each pixel have own value and kernel of size 3 with no padding and
stride size is one. So, Feature map or output of convolution matrix with 3 kernel size, no
padding and stride 1 is calculated using following formula.
nh +2 p −f n
n h × nw ∶=( + 1)×( w+2 p−f +1) (1)
s s
Where, n h and n w are height and width of images, p is padding, f is filter and s is stride.
10
The filter is applied to calculate a dot product between the input pixel and filter in an area
of image. This dot product is then fed into an output array whose size is determine in
equation (1)’s formula shown in figure above. For example
After each convolution operation, a convolution neural network applies a Rectified Linear
Unit, short form ReLU, transformation to feature map, to introduce a nonlinearity in
model.
The pooling layers are commonly used immediately after convolution layers. The primary
aim of pooling layer is to decrease the size of the convolved feature map to reduce the
computation and amount of parameters in network. In this seminar the max pooling
operation used, which is describe as below.
In the max pooling, the largest element is taken from feature map. As the filter moves
across the input, it select the pixel with the maximum value to send to the output array. A
11
max pooling with a filter of size (or pool size) 2×2 with a stride of 2 is used. Figure
below describe the how max pooling operation is performed.
In Figure 3.7, max pooling works by placing a matrix of 2×2 on the feature map and
picking the largest value in that box. The 2×2 matrix is moved from left to right through
the entire feature map picking the largest value in each pass.
After pooled feature map is obtained, the next step is to flatten it. Flattening is converting
the data into a 1D array for inputting it to the next layer. The flattening involves
transforming the entire pooled feature map into a single column which is then connected to
classification model, called fully-connected layer.
12
3.4.5 Fully Connected Layers
The fully connected layers are also called dense layers that can be one or more layers and
placed after a sequence of convolution and pooling layers. In this seminar three dense
layers is used, and 128 neurons, relu activation function is used in first two dense layers.
10 neurons for image classes and softmax activation function is passed to classify inputs
and producing a probability from 0 to 1, are passed in last dense layers
In this layer, information is passed through the network and the error prediction is
determined. The error is then backpropagated through the network to improve the
prediction. Full connected layer performs the classification task based on the features
extracted through previous layers.
13
In this study the images are classified on following architecture.
Figure above shows the complete architecture of CNN, and this how the input image classify
using CNN. Here the input image size is 32 x 32 x 1, this image is pass to Conv 1 layer with
filter of size 3, stride 1, no padding and 32 filters. Then output of matrix is 30 x 30 x 32, size of
output matrix is calculated using equation (1) formula. And same concept is apply on Conv 2
layers and in Pooling layer first with use filter of size 2, stride 2 and no padding. The output
matrix is also calculated from equation (1) formula. After Pooling layer we use Conv 3 and Conv
4 layer same as Conv 1 and 2 but 64 filters instead of 32. And we used same Pooling layer with
output of size 5 x 5 x 64. After Pooling layer we need flatten the pooling layer output of
dimension 2 into flatten layer of dimension 1, whose output is 1600 x 1. After flatten the layer
14
we passed value to two fully connected layers (FC 3 and FC 4) with size 128 x 1. Then image is
classified according to input.
15
2
CHAPTER 4 IMPLEMENTATION
4.1 Numpy
Image is two dimensional data structure which have height and width, so Numpy array is
used to store a two dimensional image data structure in two dimensional array. Numpy
array is used for faster a numerical calculation.
4.2 Matplotlib
Matplotlib is widely used python library to plot different color graphs like bar chart,
histogram, pi chart etc. Matplotlib is used to model evaluation, to show accuracy score
towards test data in graph, to show loss values towards test data in graph, and analysis the
result in graph.
4.3 Keras
Here the Keras library is used to implement whole convolution neural network, to
implement convolution layer, pooling layer and fully connected layers.
4.4 Python
Python is object oriented programming language. The whole program is wrote using
python programming language.
16
along with height and width of input, the default value is (1, 1) and default value is
used. The padding can take one of two values that is valid or same. The activation
function is used to apply after performing the convolution, relu activation function
is used.
Keras Sklearn MaxPool2D is function used to create a pooling layer in network and
its operation is to selects the maximum element from the region of feature map
covered by the filter. This layer takes a two parameters as filter and stride, filter =
(2, 2) and stride = (2, 2) is used.
Keras Sklearn Flatten is function used to create a flatten layer in network, flatten
function takes no argument and it flatten the output of convolution 2D matrix into a
1D matrix.
Keras SKlearn Dense is function used to create a fully connected layer in network,
it takes units and activation as arguments. Here units uses positive integer to
represent the input size of layer, here 128 units is used. Activation is use to apply
the element-wise activation function in dense layer, relu activation function is used.
17
CHAPTER 5 RESULT AND ANALYSIS
The total number of testing data is 1000 on each class. There are total of 10 classes and
total testing data on all class is 10000. After trained the model, we following predicted
value according actual data as shown in table below.
18
Total 7547 2453
The table above shows the correct and incorrect data over the testing data sets. There are
10 classes shown in table above and each class have own correct and incorrect values.
For instance the 801 airplane image samples are predicted to airplane, and 199 airplane
image samples are predicted to others over the 1000 data of airplane. Similarly for other
classes likes Automobile. Bird, Cat, Deer, Dog, Frog, Horse, Ship, truck, all are same
meaning according to tabular values. .
5.1.1 Accuracy:
The accuracy is total correct predicted data over the total number of testing datasets.
Accuracy score measures the image samples predicted correctly. The accuracy score is
calculated using following formula.
CP
AS= ×100 ……………(3)
TDs
20
CHAPTER 6 CONCLUSION
During this study image classification using CNN has been implemented and analyzed
and it is found that image classification using CNN works in different layers that is
Convolution Layer, Pooling Layer and Fully Connected Layer. During this study, after
trained the input image samples through the Convolution Neural Network 75% accuracy
result is achieved, and hence Convolution Neural Network is used for image
classification.
21
References
22
[8] V. Kurama, "PaperspaceBlog," 2020. [Online]. Available:
https://www.blog.paperspace.com/popular-deep-learning-architecture-
alexnet-vgg-googlnet. [Accessed 19 junly 2021].
23