You are on page 1of 44

Project Report

On

Handwritten Digit Recognition using Machine Learning

(A Project Report submitted in partial fulfillment of the requirements of Bachelor


of Technology in Information Technology of
the West Bengal University of Technology, West Bengal)

Submitted by

MRINMAY GHOSH (10200217045)


NIRMALYA MUKHERJEE (10200217044)
KISHALAY RAY (10200217049)
VISHAL KUMAR (10200217018)

Under the guidance of

Dr. Malabika Sengupta


(Professor, Dept of Information Technology)

Kalyani Government Engineering College


(Affiliated to West Bengal University of Technology)
Kalyani - 741235, Nadia, WB
2020-2021

1
Phone: 25826680 (PBX) e-mail :
Fax : 2582130

Certificate of Approval

This is to certify that Mrinmay Ghosh, Nirmalya Mukherjee, Kishalay Ray and Vishal Kumar have
done final year project work entitled “Handwritten Digit Recognition using Machine Learning”
under my direct supervision and they have fulfilled all the requirements of relating to the Final
Year Project. It is also certified that this project work being submitted, fulfills the norms of
academic standard for B. Tech Degree in Information Technology of The West Bengal University
of Technology and it has not been submitted for any degree whatsoever by him/her or anyone else
previously.

………………………………………… ……………………………………………..
Head Supervisor
Department of Information Technology Department of Information Technology
Kalyani Government Engineering College Kalyani Government Engineering College

………………………………………… ………………………………………………
Project Coordinator Examiner
Department of Information Technology
Kalyani Government Engineering College

2
ACKNOWLEDGEMENT

We would like to express our sincere gratitude to Prof. Malabika Sengupta, Professor and the Head
of the Department of Information Technology, whose role as project guide was invaluable for the
project. We are extremely thankful for the keen interest he / she took in advising us, for the books
and reference materials provided for the moral support extended to us.

Last but not the least we convey our gratitude to all the teachers for providing us the technical skill
that will always remain as our asset and to all non-teaching staff for the cordial support they
offered.

Place: Kalyani
Date: _ _ _ _ _ _ _ _ _ _

Signature

Mrinmay Ghosh

Nirmalya Mukherjee

Kishalay Ray

Vishal Kumar

3
ABSTRACT
Handwritten Digit Recognition is a practical problem in pattern recognition applications. Its
application includes bank check processing, number plate detection, communication mail sorting
and many others. The users can submit the handwritten digits via Scanner, Tablets or other digital
components. The most important problem is the formulation of an efficient algorithm that
recognizes these handwritten digits given by users. The main objective of this paper is to use
efficient and reliable methods to recognize the handwritten digits. Many machine learning
algorithms have been applied to this digit recognition problem. Some of them are KNN (K-Nearest
Neighbor), SVM (Support Vector Machine), Random Forest Classifier, CNN (Convolutional
Neural Network), Naive Bayes etc.

4
CONTENT
Page No.

CHAPTER 1: INTRODUCTION 06

1.1 Background 06
1.2 Motivation 06-07
1.3 Summary of present Work 07-09
1.4 Organization of the Project 09-10
1.5. Required Resources 10

CHAPTER 2: PATTERN RECOGNITION 11

2.1 MNIST Data Set 11-13


2.2 KNN(K nearest neighbors) 13-16
2.3 SVM(Support Vector Machine) 17-21
2.4 CNN(Convolutional Neural Network) 21-25

CHAPTER 3: PROPOSED WORK 26-27

3.1 Dataset retrieval 27-29


3.2 KNN 29-30
3.3 SVM 31-33
3.4 CNN 34-39

CHAPTER 4: ANALYSIS AND COMPARISON OF RESULTS 40-41

CHAPTER 5: CONCLUSION 42-43

REFERENCES 44

5
CHAPTER 1 – INTRODUCTION
Handwritten documents have been in use since the ancient period. But with the advent of
computers it became necessary to recognize them via electronic media. Digits are a part of our
everyday life. Be it the license plates on the vehicles, speed limit on a road, the price of a product
or bank account details etc. In the case of a text which is unclear it is easier to guess the digits than
the alphabets. The handwritten digit recognition is the ability of computers to recognize
human handwritten digits. The handwritten digit recognition is the solution to this problem which
uses the image of a digit and recognizes the digit present in the image. Handwritten recognition
enables us to convert the handwritten documents into digital form.

1.1 Background
Handwritten document recognition is already widely used in the automatic processing of bank
cheques, postal addresses,etc.Many important documents, such as ancient scriptures, old
documents or property papers, many dossiers of historical references are in handwritten format.
To preserve those, they have to be converted or copied into a digital format, that's where the
significance of recognition of handwritten documents lies. Some of the existing systems include
computational intelligence techniques such as artificial neural networks or fuzzy logic, whereas
others may just be large lookup tables that contain possible realizations of handwritten digits.
Artificial neural networks have been developed since the 1940s, but only in the past fifteen years
have they been widely applied in a large variety of disciplines.
Originating from the artificial neuron, which is a simple mathematical model of a biological
neuron, many varieties of neural networks exist nowadays. Although some are implemented in
hardware, the majority are simulated in software. Artificial neural nets have successfully been
applied to handwritten digit recognition numerous times, with very small error margins.
The work described in this paper does not have the intention to compete with existing systems, but
merely served to illustrate to the general public how an artificial neural network can be used to
recognize handwritten digits. It was part of NeuroFuzzyRoute in the Euregio, an exposition in the
framework of the world exposition EXPO2000 in Hannover.
A handwritten digit recognition system was used in a demonstration project to visualize artificial
neural networks, in particular Kop-Honen's self-organizing feature map. The purpose of this
project was to introduce neural networks through a relatively easy-to-understand application to the
general public. Several journals and research works show that handwritten digit recognition has
been a subject of discussion over the years. We have tried to contribute to this ongoing process
through our project work.

6
1.2 Motivation
Handwritten recognition of documents and characters has been around since the 1980s.The task
of handwritten document recognition has great importance and use such as – online handwriting
recognition on computer tablets, recognize zip codes on mail for postal mail sorting, processing
bank check amounts, numeric entries in forms filled up by hand (for example ‐ tax forms) and so
on. There are different challenges faced while attempting to solve this problem. The handwritten
digits are not always of the same size, thickness, or orientation and position relative to the margins.
Our goal was to implement a pattern classification method to recognize the handwritten digits
provided in the MNIST data set of images of handwritten digits (0‐9). The data set used for our
application is composed of 300 training images and 300 testing images, and is a subset of the
MNIST data set [1] (originally composed of 60,000 training images and 10,000 testing images).
Each image is a 28 x 28 grayscale (0‐255) labeled representation of an individual digit. The
general problem we predicted we would face in this digit classification problem was the similarity
between the digits like 1 and 7, 5 and 6, 3 and 8, 9 and 8 etc. Also people write the same digit in
many different ways ‐ the digit ‘1’ is written as‘1’, ‘1’, ‘1’ or ‘1’. Similarly 7 may be written as 7,
7, or 7. Finally the uniqueness and variety in the handwriting of different individuals also
influences the formation and appearance of the digits. Our work mainly focuses on the application
of various algorithms to recognize the digits and their accuracies in doing the same.

1.3 Summary of present work

1.3.1 Data Processing:


At first we took the sample Samples provided from MNIST (Modified National Institute of
Standards and Technology) dataset containing handwritten digits, more precisely a total of 70,000
images consisting of 60,000 examples in the training set and 10,000 examples in the testing set.
Each dataset (training and testing) contains images which are labeled as 0-9 (10 digits).
In the CSV format of the MNIST dataset we have two files.
1. Mnist_train.csv
2. Mnist_test.csv
In 'mnist_test.csv' file we have our testing data with which we will be testing our trained model
and check its accuracy. It has 10000 examples to test our model. We used Pandas to read our
training and testing data which are in CSV format. We used 20000 training examples and 1000
testing examples with pixel values ranging from 0-255. So we normalize them between 0-1.
UsingNumpy we convert the images to fractional arrays where data are in pixel format between 0-
1.
A) Use of KNN Algorithm
We imported the KNeighborsClassifier and accuracy_score from Scikit-learn in python.
After loading the testing and training data using pandas we split the data into Training-
Images, Training-Labels, Testing-Images and Testing-Labels. After formatting our data set we fit
the data into our KNN classifier. After training is completed we predicted the values of our Test

7
data sets and calculated the Accuracy of the model and the accuracy of the KNN is 96.88%.
Then using pandas we fetched our CSV format training and testing dataset.

B) Use of SVM Algorithm

In the above implementation of SVM code we are using Pipeline and GridSearchCV tools of
Scikit-learn. Before feeding data to an algorithm the data set must be well-formatted and noise-
free and standardization is required. We have used the StandardScalar() function to normalize our
pixel values of images (28x28). This function actually standardizes features by removing the mean
and scaling to unit variance.
After that we used SVC (Support vector classifier) with a 'polynomial' kernel. With all different
permutations of pairs of (C, gamma) we get a different decision boundary. After
5 cross-validations (CV) of result the best is chosen among different permutations of
parameters. All these computations are done by the GridSearchCV tool.
After all these computations we are ready to train our model by using our training dataset.
After training, our model is ready to predict handwritten digits and accuracy of the model is
obtained. And we get the accuracy of SVM that is 97.83%.

C) Use of CNN Algorithm

Next we use CNN algorithm for quick digit recognition. After loading our dataset with the help of
keras we initialize it and segregate our dataset into training and testing sets.
There are 60000 training examples with 10000 testing examples.
After segregating our dataset to training and testing we need to preprocess our data by reshaping
it. The parameters passed are number of examples (60000 for training and 10000 for testing) , size
of the images i.e. 28,28 and lastly 1 to denote an gray-scaled image.

In the next part of our code we performed one-hot Encoding.

The last layer of our CNN model will contain 10 nodes, every one of them relating to the individual
digit (first node -> 0, second node -> 1 etc.). Along these lines, when we feed our picture into the
model, the model will return probabilities of that digit as indicated by every node. In this way,
toward the end, the anticipated digit will be related to the node with the most elevated probability
For instance, on the off chance that the last node has the most elevated probability, at that point
the anticipated digit is 9.

Each label should be in the form of an array of 10 elements in which only 1 element =1 (highest
probability) and all rest equals 0.
For example, if the image is of the number 8, then the label instead of being = 8, it will have a
value 1 in column 9 and 0 in the rest of the columns, like [0,0,0,0,0,0,0,0,1,0] .
This is done by the function ‘to categorical’.

8
Now we will build our model and here we will be using a sequential model. Firstly, layers of the
model are defined and then added one after another to complete the model.
Our first two layers are Conv2Di.e convolution layers that deal with input images taken in 2-D
matrix.
The first layer has 32 nodes and the second layer has 64 nodes.
We are using a 3X3 filter matrix i.e kernel size= (3, 3).
Activation function used in our first two layers is “ReLu” (Rectified Linear activation).
ReLu actually outputs 0 for negative input numbers and outputs the same if input is positive.

The third layer is the MaxPooling layer that helps in choosing the best feature and helps in
reduction of dimensions of the input.

In the fourth layer we used dropout to reduce overfitting of data. It randomly turns off and on the
neurons to improve the convergence.

In between the Convolutional Layer and the Fully Connected Layer there exists a Flatten Layer
which is our fifth Layer. Flattening converts the 2-D feature matrix to a vector that can act as an
input to a fully connected Neural Network Classifier.

Our sixth layer is the dense layer. It is the regular deeply connected neural network layer. It takes
input from Flatten layer (i.e. feature vector) and output as follows:
Output = activation (dot (input, kernel) + bias)

In our sixth Layer we have our activation as ‘ReLu’.


We have 128 bias units that affect our output.

In our seventh layer we again used Dropout for convergence sake.

Our last and Eighth layer is the dense layer with 10 units and activation as ‘Softmax’.
Softmax makes the output sum up to 1, so that the output contains a series of probabilities.
The model predicts the number which has the highest probability 0 among all 10 numbers.

As all our 8 layers are ready we add them up to complete the Architecture of the model.
After adding the layers we need to compile our model. After compiling our model we fit our
training data to the model. We passed our training images and training labels to fit function. Batch_
size is the number of samples per gradient update. If unspecified, batch_size will default to 32.
One Epoch is when the Full dataset is passed forward and backward through the neural network
only once. One Epoch is too big to be fed to an algorithm at once so we have divided it into 128
batches. As we increase our Epoch our accuracy of the model increases till a certain extent then
the accuracy becomes constant. We will observe these things in the Result section.
Verbose just helps us in looking at the training progress for each Epoch. Verbose can have values
as 0, 1, and 2. For 0 we will get to see nothing (silent), for 1 we will see an animated progress bar
and for 2 will just mention the no. of Epochs.

9
At last we move to the last step of our model to evaluate the testing dataset. After evaluation we
can get the accuracy of our model and the errors caused by our model. The accuracy of digit
recognition using CNN algorithm comes around 99.02%.

1.4 Organization of Project

Chapter 1 of our project consists of the background to select this topic and the requirements of this
work which is mentioned from page 06 to 10.
Chapter 2 contains loading and processing of data from the MNIST dataset and various algorithms
for pattern recognition mentioned in pages 10 to 24. We have used the KNN algorithm and
explained it on page no 13 to 16. SVM algorithm is described from page no. 17 to 23 to read those
digits and find accuracy level. CNN algorithm is described from page no. 27 to 36.We have
discussed all of those details below.
Chapter 3 contains all our proposed work including codes for dataset retrieval, KNN, SVM& CNN
mentioned in pages 25 to 37.
Chapter 4 contains comparisons between all the algorithms and necessary analysis for the same
mentioned in pages 38 &39.Various calculation tables, graphs are attached wherever necessary.
Chapters 5 & 6 contain conclusions and references of the project mentioned in pages 40 & 41.

1.5 Required resources

1.5.1 Hardware requirements


∙ 2.6 GHz Processor
∙ 8 GB RAM
∙ 128 GB HDD

1.5.2 System Tool Requirements

∙ Python 3.5+
∙ Scikit-Learn
∙ Νumpy
∙ Matplotlib
● Pandas
● Keras
● Running Interface: Anaconda Prompt
∙ Operating System: Windows, Linux, Mac
∙ Programming Language: Python
∙ Editor: Notepad++
∙ Software: Chrome, Mozilla

10
CHAPTER 2 - PATTERN RECOGNITION
We have studied many algorithms to recognize the handwritten digit. In this chapter we have
described all the details of those algorithms which may give perfect accuracy for our predicting
the handwritten digits.

2.1 MNIST Dataset:


Samples provided from MNIST (Modified National Institute of Standards and Technology) dataset
contains handwritten digits, more precisely a total of 70,000 images consisting of 60,000 examples
in the training set and 10,000 examples in the testing set. Each dataset (training and testing)
contains images which are labeled as 0-9 (10 digits).
Handwritten digits are the image of form 28x28 pixel (grayscale intensity) representing each digit.
In the data set the first column is the label (0-9) for each and every image. Testing dataset contains
10000 examples also having labels from 0-9.

Images of the training and testing digits were taken from different sources. The images are
normalized and centered which makes MNIST dataset a brilliant database for evaluating models
with very less data cleaning and preprocessing.
The original black and white images from NIST had a size of 20x20 pixel box after normalization
while preserving their aspect ratio. The resulting images contain grey levels due to anti-aliasing
technique used by the normalization algorithm. The images were centered in a 28x28 image by
computing the center of mass of the pixels, and converting the image so as to position this point at
the center of the 28x28 field.

It has been observed that the error rate improves when the digits are centered by bounding box
rather than center of mass (especially for classification algorithms such as SVM and KNN).

The original format of the MNIST datasets consists of four files.


1. train-images-idx3-ubyte.gz: Training set images (9912422 bytes)
2. train-labels-idx1-ubyte.gz: Training set labels (28881 bytes)
3. t10k-images-idx3-ubyte.gz: Test set images (1648877 bytes)
4. t10k-labels-idx1-ubyte.gz: Test set labels (4542 bytes)
But in this project, we are using the CSV format of the MNIST dataset as the original format
requires a lot of pre-processing and formatting of the data.

In the CSV format of the MNIST dataset we have two files.


1. Mnist_train.csv
2. Mnist_test.csv

CSV stands for "Comma Separated Values". As the name suggests, the data(pixels) are arranged
in tabular format separated by commas.

11
First column named label contains the labels of the training and testing data.Next 784 columns
(28x28) belong to the pixel value of each individual image having resolution of 28x28 pixels.
Actually, the 28x28 pixel image is made 1x784 by the process of row flattening.

Fig 5.3 and Fig 5.4 shows the mnist_train.csv file.


In the 'mnist_train.csv' file we have our training data with which we will be training our model for
predicting the handwritten digits. It has 60000 examples to train our model.

Fig 5.1 and 5.2 shows the mnist_test.csv file.


In the 'mnist_test.csv' file we have our testing data with which we will be testing our trained model
and check its accuracy. It has 10000 examples to test our model.

1. mnist_test.csv

Fig-5.1

12
Fig-5.2

2. mnist_train.csv

Fig-5.3

Fig-5.4

2.2 KNN (K nearest neighbors)

KNN is the non-parametric method or classifier used for both classification and regression
problems. This is a late learning classification algorithm where all of the computations are derived
until the last stage of classification. As it is simplest and easiest to implement, the algorithm does
not perform any generalization of training data.

Working of KNN Algorithm:

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new data
points which actually means that the new data point will be assigned a value or class based on how

13
closely it matches the data points in the training set. Stepwise implementations of KNN algorithm
is given below:

Step 1−Loading the Training and Testing Data.

Step2− Next, we need to choose the value of K i.e. the nearest data points. K can be any integer
(Mostly try to take K as a odd number as even number may lead to ambiguity)

Step 3 – For each point in the test data do the following −

3.1 − Calculate the distance between test data and each row of training data with the help of any
of the methods, namely: Euclidean, Manhattan or Hamming distance. The most commonly used
method to calculate distance is Euclidean.

3.2 − Sort them in ascending order based on their distances.

3.3 − Next, it will choose the top K rows from the sorted array.

3.4 − Now, it will assign a class to the test point based on the most frequent class of these rows.

Step 4 − End

Advantages of KNN algorithm:

1. Implementation is Easy
2. If the Training Data is large it can be Effective.

Disadvantages of KNN algorithm:

1. K is the most important factor in the KNN algorithm. Evaluation of K might be complex many
times.
2. The cost of computation might get sometimes high due to calculation of all the distances
between data points.

K is the most Important Factor:

14
Fig-2.1: Importance of K in KNN algorithm

From the above figure 2.1 we can easily see the importance of K in KNN algorithm.
We have 2 classes: Red and Blue. Green is the Test data point.
If we take K=2, then the two nearest neighbors of the green point are red which signifies the green
point belongs to class Red.
If we take K=3, then the three nearest neighbors of the green point are two reds and a blue which
again implies the test point belongs to class Red.
If we take K=5, then the five nearest neighbors of the green point are two reds and three blues
which signifies the test point belongs to class Blue.

Hence we can see that depending upon the choice of the value of K our prediction differs. So the
value of K should be chosen cautiously.

Distance Functions:
Different distance functions used in KNN are-
1. Euclidean function
2. Manhattan function
3. Minkowski function
4. Hamming distance

15
Digit 0 1 2 3 4 5 6 7 8 9
0 613 0 0 0 0 1 2 0 0 1
1 0 701 0 0 1 0 1 1 0 0
2 5 9 538 1 1 0 0 7 0 1
3 0 1 4 597 0 6 0 1 1 1
4 0 4 0 0 576 0 0 0 0 13
5 1 1 1 4 3 521 6 0 0 5
6 3 1 0 0 1 1 588 0 0 0
7 0 7 2 0 2 1 0 575 0 3
8 1 10 1 5 0 8 1 2 554 7
9 2 0 0 5 4 4 1 8 0 574

Fig 2.2: Confusion matrix for Trained Data set using KNN classifier

Table 2.1: Precision, Recall and F1 score for KNN on trained data set

16
2.3 SVM (Support Vector Machine)

Support Vector Machine (SVM) is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. Mostly it is used for Classification
problems in Machine Learning.
An SVM model is a representation of different classes in a hyperplane in multidimensional space.
The hyper plane will be generated in an iterative manner by SVM so that the error gets minimized.
The goal of SVM is to divide the datasets into different classes to find a maximum marginal
hyperplane.
Here in our project we have 10 classes of digits (0-9) and we used SVM for predicting the
handwritten digits.

Components of SVM algorithm:

2.3.1 Hyperplane: There can be many lines/decision boundaries to divide the classes in n-
dimensional space, but the best decision boundary that segregates the classes is known as
Hyperplane.
The dimensions of the hyperplane depend on the features present in the dataset, i.e if there are 2
features (as shown in image), then the hyperplane will be a straight line. And if there are 3 features,
then the hyperplane will be a 2-dimension plane.
The Hyperplane must have a maximum margin i.e. maximum distance between the data points.

2.3.2 Support Vector: The data points or vectors that are the closest to the hyperplane and also
affect the position of the hyperplane are termed as Support Vectors. As these vectors support the
hyperplane, hence called a Support vector.

Working of SVM:
SVM are of two types.
A) Linear SVM
B) Non-Linear SVM

A) Linear SVM

In the below figure 2.2 we can see the two classes (Green and Blue) are separated by a optimal
hyperplane.
As the classes are in 2-d space, the classes can be divided using a straight line. There can be many
possible lines that can separate the classes.
Out of infinite lines SVM chooses the best one known as hyperplane. SVM finds the closest point
of the lines from both the classes (Support Vectors). The distance between the vectors and the
hyperplane is known as margin. The main goal of the SVM algorithm is to maximize the margin.

17
Fig-2.3: Linear SVM

B) Non-Linear SVM

In the below figure 2.4 we see an example where the Linear decision boundary (Straight line)
won’t work. Here for non-linear data we need a non-linear SVM algorithm.

Fig-2.4: Nonlinear SVM

For Linear SVM we used two dimensions i.e. X and Y. Here for non-linear SVM let’s add one
more dimension to it.
Say z = x^2 + y^2

18
By adding the third dimension our data will look like given below in figure 2.5

Fig-2.5 Non linear 3D SVM,here the

Fig- 2.6 is a 3-d plot. So the line parallel to x-axis is a plane parallel to x-axis which is our hyperplane for
the given example.

Fig-2.6 Hyperplane in nonlinear SVM

19
To visualize the above figure in 2-d we need to take z=1.
So our equation of Hyperplane becomes:

x^2 + y^2 = 1

Fig-2.7 Equation of Hyperplane

Fig 2.8: Confusion matrix for Trained Data Set

20
Table 2.2 :Precision, Recall and F1 score using SVM classifier

Advantages to SVM classifier:

1. It provides great accuracy.


2. It works well in High dimensional Space.

Disadvantages of SVM classifier:

1. Usually have high training time and might be time consuming for large datasets.
2. SVM will underperform if datasets are noisy i.e. Overlapping target classes.

2.4 CNN (Convolutional Neural Network)


In neural systems, Convolutional neural systems (ConvNets or CNNs) are one of the fundamental
classes to do pictures acknowledgment, pictures arrangements. Image recognition,
acknowledgment faces and so forth, are a portion of the zones where CNNs are generally
utilized.27
Actually, by learning CNN models to train and test, each information picture will go through a
progression of convolution layers with filters (Kernels), Pooling, fully connected layers (FC) and
apply Softmax capacity to order an item with probabilistic qualities somewhere in the range of 0
and 1.
Convolution is the primary layer to extricate highlights from an info picture. Convolution protects
the connection between pixels by learning image features utilizing little squares of information. It
is a numerical activity that takes two inputs, for example, image matrix and a filter or kernel.

21
Fig 2.9: 3D Image Matrix

Consider an 6x6 grayscale image in the below figure 2.8

3 0 1 2 7 4

1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

Fig –2.10: 6X6 gray scale


Now we will convolve this 6x6 image matrix with 3x3 filter given in figure 2.11

3 0 1 2 7 4

1 5 8 9 3 1
1 0 -1
2 7 2 5 1 3
1 0 -1
0 1 3 1 7 8
1 0 -1
4 2 1 6 2 8
2 4 5 2 3 9

Fig-2.11: 6X6 matrix with 3X3 filter

22
We take the first 3 X 3 matrix (filter size) from the 6 X 6 image and multiply it with the filter.
Now, the first element of our 4 X 4 output matrix will be the sum of the element-wise product of
these values, i.e. 3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5. To calculate the
second element of our 4 X 4 output, we will shift our filter (3 x 3) one step towards the right and
by the same method as mentioned above get the sum of the element-wise product. In figure 8.3 we
get the output matrix.

-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16

Fig-2.12 : Output matrix

2.4.1 Padding:
When we have NxN image matrix and FxF filter matrix the output matrix will be of
(N-F+1) x (N-F+1).

There are basically two hindrances here:

1. Each time we apply a convolutional activity, the size of the picture contracts.

2. Pixels present toward the side of the picture are utilized just a couple of times during convolution
when contrasted with the focal pixels. Subsequently, we don't concentrate a lot on the corners since
that can prompt data misfortune.

To overcome this issue padding is needed i.e. adding of pixels all around the matrix.

After padding the input matrix:

● Input :NxN
● Padding :p
● Filter :FxF
● Output matrix : (N+2p-F+1) x (N+2p-F+1)

2.4.2 Strides:

Stride is the quantity of pixels shifted over the information lattice (input matrix). At the point when
the stride is 1 then we move the filters to 1 pixel at once. At the point when the step is 2 then we

23
move the filters to 2 pixels one after another, etc. Stride assists with decreasing the size of the
picture, an especially valuable component.

After striding the input matrix:

● Input :NxN
● Padding :p
● Strides :s
● Filter :FxF
● Output matrix : [(N+2p-F)/s+1] x[ (N+2p-F)/s+1]

2.4.3 Pooling Layers:

Pooling layers segment would lessen the quantity of parameters when the pictures are excessively
huge. Spatial pooling, likewise called subsampling or downsampling which lessens the
dimensionality of each guide, however holds significant data. Spatial pooling can be of various
sorts:

1. Max Pooling: Takes the largest element from the rectified feature map.
2. Avg Pooling: Takes the avg of all elements.
3. Sum Pooling: Takes the sum of all elements.

1 1 2 4

5 6 7 8 6 8
max pool with 2X2 filters and stride 2
3 4
3 2 1 0

1 2 3 4

Fig-2.13: maximum pooling

0 0 2 4
1 5
2 2 6 8 2X2 average pooling, stride=2
6 2
9 3 2 2

7 5 2 2

Fig-2.14 : average pooling

24
2.4.4 Fully Connected Layer (FC) :
In the FC layer, we straighten our lattice into vectors and feed it into a completely associated layer
like a neural system. Actually the feature map matrix gets converted as vectors. With all these
connected Layers a model is created. Finally using activation functions like Softmax or sigmoid
The output is classified.

Fig-2.13: Fully Connected Layer

Advantages of CNN:

1.They are very efficient and accurate.


2.Works very well in image recognition problems.

Disadvantages of CNN:

1. If the processor of the system is not good it takes a lot of time to train the data.
2. High computational cost.

25
CHAPTER 3 - PROPOSED WORK

i) At first we took the sample Samples provided from MNIST (Modified National Institute of
Standards and Technology) dataset containing handwritten digits, more precisely a total of 70,000
images consisting of 60,000 examples in the training set and 10,000 examples in the testing set.
Each dataset (training and testing) contains images which are labeled as 0-9 (10 digits).
ii) We used pandas Pandas to read our training and testing data which are in CSV format. We used
20000 training examples and 1000 testing examples with pixel values ranging from 0-255. So we
normalize them between 0-1.UsingNumpy we convert the images to fractional arrays where data
are in pixel format between 0-1.

iii) We imported the KNeighborsClassifier and accuracy_score from Scikit-learn in python.


After loading the testing and training data using pandas we split the data into Training-
Images, Training-Labels, Testing-Images and Testing-Labels. After formatting our data set we fit
the data into our KNN classifier. After training is completed we predicted the values of our Test
data sets and calculated the Accuracy of the model.

iv) We are using Pipeline and GridSearchCV tools of Scikit-learn. Before feeding data to an
algorithm the data set must be well-formatted and noise-free and standardization is required. We
have used the StandardScalar() function to normalize our pixel values of images (28x28). This
function actually standardizes features by removing the mean and scaling to unit variance.
After that we used SVC (Support vector classifier) with a 'polynomial' kernel. With all different
permutations of pairs of (C, gamma) we get a different decision boundary. After
5 cross-validation (cv) of result the best is chosen among different permutations of
parameters. All these computations are done by the GridSearchCV tool.

v) We use CNN algorithm for quick digit recognition. After loading our dataset with the help of
keras we initialize it and segregate our dataset into training and testing sets.
There are 60000 training examples with 10000 testing examples. After segregating our dataset to
training and testing we need to preprocess our data by reshaping it. The parameters passed are
number of examples (60000 for training and 10000 for testing) , size of the images i.e. 28,28 and
lastly 1 to denote an gray-scaled image.
In the next part of our code we performed one-hot Encoding. After adding the layers we need to
compile our model. After compiling our model we fit our training data to the model. We passed
our training images and training labels to fit function. Batch size is the number of samples per
gradient update. If unspecified, batchsize will default to 32.
One Epoch is when the Full dataset is passed forward and backward through the neural network
only once. One Epoch is too big to be fed to an algorithm at once so we have divided it into 128
batches. As we increase our Epoch our accuracy of the model increases till a certain extent then
the accuracy becomes constant. We will observe these things in the Result section.

26
Verbose just helps us in looking at the training progress for each Epoch. Verbose can have values
as 0, 1, and 2. For 0 we will get to see nothing (silent), for 1 we will see an animated progress bar
and for 2 will just mention the no. of Epochs.

At last we move to the last step of our model to evaluate the testing dataset.After evaluation we
can get the accuracy of our model and the errors caused by our model.

3.1 Dataset Retrieval

27
OUTPUT

Loading the MNIST Dataset:


In the above code, we are using Pandas to read our training and testing data which are in csv
format. In the above code we are using 20000 training examples and 1000 testing examples. All

28
the pixel values range from 0-255. So to normalize them between 0-1 we used the variable ‘fac’.
Using Numpy we converted the images to fractional arrays where the data are in pixel format
between 0-1.
At last using the Matplotlib Library we have shown the MNIST images as shown below in the
figures.

3.2 KNN Algorithm:

Result:

1. Trained 20000 cases and tested 1000 cases

29
2. Trained all 60000 and tested upon all 10000 cases:

So after training and testing upon different number of cases we get result as follows:

1. Trained 10000 cases and tested 1000 cases : 91.6% accuracy

2. Trained 20000 cases and tested 1000 cases : 93.7% accuracy

3. Trained all 60000 cases and tested 1000 cases : 96.1% accuracy

4. Trained all 60000 cases and tested upon all 10000 cases : 96.88% accuracy

Accuracy of KNN: 96.88

Above is the implementation of KNN algorithm in Handwritten Digit Recognition.


We imported the KNeighborsClassifierandaccuracy_score from Scikit-learn in python.
After loading the testing and training data using pandas we split the data into Training-Images,
Training-Labels, Testing-Images and Testing-Labels. After formatting our data set we fit the data
into our KNN classifier. After training is completed we predicted the values of our Test data sets
and calculated the Accuracy of the model.

30
3.3 SVM Algorithm:

Result:

31
1. Trained 10000 cases and tested 1000 cases:

2. Trained 20000 cases and tested 1000 cases

3. Trained all 60000 cases and tested all 10000 cases:

32
So after training and testing upon different number of cases we get result as follows:

1. Trained 10000 cases and tested 1000 cases : 95.5 % accuracy


2. Trained 20000 cases and tested 1000 cases : 96.6 % accuracy
3. Trained all 60000 cases and tested upon all 10000 cases: 97.83 % accuracy

Accuracy of SVM: 97.83 %

Firstly using pandas we fetched our CSV format training and testing dataset.
In the above implementation of SVM code we are using Pipeline and GridSearchCV tools of
Scikit-learn.
Before feeding any data to an algorithm the data set must be well-formatted and noise-free.
Standardization of a dataset is a common requirement for many machine learning estimators; else
they might behave badly if the individual features do not more or less look like standard normally
distributed data.
We have used the Standard Scalar () function to normalize our pixel values of images (28x28).
This function actually standardizes features by removing the mean and scaling to unit variance.
After that we used SVC (Support vector classifier) with a 'polynomial' kernel.
In a typical machine learning workflow you will need to apply all these transformations (data
normalization, feature scaling etc) at least twice. Firstly, when training the model and secondly on
any new data you want to predict on. Pipeline is a good tool that enforces the implementation and
order of the steps provided to it.
The most important parameters in SVM algorithms in predicting the decision boundary are 'C' and
'gamma'. We set the parameters with values such as
C = 0.001,0.1,100, 10e^5
Gamma = 10,1,0.1, 0.01
With all different permutations of pairs of (C, gamma) we get a different decision boundary. After
5 cross-validations (CV) of the result the best is chosen among different permutations of
parameters. All these computations are done by the GridSearchCV tool.
After all these computations we are ready to train our model by using our training dataset.
After training, our model is ready to predict handwritten digits and accuracy of the model is
obtained.

33
3.4 CNN Algorithm:

34
Result:

A) Using Optimizer as ‘Adam”

1. No of Epoch = 5

2.No of Epoch = 10

B) Using Optimizer as ‘Adadelta’

1. No of Epoch = 5

35
2. No of Epoch = 10

So, after Training 60000 examples and testing 10000 examples we get the following results based
on No. of Epoch and Optimizer used.

A) Optimizer used: Adam

1. No of epoch: 3 - 98.40 %

2. No of epoch: 5 - 98.78 %

3. No of epoch: 10 - 98.98 %

B) Optimizer used: Adadelta

1. No of epoch: 3 - 98.75 %

2. No of epoch: 5 - 99.01 %

3. No of epoch: 10 - 99.02 %

36
Accuracy table based on No. Of Epoch and Optimizer used :

Epoch = 3 Epoch = 5 Epoch = 10


(Accuracy %) (Accuracy %) (Accuracy %)

Optimizer: Adam 98.40 % 98.78 % 98.92 %

Optimizer: Adadelta 98.75 % 99.01 % 99.02 %

Table 3.1: Accuracy Table for various optimizers


Accuracy of CNN: 99.02

Table 3.2: Accuracies of CNN depending on the number of epochs

In the implementation of CNN algorithm in Handwritten Digit Recognition we used ‘Keras’ as our
main tool.
Keras itself provides us with a number of datasets that also includes our MNIST dataset.
After loading our dataset with the help of keras we initialize it and segregate our dataset into a
training and testing set.
There are 60000 training examples with 10000 testing examples.
After segregating our dataset to training and testing we need to preprocess our data by reshaping
it. The parameters passed are number of examples (60000 for training and 10000 for testing) , size
of the images i.e. 28,28 and lastly 1 to denote an gray-scaled image.

37
In the next part of our code we performed one-hot Encoding.

The last layer of our CNN model will contain 10 nodes, every one of them relating to the individual
digit (first node -> 0, second node -> 1 etc.). Along these lines, when we feed our picture into the
model, the model will return probabilities of that digit as indicated by every node. In this way,
toward the end, the anticipated digit will be relating to the node with the most elevated probability.
For instance, on the off chance that the last node has the most elevated probability, at that point
the anticipated digit is 9.

Each label should be in the form of an array of 10 elements in which only 1 element =1 (highest
probability) and all rest equals 0.
For example, if the image is of the number 8, then the label instead of being = 8, it will have a
value 1 in column 9 and 0 in the rest of the columns, like [0,0,0,0,0,0,0,0,1,0] .
This is done by the function ‘to categorical’.

Now we will build our model and here we will be using a sequential model. Firstly, layers of the
model are defined and then added one after another to complete the model.
Our first two layers are Conv2Di.e. convolution layers that deal with input images taken in 2-D
matrix.
The first layer has 32 nodes and the second layer has 64 nodes.
We are using a 3X3 filter matrix i.e kernel size= (3,3).
Activation function used in our first two layers is “ReLu” (Rectified Linear activation).
ReLu actually outputs 0 for negative input numbers and outputs the same if input is positive.

The third layer is the MaxPooling layer that helps in choosing the best feature and helps in
reduction of dimensions of the input.

In the fourth layer we used dropout to reduce overfitting of data. It randomly turns off and on the
neurons to improve the convergence.

In between the Convolutional Layer and the Fully Connected Layer there exists a Flatten Layer
which is our fifth Layer. Flattening converts the 2-D feature matrix to a vector that can act as an
input to a Fully connected Neural Network Classifier.

Our sixth layer is the Dense layer. It is the regular deeply connected neural network layer. It takes
input from Flatten layer (i.e feature vector) and output as follows:
Output = activation (dot (input, kernel) + bias)

In our sixth Layer we have our activation as ‘ReLu’.


We have 128 bias units that affect our output.

In our seventh layer we again used Dropout for convergence sake.

Our last and Eighth layer is Dense layer with 10 units and activation as ‘Softmax’.
Softmax makes the output sum up to 1, so that the output contains a series of probabilities.

38
The model predicts the number which has the highest probability 0 among all 10 numbers.

As all our 8 layers are ready we add them up to complete the Architecture of the model.
After adding the layers we need to compile our model. Compiling the Model takes three
parameters:

1. Loss: We have used here categorical cross entropy loss function as there are two or more
label classes. In this loss function labels are to be provided in a one hot representation.

2. Optimizer: It is required for compiling the Keras model. We have used two different
optimizers (1. adam2.adadelta) and checked out the accuracy for both of them. We will see them
in the result section.

3. Metrics: A metric is a function that is used to judge the performance of the model.
Here we have used accuracy metrics that calculate how often our predictions match our labels.

After compiling our model we fit our training data to the model. We passed our training images
and training labels to fit function. Batchsize is the number of samples per gradient update. If
unspecified, batchsize will default to 32.
One Epoch is when the Full dataset is passed forward and backward through the neural network
only once. One Epoch is too big to be fitted to an algorithm at once so we have divided it into 128
batches. As we increase our Epoch our accuracy of the model increases till a certain extent then
the accuracy becomes constant. We will observe these things in the Result section.
Verbose just helps us in looking at the training progress for each Epoch. Verbose can have values
as 0, 1, and 2. For 0 we will get to see nothing (silent), for 1 we will see an animated progress bar
and for 2 will just mention the no. of Epochs.

At last we move to the last step of our model to evaluate the testing dataset. After evaluation we
can get the accuracy of our model and the errors caused by our model.

39
CHAPTER 4 - ANALYSIS AND COMPARISON OF
RESULTS:

Fig 4.1 Graph of accuracy percentage of digits in CNN SVM & KNN process

We have plotted the various accuracy level of prediction for the recognition of handwritten digits
0-9 in KNN,SVM,CNN algorithms in Fig 1.

Fig-4.2 Comparison of Accuracies

The graph in fig. 4.2 shows accuracy comparison between trained and test dataset for the three
different algorithms. While KNN & SVM don’t show much difference in accuracy between trained
and test data, CNN shows comparatively greater variation between the two.

40
Fig-4.3 Error Rates of Algorithms

Fig-4.2 shows error comparison between trained and test dataset for the three different algorithms.
Error difference between test and trained dataset is minimum for KNN & SVM but prominent in
case of CNN.

Fig. 4.4 Accuracy Percentage of Algorithms

4.1 Final Accuracy Table of KNN, SVM, CNN

KNN SVM CNN


(Trained:60000, Tested:10000) (Trained:60000, Tested:10000) (Trained:60000, Tested:10000)

96.88 % 97.83 % 99.02 %

41
Graph presentation of Final Accuracy

By looking at the above table we can clearly state that the best prediction is performed by our CNN
model with 99.02 % accuracy. Our SVM model is also 97.83 % accurate but its training time is
very high compared to the other two mentioned algorithms.

We can progress with other language handwritten digits also if we can get a proper dataset that
matches property with MNIST dataset model. Our algorithm and code fit correctly with all the
major properties of the MNIST dataset model.

42
CHAPTER 5 – CONCLUSION
In our project entitled “Handwritten Digit Recognition Using Machine Learning” we used three
different algorithms to predict the digits.

The first algorithm we used is KNN i.e. K-Nearest Neighbors. We found out that it is the simplest
and easiest to implement algorithm and hence we did not need to generalize our training data. It
can be noticed from our results that with increasing the number of training examples our accuracy
also increases and time of computation also increases a bit in case of KNN algorithm.
After training 60000 cases and performing testing upon 10000 cases our algorithm came upto
96.88 % accurate.

The second algorithm we used is SVM i.e. Support Vector Machines. It is a relatively more
complex algorithm than KNN and its implementation is also not as easy as KNN. It takes a lot of
time to train our data with the MNIST datasets. Same as KNN, its accuracy also increases with
increase in training examples. It took almost 5 hrs to train 10000 cases and its accuracy is about
95.5 %.
When we trained 20000 cases the time of training almost increased to 10 hrs and its accuracy is
96.6 %. At last the full MNIST dataset was trained i.e. 60000 cases and tested upon 10000 cases.
The time taken is almost 45 hrs and its accuracy came up to be 97.83 %.The time taken is high due
to the complex calculations based on the parameters ‘C’ and ‘gamma’ and their permutations to
yield the best result.

The last algorithm we used is CNN i.e Convolutional Neural Networks. It is totally a different
concept from the two algorithms mentioned above. In our CNN we have used 8 layers to predict
the final output. We trained our model on 60000 examples and tested it on 10000 examples. We
used two different optimizers to compile our model. As we increased the number of epochs from
3 to 10 we got a more accurate result and the accuracy was getting almost constant after 10 epochs.
It is not as time consuming as the SVM algorithm is. We got an accuracy of 98.92 % with the
‘adam’ optimizer with 10 epochs and 99.02 % with the ‘adadelta’ optimizer with 10 epochs.

We have predicted the accuracy level of every digit in algorithms from the confusion matrix which
we have got by splitting the dataset in ratio of train and test dataset. The true positive values are
divided by the total number of predictions and that is our accuracy level for particularly that digit
in that algorithm. We have also predicted precision, recall and f1-score for every algorithm to show
it clearly for prediction.

There are some cases where the accuracy level for some digit has dropped and may be one of the
reasons behind that is complexity. As there are a large number of pixels used to take input so some
haziness may also happen.

From the fig 2.1 we can analyse that the variance of the accuracy of prediction for the recognition
of handwritten digit is minimum in CNN algorithm and average accuracy of prediction is the
highest in CNN. Hence CNN is the best algorithm for handwritten digit recognition.

43
CHAPTER 6- REFERENCE
1. F Alimoğlu, “Combining multiple classifiers for pen-based handwritten digit recognition”,
1996, [online] Available: http://www.cmpe.boun.edu.tr/theses

2. E.Alpaydin and M. I. Jordan, "Local Linear Perceptrons for Classification'', IEEE


Trans.onNeural Networks, vol. 7, no. 3, pp. 788-792, 1996

3. H. Drucker, R. Schapire and P. Simard, "Improving Performance in Neural Networks Using a


Boosting Algorithm" in Advances in NIPS, Morgan-Kaufmann, vol. 5, pp. 42-49 1993

4. L. K. Hansen and P. Salamon, "Neural Network Ensembles", IEEE Trans. on PAMI, vol. 12,
no. 10, pp. 993-1001, 1990.

5. R.A. Jacobs, M. I. Jordan, S. J. Nowlan and G. E. Hinton, "Adaptive Mixtures of Local


Experts", Neural Computation, vol. 3, pp. 79-87, 1991.

6. C. Kaynak, Methods of Combining Multiple Classifiers and Their Application to Handwritten


Digit Recognition, 1995.

7. T. Pavlidis and S. Mori, "Special Issue on Optical Character Recognition ", Proceedings ofthe
IEEE, vol. 80, no. 7, 1992.

8. P. Pudil, J. Novovic̆ová, S. Bláha and J. Kittler, "Multistage Pattern Recognition with Reject
Option", 11th IAPR ICPR, vol. II, pp. 92-95, 1992.9.C. C. Tappert, C. Y. Suen and T. Wakahara,
"The State of the Art in On-line Handwriting Recognition", IEEE Trans. on PAMI, vol. 12, no. 8,
pp. 787-808, 1990

9. D. H. Wolpert, "Stacked Generalization", Neural Networks, vol. 5, pp. 241-259, 1992

44

You might also like