Professional Documents
Culture Documents
By
Most. Masura Parvin Mst. Salma Akter Rani
Student ID: 1702014 Student ID: 1702049
Level: 4 Semester: II Level: 4 Semester: II
B.Sc. (Engineering) in CSE B.Sc. (Engineering) in CSE
Session 2017
CERTIFICATE
This is to certify that the work entitled as “Bangla Handwritten Characters Recognition
Using Convolutional Neural Network” by Most Masura Parvin, Salma Akter Rani and
Afrin Naher has been carried out under our supervision. To the best of our knowledge this work
is an original one and was not submitted anywhere for a diploma or a degree.
Co-supervisor
.......................
(Md Rashedul Islam)
Assistant Professor
Department of Computer Science and Engineering
Hajee Mohammad Danesh Science and Technology University, Dinajpur-5200,
Bangladesh.
Supervisor
........................
(Md Abu Marjan)
Lecturer
Department of Computer Science and Engineering
Hajee Mohammad Danesh Science and Technology University,
Dinajpur-5200, Bangladesh.
Department of Computer Science and Engineering
Faculty of Computer Science and Engineering
Hajee Mohammad Danesh Science and Technology University
Dinajpur-5200, Bangladesh.
DECLARATION
We understand the University’s policy on plagiarism and declare that no part of this thesis has
been copied from other sources or been previously submitted elsewhere for the award of any
degree or diploma.
i
3.1.7 Data Augmentation ............................................... Error! Bookmark not defined.
3.2 Neural Network............................................................. Error! Bookmark not defined.
3.2.1 Deep Learning vs Neural Network ....................... Error! Bookmark not defined.
3.2.2 Working Process of Neural Networks ................... Error! Bookmark not defined.
3.2.3 Types of Neural Network ..................................... Error! Bookmark not defined.
3.2.4 Activation Function .............................................. Error! Bookmark not defined.
3.3 Convolutional Neural Network (CNN) .......................... Error! Bookmark not defined.
3.3.1 Introduction .......................................................... Error! Bookmark not defined.
3.3.2 Importance of CNN .............................................. Error! Bookmark not defined.
3.3.3. Few Definitions ................................................... Error! Bookmark not defined.
Chapter Error! Bookmark not defined.
Methodology
4.1 Dataset Description ....................................................... Error! Bookmark not defined.
4.2 Preparation of dataset .................................................... Error! Bookmark not defined.
4.3 Preprocessing ................................................................ Error! Bookmark not defined.
4.4 RGB to Gray ................................................................. Error! Bookmark not defined.
4.5 Resizing and Rescaling ................................................. Error! Bookmark not defined.
4.6 Train, Test, and Validation split .................................... Error! Bookmark not defined.
4.7 Proposed Methodology ................................................. Error! Bookmark not defined.
4.8 Overview of the proposed model’s architecture ............. Error! Bookmark not defined.
4.9 Block diagram of the proposed model’s architecture ..... Error! Bookmark not defined.
Chapter 5
Result Evaluation
5.1 Environmental Setup ..................................................... Error! Bookmark not defined.
5.2 Training the model ........................................................ Error! Bookmark not defined.
5.3 Model performance ....................................................... Error! Bookmark not defined.
5.4 Output Summary of our Proposed Model ...................... Error! Bookmark not defined.
5.5 Performance Metrics ..................................................... Error! Bookmark not defined.
5.5.1 Accuracy ............................................................... Error! Bookmark not defined.
5.5.2 Confusion Matrix .................................................. Error! Bookmark not defined.
5.5.3 Precision ............................................................... Error! Bookmark not defined.
ii
5.5.4 Recall .................................................................... Error! Bookmark not defined.
5.5.5 F1 Score ................................................................ Error! Bookmark not defined.
5.6 Discussion.................................................................. Error! Bookmark not defined.
Chapter Error! Bookmark not defined.
Conclusion ........................................................................... Error! Bookmark not defined.
References ........................................................................... Error! Bookmark not defined.
iii
List of Figures
Figure 3.1: Relationship between AI, ML and DL……………………………………………..8
Figure 3.2: Neural Network…………………………………………………………….…….10
Figure 3.3: Binary step function…………………………………………………………,,…..14
Figure 3.4: Linear Activation function……………………………………………………,,....15
Figure 3.5: Logistic Regression…………………………………………………………,…...16
Figure 3.6: Tanh……………………………………………………………………………....16
Figure 3.7: ReLU Function…………………………………………………………….……..17
Figure 3.8: Softmax function…………………………………………………………….…...18
Figure 3.9: Down Sampling…………………………………………………………….…….19
Figure 3.10: Convolution Operation…………………………………………………….……20
Figure 3.11: Visualization of Convolution………………………………………………........20
Figure 3.12: Convolution with stride 1…………………………………………………..……21
Figure 3.13: Stride 1with Padding 1………………………………………………….……….21
Figure 3.14: After Applying Padding……………………………………………….………...21
Figure 3.15: Different Layers of CNN……………………………………………….……….22
Figure 3.16: Pooling Layers……………………………………………………….………….23
Figure 4.1: Sample images of used datasets………………………………………….……….25
Figure 4.2: Block diagram of proposed methodology…………………………….…………..27
Figure 4.3: Architecture of our proposed model…………………………………..…………..29
Figure 5.1: Accuracy graph of proposed model………………………………….…………...31
Figure 5.2: Loss graph of proposed model……………………………………….…………...31
Figure 5.3: Confusion matrix of proposed model………………………………..…………....33
iv
List of Tables:
Table 5.1: Proposed models performance summary………………………………………....30
Table 5.2: Proposed architect summary……………………………………………………...32
v
ABSTRACT
Handwritten recognition is one of the most interesting issue in present time due to its
variant applications and helps to make the old form and information digitization and
reliable. One of most common reason conducting with handwritten scripts is big
challenge because of every person has unique style to write and also has different shape
and size. Therefore, this paper proposed a model which helps to recognize Bangla
handwritten 50 basic characters (39 consonants and 11 vowels). The proposed model
trained and validated with Ekush dataset and tested with BanglaLekha-Isolated dataset.
We have tuned different parameters to gain highest accuracy. We have tried different
types of optimizer and different values of learning rate. We used SGD (Stochastic
Gradient Descent) and Adam optimizer among them Adam optimizer performed well
with a learning rate 0.001. We performed 50 epochs. After 50 epochs the proposed
method is shown satisfactory training accuracy 99.38%, validation accuracy 95.19% for
Keywords:
HCR, Machine Learning, Deep Learning, CNN, Computer Vision, Pattern Recognition.
vi
Chapter 1
Introduction
1.1 Introduction to Handwritten Character Recognition (HCR)
Page | 1
many kinds of Handwritten recognition-based application such as Bangla Handwritten
character base OCR (Optical Character Recognition), Picture to text to speech, Bangla ID card
reading, Number plate reading, vehicle tracking, Post office automation etc.
1.3 Motivation
Bangla is the mother language of Bangladesh, apart from it is the official language of
Bangladesh, West Bengal of India, Tripura, Assam and Jharkhand, Sierra Leone a West African
country. Though Bangla is the 7th most popular language and writing scripts like about 250
million speak in Bangla and 2nd most beautiful language in the world. Considering those all
circumstances the technology in different sectors in these regions, Bangla Handwritten
recognition plays an important role and should overcome the challenge. However, to compare
other language writing script only a few studies are attested on handwritten characters of
Bangla scripts there have a sturdy model such as Latin, Chines, Japanese have achieved a great
success on machine learning and deep learning application.
Besides, Bengali is the fifth most-spoken native language and the seventh most spoken
language by total number of speakers in the world. Still there are lackings of efficient
handwritten Bangla character recognition systems. Thousands of old documents, handwritten
notes in many institutions that are still not in computerized format.
1.5 Objectives
Objective of this paper is
1.6 Challenges
Handwriting recognition tends to have problems when it comes to accuracy. People can
struggle to read others handwriting. How, then, is a computer going to do it? The issue
is that there’s a wide range of handwriting – good and bad. This makes it tricky for
programmers to provide enough examples of how every character might look. Besides
Page | 2
sometimes, characters look very similar, making it hard for a computer to recognise
accurately.
In the case of handwriting recognition from photos, there are also awkward angles to
consider. The angle the photo is taken could obscure the character, making it harder for
the computer to identify. So, recognition of Bangla handwritten characters accurately
and efficiently is a challenge.
Working with two different datasets efficiently.
1.7 Contribution
Our contribution will be:
1. Design a new architectural model for Bangla HCR using CNN.
2. Proposed model will be trained and validated with one dataset and tested with another
dataset.
Chapter 1 includes the introduction about Handwritten Character Recognition (HCR) and
various methods for HCR, motivation, problem objectives, challenges, and contribution.
Chapter 5 includes the experimental results and descriptions based on our proposed
methodology.
Page | 3
Chapter 2
Related Work
2.1 Overview
Recognition of handwritten characters has gained significant popularity in the field of pattern
recognition and machine learning because of its use in various fields. Various techniques in
handwriting recognition system have been proposed for character recognition. Among them
OCR technology, SVM, MLP, KNN, CNN etc. various machine learning and deep learning
algorithms are used.
Sufficient studies and papers describe the techniques used to convert textual content from a
paper document into readable machine form. Character recognition system may serve as a key
factor in creating a paperless environment by digitization and processing of existing paper
documents in the coming days. Here are some reviews about researches done by individuals
and groups on HCR.
In past studies there are many works for recognition of handwritten character in a different
language as Latin [1], Chines [2], Japanese [3] achieve great success. There are a few works
are available for Bangla handwritten basic character, digit and compound character
recognition, some literature has been made on Bangla characters recognition in the past years
as “A complete printed Bangla OCR system” [4], “On the development of an optical character
recognition (OCR) system for printed Bangla script” [5]. there are also few researches on
handwritten Bangla numeral recognition that reaches to the desired recognition accuracy. Pal
et al. have conducted some exploring works for recognizing handwritten Bangla characters
those are “Automatic recognition of unconstrained offline Bangla handwritten numerals” [6],
“A system towards Indian postal automation” [7]. And “Touching numeral segmentation using
water reservoir concept” [8]. The proposed schemes are mainly based on extracted features
from a concept called water reservoir. Apart from there also present several Bangla
Handwritten Character Recognition and had achieved pretty good success.
Halima Begum et al., “Recognition of Handwritten Bangla Characters using Gabor Filter and
Artificial Neural Network” [9] works with own dataset that was collected from 95 volunteers
and their proposed model achieved without feature extraction and with feature extraction
around 68.9% and 79:4% of recognition rate respectively. “Recognition of Handwritten Bangla
Basic Character and Digit Using Convex Hall Basic Feature” [10] achieve accuracy for Bangla
Page | 4
characters 76.86% and Bangla numerals 99.45%. “Bangla Handwritten Character Recognition
using Convolutional Neural Network” achieved 85.36% test accuracy using their own dataset.
In “Handwritten Bangla Basic and Compound character recognition using MLP and SVM
classifier” [11], the handwritten Bangla basic and compound character recognition using MLP
and SVM classifier has been proposed and they achieved around 79.73% and 80.9% of
recognition rate, respectively.
Research contributions relating to OCR of handwritten Bangla script may be categorized into
two major approaches. firstly, an MLP based single step approach, as proposed by Bhowmik
et al. [12], Roy et al. [13] and Basu et al. [14], and secondly, a multistage approach, as proposed
by Rehman et al. [15] and Bhattacharya et al. [16–17].
Most of the aforesaid approaches use MLP based classifiers to classify 50 Basic characters of
Bangla script. In the work of Bhowmik et al. [12], the feature set is constructed from the stroke
features of characters. The dataset used for testing recognition performances of 49 different
classes included characters collected from only 20 different writers. In the work of Roy et al.
[13], the authors have used a quadratic discriminant function. In this work, pattern classes are
grouped together intuitively on the basis of observable similarity, to form 35 pattern groups.
For forming the feature vector for this work, each character image is divided into 4×4 = 16 and
7×7 = 49 sub-images and 4-directional chain code techniques are used for computing the
directional frequencies of the contour pixels in each sub-image.
In one of our earlier works [14], we used 24 modified shadow features, 8 pairs of octant
centroid features and 36 longestrun features, computed on 9 overlapped sub-images, for each
character image to classify it into one of the 50 character classes using an MLP based classifier.
The work described in [15] involves a two stage hierarchical approach for OCR of handwritten
Bangla alphabetic characters, in which multiple experts are employed in the second stage, i.e.,
after coarse classification, for final classification of a pattern of an unknown class. The major
features used for recognition of Basic Bangla characters by this approach include Matra, upper
part of the character, disjoint section of the character, vertical line and double vertical line.
Classification decisions, in the second stage, are mainly based on the consensus among
multiple classifiers but, to reach the consensus, sample confidences of the experts are
considered instead of majority voting method. Sample confidences are certain probabilistic
measures defined for determining class membership of sample patterns by the experts. Failing
Page | 5
to reach a consensus, certain other probabilistic measures, formed with the past performances
of the participating experts, are further considered. A sample pattern is rejected if all the
prescribed confidence measures fail to meet the passing criteria. This is in a nutshell how the
classification decisions of multiple experts are finally combined in the work described in
[15].In the work of Bhattacharya et al. [16], a two-stage approach is adopted to classify 50
handwritten Basic characters and 10 numeric
digits of Bangla script. In this approach also a coarse or a group based coarse classification of
an unknown pattern in first stage is followed by a finer classification in the second stage. Based
on the similarity of shapes, 57 pattern classes are identified for final classification. These
pattern classes are clustered into 11 groups for coarse classification.
An MLP based classifier is employed in the first stage to decide about the group of an unknown
pattern. In the second stage, the pattern is subjected to another MLP based classifier, specific
to its group, for final classification. In another work, Bhattacharya et al. [17] have proposed a
similar two stage approach for recognition of 50 Basic characters of handwritten Bangla script.
64 chain code-frequency features, as used in [12] and [16], are also used here for classification
through MLP based classifiers.
2.3 Conclusion
So according to the previous work reviews, analysis, experience and comparison it is stated
that technology implementation and up-gradation for HCR is always challenging and require
careful consideration and planning. This study proposed that HCR provides an opportunity for
solving some traditional problems but also introduces new concerns. This study also discusses
some typical features and technological solutions of HCR and provides an overview of the
weaknesses and strengths of this technology. At last this technology still leave an opportunity
to be improved to enhance the efficiency and usability for future uses.
Page | 6
Chapter 3
Machine learning (ML) is science of study to give ability to computers to learn without
explicitly being programmed. ML is the subset of AI. ML is based on the idea that the system
-s can learn from data, identify the patterns and make decisions with negligible human
intervention. Basically, in ML the training data is given to a learning algorithm. Generally,
ML is of three types:
Supervised learning: Data sets are labeled and desired output is given for the patterns to be
detected and to label new data sets. For example: insurance underwriting, fraud detection, etc.
Unsupervised learning: Data sets aren’t labeled, the algorithm is asked to identify patterns in
the input data and sorted according to their similarities and differences. For example: customer
clustering, association rule miming, etc.
Reinforcement learning: Data sets aren’t labeled but, after performing an action at each step,
AI system receives feedback or reward for its action. For example: game AI, complex decision
problems, reward systems, etc. Deep learning (DL) is the subset of ML, which has ability to
learn without human supervision. DL mimics the workings of the human brain in processing
data which can be used in detecting objects, recognizing speech, translating languages and
making decisions. DL is evolution to ML. Basically, DL means how deep the machine is
learning. DL is part of broader family of ML. In general, DL is a process to learn without the
intervention or supervision of human. DL uses neural networks like artificial neural network
(ANN), convolutional neural network (CNN) and so on. DL is also known as deep neural
learning or deep neural network.
Page | 7
Figure 3.1: Relationship between AI, ML and DL
The Gray-scale image represents the brightness of a pixel. The most common pixel format is
the byte image, where this number is stored as an 8-bit integer giving a range of possible values
from 0 to 255. Typically, zero is taken to be black and ‘255’ is taken to be white.
For many applications, it is required to convert the image from one type to another like RGB
to Gray, Gray to RGB, Gray to Binary, RGB to Binary, RGB to Indexed, Gray to Indexed, etc.
In RGB to Gray, the true-color image of RGB is converted into a Gray-scale image, in which
lots of information is discarded which are not required for processing. In Gray to RGB, it is
required to generate the image in three channels such as m-by-n-by-3. For Binary image
generation, any type of image is converted into binary image form which represents two values
such as 0 and 1, one for white and another is assigned for black. A number of issues are
associated for converting the image from one format to another. Image conversion also includes
Page | 8
the conversion of CMY color image to RGB color image. For this, the value {1, 1, 1} is
subtracted from the standard RGB or true color image. RGB model is obtained by additive
process whereas CMY model is obtained by subtractive process. CMY is the complementary
model of RGB color model.
Optimization algorithms help CNN algorithms to minimize the error. Proposed model used
Adam optimizer. Adam optimization algorithm that can be used to update network weights
iteratively in training data. Adam is an update of extension to stochastic gradient descent
algorithm. For its better performance, it is widely used in computer vision researches. Proposed
model used Adam optimizer with a learning rate of 0.001.
𝜂
Adam optimizer, 𝜃𝑡+1 = 𝜃𝑡 − 1 𝑚
̂𝑡
√𝜃𝑡+∈
To calculate the error for optimizing algorithm we used categorical cross entropy function.
Recent research shows that cross entropy performs better than other function like classification
error and mean squared error etc.
Where, Li = Sample loss value, i = i-th sample in a set, j = label/output index, y = target
values, 𝑦̂ = predicted values.
Learning rate is one of the most important hyper-parameters to tune for training convolutional
neural networks. If the learning rate is low the classification is more accurate but optimizer will
take more time to reach the global optima by reducing the loss. And if the learning rate is high
the accuracy may not converge also some time may diverge. So, choosing the best learning rate
is more difficult. To overcome this challenge, we use automatic learning rate reduction method.
For faster computation, we set a higher learning rate of 0.001 which is atomically reduced by
monitoring the validation accuracy.
Deep learning technique performs better if it finds more data. For this reason, data
augmentation helps to produce more data artificially. For handwriting characters, recognition
data augmentation helps more because a single person can write a character in a different
variation. For data augmentation, the images are shifted randomly 20% in height or width or
both, also 20% rotation and 20% zoom the images.
Page | 9
3.2 Neural Network
Neural Networks is one of the most significant discoveries in history. Neural Networks can
solve problems that can't be solved by algorithms. Neural Networks is the essence of Deep
Learning. Neural networks, also known as artificial neural networks (ANNs) or simulated
neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning
algorithms. Their name and structure are inspired by the human brain, mimicking the way that
biological neurons signal to one another.
Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer,
one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to
another and has an associated weight and threshold. If the output of any individual node is
above the specified threshold value, that node is activated, sending data to the next layer of the
network. Otherwise, no data is passed along to the next layer of the network. Neural networks
rely on training data to learn and improve their accuracy over time. However, once these
learning algorithms are fine-tuned for accuracy, they are powerful tools in computer science
and artificial intelligence, allowing us to classify and cluster data at a high velocity. Tasks in
speech recognition or image recognition can take minutes versus hours when compared to the
manual identification by human experts. One of the most well-known neural networks is
Google’s search algorithm.
Deep Learning and neural networks tend to be used interchangeably in conversation, which can
be confusing. As a result, it’s worth noting that the “deep” in deep learning is just referring to
Page | 10
the depth of layers in a neural network. A neural network that consists of more than three
layers—which would be inclusive of the inputs and the output—can be considered a deep
learning algorithm. A neural network that only has two or three layers is just a basic neural
network.
A neural network has many layers. Each layer performs a specific function, and the complex
the network is, the more the layers are. That’s why a neural network is also called a multi-layer
perceptron. Before completely getting into the process of how neural networks work, we need
to be familiar with the parts of it. The purest form of a neural network, which is also known as
the node layer, has three layers:
As the names suggest, each of these layers has a specific purpose. These layers are made up of
nodes. There can be multiple hidden layers in a neural network according to the requirements.
The input layer picks up the input signals and transfers them to the next layer. It gathers the
data from the outside world. The hidden layer performs all the back-end tasks of calculation.
A network can even have zero hidden layers. However, a neural network has at least one hidden
layer. The output layer transmits the final result of the hidden layer’s calculation. Like other
machine learning applications, we will have to train a neural network with some training data
as well, before we provide it with a particular problem. But before we go more in-depth of how
a neural network solves a problem, we should know about the working of perceptron layers
first:
How do Perceptron Layers Work: A neural network is made up of many perceptron layers,
that’s why it has the name ‘multi-layer perceptron’. These layers are also called hidden layers
of dense layers. They are made up of many perceptron neurons. They are the primary unit that
works together to form a perceptron layer. These neurons receive information in the set of
inputs. We combine these numerical inputs with a bias and a group of weights, which then
produces a single output. For computation, each neuron considers weights and bias. Then, the
combination function uses the weight and the bias to give an output (modified input). It works
through the following equation:
Page | 11
After this, the activation function produces the output with the following equation:
output = activation(combination)
This function determines what kind of role the neural network performs. They form the layers
of the network.
1. Information is fed into the input layer which transfers it to the hidden layer.
2. The interconnections between the two layers assign weights to each input randomly.
3. A bias added to every input after weights are multiplied with them individually.
5. The activation function determines which nodes it should fire for feature extraction.
6. The model applies an application function to the output layer to deliver the output.
The model uses a cost function to reduce the error rate. We will have to change the weights
with different training models. The model compares the output with the original result. It
repeats the process to improve accuracy. The model adjusts the weights in every iteration to
enhance the accuracy of the output.
There are different kinds of deep neural networks – and each has advantages and disadvantages,
depending upon the use. Examples include:
1. Convolutional neural networks (CNNs) contain five types of layers: input, convolution,
pooling, fully connected and output. Each layer has a specific purpose, like summarizing,
connecting or activating. Convolutional neural networks have popularized image classification
and object detection. However, CNNs have also been applied to other areas, such as natural
language processing and forecasting.
2. Recurrent neural networks (RNNs) use sequential information such as time-stamped data
from a sensor device or a spoken sentence, composed of a sequence of terms. Unlike traditional
neural networks, all inputs to a recurrent neural network are not independent of each other, and
the output for each element depends on the computations of its preceding elements. RNNs are
used in fore-casting and time series applications, sentiment analysis and other text applications.
Page | 12
3. Feedforward neural networks, in which each perceptron in one layer is connected to every
perceptron from the next layer. Information is fed forward from one layer to the next in the
forward direction only. There are no feedback loops.
4. Autoencoder neural networks are used to create abstractions called encoders, created from
a given set of inputs. Although similar to more traditional neural networks, autoencoders seek
to model the inputs themselves, and therefore the method is considered unsupervised. The
premise of autoencoders is to desensitize the irrelevant and sensitize the relevant. As layers are
added, further abstractions are formulated at higher layers (layers closest to the point at which
a decoder layer is introduced). These abstractions can then be used by linear or nonlinear
classifiers.
An Activation Function decides whether a neuron should be activated or not. This means that
it will decide whether the neuron's input to the network is important or not in the process of
prediction using simpler mathematical operations.
Let’s suppose we have a neural network working without the activation functions. In that case,
every neuron will only be performing a linear transformation on the inputs using the weights
and biases. It’s because it doesn’t matter how many hidden layers we attach in the neural
network; all layers will behave in the same way because the composition of two linear functions
is a linear function itself. Although the neural network becomes simpler, learning any complex
task is impossible, and our model would be just a linear regression model. There are mainly
three types of Neural Networks Activation Functions are discussing below.
Binary step function depends on a threshold value that decides whether a neuron should be
activated or not. The input fed to the activation function is compared to a certain threshold; if
the input is greater than it, then the neuron is activated, else it is deactivated, meaning that its
output is not passed on to the next hidden layer.
Page | 13
1
0 1
The linear activation function, also known as "no activation," or "identity function" (multiplied
x1.0), is where the activation is proportional to the input. The function doesn't do anything to
the weighted sum of the input, it simply spits out the value it was given.
It’s not possible to use backpropagation as the derivative of the function is a constant
and has no relation to the input x.
All layers of the neural network will collapse into one if a linear activation function is
used. No matter the number of layers in the neural network, the last layer will still be a
linear function of the first layer. So, essentially, a linear activation function turns the
neural network into just one layer.
Page | 14
20
10
0 20
10 20
Figure 3.4: Linear Activation Function
3. Non-Linear Activation Functions
The linear activation function shown above is simply a linear regression model. Because of its
limited power, this does not allow the model to create complex mappings between the
network’s inputs and outputs. Non-linear activation functions solve the following limitations
of linear activation functions:
They allow backpropagation because now the derivative function would be related to
the input, and it’s possible to go back and understand which weights in the input
neurons can provide a better prediction.
They allow the stacking of multiple layers of neurons as the output would now be a
non-linear combination of input passed through multiple layers. Any output can be
represented as a functional computation in a neural network.
Now, let’s have a look at some different non-linear neural networks activation functions:
This function takes any real value as input and outputs values in the range of 0 to 1. The
larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller
the input (more negative), the closer the output will be to 0.0, as shown in the figure 3.5 in
the next page. Sigmoid/logistic activation function is one of the most widely used functions
because:
It is commonly used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice
because of its range.The function is differentiable and provides a smooth gradient, i.e.,
Page | 15
preventing jumps in output values. This is represented by an S-shape of the sigmoid activation
function. The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).
0.5
1 5 10
Tanh function is very similar to the sigmoid/logistic activation function, and even has the same
S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to -1.0.
-1 0 0
-1
Figure 3.6: Tanh
Advantages of using this activation function are:
The output of the tanh activation function is Zero centered; hence we can easily map
the output values as strongly negative, neutral, or strongly positive.
Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer much easier.
Page | 16
𝑒 𝑥 −𝑒 −𝑥
Mathematically it can be represented as: f(x) =
𝑒 𝑥 +𝑒 −𝑥
3. ReLU Function
ReLU stands for Rectified Linear Unit. Although it gives an impression of a linear function,
ReLU has a derivative function and allows for backpropagation while simultaneously making
it computationally efficient. The main catch here is that the ReLU function does not activate
all the neurons at the same time. The neurons will only be deactivated if the output of the linear
transformation is less than 0.
Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
ReLU accelerates the convergence of gradient descent towards the global minimum of the
loss function due to its linear, non-saturating property.
Relu
4
-10 0 5 10
4. Softmax Function
The softmax function, also known as normalized exponential function, converts a vector of K
real numbers into a probability distribution of K possible outcomes. It is a generalization of
the logistic function to multiple dimensions, and used in multinomial logistic regression .
Page | 17
Probability
1.0
-10 -5 0 5 10
The Softmax function is described as a combination of multiple sigmoids. It calculates the relative
probabilities. Similar to the sigmoid/logistic activation function, the SoftMax function returns the
probability of each class. It is most commonly used as an activation function for the last layer of the
neural network in the case of multi-class classification.
𝑒𝑥𝑝(𝑧𝑖 )
Mathematically it can be represented as: Softmax, (𝑧𝑖 ) =
∑𝑗 𝑒𝑥𝑝(𝑧𝑗)
Computer vision is evolving rapidly day-by-day. It’s one of the reason is deep learning. When we
talk about computer vision, a term convolutional neural network (abbreviated as CNN) comes in our
mind because CNN is heavily used here. Examples of CNN in computer vision are face recognition,
image classification etc. It is similar to the basic neural network. CNN also have learnable parameter
like neural network i.e, weights, biases etc.
Suppose we are working with MNIST dataset, each image in MNIST is 28 x 28 x 1(black & white
image contains only 1 channel). Total number of neurons in input layer will 28 x 28 = 784, this can
be manageable. What if the size of image is 1000 x 1000 which means we need 10⁶ neurons in input
layer. This seems a huge number of neurons are required for operation. It is computationally
ineffective. So here comes Convolutional Neural Network or CNN. In simple word what CNN does
is, it extract the feature of image and convert it into lower dimension without loosing its
characteristics. In the following example we can see that initial the size of the image is 224 x 224 x
3. If you proceed without convolution then we need 224 x 224 x 3 = 150,528 numbers of neurons in
Page | 18
input layer but after applying convolution we input tensor dimension is reduced to 1 x 1 x 1000. It
means we only need 1000 neurons in first layer of feedforward neural network.
Convolution+ReLu
Max pooling
Fully connected +ReLu
fully
Softmax
fully
1. Image Representation
Images are encoded into color channels, the image data is represented into each color intensity in a
color channel at a given point, the most common one being RGB, which means Red, Blue and Green.
The information contained into an image is the intensity of each channel color into the width and
height of the image. So the intensity of the red channel at each point with width and height can be
represented into a matrix, the same goes for the blue and green channels, so we end up having three
matrices, and when these are combined they form a tensor.
2. Edge Detection
Every image has vertical and horizontal edges which actually combining to form an image.
Convolution operation is used with some filters for detecting edges. Suppose we have gray
scale image with dimension 6 x 6 and filter of dimension 3 x 3. When 6 x 6 grey scale image
convolve with 3 x 3 filter, we get 4 x 4 image. First of all 3 x 3 filter matrix get multiplied with
first 3 x 3 size of our grey scale image, then we shift one column right up to end , after that we
shift one row and so on.
Page | 19
Gray White
White Convolution
The convolution operation can be visualized in the following way. Here our image dimension
is 4 x 4 and filter is 3 x 3, hence we are getting output after convolution is 2 x 2.
If we have N x N image size and F x F filter size then after convolution result will be:
Stride denotes how many steps we are moving in each steps in convolution. By default it is
one. Stride is a parameter that works in conjunction with padding, the feature that adds blank,
or empty pixels to the frame of the image to allow for a minimized reduction of size in the
output layer. Roughly, it is a way of increasing the size of an image, to counteract the fact that
stride reduces the size.
In order to work the kernel with processing in the image, padding is added to the outer frame
of the image to allow for more space for the filter to cover in the image. Adding padding to an
image processed by a CNN allows for a more accurate analysis of images.
Page | 20
Figure 3.12: Convolution with Stride 1
We can observe that the size of output is smaller than input. To maintain the dimension of
output as in input, we use padding. Padding is a process of adding zeros to the input matrix
symmetrically. In the following example, the extra grey blocks denote the padding. It is used
to make the dimension of output same as input.
N*N
N*N
F*F
N*N
(N+2P)* (N+2P)
Figure 3.14: After applying padding
If we apply filter F x F in (N+2p) x (N+2p) input matrix with padding, then we will get output
matrix dimension (N+2p-F+1) x (N+2p-F+1). As we know that after applying padding we will
get the same dimension as original input dimension (N x N). Hence we have,
Page | 21
(N+2p-F+1) x (N+2p-F+1) equivalent to N x N
N+2p-F+1 = N ---(2)
p = (F-1)/2 ---(3)
The equation (3) clearly shows that Padding depends on the dimension of filter.
4. Layers in CNN
Input layer
Convo layer (Convo + ReLU)
Pooling layer
Fully connected (FC) layer
Softmax/logistic layer
Output layer
i. Input Layer
Input layer in CNN should contain image data. Image data is represented by three dimensional
matrix as we saw earlier. We need to reshape it into a single column. Suppose we have image
of dimension 28 x 28 =784, we need to convert it into 784 x 1 before feeding into input. If we
have “m” training examples then dimension of input will be (784, m).
Convo layer is sometimes called feature extractor layer because features of the image are get
extracted within this layer. First of all, a part of image is connected to Convo layer to perform
convolution operation as we saw earlier and calculating the dot product between receptive field
(it is a local region of the input image that has the same size as that of filter) and the filter.
Result of the operation is single integer of the output volume. Then we slide the filter over the
next receptive field of the same input image by a Stride and do the same operation again. We
will repeat the same process again and again until we go through the whole image. The output
Page | 22
will be the input for the next layer. Convo layer also contains ReLU activation to make all
negative value to zero.
Pooling layer is used to reduce the spatial volume of input image after convolution. It is used
between two convolution layer. If we apply FC after Convo layer without applying pooling or
max pooling, then it will be computationally expensive and we don’t want it. So, the max
pooling is only way to reduce the spatial volume of input image. In the above example, we
have applied max pooling in single depth slice with Stride of 2. We can observe the 4 x 4
dimension input is reduce to 2 x 2 dimension.
There is no parameter in pooling layer but it has two hyperparameters — Filter(F) and Stride(S).
In general, if we have input dimension W1 x H1 x D1, then
W2 = (W1−F)/S+1
H2 = (H1−F)/S+1
D2 = D1
Where W2, H2 and D2 are the width, height and depth of output.
Fully connected layer involves weights, biases, and neurons. It connects neurons in one layer
to neurons in another layer. It is used to classify images between different category by training.
Page | 23
v. Softmax / Logistic Layer
Softmax or Logistic layer is the last layer of CNN. It resides at the end of FC layer. Logistic is
used for binary classification and softmax is for multi-classification.
Output layer contains the label which is in the form of one-hot encoded.
Page | 24
Chapter 4
Methodology
4.1. Dataset Description
The proposed model used a dataset named Ekush for training and validation and another dataset
named BanglaLekha-Isolated for testing. BanglaLekha-Isolated dataset is a collection of
bangla handwritten isolated character samples. It contains samples of 50 Bangla basic
characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting
samples for each of the 84 characters were collected, digitized and pre-processed. After
discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the
final dataset. The dataset also includes information about the age and gender of the subjects
from whom the handwriting samples were collected. This information is mapped to each
individual image. A separate spreadsheet gives an assessment of the aesthetic quality of the
handwriting samples, collected from three independent assessors. This assessment is done on
groups of 84 characters and not on individual characters.
The Ekush dataset has total 368,776 images where 155,570 alphabets, 151,607 compound
characters, 30830 digits and 30769 modifiers. Ekush dataset’s image resolution depends on
character size. Most of the images have less padding with a black background while the
character in white.
Data preparation plays an important role in deep learning. Data is everywhere, however, the
problem is the lack of processed data. Our proposed model used Ekush dataset and
BanglaLekha-Isolated dataset.
In this paper, we proposed a model for classifying Bangla Handwritten Character, which
contains 50 basic Bangla characters (11 vowels and 39 consonants). We have kept 650000
Page | 25
images from two of the datasets where 40000 images are for training, 10000 images for
validation and 15000 images are for testing and we deleted the rest of the data from the dataset
for the purpose of the ease of our work and for the reduction of time complexity. There are
total 65000 images of 50 classes. Each Class contains 800 images for training, 200 images for
validation and 300 images for testing the model. We used 50000 images from Ekush dataset
and 15000 images from BanglaLekha-Isolated dataset.
Ekush dataset images background is white and characters are black. Firstly, we inverted all the
images to make the background black and character to white. Black pixels represent the value
0, which reduce lots of computation. The images of Ekush dataset are different in height and
width to reduce unnecessary information.
4.3 Preprocessing
Images are preprocessed before feeding to machine learning algorithms. We have preprocessed
Ekush dataset and converted it similar to the testing dataset named BanglaLekha_isolated
because two datasets are completely different. BanglaLekha-Isolated has black background
color with white color characters and Ekush has white background color with black color
characters. The original images were in RGB format. That means, each image has three layers
red, green and blue. We first converted those images of Ekush dataset into grayscale. Grayscale
image has only one layer of n rows and m columns. The images were then resized to 64*64
pixels. After that we inverted each image and used canny edge detector for making the images
of two datasets similar for the HCR purpose. During training time, the image pixel values are
normalized by dividing with 255.
After preprocessing our datasets, we have created a model using Convolutional Neural
Network. So, before showing our proposed model we want to represent some important
information about Neural network and Convolutional Neural Network.
To measuring the model performance, train test and validation split were created. The training
set is used to train the model with the known output. Validation set used to check model
performance during training time and help the model to tune the hyper-parameters. And test
data used to check the final model performance after training. For training and validation
purpose we used the Ekush dataset. The Ekush dataset has 50000 characters images. 10000
characters 20% of total used in validation and 40000 characters 80% used to train the model.
For testing the model BanglaLekha-Isolated was used which is completely comes from
different distribution. BanglaLekha-Isolated has 15,000 basic characters. All 15,000 images
were used to measure model performances.
Dataset 02
Dataset 01
Preprocessing
Preprocessing
RGB to Gray
RGB to Gray Resizing
Resizing Canny edge
Color inversion detection
Canny edge Scaling
detection
Scaling Trained CNN model
Testing
We have worked with 2 datasets. Before applying CNN algorithm we preprocessed our
datasets. After preprocessing we have applied CNN in our Dataset 01 named Ekush and got a
new model for bangla handwritten characters recognition. Then we have tested our model using
our second dataset named BanglaLekha-Isolated and after different parameters tuning finally
we have gained a desired result.
Page | 27
4.8 Overview of the proposed model’s architecture
Proposed model used a multilayer CNN for classifying Bangla Handwritten Characters. This
model used convolution, Max pooling layer, fully connected dense layer and dropout. Our
proposed model have 16 layers in which first 2 layers are convolutional layer and 3rd layer is
max-pooling layer and this arrangement of layers repeat four times. After that in 13th layer
flatten is used 14th is dense layer, 15th is dropout layer and the last layer is also a dense layer.
Layer 1 and 2 are a convolutional layer with a filter size of 32 and kernel size of 3, these two
layers also use ReLU activation with the same padding. The output of these layer later
connected with max pooling layer 3.
The output of layer 3 than goes layer 4. Layer 4 and 5 are a convolution layer with a filter size
of 64 and kernel size of 3, these two layers also use ReLU activation with the same padding.
The output of these layer later connected with max pooling layer 6.
Similarly, Layer 7 and 8 are a convolution layer with a filter size of 128, dilation rate 2 and
kernel size of 3, these two layers use ReLU activation with the same padding. The output of
these layer later connected with max pooling layer 9.
Layer 10 and 11 both have a filter size of 256, dilation rate, activation function and padding
are similar to previous layers. The output of these layer later connected with max pooling layer
12. After all of these 12 operations, the output is flatten into an array and pass through a fully
connected dense layer 14 with 256 hidden units and regularized with 25% dropout.
The output of the layer 15 connected with a fully connected dense layer 16 with 50 nodes with
SoftMax activation which is also the output layer for the model. Figure 4.3 showing the
proposed architect in the next page.
Page | 28
4.9 Block diagram of the proposed model’s architecture
Input
Conv2D
Conv2D
MaxPooling2D
Conv2D
Conv2D
MaxPooling2D
Conv2D
Conv2D
MaxPooling2D
Conv2D
Conv2D
MaxPooling2D
Flatten
Dense
Dropout
Dense
Output
Page | 29
Chapter 5
Result Evaluation
5.1 Environmental Setup
Colaboratory, or Colab for short, is a Google Research product, which allows developers to
write and execute Python code through their browser. Google Colab is an excellent tool for
deep learning tasks. It is also good for the basic Machine Learning Models. It is a hosted Jupyter
notebook that requires no setup and has an excellent free version, which gives free access to
Google computing resources such as GPUs and TPUs. We have used GPU for our thesis work.
We have used google colab to run our experimental model. It provides with 12 GB of RAM
and 40 GB to 300 GB of storage.
The Proposed model was trained on Ekush dataset with a batch size of 128. After 50 epochs
the model got good accuracy. The automatic learning rate reduction formula helps the
optimizer to converge faster by reducing the learning rate. End of the training the learning rate
reduced by 0.001 to 1.5x10-5.
We have trained our CNN model several times. We have chosen different epoch number and
observed the performance of the CNN model. After 10 epochs, the model gives 97.30%
training accuracy, 94.21% validation accuracy and 92.55% test accuracy. After 20 epochs, the
model gives 99.04% training accuracy, 94.66% validation accuracy and 93.34% test accuracy.
We have also measured the performance taking 30, 40, and 50 epochs.
After 50 epochs the model gives 99.38% training accuracy, 95.19% validation accuracy and
94.47% test accuracy. All values are listed in the above table including the precision, recall and
f1-score. We have tried different model optimizer and learning rate. We have used SGD and
Adam optimizer among them Adam optimizer performed well with a learning rate 0.001.
Page | 30
We have categorical cross entropy as model’s loss function. To measure performance, accuracy
has been selected.
In the below figure, we have plotted the train accuracy and validation accuracy with respect to
number of epochs. It shows a stable training and validation accuracy.
In this thesis we used 2D CNN with 5 different layers named convolution, max-pooling, flatten,
dense and dropout layer. In figure 5.1 we have showed our proposed architect summary in the
Page | 31
next page. Our model used total 2,233,362 parameters in which trainable parameters are
2,233,362 and non-trainable parameters is 0.
5.5.1 Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be
determined as the number of correct predictions to the total number of predictions. It can be
formulated as: Accuracy = Number of correct predictions / Total number of predictions
The accuracy rate can judge the classification ability of the model, but the specific details
cannot be reflected. The confusion matrix is the comparison matrix between the predicted result
Page | 32
and the actual value, which can clearly indicate the prediction details of each category when
the classification model is making predictions.
The confusion matrix shows that the diagonal values are the highest values. These values are
true positive and true negative values. The high true positive and true negative values indicate
that the model’s performance is very good.
5.5.3 Precision
Precision is the ability of a classifier not to label an instance positive that is actually negative.
For each class, it is defined as the ratio of true positives to the sum of a true positive and false
positive.
𝑇𝑃
Precision =
(𝑇𝑃 + 𝐹𝑃)
5.5.4 Recall
Recall is the ability of a classifier to find all positive instances. For each class it is defined as
the ratio of true positives to the sum of true positives and false negatives.
𝑻𝑷
Recall =
(TP+FN)
Page | 33
5.5.5 F1 score
The F1 score is a weighted harmonic mean of precision and recall such that the best score is
1.0 and the worst is 0.0. F1 scores are lower than accuracy measures as they embed precision
and recall into their computation. As a rule of thumb, the weighted average of F1 should be
used to compare classifier models, not global accuracy.
2∗(Recall ∗ Precision)
F1 Score =
(Recall + Precision)
5.6 Discussion
We have evaluated our model with two different datasets. Ekush dataset is used for training
and validation of our model. We spilted it 80% : 20% for this purposes where 80% is used for
training and 20% is used for validation. BanglaLekha-Isolated dataset is used for testing our
model. We performed 50 epochs. After 50 epochs completed we have gained 99.38% training
accuracy, 95.19% validation accuracy and 94.47% testing accuracy.
Page | 34
Chapter 6
Conclusion
Conclusion
Bangla Handwritten Characters Recognition is the ability of a computer to receive and
intelligently interpret Bangla handwritten character as input from sources such as paper
documents, photographs, touch-screens and other devices. In this thesis, we used character
images for Bangla HCR. We have two Bangla handwritten character datasets one of which is
BanglaLekha-Isolated and the other is Ekush. Ekush dataset is used for training and validation
of our model. For this purposes we spilted it 80% : 20% where 80% is used for training and
20% is used for validation. BanglaLekha-Isolated dataset is used for testing our model. We
have to design a CNN model and train this model using Ekush dataset in such a way that it can
identify handwritten Bangla character efficiently. The Bangla characters images are collected
and done various processes such as pre-processing, feature extraction using CNN and
classification for this purpose.
We have tuned different parameters to gain highest accuracy. We have tried different model
optimizer and learning rate. We used SGD (Stochastic Gradient Descent) and Adam optimizer
among them Adam optimizer performed well with a learning rate 0.001. We performed 50
epochs. After 50 epochs completed we have gained 99.38% training accuracy, 95.19%
validation accuracy and 94.47% testing accuracy and we are satisfied with the performance of
our model.
Page | 35
References
[1] E. Kavallieratou, N. Liolios, E. Koutsogeorgos, N. Fakotakis, G. Kokkinakis,
"The GRUHD Database of Greek Unconstrained Handwriting",ICDAR, 2001,
pp. 561-565
[2] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu. ICDAR 2013 Chinese
Handwriting Recognition Competition. 2013 12th International Conference on
Document Analysis and Recognition (ICDAR), pages 1464–1470, 2013
[3] B. Zhu, X.-D. Zhou, C.-L. Liu, and M. Nakagawa. A robust model for on-line
handwritten Japanese text recognition. IJDAR, 13(2):121–131,Jan. 2010.
[5] U. Pal, “On the development of an optical character recognition (OCR) system
for printed Bangla script,” Ph.D. Thesis, 1997.
[7] K. Roy, S. Vajda, U. Pal, B.B. Chaudhuri, “A system towards Indian postal
automation,” Proceedings of the Ninth International Workshop on Frontiers in
Handwritten Recognition (IWFHR-9), pp. 580–585, October 2004.
[8] U. Pal, A. Belad, Ch. Choisy, “Touching numeral segmentation using water
reservoir concept,” Pattern Recognition Lett. vol. 24, pp. 261-272, 2003.
Page | 36
[13] K. Roy, U. Pal, F. Kimura, Bangla handwritten character recognition, in:
Proceedings of the Second Indian International Conference on Artificial
Intelligence (IICAI), 2005, pp. 431–443.
[18] AKM Shahariar Azad Rabbya, Sadeka Haquea, Md. Sanzidul Islama, Sheikh
Abujara and Syed Akhter Hossain, On bornoNet: Bangla Handwritten
Characters Recognition Using Convolutional Neural Network, in: ICACC-
2018.
Page | 37