Professional Documents
Culture Documents
978-1-5386-1150-0/17/$31.00
2017
c IEEE
mentioned methods mainly used handcrafted features extracted
from a small dataset, which made it impractical for deploying
the solutions in commercial applications.
Fig. 1: Dissimilarity between patterns in the same handwritten Recent methods [23], [24] using CNN, Bangla handwritten
Bangla character written by different individuals. character and digit recognition got a boost in performance
in relatively large scale datasets. Sharif et al., [25] proposed
a method on Bangla handwritten numeral classification. Their
have the repetition of the same pattern, for example “ ”, proposed method bridged hand crafted feature using HOG with
“ ”, “ ” and “ ”, “ ”, “ ”, which makes the classifi- CNN. Das et al., [26] proposed a two pass soft-computing
cation task more challenging. Finally, writing style varies approach for Bangla HCR. The first pass combines the highly
from person to person and the geometric structure of the misclassified classes to provide finer treatment in the second
characters fluctuate from sizes and angles, for example in Fig. pass. Sarkhel et al., [27] formulate the problem using a multi-
1 the same letter is having a different shape depending on objective perspective which trained an SVM classifier using
the writers. To solve the aforementioned challenges, in this the most informative regions of the characters. Although the
paper, we propose a modified ResNet-18 architecture to solve recent methods achieve higher performance accuracies than
Bangla HCR problem. The contribution of this paper is two- the earlier approaches, there is a significantly large margin
fold: 1) We propose a modified ResNet-18 architecture which left for improving the performance.
is capable end-to-end learning and achieves state-of-the-art
classification performance on relatively large datasets, 2) We III. P ROPOSED M ETHOD
provide a relative analysis of the performances of several state- The shapes of Bangla handwritten characters and digits
of-the-art deep learning architectures to solve the problem of are geometrically horizontal, structurally less rectangular and
Bangla HCR which can be used as a baseline performance for more spiral. Additionally, most of the conjunct characters have
comparison in the future. convoluted edges and similar pattern with small difference
The rest of the paper is organized as follows. In Section II, which sometimes makes it harder to distinguish even for
we discuss the related works on Bangla HCR. The proposed human eye (especially, when it is standalone). In order to
method is discussed in Section III. The experimental results classify such wide variety, yet strongly similar character set,
and relevant analysis are provided in Section IV. Finally, the we need to deploy a classifier which is robust in discriminating
conclusion is provided in Section V. similar patterns. As the ResNet is a proven architecture in
classifying wide number of classes, we propose a modified
II. R ELATED W ORK
version of ResNet-18 architecture which is particularly robust
In this section, we briefly discuss the related works on in classifying Bangla isolated handwritten characters.
Bangla HCR. There are a number of significant works on
Bangla HCR which can be mentioned here. Roy et al., [9] A. Modified ResNet-18 Architecture
is the pioneer of Bangla optical character recognition (OCR). A deep residual network (ResNet) is composed of stacked
They mainly introduced the Bangla character recognition entities with identity loops referred to as modules. Each
research. Following their path, many researchers investigated module consists of a multiple convolutional layers to learn the
several methods in improving the performance of Bangla features in the input space. ResNet is a proven architecture
OCR [10], [11], [12], [13], [14], [15], [16], [17]. Hasnat by winning 2016 version of ImageNet challenge. In this
et al., [18] proposed a domain specific OCR which classify paper, we use a ResNet architecture with a modification
machine printed characters as well as handwritten characters. which is particularly robust in classifying Bangla characters.
For feature extraction they apply Discrete Cosine Transform The proposed modified ResNet architecture is as follows. A
(DCT) technique over the input image and for classification typical module of ResNet gets input x and generates F (x)
Hidden Markov Model (HMM) was used. They used a simple through pairs of convolutional and ReLU layers. The generated
error correcting module that can correct splitting error due to F (x) is then added to the input x which is computes as
combination of over thresholding and segmentation problem. F (x) = F (x) + x. In the modified ResNet module, we add
Wen and Lui [19] proposed a Bangla numerals recognition a Dropout [28] layer after the second convolutional layer.
method using Principal Component Analysis (PCA) and Sup- By adding the Dropout layer, each module produces more
port Vector Machine (SVM). Liua et al., [20] proposed a generalized output with an increased regularization. In the
method of recognizing Bangla and Farsi numerals. In [21], a literature, Dropout is used by many architectures and it is
local binary pattern based feature extraction method was used mainly applied on layers having a large number of parameters
to classify with K-NN algorithm. Nibaran et al., [22] proposed to prevent feature adaptation and overfitting. Although it is
a feature set representation for Bangla handwritten alphabets used as a substitution of batch normalization [29], some work,
recognition. Their feature set is a combination of 24 shadow such as [30], explained that batch normalization along with
features, 8 distance features, 16 centroid features and 84 quad Dropout performs better in generalization. Inspired by their
tree based longest run features. With all of this they achieved findings, we applied the above mention modification in the
85.40% accuracy on 50 character classes dataset. The above proposed architecture. We keep the max pooling to 3 × 3
Fig. 2: Proposed Modified ResNet-18 architecture for Bangla HCR. In the diagram, conv stands for Convolutional layer, Pool
stands for MaxPool layer, batch norm stand for batch normalization, Relu stands for rectified linear unit activation layer, Sum
stands for the addition in ResNet, and FC stand for fully connected hidden layers. In this architecture, we have eight ResNet
modules which are modified by adding a dropout layer after the second convolutional layers.
TABLE I: Configuration detail of the convolutional layers in filter, edge thickening filter and resized the image to a square
the ResNet-18 architecture. shape with appropriate paddings by default [7]. Our input
Layer Output Layer Information images also ensured to have diversification by adding elastic
Conv1 112 × 112 7 × 7, 64 stride 2 distortions. Inspired by [32], data augmentation was done
3 × 3, maxpool stride 2 in the datasets using elastic distortions by width and height
Conv2.1 56 × 56
3 × 3, 64
Conv2.2 56 × 56 3 × 3, 64
shifting. The range was kept 0.4 for this shifting. Data Aug-
Conv3.1 28 × 28 3 × 3, 128 mentation adds multifariousness to the datasets which ensures
Conv3.2 28 × 28 3 × 3, 128 that the network is observing different samples during training
Conv4.1 14 × 14 3 × 3, 256 phase.
Conv4.2 14 × 14 3 × 3, 256
Conv5.1 7×7 3 × 3, 512
Conv5.2 7×7 3 × 3, 512
IV. E XPERIMENTAL R ESULTS
1×1 Average pooling, 84-d, FC, Softmax We present the experimental results and performance anal-
FLOPs 1.8 × 109
ysis in this section. We conduct experiments on an Ubuntu
machine containing Intel Core i3-2120 (3.30 GHz) CPU with
with the stride of 2 × 2 as it is described in the architecture 12 GB RAM and Nvidia 1050Ti 4GB GPU. The proposed
that decreasing the pooling size or stride do not enhance the modified ResNet-18 architecture is implemented in Keras [33]
performance when the input image size is more than 100×100 with Tensorflow [34] backend.
pixel. We use Softmax after fully connected layers as default.
A. Datasets
Figure 2 shows the proposed modified ResNet-18 architecture
and Table I shows the configuration detail of ResNet-18 In order to train and measure the performance of the
architecture. We experimented using both Root Mean Square proposed method, we use two recently introduced large
Propagation (RMSProp), Adam [31], and Stochastic Gradient datasets, called BanglaLekha-Isolated dataset [7] and CMA-
Descent (SGD) optimizers to find the global minima of the cat- TERdb dataset [8]. The BanglaLekha-Isolated dataset is the
egorical cross entropy loss function. The Adam optimizer is a latest publicly available Bangla handwriting character dataset
memory efficient and faster computing optimization technique, with 84 classes where 50 classes are vowel and consonants, 10
which is based on adaptive estimates of lower-order moments. classes are numeral and 24 classes are frequently used conjunct
Experimentally, Adam optimizer outperformed both RMSProp characters. This dataset contains total 166, 105 images where
and SGD optimizers. the training set consists of 132, 884 images and test set consists
of 33, 221 images. Particularly, it contains 98, 950 simple
B. Input Processing vowels and consonants, 19, 748 digits, and 47, 407 commonly
In order to have a wider variation in the input for the purpose appeared conjunct consonants. The image size of this dataset
of generalized performance of the network, we preprocess vary from 110 × 110 to 220 × 220 pixels. The handwriting
input images by inverting, removal of noise with the median of this dataset is collected from 4 to 27 year age group and
95.10 95.5
95.00
accuracy (%)
Accuracy (%)
95
94.5
94.70
op am D 94
SPr Ad SG 0.1 0.2 0.3 0.4 0.5
RM Dropout rate
(a) (b) (a) (b)
Fig. 3: Example images of Bangla characters taken from a) Fig. 5: Fine tunning the performance by selecting best hy-
BanglaLekha-Isolated dataset, b) CMATERdb3 dataset. perparameters. a) Classification performance of the proposed
method using different optimizer. b) Classification perfor-
mance with the changing dropout rate. In this experiment, we
a small portion of the samples are collected from physically use 112 × 112 input image.
disabled individuals. Figure 3 (a) shows a few examples taken
from the dataset. The CMATERdb dataset is another large
dataset having 231 classes of image. Among the classes, 50 and SGD optimizers. Based on this analysis, we decide to
belonging to simple vowel and consonants, 10 belonging to use Adam optimizer in the rest of the experiments. Secondly,
numerals and the remaining 172 classes belonging to conjunct we investigate the performance of the proposed method using
consonant classes. Figure 3 (b) shows a few examples taken different dropout rates. As it can be seen in Fig. 5 (b), for
from the CMATERdb dataset. BanglaLekha-Isoleted dataset dropout rate 0.2, the proposed method performs the best. The
and CMATERdb dataset are relatively large dataset than the rest of the experiments are done using dropout rate 0.2.
dataset proposed in [35]. The proposed ResNet-18 architecture is applied on the
above mentioned two datasets to measure the performance.
B. Experiments The training and validation curves are reported in Fig. 6
As the image size varies significantly in the BanglaLekha- and Fig. 7 for BanglaLekha-Isolated and CMATERdb dataset
Isolated dataset, selecting the right size of the input image is respectively.
crucial in achieving the optimum classification performance.
We conduct an experiment to find the optimum input size. 1
The experimental results is given in Fig. 4. As it can be seen,
for image size 112 × 112, the proposed modified ResNet- 0.9
95.10 0.7
95
0.6
training accuracy
valitaion accuracy
accuracy (%)
94.30
0.5
94.10 20 40 60 80 100
94 Iterations
Fig. 4: Classification performance of the proposed modified We further investigate the performance of Bangla character
ResNet-18 architecture using different input image size. recognition using several state-of-the-art CNN models. In this
investigation, we use VGGNet-16, VGGNet-19, ResNet-18,
The performance of the proposed method is fine-tuned ResNet-34, and the proposed method on BanglaLekha-Isolated
using two hyperparameters. Firstly, we experiment the effect dataset. The classification performance is reported in Fig. 8. As
of different optimizer in classification performance. In this it can be seen and also as expected, VGGNets are preforming
experiment, we measure the performance using three state- lower than the ResNets. Using VGGNet-16 and VGGNet-19,
of-the-art optimizers, namely RMSProp, Adam, and SGD on we achieve 91.0% and 92.11% classification accuracies re-
110 × 110 input images. The experimental results are given in spectively, while we achieve 94.52% and 94.59% classification
Fig. 5 (a). As it can be seen, using Adam optimizer, we achieve accuracies using ResNet-18 and ResNet-34 respectively. Even
0.4% and 0.1% performance boost than that of using RMSProp though the ResNet architectures are performing significantly
1
0.8
Accuracy (%)
0.6
0.4
20 40 60 80 100
Iterations
-1
-3
os
et
et
et
et
op
N
N
sN
sN
Pr
Re
Re
VG
VG
90.33
Figure 9 shows the confusion matrix of the classification
accuracy (%)
spot around lower right corner which signifies the strong 81.00
confusion between class 61 (“ ”) and class 72 (“ ”). In the 77.51
experiment, we found that out of 405 test samples, there are 75.04
75 74.06
25 instances of class 61 is misclassified as class 72 (6.2%
interclass confusion) and out of 397 test samples, 81 instances
ed
M
M
M
Pr N
CN
CN
V
SV
SV
os
SV
S
+S
op
R+
R+
-D
F+
CH
SL
G
TL
TL
TL
Q
+Q
+Q
H
-C