A Multifaceted Evaluation of Representation of Graphemes For Practically Effective Bangla OCR

International Journal on Document Analysis and Recognition (IJDAR)
https://doi.org/10.1007/s10032-023-00446-7
ORIGINAL PAPER
A multifaceted evaluation of representation of graphemes for

practically effective Bangla OCR
Koushik Roy2 · Md Sazzad Hossain1 · Pritom Kumar Saha1 · Shadman Rohan2 · Imranul Ashrafi2 ·
Ifty Mohammad Rezwan2 · Fuad Rahman1 · B. M. Mainul Hossain3 · Ahmedul Kabir3 ·
Nabeel Mohammed2
Received: 4 August 2022 / Revised: 18 March 2023 / Accepted: 10 June 2023

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023
Abstract
Bangla Optical Character Recognition (OCR) poses a unique challenge due to the presence of hundreds of diverse conjunct
characters formed by the combination of two or more letters. In this paper, we propose two novel grapheme representation
methods that improve the recognition of these conjunct characters and the overall performance of OCR in Bangla. We have
utilized the popular Convolutional Recurrent Neural Network architecture and implemented our grapheme representation
strategies to design the final labels of the model. Due to the absence of a large-scale Bangla word-level printed dataset, we
created a synthetically generated Bangla corpus containing 2 million samples that are representative and sufficiently varied
in terms of fonts, domain, and vocabulary size to train our Bangla OCR model. To test the various aspects of our model,
we have also created 6 test protocols. Finally, to establish the generalizability of our grapheme representation methods, we
have performed training and testing on external handwriting datasets. Experimental results proved the effectiveness of our
novel approach. Furthermore, our synthetically generated training dataset and the test protocols are made available to serve
as benchmarks for future Bangla OCR research.
Keywords Bangla OCR · Word-level OCR · Synthetic Bangla OCR dataset · OCR benchmark · Neural networks
1 Introduction
Koushik Roy, Md Sazzad Hossain, Pritom Kumar Saha have contributed
equally to this work. Optical character recognition, or OCR, can be considered
a complicated and diverse problem. Many approaches have
B Fuad Rahman been taken to address this complex problem. These solutions
fuad@apurbatech.com contain both rule-based deterministic and heuristic solutions.
Koushik Roy Although rule-based solutions can be considered for their
koushik.roy@northsouth.edu interpretability, it has proven difficult to capture all the appli-
Md Sazzad Hossain cable rules, thus reducing the model’s adaptability. This has
sazzad_hossain@apurbatech.com led to the rise in heuristic method-based solutions, which
Pritom Kumar Saha stem mostly from statistics and machine learning. Earlier
pritom_saha@apurbatech.com solutions contained a mixture of rule and popular statistical
Shadman Rohan
shadman.rohan@northsouth.edu
Imranul Ashrafi Nabeel Mohammed
imranul.ashrafi@northsouth.edu nabeel.mohammed@northsouth.edu
Ifty Mohammad Rezwan 1 Apurba Technologies, Dhaka, Bangladesh
mohammad.rezwan@northsouth.edu
2 Apurba-NSU R&D Lab, North South University, Dhaka
B. M. Mainul Hossain 1229, Bangladesh
mainul@iit.du.ac.bd
3 Institute of Information Technology, University of Dhaka,
Ahmedul Kabir Dhaka, Bangladesh
kabir@iit.du.ac.bd
123
K. Roy et al.
machine learning-based solutions such as Hidden Markov OCR model was used as the baseline, which was further
Models [1], Bayesian Models [2]. The recent availability of adapted as per the specifications of grapheme representa-
large datasets and deep learning models have brought sig- tion methods in the final layer. The models with the various
nificant improvements [3, 4]. These deep learning models grapheme representation methods score over a number of test
have been found to learn more useful features and outper- sets that also include real, non-synthetic samples. To summa-
form their statistical and rule-based counterparts. During rize, the main contributions of this paper are threefold:
the recent emergence of deep learning in OCR, most solu-
tions have been geared towards accessible languages such 1. We present a fully synthetic corpus for Bangla word-level
as English [5, 6] or Chinese [7] consisting of a plethora of OCR consisting of 2 million (2,074,992 to be exact) sam-
resources. Languages such as English tend to have a limited ples covering a vocabulary of 691,664 words.
number of characters and do not have any special conjunct 2. Six test sets we created are used to evaluate the OCR mod-
character to themselves. On the other hand, languages such els trained on the synthetic corpus on aspects such as real
as Bangla, as addressed in this paper tend to have a consid- and synthetic seen and unseen words, fonts, formatting,
erably larger set of characters. In addition, these characters etc.
have some special characteristics that make it difficult for 3. We propose two novel grapheme representation methods
OCR. An instance of such is the usage of modifiers where suitable for the morphological richness of an intricate lan-
one character is conjoined with another character to form a guage like Bangla, relevant to determine the final design
new grapheme. Another special attribute of Bangla emerges of our models. To confirm the generalizability of the pro-
with the conjunct characters. A certain variant form of the posed methods, we performed training and testing on
character may appear before or subsequent (or both) to the external publicly available Bangla handwriting datasets,
base character, mainly discerning a certain form of sound and which are completely different from our synthetic data
the word itself. Such linguistic flexibility of Bangla adds a and test protocols.
new dimension of difficulty to the OCR problem.
There are multiple approaches to optical character recog-
The datasets generated and used during the experiments
nition, which include the character-based approach and the
are available at figshare.1
word-based approach. Character-based approach requires the
The experiments reproducible code is publicly available
document to be first segmented into lines, then words, and
at GitHub.2
finally characters. Each level of segmentation adds a new
The remainder of this paper is organized as follows, Sect. 2
level of complexity to the problem. This is specifically true
consists of the review of the literature relevant to this paper.
for the Bangla language due to the morphological flexibil-
Next, in Sect. 3, we describe the methodology to generate
ity mentioned before. The word-level approach removes the
our synthetic training set and further outline the test pro-
last layer of complexity and preserves the conjuncts as much
tocols. In Sect. 4, we elaborate on the methods we used to
as possible. Moreover, there are two different approaches to
perform the experiments, our proposed grapheme represen-
word-level approach in OCR. The first approach, commonly
tation methods, and the evaluation metrics. In Sect. 5, we
known as the Fixed method [8, 9], tends to perform OCR
report all the metrics achieved by our models using different
within a fixed set of words from a dictionary. The second
grapheme representation methods and provide an in-depth
approach, known as the Dynamic method [10, 11], takes into
analysis. Finally, we provide our concluding observations in
account the characters in words and performs OCR based on
Sect. 7.
the dictionary of characters or graphemes it uses while being
trained on a diverse set of words. Finally, when it performs the
recognition, the usage of the dynamic architecture enables the
model to potentially detect words that might not be present 2 Related works
in the dictionary it was trained on. The latter approach is
used in this paper. At the time of conducting our research, Despite being a challenging task, OCR has seen a lot of appli-
no diverse, correct, large-scale dataset for Bangla word-level cations in real life. The problem’s challenges can be relevant
OCR was present. Moreover, the datasets that existed con- to the difficulty of understanding the target language. Other
formed primarily for character level recognition and, in some challenges include predicting document content of different
cases, contained numerous errors. levels. These can include paragraphs, sentences, words and
The research conducted in this paper is a part of building characters. Concentrating on these levels of challenges of
a word-level printed text recognition service required for a OCR, in addition to the inherent complexity of the Bangla
commercial project. Following this, we propose a pipeline
1
to create efficient, diverse large-scale synthetic datasets with https://doi.org/10.6084/m9.figshare.20186825.
word-level annotations. The popular CRNN [10] word level 2 https://github.com/apurba-nsu-rnd-lab/bangla-ocr-crnn.
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
language, multiple efforts have been made towards amelio- ules are known as feature extractors, sequence modeling
rating the challenges [12]. OCR pipelines are commonly and transcription layer. The feature extraction module is a
distributed into two different components: the first is the Text CNN-based feature extractor and is loosely based on the well-
Detection module [13–16] responsible for segmenting and known VGG16 [3] architecture. Using a sequence model
localizing each sentence, word, and character, and the second implemented with a Bidirectional Long Short-Term Mem-
is the Text Recognition module [5, 6, 8–11] which leverages ory (LSTM) [23] based encoder–decoder architecture, the
the segmented word or character images for recognition. One dependency on various lengths of texts was omitted. The tran-
other variation of OCR models uses an end-to-end strategy scription layer was responsible for predicting the sequence
to perform both text detection and recognition in a single labels for each timestep in the model using their maximum
pipeline [6, 17–20]. As this paper looks into the recogni- probability scores and determining the respective label. This
tion problem, we limit the scope of the review to only Text model is popularly known as the Convolutional Recurrent
Recognition problems, more specifically, towards word-level Neural Network (CRNN). The paper reports a Recognition
recognition. accuracy of 89.6% for English on the ICDAR 2013 [24]
In recent years, due to the prevalence of Neural Networks, dataset.
OCR models have achieved outstanding performances [10, Baek et al. [25] showed that using attention [26] based
21, 22]. However, for Bangla, the development of word-level prediction layer instead of Connectionist Temporal Classi-
OCR has been slower when compared with the growth of fication (CTC) [27] yielded 1.7% and 2.6% better accuracy
similar OCR pipelines in other languages. We discuss the rel- on their experiments and even before them, using attention
evant works for this literature survey in two parts. Firstly, we instead of CTC had become a popular method of achieving
discuss the state-of-the-art OCR models in other languages state-of-the-art performance. ASTER [28] uses a rectification
and finally review the pre-existing OCR models for Bangla. network and a recognition network where the rectification
network transforms the images to deal with perspective and
2.1 OCR in other languages curved texts without any human annotations. The recogni-
tion network uses attention LSTM to decode the prediction.
The advancement of deep neural networks in the last decade FAN [29] addresses the alignment issue between feature
kept pushing the ceiling of OCR research and there have been areas and targets which they called the attention drift present
a lot of influential works. The OCR challenge can be divided in attention-based encoder–decoder models. SCATTER [30]
into two key domains. One is scene text and printed word improved the architecture proposed by Baek et al. [25] by
recognition and the other is handwriting recognition. introducing a selective decoder that operates on both visual
features from the CNN layer and contextual features from
2.1.1 Scene text and printed words recognition the Bidirectional Long Short-Term Memory (BiLSTM) layer
harnessing a two-step 1D attention mechanism. The method
One of the earlier approaches phrased the text recognition recognized cursive texts and texts on complex backgrounds
problem as a classification problem [9]. Each word was better than Baek et al. [25].
assigned a specific label and a deep Convolutional Neural To combat the shortcomings of Recurrent Neural Network
Network (CNN) was trained with an English vocabulary of (RNN) based models, Yu et al. [31] proposed an end-to-
90,000 words. However, the major drawback of the model end framework known as the Semantic Reasoning Network.
appears when used in languages with a higher character set. The paper introduces a Global Semantic Reasoning Module
Consequently, the model is unable to handle words out of that mainly flows the semantic information in a “multi-way
the designated vocabulary. For instance, when experimenting parallel” manner. This sub-network can essentially learn the
with Chinese, a language with a very large number of char- words or characters simultaneously, making it more robust.
acters making it is extremely difficult for a deep CNN model The introduction of this network eliminates the need for
to learn such a large number of patterns. Thus, a CNN model a time-step-based sequential learner, which can at times
phrased as an object detector would fail to generalize in mul- pass erroneous information resulting in the accumulation of
tiple scenarios and cannot be used for a better-performing unexpected semantic information. To capture the semantic
OCR model. information, the authors used Transformer blocks [26], where
A dynamic word-based recognition was proposed by Shi the input is a feature extracted from the Parallel Visual Atten-
et al. [10]. Prior to this, almost all deep learning models tion Module. The paper reports a recognition rate of 95.5%
for word-level recognition focused on prediction from a on the ICDAR 2013 dataset [24] and also achieved a recog-
fixed vocabulary which as aforementioned has its own draw- nition rate of 82.7% on ICDAR 2015 dataset [32].
backs. Shi et al. proposed a novel neural network architecture Transformer-based architectures [33–37] have been widely
that does not require any pre-determined word vocabulary. used recently for their better learning and generalization
This architecture included three chief modules. These mod- capability. Aberdam et al. [38] proposed a Multimodal
123
K. Roy et al.
Semi-Supervised contrastive learning-based method utiliz- structural analysis, and algorithmic analysis and using the
ing a visual representation learning algorithm for scene text template and feature matching techniques [51–59]. However,
recognition. Self-supervised contrastive learning and masked since the advent of deep learning, feed-forward neural net-
image modeling are utilized by [39] to learn discrimination works have become the preferred method in OCR studies
and generalization for text recognition. Chu et al. [40] pro- [60–66].
posed an Iterative Language Modeling Module (IterLM) to Due to the absence of an open-source printed word dataset,
enhance the scene text recognition further. A novel single research on Bangla OCR has mostly advanced in handwriting
visual model is introduced by Du et al. [41], where recog- recognition, specifically character-based classification mod-
nition is done through simple linear prediction. Zheng et al. els. On segmented character recognition tasks, many studies
[42] proposed and regularization-based method to reduce the [67–71] have used deep learning to achieve good perfor-
domain discrepancy between real and synthetic data. mance. Sharif et al. proposed a hybrid HOG-CNN model
[72] to classify Bangla compound characters and achieved
2.1.2 Handwriting recognition 92.50% accuracy on the CMATERdb3.1.3.3 isolated char-
acters dataset. Hasan et al. [73] proposed a Deep CNN with
It is difficult to replicate the success of scene text and printed a Bidirectional Long Short-Term Memory model to predict
word recognition for handwriting recognition. Handwritings Bangla compound characters. They experimented with their
are incredibly more varied than printed words and even the proposed model on the same CMATERdb3.1.3.3 dataset and
human recognition accuracy of handwriting can be lower achieved a new milestone recognition accuracy of 98.50%.
than printed word recognition. However, the deep learning Paul et al. [74] proposed a BiLSTM-CTC-based model
architectures used for printed word recognition are also used trained on 47,720 text lines. Their dataset contained 4,72,167
in recognizing handwriting. Chammas et al. [43] used the words, consisting of 28,67,659 characters, and was classi-
CRNN architecture and applied Temporal Dropout to image- fied into 166 unique labels. The test set comprised 61,582
level and internal network representation during fine-tuning words containing 3,69,931 characters. During training, they
and reported better results than the CRNN baseline on Span- ignored the peephole connections of the BiLSTM to reduce
ish and German benchmark datasets. The CRNN baseline the CTC loss, and the weights of the model were initialized by
has also been used in [44], where the authors proposed a the Xavier Initializer [75], which helped the loss function to
novel loss function called SoftCTC and demonstrated state- converge fast. On the test set, they achieved character recog-
of-the-art results on the handwriting recognition benchmark nition rate and word recognition rate of 99.32% and 96.65%
datasets. Yousef et al. [45] and Maillette de Buy Wenniger respectively which outperforms Google’s Tesseract 4.0 [21]
et al. [46] have also achieved good performance by uti- character and word recognition accuracy of 91.79, 76.31,
lizing CTC loss-based architecture. Encode–decoder-based and Google Drive OCR 98.54, 92.86%, respectively. How-
sequence-to-sequence models such as AttentionHTR [47] ever, despite the reported great performance, the dataset is not
has also seen their use in handwriting recognition. Resource available for comparative study. Recently, two Bangla hand-
scarcity is another issue that faces handwriting recognition written word datasets BN-HTRd [76] and BanglaWriting
and thus Fogel et al. have proposed the ScrabbleGAN [48] to [77] have been published, but no baseline word recognition
generate varying-length handwritten texts. Many researchers model or results have been presented in the papers.
have also worked on the resource efficiency of handwriting
recognition [43, 45, 46, 49].
3 Proposed datasets
2.2 OCR in Bangla language
Neural network-based solutions require a dataset with suf-
Although a significant amount of research has been done ficient size and variation to perform well based on current
on OCR, the improvements in word-level OCR for Bangla training methods. Although optical character recognition is a
have been slower in comparison with other languages. The well-defined problem with satisfactory models, the require-
major hindrance to this dormant growth of performance is ment of language-specific datasets at scale, in this case, a
due to the unavailability of any standard dataset for a word- Bangla word-level dataset, still remains. We are aware that
level OCR. We combat this problem by presenting a Bangla the process of deriving annotated data is a sufficiently lengthy
word-level OCR corpus that can be used to train and evaluate process and requires a large number of resources. To solve
a word-level text recognition model. Character recognition our problem, we propose a synthetic word dataset. Despite
of the Bangla language is challenging as most characters are steps taken to produce synthetically generated Bangla char-
cursive in nature, and there are no well-defined strokes [50]. acter dataset [78], no existing large synthetic word dataset
In the pre-deep learning phase, researchers tried to improve prior to ours exists to our knowledge. In addition to our syn-
the performance of the character-level OCR by combining thetic training dataset, we have also created six unique test
123
Fig. 2 Samples of the word “Rahasyaghana” written in Bangla that

translates to “Mysterious” in English represented in four of the twenty-
two fonts we used for data generation
Social Science, Computer and IT, Literature, Mass Media

and Blogs, History, Mythology, etc., including words that
are complex in terms of spelling, expansive in their length,
and consisting special characters as well.
Fig. 1 An overview of our synthetic data generation pipeline
3.1.2 Fonts selection

sets to evaluate the models trained on the proposed synthetic
dataset. In this section, we discuss the procedure of creating Font variation allows us to encode different font styles within
our proposed training and testing sets. the word images. Additionally, this aids our model to be
robust to font changes. Through careful inspection of a
3.1 Method of synthetic data generation number of real-world Bangla documents and consulting var-
ious government organizations of Bangladesh, we limited
We generated word images with varying fonts and font sizes the selection to twenty-four different fonts including bold
from a list of Bangla words present in Unicode format. Fig- variants of three of the said twenty-four fonts. Although
ure 1 shows the flow of the synthetic data generation process. hundreds of different Bangla fonts exist, only the fonts fre-
In order to achieve good representation, we require a list of quently used in computer-composed documents, including
words that cover the graphemes and grapheme combinations the default fonts of the major operating systems, were con-
that can frequently appear in the Bangla language within the sidered. The fonts also needed to be free to use and readily
current context. In addition to the graphemes, special charac- available online for download to be selected. These fonts
ters also need to be brought into consideration. The word list were thus assumed to capture a variety of the most popu-
mostly reflects these requirements. After the construction of lar fonts that tend to appear in printed documents. Samples
the word list, we need to determine multiple fonts that are an of four fonts out of the twenty-four are shown in Fig. 2. We
adequate indicator of the font styles present in everyday use. believe that this procedure of font selection for image gen-
In the following subsections, we describe our word list eration would capture the necessary variation required in the
generation process and further outline our font selection cri- styling of the word images.
terion.
3.1.3 Image generation
3.1.1 Word list generation
A modified version of the TextRecognitionDataGenerator, 4 a
In order to prepare our synthetic training set, we curated a popular open-source synthetic data generation tool, was used
list of words for the data generation pipeline, and for this, we for generating the word images from our generated word list.
needed a huge dictionary of unique words. Some published To recognize texts with varying font sizes, we generated the
datasets have a good number of Bangla textual data [79–82] dataset using eight different font sizes. These sizes were 8,
among which we have chosen the Dakshina Dataset [79] due 10, 12, 18, 24, 32, 48 and 64. Words images were generated
to its vast number of unique words and categories. We have by adding small padding around the text.
also used the SUMono dataset [80] which is later used to
generate one of our test protocols which is further discussed 3.2 Test sets creation
in Sect. 3.2.
For our list, we extracted 691,664 unique words from We created six test sets representing various sources and
the Dakshina Dataset [79] after discarding words contain- types of printed documents to analyze the performance and
ing non-Bangla characters. In essence, the final wordlist effectiveness of our models trained on a synthetically gener-
only contains characters from the Bengali Unicode Block ated training dataset. The segmentation pipeline used over the
3 and punctuations. This dataset contains various articles documents is an integral part of the pre-processing step used
which consist of distinct categories such as Natural Science, for segmenting words to accumulate the hybrid and real test
3 https://www.unicode.org/charts/PDF/U0980.pdf. 4 https://github.com/Belval/TextRecognitionDataGenerator.
123
K. Roy et al.
Fig. 3 Synthetically generated image samples from Protocol I of the Fig. 4 Synthetically generated image samples from Protocol II of the
test sets test sets
sets for evaluating the word-level OCR. For segmentation, we images of the prior eight variant sizes along with five new
used the Apurba Segmentation pipeline [12], which depends fonts as the test set. These fonts were not used to generate
on three significant steps to extract words from a document the training set or the validation set. Samples of word images
effectively. To be clear, we used this framework, and it does for this protocol is shown in Fig. 3.
not constitute a contribution to this paper. Choosing this par-
ticular segmentation pipeline is not influential to our research 3.2.2 Protocol II
as words segmented by the segmentation pipeline were later
double-checked by a human before compiling the word-level Evaluation aspect: This is a hybrid test set that was cre-
test datasets. ated by producing two documents, each containing a hundred
Instead of one big test set, the six different protocols were words. These documents were printed, scanned, and finally
created to rigorously test the OCR model and bring robust- processed by the Apurba segmentation pipeline. This proto-
ness to the model. The following sections provide details on col has been created to evaluate text recognition performance
why each protocol was created and the steps taken to create on clean, printed and scanned documents. Upon segmenta-
these six test protocols in form of datasets, each serving a par- tion, the authors annotated each word sample in this protocol.
ticular purpose. These sections also contain image samples This is cleaner in content than many other real-world exam-
that provide intricate descriptions of different partitions of ples, as observed from the samples in Fig. 4.
the test data. Additionally, the annotations of the real-world Creation method For this dataset, we used two fonts, “Lohit”
images were performed by the authors. In order to improve and “Kalpurush”, with fixed sizes, where the “Lohit” font was
the quality of the annotation, each annotation was later veri- not used to generate the 2 million training set. The words were
fied by more than one person. also randomly taken and had overlapped.
3.2.1 Protocol I 3.2.3 Protocol III
Evaluation aspect In order to evaluate the performance of Evaluation aspect For this variation, a diverse set of articles
our models on words it has not encountered in the training were randomly selected from a collection of newspaper arti-
set, we establish the first test protocol. This protocol consists cles available online. This protocol was created to test the
of a large amount of unique synthetic words to test the per- robustness of the model on styled texts, such as bold, under-
formance of our model on a large scale which is difficult to line and italic texts.
do with real data due to resource constraints of creating a Creation method The fonts used to generate this pro-
sizeable real dataset. tocol were “SiyamRupali” and “Kalpurush” which were
Creation method For this protocol, from a list of around also present in the training set and “Lalsalu”, “Lohit” and
hundred and fifty thousand unique words taken from the “BenSen” which were not present in the training set. To intro-
SUMono dataset [80], we curated a list of 75,138 unique duce further variations in the data, the fonts were in variable
words which were not used to generate the prior defined 2 sizes, and we replicated some of the samples by generat-
million training dataset or the validation dataset. We then ing images in Bold, Italic, and Underlined font variations.
used the synthetic data generation pipeline to generate text Figure 5 shows the variations introduced in this protocol.
123
Fig. 7 Real-world image samples from Protocol V of the test sets
Fig. 5 Synthetically generated image samples from Protocol III of the

test sets
Fig. 6 Real-world image samples from Protocol IV of the test sets
We printed out the documents and scanned them again to Fig. 8 Real-world image samples of words with at least one conjunct
character from Protocol VI of the test sets
introduce real-world foreground and background noises [83].
Consequently, we used the Apurba segmentation pipeline
explained in the earlier section to convert the document into over, this protocol contains single-character images, unlike
single images of words. This eventually culminated in a total the prior test sets. It is also known that single-character-
of 3056 test samples for this protocol. based word recognition provides a depth in the challenge
for word-level recognition models [10]. This protocol con-
3.2.4 Protocol IV sists of multiple real-world challenging cases with varying
noises, as shown in Fig. 7.
Evaluation aspect In this test protocol, we introduced three Creation method This test set is based on documents col-
different types of documents. These documents include type- lected from various government organizations in Bangladesh.
set pages, printed books and old binarized books. This Upon segmenting the documents, we retrieved 1931 real-
protocol evaluates the recognition model’s performance on world images. This test set has further complications, which
documents from different domains. come with the noise that the documents carry after the scan-
Creation method The documents were scanned and fur- ning process.
ther segmented using the Apurba segmentation pipeline.
The resulting test protocol comprises 1105 real-world word 3.2.6 Protocol VI
image samples. Among them, 456 word images are from
typeset documents, 305 images are extracted from printed Evaluation aspect In addition to having real-world noises,
books and 344 words are retrieved from old binarized books. all the words in this protocol consist of at least one conjunct
In Fig. 6, we demonstrate segmented word samples from the character. We created this protocol to test the performance
three categories of documents collected for this test set. of our three grapheme representation methods on conjunct
characters. Samples are shown in Fig. 8.
3.2.5 Protocol V Creation method Our final test protocol is based on vari-
ous books of stories and poems, religious books, scientific
Evaluation aspect The fifth test protocol has six variations textbooks, etc. After running multiple pages from different
of real-world noise across nine different documents. More- documents through the segmentation pipeline, the extracted
123
K. Roy et al.
word images that meet the criteria of this protocol were hand-
picked.
In Table 1, we provide a summary of the six test protocols.
4 Methodology
In this section, we discuss the methods we used for con-

ducting the experiments on our aforementioned datasets and
describe each component of our model in detail. Initially, we
provide an overview of our OCR pipeline and further elab-
orate our baseline architecture which is modified with the
grapheme representation strategies. Next, we describe our
proposed grapheme Representation methods which define
the final labels for the models and finally, elaborate on the
evaluation metrics we used to measure the performance of
our models.
4.1 Overview Fig. 9 Overview of the CRNN architecture
For our OCR architecture, we mainly have four components—

Grapheme Extractor, Image Feature Extractor, Sequence in model architecture and only contributes to the final pre-
Modeling and Prediction. First, we used different proposed diction layer, we have kept the model architecture the same
grapheme representation methods for encoding the charac- throughout the experiments.
ters. Then we used a feature extractor to extract feature The feature extractor of the CRNN is tasked to produce
vectors that represent the whole input image. After that, we sequential features from local regions of the input. Each local
used sequence modeling to capture the contextual informa- region, also known as a receptive field, is a fixed area over
tion of the characters. Finally, we used alignment-free CTC which convolutions, pooling, and non-linearity functions are
approach to decode the predictions. applied to make them translation invariant. These deep fea-
tures, shown in Fig. 9 as component b and c are robust to any
4.2 OCR baseline form of variables in each local region. Next, the sequence
of features extracted from the receptive field is further con-
To establish our baseline for the OCR task, we leverage the joined for the sequence learning task. The objective of the
Scene Text Recognition model by Shi et. al. [10] which, sequence learner (component e) is to capture semantic cues
as aforementioned, is composed of a feature extractor com- using the contextual features (component d) extracted from
ponent followed by a sequence learner and CTC prediction the input sample (component a).
layer. Almost all the State of the Art word level OCR models The sequence learning component is comprised of recur-
have a ResNet [4], VGG [3] or RCNN [84] based feature rent layers which use the sequential features reshaped from
extractor and some sort of sequence modeling such as BiL- the convolutional receptive fields to predict a set of prob-
STM but differ in the prediction stage. Some have CTC-based abilities for each local region using an encoding scheme
prediction and others have attention-based sequence pre- (component f ). The inclusion of a sequence learner is
diction and the attention-based models tend to have better significant for the prediction task because it is easier to pre-
accuracy but are bigger in size [25]. We have also reviewed dict characters sequentially, knowing the a priori characters,
many other OCR models [25, 28–31] which includes various rather than determining each character independently. We
Feature Extractors and Attention or CTC-based prediction. employed a bi-directional LSTM as our sequence learner.
However, most of them did not mention the number of train- Decoding the prediction (component g) from an RNN-
able parameters or model size but the ResNet and the RCNN based architecture is complex due to its redundant output
have more parameters than the modified VGG extractor used from a close timestamp. Also, the length of the ground truth
in CRNN and attention-based predictions are also slower than and prediction can be different, and their accurate align-
the CTC. CRNN architecture is the most lightweight among ment may not be achievable. For example, in nature, Bangla
all the research we have reviewed and thus we have found it is visually complex and very difficult to predict using an
the most practical one to be used in production and chose it as alignment-based solution. Moreover, the alignment infor-
our baseline. Also, as our research does not bring any novelty mation is very expensive to annotate. In Fig. 10 (left), the
123
Table 1 Summary of the test protocols including their respective data type, font information, sources, and the samples count
Proto Type Font Sources Samples Tests
1 Synthetic Fonts not in training Online articles 75,138 Performance on large scale
data
2 Hybrid 1 out of 2 fonts in training Real documents 199 Performance on clean
scanned text
3 Hybrid 2 out of 5 fonts in training. Contains Online articles 3056 Performance on stylized
bold, italic, underline text
4 Real Unknown Typeset, printed, and 1105 Performance on images
binarized books from different domains
5 Real Unknown Government documents 1931 Performance on real-world
noise
6 Real Unknown Textbooks, story, poems 114 Performance on conjunct
and religious books, etc characters
Table 2 Diacritic forms of some of the vowels
Fig. 10 Example of a Bangla word using alignment based approach

(left) and alignment-free approach (right)
first row is the predicted character block, and the second

row shows a breakdown of characters in those blocks. This
demonstrates that a word can be broken down into different
levels of alignment.
We use an alignment-free approach for the learning pro-
cess by employing a Connectionist Temporal Classification
(CTC) layer. This would eliminate the labor of labeling the
positions of each character as we make predictions based
on each frame. Leveraging the CTC loss allows performing
alignment-free training as it optimizes by summing over all
in equation 1 would be computationally expensive, this is
possible probabilities for all the frames. However, a caveat for
calculated using the dynamic programming technique.
this algorithm is that it repeats the same character over mul-
tiple frames as the CNN feature for each character is spread
over a region. To combat this, the CTC algorithm introduces
4.3 Proposed grapheme representations
a special blank character to denote the start and end of a new
character. The objective of the CTC loss function is shown
Since our word recognition model performs a sequence clas-
in equation 1,
sification task, we need to map the graphemes in a word to
a finite set of labels. There are several challenges regard-
ing the representation of graphemes in Bangla. One of the

T
significant problems that we address in this paper is the rep-
p(Y |X ) = − ln( pt (xt |X )) (1) resentation of the graphemes, as this is a critical issue and
A S A,B t=1
not at all straightforward. Although the Bangla alphabet has
a finite set of vowels and consonants, the letters often form
where, p(Y |X ) represents the probability of the output, Y conjunct characters. In the Unicode standard, the consonant
given X , the input CNN feature. pt denotes the conditional conjuncts do not have any dedicated Unicode. The conjunct
probability of a single alignment xt at a single time-step given characters are formed by writing two or more vowels or con-
the CNN feature X and further combined over the set of sonants together. In Bangla, vowels are used in both their
all time steps. The final summation refers to marginalizing original and diacritic forms, as seen in Table 2 combined
over all sets of valid alignments. As computing the method with a consonant.
123
K. Roy et al.
Table 3 A possible representation of a grapheme in multiple combina- Table 4, we provide examples of representing characters in
tions illuminating the difficulty of prediction different ways and further discuss the proposed representa-
tion approach.
In the first Naive Separation method, the graphemes are
separated at the root level and broken into their smallest form,
similar to the legacy implementation of character extraction
from unicode strings. Due to this, the modifiers are also sep-
arated, and consonant clusters are broken down. Thus, each
character is treated as a class. In Bangla, a consonant cluster
comprises two to three possible consonants by a ‘hoshonto’
Consonants can also form clusters with one or two other character previously shown in Table 3. In the first row of Pos-
consonants connected by a ‘hoshonto’, an example of which sible Representations in Table 3, we show that the grapheme
is can be seen in Table 3. In Table 3, we show an exam- can be represented with multiple characters, however, joined
ple of a grapheme extracted from a word. In all traditional by the ‘hoshonto’ character in between.
sequential recognition models, Bangla Unicode text are used In the second Vowel Diacritics Separation (VDS) method,
as labels for words. In that way, the grapheme in Table 3 will the vowel diacritics are separated, and the consonant clusters
have three labels to represent it, shown on the upper right are kept in their own form, which is further demonstrated in
cell. However, by observing Table 3 we realize that as we Table 4.
are working with image-level data, it may be harder for the Finally, the last All Diacritics Separation (ADS) method
model to understand that the grapheme on the left column for grapheme representations is based on the vowel diacritics
is composed of three different characters. While the model separation method. But now, atop vowel diacritics separa-
can certainly learn it, treating graphemes distinctly based on tion, we also separate a few Brahmi-Sanskrit diacritics: jofola
their visual presentation may be easier. Instead of using three (grapheme created by ‘Hoshonto + Jo’), rofola (grapheme
different labels to represent this grapheme, we can combine created by ‘Hoshonto + Ro’) and ref (grapheme created by
the unicodes and use them as a single label for the grapheme, ‘Ro + Hoshonto’). Like the vowel diacritics, these diacrit-
shown on the lower right cell. ics are used along with almost every consonant and create
In our experiments, we break down words into three dif- consonant clusters. But jofola, rofola and ref remain visually
ferent representations. As mentioned above, the graphemes identical in every consonant cluster. It can be shown that these
may also involve conjuncts or modifiers, which are essen- diacritics only appear along with the base consonant clusters
tially special structural variants of characters or combined or characters and produce unique clusters. From Table 4, for
characters. The takeaway is that if they are visually different, the first and the third example of the separation method, we
they too can be considered a separate class in the vocabulary. can find that due to the presence of the ‘Ro + Hoshonto’
This adds a new level of complexity because if we try to grapheme, it is followed by newer graphemes such as ‘Ro
add the possibility of all combinations of consonant clusters + Hoshonto + Bo’ and ‘Ro + Hoshonto + To’ as shown in
along with the vowel diacritics, this will lead the number of Table 5. We can also show that some graphemes may also
classes in vocabulary to be in the thousands. On the other coexist in other graphemes where they do not appear visually.
hand, if we separate the graphemes naively like in other lan- Since these diacritics themselves are not unique but may pro-
guages, namely English and Chinese, where the characters duce thousands of unique clusters, we separate them from the
are split to their root form, we will end up with a small set clusters and only keep the unique clusters formed by com-
of characters which is a poor representation of the consonant bining different consonants. This ensures that the consonant
clusters and diacritics, and therefore the recognition accu- clusters without any diacritics are kept in their original forms
racy may fall. So we try to reach an optimum trade-off. In while the clusters with vowel diacritics and/or the Brahmi-
Table 4 Samples of our three proposed grapheme representation of Bangla words
123
Table 5 Composition of multiple conjunct-based characters and rein- the number of insertions, deletions and substitutions respec-
force the examples in Table 4, showing how one grapheme may co-exist tively.
in other graphemes in their Conjunct form edit_distance( pr ed, gt)
NED = (3)
max(| pr ed|, |gt|)
Equation 3 is used to calculate the Normalized Edit Dis-

tance (NED), where pr ed and gt are the prediction and
ground truth strings respectively. | pr ed| and |gt| denotes
the length of those strings. The sum of the NED of all words
is used as a metric.
N
i=1 E Di
C R R = 1 − N (4)
i=1 max(| pr edi |, |gti |)
Sanskrit diacritics are broken down. Thus we get a list of Our final metric is the Character Recognition Rate (CRR)
classes that is neither too low to represent the language nor as defined in Eq. 4 which denotes the percentage of char-
too complex. acters correctly recognized. Edit Distance (ED) counts the
Both our novel Vowel Diacritics Separation (VDS) and number of characters that were not correctly recognized by
All Diacritics Separation (ADS) do not break down com- aligning pr ed and gt. By dividing the total number of incor-
pound characters, even if it is a combination of more than rectly predicted characters by the total number of characters
two consonants. The algorithm to extract labels in the VDS which is calculated by taking the summation of the length of
and ADS method has been provided in “Appendix B”. Our each longest string among the prediction pr ed and ground
experiments showed that VDS and ADS-based methods per- truth string gt, we get the percentage of incorrectly recog-
formed better than the legacy Naive method on consonant nized characters in a word. Finally, by subtracting from 1,
conjuncts or consonant clusters. The experimental setup and we find the percentage of correctly recognized characters or
results regarding the grapheme representation methods are the Character Recognition Rate (CRR).
described in the following section.
4.4 Evaluation metrics 5 Experiments

We perform our evaluation on six different test sets with In this section, we elaborate on the details of the experi-
varied attributes. For performance comparison, we use three ments we conducted and also present their results. Firstly,
metrics. Firstly, we used the Word Recognition Rate (WRR) we describe the experimental setup used for all of our exper-
where we calculate the percentage of words that exactly iments. Next, we show the results of the variants of our model
match the ground truth (Levenshtein Distance of 0). on each of our aforementioned test sets while trained on the
We also use the Sum of Normalized Edit Distance (NED) synthetic training set. Then we train and test on real hand-
of each word as another metric that was used in the ICDAR writing data to test our grapheme representation schemes on
competition [32] for method evaluation. Edit distance or Lev- different real datasets as well as their applicability in Optical
enshtein distance [85] as defined in Eq. 2 is a measure of Handwriting Recognition (OHR). Finally, we provide some
the number of steps required for an incorrectly predicted qualitative analysis based on the results shown in the prior
sequence to transform to a correct one. subsections to determine the shortcomings of the data.
⎧
⎪
⎪max(i, j) if min(i, j) = 0
⎪
⎪ ⎧
⎨
⎨ Leva,b (i − 1, j) + 1
⎪ 5.1 Experimental setup
Leva,b (i, j) = (2)
⎪
⎪min Leva,b (i, j − 1) + 1 otherwise.
⎪
⎪ ⎪
⎩
⎩ Leva,b (i − 1, j − 1) + 1(ai =b j ) We used our synthetically generated dataset comprised of
2 million image samples for training our model. During
Equation 2 denotes the Levenshtein distance between two training, we applied the Gauss Noise, Pixel Dropout for salt-
strings a and b, where i and j are the positions for the and-pepper noise, Elastic Transform for a pencilic effect,
two respective strings. The first part of this equation where Random Brightness and Contrast, Motion Blur, etc. These
Leva,b (i, j) = max(i, j), if min(i, j) = 0 determines the noises were introduced to mimic the realistic noises found
number of insertions or deletions required to turn an empty in real data and Perspective and Rotation transformations
string a into b or vice-versa. The later part of the equation were also added. The Albumentations [86] library was used
is a recursive definition where the three equations determine to perform the augmentations. The model is initialized with
123
K. Roy et al.
Kaiming initialization [87] and initially trained with a learn- Table 6 Labels for all the variants of our models
ing rate of 5e−5 . Cosine Annealing with Warm Restarts [88] Configuration Classes Total params
learning rate scheduler was used where the first restart occurs
after 15 epochs and the minimum learning rate was set to CRNN-Naive 115 8.76M
1e−6 . We trained the models for 30 epochs and optimized CRNN-VDS 591 9.01M
them using the Adam Optimizer. We used 128 × 32 sized CRNN-ADS 262 8.84M
input, from which a 29-frame feature sequence has been Table 7 Performance of our models with two feature extractors using
generated. This is different from the input size of 100 × 32 our three proposed grapheme representation methods on the six test
mentioned in [10]. We have increased the width of the input protocols
image because Bangla words tend to be longer in length and a Protocol Model WRR (%) Total NED CRR (%)
bigger input size lets the model recognize the visual compli-
cations of the conjunct characters better. Initial testing with I CRNN-Naive 93.43 1029.3625 98.28
these longer images gave us better accuracy on all the tests CRNN-VDS 92.11 1340.5018 97.88
and thus we have shown results using this size in this paper. CRNN-ADS 93.08 1131.4137 98.15
To test the robustness of our grapheme representation II CRNN-Naive 81.91 18.9000 95.42
methods on real data, we chose two OHR datasets, BN-HTRd CRNN-VDS 85.93 14.8988 95.83
[76] and BanglaWriting [77]. Handwriting datasets were cho- CRNN-ADS 80.90 24.1583 93.88
sen due to the lack of proper Bangla Scene Text or Printed III CRNN-Naive 62.11 373.5325 90.34
word datasets for which we created our synthetic dataset in CRNN-VDS 64.01 372.2867 90.36
the first place. However, choosing these datasets also enabled CRNN-ADS 63.02 376.7820 90.15
us to test our grapheme representation methods in OHR. Dur- IV CRNN-Naive 43.71 269.0556 80.66
ing training on the real handwriting datasets, a learning rate of CRNN-VDS 48.69 239.9502 82.31
1e−3 was used and on the BanglaWriting dataset the models CRNN-ADS 42.81 272.4529 79.80
were trained for 60 epochs. All other configurations remained V CRNN-Naive 87.52 150.5297 97.17
the same as the models trained on the synthetic dataset. Ignor- CRNN-VDS 87.99 135.8027 97.10
ing some of the data lost to mislabeling, we were able to CRNN-ADS 86.85 148.4270 96.93
extract 108,061 words from BN-HTRd and 21,221 words VI CRNN-Naive 78.07 7.1286 94.25
from BanglaWriting. BanglaWriting provides both raw and CRNN-VDS 78.95 7.1302 94.25
noise-reduced versions and we have chosen raw images as
CRNN-ADS 81.58 6.1917 95.33
they are more natural and difficult to recognize. As no prior
train-test split was provided with either of the datasets, we Bold indicates that the model has the best result on the protocol among
the three models
randomly split each of the datasets and used 90% for training
and 5% each for validation and testing. We have performed the test results in the following subsections. Each model
both intra-dataset and inter-dataset testing. Intra-dataset test- configuration denotes one of the grapheme representation
ing is performed on the 5% test split but inter-dataset testing strategies paired with the CRNN architecture. For exam-
was performed on the entire dataset, which means the model ple, CRNN-Naive represents the CRNN architecture with the
trained on BN-HTRd was tested on the entire BanglaWriting Naive representation method. For reporting results for each
dataset and vice-versa. All our experiments were conducted model on all of our test sets, we use the WRR, Total NED,
on a machine with an NVIDIA V100 GPU using the Pytorch and CRR as indicators for evaluating the performance of the
framework [89] where we used a mini-batch size of 256. We models.
used the weights of the best validation epoch during testing
and provided results on the test sets mentioned above. 5.2.1 Results of training on the synthetic dataset and
testing on our six test protocols
5.2 Results
We report the results of the training experiments performed
In this section, we report the results of our models in two dif- on the synthetic dataset in Table 7.
ferent scenarios. Firstly, the models trained on the synthetic As previously mentioned, the purpose of Protocol I is to
dataset were tested on the six test protocols we described determine how the models perform on unseen synthetic data,
earlier. Then we trained on real handwriting datasets and that is, words that the model has not encountered during train-
reported intra and inter-dataset results on those real datasets. ing. For the first protocol, from Table 7, we can observe that
There are three different configurations that were trained the model with the Naive method outperformed our proposed
and tested, Table 6 lists all the different model configura- VDS and ADS grapheme representation methods, but from
tions explored. The labels in Table 6 are also used to present the table, it is clear that the margin is very low. Protocol I is
123
also the only protocol that is a completely synthetic dataset Table 8 Training and testing schemes on handwriting dataset
while the rest of the protocols are either hybrid or real. Trained on Tested on Scheme
Protocol II mainly comprises synthetically generated
words that were later printed and scanned and thus it is con- BN-HTRd BN-HTRd INTRA1
sidered a hybrid dataset. In Table 7, for the second protocol, BanglaWriting (Raw) BanglaWriting (Raw) INTRA2
it can be observed that the CRNN-VDS model achieved the BN-HTRd BanglaWriting (Raw) INTER1
highest WRR and CRR of 85.93% and 95.83% respectively. BanglaWriting (Raw) BN-HTRd INTER2
In addition to that, due to the low Total NED, we consider
our architecture, CRNN-VDS as the best performing model
Table 9 Performance of our three proposed grapheme representation
for the second protocol. methods on real handwriting datasets
Protocol III is an extension of the previously reported pro-
Scheme Model WRR (%) Total NED CRR (%)
tocol. This test set consists of multiple varieties of fonts and
font sizes with bold, italic and underlined samples as well. INTRA1 CRNN-Naive 79.48 831.5276 89.33
All of these data were generated at the document level and CRNN-VDS 82.91 757.5683 90.27
later segmented for evaluation. From the results of Protocol CRNN-ADS 83.69 712.8907 90.85
III in Table 7, we can clearly observe that our CRNN-VDS INTRA2 CRNN-Naive 69.40 116.3189 87.47
model achieves the highest 64.01% WRR and 90.36% CRR CRNN-VDS 69.30 126.9041 86.39
among all the other models and the result is reassured by the CRNN-ADS 72.50 108.6174 88.27
low Total NED. INTER1 CRNN-Naive 60.58 3647.4147 82.02
The purpose of Protocol IV was to further evaluate the CRNN-VDS 61.96 3692.2762 81.75
model with real samples comprised of typeset documents,
CRNN-ADS 62.94 3516.3509 82.93
and printed and old binarized books. Upon segmentation of
INTER2 CRNN-Naive 50.42 23897.9807 76.64
the documents, we retrieved around eleven hundred word
CRNN-VDS 49.47 26075.8238 74.13
samples on which we perform the testing while the model
CRNN-ADS 52.14 23,325.9911 76.99
trained was on the synthetic 2 million dataset. From the
results of Protocol IV in Table 7, we can observe that our Bold indicates that the model has the best result on the scheme among
the three models
CRNN-VDS model displayed the best results.
Our fifth test set, Protocol V, mainly consists of documents
from various government organizations in Bangladesh. Addi- chosen BN-HTRd [76] and BanglaWriting [77] and demon-
tionally, it also consisted of certain variations of real-world strate our novelty on these datasets. As these are published
noises. From Table 7, we can observe that the CRNN-VDS and fairly recent datasets and due to the scarcity of a good
model attained the best WRR of 87.99% and consequently Bangla real or synthetic printed text or scene text dataset,
achieved a CRR of 97.10% which is slightly lower than the these datasets were chosen to test our novel representation
model with the naive representation method. Then again, the methods. We have divided our training and testing on these
lowest Total NED among the three models makes CRNN- datasets into four schemes as presented in Table 8. The first
VDS the better model. two schemes are intra-dataset training and testing and in the
Protocol VI consists of real-world data where each word last two schemes we tested the model trained on BN-HTRd on
comprises at least one conjunct character. It was specifi- the entire BanglaWriting dataset and vice-versa. The results
cally designed to test our grapheme representation schemes. of these experiments are shown in Table 9.
Here we can observe the consistency of the CRNN-VDS From the results of Table 9 we can observe the consistency
model which achieved a better WRR than the X1 model. The of the CRNN-ADS model with the ADS grapheme represen-
CRNN-ADS model with the ADS grapheme representation tation method on the real handwriting datasets which has
was the best-performing model with a WRR and CRR of surpassed the baseline naive extraction method in terms of
81.58% and 95.33% accompanied by the lowest Total NED. all the metrices.
This shows the CRNN-ADS model’s competitive perfor-
mance on real data alongside CRNN-VDS and its superiority
on words with conjunct characters.
5.2.2 Results of training and testing on real handwriting 6 Discussion

datasets
In this section, we take a deeper look into our grapheme
In order to establish that our proposed grapheme represen- representation methods and analyze them from different per-
tation methods also work with other real datasets, we have spectives.
123
K. Roy et al.
Fig. 11 Percentage of conjuncts mispredicted by each of the models Fig. 12 Percentage of conjuncts mispredicted by each model on the
on the printed words test protocols handwriting training and testing schemes
6.1 Analysis on conjunct characters grapheme representation methods are performing better than
the CRNN-Naive model with the naive method. In all the
The evaluation metrices used so far were the WRR which schemes, there is a noticeable difference. In the first scheme,
represents absolute matches and CRR and Total NED which there is almost a 10 percent difference between the best-
highly depend on the calculation of edit distance. To calculate performing CRNN-VDS model and the CRNN-Naive model
the edit distance between two words, the words are broken with the naive representation. On the last three schemes, the
down into single characters. Bangla consonant clusters are margins were even greater.
formed with two or more consonants with a ‘hoshonto’ char- Finally, by observing the performance of the models on
acter in between each of the consonants and thus, the edit Bangla consonant clusters in Figs. 11 and 12 as well as by
distance does not reflect how well these consonant clusters the results of the models in Sects. 5.2.1 and 5.2.2, we can
are being predicted. For this, we have performed a separate conclude that the VDS grapheme representation is fitting for
analysis to calculate the percentage of mispredicted conso- the Bangla OCR problem. The model has achieved better
nant clusters. scores on the test protocols as well as the real handwriting
Although the CRNN-Naive model did better in terms of datasets and also recognized the consonant clusters better.
our evaluation metrics on Protocol I as shown in Table 7, ADS grapheme representation also showed better perfor-
from Fig. 11, we can see that the CRNN-Naive model had mance than the naive method on the real handwriting datasets
a higher error rate than CRNN-VDS and CRNN-ADS. This and did better at recognizing consonant clusters. Also, the
trend was consistent on all the protocols, as the novel charac- high WRR and CRR achieved on our test protocols indicate
ter representation methods performed better on the consonant that the models trained on the synthetic dataset were perform-
clusters. Protocols IV, V and VI consist of real data, and ing well on the hybrid and real test sets. As the CRNN-VDS
protocol V is mainly comprised of simple words and single- model with VDS grapheme representation performed better
character words, there are not many complex consonant than the rest, we chose it to further test its generalizabil-
clusters there to predict and thus the error rate is low for ity over three disparate real-world test sets reported in the
all the models. However, on protocols IV and VI, we can “Appendix A” and found that the model was performing
really see the difference. On protocol IV, the error rate of exceptionally well on an unseen real-world printed dataset.
the CRNN-Naive model is 30.18% whereas the CRNN-VDS
and CRNN-ADS models have considerably lower error rates 6.2 Analysis on short and long words
of 21.52% and 21.2% respectively. Protocol VI, specially
built for testing such cases, also demonstrates a visible dif- In Bangla text, as conjunct characters can be broken down
ference between the novel representation methods and the into two or more characters, the length of Bangla words
naive method. Here the CRNN-VDS and CRNN-ADS mod- tends to be longer than in other scripts, such as Latin. This
els have an error rate of 10.74% and 11.57% respectively, and is why it is important for a Bangla OCR system to be able
the CRNN-Naive model mispredicted 17.36% of the conso- to predict longer words effectively. However, the prediction
nant clusters. sequence length is usually a fixed length (29 in our case) and
Some word samples with conjunct characters have been thus, it is difficult to predict a long sequence using the naive
shown in Table 10. We can see that the CRNN-Naive model method. This is where our CRNN-VDS and CRNN-ADS
has a high edit distance on the samples, while the CRNN- models shine. We have conducted experiments on various
VDS and CRNN-ADS models have predicted the words word lengths: 1–4, 5–8, 9–12 and 13+ to test the models’
perfectly or with a low edit distance. performance at different text lengths. As different represen-
Figure 12 also exhibits a similar trend to Fig. 11 as here tation methods can provide different numbers of labels for
too the CRNN-VDS, and CRNN-ADS models with the novel the same text, we have considered the number of Unicode
123
Table 10 Example words with

at least one conjunct where
CRNN-VDS and CRNN-ADS
models outperform the
CRNN-Naive model
Table 11 WRR of the models at

Protocol Model # samples Length 1–4 Length 5–8 Length 9–12 Length 13 and above
different word lengths (in terms
of the number of Unicode I CRNN-Naive 96.71 95.69 92.54 84.61
characters) on the six test
protocols CRNN-VDS 96.92 94.58 90.45 83.87
CRNN-ADS 97.21 95.54 91.66 84.54
Samples 4224 37,255 25,048 8610
II CRNN-Naive 71.25 87.50 100.00 0.00
CRNN-VDS 78.75 89.42 100.00 0.00
CRNN-ADS 67.50 88.46 100.00 0.00
Samples 80 104 15 0
III CRNN-Naive 37.43 61.49 71.55 63.46
CRNN-VDS 37.15 63.78 72.80 68.44
CRNN-ADS 35.75 63.29 71.23 67.11
Samples 358 1441 956 301
IV CRNN-Naive 30.68 47.89 55.93 52.27
CRNN-VDS 37.17 52.48 57.63 63.64
CRNN-ADS 28.61 46.61 55.93 61.36
Samples 339 545 177 44
V CRNN-Naive 80.16 93.17 89.68 100.00
CRNN-VDS 82.52 92.67 88.79 97.50
CRNN-ADS 79.11 93.30 88.20 100.00
Samples 761 791 339 40
VI CRNN-Naive 75.00 74.67 96.55 33.33
CRNN-VDS 75.00 76.00 93.10 50.00
CRNN-ADS 100.00 76.00 100.00 50.00
Samples 4 75 29 6
Bold indicates that the model has the best result on the protocol among the three models
characters in a text to be the word length to keep the cal- samples with synthetically added noise are shown in Fig. 13.
culation consistent across all the extraction methods. From Due to the visual similarity, it is the easiest protocol for our
Table 11, we can observe that our novel methods perform recognition models and in this protocol, the margin is very
better than the naive method on both short and long words slim between the models.
on all real and hybrid test protocols. The CRNN-VDS method performs the best on real-world
samples when it has been trained on a large enough dataset.
However, as the number of classes for the CRNN-VDS is
6.3 Limitations and error cases
more than five times higher than the baseline CRNN-VDS
model, it needs a lot of data to achieve good accuracy and dis-
The novel representation methods do not perform exception-
tinguish each class properly. This has been made clear from
ally well on the large-scale synthetic testing protocol I. The
the results on the handwriting datasets, where the training
synthetic testing protocol I data are visually similar to the syn-
samples were low and the CRNN-ADS model overperformed
thetic training dataset. Some of the samples of the training
123
K. Roy et al.
7 Conclusion
We investigated the challenges in Bangla word-level OCR

constituted by conjunct characters present in Bangla script
and proposed two novel grapheme representation strategies,
Vowel Diacritics separation (VDS) and All Diacritics separa-
tion (ADS). Due to the scarcity of word image data in Bangla
OCR, we created a synthetic corpus containing 2 million
Fig. 13 Samples of synthetic words with synthetic noise used to train
the models samples with Bangla words having multiple variations in the
design and size of the fonts, covering a range of domains. To
evaluate the models trained on our synthetic corpus, we also
proposed six different testing sets, each designed to deter-
the CRNN-VDS model on all the testing schemes. The high mine a particular observation on a unique problem of Bangla
number of classes also increases the total trainable param- OCR.
eters of the model as shown in Table 6 which increases Both our novel separation methods have shown supe-
model complexity and inference time. However, the increase rior conjunct recognition accuracy on all the test sets and
in complexity and parameter size is not too large and thus the external datasets. On external handwriting datasets, the
CRNN-VDS model remains a viable solution in the avail- best-performing novel method has surpassed the conjunct
ability of large datasets. When the training set is smaller, our recognition rate of the baseline by a margin of 10% or more.
novel CRNN-ADS model is the go-to architecture which gets Overall OCR results on the testing protocols demonstrate that
a better result on real-world handwriting datasets as shown our novel VDS separation method outperformed the baseline
in Table 9. naive separation on every hybrid and real test dataset. Results
One weakness of the CRNN-VDS and the CRNN-ADS on external handwriting datasets demonstrated the superior-
models is falsely assuming a single character as a conso- ity of the ADS separation method. However, results on the
nant cluster occasionally. This type of misprediction greatly test protocols III and IV have demonstrated that our model
affects the recognition metrics, such as the NED. As the struggles with noisy italic, bold, and underlined data present
edit distance is calculated by aligning the prediction and in protocol III and real typeset and binarized data present
ground truth strings and counting the number of Unicode in protocol IV compared to other protocols. By establish-
characters not correctly recognized, mispredicting a single ing our research as a baseline, future work can be done by
character as a consonant cluster significantly increases the utilizing our released synthetic datasets and test protocols
edit distance as a single consonant cluster can have three or to improve upon the shortcomings of our method and the
more characters in it. Table 12 shows some sample cases metrics reported in this paper can be used for comparison in
where CRNN-Naive predicted simple words perfectly. Yet, future research.
CRNN-VDS or CRNN-ADS mispredicted a single character
as a consonant cluster and greatly increased the edit dis- Acknowledgements This research was supported in part by the
Enhancement of Bangla Language in ICT through Research & Devel-
tance. Despite having this error case, our novel CRNN-VDS
and CRNN-ADS models are performing better on real-world
printed word testing protocols and handwriting datasets.
Table 12 Examples where

CRNN-VDS or CRNN-ADS
fails with a high edit distance
due to conjunct prediction
instead of single character
123
opment (EBLICT) Project, under the Ministry of ICT, the Government

of Bangladesh.
Declarations
Conflict of interest The authors have no competing interests to declare

that are relevant to the content of this article.
Fig. 14 Samples of words from Computer Composed, Letterpress and
Typewriter data used for extrinsic testing
Appendix A: Results on an external testing
partition composed dataset with around five hundred thousand data
which we believe is a significant accomplishment consider-
To evaluate the generalizability of the models in accordance ing that the model has not seen data of this domain and being
with the training data, in this section, we conducted a testing trained on completely synthetic data. The success in recogni-
experiment on extrinsic data. The data we used is primarily tion is due to our synthetic data generation process. We have
subjected to be used for an OCR project by the Government meticulously selected almost all the open-source and popu-
of Bangladesh. Due to the confidentiality of the assignment, lar Bangla fonts and generated images of different lengths.
we cannot publish these external testing sets. However, we During the training process, we have also added noises that
report the results in Table 13 to stress the performance of reflect some of the real-world noise such as blur, salt-and-
the models trained on completely synthetic data. We tested pepper, etc. Figure 13 shows some of the input images with
on entirely unseen data that were not seen by the model in added noise used during training. These images capture the
training or validation and were accumulated from domains visual diversity present in real-world computer-composed
unknown to the model. documents and thus are able to train the model well enough
The word-level data used for this testing has been mainly to obtain high accuracy on real-world unseen test sets.
comprised of three types of documents. Firstly, the Computer The CRNN-VDS model has also shown inadequate per-
Composed test set consists of data written using computers. formances on the Letterpress and Typewriter testing sets,
Examples of Computer Composed documents can be govern- achieving a WRR of 57.86% and 28.05% respectively. The
ment notices, letters, etc. Next, we use the Letterpress test set sub-standard performance is mainly due to the introduction
containing data published from press-based printing, where of unique noises in these datasets that are, in some cases,
some common examples are books and posters. The third even different in terms of the color schemes of the back-
testing set is the Typewriter dataset which consists of data ground. Also, the synthetic data the model was trained on
composed using a typewriter device mostly used for legal mostly mimicked computer-composed data and not letter-
documents in Bangladesh. press or typewriter data as the texts from those domains
In Fig. 14, we show examples of the data we used for usually have different fonts, paper textures, and noises unique
testing. We can observe the natural noise that exists in the to those domains. Finally, due to this evaluation, we can make
samples, making it more challenging for the model to predict estimations of the performance of the synthetically trained
correctly. This is mainly because natural noises are difficult CRNN-VDS model in production circumstances.
to recreate synthetically. In Table 13, we report the results
of the evaluation. For this testing, we pick our CRNN-VDS
model with VDS Character Representation mainly due to its Appendix B: Algorithm for the extraction
competitive performance in other reported test sets with real methods
data (Protocol II-V).
From Table 13, we can observe that the CRNN-VDS In traditional sequential models, Unicode texts are broken
model has achieved a WRR of 79.03% on the computer- down into single characters and each character is used as a
label. Our Novel VDS and ADS methods aim to keep the
Table 13 Performance of the X2 model when tested on external testing consonant clusters unbroken by keeping all the participating
sets consonants of a cluster together as its label instead of using
Test set Model WRR (%) CRR (%) multiple labels to represent it. VDS and ADS are rule-based
methods that are used to extract labels from a text for training
Computer Composed CRNN-VDS 79.03 93.04 or testing purposes for the CRNN-VDS and CRNN-ADS
Letterpress CRNN-VDS 57.86 83.61 architectures. Algorithms 1–2 represents the VDS method
Typewriter CRNN-VDS 28.05 70.10 and 3–4 represents the ADS method.
123
K. Roy et al.
Algorithm 1 VDS Grapheme Extraction (Part 1)

Require: Necessary variables:
1: wor d unicode text from which grapheme labels will be extracted.
2: consonants list of consonant characters
3: f or msCluster a hash table where the keys are consonant characters and values are lists of consonants that form a consonant cluster with
that character.
4: f or msT ri ppleCluster a hash table where the keys are consonant clusters of two consonants and values are lists of consonants that form a
consonant cluster of three consonants with that cluster.
5: hoshonto, jo, jo f ola variables that hold the Unicode for these
6: char s ⇐ [] stores the extracted labels of wor d
7: i ⇐ 0 loop counter
8: ad just ⇐ 0
9: while i < length(wor d) do
10: if i + 1 < length(wor d) & wor d[i + 1] == hoshonto then
11: if i + 2 < length(wor d) & wor d[i + 2] == jo then
12: if wor d[i] in consonants then
13: char s.append(wor d[i − ad just : i + 3])
14: else
16: char s.append( jo f ola)
17: end if
18: ad just ⇐ 0
19: i ⇐i +3
20: else if i + 2 < length(wor d) & ad just = 0 & wor d[i − ad just : i + 1] in f or msT ri pleCluster
21: & wor d[i + 2] in f or msT ri pleCluster [wor d[i − ad just : i + 1]] then
23: ad just ⇐ ad just + 2
24: i ⇐i +2
25: else
27: ad just ⇐ 0
28: i ⇐i +3
29: end if
30: else if i + 2 < length(wor d) & ad just == 0 & wor d[i] in f or msCluster
31: & wor d[i + 2] in f or msCluster [wor d[i]] then
34: i ⇐i +2
35: else
37: ad just ⇐ 0
38: i ⇐i +3
39: end if
40: else
42: char s.append(hoshonto)
43: ad just ⇐ 0
44: i ⇐i +2
45: end if
Algorithm 2 VDS Grapheme Extraction (Part 2)

46: else
47: char s.append(wor d[i : i + 1])
48: i ⇐i +1
49: end if
50: end while
123
Algorithm 3 ADS Grapheme Extraction (Part 1)

Require: Necessary variables:
wor d unicode text from which grapheme labels will be extracted. consonants list of consonant characters
f or msCluster a hash table where the keys are consonant characters and values are lists of consonants that form a consonant cluster with
that character.
3: f or msT ri ppleCluster a hash table where the keys are consonant clusters of two consonants and values are lists of consonants that form a
consonant cluster of three consonants with that cluster.
hoshonto, jo, r o, r e f , jo f ola, r o f ola variables that hold the Unicode for these
char s ⇐ []
6: i ⇐ 0
ad just ⇐ 0
while i < length(wor d) do
if wor d[i] is r o then
char s.append(r e f )
12: ad just ⇐ 0
i ⇐i +2
else if i + 2 < length(wor d) & wor d[i + 2] == jo then
char s.append( jo f ola)
ad just ⇐ 0
18: i ⇐i +3
else if i + 2 < length(wor d) & wor d[i + 2] == r o then
char s.append(wor d[i − ad just : i + 1])
21: char s.append(r o f ola)
if i + 3 < length(wor d) & wor d[i + 3] == hoshonto & i + 4 < length(wor d)
& wor d[i + 4] == jo then
24: char s.append( jo f ola)
i ⇐i +5
else
27: i ⇐i +3
end if
ad just ⇐ 0
30: else if i + 2 < length(wor d) & ad just = 0 & wor d[i − ad just : i + 1] in f or msT ri pleCluster
& wor d[i + 2] in f or msT ri pleCluster [wor d[i − ad just : i + 1]] then
if i + 3 < length(wor d) & wor d[i + 3] == hoshonto then
i ⇐i +2
123
K. Roy et al.
Algorithm 4 ADS Grapheme Extraction (Part 2)

else
ad just ⇐ 0
i ⇐i +3
39: end if
else if i + 2 < length(wor d) & ad just == 0 & wor d[i] in f or msCluster
& wor d[i + 2] in f or msCluster [wor d[i]] then
ad just ⇐ ad just + 2
i ⇐i +2
45: else
char s.append(wor d[i − ad just : i + 3])
ad just ⇐ 0
48: i ⇐i +3
end if
else
char s.append(hoshonto)
ad just ⇐ 0
54: i ⇐i +2
end if
else
57: char s.append(wor d[i : i + 1])
i ⇐i +1
end if
60: end while
References Proceedings of the AAAI Conference on Artificial Intelligence,

vol. 34, pp. 11005–11012 (2020)
1. Rabiner, L., Juang, B.: An introduction to hidden Markov models. 12. Rifat, M.J.R., Banik, M., Hasan, N., Nahar, J., Rahman, F.: A novel
IEEE ASSP Mag. 3(1), 4–16 (1986) machine annotated balanced Bangla OCR corpus. In: Singh, S.K.,
2. Congdon, P.: Bayesian Statistical Modelling. Wiley Series in Roy, P., Raman, B., Nagabhushan, P. (eds.) Comput. Vis. Image
Probability and Statistics, Wiley (2006). https://doi.org/10.1002/ Process., pp. 149–160. Springer, Singapore (2021)
9780470035948 13. Anthimopoulos, M., Gatos, B., Pratikakis, I.: Detection of artificial
3. Simonyan, K., Zisserman, A.: Very deep convolutional networks and scene text in images and video frames. Pattern Anal. Appl.
for large-scale image recognition. arXiv preprint arXiv:1409.1556 16(3), 431–446 (2013)
(2014) 14. Chen, H., Tsai, S.S., Schroth, G., Chen, D.M., Grzeszczuk, R.,
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image Girod, B.: Robust text detection in natural images with edge-
recognition. In: Proceedings of the IEEE Conference on Computer enhanced maximally stable extremal regions. In: 2011 18th IEEE
Vision and Pattern Recognition, pp. 770–778 (2016) International Conference on Image Processing, pp. 2609–2612
5. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and (2011). IEEE
recognition with embedded attributes. IEEE Trans. Pattern Anal. 15. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes
Mach. Intell. 36(12), 2552–2566 (2014) with stroke width transform. In: 2010 IEEE Computer Society
6. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic Conference on Computer Vision and Pattern Recognition, pp.
data and artificial neural networks for natural scene text recogni- 2963–2970 (2010). IEEE
tion. arXiv preprint arXiv:1406.2227 (2014) 16. Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with
7. Feng, X., Yao, H., Zhang, S.: Focal CTC loss for Chinese optical convolution neural network induced MSER trees. In: European
character recognition on unbalanced datasets. Complexity 2019 Conference on Computer Vision, pp. 497–511 (2014). Springer
(2019) 17. Alsharif, O., Pineau, J.: End-to-end text recognition with hybrid
8. Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay hmm maxout models. arXiv preprint arXiv:1310.1811 (2013)
attention to what you read: non-recurrent handwritten text-line 18. Gordo, A.: Supervised mid-level features for word image repre-
recognition. arXiv preprint arXiv:2005.13044 (2020) sentation. In: Proceedings of the IEEE Conference on Computer
9. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading Vision and Pattern Recognition, pp. 2956–2964 (2015)
text in the wild with convolutional neural networks. Int. J. Com- 19. Neumann, L., Matas, J.: A method for text localization and recog-
put. Vis. 116(1), 1–20 (2016). https://doi.org/10.1007/s11263- nition in real-world images. In: Asian Conference on Computer
015-0823-z Vision, pp. 770–783 (2010). Springer
10. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network 20. Mishra, A., Alahari, K., Jawahar, C.: Image retrieval using tex-
for image-based sequence recognition and its application to scene tual cues. In: Proceedings of the IEEE International Conference on
text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), Computer Vision, pp. 3040–3047 (2013)
2298–2304 (2016) 21. Smith, R.: An overview of the Tesseract OCR engine. In: Ninth
11. Hu, W., Cai, X., Hou, J., Yi, S., Lin, Z.: GTC: Guided training International Conference on Document Analysis and Recognition
of CTC towards efficient and accurate scene text recognition. In: (ICDAR 2007), vol. 2, pp. 629–633 (2007). IEEE
123
22. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text 38. Aberdam, A., Ganz, R., Mazor, S., Litman, R.: Multimodal
recognition with automatic rectification. In: Proceedings of the semi-supervised learning for text recognition. arXiv preprint
IEEE Conference on Computer Vision and Pattern Recognition, arXiv:2205.03873 (2022)
pp. 4168–4176 (2016) 39. Yang, M., Liao, M., Lu, P., Wang, J., Zhu, S., Luo, H., Tian, Q.,
23. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Bai, X.: Reading and writing: discriminative and generative mod-
Comput. 9(8), 1735–1780 (1997) eling for self-supervised text recognition. In: Proceedings of the
24. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., 30th ACM International Conference on Multimedia, pp. 4214–
Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, 4223 (2022)
L.P.: ICDAR 2013 robust reading competition. In: 2013 12th Inter- 40. Chu, X., Wang, Y.: IterVM: iterative vision modeling module for
national Conference on Document Analysis and Recognition, pp. scene text recognition. In: 2022 26th International Conference on
1484–1493 (2013). IEEE Pattern Recognition (ICPR), pp. 1393–1399 (2022). IEEE
25. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, 41. Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du, Y., Jiang,
H.: What is wrong with scene text recognition model comparisons? Y.-G.: Svtr: scene text recognition with a single visual model. arXiv
Dataset and model analysis. In: Proceedings of the IEEE/CVF preprint arXiv:2205.00159 (2022)
International Conference on Computer Vision (ICCV) (2019) 42. Zheng, C., Li, H., Rhee, S.-M., Han, S., Han, J.-J., Wang, P.: Push-
26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., ing the performance limit of scene text recognizer without human
Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. annotation. In: Proceedings of the IEEE/CVF Conference on Com-
arXiv preprint arXiv:1706.03762 (2017) puter Vision and Pattern Recognition, pp. 14116–14125 (2022)
27. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connec- 43. Chammas, E., Mokbel, C., Likforman-Sulem, L.: Handwriting
tionist temporal classification: labelling unsegmented sequence recognition of historical documents with few labeled data. In: 2018
data with recurrent neural networks. In: Proceedings of the 23rd 13th IAPR International Workshop on Document Analysis Systems
International Conference on Machine Learning, pp. 369–376 (DAS), pp. 43–48 (2018). IEEE
(2006) 44. Kišš, M., Hradiš, M., Beneš, K., Buchal, P., Kula, M.: SoftCTC—
28. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an Semi-Supervised Learning for Text Recognition using Soft Pseudo-
attentional scene text recognizer with flexible rectification. IEEE labels. arXiv (2022). arXiv:2212.02135
Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2019). https:// 45. Yousef, M., Hussain, K.F., Mohammed, U.S.: Accurate, data-
doi.org/10.1109/TPAMI.2018.2848939 efficient, unconstrained text recognition with convolutional neural
29. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing networks. Pattern Recogn. 108, 107482 (2020). https://doi.org/10.
attention: towards accurate text recognition in natural images. In: 1016/j.patcog.2020.107482
2017 IEEE International Conference on Computer Vision (ICCV), 46. Maillette de Buy Wenniger, G., Schomaker, L., Way, A.: No
pp. 5086–5094 (2017). https://doi.org/10.1109/ICCV.2017.543 padding please: efficient neural handwriting recognition. In: 2019
30. Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., International Conference on Document Analysis and Recognition
Manmatha, R.: Scatter: selective context attentional scene text rec- (ICDAR), pp. 355–362 (2019). https://doi.org/10.1109/ICDAR.
ognizer. In: 2020 IEEE/CVF Conference on Computer Vision and 2019.00064
Pattern Recognition (CVPR), pp. 11959–11969 (2020). https://doi. 47. Kass, D., Vats, E.: AttentionHTR: handwritten text recognition
org/10.1109/CVPR42600.2020.01198 based on attention encoder–decoder networks. In: Uchida, S., Bar-
31. Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., Ding, E.: Towards ney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 507–522.
accurate scene text recognition with semantic reasoning networks. Springer, Cham (2022)
In: Proceedings of the IEEE/CVF Conference on Computer Vision 48. Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.:
and Pattern Recognition, pp. 12113–12122 (2020) Scrabblegan: Semi-supervised varying length handwritten text gen-
32. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bag- eration. In: 2020 IEEE/CVF Conference on Computer Vision and
danov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, Pattern Recognition (CVPR), pp. 4323–4332 (2020). https://doi.
V.R., Lu, S., et al.: ICDAR 2015 competition on robust reading. org/10.1109/CVPR42600.2020.00438
In: 2015 13th International Conference on Document Analysis and 49. Souibgui, M.A., Fornés, A., Kessentini, Y., Megyesi, B.: Few shots
Recognition (ICDAR), pp. 1156–1160 (2015). IEEE are all you need: a progressive learning approach for low resource
33. Feng, X., Yao, H., Qi, Y., Zhang, J., Zhang, S.: Scene text recog- handwritten text recognition. Pattern Recogn. Lett. 160, 43–49
nition via transformer. arXiv preprint arXiv:2003.08077 (2020) (2022). https://doi.org/10.1016/j.patrec.2022.06.003
34. Atienza, R.: Vision transformer for fast and efficient scene text 50. Rahman, A., Kaykobad, M.: A complete Bengali OCR: a novel
recognition. In: Document Analysis and Recognition–ICDAR hybrid approach to handwritten Bengali character recognition. J.
2021: 16th International Conference, Lausanne, Switzerland, Comput. Inf. Technol. 6(4), 395–413 (1998)
September 5–10, 2021, Proceedings, Part I, vol. 16, pp. 319–334 51. Pal, U., Chaudhuri, B.B.: OCR in Bangla: an Indo-Bangladeshi lan-
(2021). Springer guage. In: Proceedings of the 12th IAPR International Conference
35. Wu, J., Peng, Y., Zhang, S., Qi, W., Zhang, J.: Masked vision- on Pattern Recognition, Vol. 3—Conference C: Signal Processing
language transformers for scene text recognition. arXiv preprint (Cat. No.94CH3440-5), vol. 2, pp. 269–2732 (1994). https://doi.
arXiv:2211.04785 (2022) org/10.1109/ICPR.1994.576917
36. Wang, P., Da, C., Yao, C.: Multi-granularity prediction for scene 52. Sattar, M., Rahman, S.: An experimental investigation on Bangla
text recognition. In: Computer Vision—ECCV 2022: 17th Euro- character recognition system. Bangladesh Comput. Soc. J. 4(1),
pean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- 1–4 (1989)
ings, Part XXVIII, pp. 339–355 (2022). Springer 53. Rahman, A.F.R., Fairhurst, M.: Multi-prototype classification:
37. Xie, X., Fu, L., Zhang, Z., Wang, Z., Bai, X.: Toward understand- improved modelling of the variability of handwritten data using
ing wordart: corner-guided transformer for scene text recognition. statistical clustering algorithms. Electron. Lett. 33(14), 1208–1210
In: Computer Vision–ECCV 2022: 17th European Conference, Tel (1997)
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pp. 54. Pal, U.: On the development of an optical character recognition
303–321 (2022). Springer (OCR) system for printed Bangla script. PhD thesis, Indian Statis-
tical Institute, Calcutta (1997)
123
K. Roy et al.
55. Chaudhuri, B., Pal, U.: A complete printed Bangla OCR system. 73. Hasan, M.J., Wahid, M.F., Alom, M.S.: Bangla compound charac-
Pattern Recogn. 31(5), 531–549 (1998) ter recognition by combining deep convolutional neural network
56. Rahman, A.F.R., Fairhurst, M.C.: A new hybrid approach in com- with bidirectional long short-term memory. In: 2019 4th Interna-
bining multiple experts to recognise handwritten numerals. Pattern tional Conference on Electrical Information and Communication
Recogn. Lett. 18(8), 781–790 (1997) Technology (EICT), pp. 1–4 (2019). IEEE
57. Rahman, A.F.R., Rahman, R., Fairhurst, M.C.: Recognition of 74. Paul, D., Chaudhuri, B.B.: A BLSTM network for printed Bengali
handwritten Bengali characters: a novel multistage approach. Pat- OCR system with high accuracy. arXiv preprint arXiv:1908.08674
tern Recogn. 35(5), 997–1006 (2002) (2019)
58. Mahmud, J.U., Raihan, M.F., Rahman, C.M.: A complete OCR 75. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep
system for continuous Bengali characters. In: TENCON 2003. Con- feedforward neural networks. In: Proceedings of the Thirteenth
ference on Convergent Technologies for Asia-Pacific Region, vol. International Conference on Artificial Intelligence and Statistics,
4, pp. 1372–1376 (2003). IEEE pp. 249–256 (2010). JMLR Workshop and Conference Proceedings
59. Kamruzzaman, J., Aziz, S.: Improved machine recognition for 76. Rahman, M.A., Tabassum, N., Paul, M., Pal, R., Islam, M.K.:
Bangla characters. In: International Conference on Electrical and BN-HTRd: A Benchmark Dataset for Document Level Offline
Computer Engineering 2004, pp. 557–560 (2004). ICECE 2004 Bangla Handwritten Text Recognition (HTR) and Line Segmenta-
Conference Secretariat, Bangladesh of Engineering and Technol- tion. arXiv (2022). https://doi.org/10.48550/ARXIV.2206.08977.
ogy https://arxiv.org/abs/2206.08977
60. Alam, M.M., Kashem, M.A.: A complete Bangla OCR system for 77. Mridha, M.F., Ohi, A.Q., Ali, M.A., Emon, M.I., Kabir, M.M.:
printed characters. JCIT 1(01), 30–35 (2010) Banglawriting: a multi-purpose offline Bangla handwriting dataset.
61. Ahmed, S., Kashem, M.A.: Enhancing the character segmentation Data Brief. 34, 106633 (2021). https://doi.org/10.1016/j.dib.2020.
accuracy of Bangla OCR using BPNN. Int. J. Sci. Res. (IJSR) ISSN 106633
(Online), 2319–7064 (2013) 78. Banik, M., Rifat, M.J.R., Nahar, J., Hasan, N., Rahman, F.: Okkhor:
62. Chowdhury, A.A., Ahmed, E., Ahmed, S., Hossain, S., Rahman, a synthetic corpus of Bangla printed characters. In: Arai, K.,
C.M.: Optical character recognition of Bangla characters using neu- Kapoor, S., Bhatia, R. (eds.) Proceedings of the Future Technolo-
ral network: a better approach. In: 2nd ICEE (2002) gies Conference (FTC) 2020, vol. 1, pp. 693–711. Springer, Cham
63. Ahmed, S., Sakib, A.N., Ishtiaque Mahmud, M., Belali, H., Rah- (2021)
man, S.: The anatomy of Bangla OCR system for printed texts using 79. Roark, B., Wolf-Sonkin, L., Kirov, C., Mielke, S.J., Johny, C.,
back propagation neural network. Glob. J. Comput. Sci. Technol. Demirsahin, I., Hall, K.: Processing South Asian languages written
(2012) in the Latin script: the Dakshina dataset. In: Proceedings of the 12th
64. Afroge, S., Ahmed, B., Hossain, A.: Bangla optical character Language Resources and Evaluation Conference, pp. 2413–2423.
recognition through segmentation using curvature distance and European Language Resources Association, Marseille, France
multilayer perceptron algorithm. In: 2017 International Conference (2020). https://aclanthology.org/2020.lrec-1.294
on Electrical, Computer and Communication Engineering (ECCE), 80. Al Mumin, M.A., Shoeb, A.A.M., Selim, M.R., Iqbal, M.Z.:
pp. 253–257 (2017). IEEE Sumono: a representative modern Bengali corpus. SUST J. Sci.
65. Hossain, S.A., Tabassum, T.: Neural net based complete character Technol. 21(1), 78–86 (2014)
recognition scheme for Bangla printed text books. In: 16th Interna- 81. Biswas, E.: Bangla Largest Newspaper Dataset. Kaggle (2021).
tional Conference on Computer and Information Technology, pp. https://doi.org/10.34740/KAGGLE/DSV/1857507. https://www.
71–75 (2014). IEEE kaggle.com/dsv/1857507
66. Pramanik, R., Bag, S.: Shape decomposition-based handwritten 82. Ahmed, M.F., Mahmud, Z., Biash, Z.T., Ryen, A.A.N., Hossain,
compound character recognition for Bangla OCR. J. Vis. Commun. A., Ashraf, F.B.: Bangla Online Comments Dataset. Mendeley
Image Represent. 50, 123–134 (2018) Data (2021). https://doi.org/10.17632/9xjx8twk8p.1. https://data.
67. Ghosh, R., Vamshi, C., Kumar, P.: RNN based online handwritten mendeley.com/datasets/9xjx8twk8p/1
word recognition in Devanagari and Bengali scripts using horizon- 83. Farahmand, A., Sarrafzadeh, H., Shanbehzadeh, J.: Document
tal zoning. Pattern Recogn. 92, 203–218 (2019) image noises and removal methods (2013)
68. Purkaystha, B., Datta, T., Islam, M.S.: Bengali handwritten charac- 84. Lee, C.-Y., Osindero, S.: Recursive recurrent nets with attention
ter recognition using deep convolutional neural network. In: 2017 modeling for OCR in the wild. In: 2016 IEEE Conference on
20th International Conference of Computer and Information Tech- Computer Vision and Pattern Recognition (CVPR), pp. 2231–2239
nology (ICCIT), pp. 1–5 (2017). IEEE (2016). https://doi.org/10.1109/CVPR.2016.245
69. Islam, M.S., Rahman, M.M., Rahman, M.H., Rivolta, M.W., 85. Wagner, R.A., Fischer, M.J.: The string-to-string correction prob-
Aktaruzzaman, M.: Ratnet: a deep learning model for Ben- lem. J. ACM (JACM) 21(1), 168–173 (1974)
gali handwritten characters recognition. Multimed. Tools Appl. 86. Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A.,
81, 10631–10651 (2022). https://doi.org/10.1007/s11042-022- Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexi-
12070-4 ble image augmentations. Information (2020). https://doi.org/10.
70. Maity, S., Dey, A., Chowdhury, A., Banerjee, A.: Handwritten Ben- 3390/info11020125
gali character recognition using deep convolution neural network. 87. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: sur-
In: Bhattacharjee, A., Borgohain, S.K., Soni, B., Verma, G., Gao, passing human-level performance on imagenet classification. In:
X.-Z. (eds.) Machine Learning, Image Processing, Network Secu- 2015 IEEE International Conference on Computer Vision (ICCV),
rity and Data Sciences, pp. 84–92. Springer, Singapore (2020) pp. 1026–1034. IEEE Computer Society, Los Alamitos, CA, USA
71. Roy, A.: AKHCRNet: Bengali Handwritten Character Recognition (2015). https://doi.org/10.1109/ICCV.2015.123
Using Deep Learning (2020) 88. Loshchilov, I., Hutter, F.: SGDR: Stochastic Gradient Descent with
72. Sharif, S., Mohammed, N., Momen, S., Mansoor, N.: Classification Warm Restarts (2017)
of Bangla compound characters using a HOG-CNN hybrid model.
In: Proceedings of the International Conference on Computing and
Communication Systems, pp. 403–411 (2018). Springer
123
89. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Publisher’s Note Springer Nature remains neutral with regard to juris-
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, dictional claims in published maps and institutional affiliations.
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-
son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, Springer Nature or its licensor (e.g. a society or other partner) holds
L., Bai, J., Chintala, S.: Pytorch: an imperative style, high- exclusive rights to this article under a publishing agreement with the
performance deep learning library. In: Advances in Neural Infor- author(s) or other rightsholder(s); author self-archiving of the accepted
mation Processing Systems 32, pp. 8024–8035. Curran Asso- manuscript version of this article is solely governed by the terms of such
ciates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch- publishing agreement and applicable law.
an-imperative-style-high-performance-deep-learning-library.pdf
123

A Multifaceted Evaluation of Representation of Graphemes For Practically Effective Bangla OCR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Multifaceted Evaluation of Representation of Graphemes For Practically Effective Bangla OCR

Uploaded by

Copyright:

Available Formats

International Journal on Document Analysis and Recognition (IJDAR)

A multifaceted evaluation of representation of graphemes for

Received: 4 August 2022 / Revised: 18 March 2023 / Accepted: 10 June 2023

Fig. 2 Samples of the word “Rahasyaghana” written in Bangla that

Social Science, Computer and IT, Literature, Mass Media

3.1.2 Fonts selection

3.2.1 Protocol I 3.2.3 Protocol III

Fig. 7 Real-world image samples from Protocol V of the test sets

Fig. 5 Synthetically generated image samples from Protocol III of the

Fig. 6 Real-world image samples from Protocol IV of the test sets

In this section, we discuss the methods we used for con-

4.1 Overview Fig. 9 Overview of the CRNN architecture

For our OCR architecture, we mainly have four components—

Table 2 Diacritic forms of some of the vowels

Fig. 10 Example of a Bangla word using alignment based approach

first row is the predicted character block, and the second

Table 4 Samples of our three proposed grapheme representation of Bangla words

Equation 3 is used to calculate the Normalized Edit Dis-

4.4 Evaluation metrics 5 Experiments

5.2.2 Results of training and testing on real handwriting 6 Discussion

Table 10 Example words with

Table 11 WRR of the models at

We investigated the challenges in Bangla word-level OCR

Table 12 Examples where

opment (EBLICT) Project, under the Ministry of ICT, the Government

Conflict of interest The authors have no competing interests to declare

Algorithm 1 VDS Grapheme Extraction (Part 1)

Algorithm 2 VDS Grapheme Extraction (Part 2)

Algorithm 3 ADS Grapheme Extraction (Part 1)

Algorithm 4 ADS Grapheme Extraction (Part 2)

References Proceedings of the AAAI Conference on Artificial Intelligence,

You might also like