You are on page 1of 11

Received: 20 November 2019 Revised: 10 March 2020 Accepted: 1 April 2020

DOI: 10.1111/exsy.12565

ORIGINAL ARTICLE

Deep OCR for Arabic script-based language like Pastho

Saeeda Naz1 | Naila H. Khan2 | Shizza Zahoor1 | Muhammad I. Razzak3

1
Computer Science Department, Govt. Girls
Postgraduate College No. 1, Abbottabad, KPK, Abstract
Pakistan Developing cursive script recognition systems have always been a challenging task
2
Department of Computer Science, Institute of
for researchers. This article proposes a ligature-based recognition system for the cur-
Management Sciences, Peshawar, Pakistan
3
School of Information Technology, Deakin sive Pashto script using four pre-trained CNN models using a fine-tuned approach.
University, Geelong, Australia The SqueezeNet, ResNet, MobileNet and DenseNet models have been observed for

Correspondence the classification and the recognition of Pashto sub-word (ligature). Overall, the pro-
Saeeda Naz, Computer Science Department, posed system is divided into two domains (Source and Target). The source domain
GGPGC No.1, Abbottabad, KPK, Pakistan.
Email: saeedanaz292@gmail.com contains the pre-trained models used on the ImageNet Dataset. These models are
later fine-tuned using the transfer learning approach to be used for the Pashto liga-
ture recognition. The data augmentation techniques of negative and contour are used
to increase the representation of ligature images and the dataset size. The CNN
models have been evaluated on the benchmarks Pashto ligatures FAST-NU dataset.
The proposed system achieved the highest recognition rate of up to 99.31% using
the DenseNet architecture of Convolutional Neural Network for Pashto ligature.

KEYWORDS

DenseNet, ligature recognition, MobileNet, ResNet, SqueezeNet

1 | I N T RO DU CT I O N

Optical Character Recognition (OCR) is a mesmerizing technology that has enabled computers to digitize texts with high accuracy. However,
regardless, there are still several shortcomings of the technology that need to be addressed and resolved to increase accuracy. The OCR technol-
ogy has proven to be extremely advantageous to many industries, and it enables automatic digitization of large quantities of documents. The tech-
nology has advanced to real-time machine conversion and translation using mobile devices, Tablets, etc. For many decades, OCR has been a
widely explored area for several languages (Naz et al., 2014). Latin script-based languages, that is, English, French, etc., have got the maturity level
in printed ligature (sub-word) as well as character-based recognition. The character-based analytical segmentation-based recognition systems
need to process the set of characters, whereas the ligature-based holistic recognition systems deal with the high-frequency of ligatures for recog-
nition (Naz et al., 2016, 2017). In comparison to the Latin script, cursive script-based languages like Pashto, Urdu, Arabic, Persian and handwritten
cursive English needs the research and expertise of future researchers in the field of OCR.
The Pashto language and its cursive script hold immense scientific developmental significance for researchers due to its wide linkages to the
history, society and heritage. The recent literature of OCR observes the success of deep learning approaches for character and ligature recogni-
tion over other traditional machine learning methods. In this study, deep neural network models are presented for the classification and recogni-
tion of Pashto ligature. The proposed models are evaluated and compared with the existing state of the art methods. The primary contributions of
this research study are listed below.

• We deployed Convolution Neural Network (CNN) by leveraging fine-tuned features using a pre-trained transfer learning approach. Due to
insufficient data, we evaluate the impact of deep transfer learning using a fine-tuned approach across the domain for ligature recognition.

Expert Systems. 2020;e12565. wileyonlinelibrary.com/journal/exsy © 2020 John Wiley & Sons, Ltd 1 of 11
https://doi.org/10.1111/exsy.12565
2 of 11 NAZ ET AL.

• The instances and samples of classes are not enough in FAST-NU Dataset. To handle the scarcity of samples of each class in this work, we
have performed data augmentation to increase the input space for the classifier.
• The extensive experiments performed for Pashto ligature recognition, considerable improvement is achieved in the classification of Pashto liga-
ture using the SqueezeNet, ResNet and DenseNet models on the FAST-NU dataset by competing against the state-of-the-art techniques.

The remaining paper is organized as follows. The deep learning models, insights into the Pashto language and the related literature are pres-
ented in Section 2. Section 3 describes the proposed methodology for Pashto ligature recognition. The dataset used for pre-processing and exper-
imentation is discussed too. Section 4 summarizes the results and their analysis. A comparison to the state-of-the-art methods has also been
provided. Consequences and future directions are addressed in Section 5.

2 | B A C K G R O U N D A N D E X I S T I N G WO R K

To build a fully functional OCR, it is necessary to know the background and the relevant existing methods for the language's script. The sub-
sections below discuss the deep learning models in detail with a primary focus on the transfer learning and the different models of Convolutional
Neural Network (CNN) used in this study, that is, SqueezeNet, ResNet, MobileNet and DenseNet. The Pashto language, its foundation, alphabet
and complexities have also been presented. A detailed analysis has been provided for the recent state-of-the-art methods that are used for the
recognition of Pashto language script.

2.1 | Deep learning models

The traditional Machine Learning (ML) approaches use statistical models and analysis for the recognition of patterns. Over the years, these ML
algorithms have improved and successfully implemented to imitate human-like decision-making abilities. The ML algorithms are divided into three
broad categories; (a) supervised learning models, (b) unsupervised learning models and (c) reinforcement learning models. In supervised learning,
computers are fed with labelled data. Whereas, unsupervised machine learning deals with data that is unlabelled. In reinforcement learning, com-
puters use trail-error mechanisms to generate results. Some of the famous ML algorithms are Decision trees, Support Vector Machine (SVM), Arti-
ficial Neural Networks (ANN), Naive Bayes, Random Forest and K-nearest neighbour (KNN).
A subset of machine learning, deep learning is used to process extremely complex research problems. Deep learning technologies have been
introduced that enables the processing and classification of data using deeper networks. The deep learning technologies enable computers to take
human-like decisions and make predictions. Deep learning has been successfully applied to many applications like face recognition, speech recog-
nition, text recognition (Naz et al., 2017), writer identification (Rehman, Naz, Razzak, & Hameed, 2019), medical disease predictions (Naseer
et al., 2020; Rehman, Naz, Razzak, Akram, & Imran, 2020), machine translation, computer vision, natural language processing, etc. The CNNs deep
learners have proved a tremendous success for pattern recognition applications and technologies. There exist several pre-trained CNN architec-
tures (AlexNet, GoogLeNet, VGGNet, ResNet, MobileNet, DenseNet, SqueezeNet, etc.). Trained weights on the ImageNet dataset are available
for the research community. The researchers can then use these CNN networks to evaluate their research problem and compare it with the state-
of-the-art research studies. The deep learning technologies and research have improved and advanced excessively, focusing now on the process
and transfer learning abilities within the task using the “Transfer Learning (TL)” techniques.
Recently, transfer learning with fine-tuning has achieved tremendous popularity. It enables the use of pre-trained networks to solve a
completely new research problem. Instead of creating a network from scratch, it is found that fine-tuning a pre-trained deep learning model with
transfer learning is much quicker. The transfer learning approach can be broadly divided into the Fine-tune approach and the Freeze approach.
The Fine-tuned approach retains the knowledge gained previously for a task and utilizes it for another but a related task like image dataset. The
Freeze approach uses freeze weight of CNN (Base Conv layers, Few layers or All layers) learned from source dataset and use linear classifiers like
SVM, KNN, LDA, etc. for classification of classes using target dataset. The sub-sections below discuss the CNN models of SqueezeNet, ResNet
and DenseNet. The illustrations for these CNN models have been provided in Figure 1.

2.1.1 | SqueezeNet

SqueezeNet is a CNN that was released by researchers to the public in 2016 (Iandola et al., 2016). The size of the SqueezeNet is small which pro-
vides several advantages in comparison to other deep learners. Small CNN models require less memory, are much easier to transmit over the net-
work and require less communication. Due to the small size, they can be easily communicated to cloud services. The SqueezeNet model of CNN
is composed of a total of 18 layers. It consists of the convolution layer, fire modules and final convolution layer. The SqueezeNet generates large
NAZ ET AL. 3 of 11

FIGURE 1 The main building blocks of different CNN architectures

feature maps by observing downsampling late. The main building block of the SqueezeNet is a fire module. The fire module itself is composed of
an expand layer and a squeeze layer. Both these layers maintain the size of the feature map. As its name, the squeeze layer reduces the feature
map, whereas the expand layer increases the feature map. The squeeze layer uses filter size of 1 x 1 that helps in reducing the overall depth of
the network. It is primarily done to reduce the total number of parameters. The expand layer uses a filter size of 3 x 3. The squeezeNet model
reports AlexNet's equivalent accuracy when processed for the ImageNet dataset. Similarly, the parameters used by the SqueezeNet are 50 times
less in number.

2.1.2 | ResNet

Over the years, the deep learning has proved to revolutionize the image classification and recognition tasks. To solve complex tasks, there is a
need to create deeper networks. However, the deeper networks are complex models and are much difficult to train. The accuracy may start deg-
radation over a certain time. Hence, the residual learning is used to address and overcome these issues. ResNets (Residual Neural Networks) is a
type of CNNs that uses identity mapping to support gradient propagation problems. The ResNet model was developed and introduced in 2015 by
He, Zhang, Ren, and Sun (2015). It uses the skip connections to leap across numerous layers. Skipping the connections simplifies the overall model
of the network. The ResNet is composed of residual blocks. The layers in the middle of the block learn a residual function from the input to the
block. The residual function is a deduction of the feature learned from the input using skip connections. The output from previous layers is added
to the output of the stacked layers using the skip connections. There are several different variants of the ResNet model. The ResNet models are
named with the word ResNet followed by two or more digits. These digits represent the total number of layers in that ResNet model such as
18, 34, 50, 101, 110, 152, 164, 1,202, etc.

2.1.3 | MobileNet

The MobileNets as its name suggests it is more suitable for Mobile, embedded systems, non-GPU and mobile vision-based applications. These
MobileNets are extremely feasible and special for classification and recognition tasks when the resources, computation power and space are
4 of 11 NAZ ET AL.

limited. MobileNet deep learning models are lightweight and best fit for web browsers. Its complexity and size are reduced by using separable
convolutions. These separable convolutions involve the use of pointwise convolution preceded by a depthwise convolution. The MobileNet model
is composed of 30 layers containing convolution layers, depthwise layers, pointwise layers, etc. A single convolution is carried on each channel by
a depthwise convolution filter. The 1 x 1 convolutions are used to combine the depthwise convolution result linearly using point convolution filter
(Sandler, Howard, Zhu, Zhmoginov, & Chen, 2018).

2.1.4 | DenseNet

DenseNet deep learning models employ the use of residual connections in the ResNet models. It has been observed that the deep learning models
having small connections between the input and output are more efficient and accurate (Gao, Liu, van der Maaten, & Weinberger, 2016). In the
DenseNet network, the input is received by each layer from all the prior layers. The output is then forwarded to the succeeding layers in a block
in a feed-forward manner. The feature maps are concatenated to maintain a variation.

2.2 | Pashto language

Pashto is the native language of the residents of Khyber Pakhtunkhwa, province of Pakistan. It is also the official language of the country of
Afghanistan. The “Pashto Alphabet” is composed of 44 characters (Ahmad et al., 2015; Ahmad, Naz, Afzal, Amin, & Breuel, 2015) and a total of
4 diacritics. The Pashto Alphabet has adopted characters, 28 from the Arabic Alphabet and 3 from the Persian Alphabet. On the other hand, the
4 diacritic marks are used optionally for differentiation between similar characters. It is officially written in the Naskh calligraphic style. However,
Pashto is sometimes written in Nastalique style too. Since, Pashto is cursive in nature, regardless of the style the overlapping between characters
is a common challenge faced during the segmentation when developing a recognition system. One method used to avoid the overlapping of char-
acters and overcoming the segmentation problem is to use ligatures for the recognition. A ligature is a collection of two or more characters. It is
more commonly referred to as a sub-word. The development of ligature-based OCR systems for Pashto script needs to be explored.

2.3 | Pashto OCR: State-of-the-art

The research and studies in the present literature toward the Pashto OCR are notably limited. The existing studies for the Pashto OCR have
mostly focused on isolated printed text for recognition. Contrarily few studies have focused on Pashto ligature recognition systems but have
reported lower accuracy or the methodologies are weak or extremely complex. The ligature-based systems are segmentation free in comparison
to the character-based systems that are segmentation based. It has been reported that using the ligature-based system the complexities of explicit
segmentation are avoided significantly leading to improved recognition systems. In literature, both the traditional machine learning and deep
learning have been used for Pashto script recognition. The summary of the existing state-of-the-art studies is given in Table 1.
Traditional machine learning approaches have been popular among the researchers for various pattern recognition tasks such as the OCR. In
2004, Decerbo, MacRostie, and Natarajan (2004) used a Hidden Markov Model (HMM) for recognition of Pashto. The system was named as
“BBN Byblos.” To evaluate the system, the proposed method was analysed on various datasets. However, the sizes of the datasets have not been
specified in the study and are not available for the public. Three scenarios were used for experimentations based on the dataset: Synthetic dataset
scanned pages and the fixed pages. In the first case, synthetic dataset, 98.4% accuracy was achieved. In the second experimentation, scanned
pages, an accuracy of 97.9% was reported. The third experimentation, fixed pages dataset reported accuracy of 96.9%. Likewise, Wahab et al.
(Wahab et al., 2009) also developed a recognition system for Pashto script using the Principal Component Analysis (PCA) matrix manipulation.
The authors also presented a Pashto ligature dataset named “FAST-NU dataset,” where FAST-NU is the name of the University where the dataset
was formed. The dataset is composed of 4,000 synthetic ligature images, such as each ligature image has four versions based on the size (12, 14,
16 and 18) of the font. Hence, the original ligatures present in the dataset are 1,000.
Similarly, a statistical analysis was provided by Ahmad, Afzal, Rashid, Liwicki, Dengel, and Breuel (2015) for the Pashto script. To avoid seg-
mentation complexities, the authors utilized the idea of ligatures as recognizable units. They extracted 19,268 unique Pashto ligatures fro differ-
ent sources. The overall dataset used was composed of a total of 2,313,736 ligatures. They analysed that 7,000 ligatures are covering the 91% of
unique ligatures of Pashto language. In another study, SIFT features were extracted from Pashto ligatures (Ahmad et al., 2010). The study
employed a holistic approach by using the FAST-NU Pashto dataset for analysis and experiments. The FAST-NU dataset was extended and that is
one of the main contributions of the study. The second contribution was to avoid orientation problems by using the SIFT algorithm. The images in
the FAST-NU dataset were rotated at ±90. The invariances observed were the scale and rotation invariance. The system was evaluated for PCA
based and SIFT-based classification. SIFT-S2 reported high accuracy of 74%. Similarly, FAST-NU Dataset was extended in another study by
NAZ ET AL. 5 of 11

TABLE 1 Summary of State-of-the-art Pashto OCR Studies

Traditional Machine Learning & Other Techniques

Study Method Pashto dataset Recognition rate


Wahab, Amin, and Ahmed (2009) PCA FAST-NU –
Ahmad, Amin, and Khan (2010) SIFT features based matching (Scale & FAST-NU 74%
Rotation)
Ahmad, Naz, Afzal, Amin, and SIFT features based matching (scale, Rotation FAST-NU 81%
Breuel (2015) & Location)
Khan et al. (2019) K-nearest neighbour & neural network 44 letters x 102 K-NN (70.05%) & NN
samples = 4,488 images (72%)
Ullah, Enayat, Nadeem, Saeed, and Sequential minimal optimization (SMO) 5,000 images 92%
Junaid (2019)

Deep learning techniques

Study Method Pashto dataset Recognition rate


Ahmad et al. (2016) MDLSTM & BLSTM KPTI 90.78%
Ahmad, Afzal, Rashid, Liwicki, and Breuel (2015) MDLSTM & HMM & SIFT FAST-NU MDLSTM (98.9%) & HMM (89.9%) & SIFT (94.3%)

Ahmad, Naz, Afzal, Amin, and Breuel (2015). The original dataset contains 4,000 images of 1,000 unique ligatures having four different scales. The
orientation can also affect the recognition system. To address it, four rotation variations were introduced extending the dataset to 8,000 Pashto
ligature images. Scale Invariant Feature Transform (SIFT) was used for the recognition of the ligatures. An accuracy of 81% was achieved for loca-
tion, scale and orientation variance.
In literature, very few studies have been directed toward the handwritten text. Khan et al., 2019 proposed a study for recognition of hand-
written Pashto letters using the KNN and NN methods. A total of 44 Pashto letters were taken into consideration with 102 samples for each of
the letters. An accuracy of 72% and 70.05% was reported for NN and KNN. Similarly, LBP (Local Binary Patterns) and Sequential Minimal Optimi-
zation (SMO) methods were used for Pashto feature extraction and recognition respectively on a dataset of 5,000 images by (Ullah et al., 2019).
Recently, the deep learning methodologies have gained immense popularity for other cursive scripts such as Arabic and Urdu. Few studies have
also been reported for the Pashto script. An implicit segmentation-based recognition system was proposed by Ahmad et al., 2016. The authors also
presented a dataset for the Pashto language called the KPTI (Katib's Pashto Text Imagebase). Deep learning approaches of BLSTM and MDLSTM
were utilized for recognition of Pashto script using an implicit segmentation method. The recognition accuracy of 90.78% was reported for the given
deep learning models. The presented KPTI dataset is composed of 17,015 images. These images are collected from different Pashto books. The
dataset contains both printed and handwritten Pashto script images. So that the dataset can be successfully used for research and analysis the
authors have ensured the ground truth availability along with the images. The dataset contains a total of 11,910 training images, 2,552 validation
images and 2,553 test images. The distribution of these train, validation and test sets are 70, 15 and 15%, respectively. In another study, the deep
learning architectures of MDLSTM and HMM were used for Pashto ligature recognition by Ahmad, Afzal, Rashid, Liwicki, and Breuel (2015). The tra-
ditional method of SIFT-based features was also employed for recognition. The FAST-NU dataset of the Pashto script was used for the evaluation of
the classifiers. The recognition accuracy of 98.9, 94.3 and 89.9% was reported for MDLSTM, SIFT and HMM methods.

3 | M E TH O DO LO GY CO NF I G U R A T I ON

FAST-NU Pashto ligature dataset is subjected to the process of data augmentation during the pre-processing. The architecture for the proposed
system is composed of the source and target domains. The resultant images are then fed to the pre-trained CNN models of SqueezeNet, ResNet,
MobileNet and the DenseNet. The transfer learning strategy and fine-tuned deep features are used for multi-class classification and recognition
of the Pashto ligatures. Unicode mappings are also observed during classification and recognition. This section discusses in detail all the steps for
the methodology configuration that has been used in the proposed Pashto ligature recognition system using different CNN Models.

3.1 | Dataset 2 and preprocessing

In the proposed OCR system for Pashto ligature, the famous benchmark FAST-NU Pashto ligature dataset is used for experiments and analysis.
FAST-NU Pashto ligature dataset is one of the most frequently used datasets for Pashto ligature recognition systems. The dataset was developed
6 of 11 NAZ ET AL.

at the Pashto Academy of Peshawar University, Khyber Pakhtunkhwa, Pakistan by Mahreen Wahab. The dataset is composed of a collection of
ligatures from different novels. The ligature images are of size 100 x 100 and use 3 colour channel that is, RGB coloured (Few illustrations are
shown in Figure 2). The ligatures samples are available on four different scales of 12, 14, 16 and 18 as shown in Figure 3, leading to 4,000 liga-
tures for a total of 1,000 unique ligatures images. These variations of ligature sizes in images are necessary to ensure the robustness and effi-
ciency of any OCR system. The dataset is publicly available to be used by different researchers.
Each CNN model is subjected to a set of pre-processing methods before any data is fed it. In the pre-processing stage, the dataset is trans-
formed and expanded artificially using the process of data augmentation. The data augmentation increases the total number of samples in the
Pashto ligature dataset to be used by the deep learning models. The increase in data size enables the extraction of rich features from the ligature
images. The learning models are more fit and expected to generalize results to the new images. The data augmentation is observed through the
pre-processing methods of contour and negative for an improved deep learning model analysis and performance. The contours of ligature images
are found by eroding the ligature images in the Pashto dataset using a structuring element leading to identifying the ligature boundaries. The nega-
tive is computed using a process where the pixels in the images are converted. The background pixels are represented using black colour and fore-
ground ligature strokes and diacritics are represented using white pixels.

3.2 | DEEP Pashto ligature recognition system

Developing ligature recognition systems for cursive scripts such as Pashto has always been a challenging task for researchers due to the high simi-
larity among the letters as well as the complex nature of its writing. There are several complexities associated with cursive scripts such as Pashto,
Urdu, Arabic, Persian, etc. The complexities may include joining the letters, high cursiveness, the number of diacritics, the overlap of ligatures
(inter-ligature and intra-ligature), placement of the diacritics, diagonality, less horizontal spacing and stroke width variations. To avoid these com-
plexities, it is more useful to observe ligatures as the basic unit of recognition. However, in the case of ligatures, the ligatures are large in number
in comparison to the few characters that can be used with a recognition system. Machine learning-based techniques cannot work satisfactorily
for the classification of a large number of ligatures. Recently, deep transfer learning-based models have shown promising performance and results
for various application areas of pattern recognition in comparison to the traditional machine learning methods. In this study, we propose a deep
transfer learning-based ligature recognition system for the complex Pashto language script. Figure 4 show the proposed workflow of the Pashto
ligature recognition system. Our system is deployed across two domains, namely, the source domain and the target domain. The proposed deep
transfer learning-based architecture allows the pre-trained CNN models to extract and use the learned knowledge and features from the source
task in the source domain, to be used for the target task (Pashto ligature recognition) in the target domain.

3.2.1 | Architecture description: Source domain

The source domain includes the materials and methods required for solving a source task. The details for each item in the source domain have
been provided below.

• Source Task: The classification of images from the ImageNet dataset is the source task in the proposed architecture.

F I G U R E 2 (a)-(j) Ten Ligature


Illustrations from the FAST-NU
Pashto Dataset

FIGURE 3 FAST-NU Dataset Size Variations


NAZ ET AL. 7 of 11

FIGURE 4 Proposed Architecture for Recognition System of Pashto Ligatures

• Source Dataset: The source dataset selected for employing the transfer learning technique is the ImageNet Dataset (Deng et al., 2009). The
ImageNet dataset consists of images that are well-organized and labelled hierarchically. It is one of the most famous datasets used by the deep
learning community for various computer vision research problems. The reason for selecting the ImageNet dataset is because its one of the
largest datasets, and the source models will learn more accurate representations.
• Pre-trained Source Models: The pre-trained models of SqueezeNet, ResNet, MobileNet and DenseNet are used for training on the ImageNet
Dataset. To avoid high computational costs, it is common to use these pre-trained CNN models for other recognition tasks such as the Pashto
ligature recognition used in this research study.
• Source Labels: The source labels correspond to the classes/categories of the images from the ImageNet Dataset. The ImageNet dataset con-
tains labels such as, “balloon,” “chain,” “weevil,” “fly,” “bee,” “zebra,” “gazelle,” “baboon,” etc.

3.2.2 | Architecture description: Target domain

The target domain includes the materials and methods required for solving a target task. The details for each item in the target domain have been
provided below.

• Target Task: The target task is to classify and recognize ligature images from the FAST-NU Pashto ligature dataset. The target task in the target
domain is different but somewhat related to the source task in the source domain.
• Target Models: The target domain uses the pre-trained deep learning models of SqueezeNet, ResNet, MobileNet and DenseNet on the
FAST-NU Dataset using the transfer learning strategy and fine-tuned features for the Pashto ligature recognition problem. The parameters
for the CNN models are optimized to improve the learning. Except for the final output layer, the remaining layers are replicated in the target
models.
• Fine-Tuned Deep Features: The use of deep learning models enables automatic extraction of rich deep features from the augmented dataset.
The transfer learning strategy of deep learning is used to employ deep features that are fine-tuned on the FAST-NU Dataset. To ensure deep
features are learned during the transfer learning, the upper layers are fine-tuned. The performance is improved and much faster when using
fine-tuned CNN models on the target dataset.
• Target Labels: The final output layers are changed and replaced with the labels that correspond to the classes of the ligature images from the
target dataset, that is, “FAST-NU” Pashto ligature dataset.
8 of 11 NAZ ET AL.

• Multi-class Classification: It refers to research problems where the total number of classes used for prediction is more than two. The multi-
class classification is used to ensure a ligature image belongs to one of the ligature classes. A softmax layer is used for the classification. The
softmax layer transforms the scores into probability values.
• Unicode Mapping: The Unicode character encoding scheme used for Pashto characters is used in the study. The Unicode mappings correspond
to the combination of characters in the ligature image and use 2 bytes. The main advantage of using Unicode mappings results in recognition
systems that can be used by prospective researchers for other languages and scripts as long as the character codes are available for it.
• Recognition: The ligature recognition mechanism is observed for each of the models based on the corresponding target labels and the Unicode
mapping. The ligature recognition rate for the proposed system is computed using the parameters of True Positive (TP), True Negative (TN),
False Positive (FP) and False Negative (FN).

4 | EXPERIMENTS

The experiments and results observed for the Pashto ligature recognition system are explained and discussed in detail in this section. The pre-
trained CNN models have been employed for two samples of datasets. The test, validation and training accuracy are computed for each network
model. To validate the effectiveness of the proposed network, we extensively evaluated the network performance on the FAST-NU dataset using
different parameters. The performance of the proposed Pashto ligature recognition system is also compared with existing state-of-the-art
methods.

4.1 | Results analysis

The pre-trained CNN models are evaluated for two sets of samples: (a) Original Dataset and (b) Augmented Dataset, for the learning and recogni-
tion. The original dataset contains a total of 4,000 ligature images. Whereas, the augmented dataset contains the original (4,000), contoured
(4,000) and negative (4,000) images leading to a total of 12,000 ligature images. The test accuracy, validation accuracy and the training accuracy
for eight networks and four fine-tuned CNN models is given in Table 2. The following results can be extracted and summarized from the given
table.

• The lowest test accuracy and validation accuracy of 83.2 and 89.23% is observed for the SqueezeNet model on the Original Dataset.
• The lowest train accuracy of 98.01% is observed for the DenseNet model on the Original Dataset.
• The highest test accuracy, validation accuracy and training accuracy of 99.31, 99.50 and 99.89% are observed for the DenseNet model on the
Augmented Dataset.
• The ResNet and MobileNet CNN models report moderate test, validation and train accuracies for both the Original as well as the Augmented
dataset.

For the generalization of the results, the dataset is shuffled and divided into a train set, a validation set and a testing set five times randomly,
using repeated random sub-sampling. Next, five runs of experiments are conducted for training each of the pre-trained models for Pashto ligature
recognition. The average recognition accuracy result for each of the models, for each of the run, is taken into consideration as the ultimate recog-
nition accuracy. The given average accuracy findings are produced through a rigorous approach of repeated random sub-sampling demonstrating
the full behavior and the performance (accuracy) of the system. This procedure is performed for each model (SqueezeNet, MobileNet, ResNet and

TABLE 2 Recognition Rate on different samples obtained using different models of CNN

Network Model Deep learning type Samples Test accuracy (%) Val accuracy (%) Train accuracy (%)
Network1 SqueezeNet Fine-tuned Original Dataset 83.2 89.23 98.80
Network2 ResNet Fine-tuned 90.26 92.23 98.95
Network3 MobileNet Fine-tuned 84.61 89.70 99.29
Network4 DenseNet Fine-tuned 93.00 94 98.01
Network5 SqueezeNet Fine-tuned Augmented Dataset 97.70 98.00 99.62
Network6 ResNet Fine-tuned 92.71 93.39 99.35
Network7 MobileNet Fine-tuned 99.23 99.30 99.78
Network8 DenseNet Fine-tuned 99.31 99.50 99.89
NAZ ET AL. 9 of 11

F I G U R E 5 Repeated Random Sub-sampling Validation for


Generalization of Results Using Five Experiment Runs

TABLE 3 Comparison with the existing techniques for Pashto ligature recognition

Reference Features Classifier Accuracy (%)


Wahab et al. (2009) PCA based PCA features Not satisfactory
Ahmad et al. (2010) SIFT features Distance measures 74
Ahmad, Naz, Afzal, Amin, and Breuel (2015) SIFT features Distance measure 81
Ahmad, Afzal, Rashid, Liwicki, and Breuel (2015) Automatic deep features MDLSTM 98.9
Zahoor, Naz, Khan, and Razzak (2020) Automatic deep features VGGNet 99.03
Proposed Automatic deep features DenseNet 99.31

F I G U R E 6 Comparison with the Pashto Ligature Recognition


Systems in the Literature using FAST-NU Dataset

DenseNet). In Figure 5, average recognition accuracy of 99.31% using the approach of repeated random sub-sampling validation using the Den-
seNet model of CNN has been shown. It can be seen that the proposed architecture for Pashto ligature recognition has good generalization and
accuracy.

4.2 | Comparison to the state-of-the-art studies

The recognition rate of proposed models for Pashto script recognition is directly compared with the state-of-the-art Pashto ligature recognition
systems on the FAST-NU dataset (as given in Table 3). The proposed Pashto ligature recognition system using DenseNet gives a significant
improvement in the classification and recognition rate as compared to the systems of Wahab et al. (2009), Ahmad, Afzal, Rashid, Liwicki, and
Breuel (2015), Ahmad et al. (2010), and Ahmad, Naz, Afzal, Amin, and Breuel (2015). Wahab et al. (2009) have employed PCA for Pashto ligature
recognition. Ahmed et al. used global features (SIFT) for Pashto ligature recognition (Ahmad et al., 2010; Ahmad, Naz, Afzal, Amin, &
Breuel, 2015). Ahmad, Afzal, Rashid, Liwicki, and Breuel (2015) have deployed SIFT, HMM and MDLSTM and reported the highest accuracy of
10 of 11 NAZ ET AL.

98.9% using MDLSTM. The MDLSTM scans the Pashto ligature image in all the four directions. Zahoor et al. (2020) employed alexnet, GoogLeNet
and VGGNet for Pashtu ligature identification and reported the highet accuracy of 99.03% using VGGNet. The results of our proposed system
using raw pixels using the DenseNet model show significant improvements. It is observed that the DenseNet model of CNN learns salient and dis-
tinctive patterns from raw pixels of ligature images accurately. This is a pioneer work based on DensNet, MobileNet, ResNet or SqueezeNet CNN
models, using the raw pixels for Pashto ligature recognition. The average recognition results for five runs are, 99.31% (DenseNet), 99.21%
(MobileNet), 92.71% (ResNet) and 97.70% (SqueezeNet). The recognition rates for Pashto language script, despite a large number of variations in
the shapes of letters, it depends on its position inside the ligature and the rich morphology.
A graphical comparison has also been provided in Figure 6. For accurate and best comparison, the studies outlined have evaluated their sys-
tems on a similar dataset, the FAST-NU Pashto ligature dataset. The accuracy rates for different studies are compared with the accuracy achieved
by the proposed Pashto ligature recognition system model of DenseNet on the FAST-NU dataset. The study of Wahab et al. (2009) could not be
compared graphically, because, the accuracy results produced were not reported and satisfactory.

5 | C O N CL U S I O N

A deep transfer learning-based recognition for Pashto cursive script (ligature-based) has been presented in this paper. Pre-trained CNN models of
SqueezeNet, ResNet and DenseNet have been used for recognition of the Pashto ligatures on the FAST-NU benchmark dataset. The deep trans-
fer learning approach results in more accurate and fast models. The data augmentation techniques of negative and contour are observed to
increase the size of the dataset and to improve the deep learning model's performance. The proposed Pashto ligature recognition system architec-
ture in this study observes two domains, the source domain and the target domain. The source domain aims at the learning of the pre-trained
CNN models on the ImageNet Dataset. The weights are fine-tuned and the pre-trained models are deployed using the transfer learning on the
FAST-NU dataset in the target domain. Test, validation and training accuracy have been reported for the samples of the original dataset and the
augmented dataset. The recognition accuracy of 97.70, 92.71, 99.21% and 99.31% is reported for the SqueezeNet, ResNet, MobileNet and the
DenseNet model for the original dataset.
In the future, we may observe different recent deep learners such as ShuffleNet and/or EffNet for the recognition of Pashto ligatures using
renowned datasets. Similarly, the proposed system may also be investigated for other cursive and complex scripts.

CONF LICT S OF INTE R ES T


The authors declare no potential conflict of interest.

ORCID
Saeeda Naz https://orcid.org/0000-0002-5665-4615

RE FE R ENC E S
Riaz Ahmad, M Zeshan Afzal, S Faisal Rashid, Marcus Liwicki, Thomas Breuel, and Andreas Dengel. Kpti: Katib's pashto text imagebase and deep learning
benchmark. Paper presented at: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 453–458. IEEE, 2016.
Riaz Ahmad, Muhammad Zeshan Afzal, Sheikh Faisal Rashid, Marcus Liwicki, and Thomas Breuel. Scale and rotation invariant ocr for pashto cursive script
using mdlstm network. Paper presented at: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1101–1105. IEEE,
2015.
Riaz Ahmad, Muhammad Zeshan Afzal, Sheikh Faisal Rashid, Marcus Liwicki, Andreas Dengel, and Thomas Breuel. Recognizable units in pashto language
for ocr. Paper presented at: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1246–1250. IEEE, 2015.
Riaz Ahmad, Syed Hassan Amin, and Mohammad AU Khan. Scale and rotation invariant recognition of cursive pashto script using sift features. paper pres-
ented at: 2010 6th International Conference on Emerging Technologies (ICET), pp. 299–303. IEEE, 2010.
Ahmad, R., Naz, S., Afzal, M. Z., Amin, S. H., & Breuel, T. (2015). Robust optical recognition of cursive pashto script using scale, rotation and location invari-
ant approach. PLoS One, 10(9), e0133648.
Michael Decerbo, Ehry MacRostie, and Premkumar Natarajan. The bbn byblos pashto ocr system. Paper presented at: Proceedings of the 1st ACM work-
shop on Hardcopy document processing, pp. 29–32. ACM, 2004.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. Paper presented at: 2009 IEEE
conference on computer vision and pattern recognition, pp. 248–255. IEEE, 2009.
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks, 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x
fewer parameters and <0.5mb model size, 2016.
Sulaiman Khan, Hazrat Ali, Zahid Ullah, Nasru Minallah, Shahid Maqsood, and Abdul Hafeez. Higher accurate recognition of handwritten pashto letters
through zoning feature by using k-nearest neighbour and artificial neural network. arXiv preprint arXiv:1904.03391, 2019.
Naseer, A., Rani, M., Naz, S., Razzak, M. I., Imran, M., & Xu, G. (2020). Refining parkinson's neurological disorder identification through deep transfer learn-
ing. Neural Computing and Applications, 32(3), 839–854.
NAZ ET AL. 11 of 11

Naz, S., Hayat, K., Razzak, M. I., Anwar, M. W., Madani, S. A., & Khan, S. U. (2014). The optical character recognition of urdu-like cursive scripts. Pattern Rec-
ognition, 47(3), 1229–1248.
Naz, S., Umar, A. I., Ahmad, R., Ahmed, S. B., Shirazi, S. H., & Razzak, M. I. (2017). Urdu nasta'liq text recognition system based on multi-dimensional recur-
rent neural network and statistical features. Neural Computing and Applications, 28(2), 219–231.
Naz, S., Umar, A. I., Ahmad, R., Siddiqi, I., Ahmed, S. B., Razzak, M. I., & Shafait, F. (2017). Urdu nastaliq recognition using convolutional–recursive deep
learning. Neurocomputing, 243, 80–87.
Naz, S., Umar, A. I., Shirazi, S. H., Ahmed, S. B., Razzak, M. I., & Siddiqi, I. (2016). Segmentation techniques for recognition of arabic-like scripts: A compre-
hensive survey. Education and Information Technologies, 21(5), 1225–1241.
Rehman, A., Naz, S., Razzak, M. I., Akram, F., & Imran, M. (2020). A deep learning-based framework for automatic brain tumors classification using transfer
learning. Circuits, Systems, and Signal Processing, 39(2), 757–775.
Rehman, A., Naz, S., Razzak, M. I., & Hameed, I. A. (2019). Automatic visual features for writer identification: A deep learning approach. IEEE Access,
7, 17149–17157.
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. Paper
presented at: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520, 2018.
Sultan Ullah, Tehmina Enayat, Noor Nadeem, Ikram ud Din, Yousaf Saeed, and Muhammad Junaid. Offline pashto ocr using machine learning. Paper pres-
ented at: 2019 7th International Electrical Engineering Congress (iEECON), pp. 1–4. IEEE, 2019.
Mehreen Wahab, Hassan Amin, and Farooq Ahmed. Shape analysis of pashto script and creation of image database for ocr. Paper presented at: 2009 Inter-
national Conference on Emerging Technologies, pp. 287–290. IEEE, 2009.
Zahoor, S., Naz, S., Khan, N. H., & Razzak, M. I. (2020). Deep optical character recognition: A case of pashto language. Journal of Electronic Imaging, 29(2),
023002.

AUTHOR BIOGRAPHI ES

Saeeda Naz is working an assistant professor and chairperson of department of computer science at GGPGC No.1, Abbottabad, Higher Edu-
cation Department of Government of Khyber-Pakhtunkhwa, Pakistan, since 2008. She obtained her BS degree from Computer Science
Department, University of Peshawar, MS from COMSATS Institute of Information Technology, Abbottabad and PhD degree (Computer
Science) with distinction from Hazara University, Department of Information Technology, Mansehra, Pakistan. She is a writer of seven book
chapters and more than 60 research publications in well reputed journals and conferences. Her research area/field of expertise includes deep
learning, big data, intelligent system, medical imaging, text and image processing.

Niala Habib Khan completed her Ph.D. in Computer Science from the Institute of Management Sciences, Peshawar, Pakistan with research in
the field of document image understanding. She received her BS (Computer Science) degree from the Institute of Management Sciences,
Peshawar, Pakistan in 2011 and MS (Information Technology) degree in 2014. She is a double Gold Medalist and has been awarded numerous
merit scholarships during her academic career. Her areas of interest are Document Image Understanding, Pattern Recognition, Image Segmen-
tation and Computer Vision.

Shizza Zahoor is BSCS student of GGPGC No.1, Abbottabad, Higher Education Department of Government of Khyber-Pakhtunkhwa,
Pakistan and works as a Research Assistant under the supervision of Dr. Saeeda Naz at GGPGC No.1, Abbottabad. Her areas of interest are
Document Image Understanding, Machine Learning and Multimedia.

Imran Razzak is a Senior Lecturer in School of Information Technology, Deakin University, Australia. Before joining Deakin, he worked at Uni-
versity of Technology Sydney, King Saud bin Abdulaziz University for Health Sciences and Air University, Islamabad. Imran has had 70+ publi-
cations in reputed journals and conferences in the areas of data analytics and machine learning. He is the recipient of a number of awards,
both from academia and industry. His research has received research funding over $1.2M in past years. He has been serving in an editorial
board and as a guest editor for several international journals, such Neural Computing and Applications, Journal of Biomedical and Health
Informatics, Plos One, IEEE Access, International Journal of Biometrics, International Journal of information Processing. He is also active in
organizing and serving for dozens of international conferences and workshops such as ICONIP, IJCNN, KES, BESC, ICHIT, ASAR etc. He orga-
nized several special sessions in leading conferences such as IJCNN, ICONIP, KES as well as co-chair of ICHI-17 and publicity chair of
BESC-20.

How to cite this article: Naz S, Khan NH, Zahoor S, Razzak MI. Deep OCR for Arabic script-based language like Pastho. Expert Systems.
2020;e12565. https://doi.org/10.1111/exsy.12565

You might also like