A Holistic Approach For Recognition of Complete Urdu Ligatures Using Hidden Markov Models

2017 International Conference on Frontiers of Information Technology
A Holistic Approach for Recognition of Complete

Urdu Ligatures using Hidden Markov Models
Israr Uddin∗ , Imran Siddiqi∗ and Shehzad Khalid∗
∗ Center of Computer Vision and Pattern Recognition
Bahria University, Islamabad, Pakistan
Email: imran.siddiqi@bahria.edu.pk
Abstract—Optical Character Recognition (OCR) is one of the

continuously explored problems. Presently, commercial character
recognizers are available reporting near to 100% recognition
rates on text in a number of scripts. Despite these advancements,
OCR systems however, have yet to mature for cursive scripts like
Urdu. This study presents a holistic technique for recognition
of Urdu text in Nastaliq font using ‘complete’ ligatures as
recognition units. The term ‘complete’ refers to a partial word
including its main body and secondary components (dots and
diacritic marks). Discrete Wavelet Transform (DWT) is employed
as feature extractor while a separate Hidden Markov Model
(HMM) is trained for each ligature considered in our study.
More than 2000 frequently used unique Urdu ligatures from the
standard CLE (Center of Language Engineering) dataset are
considered in our evaluations. The system reads a promising
accuracy of 88.87% on more than 10,000 partial words.
I. I NTRODUCTION
Optical Character Recognition (OCR) is a classical pat- Fig. 1. Text having (a) Intra-ligature overlap and (b) Inter-ligatures overlap
(c) False (green) and filled (red) loops (d) Missing fixed baseline (e) and
tern recognition problem that has witnessed more than three variable spacing indicating the high cursiveness
decades of intensive research. Thanks to these research en-
deavors, mature recognition systems realizing accuracies up
to 99.9% [10] have been developed for many scripts, those
based on Chinese and Latin alphabet for instance. In spite representing the main body of the ligature is generally termed
of these tremendous advancements, recognition systems for as base or primary ligature while the smaller components rep-
cursive scripts like Urdu are still in early days of research resenting dots and diacritics are known as secondary ligatures.
and relatively limited literature is available till date. Urdu Figure 2 illustrates the primary and secondary ligatures con-
is mostly printed in Nastaliq style which is more complex stituting a complete ligature. It is interesting to note that more
as compared to the Naskh style of Arabic and Pashto. The often, a number of complete ligatures have indifferent base
bidirectional Nastaliq writing style is highly cursive and is ligature and different secondary ligatures depending on their
diagonally written from right-to-left with varying inter and number, shape and position. Consequently, a major proportion
intra word spaces. Overlapping strokes, incorrect or filled of recognition techniques which employ ligatures as units of
loops and absence of static baseline represent few of the recognition carry out separate classification of primary and
commonly encountered challenges [14], [16], [23]. Figure 1 secondary ligatures [6], [14], [16], [28], [29], [30] with the
illustrates the complexities of Urdu Nastaliq font making it a aim to reduce number of unique recognizable (shape) classes.
challenging recognition problem. With the digital revolution, Such techniques face the challenge to re-associate secondary
the importance of automatic recognition systems has increased ligatures with their primary ligature to recognize the complete
manifolds enabling the huge collections of printed, hand- ligature. While high recognition rates have been reported on
written or mixed documents to be digitized without manual individual recognition of base and secondary ligatures, most
transcription. of the studies [7], [14], [16], [28] ignore the re-association
This paper presents a global transformational method to step and do not report the recognition rates on complete
recognize complete Urdu ligatures. A complete or true ligature ligatures. Consequently, the reported recognition rates are not
is composed of either a single character or a combination of a true indicative of the maturity of the proposed recognition
characters (up to 10 [17]) and may correspond to a complete systems once it comes to acceptability from the view point of
word or a partial word. Moreover, a complete ligature may commercial recognition engines. This study targets recognition
comprise two types of components. The larger component of complete ligatures; rather than separately recognizing the
0-7695-6347-3/17/$31.00 ©2017 IEEE 155

DOI 10.1109/FIT.2017.00035
primary and secondary ligatures, we train models on complete of statistical features with multi-dimensional LSTM reporting
ligatures. Discrete Wavelet Transform (DWT) is employed a character recognition rate of 96.40% on the UPTI dataset.
as feature extractor while classification is carried out using The work was later extended [22] to employ Convolutional
hidden Markov models. Evaluations on the benchmark CLE Neural Networks (CNN) for automatic feature extraction.
dataset report interesting recognition rates. The combination CNN+MDLSTM reported a high character
recognition rate of 98%.
A key attraction in analytical methods is the smaller number

of unique classes to be recognized which is same as the
number of characters and their different shapes (as a function
of their position in the ligature). A major issue in analyti-
Fig. 2. (a) A complete ligature (b) Primary component and (c) Secondary
components cal approaches is segmentation into characters which is an
extremely challenging task for a cursive script like Nastaliq.
The paper organization is as follows. The next section Though implicit segmentation based techniques [4], [21], [22]
reviews notable contributions to recognition of Urdu text overcome this issue and report high recognition rates, these
followed by a detailed discussion on the proposed technique in techniques rely on huge amount of training data for acceptable
Section III. Section IV presents the details of the experiments results.
carried out to validate the introduced methodology with a
discussion on the realized results. At the end, Section V B. Holistic Methods
concludes the paper with a discussion on the future of Urdu Holistic (or segmentation-free) methods consider ligatures
OCR. as units of recognition and represent a major proportion of
research on Urdu OCR systems [5], [6], [12], [14], [16], [19],
II. L ITERATURE R EVIEW [28], [29], [30], [31]. Among significant holistic techniques,
Initial studies on recognition of Urdu text primarily focused Javed et al. [12] carry out recognition of primary components
on classification of individual segmented characters [3], [19], using HMM classifier. A sliding window scans the ligature
[25], [30], [32], [33]. The problem, however, has witnessed image for extraction of DCT features which are fed to the
significant research endeavors during the last five years and HMMs. The system trained over 1282 primary components
a number of sophisticated ligature and character recognition reported an accuracy of 92% on 3655 query ligatures. The
systems have been proposed [7], [12], [16], [20], [21], [28], technique ignores secondary components and is also font
[35]. The recognition methods can be grouped as two major size dependent. In another study, Sabbour and Shafait [28]
classes, analytical and holistic methods, as discussed in the employed shape descriptors with a nearest neighbor classifier
following. to recognize ligatures. The system realized 86.7% accuracy
for 10,000 primary ligatures extracted from Urdu Printed
A. Analytical Methods Text Images (UPTI) dataset. Akram et al. [7] modified the
Analytical (also called segmentation-based) approaches Tesseract engine to recognize Urdu ligatures in Nastaliq font.
employ characters as recognition units which are segmented The system trained on 1475 primary components realized
either implicitly [20], [21], [22] or explicitly [2], [3], [18], a recognition rate of around 97% for font sizes 14 and 16.
[25], [32], [33], [35]. Among well-known analytical methods, The system, however, ignores secondary components and
Hussain et al. [11] adopted an explicit segmentation-based requires a separate training step for each font size. This
approach for recognition of printed Urdu Nastaliq ligatures. issue was addressed in a later study [6] by the same authors
Discrete Cosine Transform (DCT) coefficients extracted from realizing 86.15% complete ligature recognition rates on
graphemes of ligatures are employed for training individual query ligatures extracted from 224 printed Urdu documents.
HMMs for the ligature classes. The primary (segmented Another segmentation-free and size invariant recognition
into graphemes) and secondary ligatures are individually technique is presented in [16] for individually recognizing
recognized that are post processed for association. The main and secondary ligatures. Statistical features extracted
system reports 87.76% complete ligature recognition rate using sliding windows are used for training individual HMMs
for almost 93,000 test words from the CLE data set. A for the ligature classes. The system trained on 2028 unique
recent trend in analytical recognition approaches is implicit (primary and secondary) ligatures reports 97.93% accuracy
segmentation of characters using deep learning techniques. on 6084 test ligatures.
These methods rely on feeding a classifier with text line
images and the corresponding character level transcription. A major advantage of holistic approaches is that
Among such techniques, Ahmed et al. [4] employed raw segmentation of ligatures into characters is not required.
pixels based bidirectional Long Short Term Memory (LSTM) This, however, increases the number of unique classes to be
for classification of characters. The system achieves 89% recognized to the total number of unique ligatures which
accuracy on the Urdu Printed Text Images (UPTI) dataset. naturally is much larger than the total number of characters
Likewise, Naz et al. [20], [21] investigated the effectiveness (and their shape variants) in the alphabet. Nevertheless, this
156
can be addressed by enhancing the feature extraction and investigated on the same dataset. The training and recognition
classification steps. steps of the system are discussed in the following.
A summarized review of well-known contributions towards

the development of an Urdu OCR system is presented in
Table I. Though high classification rates are reported by
the techniques based on implicit segmentation using deep
learning [20], [21], [22], these techniques require a mass
of data for system training. Moreover, the realized character
recognition accuracies (instead of the natural recognition unit - Fig. 3. (a) Joiners and (b) Non-joiners Urdu characters
ligature) are computed by edit distance between the predicted
and ground truth transcription of text lines. It is important to
discuss that character recognition rates and ligature recognition A. System Training
rates are not directly comparable. Error in a single character Training involves making the models learn to discriminate
within a ligature results in rejection of the complete ligature. between different ligature classes. In our study, we con-
As an example, consider two query ligatures each having five sider 2,017 high frequency ligature clusters from the CLE
characters. Assuming the scenario where one character is not database [1]. Each ligature cluster comprises 35 sample images
recognized correctly, the character recognition rate would be of the respective ligature. 30 images of each class are used in
90% (9/10) while the ligature recognition rate for this scenario the training set while 5 images of each class constitute the test
would be 50% (1/2). Consequently, from the view point of set making a total of 10,085 (2017 × 5) query ligatures in the
eventual recognition machines, the importance of ligature test set.
recognition accuracy gets more pronounced as opposed to 1) Feature Extraction: Feature extraction transforms the
character recognition accuracy. As discussed earlier, majority objects under study (ligatures in our case) to a representation
of the ligature based techniques [7], [14], [16] employ individ- space where objects of same class group together to form
ual recognition of primary and secondary ligatures ignoring its clusters. Features are generally categorized into structural and
complex association problem. Only the techniques presented statistical features which can be extracted at local or global
in [6], [9], [11] consider recognition of complete ligatures levels. Although structural features are rich in representation
realizing recognition rates of 87.15%, 87.76% and 92.26% of objects, these features are expensive to compute as well
respectively. Moreover, the first two studies are evaluated as to compare (matching step). Statistical features, which
over CLE dataset while the last one is tested over the UPTI represent certain statistics calculated from images under study,
database. It is also important to note that CLE database is are efficient to compute and rich classifiers are also available to
much more challenging as compared to the UPTI database. compare two objects using statistical measures. We, therefore,
UPTI database comprises synthetically generated Urdu text employ statistical features computed from ligature images
lines while CLE database has been developed by scanning to train the HMMs. Traditionally, HMMs are trained using
the printed Urdu documents and better imitates the documents features extracted through sliding windows [16]. Such smaller
likely to be encountered in practical recognition problems. units of feature extraction can be sensitive to minor shape
III. P ROPOSED M ETHODOLOGY variations and eventually affect the recognition rates. We have
chosen a global feature extraction method where the feature
This section discusses the proposed methodology for recog- vectors of the complete ligature images are fed to the HMMs.
nition of complete ligatures. The technique relies on a holistic The ligature image is normalized to a fixed size of N × N
approach where features extracted from the ligatures (without and two dimensional Discrete Wavelet Transform (DWT)
separating primary and secondary components) are employed coefficients are computed representing complete ligature as a
to train hidden Markov models for each of the ligature classes. feature vector with dimension N/2 × N/2. Figure 4 illustrates
As discussed earlier, a complete ligature may be a single the application of DWT to a ligature image for N = 32. The
character or combination of characters connected together via impact of the value of N on the overall classification rates is
joiner rules. Characters in the Urdu alphabet can be separated presented later in the paper. The DWT coefficients extracted
as joiners and non-joiners as shown in Figure 3. It should from the ligature images are fed to the HMMs for training as
be noted that Urdu has a large number of unique ligatures discussed in the following.
(26,000 approximately) [17]. From the view point of frequency
of usage, a major proportion of the ligatures are very rarely
employed. It has been shown that as a function probability of
usage, around 2300 unique primary and secondary ligatures
can cover more than 99% of Urdu corpus [17]. Center of Lan-
guage Engineering (CLE), Pakistan, has compiled a collection
of 2,017 High Frequency Ligatures (HFLs) that covers a major Fig. 4. DWT computation: (a) Original image (b) DWT image
proportion of the Urdu corpus. The present system has been
157
TABLE I
S UMMARY OF U RDU T EXT R ECOGNITION S YSTEMS
Method Investigation Dataset Classification Unit of Recognition Accuracy

Analytical Pal and Sarkar [25] Custom — Isolated characters 97.80%
Analytical Shamsher et al. [32] Custom Neural Networks Isolated characters 98.30%
Analytical Ahmed et al. [3] Custom Neural networks Segmented characters 93.40%
Analytical Hussain et al. [11] CLE HMM Graphemes 87.76%
Analytical Hassan et al. [35] UPTI Bidirectional LSTM Characters 86.40%/95.80%
Analytical Ahmed et al. [4] UPTI Bidirectional LSTM Characters 89.00%
Analytical Naz et al. [20] UPTI Multi dimensional LSTM Characters 94.97%
Analytical Naz et al. [21] UPTI Multi dimensional LSTM Characters 96.40%
Analytical Naz et al. [22] UPTI CNN and Multi dimensional LSTM Characters 98.12%
Holistic Sabbour and Shafait [28] UPTI KNN 10,000 Non-unique primary components 86.70%
Holistic Din et al [9] UPTI HMM 1,526 unique primary and secondary components 92.26%
Holistic Javed et al. [12] CLE HMM 1,282 unique primary components 92.00%
Holistic Akram et al. [7] CLE HMM 1,475 unique primary components 97.87%
Holistic Akram et al. [6] CLE HMM 1,475 unique primary & secondary components 86.15%
Holistic Javed and Hussain [14] CLE HMM 1,692 unique primary components 92.73%
Holistic Khattak et al. [16] CLE HMM 2028 Unique primary and secondary components 97.93%
2) Training of Models: Hidden Markov models have been of each ligature class are used in the training set and 5 in the
successful in a wide variety of pattern classification problems test set making a total of 10,085 (complete) query ligatures.
including speech, gesture and handwriting recognition [8], Figure 6 illustrates sample ligature images from the database.
[12], [13], [15], [24], [26], [27], [34]. In our study, we The ligatures in the database comprise a minimum of one to
train a separate HMM using (training) images in each of a maximum of eight characters.
the 2,017 ligature clusters. The DWT coefficients extracted
from the images are used to train the 5 state right-to-left
HMMs (Figure 5). Since the HMMs are discrete, feature
quantization is employed (using a symbol codebook of size
50) and, the training is carried out using the standard Baum-
Welch algorithm. Once trained, the Unicode of each ligature
Fig. 6. Sample ligature images in the CLE database
is associated with the respective model.
A. Recognition Results
In order to report stable recognition rates, experiments
are carried out using 7-fold cross validation with the 30/5
distribution in the training and test sets for each ligature
class. The reported recognition rates represent the average of
the 7 runs. The system realizes an overall complete ligature
recognition rate of 88.87% on the 10,085 query ligatures. We
also computed the recognition rates on the basis of number of
Fig. 5. 5 State HMM employed in our study characters in the ligature. The realized results are summarized
in Table II. With the exception of 8-character ligatures (4 such
B. Recognition of Ligatures ligatures in the database), the recognition rates corresponding
For recognition, the DWT coefficients of the query ligature to different number of characters per ligature are more or
image are fed to each of the trained models. The model less stable with an average recognition rate of around 89%
that reports the highest probability of producing the observed which indeed is promising considering the fact that the system
sequence (feature vector) is chosen and the associated Unicode recognizes complete (primary+secondary) ligatures.
is written to a text file in the UTF-8 format.
TABLE II
IV. E XPERIMENTS AND R ESULTS L ENGTH - WISE RESPECTIVE COMPLETE LIGATURE RECOGNITION RATES
This section presents the details of the experiments carried Ligature’s Constituent Characters Query Complete Ligatures Correctly Recognized Accuracy
Single character Ligatures 165 136 82.42%
out to study the effectiveness of the proposed recognition Two character ligatures 709 611 86.18%
Three character ligatures 2834 2534 89.41%
technique. We also investigate the evolution of performance Four character ligatures 3798 3391 89.28%
Five character ligatures 1914 1696 88.61%
with respect to system parameters and compare the realized Six character ligatures 560 510 91.07%
results with the latest recognition systems proposed for Urdu Seven character ligatures
Eight character ligatures
85
20
70
15
82.35%
75.00%
text. As discussed earlier, the 2,017 high frequency ligatures Total ligatures 10,085 8,963 88.87%
of the CLE database are used in our experiments. 30 images
158
B. Performance Sensitivity to System Parameters C. Comparison with Recent Works
To evaluate the stability of system performance as a function As discussed in Section II, a number of Urdu OCR sys-
of different parameters, we carried out experiments on the first tems proposed in the literature either work on segmented
500 ligatures of the dataset. These parameters include HMM characters [3], [14], [19], [25], [35] or carry out individual
states, size of the image and size of the HMM codebook. recognition of primary and secondary components [7], [12],
Figure 7 illustrates the recognition rates by varying the number [16] ignoring the difficult part of associating the two. The
of states in the HMMs while Figure 8 reports recognition recent deep learning based systems [20], [21], [22], [35]
rates as a function of image size for computation of DWT. relying on implicit segmentation report character recognition
The performance seems to be stable with respect to these rates (on synthetic UPTI database) which cannot be directly
parameters where highest recognition rates are realized for a compared with ligature recognition rates as already elaborated.
ligature size of 32 × 32 with 5 states in the HMMs. Table III provides a comparative overview of the recognition
systems evaluated on the (more challenging) CLE database
and employing ligatures as units of recognition. While the
works reported in [7], [12], [14], [16] report more than 90%
ligature recognition rates, none of these studies associates
secondary ligatures with their primary ligatures. Only the
system proposed in [6] considers recognition of complete
ligatures reporting a recognition rate of 87.15% on 1475
unique ligatures. Our technique reports a better recognition
rate of 88.87% using 2,017 unique ligatures. A recent work
by Hussain et al. [11] also considers the recognition of
Fig. 7. Performance evolution as a function of HMM states (Image size fixed
to 32 × 32) complete ligatures reporting an average accuracy of 87.76%.
The system, however, is not evaluated on the same set of
ligatures and hence the results are not directly comparable.
V. C ONCLUSION
This paper presented a holistic technique for recognition

of complete Urdu ligatures in Nastaliq font. The technique
relies on extracting DWT coefficients from ligature images
and training a separate HMM for each ligature class. A total
of 2,017 highly frequent unique ligatures from the standard
CLE database are considered in our study. Evaluations under a
number of interesting experimental settings report high recog-
Fig. 8. Performance evolution with respect to Image Size (HMM states fixed nition rates outperforming the existing techniques evaluated
to 5) using the same protocol. In our further work on this problem,
we intend to consider recognition of complete page images
Unlike image size and number of HMM states, the recog- of Urdu text which naturally will require preprocessing and
nition performance is more sensitive to the size of codebook segmentation (into lines and subsequently into ligatures) steps.
to quantify the feature vectors. These results are summarized Although significant research has been carried out on this
in Figure 9 where a codebook of size 50 reports the highest problem in the recent years, end-to-end Urdu text recognition
recognition rate. systems realizing acceptable recognition rates for commercial
products are still many years down the road. The next step
would be to address the more challenging Urdu handwriting
recognition problem which has not yet been explored at all.
This, in turn, would also require development of standard
handwriting databases similar to the printed UPTI and CLE
datasets. Concluding, the problem of Urdu OCR and Urdu
handwriting recognition offers a number of exciting challenges
to the document analysis and recognition community for many
years to come.
R EFERENCES
Fig. 9. Performance evolution as a function of codebook size (HMM states:
5, Image size: 32 × 32) [1] Text Carpora and, Image Corpora.
http://www.cle.org.pk/clestore/index.htm. Accessed: 2017-04-20.
159
TABLE III
P ERFORMANCE COMPARISON OF LIGATURE BASED RECOGNITION TECHNIQUES
Study Unique ligatures Accuracy Recognition of Complete Ligatures

Javed et al. [12] 1282 92.00% No
Akram et al. [7] 1475 97.87% No
Javed and Hussain [14] 1692 92.73% No
Khattak et al. [16] 2028 97.93% No
Akram et al. [6] 1475 87.15% Yes
Proposed 2017 88.87% Yes
[2] Zaheer Ahmad, Jehanzeb Khan Orakzai, and Inam Shamsher. Urdu matching technique. International Journal of Image Processing (IJIP),
compound character recognition using feed forward neural networks. In 3(3):92, 2009.
Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd [20] Saeeda Naz, Arif I Umar, Riaz Ahmad, Saad B Ahmed, Syed H Shirazi,
IEEE International Conference on, pages 457–462. IEEE, 2009. and Muhammad I Razzak. Urdu nastaliq text recognition system based
[3] Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, and Awais on multi-dimensional recurrent neural network and statistical features.
Adnan. Urdu nastaleeq optical character recognition. In Proceedings Neural Computing and Applications, pages 1–13, 2015.
of world academy of science, engineering and technology, volume 26, [21] Saeeda Naz, Arif I Umar, Riaz Ahmad, Saad B Ahmed, Syed H Shirazi,
pages 249–252. Citeseer, 2007. Imran Siddiqi, and Muhammad I Razzak. Offline cursive urdu-nastaliq
[4] Saad Bin Ahmed, Saeeda Naz, Muhammad Imran Razzak, Shiekh Faisal script recognition using multidimensional recurrent neural networks.
Rashid, Muhammad Zeeshan Afzal, and Thomas M Breuel. Evaluation Neurocomputing, 177:228–241, 2016.
of cursive and non-cursive scripts using recurrent neural networks. [22] Saeeda Naz, Arif I Umar, Riaz Ahmad, Imran Siddiqi, Saad B Ahmed,
Neural Computing and Applications, 27(3):603–613, 2016. Muhammad I Razzak, and Faisal Shafait. Urdu nastaliq recognition
[5] Misbah Akram and Sarmad Hussain. Word segmentation for urdu using convolutional–recursive deep learning. Neurocomputing, 243:80–
ocr system. In Proceedings of the 8th Workshop on Asian Language 87, 2017.
Resources, Beijing, China, pages 88–94, 2010. [23] Saeeda Naz, Arif Iqbal Umar, Saad Bin Ahmed, Syed Hamad Shirazi,
[6] Qurat-ul-Ain Akram, Sarmad Hussain, Farah Adeeba, Shafiq ur Rehman, M Imran Razzak, and Imran Siddiqi. An ocr system for printed nasta’liq
and Mehreen Seed. Framework for urdu nastalique optical character script: A segmentation based approach. In Multi-Topic Conference
recognition, 2014. (INMIC), 2014 IEEE 17th International, pages 255–259. IEEE, 2014.
[7] Qurat Ul Ain Akram, Shiraz Hussain, Aneta Niazi, Umair Anjum, and [24] Chan Wah Ng and Surendra Ranganath. Real-time gesture recognition
Faheem Irfan. Adapting tesseract for complex scripts: an example for system and application. Image and Vision computing, 20(13):993–1007,
urdu nastalique. In Document Analysis Systems (DAS), 2014 11th IAPR 2002.
International Workshop on, pages 191–195. IEEE, 2014. [25] U Pal and Anirban Sarkar. Recognition of printed urdu script. In
[8] Edgard Chammas, Chafic Mokbel, and Laurence Likforman-Sulem. Ara- Document Analysis and Recognition, 2003. Proceedings. Seventh In-
bic handwritten document preprocessing and recognition. In Document ternational Conference on, pages 1183–1187. IEEE, 2003.
Analysis and Recognition (ICDAR), 2015 13th International Conference [26] Bryan Pardo and William Birmingham. Modeling form for on-line
on, pages 451–455. IEEE, 2015. following of musical performances. In Proceeding of the National
[9] Israr Ud Din, Imran Siddiqi, Shehzad Khalid, and Tahir Azam. Conference on Artificial Intelligence, volume 20, page 1018. Menlo Park,
Segmentation-free optical character recognition for printed urdu text. CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005.
EURASIP Journal on Image and Video Processing, 2017(1):62, 2017. [27] Thomas Plotz and Gernot A Fink. Markov models for offline handwrit-
ing recognition: a survey. International Journal on Document Analysis
[10] VK Govindan and AP Shivaprasad. Character recognitiona review.
and Recognition (IJDAR), 12(4):269–298, 2009.
Pattern recognition, 23(7):671–683, 1990.
[28] Nazly Sabbour and Faisal Shafait. A segmentation-free approach to
[11] Sarmad Hussain, Salman Ali, and Qurat-ul-Ain Akram. Nastalique arabic and urdu ocr. In IS&T/SPIE Electronic Imaging, pages 86580N–
segmentation-based approach for urdu ocr. International Journal on 86580N. International Society for Optics and Photonics, 2013.
Document Analysis and Recognition (IJDAR), 18(4):357–374, 2015. [29] Malik Waqas Sagheer, Chun Lei He, Nicola Nobile, and Ching Y
[12] Sobia Javed, Sarmad Hussain, Ameera Maqbool, Samia Asloob, Sehrish Suen. Holistic urdu handwritten word recognition using support vector
Jamil, and Huma Moin. Segmentation free nastalique urdu ocr. World machine. In Pattern Recognition (ICPR), 2010 20th International
Academy of Science, Engineering and Technology, 46:456–461, 2010. Conference on, pages 1900–1903. IEEE, 2010.
[13] Sobia T Javed, Sarmad Hussain, Ameera Maqbool, Samia Asloob, [30] Shuwair Sardar and Abdul Wahab. Optical character recognition system
Sehrish Jamil, and Huma Moin. Segmentation free nastalique urdu ocr. for urdu. In Information and Emerging Technologies (ICIET), 2010
World Academy of Science, Engineering and Technology, 46:456–461, International Conference on, pages 1–5. IEEE, 2010.
2010. [31] Sohail A Sattar, Shamsul Haque, and Mahmood K Pathan. Nastaliq
[14] Sobia Tariq Javed and Sarmad Hussain. Segmentation based urdu optical character recognition. In Proceedings of the 46th Annual
nastalique ocr. In Iberoamerican Congress on Pattern Recognition, pages Southeast Regional Conference on XX, pages 329–331. ACM, 2008.
41–49. Springer, 2013. [32] Inam Shamsher, Zaheer Ahmad, Jehanzeb Khan Orakzai, and Awais Ad-
[15] Ergina Kavallieratou, Efstathios Stamatatos, Nikos Fakotakis, and nan. Ocr for printed urdu script using feed forward neural network. the
George Kokkinakis. Handwritten character segmentation using Proceedings of World Academy of Science, Engineering and Technology,
transformation-based learning. In International Conference on Pattern 23, 2007.
Recognition, volume 15, pages 634–637, 2000. [33] Junaid Tariq, Umar Nauman, and Muhammad Umair Naru. Softcon-
[16] Israr Uddin Khattak, Imran Siddiqi, Shehzad Khalid, and Chawki verter: A novel approach to construct ocr for printed urdu isolated
Djeddi. Recognition of urdu ligatures-a holistic approach. In Document characters. In Computer Engineering and Technology (ICCET), 2010
Analysis and Recognition (ICDAR), 2015 13th International Conference 2nd International Conference on, volume 3, pages V3–495. IEEE, 2010.
on, pages 71–75. IEEE, 2015. [34] Jochen Triesch and Christoph von der Malsburg. Classification of hand
[17] Gurpreet Singh Lehal. Choice of recognizable units for urdu ocr. In postures against complex backgrounds using elastic graph matching.
Proceeding of the workshop on Document Analysis and Recognition, Image and Vision Computing, 20(13):937–943, 2002.
pages 79–85. ACM, 2012. [35] Adnan Ul-Hasan, Saad Bin Ahmed, Faisal Rashid, Faisal Shafait, and
[18] Hamna Malik and Muhammad Abuzar Fahiem. Segmentation of printed Thomas M Breuel. Offline printed urdu nastaleeq script recognition with
urdu scripts using structural features. In Visualisation, 2009. VIZ’09. bidirectional lstm networks. In 2013 12th International Conference on
Second International Conference in, pages 191–195. IEEE, 2009. Document Analysis and Recognition, pages 1061–1065. IEEE, 2013.
[19] Tabassam Nawaz, S.A.H.S Naqvi, Habib ur Rehman, and Anoshia Faiz.
Optical character recognition system for urdu (naskh font) using pattern
160

A Holistic Approach For Recognition of Complete Urdu Ligatures Using Hidden Markov Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Holistic Approach For Recognition of Complete Urdu Ligatures Using Hidden Markov Models

Uploaded by

Copyright:

Available Formats

2017 International Conference on Frontiers of Information Technology

A Holistic Approach for Recognition of Complete

Abstract—Optical Character Recognition (OCR) is one of the

0-7695-6347-3/17/$31.00 ©2017 IEEE 155

A key attraction in analytical methods is the smaller number

A summarized review of well-known contributions towards

Method Investigation Dataset Classiﬁcation Unit of Recognition Accuracy

of the CLE database are used in our experiments. 30 images

This paper presented a holistic technique for recognition

Study Unique ligatures Accuracy Recognition of Complete Ligatures

You might also like