Professional Documents
Culture Documents
I. I NTRODUCTION
Optical Character Recognition (OCR) is a classical pat- Fig. 1. Text having (a) Intra-ligature overlap and (b) Inter-ligatures overlap
(c) False (green) and filled (red) loops (d) Missing fixed baseline (e) and
tern recognition problem that has witnessed more than three variable spacing indicating the high cursiveness
decades of intensive research. Thanks to these research en-
deavors, mature recognition systems realizing accuracies up
to 99.9% [10] have been developed for many scripts, those
based on Chinese and Latin alphabet for instance. In spite representing the main body of the ligature is generally termed
of these tremendous advancements, recognition systems for as base or primary ligature while the smaller components rep-
cursive scripts like Urdu are still in early days of research resenting dots and diacritics are known as secondary ligatures.
and relatively limited literature is available till date. Urdu Figure 2 illustrates the primary and secondary ligatures con-
is mostly printed in Nastaliq style which is more complex stituting a complete ligature. It is interesting to note that more
as compared to the Naskh style of Arabic and Pashto. The often, a number of complete ligatures have indifferent base
bidirectional Nastaliq writing style is highly cursive and is ligature and different secondary ligatures depending on their
diagonally written from right-to-left with varying inter and number, shape and position. Consequently, a major proportion
intra word spaces. Overlapping strokes, incorrect or filled of recognition techniques which employ ligatures as units of
loops and absence of static baseline represent few of the recognition carry out separate classification of primary and
commonly encountered challenges [14], [16], [23]. Figure 1 secondary ligatures [6], [14], [16], [28], [29], [30] with the
illustrates the complexities of Urdu Nastaliq font making it a aim to reduce number of unique recognizable (shape) classes.
challenging recognition problem. With the digital revolution, Such techniques face the challenge to re-associate secondary
the importance of automatic recognition systems has increased ligatures with their primary ligature to recognize the complete
manifolds enabling the huge collections of printed, hand- ligature. While high recognition rates have been reported on
written or mixed documents to be digitized without manual individual recognition of base and secondary ligatures, most
transcription. of the studies [7], [14], [16], [28] ignore the re-association
This paper presents a global transformational method to step and do not report the recognition rates on complete
recognize complete Urdu ligatures. A complete or true ligature ligatures. Consequently, the reported recognition rates are not
is composed of either a single character or a combination of a true indicative of the maturity of the proposed recognition
characters (up to 10 [17]) and may correspond to a complete systems once it comes to acceptability from the view point of
word or a partial word. Moreover, a complete ligature may commercial recognition engines. This study targets recognition
comprise two types of components. The larger component of complete ligatures; rather than separately recognizing the
156
can be addressed by enhancing the feature extraction and investigated on the same dataset. The training and recognition
classification steps. steps of the system are discussed in the following.
157
TABLE I
S UMMARY OF U RDU T EXT R ECOGNITION S YSTEMS
2) Training of Models: Hidden Markov models have been of each ligature class are used in the training set and 5 in the
successful in a wide variety of pattern classification problems test set making a total of 10,085 (complete) query ligatures.
including speech, gesture and handwriting recognition [8], Figure 6 illustrates sample ligature images from the database.
[12], [13], [15], [24], [26], [27], [34]. In our study, we The ligatures in the database comprise a minimum of one to
train a separate HMM using (training) images in each of a maximum of eight characters.
the 2,017 ligature clusters. The DWT coefficients extracted
from the images are used to train the 5 state right-to-left
HMMs (Figure 5). Since the HMMs are discrete, feature
quantization is employed (using a symbol codebook of size
50) and, the training is carried out using the standard Baum-
Welch algorithm. Once trained, the Unicode of each ligature
Fig. 6. Sample ligature images in the CLE database
is associated with the respective model.
A. Recognition Results
In order to report stable recognition rates, experiments
are carried out using 7-fold cross validation with the 30/5
distribution in the training and test sets for each ligature
class. The reported recognition rates represent the average of
the 7 runs. The system realizes an overall complete ligature
recognition rate of 88.87% on the 10,085 query ligatures. We
also computed the recognition rates on the basis of number of
Fig. 5. 5 State HMM employed in our study characters in the ligature. The realized results are summarized
in Table II. With the exception of 8-character ligatures (4 such
B. Recognition of Ligatures ligatures in the database), the recognition rates corresponding
For recognition, the DWT coefficients of the query ligature to different number of characters per ligature are more or
image are fed to each of the trained models. The model less stable with an average recognition rate of around 89%
that reports the highest probability of producing the observed which indeed is promising considering the fact that the system
sequence (feature vector) is chosen and the associated Unicode recognizes complete (primary+secondary) ligatures.
is written to a text file in the UTF-8 format.
TABLE II
IV. E XPERIMENTS AND R ESULTS L ENGTH - WISE RESPECTIVE COMPLETE LIGATURE RECOGNITION RATES
This section presents the details of the experiments carried Ligature’s Constituent Characters Query Complete Ligatures Correctly Recognized Accuracy
Single character Ligatures 165 136 82.42%
out to study the effectiveness of the proposed recognition Two character ligatures 709 611 86.18%
Three character ligatures 2834 2534 89.41%
technique. We also investigate the evolution of performance Four character ligatures 3798 3391 89.28%
Five character ligatures 1914 1696 88.61%
with respect to system parameters and compare the realized Six character ligatures 560 510 91.07%
results with the latest recognition systems proposed for Urdu Seven character ligatures
Eight character ligatures
85
20
70
15
82.35%
75.00%
text. As discussed earlier, the 2,017 high frequency ligatures Total ligatures 10,085 8,963 88.87%
158
B. Performance Sensitivity to System Parameters C. Comparison with Recent Works
To evaluate the stability of system performance as a function As discussed in Section II, a number of Urdu OCR sys-
of different parameters, we carried out experiments on the first tems proposed in the literature either work on segmented
500 ligatures of the dataset. These parameters include HMM characters [3], [14], [19], [25], [35] or carry out individual
states, size of the image and size of the HMM codebook. recognition of primary and secondary components [7], [12],
Figure 7 illustrates the recognition rates by varying the number [16] ignoring the difficult part of associating the two. The
of states in the HMMs while Figure 8 reports recognition recent deep learning based systems [20], [21], [22], [35]
rates as a function of image size for computation of DWT. relying on implicit segmentation report character recognition
The performance seems to be stable with respect to these rates (on synthetic UPTI database) which cannot be directly
parameters where highest recognition rates are realized for a compared with ligature recognition rates as already elaborated.
ligature size of 32 × 32 with 5 states in the HMMs. Table III provides a comparative overview of the recognition
systems evaluated on the (more challenging) CLE database
and employing ligatures as units of recognition. While the
works reported in [7], [12], [14], [16] report more than 90%
ligature recognition rates, none of these studies associates
secondary ligatures with their primary ligatures. Only the
system proposed in [6] considers recognition of complete
ligatures reporting a recognition rate of 87.15% on 1475
unique ligatures. Our technique reports a better recognition
rate of 88.87% using 2,017 unique ligatures. A recent work
by Hussain et al. [11] also considers the recognition of
Fig. 7. Performance evolution as a function of HMM states (Image size fixed
to 32 × 32) complete ligatures reporting an average accuracy of 87.76%.
The system, however, is not evaluated on the same set of
ligatures and hence the results are not directly comparable.
V. C ONCLUSION
R EFERENCES
Fig. 9. Performance evolution as a function of codebook size (HMM states:
5, Image size: 32 × 32) [1] Text Carpora and, Image Corpora.
http://www.cle.org.pk/clestore/index.htm. Accessed: 2017-04-20.
159
TABLE III
P ERFORMANCE COMPARISON OF LIGATURE BASED RECOGNITION TECHNIQUES
[2] Zaheer Ahmad, Jehanzeb Khan Orakzai, and Inam Shamsher. Urdu matching technique. International Journal of Image Processing (IJIP),
compound character recognition using feed forward neural networks. In 3(3):92, 2009.
Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd [20] Saeeda Naz, Arif I Umar, Riaz Ahmad, Saad B Ahmed, Syed H Shirazi,
IEEE International Conference on, pages 457–462. IEEE, 2009. and Muhammad I Razzak. Urdu nastaliq text recognition system based
[3] Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, and Awais on multi-dimensional recurrent neural network and statistical features.
Adnan. Urdu nastaleeq optical character recognition. In Proceedings Neural Computing and Applications, pages 1–13, 2015.
of world academy of science, engineering and technology, volume 26, [21] Saeeda Naz, Arif I Umar, Riaz Ahmad, Saad B Ahmed, Syed H Shirazi,
pages 249–252. Citeseer, 2007. Imran Siddiqi, and Muhammad I Razzak. Offline cursive urdu-nastaliq
[4] Saad Bin Ahmed, Saeeda Naz, Muhammad Imran Razzak, Shiekh Faisal script recognition using multidimensional recurrent neural networks.
Rashid, Muhammad Zeeshan Afzal, and Thomas M Breuel. Evaluation Neurocomputing, 177:228–241, 2016.
of cursive and non-cursive scripts using recurrent neural networks. [22] Saeeda Naz, Arif I Umar, Riaz Ahmad, Imran Siddiqi, Saad B Ahmed,
Neural Computing and Applications, 27(3):603–613, 2016. Muhammad I Razzak, and Faisal Shafait. Urdu nastaliq recognition
[5] Misbah Akram and Sarmad Hussain. Word segmentation for urdu using convolutional–recursive deep learning. Neurocomputing, 243:80–
ocr system. In Proceedings of the 8th Workshop on Asian Language 87, 2017.
Resources, Beijing, China, pages 88–94, 2010. [23] Saeeda Naz, Arif Iqbal Umar, Saad Bin Ahmed, Syed Hamad Shirazi,
[6] Qurat-ul-Ain Akram, Sarmad Hussain, Farah Adeeba, Shafiq ur Rehman, M Imran Razzak, and Imran Siddiqi. An ocr system for printed nasta’liq
and Mehreen Seed. Framework for urdu nastalique optical character script: A segmentation based approach. In Multi-Topic Conference
recognition, 2014. (INMIC), 2014 IEEE 17th International, pages 255–259. IEEE, 2014.
[7] Qurat Ul Ain Akram, Shiraz Hussain, Aneta Niazi, Umair Anjum, and [24] Chan Wah Ng and Surendra Ranganath. Real-time gesture recognition
Faheem Irfan. Adapting tesseract for complex scripts: an example for system and application. Image and Vision computing, 20(13):993–1007,
urdu nastalique. In Document Analysis Systems (DAS), 2014 11th IAPR 2002.
International Workshop on, pages 191–195. IEEE, 2014. [25] U Pal and Anirban Sarkar. Recognition of printed urdu script. In
[8] Edgard Chammas, Chafic Mokbel, and Laurence Likforman-Sulem. Ara- Document Analysis and Recognition, 2003. Proceedings. Seventh In-
bic handwritten document preprocessing and recognition. In Document ternational Conference on, pages 1183–1187. IEEE, 2003.
Analysis and Recognition (ICDAR), 2015 13th International Conference [26] Bryan Pardo and William Birmingham. Modeling form for on-line
on, pages 451–455. IEEE, 2015. following of musical performances. In Proceeding of the National
[9] Israr Ud Din, Imran Siddiqi, Shehzad Khalid, and Tahir Azam. Conference on Artificial Intelligence, volume 20, page 1018. Menlo Park,
Segmentation-free optical character recognition for printed urdu text. CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005.
EURASIP Journal on Image and Video Processing, 2017(1):62, 2017. [27] Thomas Plotz and Gernot A Fink. Markov models for offline handwrit-
ing recognition: a survey. International Journal on Document Analysis
[10] VK Govindan and AP Shivaprasad. Character recognitiona review.
and Recognition (IJDAR), 12(4):269–298, 2009.
Pattern recognition, 23(7):671–683, 1990.
[28] Nazly Sabbour and Faisal Shafait. A segmentation-free approach to
[11] Sarmad Hussain, Salman Ali, and Qurat-ul-Ain Akram. Nastalique arabic and urdu ocr. In IS&T/SPIE Electronic Imaging, pages 86580N–
segmentation-based approach for urdu ocr. International Journal on 86580N. International Society for Optics and Photonics, 2013.
Document Analysis and Recognition (IJDAR), 18(4):357–374, 2015. [29] Malik Waqas Sagheer, Chun Lei He, Nicola Nobile, and Ching Y
[12] Sobia Javed, Sarmad Hussain, Ameera Maqbool, Samia Asloob, Sehrish Suen. Holistic urdu handwritten word recognition using support vector
Jamil, and Huma Moin. Segmentation free nastalique urdu ocr. World machine. In Pattern Recognition (ICPR), 2010 20th International
Academy of Science, Engineering and Technology, 46:456–461, 2010. Conference on, pages 1900–1903. IEEE, 2010.
[13] Sobia T Javed, Sarmad Hussain, Ameera Maqbool, Samia Asloob, [30] Shuwair Sardar and Abdul Wahab. Optical character recognition system
Sehrish Jamil, and Huma Moin. Segmentation free nastalique urdu ocr. for urdu. In Information and Emerging Technologies (ICIET), 2010
World Academy of Science, Engineering and Technology, 46:456–461, International Conference on, pages 1–5. IEEE, 2010.
2010. [31] Sohail A Sattar, Shamsul Haque, and Mahmood K Pathan. Nastaliq
[14] Sobia Tariq Javed and Sarmad Hussain. Segmentation based urdu optical character recognition. In Proceedings of the 46th Annual
nastalique ocr. In Iberoamerican Congress on Pattern Recognition, pages Southeast Regional Conference on XX, pages 329–331. ACM, 2008.
41–49. Springer, 2013. [32] Inam Shamsher, Zaheer Ahmad, Jehanzeb Khan Orakzai, and Awais Ad-
[15] Ergina Kavallieratou, Efstathios Stamatatos, Nikos Fakotakis, and nan. Ocr for printed urdu script using feed forward neural network. the
George Kokkinakis. Handwritten character segmentation using Proceedings of World Academy of Science, Engineering and Technology,
transformation-based learning. In International Conference on Pattern 23, 2007.
Recognition, volume 15, pages 634–637, 2000. [33] Junaid Tariq, Umar Nauman, and Muhammad Umair Naru. Softcon-
[16] Israr Uddin Khattak, Imran Siddiqi, Shehzad Khalid, and Chawki verter: A novel approach to construct ocr for printed urdu isolated
Djeddi. Recognition of urdu ligatures-a holistic approach. In Document characters. In Computer Engineering and Technology (ICCET), 2010
Analysis and Recognition (ICDAR), 2015 13th International Conference 2nd International Conference on, volume 3, pages V3–495. IEEE, 2010.
on, pages 71–75. IEEE, 2015. [34] Jochen Triesch and Christoph von der Malsburg. Classification of hand
[17] Gurpreet Singh Lehal. Choice of recognizable units for urdu ocr. In postures against complex backgrounds using elastic graph matching.
Proceeding of the workshop on Document Analysis and Recognition, Image and Vision Computing, 20(13):937–943, 2002.
pages 79–85. ACM, 2012. [35] Adnan Ul-Hasan, Saad Bin Ahmed, Faisal Rashid, Faisal Shafait, and
[18] Hamna Malik and Muhammad Abuzar Fahiem. Segmentation of printed Thomas M Breuel. Offline printed urdu nastaleeq script recognition with
urdu scripts using structural features. In Visualisation, 2009. VIZ’09. bidirectional lstm networks. In 2013 12th International Conference on
Second International Conference in, pages 191–195. IEEE, 2009. Document Analysis and Recognition, pages 1061–1065. IEEE, 2013.
[19] Tabassam Nawaz, S.A.H.S Naqvi, Habib ur Rehman, and Anoshia Faiz.
Optical character recognition system for urdu (naskh font) using pattern
160