You are on page 1of 5

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Technology 11 (2013) 580 – 584

The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013)

Arabic Handwriting Data Base for Text Recognition


Jabril Ramdan, Khairuddin Omar, Mohammad Faidzul, Ali Mady*
Center of Artificial Intelligent, School of Computer Science, Faculty of Information Science and Technology,
Universiti Kebangsaan Malaysia, 43200 Bangi, Selangor, Malaysia.

Abstract

The process of assessing the outcomes obtained by various groups of researchers is heavily facilitated by conventional databases.
This paper introduces, an database (AHDB/FTR) comprising Arabic Handwritten Text Images, which helps the researches
associated with recognition of Arabic handwritten text with open vocabulary, word segmentation and writer identification and
can be freely accessed by researchers worldwide. This database consists of four hundred and ninety seven images of Libyan
cities, which were hand written by five Arabic scholars.

©
© 2013
2013 The
TheAuthors.
Authors.Published
PublishedbybyElsevier
ElsevierB.V.
Ltd.
Selection
Selection and
andpeer-review
peer-reviewunder
underresponsibility
responsibilityofofthe
theFaculty
FacultyofofInformation Science
Information &&
Science Technology, Universiti
Technology, Kebangsaan
Universiti Kebangsaan
Malaysia.
Malaysia.

Keyword: Arabic Handwritten text image; Database AHDB/FTR.

1. Introduction

Due to the availability of numerous databases of printed and handwritten Latin text, over these years, majority of
the text recognition researches have mainly concentrated on Latin script [1, 2]. Nevertheless, databases of Chinese
and Indian languages are also found quite commonly [3, 4, 5].
It is noteworthy that, presently there is a lack of studies related to Arabic handwritten recognition; the absence of
freely available Arabic databases is regarded as the primary reason for this limited number of studies. Consequently,
every research group have conducted their researches based on data collected by them individually,

* Corresponding author. Tel.: +603 - 8921 ext 6347.


E-mail address: bin_mady@yahoo.com

2212-0173 © 2013 The Authors. Published by Elsevier Ltd.


Selection and peer-review under responsibility of the Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia.
doi:10.1016/j.protcy.2013.12.231
Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584 581

which resulted in diverse rates of outcomes in recognizing Arabic language, therefore it is quite hard to compare the
systems used by those researchers [8]. This scenario clearly depicts the need of having a common database for text
image recognition and identification of the writer. However, there exist databases for handwritten texts [6],
handwritten words [7] and bank checks [8]. Based on our understanding, there is just one database that contains
Arabic handwritten text-lines with open vocabulary [9]. On the other hand, this database does not comprise a dataset
of words and furthermore it is not that much easy to get access to it. In order to address this problem, this paper
proposes an Arabic Handwritten Text Images Database (AHDB/FTR), which encompasses all Arabic characters and
forms (beginning, middle, end, and isolated)

Overview of the AHDB/FTR

The data collection starts by making five Arabic scholars of different ages and educational qualification to write
text-lines, by using any pen of their choice (no restriction was posed in the selection of pens). Then those texts are
then scanned and stored as grayscale BMP images (resolution: 300 dpi). The Fig 1 below is the sample of an image
included in the database:

Fig 1. Sample of image words

The AHDB/FTR database comprises of 497 such images of the names of Libyan towns and many of characters
in total, with various types of characters (beginning, middle, end, and isolated).
Each binary image bitmap represents the handwritten town name and additional GT information. The table -1
illustrates the data set entry of the AHDB/FTR – database, and the figure 2 depicts a standard example.
582 Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584

Table 1. A data set entry of the AHDB/FTR -database. TheSymbols M, B, A, E stand for the used character shapes
(Middle, begin, alone, end position in a word).

Image

Ground truth
Global word ‫ﻟﻤﻠﻮﺩﺓ‬
Character sequence B‫ –ﻝ‬M‫ – ﻡ‬M‫ – ﻝ‬E‫ – ﻭ‬B‫ ﺩ‬- B‫ﺓ‬
Baseline 114
Quantity of word 1
Quantity of PAWs 3
Quantity of character 6

Fig 2. Illustrate Baseline

Fundamentally, the names of the towns comprise utmost three words, and could possibly be having some
associated sub-words. In the beginning the writers were asked to fill in the forms without any restrictions, however,
they were asked to avoid lines and boxes. It is quite common that, in AHDB/FTR database, majority of the words
are either overlapping or touching. It is worthy to mention again that, AHDB/FTR can be utilized for Arabic word
recognition, word spotting, and writer identification. The table 2 illustrates an example of a data set entry of the
AHDB/FTR – database, IFN/ENIT – database and AHTID/MW– database.

Table 2. A data set entry of the AHDB/FTR, IFN/ENIT and AHTID/MW database.

AHDB/FTR IFN/ENIT AHTID/MW


Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584 583

3. Description of Arabic characteristics

Arabic is one of the popular languages, which is used by more than three hundred million people, in more than
20 countries. Arabic text is very unique of its kind; it is basically cursive and is written horizontally from right to
left. There are twenty eight Arabic alphabets; few of these alphabets modify their shapes based on their position in
the word. Quite a number of those Arabic characters have four shapes: isolated, initial, medial and final (Table 1)
[8]. Therefore, it is possible to break down an Arabic word into more than one sub-word called PAW (Piece of
Arabic Word), each of it signifies one or more associated letters. Majority of the Arabic letters comprise one to three
shape dots. The existence of these dots in their positions helps people to distinguish letters belonging to the same
family shape. In addition, few Arabic letters can be written in several styles, consequently, it is essential to gather
samples from all those styles [9, 10]. Table 3 illustrates of different shapes of an Arabic letter.

Table 3- Example of different shapes of an Arabic letter

Letter label Isolated Begin Middle End


Yaa ‫ﻱ‬ ‫ﻳـ‬ ‫ـﻴـ‬ ‫ـﻲ‬
Nuun ‫ﻥ‬ ‫ﻧـ‬ ‫ـﻨـ‬ ‫ـﻦ‬

4. Purpose of this paper and dataset

Of late, researches are showing more interest in addressing the problems of Arabic handwritten text recognition.
Even though some studies use their own set of database, it has been obviously emphasized to have and standard
database to facilitate the studies related to Arabic handwriting recognition, furthermore the standardized database
must be available for all purposes.
After the collection and storage of images, fundamental pre-processing tasks such as: noise filtering, text block
segmentation, image binarization, and word segmentation must be conducted, which are heavily based on the
quality and nature of the documents used, however, they are independent from the subsequent processing phases. As
the basic pre-processing tasks have already been done during the database development, the AHDB/FTR database
contains already pre-processed binary images of single words, which facilitates the work on recognition methods to
be independent from the nature of exclusive documents.

5. Future directions

The dataset can be enhanced by adding more images and as well as increasing the number of writers, and
including new decoration images.

6. Conclusions

This paper had stressed the need of a standardized database of Arabic handwritten images, and had introduces
one such. The proposed AHDB/FTR database contains 497 word images of names of Libyan towns, written by five
different writes. This database will be very useful for a variety of research applications such as, Arabic handwritten
text recognition and writer identification systems. It is noteworthy that the AHDB/FTR database will be made freely
available to interested researchers. Ultimately we believe that, the database will facilitate the research community
and will be considered as an significant asset.

7. Acknowledgment

The authors would like to thank all writers who contributed to AHDB/FTR database generation.
584 Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584

References

[1] J. J. Hull, “A database for Handwritten Text Recognition Research,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 1994,
vol. 16, pp. 550–554,.
[2] Y. LeCun, L. Bottou, Y. Bengio and P. Haffiner, “Gradient based learning applied to document recognition,” Proceedings of IEEE, vol.
86(11), 1998, p. 2278–2324.
[3] T. Saito, H. Yamada and K. Yamomoto, “One database ELT9 of handprinted characters in JIS Chinese characters and its analysis (in
Japanese),” Transaction of IECEJ, 1985, vol. J.68-D(4), pp. 757–764.
[4] U. Bhattacharya and B. B. Chaudhuri, “Databases for Research on Recognition of Handwritten Characters of Indian Scripts,” International
Conference of Document Analysis and Recognition, 2005, p.789–793.
[5] S. Mihov, K. U. Schulz, C. Ringlsteller, V. Dojchinova, V. Nakaova, K. Kalpakchieva, O. Gerasimov, A. Gotsharek and C. Gercke, “A
Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques,” Proceeding of International Conference of Document
Analysis and Recognition, 2005, p. 162–166.
[6] S. Al-Ma’adeed, D. Ellimam and C. A. Higgins, “A data Base for Arabic Handwritten Text Recognition Research,” Proceeding of Eighth
International Workshop on Frontiers in Handwriting Recognition, 2002, p. 485–489.
[7] M. Pechwitz, S. S. Maddouri, V. Maergner, N. Ellouze and H. Amiri, “IFN/ENIT - database of handwritten Arabic words,” Proceeding of
Colloque International Francophone sur l’Écrit et le Document, 2002, p. 129–136.
[8] Abdulkader, Ahmad. "Two-Tier Approach for Arabic Offline Handwriting Recognition." Tenth International Workshop on Frontiers in
Handwriting Recognition. 2006.

You might also like