You are on page 1of 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 1

Line and Ligature Segmentation of Urdu


Nastaleeq Text
Ibrar Ahmad, Xiaojie Wang, Ruifan Li, Manzoor Ahmed, RahatUllah

 languages (weber1) of world. The number of research work is


Abstract— The recognition accuracy of ligature based Urdu ongoing to preserve its literature. Urdu is an Arabic script
language Optical Character Recognition systems highly language, mostly written in Nastaleeq writing style. Nastaleeq
depends on the accuracy of segmentation that converts Urdu writing style is also followed in many other languages
text into lines and ligatures. Generally, Lines and ligatures including Persian, Pashto, Panjabi, Baluchi, and Siraiki [1],
based Urdu language OCRs are more successful as compared to [2]. Successful Optical Character Recognition (OCR) for
the characters based. This paper presents the techniques for
Nastaleeq text therefore becomes imperative.
segmenting Urdu Nastaleeq text images into lines and
subsequently to ligatures. Classical horizontal projection based Generally, Urdu language OCR systems can be divided into
segmentation method is augmented with a propose Curved- three categories based upon text’s segmentation, which
Line-Split algorithm for successfully overcoming the problems includes character based [4], [10-17], ligature based [3], [18],
such as text line split position, overlapping, merged ligatures
[30], [31] and sentence based systems [20-23].
and ligatures crossing line split positions. Ligature
segmentation algorithm extracts connected components from Currently, researchers are focusing more on ligature and
text lines, categorizes them into primary and secondary classes, sentence based OCRs. All these OCR systems mainly rely on
and allocates secondary components to primary by examining of document images’ segmentation into text lines, ligatures
width, height, coordinates, overlapping, centroids and baseline
and characters. Complexities of Urdu Nastaleeq writing style
information. The propose line segmentation algorithm is tested
on 47 pages with 99.33% accuracy. The propose ligature
are the major obstacles for its segmentation into lines and
segmentation algorithm is mainly tested on a large Urdu ligatures. Major complexities of Nastaleeq writing style
Printed Text Images dataset. The propose algorithm segmented include context sensitiveness, compactness, overlapping,
Urdu Printed Text Images dataset to 189 thousand ligatures cursive nature, diagonality, and many others [1] as presented
from 10063 text lines having 332 thousand connected in Fig. 1.
components. A total of about 142 thousand secondary
components have been successfully allocated to more than 189 Nastaleeq accommodates more characters in less space
thousand primary ligatures with accuracy rate of 99.80%. because of its diagonality and cursive nature [1]. Characters
Thus, both the propose segmentation algorithms outperform are stacked from right to left and top to bottom into a word.
the existing algorithms employed for Urdu Nastaleeq text For instance, the starting character of each ligature in Fig. 2 is
segmentation. Moreover, the propose line segmentation the same yet different in shape [4]. In case of Urdu Nastaleeq
algorithm is also tested on Arabic for which it also extracted writing style, the number of possible shapes of certain
lines correctly. characters reach to 60 [5], [6]. Therefore, it is very difficult to
segment text in Nastaleeq writing style due to large number of
Index Terms— Urdu text segmentation, Nastaleeq script character shapes, inter-ligature and intra-ligature overlaps, and
segmentation, line and ligature segmentation, preprocessing context sensitivity of text [2].
I. INTRODUCTION Furthermore, in Nastaleeq writing style, there is no
consistency in spacing among the words, ligatures and
U RDU is the national language of Pakistan, and is widely
spoken and understood in India and other Southern Asian
countries as well. It is one of the top ten influential
sentences. Some ligatures of sentences intrude the boundaries
of other sentences that are written above or below lines, as
presented in Fig. 3
———————————————— In the first line of the paragraph, the character “ ” is
 I. Ahmad is with Center for Intelligence of Science and Technology (CIST), connected with the word “ ” of second line. Besides, ligature
Beijing University of Posts and Telecommunications, China.
Email:toibrar@yahoo.com “ ” of third line is entering in the boundary of second line.
 X. Wang is with Center for Intelligence of Science and Technology (CIST), Also, ligature “ ” of third line is crossing the border of fourth
Beijing University of Posts and Telecommunications, China. Email:
xjwang@bupt.edu.cn
line. Thus, there exists no fixed inter ligature, inter words and
 Ruifan Li is with Center for Intelligence of Science and Technology (CIST), inter sentences gap as demonstrated in Fig. 3.
Beijing University of Posts and Telecommunications, China.
Email:rfli@bupt.edu.cn
 Manzoor Ahmed is with Department of Electronic engineering, FIB Lab,
Tsinghua University, China. Email: manzoor.achakzai@gmail.com. 1
http://www2.ignatius.edu/faculty/turner/languages.htm(accessed
 RahatUllah is with Nanjing University of Information Sciences and
10/05/2016, 4:25am)http: 4:25am)
Technology, China. Email: r.rahatullah@gmail.com

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 2

Fig. 1. Complexities of Nastaleeq writing styles

non-equispaced lines and non-availability of clear


demarcation.
Geometric algorithms for layout [27] were adopted and
tailored for getting the correct layout of Urdu Nastaleeq
documents [28]. The authors achieved accuracy of text line
detection ranging from 72% to 93% for different genres of
document images. Ridge based approach [29], [33], [34], [35]
Fig. 2. Different shapes of same starting character of words in was applied for text line extraction, while the method in [28]
Nastaleeq was adopted for reading order. The achieved accuracy for
Arabic and Urdu document images were 96% and 92%,
A. Previous Work respectively. Although these methods are accurate but lacks
An overview of contributions in the area of Urdu line and proper allocation of misplaced diacritics/dots.
ligature segmentation is briefed in the following subsections. Zone based global horizontal projection method [8] was
1) Line Segmentation proposed for segmenting Nastaleeq’s Urdu text lines. The
Text lines extraction is a very crucial step in the segmenting method segmented document images containing touching
procedure. The algorithms applied for Urdu page ligatures across lines with the accuracy of 99.11%. In this
segmentation into lines include horizontal and vertical method, there was no zone mentioned above the zone 1, as
projection, x-y cut, smearing, geometric, ridge based, and depicted in Fig. 4(a). Therefore, it can miss the diacritics of
zone-based horizontal projection algorithms. the text line that appear above zone 1, as presented in Fig. 4
(b).
The pioneer work [4] on Urdu OCR used horizontal
projection profile method for text lines extraction, followed
by component labeling and vertical projection profile methods
for character extraction. Horizontal and vertical projection
methods applied on scanned document image for text lines
and character segmentation, respectively [7], [25]. Although,
horizontal projection segmented the text lines correctly but it
can’t segment lines with merged ligatures.
Recursive XY cut, whitespace Analysis, constrained text- a) Zone based line b) Diacritics/dots above zone 1
line detection, docstrum, voronoi-diagram based algorithm, segmentation[8]
and smearing algorithm used [26] on five scripts including Fig. 4. Zone based Urdu Nastaleeq text line segmentation
Nastaleeq were segmented into text lines, and the highest In Urdu Nastaleeq, it is possible to have some
accuracy reported for Urdu was 65.4%. The results are not diacritics/dots above the zone 1(i.e; main text). So, there are
promising for Urdu text, because of its overlapping nature, possibilities of assigning diacritics/dots inaccurately, as
depicted in Fig. 4(b). Implicitly, three zones are suggested [7]
as opposed to handling 4 zones methodology for page
segmentation as shown in Fig. 5.

Fig. 3. A paragraph of Nastaleeq writing style and its


complexities Fig. 5. Zone above zone 1 [9]

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 3

This strategy takes into consideration the dots/diacritics information. Relying more on zonal and heuristics
even above zone 1. A recent study [9] also used zone based information for line and ligature segmentation becomes very
method [8] with dilating document images before much context and font dependent. Lines are segmented with
segmentation for getting more correct size of zones, which perfection [28], [29], but misplaced dots/diacritics are not
resulted in 98.7% line segmentation accuracy. This method allocated to their respective lines. Thus, our motivation is to
confronts difficulties for segmenting merged text lines. reduce the complex mechanism of zoning and their merging
decisions for line segmentation. In our approach, we rely
2) Ligature Segmentation
mainly on baseline information with three zonal method for
A ligature is composed of one or more characters joined
dots/diacritics allocation, and quantitative values (width,
together [8], and the combination of one or more ligatures
height, centroids, and overlapping) for ligature segmentation.
form words in Arabic script (Like Urdu language). A ligature
is comprised of two parts; the primary component is made up C. Contribution
of main stroke, while diacritics are nominated as secondary In this paper for Urdu Nastaleeq text, we present two
strokes as presented in Fig. 6. algorithms for line and ligature segmentation. These
algorithms can also be applied to other Nastaleeq based
scripted languages, such as Arabic, Persian, Pashto, and
Sindhi etc. Specifically, our proposed line segmentation
algorithm depends solely on horizontal projection, split
positions (propose in this paper and named as Curved Line
Split (CLS) algorithm) and base line information. In proposed
approach, the allocation of dots/diacritics to proper text line is
properly addressed by employing three-zonal method. For
ligature segmentation, the propose algorithm employs width,
height, centroids, and overlapping quantitative information as
opposed to heuristics [8], [9]. The accuracies achieved by
Fig. 6. Primary and secondary components of two ligatures
both the propose line and ligature segmentation algorithms are
Considerable efforts were devoted for line segmentation 99.33% and 99.08%, respectively, which are better than the
into ligatures like vertical histogram method [20] and [27], previous contributions [8], [9], [29].
horizontal projection method [4], and bounding box method The remaining paper is organized as follows. In Section 2,
[8], [9], [18], [28], [31]. line segmentation algorithm is discussed. While, in Section 3
Individual ligatures extracted by vertical histogram of text we discuss the ligature extraction algorithm. In Section 4,
line in Naskh as described in [20] and [27], but this method is results are analyzed for evaluating the performance of propose
not applicable for Nastaleeq because of inter-ligature algorithms. Finally, Section 5 includes the concluding
overlapping. In another work [4], vertical projection was remarks.
applied for getting ligatures and character on extracted lines
by horizontal projection method. Ligatures were also II. LINE SEGMENTATION
segmented by connected component method supplemented This section details about segmentation algorithm of a page
with baseline and centroid information, while achieving 94% into constituents text lines, which employs global horizontal
segmentation accuracy [31]. Connected components were projection method augmented with propose CLS algorithm.
extracted by bounding box methods from text lines [18] and Additionally, it contains steps for conflicts resolution
[28], and then secondary components were allocated to
regarding components crossing split lines based upon baseline
primary components based on the detected baseline. Text
information.
lines were segmented into connected components that were
categorized into 6 types [8], and then secondary components A. Curved Line Split Algorithm
were merged together with primary components to get the The propose CLS algorithm can even find out a curved (not
resultant ligatures based upon certain heuristics, thus, straight) demarcation line between every two consecutive text
achieving 99.02% accuracy. Another study [9] first segmented lines, when there is no straight demarcation line. The basic
lines into connected components, and then allocated idea of CLS algorithm is as follows:
secondary components to primary components with 92.5%
accuracy. Basic Idea: Algorithm starts a demarcation line from
middle of baselines of consecutive text lines. It traverses a
B. Motivation path from left to right on horizontal path, if it finds text pixels
The most recent line and ligature segmenting algorithms it moves up or down vertically until it finds a text pixel or
[8], [9] proposed the use of 4 zones for line segmentation, and baseline, and then from half of the vertical movement it starts
6 categories of connected components for constructing a final again horizontal traversal in between two lines.
ligature. They did not utilize the baseline information that is
proved to be very much decisive in our propose line The procedure of CLS is presented in Figs. 7 and 8. The
segmentation algorithm. Also, ligature segmentation [8] relies CLS algorithm detail is given as Algorithm 1.
on 6 heuristics, among which only one heuristic used baseline

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 4

B. Line Segmentation Algorithm


Line segmentation algorithm is applied on Urdu Nastaleeq
document image for getting segmented text lines. Classical
horizontal projection based segmentation method is
(a) Working of CLS algorithm (b) Demarcation line, output augmented with a propose CLS algorithm. This augmentation
of CLS algorithm
Fig. 7. Application of CLS algorithm and its output with demarcation
successfully overcomes the problems, such as text line split
line position, overlapping, merged ligatures and ligatures crossing
line split positions. Horizontal line segmentation algorithms
[4], [7], [8], [9] can split two text lines if there is a clear
horizontal demarcation. The CLS algorithm overcomes this
flaw and can find a demarcation boundary in case of non-
equispaced text lines and non-availability of clear
demarcation. The main steps are described as follows:
1) Find Line-Peaks
In this step, an image of Urdu Nastaleeq script 𝐼𝑈 converted
into binary image 𝐼𝑈𝐵 by global thresholding method. This

binarized image is inverted for getting an image 𝐼𝐻×𝑊 with
white text on black background. Then horizontal projection
(HP) is calculated for the entire document.

HP(x) = ∑𝑊 𝑦=1 𝐼 (𝑥, 𝑦), ∀𝑥 ∈ [1, 𝐻] (1)
Find all peaks and troughs of horizontal projection.
Peaks: ∀𝑥 𝑓𝑖𝑛𝑑 𝑥𝑖 𝑠. 𝑡 𝑥𝑖−1 < 𝑥𝑖 > 𝑥𝑖+1 (2)
Troughs ∶ ∀𝑥 𝑓𝑖𝑛𝑑 𝑥′𝑖 𝑠. 𝑡 𝑥′𝑖−1 > 𝑥′𝑖 < 𝑥′𝑖+1 (3)
Fig. 8. Improper segmentations encircled red in (a) [8], (c) [29] First of all peaks are scanned to find out all local highest
are solved by the propose CLS algorithm encircled green in (b) valued peaks. Generally, such peaks start with a lowest peak
and (d).
value after a trough followed by step-wise higher valued

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 5

peaks until we get a local highest valued peak. Then the value For visualization of this concept, four cases are briefed as
of next coming peaks gradually decrease until it reaches follows:
another trough. Such local highest valued peaks are termed as Case 1: Crossing component touching one base line
Line-Peaks, which are accompanied by gradual lower valued Move the crossing component to the text line where it is
peaks on both sides representing one line. Line-Peaks touching the baseline.
represent center/base with maximum pixel intensity of text Case 2: Crossing component not touching any base line
lines in a page image. a) Here, find the top row, bottom row, left most column and
right most column numbers for under consideration crossing
Line − Peaks: ∀𝑥 𝑓𝑖𝑛𝑑 𝑎𝑙𝑙 𝑥 ′′ 𝑐 𝑠. 𝑡 component.
𝑥𝑖−1 < 𝑥 ′′ 𝑐 > 𝑥𝑖+1 (4) b) Next, start search by moving above from the top row and
below from the bottom row while checking every pixel from
2) Average Text Line Height left most column to right most column for finding the nearest
Next, we find the average number of pixels’ rows per text text pixel of any other component.
line. c) Move this crossing component to the corresponding side
|𝑥|
Average Text Line(TL)Height: 𝐴𝑣𝑔𝐻 (𝑇𝐿) = ′′ , where text pixel is found earlier. In case of tie, move
|𝑥 |
component to the side where search found component is
Where, 𝐴𝑣𝑔𝐻 (𝑇𝐿) 𝑖𝑠 Average Text Line(TL)Height, touching base line. If both are not touching the baseline, then
|𝑥| 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑡𝑜𝑡𝑎𝑙 𝑟𝑜𝑤𝑠 𝑜𝑓 𝑝𝑖𝑥𝑒𝑙𝑠, perform step a, b and c for both newly found components, but
search is performed in the same direction where these new
𝑥 ′′ 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐿𝑖𝑛𝑒_𝑃𝑒𝑎𝑘𝑠 , (5) components are found.
3) False Text Lines Case 3: Connected component touching neither base lines
Once again scan the peaks, if lowest troughs of some Line- nor split line
Peak come before ¾th of the average text line height then that Some diacritics/dots might neither crossing any base nor
line is considered as false Line-Peak and excluded from total split line. Such components might be segmented in wrong text
Line-Peaks. line. Therefore, scan every text line for any text component
∀𝑥 ′′ 𝑐 𝑓𝑖𝑛𝑑: 𝑎 = |𝑥 ′′ 𝑐𝑖−1 − 𝑥 ′′ 𝑐𝑖−2 |, measuring approximately ¼th height of each line. These areas
are referred as upper and lower diacritics/dots zones as
b= |𝑥 ′′ 𝑐𝑖 − 𝑥 ′′ 𝑐𝑖−1 |, presented in Fig. 9.
There may be possibilities of attaching diacritics to wrong
c= |𝑥 ′′ 𝑐𝑖+1 − 𝑥 ′′ 𝑐𝑖 |,
lines, most probably in upper and lower diacritics/dots zones
3
𝑖𝑓 𝑐 < 𝐴𝑣𝑔𝐻 (𝑇𝐿) ∧ 𝑎 < 𝑐, 𝑎 = 𝑎 + 𝑏, of every text line except first and last ones. First text line may
4
𝐸𝑙𝑖𝑚𝑖𝑛𝑎𝑡𝑒 𝑥 ′′ 𝑐𝑖 𝑓𝑟𝑜𝑚 Line − Peaks (6) only have lower diacritics/dots zone, while last one may have
only upper diacritics/dots zone.
4) Missed Text Lines In these portions, if some component is found then it is
All Line-Peak values must be greater than the highest moved to the line where it has minimum distance from the
trough value; otherwise the text lines with Line-Peaks less neighboring component. This step can easily be performed as
than the highest trough will be missed. After conforming step follows:
3, we hypothetically increase the value of all such Line-Peaks
more than the highest trough value. In this way, we fix the
problem of missed lines.
Highest Trough ∶ 𝑚𝑎 𝑥 𝑥𝑇′ = H 𝑇 ∀𝑥 𝑓𝑖𝑛𝑑 𝑥𝑖′ 𝑠. 𝑡

𝑥𝑖−1 > 𝑥𝑖′ < 𝑥𝑖+1

] (7)
′′
Missed Text Lines ∶ ∀𝑥 𝑐𝑖 < H𝑇 ,
𝑥 ′′ 𝑐𝑖 = 𝑥 ′′ 𝑐𝑖 + (|H 𝑇 − 𝑥 ′′ 𝑐𝑖 |) (8)
5) Segment Page Image into Text Lines Fig. 9. Upper and lower diacritics/dots zones scanning
Segment page image into text lines: Pass initial split row
numbers to CLS algorithm. The initial candidate split row a) For every component found in upper or lower diacritics/dots
numbers are the lowest trough values in between every two zone, where:
Line-Peaks. Convert the text page into connected components i. The component must not touch baseline.
format, so that every component is represented by a different ii. The component must not be greater in size than its
number. Then split this page according to the coordinates for respective zone.
demarcation of lines returned by CLS algorithm. b) Find the top and bottom row of each component. Also, find
6) Assignment of Components the left and right most column numbers of its maximum
Assignment of misplaced and crossing components to span/width.
proper text line may easily be done with the information of c) Start searching above and below from the component’s top
baseline of every text line. Crossing components are those coordinates and bottom coordinates, respectively, for
connected components that are crossing the demarcation line. finding the nearest text pixel.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 6

Move this component to line where text pixel is found


earlier. In case of tie, move the component to the line where
search finds the component that is touching base line. If both
are not touching baseline then for breaking the tie perform
step a, b and c for both newly found components but search is
performed in the same direction where these new components
were found.
Case 4: Crossing component touching both base lines
This situation comes along when two ligatures are
connected mistakenly. We need to find the position of the
joint of two ligatures with minimum number of pixel. Start
from the split position of text lines along the crossing
component above and below, and search for minimum pixel (a)
Fig. 11. False Line-Peak represents false line (b)
connectivity nearby split position. This search must not move Fig. 11. False Line-Peak represents false line (Steps 1-3)
more than one fourth of the height of a text line. If otherwise,
Step 4: All Line-Peaks less than the highest trough
split this component from the text line split position.
will be missed as last line of page is missed line.
Examples: Each step of the propose line segmentation
algorithm is explained by applying it on a complex text page
presented in [28]. Also, algorithm is applied other sample
pages in earlier works [8], [9], [29].

Fig. 12. Last line represents missed Line-Peak as its value 61


is less than highest trough value 71

Step 5: Initial lines split after demarcation by Curved


Line Split Algorithm

Fig. 10. Sample Urdu Nastaleeq pages (a) from [28], (b) from
[29], (c) from [8], (d) from [9]

Steps 1-3: Top most Line-Peak/base line in Fig. 11(a)


represents false Line-Peak that may result a false text line. As, (a) (b)
Fig. 11(b) represents that this Line-Peak and its corresponding Fig. 13. Steps 5, Fig. 13(a), odd and even numbered lines are
representing base and split lines, respectively. Initial
troughs last well before ¾th of average text line height, so it is segmented textlines in In Fig. 13(b).
eliminated from Line-Peaks. Furthermore, this eliminated
Line-Peak is merged with the smaller height neighbored line.

Resolution of 4 cases (step 6) is shown in Fig.14-16:

Fig. 15. Red encircled components of previous step are resolved in next step, encircled green, (a) to (b) crossing component touching one base line,
2169-3536 (c) crossing
(b) to (c) 2016 IEEE. Translations
component anddoes
contentnot
mining are permitted
touching for academic
any base research
line, (c) only.
to (d) Personal use isneither
component also permitted,
touchingbut republication/redistribution
base lines nor split lines, requires
(d)IEEE
to permission. See
(e) component
touching 2 baselines. http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 7

component. Our propose ligature segmentation algorithm


therefore has three steps: 1) The sentence images are split into
connected components and information is extracted.
2)Connected components are then categorized into primary
and secondary classes by examining width, height,
coordinates, overlapping and centroids along with the baseline
information of the sentence. 3) Secondary components are
associated to proper primary components to form ligatures.
The detailed explanation of these algorithms is provided as
under:
A. Ligature Segmentation Algorithm

Fig. 14. Four cases for resolution of conflicting components

B. Extract Information
Fig. 16. Example of crossing components, touching both base
lines

The propose line segmentation algorithm extracted all


constituent text lines with only two diacritics out of place
encircled red as indicated in Fig. 17.

C. Decide Primary and Secondary Components


This step classifies the connected components into primary
and secondary on the basis of overlapping, height, width,
Fig. 17. Segmented text lines by propose line algorithm centroids and baseline information. It is depicted in Algorithm
4.
Other results about line segmentations algorithm are
discussed in Section 4. D. Allocate Secondary Components to Primary Components
In this step, pixels of secondary components are moved to
III. LIGATURE SEGMENTATION
their respective primary. Firstly, all the non-conflicting (not
Ligature segmentation algorithm splits text lines into touching baseline) secondary components are moved to
constituent ligatures. A ligature is usually composed of 1 to 8 respective primary components. The conflicting secondary
characters forming one primary and zero or more secondary components (touching the base line) are decided based on
connected components. Main body of a ligature is known as their size and height. If size/height is less than the threshold
primary component that is made up longest stroke, primary size/height of secondary components, then these components
stroke. Secondary components are usually dots and diacritics, are declared as secondary and copied to the respective
as presented in Fig. 6. primary components, otherwise, these components are
The basic idea of our algorithm is to find all components declared as primary. It is depicted in Algorithm 5.
and then allocate secondary components to proper primary

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 8

The propose ligature segmentation algorithm is applied to


the sentences of UPTI dataset and the segmented sentences
presented in Section 2. For example, three sentences and
corresponding ligatures are depicted in Fig. 18.
(a)

(b)

(c)

(d)

(e)

(f)
Fig. 18. Total 27, 12 and 21resultant ligatures of above
displayed sentences taken from UPTI data set, [28] and [29],
respectively. First sentence of UPTI dataset and its segmented
ligatures are represented in (a) and (b). A sentence of a text
page given in [28] and its segmented ligatures are represented
in (c) and (d), respectively. Similarly, a sentence of a text page
given in [29] and its segmented ligatures are represented in (e)
and (f), respectively.

IV. RESULTS AND ANALYSIS


In this section, the results of propose line and ligature
segmentation algorithms are discussed. Results are compared
with the prevalent studies.

A. Results of Propose Line Segmentation Algorithm


The propose line segmentation algorithm is tested on 47
pages. Lines are segmented 100% for 7 pages taken from
earlier paper [8], [9], [28], [29]. Remaining 40 pages are taken
from single columned editorials of an online Urdu newspaper.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 9

TABLE 1 initial segmentation, therefore, it necessitates to process these


RESULTS OF PROPOSE LINE SEGMENTATION ALGORITHM tasks later on.
Source Script Pages lines Correctly Accuracy Furthermore, red circled tasks in Fig. 20 (a) are still there to
Segmented be resolved after initial segmentation by [8], yet the propose
lines
[8] Urdu 1 11 11 100% algorithm has already resolved more than half of them
encircled as blue shown in Fig. 20(b).
[9] Urdu 2 28 28 100%
A recent line segmentation algorithm [9] also used the zone
[28] Urdu 1 11 11 100% based method as defined by [8] with additional processing of
[29] Urdu 2 16 16 100% dilating for better estimation of zone size.
[29] Arabic 1 9 9 100%
Newspaper Urdu 40 532 509 96%

Average Accuracy 99.33%

The propose line segmentation algorithm segmented line


46 pages with the accuracy of 99.33% that is better than
accuracies reported in prevalent work. Overall accuracy of
propose line segmentation algorithm is 99.33% as
presented in Table 1.
B. Comparison of Line Segmentation Algorithms
Line segmentation [8] draws a straight boundary line
between two text lines that may not always be straight. (a) (b)
Therefore, it needs to place many text components to proper Fig. 20. Final segmented text lines by [8] and propose line
text line later on. The propose line segmentation algorithm segmentation algorithm
can draw a straight as well as curved boundary. The results of Fig. 21(a) represents the initial segmentation process and
initial segmentation [8] on a merged text lines page is there is still need to resolve some overlapping issues later on.
depicted in Fig. 19(a). CLS algorithm of propose line On the other hand, output of the propose algorithm has
segmentation algorithm has applied on the same page and perfectly segmented the page into text lines and there is no
resultant initial segmented page is depicted in Fig. 19(b). need for further processing as depicted in Fig. 21(b).

a) Line segmentation by [9 ] line b) Initial line segmentation by


segmentation algorithm propose line segmentation
a) Line segmentation by [8 ] after b) Initial line segmentation by algorithm
segmentation of merged lines propose line segmentation Fig. 21. Comparison of output of [9] and propose line segmentation
algorithm algorithm
Fig. 19. Initial line segmentation by [8] and propose line segmentation A state of the art contribution [29] for layout of Urdu and
algorithm Arabic script that is actually an improvement of [28], which
The line segmentation [8] segments a text page into zones has depicted two very complex sample Urdu Nastaleeq pages
and then zones are segmented into lines by applying straight as shown in Fig. 22 (a) and (c).
split line. So, whenever CLS algorithms changes its path for The complexity of the pages may be observed by more
drawing boundary, actually it avoids the later processing tasks interlacing and merging of components of subsequent
as compared to line segmentation in [8]. For example, two lines. The average intensity of initial split line positions are
46 and 14 for first and second page, respectively, as
words and in line 9 and 11, respectively, are correctly presented in Table 2.
segmented by CLS algorithm but [8] is incapable of such

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 10

Fig. 22. Urdu Nastaleeq pages taken from [29] and segmented by propose line segmentation algorithm

The highest intensity values at split positions pose C. Results of Propose Ligature Segmentation Algorithm
maximum obstacles in drawing boundary, whereas, a UPTI (Urdu Printed Text Images) dataset [18] is used for
straight line can demarcate a boundary in between two text checking the segmentation accuracy of the proposed
lines in case of minimum intensity value (i.e. 0). The segmentation algorithms. This dataset contains images of
propose line segmentation algorithm is applied on these 10063 sentences. Four degradation techniques that were
two complex pages, having extracted text lines accurately proposed in [32] are applied, which includes jitter, elastic
with only one diacritic out of place encircled red as shown elongation, threshold and sensitivity. Every degradation
in Fig. 22(b). Furthermore, some encircled dots/diacritics technique is applied with four different parameter values.
are misplaced by algorithm of [29] depicted in Fig.22 (e). Overall, UPTI contains 12 degradation versions of original
10063 sentence images.
TABLE 2 The sentence images of UPTI dataset from un-degraded
INITIAL SPLIT ROW NUMBERS version is segmented into ligatures as presented in Table 3.
Table 2(a) Initial split for Fig. Table 2(b) Initial split for Fig. There are total of 189584 ligatures as per split of ground
22(a) 22(c) truths of UPTI dataset. Accuracy of the implemented
Lines Intensity Split Lines Intensity Split
algorithms is even better than the best ligature segmentation
# Value Row # Value Row
Number Number
accuracy reported in [8] i.e. 99.02%.
1&2 42 93 1&2 3 61
2&3 83 164 2&3 19 120
3&4 82 248 3&4 27 178
4&5 79 325 4&5 24 224 Fig. 24. Wrongly connected components of ligatures
5&6 45 390 5&6 14 273
UPTI dataset contains incorrectly merged ligatures that
6&7 36 483 6&7 24 328
badly affect the accuracy, because the corresponding ligatures
7&8 10 383 in ground truths are not properly connected. Therefore, count
8&9 17 433 of image ligatures becomes less as compared to ligatures
segmented from text lines of ground truth, resulting in low
The propose line segmentation algorithm is also applied on accuracy. Examples of these incorrectly connected dots,
an Arabic language page provided in [29]. Our line diacritics and ligatures are depicted in Fig. 24.
segmentation algorithm misplaced only one diacritic, while
line segmentation algorithm of [29] misplaced 5 diacritics as First and second part of Fig. 24
shown in Figs. 23(b) and 23(c), respectively. Lines are represent two ligatures connected incorrectly. Fourth part
segmented by both algorithms with 100% accuracy.
consists of all secondary components of third part

Fig. 23. Arabic Nastaleeq page taken from [29] and segmented by propose line segmentation algorithm

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 11

TABLE 3
STATISTICS ABOUT COMPONENTS OF UPTI DATA SET
Primary Components Secondary Components Total Ligatures Segmentation
Components Segmented Accuracy
Touching Not Touching Not
Base Line Touching Base Touching
Base Line Line Base Line
UPTI Data Set
175431 13754 18253 126387 333825 189185 99.80%
Undegraded

approximately 1.7 connected components. Ligature


segmentation algorithm for sample first sentence of UPTI
data set observed statistics in Table 4.

TABLE 4
STATISTICS ABOUT COMPONENTS OF FIRST SENTENCE
OF UPTI DATA SET
Primary Secondary Total Ligatures
Comp. Comp. Comp. Segmented
Touching Touching
Baseline Baseline
Yes No Yes No
21 6 4 20 51 27

Fig. 25. Percentage statistics about components of UPTI data


Table 5 presents the results of segmentation algorithms and
set their accuracy along with the sizes of datasets used. Primary
components that are touching the baseline and their sizes
and represents six dots connected incorrectly. The greater than the sizes of primary components threshold are
overall size of these connected dots becomes more than the considered as primary components without confusion and the
maximum threshold size of a secondary component, also it is rest are ambiguous. Similarly, the secondary components
touching the baseline. Therefore, ligature segmentation conflicting (confusing) and non-conflicting (clear) are
algorithm declares it as a primary component. explained in Algorithm 5.
Javed et al. [7] used total 3655 ligatures, having 3375
The propose ligature segmentation algorithm is more correctly segmented ligatures with 92% accuracy, as
accurate as compared to the prevalent ligature segmentation presented in first row of Table 5. Israr et al. [9] tested their
algorithms. The propose algorithm is tested for segmenting algorithm on approximately 7364 ligatures with 99.49%
four different versions of the UPTI datasets. It is very accuracy. Lehal [8] tested his algorithm on 23555 ligatures
interesting that the overall secondary components are less than with 99.02% accuracy. Last row of Table 5 depicts the
the primary components. Further, the ratio of primary to evaluation of our algorithm on first version of UPTI dataset,
secondary components is about 7:5. as presented in Table 3. Also, 10063 text lines consisting of
There are total of 189,584 ligatures as per split of ground approximately 333 thousand connected components from
truths of UPTI dataset. Ground truths don’t have any incorrect every subset of data set were extracted. Furthermore, total of
connectivity, so, in this way UPTI dataset contains 189,584 about 143 thousand secondary components were allocated to
ligatures and our algorithm finds the ligatures as per approximately 189 thousand primary ligatures with the
mentioned in Table 3. accuracy rate of 99.80%.
Table 3 indicates that about 189 thousand ligatures are The proposed algorithm is better than the existing state of
successfully extracted from 10063 text lines consisting of the art algorithms. The best ligature segmentation algorithm
approximately 333 thousand connected components from [8], [9] are using six heuristics for segmentation into ligatures.
UPTI data set. Furthermore, total of about 143 thousand Instead of heuristics, we are using quantitative values (width,
secondary components are allocated to approximately 189 height, centroids and baseline touching) for segmenting lines
thousand primary ligatures with an accuracy rate of 99.80%. into ligatures.
Fig. 25 depicts the percentage statistics of ligature
segmentation algorithm for UPTI dataset. The primary V. CONCLUSION
components touching the baseline are approximately 93%. The ongoing research work on line and ligature based Urdu
Similarly, the secondary components that are not touching Nastaleeq recognition systems motivated us to put forward
the baseline are approximately 87%, while the rest are not more accurate segmentation algorithms for Urdu Nastaleeq
touching the base line. Besides, every ligature consists of document images. This proposal mainly introduces two

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 12

TABLE 5
COMPARISON OF URDU LANGUAGE LIGATURE SEGMENTATION ALGORITHMS
Ligature Primary Components Secondary Components
segmentation No With No With Total
Algorithm Total confusion confusion Total confusion confusion Comp. Ligatures Accuracy
[7] Javed et Corr./total
al 3375/3655 92%
[9] Israr et
al 6811/7364 99.49%
[8] Lehal 23555 18886 17565 12805 42441 23555 99.02%
This paper 189232 176353 12879 143260 18801 124459 332492 189232 99.80%

algorithms for line and ligature segmentation of Nastaleeq text [5] D. A. Satti and K. Saleem, “Complexities and implementation challenges
in offline Urdu Nastaleeq OCR,” in Proceedings of the Conference on
images. The proposed line segmentation algorithm places dots
Language & Technology, pp. 85–91, 2012.
and diacritics more accurately as compared to [28], [29].
Prevailing work [8], [9] relied more on zonal information and [6] S. Hussain, “Complexity of Asian writing systems: a case study of nafees
heuristics for line and ligature segmentation, respectively. In nasta’leeq for Urdu,” in Proceedings of the 12th AMIC Annual Conference
on e-Worlds: Governments, Business and Civil Society, Asian Media
our approach, we rely more on baseline for decision making
Information Center, Singapore, 2003.
as opposed to zonal information for line segmentation. Also,
proposed algorithm employs baseline, width, height, [7] S. T. Javed and S. Hussain, “Improving nastalique specific pre-recognition
centroids, and overlapping quantitative information for process for Urdu OCR,” in Multitopic Conference, 2009. INMIC 2009. IEEE
13th International, pp. 1–6, IEEE, 2009.
ligature segmentation algorithms. Furthermore, we tested our
ligature segmentation algorithms on a very large dataset with [8] G. S. Lehal, “Ligature segmentation for Urdu ocr,” in Document Analysis
improved accuracy as presented in Table 3. On average, for and Recognition (ICDAR), 2013 12th International Conference on,pp. 1130–
UPTI dataset, the propose algorithm segmented 189 thousand 1134, IEEE, 2013.
ligatures from 10063 text lines with 332 thousand connected
[9] I. U. Din, Z. Malik, I. Siddiqi, and S. Khalid, “Line and ligature
components. A total of about 142 thousand secondary segmentation in printed urdu document images,” J. Appl. Environ. Biol. Sci,
components have been successfully allocated to more than vol. 6, no. 3S, pp. 114–120, 2016.
189 thousand primary ligatures. The results of our line and
[10] I. Shamsher, Z. Ahmad, J. K. Orakzai, and A. Adnan, “Ocr for printed
ligature segmentation algorithms with accuracy of 99.33%
Urdu script using feed forward neural network,” in Proceedings of World
and 99.80% respectively, are thus, better than the precision Academy of Science, Engineering and Technology, vol. 23, pp. 172–175,
realized by the prevailing Urdu Nastaleeq segmentation 2007.
algorithms.
[11] I. K. Pathan, A. A. Ali, and R. Ramteke, “Recognition of offline
This work can be extended easily for other Nastaleeq script
handwritten isolated Urdu character,” Advances in Computational Research,
based languages like Persian, Pashto, Siraiki, and Panjabi etc. vol. 4, no. 1, 2012.

[12] J. Tariq, U. Nauman, and M. U. Naru, “Softconverter: A novel approach


ACKNOWLEDGMENT to construct OCR for printed Urdu isolated characters,” in Computer
The financial supports from National Natural Science Engineering and Technology (ICCET), 2010 2nd International Conference
on, vol. 3, pp. V3–495, IEEE, 2010.
Foundation of China (Project No. 61273365) and 111 Project
(No. B08004) are gratefully acknowledged. [13] Q. U. A. Akram, S. Hussain, and Z. Habib, “Font size independent ocr
for noori nastaleeq,” 2010.
REFERENCES
[14] K. Khan, R. Ullah, N. A. Khan, and K. Naveed, “Urdu character
[1] A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey, recognition using principal component analysis,” International Journal of
“Artificial Intelligence Review, pp. 1–33, 2016. Computer Applications, vol. 60, no. 11, 2012.

[2] S. Naz, K. Hayat, M. I. Razzak, M. W. Anwar, S. A. Madani, and S. [15] Z. Ahmad, J. K. Orakzai, I. Shamsher, and A. Adnan, “Urdu Nastaleeq
U.Khan, “The optical character recognition of urdu-like cursive scripts,” optical character recognition,” in Proceedings of world academy of science,
Pattern Recognition, vol. 47, no. 3, pp. 1229–1248, 2014. engineering and technology, vol. 26, pp. 249–252, Citeseer, 2007.

[3] G. S. Lehal and A. Rana, “Recognition of nastalique Urdu ligatures, “in [16] T. Nawaz, S. Naqvi, H. ur Rehman, and A. Faiz, “Optical character
Proceedings of the 4th International Workshop on Multilingual OCR, pp. 7, recognition system for urdu (naskh font) using pattern matching technique,”
ACM, 2013 International Journal of Image Processing (IJIP), vol. 3, no. 3, p. 92, 2009.

[4] U. Pal and A. Sarkar, “Recognition of printed Urdu script.,” in ICDAR, [17] S. Hussain, S. Ali, et al., “Nastalique segmentation-based approach for
vol. 2003, pp. 1183–1187, 2003. urdu ocr,” International Journal on Document Analysis and Recognition
(IJDAR), vol. 18, no. 4, pp. 357–374, 2015.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2703155, IEEE Access

IEEE Access-2017-01632 13

[18] N. Sabbour and F. Shafait, “A segmentation-free approach to Arabic and [27] T. M. Breuel, “Two geometric algorithms for layout analysis,” in
Urdu ocr.” in DRR, 2013. International workshop on document analysis systems, pp. 188–199,
Springer,2002.
[19] S. T. Javed and S. Hussain, “Segmentation based Urdu nastalique OCR,”
in Iberoamerican Congress on Pattern Recognition, pp. 41–49, Springer, [28] F. Shafait, A. Hasan, D. Keysers, and T. M. Breuel, “Layout analysis of
2013. Urdu document images,” in 10th IEEE Int. Multitopic Conference,
INMIC’06, Islamabad, Pakistan, Dec. 2006
[20] A. Ul-Hasan, S. B. Ahmed, F. Rashid, F. Shafait, and T. M. Breuel,
“Offline printed urdu nastaleeq script recognition with bidirectional lstm [29] S. S. Bukhari, F. Shafait, and T. M. Breuel, “High performance layout
networks,” in Document Analysis and Recognition (ICDAR), 2013 12th analysis of arabic and urdu document images,” in Document Analysis and
International Conference on, pp. 1061–1065, IEEE, 2013. Recognition (ICDAR), 2011 International Conference on, pp. 1275–1279,
IEEE, 2011.
[21] S. Naz, A. I. Umar, R. Ahmad, S. B. Ahmed, S. H. Shirazi, and M. I.
Razzak, “Urdu nastaliq text recognition system based on multidimensional [30] S. A. Husain, “A multi-tier holistic approach for urdu nastaliq
recurrent neural network and statistical features,” Neural Computing and recognition,” in Multi Topic Conference, 2002. Abstracts. INMIC 2002.
Applications, pp. 1–13, 2015. International, pp. 84, IEEE, 2002.

[22] S. Naz, A. I. Umar, R. Ahmad, S. B. Ahmed, S. H. Shirazi, I. Siddiqi, [31] S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S. Jamil, and H. Moin,
and M. I. Razzak, “Offline cursive urdu-nastaliq script recognition using “Segmentation free nastalique urdu ocr,” World Academy of Science,
multidimensional recurrent neural networks,” Neurocomputing, vol. 177,pp. Engineering and Technology, vol. 46, pp. 456–461, 2010.
228–241, 2016.
[32] H. S. Baird, “Document image defect models,” in Structured Document
[23] S. Naz, S. B. Ahmed, R. Ahmad, and M. I. Razzak, “Zoning features Image Analysis, pp. 546–556, Springer, 1992.
and 2DLSTM for Urdu text-line recognition,”. vol. 96, pp. 16-22, 2016.
[33] S. S. Bukhari, F. Shafait, and T. M. Breuel, “Towards generic text-line
[24] I. Ahmad, X. Wang, R. Li, and S. Rasheed, “Offline urdu Nastaleeq extraction,” in Document Analysis and Recognition (ICDAR), 2013 12 th
optical character recognition based on stacked denoising autoencoder,” China International Conference on, pp. 748–752, IEEE, 2013.
Communications, vol. 14, no. 1, pp. 146–157, 2017.
[34] S. S. Bukhari, F. Shafait, and T. M. Breuel, “Text-line extraction using a
[25] H. Malik and M. A. Fahiem, “Segmentation of printed urdu scripts using convolution of isotropic gaussian filter with a set of line filters,” in Document
structural features,” in Visualisation, 2009. VIZ’09. Second International Analysis and Recognition (ICDAR), 2011 Inte national Conference on, pp.
Confe rence in, pp. 191–195, IEEE, 2009. 579–583, IEEE, 2011.

[26] K. S. Kumar, S. Kumar, and C. Jawahar, “On segmentation of [35] S. S. Bukhari, T. M. Breuel, and F. Shafait, “Textline information
documents in complex scripts,” in Document Analysis and Recognition, extraction from grayscale camera-captured document images,” in Image
2007. ICDAR 2007. Ninth International Conference on, vol. 2, pp. 1243– Processing (ICIP), 2009 16th IEEE International Conference on, pp. 2013–
1247, IEEE, 2007. 2016, IEEE, 2009.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like