You are on page 1of 3

A new method to separation of Farsi and Arabic

sub-words using image processing techniques


Parisa Shirvani (Author) Mehrdad Vatankhah Khouzani (Author)
Department of Electrical Engineering faculty of Arts-Computing-Engineering and Science(ACES)
University of Semnan Sheffield Hallam University
Semnan, Iran Sheffield, united Kingdom
shirvani.parisa@gmail.com m.vatankhah@yahoo.com

Abstract—letters separation and word's units is one of the most algorithms 3) multi-pass algorithms.
important parts of text recognition algorithms. In the Farsi In one-pass algorithms, image is scanned only once to find
language, these parts consist of single letters and connected letters pixels without label, and then assigned label is used to be
which called "sub-word". Therefore, separation for these units has
a main role in developing text processing's algorithms. In this
assigned to all other pixels of object. Usually, this scan can be
paper, a method based on connected-component labeling done with a process pattern and without any rule. In practical
techniques with high accuracy is suggested that makes letters and experiments, this rule-less process results to a poor
sub-words separation in Farsi font in any size, possible. performance. But for solving this problem, some solutions are
Experiments show more than 90 percent accuracy in this method suggested.
– recognition, connected-component labeling, sub-word's All two-pass algorithms consist of three steps: 1) scan step
separation, image processing. and basic labeling; in this stage image is scanned to recognize
first labels. 2) Information analysis step; this step, information
I. I. INTRODUCTION analysis in order to find labels. 3) final labeling step; this stage
scans the image again to show final pixel's label.
T EXT recognition is one of areas that absorbed much
attention at last decades and absolutely, sub-words
separation is one of the most important stages in text
Multi-scan algorithms might need more than two scans to
recognize final labels and number of scans depends on image
recognition. In the Farsi and Arabic texts, words are built by content. Most multi-scan algorithms by high performance are
combination of letters. So, to improve speed and contrast for introduced by Suzuki. He used a connection table for labeling
recognition, having contrast in separation step is critical. As a to reduce number of scans. As a result of limitation in number
reason special characteristic of Farsi and Arabic languages, of scans, this introduced method by him is faster than other
separation in these two languages is more difficult than Latin famous algorithms. Different experiments have shown that for
languages like English. this paper's subject (sub-words separation) using two-pass
Text recognition methods can be categorized into three method works by adequate accuracy. Therefore, this method
categories: 1) recognition based on separation, 2) recognition has used in this paper.
based on overall shape and 3) combination of two these. In the In the following text, a new separation method is
offered using explained steps and by solving current problems
first method, sub-words are divided into their components of similar methods to separate sub-words leads to accurate
which are called letters, after that recognition can be results. The structure of this paper is: in section 2 method for
processed. In the second method, overall sub-word is extracting text's sub-words is introduced and also existing
considered as an image and then recognition is done. Also, in problems are reviewed and then correction techniques that are
third method a combination of two last methods is used. The used to improvement are shown. In section 3, a final review to
recognition based separation is more popular and lots of overcome rest of problems is introduced. Section 4, execution
researches have done, about this branch which is for Farsi and results considered and conclusion is section 5.
Arabic languages. Recognition based on overall shape has
many problems; also this method in some location has some II. II. SUB-WORDS RECOGNITION, REVIEW AND
advantages comparison to separation based on. PROBLEM SOLVING
In this paper for sub-words separation to shape base The steps for sub-words separation and techniques to
recognition, a connected-component labeling method is used. improve method are:
Component labeling for connected components in binary
Step one: scan text's image and basic labeling; of the first,
images is one of the most important and base of steps in the picture is scanned pixel by pixel and from left to write and
pattern recognition. Labeling algorithms are categorized into up to down. The 'e' pixel has selected as current pixel and a, b,
three categories: 1) one-pass algorithms, 2) two-pass

978-1-4673-6206-1/13/$31.00 ©2013 IEEE


c, d as four direct neighbors. The visual picture of that is second scan, the current pixel label will be assigned instead of
shown in figure 1. stored label. But if the right pixel is white, down neighbor
pixel would be considered and explained process repeats to
The algorithm is like, if 'e' pixel is white nothing changes finish final labeling.
and algorithm moves to next pixel but if the 'e' pixel is black
and all four neighbors pixel are white a new label will be
assigned to 'e' ( new label means new and unique number). III. FINAL REVERSE AND CORRECTION
Also if only one of neighbors is not white the label of that will TECHNIQUE
be used to labeling current pixel 'e'. Otherwise, more than one In this section, the results of previous steps are considered
pixel are not white. The minimum label (minimum number of and observed that points of letters are recognized as distinct
not white neighbor pixels) should be assigned to 'e'. sub-words. Also, in some sub-words such as ""‫ ﺁ‬the hat of "‫"ا‬
is recognized as a separate sub-word. In order to the points of
letters have different labels of their letters. The reason is that
there is no connection between letters and their points, it
causes they gain similar label. For solving this problem, points
must be in sub-word's area. For this purpose, we first define an
area for each of sub-words. Then the vertical area of each sub-
Fig. 1. Current pixel and it's neighbors word is increased. Because, in Farsi grammar points of letters
place above or below the letters (not the left or right letters). In
fact, within each sub-word's area when scanning, pixels with
The decision tree has been used, as it is shown in figure 2 different labels of label's current pixel are points or the hat of
to execute explained steps. "‫ "ا‬and must take similar label of current pixel's label. Also to
reduce the error rate of overlapping letters, pixel's label of the
first column of each sub-word if isn't a white pixel not
considered as a current pixel. If the next pixel is labeled with a
label against the label of first column's pixel, as the current
pixel is checked.

IV. EXPERIMENTAL RESULTS


The proposed algorithm was applied to several paragraphs
with different fonts. Some examples of these paragraphs are
shown in figure 3.

Fig. 2. Decision tree

Step two: Information analysis (analysis the output of first


step) ; after the labeling process, some sub-words might have
more than one label, on the other hand , in some sub-words all
pixels don't have similar labels. For instance, a sub-word like
"‫ "ﺳﺎ‬might have four labels. Because, three new labels is
assigning to each arcs of "‫ "س‬and a new label to "‫"ا‬. This
problem is as a result of structure of sub-word. Therefore, in
order to having better performance in this method, some Fig. 3. Farsi paragraphs
connections should be done on sub-words.
Step three: final labeling; to solve multi labeling problems The results of suggested algorithm on three fonts are
some sub-words, two neighbors of right and down side, on the shown in table 1.
other word (x,y+1) and (x+1,y) must be checked. So if the
label of current pixel is more than zero (not white pixel) and V. CONCLUSION
also the label of right (x, y+1) neighbor pixel is more than In this paper, a fast and accurate method for separation of
zero, the (x, y+1) pixel's label will be stored and during Farsi and Arabic sub-words was proposed. First, using two-
pass scanning sub-words was separated and then in order to [5] Z.Shaaban, "A new recognition scheme for machine printed Arabic texts
solve problems such as the existence of additional based on neural networks", world academy of science, engineering and
technology 41, 2008.
components (points, hat of "‫"ا‬, etc) and overlapping some
[6] K.Wu, E.Otoo, K.Suzuki, "Optimizing Two-Pass Connected-Component
letters appropriate solutions were presented. Labeling Algorithms", Lawrence Berkeley national laboratory,
The results of applying method to English and Persian university of California, 2008.
texts with various fonts, shows the capability of the proposed [7] Fu.Chang, C.Chen, C.Lu," A linear-time component-labeling algorithm
using contour tracing technique", Comput. Vis. Image Underst,
method. The proposed method has positive and negative 93(2):206-220, 2004.
characteristics are as follow: [8] Q.Hu, G.Qian, W.Nowinski, "Fast connected-component labeling in
The proposed algorithm has high accuracy and good three-dimensional binary images based on iteractive recursion", Comput.
Vis. Image Underst, 99:414-434, 2005.
performance in each steps of recognition, due to using
[9] T.Gotoh, Y.Ohta, M.Yoshida, and Y.Shirai, "Component labeling
solutions for overcoming problems. algorithm for video rate processing", In Proc. SPIE1987, volume 804 of
This method has the ability to run on most Persian fonts. Advances in Image Processing, pages 217-224,1987.
[10] R.Lumia," A new three-dimensional connected components algorithm,
One of the limitations of this algorithm on the recognition Comput". Vision. Graphics, and Image Process, 23(2):207-217, 1983
of sub-words is when they stick together because of their [11] K.Suzuki, I.Horiba, and N.S. 'Linear-time connected-component
small fonts or existing noise. This problem can be solved by labeling based on sequential local operations, Comput". Vis. Image
identifying the letters of alphabet don't stick to the left and Underst, 89(1):1-23, 2003
their definition in algorithm. [12] K.Suzuki, I.H and N.S," Linear-time connected-component labeling
based on sequential local operations, Comput". Vis. Image Underst,
89(1):1-23, 2003
TABLE I. COMPARING OF THREE DIFFERENT FONTS
Papers from Conference Proceedings (Published):

285 sub- 136 119 104 98 sub- [13] C. Jacobs, P.Simard, P.Viola and J.Rinker," text recognition of low-
words sub- sub- sub- words resolution document images", Eight International Conference on
words words words Document Analysis and Recognition , IEEE 2005.
[14] M.Sarfaraz, N.Nawaz and A. Al-khuraidly, "offline Arabic text
recognition system", International Conference on Geometric Modeling
and Graphics (GMAG’03), 2003.
One One One One Two Nazanin [15] M.B and M.Adab, Simultaneous segmentation and recognition of
error error error error errors Farsi/Latin printed texts with MLP, International Joint Conference on
Neural Networks, 1534-1539, 2002
"‫"وح‬ "‫"ﻳﻒ‬ "‫"وح‬ "‫"ﻳﮏ‬ ،"‫"ﻧﻮع‬ [16] A.Broumandnia, J. Shanbehzadeh, "Segmentation of Printed
"‫"ﻳﮏ‬ Farsi/Arabic Words", IEEE/ACS International Conference on Computer
Systems and Applications, 761-766, 2007.

Two No One No One Mitra Dissertations:


errors error error error error ‫ﮐﻠﻤﺎت‬-‫ "اﺳﺘﻔﺎدﻩ از ﺷﮑﻞ ﮐﻠﯽ زﻳﺮ‬،‫[ اﺑﺮاهﻴﻤﯽ‬1]
‫و‬ ‫ﻣﺴﺘﻨﺪات‬ ‫ﺗﺼﻮﻳﺮ‬ ‫ﺑﺎزﻳﺎﺑﯽ‬ ‫در‬ ‫ﭼﺎﭘﯽ‬
،"‫"وح‬ "‫"وح‬ "‫"ﻧﻮع‬ ‫ رﺳﺎﻟﻪ دﮐﱰی ﲞﺶ‬،"‫ﺑﺎزﺷﻨﺎﺳﯽ ﻣﺘﻮن ﻓﺎرﺳﯽ‬
"‫"ﺳﺎﮐﺖ‬
.1384 ‫ داﻧﺸﮕﺎﻩ ﺗﺮﺑﻴﺖ ﻣﺪرس‬،‫ﻣﻬﻨﺪﺳﯽ ﺑﺮق‬
‫ ﺑﺨﺶ ﻣﻬﻨﺪﺳﯽ ﺑﺮق‬،"‫ "ﺑﺎزﺷﻨﺎﺳﯽ ﻣﺘﻮن ﭼﺎﭘﯽ ﻓﺎرﺳﯽ‬،‫ رﺿﺎ‬،‫[ ﻋﺰﻣﯽ‬2]
One No No Two One Lotus .1378 ‫داﻧﺸﮕﺎﻩ ﺗﺮﺑﻴﺖ ﻣﺪرس‬
error error error errors error

"‫"ﻣﻼﻟﺖ‬ ،"‫"ﺟﻪ‬ "‫"ﺑﻪ‬


"‫"ﻳﮏ‬

References
Periodicals:
[1] H.Al-Muhtaseb, "Recognition of off-line printed Arabic text using
hidden Marcov Model", Signal Processing 2902–2912, 2008.
[2] H. El Abed and V.Margner, "Arabic text recognition systems – state of
the art and future trends", 692-696, 2008.
[3] H. khosravi and E. Kabir, "Farsi font recognition based on Sobel-
Roberts features", Pattern Recognition Letters 75–82, 2008.
[4] Ebrahimi, A. and E. Kabir, "A pictorial dictionary for printed Farsi sub-
words", Pattern Recognition Letters 29(5): 656-663, 2008

You might also like