Professional Documents
Culture Documents
1370 شیروانی
1370 شیروانی
Abstract—letters separation and word's units is one of the most algorithms 3) multi-pass algorithms.
important parts of text recognition algorithms. In the Farsi In one-pass algorithms, image is scanned only once to find
language, these parts consist of single letters and connected letters pixels without label, and then assigned label is used to be
which called "sub-word". Therefore, separation for these units has
a main role in developing text processing's algorithms. In this
assigned to all other pixels of object. Usually, this scan can be
paper, a method based on connected-component labeling done with a process pattern and without any rule. In practical
techniques with high accuracy is suggested that makes letters and experiments, this rule-less process results to a poor
sub-words separation in Farsi font in any size, possible. performance. But for solving this problem, some solutions are
Experiments show more than 90 percent accuracy in this method suggested.
– recognition, connected-component labeling, sub-word's All two-pass algorithms consist of three steps: 1) scan step
separation, image processing. and basic labeling; in this stage image is scanned to recognize
first labels. 2) Information analysis step; this step, information
I. I. INTRODUCTION analysis in order to find labels. 3) final labeling step; this stage
scans the image again to show final pixel's label.
T EXT recognition is one of areas that absorbed much
attention at last decades and absolutely, sub-words
separation is one of the most important stages in text
Multi-scan algorithms might need more than two scans to
recognize final labels and number of scans depends on image
recognition. In the Farsi and Arabic texts, words are built by content. Most multi-scan algorithms by high performance are
combination of letters. So, to improve speed and contrast for introduced by Suzuki. He used a connection table for labeling
recognition, having contrast in separation step is critical. As a to reduce number of scans. As a result of limitation in number
reason special characteristic of Farsi and Arabic languages, of scans, this introduced method by him is faster than other
separation in these two languages is more difficult than Latin famous algorithms. Different experiments have shown that for
languages like English. this paper's subject (sub-words separation) using two-pass
Text recognition methods can be categorized into three method works by adequate accuracy. Therefore, this method
categories: 1) recognition based on separation, 2) recognition has used in this paper.
based on overall shape and 3) combination of two these. In the In the following text, a new separation method is
offered using explained steps and by solving current problems
first method, sub-words are divided into their components of similar methods to separate sub-words leads to accurate
which are called letters, after that recognition can be results. The structure of this paper is: in section 2 method for
processed. In the second method, overall sub-word is extracting text's sub-words is introduced and also existing
considered as an image and then recognition is done. Also, in problems are reviewed and then correction techniques that are
third method a combination of two last methods is used. The used to improvement are shown. In section 3, a final review to
recognition based separation is more popular and lots of overcome rest of problems is introduced. Section 4, execution
researches have done, about this branch which is for Farsi and results considered and conclusion is section 5.
Arabic languages. Recognition based on overall shape has
many problems; also this method in some location has some II. II. SUB-WORDS RECOGNITION, REVIEW AND
advantages comparison to separation based on. PROBLEM SOLVING
In this paper for sub-words separation to shape base The steps for sub-words separation and techniques to
recognition, a connected-component labeling method is used. improve method are:
Component labeling for connected components in binary
Step one: scan text's image and basic labeling; of the first,
images is one of the most important and base of steps in the picture is scanned pixel by pixel and from left to write and
pattern recognition. Labeling algorithms are categorized into up to down. The 'e' pixel has selected as current pixel and a, b,
three categories: 1) one-pass algorithms, 2) two-pass
285 sub- 136 119 104 98 sub- [13] C. Jacobs, P.Simard, P.Viola and J.Rinker," text recognition of low-
words sub- sub- sub- words resolution document images", Eight International Conference on
words words words Document Analysis and Recognition , IEEE 2005.
[14] M.Sarfaraz, N.Nawaz and A. Al-khuraidly, "offline Arabic text
recognition system", International Conference on Geometric Modeling
and Graphics (GMAG’03), 2003.
One One One One Two Nazanin [15] M.B and M.Adab, Simultaneous segmentation and recognition of
error error error error errors Farsi/Latin printed texts with MLP, International Joint Conference on
Neural Networks, 1534-1539, 2002
""وح ""ﻳﻒ ""وح ""ﻳﮏ ،""ﻧﻮع [16] A.Broumandnia, J. Shanbehzadeh, "Segmentation of Printed
""ﻳﮏ Farsi/Arabic Words", IEEE/ACS International Conference on Computer
Systems and Applications, 761-766, 2007.
References
Periodicals:
[1] H.Al-Muhtaseb, "Recognition of off-line printed Arabic text using
hidden Marcov Model", Signal Processing 2902–2912, 2008.
[2] H. El Abed and V.Margner, "Arabic text recognition systems – state of
the art and future trends", 692-696, 2008.
[3] H. khosravi and E. Kabir, "Farsi font recognition based on Sobel-
Roberts features", Pattern Recognition Letters 75–82, 2008.
[4] Ebrahimi, A. and E. Kabir, "A pictorial dictionary for printed Farsi sub-
words", Pattern Recognition Letters 29(5): 656-663, 2008