You are on page 1of 5

,(((,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ&RPPXQLFDWLRQ,QVWUXPHQWDWLRQDQG&RQWURO ,&,&,&ದ 

3DSHU,G

Segmentation and Extraction of Text from Curved


Text Lines using Image Processing Approach
Monika A. Shejwal, Sangita D. Bharkad
Department of Electronics and Telecommunication Engineering,
Government College of Engineering,
Aurangabad, 431001, Maharashtra, India
mshejwal.ms@gmail.com, sangita.bharkad@gmail.com

Abstract—The camera captured images containing text are technique is proposed in [1] which overcomes the lacunas of
having curved text lines because of distortions by page curl and local binariazation algorithms. The document image can be
the view angle of camera. So it is necessary while scanning the cleanout with the help of geometric matching algorithm. This
document, the text should be straight and words are inline aids to locate the real page contents region, disregarding trivial
properly. But text lines segmentation of curled text is a difficult noise beside the page border [2].The analysis methods of
method for dewraping techniques. This paper presents the
existing document can handle non-textual noise reasonably
method based on image processing algorithms for segmentation
and extraction of characters from curled text lines from well, whereas textual noiserepresents a major issue for
document images. The algorithm performs the curved text document analysis systems. The textual noise may occured as
segmentation using x-line and base line. The words in the undesired text in optical character recognition (OCR) output.
document image are identified and bounding boxes are plotted So, it is necessary to removed afterwards. For detecting and
around the words. The properties of connected components are removing marginal noise the existing document cleanup
used for segmentation of words.This algorithm achieved good methods are used. The method proposed in [2] overcomes the
accuracy for extraction of characters from curved text lines. limitations of these existing methods. Adaptive active contour
snake algorithm is used in [3] for segmentation of curled text
Keywords—Segmentation, Curved Text Line, Bounding Box,
lines.
Optical Character Recognition (OCR),KNearest Neighbor
Classifier.
Adaptive thresholding, edge detection operators and
mathematical morphology are used in [4] for detection of the
I. INTRODUCTION curled text line in camera captured document images.Liang et
The digitization of document is an important method for al., Dhanya and Jayalakshmi [5][9] presented a comprehensive
increasing the quality and compatibility of document. Now a study of applications of binarization of camera captured
day instead of scanning the document, camera captured document images (thick books, historical manuscripts, text in
document images are very much used by people. Because scenes etc.), methodological difficulties and way out for these
cameras are on hand usually at low cost and implanted in all difficulties. Oliveria et al. [6] introduce a new approach which
mobile gadgets. This gives quick and non- contact document gives the information about curved text line segmentation
imaging. The quality of the document images captured by using de-wrapping and parallel line regression. This approach
camera is very poor because camera perspective distortions, overcomes the limitations of few existing curved text line
non-uniform shading, image blurring, character smearing (due segmentation methods. Roy et al. [7] proposed a method for
to low resolution) and lighting variations. So document image segmentation of text lines based on foreground and
analysis plays a vital role in extraction of information from background information of text. Kasar and Ramakrishnan [8]
document images. Extraction of straight text lines from proposed a novel method for extraction of text lines of random
document images is easy as compared to the curved text lines. curvature and bring into line them horizontally.In this paper,
The text of document is in curved manner because of camera with the help of spatial reliability characteristics of text,
perspective and other distortions. In this paper, we tried to neighboring components are collected jointly to get the text
tackle these obstacles occur in binarization of camera lines present in the image.
capturedimages. We used image processing algorithms for
segmentation of the curved text lines of document images. Detection and identification of text is carried out by Sebastian
and Priya [10] which is useful for blind persons accessing
II. RELATED WORK unfamiliar environments. A binarization-free clustering
There are number of algorithms available in the literature for approach is introduced in [11] for text line segmentation
binarization of document images. For degraded hand-held which is able to handle with touching text lines and with
camera-captured document images, the adaptive binarization complex baseline curvature. Splitting of image into two parts
,(((,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ&RPPXQLFDWLRQ,QVWUXPHQWDWLRQDQG&RQWURO ,&,&,&ದ 
3DSHU,G

is done in [12] for extraction of text from compound document ͳǡ ݂݅‫ ீܫ‬ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൐ ܶ‫ܪ‬
‫ܫ‬஻ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൌ ൜ (1)
images. The extraction of text and blob from natural scene Ͳǡ ݂݅‫ ீܫ‬ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൑ ܶ‫ܪ‬
images and comic images is performed in [13][14].
Where‫ ீܫ‬ሺ‫ݔ‬ǡ ‫ݕ‬ሻ, ‫ܫ‬஻ ሺ‫ݔ‬ǡ ‫ݕ‬ሻare gray scale image and segmented
Rest of the paper is organized as follows. Section III gives imagerespectively.ܶ‫ܪ‬is the threshold value used to convert
details of the algorithm used for segmentation and extraction the gray scale image into binary image.
of text from curved text lines. Experimental results presented
in Section IV. Finally conclusion is given in Section V.
III. PROPOSED METHOD
Fig. 2 Histogram of database images
Fig.1 shows the various functional blocks of proposed
method. Thismethod is divided into two parts. These two parts
are as follows.
(a) Segmentation of curled text lines
(b) Recognition of text from segmented curled lines
A. Curled text line segmentation
The database is a collection of color or gray scale
imagescontaining curled text lines. Gray scale images are used
as it is for further processing and color images are converted
into gray scale images to avoid the large number of
computations needed for color image processing.
Preprocessing is the important step in image processing which
is used to enhance the quality of image. In preprocessing, we
used the binarization of the gray scale image which aids in
segmentation of curled text lines.Gray scale image is
converted into binary image. The conversion of gray scale
image ‫ ீܫ‬ሺ‫ݔ‬ǡ ‫ݕ‬ሻto binary image‫ܫ‬஻ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ is carried out using
equation (1). The threshold ܶ‫ ܪ‬is calculated by using the
information of histogram of grayscale image [15]. Fig. 2
shows the histograms of images in database.

Database

Binarization of Image Fig. 2 Histogram of database images

In the next step noise present in binary image is removed


Remove noise from image using morphological operation using morphological image processing operations. In this step
the small elements which are not the part of the characters in
curled text line are removed. Following steps are adopted to
remove these elements.
Smoothing of image
a) Find the number of connected components present in the
binary image
b)Compute the area of each component.
Formation of bounding box
c) Remove the components having area greater than 200.
The value of area is set to 200. This value is find out
Segmentation of curved text line experimentally.The value of area greater than 200 does not
provide the proper segmentation.

Slope detection and correction of skew words

Detection of text line

Recognition of character in text line

Fig. 1 Flowchart of proposed algorithm


,(((,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ&RPPXQLFDWLRQ,QVWUXPHQWDWLRQDQG&RQWURO ,&,&,&ದ 
3DSHU,G

Smoothing text lines is required for proper segmentation of the are in the curved form that means it is skewed by some angle.
words. The Gaussian filter is used for smoothing the text line And this is detect and then corrected to a straight line. The
of the document image. The filtering is performed by linear regression method is used for classification of curled
computing convolution of the document image and filter text lines.
mask. The mathematical equation for the Gaussian low pass
filter is given byequation(2), Now, we get the straight text words document in an image
ೣమ
format. This is done by cropping the skewed word and makes
ଵ ି మ
‫ܪ‬ሺ‫ݔ‬ሻ ൌ ݁ మ഑  (2) it corrected by rotating in that particular angle. The next image
ඥଶగఙ మ
is consisting the one line from document.Then the character
recognition is the important step to be performed. That is done
Where ‫ܪ‬ሺ‫ݔ‬ሻ transfer function of the Gaussian filter, ı is is by Optical Character Recognition (OCR). The text is readable
the standard deviation of the distribution. The distribution is by the OCR technique. The Optical Character Recognition is
assumed to have a mean of 0. used to classify optical patterns correspond to alphanumeric or
other characters. In OCR, the preprocessing steps are
B. Segmentation of curved text lines binarization, morphology and segmentation
The bonding box is a rectangle. This rectangle is used to
detect the text lines using the connected component. Bounding The segmented text line image is already binary image. So
box is formed around the each text word by using the recognition of characters, morphological features can be used.
properties of connected components. Then after the bounding In segmentation, the connectivity of shape and label is
box is formed on every word of the text line document, then checked. Segmentation is most important aspect in
curled text line segmentation is performed using K nearest preprocessing. From each individual character it extracts the
neighbor technique. The k nearest neighbor technique uses the features for character recognition.Moments that make the
Euclidean distance concept. The difference between the two process of recognizing an objectscale, translation, and rotation
ending point and its neighborhood starting points of many invariant can be used for character recognition. It is
words is evaluated. And finding the smallest distance from the represented by equation (4).
two bounding boxes considers the next word of that line.
݀௜௝ ൌ  σ௪ିଵ ௜ ௝
௫ୀ଴ ‫݂ ݕ ݔ‬ሺ‫ݔ‬ǡ ‫ݕ‬ሻ (4)
This procedure segments the curled text lines of that image.
The segmented lines of the document indicate by the different
color of the bounding boxes. After that, the image is processed Where݀௜௝ is the moment of order (݅,݆),݅ and ݆ = 0, 1, 2, 3, ...,
for the text extraction. ݇ and݇ represents the order of the moment.Then the OCR
reads all the content of image and then extraction is done.
Each character is extracted and saved in the word pad format.
Linear regression method is used to estimate top line and base
line. Several areas of text in document having different skew
angle. Each Adjacent connected components are grouped IV. EXPERIMENTAL RESULTS
using nearest neighbor to form words which are further The database is created by capturing images of text books.
grouped to form text lines. Then the top line and base line are The experimental results are performed on camera captured
estimated by linear regression method. Equation (3) describes images of text books. Fig.3(a) shows the database image and
a line with slope Ⱦ and y intercept‫ ן‬. Fig.3(b) shows the Gaussian filtered database image. The
plotting of bounding box on text is shown in Fig.3(c).The
‫ ݕ‬ൌ‫ ן‬൅ߚ‫ݔ‬ (3) segmented text lines are shown in Fig. 3 (d). Fig.4 (a)-(c)
shows the results of segmentation of curved text lines from
Skew detection and correction: When a document is captured document images. After segmentation of curved text lines,
by a camera, a small skew is inevitable. This affects the charactersare extractedand saved in the word pad. The
accuracy of algorithm for recognition of characters. The angle performance of this algorithm is calculated by computing the
between lower base line and reference base line is calculated accuracy. Accuracy is defined as follows.
and then this skew is corrected in opposite direction. In this
way skew is detected and corrected using linear regression ௐಲ
‫ ݕܿܽݎݑܿܿܣ‬ൌ  (5)
method. ௐವ

After curved text lines are get segmented then extraction of Where ܹ஺ represents the number of characters segmented by
text is done.Text extraction in an image is a very challenging the algorithm and ܹ஽ represents the number of characters
task. Extraction of text plays an important role for providing present in the document image. Table I shows the results of
information. The text gives meaningful information which can this algorithm.
be used to understand the contents of image.The documents
,(((,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ&RPPXQLFDWLRQ,QVWUXPHQWDWLRQDQG&RQWURO ,&,&,&ದ 
3DSHU,G

TABLE I. PERFORMANCE OF ALGORITHM IN TERMS OF ACCURACY


Number of words
Sr Total no. of words in
Database recognized by Accuracy
No. document image
algorithm
1 Image 1 83 55 66.26%

2 Image 2 34 24 70.58%

3 Image 3 66 50 75.75%

4 Image 4 60 45 75%

5 Image5 2 2 100%

6 Image 6 2 2 100%

7 Image 7 8 7 87.5%

8 Image 8 7 3 42.85%

Fig. 3(a) Database image (b) Gaussian filtered image (c)


segmentation of text line image (d) Image after bounding box
,(((,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ&RPPXQLFDWLRQ,QVWUXPHQWDWLRQDQG&RQWURO ,&,&,&ದ 
3DSHU,G

[9]M.Dhanya and Jayalakshmi,“Literature Survey On Dewarping Of


Document Images”,International Journal of Modern Trends in
Engineering and Research(IJMTER),Vol.02, No.07,pp.343-347, 2015.
[10] S. Sebastian and PriyaS,“Text detection and recognition from images as
an aid to blind persons accessing unfamiliar environments”,ARPNJournal
of Engineering and Applied Sciences,Vol.10, No. 17, pp.7559-7564,
2015.
[11]A. Garz, A. Fischer, H. Bunke and R. Ingold,“A Binarization-Free
Clustering Approach to Segment Curved Text Lines in Historical
Manuscripts”,Proc. of Conf. on12th International Conference on
Document Analysis and Recognition, pp.1290-1294, 2013.
[12]H. Song and Dongjian,“A Novel Method to Extract Text from Compound
Document Images”,Proc. of IEEE Conf.onIntelligent Computing and
Intelligent Systems,pp. 143-146, 2016.
[13] N. Maria, P. Damien and C. Yaacoub,“A Robust Algorithm for Text
Extraction from Images”, Proc. ofIEEE Conf. on Telecommunications and
Signal Processing (TSP),pp. 493-497,2010.
[14]M.Sundaresan,S.Ranjini,“Text Extraction from Digital English Comic
Image Using Two Blobs Extraction Method”, Proc. of IEEE
Conf.onInternational Conference on Pattern Recognition, Informatics and
Medical Engineering, pp. 449-452, 2012.
[15]N. Otsu,"A Threshold Selection Method from Gray-Level Histograms.
IEEE Transactions on Systems, Man, and Cybernetics," 1979, Vol. 9, No.
Fig.4 (a)-(c) Database images and its segmentation of text lines 1,pp. 62-66.

V. CONCLUSION
In this paper, we extracted the text lines from curled text
document images that captured by camera. The proposed
method is applied on gray scale image as well as color image.
Here the curled text lines are detected by using the properties
of connected components. Then the text is extracted from
document image using OCR. It can be used further for
document image analysis, understanding and processing. The
accuracy of this algorithm is approximately 77.24%.

REFERENCES
[1] S. Bukhari,“Adaptive Binarization of Unconstrained Hand-Held Camera-
Captured Document Images”,Journal of Universal Computer
Science,,Vol. 15, No. 18 ,pp.3343-3363,2009.
[2] V.Beusekom,J.Keysers and D.Breuel, “Document cleanup using page
frame detection”,Int. J. Doc. Anal.Recognition,Vol.11, No. 2, pp.81–96,
2008.
[3] S.Bukhari, F. Shafait and T.Breuel,“Curled text-line segmentation from
warped document images”, Proc. of Conf. on Document Analysis and
Recognition,Vol.16, No. 1, pp.33–53,2011.
[4] M.Anushree and D.Dhanalakshmy, “Text line Segmentation of Curved
Document Images”,IJERA,Vol. 4,pp.32-36, 2014.
[5] J.Liang, D.Doermann and H.Li, “Camera-based analysis of text and
documents: a survey”,IJDAR,2005.
[6]D.Oliveira1,R.Lins, G.Torreão1,J.Fan and M.Thielo,“A New Method for
Text-Line Segmentation for Warped Documents’’,Proc. of Conf. on
International Conference Image Analysis and
Recognition,Vol.6112,pp. 398-408,2010.
[7] P.Roy, U.Pal, J.Lladós “Text Line Extraction in Graphical Documents
using Background and Foreground Information”,International Journal on
Document Analysis and Recognition (IJDAR),Vol.15, no.3, pp 227–
241,2012.
[8]T.Kasar and A.Ramakrishnan,“Alignment of Curved Text Strings for
Enhanced OCR Readability”,International Journal of Computer Vision
and Signal Processing,Vol. 3, no.1, pp. 1-9, 2013.

You might also like