Professional Documents
Culture Documents
Abstract— Extracting text objects from the PDF images is a Later with the increasing need for color documents, techniques
challenging problem. The text data present in the PDF images [2]have been proposed
contain certain useful information for automatic annotation,
indexing etc. However variations of the text due to differences in The segmentation techniques like Block based image
text style, font, size, orientation, alignment as well as complex
segmentation [3] is used extensively in practice. Under this
structure make the problem of automatic text extraction extremely
difficult and challenging job. This paper presents two techniques block based segmentation, the comparison goes (i) AC-
under block-based classification. After a brief introduction of the Coefficient Based technique and (ii) Histogram Based technique
classification methods, two methods were enhanced and results
were evaluated. The performance metrics for segmentation and This paper is organized as follows: In section II the brief
time consumption are tested for both the models. introduction of PDF Image. Section III discuss about the review
of block based segmentation. Section IV discusses in detail about
Keywords- Block based segmentation, Histogram based, AC the Text extraction using proposed techniques. Section V discuss
Coefficient based. about the experimental results of the two models. Finally the
section 6 concludes the paper.
I. INTRODUCTION
II.PDF IMAGE
With the drastic advancement in Computer Technology
& communication technology, the modern society is entering to
the information edge. In change in the traditional document PDF format is converted into images using available
system (paper etc), people now follow electronic document commercial software’s so that each PDF page is converted into
system (PDF Format) for communication and storage which is image format. From that image format the text part are
currently imperative. But on complex matters, the document segmented and extracted for further process.
image is difficult to accurately identify the information directly
out of the need. On such cases preprocessing the document is III. BLOCK BASED SEGMENTATION
done before its entry. Image segmentation theory, as digital The goal of segmentation is to simplify and/or change
image processing has become an important part of people active the representation of an image into something that is more
research. meaningful and easier to analyze. Image segmentation is
typically used to locate objects and boundaries (lines, curves,
Image processing document image segmentation etc.) in images. More precisely, image segmentation is the
theory is an important research topic in the process it is mainly process of assigning a label to every pixel in an image such
between the document image pre-processing and advanced that pixels with the same label share certain visual
character recognition an important link between. The relatively characteristics.
effective and commonly used for document image segmentation
and classification methods include threshold, and geometric Most of the recent researches in this field mainly based
analysis and other categories. on either layer based or block based. This block based
segmentation approach divides an image into blocks of regions
After segmenting, Text part is detected and extracted (Fig:1). Each region follows approximate object boundaries, and
for further process, earlier, text extraction techniques have been is made of rectangular blocks. The size of the blocks may vary
developed only on monochrome documents [1]. These within the same region to better approximate the actual object
techniques can be classified as bottom-up, top-downand hybrid. boundary.
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 10, No. 9, September 2012
Block-based segmentation algorithms are developed with better segmentation and were used during further
mostly for grayscale or color compound image, For example ,In experimentation.
[4], text and line graphics are extracted from check images. In
[5],proposed block based clustering algorithm, [6] propose a
classification algorithm, based on the threshold of the number of
colors in each block. [7], approach based on available local
texture features[8], detection using mask [9], block classification
algorithm for efficient coding by thresholding DCT energy [10]
The following sub sections discuss the two techniques It was reported by Konstantinides and Tretter
(A) AC Coefficient based technique & (B) Histogram based (2000)[17] that the code lengths of text blocks after entropy
technique in block based segmentation encoding tend to be longer than non-text blocks due to the higher
level of high frequency content in these blocks. Thus, the first
feature, D1, calculates the encoding length using Equation (3).
A. AC Coefficient based technique
Ds,1 = 1 63
The first model uses the AC coefficients introduced f (Ys,0 Ys 1,0 ) f (Ys, i )
64 1
during (Discrete Cosine Transform)DCT to segment the image
(2)
into threeblocks, background, Text and image blocks[11] [12]
[14][15]. The background block has smooth regions of the
imagewhile the text / graphics block has high density of sharp Where f(x) = log 2 (| x |) 4 if | x | 1
edges regions and image block has the non-smooth partof the 0 Otherwise
PDF image( Fig:2). AC energy is calculated from AC
coefficients and is combined with a user-definedthreshold value The second feature, D2, is the measure of how close a block is to
to identify the background block initially. The AC energy of a a two-colored block. For each block ‘s’, a two-color projection is
block ‘s’ is calculated using Equation (1) performed on the luminance channel. Each block is clustered in
to two groups using k-means clustering algorithms with means
denoted by θs,1 and θs,2. The two-color projection is formed by
clipping each luminance of each pixel to the mean of the cluster
to which the luminance value belongs. The l2 distance between
the luminance of the block and its two-color projection is then a
measure of how closely the block resembles a two-color block.
This projection error is normalized by the square of the
(1) difference of the two estimated means, |θs,1 − θs,2|2, so that a high
contrast block has a higher chance to be classified as a text
where Ys,i is the estimate of the i-th DCT coefficient of the block. The second feature is then calculated as
block ‘s’, produced by JPEG decompression. When the Es
value thus calculated is lesser than a threshold T1, then it is 1 63 (3)
Ds ,2 2
| X s, i X 's, i |2
grouped as smooth region; else it is grouped as non-smooth | s,1 s, 2 | i 0
region. After much experimentation with different images,
the thresholds T1 and T2 with 20 and 70 respectively resulted
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 10, No. 9, September 2012
where Xs,i is the estimate of the ith pixel of the block ‘s’, blocks is typically dominated by one or two intensity values
produced by JPEG decompression and X's,i is the value of the ith (modes). Separating the Text and image blocks from the PDF
pixel of the two-color projection. If θs,1 = θs,2, then Ds,2 = 0. image is challenging.
For the norm used in k-means clustering, the contributions of the Here the intensity value is defined as mode if its frequency
two features are differently weighted. Specifically, for a vector satisfies two conditions,
Dt = [D1,D2], the norm is calculated by
(i) it is a local maximum and
|| D || = D 2 D 2 (4) (ii) the cumulative probability around it is above a pre-
2
1 selected threshold, T.
[4] Huang, J., Wang, Y. and Wong. E. K. Check image compression using a
(i) – Single (ii) Double (iii) Single (iv) – Double layered coding method. Journal of Electronic Imaging,7(3):426-442, July 1998.
Column column PDF Column PDF Column PDF
PDF file file with no file with file with [5] Bing-Fei Wu1, Yen-Lin Chen, Chung-Cheng Chiu and Chorng-Yann Su, “A
Novel Image Segmentation Method for complex document images”, 16th IPPR
with no Figures Figures Figures Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
Figures
Fig 5: sample PDF files used for testing [6] Lin .T and Hao .P,“Compound Image Compression for Real Time Computer
Screen Image Transmission,” IEEE Trans. on ImageProcessing, Vol.14, pp. 993-
1005, Aug. 2003.
Total No. of PDF Files used for testing – 100 PDF files
[7] Wong, K.Y., Casey, R.G. and Wahl, F.M. (1982) Document analysis system.
20– Single Column Files with no Figures IBM J. of Res. And Develop., Vol. 26, No. 6, Pp. 647-656.
20 – Double column Files with no Figures
[8]Mohammad Faizal Ahmad Fauzi, Paul H. Lewis,”Block-based Against
30 – Single Column Files with Figures Segmentation-based Texture ImageRetrieval”, Journal of Universal Computer
30 – Double Column Files with Figures Science, vol. 16, no. 3 (2010), 402-423.
Evaluation Method: 10-fold cross validation technique
[9] Vikas Reddy, Conrad Sanderson, Brian C. Lovell,” Improved Foreground
Detection via Block-based Classifier Cascade with Probabilistic Decision
The Table 1 shows the comparison rate of the two proposed Integration”, IEEE Transactions on Circuits And Systems For Video
methods. Technology, Vol. XX, No. XX (2012).
Metrices
Single column Double column Single column Double column [10] S.Ebenezer Juliet, D.Jemi Florinabel, Dr.V.Sadasivam,” Simplified DCT
PDF image PDF image with PDF image with PDF image with based segmentation with efficient coding of computer screen images”,
with no figures no figures figures figures ICIMCS’09, November 23–25, 2009, Kunming, Yunnan, China. Copyright 2009
AC Histo AC Histogr AC Histogr AC Histog ACM 978-1-60558-840-7/09/11.
coeffi gram coeffi am coeffi am coeffi ram
cient based cient based cient based cient based [11]K. Veeraswamy, S. Srinivas Kumar,” Adaptive AC-Coefficient Prediction
based based based based for Image
Accuracy Compression and Blind Watermarking”,Journal of Multimedia, VOL. 3, NO. 1,
94.33 93.87 92.66 91.87 93.51 92.44 90.19 91.67
MAY 2008.
False
positive 5.67 6.13 7.34 8.13 6.49 7.56 9.81 8.33
Time 12] Wong, T.S., Bouman, C.A. and Pollak, I. (2007) Improved JPEG
(seconds) 20.71 14.91 22.57 13.57 26.64 13.06 21.02 13.10 decompression of document images based on image segmentation,Proceedings of
Table1: comparison rate of the two proposed methods. the 2007 IEEE/SP 14th Workshop on Statistical Signal Processing, IEEE
Computer Society, Pp. 488-492.
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 10, No. 9, September 2012
[13] Li, X. and Lei, S. (2001) Block-based segmentation and adaptive coding for
visually lossless compression of scanned documents,Proc. ICIP, VOL. III, PP.
450-453.
AUTHORS PROFILE