• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
DOCUMENT PRESENTATION ENGINE FOR INDIAN OCRA DOCUMENT LOYOUT ANALYSIS APPLICATION
ABSTRACT 
Today office automation is going on in the all of the fields. Everybody is using the computers for fast data processing and for maintaining the large amount data. But, there is need of the previous processed data which is printed documents. There are two ways to use old data. First is to type the data to process in the computer. And the second is to scan the document and use OCR (Optical Character Recognition) to convert the document image in to the editable text format. There are more than 1000 languages and 14 scripts used by 112 million people in the India [1,2].So there is need of OCR system for Indian scripts which is in development process. In OCR we have to scan the document, then noise cleaning, skew detection and correction, text non–text classification, text line detection and segmentation, word segmentation, character segmentation and identification and output file generation. The main contribution in this paper is that how to maintain the layout of document. At the present scenario OCR system produce the text file as output without maintaining the layout of document. OCR is an error-prone process. The error remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. In this paper the use of XML is presented to generate the open office document file. In this paper the open document standard is followed which is approved by OASIS (Organization for the advancement of Structured Information Standards) on Feb. 1, 2007 [3]. The main feature of the propose solution is that it is scripts independent. So it can be applied for all Indian scripts.
Keywords:
OCRed Document, Indian Scripts, Document layout.
1.0 INTRODUCTION
In the field of computer science text and images are the main source of the information. A human can understand theinformation if it is presented in well manner. For example, if the data is presented with the images in well organizedmanner then the other one can understand in a better way. If the presentation is not well the information given will notbe understood by the other. The OCR process is error prone. It is time consuming and expensive to manuallyproofread OCR results. The errors remaining in OCRed texts can cause serious problems in reading andunderstanding if they do not refer to the original image representation. Document representation after OCR is veryimportant task. It can cause serious problems in reading and as well as understanding the document if they do notmaintain the layout as in the original image representation. Present system scan the document image and place thetext and image one after other without maintaining the layout [2]. In this paper there is small discussion about theprocessing of the OCR and the detail discussion of the proposed system. The following figure gives an overview of the OCR process.Figure 1: flow of OCR processAs the above figure show the first step of OCR is two tone conversions which convert the image into binary image,and then skew is detected and corrected. Noise cleaning is performed on the skew corrected image. Text non-textclassification technique classifies the document image in text and image. The image part is extracted from thedocument image for further processing and remaining text part is passed to text lines detection. After detecting thetext lines, each line is processed and individual character is generated and final output file is generated. But at this
©Informatics '09, UM 2009
 RDT 
4
 
-
 
88
 
 Proceeding of the 3rd International Conference on Informatics and Technology, 2009
Umesh Kumar, Jagdish Raheja
 
stage there is problem that, in output file the layout of document image, is not maintained. The main contribution of this paper is that how to maintain the layout of the output document.
2.0 PROCESSING OF OCR
The OCR is combination of multiple processes as shown in the above figure 1. The first process of the OCR is toacquire the document image using the scanner or a camera. The image has the many color combination in the imagebut the OCR process the binary image. So the image is converted to the binary image. A global threshold value isgenerated and the image is converted to the corresponding binary image. Then the noise is removed from the inputimage. The most commonly used approach called morphological component removal technique is used to removethe noise. Figure 2 shows the image taken as input and after removing the noise.Figure 2: Noisy image, Noise cleaned Image, Skewed Image and Skew Corrected ImageAfter noise removal the resultant image is taken as input to the skew detection and correction technique. The causeof skew is due to the improper alignment of the document paper during scanning. When the document is beingscanned and it not properly placed on scanner then the scanned image can have skewed image. A skewed imagemay result in failure to detect the text in the document image. So it is necessary to detect the skew and correct it.Above figure 2 represent the skew in the image and the resultant image after the correction. After the preprocessingof image involving the noise removal and skew correction, the image is segmented in two categories i.e. text and nontext classification. This categorization is done by the text, non-text classification. As all types of document imagecontain the text, Image and table; this module takes the image and table as non-text area and remaining part as atext area [13, 14]. At this stage the non-text area is extracted from the image and stored. The remaining part of imageis used to detect the text. The figure 4 represents the Input image and the text non text area classification andextracted text area. Red color boundary is used to represent non-text area of image.Figure 3: An input image, Text non text classification, and extracted Text areaThe above figure 4 shows the complete text part in the document image which is used to detect the text. Afteclassification each text block is identified and text lines are detected as shown in the following image for a single textblock.
©Informatics '09, UM 2009
 RDT 
4
 
-
 
89
 
 Proceeding of the 3rd International Conference on Informatics and Technology, 2009
 
 Figure 5: Detected text lines.After detecting the text lines word segmentation, character segmentation, template matching and charactereplacement is then performed. As there are standard techniques for these processes and are discussed elsewhere[5, 12] so details discussion is not given here. But for the sake of completion of the OCR process, just introduction isgiven here. Word segmentation is performed using the basic feature of the script called the white space betweeneach word. Each word is segmented and passed for further processing called the character segmentation. Again thecharacter segmentation is performed using the same basic feature of the scripts called the white space or gapbetween each character. After this process the output comes as individual character image. And matching character is searched and replaced.
3.0 OVERVIEW OF PROBLEM
The present OCR system available for Indian scripts is able to convert the document image in editable text andproduce a text file as discussed above. As discussed before it performs the pre-processing and classify the text andnon text area. Then detect the text in segmented text area and generate its equivalent text. The editable text andimage is placed in the text file. It works fine and has no problem if the document image is single column. But whathappens if the document image is multicolumn as shown in the following image.Figure 6: Multicolumn Document Image (left) and corresponding disorder blocks representation (right)As shown in the figure 6 (left) there are five text blocks and the two image block. If the 1
st
block then 3, 2, 4, and 5 isprocessed one by one and placed in output file. But if it not follows this sequence or image is not placed at properly
©Informatics '09, UM 2009
 RDT 
4
 
-
 
90
 
 Proceeding of the 3rd International Conference on Informatics and Technology, 2009
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...