DOCUMENT PRESENTATION ENGINE FOR INDIAN OCRA DOCUMENT LOYOUT ANALYSIS APPLICATION
ABSTRACT
Today office automation is going on in the all of the fields. Everybody is using the computers for fast data processing and for maintaining the large amount data. But, there is need of the previous processed data which is printed documents. There are two ways to use old data. First is to type the data to process in the computer. And the second is to scan the document and use OCR (Optical Character Recognition) to convert the document image in to the editable text format. There are more than 1000 languages and 14 scripts used by 112 million people in the India [1,2].So there is need of OCR system for Indian scripts which is in development process. In OCR we have to scan the document, then noise cleaning, skew detection and correction, text non–text classification, text line detection and segmentation, word segmentation, character segmentation and identification and output file generation. The main contribution in this paper is that how to maintain the layout of document. At the present scenario OCR system produce the text file as output without maintaining the layout of document. OCR is an error-prone process. The error remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. In this paper the use of XML is presented to generate the open office document file. In this paper the open document standard is followed which is approved by OASIS (Organization for the advancement of Structured Information Standards) on Feb. 1, 2007 [3]. The main feature of the propose solution is that it is scripts independent. So it can be applied for all Indian scripts.
Keywords:
OCRed Document, Indian Scripts, Document layout.
1.0 INTRODUCTION
In the field of computer science text and images are the main source of the information. A human can understand theinformation if it is presented in well manner. For example, if the data is presented with the images in well organizedmanner then the other one can understand in a better way. If the presentation is not well the information given will notbe understood by the other. The OCR process is error prone. It is time consuming and expensive to manuallyproofread OCR results. The errors remaining in OCRed texts can cause serious problems in reading andunderstanding if they do not refer to the original image representation. Document representation after OCR is veryimportant task. It can cause serious problems in reading and as well as understanding the document if they do notmaintain the layout as in the original image representation. Present system scan the document image and place thetext and image one after other without maintaining the layout [2]. In this paper there is small discussion about theprocessing of the OCR and the detail discussion of the proposed system. The following figure gives an overview of the OCR process.Figure 1: flow of OCR processAs the above figure show the first step of OCR is two tone conversions which convert the image into binary image,and then skew is detected and corrected. Noise cleaning is performed on the skew corrected image. Text non-textclassification technique classifies the document image in text and image. The image part is extracted from thedocument image for further processing and remaining text part is passed to text lines detection. After detecting thetext lines, each line is processed and individual character is generated and final output file is generated. But at this
©Informatics '09, UM 2009
RDT
4
-
88
Proceeding of the 3rd International Conference on Informatics and Technology, 2009
Umesh Kumar, Jagdish Raheja
Leave a Comment