You are on page 1of 5

Uvika* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 10, Issue No.

2, 309 - 313

SYMBOL EXTRACTION FROM DOCUMENT IMAGES USING IMAGE SEGMENTATION IN COLOR DOMAIN
Student of Master of Technology Department of Computer Engineering YCOE,GuruKashi Campus, Punjabi University Talwandi Sabo, Punjab,India uvikataneja01@gmail.com

Uvika

Keywords- Image segmentation, Matlab, number of

symbols extracted, symbol extraction


I. INTRODUCTION

Image segmentation and extraction is the process of dividing the image into segments and then extract or recognize objects from it. Image segmentation has

IJ A
ISSN: 2230-7818

Abstract Image segmentation is an important application of image processing. In proposed algorithm we achieved the segmentation and extraction of symbols using minimum spanning tree based segmentation method. This paper presents the extraction of symbols and characters from document images and describes the number of symbols extracted from the images. Symbols itself include all characters and characters includes all the letters and numbers. The focus is on the black and white images. Basically this is achieved by using image segmentation in color domain. That is why each and every symbol or character in document images should be disjoint. Our proposed algorithm also extracts the handwritten symbols and characters from binary images. The images of text can also be taken with the help of high resolution camera and extract symbols (including characters) from those images.

basically two parts color extraction and texture extraction. In proposed algorithm the symbol extraction has been done by using image segmentation in only color domain where the intensity of color changes from white to black, it extracts each symbol and character from binary images. So each symbol should be disjoint from another. It takes connected symbols as one symbol. Basically segmentation means to find out the coordinates of objects having same pixel intensity and to cut that part from image using image processing commands is called extraction. The extraction of textual information from document images provides many useful applications in document analysis and understanding, such as optical character recognition, document retrieval, and compression. The document image segmentation is an important component in the document image understanding. The extraction of text in an image is a classical problem in the computer vision. However variation of text due to difference in size, style, orientation, alignment, low image contrast and complex background make the problem of automatic text extraction extremely challenging. Sometimes characters in a text are of different shapes and structures. The images may contain noise and have complex structure which makes the extraction more difficult [3]. In proposed algorithm I have used

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES

Assistant Professor, Department of Computer Engineering YCOE,GuruKashi Campus, Punjabi University Talwandi Sabo,Punjab,India purbasumeet@yahoo.co.in

Sumeet Kaur

Page 309

Uvika* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 10, Issue No. 2, 309 - 313

II.

PROPOSED SCHEME

IJ A
A. Binarization B. Segmentation C. Extraction
ISSN: 2230-7818

In this section proposed method for extraction is given, in the proposed scheme the work is divided into three parts-

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES
1. 2.

minimum spanning tree based segmentation method. A Minimum Spanning Tree (MST) is a minimum-weight; cycle-free subset of a graph's edges such that all nodes are connected. The possibility of stitching together independent subimages motivates adding connectivity information to the pixels. In which image can be viewed as a graph, the nodes of which are pixels, and edges represent connections between pixels. This method is used to show multiple disjoint symbols which collectively cover the entire image. [7] Symbol extraction is the process of extracting each symbol which includes letters, numbers and symbols from document images. The purpose of Symbol extraction from document images using image segmentation in color domain is to extract every symbol from document images. And after extracting the individual symbol in a document then calculate the number of symbols extracted in image. The focus is on the black and white images. These include scanned images and take live picture of document also (documents can be handwritten also) and extract symbols from it and calculate the numbers of symbols or characters included in document images. For this purpose high resolution camera must be used. In proposed algorithm I have used 16 mega pixels camera.

Work flow Model of the Proposed Scheme A. Here in the first phase we perform Binarization of RGB that mean this algorithm would covert the RGB images i.e Document contained Symbols and characters in to the Binary Images, Read RGB images i.e Documents. Convert it into a Gray images. Convert the gray image into binary image by taking the suitable threshold value (100 has been used). 4. This binary image is then inverted. 5. All connected components (symbols) that have fewer than 30 pixels are removed from binary images by using filter morphologically open binary
1. 2. 3.

B. The Second phase of the development is of Segmentation that is for the defining the co-ordinates of objects in document images Convert the gray image into rgb format. Invert of the labeling of the original image is done. bwlabel(~(sel_img)); 3. Finding must be done. Take two metrics x,y. Got the min(minimum) and max(maximum) of the metrics using Find function. [x_mat y_mat]=find(img == i);

Page 310

Uvika* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 10, Issue No. 2, 309 - 313

C. In our final phase the Extraction Algorithm is come which is used for extracting the symbols & character from the document images and count each one Use same metrics x,y for plotting the boundary box on each symbol. 2. For putting the colored box, we have to convert the binary image into rgb format.Then plot the Red or green or blue colour box on symbols by changing the value. 3. Show each and every symbol and character separately by using metrics. x_min : x_max,y_min : y_max
1.

Figure 2 Segmented Image

IJ A
Figure 1 Original Image
ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES

We have applied the proposed algorithm on the Standard Three Images which are presented in the Below shown figures, each one original figure is followed by its segmented image. The extraction of symbols from document images is shown in following figures. Figure 1, figure 3 and figure 5 are the original document images and captured handwritten images respectively. Figure 2, figure 4 and figure 6 are the resulting segmented images.

Figure 3 Original Scanned Handwritten Image

T
Figure 4 Segmented Image
Page 311

Uvika* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 10, Issue No. 2, 309 - 313

extracted symbols by the total numbers of the symbols in the images.


Images Handwritten Images Document Images Document Images Document Images Font Size Large 28 26 24 Arial black Arial black Arial black Font Type Extraction Rate 100% 97% 90% 89%

IJ A
Figure 6 Segmented Image
III.

CONCLUSION In this paper extraction of symbols and characters from document images and handwritten images is presented in which all the symbols should be disjointed. The major sources of error were due to symbols like % and = because it will take % as three different symbols and = as two symbols because they are disconnected. Black colored text printed on the white sheet is preferred for better extraction rate. No extra light effects must be present while capturing the images from camera. Our future work is directed towards segmentation and extraction of symbols from RGB images which includes other objects also with the documents includes symbols.
IV.

EXPERIMENTAL RESULTS

The algorithm is implemented using MATLAB. We have taken RGB document images of different font sizes. We consider the document images of size generally 24, 26, 28, 36 and Arial black font has been used. The extraction rate of the symbols is calculated from dividing the total numbers of the

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

ES

V.REFERENCES 1. Phalgun Pandya, Mandeep Singh Morphology Based Approach To Recognize Number Plates in India International Journal of Soft Computing and Engineering (IJSCE) ISSN: 22312307, Volume-1, Issue-3, July 2011. 2. Zhe Wang, Yue Lu, Chew Lim Tan, Word Extraction Using Area Voronoi Diagram Department of Computer Science, School of Computing National University of Singapore, Kent Ridge, Singapore, 2009.

Figure 5 Original Captured Handwritten Image

Table 1 Experimental Results

Page 312

Uvika* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 10, Issue No. 2, 309 - 313

3.

4.

5.

6.

7.

IJ A
ISSN: 2230-7818 @ 2011 http://www.ijaest.iserp.org. All rights Reserved. Page 313

ES

G. RAMA MOHAN BABU, P. SRIMAIYEE, A. SRIKRISHNA, TEXT EXTRACTION FROM HETROGENOUS IMAGES USING MATHEMATICAL MORPHOLOGY Journal of Theoretical and Applied Information Technology 2005 - 2010 JATIT. Aryuanto, Koichi Yamada, F. Yudi Limpraptono, Color Segmentation for Extracting Symbols and Characters of Road Sign Images Department of Electrical Engineering, Institut Teknologi Nasional (ITN) Malang, Indonesia Department of Management Information Systems Science, Nagaoka University of Technology, Japan. Satadal Saha, Subhadip Basu, Mita Nasipuri and Dipak Kr. Basu, A Hough Transform based Technique for Text Segmentation JOURNAL OF COMPUTING, VOLUME 2, ISSUE 2, FEBRUARY 2010, ISSN 2151-9617. Character recognition overview http://www.cs.berkeley.edu/~fateman/ka they/char_recognition.html Minimum spanning tree-based segmentation http://en.wikipedia.org/wiki/Minimum_s panning_tree-based_segmentation