You are on page 1of 20

Proc. of ICCPOL '99, pp.171-176, March, 1999. ICCPOL.

* This is an extended version of the original conference paper.

Word Extraction in Text/Graphic Mixed Image Using 3-Dimensional Graph Model

Hwan-chul Park, Se-young Ok, Hwan-gue Cho


Graphics Application Lab.,

Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735, Korea. Tel. 82-51-582-5009, Fax. 82-51-515-2208.
E-mail:
fhcpark,seok,hgchog@pearl.cs.pusan.ac.kr

Word Extraction in Text/Graphic Mixed Image Using 3-Dimensional Graph Model


Abstract
Automatic Text location, character recognition and image understanding of a given paper document are main objectives of computer vision area. The rst stage for these problems is extracting text information and separating graphic symbol from texts. Previous text location algorithm could not extract the negative text(e.g., newspaper headline) which is a white colored text on some solid background color plane. Also they could extract only the horizontal or vertical text in a document, so the inclined text or text on a circular arc can not be located by the previous works. In this paper, we propose a new extracting method for these negative texts and real texts from a text/graphics mixed document image. Also we propose a new word grouping method when texts are intersected each other or placed on a circular arc or an inclined line segment with an arbitrary orientation. The basic strategy of our algorithm is based on the frequency analysis of the run-length encoded le of the image segment. Generally the number of runs in the run-length encoding for a text(character) is smaller than that of graphic symbol. And the average and variance of the number of runs in a run-length encoding gives a nice characterization of symbols and texts. After isolating each letter in a document le, we need to group the related letters a word. This procedure is a crucial work for an automatic document processing, since the unit of the nal output of the document processing should be a word or a statement. For this procedure, we propose 3 dimensional neighborhood graph for grouping words and statement from a set of isolated letters obtained from the rst letter isolating phase. This graph maps each letter in a document to a vertex in 3-dimensional space according to the size of that letter. Experimental results show that more than 97% of words were successfully extracted from the text/graphics mixed document including negative texts. This result shows the usefulness of our character isolating algorithm and our 3-dimensional graph mapping for the document with oriental characters.

Keywords:

Document analysis, Pattern recognition, Text extraction, Image process-

ing 2

1 Introduction
1.1 Related works
The separation of text strings from a document image is one of di cult works in digital image processing, eg., a traditional OCR. There are recently a growing need to convert new and existing newspaper documents into electronic documents for better archiving, retrieval and maintenance due to development of network. In general, when these documents such as a newspaper or magazine are converted into electronic version, this procedure is solved manually, but it is an ine ective and a time consuming process. Thus the need for an automated text string extraction system is very desirable. We propose a novel algorithm that automatically extracts text/words from non-text elements in a document image le. Previously, there are lots of works for the automated document analysis with text/graphics mixed document image in engineering drawings and newspaper, magazines 1-4]. These works are classi ed into two major areas. One is classifying text strings and graphical components in document image. The other is extraction for the speci c symbols in scene image. Previously, Fletcher described one method which uses the structural information of the connected component in a text/graphics mixed document image 1]. And the Hough transform was applied in order to group characters into text strings 1]. Other methods regard the text information as textured objects and uses well-known method of texture analysis such as Gabor ltering, but this method is sensitive to font size and style, and is time-consuming 13]. Tan proposed one model, namely pyramid, to extract text strings by apply various image resolution. The pyramid model helps to identify and locate words or phrases e ciently and quickly. But when characters are required to be grouped into words, it was reported that the nal e ciency is decreased 11]. Recently Kamel and Ohya proposed a method which separates text area using grey scale information in document images with a mixed background image, shadow and highlight image. This technique was ltering method, interpolation scheme and adaptive thresholds using the gray level of pixels 5, 6].

Box

first-character second-character third-character base line

(a) a syllable [I]

[II] [III]
(b) /i:m/

(c) /win/

Figure 1: (a)The structure of Korean character. (b),(c) two example letters. I] is the rst character, II] is second character(vowel) and III] is third character

1.2 Research problem


Text separation procedure becomes very di cult if it has the various fonts and sizes of text and the text orientation is allowed to be arbitrary. See Fig.3 for a typical intermixed document. And there are few research that extracts negative text strings such as headline of newspaper. Negative text means the text where inner and boundary color(=text region) is di erent from a general text. Thus in case of negative texts, and if there are mixed text/graphical components. We need a new approch to solve whose there problem. Especially for the typical oriental languages such as Korean, Japanese, Chinese, this work is very hard since the basic structure of these languages are quite di erent to western language, that is, these oriental letters are composed of several disconnected strokes. There are three major technical di culties to extract the text strings from the text/graphics mixed oriental document image. There are various sizes of characters and also the distance of inter-letters or interwords are various. And text strings can be placed in any direction, so we do not 4

Figure 2: One example of oriental characters, (a) Korean, (b) Japanese, (c) Chinese, N is the number of disconnected strokes. assume the orientation of each text in advance. Normally every oriental letters consists of multiple disconnected strokes. Fig.2 is an example of oriental letters. Fig.1 shows the general structure of a Korean letter where each small box contains a consonant or a vowel. Fig.3 shows a document with a negative text(top), text and graphical component(middle and bottom). In English alphabet, every letter is one connected segment except \i" and \j", while most of Chinese, Japanese and Korean letters are composed of three or more disconnected multistrokes. So in order to process the oriental letter, at rst it is necessary that multi-strokes should be group into a single letters.

2 Text extraction
The text in brochure images and newspaper is allowed to have various fonts, sizes, styles and spacing.

2.1 Connected Components Extraction


In order to isolate character set, at rst we have to nd the connected component in a document images. A connected component is computed using 8-connected region generation algorithm 1]. Let CCi denote a connected component(=pixel cluster). After nding all connected components, we should prepare the following attributes: 5

Figure 3: A document with negative texts,mixed text image, graphic symbol

Box(CCi) = hWi Hi i, Bounding Box for a connected component CCi , the width is Wi , the height is Hi Pixel(CCi ), The number of pixels in the connected component CCi and (Pri = Pixel(CCi )=Box(CCi )), the ratio of the number of pixel to the bounding box area Wri , Ratio of bounding box(Wri = Wi=Hi)
If the number of pixels in a connected component is less than a threshold value,c0 , we regard it as a noise segment and it will be removed from document image. Threshold constant c0 = 6 was determined by the several experimental works. We classify all connected components into three classes. (text, negative text, and graphic symbol(=picture))

2.2 Classi cation of a Connected Component


General document such as brochure and newspapers have lots of di erent types of texts, lines and gures. In this section, we describe an extraction technique to determine the type of a connected component. Without character recognition capabilities, it is not easy to distinguish characters from non-characters by the size of a connected components. There 6

(a) a consonant /giyok/

(c) a text /park/

(b) a vowel /a/

(d) Picture

Figure 4: comparison for the number of runs in text and picture are two technical di culties : One is that the some graphic symbol and letter are very similar in terms of bounding box size. The other is that we can not tell which is graphic symbol or negative letters since the size and the number of pixels in those component is nearly equal. Without loss of generality, we can assume the connected component of graphical symbol is larger than that of text in a document image. But in case of negative text, the connected component of the negative text is similar to that of picture. The spatial con guration of bounding boxes of a connected components can be obtained by projecting them onto a straight line. The projection onto a vertical straight line is called horizontal projection. Here we assume that the connected component image is encoded by the run-length method. We de ne R as a binary sequence. Let N (R) denote the number of runs in R. So if R is "11101101011", N (R) should be 6. Also if R is "1111000011111", N (R) should be 2. Fig.4 shows the horizontally projected sample pro les of a consonant, vowel, text, and picture. The right bar of each Fig.4(a),(b),(c),(d) shows the number of runs for the corresponding horizontal line(=binary sequence). N (R) of each Fig.4(a),(b) and (c) seems regular but N (R) of Fig.4(d) is irregular. So we can see that the variance of N (R) of graphic symbols is larger than that of texts. In order to separate negative text 7

with picture, we can equally apply the previous algorithm. At rst, we check the Wri of a connected component. If Wri is larger than a threshold, rst a connected component is regarded as a negative text. Next, we can accurately classify negative text by N (R), N (R)(average of N (R)), Nd )(variance of N (R)). On equal terms with text(=positive (R text), Nd ) of negative text is small. Now we will describe the classi cation procedure to (R determine if the given image is positive letter, negative letter or picture.
Algorithm : Classi cation of connected component Algorithm
Input: Output: 1. a Connected Component 1(=normal text), 2(=negative text), 3(=picture) For each connected component in its X-axis projection profiling, find

run-length information. (a)

N (R )

= The number of runs in j-th line for

Box(CC )
i

in the direction of

X-axis. (b) For each connected component, compute average , 2.

N (R)

and variance,

Nd) . (R
i

IF ((Wr ) t0 ) THEN /? t0 ,l0 ,k0 is a threshold IF ((Pr ) l0 AND Nd) k0 ) THEN (R


i i i

constant

?/

type(

CC )
i

= Negative Text

Reverse the area which is included by condition and extract again

Return 2 ELSE type(CC ) = Picture, Return 3 ELSE IF (Nd) k) THEN (R


i i

connected component .

delete

CC

Return 1 ELSE type(CC ) Return 3


i

type(

CC )
i

= Normal Text

= Picture, delete

CC

Algorithm 1.

Text extraction algorithm

t0 , l0 , and k0 are threshold constants which were set by experiment. Average and
8

(a)Original test image

(b) Text strings extracted

Figure 5: The result of Extracted texts using our algorithm variance of N (R) is calculated in the following such that :
j =1 (N (R)j ; N (R)i ) N (R)i = (Hi ) where N (R)j is the number of runs in Box(CCi) 1 j Hi j =1 N (R)j (Hi )
i

PH

Nd)i = (R

PH

Hi is a height of connected component CCi. Fig.4(c) shows one Korean letter " ". If Nd)i is larger than a given threshold, the connected component is regarded as a graphic (R component or pictures. In Fig.4(c), N (R)i is 3 and Nd)i is 2. But in Fig.4(d), the N (R)i (R is 15 and the Nd)i is 40. Also, Nd ) of negative text is small. As the result of pro ling, (R (R
we could successfully classify components into texts, negative texts, and graphic symbols (R by N (R), N (R) and Nd ). Fig.5(a) is a text/graphics mixed document image and Fig.5(b) shows the set of letters by our algorithm. In Fig.5(a), negative texts "O'REILLYP G: ! ", "VBScript W ", " y%q t ... " are classi ed with a negative text. And these negative texts are recovered into general texts (positive texts) in Fig.5(b) 9

by our algorithm.

3 Word Grouping
Till this section we could isolate basic component(consonant or vowel) from a text/graphics mixed document image. Now in this section, we present a new technique for word grouping from the extracted characters. Since Korean, Japanese, Chinese letter consists of several disconnected strokes, this character grouping is required from the set of disconnected strokes before word grouping. For example, a Korean letter \M" has 6 disconnected segments. The basic assumption of our stroke grouping algorithm is that the circumscribing circle of a stroke is intersecting other strokes in a letter.

3.1 Single Character Grouping


We can easily observe that the distance between strokes is smaller than that of letters in a word. So we propose a circle-based grouping technique for a single character. This procedure works in the following steps. Let si denote a single stroke in an oriental character. 1. Compute the centroid point, cpi , of each stroke si 2. Compute the minimal enclosing circle Ci for each si. 3. For all i, check if Ci intersects the other stroke, sk , or Ck , where 1 k n. Suppose that Ca intersects Cb or sb , then we put sa and sb into the same stroke group for one character. We proceed this grouping step till there are no more intersections in among Ci . Fig.6 shows one Korean letter \ " which consists of four disconnected strokes. These four strokes could be grouped as a single character using our circle-based procedure. Dotted circles denote the enclosing circle Ci for si . Fig.6(a) shows that there are four strokes and stroke II, strike III and stroke IV intersect C1 of stroke s1 . So we could group one letter with four strokes as shown in Fig.6(b).

10

Searching Region stroke I stroke II

A nally grouped letter from (a)

the centroid of stroke

stroke III (a)

stroke IV (b)

Figure 6: Circle-based grouping procedure : (a) A Korean letter with four disconnected strokes (b) A nally grouped letter \ "

3.2 Word Grouping using 3-D neighborhood graph


Now we are ready to group a word from the set of singly isolated characters in the previous subsection. For this we propose a new 3-D neighborhood graph model to group several characters, which are di erent in sizes, fonts ,and orientations. By constructing a 3-D neighborhood graph, we can obtain more contextual and geometrical information of characters. As the previous character isolating phase, there are lots of characters of di erent size among the extracted characters. When we look a document, we can identify words easily, since the characters in a word are separated in a baseline or circle uniformly and their orientation is constant. The basic idea of our word grouping procedure is that we map each character to a 3 dimensional box. We will map a larger character to the upper space to make the smaller character far away in 3-D space. So this mapping philosophy forces the set of characters of nearly same size are placed in a horizontal plane. So big characters will emerge to the upper plane and the smaller one will sink to the bottom.

11

(a) Test Image

(b) Single character grouping of (a)

Figure 7: single character grouping


3.2.1 Generation of 3-D neighborhood graph

As illustrated in Fig.3(a) and Fig.7(a), various sizes of characters are printed in a document. And their placing orientations are also di erent from each other. Some words are placed on a curve rather than a straight line. So the previous word grouping algorithm could not handle this kind of problem easily. In Fig.7(a), we can see that the statement of six words \ & x J ." cuts the relatively larger character statement of three words \x x !" . The main idea under the 3-D neighborhood graph modeling, G3;D (V E ), is that we map each character of several sizes into the 3-dimensional space and give edges by considering the distance between two mapped characters(=vertices). Let us de ne formally. Let ci denote a single character in a document. We map ci to a vertex vi of G3;D (V E ). And vi has 3-dimensional coordinate as ci (xi yi zi ). Since (xi ,yi ) is the centroid of ci in the document coordiate, we only need to assign the height value zi for ci . Brie y stating, if the size of ci is big(small), then it will be placed higher(lower) in (x y z ) space. This placement strategy prevents from interferencing between several sizes of the intermixed characters. Let Area(ci ) denote the area of minimal enclosing circle of ci , that is the area of Ci . p0 is an adjusting constant to control zi -height in 3-D space.

V = f(xi yi zi )g Zi = p0 Area(ci ) a where (xi yi) is the centriod of ci in a document and


12

(a) Word grouping for Fig.7(a) 1 a = n Area(ck ) k=1


n X

(b) 3-D graph of (a)

Figure 8: Generation for Fig.7(a)


n 2 = 1 (Area(ci ) ; a )2 n i=1

Now we need to de ne the edge set ,E , of G3;D (V E ) with f ci (xi yi zi ) g vertex set. For each vi 2 V , we give an edge (vi vj ), if vj is the nearest vertex from vi in (xi yi zi ) space. To say other words, every vertex has an edge connection to its nearest vertex. Therefore the degree of each vertex after this edge assigning procedure should be 1 or 2. This graph guarantees that all pairs of relatively closest vertices are connected by an edge. Fig.7(b) shows an example of circle-based grouping for Fig.7(a). In Fig.8(b) in order to help understanding, the higher the zi value of ci is, the bigger the shape of vertex in G3;D (V E ) is.
3.2.2 Word grouping in

G3;D (V E )

The goal of this procedure is to nd the closely related subgraph(=word) so that the members of the characters set can be grouped into a meaningful character strings(=words). At rst, we traverse a given G3;D in depth rst search manner in order to cluster. Since it is common that characters in a word is placed linearly and uniformly in documents, we assume that the centroids of characters in a word are placed quite colinearly. Thus we compute the degree of linearity by examining along path(=a sequence of adjacent edges). Let i be the angle between two successive edges, namely (vi;1 vi ) and (vi vi+1 ), and we give i = 6 vi;1 vi vi+1 . 13

(a) Top view of 3-D graph of Fig.8(b)

(b) Another view of 3-D graph

Figure 9: 3-D graph of Fig.13(b) that is viewed from di erent direction Then for vi , we nd the maximal index of k satisfying following constraint.
k 1 X ( ; )2 g = maxk f n avg i i=1
0

where

X 1 k;1 avg = n j j =1

where 0 is a threshold constant for linearity. Though characters in a word are adjusted on a base line strictly, there are di erent types or number of strokes. So the centroid can not be placed in a straight line exactly. Therefore 0 can not be 0 exactly. Finally, we traverse the each subgraph and calculate the local threshold by considering length of all nodes in a subgraph so that we remove edges if the length of an edge is larger than a local threshold value. Fig.8(b) show the nal 3-D neighborhood graphs which correspond to Fig.8(a). Fig.9(a) and (b) shows 3-D neighborhood graphs of Fig.8(b) with a di erent view point.

4 Experiments
The performance of the proposed algorithm has been evaluated using 5 sample document images which were obtained from books, brochures and magazines. These documents were scanned to the raster images with l300 by 1300 pixel resolution. Every testing image have the following characteristic features. 14

(a)

(b)

Figure 10: (a) A document image (b) Text string extracted of (a) It has various text fonts and sizes. It has texts with various orientations. It allows negative texts. Fig.10(a) shows an original document image. Some texts are placed in a circular form, and texts and symbols( gures) are included in the middle black box as a negative form. The spatial resolution of Fig.10(a) is 760 488. Fig.10(b) shows the nal result after processing our character extraction procedure. Note that every characters separated in Fig.10(b) and some symbols and negative text was recovered to the positive form i.e., a normal text image. Fig.11(a) shows the result of word grouping from Fig.10(b). In this Fig.11(a), all words are denoted as a stream of character threaded by a solid line. The statement in the bottom of Fig.11(a) was clearly grouped in a meaningful word. See the grouped statement \ x L : : : ; " in the bottom of Fig.11(a) designated by a solid line. Fig.11(b) shows the corresponding 3-D neighborhood graph of Fig.11(a), which was viewed from top to bottom(x-y plane). Fig.11(b) shows that the large letter(denoted as a big ball) is placed relatively to the upper position. 15

(a)

(b)

Figure 11: (a)Grouped words of Fig.10(b) The corresponding 3-D graph of (a)

(a)

(b)

Figure 12: (a) A document image (b) Text strings extracted of (a) 16

(a) Word grouping results from Fig.12(b) (b) 3-D graph of the document(a)(top view) Figure 13: Word grouping from Fig.12

(a)3-D graph of Fig.11

(b)3-D graph of Fig.13

Figure 14: 3-D graph viewed from z-axis 17

Name DATA1 DATA2 DATA3 DATA4 DATA5

SIZE

704 961 97.2% 2.8% 800 1000 99.5% 0.3% 667 953 96.9% 1.9% 700 962 97.0% 2.3% 760 980 98.5% 1.2%

ESRR EFRR WSWG WFWG Nchar Nword


97.5% 98.2% 98.9% 98.5% 99.2% 1.8% 0.4% 1.09% 1.2% 0.2% 467 100 512 606 203 107 40 141 197 47

Table 1: Testing results Table 1 shows the testing results with 5 sample documents. In this table, ESRR means the Successful Recognition Ratio which is calculated as ESRR = (the number identi ed characters)/(the number of characters in a document). EFRR denotes the False Recognition Ratio for all characters in document such as EFRR = (the number of image component which was identi ed as a character, but is not a character ) / (the number of characters in a document). Equally, WSWG is the Successful Word Grouping Ratio and WFWG is the False Grouped Word Ratio. Nchar (Nword ) is the number of characters(words) in a document images. Fig.13 shows that some graphic symbols were removed and negative texts were recovered in a positive form. Fig.12(a) shows another testing document, which includes some handwritten-font letters whose orientation is arbitrary. Fig.12(a) is the original image and Fig.12(b) shows text strings extracted from Fig.12(a). Fig.13(a) shows the word grouping using 3-D graph and Fig.13(b) corresponds to 3-D graph for Fig.13(a). Fig.14 shows another 3-D graph which is viewed from z-axis.

5 Conclusion
In this paper we proposed a new text extraction algorithm from a text/graphics mixed document image. Our algorithm can extract word groups, even if a document has a various sizes, style of fonts and the arbitrary orientation of text strings. The basic idea to separate texts from graphic symbols is to exploit the characteristic in the number of runs in the runlength encoded le of a given document. If the number of runs encoded in an a component is high and its variance is high, then that is believed to be a kind of symbol rather than a 18

text, since text has a few connected component. And in order to group the singly isolated letter to a meaningful word, we propose a multi-strokes grouping technique and 3-D neighborhood graph modeling scheme. In this 3-D graph model, every letter extracted in the previous phase is mapped to a location in the three dimensional space. We control the height of each mapped letter according to its size of the enclosing circle, so the letters with the similar size will be placed in the nearly same horizontal plane in 3-D space. In order to determine whether some letters are placed in a line or curve, we need to compute the linearity of these letters. Experiment works show that our algorithm can extract successfully text strings from a quite complicated document image with more than 95% accuracy. For experiments, more than 6 document images were taken from brochures, magazines and books. The experiment result was quite satisfactory. In the future we hope that this system would adopt an automatic letters recognition system to make truly full-automatic document processing system for constructing digital library which to process papers or magazines and lots of engineering drawings.

References
1] L.A. Fletcher and R. Kasturi, A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images, IEEE Transaction On PAMI, Vol. 10, No. 6, pp. 910918, Nov., 1988. 2] A. Nakamura and O. Shiku, A Method for Recognizing Character Strings from Maps Using Linguistic Knowledge, Proc. of 2nd International Conf. on Document Analysis and Recognition, pp. 561-564, Oct., 1993. 3] R. Kasturi and S. Bow, A System for Interpretation of Line Drawings, IEEE Transaction On PAMI, Vol. 12, No. 10, pp. 978-991, Oct., 1990. 4] L. Eikvil and K. Aas, Tools for Interactive Map Conversion and Vectorization, Proc. Inf.conf.on Document Analysis and Recognition, Vol. 2, Aug., pp. 14-16, 1995.

19

5] M. Kamel and A. Zhao, Extraction of Binary Character/Graphics Images from Grayscale Document Images, Graphical Models and Image Processing, Vol. 55, No. 3, pp. 203-217, May, 1993. 6] J. Ohya and A. Shio, Recognizing Characters in Scene Images, IEEE Transaction on PAMI, Vol. 16, No. 2, pp. 214-220, Feb., 1994. 7] T. Taxt and P.J. Flynn Segmentation of document images, IEEE Transaction on PAMI, Vol. 11, No. 12, pp. 1322-1329, Dec, 1989. 8] G. Monagan and M. Roosli, Appropriate Base Representation Using a Run Graph Proc. of 2th International Conf. on Document Analysis and Recognition, pp. 623626, Oct., 1993. 9] J. Ding and L. Lam, Di erentiation between Oriental and European Scripts, Proc. of the 7th ICCPOL, pp. 35-40, Apr., 1997. 10] T.Saito and M.Tachikawa and T.Yamaai, Document Iamge Segmentation and Text Area Ordering, Proc. of the ICDAR '93, pp. 323{329, Oct., 1993. 11] C.L.Tan and P.O.Ng, Text Extraction Using Pyramid, Proc. of Pattern Recognition '97, Vol. 31, No. 1, pp. 63-72, Feb., 1997. 12] A.K.Jain and B.Yu, Page Segmentation Using Document Model, Proc. of ICDAR '97, Vol. 1, pp. 34-37, Apr., 1997. 13] A.Jain and S. Bhattacharjee, Text Segmentation using Gabor lters for automatic document pro, Mach. Vision Applic, Vol. 5, pp. 169-184, 1992.

20

You might also like