International Journal of Document Analysis (2006) 8(2): 172–182 DOI 10.

1007/s10032-005-0006-5

R E G U L A R PA P E R

S. Mandal · S. P. Chowdhury · A. K. Das · Bhabatosh Chanda

A simple and effective table detection system from document images

Received: 23 July 2004 / Revised: 8 January 2005 / Accepted: 15 September 2005 / Published online: 24 March 2006 c Springer-Verlag 2005

Abstract The requirement of detection and identification of tables from document images is crucial to any document image analysis and digital library system. In this paper we report a very simple but extremely powerful approach to detect tables present in document pages. The algorithm relies on the observation that the tables have distinct columns which implies that gaps between the fields are substantially larger than the gaps between the words in text lines. This deceptively simple observation has led to the design of a simple but powerful table detection system with low computation cost. Moreover, mathematical foundation of the approach is also established including formation of a regular expression for ease of implementation. Keywords Table detection · Document image segmentation · Digital document library

1 Introduction Millions of paper documents are being produced everyday, adding a never ending wealth of information to the human society. Practical use of these documents demand indexing, viewing, printing and extracting the intended portions in a fast and flexible way through electronic media. With the maturity of the document image analysis such systems are coming in the market. These include digital document libraries, vectorization of engineering drawings and form processing systems [1, 12, 13, 15, 18] to name a few. Common task for a typical document image analysis (DIA) system starts with skew correction and identification of the constituent parts of the document image to text, graphics, half-tones, etc. The graphics portion may be vectorised and text portion may be put to OCR. However, any table
S. Mandal (B) · S. P. Chowdhury · A. K. Das CST Department, Bengal Engineering College (DU), Sibpur, Howrah E-mail: {sekhar, shyama, amit}@cs.becs.ac.in B. Chanda ECS Unit, Indian Statistical Unit, Calcutta 700 035, India E-mail: chanda@isical.ac.in

that may be present in the document needs to be identified and requires special treatment because the fields are interrelated and individually carry a little sense. It may be noted that the table detection/segmentation step may be followed by table recognition step where the aim is to find out the logical or layout structure of the table. For such recognition systems it is usually assumed that the tables are already segmented out from the document or the whole document is a table [10]. In this paper our aim is limited to detection and segmentation of tables in a simple and efficient manner from the scanned document for subsequent processing. Our approach depends on the observation that the tables have distinct columns which implies that the gaps between fields are substantially larger than the gaps between the words in text lines. Indeed this calls for a reasonably homogeneous contexts with clear rules for the arrangements of the headers and cells of a typical table. Our method works well for typical paper document with Manhattan layout, it may not work for complex document with heterogeneous arrangement of items, which are commonly used by marketing people and magazines that have got very low archival value. This paper is organised as follows. Section 2 describes past works. Proposed work is dealt next in Sect. 3. Section 4 details experimental result while concluding remarks are given in Sect. 5.

2 Past work Table detection and segmentation is done by many researchers [3, 7, 21, 22]. Watanabe et al. [22] have proposed a tree representation to capture the structures of various kinds of tables. Table structure detection is also reported in [3, 11]. Zuyev [24] described a table grid and defined the compound cell and simple cell of a table based on table grid. Node property matrix is used by Tanaka [19] in processing of irregular rule lines and generation of HTML files. Unknown table structure analysis is proposed by Belaid [2]. Tersteegen et al. proposed a system for extraction of tabular structure with the help of predefined reference table [20] and Tsuruoka [21]

A simple and effective table detection system from document images

173

proposed a segmentation method for complex tables including rule lines and omitted rule lines. A technique is described in [7] to separate out tables and headings present in document images. Ramel et al. [17] used a flexible representation scheme based on clear distinction between the physical table and its logical structure. Detection and extraction of tables are done by the analysis of graphics line present in the table in the context of the representation scheme. For tables without the rule lines a multilevel analysis of the layout of text component is carried out to capture the regularities of the text present in a typical table. Only a few approaches for table detection are based on the textual contents of the document, we describe two representative [10, 14] from them. Individual words in [14] are clustered and a block segmentation graph is constructed based on the overlaps of the individual items (words) in consecutive lines. Note that in tables the overlaps are limited to individual columns and the block segmentation graph will be distinctly different from a block of normal text. The authors claim that this approach works fine for ASCII file and may be extended for a scanned document. However, the algorithm contains too many heuristics and no guideline is given regarding the choice of the parameters used for segmentation. A structured approach based on dynamic programming is taken [10] to find out which input line(s) can be taken as a part of a table. This is done by computing some characteristics like score, merit and line correlations to ascertain the gain (or loss) if the candidate line is taken ( or rejected) as a part of the table. This approach has a strong theoretical foundation but the couple of empirical constants used in the characteristic measures need to be fixed without a priory knowledge. As a result, the detection rate is limited to 81% for the scanned image and only 83% for ASCII text. This reflects the difficulty in mapping the theoretical proposition to a practical implementation. We have cited a couple of important approaches for detection and segmentation of tables in this section. However, the reader may see the survey paper by Zanibbi et al. [23] for a comprehensive and up to date information on this topic.

[5] is done leaving only text zones which is used as the input for table detection. 3.1 Observation To formulate the rules for detection of the tables from document images we have scanned 52 pages from books, journals, reports, etc. containing different types of tables. The observations are listed below. • Tables may be bounded by boxes. Rows and columns may have horizontal and vertical rule lines. • Tables without any box and rule lines are also common. • The gap between the fields (columns) is significantly larger than the normal word gap in a text line. This feature is applicable to all tables and distinguishes table rows from normal text lines. 3.2 Steps for table detection The table detection algorithm depends primarily on (A) formation of word blobs in text lines and (B) finding out the set of consecutive text lines which would form a table. As a preprocessing step component labelling is done first and some statistics, namely the medians of the height and width of the components are determined. Using these medians all vertical and horizontal rule lines are identified and removed from the image. However, the horizontal rule lines are marked as H L r (r = 1, 2, . . . , S) and stored for future use. Unlike some works on table detection which rely on the presence of the rule lines our approach is to ignore them to design an algorithm for detecting all kind of tables and to extend the gap between the fields; a characteristic which is primarily exploited in our work. A. Formation of word blobs in text lines This is done by coalescing the words in a line. Normally a text line would be converted to a single rectangular block while a row of a table would consists of multiple smaller blocks. Such a word coalescing depends on the accuracy in finding the normal word gap and finding out consecutive connected components in a single text line. We next mathematically formulate the blob formation. Consider a binary image I P × Q , which consists of connected components Ck (k = 1, 2, . . . , E), as defined in literature [9] with their usual meanings. Let L(Ck ), R(Ck ), T (Ck ) and B(Ck ) be the four extreme points of Ck in four directions (i.e., left, right, top and bottom) respectively. Suppose function F guarantees that the two connected components Ca and Cb are in the same text line. Then function F may be represented as
F(Ca , Cb ) = 1 0

3 Proposed work The objective of the present work is to find out tables that are present in a scanned document using simple tests on the structural properties of the document. This work is continuation of our earlier work on segmentation where the document image containing text, graphics, half-tones are segmented. The present work starts with the gray-scale image of the page. Half-tones are removed first [6]. The image is then binarised [16] and skew corrected [8]. Further processing is done column wise; so multicolumn document is stripped into separate columns. Text containing displayed-math zone; particularly matrices/determinants are structurally very similar to tables. In the next step, we segment displayed-math zones [4] from the document. Finally, extraction of graphics

if (T (Ca ) ≥ B(Cb ) and B(Ca ) ≤ T (Cb )) otherwise

Blob formation requires information on inter-word gap. This may be obtained from the histogram H1 of distance

174

S. Mandal et al.

Fig. 1 Example of a a document page image and b histogram of the gaps between horizontally consecutive connected components (i.e. characters) belonging to the same text line

D between two consecutive connected components Ca and Cb . The distance function D is defined for computing the horizontal distance between any two consecutive connected components Ca and Cb as follows: D(Ca , Cb ) = L(Cb ) − R(Ca ) The consecutive component Cb of Ca is detected by taking b = arg minx {L(C x ) − R(Ca )} where (F(Ca , C x ) = 1 AND L(C x ) > R(Ca )). The histogram H1 registers the intermediate character gap of two consecutive characters. It may be noted that if there is only one font in the document then we will get two distinct humps in H1 ; first one for character gap in a word and the second one for the word gap. An example page and corresponding histogram are shown in Fig. 1a and b. If there is more than one font we may find few other humps however, first hump will be most prominent followed by the second hump. Our intention is to find out the word gap in the normal text in a document page so that we could combine the consecutive words into a single blob. For doing so, we have taken the upper boundary (υ) of the second hump. Morphological closing operation with a line-structuring element of length v forms the blobs denoted as Vw (where w = 1, 2, . . . , Q). The blob formation will be dictated by the following two conditions: 1. If there are two connected components Ca and Cb (1 ≤ a, b ≤ E) having the relations • F(Ca , Cb ) = 1 • D(Ca , Cb ) ≤ υ

then Ca and Cb should belong to the same blob. (Note that E is the number of connected components in a document page.) 2. Vm ∩ Vn = ∅ ∀ m, n | (1 ≤ m, n ≤ Q AN D m = n) Result of typical blob formation using the said structuring element is shown in Fig. 2a. B. Selection of candidate text line for table(s) Blobs formed by coalescing the connected components are also connected components. Thus, we can directly apply the previously defined five functions L , R, T, B and F on these blobs. Let there are total N number of distinct text lines which are represented as T E L a , where a = 1, 2, . . . , N . Each text line is nothing but a set of blobs such that a blob is not be shared by more than one text line. Two blobs are in same text line if they overlap in horizontal projection, otherwise, they are in different text lines. By horizontal projection we mean projection in the horizontal direction on a vertical line. A number index is assigned to each text line in raster scan order. Candidate text line selection is done primarily by taking all lines that has more than one blob (see Fig. 2b). It may be noted that all words in a text line are coalesced to a single blob whereas we get multiple blobs for table rows. Mathematically Q(T E L a ) = 1 0 if (#(T E L a ) > 1 | (1 ≤ a ≤ N ) otherwise

where #(.) counts the number of blobs in a line. Thus, Q = 1 indicates that the candidate text line may belongs to a table.

A simple and effective table detection system from document images

175

Fig. 2 Example of blob formation. a Blobs formed from image shown in Fig. 1a. b Lines containing multiple blobs

It may be noted that the primary selection is not strong enough to detect all potential candidate lines; we may miss some of the rows of the tables. Those lines should be included to prevent splitting errors. Thus, imposing candidature to some of the lines which have not been selected in this step is the next task. Imposing candidature on missed lines depends primarily on a test that whether their immediate neighbours are candidate lines or not. This calls for the computation of vertical gap between any two consecutive lines and a measure of maximum gap gmax that should exist between two consecutive rows of a table. Now considering the possible presence of a horizontal rule line between two consecutive text lines, the gap width between T E L a and T E L a+1 is computed as ⎧ ⎪ T (T E L a+1 )−B(Ta ) if a horizontal rule line HL ⎪ 2 ⎨ lies between TEL a and TEL a+1 g(a) = ⎪ T (T E L a+1 ) ⎪ ⎩ −B(T E L a ) otherwise An example histogram H2 of g(a) is shown in Fig. 3. The histogram H2 exhibits a line spectrum because the size of sample is small. As a result we cannot obtain proper value of gmax from H2 . To convert the line spectrum into an equivalent band spectrum, we have applied Gaussian filter on g(a) with standard deviation σ = 0.5, σ is chosen 0.5 as the span of the hump is small. After applying Gaussian filter,

Fig. 3 Histogram of the text line gap of the preprocessed document page in Fig. 1a

176

S. Mandal et al.

troughs. This is done as a measure against getting multiple consecutive text lines as candidate lines with multiple blobs. This can occasionally happen when the interword gaps are stretched in a couple of consecutive lines to solve the text alignment problem. An example of the extraction of table from the candidate lines is given in Fig. 5. Table detection from the candidate text lines can also be modeled by a regular expression as elaborated below. Let an initial candidate text lines be represented as P, noncandidate text line be represented by O, and intermediate gap between two consecutive text lines be represented by G. If the gap is less than or equals to (gmax ) then the string represented by the regular expression PG((PG)∪(OGPG))+ will correspond to a table in the page.

4 Experimental results The experiments are done on the dataset using document pages from University of Washington’s document image database (UW1 and UW2) and our own collection. About 300 document pages are tested of which ≈48% pages contain table(s). Our database images contain mostly text and tables, or text and math. However, some pages have text, tables and math. We have a couple of pages which have got graphics, text and tables and there is no page with all the four components i.e. text, graphics, math and table. All the programmes are written in C and the tests are carried out in a COMPAQ DS 20E server running digital UNIX. Tables 1 and 2 show the performance figures while Fig. 6 shows a few results with pages containing typical tables. The average time for detection and identification of the table(s) for a page is about 1 s including the pre-processing
Table 1 Overall table detection performance Page with table Table row (%) Detected as table row Detected as text line 97.21 2.79 Text line (%) 0.00 100.00 Page without table Text line (%) 0.09 99.91

Fig. 4 Band spectrum of g(a) after applying Gaussian filter with σ = 0.5 of pre-processed document page in Fig. 1a with a vertical scaling factor of 6

the resultant histogram H3 , is shown in Fig. 4. In histogram H3 , the upper boundary of the first as well as the most prominent hump is taken as the value of gmax . Thus gmax is more than the white gap between any two rows of the table but less than the gap between the (i) heading of a table and the table itself, (ii) gaps between two successive tables and (iii) a text line and a table. Based on the gap g(a) as well as the value of gmax , we infer that the text line T E L a belongs to a table if all of the following conditions are satisfied. 1. 2. 3. 4. Q(T E L a−1 ) = 1 g(a − 1) ≤ gmax Q(T E L a+1 ) = 1 g(a + 1) ≤ gmax

where 1 < a < N . This marks those text lines (belong to a table) which initially have not been marked but their neighbouring lines were marked to be a part of a table. It may also be noted that during primary selection some extra lines which are not actually table rows (e.g. table heading or paragraph heading) are also marked [see Fig 2b]. These are easily eliminated as isolated lines because no table has a single row. Similarly, if a text line is accidentally coalesced into a number of blobs then also it will be eliminated as an isolated text line because its immediate neighbours in the coalesced text are not candidate text lines. Finally, we take a vertical projection of the area which has been detected as a table to check for multiple humps and

Table 2 Errors in table detection Page With table Input page Table present Detected table Insertion error Merging error Deletion error Splitting error 143 218 226 0.00% 0.00% 2.79% 3.67% Without table 157 0 4 0.09% N.A. N.A. N.A.

A simple and effective table detection system from document images

177

Fig. 5 Example of table extraction from candidate lines. a Extracted lines for Fig. 1a. b Table within the preprocessed image

operations. The results we have got are highly encouraging considering the simplicity and low computation cost of our approach. As stated earlier our approach does not rely on rule lines and performance is equally well for both types of the tables. In fact, we have removed the rule lines as a preprocessing step. It may further be noted that two commonly present items; namely displayed math and graphics which are likely to be confused with tables have been segmented [4, 5] in the pre-processing steps. This leads to a clean input to the table detection algorithm minimising the chance of errors. The strength of the approach can also be verified from the results given in Fig. 6. We see that the algorithm has successfully singled out different types of tables from the document pages. There may be multiple tables in a page where text lines are present between the tables [see Fig. 6a and b] and there may not be any text lines between two successive tables [see Fig. 6c and d]. Full page table is also segmented as shown in Fig. 6e and f. In Fig. 6g and h two tables are segmented from a page properly rejecting a near tabular structure in between the detected tables. Next in Fig. 6i and j, a hidden table detection from a page containing a notice is shown. In Fig 6k and l it is notable that the table is properly detected though there were a few rows with a couple of blank fields; a job which is difficult for any table detection algorithm. Finally, in the last example (Fig. 6m and n) we see that the second line from the top, which is not a primary candidate line (as it has a single blob) is correctly included

in the segmented table as its immediate neighbours are table rows. Regarding other characteristics of our algorithm it may be noted that the chance of insertion error is practically nil implying no or extremely low merging error as reflected in the performance figures. Though the performance of the algorithm is very good; it has its blemishes too. It may not be able to separate two tables appearing side by side; though this is rare but in that case merging would be 100% for those tables.If a text block is horizontally adjacent to a table within a column of page layout, then the proposed algorithm merges the two and identifies the whole as a table. Multiline headings may also cause problem as these have bigger fonts with more gaps than the normal one. However, as a discriminating criterion vertical projection profile may help as multiline headings will not have their word gaps in column-like fashion. 5 Conclusion We conclude this paper by reiterating the plus points of our approach. First and foremost it is fully automatic with no manual intervention or parameter entry. It can extract all sorts of tables that have homogeneous arrangement of cells in the table body with high segmentation performance and is not affected by the change in font and page style as the blob generation is done based on parameters computed for each

178

S. Mandal et al.

Fig. 6 Figures in the left are the original page image and in the right are segmented portions within a box on preprocessed image

A simple and effective table detection system from document images

179

Fig. 6 Continued

180

S. Mandal et al.

Fig. 6 Continued

A simple and effective table detection system from document images

181

Fig. 6 Continued

page. Moreover, it is based on very simple observation leading to a low-cost high-performance implementation. Finally, a mathematical treatment is presented and also a regular expression is proposed using which the implementation can be done quickly through compiler tools. Work is going on to improve our algorithm: • To avoid merging error for multiple tables (or text block and table) appearing side by side. • Segmentation of tables with heterogeneous placement of cells commonly used by marketing people.

References
1. Baird, H.S.: Digital libraries and document image analysis. In: Proceedings of 7th Internatiolal Conference on Document Image Anlysis; vol. 1, pp. 2–14. IEEE Computer Society Los Alamitos, California (2003) 2. Belaid, Y., Panchevre, J.L., Belaid A.: Form analysisi by neural classification of cells. In: Proceedings of 3rd IAPR Workshop on Document Analysis Systems (DAS’98), pp. 69–78. Nagano, Japan (1998) 3. Chandran, S., Balasubramanian, S., Gandhi, T., Prasad, A., Kasturi, R., Chhabra, A.: Structure recognition and information extraction from tabular documents. IJIST 7(4), 289–303 (1996) 4. Chowdhury, S.P., Mandal,S., Das, A.K., Chanda, B.: Automated segmentation of math-zones from document images. In: 7th International Conference on Document Analysis and Recognition, vol. 2, pp. 755–759. Edinburgh, UK (2003)

5. Das, A.K.: Document image segmentation: a morphological approach. PhD thesis, Bengal Engineering College (Deemed University), Sibpur, India (1998) 6. Das, A.K., Chanda, B.: Text segmentation from document images: a morphological approach. J Institute Eng. 1(77), 50–56 (1996) 7. Das, A.K., Chanda, B.: Detection of tables and headings from document image: a morphological approach. In: International Conference on Computational linguistics, Speech and Document Processing (ICCLSDP’98), pp. A57–A64. Calcutta, India (1998) 8. Das, A.K., Chanda, B.: A fast algorithm for skew detection of document images using morphology. Int. J. Doc. Anal. Recog. 4, 109– 114 (2001) 9. Gonzalez, R.C., Wood, R.: Digital Image Processing. AddisionWesley, Reading, MA (1992) 10. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: SPIE Document Recognition and Retrieval VII, pp. 291–302. San Jose, CA (2000) 11. Itonori, K.: Table structure recognition based on textblock arrangement and ruled Line position. In: Proceedings of ICDAR, pp. 765–768 (1993) 12. Joseph, S.H.: Processing of engineering line drawings for automatic input to cad. Pattern Recog. 22, 1–11 (1989) 13. Katsura, E., Takasu, A., Hara, S., Aizawa, A.: Design considerations for capturing an electronic library. Inf. Serv. Use, pp. 99–112 (1992) 14. Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceedings Document Recognition V, SPIE, vol. 3305, pp. 22–32. San Jose, California (1998) 15. Liu, J., Wu, X.: Description and recognition of form and automated form data entry. In: Proceedings of 3rd International Conference on Document Analysis and Recognition (ICDAR’95), pp. 579–582 (1995)

182

S. Mandal et al.

16. Otsu, N.: A threshold selection method from gray-level histogram. IEEE Trans. SMC 9(1), 62–66 (1979) 17. Ramel, J.-Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: 7th International Conference on Document Analysis and Recognition, vol. 1, pp. 374–378. Edinburgh, UK (2003) 18. Satoh, S., Takasu, A., Katsura, E.: An automated generation of electronic library based on document image understanding. In: Proceedings of ICDAR 1995, pp. 163–166 (1995) 19. Tanaka, T., Tsuruoka, S.: Table form document understanding using node classification method and html document generation. In: Proceedings of 3rd IAPR Workshop on Document Analysis Systems (DAS ’98), pp. 157–158. Nagano, Japan (1998)

20. Tersteegen, W.T., Wenzel, C.: Scantab: table recognition by reference tables. In: Proceedings of 3rd IAPR workshop on Document Analysis Systems (DAS’98), pp. 356–365. Nagano, Japan (1998) 21. Tsuruoka, S., Takao, K., Tanaka, T., Yoshikawa, T., Shinogi, T.: Region segmentation for table image with unknown complex structure. In: Proceedings of ICDAR’2001, pp. 709–713 (2001) 22. Watanabe, T., Luo, Q.L., Sugie, N.: Layout recognition of multikinds of table-form documents. IEEE Trans. on Pattern Anal. and Machine Intell. 17(4), 432–446 (1995) 23. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: models, observations, transformations, and inferenences. IJDAR 7(1), 1–16 (2004) 24. Zuyev, K.: Table image image segmentation. In: Proceedings of ICDAR’1997, pp. 705–707. Ulm, Germany (1997)

Sign up to vote on this title
UsefulNot useful