You are on page 1of 10

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

An Efficient Text Segmentation Technique Based on Naive Bayes Classifier


M. M. Haji, S. D. Katebi Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran mehdi.haji@gmail.com, katebi@shirazu.ac.ir http://www.mhaji.com

Abstract
In this paper the Naive Bayes Classifier (NBC) is introduced for text segmentation. A set of training data is generated from a wide category of document images for learning the NBC. The images used for generating the training data include both machine-printed and handwritten text with different fonts, sizes, intensity values and background models. A small subset of the coefficients of a discrete cosine transformed image block is used to classify the block as text or non-text. The NBC decision threshold is optimized on a test set. Experiments carried out on unseen documents show promising results. A comparison with a well-established method for text segmentation indicates advantages of the proposed method. Keywords: Document Image Analysis, Text Segmentation, Content Based Image Retrieval, Naive Bayes Classifier, Discrete Cosine Transform, Feature Selection, Morphological Operations.

1. Introduction
A text segmentation algorithm aims at detecting text areas in images which has wide applications in document image analysis and understanding, image compression and content-based image retrieval. In document image binarization [1] and skew correction [2] algorithms, it is often desirable to remove non-text items from the input image because they usually require predominant text area to have an accurate estimate of text characteristics. Paper text is still one of the main sources of information and it is clear that huge amount of such valuable data in the paper form, makes their updating and retrieval much difficult. Thus, there is a need to convert the text from paper to electronic format. This task is usually done by an Optical Character Recognition (OCR) engine and text extraction is an essential component in the page segmentation module of the engine [3]. Text segmentation also has applications in training-based image compression algorithms such as Vector Quantization (VQ), which need to classify the data into statistically consistent parts, and then use an appropriate codebook for each part [4]. The text in natural images 27 27

and video frames such as street signs, vehicle license plates, billboards, writing on shirts, sport scores, time and location stamps, is a powerful source of knowledge in building image and video indexing and retrieval systems [5]. This kind of text also provides useful content information for video understanding and automatic navigation systems. Due to the wide range of applications, numerous methods for text segmentation also referred to as text detection have been proposed. Some of them require binary input images; which restricts their application when the text is embedded in an image with a complex background, because binarization techniques usually produce poor results for complicated images [6]. On the other hand, some methods also use colour information to detect text areas; colour information can be helpful, but it is not available in all situations. Moreover, for a human observer, intensity information is enough to segment the text areas, so most methods perform text segmentation on grey-scale images, i.e., even if a colour input image is available, it is first converted to grey-scale [3, 5]. The main text segmentation methods in the literature can be classified into connected component-based [7], edgebased [8,9] and texture-based methods [10]. Connected component-based ones are bottom-up approaches that work by grouping small components satisfying several heuristic constraints into successively larger components to form text lines and columns. They are relatively independent of changes in text size and orientation, but having difficulties with complex images with nonuniform backgrounds, because in such cases thresholding techniques can not produce the expected binary image, for example, if a text string touches a graphical object in the original image, they may form one connected component in the resultant binary image. The basic idea behind the edge-based algorithms is that the edges of text symbols are typically stronger than those of noise, textured-background and other graphical items [5, 9, 11]. In these top-down techniques, a binary edge image is first generated using an edge detector, and then adjacent edges are connected by applying morphological operations or other algorithms such as run-length smoothing [9]. Connected components of the

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

resultant image are the candidate text areas, as each one represents either several merged lines or a graphical item. Then, each component is decomposed into smaller regions by analyzing its vertical and horizontal projection profiles, and finally each of the small regions satisfying certain heuristic constraints is labelled as text. Edgebased methods are fast and can detect text in complex backgrounds but are restrictive to detect only horizontally or vertically aligned text strings. Text segmentation can also be taught of as a special case of texture segmentation in which characters correspond to texels. By treating text as a distinct texture, a texture segmentation algorithm can be applied to separate them. In texture-based methods the input image is usually considered as a composite of two (text and non-text) or three (text, picture and background) texture classes. Many segmentation algorithms employ a classification window (block) of a certain size in the hope that all or majority of pixels in the window belong to the same class [12]. Thereafter, a classification algorithm can be used to label each window in the feature space. For example, in [13] the number of classes is two, and a 2-means classification is used to classify each block of the image as text or non-text according to its local energy in the wavelet transform domain. By using a 3-means clustering in [3] each image pixel is labelled as text, picture or background according to a 9-D feature vector based on Gaussian filtering. A large number of statistical and geometrical features have been proposed for texture segmentation such as features of co-occurrence matrix, spatial grey-level dependency matrix [14], the Fourier power spectrum, moments of wavelet coefficients [15], Gaussian filters [3], Gabor filters [16], Voronoi tessellation [17]. Among these, wavelet based features are of most interest. The wavelet transform has become a very effective tool in texture segmentation and classification due to its multi-resolution properties. It provides a powerful transform domain for modelling images that are well characterized by their edges. In texture-based methods, irrespective of the employed features, the size of classification window is crucial. A large window results in robust segmentation in homogeneous regions but poor segmentation along the boundaries between regions. On the other hand, classification using small windows is not reliable because small amount of data (pixels) do not provide sufficient statistical information. All of the methods have difficulties with multi-size text strings and text-like texture areas. The former causes false negatives, while the latter results in false positives. The problem of detecting text strings of different sizes can be addressed by pyramid approaches [6] to some extent, while reducing false positives needs more sophisticated approaches; for example in [5] a support vector machine is utilized for this task. Despite the many efforts spent on the text segmentation problem, there is no general method to detect arbitrary text strings; because in the most general form, detection must be insensitive to noise, background model and lighting conditions. Also, it must be invariant to text language, colour, size, font and orientation even in a same image. The literature on text segmentation is extensive but there appears to be very little appropriate literature on using 28 28

machine learning techniques on this subject. We believe that a text segmentation algorithm should have adaptation and learning capability, but a learner usually needs much time and training data to achieve satisfactory results, which restricts its practicality. To overcome these problems, we give a simple procedure for generating training data from manually segmented images, then applying a Naive Bayes Classifier (NBC), which is fast both in training and application phase. It will be shown that very promising results can be obtained by this simple classifier. This paper is organized as follows. In section 2, an introduction to NBC is presented. In sections 3, the process used to generate training data is described. Section 4 describes feature selection. Sections 5 and 6 describe the NBC training and classification. The postprocessing procedure is described in section 7. Experimental results are presented in section 8. Finally, in section 9, a summary of the work presented in this paper is provided.

2. Naive Bayes Classifier


The NBC is applicable to learning tasks where each instance is described by a conjunction of attribute values and a target function which takes a value from a finite set V. A set of training examples for the target function is provided, a new instance described by the attribute values (a1, a2, , an) is then presented, and the learner is asked to predict the target value or classification. The Bayesian approach to classify the new instance is to assign the most probable that is the Maximum A Posteriori (MAP) hypothesis, given the attribute values that describe the instance [18]. v MAP = arg max P(v j | a1 , a 2 ,..., a n ) (1)
v j V

Where vMAP is the most probable target value. Using Bayes theorem Equation (1) can be written as follows:

v MAP = arg max


v j V v j V

P (a1 , a 2 ,..., a n | v j ) P (v j ) P (a1 , a 2 ,..., a n )


(2)

= arg max P (a1 , a 2 ,..., a n | v j ) P (v j )

Using training data the two terms in Equation (2) must be calculated. It is rather easy to estimate each P(vj) by counting the frequency of occurrence of each target value in the training data. However, estimating the different P(a1, a2, , an) terms in this way is not possible unless a huge set of training data is available. In order to make the classifier much more practical and computationally efficient, we use the simplifying assumption that the attribute values are conditionally independent given the target value. This independence assumption implies that: P(a1 , a 2 ,..., a n | v j ) = P(ai | v j ) (3)
i

Substituting Equation (3) into Equation (2) results in the approach used by NBC, given by Equation (4): v NB = arg max P(v j ) P (ai | v j ) (4)
v j V i

Where vNB denotes the target value output given by the NBC.

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

Figure 1. An overview of the proposed text segmentation system

Despite the fact that the independence assumption is often violated in practice, NBC has shown itself a serious competitor with more sophisticated classifiers. This classifier is shown to be very effective in many practical domains such as text categorization and medical diagnosis [18]. NBC has several distinctive characteristics which make it suitable for the text segmentation task. First, it is a probabilistic classifier, i.e. it outputs posterior probability distribution over classes. In our work, text segmentation is treated as a two-class classification task, and a probabilistic classifier is appropriate here since it assigns a score to each instance expressing the degree to which that instance is thought to be positive. The second advantage of NBC is that the learning task is not sensitive to the relative number of training instances in the positive (text) and negative (nontext) classes. It is only important to have non-zero probability estimates in Equation (4). Third, in naive Bayes methods, learning time is short and actually linear in the number of training examples making them suitable for real-time learning. From Equation (4) it is clear that learning is simply done through counting the frequency of various data combinations within the training examples. The final advantage is that, for 2-class problems, the classifier has only one parameter, a decision threshold with the default of 0.5, which has to be tuned experimentally. Figure 1 shows the block diagram of the proposed text segmentation system. In the training phase, each training image is first divided into small blocks. Then, each block is represented in the Discrete Cosine Transform (DCT) feature space, and the manually generated mask of the training image is used to generate the training dataset. Since the NCB is a discrete classifier, a set of rules is then acquired to discretize the data; these rules are saved for future use. Feature selection is then performed to obtain a relevant subset of features; the selected features are also saved for future use. Finally, the NBC is trained over the discretized (nominal) values of the selected features, and the classifier or equivalently the computed probabilities are saved. In the application phase, the input

image is first divided into small blocks with the block size equivalent to that of the training phase. The selected features are then extracted and discretized according to the discretization rules. Then, the classifier labels each block as text or non-text. Finally, a postprocessing step is applied to enhance the results. In the following sections, each of the above mentioned steps is described in detail.

3. Training Data Generation


A large training set facilitate the task of learning, tuning and comparing various classifiers. A simple procedure is employed to generate a large set of training data from a small set of eight hand-segmented images. The images were selected from a wide category, containing both English handwritten and machine-printed text strings with different fonts, sizes, intensity values and background models. Furthermore, since the method is intended to be script-independent, two Farsi (Persian) document images were also included. For each training image a binary mask is created manually. The mask contains white rectangles correspond to the text strings (see Figure 2). The proposed segmentation method is block-based which uses features in the Discrete Cosine Transform (DCT) domain. The classification will be performed on 88 blocks of input images, so each block is represented by 64 coefficients (features) in the DCT domain. For small squares, such as 88, the DCT is more efficiently computed by the DCT transform matrix T given by Equation (5) for an NN block. Then the 2D-DCT of the square matrix A can be computed by T A T '. 1 p = 0,0 q N 1 N (5) T pq = 2 cos (2q + 1) p 1 p N 1,0 q N 1 N 2N The procedure used to generate the training data file is outlined in Algorithm 1, where I(i1:i2, j1:j2) notation is used to refer to the sub-image specified by the rectangle with (j1,i1) as the top-left corner and (j2,i2) as the

29 29

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

bottom-right corner. The vertical sampling period, denoted by vp, and the horizontal sampling period, denoted by hp, were both set to 4 in this work. Using this procedure, about 100,000 training instances were generated from the eight images, although there may be no need for such a large amount of data, i.e. a small fraction of this dataset may provide reasonable estimates for the P(ai | vj) terms of Equation (4). The NCB is a discrete classifier, and hence the continuous feature values must be discretized. For this purpose, each continuous value is converted to one of the five nominal values: S2, S1, ZE, B1 or B2 (respectively for very small, small, around zero, big and very big). The choice of five bins was found to yield the best performance. The discretization rules were set in such a way to have approximately equal number of instances in each of the 5 bins for each feature, and so a different set of rules is used for each of the 64 features.
for each training image I and its corresponding mask M { for i = 0 : vp : 8 * rows (I) / 8 1 { for j = 0 : hp : 8 * columns (I) / 8 1 { [C1 C2 C3 C64] = dct2( I(i:i+7, j:j+7) ); /* 2D-DCT */ if M(i:i+7, j:j+7) has more white than black pixels { /* it is a positive training instance */ write [C1 C2 C3 C64 1] to the output file. } else { /* it is a negative training instance */ write [C1 C2 C3 C64 0] to the output file. } } } }
Algorithm 1. The procedure for generating training data file for the 'IsText' concept.

algorithms, such as exhaustive, greedy and genetic, to walk through the space of feature subsets. Moreover, it can evaluate the value (worth) of a subset in different ways; for example, by training a classifier with the subset and evaluating the classification performance on a hold out test set; or by calculating the level of consistency in the class values when the training instances are projected onto the subset of feature. Since the search space is huge (there are 264 subsets), it is not practical to find the optimal solution by an exhaustive search. The best subset found by some of the other search methods of Weka contains 5 elements with indices: 1, 16, 21, 39 and 53, when counting the coefficients of an 88 transformed block at 1 and going line after line. The classification rate, defined as the number of instances classified correctly over the total number of instances, is used as a criterion to rank the feature subsets. Table 1 shows the classification rate of the NBC trained with different feature subsets and evaluated using a 10fold cross-validation. As expected, there are small subsets that perform better than when all of the 64 features are used.

5. Training
For the 'IsText' concept, let v1 = 'Yes' and v2 = 'No'. The evaluation of conditional probabilities is carried out on the discretized training data. When using NBC, no conditional probability is allowed to be zero because only a zero value causes the estimate of zero in Equation (3) which is actually a biased underestimate of the probability. Thus, the m-estimate of probability [18], a simple and effective smoothing method, is used to avoid zero probability estimates. It must be mentioned that the probability estimates of a NBC can also be acceptable if some of the underlying independence assumptions are violated. As expected, NBC is the optimal classifier when the independence assumptions are satisfied, but Rish [20] has shown that NBC also works well for functionally dependent features. The optimality of NBC has proved for some problems that have a high degree of feature dependencies such as disjunctive and conjunctive concepts [21]. By analyzing the impact of distribution entropy on the classification error, Rish has demonstrated that NBC is a good performer for low-entropy (almost deterministic) feature distributions.

4. Feature Selection
Feature selection is the process of selecting a subset of features relevant, or eliminating a subset of features irrelevant, to a particular application. A small subset of features not only reduces the amount of time required for its extraction, but more importantly, may result in higher classification performance. So feature selection is indispensable when using machine learning techniques. The effect of selecting various subsets from the 64 features is studied by the software package called Weka [19]. The package has implemented a number of search
Feature Subset C1 C8 C1, C2 C1, C2, C3, C4 C1, C5, C60, C64 C1, C16, C21, C39, C53 C2, C21, C28, C37, C52 C3, C15, C20, C30, C42, C50, C51 Classification Rate 81.0% 80.4% 82.8% 79.3% 85.2% 87.4% 85.0% 84.8%

Feature Subset C32 C64 C32, C64 C16, C32, C48, C64 C16, C17, C18, C19 C17, C25, C26, C34, C35 C1, ..., C4, C61, ..., C64 C1, C2, C3, ..., C63, C64

Classification Rate 83.1% 83.2% 83.1% 84.3% 80.9% 81.2% 85.0% 85.0%

Table 1. The value (worth) of a number of feature subsets

30 30

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

6. Classification
No prior information about the source image is assumed, and so P(V1) = P(V2) = 0.5. Therefore, according to Bayes theorem:
P(Text) = P(a1 | v1 ) P(a 2 | v1 )...P(a18 | v1 ) P(a1 | v1 ) P(a 2 | v1 )...P(a18 | v1 ) + P(a1 | v 2 ) P(a 2 | v 2 )...P(a18 | v 2 )

(6) The usual decision criterion (Equation (4)) suggests selecting the class with the highest posterior probability, or if P(Text) exceeds 0.5 the input block should be labelled as text. However, there is no justification for such a decision criterion, and especially when the probability estimates are inaccurate. In [22] it is shown that if the NBC decision criterion is treated as an additional model parameter, which has to be learned from the training data, rather than a fixed threshold, significant improvements will result. After classification, the image is postprocessed by morphological operations to fill small black (non-text) holes within white (text) areas in order to reduce false negatives. In the classification phase, if a high decision threshold DT is selected (higher than 0.5) for the text class, the number of false positives (the blocks mistakenly marked as text) is obviously reduced, because only almost confident text blocks are classified as text. To classify an 88 block of image, the DCT-18 features are evaluated first, and their nominal equivalents are then computed. Lastly, P(Text) is evaluated using Equation (6) and the conditional probabilities; if it exceeds DT, the input block is classified as text, and otherwise the block is classified as non-text. This way a binary image is formed, with white pixels representing text and black pixels representing non-text areas. In order to improve the segmentation accuracy, this image should be postprocessed.

7. Postprocessing
The operations performed to postprocess the binary image obtained from the previous step are based on the following assumptions: 1) the input image has more false negatives than false positives and 2) text areas are usually large and do not contain non-text areas (holes). The postprocessing procedure is based on mathematical morphology which is a well suited technique for image processing. Complex functions on binary images can be decomposed into sequences of the two atomic morphological operations of erosion and dilation. In the first step of postprocessing, all isolated white pixels (without any white 8-neighbor) are removed, which means that small isolated blocks do not correspond to text regions and should thus be eliminated. Then the morphological closing (dilation followed by erosion) with a 33 rectangular structuring element is applied. A closing operation, when applied to a binary image, connects neighbouring white regions, which is equivalent to removing small black holes. So, by the closing operation the false negative error rate can be decreased.

8. Experimental Results
The proposed method was tested on a set of unseen images. For each test image a binary mask was created 31 31

manually. The error between the mask of a test image and the mask obtained by the proposed method is calculated by XORing the two masks and then dividing the number of white pixels (mismatches) by the total number of pixels (area). This error can be used as a criterion to judge the segmentation performance. It should be noted that even if the method performs an optimal segmentation (as judged by a human), there can be a number of mismatches between the two masks, mostly around the borders of text areas as shown in Figure 5. So, this error measure is somewhat rough (pessimistic), in the sense that a nonzero error does not necessarily mean a non-optimal segmentation. But, it can properly serve as a criterion to tune the free parameter DT. The value of DT ranges between 0 and 1, and its optimal value is set to the one that results in the minimum average error over the eight images. The experiments show that the optimal value, yielding the minimum error of 0.0657, is 0.98 (see Figure 3). The result of applying the proposed method to a portion of a newspaper is shown in Figure 4. The raw classification of the NBC is given in the image of Figure 4(b); each small square shows the probability that the corresponding square in the input image is thought to be text. Thresholding this image at 0.5 results in the binary image of Figure 4(c), and at 0.98 results in the binary image of Figure 4(d) with less false positives. The final mask obtained by the morphological postprocessing is shown in the image of Figure 4(e); as seen, some false positives are still remained, which can be removed only by introducing heuristics based on size and proximity for example. For the image of Figure 5(a) with clearly separated text areas, an almost optimal segmentation is obtained by the proposed method. However, as shown in Figure 5(d), the evaluated error of 0.053 does not correctly reflect the actual error which is near zero in this case. The proposed method is compared with Textfinder [3,6], which is a well-established method for text segmentation. Textfinder uses a texture segmentation scheme in the first step to locate candidate text regions, and then applies a set of appropriate heuristics to find text strings within/near the segmented regions. The mask obtained by applying the texture segmentation module of Textfinder is shown in the image of Figure 5(e), which has large false positive regions. The proposed method performs much better for this image with the execution time of about 0.08 seconds which is about 100 times less than that of Textfinder. In the experiment of Figure 6, the proposed method is applied to an image that contains two texts of different colors and other textures. Here the text strings are not well-separated from the background. The result shows that the proposed method is not very sensitive to nonuniform background and works well if the text is darker or lighter than the background. In contrast, many existing approaches assume that background is uniform [11], showing poor performance when this assumption is not satisfied. Figure 6(e) shows the mask obtained by applying the texture segmentation module of Textfinder; here, the produced mask has both false positive and false negative regions.

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

9. Conclusion
A method based on the naive Bayes classifier was presented for text segmentation. A large set of training data was generated from small overlapping blocks of eight document images containing text strings with different fonts, sizes, intensity values and background models. The NBC was trained using the generated training data. A small subset of the coefficients of the DCT domain was used as the feature vector. The feature selection was performed by a machine learning software package. The NBC decision threshold was optimized on a set of test images. A simple morphological postprocessing step was also applied to enhance the segmentation results. The focus of the paper was to show the possibility of fast text segmentation by using a machine learning technique, and it was shown that the naive Bayes approach offers accurate results in an efficient computational time. This is illustrated by comparative studies made with another text segmentation technique.

10.

References

[1] Y. Liu and S.N. Srihari, Document Image Binarization Based on Texture Features, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19(5), pp. 540-544, May 1997. [2] Avanindra and Subhasis Chaudhuri, "Robust Detection of Skew in Document Images", IEEE Transactions Image Processing, vol. 6(2), pp. 344-349, February 1997. [3] V. Wu, R. Manmatha, and E. M. Riseman, Textfinder: an automatic system to detect and recognize text in images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21(11), pp. 1224-1229, 1999. [4] A. Gersho and R. M. Grey, Vector Quantization and Signal Compression, Kluwer Academic Publishers, 1992. [5] D. Chen, H. Bourlard and J. Thiran, Text Identification in Complex Backgrounds Using SVM, Proc. of the International Conf. On Computer Vision and Pattern Recognition, pp. 621-626, Dec. 2001. [6] V. Wu, R. Manmatha, and E. M. Riseman, Finding text in images, Proc. of ACM International Conf. On Digital Libraries, 1997. [7] L. A. Fletcher and R. Kasturi, A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10(6), pp. 910-918, Nov. 1988. [8] M. Pietikinen and O. Okun, Text Extraction from Grey Scale Page Images by Simple Edge Detectors, Proc. of the 12th Scandinavian Conf. On Image Analysis, Bergen, Norway, pp. 628-635, June 2001.

[9] Jie Xi, Xian-Sheng Hua, Xiang-Rong Chen, et al., A Video Text Detection and Recognition System, Proc. of ICME 2001, Waseda University, Japan, pp. 10801083, August 2001. [10] J. Li and R. M. Grey, Text and Picture Segmentation by the Distribution Analysis of Wavelet Coefficients, Proc. of IEEE International Conf. On Image Processing, Chicago, Illinois, vol. 3, pp 790-794, Oct. 1998. [11] Q. Yuan and C. L. Tan, Page Segmentation and Text Extraction from Grey-Scale Images in Micro Film Format, SPIE Proc. on Document Recognition and Retrieval, vol. 4307, pp.323-332, 2000. [12] H. Choi and R. G. Baraniuk, Multiscale Image Segmentation Using Wavelet-Domain Hidden Markov Models, IEEE Transactions on Image Processing, vol. 10(9), pp. 1309-1321, Sep. 2001. [13] Shulan Deng and Shahram Latifi, "Fast Text Segmentation Using Wavelet for Document Processing", Proc. of the 4th WAC, ISSCI, IFMIP, Maui, Hawaii, USA, pp. 739-744, 2000. [14] J. Ohya, A. Shio and S. Akamatsu, Recognizing Characters in Scene Images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16(2), pp. 214-224, Feb. 1994. [15] Michael Unser, Texture Classification and Segmentation Using Wavelet Frames, IEEE Transactions on Image Processing, vol. 4(11), pp. 1549-1560, Nov. 1995. [16] A.K. Jain and F. Farrokhnia, Unsupervised Texture Segmentation Using Gabor Filters, Pattern Recognition, vol. 24, pp. 1167-1186, 1991. [17] M. Tuceryan and A.K. Jain, Texture Segmentation Using Voronoi Polygons, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp. 211-216, 1990. [18] Tom M. Mitchell, Machine Learning, McGrawHill, 1997. [19] Ian H. Witten and Eibe Frank , Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000. [20] I. Rish, "An Empirical Study of the Naive Bayes Classifier", Proc. of IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 2001. [21] P. Domingos and M. Pazzani, "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss", Machine Learning, vol. 29, pp. 103130, 1997. [22] N. Lachiche and P. Flach, "Improving Accuracy and Cost of Two-Class and Multi-Class Probabilistic Classifiers using ROC Curves", Proc. of the 20th International Conf. On Machine Learning (ICML2003), 2003.

32 32

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

(a) a portion of an English document image

(b) manually generated mask of (a)

(c) a portion of a Farsi document image (d) manually generated mask of (c) Figure 2. Two document images and their text masks

Figure 3. Segmentation error as a function of the NBC decision threshold

33 33

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

(a) an input image

(b) text probabilities of 88 blocks of (a)

(c) image (b) thresholded at 0.5

(d) image (b) thresholded at 0.98

(e) image (d) after morphological postprocessing (f) text areas of (a) when using the mask (e) Figure 4. Applying the proposed method to a portion of a newspaper

34 34

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

(a) an input image

(b) manually generated mask of (a)

(c) mask produced by the proposed method

(d) result of XORing (b) and (c), error = 0.053

(e) mask produced by the texture segmentation module of Textfinder (f) text areas of (a) when using the mask (c) Figure 5. Applying the proposed method to a document image with a simple layout and comparing with the mask obtained using the text segmentation module of Textfinder. Error is defined as the normalized area of the white pixels in the image obtained by XORing the two masks.

35 35

ICGST-GVIP Journal, Volume (5), Issue (7), July 2005

(a) an input image

(b) text probabilities of 88 blocks of (a)

(c) image (b) thresholded at 0.98

(d) image (c) after morphological postprocessing

(e) mask produced by the texture segmentation module of Textfinder Figure 6. Applying the proposed method to an image with a complex background

36 36

You might also like