2009 10th International Conference on Document Analysis and Recognition

Devanagari and Bangla Text Extraction from Natural Scene Images
U. Bhattacharya, S. K. Parui and S. Mondal
Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata – 108, India
{ujjwal, swapan, srikanta_t}@isical.ac.in
and alignment of texts, background complexity,
influence of luminance, and so on.
A survey work of existing methods for detection,
localization and extraction of texts embedded in
images of natural scenes can be found in [1]. Two
broad categories of available methods are connected
component (CC) based and texture based algorithms.
The first category of methods segments an image into a
set of CCs, and then classifies each CC as either text or
non-text. CC-based algorithms are relatively simple,
but often they fail to be robust. On the other hand,
texture-based methods assume that texts in images
have different textural properties compared to the
background or other non-text regions. Although the
algorithms of the latter category are more robust, they
have usually higher computational complexities.
Additionally, a few authors studied various
combinations of the above two categories of methods.
Among early works, Zhong et al. [2] located text in
images of compact disc, book cover, or traffic scenes
in two steps. In the first step, approximate locations of
text lines were obtained and then text components in
those lines were extracted using color segmentation.
Wu et al.[3] proposed a texture segmentation method
to generate candidate text regions. A set of feature
components is computed for each pixel and these are
clustered using K-means algorithm.
Jung et al. [4] employed a multi-layer perceptron
classifier to discriminate between text and non-text
pixels. A sliding window scans the whole image and
serves as the input to a neural network. A probability
map is constructed where high probability areas are
regarded as candidate text regions.
In [5], Li et al. computed features from wavelet
decomposition of grayscale image and used a neural
network classifier for labeling small windows as text or
non-text. Gllavata et al. [6] considered wavelet
transform based texture analysis for text detection.
They used K-means algorithm to cluster text and nontext regions.
Saoi et al. [7] used a similar but improved method
for detection of text in natural scene images. In this

Abstract
With the increasing popularity of digital cameras
attached with various handheld devices, many new
computational challenges have gained significance.
One such problem is extraction of texts from natural
scene images captured by such devices. The extracted
text can be sent to OCR or to a text-to-speech engine
for recognition. In this article, we propose a novel and
effective scheme based on analysis of connected
components for extraction of Devanagari and Bangla
texts from camera captured scene images. A common
unique feature of these two scripts is the presence of
headline and the proposed scheme uses mathematical
morphology operations for their extraction.
Additionally, we consider a few criteria for robust
filtering of text components from such scene images.
Moreover, we studied the problem of binarization of
such scene images and observed that there are
situations when repeated binarization by a well-known
global thresholding approach is effective. We tested
our algorithm on a repository of 100 scene images
containing texts of Devanagari and / or Bangla

1. Introduction
Digital cameras have now become very popular and
it is often attached with various handheld devices like
mobile phones, PDAs etc. Manufacturers of these
devices are now-a-days looking for embedding various
useful technologies into such devices. Prospective
technologies include recognition of texts in scene
images, text-to-speech conversion etc. Extraction and
recognition of texts in images of natural scenes are
useful to blind and foreigners with language barrier.
Furthermore, the ability to automatically detect text
from scene images has potential applications in image
retrieval, robotics and intelligent transport systems.
However, developing a robust scheme for extraction
and recognition of texts from camera captured scenes
is a great challenge due to several factors which
include variations of style, color, spacing, distribution
978-0-7695-3725-2/09 $25.00 © 2009 IEEE
DOI 10.1109/ICDAR.2009.178

171

[9] used a Gaussian mixture distribution to model the occurrence of three neighbouring characters and proposed a scheme under Bayes framework for discriminating text and non-text components. Liu et al.114*B. (c) and (d) after binarization of (a) and (b) by Otsu’s method The present study is based on a set of 100 outdoor images of signboards. (a) (b) (c) (d) Headline (a) (b) Headline Figure 1. [10] used a sparse representation based method for the same purpose. However. A unique and common characteristic of these two scripts is the existence of certain headlines as shown in Fig. Usually. G and B channels of input color image separately. On the other hand. Text segmentation method described in [12] uses a combination of a CC-based stage and a region filtering stage based on a texture measure. 2. no existing work deals with the same problem. They used a disk filter obtaining the difference between the closing image and the opening image. Preprocessing Size of an input image varies depending upon the resolution of the digital camera. A global binarization method like the well-known Otsu's technique is usually not suitable for camera captured images since the gray-value histogram of such an image is not bi-modal. we show the binarization results of the images of Figs. Binarization of such an image using a single threshold value often leads to loss of textual information against the background. wavelet transform is applied to all of R. Connected components (both black and white) are extracted from the binary image. We implemented an adaptive thresholding technique which use the simple average gray value in a window of size 27×27 around a pixel as the threshold for that pixel. The only assumption we make is that the characters are sufficiently large and/or thick so that using a linear structuring element of a certain fixed length can capture its headlines.299*R + 0. (a) and (b) Two scene images. this resolution is 1 MP or more. local binarization methods are generally window-based and the choice of window size in such methods severely affect the result producing broken characters. Devanagari and Bangla are the two most popular Indian scripts used by more than 500 and 200 million people respectively in the Indian subcontinent. The focus of the present work is to exploit the above fact for extraction of Devanagari and Bangla texts from images of natural scenes. (b) a piece of text in Bangla Figure 2. banners. The filtered images are binarized to extract connected components. it is converted to 8-bit grayscale image using the formula G = 0.25 MP. if the characters are thicker than the window size. Ezaki. Next. we use several geometrical properties of the characters of these two scripts to locate the whole text parts in relation to the detected headlines. Texts in the images of Figs. there is no absolute reference for weight values of R.587*G + 0. In Fig. Pan et al. Then. [11] proposed a coarse-to-fine strategy using multiscale wavelet features to locate text lines in color images. Section 2 describes the preprocessing operations. Next. 2 by this 172 . Bulacu and Schomaker [8] studied morphological operations for detection of connected text components in images. (a) A piece of text in Devanagari. are lost during binarization by Otsu’s method. we use morphological opening operation along with a set of criteria to extract headlines of Devanagari or Bangla texts. G and B. we down sample the input image by an integral factor so that its size is reduced to the nearest of 0. attempt. The proposed method is described in Section 3. In a recent work. Section 5 concludes the paper. 1. hoardings and nameplates collected using two different cameras. the above set of weights is standardized by NTSC (National Television System Committee) and its usage is common in computer imaging. In fact. Experimental results are provided in Section 4. 2(a) and 2(b).The rest of this article is organized as follows. Initially. 3. To the best of our knowledge. Ye et al.

2(b). 5(a). Step 8: Select all the components C corresponding to each true headline component HT. True headline components obtained at the end of Step 7 are shown in Fig. 2(a) & 2(b) by adaptive method On the other hand. See Fig. the latter stages of the proposed method cannot recover from this error. adaptive method. Compute p = the standard deviation of h divided by the mean of h.1. 5(f). In the above particular example. Step 9: Revisit all the connected components. Also. Result of subtracting candidate headline components from respective parent components is shown in Fig. Basic steps of our approach. (a) the binarized image of the sample in Fig. Finally. Step 6: Obtain the height (h) of each connected component F of E that lies below HC. For each such component we examine whether any other component in its immediate neighborhood has already been selected. 1. These include both white and black components. (a) & (b) After binarization of images in Figs. If multiple connected sets are obtained from 173 . HT. 5(d). All the line segments produced after the morphological operations on each component is shown in Fig. Algorithm Step 1: Obtain connected components (C) from the binary image (B) corresponding to the gray image (A).2) on each C. which are respectively the heights of the parts of E that lie above and below HC. However. 5(c). summarized below. then we consider only the largest one and call it the candidate headline component HC. 3. Step 7: If both H1 / H2 and p are less than two suitably selected threshold values. we compare the gray values of the two concerned components in image A and if these values are very close. (a) (b) Figure 4. Thus. The second time use of Otsu’s method as described above convert several pixels from foreground to background and also vice versa.same C. only one non-text component (at the bottom of the image) has also been selected. the example in Fig. 5(b). (a) (b) Figure 3. As an example. then we include the former component into the set of already selected components. we consider the binarized image of Fig. If so. we observed that applying Otsu for the second time separately on both the sets of foreground and background pixels of the binarized image often recover lost texts efficiently. a few other possible text components are selected by the last step and the final set of selected components are shown in Fig.3 (b) has text components connected with the background and similar situations occurred frequently with the scene images used during our experimentations. for each E. Step 5: For each E. 4(a). Step 3: Obtain connected sets of the above line segments. 5(a). Final results of applying Otsu's method twice on input images of Fig. are executed separately on resulting images of first and second time binarization. (b) the binarized image of the sample in Fig. Here. which have not been selected above. Replace E by subtracting HC from it. 5(e). 2 are shown in Fig. Candidate headlines obtained at the end of step 3 are shown in Fig. 2(a). Results of binarization by applying Otsu’s method two times. 4. E may now get disconnected consisting of several connected components. call the corresponding HC as the true headline component. Points on horizontal line segments obtained from white components are represented by the gray color while the same for black components are represented by black color. compute H1 and H2. 3. Proposed approach for text extraction Extraction of Devanagari and / or Bangla texts from binarized images is primarily based on the unique property of these two scripts that they have headlines as in Fig. all the text components have been selected. Step 2: Compute all horizontal or nearly horizontal line segments by applying morphological opening operation (Section 3. Text components selected by Step 8 are shown in Fig. However. Step 4: Let E denote a component C that produces a candidate headline component HC. it should be noted that the characters of Devanagari and Bangla always have a part below the headline and a possible part above the headline is always smaller than the part below it.

6(d). denoted by A+B. (a) An object (A). For each pixel P in the object A. Institutions. 3. is defined as the union of such placements B(P) for all P in A. (e) components selected corresponding to true headlines.0 MP) still camera and (ii) a SONY DCR-SR85E handy cam used in still mode (1. consider object A and structuring element B as shown in Figs. XXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXXX XXX XXX XXX XXX XXX (a) XXXXXXXXX XXXXXXXXXXXXXX XXXXX (a) (b) (c) (d) (e) (f) XXXXX (b) XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXXX (c) (d) Figure 6. (c) eroded object (C = A-B). (a) all line segments obtained by morphological operation. 6(a) and 6(b) respectively. (d) true headlines. A few images on which our algorithm performed perfectly & respective output.0 MP). Resolution of images captured by these two cameras are respectively 2576×1932 and 644×483 pixels. These are of highways. (c) all the components minus the respective candidate headlines. It is evident that opening of an object A with a linear structuring element B can effectively identify the horizontal line segments present in a connected component. Opening of A by the element B is (A-B)+B and it is shown in Fig. Morphological operation 4. railway station. The erosion of object A by the structuring element B. (b) a structuring element (B). We obtained simulation results based on 100 test images acquired by (i) a Kodak DX7590 (5. (f) final set of selected components. a suitable choice of the length of this structuring element is crucial for processing of the latter stage and we empirically selected its length as 21 for the present problem. After downsampling their sizes are reduced to 644×483 and 576×432 pixels respectively. 6(c).2. Figure 5. Results of different stages of the algorithm based on the image of Fig. The dilation operation is in some sense dual of Erosion. consider the placement B(P) of the structuring element B with its center at P. For object A and structuring element B. For illustration. the eroded object A-B is shown in Fig. (b) set of candidate headlines. (d) the object after opening (D = (A-B)+B) (g) (h) Figure 7. denoted by A-B. Experimental results We apply mathematical morphology tools such as erosion followed by dilation on each connected component to extract possible horizontal line segments. 2(a). is defined as the set of all pixels P in A such that if B is placed on A with its center at P. However. Then Dilation of object A by structuring element B. B 174 .(a) (b) (c) (d) (e) (f) is entirely contained in A.

8% and 71. I. vol. [4] K. 473-476. Kobayashi. pp. Conf. Liu. Int. These contain Devanagari and Bangla texts of various font styles. no.2%. 5. Karu. [10] W. “Camera based analysis of text and documents : a survey”. and Recog. 2005. Han. pp. Li. A. “A framework towards realtime detection and tracking of text”. Image Processing. Workshop on Camera-Based Doc. Jung. of 8th Int. 2002. Image and Vision Computing. Jia. on Doc. 2008. Liang. Bulacu. the proposed algorithm will fail whenever the size of such curved or slanted text is not sufficiently large. Bui. 2008. Anal. Kia. Pattern Recognition. 1224-1228. 1995. pp. 41. “Text Scanner with Text Detection Technology on Image Sequences”. E. Pan. 146-149. D. 2004. [5] H. H. Schomaker. Freisleben. Jain. 1. Journ. Ewerth.” 3rd International Conference on Document Analysis and Recognition. 7. Extracted components are shown to the right of each source image. R. M. pp. 21. pp. 425-428. of 17th Int. W. 9. vol. R. M. 2007. T. 1. Anal. D. “Text Detection in Images Based on Unsupervised Classification of High Frequency Wavelet Coefficients”. “Automatic text detection and tracking in digital video. “Locating text in complex color images. H. Wu. Kourogi.festival ground etc. pp. Ezaki. [2] Y. Huang. L. Saoi. pp. 2000. pp. we shall study use of machine learning tools to improve the performance of the proposed algorithm. Y. D. Kim. Anal. In future.” Proc. on Pattern Recognition. 9. K. pp. “Text Detection from Scene Images Using Sparse Representation”. Y. 3. “Text Finder: an automatic system to detect and recognize text in images. 565–576. Doermann. Kurata. 7. of the 19th International Conference on Pattern Recognition. D. two of the sample images on which the performance of our algorithm is extremely poor are shown in Fig. Zhao. [6] J. 690-694. Doermann. Gao. H. on Doc. vol. Risemann. and Recog. of 17th Int. On rest of the 36 images the algorithm either partially extracted relevant text components or extracted text along with a few non-text components. One such situation is shown in Fig. However. [11] Q. (a) (b) (c) (d) 175 . M. A few of the images on which the algorithm perfectly extracts all the Bangla and Devanagari text components are shown in Fig. Two sample images on which the performance of our algorithm is very poor. 683-686. vol. [3] V. and directions. Gllavata.” Proc. C. pp. Proc. and Recog. (IJDAR) vol. vol. 2004. 1999. (b) (c) (d) Figure 8. These are focused on names of building / shop / railway station / financial institutions or hoardings towards advertisements. Mirmehdi. T. 8. Li. Suen. vol. Manmatha. Fu. Goto. Ye. K. 2005. Proceedings of 16th International Conference on Pattern Recognition (ICPR). Conf. H. Zhong. 1. [9] X. Proc. “Text detection from natural scene images: towards a system for visually Impaired Persons. [12] C. "Gaussian mixture modeling and learning of neighboring characters for multilingual text extraction in images". “Text Detection in Color Scene Images Based on Unsupervised Clustering of Multihannel Wavelet Features.” IEEE Trans. “Fast and robust text detection in images and video frames”. Similar poor performance occurred with 6 of our sample images. Merino. the precision and recall values of our algorithm obtained on the basis of the present set of 100 images are respectively 68. Conf. In summary. O. sizes. 10–17. B. 23. (a) Figure 9. 2005. [8] N. Q. (ICDAR). H. There are 58 such images all of whose relevant text components could be extracted. M. Conclusions The proposed algorithm works well even on slanted or curved text components of Devanagari and Bangla. Two images consisting of curved or slanted texts. II. 2nd Int. pp. on Pattern Recognition (ICPR). 147-167. 84-104. On the other hand. [7] T. 484 – 493. vol.. J. K. IEEE Transactions on PAMI. References [1] J.