1 1

SIViP (2011) 5:165183
DOI 10.1007/s11760-010-0152-1
ORIGINAL PAPER
Statistical modeling for the detection, localization and extraction

of text from heterogeneous textual images using combined
feature scheme
Chitrakala Gopalan D. Manjula
Received: 29 April 2009 / Revised: 1 January 2010 / Accepted: 1 January 2010 / Published online: 29 January 2010
Springer-Verlag London Limited 2010
Abstract Discriminating between the text and non text feature selection methods with benchmark dataset. The pro-
regions of an image is a complex and challenging task. In con- posed text extraction system is compared with the Edge based
trast to Caption text, Scene text can have any orientation and method, Connected component method and Texture based
may be distorted by the perspective projection. Moreover, it method and shown encouraging result and finds its major
is often affected by variations in scene and camera parameters application in preprocessing for optical character recognition
such as illumination, focus, etc. These variations make the technique and multimedia processing, mobile robot naviga-
design of unified text extraction from various kinds of images tion, vehicle license detection and recognition, page segmen-
extremely difficult. This paper proposes a statistical unified tation and text-based image indexing, etc.
approach for the extraction of text from hybrid textual images
(both Scene text and Caption text in an image) and Document Keywords Text extraction Non sub sampled Contourlet
images with variations in text by using carefully selected fea- Transform Gray level run length matrix Caption text
tures with the help of multi level feature priority (MLFP) Scene text Document image
algorithm. The selected features are combinedly found to be
the good choice of feature vectors and have the efficacy to Abbreviations
discriminate between text and non text regions for Scene text, NSCT Non sub sampled Contourlet Transform
Caption text and Document images and the proposed system NSP Non sub sampled pyramid
is robust to illumination, transformation/perspective projec- NSDFB Non sub sampled directional filter bank
tion, font size and radially changing/angular text. MLFP fea- CC Connected component
ture selection algorithm is evaluated with three common ML NLV Normalized local variance
algorithms: a decision tree inducer (C4.5), a naive Bayes GLRLM Gray level run length matrix
classifier, and an instance based K-nearest neighbour learner GLCM Gray level co-occurrence matrix
and effectiveness of MLFP is shown by comparing with three MLFP Multi level feature priority
VM Closing Vertical morphological closing
Chitrakala Gopalan (B) HM Closing Horizontal morphological closing
Department of Computer Science and Engineering,
Easwari Engineering College, Anna University,
Ramapuram, Chennai, Tamil Nadu, India
e-mail: ckgops@gmail.com 1 Introduction
Chitrakala Gopalan
4/3, Kaliappa Naicker St., Plot no: 80, Nehru Nagar,
The growing popularity of Internet and the World Wide Web
Ramapuram, Chennai 600089, Tamil Nadu, India has resulted in the tremendous growth of multimedia data
containing still images and video, in addition to the textual
D. Manjula information. Text in images/video usually carries important
Department of Computer Science and Engineering,
College of Engineering, Anna University, Guindy,
messages about the content. Indexing, querying and retriev-
Chennai, Tamil Nadu, India ing multimedia information uses the textual information
e-mail: manju@annauniv.edu embedded in the multimedia data.
123
166 SIViP (2011) 5:165183
This has created an overriding need to provide efficient lightness. Character-like components are then extracted as
means of text extraction methodologies from textual images forming text lines in a number of orientations and along
for effective data storage, retrieval, search, querying and curves. Kumar et al. [5] proposed globally matched wavelet
interaction capabilities and preprocessing for OCR. Extract- filters with Fisher classifiers for text extraction from Docu-
ing embedded/inserted text in images often gives an indica- ment images and Scene text images. Liu and Samarabandu
tion of a scenes semantic content. Automatic extraction of [6] proposed Edge based method with edge strength, density
text is a challenging job due to variation in font style, size, and the orientation variance as distinguishing characteristics
orientation, alignment and complexity of background. of text embedded in images which can handle printed doc-
Text in images/videos is classified into Caption text and ument and Scene text images. This method used multi scale
Scene text [1]. Caption text image is the one which is the edge detector for text detection and dilation operator for text
inserted text and is otherwise called as superimposed/artifi- localization stages.
cial text. Natural textual images/embedded texts are called as Gllavata et al. [7] proposed connected component based
Scene texts or graphics text images. Electronic documents, method which uses color reduction technique followed by
images of paper documents, images acquired by scanning horizontal projection profile analysis which can extract text
book covers, CD covers or other multi-colored documents from Caption text images. Li et al. [8] presented algorithms
are called Document images. for detecting and tracking text in digital video and implement
Literature studies so far addressed three different app- a scale-space feature extractor that feeds an artificial neural
roaches to extract the text from images namely Bottom-up, processor to detect text blocks. Lin and Tan [9] proposed a
Top-down and Hybrid approaches. Bottom-up approach method to apply a neural network on canny edges with both
starts with the identification of sub-structures, such as con- spatial and relative features like sizes, color attributes and
nected components (CCs) or edges and then merging these relative alignment features. By making use of the alignment
sub-structures to mark bounding boxes for text. Top-down information, we can identify the text area from the charac-
approach looks for global information in the page and splits ter level rather than the conventional window block level.
the page from column level to word level. Edge based and In [10], the region-of-interests (ROI) probably containing
Connected component (CC) based methods are categorized the overlay texts are decomposed into several hypothetical
under Bottom-up methods and Texture based methods under binary images using color space partitioning. A grouping
Top-down approach. The proposed system employs Non sub- algorithm is then conducted to group the identified character
sampled Contourlet transform and texture analysis to extract blocks into text lines in each binary image.
text from Caption text image, Scene text image and Doc- Jeong et al. [11] classifies text pixels and non text pixels
ument image with variations in illumination, font size, per- using a network that operates as a set of texture discrim-
spective projection and orientation by using MLFP algorithm ination filter to find and locate text regions from Caption
with Neural network classifier. text images using histogram analysis after removing errors
in the classification results. Pan et al. [12] proposes a hybrid
method to localize texts in natural scene images with a Con-
2 Related work ditional Random Field (CRF) model, considering the unary
component property as well as binary neighboring compo-
The Literature has been surveyed to find existing methods for nent relationship. Finally text components are grouped into
text extraction from different kinds of images which are [2] text lines with an energy minimization approach.
based on the combination of connected component and tex- Phan [13] proposes a text detection method from Video
ture feature analysis of unknown text region contours. Each based on the Laplacian operator. K-means is then used to
candidate text region is verified with texture features derived classify all the pixels into two clusters: text and non text.
from wavelet domain followed by expectation maximization This method undergoes projection profile analysis to deter-
algorithm to binarise each text region. Jiang et al. [3] pro- mine the boundary of the text blocks and employ empirical
posed a learning-based method for text detection and text rules to eliminate false positives based on geometrical prop-
segmentation in natural scene images. Here, the input image erties. Experimental results show that the proposed method
is decomposed into multiple CCs by Niblack clustering algo- is able to detect text of different fonts, contrast and back-
rithm. Then all the CCs including text CCs and non text CCs grounds.
are verified on their text features by a two-stage classification Shi et al. [14] presents an algorithm using adaptive local
module. connectivity map for retrieving text lines from the complex
Karatzas and Antonacopoulos [4] follows a split-and- handwritten documents such as handwritten historical manu-
merge strategy based on the Hue-Lightness-Saturation (HLS) scripts. The algorithm is designed for solving the problems
representation of color as a first approximation of an anthro- like fluctuating text lines, touching or crossing text lines and
pocentric expression of the differences in chromaticity and low quality image seen in handwritten documents.
123
SIViP (2011) 5:165183 167
The above mentioned approaches focused on extracting Text extraction has been carried out in two phases, namely,
text from Scene text, Caption text and Document images sep- Offline processing (training phase) and Online processing
arately or on some combinations only. In contrast to Caption (testing phase). Offline processing is carried out to extract
text, Scene text can have any orientation and may be distorted and generate feature vectors for training images from the
by the perspective projection. Moreover, it is often affected image corpus. Experimental analysis on various images
by variations in scene and camera parameters such as illumi- indicated that text regions typically have different texture
nation, focus, etc. These variations make the design of uni- properties than the non text areas. This is analyzed by decom-
fied text extraction from various kinds of images extremely posing the input image with the variation of Contourlet
difficult. transform such as Non sub sampled Contourlet Transform
Recently we [15] proposed an Image analysis based (NSCT) which decomposes the image into a set of direc-
approach called Sub Band Texture Analysis based Text tional 2n sub bands with texture details or edges captured in
Detection/Text Localization (SBTA-TD/TL) technique for different orientations at various scales for n level specified.
text extraction from heterogeneous images which is robust to Each and every decomposed sub band has high intensity tex-
limited orientation (horizontal, vertical) and limited font size ture details in various directions with high prominent values
of images. Comparable performance for Scene text images shown for NSCT coefficients which are merged by adding
was not produced as well. It was also observed that SBTA- eight sub bands to form an image with edges detected in
TD/TL technique suffers and not showing encouraging various directions. The transformed NSCT coefficients in an
results for the following variations edge image are used for the calculation of feature vectors for
text and non text regions of images.
Variation in illumination Five different sets of features are extracted and analyzed
Variation in wide range of font size from the above edge image with MLFP algorithm and best
Variation in skewness features are selected. With this MLFP algorithm, textural
Variation in angularity of text distribution of the spatial frequency components within the
Perspective projection decomposed regions (Normalized Local VarianceNLV)
and the feature based on the run which is a series of pix-
els having the same gray level on a definite direction in the
It is required to produce equal performance for Scene
image (Gray level run length matrixGLRLM-based fea-
text images also with robustness to the variations in textual
tures) are collectively found to be the good choice of feature
images. Consequently, all these observations, drawbacks and
vector and have the efficacy to discriminate between text and
advantages were analyzed and a more complex Image analy-
non text regions for three kinds of images with variations in
sis plus Machine learning based approach called multi level
lighting, orientation and font size.
feature priority (MLFP) technique is proposed for the Text
During Online processing, when the user supplies an input
detection and localization from heterogeneous images which
image, Selective features are extracted from the transformed
takes care of variation in illumination, transformation/per-
and merged contourlet coefficients by capturing the textural
spective projection, font size, angular and radially chang-
distribution of the pixels and produce feature vector.
ing text so as to make the system suitable for heterogeneous
Neural network classifier is used to classify the regions as
images.
candidate text and non text regions from the textual image
by using the extracted feature vector. Candidate text regions
will undergo selected verification rules to eliminate the
3 Approach unwanted non text regions. Then binarization algorithm is
applied to extract text from text regions by eliminating the
The goal is to build a unified extraction of text from hetero- background.
geneous range of textual images and to focus on transformed, System architecture of the proposed system is shown in
illuminated and angular form. The major contributions of the Fig. 1. Proposed approach includes:
proposed text extraction system are as follows:
(a) Candidate text region detection

Extracts text from Scene text images, Caption text images,
Document images and hybrid text (which has both scene Image decomposition using NSCT
and caption text) images with comparable performance Feature extraction and selection using MLFP
for al kinds of images. algorithm
Supports multi lingual text images.
(b) Text region classification
Takes care of variation in font size, skewness, angularity
of text and perspective projection Neural network classifier
123
168 SIViP (2011) 5:165183
Input Training Images

Textual image
Text NSCT
NSCT
Binarization
Feature extraction
Feature extraction Verification of text regions
Feature selection
Candidate Non
Features Representation of
Text text
Features
images
Neural network
Neural network classifier
Training
(Testing )
ONLINE PROCESSING OFFLINE PROCESSING
Fig. 1 System architecture of the proposed system
(c) Text region Localization scales, with flexible aspect ratios. The contourlet transform
effectively captures smooth contour images that are the dom-
Verification rules
inant feature in natural images. The contourlet transform
(d) Binarization [16] is a multi directional and multi scale transform that is
constructed by combining the Laplacian pyramid with the
Directional Filter Bank (DFB). Due to down samplers and
4 Candidate text region detection up samplers present in both the Laplacian pyramid and the
DFB, the contourlet transform is not shift-invariant. An over
Always Text extraction from images starts with the candi- complete transform, the Non sub sampled Contourlet Trans-
date text region detection which involves the detection of the form (NSCT) has been proposed in [17] and applied in the
presence of the text in the image. In this paper, this detection proposed system. The NSCT is a fully shift-invariant, multi
is done by scale and multi direction expansion that have a fast imple-
mentation. Here filters are designed with better frequency
(1) Capturing multi oriented texture components with high selectivity thereby achieving better sub band decomposition.
intensity texture details representing primitively and This NSCT transform [17] can thus be divided into two
roughly identified text regions from the NSCT decom- shift-invariant parts: (1) a non sub sampled pyramid struc-
posed merged sub bands. ture that ensures the multi scale property of the NSCT and is
(2) Extracting best features from the merged sub bands so obtained from a shift-invariant filtering structure that achieves
as to distinguish text and non text regions better with sub band decomposition similar to that of the Laplacian pyr-
the help of the classifier stage. amid and (2) A non sub sampled DFB structure that gives
directionality. A shift-invariant directional expansion is also
These detected candidate text regions are later verified obtained with NSDFB. The NSDFB is constructed by elimi-
with the text localization stage. nating the down samplers and up samplers in the DFB. This is
done by switching off the down samplers/up samplers in each
4.1 Image decomposition using Non sub sampled two-channel filter bank in the DFB tree structure and up sam-
Contourlet Transform pling the filters accordingly. This results in a tree composed
of two-channel NSDFBs. Refer [15] for detailed description
The contourlet transform is an extension of the wavelet trans- of NSCT.
form which uses multi scale and directional filter banks. The NSCT applied on the input image produces 2n sub
Here images are oriented at various directions in multiple bands for n level specified. Here, eight sub bands have been
123
SIViP (2011) 5:165183 169
(1) Spatial distribution of the gray levels in the texture

image with GLCM-based features
Gray level co-occurrence matrix based features
(2) Textures at different scales with Fractal dimension
(3) Textural distribution of the spatial frequency compo-
nents with NLV
Normalized local variance
Fig. 2 NSCT coefficients
(4) Density distribution in the image statistically with
GLRLM-based features in various directions (positive
produced after three level decomposition and they have Multi and negative 0, 45, 90 and 135 directions)
oriented texture components with high intensity texture Gray level run length matrix (GLRLM)
details representing candidate text regions as shown in Fig. 2,
for Scene text images. The NSCT is a fully shift-invariant, (5) Intensity of pixels in the image with Intensity Histogram
multi scale and multi direction expansion that have a fast based features
implementation. A non sub sampled pyramid structure in
NSCT ensures the multi scale property. Large variations in Now feature vectors are calculated for each 8 8 block
scale are captured in candidate text region detection stage from the decomposed merged sub band edge image rather
itself with the help of NSP in NSCT. Each and every decom- than on original input image because these merged sub bands
posed sub band has texture details in various directions with have primitively identified candidate text regions. The
high prominent values shown for NSCT coefficients. These extracted feature vectors will be passed through MLFP algo-
prominent values of NSCT coefficients from decomposed rithm and then submitted to a neural network classifier. The
eight sub band images are merged by adding eight sub bands methodology employed in the approach to represent images
to form an image with edge detected in various directions so is the combination of finally selected NLV and GLRLM
as to facilitate the extraction of features in next stage. based features which are used to extract the primitives of
an image. The above mentioned 5 sets of features compris-
ing 55 features are extracted from the decomposed merged
4.2 Feature extraction sub bands in each block of an image and the list is shown
in Table 1. Features considered for investigation are detailed
Once the image is converted into gray image, its feature vec- below:
tor represents them. Several feature extraction strategies are
proposed in the literature and classification using various 4.2.1 Normalized local variance (NLV)
image features is analyzed [18]. Texture is the term used
to characterize the surface of a given object or region and it Information about textural distribution of the spatial fre-
is one of the main features utilized in image processing and quency components within regions in frequency domain is
pattern recognition. analyzed and can be used as one of the features to dis-
Existing methods for text detection, localization and criminate text and non text areas. This is derived by cal-
extraction can broadly be classified as gradient features culating local energy coefficients of the NSCT of an image
based, color segmentation based and texture features based as in Eq. (3). The derived local energy is normalized using
[19]. The proposed system employs texture feature based global standard deviation of respective energy coefficient as
method. Text is viewed as a unique texture that exhibits a cer- in Eq. (2) to obtain a feature set containing additional infor-
tain regularity that is distinguishable from the background. mation of textures. It is observed from experimentation that
Humans can identify text of foreign languages even when this NLV feature is high for text and low for non text.
they do not understand them, largely due to its distinct tex- Global mean (GM):
ture. Various researchers have exploited this fact to detect
w h
text in images. The texture methods are largely used for text 1
detection. Texture features can be extracted directly from the GMi = Ci (u, v) (1)
wh
pixels spatial relationships or from frequency data. However, u=0 v=0
these methods are often computationally expensive, but lead Global standard deviation (GSD) for each coefficient:
to good results. Herewith five sets of texture features are con-
w h
v=0 (Ci (u, v) GMi )
sidered for investigation to analyze the following properties 2
u=0
of the image: GSDi = (2)
wh
123
170 SIViP (2011) 5:165183
Table 1 Set of features

S. No Feature group Feature name
analyzed
1 Local energy deviation Normalized Local Variance (NLV)
of NSCT coefficients
2 Histogram based features Mean, Variance, Skew ness, Kurtosis,
Energy, Entropy
3 Gray level co-occurrence matrix / GLCM Contrast, Correlation, Energy,
Entropy Homogeneity
4 Fractal dimension
5 Gray Level Run Length Matrix / GLRLM Short run emphasis (SRE)
based features checked in various Long Run Emphasis (LRE)
directions. (positive and negative Gray level non uniformity (GLN)
0, 45, 90 and 135 directions) Run length non uniformity (RLN)
Low gray level run emphasis (LGRE)
High gray level run emphasis (HGRE)
Short run low gray level emphasis (SRLGE)
Short run high gray level emphasis (SRHGE)
Long run low gray level emphasis (LRLGE)
Long run high gray level emphasis (LRHGE)
Local mean in the window of size N N is: From the original run length matrix p(i, j), many numer-
ical texture measures can be computed [20] such as Short
1
N N
Mi = Ci (u, v) (3) Run Emphasis (SRE), Long Run Emphasis (LRE), Gray
N2 Level Non uniformity (GLN), Run Length Non uniformity
u=0 v=0
(RLN), Low Gray Level Run Emphasis (LGRE), High Gray
Local coefficient variance V Ci , Level Run Emphasis (HGRE), Short Run Low Gray Level
N N Emphasis (SRLGE), Short Run High Gray Level Emphasis
(Ci (u, v) Mi )2
V Ci (u, v) = u=0 v=0 2 (4) (SRHGE), Long Run Low Gray Level Emphasis (LRLGE)
N
and Long Run High Gray Level Emphasis (LRHGE).
where, w and h are the number of pixels in horizontal and ver- For a given image, a run-length matrix p(i, j) is defined
tical directions respectively, in the image, i = 0 to 9, N = as the number of runs with pixels of gray level i and run
2n (=16), Ci is ith coefficient and the normalized variance length j. Let M be the number of gray levels and N be the
NV Ci (u, v) for each pixel is calculated as follows: maximum run length. Pg vector represents the sum distribu-
V Ci (u, v) tion of the number of runs with gray level i.nr is the total
NVCi (u, v) = (5) number of runs. The subsets of GLRLM features which have
GSDi
been selected with MLFP algorithm as in Table 2 are only
The relative deviation in local coefficient in the region N N described as follows:
window is computed for each pixel of the image.
(1) Gray level non uniformity (GLN):
4.2.2 GLRLM based features
2
1 1
M N M
Next set of features were extracted from decomposed merged
sub bands using texture analysis based on the GLRLM. Tex- GLN = p(i, j) = Pg (i)2 (6)
nr nr
i=1 j=1 i=1
ture analysis is the method to analyze the density distri-
bution in the image statistically. Texture analysis based on
the GLRLM is one such analysis and is a method based (2) High gray level run emphasis (HGRE):
on the run which is a series of pixels having the same gray
1 1
level on a definite direction in the image [20]. Since M N M
density, length and direction in the run are included, the HGRE = p(i, j) i 2 = pg (i) i 2
nr nr
aggregation of the run from the image represents features i=1 j=1 i=1
of the texture. (7)
123
SIViP (2011) 5:165183 171
Table 2 Selected features with multi level feature priority algorithm

S. No Feature group Feature name Priority Non text (NT) Text (T)
1 Local energy deviation Normalized local variance (NLV) (1) Low High
of NSCT coefficients
2 GLRLM based features High gray level run emphasis HGRE-0 High Low
3 HGRE-90 High Low
4 Short run high gray level emphasis SRHGE-0 High Low
5 SRHGE-90 High Low
6 Long run high gray level emphasis LRHGE-90 High Low
7 Gray level non uniformity GLN-135 Low High
8 Short run high gray level emphasis SRHGE-45 (2) High Low
9 Long run high gray level emphasis LRHGE-45 High Low
10 High gray level run emphasis HGRE-45 High Low
11 Gray level non uniformity GLN-0 Low High
12 GLN-45 Low High
13 Long run high gray level emphasis LRHGE-135 (3) High Low
(3) Short run high gray level emphasis (SRHGE): used to construct the matrix. Various statistics can be derived
from the created GLCM and can provide information about
1 p(i, j) i 2 the texture of an image. Five out of the fourteen Haralick
M N
SRHGE = (8) features [21] have been used in the approach.
nr j2
i=1 j=1 These features are: contrast, homogeneity, energy, entropy
and correlation. To calculate different features, the joint prob-
(4) Long Run High Gray Level Emphasis (LRHGE): ability density of grey level co occurrence computed by
GLCM is weighted differently. The first two features (con-
1
M N trast, homogeneity) can be grouped and named as Contrast
LRHGE = p(i, j) i 2 j 2 (9) Group. They compute quantity of contrast in a window. The
nr
i=1 j=1 second group called Orderliness Group contains features
(energy and entropy) which indicate how regular (orderly)
These features are used to capture apparent properties of the pixel values are within the window.
run length distribution. It is observed from experimentation
that the above selected features have shown distinguishable
difference between text and non text as shown in Fig. 4ae. (a) Contrast measures the local variations in the gray level
co-occurrence matrix. It is calculated as
4.2.3 Gray level co-occurrence matrix based features
(GLCM)
L1
Contrast = Pi j (i j)2 (10)
i, j=0
Gray level co-occurrence matrix (GLCM) can reveal certain
properties about the spatial distribution of the gray levels in
the texture image as proposed by Haralick et al. [21] and Here Pi j = Element i, j of the normalized symmetrical
is also known as the Gray Level spatial dependence matrix. GLCM.
Hence, GLCM-based features were considered for feature N = Number of gray levels in the image.
extraction. GLCM is a tabulation of how often different com- (b) Homogeneity measures the closeness of the distribution
binations of pixel values (grey levels) occur in an image. of elements in the GLCM to the GLCM diagonal. It is
When divided by the total number of neighboring pixels in calculated as
the image, this matrix becomes the estimate of the joint prob-
ability of two pixels, a distance d apart in a direction e having
L1
particular (co-occurring) grey values i and j. The dimension Homogeneity = Pi j /1 + (i j)2 (11)
of GLCM is Gx G where G is the number of grey levels i, j=0
123
172 SIViP (2011) 5:165183
(c) Energy provides the sum of squared elements in the where m 3 is the third moment
GLCM. Also known as uniformity or the angular sec-

L1
ond moment. It is calculated as Kurtosis = 1/4 (x m 4 )4 Pu(x) (19)
x=0

L1
Energy = (Pi j )2 (12) 4.2.5 Fractal dimension
i, j=0
Fractal dimension is a parameter frequently used to analyze

(d) Entropy measures the average, global information con- textures at different scales. Fractal dimension values indicate
tent of an image in terms of average bits per pixel. the complexity of a pattern or the quantity of information
embodied in a pattern in terms of morphological, entropy,

N 1 spectral and variance [22]. FD serves as an index of the mor-
Entropy = Pi j log2 Pi j (13) pho metric complexity and variability of the object being
i, j=0 studied. The fractal dimension indicates the complexity of
images. Because the picture components are more compli-
(e) Correlation measures the linear dependency of grey lev- cated than the text components, the picture components have
els on those of neighboring pixels. It is calculated as higher fractal dimensions than the text components. To cal-
culate fractal dimension using the box dimension method, a
square grid of size s is superimposed over the image. N (s),

L1
Correlation = Pi j (i )( j )/ 2 (14) the number of grids that contain part of the image is counted.
i, j=0 Box dimension D can be calculated using the following
formula:
4.2.4 Intensity histogram based features D = log N (s)/ log 1/s (20)
The intensity of the pixels in an image helps to differentiate

4.3 Feature selection algorithm
the text pixels from that of the background pixels. Hence, the
first order intensity histogram statistical features were con-
Finding an optimal set of features from a large set of can-
sidered for feature extraction. The intensity histogram was
didate features is a problem which occurs in many contexts.
created by quantizing the gray-scale intensity values into the
Several feature selection strategies are proposed in the liter-
range 0255 and then making a 256-bin histogram for these
ature and analyzed in [23].
values. The histogram was then normalized by dividing all
Heuristic algorithms for finding near-optimal solutions
values by the total sum. The mean, second and third moments
[24] require both the definition of a strategy for selecting
are derived from the intensity histogram. The histogram fea-
feature subspaces and the definition of a function, i.e. how
tures are calculated as follows:
well classes result separated in the selected feature subspace.
Let u be a random variable representing a gray level in a
In this paper, proposed MLFP algorithm follows sequential
given region of the image. Define:
backward selection approach, which removes one feature at
Pu (x) = Prob(u = x) = no. of pixels with gray level x/ a time with low discrimination ability until no improvement
in the criterion function is obtained. The criterion function
total no. of pixels in the region is problem specific and can vary from simple performance
(15) measures to complex multistage evaluation procedures.
The evaluation methods for evaluating the goodness of
Now the mean, variance, skewness and kurtosis are defined each selection can be divided into two wide classes: (1) Fil-
as follows: ter method evaluate a feature subset independent of the clas-

L1 sifier and are usually based on some statistical measures of
Mean = E(u) = x Pu(x) (16) distance between the samples belonging to different classes
x=0 and (2) Wrapper method is based on the classification results

L1 achieved by a given classifier. Filter methods are usually
Variance = (x m 1 )2 Pu(x) (17) faster than wrapper ones, but selecting comparatively more
x=0 number of features. The approach here follows filter-based

L1 method for evaluation and aims at producing small number
Skewness = (x m 3 )3 Pu(x) (18) of features at less running time with Difference between clas-
x=0 ses (DBC) based consistency. This algorithm also categorizes
123
SIViP (2011) 5:165183 173
the selected features, meeting the criteria into various priority 4. Maximum relevant features are selected and categorized
levels at the time of the selection of the features itself. into various priority sets such as P1 , P2 , P3 (with P1 as
the highest priority) based on Difference between clas-
4.3.1 Multi level feature priority (MLFP) algorithm ses (DBC) based consistency i.e., based on the differ-
ence between the flag and number of images N . If the
Multi level feature priority algorithm aims for the selection difference between text and non text of a particular fea-
of the most appropriate subset of features that adequately ture is being maintained a value above the threshold for
describes a given classification task. MLFP algorithm is based all images, choose these kinds of features as a candidate
on sequential backward selection method, which considers for first highest priority level P1 .
all features and removes a feature with low discrimination
ability to distinguish text and non text regions of an image. P1 = { f i / |flag| N = 0} (23)
This algorithm also categorizes the selected features, meet- P2 = { f i / |flag| N < t1} (24)
ing the criteria into various priority levels at the time of the P3 = { f i / |flag| N < t2 where t2 t1} (25)
selection of the features itself.
where is a partial ordering such that t1 t2 is
This algorithm selects and categorizes the features as first defined as t1 t2 0.
level (top) features if difference between feature values 5. Selected priority set of features must be maximum
of text and non text regions exceeds the threshold and the relevant and minimum redundant in order to qualify as
same is maintained for all n images/consistent for all significant features. Maximum relevancy is rechecked
n images. by consistency and Minimum redundancy is checked by
The algorithm selects and categorizes the features as sec- correlation among the features.
ond level features if the difference between feature val-
ues of text and non text regions exceeds the threshold but
Consistency: Optimal set of features to distinguish text
consistent only for large subset of images n1, where
and non text regions of an image should be consistent or
n1 n.
homogeneous. It indicates that difference between text and
The algorithm selects and categorizes the features as
non text of a particular feature should be consistent for all
third level features if the difference between feature val-
images irrespective of the types of images such as Caption
ues of text and non text regions exceeds the threshold and
text, Scene text and Document images. These kinds of fea-
consistent not for all images but only for small subset of
tures will become candidature for optimal set of features.
images n2, where n2 n.
Consistency is measured by coefficient of variation (C.V)
and is calculated as follows:
Various steps involved in MLFP algorithm during offline Coefficients of dispersion (C.D) are calculated [25] to
processing are as follows: compare the variability of the two series which differ widely
in their averages or which are measured in different units.
1. For each image of m blocks, compute the following for Coefficient of dispersion (C.D) based upon standard devi-
each feature: ation is given as:
C.D = S.D/Mean = / X (26)
Fi T = f i T 1 + + f i T n /N
Fi N T = f i N T 1 + + f i N T m /M (21) Coefficient of Variation: 100 times the coefficient of disper-
sion based upon standard deviation is called coefficient of
DFi = Fi T Fi N T
variation (C.V) and it is calculated as
where f i T 1 , . . . , f i T n and f i N T 1 , . . . , f i N T n are fea- C.V = 100 (/ X ) (27)

ture values in various text and non text regions of the
According to Professor Karl Pearson who suggested this
image respectively and N and M are the total number
measure, C.V is the percentage variation in the mean,
of text and non text regions in the image. Fi T , Fi N T are
standard deviation being considered as the total variation
the mean feature values for text and non text regions,
in the mean. The series having greater C.V are said to be
respectively, and DFi is the difference between those
more variable than the other and the series having lesser C.V
mean values.
are said to be more consistent (or homogeneous) than the
2. flag = flag + 1; {if DFi > threshold} (22) other.
Correlation: Optimal subset of features should not be
3. Repeat the above two steps for all N images. redundant. Redundancy is checked by the correlation
123
174 SIViP (2011) 5:165183
between the features. Correlation criteria can detect linear Procedure Multi level Feature Priority ( )
dependencies between variable and target. It is a measure Step 1: Start

flag = 0
of the strength of linear association between two variables. Step 2: For i=1 to n features
Correlation coefficient r is calculated as follows: Correlation Step 3: For each j= 1 to x images
Step 4: For each k = 1 to m of Id
Coefficient (r ) = Covariance (x, y)/x y Compute Mean (i) for text region

Compute Mean (i) for non text region
r=N XY X Y / Step 5: Next
Compute dtn betwn text and non text values from Mean

2
2
fix_ threshold()
If dtn > = Th
N X2 X N Y2 Y flag = flag +1
Step 6: Next
If flag = = x
(28) Include i into Priority SET 1
Else if | flag - x | < t1
where N is the number of values. Include i into Priority SET 2
Else if | flag - x | < t2
The value of r lies between 1 and 1, inclusive. If X and Include i into Priority SET 3
Y are completely correlated, r takes the value of 1 or 1. If X Step 7: Next
Step 8: Priority Set Evaluation ()
and Y are independent, r is zero. It is a symmetrical measure {Min_ redundancy check ();
for two variables. If X and Y have a strong positive linear Max_ relevancy check () ;}
{/*Returns Significant
correlation, r is close to +1. A feature with r value of exactly Features*/}
+1 indicates a perfect positive fit which are truly redundant in Step 9: End
the sense that no additional information is gained by adding

Procedure fix_threshold ( )
them. If X and Y have a strong negative linear correlation, Step 1: Start
r is close to 1. If there is no linear correlation or a weak Step 2: For i=1 to n features
Step 3: For j=1 to x images
linear correlation, r is close to 0. A correlation greater than Draw histogram for text and non text regions
0.8 (close to 1) are generally described as strong, whereas a Th = Min (Valley) of histogram
Step 4: Next
correlation less than 0.5 (close to 0) are generally described Step 5: Next
Step 6: End
as weak.
6. Then qualified top level significant features are used to
define a primitive to distinguish text and non text regions Table 3 Selected features with coefficient of variation (CV)
of the image. S. No Feature Name CV
The Procedure for MLFP is summarized as follows:
1 SRHGE-0 10.47
Declare 2 HGRE-0 14.92
n: no. of features 3 LRHGE-90 15.46
4 SRHGE-90 17.30
k: no. of levels
5 (NLV) 19.86
m: no. of blocks in segmented
6 GLN-135 20.48
decomposed image 7 HGRE-90 21.99
x: no. of images 8 GLN-0 25.15
N : Sub bands count 9 GLN-45 25.46
Th: Threshold for difference value 10 SRHGE-45 27.68

11 LRHGE-45 33.13
dm : Difference between the text and non text values
12 HGRE-45 36.08
t1, t2, t3: Thresholds to decide priority, 13 LRHGE-135 40.39
such that t1 t2 t3
4.3.2 Analysis of Features with MLFP algorithm show that they are consistent. There is a marginal rise in C.V
for Priority sets P2 , P3 than P1 as in Table 3.
Above 5 sets of features comprising 55 features are analyzed Correlation coefficient matrix values for all 55 features
and 13 features are selected and categorized into 3 priorities are calculated and observed that r value for P1 set features is
using MLFP algorithm as in Table 2. C.V is calculated for close to 0 ranging between 0.5 to 0 and 0 to 0.5 and shows
the series with difference of value between text and non text that features are weakly correlated and are not redundant.
regions for all 55 features. It is observed that top seven fea- r Value for seven features are listed in Table 4 with self corre-
tures comprising P1 have scored lesser C.V than others and lation shown as 1. Distributions of some features from Fig. 4
123
SIViP (2011) 5:165183 175
Table 4 Selected Priority 1

level features with correlation r NLV HGRE-0 HGRE-90 SRHGE-0 SRHGE-90 LRHGE-90 GLN-135
coefficient (r )
NLV 1.00 0.07 0.43 0.05 0.30 0.21 0.18
HGRE-0 0.07 1.00 0.15 0.22 0.50 0.24 0.22
HGRE-90 0.43 0.15 1.00 0.41 0.38 0.04 0.36
SRHGE-0 0.05 0.22 0.41 1.00 0.50 0.03 0.43
SRHGE-90 0.30 0.50 0.38 0.50 1.00 0.27 0.33
LRHGE-90 0.21 0.24 0.04 0.03 0.27 1.00 0.38
GLN-135 0.18 0.22 0.36 0.43 0.33 0.38 1.00
LRHGELong Run High Gray Level Emphasis

H, V, DHorizontal, Vertical & Diagonal direction
It is observed from experimentation that above selected

features have shown distinguishable difference between text
and non text and two features are shown in Figs. 3 and 4a and
all seven features showing remarkable difference between
text and non text for set of images as shown in Fig. 4b.
Fig. 3 NLV feature differentiating text and non text 4.3.3 Results of feature selection algorithm
and comparison with other methods
may seem to be same but their variance differs which make Our data set for feature selection algorithm is created by
them weakly correlated and not redundant. pooling features from 5 different texture models compris-
Subsequently, out of 13 features, only 7 top level prior- ing 55 features. In addition to this, some more data sets are
ity features are selected based on the filter-based evaluation selected from UCI Machine Learning Repository. WEKAs
method and are used to define a primitive P to distinguish collection (27) is used to implement all the algorithms cho-
text and non text regions and is described as a seven-attribute sen for comparison. In this section, we evaluate the efficiency
tuple where and effectiveness of our method by comparing MLFP with
representative feature selection algorithms such as Genetic
P = (NLV, HGRE-H, HGRE-V, SRHGE-H, search algorithm, ReliefF and Greedy stepwise search
SRHGE-V, LRHGE-V, GLN-D-135) algorithm [27]. For each data set, we first run all the feature
selection algorithms in comparison, and obtain the selected
features for each algorithm as in Table 5. It is observed that
NLVNormalized Local Variance all these algorithms achieve significant reduction of dimen-
GLNGray Level Non uniformity sionality by selecting only a small portion of the original
HGREHigh Gray Level Run Emphasis features. MLFP on average selects the smallest number of
SRHGEShort Run High Gray Level Emphasis features.
Fig. 4 Various features distinguishing text (T) and non text (NT). a One of GLRLM feature- GLN 135, b selected features differentiating
T and NT
123
176 SIViP (2011) 5:165183
Table 5 Number of features

selected by each feature Data set (No. of features) MLFP Genetic search Greedy stepwise ReliefF
selection (FS) algorithm
Our textual dataset 55 7 10 8 8
Splice 60 9 12 11 10
Chemical 150 8 14 12 11
CoIL2000 85 10 18 17 16
Average feature count 350 34 54 48 45
Table 6 Accuracy of various

classifiers on selected features Classifier FS
for each FS algorithm MLFP-FS Genetic search Greedy stepwise ReliefF
F1 RMSE F1 RMSE F1 RMSE F1 RMSE
IBk 0.89 0.3282 0.75 0.5 0.5 0.679 0.69 0.5

NBC 0.85 0.4097 0.67 0.5 0.75 0.5993 0.33 0.7071
C4.5 0.90 0.2236 0.5 0.71 0.5 0.7071 0.73 0.5
F1-measure for FS algorithms 5 Text region classification

IBk NBC C4.5
F1-measure
1
Here neural network classifier is used to classify objects into
0.5 one or two or more groups based on a set of features that
0
describe the objects. The above mentioned seven features
MLFP-FS Genetic search Greedy ReliefF are used as the basis for classification. After extracting the
Method features, a neural network is trained to identify text-like win-
Fig. 5 F1-measure for feature selection algorithm dows. The network used is the general feed-forward network
trained with back propagation. To train the network,
1. Choose the mask for typical text and non text regions.
In addition, three bench mark learning algorithms, Naive 2. Extract features for these regions separately
bayes Classifier [28], lazy instance-based algorithm (k-NN) - 3. Calculate mean of appropriate feature for all text and non
IBk nearest neighbour [29] with k = 5 and eager model text regions separately
based C4.5 Decision tree algorithm [30] are used to evalu- 4. Pass that mean for training
ate the predictive accuracy on the selected subset of features.
F1 measure and Root mean square error (RMSE) calculated Each small window is classified using the trained neural
for all datasets and average of those measures are listed in net on the feature vector associated with that window. If a
Table 6. These measures are estimated with the use of tenfold window is classified as text, all the pixels in this window
stratified cross-validation tests. will be labeled as text. Those pixels which are not covered
It is observed from Table 6 and Fig. 5 that for each of the by any text window are labeled as non text. As the result of
four data sets, the highest accuracy is achieved by applying classification, a label map of the original image is generated
IBk and C4.5 on the feature subset selected by MLFP. as in Fig. 6b.
MLFP achieves higher accuracy, lower RMSE, smallest
number of selected features compared with other FS algo-
rithms. The reason is that importance of the features is 6 Text region localization
checked at two levels in MLFP. Relevant features are selected
based on DBC-based consistency and features are catego- 6.1 Verification of text regions
rized into priority sets with high relevance. Second time of
consistency check is incorporated during priority set eval- The generated label map of the original image after text
uation. So MLFP produces comparatively small number of region classification consists of detected text regions, but
features with high relevance and is faster as it follows filter may include non text as text regions also. These detected
method. regions have to be verified as text regions with the removal
123
SIViP (2011) 5:165183 177
Fig. 6 Various steps in the MLFP method. a Input image, b O/P of tal projection, j vertical projection, k logical AND of horiz verti image,
NN classifier, c VM closing, d HM closing, e morphological opening, l binary image, m O/P text region, n ground truth image
f bounding box region, g superimpose with i/p, h edge image, i horizon-
of unwanted non text regions. This verification stage under where z = {X, Y j }, j = 0, 1, . . . n, X = C1 for VM
text localization stage groups and localizes the text regions closing
identified by the detection stage into text instances. The ver-
ification rules are used for text localization to
(A B) = (A B) B = {z |(B)z A = } (30)
(a) Identify and remove small unwanted non text regions.
(b) Group the unfilled pixels by closing all the gaps.
The following verification rules are shown in Fig. 6. After the where z = {X i , Y }, i = 0, 1, . . . n, Y = C2 for HM
mask region is obtained from neural network, closing
2. Binary area open operation is performed to remove all
1. Perform vertical morphological (VM) closing and hori- connected components (objects) that have fewer than P
zontal morphological (HM) closing operation as in pixels from a binary image.
Eqs. (29) and (30) to smooth the contour, fuses narrow 3. Place the bounding box for every region identified and
breaks and eliminates small holes. Closing is performed corresponding region is taken. Superimpose this mask
with dilation followed by erosion where C1, C2 are region with input image to get candidate text region.
constants. 4. Perform edge detection with sobel operator.
5. Calculate horizontal projection (HP) for every row from
(A B) = (A B) B = {z |(B)z A = } (29) an edge image.
123
178 SIViP (2011) 5:165183
(a) Initialize new binary image with 0 same as size as 7. Perform logical AND of the output images from HP and
input image VP to get the text region
(b) Assign 1 for only Rows with (HP) > Mean (HP) T = i j (35)
x

1| Hi > H
8. This binary image is superimposed with input image to
i=0
i = (31) get text from image.

x
0| Hi > H
i=0
(c) Perform open operation vertically
7 Binarization
(A B) = (A B) B = {(B)z|(B) Z A}
(32) Before the extracted text is parsed and recognized by the
Common OCR, it has to be binarised to extract the text
where z = {X, Y j }, j = 0, 1, . . . n, X = C1 from the identified text region. Well known global thres-
6. Calculate Vertical projection (VP) for every column from holding method described by Otsu [26] is tried for binari-
edge image zation. All intensity values below a selective threshold are
converted to one intensity level and intensities higher than
(a) Initialize new binary image with 0 same as size as this threshold are converted to the other chosen intensity.
input image It segments an image into foreground and background. The
(b) Assign 1 for only columns with VP > mean (VP) foreground contains interested characters and this process

y
generates an output image with white text against a black

1| V j > V
background.
j=0
j = y (33)

<

0| V j V

j=0 8 Results and performance analysis
(c) Perform open operation horizontally
8.1 Experimentation setup and Performance measures
(A B) = (A B) B = {(B)z|(B) Z A}
(34) In the experimental part, data sets for various textual images
with variations in illumination, perspective, font size and
where z = {X i , Y }, i = 0, 1, . . . n, Y = C2 orientation are considered as in Table 7 and tested using
Table 7 Data set for text

extraction experimentation Image group Image type No. of images
Caption text images News telecast & video titles 25

Scene text images Name plate images
Number plate images
Book cover images
General embedded text images 90
Variations:
Skewed images 15
Perspective projection 15
Angular text 20
Radially changing text 20
Illuminated images (day, evening and night light ) 15
Language tested (English, Tamil and Hindi) 15
Document images Scanned document image 50
Web document image 20
Hybrid Scene and Caption text in same image 15
Total images 300
123
SIViP (2011) 5:165183 179
1.79 GHz system having 1GB RAM. These data set images 8.2 Results and comparison with other text extraction
have been gathered from sites of several research groups such methods
as Laboratory for Language and Media Processing (LAMP),
Automatic Movie Content Analysis (MoCA) Project, Com- The proposed text extraction system has been tested with the
puter Vision Lab., Pennsylvania State University. The sig- dataset mentioned in Table 7. The proposed system produces
nificance of testing the algorithms on the above variations is good performance for Scene text, Caption text and Docu-
to ensure the efficiency of this methodology to function as a ment images by producing better recall rate missing relatively
unified technique. weak texts and shown promising precision rate as non text
The performance of the proposed technique has been eval- objects are correctly classified as in Table 8, Figs. 7 and 8. The
uated based on its precision and recall rates and F score results have also been verified against other existing algo-
obtained. In order to have a common method to evaluate and rithms such as Edge-based method [6], Connected compo-
compare the results from each algorithm, in this project each nent (CC) method [7] and our previous Texture-based method
text pixel in ground truth and output image is considered SBTA-TD/TL [15]. Even though, Edge based method is
in the calculation of precision and recall rate. Precision and designed for Scene text and Document images and
recall rates are calculated as follows where CDP (Correctly CC method is designed for Caption text images, proposed
Detected Pixels) is the number of pixels matching between method is compared with the above two methods in order to
output mask image (O) and Ground Truth image (GT). show the ability of the proposed method to extract text from
all three kinds of images in much better way. We imple-
CDP = O GT (36) mented the above mentioned three methods and estimated
Precision rate (PR) = CDP/(CDP + FP) (37) with our data sets and compared with the proposed sys-
Recall rate (RR) = CDP/(CDP + FN) (38) tem. Performance measures for comparison are shown in
Tables 8 and 9.
Precision rate takes into consideration the false positives, The results of the proposed system shows clear improve-
which are the non text pixels in output image and have been ment over [6,7] methods and when compared to texture based
detected by the algorithm as text pixels. They are not text SBTA method [15], proposed system is showing better per-
pixels in ground truth image but detected as text pixels in formance with F-score of 90 and 81.5% for Caption text and
output mask image Scene text images, respectively, as far as the basic three kinds
FP = O GT (39) of images are concerned without any mentioned variations
in text as in Fig. 8 and Table 8.
Recall rate takes into consideration the false negatives, Subsequently, the proposed system is showing apprecia-
which are text pixels in output image, and have not been ble improvement in performance over [6,7,15] for hetero-
detected by the algorithm. They are text pixels in ground geneous images with variations in text such as variation in
truth image but not detected as text pixels in output mask orientation, font size, perspective projection and lighting con-
image ditions, as the proposed system is intended to handle varia-
tions in text so as to qualify as a better unified framework
FN = O GT (40) for text extraction from heterogeneous images than texture
Thus, precision and recall rates are useful as measures to based SBTA method [15].
determine the accuracy of each algorithm in locating correct It is observed from experimentation that MLFP method
text regions and eliminating non text regions. F-score is the produces good and comparable results with F-score of 80,
harmonic mean of recall and precision rates as represented 82.7 and 78.3% for Orientation variation, font size varia-
in Eq. (29) tion, lighting variation, respectively, as shown in Table 9 and
Fig. 9. The reason for this improvement in performance is
F-score = (2 PR RR)/(PR + RR) (41) twofold. (1) NSCT is invariant to scale and orientation and
Table 8 Average F-score for

the proposed and other methods Image type Method
Edge-based CC Texture Proposed

method (%) method (%) method (%) MLFP method (%)
Caption text images 66 75 86 90

Scene text images 63 34 78 81.5
Document images 71.6 75 91.7 91
123
180 SIViP (2011) 5:165183
Fig. 7 Heterogeneous images with variations. a Caption text with small font, b scene text with large font, c scene text with mixed font size,
d document image, e hybrid image with oriented text, f scene image with radially changing text
Table 9 Comparative results

for variations in textual images Methods Variations
Orientation variation Font size variation Lighting variation
P R F-score P R F-score P R F-score
Edge based method 35 80 48.7 46 70.8 55.8 41 87 55.7

CC method 55 82 65.8 50 83 62.4 36 75 48.6
Texture method 70 83 76 67 85 75 60 87 71
MLFP method 75 85 80 78 88 82.7 70 89 78.3
Document images Scene text images Caption text images 76 75 71 80 82.7 78.3
65.8 62.4
55.7 55.8
48.7 48.6
MLFP
Texture
Methods
CC Edge based CC Texture MLFP
Edge based Orientation variance Font size variance Lighting variance
0% 20% 40% 60% 80% 100%

F-score Fig. 9 Comparison of F-score for various variations in images
Fig. 8 Comparison of F-score for heterogeneous images
and results are more robust to the variation in orientation, font

GLRLM features are also extracted in various directions. size, perspective projection and lighting conditions. One such
(2) Optimal set of features suitable for better discrimina- result for Scene text with perspective projection and differ-
tion between text and non text with variations are chosen ent lighting conditions is shown for all above methods and
by MLFP algorithm. It is observed that the proposed system MLFP method in Figs. 10 and 11. Results of various cases
is showing clear improvement over other compared methods have been shown in Figs. 7, 10 and 11 as follows:
123
SIViP (2011) 5:165183 181
Fig. 10 Results of images for various methods at three different lighting conditions. a Images at three different lighting conditions, b results from
Edge based algorithm, c CC based algorithm, d texture based algorithm, e proposed MLFP algorithm
Fig. 11 Scene text with perspective projection. a Input image, b CC method, c edge method, d texture, e MLFP method
123
182 SIViP (2011) 5:165183
1. Caption text with small font in Fig. 7a. 5. Kumar, S., Gupta, R., Khanna, N., Chaudhury, S., Joshi, S.D.: Text
2. Scene text with large font in Fig. 7b. extraction and document image segmentation using matched wave-
lets and MRF model. IEEE Trans. Image Process. 16(8), 2117
3. Scene text with mixed font size in Fig. 7c. 2128 (2007)
4. Document image in Fig. 7d. 6. Liu, X., Samarabandu, J.: Multiscale edge-based text extrac-
5. Hybrid image (with Scene and Caption text) with ori- tion from complex images. IEEE Int. Conf. Multimedia Expo.,
ented text in Fig. 7e. pp. 17211724 (2006)
7. Gllavata, J., Ewerth, R., Freisleben, B.: A Robust algorithm for
6. Scene text with radially changing text in Fig. 7f. text detection in images. In: Proceedings of 3rd international
7. Scene text with different lighting conditions for Edge symposium on image and signal processing and analysis, vol. 2,
based, CC based, texture based and Proposed method in pp. 611616 (2003)
Fig. 10. 8. Li, H., Doermann, D., Kia, O.: Automatic text detection and
tracking in digital video. IEEE Trans. Image Process. 9(1), 147
8. Scene text with perspective projection for Edge based,
156 (2000)
CC based, Texture based and Proposed method in Fig. 11. 9. Lin, L., Tan, C.L.: Text extraction from name cards using neural
network. In: IJCNN 05. Proceedings of IEEE international joint
conference neural networks, vol. 3, pp. 18181823 (2005)
10. Zhang, D., Chang, S.-F.: Accurate Overlay Text Extraction for
9 Conclusion Digital Video Analysis. Int. Conf. on Information Technology:
Research and Education, ITRE2003, pp. 233237
11. Jeong, K.-Y., Jung, K., Kim, E.Y., Kim, H.J.: Neural network-based
In this paper, the objective of extracting text from hetero- text location for news video indexing. IEEE Proc ICIP 3, 319
geneous images is achieved by focusing with the variations 323 (1999)
involved in orientation, font size, perspective projection and 12. Pan, Y.-F., Hou, X., Liu, C.-L.: Text localization in natural scene
different lighting conditions with MLFP algorithm applied images based on conditional random field (ICDAR 09), pp. 610
(2009)
on the transformed Contourlet coefficients. Effectiveness of 13. Phan, T.Q., Shivakumara, P., Tan, C.L.: A Laplacian method for
MLFP feature selection algorithm is shown by producing video text detection (ICDAR 09), pp. 6670 (2009)
optimal set of features with the compared three feature selec- 14. Shi, Z., Setlur, S., Govindaraju, V.: Text extraction from gray
tion methods. Experimental results also show that the scale historical document images using adaptive local connectivity
map. (ICDAR 05) 2, 794798 (2005)
proposed text extraction method using MLFP algorithm out- 15. Gopalan, C., Manjula, D.: Contourlet based approach for text iden-
performs the Edge based method and Connected component tification and extraction from heterogeneous textual images. Int. J.
method and performs marginally well with Texture based Comput. Sci. Eng. 2(4), 202211 (2008)
method for heterogeneous images with appreciable improve- 16. Do, M.N., Vetterli, M.: The contourlet transform: an efficient direc-
tional multiresolution image representation. IEEE Trans. Image
ment shown for Scene text images. It is showing clear Proc. 14(12), 20912106 (2005)
improvement in the performance over all three methods when 17. da Cunha, A.L., Zhou, J., Do, M.N.: The Non sub sampled Con-
the images are considered with variations. The results indi- tourlet Transform: theory, design and applications. IEEE Trans.
cate that our methodology using NSCT and NLV and Image Process. 15(10), 30893101 (2006)
18. Chen, N., Blostein, D.: A survey of document image classification:
GLRLM feature based MLFP algorithm have the efficacy problem statement, classifier architecture and performance evalu-
to discriminate between text and non text for three kinds of ation. Int. J. Document Anal. Recogn. 10, 116 (2007)
images with variations in text. As an extension of this system, 19. Liang, J., Doermann, D., Huiping, L.: Camera-based analysis of
it is planned to tune binarization stage to work equally well text and documents: a survey. Int. J. Document Anal. Recogn.
7, 84104 (2005)
for the three kinds of text images also. 20. Galloway, M.M.: Texture analysis using gray level run
lengths. Comput. Graph. Image Process 4, 172179 (1975)
21. Haralick, R., Shanmugam, K., Dinstein, I.: Textual features for
image classification. IEEE Trans. Syst. Man Cybern. SMC-3(6),
References 610621 (1973)
22. Gnitecki, J., Moussavi, Z.: Classification of lung sounds during
1. Jung, K., Kim, K.I., Jain, A.K.: Text information extraction in bronchial provocation using waveform fractal dimension. In: Pro-
images and video: a survey. J. Pattern Recogn. Soc. 37(5), 977 ceeding 26th conference IEEE engineering in medicine and biology
997 (2004) society (EMBS), pp. 38443847 (2001)
2. Liu, Y., Goto, S., Ikenaga, T.: A contour-based robust algorithm 23. Molina, L.C., Belanche, L., Nebot, A.: Feature selection algo-
for text detection in color images. IEICE Trans. Inf. Syst. E89 rithms: a survey and experimental evaluation data mining. In:
D(3), 12211230 (2006) ICDM 2002. Proceedings of IEEE international conference,
3. Jiang, R., Qi, F., Xu, L., Wu, G., Zhu, K.: A learning-based method pp. 306313 (2002)
to detect and segment text from scene images. J. Zhejiang Univ. 24. Guyon, I., Elisseeff, A.: An introduction to variable and feature
Sci. A 8(4), 568574 (2007) selection. J. Mach. Learn. Res. 3, 11571182 (2003)
4. Karatzas, D., Antonacopoulos, A.: Text extraction from web 25. Gupta, S.C., Kapoor, V.K.: Fundamentals of mathematical statis-
images based on a split-and-merge segmentation method using col- tics, chap. 2. Sultan Chand and Sons, New Delhi, pp. 2.432.45
our perception. In: Proceedings of 17th international conference on (1970)
pattern recognition (ICPR 2004), August 2004. IEEE Computer 26. Otsu, N.: A threshold selection method from grey-level histo-
Society Press, pp. 634637 grams. IEEE Trans. Syst. Man Cybern. SMC-1, 6266 (1979)
123
SIViP (2011) 5:165183 183
27. Witten, I.H., Frank, E.: Data miningpractical machine learn- 29. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning
ing tools and techniques with JAVA implementations, Morgan algorithms. Mach. Learn. 6(1), 3766 (1991)
Kaufmann, Menlo Park (2000) 30. Quinlan, R.: C4.5: Programs for machine learning. Morgan
28. John, G.H., Langley, P.: Estimating continuous distributions in Kaufmann, Menlo Park (1993)
Bayesian classifiers. In: Eleventh conference on uncertainty in
artificial intelligence, pp. 338345 (1995)
123

1 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 1

Uploaded by

Copyright:

Available Formats

SIViP (2011) 5:165183

Statistical modeling for the detection, localization and extraction

(a) Candidate text region detection

Input Training Images

Feature extraction Verification of text regions

ONLINE PROCESSING OFFLINE PROCESSING

Fig. 1 System architecture of the proposed system

(1) Spatial distribution of the gray levels in the texture

Table 1 Set of features

Table 2 Selected features with multi level feature priority algorithm

Fractal dimension is a parameter frequently used to analyze

The intensity of the pixels in an image helps to differentiate

where f i T 1 , . . . , f i T n and f i N T 1 , . . . , f i N T n are fea- C.V = 100 (/ X ) (27)

dependencies between variable and target. It is a measure Step 1: Start

the sense that no additional information is gained by adding

Th: Threshold for difference value 10 SRHGE-45 27.68

Table 4 Selected Priority 1

LRHGELong Run High Gray Level Emphasis

It is observed from experimentation that above selected

Table 5 Number of features

Table 6 Accuracy of various

F1 RMSE F1 RMSE F1 RMSE F1 RMSE

IBk 0.89 0.3282 0.75 0.5 0.5 0.679 0.69 0.5

F1-measure for FS algorithms 5 Text region classification

Table 7 Data set for text

Caption text images News telecast & video titles 25

Table 8 Average F-score for

Edge-based CC Texture Proposed

Caption text images 66 75 86 90

Table 9 Comparative results

Orientation variation Font size variation Lighting variation

P R F-score P R F-score P R F-score

Edge based method 35 80 48.7 46 70.8 55.8 41 87 55.7

Edge based Orientation variance Font size variance Lighting variance

0% 20% 40% 60% 80% 100%

Fig. 8 Comparison of F-score for heterogeneous images

and results are more robust to the variation in orientation, font

You might also like