Professional Documents
Culture Documents
1 1
1 1
DOI 10.1007/s11760-010-0152-1
ORIGINAL PAPER
Received: 29 April 2009 / Revised: 1 January 2010 / Accepted: 1 January 2010 / Published online: 29 January 2010
Springer-Verlag London Limited 2010
Abstract Discriminating between the text and non text feature selection methods with benchmark dataset. The pro-
regions of an image is a complex and challenging task. In con- posed text extraction system is compared with the Edge based
trast to Caption text, Scene text can have any orientation and method, Connected component method and Texture based
may be distorted by the perspective projection. Moreover, it method and shown encouraging result and finds its major
is often affected by variations in scene and camera parameters application in preprocessing for optical character recognition
such as illumination, focus, etc. These variations make the technique and multimedia processing, mobile robot naviga-
design of unified text extraction from various kinds of images tion, vehicle license detection and recognition, page segmen-
extremely difficult. This paper proposes a statistical unified tation and text-based image indexing, etc.
approach for the extraction of text from hybrid textual images
(both Scene text and Caption text in an image) and Document Keywords Text extraction Non sub sampled Contourlet
images with variations in text by using carefully selected fea- Transform Gray level run length matrix Caption text
tures with the help of multi level feature priority (MLFP) Scene text Document image
algorithm. The selected features are combinedly found to be
the good choice of feature vectors and have the efficacy to Abbreviations
discriminate between text and non text regions for Scene text, NSCT Non sub sampled Contourlet Transform
Caption text and Document images and the proposed system NSP Non sub sampled pyramid
is robust to illumination, transformation/perspective projec- NSDFB Non sub sampled directional filter bank
tion, font size and radially changing/angular text. MLFP fea- CC Connected component
ture selection algorithm is evaluated with three common ML NLV Normalized local variance
algorithms: a decision tree inducer (C4.5), a naive Bayes GLRLM Gray level run length matrix
classifier, and an instance based K-nearest neighbour learner GLCM Gray level co-occurrence matrix
and effectiveness of MLFP is shown by comparing with three MLFP Multi level feature priority
VM Closing Vertical morphological closing
Chitrakala Gopalan (B) HM Closing Horizontal morphological closing
Department of Computer Science and Engineering,
Easwari Engineering College, Anna University,
Ramapuram, Chennai, Tamil Nadu, India
e-mail: ckgops@gmail.com 1 Introduction
Chitrakala Gopalan
4/3, Kaliappa Naicker St., Plot no: 80, Nehru Nagar,
The growing popularity of Internet and the World Wide Web
Ramapuram, Chennai 600089, Tamil Nadu, India has resulted in the tremendous growth of multimedia data
containing still images and video, in addition to the textual
D. Manjula information. Text in images/video usually carries important
Department of Computer Science and Engineering,
College of Engineering, Anna University, Guindy,
messages about the content. Indexing, querying and retriev-
Chennai, Tamil Nadu, India ing multimedia information uses the textual information
e-mail: manju@annauniv.edu embedded in the multimedia data.
123
166 SIViP (2011) 5:165183
This has created an overriding need to provide efficient lightness. Character-like components are then extracted as
means of text extraction methodologies from textual images forming text lines in a number of orientations and along
for effective data storage, retrieval, search, querying and curves. Kumar et al. [5] proposed globally matched wavelet
interaction capabilities and preprocessing for OCR. Extract- filters with Fisher classifiers for text extraction from Docu-
ing embedded/inserted text in images often gives an indica- ment images and Scene text images. Liu and Samarabandu
tion of a scenes semantic content. Automatic extraction of [6] proposed Edge based method with edge strength, density
text is a challenging job due to variation in font style, size, and the orientation variance as distinguishing characteristics
orientation, alignment and complexity of background. of text embedded in images which can handle printed doc-
Text in images/videos is classified into Caption text and ument and Scene text images. This method used multi scale
Scene text [1]. Caption text image is the one which is the edge detector for text detection and dilation operator for text
inserted text and is otherwise called as superimposed/artifi- localization stages.
cial text. Natural textual images/embedded texts are called as Gllavata et al. [7] proposed connected component based
Scene texts or graphics text images. Electronic documents, method which uses color reduction technique followed by
images of paper documents, images acquired by scanning horizontal projection profile analysis which can extract text
book covers, CD covers or other multi-colored documents from Caption text images. Li et al. [8] presented algorithms
are called Document images. for detecting and tracking text in digital video and implement
Literature studies so far addressed three different app- a scale-space feature extractor that feeds an artificial neural
roaches to extract the text from images namely Bottom-up, processor to detect text blocks. Lin and Tan [9] proposed a
Top-down and Hybrid approaches. Bottom-up approach method to apply a neural network on canny edges with both
starts with the identification of sub-structures, such as con- spatial and relative features like sizes, color attributes and
nected components (CCs) or edges and then merging these relative alignment features. By making use of the alignment
sub-structures to mark bounding boxes for text. Top-down information, we can identify the text area from the charac-
approach looks for global information in the page and splits ter level rather than the conventional window block level.
the page from column level to word level. Edge based and In [10], the region-of-interests (ROI) probably containing
Connected component (CC) based methods are categorized the overlay texts are decomposed into several hypothetical
under Bottom-up methods and Texture based methods under binary images using color space partitioning. A grouping
Top-down approach. The proposed system employs Non sub- algorithm is then conducted to group the identified character
sampled Contourlet transform and texture analysis to extract blocks into text lines in each binary image.
text from Caption text image, Scene text image and Doc- Jeong et al. [11] classifies text pixels and non text pixels
ument image with variations in illumination, font size, per- using a network that operates as a set of texture discrim-
spective projection and orientation by using MLFP algorithm ination filter to find and locate text regions from Caption
with Neural network classifier. text images using histogram analysis after removing errors
in the classification results. Pan et al. [12] proposes a hybrid
method to localize texts in natural scene images with a Con-
2 Related work ditional Random Field (CRF) model, considering the unary
component property as well as binary neighboring compo-
The Literature has been surveyed to find existing methods for nent relationship. Finally text components are grouped into
text extraction from different kinds of images which are [2] text lines with an energy minimization approach.
based on the combination of connected component and tex- Phan [13] proposes a text detection method from Video
ture feature analysis of unknown text region contours. Each based on the Laplacian operator. K-means is then used to
candidate text region is verified with texture features derived classify all the pixels into two clusters: text and non text.
from wavelet domain followed by expectation maximization This method undergoes projection profile analysis to deter-
algorithm to binarise each text region. Jiang et al. [3] pro- mine the boundary of the text blocks and employ empirical
posed a learning-based method for text detection and text rules to eliminate false positives based on geometrical prop-
segmentation in natural scene images. Here, the input image erties. Experimental results show that the proposed method
is decomposed into multiple CCs by Niblack clustering algo- is able to detect text of different fonts, contrast and back-
rithm. Then all the CCs including text CCs and non text CCs grounds.
are verified on their text features by a two-stage classification Shi et al. [14] presents an algorithm using adaptive local
module. connectivity map for retrieving text lines from the complex
Karatzas and Antonacopoulos [4] follows a split-and- handwritten documents such as handwritten historical manu-
merge strategy based on the Hue-Lightness-Saturation (HLS) scripts. The algorithm is designed for solving the problems
representation of color as a first approximation of an anthro- like fluctuating text lines, touching or crossing text lines and
pocentric expression of the differences in chromaticity and low quality image seen in handwritten documents.
123
SIViP (2011) 5:165183 167
The above mentioned approaches focused on extracting Text extraction has been carried out in two phases, namely,
text from Scene text, Caption text and Document images sep- Offline processing (training phase) and Online processing
arately or on some combinations only. In contrast to Caption (testing phase). Offline processing is carried out to extract
text, Scene text can have any orientation and may be distorted and generate feature vectors for training images from the
by the perspective projection. Moreover, it is often affected image corpus. Experimental analysis on various images
by variations in scene and camera parameters such as illumi- indicated that text regions typically have different texture
nation, focus, etc. These variations make the design of uni- properties than the non text areas. This is analyzed by decom-
fied text extraction from various kinds of images extremely posing the input image with the variation of Contourlet
difficult. transform such as Non sub sampled Contourlet Transform
Recently we [15] proposed an Image analysis based (NSCT) which decomposes the image into a set of direc-
approach called Sub Band Texture Analysis based Text tional 2n sub bands with texture details or edges captured in
Detection/Text Localization (SBTA-TD/TL) technique for different orientations at various scales for n level specified.
text extraction from heterogeneous images which is robust to Each and every decomposed sub band has high intensity tex-
limited orientation (horizontal, vertical) and limited font size ture details in various directions with high prominent values
of images. Comparable performance for Scene text images shown for NSCT coefficients which are merged by adding
was not produced as well. It was also observed that SBTA- eight sub bands to form an image with edges detected in
TD/TL technique suffers and not showing encouraging various directions. The transformed NSCT coefficients in an
results for the following variations edge image are used for the calculation of feature vectors for
text and non text regions of images.
Variation in illumination Five different sets of features are extracted and analyzed
Variation in wide range of font size from the above edge image with MLFP algorithm and best
Variation in skewness features are selected. With this MLFP algorithm, textural
Variation in angularity of text distribution of the spatial frequency components within the
Perspective projection decomposed regions (Normalized Local VarianceNLV)
and the feature based on the run which is a series of pix-
els having the same gray level on a definite direction in the
It is required to produce equal performance for Scene
image (Gray level run length matrixGLRLM-based fea-
text images also with robustness to the variations in textual
tures) are collectively found to be the good choice of feature
images. Consequently, all these observations, drawbacks and
vector and have the efficacy to discriminate between text and
advantages were analyzed and a more complex Image analy-
non text regions for three kinds of images with variations in
sis plus Machine learning based approach called multi level
lighting, orientation and font size.
feature priority (MLFP) technique is proposed for the Text
During Online processing, when the user supplies an input
detection and localization from heterogeneous images which
image, Selective features are extracted from the transformed
takes care of variation in illumination, transformation/per-
and merged contourlet coefficients by capturing the textural
spective projection, font size, angular and radially chang-
distribution of the pixels and produce feature vector.
ing text so as to make the system suitable for heterogeneous
Neural network classifier is used to classify the regions as
images.
candidate text and non text regions from the textual image
by using the extracted feature vector. Candidate text regions
will undergo selected verification rules to eliminate the
3 Approach unwanted non text regions. Then binarization algorithm is
applied to extract text from text regions by eliminating the
The goal is to build a unified extraction of text from hetero- background.
geneous range of textual images and to focus on transformed, System architecture of the proposed system is shown in
illuminated and angular form. The major contributions of the Fig. 1. Proposed approach includes:
proposed text extraction system are as follows:
123
168 SIViP (2011) 5:165183
Text NSCT
NSCT
Binarization
Feature extraction
Feature selection
Candidate Non
Features Representation of
Text text
Features
images
Neural network
Neural network classifier
Training
(Testing )
(c) Text region Localization scales, with flexible aspect ratios. The contourlet transform
effectively captures smooth contour images that are the dom-
Verification rules
inant feature in natural images. The contourlet transform
(d) Binarization [16] is a multi directional and multi scale transform that is
constructed by combining the Laplacian pyramid with the
Directional Filter Bank (DFB). Due to down samplers and
4 Candidate text region detection up samplers present in both the Laplacian pyramid and the
DFB, the contourlet transform is not shift-invariant. An over
Always Text extraction from images starts with the candi- complete transform, the Non sub sampled Contourlet Trans-
date text region detection which involves the detection of the form (NSCT) has been proposed in [17] and applied in the
presence of the text in the image. In this paper, this detection proposed system. The NSCT is a fully shift-invariant, multi
is done by scale and multi direction expansion that have a fast imple-
mentation. Here filters are designed with better frequency
(1) Capturing multi oriented texture components with high selectivity thereby achieving better sub band decomposition.
intensity texture details representing primitively and This NSCT transform [17] can thus be divided into two
roughly identified text regions from the NSCT decom- shift-invariant parts: (1) a non sub sampled pyramid struc-
posed merged sub bands. ture that ensures the multi scale property of the NSCT and is
(2) Extracting best features from the merged sub bands so obtained from a shift-invariant filtering structure that achieves
as to distinguish text and non text regions better with sub band decomposition similar to that of the Laplacian pyr-
the help of the classifier stage. amid and (2) A non sub sampled DFB structure that gives
directionality. A shift-invariant directional expansion is also
These detected candidate text regions are later verified obtained with NSDFB. The NSDFB is constructed by elimi-
with the text localization stage. nating the down samplers and up samplers in the DFB. This is
done by switching off the down samplers/up samplers in each
4.1 Image decomposition using Non sub sampled two-channel filter bank in the DFB tree structure and up sam-
Contourlet Transform pling the filters accordingly. This results in a tree composed
of two-channel NSDFBs. Refer [15] for detailed description
The contourlet transform is an extension of the wavelet trans- of NSCT.
form which uses multi scale and directional filter banks. The NSCT applied on the input image produces 2n sub
Here images are oriented at various directions in multiple bands for n level specified. Here, eight sub bands have been
123
SIViP (2011) 5:165183 169
these methods are often computationally expensive, but lead Global standard deviation (GSD) for each coefficient:
to good results. Herewith five sets of texture features are con-
w h
v=0 (Ci (u, v) GMi )
sidered for investigation to analyze the following properties 2
u=0
of the image: GSDi = (2)
wh
123
170 SIViP (2011) 5:165183
Local mean in the window of size N N is: From the original run length matrix p(i, j), many numer-
ical texture measures can be computed [20] such as Short
1
N N
Mi = Ci (u, v) (3) Run Emphasis (SRE), Long Run Emphasis (LRE), Gray
N2 Level Non uniformity (GLN), Run Length Non uniformity
u=0 v=0
(RLN), Low Gray Level Run Emphasis (LGRE), High Gray
Local coefficient variance V Ci , Level Run Emphasis (HGRE), Short Run Low Gray Level
N N Emphasis (SRLGE), Short Run High Gray Level Emphasis
(Ci (u, v) Mi )2
V Ci (u, v) = u=0 v=0 2 (4) (SRHGE), Long Run Low Gray Level Emphasis (LRLGE)
N
and Long Run High Gray Level Emphasis (LRHGE).
where, w and h are the number of pixels in horizontal and ver- For a given image, a run-length matrix p(i, j) is defined
tical directions respectively, in the image, i = 0 to 9, N = as the number of runs with pixels of gray level i and run
2n (=16), Ci is ith coefficient and the normalized variance length j. Let M be the number of gray levels and N be the
NV Ci (u, v) for each pixel is calculated as follows: maximum run length. Pg vector represents the sum distribu-
V Ci (u, v) tion of the number of runs with gray level i.nr is the total
NVCi (u, v) = (5) number of runs. The subsets of GLRLM features which have
GSDi
been selected with MLFP algorithm as in Table 2 are only
The relative deviation in local coefficient in the region N N described as follows:
window is computed for each pixel of the image.
(1) Gray level non uniformity (GLN):
4.2.2 GLRLM based features
2
1 1
M N M
Next set of features were extracted from decomposed merged
sub bands using texture analysis based on the GLRLM. Tex- GLN = p(i, j) = Pg (i)2 (6)
nr nr
i=1 j=1 i=1
ture analysis is the method to analyze the density distri-
bution in the image statistically. Texture analysis based on
the GLRLM is one such analysis and is a method based (2) High gray level run emphasis (HGRE):
on the run which is a series of pixels having the same gray
1 1
level on a definite direction in the image [20]. Since M N M
density, length and direction in the run are included, the HGRE = p(i, j) i 2 = pg (i) i 2
nr nr
aggregation of the run from the image represents features i=1 j=1 i=1
of the texture. (7)
123
SIViP (2011) 5:165183 171
1 Local energy deviation Normalized local variance (NLV) (1) Low High
of NSCT coefficients
2 GLRLM based features High gray level run emphasis HGRE-0 High Low
3 HGRE-90 High Low
4 Short run high gray level emphasis SRHGE-0 High Low
5 SRHGE-90 High Low
6 Long run high gray level emphasis LRHGE-90 High Low
7 Gray level non uniformity GLN-135 Low High
8 Short run high gray level emphasis SRHGE-45 (2) High Low
9 Long run high gray level emphasis LRHGE-45 High Low
10 High gray level run emphasis HGRE-45 High Low
11 Gray level non uniformity GLN-0 Low High
12 GLN-45 Low High
13 Long run high gray level emphasis LRHGE-135 (3) High Low
(3) Short run high gray level emphasis (SRHGE): used to construct the matrix. Various statistics can be derived
from the created GLCM and can provide information about
1 p(i, j) i 2 the texture of an image. Five out of the fourteen Haralick
M N
SRHGE = (8) features [21] have been used in the approach.
nr j2
i=1 j=1 These features are: contrast, homogeneity, energy, entropy
and correlation. To calculate different features, the joint prob-
(4) Long Run High Gray Level Emphasis (LRHGE): ability density of grey level co occurrence computed by
GLCM is weighted differently. The first two features (con-
1
M N trast, homogeneity) can be grouped and named as Contrast
LRHGE = p(i, j) i 2 j 2 (9) Group. They compute quantity of contrast in a window. The
nr
i=1 j=1 second group called Orderliness Group contains features
(energy and entropy) which indicate how regular (orderly)
These features are used to capture apparent properties of the pixel values are within the window.
run length distribution. It is observed from experimentation
that the above selected features have shown distinguishable
difference between text and non text as shown in Fig. 4ae. (a) Contrast measures the local variations in the gray level
co-occurrence matrix. It is calculated as
4.2.3 Gray level co-occurrence matrix based features
(GLCM)
L1
Contrast = Pi j (i j)2 (10)
i, j=0
Gray level co-occurrence matrix (GLCM) can reveal certain
properties about the spatial distribution of the gray levels in
the texture image as proposed by Haralick et al. [21] and Here Pi j = Element i, j of the normalized symmetrical
is also known as the Gray Level spatial dependence matrix. GLCM.
Hence, GLCM-based features were considered for feature N = Number of gray levels in the image.
extraction. GLCM is a tabulation of how often different com- (b) Homogeneity measures the closeness of the distribution
binations of pixel values (grey levels) occur in an image. of elements in the GLCM to the GLCM diagonal. It is
When divided by the total number of neighboring pixels in calculated as
the image, this matrix becomes the estimate of the joint prob-
ability of two pixels, a distance d apart in a direction e having
L1
particular (co-occurring) grey values i and j. The dimension Homogeneity = Pi j /1 + (i j)2 (11)
of GLCM is Gx G where G is the number of grey levels i, j=0
123
172 SIViP (2011) 5:165183
(c) Energy provides the sum of squared elements in the where m 3 is the third moment
GLCM. Also known as uniformity or the angular sec-
L1
ond moment. It is calculated as Kurtosis = 1/4 (x m 4 )4 Pu(x) (19)
x=0
L1
Energy = (Pi j )2 (12) 4.2.5 Fractal dimension
i, j=0
123
SIViP (2011) 5:165183 173
the selected features, meeting the criteria into various priority 4. Maximum relevant features are selected and categorized
levels at the time of the selection of the features itself. into various priority sets such as P1 , P2 , P3 (with P1 as
the highest priority) based on Difference between clas-
4.3.1 Multi level feature priority (MLFP) algorithm ses (DBC) based consistency i.e., based on the differ-
ence between the flag and number of images N . If the
Multi level feature priority algorithm aims for the selection difference between text and non text of a particular fea-
of the most appropriate subset of features that adequately ture is being maintained a value above the threshold for
describes a given classification task. MLFP algorithm is based all images, choose these kinds of features as a candidate
on sequential backward selection method, which considers for first highest priority level P1 .
all features and removes a feature with low discrimination
ability to distinguish text and non text regions of an image. P1 = { f i / |flag| N = 0} (23)
This algorithm also categorizes the selected features, meet- P2 = { f i / |flag| N < t1} (24)
ing the criteria into various priority levels at the time of the P3 = { f i / |flag| N < t2 where t2 t1} (25)
selection of the features itself.
where is a partial ordering such that t1 t2 is
This algorithm selects and categorizes the features as first defined as t1 t2 0.
level (top) features if difference between feature values 5. Selected priority set of features must be maximum
of text and non text regions exceeds the threshold and the relevant and minimum redundant in order to qualify as
same is maintained for all n images/consistent for all significant features. Maximum relevancy is rechecked
n images. by consistency and Minimum redundancy is checked by
The algorithm selects and categorizes the features as sec- correlation among the features.
ond level features if the difference between feature val-
ues of text and non text regions exceeds the threshold but
Consistency: Optimal set of features to distinguish text
consistent only for large subset of images n1, where
and non text regions of an image should be consistent or
n1 n.
homogeneous. It indicates that difference between text and
The algorithm selects and categorizes the features as
non text of a particular feature should be consistent for all
third level features if the difference between feature val-
images irrespective of the types of images such as Caption
ues of text and non text regions exceeds the threshold and
text, Scene text and Document images. These kinds of fea-
consistent not for all images but only for small subset of
tures will become candidature for optimal set of features.
images n2, where n2 n.
Consistency is measured by coefficient of variation (C.V)
and is calculated as follows:
Various steps involved in MLFP algorithm during offline Coefficients of dispersion (C.D) are calculated [25] to
processing are as follows: compare the variability of the two series which differ widely
in their averages or which are measured in different units.
1. For each image of m blocks, compute the following for Coefficient of dispersion (C.D) based upon standard devi-
each feature: ation is given as:
C.D = S.D/Mean = / X (26)
Fi T = f i T 1 + + f i T n /N
Fi N T = f i N T 1 + + f i N T m /M (21) Coefficient of Variation: 100 times the coefficient of disper-
sion based upon standard deviation is called coefficient of
DFi = Fi T Fi N T
variation (C.V) and it is calculated as
123
174 SIViP (2011) 5:165183
between the features. Correlation criteria can detect linear Procedure Multi level Feature Priority ( )
4.3.2 Analysis of Features with MLFP algorithm show that they are consistent. There is a marginal rise in C.V
for Priority sets P2 , P3 than P1 as in Table 3.
Above 5 sets of features comprising 55 features are analyzed Correlation coefficient matrix values for all 55 features
and 13 features are selected and categorized into 3 priorities are calculated and observed that r value for P1 set features is
using MLFP algorithm as in Table 2. C.V is calculated for close to 0 ranging between 0.5 to 0 and 0 to 0.5 and shows
the series with difference of value between text and non text that features are weakly correlated and are not redundant.
regions for all 55 features. It is observed that top seven fea- r Value for seven features are listed in Table 4 with self corre-
tures comprising P1 have scored lesser C.V than others and lation shown as 1. Distributions of some features from Fig. 4
123
SIViP (2011) 5:165183 175
Fig. 3 NLV feature differentiating text and non text 4.3.3 Results of feature selection algorithm
and comparison with other methods
may seem to be same but their variance differs which make Our data set for feature selection algorithm is created by
them weakly correlated and not redundant. pooling features from 5 different texture models compris-
Subsequently, out of 13 features, only 7 top level prior- ing 55 features. In addition to this, some more data sets are
ity features are selected based on the filter-based evaluation selected from UCI Machine Learning Repository. WEKAs
method and are used to define a primitive P to distinguish collection (27) is used to implement all the algorithms cho-
text and non text regions and is described as a seven-attribute sen for comparison. In this section, we evaluate the efficiency
tuple where and effectiveness of our method by comparing MLFP with
representative feature selection algorithms such as Genetic
P = (NLV, HGRE-H, HGRE-V, SRHGE-H, search algorithm, ReliefF and Greedy stepwise search
SRHGE-V, LRHGE-V, GLN-D-135) algorithm [27]. For each data set, we first run all the feature
selection algorithms in comparison, and obtain the selected
features for each algorithm as in Table 5. It is observed that
NLVNormalized Local Variance all these algorithms achieve significant reduction of dimen-
GLNGray Level Non uniformity sionality by selecting only a small portion of the original
HGREHigh Gray Level Run Emphasis features. MLFP on average selects the smallest number of
SRHGEShort Run High Gray Level Emphasis features.
Fig. 4 Various features distinguishing text (T) and non text (NT). a One of GLRLM feature- GLN 135, b selected features differentiating
T and NT
123
176 SIViP (2011) 5:165183
1
Here neural network classifier is used to classify objects into
0.5 one or two or more groups based on a set of features that
0
describe the objects. The above mentioned seven features
MLFP-FS Genetic search Greedy ReliefF are used as the basis for classification. After extracting the
Method features, a neural network is trained to identify text-like win-
Fig. 5 F1-measure for feature selection algorithm dows. The network used is the general feed-forward network
trained with back propagation. To train the network,
1. Choose the mask for typical text and non text regions.
In addition, three bench mark learning algorithms, Naive 2. Extract features for these regions separately
bayes Classifier [28], lazy instance-based algorithm (k-NN) - 3. Calculate mean of appropriate feature for all text and non
IBk nearest neighbour [29] with k = 5 and eager model text regions separately
based C4.5 Decision tree algorithm [30] are used to evalu- 4. Pass that mean for training
ate the predictive accuracy on the selected subset of features.
F1 measure and Root mean square error (RMSE) calculated Each small window is classified using the trained neural
for all datasets and average of those measures are listed in net on the feature vector associated with that window. If a
Table 6. These measures are estimated with the use of tenfold window is classified as text, all the pixels in this window
stratified cross-validation tests. will be labeled as text. Those pixels which are not covered
It is observed from Table 6 and Fig. 5 that for each of the by any text window are labeled as non text. As the result of
four data sets, the highest accuracy is achieved by applying classification, a label map of the original image is generated
IBk and C4.5 on the feature subset selected by MLFP. as in Fig. 6b.
MLFP achieves higher accuracy, lower RMSE, smallest
number of selected features compared with other FS algo-
rithms. The reason is that importance of the features is 6 Text region localization
checked at two levels in MLFP. Relevant features are selected
based on DBC-based consistency and features are catego- 6.1 Verification of text regions
rized into priority sets with high relevance. Second time of
consistency check is incorporated during priority set eval- The generated label map of the original image after text
uation. So MLFP produces comparatively small number of region classification consists of detected text regions, but
features with high relevance and is faster as it follows filter may include non text as text regions also. These detected
method. regions have to be verified as text regions with the removal
123
SIViP (2011) 5:165183 177
Fig. 6 Various steps in the MLFP method. a Input image, b O/P of tal projection, j vertical projection, k logical AND of horiz verti image,
NN classifier, c VM closing, d HM closing, e morphological opening, l binary image, m O/P text region, n ground truth image
f bounding box region, g superimpose with i/p, h edge image, i horizon-
of unwanted non text regions. This verification stage under where z = {X, Y j }, j = 0, 1, . . . n, X = C1 for VM
text localization stage groups and localizes the text regions closing
identified by the detection stage into text instances. The ver-
ification rules are used for text localization to
(A B) = (A B) B = {z |(B)z A = } (30)
(a) Identify and remove small unwanted non text regions.
(b) Group the unfilled pixels by closing all the gaps.
The following verification rules are shown in Fig. 6. After the where z = {X i , Y }, i = 0, 1, . . . n, Y = C2 for HM
mask region is obtained from neural network, closing
2. Binary area open operation is performed to remove all
1. Perform vertical morphological (VM) closing and hori- connected components (objects) that have fewer than P
zontal morphological (HM) closing operation as in pixels from a binary image.
Eqs. (29) and (30) to smooth the contour, fuses narrow 3. Place the bounding box for every region identified and
breaks and eliminates small holes. Closing is performed corresponding region is taken. Superimpose this mask
with dilation followed by erosion where C1, C2 are region with input image to get candidate text region.
constants. 4. Perform edge detection with sobel operator.
5. Calculate horizontal projection (HP) for every row from
(A B) = (A B) B = {z |(B)z A = } (29) an edge image.
123
178 SIViP (2011) 5:165183
(a) Initialize new binary image with 0 same as size as 7. Perform logical AND of the output images from HP and
input image VP to get the text region
(b) Assign 1 for only Rows with (HP) > Mean (HP) T = i j (35)
x
1| Hi > H
8. This binary image is superimposed with input image to
i=0
i = (31) get text from image.
x
0| Hi > H
i=0
(c) Perform open operation vertically
7 Binarization
(A B) = (A B) B = {(B)z|(B) Z A}
(32) Before the extracted text is parsed and recognized by the
Common OCR, it has to be binarised to extract the text
where z = {X, Y j }, j = 0, 1, . . . n, X = C1 from the identified text region. Well known global thres-
6. Calculate Vertical projection (VP) for every column from holding method described by Otsu [26] is tried for binari-
edge image zation. All intensity values below a selective threshold are
converted to one intensity level and intensities higher than
(a) Initialize new binary image with 0 same as size as this threshold are converted to the other chosen intensity.
input image It segments an image into foreground and background. The
(b) Assign 1 for only columns with VP > mean (VP) foreground contains interested characters and this process
y
generates an output image with white text against a black
1| V j > V
background.
j=0
j = y (33)
<
0| V j V
j=0 8 Results and performance analysis
(c) Perform open operation horizontally
8.1 Experimentation setup and Performance measures
(A B) = (A B) B = {(B)z|(B) Z A}
(34) In the experimental part, data sets for various textual images
with variations in illumination, perspective, font size and
where z = {X i , Y }, i = 0, 1, . . . n, Y = C2 orientation are considered as in Table 7 and tested using
123
SIViP (2011) 5:165183 179
1.79 GHz system having 1GB RAM. These data set images 8.2 Results and comparison with other text extraction
have been gathered from sites of several research groups such methods
as Laboratory for Language and Media Processing (LAMP),
Automatic Movie Content Analysis (MoCA) Project, Com- The proposed text extraction system has been tested with the
puter Vision Lab., Pennsylvania State University. The sig- dataset mentioned in Table 7. The proposed system produces
nificance of testing the algorithms on the above variations is good performance for Scene text, Caption text and Docu-
to ensure the efficiency of this methodology to function as a ment images by producing better recall rate missing relatively
unified technique. weak texts and shown promising precision rate as non text
The performance of the proposed technique has been eval- objects are correctly classified as in Table 8, Figs. 7 and 8. The
uated based on its precision and recall rates and F score results have also been verified against other existing algo-
obtained. In order to have a common method to evaluate and rithms such as Edge-based method [6], Connected compo-
compare the results from each algorithm, in this project each nent (CC) method [7] and our previous Texture-based method
text pixel in ground truth and output image is considered SBTA-TD/TL [15]. Even though, Edge based method is
in the calculation of precision and recall rate. Precision and designed for Scene text and Document images and
recall rates are calculated as follows where CDP (Correctly CC method is designed for Caption text images, proposed
Detected Pixels) is the number of pixels matching between method is compared with the above two methods in order to
output mask image (O) and Ground Truth image (GT). show the ability of the proposed method to extract text from
all three kinds of images in much better way. We imple-
CDP = O GT (36) mented the above mentioned three methods and estimated
Precision rate (PR) = CDP/(CDP + FP) (37) with our data sets and compared with the proposed sys-
Recall rate (RR) = CDP/(CDP + FN) (38) tem. Performance measures for comparison are shown in
Tables 8 and 9.
Precision rate takes into consideration the false positives, The results of the proposed system shows clear improve-
which are the non text pixels in output image and have been ment over [6,7] methods and when compared to texture based
detected by the algorithm as text pixels. They are not text SBTA method [15], proposed system is showing better per-
pixels in ground truth image but detected as text pixels in formance with F-score of 90 and 81.5% for Caption text and
output mask image Scene text images, respectively, as far as the basic three kinds
FP = O GT (39) of images are concerned without any mentioned variations
in text as in Fig. 8 and Table 8.
Recall rate takes into consideration the false negatives, Subsequently, the proposed system is showing apprecia-
which are text pixels in output image, and have not been ble improvement in performance over [6,7,15] for hetero-
detected by the algorithm. They are text pixels in ground geneous images with variations in text such as variation in
truth image but not detected as text pixels in output mask orientation, font size, perspective projection and lighting con-
image ditions, as the proposed system is intended to handle varia-
tions in text so as to qualify as a better unified framework
FN = O GT (40) for text extraction from heterogeneous images than texture
Thus, precision and recall rates are useful as measures to based SBTA method [15].
determine the accuracy of each algorithm in locating correct It is observed from experimentation that MLFP method
text regions and eliminating non text regions. F-score is the produces good and comparable results with F-score of 80,
harmonic mean of recall and precision rates as represented 82.7 and 78.3% for Orientation variation, font size varia-
in Eq. (29) tion, lighting variation, respectively, as shown in Table 9 and
Fig. 9. The reason for this improvement in performance is
F-score = (2 PR RR)/(PR + RR) (41) twofold. (1) NSCT is invariant to scale and orientation and
123
180 SIViP (2011) 5:165183
Fig. 7 Heterogeneous images with variations. a Caption text with small font, b scene text with large font, c scene text with mixed font size,
d document image, e hybrid image with oriented text, f scene image with radially changing text
Document images Scene text images Caption text images 76 75 71 80 82.7 78.3
65.8 62.4
55.7 55.8
48.7 48.6
MLFP
Texture
Methods
CC Edge based CC Texture MLFP
123
SIViP (2011) 5:165183 181
Fig. 10 Results of images for various methods at three different lighting conditions. a Images at three different lighting conditions, b results from
Edge based algorithm, c CC based algorithm, d texture based algorithm, e proposed MLFP algorithm
Fig. 11 Scene text with perspective projection. a Input image, b CC method, c edge method, d texture, e MLFP method
123
182 SIViP (2011) 5:165183
1. Caption text with small font in Fig. 7a. 5. Kumar, S., Gupta, R., Khanna, N., Chaudhury, S., Joshi, S.D.: Text
2. Scene text with large font in Fig. 7b. extraction and document image segmentation using matched wave-
lets and MRF model. IEEE Trans. Image Process. 16(8), 2117
3. Scene text with mixed font size in Fig. 7c. 2128 (2007)
4. Document image in Fig. 7d. 6. Liu, X., Samarabandu, J.: Multiscale edge-based text extrac-
5. Hybrid image (with Scene and Caption text) with ori- tion from complex images. IEEE Int. Conf. Multimedia Expo.,
ented text in Fig. 7e. pp. 17211724 (2006)
7. Gllavata, J., Ewerth, R., Freisleben, B.: A Robust algorithm for
6. Scene text with radially changing text in Fig. 7f. text detection in images. In: Proceedings of 3rd international
7. Scene text with different lighting conditions for Edge symposium on image and signal processing and analysis, vol. 2,
based, CC based, texture based and Proposed method in pp. 611616 (2003)
Fig. 10. 8. Li, H., Doermann, D., Kia, O.: Automatic text detection and
tracking in digital video. IEEE Trans. Image Process. 9(1), 147
8. Scene text with perspective projection for Edge based,
156 (2000)
CC based, Texture based and Proposed method in Fig. 11. 9. Lin, L., Tan, C.L.: Text extraction from name cards using neural
network. In: IJCNN 05. Proceedings of IEEE international joint
conference neural networks, vol. 3, pp. 18181823 (2005)
10. Zhang, D., Chang, S.-F.: Accurate Overlay Text Extraction for
9 Conclusion Digital Video Analysis. Int. Conf. on Information Technology:
Research and Education, ITRE2003, pp. 233237
11. Jeong, K.-Y., Jung, K., Kim, E.Y., Kim, H.J.: Neural network-based
In this paper, the objective of extracting text from hetero- text location for news video indexing. IEEE Proc ICIP 3, 319
geneous images is achieved by focusing with the variations 323 (1999)
involved in orientation, font size, perspective projection and 12. Pan, Y.-F., Hou, X., Liu, C.-L.: Text localization in natural scene
different lighting conditions with MLFP algorithm applied images based on conditional random field (ICDAR 09), pp. 610
(2009)
on the transformed Contourlet coefficients. Effectiveness of 13. Phan, T.Q., Shivakumara, P., Tan, C.L.: A Laplacian method for
MLFP feature selection algorithm is shown by producing video text detection (ICDAR 09), pp. 6670 (2009)
optimal set of features with the compared three feature selec- 14. Shi, Z., Setlur, S., Govindaraju, V.: Text extraction from gray
tion methods. Experimental results also show that the scale historical document images using adaptive local connectivity
map. (ICDAR 05) 2, 794798 (2005)
proposed text extraction method using MLFP algorithm out- 15. Gopalan, C., Manjula, D.: Contourlet based approach for text iden-
performs the Edge based method and Connected component tification and extraction from heterogeneous textual images. Int. J.
method and performs marginally well with Texture based Comput. Sci. Eng. 2(4), 202211 (2008)
method for heterogeneous images with appreciable improve- 16. Do, M.N., Vetterli, M.: The contourlet transform: an efficient direc-
tional multiresolution image representation. IEEE Trans. Image
ment shown for Scene text images. It is showing clear Proc. 14(12), 20912106 (2005)
improvement in the performance over all three methods when 17. da Cunha, A.L., Zhou, J., Do, M.N.: The Non sub sampled Con-
the images are considered with variations. The results indi- tourlet Transform: theory, design and applications. IEEE Trans.
cate that our methodology using NSCT and NLV and Image Process. 15(10), 30893101 (2006)
18. Chen, N., Blostein, D.: A survey of document image classification:
GLRLM feature based MLFP algorithm have the efficacy problem statement, classifier architecture and performance evalu-
to discriminate between text and non text for three kinds of ation. Int. J. Document Anal. Recogn. 10, 116 (2007)
images with variations in text. As an extension of this system, 19. Liang, J., Doermann, D., Huiping, L.: Camera-based analysis of
it is planned to tune binarization stage to work equally well text and documents: a survey. Int. J. Document Anal. Recogn.
7, 84104 (2005)
for the three kinds of text images also. 20. Galloway, M.M.: Texture analysis using gray level run
lengths. Comput. Graph. Image Process 4, 172179 (1975)
21. Haralick, R., Shanmugam, K., Dinstein, I.: Textual features for
image classification. IEEE Trans. Syst. Man Cybern. SMC-3(6),
References 610621 (1973)
22. Gnitecki, J., Moussavi, Z.: Classification of lung sounds during
1. Jung, K., Kim, K.I., Jain, A.K.: Text information extraction in bronchial provocation using waveform fractal dimension. In: Pro-
images and video: a survey. J. Pattern Recogn. Soc. 37(5), 977 ceeding 26th conference IEEE engineering in medicine and biology
997 (2004) society (EMBS), pp. 38443847 (2001)
2. Liu, Y., Goto, S., Ikenaga, T.: A contour-based robust algorithm 23. Molina, L.C., Belanche, L., Nebot, A.: Feature selection algo-
for text detection in color images. IEICE Trans. Inf. Syst. E89 rithms: a survey and experimental evaluation data mining. In:
D(3), 12211230 (2006) ICDM 2002. Proceedings of IEEE international conference,
3. Jiang, R., Qi, F., Xu, L., Wu, G., Zhu, K.: A learning-based method pp. 306313 (2002)
to detect and segment text from scene images. J. Zhejiang Univ. 24. Guyon, I., Elisseeff, A.: An introduction to variable and feature
Sci. A 8(4), 568574 (2007) selection. J. Mach. Learn. Res. 3, 11571182 (2003)
4. Karatzas, D., Antonacopoulos, A.: Text extraction from web 25. Gupta, S.C., Kapoor, V.K.: Fundamentals of mathematical statis-
images based on a split-and-merge segmentation method using col- tics, chap. 2. Sultan Chand and Sons, New Delhi, pp. 2.432.45
our perception. In: Proceedings of 17th international conference on (1970)
pattern recognition (ICPR 2004), August 2004. IEEE Computer 26. Otsu, N.: A threshold selection method from grey-level histo-
Society Press, pp. 634637 grams. IEEE Trans. Syst. Man Cybern. SMC-1, 6266 (1979)
123
SIViP (2011) 5:165183 183
27. Witten, I.H., Frank, E.: Data miningpractical machine learn- 29. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning
ing tools and techniques with JAVA implementations, Morgan algorithms. Mach. Learn. 6(1), 3766 (1991)
Kaufmann, Menlo Park (2000) 30. Quinlan, R.: C4.5: Programs for machine learning. Morgan
28. John, G.H., Langley, P.: Estimating continuous distributions in Kaufmann, Menlo Park (1993)
Bayesian classifiers. In: Eleventh conference on uncertainty in
artificial intelligence, pp. 338345 (1995)
123