0% found this document useful (0 votes)
36 views22 pages

Sahare 2017

Uploaded by

Savet Omron
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views22 pages

Sahare 2017

Uploaded by

Savet Omron
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Int J Multimed Info Retr

DOI 10.1007/s13735-017-0130-2

TRENDS AND SURVEYS

Script identification algorithms: a survey


Parul Sahare1 · Sanjay B. Dhok1

Received: 5 May 2017 / Revised: 1 July 2017 / Accepted: 20 July 2017


© Springer-Verlag London Ltd. 2017

Abstract Script identification is being widely accepted for developing software and hardware for OCR increased
techniques for selection of the particular script OCR (Opti- many folds during the past decades. This OCR technique has
cal Character Recognition) in multilingual document images. many applications like document indexing and digitization
Extensive research has been done in this field, but still it of public records, authentication, security and intelligence.
suffers from low identification accuracy. This is due to the These help in easy editing and searching of images in a big
presence of faded document images, illuminations and posi- bunch of stacks. Cost and time of developing a universal
tions while scanning. Noise is also a major obstacle in the OCR for multiscripts texts is high and bit infeasible than
script identification process. However, it can only be mini- developing individual script OCR.
mized up to a level, but cannot be removed completely. In Script identification is a process to identify scripts (usually
this paper, an attempt is made to analyze and classify various two or more) within document images. This step is usually
script identification schemes for document images. The com- done before OCR process. The script is a graphical represen-
parison is also made between these schemes, and discussion tation for writing, which is expressible through language and
is made based upon their merits and demerits on a common the set of characters (Namboodiri and Jain [1]). This means
platform. This will help the researchers to understand the that language is a subset of script, i.e., script can be shared by
complexity of the issue and identify possible directions for more than one language. For example, English and German
research in this field. languages share Latin script, whereas Hindi and Marathi lan-
guages share Devanagari script. In India, there are 22 official
Keywords Script identification · Feature matching · spoken and writing languages and 13 scripts. Most of scripts
Classifier · Optical character recognition · Document of India have approximately 13 vowels and 35 consonants
analysis (Pati and Ramakrishnan [2]).
For OCR, it requires a good resolution and preprocessed
scanned document. However, this document image gener-
1 Introduction ally suffers from the bad resolution (while scanning), noise,
blurring effects and text of multiple sizes, fonts and orienta-
Computer Science with Electronics plays a major role in our tions (Sharma et al. [3]). Sometimes these documents contain
day-to-day life, which cannot be avoided. This ranges from texts of more than one script. These are the main problems
developing software to develop real-time systems. Demands occurred during script identification. Therefore, to get highly
accurate OCR results, a good script identification process
B Parul Sahare should be developed first.
parulsahare2387@gmail.com Script identification is a sub-field of document image anal-
Sanjay B. Dhok ysis, which contains features at multiple scales. This means
sbdhok@ece.vnit.ac.in that features are obtained from text-lines, words and char-
1 acters at a fine scale and from text or paragraph at large
Department of Electronics and Communication Engineering,
Centre for VLSI and Nanotechnology, Visvesvaraya National scale. For calculating features at a fine scale, text-lines,
Institute of Technology, Nagpur, Maharashtra, India words and characters should be segmented first. This can

123
Int J Multimed Info Retr

be done using simple technique like projection profile anal- used to decide the correct label of query data, i.e., it helps to
ysis. decide the script of the text during the testing stage. Features
This paper surveys previously published work on script are vectors of real numbers obtained from image and its pat-
identification schemes of document images. Though there tern (Patil and Subbareddy [5]). Previous research on script
are numbers of survey papers available on this topic, none of identification is done for machine-printed text on uniform
the paper does this particular work of analysis of script identi- background (Zhu et al. [6]). This work of script identifica-
fication at large and fine scales. To the best of our knowledge, tion becomes complex, when handwriting text is added with
this is the first paper carried out detailed analysis of differ- machine-printed text along with nonuniform backgrounds.
ent script identification algorithms and comparative analysis As handwritten text suffers from improper segmentation, thus
with common parameters at such scales. The rest of the paper templates for identifying scripts from handwritten texts are
is organized as follows: Sect. 2 explains various script identi- not uniform and rigid in nature. Figure 2a shows an exam-
fication methods along with their analysis. Section 3 contains ple of large scale, while Fig. 2b–d shows some examples at
observations, discussions and comments. Finally, the conclu- fine scales of document image from which features can be
sion is given in Sect. 4. extracted for script identification purpose. Analysis of script
identification process is classified in the taxonomy shown
in Fig. 3 and is discussed in the next section. Here, further
2 Script identification analysis for document investigation at large-scale analysis and fine-scale analysis is
images done based on structural features, texture features and hybrid
or other features.
Script identification deals with the determination of under- Structural features majorly depend upon strokes, their
lying script within the documents (Shijian and Tan [4]). This orientations and sizes. These provide a well-defined sym-
can be done at the paragraph level, text-line level, word or bolic overview of components that depend upon spatial
character level. Figure 1 shows the overall script identifi- arrangement of pixels. Texture features depend upon visual
cation process. Generally, any identification or recognition appearances of components and exhibit periodicity. These
process contains two stages. First is training stage, and the are generally transforming-based methods that change the
second is testing stage. In training stage, document image is domain of the components from one to another. The visual
scanned at particular resolution level (300 dpi is good). After
that, preprocessing steps like noise elimination, skew detec-
tion, correction and font size normalization are performed. If
features are calculated at text-line, word or character level,
there segmentation is necessary before the feature extraction
step. Features are calculated in the feature extraction step,
which is further used to train the classifier. This classifier is

Stage I:- Training Stage Stage II:- Testing Stage

Document image scanned Document image scanned Fig. 2 a Text block represents large scale, b, c and d Text-line, word
and characters, respectively, represent a fine level

Preprocessing and Preprocessing and Script Identification


Segmentation Segmentation

Global Analysis Local Analysis


Feature Extraction Feature Extraction
Paragraph/Text block level Text-line level Word/Character level

Classifier Trained with Feature Matching


feature vector database

Decision/Output label Structural Texture Other Structural Texture Other


Feature Feature Feature Feature Feature Feature

Optical Character Structural Texture Other


Recognition (OCR) Feature Feature Feature

Fig. 1 Overall script identification process Fig. 3 Taxonomy of script identification process

123
Int J Multimed Info Retr

system of human is capable of discriminating between a gradient spatial feature. For that, junction point, end point
objects, which is possible through inspection (Joshi et al. and intersection point using pixels along with their surround-
[7]). ing pixels for each script are calculated. In addition, variances
are computed for each point from the proximity matrices of
2.1 Large-scale analysis of script identification process each script using Euclidean distance. Second formed feature
is a gradient structural feature. For that, the total number
In this analysis, text blocks consist of sets of more than two of end points, junction points and intersection points from
lines, or complete paragraphs are considered for script anal- each script block and normalized them with total number of
ysis of document images. This analysis does not need any components is calculated. In addition, straightness and cur-
specific segmentation step, which is needed by fine-scale siveness of components are calculated, which is based on
analysis. In addition, this analysis generally does not require the number of straight-line pixels, centroid of components
components of specific scripts for their segmentation. There- and mean of junction, end and intersection points. This is
fore, a more generic platform is presented by this analysis for done at the component level, whereas for branch level they
script identification. calculated similarly the straightness and cursiveness of each
branch using straight-line pixels and centroid of branches.
2.1.1 Structural feature-based large-scale analysis Templates are then created for each feature and scripts, and
for identification, Euclidean distance is used. Experiments
These features are good as soon as the structures of the com- are conducted with separate features and in combined form.
ponents are not distorted by noise and other factors. The Gradient spatial feature identified Chinese, English, Korean
advantage of this analysis is that script identification rate is and Japanese scripts better, whereas gradient structure fea-
independent of segmentation methods. This means that these ture identified better Arabic and Tamil scripts.
are free from errors, which occur during the segmenting pro- Hochberg et al. [9] first done connected component
cess (Shivakumara et al. [8]). analysis and removed very small- and very large-sized com-
Patil and Subbareddy [5] applied morphological dilation ponents. Then features like relative X and Y centroids,
operation on text blocks in vertical, horizontal, left-diagonal number of white holes, sphericity and aspect ratios are cal-
and right-diagonal directions with a mask of 3 × 3. Later, culated for handwritten Chinese, Arabic, Cyrillic, Japanese,
three neural network classifiers each one trained for Hindi, Devanagari and Roman scripts. Relative X and Y centroids
English and Kannada scripts and decision is made using the are the ratio of vertical and horizontal centroids to the com-
maximum output value among the three classifiers. ponent height and width, respectively. Later, Fisher linear
Zhu et al. [6] worked on shape-based structural fea- discriminant analysis is used for identification. Table 1 shows
tures, which does not require any preprocessing steps. These the algorithms’s performance metrics for structural feature-
features are scale, rotation, segmentation and translation based large-scale analysis.
independent. First, edges are detected in an image using
the Canny operator. Then, contour segments are formed and 2.1.2 Texture feature-based large-scale analysis
extracted using connected components and line fitting tech-
niques. After that, shape codebook is created for each script These features give the overall information about an image
using clustering and a similar feature partitioning through pattern. It has been observed that the efficiency of identi-
graph cut technique. For each cluster, one code word, which fication is affected using these features at this scale due to
is closest to cluster center, is selected called as ‘Exemplary lines and words heights, widths and spacings [13,14]. There-
Codeword.’ Now, image descriptor is formed which is a his- fore, normalization of text blocks should be done before this
togram plot. Finally, multiclass support vector machine is analysis.
used for Arabic, Chinese, English and Hindi scripts identifi- Tan [13] extracted the features, which are rotation invari-
cation. ance and noise robust. Image is convolved with Gabor filter,
Shivakumara et al. [8] worked for identification of Chi- and the advantage of periodic property of Fourier transform
nese, Arabic, Japanese, Korean, English and Tamil scripts is used to calculate features. Later, distance classifier is used,
from videos. First, text components are found out using gra- which includes mean and variance of the classes to classify
dient concept. For that, centroid of gradient image is found Chinese, English, Greek, Malayalam, Persian and Russian
out; then the image is divided based on centroid, histogram scripts.
of each block horizontally and vertically is calculated, and Busch et al. [14] started with skew detection and correc-
results are combined. Then, modified skeleton operation is tion using Radon and Fourier transforms and then scaling
applied, which uses an area calculation and k-means clus- and normalization of text blocks is done. Now, for feature
tering for noise reduction. This step is done to determine generation first gray-level co-occurrence matrix feature for
intersection, junction and end points. First formed feature is each image for particular distance and directions is calcu-

123
Int J Multimed Info Retr

Table 1 Algorithms for


Algorithm Identification accuracy Image database information
structural feature-based
large-scale analysis Patil and Subbared-dy [5] 99.00 Self-prepared machine-printed database in
Microsoft Word
Zhu et al. [6] 95.60 UMD [10] and IAM [11] databases
Shivakumara et al. [8] 94.30 Self-prepared database from weather
news, sports news and entertainment
videos in machine-printed form and
database used in [12]
Hochberg et al. [9] 88.00 Self-prepared handwritten database using
an Agfa scanner

lated. Later, Gabor filter energy for different orientations, and multilayer perceptron had shown highest classification
wavelet sub-bands energies and wavelet log mean devia- rate.
tion as additional features are calculated. At last, wavelet Benjelil et al. [17] identified Arabic and Latin scripts,
co-occurrence and scale co-occurrence features are calcu- both in handwritten and machine-printed form. Steerable
lated. These co-occurrence features are calculated in order to pyramids of text blocks are formed, which are obtained
reduce the nonlinearity among the wavelet coefficients. This using the combination of low-pass, band-pass and high-
was done by finding quantization function and forming co- pass filters. These are the derivatives of Gaussian filters.
occurrence matrices. For identifying scripts, the feature set is Sub-bands are obtained at different orientations and scales.
reduced using linear discriminant analysis, which maximize Now, mean, standard deviation, Kurtosis, energy, homogene-
interclass separability and minimize intraclass separability. ity and correlation of each sub-bands of scripts are calculated
Gaussian mixture model is used to identify Latin, Chinese, and scripts are classified through k-nearest neighbor classi-
Japanese, Greek, Cyrillic, Hebrew, Sanskrit and Farsi scripts. fier.
For identifying multifont scripts, first cluster of each script Zhou et al. [18] first applied discrete wavelet transform
is formed using feature vectors and then linear discriminant on text images and calculated combined energy pixelwise
analysis and Gaussian mixture model are applied. Gray-level for three detail images. Now, in order to convert continuous
co-occurrence matrix feature showed less performance for values of energy into discrete form, linear quantization is
small distances, whereas scale co-occurrence feature not per- done. Now, wavelet energy histograms are plotted for Arabic,
formed well on binary scripts. Chinese, English, Hindi, Thailand and Korean scripts, which
Hiremath and Shivashankar [15] used wavelet transform clearly showed the difference. Four class weighted moment
for script identification. It have been concluded by them that features are calculated from each histogram, and scripts are
the interrelation between sub-bands at a particular resolution classified using support vector machine.
exhibits a strong correlation, which further used for describ- Peake et al. [19] during preprocessing step binarized the
ing a texture. In the feature extraction step, they computed document image. Then, the text-lines are located using pro-
co-occurrence histograms between average images and each jection profile and those text-lines are removed, which fall
detail image and features like mean, deviation and slope outside the range of the mean and standard deviation of
are calculated. Gabor filter is also applied on text blocks to text-line heights. Some normalization includes white space
calculate energies in different directions. Finally, k-nearest elimination between text-lines, left side justification and
neighbor classifier is used for identification of English, Ben- inter-word spacing is done. Now with a text block of fixed size
gali, Hindi, Kannada, Malayalam, Tamil, Telugu and Urdu (128 × 128), features like Gabor filtering using fast Fourier
scripts. transforms and gray-level co-occurrence matrix, are calcu-
Singh et al. [16] first binarized the image and removed lated. Gabor filtering is done for four different frequencies in
noise using a Gaussian low-pass filter. Then, gray-level co- four different directions, whereas gray-level co-occurrence
occurrence matrix is calculated, which describes about the matrix is calculated for five distances in four directions.
joint probability distributions of gray levels for pairs of pix- The k-nearest neighbor classifier is used to identify Russian,
els. This is calculated at a particular distance and orientation English, Chinese, Malayalam, Greek, Persian and Korean
for a pair of pixels. After forming gray-level co-occurrence scripts.
matrix, features like energy, entropy, inertia, contrast, local Pan et al. [20] convolved the image with Gabor filter and
homogeneity, cluster shade, cluster prominence and infor- calculated mean and standard deviation for different frequen-
mation measure of correlation are calculated. Finally, scripts cies and rotations. Then, one-dimensional discrete Fourier
like Bangla, Devanagari, Telugu and Roman are identified transform is calculated from these means and standard devi-

123
Int J Multimed Info Retr

ations to generate rotation-invariant features. In addition, Malayalam, English and Hindi and Tamil, English and Hindi
response of the Gabor steerable filter is calculated in order to tri-scripts identification.
save computational time and cost. Neural network with feed Lee et al. [23] first applied wavelet transform on the text
forward and back-propagation is used to identify Chinese, blocks, which result approximate and detail sub-bands. Then,
Korean, Japanese and English scripts. Using steerable prop- soft thresholding technique on detail sub-bands is applied to
erty, they saved nearly 40% computations, but performance denoise them. After that, block difference of inverse proba-
for script identification was slightly decreased. bilities, block variance of local correlation coefficients and
Singhal et al. [21] first done with preprocessing which normalized magnitude on detail sub-bands are calculated and
include denoising to remove salt and pepper noise using mor- computed mean and variance. Using these features, Greek,
phological operation, thinning, pruning, m-connectivity and English, Hebrew, Russian, Hindi, Persian, Korean, Thai,
normalization (text height, inter-word space, text left and Japanese and Chinese scripts are classified through Bayesian
right justification). Gabor filter is used for feature extraction, classifier. They also compared their results with other texture
and probabilistic clustering is used to classify English, Hindi, features and concluded that maximum accuracy is achieved
Bangla and Telugu scripts. with the combination of above three features and denoising.
Rajput and Anita [22] binarized the text blocks using Table 2 shows algorithms’s performance metrics for texture
Otsu’s method, then noise is removed using morphologi- feature-based large-scale analysis.
cal operation, and thinning operation is applied. Now, for
feature extraction, discrete cosine transform and discrete 2.1.3 Hybrid feature or other feature-based large-scale
wavelet transform (Daubechies 9) are applied. Then, the analysis
standard deviation of first and second blocks and average,
vertical and horizontal bands of discrete cosine transform These features contain either combination of structural and
and discrete wavelet transform, respectively, are calculated. texture features or not using any one of these features. If these
The k-nearest neighbor classifier is used to classify Kan- features use the combination, it can enjoy the advantages of
nada, Hindi, Gujarati, English, Tamil, Telugu, Malayalam both structural and texture features.
and Punjabi scripts. At a time, three scripts are used for Joshi et al. [7] first removed nontext region of document
identification. The method shown highest performance for image, extracted text blocks of fixed size and binarized them.

Table 2 Algorithms for texture


Algorithm Identification accuracy Image database information
feature-based large-scale
analysis Tan [13] 96.70 Self-prepared machine-printed
database from books,
newspapers, magazines etc.
Busch et al. [14] 96.20 Self-prepared machine-printed
database
Hiremath and Shivashankar [15] 98.00 The database used in Valkealahti
and Oja [24]
Singh et al. [16] 96.24 Self-prepared handwritten database
Benjelil et al. [17] 97.50 Self-prepared machine-printed and
handwritten databases
Zhou et al. [18] 97.59 Self-prepared machine-printed
database
Peake and Tan [19] 95.23 Database prepared through
scanned newspapers, journals,
magazines and books at 150 dpi
Pan et al. [20] 99.81 Database prepared through
newspapers, books, magazines
and computer printouts, scanned
at 200 dpi resolution
Singhal et al. [21] 91.60 Self-prepared handwritten
database at 150 dpi
Rajput and Anita [22] 99.20 Self-prepared handwritten
database at 300 dpi
Lee et al. [23] 98.35 Self-prepared machine-printed
database

123
Int J Multimed Info Retr

Log Gabor filter is used for feature extraction, as it is more removal of noise from these components along with their seg-
reliable and informative as compared with Gabor filter and mentation. These approaches are generally peculiar in nature.
calculated global and oriented local energy responses of each
script. In addition, horizontal profile run is incorporated as
one of the feature for Devanagari and Gurumukhi scripts. 2.2.1 Structural feature-based fine-scale analysis at the
Finally, hierarchical tree-based classification scheme is used text-line level
to identify Devanagari, Bangla, Tamil, Gurumukhi, Kannada,
Malayalam, English, Oriya, Gujarati, and Urdu scripts. This The features at this level are efficient and robust, if the scripts
classification scheme at different levels identified different for identification are rich in strokes information. For exam-
set of scripts using either global, local energy responses and ple, Devanagari script can easily be separated from Roman
horizontal runs or their combination. scripts using the headline structural feature (Pal and Chaud-
Brodi’c et al. [25] found out the distribution of types huri [28]).
of letters within Cyrillic, Latin and Glagolitic scripts text Pal and Chaudhuri [28,29] identified English, Chinese,
blocks. The letters are classified as base letters, ascender Bangla, Hindi and Arabic scripts by utilizing their structural
letters, descendent letters and full letters. They concluded properties. First, Hindi and Bangla scripts from English, Chi-
that Glagolitic, Cyrillic and Latin scripts have the high- nese and Arabic scripts are classified by checking horizontal
est distribution of base, descending and ascending letters, runs of rows. Hindi and Bangla scripts usually have longer
respectively. In addition, gray-level co-occurrence matrix runs due to Sirorekha. Now, Hindi and Bangla scripts are
feature is calculated and consists of entropy, energy, dis- identified using strokes-shaped features within zones. Chi-
similarity, maximum, inverse different moment, contrast, nese script is classified from English and Arabic scripts using
correlation and homogeneity calculations along with its nor- vertical runs of black pixels and run length smoothening
malized probability version to classify scripts. algorithm technique. At last, English and Arabic scripts are
Arabnejad et al. [103] used nonnegative matrix factor- classified. This is done using the distribution of lowermost
ization to create a low-dimensional representation of text components and water flow concept, as Arabic script has a
patches as features. Later, k-nearest neighbor is used for random distribution of these components between baseline
classification of Arabic, Hebrew, Cyrillic and Latin scripts. and lower line.
Table 3 shows algorithms’s performance metrics for hybrid Gopakumar et al. [30] first done some preprocessing
feature or other feature-based large-scale analysis. which includes binarization, text-line segmentation using
histogram techniques and thinning. Then from each seg-
mented text-line, features like horizontal, vertical, left-
diagonal and right-diagonal lines are calculated. They used
2.2 Fine-scale analysis of script identification process feature subset selection as an additional step as calculated
features are not classifying scripts efficiently. Later, for script
Fine-scale analysis is done at text-lines, words and charac- classification, k-nearest neighbor and support vector machine
ter levels. Text-line, word and character segmentations are are used. The k-nearest neighbor classifier with Euclidean
necessary before doing the fine-scale analysis of script iden- distance and support vector machine with Gaussian kernel
tification. This analysis totally depends upon the effective in one against all approach, classified scripts like English,

Table 3 Algorithms for hybrid feature or other feature-based large-scale analysis


Algorithm Identification accuracy Image database information

Joshi et al. [7] 95.50 Database collected from newspapers, books, magazines and
computer printouts [26] and from LIFI: language
identification from images. http://www.c3.lanl.gov/~kelly/
LIFI/, scanned documents from the digital library of India
http://dli.iiit.ac.in/ and from regional newspapers http://www.
samachar.com/
Brodić et al. [25] 98.00 Custom-oriented database collected from [27],http://www.
croatianhistory.net/etf/juraj_slovinac_misli.html and http://
www.croatianhistory.net/etf/badurina_parcic.html and from
the book ‘Le château de virginité’ (‘The Castle of Virginity’)
written by George d’Esclavonie (Juraj Slovinac) in 1411
http://www.croatianhistory.net/etf/juraj_slovinac_misli.html
Arabnejad et al. [103] 97.60 Self-created database of ancient manuscripts

123
Int J Multimed Info Retr

Hindi, Kannada, Tamil, Telugu and Malayalam. Gopakumar Urdu scripts is calculated. Preprocessing like binarization
et al. [31] followed the same steps, but done additional things using Otsu’s technique, skew correction, noise deletion and
of dividing each text-lines into zones. Now by calculating text-line segmentation through projection profile analysis is
this feature on the segmented zones, they concluded that it also done. Classification is based upon entropy values. Bashir
provides more fine information of different scripts. and Quadri [39] discriminated English and Kashmiri scripts
Padma and Vijaya [32] had done same preprocessing steps using horizontal projection profile coefficients and valleys.
like Gopakumar et al. and calculated features like density pro- Bashir and Quadri [40] identified Roman, Devanagari, Urdu
files, coefficient values using standard deviation and mean, and Kashmiri scripts based on density of pixels. For that,
bottom maximum row number and top component density binarization inversion and calculation of density of text-line
from top maximum row and bottom maximum row of top images are done. Peak density percentage for Kashmiri script
and bottom profiles. This is done for Kannada, English and was found high, whereas low for Roman script.
Hindi scripts. Here k-nearest neighbor classifier is used to Ghosh and Chaudhuri [41] employed Hough transform for
identify scripts. skew detection in preprocessing step. For generating a first
Aithal et al. [33] segmented text-lines using horizon- feature, water-filling concept is used, where the capacity of
tal projection profiles. Their feature extraction scheme also the component (reservoir) is measured. This is made from
depends upon the horizontal projection profile concept. Since top, bottom, left and right sides of the components. Other fea-
they identified English, Hindi and Tamil scripts, rule-based tures are white hole area calculation, horizontal and vertical
classifier is sufficient for identification using projection pro- white-black transitions and crossing counts. Classification
file technique. Rule-based classifier depends upon horizontal tree is used, where the first level classified Gujarati, Oriya
peaks and histograms with crossing means of projection and Urdu scripts. At second level, Bangla, Devanagari and
profile of each script. Aithal et al. [34] proposed the same Gurmukhi are classified. At third level, English, Malayalam
projection profile-based feature extraction process and rule- and Tamil and at fourth level, Kannada and Telugu scripts
based classifier for identifying Hindi, English and Kannada are classified. At each level, they had not used all features,
scripts. Prakash et al. [35] also used similar projection pro- instead, they used in combination, whereas classifiers used
file concepts and rule-based classifier for identifying Hindi, are minimum distance classifier, support vector machine and
Bangla, Telugu, Kannada scripts. k-nearest neighbor.
Phan et al. [36] worked for identifying scripts from video Cheng et al. [42] calculated the normalized histogram of
frames. For feature extraction step, Canny edge map tech- Chinese, Japanese, English and Russian scripts and saw the
nique is used and cursiveness and smoothness of upper and distribution of pixels on x-line and baseline. They found out
lower edge lines are calculated. They found that the slope of that Japanese and Chinese scripts had random distribution,
these edge lines is low for an English script as compared to whereas Russian and English scripts had a concentration of
Chinese and Tamil scripts. Now, k-nearest neighbor classifier pixels near to baseline and x-line. Now, a further calculation
is used to classify the above-mentioned scripts. of the areas of the histogram near top and bottom lines and
Tan et al. [37] developed their method in three stages, their ratios, identified each script.
i.e., prototype building stage, document indexing stage and Moussa et al. [43] calculated two fractal dimensions and
retrieval stage. During prototype building stage, preprocess- computed features. First was Hausdorff–Besicovitch dimen-
ing, text-line extraction and features related to horizontal and sion (box count estimator dimension), which requires a
vertical stroke directions and inter-stroke directions; average number of boxes, covering an image at different scales.
stroke lengths and density are extracted. In document index- Second was the Minkowski–Bouligand dimension, which is
ing stage, term frequency and inverse document frequency the level and radius of dilation of texture surface. Obtained
are calculated. Term frequency describes about the similarity features are used to identify Arabic and Latin scripts in
of a document to a particular script family, whereas inverse both machine-printed and handwritten form using k-nearest
document frequency tells about how frequent features of a neighbor and radial basis function classifiers.
particular script family can be used in other script families. Padma and Vijaya [44] calculated features like top and
At last, in retrieval stage, term frequency and inverse docu- bottom portions, tick features, top and bottom pipe den-
ment frequency vectors of the documents are compared with sities, coefficients, horizontal lines and top downward and
term frequency and inverse document frequency vectors of bottom upward curves. Top and bottom pipes are computed
Arabic, Roman and Tamil scripts using Chi-square distance. by deleting connected components and black pixels whose
Bashir and Quadri [38] depended upon entropy calcu- values are less than the threshold. Top and bottom coefficients
lations for generation of feature vectors. Entropy is the are calculated using connected components of top and bot-
calculation of the number of transitions from one to zero or tom profiles. Top downward and bottom upward curves are
from zero to one rowwise or columnwise. Here, columnwise calculated from top and bottom pipes, respectively. Telugu,
entropy for identifying Kashmiri, Roman, Devanagari and Hindi and English scripts are identified using these features.

123
Int J Multimed Info Retr

Table 4 Algorithms for


Algorithm Identification accuracy Image database information
structural feature-based
fine-scale analysis at the Pal and Chaudhuri [28] 97.33 Self-prepared machine-printed database
text-line level of books, question papers, journals,
computer printouts magazines,
translation books, multilingual
operational and service manuals, etc.
Gopakumar et al. [30] 99.13 Self-prepared machine-printed database
using the web pages of e-news papers
Padma and Vijaya [32] 99.75 Machine-printed and handwritten dataset
created using Microsoft word and paint
software and documents like application
forms, manuals etc.
Aithal et al. [34] 99.83 Database prepared from e-news papers
Prakash et al. [35] 97.83 Self-prepared machine-printed database
Tan et al. [37] 93.30 Self-prepared handwritten database using
digital pen and paper
Bashir and Quadri [38] 98.50 Self-prepared machine-printed database at
a resolution of 300 dpi
Bashir and Quadri [39] 96.20 Self-prepared machine-printed database
using Inpage 2009 software
Ghosh and Chaudhuri [41] 99.44 Machine-printed text document database
scanned at 200/300/400 dpi
Cheng et al. [42] 96.70 Scanned images, fax and internet
downloaded, printed papers
Moussa et al. [43] 98.72 ALPH-REGIM database http://www.
regim.org/database/alph.html
Padma and Vijaya [44] 98.50 Self-prepared machine-printed database

Table 4 shows algorithms’s performance metrics for struc- neighbor classifier. Table 5 shows algorithms’s performance
tural feature-based fine-scale analysis at the text-line level. metrics for texture feature-based fine-scale analysis at the
text-line level.
2.2.2 Texture feature-based fine-scale analysis at the
text-line level 2.2.3 Hybrid feature or other feature-based fine-scale
analysis at the text-line level
For calculating this feature, minimum two words should be
present in the text-lines and at this scale, features can handle Direct convolution of a filter with an image is a slow pro-
variable spacing between words. cess and not efficient, then calculating fast Fourier transform
Rajput and Anita [45] first preprocessed the image, which of both filter and image for computing texture features
include de-skewing, noise reduction using median filter, (Obaidullah et al. [47]).
binarization using Otsu’s technique and later, thinning opera- Obaidullah et al. [47] convolved the image with Gabor
tion is applied. Feature extracted is based on Gabor filtering. filters with fixed frequency in a number of directions and
In addition, the standard deviation of sine and cosine part of calculated the standard deviation of real and imaginary parts.
Gabor filter separately along with entire image is calculated. Then the combination of dilation and four times erosion oper-
The k-nearest neighbor classifier is used for classification ations gave horizontal, vertical, right and left angular details
of Kannada, Malayalam, Punjabi, Tamil, Gujarati, Telugu, of an image, and further, the average and standard devia-
Hindi and English scripts. At a time, they classified two tion are calculated. Now, multilayer perceptron classifier is
scripts in which English was one of the scripts. used to identify Bangla, Hindi, English and Urdu handwritten
Jindal and Hemrajani [46] calculated discrete cosine scripts.
transform of images and standard deviation for a feature Gomez et al. [101] extracted densely text patches at two
vector generation. Then, this feature set is reduced using different scales, namely 32 × 32 and 40 × 40 from word
principal component analysis technique and later classified images using a sliding window. Then, convolutional neu-
English, Hindi, Urdu, Bengali, Tamil, Gujarati, Telugu, Kan- ral network classifier is used to identify English, Chinese,
nada, Malayalam, Odiya and Punjabi scripts using k-nearest Kannada, Korean, Hindi, Bengali, Punjabi, Mongolian, Thai,

123
Int J Multimed Info Retr

Table 5 Algorithms for texture


Algorithm Identification accuracy Image database information
feature-based fine-scale analysis
at the text-line level Rajput and Anita [45] 99.98 Self-prepared handwritten
document database
Jindal and Hemrajani [46] 95.00 Machine-printed document
database prepared from bus
question papers, reservation
forms, money-order forms and
language translation books

Table 6 Algorithms for hybrid feature or other feature-based fine-scale analysis at the text-line level
Algorithm Identification accuracy Image database information

Obaidullah et al. [47] 94.4 Self-prepared handwritten database at 300 dpi resolution
Gomez et al. [101] 94.80 SIW-13 [37], MLe2e [101] and CVSI [102] databases

Russian, and Tibetan scripts. The parameters of this classifier ing and ending x and y coordinates of strokes, respectively.
like convolutional and fully connected layers, kernel sizes, The tenth is the vertical interstroke direction calculated with
number of filters per layer, feature map normalization are the help of starting y coordinates of stroke. The eleventh is the
optimized to get better classification performance. Table 6 variance of stroke length measured within the words. These
shows algorithms’s performance metrics for hybrid feature, eleven values are calculated for each word, and scripts are
or other feature-based fine-scale analysis at the text-line level. classified using k-nearest neighbor classifier, Bayes quadratic
classifier, Bayesian classifier with a mixture of Gaussian den-
sities, decision tree-based classifier, neural network-based
2.2.4 Structural feature-based fine-scale analysis at
classifier and support vector machine classifier.
word/character level
Shijian and Tan [4] in the preprocessing step removed
noise (salt and pepper using median filtering) and done con-
At this level, structural features are computed as either par-
nected components labeling. Then, binarization is done using
ticular strokes or set of strokes in the form of connected
Otsu’s technique. First, Arabic, Chinese, Roman and Korean
components. Sometimes, dots within the characters create
scripts are identified. For that, the characters of words are
problems for extracting strokes. Four and eight connected
divided into zones and the numbers of vertical cut vectors
neighboring analysis is an important step. If computed fea-
are found out for words of each script. This vertical cut
ture vectors are normalized, then robustness with respect to
vector of each script exhibits similarity. Therefore, clusters
the font size, styles and noise can easily be achieved.
based on these vectors are formed using k-means clustering.
Namboodiri and Jain [1] proposed their method for hand-
Now, the scripts are identified using Bray Curtis distance.
written text documents of Cyrillic, Arabic, Hindi, Han,
They extended their work for identification of English, Ger-
Hebrew and English scripts. They collected the documents
man, French, Spanish, Portuguese, Swedish and Norwegian
from different individual handwritings, where they captured
languages, which are all considered under Latin script. For,
the coordinates of a pen tip at a sampling rate of 132 samples
each language word shape vectors are formed using upward,
per second. They also collected equidistant sampled individ-
downward text boundaries and extremum points. Frequency
ual strokes. Now, text-lines and words are segmented through
of words occurrence is calculated, and vectors consist of
projection profile analysis. Eleven features are calculated.
shape code and frequency is formed. Now, the templates
The first feature is horizontal inter-stroke direction, calcu-
are created through corpus-based learning approach and lan-
lated with the help of starting x coordinate of stroke. Second
guages are identified using cosine measure distance formula.
is average stroke length, which denotes the number of sam-
The proposed method does not differentiate between the
ple points within a stroke. Third and fourth are Sirorekha
scripts having same vertical cuts like in Tamil and Korean
strength and its confidence, which is calculated using Hough
scripts. In addition, this method does not work properly with
transform and height and width pattern, respectively. Fifth is
skewed characters and characters of very small sizes. Lu et
stroke density calculated as the number of strokes per unit
al. [48] also formed word shape coding for English, French,
length in horizontal direction. Sixth is aspect ratio, whereas
German and Italian languages using character strokes. Char-
seventh is reverse distance measured in a direction opposite to
acter strokes are categorized into vertical and curved, which
the normal direction. Eight and ninth are average horizontal
lying below baseline, between baseline and x line and above
and vertical stroke directions measured with the help of start-

123
Int J Multimed Info Retr

x line. X line is found out between the middle line and top be same and contained more than 60% of the top horizontal
line of characters. Document vector, template construction line. In Telugu script, top and bottom holes should be present
and language identification were done in the same way. Lu along with tick components. In English script, vertical lines
et al. [12] found out vertical cut for characters of each script should be present.
and formed document script vectors for the identification of Yeotikar et al. [51] also proposed features like bottom
Arabic, Chinese, Roman and Bengali scripts using k-nearest component, bottom-max-row number, top horizontal line,
neighbor or Bray Curtis distance between query document vertical lines, top holes, top-down curves and bottom-up
and reference document vectors. curves, bottom holes, left curve and right curve, for iden-
Patil and Subbareddy [5] took word images of Hindi, tifying Kannada, English and Hindi scripts.
English and Kannada scripts and applied dilation operation Dhandra and Hangarge [52] began with line and word seg-
in vertical, horizontal, right-diagonal and left-diagonal direc- mentation from document images using a projection profile
tions. Each resulting image is also divided into five horizontal concept. Their feature extraction step depended upon mor-
and vertical zones, and number of pixels in each zone is phological (erosion and opening) reconstruction operation,
calculated. Later, probabilistic neural network is used for where strokes in all directions are found out. Then, pixel
identification. ratio, pixel density, aspect ratio, eccentricity and extent are
Hochberg et al. [26] first rescaled the connected compo- calculated. Now, with the help of k-nearest neighbor clas-
nents of each script and formed clusters of connected com- sifier, Hindi, English and Kannada scripts are identified. In
ponents. New connected components are added to existing addition, they checked the potentiality of their method on
clusters through Hamming distance (250 pixels as thresh- handwritten English numerals.
old). Next, centroids are calculated from each cluster and Chanda et al. [53] calculated headline feature, chain code
template for each cluster is created using a thresholding tech- histogram-based feature and gradient feature for each word
nique. Now, template matching is done based on Hamming of English, Devanagari and Bangla scripts. Robert filter is
distance. This classification scheme depends upon number used to calculate gradient feature on low-pass-filtered image.
of components (to examine) and reliability threshold. They Then support vector machine with Gaussian kernel is used
worked to identify Arabic, Armenian, Burmese, Chinese, for identification.
Cyrillic, Devanagari, Ethiopic, Greek, Hebrew, Japanese, Chanda et al. [54] discriminated Han script (Chinese,
Korean, Roman and Thai scripts. Japanese and Korean) from Roman script (English). First,
Spitz [49] first done with preprocessing like text-line text-lines and characters using histogram technique are seg-
segmentation and alignment of baseline using least-squares mented. Then from these characters, chain code histograms
regression analysis and words and characters (using con- are calculated in different directions. This is done by dividing
nected components analysis) segmentation. Then for classi- characters into the blocks of 7×7. Now in order to reduce the
fying Han and Latin scripts, the concept of upward concavity length of feature vectors, the histograms are downsampled
and their distribution with respect to baseline position is using Gaussian filter and normalized. Finally, support vector
used. After that for language classification of Han-based machine with Gaussian kernel is used to classify the scripts.
script (Chinese, Japanese and Korean), the optical density Roy et al. [55] worked on fractal dimensions to compute
of characters (connected components) of each script is cal- features of handwritten scripts. A fractal is a mathematical
culated. Optical density depends upon the runs of black set of irregular geometric object, which exhibits a repeating
pixels. Linear discriminant analysis is used for classifying pattern of all scales (self-similarity). Some more features are
Han-based scripts. For language classification of Latin-based also calculated, which are based upon height, width, horizon-
script (English, French and German as the initial set of dis- tal and vertical black runs of the components. At last, above
criminable languages), character shape codes is made for features are tested and classified Persian and Roman scripts
each character, which is based upon the relative position of on multilayer perceptron with feed-forward network, support
the four lines, namely bottom, top, x-line and baseline and vector machine with Gaussian and polynomial kernels, k-
morphology of the components. Now, word shape tokens nearest neighbor with Euclidean and city block distances and
are made from character shape codes and their frequency modified quadratic discriminating function classifiers. They
of occurrence in each language is calculated. Later, linear concluded that support vector machine with the Gaussian
discriminant analysis is again used for classification. kernel outperformed support vector machine with polyno-
Das et al. [50] proposed their method for Telugu, English mial kernel. Seven nearest neighbors with city block distance
and Hindi scripts. Structure benefit of each script is used to achieved better results among other k-nearest neighbor clas-
calculate features. These features are based upon top hor- sifiers. Roy et al. [56] used the same structural features
izontal and vertical lines, top and bottom holes, rows and like in Obaidullah et al. [47] along with fractal dimensions
tick components, and later, heuristics are developed. Deci- as additional features and classified Bangla, Devanagari,
sion criteria for Hindi script: Top and bottom rows should Malayalam, Urdu, Oriya and Roman scripts using multilayer

123
Int J Multimed Info Retr

perceptron. Obaidullah et al. [57] used the same set of fea- the number of white components. The centroid is center of
tures like in Roy et al. [56] and tested on the logistic model either horizontal or vertical masses. Later, support vector
tree, random forest, multilayer perceptron, sequential min- machine with decision tree is used to identify English and
imal optimization, liblinear, radial basis function network Chinese scripts.
and fuzzy unordered rule induction algorithm classifiers to Echi et al. [64] worked for machine-printed and hand-
classify Bengali, Devanagari, Malayalam, Urdu, Oriya and written words of Latin and Arabic scripts. As, from each
Roman scripts. They concluded that sequential minimal opti- word, presence of diacritic points, loop positions and elon-
mization was fastest, whereas multilayer perceptron was gated descenders as feature vectors are calculated. Diacritic
slowest and logistic model tree and random forest outper- points are found in Arabic scripts. Loop structures are found
formed all. in Latin letters, which are above and below the central bands.
Piao and Cui [58] segmented the characters using val- Vertical descenders are found in Latin script, whereas elon-
ley, width and centroid of characters. Now eigenspace is gated descenders are found in Arabic scripts. In addition, they
constructed from eigenvectors for three scripts, i.e., Korean, employed techniques for suitability of feature vectors, so that
Chinese and English. These eigenvectors are calculated using maximum efficiency is achieved. Techniques like principal
covariance matrix. Now, for identification of characters, orig- component analysis, ranking, genetic algorithm and best first
inal image is projected on each eigenspace and reconstructed are used. Later, AODEsr classifier is used for script iden-
image is obtained. Then, eigen distance and relative entropy tification, which is based upon Bayes rule of conditional
between original and reconstructed images are constructed. probability.
Chanda et al. [59] segmented text-lines, words and char- Haboubi et al. [65] first extracted text-lines and then words
acters using projection profile concept from English and Thai using histogram analysis and dilation from the Arabic and
scripts. Then, features as loop feature, water reservoir fea- Latin scripts document images, respectively. Then features
ture, vertically overlapping component feature, rotated ‘J’ like determination of baselines, poles and jambs, diacritical
like feature and vertical lines like features are extracted. For dots, loops between base lines, pieces are determined. Mul-
water reservoir-based feature, top, bottom, left, right, flow tilayer perceptron is used for script identification.
level, height of reservoir, reservoir base line are calculated Kacem et al. [105] calculated run lengths of black pixels
for each character. Later, support vector machine is used along four directions, namely vertical, horizontal, right and
for script identification. It has been observed that vertically left diagonals as features. Bayes, support vector machine and
flipped ‘J’ character like components and two or more vertical k-nearest neighbor classifiers are used to identify Arabic and
lines were found more in Thai script characters. Chanda et al. Latin scripts.
[60] calculated water reservoir feature, vertically overlapping Obaidullah et al. [106] found out fractal dimension, rect-
component feature vertical lines like features for identifying angularity, circularity, directional chain code and convexity-
English and Japanese scripts. Additional features like cross- based structural features. Stroke directions are also identi-
ing count and character pitch feature are calculated. Pitch fied using morphological operations. Multilayer perceptron
signifies the width of character, which is definite in Japanese and simple logistic classifiers are used to identify Bangla,
characters. Binary tree classifier is used for identification. Devanagari, Gujarati, Gurumukhi, Kannada, Malayalam,
Dhandra and Hangarge [61] calculated vertical and hori- Oriya, Roman, Tamil, Telugu and Urdu scripts. Table 7 shows
zontal stroke densities, morphological hat in top and bottom algorithms’s performance metrics for structural feature-
directions, pixel densities, aspect ratio, eccentricity and based fine-scale analysis at word/character level.
extent of the components as feature vectors. Later, handwrit-
ten Kannada, English and Hindi scripts are identified using 2.2.5 Texture feature-based fine-scale analysis at
k-nearest neighbor classifier. word/character level
Singh et al. [62] segmented middle zone and deleted upper
and middle zones from the words from Hindi and Gurumukhi Computing these features is bit tricky due to the variation in
scripts. Then, with horizontal projection profile, the densities rotation, direction and frequency. Generally, these features
of vertical strokes in the middle zone and horizontal strokes in are calculated in the form of energy, divergence and spread
the lower zones are calculated to identify the scripts. Vertical plots [2].
and horizontal strokes are present in Gurumukhi script and Pati and Ramakrishnan [2] calculated features based upon
absent in Hindi script. Gabor and discrete cosine transforms. They concluded that
Lin et al. [63] calculated density, cross-count, aspect ratio mean of two scripts played a significant role for identifying
of horizontal and vertical text-line, white hole and spheric- scripts rather than standard deviation. These are calculated
ity, upward concavity and centroid. Aspect ratio is concerned using the Gabor filter. Only, low-frequency components are
with the ratio between components and text-line heights, utilized for the feature vector generation using discrete cosine
widths, top and bottom gaps. A white hole is concerned with transform. Later, using support vector machine, linear dis-

123
Int J Multimed Info Retr

Table 7 Algorithms for


Algorithm Identification accuracy Image database information
structural feature-based
fine-scale analysis at Namboodiri and Jain [1] 95.00 A handwritten database created
word/character level using CrossPad IBM
Technologies http://www.
research.ibm.com/handwriting/
Shijian and Tan [4] 97.29 Self-prepared machine-printed
document database at 300 dpi
Patil and Subbareddy [5] 98.89 Self-prepared machine-printed
database in Microsoft Word
Lu et al. [12] 91.60 Machine-printed document
database collected from
electronic proceedings and
Internet
Hochberg et al. [26] 98.00 Machine-printed document
database collected from books,
magazines, newspapers, and
computer printouts
Lu et al. [48] 97.08 Self-prepared machine-printed
database
Spitz [49] 100 Machine-printed document
database
Das et al. [50] 93.00 Machine-printed document
database collected from
textbooks, internet based
newspapers, and from the
scanned text documents
Yeotikar and Deshmukh [51] 98.50 Machine-printed document
database created in Microsoft
word and paint programs
Dhandra and Hangarge [52] 98.73 Machine-printed document
database collected from various
newspapers, magazines and
books etc.
Chanda et al. [53] 98.51 Machine-printed, scanned images
from magazine, newspaper, story
books, etc.
Chanda et al. [54] 98.39 Self-prepared machine-printed
database
Roy et al. [55] 99.20 Self-prepared machine-printed and
handwritten databases
Roy et al. [56] 89.48 Self-prepared handwritten database
Obaidullah et al. [57] 91.20 University and postal documents
Piao and Cui [58] 99.78 Machine-printed, scanned
document images
Chanda et al. [59] 99.36 Machine-printed documents
collected from newspaper, books
and from literature through
internet
Chanda et al. [60] 98.79 Machine-printed documents
obtained from advertisement,
Windows manual, Nikkei Byte,
ASCII and Interface magazines
etc.
Dhandra and Hangarge [61] 96.05 Handwritten document database
Singh et al. [62] 98.27 Self-prepared machine-printed
database

123
Int J Multimed Info Retr

Table 7 continued
Algorithm Identification accuracy Image database information

Lin et al. [63] 99.60 Machine-printed documents


obtained from newspapers and
magazines
Echi et al. [64] 98.72 IAM [11] and IFN-ENIT [71]
databases
Haboubi et al. [65] 94.32 IFN-ENIT [71] database
Kacem and Asma [105] 99.03 IAM [11], IFN-ENIT [71],
AHTID/MW [98] and APTI
[100] databases
Obaidullah et al. [106] 97.99 PHDIndic_11 [106] database

criminant analysis and nearest neighbor classifiers are used entations and thus produced 16 Gabor channels for feature
to classify scripts. For that, at first, two scripts are classified extraction. Later for script identification, the performance of
at a time and then, their approach is extended for classifying three classifiers, namely k-nearest neighbor, support vector
three or more scripts. Scripts are English, Hindi, Urdu, Ben- machine and Gaussian mixture model, is compared. They
gali, Tamil, Gujarati, Telugu, Kannada, Malayalam, Odiya concluded that k-nearest neighbor gave good average accu-
and Punjabi. Support vector machine and nearest neighbor racy, whereas support vector machine gave minimal deviation
classifiers worked well with the Gabor filter, and these out- results, and Gaussian mixture model gave less accuracy. For
performed linear discriminant analysis classifier. This shows experimentation, English with Chinese, Hindi and Arabic
that scripts are not linearly separable. scripts as bilingual dictionaries are used.
Sharma et al. [3] worked for video frames and calculated Ferrer et al. [68] identified handwritten scripts, namely
features like local binary pattern, histogram of oriented gra- Bangla, Persian and Roman. At first, the text-lines and words
dients and gradient local auto-correlation. For local binary are segmented using projection profile concepts and convex
pattern, eight neighbors are considered, whereas for a his- hull. For identifying scripts through text-lines, the text-lines
togram of oriented gradients, the image is divided into 5 × 5, are divided into overlapping text blocks. Then, local binary
and Roberts filter is used in gradient local auto-correlation. pattern and histograms are calculated, which describe about
Support vector machine with Gaussian kernel and artifi- spatial information like script structure density. For classifi-
cial neural network classifiers is used for classification. The cation, least-square support vector machine with radial basis
experiment was conducted on each feature using each classi- function kernel in one against all modes is used. Similar
fier to classify English, Hindi and Bengali scripts. The local steps are carried out for script identification of words. They
binary pattern is a rotation-invariant texture feature. It gives introduced multiple training one test classification scheme.
the spatial structure of an image across angular space and res- Here each text-line is divided into overlapping blocks and
olution in circular neighborhoods. On another paper, Sharma a number of classifiers are trained. For training purpose,
et al. [66] proposed their algorithm for video frames. Nor- local binary pattern of different normalized values for each
mally video frames have complex and blur backgrounds, overlapping block is used. Now for script identification, the
and suffer from low resolution. As a preprocessing step, the normalized local binary pattern value of query word is cal-
image is binarized, then the skeleton of an image is formed, culated, and thus its script is estimated from the classifier
and the resolution is increased by 1.5% using cubic inter- trained with same normalized value.
polation method. Features are extracted using Gabor filters, Singh et al. [69] also computed discrete cosine transform
Zernike moments and gradient directions. Zernike moments over 4 × 4 grids along with translation, scale, rotation-
are orthogonal moments and contain more information for invariant moment features. These are normalized central
an image. Later support vector machine classifier is applied moments. Multilayer perceptron showed highest accuracy
on English, Bengali and Hindi scripts for their identification. for classifying Malayalam, Telugu, Tamil, Oriya and Roman
Ma and Doermann [67] first deskewed the document scripts.
image using the histogram profile. Then, certain vertical and Angadi and Kodabagi [70] had done some preprocess-
horizontal lines are removed using Hough transform along ing including skew detection and correction, binarization
with some symbols, which neither belonged to any script. using Otsu’s method and noise reduction. Wavelet transform
Word extraction is done using a well-known ‘Docstrum algo- is applied on the processed image, and then features like
rithm.’ They used a standard image size of 64 × 64, applied zonewise energy, log mean deviation and vertical runs at each
Gabor filter for four different spatial frequencies and ori- level are calculated. Here, zone is made by dividing the image

123
Int J Multimed Info Retr

horizontally and vertically into 3 and 4 parts, respectively. images, and mean and standard deviation are calculated as
Then for script identification, fuzzy-based classification feature vectors. Now, linear discriminant analysis is used to
scheme is used. Their method first identified Hindi script identify English, Hindi, Kannada, Tamil, Telugu and Malay-
from other scripts like English, Kannada, Tamil or Malay- alam scripts.
alam. Later, identification among the other scripts is done. Pardeshi et al. [79] first binarized the document and
Malemath et al. [72] used Gaussian steerable filter in segmented the words using morphological operation and con-
different directions for estimating the spectral power and nected component labeling. Then Radon transform, discrete
orientation energy. Now, the standard deviation of energy wavelet transforms are used and entropy and standard devi-
at different angles is calculated. This feature vector is used ation from projections of the Radon transform and sub-band
to classify English script from Kannada, Hindi, Urdu and images of discrete wavelet transform are calculated. Next,
Tamil scripts using linear discriminant analysis and k-nearest the addition of first twenty low-frequency coefficients from
neighbor techniques. discrete cosine transform is calculated. Linear discriminant
Rezaee et al. [73] first corrected the skew of the document analysis, support vector machine and k-nearest neighbor clas-
using Radon transform and then calculated horizontal pro- sifiers are used for classifying Roman, Devanagari, Urdu,
jection profile distribution for Latin and Farsi words. They Kannada, Oriya, Gujrati, Bangla, Gurumukhi, Tamil, Tel-
found out that Latin scripts followed the uniform distribution, ugu and Malayalam scripts in bi-script and tri-script form.
whereas Farsi scripts followed Gaussian distribution, which Discrete cosine transform had shown poor classification
was the criterion for script identification. capability as compared to the Radon transform and dis-
Rani et al. [74] computed feature using Gabor filter and crete wavelet transforms due to the lack of directionality.
Sobel masks for fixed sized character images. Support vec- Support vector machine and linear discriminant analysis out-
tor machine with different kernels is used to classify English performed k-nearest neighbor classifier.
and Gurumukhi scripts. Support vector machine with radial Singh et al. [107] used modified Gabor filter-based fea-
basis function and polynomial kernels gave better perfor- tures for classification of Bangla, Devanagari and Roman
mance than linear kernel. scripts.
Pal et al. [75] also calculated Zernike moments and gradi- Brodić et al. [109] mapped characters of Latin and Fraktur
ent features as a feature vector. The gradient feature includes scripts into coded text. Then this coded text is subjected to
Robert filtering of 9 × 9 blocks and calculation of histograms co-occurrence analysis to generate features. Later, genetic
in 16 directions for each block. Later, support vector machine algorithm is used for identification purpose. Table 8 shows
is used to identify English, Hindi and Bengali scripts signa- algorithms’s performance metrics for texture feature-based
tures. fine-scale analysis at word/character level.
Obaidullah et al. [76] worked for identification of hand-
written numerals for Bangla, Hindi, English and Urdu scripts. 2.2.6 Hybrid feature or other feature-based fine-scale
Word numeral image is used for binarization it, and later, analysis at word/character level
wavelet transform (from Daubechies wavelet family) is
applied. Then, entropy, standard deviation and mean on Apart from calculating structural features and texture fea-
decomposed sub-images, whereas log energy, shannon, sure, tures, these features are calculated occurrence frequency of
threshold and norm on the approximate sub-band are calcu- words and letters within document images.
lated. For classification, nbtree, part, liblinear, random forest, Selamat and Ng [82] worked for the identification of
sequential minimal optimization, simple logistic and multi- three languages, which were Arabic, Persian and Urdu
layer perceptron classifiers are used. Multilayer perceptron for web page documents. First, they described about the
had shown highest accuracy. significance and implementation of ARTMAP (Adaptive
Hangarge et al. [77] segmented words from text image Resonance Theory) and DT (Decision Tree) approaches.
using morphological dilation operation. They extracted the ARTMAP approach works on neural network and cluster-
principal diagonals, upper and lower diagonals of word ing mechanisms, i.e., sample is processed by finding its
images and their flipped form. Then, one-dimensional dis- nearest cluster and then updating that cluster. DT approach
crete cosine transform and standard deviations are calcu- depends upon the frequency of occurrence of letters within
lated. The two-dimensional discrete cosine transform is also the document. For that, Unicode of that letter is used. Later
applied on word images, and standard deviations on princi- they developed the technique, which was a hybridization
pal, upper and lower diagonals are calculated. Now, linear of the above two approaches. Here, DT approach is used
discriminant analysis is used to identify English, Hindi, Tel- for feature extraction, which is based upon the character
ugu, Kannada, Malayalam and Tamil scripts. Hangarge et occurring frequency, whereas ARTMAP approach is used
al. [78] applied directional discrete cosine transform in ver- to classify three languages using neural network. Selamat
tical, horizontal, right- and left-diagonal directions of word and Ng [83] also worked on finding letter frequency, docu-

123
Int J Multimed Info Retr

Table 8 Algorithms for texture


Algorithm Identification accuracy Image database information
feature-based fine-scale analysis
at word/character level Pati and Ramakrish-nan [2] 98.00 Machine-printed documents
collected from newspapers,
magazines, book, and laser
printed documents
Sharma et al. [3] 94.25 Machine-printed text obtained
from videos
Ma and Doermann [67] 97.87 Machine-printed document
database
Ferrer et al. [68] 97.18 IAM [11] database and English
sentence database [80]
Singh et al. [69] 93.56 Handwritten documents scanned at
300 dpi
Angadi and Kodabagi [70] 94.33 Low-resolution display board
images of Government offices in
India
Malemath et al. [72] 99.13 Machine-printed documents
collected from magazines,
various books, journals and
newspapers. Newspapers and
documents from online resources
and some documents from digital
libraries are downloaded
Rezaee et al. [73] 96.05 Machine-printed documents from
books are scanned at 300 dpi
Rani et al. [74] 99.18 Self-prepared machine-printed
database at a resolution of 300
dpi
Pal et al. [75] 92.14 Self-prepared handwritten database
Obaidulla-h et al. [76] 82.20 Self-prepared handwritten database
Hangarge et al. [77] 85.77 Self-prepared handwritten database
Hangarge et al. [78] 97.35 Self-prepared handwritten database
Pardeshi et al. [79] 96.00 Self-prepared handwritten
database and from CMATERdb
1.1 [81] database
Singh et al. [107] 95.30 Self-prepared databases
CMATERdb 1.2.2 and
CMATERdb 1.5.1 databases
Brodić et al. [109] 98.08 Self-prepared database

ment frequency and letter weight. Document frequency is the independent component analysis space. Neural network with
number of documents in which particular letter appears and back-propagation is used to identify Arabic, Urdu and Per-
letter weight is the multiplication of letter frequency and doc- sian scripts.
ument frequency. They formed feature vector for each letter Shi et al. [85] divided their method into two stages. In the
consisted of its weight and Unicode. Now, the language of first stage, features are extracted through convolution and
unknown document is identified using sum of features for max pooling (down sampling). It is done in a recursive man-
each language, current and previous features positions and ner, so numbers of levels (hierarchies) are obtained. This
current and previous features frequencies. They identified resulted in dense feature vectors, from which discriminative
Arabic, Urdu, Persian and Pashto languages. Selamat and clustering is done (codebook) along with local patches of dif-
Lee [84] used entropy-based weighting scheme for words, ferent scripts. Now in mid-level representation, they formed
which is calculated through term frequency. Global and local the long vector, which contains topological information of
weighting terms are calculated for feature vectors. Indepen- characters parameterized by codebook weights. Finally, at
dent component analysis is used for feature reduction and the global fine-tuning stage, optimization of features and
refinements. This is done by projecting clustered data into mid-level representations is done using neural network back-

123
Int J Multimed Info Retr

propagation algorithm. For this algorithm, word images of iments are conducted for the identification of Assamese,
Arabic, Greek, English, Japanese, Korean, Hebrew, Kan- Bengali, Manipuri, Hindi, Nepali, English, Kashmiri, Pashto,
nada, Tibetan, Cambodian, Russian, Mongolian, Thai and Urdu, Tamil, Telugu and Punjabi scripts. Hebert et al. [91]
Chinese scripts from natural scenes are used. first identified Arabic and Latin machine-printed and hand-
Behrad et al. [86] proposed their algorithm for Farsi and written scripts using codebook generation using chain code
English scripts and extended their work for Farsi digit recog- histograms and multilayer perceptron classifier. Then, using
nition. First, the document image is segmented into text-lines, N-gram language model (based on character frequency),
words and connected components (character) levels as a they identified English and French handwritten and machine-
preprocessing step. Then, using ‘Curvature Scale Space Rep- printed languages using Chi-square distance.
resentation,’ planar curves for each connected component Ablavsky et al. [92] calculated Cartesian moment, cen-
are calculated. This feature became rotation, scale and noise tralized moment (robust to character position), normalized
robust when normalized. This feature is calculated using a moment (robust to scale), Hu moment (robust to rotation)
Gaussian filter along with derivatives in X and Y directions. co-occurrence matrix as texture features. Other calculated
Some structural features are calculated using projection pro- features are related to the perimeter, the lengths of the major
file analysis, height, width, filled holes and their ratios. The and minor axis, elongation, eccentricity and convexity also
k-nearest neighbor classifier is used for classifying Farsi and calculated. RELIEF- F algorithm is used to reduce the fea-
English scripts. Based upon the majority of characters on ture set, which depend upon weight factor and nearest hit and
identifying script, scripts are recognized at higher levels, i.e., miss within each class (script). Finally, k-nearest neighbor is
at word, text-line and page levels. used to identify Cyrillic and Latin scripts.
Rani et al. [87] computed Gabor response on English and Obaidullah et al. [93] extracted circularity, rectangularity,
Punjabi word images and calculated mean and standard devi- Freeman chain code and component sizes and Gabor filter
ation for different frequencies and directions. Apart from feature. Simple logistic classifier is used for classification of
that, aspect ratios, eccentricities, strokes densities in verti- Bangla, Devanagari, Roman, Oriya, Urdu, Gujarati, Telugu,
cal, horizontal, right and left diagonal, pixel ratios after filling Kannada, Malayalam and Kashmiri scripts. Obaidullah et al.
holes, vertical inter character gap through vertical histogram [94] used same feature vectors, but here they extended their
and horizontal break using a horizontal histogram are also work of classification using numbers of classifiers like Bayes
calculated. Now, k-nearest neighbor classifier and support net, multilayer perceptron, Liblinear, radial basis function
vector machine classifier are used for identification. They network, simple logistic, part, Furia, random tree and nbtree.
concluded that support vector machine gave highest accu- Out of them simple logistic classifier had shown highest accu-
racy, using Gabor filter with polynomial kernel and using racy.
structural feature with linear kernel. The k-nearest neighbor Dhanya et al. [95] worked for Tamil and English scripts.
achieved the highest accuracy using structural feature with For that, spatial features based on pixel concentration and
value of k = 3 and using Gabor filter with value of k = 1. density and directional feature using Gabor filter are cal-
Mezghani et al. [88] found out affine moment invariants, culated. Later, there features are tested and classified scripts
extrema (top and bottom) points, horizontal projection pro- using support vector machine, nearest neighbor and k-nearest
file by dividing the image into 10 equal parts, the amplitude neighbor classifiers.
difference between the top and bottom profiles as feature Singh et al. [96] first binarized the word images using
vectors. Then Gaussian mixture model is used to identify a Otsu’s method, removed noise using Gaussian low-pass
machine-printed and handwritten Arabic and French scripts. filters and detected edges using Canny algorithm. Now, fea-
Abainia et al. [89] used both character frequency and tures like coordinates using centroid values, distances from
word frequency algorithms for script identification, which centroid, slope and tangent angle, areas and histogram of
are based upon UTF8 code. They also used these two algo- oriented gradient descriptors for each cell are calculated.
rithms in hybrid form as either sequential or parallel ways. Multilayer perceptron classifier had shown highest accuracy
Some of the identified scripts were English, Malay, Indone- for the classification of Bangla, Hindi, Telugu, Malayalam
sian, Dutch, Arabic, Danish, Norwegian, Chinese, Persian, and English scripts. Table 9 shows algorithms’s performance
Italian and French, etc. metrics for hybrid feature, or other feature-based fine-scale
Yadav and Kaur [90] proposed method where the chara- analysis at word/character level.
cter-based language model is developed for 12 different
languages. This language model is made through N-gram
model, which is based upon maximum likelihood estimation 3 Discussions and observations
technique. In text correction step, the best path is calculated
for every given input text between 12 languages and finally, It has been observed from Tables 1, 2, 3, 4, 5, 6, 7, 8 and
the decision is made through Bayesian classifier. Exper- 9 that there is a lack of availability of standard databases.

123
Int J Multimed Info Retr

Table 9 Algorithms for hybrid


Algorithm Identification accuracy Image database information
feature, or other feature-based
fine-scale analysis at Selamat and Ng [82] 99.49 BBC web site http://www.bbc.co.
word/character level uk/S
Shi et al. [85] 99.30 Images collected from Google
Street View http://maps.google.
com, SIW [97], and CVSI [102]
datasets
Behrad et al. [86] 98.72 Documents collected from internet
at 300 dpi resolution
Rani et al. [87] 99.75 Self-prepared machine-printed
database
Mezghani et al. [88] 99.10 AHTID/MW [98], RIMES [99]
and APTI [100] databases
Hebert et al. [91] 86.78 Maurdor dataset available at http://
www.maurdor-campaign.org/
Ablavsky and Stevens [92] 97.00 Documents from books and
newsprint at 300 dpi resolution
Obaidulla-h et al. [93] 98.20 Machine-printed documents
collected from articles and book
pages etc.
Dhanya et al. [95] 96.03 Machine-printed documents
collected from newspapers and
various magazines scanned 300
dpi resolution
Singh et al. [96] 91.79 Self-prepared handwritten database

Authors mostly prepared their handwritten and machine- [16], efficiency is increased and identification was possible
printed databases by themselves. For that, they downloaded with less amount of text, but on the other hand, sensitive is
various machine-printed articles, forms, books from inter- increased for noise. Method by Peake and Tan [19] is robust
net resources or captured images from camera. Otherwise, to foreign and italicized characters and numerals due to use
they used hard copies and scanned them, usually at a resolu- of Gabor filter features. Padding of blocks generally helps
tion of 300 dpi. Many software like Microsoft Office Word in script identification with small amount of text in it. Nor-
and paintbrush are present, where one can write and save malization step cannot be applied efficiently for handwritten
documents and make databases. For preparing handwritten text. The log Gabor feature is highly sensitive to the spaces
document databases, authors also made group of peoples between text regions, horizontal/vertical lines and extrane-
and asked them to write letters in their own handwriting ous characters (Joshi et al. [7]). These terms probably can
in different languages (Obaidullah et al. [106] and Singh et distort the energy profile of log Gabor features.
al. [107]). Later, scanning and preparation of databases are Higher identification rates are achieved at structure-based
done. While preparing, various annotations are introduced. features and text-line level (Table 4) than at the block level
Therefore, script identification requires better preprocessing (Table 1), as more fine information is received, which help
techniques for eliminating annotations. These preprocessing in differentiating between scripts. Water flow-based feature
techniques involve tasks like lines and noise removal from vector concept explained in Pal and Chaudhuri [28,29] is
text, contour smoothing, skew and slant removal, reference not affected by touching characters or words within text-
line detection, scaling and skeleton (Saba et al. [104]). lines and can efficiently differentiate two similar looking
From Tables 1, 2 and 3, it has been observed that the scripts. For feature extraction, text-lines can be divided into
methods by Pan et al. [20] and Rajput and Anita [22] achieved different zones to get finer details. Projection profile fea-
highest identification rates, but the scope of Pan et al. [20] is tures explained in methods Aithal et al. [33,34] and Prakash
limited to the machine-printed document database, whereas et al. [35] are sensitive to skew. The identification rate
the scope of Rajput and Anita [22] is limited to handwritten of Rajput and Anita [45] affects in case of bi-script text-
document database. Method by Zhu et al. [6] is more robust lines.
to skew and size normalization, as it works on both types of Length of words is measured in terms of density of strokes
databases. within words. The density of these strokes affects the over-
Due to the use of spatial information on the calculation all efficiency of Namboodiri and Jain [1], whereas Pati and
of gray-level co-occurrence matrix feature in Singh et al. Ramakrishnan [2] is robust to noise, font style and size due

123
Int J Multimed Info Retr

to document vectorization and is one of the fastest tech- Table 10 Feature-based comparative analysis at large-scale and fine-
niques. It has been observed that character stroke density scale on common parameters
changes with slight skew within the text-line. Character sizes Sr. no. Structural-based features Texture-based features
are assumed uniform within each word or text-line. At the
1. High identification accuracy Can be easily applied to
character level, structure-based features can be used more
if text components have components having less
easily to discriminate between the languages of the same fragmented characters. amount of text
scripts. If character fonts are varying in large amount, the More texts present in
best way to handle such situation is to form separate clus- components are desirable
ters rather to grow existing clusters (Hochberg et al. [26]). 2. Not performs well when Perform better in the
there are broken or presence of noise and
In method of Dhandra and Hangarge [52], features are also
touching characters, in the skew
based upon the direction of strokes along with stroke density. presence of noise and
Run length histogram describes the character stroke lengths, skew
whereas crossing count histogram describes the complexity 3. Simple and rigid Complex but flexible
of strokes. From Table 7, method by Spitz [49] showed full 4. Computation time is Generally, computational
efficiency, but for machine-printed database, whereas Echi et directly proportional to time is independent of
al. [64] is applicable to both machine-printed and handwrit- the complexity of features complexity
ten databases. Length of words does not affect the features 5. The domain remains the The domain may get
same changed
calculated in Pati and Ramakrishnan [2]. Texture-based local
6. Document formatting Here also document
binary pattern proposed by Sharma et al. [3] and Ferrer et al.
affects performance formatting affects
[68] does not recognize effectively the differences between performance
scripts on a large variety of datasets, because different lay- 7. Sensitive to fonts variation Robust to italic and foreign
outs and unconstrained handwriting are difficult to capture characters and numerals
using local binary pattern. Discrete cosine transform feature 8. The spacing does not affect Spacing adversely affect
classified the scripts based on compact representation (Singh the performance performance at block
et al. [69]). Directional discrete cosine transform is robust level (large-scale) but not
much at the text-line level
to discrete cosine transform in case of rotational invariance and word/ character level
(Hangarge and Santosh [78]), whereas discrete wavelet trans- (fine-scale)
form features outperforms discrete cosine transform features
in terms of resolution (Pardeshi et al. [79]).
For template preparation, features, which depend upon
From these observations, Table 10 shows the comparative
the occurrence frequency in Selamat and Ng [82] and Ng
analysis at large scale and fine scale, which is based upon the
and Selamat [83], there count is increased one by one.
features calculated. This comparison is made on common
The convolutional neural network takes large training time
parameters, which directly or indirectly affects the perfor-
(Shi et al. [85]). From Table 9, it has been observed that
mance of features at each level.
the authors in Mezghani et al. [88] tested their method on
standard databases and achieved good identification rate. In
method Ablavsky and Stevens [92], all features calculated
using moments are sensitive to noise, so features based on 4 Conclusion
the shape of components should be included to make the
method noise robustness. Connected components analysis is In this paper, a comprehensive survey of script identifica-
necessary for the scripts having small components like Chi- tion at large scale and fine scale is presented, and insights
nese script. Higher accuracy is achieved if text of similar are drawn based on the observations. At first, we provided
font is present within text components. Calculation of aspect an overview of the taxonomy shown in Fig. 3 and then
ratio at word level helps in identifying the scripts robustly. briefly described various aforementioned methods. Based
The efficiency of the system can be increased, if the calcu- on their performances, databases used and with common
lated features are less dependent upon the script’s structural parameters like skew, font variations, etc., comparison is also
properties. Script identification at structural feature-based demonstrated. This will help researchers to understand the
text-line level holds well, if more than two words are present complicated issues of script identification under document
in each text-line. For texture feature at block level, generally image analysis and direct them in the further research.
text components of even spaces and heights are preferred. One can write and prepare documents using various scripts
This feature can include text of varying fonts and sizes with the help of a medium called language. Script identifica-
for improving the identification rates during the learning tion is mainly used for language identification and pre-OCR
stage. processing. Sometimes, applications of script identification

123
Int J Multimed Info Retr

techniques are extended for language identification purpose tion from natural scene images is still a less researched area.
(Brodic’ et al. [108]). OCR is one of the emerging research Identification accuracy generally decreases with increase in
fields for the past two decades. Many scripts are still open to number of scripts. Videos and camera-based real-time identi-
OCR research and in its development process. fications are an important research area, where more research
Although, for the past two decades, a lot of research has is required.
been carried out in the field of script identification, some With the increase in globalization and transactions in
improvement in terms of performance and accuracy is still recent business activities, OCR research communities have
a vital requisite. It is observed that obtaining good features become more conscious about automatic script recognition.
as well as potent classifier is quite difficult. Texture features It has been observed that script identification of handwritten
obtained from Gabor filter and wavelet transforms provide text is in the early stage as compared to machine-printed text.
effective discrimination between scripts, whereas support In addition, script identification techniques are developed for
vector machines and neural network classifiers are proven text of particular font size and style. Therefore, there is a need
accurate for binary and multiscripts classifications. Text-line for developing handwritten script identification techniques
and word level scripts identification techniques are helpful and multifonts scripts recognizers.
for multiscript documents. The main contribution of this paper is to analyze and
Tables 1, 2, 3, 4, 5, 6, 7, 8 and 9 show that many authors classify various script identification methods, discussions on
according to their research project created databases, and various parameters like identification rates, database used
these databases are not available publicly. Besides, most of and comparatively analysis on a common platform. The
the referred publicly available databases are developed for the future point of research is to develop the algorithm, which
research field of OCR, writer identification, signature verifi- would be flexible, can accommodate easily more number of
cation and handwriting recognition. Therefore, it is necessary scripts for accurate identification and consume less time with
to develop the standard database, especially for script identi- minimum requirements.
fication. Moreover, available databases contain only specific
scripts like Arabic, Latin, Chinese, Devanagari and Bangla,
whereas no database is available publicly for other scripts.
The main reason for this database void is that many countries References
in the world generally use and work on single script; there-
1. Namboodiri AM, Jain AK (2004) Online handwritten script recog-
fore their predominant research focus is more on developing nition. IEEE Trans Pattern Anal Mach Intell 26:124–130. doi:10.
OCR rather developing script identification technologies. 1109/TPAMI.2004.1261096
Most of the research in script identification is done for 2. Pati PB, Ramakrishnan AG (2008) Word level multi-script identi-
fication. Pattern Recogn Lett 29:1218–1229. doi:10.1016/j.patrec.
offline script identification, but only a few reports are avail-
2008.01.027
able for online script identification. In this technological era, 3. Sharma N, Pal U, Blumenstein M (2014) A study on word-level
the use of smart phones and cheap gazettes is widespread; multi-script identification from video frames. In: International
hence, one can think of a smart phone application for joint conference on neural networks, Beijing, pp 1827–1833.
doi:10.1109/IJCNN.2014.6889906
automatic translation of signboards into required scripts.
4. Shijian L, Tan CL (2008) Script and language identification in
However, this online script identification poses an extraor- noisy and degraded document Images. IEEE Trans Pattern Anal
dinary challenge for the research community and requires a Mach Intell 30:14–24. doi:10.1109/TPAMI.2007.1158
great amount of effort. 5. Patil SB, Subbareddy NV (2002) Neural network based system
for script identification in Indian documents. Sadhana 27:83–97.
There is no universally accepted script identification
doi:10.1007/BF02703314
method, which identifies every text component of a particular 6. Zhu G, Yu X, Li Y, Doermann D (2009) Language identification
script. Every method has its pros and cons. Structural feature- for handwritten document images using a shape codebook. Pattern
based methods developed for particular groups of scripts may Recogn 42:3184–3191. doi:10.1016/j.patcog.2008.12.022
7. Joshi GD, Garg S, Sivaswamy J (2007) A generalised framework
not work efficiently for another group. In the multiscript envi-
for script identification. Int J Doc Anal Recogn 10:55–68. doi:10.
ronment, some scripts are simple in nature, whereas some are 1007/s10032-007-0043-3
complicated in structure. For example, Latin script, which is 8. Shivakumara P, Yuan Z, Zhao D, Lu T, Tan CL (2015) New
based on alphabetic system, is simple in nature as compared gradient-spatial-structural features for video script identification.
Comput Vis Image Underst 130:35–53. doi:10.1016/j.cviu.2014.
to nonalphabetic scripts like Arabic, Devanagari and Han.
09.003
Texture feature-based methods (Tan [13], Busch et al. [14] 9. Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and
and Hiremath and Shivashankar [15]) are highly accurate language identification for handwritten document images. Int J
and universally proved. Sometimes, these methods cannot Doc Anal Recogn 2:45–52. doi:10.1007/s100320050036
10. Li Y, Zheng Y, Doermann D, Jaeger S (2008) Script-independent
be applied reliably at the character level. One possible way
text line segmentation in freestyle handwritten documents. IEEE
to classify these scripts is to target special types of symbols Trans Pattern Anal Mach Intell 30:1313–1329. doi:10.1109/
and characters of the scripts (Shi et al. [85]). Script identifica- TPAMI.2007.70792

123
Int J Multimed Info Retr

11. Marti U, Bunke H (2006) The IAM-database: an English sen- 28. Pal U, Chaudhuri BB (2002) Identification of different script lines
tence database for offline handwriting recognition. Int J Doc Anal from multi-script documents. Image Vis Comput 20:945–954.
Recognit 5:39–46. doi:10.1007/s100320200071 doi:10.1016/S0262-8856(02)00101-4
12. Lu S, Li L, Tan CL (2010) Identification of scripts and orientations 29. Pal U, Chaudhuri BB (2001) Automatic identification of English,
of degraded document images. Pattern Anal Appl 13:469–475. Chinese, Arabic, Devnagari and Bangla script line. In: Pro-
doi:10.1007/s10044-009-0169-7 ceedings of sixth international conference on document analysis
13. Tan TT (1998) Rotation invariant texture features and their use and recognition, Seattle, pp 790–794. doi:10.1109/ICDAR.2001.
in automatic script identification. IEEE Trans Pattern Anal Mach 953896
Intell 20:751–756. doi:10.1109/34.689305 30. Gopakumar R, Subbareddy NV, Makkithaya K, Acharya UD
14. Busch A, Boles WW, Sridharan S (2005) Texture for script iden- (2010) Script identification from multilingual Indian documents
tification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732. using structural features. J Comput 2:106–111
doi:10.1109/TPAMI.2005.227 31. Gopakumar R, Subbareddy NV, Makkithaya K, Acharya UD
15. Hiremath PS, Shivashankar S (2008) Wavelet based co- (2010) Zone-based structural feature extraction for script identi-
occurrence histogram features for texture classification with an fication from Indian documents. In: 5th international conference
application to script identification in a document image. Pattern on industrial and information systems, Mangalore, pp 420–425.
Recogn Lett 29:1182–1189. doi:10.1016/j.patrec.2008.01.012 doi:10.1109/ICIINFS.2010.5578668
16. Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level 32. Padma MC, Vijaya PA (2010) Script identification from trilingual
script identification from multi- script handwritten documents. In: documents using profile based features. Int J Comput Sci Appl
3rd international conference on computer, communication, con- 7:16–33
trol and information technology, Hooghly, pp 1–6. doi:10.1109/ 33. Aithal PK, Rajesh G, Acharya DU, Krishnamoorthi M, Sub-
C3IT.2015.7060113 bareddy NV (2011) Script identification for a tri-lingual doc-
17. Benjelil M, Kanoun S, Mullot R, Alimi AM (2009) Arabic and ument. In: 2nd international conference on advances in com-
Latin script identification in printed and handwritten types based munication, network, and computing, pp 434–439. doi:10.1007/
on steerable pyramid features. In: 10th international conference 978-3-642-19542-6_82
on document analysis and recognition, Barcelona, pp 591–595. 34. Aithal PK, Rajesh G, Acharya DU, Krishnamoorthi M, Sub-
doi:10.1109/ICDAR.2009.287 bareddy NV (2010) Text line script identification for a tri-lingual
18. Zhou L, Ping XJ, Zheng EG, Guo L (2010) Script identification document. In: 2nd international conference on computing, com-
based on wavelet energy histogram moment features. In: IEEE munication and networking technologies, Karur, pp 1–3. doi:10.
10th international conference on signal processing, Beijing, pp 1109/ICCCNT.2010.5592562
980–983. doi:10.1109/ICOSP.2010.5655843 35. Prakash O, Shrivastava V, Kumar A (2013) An efficient approach
19. Peake GS, Tan TN (1997) Script and language identification from for script identification. Int J Comput Trends Technol 4:1626–
document images. In: Proceedings of workshop on document 1631
image analysis, Washington DC, pp 10–17, doi:10.1109/DIA. 36. Phan TQ, Shivakumara P, Ding Z, Lu S, Tan CL (2011) Video
1997.627086 script identification based on text lines. In: International confer-
20. Pan WM, Suen CY, Bui TD (2005) Script identification using ence on document analysis and recognition, Beijing, pp 1240–
steerable Gabor filters. In: Proceedings of the eight international 1244. doi:10.1109/ICDAR.2011.250
conference on document analysis and recognition, Seoul, pp 883– 37. Tan GX, Gaudin CV, Kot AC (2009) Information retrieval model
887. doi:10.1109/ICDAR.2005.206 for online handwritten script identification. In: 10th international
21. Singhal V, Navin N, Ghosh D (2003) Script-based classification conference on document analysis and recognition, Barcelona, pp
of hand-written text documents in a multilingual environment. In: 336–340. doi:10.1109/ICDAR.2009.162
Proceedings of 13th international workshop on research issues in 38. Bashir R, Quadri SMK (2014) Entropy based script identification
data engineering: multi-lingual information management, Hyder- of a multilingual document image. In: International conference
abad, pp 47–54. doi:10.1109/RIDE.2003.1249845 on computing for sustainable global development, New Delhi, pp
22. Rajput GG, Anita HB (2010) Handwritten script recognition using 19–23. doi:10.1109/IndiaCom.2014.6828005
dct and wavelet features at block level. IJCA, Special issue on 39. Bashir R, Quadri SMK (2013) Identification of Kashmiri script
RTIPPR 3:158–163 in a bilingual document image. In: Proceedings of the IEEE sec-
23. Lee WS, Kim NC, Jang IH (2010) Texture feature-based lan- ond international conference on image information processing,
guage identification using wavelet-domain bdip, bvlc, and nrma Waknaghat, pp 575–579. doi:10.1109/ICIIP.2013.6707658
features. In: IEEE international workshop on machine learning 40. Bashir R, Quadri SMK (2015) Density based script identifica-
for signal processing, Finland, pp 444–449. doi:10.1109/MLSP. tion of a multilingual document image. Int J Image Graph Signal
2010.5588751 Process 2:8–14. doi:10.5815/ijigsp.2015.02.02
24. Valkealahti K, Oja E (2007) Reduced multidimensional co- 41. Ghosh S, Chaudhuri BB (2011) Composite script identification
occurrence histograms in texture classification. IEEE Trans Pat- and orientation detection for Indian text images. In: International
tern Anal Mach Intell 20:90–95. doi:10.1109/34.655653 conference on document analysis and recognition, Beijing, pp
25. Brodić D, Milivojević ZN, Maluckov CA (2015) An approach to 294–298. doi:10.1109/ICDAR.2011.67
the script discrimination in the Slavic documents. Soft Comput 42. Cheng J, Ping X, Zhou G, Yang Y (2006) Script identification
19:2655–2665. doi:10.1007/s00500-014-1435-1 of document image analysis. In: Proceedings of the 1st inter-
26. Hochberg J, Kelly P, Thomas T, Kerns LL (1997) Automatic script national conference on innovative computing, information and
identification from document images using cluster-based tem- control, Beijing, pp 178–181. doi:10.1109/ICICIC.2006.518
plates. IEEE Trans Pattern Anal Mach Intell 19:176–181. doi:10. 43. Moussa SB, Zahour A, Benabdelhafid A, Alimi AM (2008)
1109/34.574802 Fractal-based system for Arabic/Latin, printed/handwritten script
27. Silva C, Ribeiro B (2007) On text-based mining with active identification. In: 19th international conference on pattern recog-
learning and background knowledge using SVM. Soft Comput nition, Florida, pp 1–4. doi:10.1109/ICPR.2008.4761838
11:519–530. doi:10.1007/s00500-006-0080-8 44. Padma MC, Vijaya PA (2009) Monothetic separation of Telugu,
Hindi and English text lines from a multi script document. In:
Proceedings of the IEEE international conference on systems,

123
Int J Multimed Info Retr

man, and cybernetics, San, Antonio, pp 4870–4875. doi:10.1109/ gence and multimedia applications, Sivakasi, pp 471–475. doi:10.
ICSMC.2009.5346045 1109/ICCIMA.2007.125
45. Rajput GG, Anita HB (2011) Handwritten script identification 62. Singh S, Kumar A, Shaw DK, Ghosh D (2014) Script separa-
from a bi-script document at line level using Gabor filters. In: tion in machine printed bilingual (Devnagari and Gurumukhi)
Proceeding of SCAKD, pp 94–101 documents using morphological approach. In: 20th national con-
46. Jindal M, Hemrajani N (2013) Script identification for printed ference on communications, Kanpur, pp 1–5. doi:10.1109/NCC.
document images at text-line level using dct and pca. IOSR J 2014.6811361
Comput Eng 12:97–102 63. Lin XR, Guo CY, Chang F (2011) Classifying textual compo-
47. Obaidullah SM, Nibaran D, Roy K (2014) Gabor filter based nents of bilingual documents with decision-tree support vector
technique for offline Indic script identification from handwritten machines. In: International conference on document analysis and
document images. In: International conference on devices, circuits recognition, Beijing, pp 498–502. doi:10.1109/ICDAR.2011.106
and communications, Ranchi, pp 1–5. doi:10.1109/ICDCCom. 64. Echi AK, Saidani A, Belaid A (2014) How to separate between
2014.7024723 machine-printed/handwritten and Arabic/Latin Words? Electron
48. Lu S, Li L, Tan CL (2007) Identification of Latin-based languages Lett Comput Vis Image Anal 13:1–16. doi:10.5565/rev/elcvia.
through character stroke categorization. In: 9th international 572
conference on document analysis and recognition, Brazil, pp 352– 65. Haboubi S, Maddouri SS, Amiri H (2011) Separation between
356. doi:10.1109/ICDAR.2007.4378731 Arabic and Latin scripts from bilingual text using structural fea-
49. Spitz AL (1997) Determination of the script and language con- tures. In: 1st international conference innovative computing tech-
tent of document images. IEEE Trans Pattern Anal Mach Intell nology, Brazil, pp 132–143. doi:10.1007/978-3-642-22247-4_12
19:235–345. doi:10.1109/34.584100 66. Sharma N, Chanda S, Pal U, Blumenstein M (2013) Word-wise
50. Das MS, Rani DS, Reddy CRK (2012) Heuristic based script script identification from video frames. In: 12th international con-
identification from multilingual text documents. In: 1st interna- ference on document analysis and recognition, Washington DC,
tional conference on recent advances in information technology, pp 867–871. doi:10.1109/ICDAR.2013.177
Dhanbad, pp 487–492. doi:10.1109/RAIT.2012.6194627 67. Ma H, Doermann D (2004) Word level script identification for
51. Yeotikar PP, Deshmukh PR (2013) Script identification of text scanned document images. In: Proceeding of international confer-
words from multilingual Indian document. Int J Comput Appl ence on document recognition and retrieval, San Jose, pp 178–191
1:22–29 68. Ferrer MA, Morales A, Rodríguez N, Pal U (2014) Multiple
52. Dhandra BV, Hangarge M (2011) Morphological reconstruction training—one test methodology for handwritten word-script iden-
for word level script identification. Int J Comput Sci Secur 1:41– tification. In: 14th international conference on frontiers in hand-
51 writing recognition, Greece, pp 754–759. doi:10.1109/ICFHR.
53. Chanda S, Pal S, Franke K, Pal U (2009) Two-stage approach for 2014.132
word-wise script identification. In: 10th international conference 69. Singh PK, Khan A, Sarkar R, Nasipuri M (2014) A texture based
on document analysis and recognition, Barcelona, pp 926–930. approach to word-level script identification from multi-script
doi:10.1109/ICDAR.2009.239 handwritten documents. In: International conference on compu-
54. Chanda S, Pal U, Franke K, Kimura F (2010) Script tational intelligence and communication networks, Udaipur, pp
identification—a Han and Roman script perspective. In: 20th 228–232. doi:10.1109/CICN.2014.60
international conference on pattern recognition, Istanbul, pp 70. Angadi SA, Kodabagi MM (2013) A fuzzy approach for word
2708–2711. doi:10.1109/ICPR.2010.1127 level script identification of text in low resolution display
55. Roy K, Alaei A, Pal U (2010) Word-wise handwritten Persian board images using wavelet features. In: International confer-
and Roman script identification. In: International conference on ence on advances in computing, communications and informatics,
frontiers in handwriting recognition, Kolkata, pp 628–633. doi:10. Mysore, pp 1804–1811. doi:10.1109/ICACCI.2013.6637455
1109/ICFHR.2010.103 71. Pechwitz M, Maddouri SS, Märgner V, Ellouze N, Amiri H (2002)
56. Roy K, Das SK, Obaidullah SM (2011) Script identification from IFN/ENIT-database of handwritten ARABIC words. In: 7th col-
handwritten document. In: 3rd national conference on computer loque international francophone Sur l’Ecrit et le Document, Tunis,
vision, pattern recognition, image processing and graphics, Hubli, pp 129–136
pp 66–69. doi:10.1109/NCVPRIPG.2011.22 72. Malemath VS, Kulkarni AH, Mallikarjun H (2014) Word-wise
57. Obaidullah SM, Roy K, Das N (2013) Comparison of different script identification in document images based on steerable Gaus-
classifiers for script identification from handwritten document. sian filtering technique. Int J Adv Res Comput Commun Eng
In: IEEE international conference on signal processing, comput- 3:6844–6848
ing and control, Waknaghat, pp 1–6. doi:10.1109/ISPCC.2013. 73. Rezaee H, Geravanchizadeh M, Razzazi F (2009) Automatic lan-
6663388 guage identification of bilingual English and Farsi scripts. In:
58. Piao M, Cui RR (2013) An approach to script identification in International conference on application of information and com-
multi-language text image. In: 6th international conference on munication technologies, Baku, pp 1–4. doi:10.1109/ICAICT.
intelligent networks and intelligent systems, Shenyang, pp 248– 2009.5372532
251. doi:10.1109/ICINIS.2013.70 74. Rani R, Dhir R, Lehal GS (2013) Script identification of pre-
59. Chanda S, Terrades OR, Pal U (2007) SVM based scheme for Thai segmented multi-font characters and digits. In: 12th international
and English script identification. In: 9th international conference conference on document analysis and recognition, Washington
on document analysis and recognition, Brazil, pp 551–555. doi:10. DC, pp 1150–154. doi:10.1109/ICDAR.2013.233
1109/ICDAR.2007.4378770 75. Pal S, Alireza A, Pal U, Blumenstein M (2012) Multi-script off-
60. Chanda S, Pal U, Kimura F (2007) Identification of Japanese line signature identification. In: 12th international conference on
and English script from a single document page. In: 7th IEEE hybrid intelligent systems, Pune, pp 236–240. doi:10.1109/HIS.
international conference on computer and information technol- 2012.6421340
ogy, Fukushima, pp 656–661. doi:10.1109/CIT.2007.109 76. Obaidullah SM, Halder C, Das N, Roy K (2015) Numeral script
61. Dhandra BV, Hangarge M (2007) Global and local features based identification from handwritten document images. In: 11th inter-
handwritten text words and numerals script identification. In: national multi-conference on information processing, Bangalore,
International conference on conference on computational intelli- pp 585–594. doi:10.1016/j.procs.2015.06.067

123
Int J Multimed Info Retr

77. Hangarge M, Santosh KC, Pardeshi R (2013) Directional discrete 93. Obaidullah SM, Mondal A, Roy K (2014) Structural feature based
cosine transform for handwritten script identification. In: 12th approach for script identification from printed Indian document.
international conference on document analysis and recognition, In: International conference on signal processing and integrated
Washington DC, pp 344–348. doi:10.1109/ICDAR.2013.76 networks, Noida, pp 120–124. doi:10.1109/SPIN.2014.6776933
78. Hangarge M, Santosh KC (2014) Word-level handwritten script 94. Obaidullah SM, Mondal A, Das N, Roy K (2014) Script identifi-
identification from multi-script documents. In: Recent advances in cation from printed Indian document images and performance
information technology, advances in intelligent systems and com- evaluation using different classifiers. Appl Comput Intell Soft
puting, Dhanbad, pp 49–55. doi:10.1007/978-81-322-1856-2_6 Comput. doi:10.1155/2014/896128
79. Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) 95. Dhanya D, Ramakrishnan AG, Pati PB (2002) Script identification
Automatic handwritten Indian scripts identification. In: 14th in printed bilingual documents. Sadhana 27:73–82. doi:10.1007/
international conference on frontiers in handwriting recognition, 3-540-45869-7_2
Greece, pp 375–380. doi:10.1109/ICFHR.2014.69 96. Singh PK, Mondal A, Bhowmik S, Sarkar R, Nasipuri M (2014)
80. Marti U, Bunke H (1999) A full English sentence database Word-level script identification from handwritten multi-script
for off-line handwriting recognition. In: Proceedings of the 5th documents. In: Proceedings of the 3rd international conference
international conference on document analysis and recognition, on frontiers of intelligent computing: theory and applications,
Bangalore, pp 705–708. doi:10.1109/ICDAR.1999.791885 Bhubaneswar, pp 551–558. doi:10.1007/978-3-319-11933-5_62
81. Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) 97. Shi B, Yao C, Zhang C, Guo X, Huang F, Bai X (2015) Automatic
Cmaterdb1: a database of unconstrained handwritten Bangla and script identification in the wild. In: Proceedings of international
Bangla English mixed script document image. Int J Doc Anal conference on document analysis and recognition, Nancy
Recogn 15:71–83. doi:10.1007/s10032-011-0148-6 98. Mezghani A, Kanoun S, Khemakhem M, El AH (2012) A database
82. Selamat A, Ng CC (2011) Arabic script web page language iden- for Arabic handwritten text image recognition and writer identi-
tifications using decision tree neural networks. Pattern Recogn fication. In: International conference on frontiers in handwriting
44:133–144. doi:10.1016/j.patcog.2010.07.009 recognition, Bari, pp 399–402. doi:10.1109/ICFHR.2012.155
83. Ng CC, Selamat A (2009) Improved letter weighting feature 99. Grosicki E, Carré M, Brodin JM, Geoffrois E (2009) Results of the
selection on Arabic script language identification. In: 1st Asian RIMES evaluation campaign for handwritten mail processing. In:
conference on intelligent information and database systems, Viet- International conference on document analysis and recognition,
nam, pp 150–154. doi:10.1109/ACIIDS.2009.33 Barcelona, pp 941–945. doi:10.1109/ICDAR.2009.224
84. Selamat A, Lee ZS (2008) Language identifications of Arabic 100. Slimane F, Ingold R, Kanoun S, Alimi AM, Hennebert J (2009) A
script web documents using independent component analysis. In: new Arabic printed text image database and evaluation protocols.
2nd Asia international conference on modeling and simulation, In: International conference on document analysis and recogni-
Kuala Lumpur, pp 427–432. doi:10.1109/AMS.2008.46 tion, Barcelona, pp 946–950. doi:10.1109/ICDAR.2009.155
85. Shi B, Bai X, Yao C (2016) Script identification in the wild 101. Gomez L, Nicolaou A, Karatzas D (2017) Improving patch-based
via discriminative convolutional neural network. Pattern Recogn scene text script identification with ensembles of conjoined net-
52:448–458. doi:10.1016/j.patcog.2015.11.005 works. Pattern Recogn 67:85–96. doi:10.1016/j.patcog.2017.01.
86. Behrad A, Khoddami M, Salehpour M (2010) A novel framework 032
for Farsi and Latin script identification and Farsi handwrit- 102. Sharma N, Mandal R, Sharma R, Pal U, Blumenstein M (2015)
ten digit recognition. J Autom Control 20:17–25. doi:10.2298/ ICDAR2015 competition on video script identification (CVSI
JAC1001017B 2015). In: IEEE 13th international conference on document analy-
87. Rani R, Dhir R, Lehal GS (2011) Comparative analysis of sis and recognition (ICDAR), 2015, Tunis, pp 1196–1200. doi:10.
Gabor and discriminating feature extraction techniques for script 1109/ICDAR.2015.7333950
identification. In: International conference on information sys- 103. Arabnejad E, Moghaddam RF, Cheriet M (2017) PSI: Patch-based
tems for Indian languages, Patiala, pp 174–179. doi:10.1007/ script identification using non-negative matrix factorization. Pat-
978-3-642-19403-0_27 tern Recogn 67:328–339. doi:10.1016/j.patcog.2017.02.020
88. Mezghani A, Slimane F, Kanoun S, Margner V (2014) Identifi- 104. Saba T, Rehman A, Altameem A, Uddin M (2014) Annotated
cation of Arabic/French–handwritten/printed words using Gmm- comparisons of proposed preprocessing techniques for script
based system. In: Proceedings of CIFED, France, pp 371–374 recognition. Neural Comput Appl 25:1337–1347. doi:10.1007/
89. Abainia K, Ouamour S, Sayoud H (2014) Robust language iden- s00521-014-1618-9
tification of noisy texts: proposal of hybrid approaches. In: 25th 105. Kacem A, Asma S (2016) A texture-based approach for word
international workshop on database and expert systems applica- script and nature identification. Pattern Anal Appl. doi:10.1007/
tions, Munich, pp 228–232. doi:10.1109/DEXA.2014.55 s10044-016-0555-x
90. Yadav P, Kaur S (2013) Language identification and correction 106. Obaidullah SM, Halder C, Santosh KC, Das N, Roy K (2017)
in corrupted texts of regional Indian languages. In: International PHDIndic_11: page-level handwritten document image dataset of
conference oriental held jointly with conference on Asian spoken 11 official Indic scripts for script identification. Multimed Tools
language research and evaluation, Gurgaon, pp 1–5. doi:10.1109/ Appl. doi:10.1007/s11042-017-4373-y
ICSDA.2013.6709877 107. Singh PK, Sarkar R, Das N, Basu S, Kundu M, Nasipuri M
91. Hebert D, Barlas P, Chatelain C, Adam S, Paquet T (2014) Writing (2017) Benchmark databases of handwritten Bangla-Roman and
type and language identification in heterogeneous and complex Devanagari-Roman mixed-script document images. Multimed
documents. In: 14th international conference on frontiers in hand- Tools Appl. doi:10.1007/s11042-017-4745-3
writing recognition, Greece, pp 411–416. doi:10.1109/ICFHR. 108. Brodic’ D, Amelio A, Milivojevic’ ZN (2016) Language discrim-
2014.75 ination by texture analysis of the image corresponding to the text.
92. Ablavsky V, Stevens MR (2003) Automatic feature selection with Neural Comput Appl. doi:10.1007/s00521-016-2527-x
applications to script identification of degraded documents. In: 109. Brodić D, Amelio A, Milivojević ZN (2016) Identification of
Proceedings of 7th international conference on document analysis Fraktur and Latin scripts in German historical documents using
and recognition, Edinburgh, pp 750–754. doi:10.1109/ICDAR. image texture analysis. Appl Artif Intell Int J 30(5):379–395.
2003.1227762 doi:10.1080/08839514.2016.1185855

123

You might also like