You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342872226

Script Identification of Telugu, English and Hindi Document Image

Article · February 2014

CITATIONS READS

0 525

3 authors, including:

Gorre Srinivas Hari KUMAR Buddha


Vasavi College of Engineering Wellfare Institute of Science Technology & Management
33 PUBLICATIONS   218 CITATIONS    17 PUBLICATIONS   7 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Internet of Things View project

Parallel filter View project

All content following this page was uploaded by Hari KUMAR Buddha on 11 July 2020.

The user has requested enhancement of the downloaded file.


ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

Script Identification of Telugu, English and Hindi Document Image


G.Srinivas Rao Mohammed Imanuddin B.Hari Kumar
Department of Electronics and Communications Engineering, Aurora’s Scientific Technological
and Research Academy Hyderabad, Andhra Pradesh, India

Abstract be a common medium for different languages) are


made up of different shaped patterns to produce
In a multilingual country like India, a document different character sets. OCR is of special
may contain text words in more than one language. significance for a multi-lingual country like India,
For a multilingual environment, multilingual where the text portion of the document usually
Optical Character Recognition (OCR) system is contains information in more than one language. A
needed to read the multilingual documents. So, it is document containing text information in more than
identify different language. The objective f this one language is called a multilingual document. For
project is to propose visual clues based procedure such type of multilingual documents, it is very
to identify Telugu, Hindi and English text portions essential to identify the text language portion of the
of the Indian multilingual document. document, before the analysis of the contents could
Language identification is an important topic in be made. Although a great number of OCR
pattern recognition and image processing based techniques have been developed over years, almost
automatic document analysis and recognition an all existing works on OCR make an important
image processing based automatic document implicit assumption that the language of the
analysis and recognition. The objective of language document to be processed is known beforehand.
identification is to translate human identifiable Individual OCR tools have been developed to deal
documents to machine identifiable codes. The best with only one specific language. In an
world we live in, is getting increasingly automated environment such document processing
interconnected, electronic libraries have become systems relying on OCR would clearly need human
more pervasive and at the same time increasingly intervention to select the appropriate OCR package,
automated including the task of presenting a text which is certainly inefficient, undesirable and im
any language as automatically translated text in practical. A pre-OCR language identification
any other language. Identification of the language system would enable the correct OCR system to be
in a document image is of primary importance for selected in order to achieve the best character
selection of a specific OCR system processing multi interpretation of the document. This area has not
lingual documents. been very widely researched to date, despite its
growing importance to the document image
processing community and the progression towards
1. Introduction
Language identification is an important the “project less office”.
topic in pattern recognition and image processing
1.1 Motivation :
based automatic document analysis and
recognition. The objective of language
In this project an attempt has been made to
identification is to translate human identifiable
solve a more foundation problem of language
documents to machine identifiable codes. The
identification of a text from a multilingual
world we live in, is getting increasingly document, before its contents are automatically
interconnected, electronic libraries have become
read. Language identification is one of the vision
more pervasive and at the same time increasingly
application problems. Generally human system
automated including the task of presenting a text in identifies the language in a document using some
any language as automatically translated text in any
visible characteristic features such as texture,
other language. Identification of the language in a
horizontal lines, vertical lines, which are visually
document image is of primary importance for
perceivable and appeal to visual sensation. This
selection of a specific OCR system processing
human visual perception capability has been the
multi lingual documents.
motivator for the development of the proposed
system. With this context, in this project, an
Language identification may seem to be
attempt has been made to simulate the human
an elementary and simple issue for humans in the
visual system, to identify the type of the language
real world, but it is difficult for a machine,
primarily because different scripts (a script could

443 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

based on visual clues, without reading the contents 1.3 Scope:


of the document.
In a multi-script multi-lingual country like
In a multi-lingual country like India (India India (India has 18 regional languages derived from
has 18 regional languages derived from 12 different 12 different scripts), a document page like bus
scripts; a script could be a common medium for reservation forms, question projects, language
different languages), documents like bus translation books and money-order forms may
reservation forms, passport application forms, contain text lines in more than one script/language
examination question papers, bank-challanas, forms. One script could be used to write more than
language translation books and money-order forms one languages. For example, languages such as
may contain text words in more than one language Hindi, Marathi, Rajastani, Sanskrit and Nepali are
forms. For such an environment, multi lingual OCR written using the Devanagari script; Assamese and
system is needed to read the multilingual Bangla languages are written using the Bangla
documents. To make a multi-lingual OCR system script. In order to reach a larger cross section of
successful, it is necessary to separate portions of people, it is necessary that a document should be
different language regions of the document before composed of text contents in different languages.
feeding to individual OCR systems. In this However, for a document having text information
direction, multi lingual document segmentation has in different languages, it is necessary to pre-
strong direct application potential, especially in a determine the language type of the document,
multilingual country like India. before employing a particular OCR on them. With
this context, in this project, the problem of
In the context of Indian languages, some recognizing the language type of the text content is
amount of research work has been reported. Further addressed. However, it is perhaps impossible to
there is a growing demand for automatically design a single recognizer, which can identify a
processing the documents in every state in India large number of scripts/languages. As a via media,
including Andhra Pradesh. Under the three this project proposes to work on the prioritized
language formulae, adopted by most of the Indian requirements of a particular region- Andhra
states, the document in a state may be printed in its Pradesh, a state in India. According to the three-
respective official regional language, the national language policy adopted by most of the Indian
language Hindi and also in English. Accordingly, a states, the documents produced in any Indian state
document produced in Andhra Pradesh, a state in are composed of text information in their regional
India, may be printed in its official regional language, the National language - Hindi and the
language Telugu, national language Hindi and also general importance language - English.
in English. Accordingly, the documents of Andhra Pradesh are
generally printed in Telugu, Hindi and English
1.2 Objective: languages. Consequently, majority of the
documents produced in many of the private and
The objective of language identification is to Government sectors, railways, banks, post-offices
translate human identifiable documents to machine of Andhra Pradesh are of tri-lingual (a document
identifiable codes. Multilingual OCR system is having text in three languages) type. So, when it
needed to read the multilingual documents. To comes to automation, assuming that there are three
make a multilingual-OCR system successful, it is OCRs for Telugu, Hindi (Devanagari) and English
necessary to develop the multilingual- OCR system languages, a pre-processor is necessary by which
that would work in two stages: the language type of the different texts lines are
identified. In this project, a script identification
(i) Identification and separation of
technique to identify the text lines of Telugu, Hindi
different language portions of the
and English languages from a tri-lingual document
document and
is presented.
(ii) Feeding of individual language
regions to appropriate OCR system.
2. LITERATURE SURVEY
In this project, we focus on the first stage of
the multilingual- OCR system and present The literature survey, it has been revealed
procedures for identification and separation of that some amount of work has been carried out in
telugu and English text portions of the multilingual language identification. Peake and Tan have
document produced at an Indian state. In the proposed a method for language identification
present case, it could also be called as script or from document images using multiple channel
language identification, since the two languages (Gab our) filters and gray level co-occurrence
Telugu and English belong to two different scripts. matrices for seven languages: Chinese, English,
Greek, Korean, Malayalam, Persian and Russian.

444 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

Tan has developed rotation invariant texture feature In one of my earlier works, it is assumed
extraction method for automatic script that a given document should contain the text lines
identification for six languages: Chinese, Greek, in one of the three languages telugu, Hindi and
English, Russian, Persian and Malayalam. In the English. In one of my previous researches, the
context of Indian languages, some amount of results of detailed investigations were presented
research work on language identification has been related to the study of the applicability of
reported. Pal and Choudhuri have proposed an horizontal and vertical projections and
automatic technique of separating the text lines segmentation methods to identify the language of a
from 12 Indian scripts (English, Devanagari, document considering specifically the three
Bangla, Gujarati, Kannada, Kashmiri, Malayalam, languages telugu, Hindi and English. It is
Oriya, Punjabi, Tamil, Telugu and Urdu) using ten reasonably natural that the documents produced at
triplets formed by grouping English and the border regions of Karnataka may also be
Devanagari with any one of the other scripts. printed in the regional languages of the neigh
Santanu Choudhuri, et al. have proposed a method boring states like Tamil, Malayalam and Urdu. The
for identification of Indian languages by combining system was unable to identify the text words for
Gab our filter based technique and direction such documents having text words in Tamil,
distance histogram classifier considering Hindi, Malayalam, Urdu languages and hence these text
English, Malayalam, Bengali, Telugu and Urdu. words were misclassified into any one among the
Basavaraj Patil and Subbareddy have developed a three languages, whichever is nearer and similar in
character script class identification system for its visual appearance. Keeping the drawback of the
machine printed bilingual documents in English previous method in mind, we have proposed a
and Kannada scripts using probabilistic neural system that would more accurately identify and
network. Pal and Choudhuri have proposed an separate different language portions of telugu,
automatic separation of Bangla, Devanagari and Hindi and English documents. As our intension is
Roman words in multilingual multiscript Indian to identify only telugu, Hindi and English. The
documents. Nagabhushan et.al. have proposed a system identifies the three languages, in the first
fuzzy statistical approach to Kannada vowel stage Hindi is identified, in the second stage telugu
recognition based on invariant moments. Pal et. al. is identified and in the third stage English is
have suggested a word-wise script identification identified.
model from a document containing English,
Devanagari and Telugu text. Chanda and Pal have 2.1 Data collection:
proposed an automatic technique for word-wise
identification of Devanagari, English and Urdu Standard database of documents of Indian
scripts from a single document. Spitz has proposed languages is currently not available. Data base
a technique for distinguishing Han and Latin based construction with respect to the language
scripts on the basis of spatial relationships of identification problem seems to be complex since
features related to the character structures. Pal et al. the factors like the font type and font size of each
have developed a script identification technique for language needs to be considered. In this research, it
Indian languages by employing new features based is assumed that the input data set contains
on water reservoir principle, contour tracing, jump documents having the text lines of Telugu,
discontinuity, left and right profile. Ramachandra et Devanagri and English scripts. Also, it is assumed
al. have proposed a method based on rotation- that the language type, font and size of the text
invariant texture features using multichannel Gabor words within a text line are same. For the
filter for identifying six (Bengali, Kannada, experimentation of the proposed model, three sets
Malayalam, Oriya, Telugu and Marathi) Indian of database are constructed, out of which one
languages. Hochberg et al. have presented a system database was used to train the proposed system and
that automatically identifies the script form using the other two databases were constructed to test the
cluster-based templates. Gopal et al. have presented system.
a scheme to identify different Indian scripts
through hierarchical classification which uses The size of the document images
features extracted from the responses of a considered were 600x600 pixels having about six
multichannel log-Gabor filter. Our survey for to ten text lines depending upon the font size of the
previous research work in the area of document text. The document of English language was
language identification shows that much of them created using the Microsoft word software and
rely on languages followed by other countries and these text lines were imported to the Micro Soft
few from our country, but hardly few attempts Paint program. In the Micro Soft Paint, a portion of
focus on these three languages Telugu, Hindi and the text lines was saved as black and white BitMaP
English. (BMP) image having 600X600 pixels. The font
type of Times New Roman, Arial, Bookman Old

445 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

Style and Tahoma were used for English language. image by removing the upper, lower, left and right
The font sizes of 12 to 26 were used for English blank regions. It should be noted that the text block
text lines. The input images of Telugu and Hindi might contain lines with different font sizes and
language were constructed by clipping only text variable spaces between lines. It is not necessary to
portion of the document downloaded from the homogenize these parameters, as the input to the
Internet. The training database is constructed such proposed model is the individual text lines. The
that 500 text lines were considered from each of the document image is segmented into several text
three languages. lines using the valleys of the horizontal projection
profiles computed by a row-wise sum of black
To test the proposed model, two different pixels. The position between two consecutive
data sets were constructed out of which one dataset horizontal projections where the histogram height
was constructed manually similar to the dataset is least denotes the boundary of a text line. Using
constructed for training and the other data set was these boundary lines, document image is
constructed from the scanned document images. segmented into several text lines. The segmented
The printed documents like textbooks and text lines might have varying inter-word spacing.
magazines were scanned through an optical scanner So, it is necessary to normalize the inter-word
to obtain the document image. The HP Scan Jet spacing to a maximum of 5 pixels. Normalization
5200c series scanner was used to obtain the of the inter-word spacing is achieved by projecting
digitized images. The scanning was performed in the pixels of each text line vertically; counting the
normal 100% view size at 300 dpi resolution. number of white pixels from left to right and
Manually constructed dataset is considered as good reducing the number of white pixels greater than 5
quality dataset and the data set constructed from the pixels to 5. Due to varying size of fonts, it is
scanned document images are considered as poor necessary to normalize the input text lines to fixed
quality data set. The test datasets were constructed size. Through experimental observation, it was
such that 300 text lines from each of the three determined to fix the height of the text line as 40
languages - Telugu, Hindi and English, were rows that facilitate to extract the features
present from each of the good quality and poor efficiently. So, the input text line of size m rows
quality datasets. and n columns resized to fixed size of 40 rows and
(40 x n/m) columns keeping the aspect ratio. Then,
2.2 Pre-processing: a bounding box is fixed for the segmented and
resized text line by finding the leftmost, rightmost,
Any language identification method topmost and bottommost black pixel of each text
requires conditioned image input of the document, line
which implies that the document should be noise
free and skew free. Apart from these, some 2.3Text line structure:
recognition techniques require that the document
image should be segmented, threshold and thinned. A text line can be considered as being
All these methods, help in obtaining appropriate composed of three zones: the upper zone, the
features for text language identification processes. middle zone and the lower zone. These zones are
In this research, the pre-processing techniques such delimited by four virtual lines: the top-line, the
as noise removal and skew correction are not upper-line, the base-line and the bottom-line. Each
necessary for the datasets that are manually text line has at least a middle zone; the upper zone
constructed by downloading the documents from depends on capital letters and letters with
the Internet. However, for the datasets that is ascenders, like h and k; the lower zone depends on
constructed from the scanned document images, letters with descenders, like g, p and y. This
pre-processing steps such as removal of non-text structure allows the definition of four kinds of text
regions, skew-correction, noise removal and line:
binarization is necessary. In this research, text
portion of the document image was separated from 1. Full line, with character parts present in all three
the non-text region manually, though page zones;
segmentation algorithm such as could be readily be
employed to perform this automatically. Skew 2. Ascender line, with character parts present in
detection and correction was achieved using the the upper and middle zones;
technique proposed by Shivakumar. A global
thresholding approach was used to binarize the 3. Descender line, with character parts present in
scanned gray scale images where black pixels the lower and middle zones;
having the value 0’s correspond to object and white
pixels having value 1’s correspond to background. 4. Short line, with character parts present in the
The text area is segmented from the document middle zone.

446 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

3. Evaluation of Telugu, Hindi and English 3.2 Properties of Devanagari (Hindi)


discriminating features Language:
Every script defines a finite set of text It could be noted that many characters of
patterns called alphabets. Alphabets of one script Devanagari script have a horizontal line at the
are grouped together giving meaningful text upper part called headline which is named as
information in the form of a word, a text line or a sirorekha in Devanagari. It could be seen that,
paragraph. Thus, when the alphabets of the same when two or more basic or compound characters
script are combined together to yield meaningful are combined to form a word, the character
text information, the text portion of the individual headline segments mostly join one another and
script exhibits a distinct visual appearance. The generates one long headline for each text word.
distinct visual appearance of every script is due to These long horizontal lines are present at the top
the presence of the segments like – horizontal lines, portion of the characters and they are used as
vertical lines, upward curves, downward curves, supporting features in identifying Devanagari
descendants and so on. The presence of such script. Another strong feature that could be noticed
segments in a particular script is used as visual in a Devanagari text line is that most of the pixels
clues for a human to identify the type of even the of the headline happen to be the pixels of bottom
unfamiliar script. It was motivated to adopt the idea profile (bottom profile is defined in the later
of human visual perception capability into the section). This results in both top and bottom
proposed model to use the distinct features profiles of a Hindi text line to lie at the top portion
exhibited by each script. So, the target of this of the characters. However this distinct feature is
project is to identify the script type of the texts absent in both Telugu and English text lines where
without reading the contents of the document. the density top and bottom profiles occur at
different positions. Using these features Hindi text
By thoroughly observing the structural line could be strongly separated from Telugu and
outline of the characters of the three scripts - English languages.
Telugu, Devanagari and English, it is observed that
the distinct features are present at some specific 3.3 Properties of English Language:
portion of the characters. So, in this project, the
discriminating features are extracted from the top- It is observed that the pixel distribution in
profile and the bottom-profile of each text line. The most of the English characters is found to be
top-profile (bottom-profile) of a text line represents symmetric and regular. This uniform distribution of
a set of black pixels obtained by scanning each the pixels of English characters results in the
column of the text line from top (bottom) until it density of the top profile to be almost same as the
reaches a first black pixel. Thus, a component of density of the bottom profile. However, such
width N gets N such pixels. The row at which the uniformity found in pixel distribution of the top and
first black pixel lies in the top-profile (bottom- bottom profiles of an English text line is not found
profile) is called top-line (bottom-line). The row in the other two anticipated languages Telugu and
number having the maximum number of black Hindi. Thus, this characteristic attribute is used as a
pixels (black pixels having the value 0’s supporting feature to separate an English text line.
correspond to object and white pixels having value Thus, the distinct characteristic structures of each
1’s correspond to background) in the top-profile language are used as supporting visual features in
(bottom-profile) is called the attribute top-max-row the proposed model.
(bottom max- row).
3.4 Feature Extraction
3.1 Properties of Telugu Language:
The distinct features used in the proposed
Telugu is the official language of one of model are extracted as explained below:
the South Indian state - Andhra Pradesh. Telugu
script is derived from Telugu script itself. It can be 3.4.1 Feature 1: Bottom-max-row-no:
seen that, most of the Telugu characters have tick
shaped structures at the top portion of their The feature bottom-max-row-no
characters. Also, it could be observed that majority represents the row number of the bottom-profile at
of Telugu characters have upward curves present at which the maximum number of black pixels lies
their bottom portion. These distinct properties of (black pixels having the value 0’s correspond to
Telugu characters are helpful in separating them object and white pixels having value 1’s
from Hindi and English languages. correspond to background).

447 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

3.4.2 Feature 2: Top-horizontal line: where f(x,y) represent the image of the input text
line.
(i) Obtain the top-max-row from the top-
profile. Through experimentation, it is estimated
(ii) Find the components whose number that the number of pixels of a descendant is greater
of black pixels is greater than than 8 pixels and hence the threshold value for a
threshold1 (threshold1 = half of the connected component is fixed as 8 pixels. Any
height of the bounding box) and store connected component whose number of pixels is
the number of such components in the greater than 8 pixels is considered as the feature
attribute horizontal-lines. bottom-component. Such bottom-components
(iii) Compute the feature top-horizontal-line extracted from Telugu script are shown in Figure 2.
using the equation (1) below: Top-
horizontal-line = (hlines * 100) / tc (1) 3.4.5 Feature 5: Top-pipe-size:
where hlines represent number of horizontal
lines and tc represents total number of The attribute top-pipe (bottom-pipe) is
components of the top-max-row. obtained by deleting the connected components
whose number of pixels is less than threshold2. The
3.4.3 Feature 3: Tick-component: value of threshold2 is computed through
experimentation and it is fixed to 10 pixels. The
The observation of the characters of number of rows comprising the top-pipe is used as
Telugu script motivated to use the tick shaped the feature top-pipe-size.
components as a feature. A component is said to
have the shape of the ticklike structure if the pixel 3.4.6 Feature 6: Bottom-pipe-size:
values of the components are in the sequence (i, j),
The feature bottom-pipe-size is computed
(i+1, j+1), (i+2, j+2), … , (i+m1, j+n1), (i+m1-1,
as that of top-pipesize.
j+n1+1), (i+m1-2, j+n1+2), (i+m1-3, j+n1+3), ... ,
(i+m1-m2, j+n1+n2), where m2=i+m1 and n2>n1. 3.4.7 Feature 7: Top-pipe-density:
The shape of the tick-like structure is shown in
Figure 2. The component having the shape of the The feature top-pipe-density is computed
tick-like structure (_) is named as the feature using the equation (2). top-pipe-
‘tickcomponent’. Such tick-components extracted density=(nbp*100)/(m*n) (2) where nbp
from the top portion of Telugu script are shown in correspond to number of black pixels present in the
Figure 3. top-pipe and (m,n) is the size of the image - top-
pipe.

3.4.8 Feature 8: Bottom-pipe-density:

The feature bottom-pipe-density is


computed as that of toppipe density.

Fig 3.1: A tick shaped component.

3.4.4 Feature 4: Bottom-component:


If more than 50% of the connected
component is present below the attribute bottom-
max-row, then that connected component is
considered as the descendent. The presence of
descendants or vathaksharas found at the bottom
portion of the Telugu script could be used as a
feature called bottom-component. The feature
named ‘bottomcomponent’ is extracted from the
bottom-portion of the input text line. Bottom-
portion is computed as Bottom-portion= f(x,y) Fig 3.2: Sample output image of telugu text
where x=bottom-max-row to m and y=1to n ; line.

448 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

3.7 Line-wise Identification Model


Step-1: Pre-processing

The input document is pre-processed i.e.,


noise removed, smoothing done, skew
compensated and binarized.

Step-2: Line segmentation:

To segment the document image into


several text lines, we use the valleys of the
horizontal projection computed by a row-wise sum
of black pixels. The position between two
consecutive horizontal projections where the
Fig 3.3: Sample output image of English text histogram height is least denotes one boundary line.
line. Using these boundary lines, document image is
segmented into several text lines.
3.5. Feature Selection
Step-3: Zonalization:
Feature selection is a process of
minimizing the number of features and maximizing Each text line is partitioned into three
the discriminating property of the feature set. zones - upper zone, middle zone and lower zone.
Feature selection is a process that aims to identify
an optimal subset of relevant features from a large Step-4: Block segmentation:
number of features collected in the data set, such
that the overall accuracy of classification is From the zonalized text line, upper line
increased. and lower line is used as two boundary lines for
every text line. Then every text line is scanned
3.6 The Learning Algorithm vertically from its upper line to reach its lower line
without touching any black pixels to get a boundary
Using the optimal features of each script,
line. Such characters enclosed within each
the proposed model is learnt with a training data set
boundary lines lead to a stream of blocks.
of 500 text lines from each of the three scripts -
Telugu, Devanagari and English. Learning Step-5: Feature extraction:
algorithm used in the proposed model is given
below. Feature (i): Horizontal line detection: From the
input image, the horizontal lines are obtained and
Algorithm Learning () the percentage of the presence of these horizontal
lines for each text line is computed.
Input: Pre-processed text lines of Telugu,
Devanagari and English scripts Feature (ii): Vertical line detection: From the input
image, the vertical lines are obtained and the
Output: Range of feature values.
percentage of the presence of these vertical lines
1. Do for i = 1 to 3 script types for each text line is computed.

2. { Do for k = 1 to 500 text lines of i th script Feature (iii): Variable Sized blocks: The size of the
blocks of each text line is calculated by taking the
3. { Obtain top profile and bottom profile. ratio of width to height of each block. Then the
percentage of equal and unequal sized blocks of
4. Compute the values of the optimal features of the each text line is calculated.
i th script. }
Feature (iv): Blocks with more than one
5. Find minimum, maximum and mean of the component: The percentage of the number of
optimal features for n text lines components present in each block of every text line
is computed.
and store them in a knowledge base. }
Step-6: Decision making:

449 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

(i) Condition-1: If 90% of the horizontal Each text word is partitioned into three
lines on the mean line is greater than zones - upper zone, middle zone and lower zone as
two explained in Section 2 to get upper line and lower
times the X-height of the line as two boundary lines for every text word.
corresponding text line; if there are
80% of vertical lines in the middle Stage 5: Block segmentation:
zone and also if 70% of the blocks
have width greater than two times the From the partitioned text word, upper and
X-height, then such portion of the lower lines are used as two boundary lines for
document is recognized as Hindi every text word. Then, every text word is scanned
language. vertically from upper line to lower line without
touching any black pixels to get a stream of blocks.
(ii) Condition-2: If 65% of the horizontal Thus a block is defined as a rectangular section of
lines on the mean line is greater than the text words that has one or more characters with
half of the X-height of the more than one component.
corresponding text line and if there
are 40% of unequal sized blocks in a Stage 6: Block size evaluation:
text line, then such portion of the
document is recognized as telugu The size of each block of every text word
language. is calculated by taking the ratio of width to height
of each text word. Then the percentage of equal and
(iii) Condition-3: If there are 80% of unequal sized blocks of each text word is
vertical lines in the middle zone calculated.
greater than half of the text line height
Stage 7: Blocks having more than one component:
and if 80% of the blocks are equal in
size, then such portion of the The number of components (a connected
document is recognized as English component is one in which the pixels are
language. aggregated by an 8-connected points analysis)
3.8 Word-level identification model present within each block is calculated using 8-
neighbour connectivity. Then the percentage of the
Stage 1: Pre-processing: occurrence of blocks having more than one
component is calculated.
The input document is pre-processed i.e.,
noise removed, smoothing done, skew Stage 8: Feature Extraction:
compensated and binarized.
For each text word, the four features (i)
Stage 2: Line segmentation: Horizontal lines, (ii) Vertical lines, (iii) Variable
Sized blocks and (iv) Blocks with more than one
To segment the document image into
component are obtained.
several text lines, we use the valleys of the
horizontal projection computed by a row-wise sum Stage 9: Decision making:
of black pixels. The position between two
consecutive horizontal projections where the Level-1:
histogram height is least denotes one boundary line.
Using these boundary lines, document image is If the length of the horizontal line on the
segmented into several text lines. mean line is greater than two times the X-height of
the corresponding text word, if there are vertical
Stage 3: Word segmentation: lines in the middle zone, if the block has width
greater than two times the X-height and also if the
Every text line is segmented into words by word/block contains only one component, then that
finding the valleys of the vertical projection . If the text word is identified as a text word in Hindi
width of the valley is greater than the threshold language.
value (Threshold value = two times the inter
character gap), then a vertical line is drawn at the Level-2:
middle of the columns with zero values (Inter word
gap). Using these vertical lines, words within a text If the length of the horizontal line on the
line are segmented. mean line is equal to the x-height of the
corresponding text word; if there are 70% of
Stage 4: Word Partitioning: unequal sized blocks in the output image and also if

450 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

30% of the blocks contain more than one

E li h
component, then that text word is recognized as a
text word in Telugu language.

Level-3:

If there are vertical lines in the middle


zone greater than half of the text word height; if a
text word contains 70% equal blocks in size and
also if a text word contains 90% of the blocks
having only one component, then that text word is
identified as a text word in English language.

4. Results

l
Input Image (Telugu with English):

CleanImage():
Input Image (Telugu with Hindi) :

Output Images:

451 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014

5. Conclusion & Future Enhancement [6]. Gopal Datt Joshi, Saurabh Garg, Jayanthi
Sivaswamy, “Script Identification from Indian
In this project, a method to identify and Documents”, DAS 2006, LNCS 3872, 255-267, 2006.
separate text lines of Telugu, Hindi and English International Journal of Computational Intelligence
Systems, Vol.1, No. 2 (May, 2008), 116–126 Published
scripts from a trilingual document is presented. The
by Atlantis Press 126.
approach is based on the analysis of the top and
bottom profiles of individual text lines and hence [7]. G.S. Peake, T.N.Tan, “Script and Language
does not require any character or word Identification from Document Images”, Proc. Eighth
segmentation. A document may contain text words British Mach. Vision Conference., 2, 230-233, (1997).
of different languages within a text line. The
proposed method is bound to fail on such
documents. This is a limitation. One possible
solution is to identify the script type at word level Authors Profile
by segmenting the text line into words. Our future
work is to consider identification of the script type B.Harikumar completed his B.Tech in Electronics
at word level. Further this project extends to and Communications Engineering from Kamareddy
identify combination of different Fonts with Engineering College in 2011. Presently he is
different Font sizes and for identification of hand pursuing his M.Tech in Electronics and
written documents also. The experimental results Communications Engineering in Aurora's Scientific
show that the two methods are effective and good Tech & Research Academy, Hyderabad. His area
enough to identify and separate the three language of interests is Image Processing, Optical Character
portions of the document, which further helps to Recognition.
feed individual language regions to specific OCR
system. Our future work is to develop a system that G.Srinivas Rao Associate prof (ECE) Aurora's
can identify other Indian languages. Scientific Tech & Research Academy Exp 9 years
Teaching, M.Tech (DEC) JNTU Ananthapur.
6 References
Mohammed Imanuddin Associate prof (ECE)
[1]. Identification Of Telugu, Devanagari And English Aurora's Scientific Tech & Research Academy Exp
Scripts Using Discriminating Features M C Padma1 9 years Teaching, M.Tech (DEC) JNTU
And P A Vijaya2, International Journal Of Computer Ananthapur . His area of interests is Image
Science & Information Technology (Ijcsit), Vol 1, No 2, Processing
November 2009

[2]. U.Pal, B.B.Choudhuri, : Script Line Separation


Table Mean values of each feature used by
From Indian Multi-Script Documents, 5th Int. the KNN based method
Conference On Document Analysis And Recognition(Ieee
Comput. Soc. Press), 406-409, (1999).

[3]. Ramachandra Manthalkar and P.K. Biswas, “An


Automatic Script Identification Scheme for Indian
Languages”, NCC, 2002.

[4]. S.Basvaraj Patil, N.V.Subba Reddy, “Character


script class identification system using probabilistic
neural network for multi-script multi lingual document
processing”, Proc. National Conference on Document
Analysis and Recognition, Mandya, Karnataka, 1-8,
(2001).

[5]. P.Nagabhushan, “Identification and separation of


text words of Kannada, Hindi and English languages
through discriminating features”, Proc. 2nd National
Conference Document Analysis and Recognition,
Mandya, Karnataka, 252-260, (2003).

452 WWW.IJAEGT.COM

View publication stats

You might also like