Professional Documents
Culture Documents
net/publication/342872226
CITATIONS READS
0 525
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Hari KUMAR Buddha on 11 July 2020.
443 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
444 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
Tan has developed rotation invariant texture feature In one of my earlier works, it is assumed
extraction method for automatic script that a given document should contain the text lines
identification for six languages: Chinese, Greek, in one of the three languages telugu, Hindi and
English, Russian, Persian and Malayalam. In the English. In one of my previous researches, the
context of Indian languages, some amount of results of detailed investigations were presented
research work on language identification has been related to the study of the applicability of
reported. Pal and Choudhuri have proposed an horizontal and vertical projections and
automatic technique of separating the text lines segmentation methods to identify the language of a
from 12 Indian scripts (English, Devanagari, document considering specifically the three
Bangla, Gujarati, Kannada, Kashmiri, Malayalam, languages telugu, Hindi and English. It is
Oriya, Punjabi, Tamil, Telugu and Urdu) using ten reasonably natural that the documents produced at
triplets formed by grouping English and the border regions of Karnataka may also be
Devanagari with any one of the other scripts. printed in the regional languages of the neigh
Santanu Choudhuri, et al. have proposed a method boring states like Tamil, Malayalam and Urdu. The
for identification of Indian languages by combining system was unable to identify the text words for
Gab our filter based technique and direction such documents having text words in Tamil,
distance histogram classifier considering Hindi, Malayalam, Urdu languages and hence these text
English, Malayalam, Bengali, Telugu and Urdu. words were misclassified into any one among the
Basavaraj Patil and Subbareddy have developed a three languages, whichever is nearer and similar in
character script class identification system for its visual appearance. Keeping the drawback of the
machine printed bilingual documents in English previous method in mind, we have proposed a
and Kannada scripts using probabilistic neural system that would more accurately identify and
network. Pal and Choudhuri have proposed an separate different language portions of telugu,
automatic separation of Bangla, Devanagari and Hindi and English documents. As our intension is
Roman words in multilingual multiscript Indian to identify only telugu, Hindi and English. The
documents. Nagabhushan et.al. have proposed a system identifies the three languages, in the first
fuzzy statistical approach to Kannada vowel stage Hindi is identified, in the second stage telugu
recognition based on invariant moments. Pal et. al. is identified and in the third stage English is
have suggested a word-wise script identification identified.
model from a document containing English,
Devanagari and Telugu text. Chanda and Pal have 2.1 Data collection:
proposed an automatic technique for word-wise
identification of Devanagari, English and Urdu Standard database of documents of Indian
scripts from a single document. Spitz has proposed languages is currently not available. Data base
a technique for distinguishing Han and Latin based construction with respect to the language
scripts on the basis of spatial relationships of identification problem seems to be complex since
features related to the character structures. Pal et al. the factors like the font type and font size of each
have developed a script identification technique for language needs to be considered. In this research, it
Indian languages by employing new features based is assumed that the input data set contains
on water reservoir principle, contour tracing, jump documents having the text lines of Telugu,
discontinuity, left and right profile. Ramachandra et Devanagri and English scripts. Also, it is assumed
al. have proposed a method based on rotation- that the language type, font and size of the text
invariant texture features using multichannel Gabor words within a text line are same. For the
filter for identifying six (Bengali, Kannada, experimentation of the proposed model, three sets
Malayalam, Oriya, Telugu and Marathi) Indian of database are constructed, out of which one
languages. Hochberg et al. have presented a system database was used to train the proposed system and
that automatically identifies the script form using the other two databases were constructed to test the
cluster-based templates. Gopal et al. have presented system.
a scheme to identify different Indian scripts
through hierarchical classification which uses The size of the document images
features extracted from the responses of a considered were 600x600 pixels having about six
multichannel log-Gabor filter. Our survey for to ten text lines depending upon the font size of the
previous research work in the area of document text. The document of English language was
language identification shows that much of them created using the Microsoft word software and
rely on languages followed by other countries and these text lines were imported to the Micro Soft
few from our country, but hardly few attempts Paint program. In the Micro Soft Paint, a portion of
focus on these three languages Telugu, Hindi and the text lines was saved as black and white BitMaP
English. (BMP) image having 600X600 pixels. The font
type of Times New Roman, Arial, Bookman Old
445 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
Style and Tahoma were used for English language. image by removing the upper, lower, left and right
The font sizes of 12 to 26 were used for English blank regions. It should be noted that the text block
text lines. The input images of Telugu and Hindi might contain lines with different font sizes and
language were constructed by clipping only text variable spaces between lines. It is not necessary to
portion of the document downloaded from the homogenize these parameters, as the input to the
Internet. The training database is constructed such proposed model is the individual text lines. The
that 500 text lines were considered from each of the document image is segmented into several text
three languages. lines using the valleys of the horizontal projection
profiles computed by a row-wise sum of black
To test the proposed model, two different pixels. The position between two consecutive
data sets were constructed out of which one dataset horizontal projections where the histogram height
was constructed manually similar to the dataset is least denotes the boundary of a text line. Using
constructed for training and the other data set was these boundary lines, document image is
constructed from the scanned document images. segmented into several text lines. The segmented
The printed documents like textbooks and text lines might have varying inter-word spacing.
magazines were scanned through an optical scanner So, it is necessary to normalize the inter-word
to obtain the document image. The HP Scan Jet spacing to a maximum of 5 pixels. Normalization
5200c series scanner was used to obtain the of the inter-word spacing is achieved by projecting
digitized images. The scanning was performed in the pixels of each text line vertically; counting the
normal 100% view size at 300 dpi resolution. number of white pixels from left to right and
Manually constructed dataset is considered as good reducing the number of white pixels greater than 5
quality dataset and the data set constructed from the pixels to 5. Due to varying size of fonts, it is
scanned document images are considered as poor necessary to normalize the input text lines to fixed
quality data set. The test datasets were constructed size. Through experimental observation, it was
such that 300 text lines from each of the three determined to fix the height of the text line as 40
languages - Telugu, Hindi and English, were rows that facilitate to extract the features
present from each of the good quality and poor efficiently. So, the input text line of size m rows
quality datasets. and n columns resized to fixed size of 40 rows and
(40 x n/m) columns keeping the aspect ratio. Then,
2.2 Pre-processing: a bounding box is fixed for the segmented and
resized text line by finding the leftmost, rightmost,
Any language identification method topmost and bottommost black pixel of each text
requires conditioned image input of the document, line
which implies that the document should be noise
free and skew free. Apart from these, some 2.3Text line structure:
recognition techniques require that the document
image should be segmented, threshold and thinned. A text line can be considered as being
All these methods, help in obtaining appropriate composed of three zones: the upper zone, the
features for text language identification processes. middle zone and the lower zone. These zones are
In this research, the pre-processing techniques such delimited by four virtual lines: the top-line, the
as noise removal and skew correction are not upper-line, the base-line and the bottom-line. Each
necessary for the datasets that are manually text line has at least a middle zone; the upper zone
constructed by downloading the documents from depends on capital letters and letters with
the Internet. However, for the datasets that is ascenders, like h and k; the lower zone depends on
constructed from the scanned document images, letters with descenders, like g, p and y. This
pre-processing steps such as removal of non-text structure allows the definition of four kinds of text
regions, skew-correction, noise removal and line:
binarization is necessary. In this research, text
portion of the document image was separated from 1. Full line, with character parts present in all three
the non-text region manually, though page zones;
segmentation algorithm such as could be readily be
employed to perform this automatically. Skew 2. Ascender line, with character parts present in
detection and correction was achieved using the the upper and middle zones;
technique proposed by Shivakumar. A global
thresholding approach was used to binarize the 3. Descender line, with character parts present in
scanned gray scale images where black pixels the lower and middle zones;
having the value 0’s correspond to object and white
pixels having value 1’s correspond to background. 4. Short line, with character parts present in the
The text area is segmented from the document middle zone.
446 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
447 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
3.4.2 Feature 2: Top-horizontal line: where f(x,y) represent the image of the input text
line.
(i) Obtain the top-max-row from the top-
profile. Through experimentation, it is estimated
(ii) Find the components whose number that the number of pixels of a descendant is greater
of black pixels is greater than than 8 pixels and hence the threshold value for a
threshold1 (threshold1 = half of the connected component is fixed as 8 pixels. Any
height of the bounding box) and store connected component whose number of pixels is
the number of such components in the greater than 8 pixels is considered as the feature
attribute horizontal-lines. bottom-component. Such bottom-components
(iii) Compute the feature top-horizontal-line extracted from Telugu script are shown in Figure 2.
using the equation (1) below: Top-
horizontal-line = (hlines * 100) / tc (1) 3.4.5 Feature 5: Top-pipe-size:
where hlines represent number of horizontal
lines and tc represents total number of The attribute top-pipe (bottom-pipe) is
components of the top-max-row. obtained by deleting the connected components
whose number of pixels is less than threshold2. The
3.4.3 Feature 3: Tick-component: value of threshold2 is computed through
experimentation and it is fixed to 10 pixels. The
The observation of the characters of number of rows comprising the top-pipe is used as
Telugu script motivated to use the tick shaped the feature top-pipe-size.
components as a feature. A component is said to
have the shape of the ticklike structure if the pixel 3.4.6 Feature 6: Bottom-pipe-size:
values of the components are in the sequence (i, j),
The feature bottom-pipe-size is computed
(i+1, j+1), (i+2, j+2), … , (i+m1, j+n1), (i+m1-1,
as that of top-pipesize.
j+n1+1), (i+m1-2, j+n1+2), (i+m1-3, j+n1+3), ... ,
(i+m1-m2, j+n1+n2), where m2=i+m1 and n2>n1. 3.4.7 Feature 7: Top-pipe-density:
The shape of the tick-like structure is shown in
Figure 2. The component having the shape of the The feature top-pipe-density is computed
tick-like structure (_) is named as the feature using the equation (2). top-pipe-
‘tickcomponent’. Such tick-components extracted density=(nbp*100)/(m*n) (2) where nbp
from the top portion of Telugu script are shown in correspond to number of black pixels present in the
Figure 3. top-pipe and (m,n) is the size of the image - top-
pipe.
448 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
2. { Do for k = 1 to 500 text lines of i th script Feature (iii): Variable Sized blocks: The size of the
blocks of each text line is calculated by taking the
3. { Obtain top profile and bottom profile. ratio of width to height of each block. Then the
percentage of equal and unequal sized blocks of
4. Compute the values of the optimal features of the each text line is calculated.
i th script. }
Feature (iv): Blocks with more than one
5. Find minimum, maximum and mean of the component: The percentage of the number of
optimal features for n text lines components present in each block of every text line
is computed.
and store them in a knowledge base. }
Step-6: Decision making:
449 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
(i) Condition-1: If 90% of the horizontal Each text word is partitioned into three
lines on the mean line is greater than zones - upper zone, middle zone and lower zone as
two explained in Section 2 to get upper line and lower
times the X-height of the line as two boundary lines for every text word.
corresponding text line; if there are
80% of vertical lines in the middle Stage 5: Block segmentation:
zone and also if 70% of the blocks
have width greater than two times the From the partitioned text word, upper and
X-height, then such portion of the lower lines are used as two boundary lines for
document is recognized as Hindi every text word. Then, every text word is scanned
language. vertically from upper line to lower line without
touching any black pixels to get a stream of blocks.
(ii) Condition-2: If 65% of the horizontal Thus a block is defined as a rectangular section of
lines on the mean line is greater than the text words that has one or more characters with
half of the X-height of the more than one component.
corresponding text line and if there
are 40% of unequal sized blocks in a Stage 6: Block size evaluation:
text line, then such portion of the
document is recognized as telugu The size of each block of every text word
language. is calculated by taking the ratio of width to height
of each text word. Then the percentage of equal and
(iii) Condition-3: If there are 80% of unequal sized blocks of each text word is
vertical lines in the middle zone calculated.
greater than half of the text line height
Stage 7: Blocks having more than one component:
and if 80% of the blocks are equal in
size, then such portion of the The number of components (a connected
document is recognized as English component is one in which the pixels are
language. aggregated by an 8-connected points analysis)
3.8 Word-level identification model present within each block is calculated using 8-
neighbour connectivity. Then the percentage of the
Stage 1: Pre-processing: occurrence of blocks having more than one
component is calculated.
The input document is pre-processed i.e.,
noise removed, smoothing done, skew Stage 8: Feature Extraction:
compensated and binarized.
For each text word, the four features (i)
Stage 2: Line segmentation: Horizontal lines, (ii) Vertical lines, (iii) Variable
Sized blocks and (iv) Blocks with more than one
To segment the document image into
component are obtained.
several text lines, we use the valleys of the
horizontal projection computed by a row-wise sum Stage 9: Decision making:
of black pixels. The position between two
consecutive horizontal projections where the Level-1:
histogram height is least denotes one boundary line.
Using these boundary lines, document image is If the length of the horizontal line on the
segmented into several text lines. mean line is greater than two times the X-height of
the corresponding text word, if there are vertical
Stage 3: Word segmentation: lines in the middle zone, if the block has width
greater than two times the X-height and also if the
Every text line is segmented into words by word/block contains only one component, then that
finding the valleys of the vertical projection . If the text word is identified as a text word in Hindi
width of the valley is greater than the threshold language.
value (Threshold value = two times the inter
character gap), then a vertical line is drawn at the Level-2:
middle of the columns with zero values (Inter word
gap). Using these vertical lines, words within a text If the length of the horizontal line on the
line are segmented. mean line is equal to the x-height of the
corresponding text word; if there are 70% of
Stage 4: Word Partitioning: unequal sized blocks in the output image and also if
450 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
E li h
component, then that text word is recognized as a
text word in Telugu language.
Level-3:
4. Results
l
Input Image (Telugu with English):
CleanImage():
Input Image (Telugu with Hindi) :
Output Images:
451 WWW.IJAEGT.COM
ISSN No: 2309-4893
International Journal of Advanced Engineering and Global Technology
Vol-2, Issue-2, February 2014
5. Conclusion & Future Enhancement [6]. Gopal Datt Joshi, Saurabh Garg, Jayanthi
Sivaswamy, “Script Identification from Indian
In this project, a method to identify and Documents”, DAS 2006, LNCS 3872, 255-267, 2006.
separate text lines of Telugu, Hindi and English International Journal of Computational Intelligence
Systems, Vol.1, No. 2 (May, 2008), 116–126 Published
scripts from a trilingual document is presented. The
by Atlantis Press 126.
approach is based on the analysis of the top and
bottom profiles of individual text lines and hence [7]. G.S. Peake, T.N.Tan, “Script and Language
does not require any character or word Identification from Document Images”, Proc. Eighth
segmentation. A document may contain text words British Mach. Vision Conference., 2, 230-233, (1997).
of different languages within a text line. The
proposed method is bound to fail on such
documents. This is a limitation. One possible
solution is to identify the script type at word level Authors Profile
by segmenting the text line into words. Our future
work is to consider identification of the script type B.Harikumar completed his B.Tech in Electronics
at word level. Further this project extends to and Communications Engineering from Kamareddy
identify combination of different Fonts with Engineering College in 2011. Presently he is
different Font sizes and for identification of hand pursuing his M.Tech in Electronics and
written documents also. The experimental results Communications Engineering in Aurora's Scientific
show that the two methods are effective and good Tech & Research Academy, Hyderabad. His area
enough to identify and separate the three language of interests is Image Processing, Optical Character
portions of the document, which further helps to Recognition.
feed individual language regions to specific OCR
system. Our future work is to develop a system that G.Srinivas Rao Associate prof (ECE) Aurora's
can identify other Indian languages. Scientific Tech & Research Academy Exp 9 years
Teaching, M.Tech (DEC) JNTU Ananthapur.
6 References
Mohammed Imanuddin Associate prof (ECE)
[1]. Identification Of Telugu, Devanagari And English Aurora's Scientific Tech & Research Academy Exp
Scripts Using Discriminating Features M C Padma1 9 years Teaching, M.Tech (DEC) JNTU
And P A Vijaya2, International Journal Of Computer Ananthapur . His area of interests is Image
Science & Information Technology (Ijcsit), Vol 1, No 2, Processing
November 2009
452 WWW.IJAEGT.COM