Multilingual Document Image Analysis

PART IV
Multilingual Document Image

Analysis
Dr.Mallikarjun Hangarge
– Script is a set of symbols and rules used to express or
convey the information in a graphic form.
4/8/2018 2
– Script is independent
of language
– Different languages
may use the same
script
– For example, Sanskrit,
Marathi, and Hindi
use the Devnagari
script
4/8/2018 3
Scripts Languages Regions
Hindi, Sanskrit,
Devanagari North India
Marathi, Nepali
Gujarati Gujarati North India
Gurumukhi Punjabi North India
Bengali,
Bangla North India
Assamese
Oriya Oriya North India
Telugu Telugu South India
Kannada Kannada South India
Tamil Tamil South India
Malayalam Malayalam South India
Urdu Urdu North India
Roman English India
4/8/2018 4
4/8/2018 5
Script Identification
Devanagari
Roman
Input Image
4/8/2018 6
Indian Script Character Shape
Properties
Roman Devanagari Telugu
Malayalam
Kannada Tamil
• Primary aim of the proposed system is to identify the
script of a word
INPUT DOCUMENT
Pre-Processing Feature Extraction Classification
Binarization OUTPUT
Skew Detection
Segmentation
4/8/2018 8
Input Image Binarization : Otsu’s Method
Output Line Removal: CC Analysis

4/8/2018 9
• Conventional DCT is not efficient in characterizing
the images where directional edges are dominant .
• Directional DCT is efficient in capturing minute edge
information of the shape.
A= 1 1 0 0 1 3.6000 -0.2690 0.6865 -0.1663 0.2622
-0.6015 0.5854 0.9511 -0.1382 0.4253
10101 2DCT
0.0540 0.1176 -0.4382 -0.8784 -0.2854
10011 0.3717 0.3618 0.2629 -0.0854 -0.5878
-0.3702 -0.2800 0.3854 0.1902 -0.6618
11111
01111
4/8/2018 10
D1= 0.2690 0.6865 -0.1663
STD
0.9511 -0.1382 0.4253 0.8042 0.5238 0.4183
-0.8784 -0.2854 0
-0.5878 0 0
Mean
0.5821
4/8/2018 11
1DCT
4/8/2018 ICECIT_2012@SRIT,ANATAPUR,A.P 12
Classification
• LDA:- It preserves class discriminating information to
the higher extent by reducing dimensionality of
feature space. It also maximizes separability
between the classes by maximizing the ratio of
between-class variance to the within class variance.
• KNN:- To comprehend the performance of LDA,
another traditional classifier i.e., K-NN is used.
Basically K-NN stores the training data X. Then finds
the minimum D distance between training sample X
and testing pattern Y using
4/8/2018 13
Experiments
• There is no publicly available dataset of Indic script at
present. Therefore, a dataset of 9000 handwritten
text words of six scripts, namely Roman (R),
Devanagari (D), Kannada (K ), Telugu (TE ), Tamil (TA )
Each script is written by a different set of 20 writers.
Each writer has written 75 words.
• The writers are asked to write the text provided for
them on a A4 size paper. These papers are digitized
by a scanner with a resolution of 300 dpi.
4/8/2018 14
Evaluation Protocol
• To evaluate the performance of the method, K-fold
cross validation (CV) has been implemented unlike
traditional dichotomous classification. In K-fold CV,
the original sample for every dataset is randomly
partitioned into K sub-samples. Of the K sub-
samples, a single sub-sample is used for validation,
and the remaining K − 1 sub-samples are used for
training.
• This process is then repeated for K-folds, with each
of the K sub-samples used exactly once. Eventually, a
single value results from averaging all. In our tests,
we use K = 10.
4/8/2018 15
BI-SCRIPT IDENTIFICATION RESULTS IN % WITH LDA (LOWER
TRIANGLE RESULTS ARE FROM DDI AND UPPER TRIANGLE ARE
FROM D-DCT).
4/8/2018 16
BI-SCRIPT IDENTIFICATION RESULTS IN % WITH KNN (LOWER
TRIANGLE RESULTS ARE FROM DDI AND UPPER TRIANGLE ARE
FROM D-DCT).
4/8/2018 17
TRI-SCRIPT IDENTIFICATION (IN %) WITH LDA.
TRI-SCRIPT IDENTIFICATION (IN %) WITH KNN.
4/8/2018 18
MULTI-SCRIPT IDENTIFICATION (IN %) WITH LDA .
MULTI-SCRIPT IDENTIFICATION (IN %) WITH KNN .
4/8/2018 19
Horizontal features of Indian Roman
and Kannada script
4/8/2018 20
Horizontal features of IAM Roman and Kannada
Script
4/8/2018 21
C-DCT Versus D-DCT
4/8/2018 22
Observations
• The native writer of a specific script mimics his style
of writing while writing non native scripts. It is
experimentally validated with Indian scripts and
Roman script.
• The performance of directional DCT is remarkable as

compared to traditional DCT.
• The performance of LDA+D-DCT is notable.
4/8/2018 23

Multilingual Document Image Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multilingual Document Image Analysis

Uploaded by

Copyright:

Available Formats

PART IV

Multilingual Document Image

Roman Devanagari Telugu

Pre-Processing Feature Extraction Classification

Output Line Removal: CC Analysis

TRI-SCRIPT IDENTIFICATION (IN %) WITH KNN.

MULTI-SCRIPT IDENTIFICATION (IN %) WITH KNN .

• The performance of directional DCT is remarkable as

• The performance of LDA+D-DCT is notable.

You might also like