You are on page 1of 5

Optical Character Recognition for Devanagari Script

Prof Amol Bhilare, Varad Kulkarni, Prem Chavhan, Adesh Ramgude, Mayur Jadhav

Department of Computer Engineering, Vishwakarma Institute of Technology Pune

Abstract- trigger some acceptance mechanism for the corresponding


character. This was the first documented vision of this type of
Optical Character Recognition (OCR) is the process of extracting technology. The world has come a long way since this
text from an image. The main purpose of an OCR is to make prototype There are many O.C.R. Engines are available
editable documents from existing paper documents or image files
currently in the market such as Google’s cloud vision API
A significant number of algorithms is required to develop an
Microsoft Azure’s Leverage OCR etc.
OCR and basically, it works in two phases such as character and
word detection. In case of a more sophisticated approach, an
OCR also works on sentence detection to preserve a document's III. METHODS
structure.
The idea of OCR technology has been around for a long time
I. INTRODUCTION and even predates electronic computers. The original idea of
OCR is given by Paul.W.Handel in 1931, after that many
We are moving forward to a more digitized world. Computer methods, are introduced for OCR.
and PDA screens are replacing traditional books and
newspapers. Also, the large number of paper archives which 1. Template-Matching Method:-
requires maintenance as paper decays over time lead to the
idea of digitizing them instead of simply scanning them. This In 1956, Kelner and Glauberman used magnetic shift register
requires recognition software that is capable of an ideal to project two-dimensional information. The reason for this is
version of reading as well as humans. There are many O.C.R. to reduce the complexity and make it easier to interpret the
software available for detection of English or other languages information. A printed input character on paper is scanned by
But there is such software also needed which can identify and
a photo detector through a slit. The reflected light on the input
detect the languages supported by Devanagari script. We are
using Python-tesseract which is an optical character paper allows the photo detector to segment the character by
recognition (OCR) tool for python. Python-tesseract is a calculating the proportion of the black portion within the slit.
wrapper for Google’s Tesseract-OCR Engine. It is also useful This proportion value is sent to a register which converts the
as a stand-alone invocation script to tesseract, as it can read all analog values to digital values.
image types supported by the Python Imaging Library. The
Tesseract engine was originally developed as proprietary
software at Hewlett Packard labs in Bristol England and
Greeley Colorado between 1985 and 1994. Tesseract
development has been sponsored by Google since 2006.

II. BACKGROUND OR LITERATURE REVIEW

The idea of OCR technology has been around for a long time
and even predates electronic computers. The original OCR
design proposed by Paul W. Handel in 1931. He applied for a
Illustration of 2-D reduction to 1-D by a slit
patent for a device “in which successive comparisons are made
between a character and a character a photo-electric apparatus
(a). an input numeral “4” and a slit scanned from left to right.
would be used to respond to a coincidence of a character and
an image. This means you would shine a light through a filter
(b) Black area projected onto-axis, the scanning direction of
and, if the light matches up with the correct character of the
the slit.
filter, enough light will come back through the filter and
2. Peephole Method:- For local binarization we choose:
1)Niblack Method
This is the simplest logical template matching method. Pixels 2)Adaptive Method
from different zones of the binarized character are matched to 3)Sauvola Method
template characters. An example would be in the letter A, 4)Bernsen Method
where a pixel would be selected from the white hole in the In Global Binarization methods, The Fixed Thresholding
binarization method uses a fixed threshold value to assign 0’s
center, the black section of the stem, and then some others
and 1’s for all pixel positions in a given image.It is given as
outside of the letter. Each template character would have its
follows:
own mapping of these zones that could be matched with the
character that needs to be recognized. The peephole method
was first executed with a program called Electronic Reading
Automation in 1957. The Otsu’s thresholding method is used for automatic
binarization level decision, based on the shape of the
histogram. It is given by

The kilter method has used a mixture of Gaussian distribution


to find the threshold value. In kilter Method, the t is the
threshold that is used to segment the image into two parts
background and foreground, both of the parts modeled by
Gaussian distribution.

Illustration of the peephole method

3. Structured Analysis Method:-


In Local Binarization method, Adaptive binarization method
is used for local binarization. In this a window of NxN blocks
It is very difficult to create a template for handwritten slide over the entire image and threshold value is computed for
characters. The variations would be too large to have an each local area under the window for binarization.
accurate or functional template. This is where the structure In the Niblack method, the threshold value for the local area
analysis method came into play. This method analyzes the under the window is calculated pixel-wise.
character as a structure that can be broken down into parts. The
features of these parts and the relationship between them are
then observed to determine the correct character. The issue
with this method is how to choose these features and
relationships to properly identify all of the different possible
characters.
If the peephole method is extended to the structured analysis
method, peepholes can be viewed on a larger scale. Instead of Sauvola’s method takes a grayscale image as input and
single pixels, we can now look at a slit or ‘stroke’ of pixels and calculates the threshold value for each pixel.
determine their relationship with other slits. This technique
was first proposed in 1954 with William S. Rohland’s
“Character Sensing System” patent using a single vertical
scan. The features of the slits are the number of black regions
present in each slit. This is called the cross counting technique. Bernsenis local binarization method which computes the
threshold value from the pixel of the image but it works on the
4. Methods For Binarization Process (Grey Scaling Of certain equation for calculation of the threshold value
Images):-
There are many methods available for Image Binarization or
Gray scaling, for global binarization we choose:
1) Fixed thresholding method
2) Otsu Method
3) Kittler Method
IV FACTORS AFFECTING PERFORMANCE OF OCR Architecture of Tesseract

OCR results are mainly attributed to the OCR recognizer VI.COMPOSITION OF CHARACTERS AND WORDS FOR
software, but there are other factors that can have a WRITTEN WORDS
considerable impact on the results. The simplest of these
factors can be the scanning technique and parameters. A horizontal line is drawn on top of all Characters of a
The table below summarizes these factors which affect the word that is referred to as the header line or shi- rorekha.
performance of OCR. It is convenient to visualize a Devanagari word in terms
of three strips: a core strip, a top strip, and a bottom strip.
The core and top strips are separated by the header line.
S.No Factors affecting the performance of OCR The following Figure shows the image of a word that
contains five characters, two lower modifiers, and a top
modifier. The three strips and the header line have been
1 Quality of original source marked.

2 Scanning resolution and file format

 The bit depth of the image


3  Image optimization and a
binarization process
 Quality of source (density of Three strips of a Devanagari word
microfilm)
VII.PROCESS OF OCR WITH RESPECT TO
DEVANAGARI SCRIPT
4  Pages with complex layouts The Process of Extraction of text from an image includes
 Adequate white space between lines, the concept of Segmentation in our case it consists of two
columns and at the edge of the page stages.

A.Preliminary Segmentation of Words(Stage 1);-


5 Unoptimized Image The preliminary segmentation consists of the following
steps:
1. Locate The Header line.
6 Pattern present in provided Image We compute the horizontal projection of the word
image box. The row containing the maximum
number of black pixels is considered to be header
7 Algorithm Used for processing of Image line Let this position is denoted by hLinePos.
2. Separate character /symbol boxes.
To Separate the character and symbol boxes we
V.ARCHITECTURE OF TESSERACT make a vertical projection of image starting from
hLinePos to the bottom row of the word image box.
3. Separate symbols of the top strip.
To Separate the symbol of the last strip, We
compute the vertical projection of the image
starting from the top row of the image to the
hLinePos.The columns that have no black pixels
are used as delimiters for extracting the top
modifier symbol boxes.
Our selection criteria for marking the sub-images that need
further segmentation due to the possible presence of lower
modifiers or shadow characters or conjuncts.

B.Segmentation of Conjunct/Touching Characters(Stage 2)


Sub images marked for further segmentation due to its
width may contain either a conjunct or a shadow character
pair. An outer boundary traversal is initiated from the top-
left-most black pixel of the image. During the traversal, the image. In this paper, we have presented the internal
right most pixel visited will not be on the right boundary of architecture and the working of Pytessarct which is a
the image box if an image box contains the shadow wrapper of Tesseract OCR engine with respect to our
character pair. If the image contains a conjunct, the right Devanagari OCR System. Pytessarct uses two-pass system
most pixel visited will be on the right boundary of the or two-stage system to optically read and recognize
image. A conjunct is segmented vertically at an appropriate Devanagari characters of the present in an image provided
column to yield constituent characters. by the user

VII.CONCLUSION
We tested our Devanagari OCR system on various images VIII.REFERENCES
taken from the internet as well as from other sources. A
performance of approximately 93% of the character level is 1. Sean O’Brien, Dhia Ben Haddej “Optical Character
obtained. The input and output images are given below Recognition”
2. Veena Bansal.,R.M.K.Sinha“A Complete OCR for Printed
Hindi Text in Devanagari Script”
3. “ISauvola: Improved Sauvola’s Algorithm for Document
Image Binarization” Zineb Hadjadj, Abdelkrim Meziane,
Yazid Cherfa,Mohamed Cherietand Insaf Setitra
4. “Tesseract OSCON”Google.

Input Image

Output Image
.
The image output after whole processing is given in output

You might also like