Thanks Google for Hindi OCR Guidelines

Thanks Google for Hindi OCR Guidelines

Published by Brijesh Verma
Hindi OCR Guidelines
Hindi OCR Guidelines

Published by: Brijesh Verma on Jan 20, 2013
Installation guidelines for Hindi/Indic languages OCR (Windows)
Thanks Google, UBUNTU & all open source resources for Internet based Gayatri and Yagyawhichhelped us to develop Hindi/Indic languages OCR (Optical Character Recognition) for our megaUnicode conversion project of Vedic Literature
को हमारेवैिदक सािहय कमेगा यूनकोड पांतर परयोजना कलए िहद 
भारतीय भाषाओंओसीआर (ऑटकल केटर परवतक) उपलध कराने कलए धयवाद.
This will help us to propagate & implement OUR WILL/Our Solemn Pledge for everyone to have alife like our P.Gurusatta
यह सहायता हर िकसी का जीवन पू.गुसा की तरह जीने कलए, हमारढ संकप कचार-सार करने ममदद करेगा.वंदनीया माताजी  - ‘‘बेटा! मुझे और गुजी को कभी अलग मत करना।’’ िफर बोलीं, ‘‘बेटा, आनेवालेसमय  म द  ुनया अपनी समयाओंका समाधान मेगीत म और गुजी वचन  म    ू ँढेगी।’’ सच तो है, शव और श को भला अलग िकया भी कसे?-ी -झाँकी  पृ. ७३
Overview of Hindi/Indic/Multilingual OCR:
1.Scan document. (300DPI for better output) Image or PDF file.2.If images are of not high quality, then for post-processing of scanned pages, save/export PDF asimages (.tif, png) into one folder.3.UseScan Tailor softwarefor post-processing of scanned pages. Detailed user guidelines follows.4.Make PDF file from images by creating PDF. Files > Create PDF from multiple files > Add files.5.InstallTesseract, one of the most accurateopen source OCR engine available.Restart your  computer for new system path to be assignedfor Tesseract. 6.UsegImageReader softwarefor OCR. Save file in same folder. Detailed user guidelines follows.7.Check & correct spellings using any spell checker software.http://www.awgp.in/spellchecker/or http://www.bhashagiri.com/for Hindi.8.Convert fonts. Use Hindi Lekhak.http://www.awgp.in/hindilekhak/or download. 9.Print for manual proof reading. Detailed user guidelines follows.10.Check manually logical errors of the document.11.Feedback is always welcome.
Google'sTesseracthin.traineddata found working good for Chanakya/Arial Unicode MSlike fonts.Nitin's hin.traineddatafound working good for Mitra/Mangal/KrutiDev like fonts.
Required Installation instructions:
(We should be connected to Internet throughout the installation process.)
gs905w32.exe - GPL Ghotscripthttp://sourceforge.net/projects/ghostscript/ 
vcredist_x86.exe - MS VC++ Redistributable Setup.http://www.microsoft.com/en-in/download/details.aspx?id=5555 
Scan Tailor - An interactive post-processing tool for scanned pages.http://sourceforge.net/projects/scantailor/ 
tesseract-ocr-setup-3.02.02.exehttp://tesseract-ocr.googlecode.com/files/tesseract-ocr-setup-3.02.02.exe a.Make Internet connection ON.b.Choose Components
Download & Install Hindi Language DataDownload & Install Math / Equation Detectc.Installation complete successfully.
6.Restart is important immediately.
Copy-paste hin.traineddata file C:\Program Files\Tesseract-OCR\tessdata folder if it is notdownloaded there fromhttp://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.hin.tar.gz 
You may try more Indic language traineddata files & paste into above folder fromhttp://code.google.com/p/parichit/downloads/list. Thanks to Indu and RKVS Ramanhttp://code.google.com/p/parichit/for their Parichit (
) project. Accuracy is low.
Scan Tailor - An interactive post-processing tool for scanned pages.
http://sourceforge.net/projects/scantailor/1.Download and install.2.Put all scanned images / exported images from PDF file into one folder.3.Start Scan Tailer. Open new project. Select folder.4.Select all files / required files. Click “Fix DPI even if …..”. Click OK.

