You are on page 1of 16

OCRopusAddons

InternshipReport

Submittedto:
ImageUnderstandingandPatternRecognitionLab
GermanResearchCenterforArtificialIntelligence
Kaiserslautern,Germany

Submittedby:
AmbrishDantrey,B.Tech.IIIyear,E&CE
IndianInstituteofTechnology,Roorkee
Roorkee,India

Supervisors:FaisalShafait,IllyaMezhirov
Reviewer:prof.Dr.ThomasBreuel

StartDateforInternship:15thMay,2007
EndDateforInternship:27thJuly,2007


ReportDate:27thJuly,2007

Preface

This report documents the work done during the summer internship at Image
UnderstandingandPatternRecognition(IUPR)Lab,DeutscheForschungszentrum
fr Knstliche Intelligenz(DFKI), Germany under the supervision of Prof. Dr.
ThomasBreuel. The report first shall giveanoverviewofthetaskscompleted
duringtheperiodofinternshipwithtechnicaldetails.Thentheresultsobtained
shallbediscussedandanalyzed.
Reportshallalsoelaborateonthethefutureworkswhichcanbepersuadedasan
advancementofthecurrentwork.

Ihavetriedmybesttokeepreportsimpleyettechnicallycorrect.IhopeIsucceed
inmyattempt.

AmbrishDantrey


Acknowledgments

Simplyput,IcouldnothavedonethisworkwithoutthelotsofhelpIreceived
cheerfully from whole IUPR. The work culture in IUPR really motivates.
Everybodyissuchafriendlyandcheerfulcompanionherethatworkstressisnever
comesinway.

I would specially like to thank Dr. Thomas Breuel and Dr. Daniel keysers for
provingtheniceideastoworkupon.Notonlydidtheyadvisedaboutmyproject
butlisteningtotheirdiscussionsinIPeTmeetinghaveevokedagoodinterestin
Imageanalysis.IamalsohighlyindebtedtomysupervisorsFaisalShafaitandIlya
Mezhirov,whoseemedtohavesolutionstoallmyproblems.

Author


Abstract
ThereportpresentsthethreetaskscompletedduringsummerinternshipatIUPR
whicharelistedbelow:
1. Detection of headlines in document images with black runlengths and
OCRopusperformanceevaluationindetectingheadlines
2. Reengineeringthezoneclassificationmodule
3. Evaluationofdifferentsegmentationalgorithmsperformance
Allthesetaskshavebeencompletedsuccessfullyandresultswereaccordingto
expectations.Thedetection of headlinesachievedalowerrorrateof2.85%as
against 6.52 of previously used methods. During evaluation of segmentation
algorithmsXYcutwasfoundtogainalotbynoisecleanup,whichisaninteresting
resultasitstrengthentheclaimofXYcutsegmentationalgorithmasasuitable
method for OCRopus. The reengineering and porting of zoneclassification
module to OCRopus makes it possible for OCRopus to have a text/image
segmentationifitisrequiredinfuture.

Author


OCRopus:Introduction
Thoughthefieldofopticalcharacterrecognition(OCR)isconsideredtobewidely
explored,thedevelopmentofanefficientsystemforuseinrealworldsituations
stillremainsachallengefordevelopers. OCRopusisastateoftheartdocument
analysisandOCRsystem,featuringpluggablelayoutanalysis,pluggablecharacter
recognition,statisticalnaturallanguagemodeling,multilingualcapabilitiesandis
beingdevelopedatIUPR.Thisbeingaverybigproject,Iwasassignedthetasksof
developingtoolsforlayoutanalysisandevaluation.

TheGoals:
FollowinggoalsweresetasIproceededinmywork:
1. ConversionofgroundtruthdatainMARGdatabasefromXMLformat
tohOCRmicroformat[1].
2. Developmentofarulebasedheadlinedetectionmethodusingthemedian
blackrunlengthofthelines.


3. Development of segmentationclassification module and evaluation of
performanceofdifferentsegmentationalgorithmsasagainstnoise.

1.XMLtohOCR:
hOCR is a format for representing OCR output, including layout information,
character confidences, bounding boxes, and style information. It embeds this
information invisibly in standard HTML. By building on standard HTML, it
automatically inherits welldefined support for most scripts, languages, and
common layout options. Furthermore, unlike previous OCR formats, the
recognizedtextandOCRrelatedinformationcoexistinthesamefileandsurvives
editingandmanipulation.hOCRmarkupisindependentofthepresentation.

DuetoallabovequalitiesofhOCRformat,itishighlydesirabletohaveground
truthinthisformat.IwasassignedthetaskofconvertingtheMARGdatabase
groundtruthintohOCRformat.ForthispurposeIhavewrittenfollowingscript.
ScriptName:xmltohocr
LanguageUsed:Python
Commandlineargumentform:xmltohocrFILE.XML

FILE.XML:ThefileinXMLformattobeconvertedintohOCRmicroformat.

Note: The script does not take care of latex characters yet. It would be an
improvementtoincorporatethisfeature.

2.HeadlinedetectionBasedonblackrunlengthanditsintegration
intoOCRopus:
Detectionofheadlinesindocumentimagesisoneissuethatismostlyoverlooked
butyetishighlydesirabletoproperlyformattheoutputofOCR.OCRopushadtill
nowusedarulebasedmethodwhichusedspacebetweenlinesasthecriteriafor
detectionofheadlines.Thoughthismethodworkedformanyimages,italsofailed
manytimes.Itwasanobviousobservationthatblackrunlengthsofheadlinesare
morethantheblackrunlengthofthenormalline,andwetriedtobuilduponthis


concept.Weusedmedianblackrunlengthofalineasthedecidingcriteria.The
medianwasusedinsteadofmeanbecausemeanrunlengthcouldhaveeasilybeen
affectedbythenoisemergingwithtextandwouldhaveproduceerrors.

Thewholeapproachissimpleasdiscussedbelow:
1. Calculatethemedianblackrunlengthfortheeachlineonpage.
2. Comparethisrunlengthforeachlinewiththelinesbelowandaboveit.
3. If black runlength for a line has been found K1(a parameter) times the
median runlength oflinebelowit,andK2(anotherparameter)timesthe
medianrunlengthofthelineaboveit,setitasaheadline.

ThevalueofparametersK1andK2wastobefoundexperimentally.Aftermany
timesevaluatingtheperformanceoftheprogram,thevalueofK1andK2hasbeen
setto1.5and1.1respectively.

Weusedhistogrambasedmethodtofindthemedianrunlength.Ahistogramof
thenumberofoccurrencesversusrunlengthwascalculated,oncewehavesucha
histogramwenormalizeitwiththelargestvalueofoccurrence.Thenwecalculated
thecumulativedistributionfunctionforthisnormalizedhistogram.Thepointwhen
cumulativedistributionfunctionrechesavalueof0.5,correspondstothemedian
runlength.

The program for detection of headlines was written in C++ and used standard
OCRopusclasses.TheprogramhasbeensuccessfullyintegratedintoOCRopusand

Evaluation:
We also designed a tool which evaluates the performance of the OCRopus in
detecting headlines. As according to OCRopus standards, this tool has been
developedtoworkwithfilesinhOCRmicroformat.Thistoolcomprisesoftwo
programs:
1. ThefirstprogramtakestheOCRopusoutputandthecorrespondingground
truthfileinhOCRformatand outputsthetotalnooffalsepositivesand


falsenegativeswhichoccurredindetection.Italsooutputsthetotalnoof
true headlines which are present in the groundtruth. The command line
formofthisprogramsis:
headlineevalhOCRtruehOCRactual
2. The second program is for parsing the file produced by running above
programonalargenooffiles(oronadatabase)andcountsthetotalnoof
falsepositivesandfalsenegativesoccurredinwholedatabaseandtellsthe
errorrateofOCRopusonwholedatabase. Thecommandlineformofthis
programsis
count_errorsFILE.TXT

BothoftheaboveprogramswerewritteninPYTHON.

Criteriaforevaluation:ForevaluatingtheperformanceofOCRopusindetection
ofheadlineswedefinethetheerrorrateas:

e=(fp+fn)/T
e=percentageerror
fp=totalnooffalsepositives
fn=totalnooffalsenegatives

WeevaluatedtheperformanceonstandardUniversityofWashingtonIII(UWIII)
database[2].Theresultsforheadlinedetectionprogramshowedclearlythat
medianblackrunlengthcriteriaisbetterthanthespacebetweenlinescriteria,yet
errorswerestillpresent.Whilevisuallyanalyzingtheoutput,anobservationwas
madethatrunlengthbasedcriteriaandspacebasedcriteriabothproduced
differentfalsenegativesandpositives.Henceitwasclearthatoneofthemethod
canbeusedtoremovetheerrorsproducedbyother.Sowetriedtocombinethe
bothapproachesinsuchawaythatspacebasedcriteriaisusedasafiltertodetect
falsepositivesproducedbytherunlengthbasedcriteria.Therulewhichwasused
tocombinethemwasasfollows:
1. Userunlengthbasedcriteriatofindtheheadlines.
2. Calculatethemedianblackrunlengthforwholepage


3. Comparethemedianblackrunlengthofalllinesfoundtobeheadlinein
step1withthemedianblackrunlengthofthepage.Sincemedianblackrun
lengthofthepagerepresentsjustthesimplelinenotaheadline,ifany
headlinefoundinstep1hasarunlengthlessthanorequaltotherunlength
forwholepage,itisasuspiciouscase.Recheckforthislinewithspacebased
criteria.

Results:
Theresultswereasexpected.Onlyrunlengthbasedcriteriaperformedbetterthan
onlyspacebasedcriteriaandacombinationofboththecriteriaasdescribedabove
outperformedtheboth.TheerrorratesonstandardUW3databasefordifferent
approachesareasfollows:
Spacebasedheadlinedetection:

totalnooftextlines:138018
totalnooffalsepositives:7356.0
totalnooffalsenegatives:1713.0
%error=6.52%

BlackRunlengthbasedheadlinedetection:

totalnooftextlines:138018
totalnooffalsepositives:4341.0
totalnooffalsenegatives:1386.0
%error=4.14%

Bothapproachescombined(usingspacebasedapproachasafiltertoremove
falsepositives)

totalnooftextlines:138018
totalnooffalsepositives:2452.0


totalnooffalsenegatives:1476.0
%error=2.85%

Nextweshowsomeoftheexamples:



3. Text/ImageSegmentationandClassification
Documentimagelayoutanalysisisacrucialstepinmanyapplicationsrelatedto
documentimages,liketextextractionusingopticalcharacterrecognition(OCR),
reflowingdocuments,andlayoutbaseddocumentretrieval.Layoutanalysisisthe
process of identifying layout structures by analyzing page images. Layout
structures can be physical (text, graphics, pictures, . . . ) or logical (titles,
paragraphs, captions, headings, . . . ). The identification of physical layout
structuresiscalledphysicalorgeometriclayoutanalysis,whileassigningdifferent
logicalrolestothedetectedregionsistermedaslogicallayoutanalysis[3].The
taskofageometriclayoutanalysissystemistosegmentthedocumentimageinto
homogeneouszones,eachconsistingofonlyonephysicallayoutstructure,andto
identifytheirspatialrelationship(e.g.readingorder).Therefore,theperformance
oflayoutanalysismethodsdependsheavilyonthepagesegmentationalgorithm
used. A detailed explanation of defferent segmentation algorithms and their
performancecomparisoncanbefoundin[4,5].

Also,anotherimportantsubtaskofdocumentimageanalysisintheclassificationof
physicallysegmentedblocksintooneofthepredefinedclasses.Inmostofthe
casestheclassificationstepsfollowsthesegmentationanditishighlydesirableto
evaluatethesystemperformanceonwholesegmentation/classificationtask.With
thehelpofsuchanevaluation,itiseasytodecideiftheincorporationofthesestep
inOCRopuswouldresultinimprovedperformance.alsoitwouldbeeasytodecide
whichsegmentationalgorithmtouse.

Forclassificationstepweusedmethodasdescribedin[6]thisbeingthebest
classificationmethod.Weusedonlytwoclassestextandnontextwhichwere
releventtoOCRopus,insteadofeightclassesasdescribedinthispaper.

Wealreadyhadanimplementationofvarioussegmentationalgorithmsand
classificationstep.Thetaskincludedreengineeringtheclassificationstep'scode
andportingthewholesegmentationclassificationmoduleintoOCRopus,making
itusestandardOCRopusclassesandfunctions.Thetaskhasbeencompleted

successfullyandnowwehaveaversionofwholesegmentationclassification
moduleinOCRrepositoryanditcanbeintegratedwithOCRopusiftheresultsand
experimentscomespositive.Thecommandlineformoftheprogramis:

ocrclassifyanddisplayiIMAGEbBOUNDINGBOXFILEoOUTPUT
IMAGE

IMAGE:Theimagetobeclassified
BOUNDINGBOXFILE:Theboundingboxfileproducedbysegmentation
algorithms
OUTPUTIMAGE:Thenameofoutputimagetobewritten

Evaluation
Asdiscussedearliertheevaluationofbothsegmentationandclassificationsteps
combinedtogetherishighlydesirable.Thepurposeofdevelopingaevaluation
modulewastodecidewhichsegmentationalgorithmwouldbestsuitetheneedof
OCRopus.Wedevelopedaevaluationprogramwhichevaluatestheperformanceof
twostepsasagainstthegroundtruth.Ourcriteriafortheevaluationisthe
hammingdistancebetweenthetext/nontextzoneimageproducedfromground
truthandthatfromtheZoneclassificationmodule.Theerrorrateisdefinedas
follows:

e=HD*100/T
e=errorrate
HD=Hammingdistancebetweengroundtruthtextnontextimageand
actualtextnontextimage
T=Totalnoofpixelspresentinimage

%efficiency=100e

ThisprogramwasdevelopedinC++.Thecommandlineargumentformofthe
programis:


ocrevaluategtGROUNDTRUTHIMAGEaiACTUALIMAGE

GROUNDTRUTHIMAGE:Text/nontextimageproducedfromgroundtruth
ACTUALIMAGE:Text/nontextimageproducedfromactualprogram

Issueofnoisecleanup:DocumentImageNoiseaffectstheperformanceof
segmentationalgorithmsgreatly.Itwasourviewthattheperformanceofallthe
algorithmsshouldimproveafternoisecleanup.Abetterexplanationcanbefound
in[5].Weusednoisecleanupsystemasexplainedin[7]Alsoweexpected
improvementinperformanceofsimplesegmentationalgorithmslikeXYcuttobe
morethanthatofcomplexalgorithmslikevoronoi,reasonbeingXYcutgetsmore
affectedbynoisethanvoronoidoesandasweevaluatedtheperformanceofthese
algorithmswithandwithoutnoise,weprovedcorrect.

Results:
Threesegmentationalgorithms(Voronoi,DocstrumandXYcut)performancewas
evaluatedbyourprogram.Theresultswereaswehadexpectedandhencewere
quiteencouraging.Belowaretheerrorratesforallthesealgorithmswithand
withoutnoisecleanup.

Algorithm Percentageefficiencywithoutnoise Percentageefficiencywithnoise


cleanup cleanup
Voronoi 87.03 87.69
Docstrum 86.88 86.92
XYcut 80.16 85.70

Asevidenttheperformanceofallthealgorithmsincreasewithnoisecleanup,but
theimprovementwasmuchmoreforXYcutcomparedtootheralgorithms.After
noisecleanupXYcuthasanefficiencymuchclosetothatofVoronoiandbeinga
simplealgorithmsXYcutcanbeanoptimumchoicefortheOCRopus.

Conclusion:


ThewholeexperienceofworkingatIUPRwasgreat.Thisorganizationhasa
superbworkculture,greatmindsandveryhighqualityofwork.Ilearnedalotof
aboutimageprocessingandanalysis.TheworkIcouldcompleteherewasvery
satisfactory.IhavetriedtodevelopasmanyaddonsaspossibleforOCRopusand
evengotveryencouragingresultswithsomeofthem.IhopemyworkonOCRopus
helpsitmeetitsgoals.


References
1. T.M.Breuel:ThehOCRMicroformatforOCRWorkflowandResults:
ICDAR,2007,acceptedforpublication
2. I.Guyon,R.M.Haralick,J.J.HullandI.T.Phillips:DatasetsforOCRand
documentimageunderstandingresearch.In:Handbookofcharacter
recognitionanddocumentimageanalysis,WorldScientific,(1997)779799
3. R.Cattoni,T.Coianiz,,Messelodi,S.Modena,C.M.:Geometriclayout
analysistechniquesfordocumentimageunderstanding:areview.Technical
report,IRST,Trento,Italy(1998)*
4. F.Shafait,D.Keysers,andT.M.Breuel:PerformanceComparisonofSix
AlgorithmsforPageSegmentation:7thIAPRWorkshoponDocument
AnalssisSystems(DAS),pages368379
5. F.Shafait,D.Keysers,T.M.Breuel:PixelAccurateRepresentationand
EvaluationofPageSegmentationinDocumentImages:ICPR2006,
InternationalConferenceonPatternRecognition,pages872875*
6. T.M.Breuel,D.Keysers,F.Shafait:DocumentImageZoneClassification
ASimpleHighPerfomanceApproach:VISAPP2007,pages4451
7. T.Gupta:OCRopusaddons:techreports,IUPR,2007

You might also like