You are on page 1of 16

OCRopusAddons

InternshipReport
Submittedto:

ImageUnderstandingandPatternRecognitionLab
GermanResearchCenterforArtificialIntelligence
Kaiserslautern,Germany

Submittedby:

AmbrishDantrey,B.Tech.IIIyear,E&CE
IndianInstituteofTechnology,Roorkee
Roorkee,India
Supervisors:FaisalShafait,IllyaMezhirov

Reviewer:prof.Dr.ThomasBreuel
StartDateforInternship:15thMay,2007
EndDateforInternship:27thJuly,2007

ReportDate:27thJuly,2007
Preface
This report documents the work done during the summer internship at Image
UnderstandingandPatternRecognition(IUPR)Lab,DeutscheForschungszentrum
fr Knstliche Intelligenz(DFKI), Germany under the supervision of Prof. Dr.
ThomasBreuel. The report first shall giveanoverviewofthetaskscompleted
duringtheperiodofinternshipwithtechnicaldetails.Thentheresultsobtained
shallbediscussedandanalyzed.
Reportshallalsoelaborateonthethefutureworkswhichcanbepersuadedasan
advancementofthecurrentwork.
Ihavetriedmybesttokeepreportsimpleyettechnicallycorrect.IhopeIsucceed
inmyattempt.
AmbrishDantrey

Acknowledgments
Simplyput,IcouldnothavedonethisworkwithoutthelotsofhelpIreceived
cheerfully from whole IUPR. The work culture in IUPR really motivates.
Everybodyissuchafriendlyandcheerfulcompanionherethatworkstressisnever
comesinway.
I would specially like to thank Dr. Thomas Breuel and Dr. Daniel keysers for
provingtheniceideastoworkupon.Notonlydidtheyadvisedaboutmyproject
butlisteningtotheirdiscussionsinIPeTmeetinghaveevokedagoodinterestin
Imageanalysis.IamalsohighlyindebtedtomysupervisorsFaisalShafaitandIlya
Mezhirov,whoseemedtohavesolutionstoallmyproblems.
Author

Abstract

ThereportpresentsthethreetaskscompletedduringsummerinternshipatIUPR
whicharelistedbelow:
1. Detection of headlines in document images with black runlengths and
OCRopusperformanceevaluationindetectingheadlines
2. Reengineeringthezoneclassificationmodule
3. Evaluationofdifferentsegmentationalgorithmsperformance
Allthesetaskshavebeencompletedsuccessfullyandresultswereaccordingto
expectations.Thedetection of headlinesachievedalowerrorrateof2.85%as
against 6.52 of previously used methods. During evaluation of segmentation
algorithmsXYcutwasfoundtogainalotbynoisecleanup,whichisaninteresting
resultasitstrengthentheclaimofXYcutsegmentationalgorithmasasuitable
method for OCRopus. The reengineering and porting of zoneclassification
module to OCRopus makes it possible for OCRopus to have a text/image
segmentationifitisrequiredinfuture.
Author

OCRopus:Introduction

Thoughthefieldofopticalcharacterrecognition(OCR)isconsideredtobewidely
explored,thedevelopmentofanefficientsystemforuseinrealworldsituations
stillremainsachallengefordevelopers. OCRopusisastateoftheartdocument
analysisandOCRsystem,featuringpluggablelayoutanalysis,pluggablecharacter
recognition,statisticalnaturallanguagemodeling,multilingualcapabilitiesandis
beingdevelopedatIUPR.Thisbeingaverybigproject,Iwasassignedthetasksof
developingtoolsforlayoutanalysisandevaluation.

TheGoals:

FollowinggoalsweresetasIproceededinmywork:
1. ConversionofgroundtruthdatainMARGdatabasefromXMLformat
tohOCRmicroformat[1].
2. Developmentofarulebasedheadlinedetectionmethodusingthemedian
blackrunlengthofthelines.

3. Development of segmentationclassification module and evaluation of


performanceofdifferentsegmentationalgorithmsasagainstnoise.

1.XMLtohOCR:

hOCR is a format for representing OCR output, including layout information,


character confidences, bounding boxes, and style information. It embeds this
information invisibly in standard HTML. By building on standard HTML, it
automatically inherits welldefined support for most scripts, languages, and
common layout options. Furthermore, unlike previous OCR formats, the
recognizedtextandOCRrelatedinformationcoexistinthesamefileandsurvives
editingandmanipulation.hOCRmarkupisindependentofthepresentation.
DuetoallabovequalitiesofhOCRformat,itishighlydesirabletohaveground
truthinthisformat.IwasassignedthetaskofconvertingtheMARGdatabase
groundtruthintohOCRformat.ForthispurposeIhavewrittenfollowingscript.
ScriptName:xmltohocr
LanguageUsed:Python
Commandlineargumentform:xmltohocrFILE.XML
FILE.XML:ThefileinXMLformattobeconvertedintohOCRmicroformat.
Note: The script does not take care of latex characters yet. It would be an
improvementtoincorporatethisfeature.

2.HeadlinedetectionBasedonblackrunlengthanditsintegration
intoOCRopus:
Detectionofheadlinesindocumentimagesisoneissuethatismostlyoverlooked
butyetishighlydesirabletoproperlyformattheoutputofOCR.OCRopushadtill
nowusedarulebasedmethodwhichusedspacebetweenlinesasthecriteriafor
detectionofheadlines.Thoughthismethodworkedformanyimages,italsofailed
manytimes.Itwasanobviousobservationthatblackrunlengthsofheadlinesare
morethantheblackrunlengthofthenormalline,andwetriedtobuilduponthis

concept.Weusedmedianblackrunlengthofalineasthedecidingcriteria.The
medianwasusedinsteadofmeanbecausemeanrunlengthcouldhaveeasilybeen
affectedbythenoisemergingwithtextandwouldhaveproduceerrors.
Thewholeapproachissimpleasdiscussedbelow:
1. Calculatethemedianblackrunlengthfortheeachlineonpage.
2. Comparethisrunlengthforeachlinewiththelinesbelowandaboveit.
3. If black runlength for a line has been found K1(a parameter) times the
median runlength oflinebelowit,andK2(anotherparameter)timesthe
medianrunlengthofthelineaboveit,setitasaheadline.
ThevalueofparametersK1andK2wastobefoundexperimentally.Aftermany
timesevaluatingtheperformanceoftheprogram,thevalueofK1andK2hasbeen
setto1.5and1.1respectively.
Weusedhistogrambasedmethodtofindthemedianrunlength.Ahistogramof
thenumberofoccurrencesversusrunlengthwascalculated,oncewehavesucha
histogramwenormalizeitwiththelargestvalueofoccurrence.Thenwecalculated
thecumulativedistributionfunctionforthisnormalizedhistogram.Thepointwhen
cumulativedistributionfunctionrechesavalueof0.5,correspondstothemedian
runlength.
The program for detection of headlines was written in C++ and used standard
OCRopusclasses.TheprogramhasbeensuccessfullyintegratedintoOCRopusand

Evaluation:

We also designed a tool which evaluates the performance of the OCRopus in


detecting headlines. As according to OCRopus standards, this tool has been
developedtoworkwithfilesinhOCRmicroformat.Thistoolcomprisesoftwo
programs:
1. ThefirstprogramtakestheOCRopusoutputandthecorrespondingground
truthfileinhOCRformatand outputsthetotalnooffalsepositivesand

falsenegativeswhichoccurredindetection.Italsooutputsthetotalnoof
true headlines which are present in the groundtruth. The command line
formofthisprogramsis:
headlineevalhOCRtruehOCRactual
2. The second program is for parsing the file produced by running above
programonalargenooffiles(oronadatabase)andcountsthetotalnoof
falsepositivesandfalsenegativesoccurredinwholedatabaseandtellsthe
errorrateofOCRopusonwholedatabase. Thecommandlineformofthis
programsis
count_errorsFILE.TXT
BothoftheaboveprogramswerewritteninPYTHON.
Criteriaforevaluation:ForevaluatingtheperformanceofOCRopusindetection
ofheadlineswedefinethetheerrorrateas:
e=(fp+fn)/T
e=percentageerror
fp=totalnooffalsepositives
fn=totalnooffalsenegatives
WeevaluatedtheperformanceonstandardUniversityofWashingtonIII(UWIII)
database[2].Theresultsforheadlinedetectionprogramshowedclearlythat
medianblackrunlengthcriteriaisbetterthanthespacebetweenlinescriteria,yet
errorswerestillpresent.Whilevisuallyanalyzingtheoutput,anobservationwas
madethatrunlengthbasedcriteriaandspacebasedcriteriabothproduced
differentfalsenegativesandpositives.Henceitwasclearthatoneofthemethod
canbeusedtoremovetheerrorsproducedbyother.Sowetriedtocombinethe
bothapproachesinsuchawaythatspacebasedcriteriaisusedasafiltertodetect
falsepositivesproducedbytherunlengthbasedcriteria.Therulewhichwasused
tocombinethemwasasfollows:
1. Userunlengthbasedcriteriatofindtheheadlines.
2. Calculatethemedianblackrunlengthforwholepage

3. Comparethemedianblackrunlengthofalllinesfoundtobeheadlinein
step1withthemedianblackrunlengthofthepage.Sincemedianblackrun
lengthofthepagerepresentsjustthesimplelinenotaheadline,ifany
headlinefoundinstep1hasarunlengthlessthanorequaltotherunlength
forwholepage,itisasuspiciouscase.Recheckforthislinewithspacebased
criteria.

Results:

Theresultswereasexpected.Onlyrunlengthbasedcriteriaperformedbetterthan
onlyspacebasedcriteriaandacombinationofboththecriteriaasdescribedabove
outperformedtheboth.TheerrorratesonstandardUW3databasefordifferent
approachesareasfollows:
Spacebasedheadlinedetection:
totalnooftextlines:138018
totalnooffalsepositives:7356.0
totalnooffalsenegatives:1713.0
%error=6.52%

BlackRunlengthbasedheadlinedetection:
totalnooftextlines:138018
totalnooffalsepositives:4341.0
totalnooffalsenegatives:1386.0
%error=4.14%
Bothapproachescombined(usingspacebasedapproachasafiltertoremove
falsepositives)
totalnooftextlines:138018
totalnooffalsepositives:2452.0

totalnooffalsenegatives:1476.0
%error=2.85%
Nextweshowsomeoftheexamples:

3. Text/ImageSegmentationandClassification
Documentimagelayoutanalysisisacrucialstepinmanyapplicationsrelatedto
documentimages,liketextextractionusingopticalcharacterrecognition(OCR),
reflowingdocuments,andlayoutbaseddocumentretrieval.Layoutanalysisisthe
process of identifying layout structures by analyzing page images. Layout
structures can be physical (text, graphics, pictures, . . . ) or logical (titles,
paragraphs, captions, headings, . . . ). The identification of physical layout
structuresiscalledphysicalorgeometriclayoutanalysis,whileassigningdifferent
logicalrolestothedetectedregionsistermedaslogicallayoutanalysis[3].The
taskofageometriclayoutanalysissystemistosegmentthedocumentimageinto
homogeneouszones,eachconsistingofonlyonephysicallayoutstructure,andto
identifytheirspatialrelationship(e.g.readingorder).Therefore,theperformance
oflayoutanalysismethodsdependsheavilyonthepagesegmentationalgorithm
used. A detailed explanation of defferent segmentation algorithms and their
performancecomparisoncanbefoundin[4,5].
Also,anotherimportantsubtaskofdocumentimageanalysisintheclassificationof
physicallysegmentedblocksintooneofthepredefinedclasses.Inmostofthe
casestheclassificationstepsfollowsthesegmentationanditishighlydesirableto
evaluatethesystemperformanceonwholesegmentation/classificationtask.With
thehelpofsuchanevaluation,itiseasytodecideiftheincorporationofthesestep
inOCRopuswouldresultinimprovedperformance.alsoitwouldbeeasytodecide
whichsegmentationalgorithmtouse.
Forclassificationstepweusedmethodasdescribedin[6]thisbeingthebest
classificationmethod.Weusedonlytwoclassestextandnontextwhichwere
releventtoOCRopus,insteadofeightclassesasdescribedinthispaper.
Wealreadyhadanimplementationofvarioussegmentationalgorithmsand
classificationstep.Thetaskincludedreengineeringtheclassificationstep'scode
andportingthewholesegmentationclassificationmoduleintoOCRopus,making
itusestandardOCRopusclassesandfunctions.Thetaskhasbeencompleted

successfullyandnowwehaveaversionofwholesegmentationclassification
moduleinOCRrepositoryanditcanbeintegratedwithOCRopusiftheresultsand
experimentscomespositive.Thecommandlineformoftheprogramis:
ocrclassifyanddisplayiIMAGEbBOUNDINGBOXFILEoOUTPUT
IMAGE
IMAGE:Theimagetobeclassified
BOUNDINGBOXFILE:Theboundingboxfileproducedbysegmentation
algorithms
OUTPUTIMAGE:Thenameofoutputimagetobewritten

Evaluation

Asdiscussedearliertheevaluationofbothsegmentationandclassificationsteps
combinedtogetherishighlydesirable.Thepurposeofdevelopingaevaluation
modulewastodecidewhichsegmentationalgorithmwouldbestsuitetheneedof
OCRopus.Wedevelopedaevaluationprogramwhichevaluatestheperformanceof
twostepsasagainstthegroundtruth.Ourcriteriafortheevaluationisthe
hammingdistancebetweenthetext/nontextzoneimageproducedfromground
truthandthatfromtheZoneclassificationmodule.Theerrorrateisdefinedas
follows:
e=HD*100/T
e=errorrate
HD=Hammingdistancebetweengroundtruthtextnontextimageand
actualtextnontextimage
T=Totalnoofpixelspresentinimage
%efficiency=100e
ThisprogramwasdevelopedinC++.Thecommandlineargumentformofthe
programis:

ocrevaluategtGROUNDTRUTHIMAGEaiACTUALIMAGE
GROUNDTRUTHIMAGE:Text/nontextimageproducedfromgroundtruth
ACTUALIMAGE:Text/nontextimageproducedfromactualprogram
Issueofnoisecleanup:DocumentImageNoiseaffectstheperformanceof
segmentationalgorithmsgreatly.Itwasourviewthattheperformanceofallthe
algorithmsshouldimproveafternoisecleanup.Abetterexplanationcanbefound
in[5].Weusednoisecleanupsystemasexplainedin[7]Alsoweexpected
improvementinperformanceofsimplesegmentationalgorithmslikeXYcuttobe
morethanthatofcomplexalgorithmslikevoronoi,reasonbeingXYcutgetsmore
affectedbynoisethanvoronoidoesandasweevaluatedtheperformanceofthese
algorithmswithandwithoutnoise,weprovedcorrect.

Results:

Threesegmentationalgorithms(Voronoi,DocstrumandXYcut)performancewas
evaluatedbyourprogram.Theresultswereaswehadexpectedandhencewere
quiteencouraging.Belowaretheerrorratesforallthesealgorithmswithand
withoutnoisecleanup.
Algorithm

Percentageefficiencywithoutnoise Percentageefficiencywithnoise
cleanup
cleanup

Voronoi

87.03

87.69

Docstrum

86.88

86.92

XYcut

80.16

85.70

Asevidenttheperformanceofallthealgorithmsincreasewithnoisecleanup,but
theimprovementwasmuchmoreforXYcutcomparedtootheralgorithms.After
noisecleanupXYcuthasanefficiencymuchclosetothatofVoronoiandbeinga
simplealgorithmsXYcutcanbeanoptimumchoicefortheOCRopus.

Conclusion:

ThewholeexperienceofworkingatIUPRwasgreat.Thisorganizationhasa
superbworkculture,greatmindsandveryhighqualityofwork.Ilearnedalotof
aboutimageprocessingandanalysis.TheworkIcouldcompleteherewasvery
satisfactory.IhavetriedtodevelopasmanyaddonsaspossibleforOCRopusand
evengotveryencouragingresultswithsomeofthem.IhopemyworkonOCRopus
helpsitmeetitsgoals.

References

1. T.M.Breuel:ThehOCRMicroformatforOCRWorkflowandResults:
ICDAR,2007,acceptedforpublication
2. I.Guyon,R.M.Haralick,J.J.HullandI.T.Phillips:DatasetsforOCRand
documentimageunderstandingresearch.In:Handbookofcharacter
recognitionanddocumentimageanalysis,WorldScientific,(1997)779799
3. R.Cattoni,T.Coianiz,,Messelodi,S.Modena,C.M.:Geometriclayout
analysistechniquesfordocumentimageunderstanding:areview.Technical
report,IRST,Trento,Italy(1998)*
4. F.Shafait,D.Keysers,andT.M.Breuel:PerformanceComparisonofSix
AlgorithmsforPageSegmentation:7thIAPRWorkshoponDocument
AnalssisSystems(DAS),pages368379
5. F.Shafait,D.Keysers,T.M.Breuel:PixelAccurateRepresentationand
EvaluationofPageSegmentationinDocumentImages:ICPR2006,
InternationalConferenceonPatternRecognition,pages872875*
6. T.M.Breuel,D.Keysers,F.Shafait:DocumentImageZoneClassification
ASimpleHighPerfomanceApproach:VISAPP2007,pages4451
7. T.Gupta:OCRopusaddons:techreports,IUPR,2007

You might also like