A More Technical Redsdport

OCRopusAddons
InternshipReport
Submittedto:
ImageUnderstandingandPatternRecognitionLab
GermanResearchCenterforArtificialIntelligence
Kaiserslautern,Germany
Submittedby:
AmbrishDantrey,B.Tech.IIIyear,E&CE
IndianInstituteofTechnology,Roorkee
Roorkee,India
Supervisors:FaisalShafait,IllyaMezhirov
Reviewer:prof.Dr.ThomasBreuel
StartDateforInternship:15thMay,2007
EndDateforInternship:27thJuly,2007
ReportDate:27thJuly,2007
Preface
This report documents the work done during the summer internship at Image
UnderstandingandPatternRecognition(IUPR)Lab,DeutscheForschungszentrum
fr Knstliche Intelligenz(DFKI), Germany under the supervision of Prof. Dr.
ThomasBreuel. The report first shall giveanoverviewofthetaskscompleted
duringtheperiodofinternshipwithtechnicaldetails.Thentheresultsobtained
shallbediscussedandanalyzed.
Reportshallalsoelaborateonthethefutureworkswhichcanbepersuadedasan
advancementofthecurrentwork.
Ihavetriedmybesttokeepreportsimpleyettechnicallycorrect.IhopeIsucceed
inmyattempt.
AmbrishDantrey
Acknowledgments
Simplyput,IcouldnothavedonethisworkwithoutthelotsofhelpIreceived
cheerfully from whole IUPR. The work culture in IUPR really motivates.
Everybodyissuchafriendlyandcheerfulcompanionherethatworkstressisnever
comesinway.
I would specially like to thank Dr. Thomas Breuel and Dr. Daniel keysers for
provingtheniceideastoworkupon.Notonlydidtheyadvisedaboutmyproject
butlisteningtotheirdiscussionsinIPeTmeetinghaveevokedagoodinterestin
Imageanalysis.IamalsohighlyindebtedtomysupervisorsFaisalShafaitandIlya
Mezhirov,whoseemedtohavesolutionstoallmyproblems.
Author
Abstract
ThereportpresentsthethreetaskscompletedduringsummerinternshipatIUPR
whicharelistedbelow:
1. Detection of headlines in document images with black runlengths and
OCRopusperformanceevaluationindetectingheadlines
2. Reengineeringthezoneclassificationmodule
3. Evaluationofdifferentsegmentationalgorithmsperformance
Allthesetaskshavebeencompletedsuccessfullyandresultswereaccordingto
expectations.Thedetection of headlinesachievedalowerrorrateof2.85%as
against 6.52 of previously used methods. During evaluation of segmentation
algorithmsXYcutwasfoundtogainalotbynoisecleanup,whichisaninteresting
resultasitstrengthentheclaimofXYcutsegmentationalgorithmasasuitable
method for OCRopus. The reengineering and porting of zoneclassification
module to OCRopus makes it possible for OCRopus to have a text/image
segmentationifitisrequiredinfuture.
Author
OCRopus:Introduction
Thoughthefieldofopticalcharacterrecognition(OCR)isconsideredtobewidely
explored,thedevelopmentofanefficientsystemforuseinrealworldsituations
stillremainsachallengefordevelopers. OCRopusisastateoftheartdocument
analysisandOCRsystem,featuringpluggablelayoutanalysis,pluggablecharacter
recognition,statisticalnaturallanguagemodeling,multilingualcapabilitiesandis
beingdevelopedatIUPR.Thisbeingaverybigproject,Iwasassignedthetasksof
developingtoolsforlayoutanalysisandevaluation.
TheGoals:
FollowinggoalsweresetasIproceededinmywork:
1. ConversionofgroundtruthdatainMARGdatabasefromXMLformat
tohOCRmicroformat[1].
2. Developmentofarulebasedheadlinedetectionmethodusingthemedian
blackrunlengthofthelines.
3. Development of segmentationclassification module and evaluation of

performanceofdifferentsegmentationalgorithmsasagainstnoise.
1.XMLtohOCR:
hOCR is a format for representing OCR output, including layout information,

character confidences, bounding boxes, and style information. It embeds this
information invisibly in standard HTML. By building on standard HTML, it
automatically inherits welldefined support for most scripts, languages, and
common layout options. Furthermore, unlike previous OCR formats, the
recognizedtextandOCRrelatedinformationcoexistinthesamefileandsurvives
editingandmanipulation.hOCRmarkupisindependentofthepresentation.
DuetoallabovequalitiesofhOCRformat,itishighlydesirabletohaveground
truthinthisformat.IwasassignedthetaskofconvertingtheMARGdatabase
groundtruthintohOCRformat.ForthispurposeIhavewrittenfollowingscript.
ScriptName:xmltohocr
LanguageUsed:Python
Commandlineargumentform:xmltohocrFILE.XML
FILE.XML:ThefileinXMLformattobeconvertedintohOCRmicroformat.
Note: The script does not take care of latex characters yet. It would be an
improvementtoincorporatethisfeature.
2.HeadlinedetectionBasedonblackrunlengthanditsintegration
intoOCRopus:
Detectionofheadlinesindocumentimagesisoneissuethatismostlyoverlooked
butyetishighlydesirabletoproperlyformattheoutputofOCR.OCRopushadtill
nowusedarulebasedmethodwhichusedspacebetweenlinesasthecriteriafor
detectionofheadlines.Thoughthismethodworkedformanyimages,italsofailed
manytimes.Itwasanobviousobservationthatblackrunlengthsofheadlinesare
morethantheblackrunlengthofthenormalline,andwetriedtobuilduponthis
concept.Weusedmedianblackrunlengthofalineasthedecidingcriteria.The
medianwasusedinsteadofmeanbecausemeanrunlengthcouldhaveeasilybeen
affectedbythenoisemergingwithtextandwouldhaveproduceerrors.
Thewholeapproachissimpleasdiscussedbelow:
1. Calculatethemedianblackrunlengthfortheeachlineonpage.
2. Comparethisrunlengthforeachlinewiththelinesbelowandaboveit.
3. If black runlength for a line has been found K1(a parameter) times the
median runlength oflinebelowit,andK2(anotherparameter)timesthe
medianrunlengthofthelineaboveit,setitasaheadline.
ThevalueofparametersK1andK2wastobefoundexperimentally.Aftermany
timesevaluatingtheperformanceoftheprogram,thevalueofK1andK2hasbeen
setto1.5and1.1respectively.
Weusedhistogrambasedmethodtofindthemedianrunlength.Ahistogramof
thenumberofoccurrencesversusrunlengthwascalculated,oncewehavesucha
histogramwenormalizeitwiththelargestvalueofoccurrence.Thenwecalculated
thecumulativedistributionfunctionforthisnormalizedhistogram.Thepointwhen
cumulativedistributionfunctionrechesavalueof0.5,correspondstothemedian
runlength.
The program for detection of headlines was written in C++ and used standard
OCRopusclasses.TheprogramhasbeensuccessfullyintegratedintoOCRopusand
Evaluation:
We also designed a tool which evaluates the performance of the OCRopus in

detecting headlines. As according to OCRopus standards, this tool has been
developedtoworkwithfilesinhOCRmicroformat.Thistoolcomprisesoftwo
programs:
1. ThefirstprogramtakestheOCRopusoutputandthecorrespondingground
truthfileinhOCRformatand outputsthetotalnooffalsepositivesand
falsenegativeswhichoccurredindetection.Italsooutputsthetotalnoof
true headlines which are present in the groundtruth. The command line
formofthisprogramsis:
headlineevalhOCRtruehOCRactual
2. The second program is for parsing the file produced by running above
programonalargenooffiles(oronadatabase)andcountsthetotalnoof
falsepositivesandfalsenegativesoccurredinwholedatabaseandtellsthe
errorrateofOCRopusonwholedatabase. Thecommandlineformofthis
programsis
count_errorsFILE.TXT
BothoftheaboveprogramswerewritteninPYTHON.
Criteriaforevaluation:ForevaluatingtheperformanceofOCRopusindetection
ofheadlineswedefinethetheerrorrateas:
e=(fp+fn)/T
e=percentageerror
fp=totalnooffalsepositives
fn=totalnooffalsenegatives
WeevaluatedtheperformanceonstandardUniversityofWashingtonIII(UWIII)
database[2].Theresultsforheadlinedetectionprogramshowedclearlythat
medianblackrunlengthcriteriaisbetterthanthespacebetweenlinescriteria,yet
errorswerestillpresent.Whilevisuallyanalyzingtheoutput,anobservationwas
madethatrunlengthbasedcriteriaandspacebasedcriteriabothproduced
differentfalsenegativesandpositives.Henceitwasclearthatoneofthemethod
canbeusedtoremovetheerrorsproducedbyother.Sowetriedtocombinethe
bothapproachesinsuchawaythatspacebasedcriteriaisusedasafiltertodetect
falsepositivesproducedbytherunlengthbasedcriteria.Therulewhichwasused
tocombinethemwasasfollows:
1. Userunlengthbasedcriteriatofindtheheadlines.
2. Calculatethemedianblackrunlengthforwholepage
3. Comparethemedianblackrunlengthofalllinesfoundtobeheadlinein
step1withthemedianblackrunlengthofthepage.Sincemedianblackrun
lengthofthepagerepresentsjustthesimplelinenotaheadline,ifany
headlinefoundinstep1hasarunlengthlessthanorequaltotherunlength
forwholepage,itisasuspiciouscase.Recheckforthislinewithspacebased
criteria.
Results:
Theresultswereasexpected.Onlyrunlengthbasedcriteriaperformedbetterthan
onlyspacebasedcriteriaandacombinationofboththecriteriaasdescribedabove
outperformedtheboth.TheerrorratesonstandardUW3databasefordifferent
approachesareasfollows:
Spacebasedheadlinedetection:
totalnooftextlines:138018
totalnooffalsepositives:7356.0
totalnooffalsenegatives:1713.0
%error=6.52%
BlackRunlengthbasedheadlinedetection:
%error=4.14%
Bothapproachescombined(usingspacebasedapproachasafiltertoremove
falsepositives)
%error=2.85%
Nextweshowsomeoftheexamples:
3. Text/ImageSegmentationandClassification
Documentimagelayoutanalysisisacrucialstepinmanyapplicationsrelatedto
documentimages,liketextextractionusingopticalcharacterrecognition(OCR),
reflowingdocuments,andlayoutbaseddocumentretrieval.Layoutanalysisisthe
process of identifying layout structures by analyzing page images. Layout
structures can be physical (text, graphics, pictures, . . . ) or logical (titles,
paragraphs, captions, headings, . . . ). The identification of physical layout
structuresiscalledphysicalorgeometriclayoutanalysis,whileassigningdifferent
logicalrolestothedetectedregionsistermedaslogicallayoutanalysis[3].The
taskofageometriclayoutanalysissystemistosegmentthedocumentimageinto
homogeneouszones,eachconsistingofonlyonephysicallayoutstructure,andto
identifytheirspatialrelationship(e.g.readingorder).Therefore,theperformance
oflayoutanalysismethodsdependsheavilyonthepagesegmentationalgorithm
used. A detailed explanation of defferent segmentation algorithms and their
performancecomparisoncanbefoundin[4,5].
Also,anotherimportantsubtaskofdocumentimageanalysisintheclassificationof
physicallysegmentedblocksintooneofthepredefinedclasses.Inmostofthe
casestheclassificationstepsfollowsthesegmentationanditishighlydesirableto
evaluatethesystemperformanceonwholesegmentation/classificationtask.With
thehelpofsuchanevaluation,itiseasytodecideiftheincorporationofthesestep
inOCRopuswouldresultinimprovedperformance.alsoitwouldbeeasytodecide
whichsegmentationalgorithmtouse.
Forclassificationstepweusedmethodasdescribedin[6]thisbeingthebest
classificationmethod.Weusedonlytwoclassestextandnontextwhichwere
releventtoOCRopus,insteadofeightclassesasdescribedinthispaper.
Wealreadyhadanimplementationofvarioussegmentationalgorithmsand
classificationstep.Thetaskincludedreengineeringtheclassificationstep'scode
andportingthewholesegmentationclassificationmoduleintoOCRopus,making
itusestandardOCRopusclassesandfunctions.Thetaskhasbeencompleted
successfullyandnowwehaveaversionofwholesegmentationclassification
moduleinOCRrepositoryanditcanbeintegratedwithOCRopusiftheresultsand
experimentscomespositive.Thecommandlineformoftheprogramis:
ocrclassifyanddisplayiIMAGEbBOUNDINGBOXFILEoOUTPUT
IMAGE
IMAGE:Theimagetobeclassified
BOUNDINGBOXFILE:Theboundingboxfileproducedbysegmentation
algorithms
OUTPUTIMAGE:Thenameofoutputimagetobewritten
Evaluation
Asdiscussedearliertheevaluationofbothsegmentationandclassificationsteps
combinedtogetherishighlydesirable.Thepurposeofdevelopingaevaluation
modulewastodecidewhichsegmentationalgorithmwouldbestsuitetheneedof
OCRopus.Wedevelopedaevaluationprogramwhichevaluatestheperformanceof
twostepsasagainstthegroundtruth.Ourcriteriafortheevaluationisthe
hammingdistancebetweenthetext/nontextzoneimageproducedfromground
truthandthatfromtheZoneclassificationmodule.Theerrorrateisdefinedas
follows:
e=HD*100/T
e=errorrate
HD=Hammingdistancebetweengroundtruthtextnontextimageand
actualtextnontextimage
T=Totalnoofpixelspresentinimage
%efficiency=100e
ThisprogramwasdevelopedinC++.Thecommandlineargumentformofthe
programis:
ocrevaluategtGROUNDTRUTHIMAGEaiACTUALIMAGE
GROUNDTRUTHIMAGE:Text/nontextimageproducedfromgroundtruth
ACTUALIMAGE:Text/nontextimageproducedfromactualprogram
Issueofnoisecleanup:DocumentImageNoiseaffectstheperformanceof
segmentationalgorithmsgreatly.Itwasourviewthattheperformanceofallthe
algorithmsshouldimproveafternoisecleanup.Abetterexplanationcanbefound
in[5].Weusednoisecleanupsystemasexplainedin[7]Alsoweexpected
improvementinperformanceofsimplesegmentationalgorithmslikeXYcuttobe
morethanthatofcomplexalgorithmslikevoronoi,reasonbeingXYcutgetsmore
affectedbynoisethanvoronoidoesandasweevaluatedtheperformanceofthese
algorithmswithandwithoutnoise,weprovedcorrect.
Results:
Threesegmentationalgorithms(Voronoi,DocstrumandXYcut)performancewas
evaluatedbyourprogram.Theresultswereaswehadexpectedandhencewere
quiteencouraging.Belowaretheerrorratesforallthesealgorithmswithand
withoutnoisecleanup.
Algorithm
Percentageefficiencywithoutnoise Percentageefficiencywithnoise
cleanup
cleanup
Voronoi
87.03
87.69
Docstrum
86.88
86.92
XYcut
80.16
85.70
Asevidenttheperformanceofallthealgorithmsincreasewithnoisecleanup,but
theimprovementwasmuchmoreforXYcutcomparedtootheralgorithms.After
noisecleanupXYcuthasanefficiencymuchclosetothatofVoronoiandbeinga
simplealgorithmsXYcutcanbeanoptimumchoicefortheOCRopus.
Conclusion:
ThewholeexperienceofworkingatIUPRwasgreat.Thisorganizationhasa
superbworkculture,greatmindsandveryhighqualityofwork.Ilearnedalotof
aboutimageprocessingandanalysis.TheworkIcouldcompleteherewasvery
satisfactory.IhavetriedtodevelopasmanyaddonsaspossibleforOCRopusand
evengotveryencouragingresultswithsomeofthem.IhopemyworkonOCRopus
helpsitmeetitsgoals.
References
1. T.M.Breuel:ThehOCRMicroformatforOCRWorkflowandResults:
ICDAR,2007,acceptedforpublication
2. I.Guyon,R.M.Haralick,J.J.HullandI.T.Phillips:DatasetsforOCRand
documentimageunderstandingresearch.In:Handbookofcharacter
recognitionanddocumentimageanalysis,WorldScientific,(1997)779799
3. R.Cattoni,T.Coianiz,,Messelodi,S.Modena,C.M.:Geometriclayout
analysistechniquesfordocumentimageunderstanding:areview.Technical
report,IRST,Trento,Italy(1998)*
4. F.Shafait,D.Keysers,andT.M.Breuel:PerformanceComparisonofSix
AlgorithmsforPageSegmentation:7thIAPRWorkshoponDocument
AnalssisSystems(DAS),pages368379
5. F.Shafait,D.Keysers,T.M.Breuel:PixelAccurateRepresentationand
EvaluationofPageSegmentationinDocumentImages:ICPR2006,
InternationalConferenceonPatternRecognition,pages872875*
6. T.M.Breuel,D.Keysers,F.Shafait:DocumentImageZoneClassification
ASimpleHighPerfomanceApproach:VISAPP2007,pages4451
7. T.Gupta:OCRopusaddons:techreports,IUPR,2007

A More Technical Redsdport

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A More Technical Redsdport

Uploaded by

Copyright:

Available Formats

OCRopusAddons

3. Development of segmentationclassification module and evaluation of

hOCR is a format for representing OCR output, including layout information,

We also designed a tool which evaluates the performance of the OCRopus in

You might also like