Professional Documents
Culture Documents
Optical Character Recognition On A Grid Infrastructure
Optical Character Recognition On A Grid Infrastructure
Abstract
1. Introduction
The document image analysis (DIA) is concerned with
the conversion of documents in paper form into electronic
formats. Optical character recognition (OCR) is a fundamental part of DIA. While DIA takes care of the general
problem of recognizing and giving semantics to the graphical components of an input document, OCR takes care
of deriving the meaning of the characters from their bitmapped images. Usually, the paper documents are scanned,
and the resulting document images are processed via document layout analysis and OCR. This transforms them into
a structured electronic format similar to that generated by
electronic authoring tools. The conversion of paper documents into electronic formats is an on-going task at world
scale. On one hand, there is a large number of documents
and books still in paper format that need to be converted. On
the other hand, many documents, like legacy documents, are
still published in paper format for a variety of reasons (see
the overview [2]). The range of applications of OCR has
been expanding recently. For example, mobile electronic
devices such as cell phones and digital cameras are capable
of acquiring images at a sufficiently high resolution to faci-
21
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.
The robot analyzes each specification file in XML and updates the database entries. A simple search program picks
up the OCR servers that match the clients needs from the
database and shows the search results. The client gathers the
results and performs the recognition based on the majority
logic. A recent Web-based OCR system is freely available
for research purposes at [8]. The user is supposed to be a
human. The current trend in information technology is to
build service oriented architecture in which software components can replace the human user. In this context a Web
service for OCR is proposed by [15].
Current initiatives such as Google Book Search [11] and
the Open Content Alliance [19] are advance efforts to digitize millions of books. The OCRopus engine recently proposed by Google [10], intended for high-throughput, highvolume document conversion effort, is based on two research projects: a high-performance handwriting recognizer
developed in the mid-90s and novel high-performance layout analysis methods (the code will be available at the end
of 2007, beginning of 2008).
Since the OCR technology has matured to the point
where the available recognition software delivers negligible error rates per page for high quality typewritten text, the
efforts are now concentrated on more difficult tasks to recognize important content, including both the semantic and
structural aspects, to create flexible and modular document
recognition systems to recognize a vast diversity of fonts,
symbols, tables, languages, or to identify links to other documents in order to group diverse content. In [2] it is suggested that paper documents should be scanned, but they
should be kept on-line in image-based form and OCR results should be viewed as an annotation of the document image, not as the definitive representation. Extensive work is
being carried out on creating standards (like XML or SVG)
for semantic and high-level representations of documents.
22
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.
Test set
Mean value
Minim
Maxim
Mean value
Mininm
Maxim
Mean value
Minim
Maxim
Mean value
Minim
Maxim
1
4407
3624
4820
97.74
97.53
97.95
746
650
814
93.68
92.38
95.38
2
3007
2925
3088
66.48
64.72
68.23
467
457
486
18.51
15.07
22.98
3
2822
2226
3196
98.10
97.76
98.61
434
352
487
92.91
92.19
93.75
4
2050
1707
2271
96.07
93.62
97.71
352
307
381
83.46
74.02
89.43
1
339.5
30.87
14
432
69.66
50.84
28.31
71
4.81
6.11
2
400.9
36.45
272
9913
60.05
47.21
16.42
121
6.62
81.92
3
128.79
11.71
181
2119
22.65
16.14
9.96
46
5.59
46.07
4
99.52
9.05
766
6929
11.75
9.77
6.79
72
8.30
96.36
23
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.
References
[1] D. Andrews, R. Brown, C. Caldwell, et al., A Parallel Architecture for Performing Real Time Multi-Line Optical Character Recognition, in Procs. 25th SSST, 1993, pp. 533536.
[2] T.M. Breuel, The Future of Document Imaging in the Era of Electronic Documents, in Procs. DAS VI, 2005.
[3] K. Chellapilla, M. Shilman, P. Simard, Combining Multiple Classifiers for Faster Optical Character Recognition, in Procs. DAS VII,
Springer, LNCS 3872, 2006, pp. 358367.
[4] G.S. Choudhury, T. DiLauro, R. Ferguson, M. Droettboom, I. Fujinaga, Document Recognition for a Million Books, D-Lib Magazine, Vol. 12, No. 3, 2006, www.dlib.org/dlib/march06/
03contents.html
[5] A. Cuhadar, A.C. Downton, Scalable Parallel Processing Design for Real Time Handwritten OCR, in Procs. 12th IAPR Inter.Conf.Pattern Recognition, Vol. 3, 1994, pp. 339341.
[6] M. Danelutto, S. Pelagatti, R. Ravazzolo, A. Riaudo, Parallel OCR
in P3 L: a case study, in High-Performance Computing and Networking, LNCS 1067, Springer, 1996, pp. 10171019.
[7] M. Forbes, OCHRE-P Optical Character Recognition in Parallel,
Technical report EPCC-SS95-06, 1995.
[8] Free OCR, www.123dox.com
[9] gOCR: Optical Character Recognition, 2006, sourceforge.
net/projects/jocr/
[10] Google, Ocropus, 2007, code.google.com/p/ocropus/
[11] Google, Books Library Project, 2007, books.google.com
googlebooks/library.html.
[12] H. Goto, OCRGrid : A Platform for Distributed and Cooperative
OCR Systems, in Procs. 18th ICPR, Vol. 2, 2006, pp. 982985.
[13] H. Goto, A Platform for Web-Based OCR Systems with Server
Search Function, in Procs. DAS VII, LNCS 3872, 2006, pp. 1316.
[14] JavaOCR, 2004, www.javaocr.com/
[15] LeadTools, Optical Character Recognition Web Service,
2007,
www.leadtools.com/SDK/WEB-SERVICES/OCRSERVICE/default.htm
[16] A.P.Lenaghan, R.R.Malyan, XPEN: An XML Based Format for
Distributed Online Handwriting Recognition, in Procs. 7th DAR,
2003, pp. 12701274.
[17] R. Mason, H. Schmidt, R. Trott, Down on the OCR Farm: How We
Produced Searchable PDFs for 7 Million Documents in a Student
Computer Lab, in Procs. 5th JCDL, 2005, pp. 391391.
[18] Netpbm - graphics tools and converters, 2006, sourceforge.
net/projects/netpbm
[19] Open Content Alliance, OCA, 2006, www.opencontent
alliance.org
[20] G.J.Rama, A.Ramakrishnan, D.Gupta, Parallel Processing in OCR
A Multithreaded approach, Procs.Tamil Internet, 2002, 107110.
[21] South-Eastern European Grid, 2006, www.see-grid.eu
24
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.