You are on page 1of 4

Third International Conference on Automated Production of Cross Media Content for Multi-channel Distribution

Optical Character Recognition on a Grid Infrastructure


Dana Petcu1 , Silviu Panica1 , Doina Banciu2 , Viorel Negru1 , Andrei Eckstein1
1
Computer Science Department, West University of Timisoara, Romania
2
National Institute for Research and Development in Informatics, Romania

Abstract

litate OCR. Some mobile phones with OCR capabilities are


already available on the market today. Given the low processing power of these devices, it is desirable to have high
speed OCR systems that can be used on these machines or
to access remote OCR systems.
In this context, the potential and promise of bringing together image collections in open, distributed, flexible document recognition frameworks is immense [4]. A special case is that of large-scale book image collections [11].
Recognizing this fact, the Romanian cooperation project
SINRED (National System for the Management of Digital Resources in Science and Technology based on Grid
Structures, 2005-2008) intends to: design and implement
a virtual digital library based on the facilities of the current
public libraries; design the methods and methodologies for
creating a uniform system at the national level in the field
of documentation based on digital documents; analyze the
advantages and opportunities offered by the Grid technologies in the field of digital content; adopt international approaches for library networking that are compatible with the
national standards; design and implement a unique portal
for accessing the digital information; define the procedures
for building digital databases according to national and international regulations. This paper describes the first results
of SINRED concerning the use of Grid technologies for document image analysis. Section 2 presents a short overview
of the previous efforts to improve the response time of the
OCR systems. The experiments performed in a Grid environment are reported in Section 3.

The current capacity to translate paper documents


quickly and accurately into machine readable form using
optical character recognition technology augments the opportunities in document searching and storing, as well as
the automated document processing. A fast response in
translating large collections of image-based electronic documents into structured electronic documents is still a problem. The availability of a large number of processing units
in Grid environments and of free optical character recognition tools can be exploited to produce a fast translation.
Following this idea, several experiments concerning optical
character recognition were performed on a Grid infrastructure and their results are reported in this paper. These results are encouraging further developments of systems for
document image analysis using Grid technologies.

1. Introduction
The document image analysis (DIA) is concerned with
the conversion of documents in paper form into electronic
formats. Optical character recognition (OCR) is a fundamental part of DIA. While DIA takes care of the general
problem of recognizing and giving semantics to the graphical components of an input document, OCR takes care
of deriving the meaning of the characters from their bitmapped images. Usually, the paper documents are scanned,
and the resulting document images are processed via document layout analysis and OCR. This transforms them into
a structured electronic format similar to that generated by
electronic authoring tools. The conversion of paper documents into electronic formats is an on-going task at world
scale. On one hand, there is a large number of documents
and books still in paper format that need to be converted. On
the other hand, many documents, like legacy documents, are
still published in paper format for a variety of reasons (see
the overview [2]). The range of applications of OCR has
been expanding recently. For example, mobile electronic
devices such as cell phones and digital cameras are capable
of acquiring images at a sufficiently high resolution to faci-

0-7695-3030-3/07 $25.00 2007 IEEE


DOI 10.1109/AXMEDIS.2007.23

2. State-of-the-art concerning OCR systems


The OCR problem is well suited to parallelization or
distribution: the most computationally intensive part of
the problem is recognizing individual characters and this
requires access to particular parts of the scanned bitmap
data. Moreover, the processing associated with characters
or groups of characters can be done independently.
There are several ways in which, in the last thirty years,
the task of recognizing a document was partitioned for parallel or distributed processing: by algorithm phase, by
scanned page, by page region, by line or by character. The

21

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.

MLOCR system [1], for example, has been designed in


mid-90s for real time processing of printed characters on
letters. The system was configured as a master-slave mode
and different phases (like separation, rotation etc.) are performed by different processors. Similarly, on a transputer
machine from the same period, the implementation presented in [5] splits the postcodes up into individual characters and processes them independently within the farm that
implements the preprocessing and character classification
stages. OCHRE-P [7] was a prototype OCR system modelled on a task farming paradigm, in which each task acted
on a line of text. Parallelism at the text line was discussed
more recently in [20]. A modern client-server solution was
proposed in [16] starting from the remark that a recognition system can be logically subdivided into three components responsible for: capturing data, classifying data, and
a knowledge-base holding models of words or characters
known to the system. In [16] the data capture occurs locally
on a client machine and recognition is delegated to a set of
remote servers. The exchange of online data requires a protocol and central to this is a format in which to send data to
a remote server (an XML based format is used currently).
With the increase of computing power, the need to improve the accuracy of the OCR has lead to the approach to
combine classifiers at the cost of increased processing. Simple classifier fusion methods such as minimum, maximum,
average, median, and majority voting have been tested. The
paper [3] proposes a cascade architecture for combining
classifiers. Both the separate classifiers and the cascade
classifiers are good candidates for a limited parallelism or
distribution. The availability of a large number of idle workstations in computer labs was recently exploited as opportunity to produce searchable PDF files for material that exists
only as image files [17]: using 100 workstations available
approximately 12 hours a day, 7 million searchable PDF documents were generated from 42 million TIF page images,
with a peak productivity of about 1 million pages per day.
The workstation application determined what TIF files were
available for processing by querying and updating a serverbased SQL database, processed page images one at a time,
returning text and searchable PDF files to the server.

The robot analyzes each specification file in XML and updates the database entries. A simple search program picks
up the OCR servers that match the clients needs from the
database and shows the search results. The client gathers the
results and performs the recognition based on the majority
logic. A recent Web-based OCR system is freely available
for research purposes at [8]. The user is supposed to be a
human. The current trend in information technology is to
build service oriented architecture in which software components can replace the human user. In this context a Web
service for OCR is proposed by [15].
Current initiatives such as Google Book Search [11] and
the Open Content Alliance [19] are advance efforts to digitize millions of books. The OCRopus engine recently proposed by Google [10], intended for high-throughput, highvolume document conversion effort, is based on two research projects: a high-performance handwriting recognizer
developed in the mid-90s and novel high-performance layout analysis methods (the code will be available at the end
of 2007, beginning of 2008).
Since the OCR technology has matured to the point
where the available recognition software delivers negligible error rates per page for high quality typewritten text, the
efforts are now concentrated on more difficult tasks to recognize important content, including both the semantic and
structural aspects, to create flexible and modular document
recognition systems to recognize a vast diversity of fonts,
symbols, tables, languages, or to identify links to other documents in order to group diverse content. In [2] it is suggested that paper documents should be scanned, but they
should be kept on-line in image-based form and OCR results should be viewed as an annotation of the document image, not as the definitive representation. Extensive work is
being carried out on creating standards (like XML or SVG)
for semantic and high-level representations of documents.

3. Experiments on a Grid infrastructure


The aim of our experiments is to prove that the time response of a state-of-the-art OCR system that is applied on
a large collection of image-based electronic documents can
be considerably improved if Grid technologies are involved.
In our first experiments, that are reported here, we consider
that the Grid architecture is a computational Grid. In particular, the tests were performed on the European SEE-Grid
infrastructure [21] with around 750 CPUs.
While commercial OCR are providing very low error
rates in character recognition process, their use on Grid infrastructure is not an option. Free OCR systems should be
used. In the case of using gLite-based infrastructure like
SEE-Grid, the selected OCR should work on Linux. There
are several free OCR systems that are available on Linux
platforms, like gOCR, Ocrad, ocre, ClaraOCR, OCRchie.
We selected gOCR [9] for several reasons: (a) recently

The OCRGrid [12, 13] is a platform for distributed and


cooperative OCR systems that allows end users to search
for and use OCR servers over networks. Moreover, a toolkit
was built to deploy secure Web-based OCRs. The document
image is sent from the clients computer to the toolkit via a
Web server. Every OCR server has a specification file in
XML, which is written by the server administrator. The file
describes the specifications of the server, including the location (URL) of the server, the name of the OCR engine, the
supported languages, the document types, etc. The portal
server has a robot program for collecting the specification
files automatically and periodically from the OCR servers.

22

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.

Table 1. Rates of recognizing characters


Number of
characters
per page
Percent of
recognized
chars per page
Number of
words
per page
Percent of
recognized
words per page

Figure 1. Sample from the test pages


released version; (b) operating in command line; (c) executable that is portable without special library; (d) good reviews in the literature about its quality; (e) the availability
of libgocr, a library with the functionality needed to develop
an OCR engine (for further experiments, like collaborative
engines). The test pages containing only text were scanned
at 600dpi. Four sets of scanned pages of different qualities were considered. Each sets contains 11 pages from a
book: (1) a book from 1993 edited in Germany; (2) a book
from 1996 edited in Romania by a photocopy technique;
(3) a book from 1999 edited in Germany in LNCS series;
(4) a book from 2003 edited in UK concerning the stories
of Harry Potter. The images were stored as gray GIF images. Samples of these pages, as well as the text recognized by gOCR, are shown in Figure 1. Giftopnm, part of
Netpbm [18], was used convert a GIF file into a PNM image, the classical input of gOCR. As expected, the error rate
of gOCR depends on the inputs quality. This error rate is
high for the second test set and is acceptable for the other
test sets (see Table 1). While ten years ago applying an OCR
system to a page was a time-consuming task (in [6] for example, for a stream of input pages of 1135 characters, the
average time taken to compute a single page was 18160 sec,
and a transputer has reduced it to 450 sec), currently the text
can be obtained in a few seconds. For the test pages, gOCR
responded in an interval between 9 and 50 seconds. We
noticed that the response times and the error rate in of the
recognized text are far from optimal the system from [8],
for example, allows faster and more accurate OCR, despite
being a Web-based system. But gOCR is still the best candidate for Grid-enabling the OCR of the books. Applying
parallelism or distribution at line, region, or character level
is not under discussion since the times obtained by applying
current OCR systems to the full page are too shorts.
A script was built to launch the tasks of applying gOCR
to the test images from a set in parallel on SEE-Grid infrastructure. The packages that were send remotely using
gLite facilities included the gOCR executable, the Netpbm
library, and the GIF images. As expected, there is a considerable speedup of the response time of processing each
test set see Table 2 despite the fact that the local workstation on which the test sets were processed sequentially

Test set
Mean value
Minim
Maxim
Mean value
Mininm
Maxim
Mean value
Minim
Maxim
Mean value
Minim
Maxim

1
4407
3624
4820
97.74
97.53
97.95
746
650
814
93.68
92.38
95.38

2
3007
2925
3088
66.48
64.72
68.23
467
457
486
18.51
15.07
22.98

3
2822
2226
3196
98.10
97.76
98.61
434
352
487
92.91
92.19
93.75

4
2050
1707
2271
96.07
93.62
97.71
352
307
381
83.46
74.02
89.43

Table 2. Response times (s) and speedup


Test set
Total time/set
Mean time/page
Pages of the book
Expected time/book
User time/set
Mean time/page
Min. time/page
Expected time/book
Speedup Per set, 11 CPUs
Expected/book
max.128 CPUs
Time
on
user
machine
Time
in
Grid

1
339.5
30.87
14
432
69.66
50.84
28.31
71
4.81
6.11

2
400.9
36.45
272
9913
60.05
47.21
16.42
121
6.62
81.92

3
128.79
11.71
181
2119
22.65
16.14
9.96
46
5.59
46.07

4
99.52
9.05
766
6929
11.75
9.77
6.79
72
8.30
96.36

is more powerful than the individual workstations from the


SEE-Grid infrastructure were the tasks were run. As the
books were not completely transformed in image-based documents, the times needed for processing them entirely
were only estimated. The estimation takes into account a
possible temporary limitation of the number of processing
units in the SEE-Grid environment (128 CPUs from the 750
CPUs that are available in SEE-Grid).
Furthermore, a user interface has been build, the gOCR
component of GOC, the Grid Operations Center. GOC is a
Grid portal operating on authors academic sites which ensures the interface between the Grid and gOCR component,
on one side, and user (human or code) on the other side.
GOC should be seen as a Grid user interface that delivers
access to Grid infrastructure using components specialized
in solving one or more type of problems. For example,
in our case gOCR component is a component that makes
gOCR program available on the Grid. GOC has three main
parts (Figure 2): (1) Authentication component which ensures user authentication to the portal but also, using digital certificates, authorization to the Grid infrastructure; (2)
Grid engine component which makes possible connection
to the Grid infrastructure using specific Grid API components (in our case is gLite API); (3) gOCR component
which delivers Grid-enabled interface to gOCR analyzer.
The authentication component is in charge with user authentication and authorization, using, either password authentication (if using local Grid infrastructure) or certificate based authentication (if using global Grid infrastruc-

23

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.

be efficiently used to speedup the translation of large sets


of image-based documents into structured documents that
are currently easy to discover, search and process. The next
step will consist in experimenting with combined classifiers
on computational Grid infrastructure. Taking into consideration the current evolution of Grid architectures towards
service oriented architectures, the design of Web services
wrapping existing OCR systems or other document analysis tools or combined classifies, as well of Web services
written using Java OCR [14], is under discussion.
Acknowledgment. This work is supported by RO-CEEXI 729/2005 SINRED and RO-CEEX-II 5919/2006 GRAI.

References
[1] D. Andrews, R. Brown, C. Caldwell, et al., A Parallel Architecture for Performing Real Time Multi-Line Optical Character Recognition, in Procs. 25th SSST, 1993, pp. 533536.
[2] T.M. Breuel, The Future of Document Imaging in the Era of Electronic Documents, in Procs. DAS VI, 2005.
[3] K. Chellapilla, M. Shilman, P. Simard, Combining Multiple Classifiers for Faster Optical Character Recognition, in Procs. DAS VII,
Springer, LNCS 3872, 2006, pp. 358367.
[4] G.S. Choudhury, T. DiLauro, R. Ferguson, M. Droettboom, I. Fujinaga, Document Recognition for a Million Books, D-Lib Magazine, Vol. 12, No. 3, 2006, www.dlib.org/dlib/march06/
03contents.html
[5] A. Cuhadar, A.C. Downton, Scalable Parallel Processing Design for Real Time Handwritten OCR, in Procs. 12th IAPR Inter.Conf.Pattern Recognition, Vol. 3, 1994, pp. 339341.
[6] M. Danelutto, S. Pelagatti, R. Ravazzolo, A. Riaudo, Parallel OCR
in P3 L: a case study, in High-Performance Computing and Networking, LNCS 1067, Springer, 1996, pp. 10171019.
[7] M. Forbes, OCHRE-P Optical Character Recognition in Parallel,
Technical report EPCC-SS95-06, 1995.
[8] Free OCR, www.123dox.com
[9] gOCR: Optical Character Recognition, 2006, sourceforge.
net/projects/jocr/
[10] Google, Ocropus, 2007, code.google.com/p/ocropus/
[11] Google, Books Library Project, 2007, books.google.com
googlebooks/library.html.
[12] H. Goto, OCRGrid : A Platform for Distributed and Cooperative
OCR Systems, in Procs. 18th ICPR, Vol. 2, 2006, pp. 982985.
[13] H. Goto, A Platform for Web-Based OCR Systems with Server
Search Function, in Procs. DAS VII, LNCS 3872, 2006, pp. 1316.
[14] JavaOCR, 2004, www.javaocr.com/
[15] LeadTools, Optical Character Recognition Web Service,
2007,
www.leadtools.com/SDK/WEB-SERVICES/OCRSERVICE/default.htm
[16] A.P.Lenaghan, R.R.Malyan, XPEN: An XML Based Format for
Distributed Online Handwriting Recognition, in Procs. 7th DAR,
2003, pp. 12701274.
[17] R. Mason, H. Schmidt, R. Trott, Down on the OCR Farm: How We
Produced Searchable PDFs for 7 Million Documents in a Student
Computer Lab, in Procs. 5th JCDL, 2005, pp. 391391.
[18] Netpbm - graphics tools and converters, 2006, sourceforge.
net/projects/netpbm
[19] Open Content Alliance, OCA, 2006, www.opencontent
alliance.org
[20] G.J.Rama, A.Ramakrishnan, D.Gupta, Parallel Processing in OCR
A Multithreaded approach, Procs.Tamil Internet, 2002, 107110.
[21] South-Eastern European Grid, 2006, www.see-grid.eu

Figure 2. GOC-gOCR: functionality; interface


ture, like SEE-Grid). It also provides user authorization to
certain components, each user is allowed to use only specific components that are assigned to him. The Grid engine component makes connection between user interface
portal and Grid infrastructure using Grid infrastructure specific API. In our case we developed this portal for gLite
infrastructure used in SEE-Grid project. To submit a job
to gLite infrastructure the user must do the following: (a)
obtain a digitally signed gLite VOMS certificate for access to a gLite infrastructure UI (User Interface); (b) create a JDL (Job Description Language) for the application to
launch it on the Grid; (c) submit JDL file to the Grid using
gLite user utilities available on gLite UI; (d) check status
of its job; (e) retrieve job output using gLite user utilities
available on gLite UI. Through the gOCR component the
user can: register new job sessions; upload image files that
need to be analyzed; user gOCR code or his own particular code (he can upload the executable code, list of parameters and other external files). If all the above steps
were completed successfully then gOCR component will
proceed to create a package (to adapt gOCR code or user
own code for Grid) and then, at the user command, will send
this package to the Grid engine component. After the Grid
job ran successfully, the user can send a retrieve command
to obtain the output from Grid, and, finally, that user can
download the output retrieved along with the log file (useful in the case of errors). GOC user interface can be found
at http://ui01.info.uvt.ro:8080/nGOC/, and the test sets and
scripts at http://web.info.uvt.ro/petcu/Grid-gOCR.zip.

4. Conclusions and future directions


The first experiments done in the frame of SINRED revealed the fact that a computational Grid infrastructure can

24

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on December 21, 2009 at 00:25 from IEEE Xplore. Restrictions apply.

You might also like