Professional Documents
Culture Documents
Document Analysis System: Wong G. Casey Wahl
Document Analysis System: Wong G. Casey Wahl
Wong
R. G. Casey
F. M. Wahl
This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in
encoding printed documents for computer processing. Several critical functions have been investigated and the technical
approaches are discussed. The first is the segmentation and classijication of digitized printed documents into regions of text
and images. A nonlinear, run-length smoothing algorithm has been used for this purpose.By using the regular features of text
lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an
adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A
preclassifier is constructedduring the input process and used to speed up a well-known pattern-matchingmethod f o r clustering
characters from an arbitrary printsource into asmall sampleof prototypes. Experimental results are included.
Introduction
The decreasing cost of hardware will eventually make com- only expensive and time-consuming but also precludes the
monplace thestorageanddistribution of documents by encoding of graphics and images contained in the document.
electronic means. However, today most documents arebeing Current optical character recognition (OCR) machines only
saved, distributed,and presentedon paper.Paper is the recognize certain preprogrammed fonts; they do not accept
primarymedium for books, journals, newspapers, maga- documentscontaininga mixture of text and images. The
zines, and business correspondence. This paper describes an Document Analysis System thusprovides a new capability for
experimentalDocument Analysis System, which assistsa converting information stored on printed documents, includ-
user in the encoding of printed documents for computer ing both images and text, tocomputer-processable form.
processing.
Thispaper is organizedintothree sections. The first
A document analysis system could be applied to extract section gives an overview of the system requirements. Not all
informationfromprinteddocumentstocreatedata bases the components described in the system have been designed
(e.g.,patents,court decisions). Such systemscouldalso and tested yet. The next two sections present the results of
create computer document files from papers submitted to the technical investigations of some of the key components:
journals and assist in revising books and manuals. In addi- the segmentationof a scanned document intoregions of text
tion,a document analysis systemcould provide a general and images, and the recognition of text. Initial experimental
data compression facility, since encoding optically scanned data areincluded to illustrate thesetwo methods.
text from a bit-map into symbol codes greatly reduces the
cost of storing or transmitting the information. System requirements and overall structure
The design of an interactive computer system to read printed
Existing commercial systems cannot adequately convert documents was investigatedby Nagy [ 11. Some of the system
unconstrained printed documents into a format suitable for concepts used by him are reflected in the DocumentAnalysis
computer processing. Data entry by operator keying is not System, but thesolutions in mostareas aredifferent.
8 Copyright 1982 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of
royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on
the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by
computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the
Editor. 647
+"
L"" "" J
i
The system is designed to have the following capabilities: Figure 1 shows the modules of the proposed Document
Analysis System.Theshaded moduleshave been imple-
Manual editing of scanned documents(e.g., functions such mented or investigated in detail. The preliminary version of
as crop,copy, merge, erase, save, etc.). the system permits meaningful experiments but does not yet
Separating a document into regions of text and images provide full interactive capability.
automatically.
Representing text by standard symbols such as EBCDIC, Documentsarescannedasblack/white imageswitha
ASCII, etc., possibly with the assistance of user interac- commerciallaser scanner at 240 pixels per inch into a
tion. Automatic character recognition will be provided for Series/ 1 minicomputer, which then transfers the dataover a
documents printed in certain conventional fonts. network to the host IBM 3033 computer. Thedisplay device
Checking the recognized text against a dictionary. is a Tektronics 618 storage tube attached to an IBM 3277
Enhancing the images and converting graphics from bit- terminal, which communicates with the host. The system can
maps into vector format. automaticallyseparate a documentinto regions of text,
Analyzing the page layoutso that appropriate typographi- graphics, and halftone images. The technique for doing this
cal commands can be generated automatically or interac- is presented in the next section. Alternatively, the user can
tively to reproduce the document with printer subsystems. choose to perform the same function manually by interac-
Allowing the user to customize thesystem for his particu- tively editingtheimage.The user can invoke the Image
648 lar needs. Manipulation Editor (called IEDIT), which reads and stores
Other functions will be provided in future extensions to the The basic RLSA is applied to a binary sequence in which
system. For instance,thegraphicsandhalftone images white pixels are represented by 0’s and black pixels by 1’s.
separated from the document may need further processing in The algorithm transforms a binary sequencex into an output
certain applications. Thus, in a maintenance manual the text sequence y according to thefollowing rules:
labels may need revision without changing the line drawing
1. 0’s in x are changed to1’s in y if the number of adjacent
itself, and vice versa. In such cases, text information must be
0’s is less than or equal to a predefined limit C.
extracted from the image. Also, due to spatial digitization
2 . 1’s in x are unchanged in y .
errors, scanned line drawings and graphics generally have
missing or extra dots along edges. Consequently, after scan- For example, with C = 4 the sequence x is mapped intoy as
ning and thresholding, identical lines may vary in width in follows:
the binary bit-maps. Such errors can be corrected with an
image enhancement function. Scanned halftone images may x:0001000001010000100000001l000
also have moirt patterns [2]. Additional processing functions y:111100000111111110000000111ll
are required to remove the moirt patterns and convert the When applied to pattern arrays, the RLSA has theeffect of
images into gray scale format so that conventional digital linking together neighboring black areas that are separated
halftoning techniques canbe used to reproduce theimage. In by less than C pixels. With an appropriate choice of C,the
another application,existing line drawings and graphics may linked areas will be regions of acommon data type. The
have to be encoded into an engineering data base. Here, degree of linkage dependson C, the distributionof white and
conversion of thedrawingfrombit-mapsinto vector or blackin the document, and the scanning resolution. (The
polynomial representation of objects is necessary. Document Analysis System scansat 240 pixels per inch.)
IBM J. RES. DEVELOP. VOL. 26 NO. 6 NOVEMBER 1982 K. Y.WONG EiT AL.
Figure 2 (a) Block segmentation example of a mixed text/image document, which here is the original digitized document. (b) and (c) Results
of applying the RLSA in the horizontal and vertical directions. (d) Final result of block segmentation. ( e ) Results for blocks considered to be text
data (class 1).
Total number of black pixels in a segmented block ( B C ) . 3. The ratio of the number of block pixels to the areaof the
0 Minimum x-y coordinates of a block and its x , y lengths surrounding rectangle: S = B C / ( A x A y ) . If S is close to
h i " ' Ax, Y,,", AY). one, the block segment is approximately rectangular.
Total number of black pixels in original data from the 4. The meanhorizontallength of the black runs of the
block ( D C ) . original data from eachblock:
,
Horizontal white-black transitions of original data ( T C ) . Rm = DCITC.
These measurements are storedin a table (see Table 1) and
These features areused to classify the block. Because text
are used to calculate thefollowing features:
is the predominating data type in a typical office document
1. The height of each block segment: H = Ay. and text lines are basically textured stripes of approximately
2. The eccentricityof the rectangle surrounding theblock: a constant height H a n d mean length of black runs Rm,text
650 E = Ax/Ay. blocks tend to cluster with respect to these features. This is
Three methods of designing the preclassifierhave been The third approach (called strategy C) is called “critical
investigated. The topextension approach (called strategy A) node extension.” Here, prototype G ( k + 1) is entered into 653
( F 2 4 ) T l 3 I "I
time a class is added, then a network designed usingstrategy
B can have no path longer than K nodes when it has been
S f l L h ski. i n 5 6 extended to accommodate K prototypes, and typically the
growth in average path lengthis approximately logarithmic.
M t i 4 R m v a E;h
However, strategy B calls for modifying the network in a
P k U b 4 * S r t & i number of locations. As the network grows, many terminal
nodes may have to be extended in order to adda single class.
antiirmalmginCHE This effect increases both the time required for adaptation
50 000-
thi rrn
ai ai
E
z'
0 500 I000 IS00
656