You are on page 1of 4

IMAGE SEGMENTATION OF HISTORICAL

DOCUMENTS
Carlos A.B. Mello and Rafael D. Lins
Department of Electronics and Systems – UFPE – Brazil
{cabm, rdl}@cin.ufpe.br

ABSTRACT described in the literature. Two different grands for


comparison are presented: visual inspection of the filtered
This paper presents a new entropy-based segmentation
document and the response of Optical Character
algorithm for images of documents. The algorithm is used
Recognition (OCR) tools.
to eliminate the noise inherent to the paper itself specially
in documents written on both sides. It generates good
quality monochromatic images increasing the hit rate of II. ENTROPY-BASED SEGMENTATION
OCR commercial tools.
The documents of Nabuco’s file are digitized with 200 dpi
in true colour and then converted to 256-level greyscale
I. INTRODUCTION
format by using the equation:
We are interested in processing and automatic C = 0.3*R + 0.59*G + 0.11*B
transcription of historical documents from the nineteenth where C is the new greyscale colour and R, G and B are,
century onwards. Image segmentation [2] of this kind of respectively, the Red, Green and Blue components of the
documents is more difficult than more recent documents palette of the original colour image.
because while the paper colour darkens with age, the Three segmentation algorithms based on the
printed part either handwritten or typed, tends to fade. entropy function [1] are applied to greyscale images and
These two factors acting simultaneously narrows the are studied here: Pun [9], Kapur et al [3] and Johannsen
discriminations gap between the two predominant colour [8].
clusters of documents. If a document is typed or written
on both sides and the opacity of the paper is such as to A. Pun’s Algorithm
allow the back printing to be visualized on the front side, Pun’s algorithm analyses the entropy of black pixels, Hb,
the degree of difficulty of good segmentation increases and the entropy of the white pixels, Hw, bounded by the
enormously. A new set of hues of paper and printing threshold value t. The algorithm suggests that t is such
colours appears. Better filtering techniques are needed to that maximizes the function H = Hb + Hw, where Hb and
filter out those pixels, reducing back-to-front noise. Hw are defined by:
The segmentation algorithm presented was t

applied to documents from Joaquim Nabuco’s1 file [5,12] ∑


Hb = − p[i ] log( p[i ])
i =0
(Eq. 1)
held by Joaquim Nabuco Foundation (a research center in 255
Recife-Brazil). The segmentation process is used to Hw = − ∑ p[i] log( p[i ]) (Eq. 2)
i = t +1
generate high quality greyscale or monochromatic images.
Figure 1 shows the application of a nearest colour where p[i] is the probability of pixel i with colour
algorithm for decreasing the colours of a sample colour[i] is in the image. The logarithm function is taken
document from Nabuco’s bequest, using Adobe Photoshop in base 256. Figure 2 presents the application of Pun’s
[10]. The document is written on both sides the colour algorithm to the sample image shown in figure 1-left.
reduction process has not produced satisfactory results as
the ink of one side of the paper interferes with the B. Kapur et al’s Algorithm
monochromatic image of the other side. Reference [3] defines a probability distribution A for an
This paper introduces a new entropy-based segmentation object and a distribution B to the background of the
algorithm and compares it with three of the most document image, such that:
important entropy-based segmentation algorithms
A: p0/Pt, p1/Pt, ..., pt/Pt
B: (pt+1)/(1 – Pt), (pt + 2)/(1 - Pt),..., p255/(1 – Pt)
1
Brazilian statesman, writer, and diplomat, one of the key
figures in the campaign for freeing black slaves in Brazil,
Brazilian ambassador to London (b.1861-d.1910).
The entropy values Hw and Hb are evaluated using
equations (1) and (2) above, with p[i] following are
applied to greyscale images the previous distributions.
The maximization of the function Hw + Hb is analysed to
define the threshold value t. The sample image of figure
1-left is segmented with this algorithm and the result is
presented on figure 3.

C. Johannsen’s Algorithm
Another variation of an entropy-based algorithm is
proposed by Johannsen trying to minimize the function
Sb(t) + Sw(t), with:
255 255 255
S w (t ) = log( ∑ pi ) + (1 / ∑ pi )[ E ( pt ) + E ( ∑ pi )]
i =t +1 i =t +1 i =t +1 Figure 3. Kapur et al’s segmentation
and
t t t −1
Sb (t ) = log(∑ pi ) + (1 / ∑ pi )[E ( pt ) + E (∑ pi )]
i =0 i =0 i =0
where E(x)=-xlog(x) and t is the threshold value. Figure 4
presents the application of this algorithm to the image of
the document under study.

Figure 4. Johannsen’s segmentation.

III. A NEW SEGMENTATION ALGORITHM

The algorithm scans the image looking for the most


Figure 1. (left) Original image in 256 greyscale levels frequent colour, which is likely to belong to the image
and (right) its monochromatic version generated by background (the paper). This colour is used as inicial
Photoshop threshold value, t, to evaluate Hw and Hb as defined in
equations (1) and (2).
The entropy H of the complete histogram of the
image is also evaluated. It must be noticed that in this
new algorithm the logarithmic function used to evaluate
H, Hw and Hb is taken with a base equal to the product of
the dimensions of the image. This means that, if the
image has dimensions x versus y, the logarithmic base is
x.y. As can be seen in [4], this does not change the
concept of entropy.
Using the value of H two multiplicative factors, mw
and mb, are defined folllowing the rules:
• If 0.25 < H < 0.30, then mw = 1 and mb = 2.6
• If H ≤ 0.25, then mw = 2 and mb = 3
• If 0.30 ≤ H < 0.305, then mw = 1 and mb = 2
Figure 2.Pun’s algorithm applied to document • If H ≥ 0.305, then mw = 0.8 and mb = 0.8
These values of mw and mb were found empirically after Figure 6 zooms into one of these documents and the
several experiments. By now, they can be applied to output obtained.
images of historical documents only. For any other kind For typed documents (also from Nabuco’s file)
of image, these values must be analysed again. We the segmentation algorithm was applied in search for
empathise that this new algorithm was developed to work better responses from OCR’s commercial tools. In
with images with characteristcs of historical kinds. previous tests [6], the OCR tool Omnipage [11] from
The greyscale image is scanned again and each Caere Corp. achieved the best hit rates1 amongst six
pixel i with colour[i] is turned white if: commercial analysed softwares. These rates reached
(colour[i]/256) ≥ (mw*Hw + mb*Hb) almost 99% in some cases. When applied to historical
Otherwise, its colour remains the same (to generate a new documents, however, this rate decreased to much lower
greyscale image) or it is turned to black (generating a values. The segmented images for a sample typed
monochromatic image). document can be seen in detail in figure 7.
This condition can be inverted generating a new
image where the pixels with colour corresponding to the
ink are eliminated, remaining only the pixels classified as
paper.
This new segmentation algorithm was used for
two kind of applications: 1) to create high quality
monochromatic images of the documents for minimum
space storage and to efficient network transmission and 2)
to look for better hit rates of OCR commercial tools. The
application of the algorithm to the sample document of
figure 1-left can be found in figure 5 next.

Figure 6. (top left) Original image; (top right) Original


image in black-and-white; (center left) Original image
segmented by Pun’s algorithm; (center right) Application
of Kapur et al’s algorithm; (bottom left) Johannsen’s
algorithm and (bottom right) our algorithm applied to
original image.
The table below presents the hit rate of
Figure 5. Application of new segmentation algorithm in Omnipage for four typed documents representative of
the document presented in figure 1-left. Nabuco’s bequest after segmentation with the four
entropy-based algorithms presented here. They are
Comparing figures 2, 3, 4 and 5, one can observe compared with the use of the original image with no pre-
that the algorithm proposed in this paper yielded the best processing besides the one used by the own software (the
quality image, with most of the back-to-front interfernce column labeled as Omnipage).
removed. It can be seen a little degradation in one of the
It is also important to notice that the new cases in the hit rate of the software when compared with
algorithm presented the lowest processing time amongst its use after the application of the new segmentation
the algorithms analysed. technique (in the D064 image). This degradation can be
The entropy filtering presented here was used in justified by a possible loss of part of some characters in
a set of 40 images of documents and letters from Nabuco’s
bequest. Only four times unsatisfactory images were
1
produced which required the intervention of an operator. Number of characters correctly transcribed from image
to text
the segmentation process producing errors in the producing, under visual inspection, better quality images
character recognition process. Eventhough the than the best known algorithms described in literature.
segmentation algorithm proposed in this paper reached The automatic image-to-text transcription of
the best rates in average. those documents using Omnipage 9.0, a commercial OCR
tool from Caere Corp. [11], improved after segmentation.
Image Omni Johannsen Pun Kapur New
The algorithm presented did not work well with
Page et al Scheme very faded documents. We are currently working on re-
D023 80.3 78.3 43.3 91.7 91.4
tunning the algorithm for this class of documents.
D064 84.4 84.5 63.7 85.2 80.1 V. REFERENCES
D077 80.1 80.1 71.8 77.3 92.4 [1] N.Abramson. Information Theory and Coding. McGraw-Hill
D097 75.4 5.1 69.5 73.4 88.0 Book Company, 1963.
[2] R.Gonzalez and P.Wintz. Digital Image Processing. Addison
Table 1. Hit rate of Omnipage for images of typed
Wesley, 1987.
historical documents in percentage
[3] J.N.Kapur, P.K.Sahoo and A.K.C.Wong. A New Method for
Figure 7-bottom shows another application of the Gray-Level Picture Thresholding using the Entropy of the
algorithm, as explained before, where the frequencies classified
Histogram, Computer Vision, Graphics and Image Processing,
as ink are eliminated remaining only the background of the
29(3), 1985.
image (the paper). This image is used in another part of the
system in the generation of paper texture for historical [4] S.Kullback. Information Theory and Statistics. Dover
documents [7]. Publications, Inc.1997.
[5] R.D.Lins et al. An Environment for Processing Images of
Historical Documents. Microprocessing & Microprogramming,
pp. 111-121, North-Holland, 1995.
[6] C.A.B.Mello & R.D.Lins. A Comparative Study on
Commercial OCR Tools. Vision Interface’99, pp. 224-
323.Quebéc, Canada, 1999.
[7] C.A.B.Mello & R.D.Lins. Generating Paper Texture Using
Statistical Moments. IEEE International Conference on
Acoustic, Speech and Signal Processing, Istanbul, Turkey, June
2000.
[8] J.R.Parker. Algorithms for Image Processing and Computer
Vision. John Wiley and Sons, 1997.
[9] T.Pun, Entropic Thresholding, A New Approach, C.
Graphics and Image Processing, 16(3), 1981.
Figure 7. (top left) Original image, (top right) segmented [10] Adobe Systems Inc. http://www.adobe.com
image (ink) and (bottom) negative segmentation (paper)
[11] Caere Corporation. http://www.caere.com

The algorithm was also tested with other [12] Nabuco Project. URL: http://www.di.ufpe.br/~nabuco.
segmentation methods such as iteractive selection yielding
better results in terms of OCR hit rates and visual
inspection of monochromatic images quality.

IV. CONCLUSION
This paper introduces a new segmentation algorithm for
historical documents, which is particularly suitable to
reduce back-to-front noise of documents written on both
sides. Applied to a set of 40 samples from Nabuco’s
bequest it worked satisfactorily in 90% of them

You might also like