Professional Documents
Culture Documents
9
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007), pages 9–16,
Prague, 28 June 2007.
2007
c Association for Computational Linguistics
Figure 1: Screen-shot of the alignment prototype interface displaying an outlined word (using a box) in the
manuscript (left) and the corresponding highlighted word in the transcript (right).
using dynamic time warping (DTW). In this same to one recognized word. In our case, word recogni-
direction, (Huang and Srihari, 2006), in addition to tion is not actually needed, as we do already have
the word pre-segmentation, attempt a (rough) recog- the correct transcription. Therefore, to obtain the
nition of the word images. The resulting word string segmentations for the given word sequences, the so-
is then aligned with the transcription using dynamic called “forced-recognition” approach is employed
programming. (see section 2.2). This idea has been previously ex-
plored in (Zimmermann and Bunke, 2002).
The alignment method presented here (hencefor-
ward called Viterbi alignment), relies on the Viterbi Alignments can be computed line by line in cases
decoding approach to handwritten text recogni- where the beginning and end positions of lines are
tion (HTR) based on Hidden Markov Models known or, in a more general case, for whole pages.
(HMMs) (Bazzi et al., 1999; Toselli et al., 2004). We show line-by-line results on a set of 53 pages
These techniques are based on methods originally from the “Cristo-Salvador” handwritten document
introduced for speech recognition (Jelinek, 1998). (see section 5.2). To evaluate the quality of the ob-
In such HTR systems, the alignment is actually a tained alignments, two metrics were used which give
byproduct of the proper recognition process, i.e. an information at different alignment levels: one mea-
implicit segmentation of each text image line is ob- sures the accuracy of alignment mark placements
tained where each segment successively corresponds and the other measures the amount of erroneous as-
10
0.3 0.2 0.1 0.2 0.3
Figure 2: Example of 5-states HMM modeling (feature vectors sequences) of instances of the character “a”
within the Spanish word “cuarenta” (forty). The states are shared among all instances of characters of the
same class. The zones modelled by each state show graphically subsequences of feature vectors (see details
in the magnifying-glass view) compounded by stacking the normalized grey level and its both derivatives
features.
signments produced between word images and tran- a given handwritten sentence (or line) image rep-
scriptions (see section 4). resented by a feature vector sequence x = xp1 =
The remainder of this paper is organized as fol- hx1 , x2 , . . . , xp i, that is:
lows. First, the alignment framework is introduced
and formalized in section 2. Then, an implemented ŵ = arg max P r(w|x)
w
prototype is described in section 3. The alignment
= arg max P r(x|w) · P r(w) (1)
evaluation metrics are presented in section 4. The w
experiments and results are commented in section 5.
Finally, some conclusions are drawn in section 6. where P r(x|w) is usually approximated by
concatenated character Hidden Markov Models
2 HMM-based HTR and Viterbi alignment (HMMs) (Jelinek, 1998; Bazzi et al., 1999),
whereas P r(w) is approximated typically by an
HMM-based handwritten text recognition is briefly n-gram word language model (Jelinek, 1998).
outlined in this section, followed by a more detailed Thus, each character class is modeled by a con-
presentation of the Viterbi alignment approach. tinuous density left-to-right HMM, characterized by
a set of states and a Gaussian mixture per state. The
2.1 HMM HTR Basics Gaussian mixture serves as a probabilistic law to
The traditional handwritten text recognition problem model the emission of feature vectors by each HMM
can be formulated as the problem of finding a most state. Figure 2 shows an example of how a HMM
likely word sequence ŵ = hw1 , w2 , . . . , wn i, for models a feature vector sequence corresponding to
11
w1 w2 w3 w4 w5 w6 wn=7
x1 xp
b0 b1 b2 b3 b4 b5 b6 bn=7
Figure 3: Example of segmented text line image along with its resulting deslanted and size-normalized
image. Moreover, the alignment marks (b0 . . . b8 ) which delimit each of the words (including word-spaces)
over the text image feature vectors sequence x.
character “a”. The process to obtain feature vector where b̂ is the optimal alignment. In our case,
sequences from text images as well as the training of we are not really interested in proper text recogni-
HMMs are explained in section 3. tion because the transcription is known beforehand.
HMMs as well as n-grams models can be rep- Let w̃ be the given transcription. Now, P r(w) in
resented by stochastic finite state networks (SFN), equation 3 is zero for all w except w̃, for which
which are integrated into a single global SFN by re- P r(w̃) = 1. Therefore,
placing each word character of the n-gram model by
the corresponding HMM. The search involved in the b̂ = arg max P r(x, b|w̃) (4)
b
equation (1) to decode the input feature vectors se-
quence x into the more likely output word sequence which can be expanded to,
ŵ, is performed over this global SFN. This search
problem is adequately solved by the Viterbi algo- b̂ = arg max P r(x, b1 |w̃)P r(x, b2 |b1 , w̃) . . .
b
rithm (Jelinek, 1998).
. . . P r(x, bn |b1 b2 . . . bn−1 , w̃)
2.2 Viterbi Alignment (5)
As a byproduct of the Viterbi solution to (1), the
feature vectors subsequences of x aligned with each Assuming independence of each bi mark from
of the recognized words w1 , w2 , . . . , wn can be ob- b1 b2 . . . bi−1 and assuming that each subsequence
tained. These implicit subsequences can be visual- xbbii−1 depends only of w̃i , equation (5) can be rewrit-
ized into the equation (1) as follows: ten as,
X
ŵ = arg max P r(x, b|w) · P r(w) (2) b̂ = arg max P r(xbb10 |w̃1 ) . . . P r(xbbnn−1 |w̃n ) (6)
w b
b
where b is an alignment; that is, an ordered se- This simpler Viterbi search problem is known as
quence of n + 1 marks hb0 , b1 , . . . , bn i, used to de- “forced recognition”.
marcate the subsequences belonging to each recog-
nized word. The marks b0 and bn always point out 3 Overview of the Alignment Prototype
to the first and last components of x (see figure 3).
Now, approximating the sum in (2) by the domi- The implementation of the alignment prototype in-
nant term: volved four different parts: document image prepro-
cessing, line image feature extraction, HMMs train-
ŵ ≈ arg max max P r(x, b|w) · P r(w) (3) ing and alignment map generation.
w b
12
Document image preprocessing encompasses the with 6 states and 64 Gaussian mixture components
following steps: first, skew correction is carried out per state. This topology (number of HMM states and
on each document page image; then background Gaussian densities per state) was determined by tun-
removal and noise reduction is performed by ap- ing empirically the system on the corpus described
plying a bi-dimensional median filter (Kavalliera- in section 5.1. Once a HMM “topology” has been
tou and Stamatatos, 2006) on the whole page im- adopted, the model parameters can be easily trained
age. Next, a text line extraction process based on from images of continuously handwritten text (with-
local minimums of the horizontal projection profile out any kind of segmentation) accompanied by the
of page image, divides the page into separate line transcription of these images into the correspond-
images (Marti and Bunke, 2001). In addition con- ing sequence of characters. This training process is
nected components has been used to solve the situ- carried out using a well known instance of the EM
ations where local minimum values are greater than algorithm called forward-backward or Baum-Welch
zero, making impossible to obtain a clear text line re-estimation (Jelinek, 1998).
separation. Finally, slant correction and non-linear The last phase in the alignment process is the gen-
size normalization are applied (Toselli et al., 2004; eration of the mapping proper by means of Viterbi
Romero et al., 2006) on each extracted line image. “forced recognition”, as discussed in section 2.2.
An example of extracted text line image is shown
in the top panel of figure 3, along with the result-
ing deslanted and size-normalized image. Note how 4 Alignment Evaluation Metrics
non-linear normalization leads to reduced sizes of
ascenders and descenders, as well as to a thiner un- Two kinds of measures have been adopted to evalu-
derline of the word “ciudadanos”. ate the quality of alignments. On the one hand, the
As our alignment prototype is based on Hid- average value and standard deviation (henceforward
den Markov Models (HMMs), each preprocessed called MEAN-STD) of the absolute differences be-
line image is represented as a sequence of feature tween the system-proposed word alignment marks
vectors. To do this, the feature extraction mod- and their corresponding (correct) references. This
ule applies a grid to divide line image into N × gives us an idea of the geometrical accuracy of the
M squared cells. In this work, N = 40 is cho- alignments obtained. On the other hand, the align-
sen empirically (using the corpus described further ment error rate (AER), which measures the amount
on) and M must satisfy the condition M/N = of erroneous assignments produced between word
original image aspect ratio. From each cell, three images and transcriptions.
features are calculated: normalized gray level, hor-
Given a reference mark sequence r =
izontal gray level derivative and vertical gray level
hr0 , r1 , . . . , rn i along with an associated to-
derivative. The way these three features are deter-
kens sequence w = hw1 , w2 , . . . , wn i, and a
mined is described in (Toselli et al., 2004). Columns
segmentation marks sequence b = hb0 , b1 , . . . , bn i
of cells or frames are processed from left to right
(with r0 = b0 ∧ rn = bn ), we define the MEAN-STD
and a feature vector is constructed for each frame
and AER metrics as follows:
by stacking the three features computed in its con-
stituent cells. MEAN-STD: The average value and standard devi-
Hence, at the end of this process, a sequence of ation of absolute differences between reference and
M 120-dimensional feature vectors (40 normalized proposed alignment marks, are given by:
gray-level components, 40 horizontal and 40 vertical
derivatives components) is obtained. An example of s
feature vectors sequence, representing an image of Pn−1
di
Pn−1
(di − µ)2
i=1 i=1
the Spanish word “cuarenta” (forty) is shown in fig- µ= σ= (7)
n−1 n−1
ure 2.
As it was explained in section 2.1, characters are
modeled by continuous density left-to-right HMMs where di = |ri − bi |.
13
w1 w2 w3 w4 w5 w6 wn=7
x1 xp
r0 r1 r2 r3 r4 r5 r6 r7
m1 m3 m5 m7
b0 b1 b2 b3 b4 b5 b6 b7
Figure 4: Example of AER computation. In this case N = 4 (only no word-space are considered:
w1 , w3 , w5 , w7 ) and w5 is erroneously aligned with the subsequence xbb65 (m5 ∈
/ (b4 , b5 )). The resulting
AER is 25%.
AER: Defined as: As has been explained in section 3, the page im-
100 X ages have been preprocessed and divided into lines,
AER(%) = ej resulting in a data-set of 1,172 text line images.
N
j:wj 6=b In this phase, around 4% of the automatically ex-
(8)
½ tracted line-separation marks were manually cor-
0 bj−1 < mj < bj
ej = rected. The transcriptions corresponding to each line
1 otherwise
image are also available, containing 10,911 running
where b stands for the blank-space token, N < n is words with a vocabulary of 3,408 different words.
the number of real words (i.e., tokens which are not To test the quality of the computed alignments, 12
b, and mj = (rj−1 + rj )/2. pages were randomly chosen from the whole corpus
A good alignment will have a µ value close to 0 pages to be used as references. For these pages the
and small σ. Thus, MEAN-STD gives us an idea of true locations of alignment marks were set manually.
how accurate are the automatically computed align- Table 1 summarized the basic statistics of this cor-
ment marks. On the other hand, AER assesses align- pus and its reference pages.
ments at a higher level; that is, it measures mis-
matches between word-images and ASCII transcrip- Number of: References Total Lexicon
tions (tokens), excluding word-space tokens. This is pages 12 53 –
illustrated in figure 4, where the AER would be 25%. text lines 312 1,172 –
words 2,955 10,911 3,408
5 Experiments characters 16,893 62,159 78
In order to test the effectiveness of the presented Table 1: Basic statistics of the database
alignment approach, different experiments were car-
ried out. The corpus used, as well as the experiments
carried out and the obtained results, are reported in 5.2 Experiments and Results
the following subsections. As mentioned above, experiments were carried out
computing the alignments line-by-line. Two differ-
5.1 Corpus description ent HMM modeling schemes were employed. The
The corpus was compiled from the legacy handwrit- first one models each of the 78 character classes us-
ing document identified as Cristo-Salvador, which ing a different HMM per class. The second scheme
was kindly provided by the Biblioteca Valenciana uses 2 HMMs, one to model all the 77 no-blank
Digital (BIVALDI). It is composed of 53 text page character classes, and the other to model only the
images, scanned at 300dpi and written by only one blank “character” class. The HMM topology was
writer. Some of these page images are shown in the identical for all HMMs in both schemes: left-to-
figure 5. right with 6 states and 64 Gaussian mixture com-
14
Figure 5: Examples page images of the corpus “Cristo-Salvador”, which show backgrounds of big variations
and uneven illumination, spots due to the humidity, marks resulting from the ink that goes through the paper
(called bleed-through), etc.
15
12 mean mean
6
10
5
Frequency (%)
Frequency (%)
8 4
6 3
4 2
2 1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
|Segi − Refi | (mm) |Segi − Refi | (mm)
Figure 6: |ri − bi | distribution histograms for 78-HMMs (left) and 2-HMMs (right) modelling schemes.
Figure 7: Word alignment for 6 lines of a particularly noisy part of the corpus. The four last words on the
second line as well as the last line illustrate some of over-segmentation and under-segmentation error types.
Arabic. IEEE Trans. on PAMI, 21(6):495–504. Journal of Pattern Recognition and Artificial In telli-
gence, 15(1):65–90.
Chen Huang and Sargur N. Srihari. 2006. Mapping Tran-
scripts to Handwritten Text. In Suvisoft Ltd., editor, V. Romero, M. Pastor, A. H. Toselli, and E. Vidal. 2006.
Tenth International Workshop on Frontiers in Hand- Criteria for handwritten off-line text size normaliza-
writing Recognition, pages 15–20, La Baule, France, tion. In Procc. of The Sixth IASTED international
October. Conference on Visualization, Imaging, and Image Pro-
cessing (VIIP 06), Palma de Mallorca, Spain, August.
F. Jelinek. 1998. Statistical Methods for Speech Recog-
nition. MIT Press. A. H. Toselli, A. Juan, D. Keysers, J. Gonzlez, I. Sal-
Ergina Kavallieratou and Efstathios Stamatatos. 2006. vador, H. Ney, E. Vidal, and F. Casacuberta. 2004.
Improving the quality of degraded document images. Integrated Handwriting Recognition and Interpretation
In DIAL ’06: Proceedings of the Second International using Finite-State Models. Int. Journal of Pattern
Conference on Document Image Analysis for Libraries Recognition and Artificial Intelligence, 18(4):519–
(DIAL’06), pages 340–349, Washington, DC, USA. 539, June.
IEEE Computer Society.
M. Zimmermann and H. Bunke. 2002. Automatic Seg-
E. M. Kornfield, R. Manmatha, and J. Allan. 2004. Text mentation of the IAM Off-Line Database for Hand-
Alignment with Handwritten Documents. In First In- written English Text. In ICPR ’02: Proceedings of
ternational Workshop on Document Image Analysis the 16 th International Conference on Pattern Recog-
for Libraries (DIAL), pages 195–209, Palo Alto, CA, nition (ICPR’02) Volume 4, page 40035, Washington,
USA, January. DC, USA. IEEE Computer Society.
16