You are on page 1of 6

A deep learning approach to handwritten text

recognition in the presence of struck-out text


Hiqmat Nisa, James A. Thom, Vic Ciesielski, Ruwan Tennakoon

School of Science, RMIT University, Melbourne-3000, Australia

hiqmat.nisa@student.rmit.edu.au, {james.thom, vic.ciesielski, ruwan.tennakoon}@rmit.edu.au

Abstract—The accuracy of handwritten text recognition may


be affected by the presence of struck-out text in the hand-
written manuscript. This paper investigates and improves the (a) Single line stroke
performance of a widely used handwritten text recognition
approach Convolutional Recurrent Neural Network (CRNN) on
handwritten lines containing struck out words. For this purpose,
some common types of struck-out strokes were superimposed on (b) Double line stroke
words in a text line. A model, trained on the IAM line database
was tested on lines containing struck-out words. The Character
Error Rate (CER) increased from 0.09 to 0.11. This model
was re-trained on dataset containing struck-out text. The model (c) Single diagonal stroke
performed well in terms of struck-out text detection. We found
that after providing an adequate number of training examples,
the model can deal with learning struck-out patterns in a way (d) Crossed diagonal strokes
that does not affect the overall recognition accuracy.
Index Terms—Handwritten Text Recognition, Struck-out text, Fig. 1: Different types of struck-out strokes from the Modified-
Deep Learning IAM training set

I. I NTRODUCTION
Transcription of offline handwritten text is an essential area struck-out text from entering into the recognizer [3], [4]. In
of digital document analysis. For the last few decades, re- that case, a handwritten document is divided into small chunks
searchers have been working on ways to improve the recogni- (can be a word or words) and recognized as regular text or
tion accuracy of handwritten documents. Initially, recognition struck-out text. After removing the struck-out text, the regular
of isolated characters was considered. However, the nature of text is sent to the recognizer.
English handwriting does not consist of isolated characters The purpose of this paper is to investigate the performance
only, so the attention shifted from isolated characters to word of the widely used handwritten text recognition approach,
recognition. Although word recognition is challenging because that is Convolutional Recurrent Neural Network (CRNN) on
of the cursive nature and wide variability of different persons’ handwritten lines containing struck out words. This network
handwriting, good performance has been achieved using deep consists of three different components: CNN (Convolutional
learning neural networks [1], [2]. Neural Network), LSTM (Long Short Term Memory) and
One constraint in most published work on handwritten text CTC (Connectionist temporal classification) [5]. In this work,
recognition is that researchers consider only clean versions we generated a synthetic database containing crossed-out text
of handwritten documents, i.e., documents free from writing from a well-known line version of the English language
errors. However, a free-form of a handwritten document may database IAM [6]. We generated some common types of
not be an ideal. It may have some corrections, deletions or struck-out strokes, depicted in Figure 1: a horizontal straight
insertions while working on a second pass. For instance, writ- line through the middle of the word, two strokes crossing the
ers strike-out inappropriate words and write appropriate words text, a single diagonal stroke and two diagonal strokes crossing
next to or above the crossed-out. Some common examples of each other in the middle of the word. We train the CRNN on
crossed-out words are shown in Figure 1. the IAM database and on the synthetically created database
If these crossed-out words are input into a handwritten text and report the results.
recognizer, arbitrary characters are produced as output. To Our research questions are:
prevent the recognizer from producing incorrect output, re-
1 What will be the effect of struck-out text on a handwritten
searchers have worked on a pre-processing module to prevent
text recognition system based on CRNN?
2 How can the training procedure of a CRNN network be
978-1-7281-4187-9/19/$31.0©2019 IEEE modified to correctly identify struck-out text in a handwritten
line without affecting the accuracy of the text recognition?
The remainder of the paper is organized as follows: Section
II presents related work on struck-out text identification, the (a) Ground Truth: the # Ministry of Labour. This scheme is
designed
methodology used in this paper is explained in Section III and
experimental results and conclusion are presented in Sections
IV and V, respectively. (b) Ground Truth: # winter. In late November, he was ‘suffering
II. R ELATED W ORK as

Several papers have been published on struck-out text


processing in recent years. Likforman-Sulem and Vinciarelli
[7] assessed the robustness of a HMM based classifier in (c) Ground Truth: down # unambiguous alternatives to
the presence of struck-out text, by inducing struck-out words. Fig. 2: Examples of struck-out text from IAM database
For this purpose, the HMM was trained on clean handwritten
words and tested on synthetic data that included struck-out
words with wave trajectories and line-trajectories. They found is widely used for handwritten recognition tasks. We used the
a moderate decrease in performance for single writer data. For image sets appearing in the “Large Writer Independent Text
multiple writers, the performance was worse. In their study, Line Recognition Task” as the basis of synthetic database.
the authors did not train the model on strike-outs and did not There are 9,862 text lines divided into 6161 lines for training,
attempt to detect the strike-outs. 900 lines for validation1, 940 lines for validation2, and 1861
Adak et al. [3] investigated the impact of the struck-out text lines for testing. There are 79 different symbols, including
on the writer identification task. Their approach was a hybrid lowercase and capital letters, digits, punctuation marks and
approach of CNN and SVM to detect words as struck-out or the symbol # for struck-out words.
not before recognition. The authors concluded that struck-out This database contains only 55 struck-out words which are
text degrades the performance of writer identification. This labeled with the symbol #, an example of such words can be
work is not directly related to our work as its main task is seen in Figure 2. All these 55 words appear in training set or
related to writer identification rather than text recognition. validation set. None of the words in the test data are labeled
However, it includes of the struck-out text identification part. with #. We also found a few unlabeled struck-out words in
Another study by Brink et al. [8] automatically identified and this database. The small number of struck-out words present
removed crossed-out text using two simple features, branching in the IAM database are not sufficient for training the model.
and size. A simple decision tree was used for the classification Therefore, we synthetically crossed out random words from
of crossed-out and normal text. the subset of IAM database using several types of strokes.
Studies performed by Chaudhuri and Adak [4] [9] tried First, the bounding box of each word in a line was detected
to identify and remove struck-out strokes from handwritten using the line segmentation method described by Manmatha
documents. In these studies, a SVM based classifier was used and Srimal [10]. The bounding box covers the core area of
for struck-out word detection. [4] is limited to straight lines the word. Then at random, a word is crossed-out within the
only. In [9], different types of stroke were considered, e.g., core area. From the IAM training and validation sets, 30%,
straight and zigzag for two languages English and Bengali. In and 25% of samples were used to generate struck-out texts,
the above study the length of connected components is used respectively. The annotation of these struck-out words was
to detect and distinguish a word, successive multi-words, or replaced with the symbol #. The modified version of IAM
multi-line struck-out. database which we call Modified-IAM contains a total of 1900
All of the above work handles struck-out text as a binary text lines of training and 280 of validation. Each line contains
classification problem i.e. considers the text as struck-out or a random struck-out word at a different position, 480 lines out
normal. However, there can be situations when punctuation of 1861 from the IAM test set were selected randomly and we
written at the end of a word is crossed-out. Apart from that, superimposed a stroke/s on a random word within a line. So
handwriting recognition approaches shifted from character to our Modified-IAM test set contains 1381 original lines and
word then line and paragraph recognition, yet none of the 480 lines having a crossed-out word. The detail of the data
above consider recognition on line-level or paragraph-level in sets is shown in a Table I.
the presence of struck-out text. In this paper we work on a We considered the following parameters for strike-out stroke
line-level to identify struck-out text which includes words and generation.
punctuation. i. Cross-out stroke style: widely used cross-out strokes were
considered, namely, single line stroke (Figure:1a), double-line
III. M ETHODOLOGY
stroke (Figure:1b), single diagonal stroke (Figure:1c), and two
A. Generating Synthetic Database diagonal strokes crossing each other (Figure:1d). All these
This section describes the generation of the synthetic strokes are drawn using straight lines.
database required for training and performance analysis. The ii. Cross-out stroke value: Our model is trained on greyscale
publicly available English handwritten texts database IAM [6] images. So it is important to match the value of the stroke
Fig. 3: Network Architecture used in this paper based on [5].

with the written text on the image. For this purpose, the value This avoids the use of Multi-Dimensional LSTM layers. A
of a cross-out stroke is calculated from the histogram of the 1D bidirectional-LSTM is used to perform the classification.
greyscale image. A frequency-based histogram was generated Recently, it has been established by Joan Puigcerver [1] that
on a selected range of values of a greyscale image. A value convolutional layers used for feature extraction learn similar
with the highest frequency was selected for the cross-out features as Multi-Dimensional LSTM layers.
stroke. 1) Convolutional Layers: The neural network consists of
iii. Cross-out stroke width: Cross-out strokes are usually five convolutional blocks. Each convolutional block contains
made by the instruments that are used for writing. The tip a two-dimensional convolutional layer with a kernel size of
of the same instrument determines the width of the writing 3x3 pixels with a stride of 1x1. The number of filters at the
strokes. To produce more realistic strokes, it is essential to five different layers are 16, 32, 48, 65, and 80, respectively. To
match the stroke width with the written text width. To find the reduce overfitting, dropout regularization is applied at the rate
average width of writing strokes in handwritten text line we of 0.2 at the output of every convolutional layer except the first.
find all the strokes in terms of connected components in a text Leaky Rectifier Linear Units (LeakyReLU) have been used as
line, then we apply the Euclidean distance transform [11] to the activation functions in the convolutional blocks. Finally,
every connected component. The distance transform calculates after each of the first three convolutional layers, the output
a distance from each stroke point to the nearest boundary point of the activation function is fed to a max-pooling layer with
on the stroke. So, we find all the points that share maximum kernels of 2x2 pixels. The max-pooling layer is commonly
distance from stroke to boundary. These points represent the used to reduce the dimensionality of the input images. To
middle of the stroke. We take the average of all these distances, convert 3D tensors of size (width x height x depth) into a
which determines the half-width of the stroke. For full stroke 2D shape of size (width x (height x depth)), a column-wise
width, we multiply that number by two. concatenation is performed after the 5th convolutional layer.
So the number of features to passed the bidirectional-LSTMs
TABLE I: Data Sets is equal to the 80 x Height, Height is the height of the image
after three stages of max-pooling.
IAM Modified-IAM
#Lines in Training 6161 1900 2) Recurrent Layers: Recurrent blocks contain bidirec-
#Lines in Validation 900 280 tional 1D-LSTM layers that scan the image columnwise i.e.
#Lines in Test 1861 1861 in left-to-right and right-to-left orders. After the CNN stage,
#Struck-out words in Training 50 1900
five bidirectional 1D-LSTM recurrent layers having 256 units
#Struck-out words in Test 0 480
#Non-struck-out words in Training 53757 14040 in each direction are used. A depth-wise concatenation is
#Non-struck-out words in Test 17560 17080 performed on the output of the two 1D-LSTM layers. A
dropout with 0.5 probability is also applied to LSTM layers.
3) Fully connected layer: Finally, a fully connected layer
B. Network Architecture with L+1 nodes is used on the output of the 5th bidirectional
LSTM layer. The term L refers to the number of characters
Prior deep learning network architectures used for handwrit-
in IAM dataset, which is 79, and one additional dimension
ing recognition consist of a combination of Multi-Dimensional
is required for the blank symbol of CTC [13]. Overall, this
LSTM layers and convolutional layers [2]. The use of Multi-
CRNN architecture has a 9,581,008 learnable parameters.
Dimensional LSTM layers in the early stages has some
drawbacks. It requires more memory for the activations, back-
IV. E XPERIMENTS AND RESULTS
propagation and the runtime needed to train a network is also
higher [12]. The network architecture presented by Shi et al. In this section, we present the experimental setup, results
[5] has been used in this work and is shown in Figure :3. and the struck-out texts recognition accuracy, and an analysis
The network uses convolutional layers as feature extractors. of their effect on handwritten text recognition.
were updated using a gradient of CTC loss on the batch of
20 text lines. We set the hyper-parameter learning rate to
0.003. The performance of the handwritten text recognition is
measured as Character Error Rate (CER) and Word Error Rate
(WER). CER and WER are calculated using the Levenstein
edit distance [14]. This distance computes the number of edit
operations performed to transform one string into another
string. We achieved 0.02 and 0.08 CER on training and
validation, respectively. The training and validation progress
can be seen in Figure 4a.
After getting the trained model, we tested this model on
the IAM test set and achieved 0.09 CER. After that, we test
this model on the Modified-IAM test set to check the struck-
(a) Training of Model-1 out text recognition accuracy of this model. We observe an
increase in CER from 0.09 to 0.11 and WER from 0.24 to
0.25. To investigate this performance degradation, we analyze
the output of the struck-out text predicted by the model. The
model tries to predict the struck-out characters from the trained
character set, as shown in Figure 5. The performance of struck-
out detection is calculated as.
True Positive (TP) = number of actual struck-out words
correctly detected.
False Negative (FN) = number of actual struck-out words
incorrectly recognized as normal words
False Positive (FP) = number of normal words incorrectly
recognized as struck-out
True Negative (TN) = number of normal words correctly
detected as normal
(b) Training of Model-2
On the Modified-IAM test set, we observed 25 TP out of
480 struck-out words. The remaining 455 words were not
predicted as struck-out (see Table II). However, they were
predicted as a sequence of letters from the training character
set. There were no FP.

(a) Ground Truth: they will # give us their hypnotic


Predicted: they vili pochably give us their hypnotic

(b) Ground Truth: # how the Thetan machinations had been


Predicted: Higg how the Thetan machinations had been
(c) Training of Model-3 Fig. 5: Some examples Modified-IAM tested on Model-1
Fig. 4: Training of three different Models
TABLE II: Confusion matrix on Modified-IAM Test set for
Model-1
A. Training a model on IAM database (Model-1)
We trained the CRNN architecture (as implemented1 by Actual
Jaramilo et al. [12]) from scratch on the IAM database to Struck-out Not struck-out
Struck-out 25 0
get the state-of-art recognition accuracy. We did not apply Not Struck-out 455 17080
any pre-processing except resizing all the images of same
height and width as CRNN accepts all the images of same
size. No language model was used during training. The CRNN
B. Training from scratch on Struck-out database (Model-2)
was trained using the Adam optimizer, and the parameters
From the above results, we observe that the struck-out words
1 The code URL: https://github.com/josarajar/HTRTF appearing in the IAM training dataset were not sufficient to
learn the pattern of struck-out texts. So, we used the Modified- as models are initialized with learned values. The training of
IAM database and trained the model from scratch. The same Model-3 is shown Figure 4c.
CRNN architecture and training parameters were used to train After getting the trained model, the Modified-IAM test set
Model-2 on the synthetic database. 1900, 280, and 1861 lines was fed to the model to assess the performance of the model
of text were used for training, validation, and testing purposes, on struck-out text detection and recognition accuracy. We
respectively. achieved 0.11 CER and 0.27 WER. We also achieved similar
During training, we were regularly checking the graph for CER and WER on the IAM test set. A comparison of these
loss and accuracy. We stopped training after 160 epochs as the three models on the two different test sets is shown in Table
training CER reached 0.05, but on the validation set CER was III.
0.20 (shown in Figure 4b). On the other hand, the loss on the
TABLE IV: Confusion matrix on Modified-IAM Test set for
validation data set started to increase after 150 epochs, which
Model-3
shows overfitting. The overfitting was likely due to the small
number of training samples. Actual
We test the model on the Modified-IAM test set. The test set Struck-out Not struck-out
was fed to the trained model, and we observe a CER of 0.26 Struck-out 439 144
Not Struck-out 41 16936
and WER of 0.49. The reason for this higher CER and WER is
the small number of training examples used to train the model.
We also analyze the output of the Modified-IAM test data,
despite a low text recognition rate, we observe that Model-2
correctly identified 435 out of 480 struck-out words as symbol
#. So, the Model-2 performed well on struck-out recognition (a) Ground Truth: # came to just over six hundred
but did not learn different varieties of handwritten text. It Predicted: # came # just over six hundred
is necessary to provide more training examples to overcome
the overfitting, and a model will learn more variability in
handwriting. This conclusion encourages us to retrain Model- (b) Ground Truth: to # other day, my lord. A being of whom
1 on a Modified-IAM database, which was already trained on Predicted: to # other day, my lord. # being of whom
the IAM database.
TABLE III: CER and WER for the three models
(c) Ground Truth: the Avenue from its many tributary streets.
It might have
IAM Modified-IAM Predicted: the Avenue from #s many tirlumary #r.It might have
test set test set
start of Fig. 6: Some Examples of False-Positive on Model-3
training data CER WER CER WER
training
Model-1 scratch IAM 0.09 0.23 0.11 0.25
Model-2 scratch Modified-IAM 0.26 0.50 0.26 0.49
Model-3 Model-1 Modified-IAM 0.11 0.27 0.11 0.27

(a) ’,’ not detected as struck-out

C. Training with transfer learning (Model-3)


We have seen that Model-1 performance decreases on the (b) Crossed-out ’I’ detected as ’t’
Modified-IAM test set, and Model-2 is overfitted. In this
experiment, we re-train Model-1 on the Modified-IAM training
dataset so that our final model learns the struck-out text pattern (c) ’the’ not detected as struck-out
along with the variabilities in handwriting. Fig. 7: Some examples of False-Negative on Model-3
We have already achieved the state-of-art CER and WER on
Model-1. To train the Model-3, we initialize the whole network
with the learned weights of Model-1. We did not freeze any D. Discussion
layer, as suggested by Jaramillo et al. [12] because with few Table III shows the CER and WER of the three different
training examples, the best strategy is to re-train the whole models tested on IAM and Modified-IAM test sets. The CER
network. Our transfer strategy was simple. We keep all the of Model-1 increases from 0.09 to 0.11 after introducing
learned values of the trained Model-1 because the Modified- struck-out text in the test data. The possible explanation for
IAM database is created from IAM and has the same number this higher CER could be that the model tries to predict the
of labels, which is 79. The same architecture and training characters of struck-out text. We also observe an interesting
parameters were used to re-train the model. We achieved 0.01 finding about struck-out text detection. Although Model-1 was
and 0.09 CER on the training and validation sets after only 50 trained on only a small number of originally struck-out words,
epochs. Usually, fewer epochs are required in transfer learning it correctly identified 25 struck-out words. All these 25 words
contained crossed-out strokes generated by us. After analysing especially comma (,), colon (:), and question mark (?). Some
all the correctly identified struck-outs, it is found that most improvement is required to prevent a network from producing
of the text have two strokes (diagonal or straight). None of such false positives.
the correctly identified struck-outs contains a single struck- We have shown that it is possible to modify CRNN training
out stroke. This is because most of the originally struck-out to accommodate some form of struck-out text without loss of
examples in IAM have more than one stroke. accuracy. In future work we will investigate a wider range
Our purpose was to detect struck-out texts without affecting of struck-out forms. For example, stuck-out stroke/s do not
the recognition accuracy on normal text. From all the above always start from the left of a word and end at the right side.
experiments, after providing adequate training examples of Also there are some situations where only part of a word is
struck-out text, our Model-3 correctly identified almost 439 crossed-out.
out of 480 struck-out words along with a 0.11 CER. We
ACKNOWLEDGMENT
also observe a small number of false negativities and quite
a few false positives. In both cases, only short words or This work was partially funded by the government of
punctuations are detected incorrectly. Mostly false positives Pakistan under the project “HDR INITIATIVE-MS LEADING
consist of a letter ‘t’ and capital ‘A’. This may be because both TO PhD PROGRAM OF FACULTY DEVELOPMENT FOR
characters contain a straight line (see Figure 6). Regarding UESTPS/UETS, PHASA-1”, and by the RMIT University
false negatives, coma (,), colon (:), and question-mark (?) were Australia.
incorrectly identified. Such type of symbols are usually found R EFERENCES
at the end of the non-struck words, making them hard to detect
[1] J. Puigcerver, “Are multidimensional recurrent layers really necessary
(Figure 7a). Apart from that, FN also contains letters which for handwritten text recognition?” in 2017 14th IAPR International
change its shape from one character to another, e.g. ‘I’ to Conference on Document Analysis and Recognition (ICDAR), vol. 1.
‘t’, after being struck-out (Figure 7c). In some cases, where a IEEE, 2017, pp. 67–72.
[2] P. Voigtlaender, P. Doetsch, and H. Ney, “Handwriting recognition
stroke looks like a part of the word (Figure 7b). with large multidimensional long short-term memory recurrent neural
networks,” in 2016 15th International Conference on Frontiers in
V. C ONCLUSION Handwriting Recognition (ICFHR). IEEE, 2016, pp. 228–233.
[3] C. Adak, B. B. Chaudhuri, and M. Blumenstein, “Impact of struck-out
The principal contributions of this paper are investigating: text on writer identification,” in 2017 International Joint Conference on
(1) the effect of struck-out text on a handwritten text recog- Neural Networks (IJCNN). IEEE, 2017, pp. 1465–1471.
nition system based on CRNN (2) how the training procedure [4] C. Adak and B. B. Chaudhuri, “An approach of strike-through text
identification from handwritten documents,” in 2014 14th International
of a CRNN can be modified to correctly identify struck-out Conference on Frontiers in Handwriting Recognition. IEEE, 2014, pp.
text in a handwritten line without affecting the accuracy of the 643–648.
text recognition. To address these questions, we synthetically [5] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
for image-based sequence recognition and its application to scene
superimpose struck-out stroke/s on a subset of a well known text recognition,” IEEE transactions on pattern analysis and machine
the English database IAM, which we call Modified-IAM. intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
For question (1), we trained a model on the IAM dataset [6] U.-V. Marti and H. Bunke, “The iam-database: an english sentence
database for offline handwriting recognition,” International Journal on
and tested it on both the IAM and the Modified-IAM test Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46, 2002.
sets. Our experiment showed that testing a model on lines [7] L. Likforman-Sulem and A. Vinciarelli, “Hmm-based offline recognition
containing struck-outs decreased the recognition accuracy. We of handwritten words crossed out with different kinds of strokes,” 2008.
[8] A. Brink, H. van der Klauw, and L. Schomaker, “Automatic removal
observed an increase in character error rate from 0.09 to 0.11 of crossed-out handwritten text and the effect on writer verification and
and word error rate from 0.23 to 0.25 on the IAM and the identification,” in Document Recognition and Retrieval XV, vol. 6815.
Modified-IAM test sets, respectively. International Society for Optics and Photonics, 2008, p. 68150A.
[9] B. B. Chaudhuri and C. Adak, “An approach for detecting and cleaning
For question (2), we re-trained the previous model with of struck-out handwritten text,” Pattern Recognition, vol. 61, pp. 282–
Modified-IAM which contains 1900 samples of struck-out 294, 2017.
words. We tested this model on both the IAM and Modified- [10] R. Manmatha and N. Srimal, “Scale space technique for word segmen-
tation in handwritten documents,” in International conference on scale-
IAM test sets. We observed the same 0.11 character error rate space theories in computer vision. Springer, 1999, pp. 22–33.
and 0.27 word error rate on both sets. It also performed well [11] J. C. Elizondo-Leal and G. Ramirez-Torres, “An exact euclidean distance
in terms of struck-out detection, 439 out of 480 struck-out transform for universal path planning,” in 2010 IEEE Electronics,
Robotics and Automotive Mechanics Conference. IEEE, 2010, pp. 62–
words were detected correctly. We conclude that our model 67.
successfully detected struck-out words without affecting the [12] J. C. Aradillas, J. J. Murillo-Fuentes, and P. M. Olmos, “Boosting
accuracy of overall text recognition. It can be concluded that, handwriting text recognition in small databases with transfer learning,”
arXiv preprint arXiv:1804.01527, 2018.
in this case 1900 training samples are enough for the CRNN [13] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-
model to learn the four basic types of stuck-out strokes. ist temporal classification: labelling unsegmented sequence data with
In some situations, our model detected normal text as recurrent neural networks,” in Proceedings of the 23rd international
conference on Machine learning. ACM, 2006, pp. 369–376.
struck-out (false positives) on both test sets which negatively [14] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-
effected the recognition accuracy. These false positives usually tions, and reversals,” in Soviet physics doklady, vol. 10, no. 8, 1966, pp.
occurred in two situations, (1) a straight line is a part of 707–710.
the letter as in ‘t’ or ‘A’, and (2) on punctuation marks

You might also like