Professional Documents
Culture Documents
I. I NTRODUCTION
Transcription of offline handwritten text is an essential area struck-out text from entering into the recognizer [3], [4]. In
of digital document analysis. For the last few decades, re- that case, a handwritten document is divided into small chunks
searchers have been working on ways to improve the recogni- (can be a word or words) and recognized as regular text or
tion accuracy of handwritten documents. Initially, recognition struck-out text. After removing the struck-out text, the regular
of isolated characters was considered. However, the nature of text is sent to the recognizer.
English handwriting does not consist of isolated characters The purpose of this paper is to investigate the performance
only, so the attention shifted from isolated characters to word of the widely used handwritten text recognition approach,
recognition. Although word recognition is challenging because that is Convolutional Recurrent Neural Network (CRNN) on
of the cursive nature and wide variability of different persons’ handwritten lines containing struck out words. This network
handwriting, good performance has been achieved using deep consists of three different components: CNN (Convolutional
learning neural networks [1], [2]. Neural Network), LSTM (Long Short Term Memory) and
One constraint in most published work on handwritten text CTC (Connectionist temporal classification) [5]. In this work,
recognition is that researchers consider only clean versions we generated a synthetic database containing crossed-out text
of handwritten documents, i.e., documents free from writing from a well-known line version of the English language
errors. However, a free-form of a handwritten document may database IAM [6]. We generated some common types of
not be an ideal. It may have some corrections, deletions or struck-out strokes, depicted in Figure 1: a horizontal straight
insertions while working on a second pass. For instance, writ- line through the middle of the word, two strokes crossing the
ers strike-out inappropriate words and write appropriate words text, a single diagonal stroke and two diagonal strokes crossing
next to or above the crossed-out. Some common examples of each other in the middle of the word. We train the CRNN on
crossed-out words are shown in Figure 1. the IAM database and on the synthetically created database
If these crossed-out words are input into a handwritten text and report the results.
recognizer, arbitrary characters are produced as output. To Our research questions are:
prevent the recognizer from producing incorrect output, re-
1 What will be the effect of struck-out text on a handwritten
searchers have worked on a pre-processing module to prevent
text recognition system based on CRNN?
2 How can the training procedure of a CRNN network be
978-1-7281-4187-9/19/$31.0©2019 IEEE modified to correctly identify struck-out text in a handwritten
line without affecting the accuracy of the text recognition?
The remainder of the paper is organized as follows: Section
II presents related work on struck-out text identification, the (a) Ground Truth: the # Ministry of Labour. This scheme is
designed
methodology used in this paper is explained in Section III and
experimental results and conclusion are presented in Sections
IV and V, respectively. (b) Ground Truth: # winter. In late November, he was ‘suffering
II. R ELATED W ORK as
with the written text on the image. For this purpose, the value This avoids the use of Multi-Dimensional LSTM layers. A
of a cross-out stroke is calculated from the histogram of the 1D bidirectional-LSTM is used to perform the classification.
greyscale image. A frequency-based histogram was generated Recently, it has been established by Joan Puigcerver [1] that
on a selected range of values of a greyscale image. A value convolutional layers used for feature extraction learn similar
with the highest frequency was selected for the cross-out features as Multi-Dimensional LSTM layers.
stroke. 1) Convolutional Layers: The neural network consists of
iii. Cross-out stroke width: Cross-out strokes are usually five convolutional blocks. Each convolutional block contains
made by the instruments that are used for writing. The tip a two-dimensional convolutional layer with a kernel size of
of the same instrument determines the width of the writing 3x3 pixels with a stride of 1x1. The number of filters at the
strokes. To produce more realistic strokes, it is essential to five different layers are 16, 32, 48, 65, and 80, respectively. To
match the stroke width with the written text width. To find the reduce overfitting, dropout regularization is applied at the rate
average width of writing strokes in handwritten text line we of 0.2 at the output of every convolutional layer except the first.
find all the strokes in terms of connected components in a text Leaky Rectifier Linear Units (LeakyReLU) have been used as
line, then we apply the Euclidean distance transform [11] to the activation functions in the convolutional blocks. Finally,
every connected component. The distance transform calculates after each of the first three convolutional layers, the output
a distance from each stroke point to the nearest boundary point of the activation function is fed to a max-pooling layer with
on the stroke. So, we find all the points that share maximum kernels of 2x2 pixels. The max-pooling layer is commonly
distance from stroke to boundary. These points represent the used to reduce the dimensionality of the input images. To
middle of the stroke. We take the average of all these distances, convert 3D tensors of size (width x height x depth) into a
which determines the half-width of the stroke. For full stroke 2D shape of size (width x (height x depth)), a column-wise
width, we multiply that number by two. concatenation is performed after the 5th convolutional layer.
So the number of features to passed the bidirectional-LSTMs
TABLE I: Data Sets is equal to the 80 x Height, Height is the height of the image
after three stages of max-pooling.
IAM Modified-IAM
#Lines in Training 6161 1900 2) Recurrent Layers: Recurrent blocks contain bidirec-
#Lines in Validation 900 280 tional 1D-LSTM layers that scan the image columnwise i.e.
#Lines in Test 1861 1861 in left-to-right and right-to-left orders. After the CNN stage,
#Struck-out words in Training 50 1900
five bidirectional 1D-LSTM recurrent layers having 256 units
#Struck-out words in Test 0 480
#Non-struck-out words in Training 53757 14040 in each direction are used. A depth-wise concatenation is
#Non-struck-out words in Test 17560 17080 performed on the output of the two 1D-LSTM layers. A
dropout with 0.5 probability is also applied to LSTM layers.
3) Fully connected layer: Finally, a fully connected layer
B. Network Architecture with L+1 nodes is used on the output of the 5th bidirectional
LSTM layer. The term L refers to the number of characters
Prior deep learning network architectures used for handwrit-
in IAM dataset, which is 79, and one additional dimension
ing recognition consist of a combination of Multi-Dimensional
is required for the blank symbol of CTC [13]. Overall, this
LSTM layers and convolutional layers [2]. The use of Multi-
CRNN architecture has a 9,581,008 learnable parameters.
Dimensional LSTM layers in the early stages has some
drawbacks. It requires more memory for the activations, back-
IV. E XPERIMENTS AND RESULTS
propagation and the runtime needed to train a network is also
higher [12]. The network architecture presented by Shi et al. In this section, we present the experimental setup, results
[5] has been used in this work and is shown in Figure :3. and the struck-out texts recognition accuracy, and an analysis
The network uses convolutional layers as feature extractors. of their effect on handwritten text recognition.
were updated using a gradient of CTC loss on the batch of
20 text lines. We set the hyper-parameter learning rate to
0.003. The performance of the handwritten text recognition is
measured as Character Error Rate (CER) and Word Error Rate
(WER). CER and WER are calculated using the Levenstein
edit distance [14]. This distance computes the number of edit
operations performed to transform one string into another
string. We achieved 0.02 and 0.08 CER on training and
validation, respectively. The training and validation progress
can be seen in Figure 4a.
After getting the trained model, we tested this model on
the IAM test set and achieved 0.09 CER. After that, we test
this model on the Modified-IAM test set to check the struck-
(a) Training of Model-1 out text recognition accuracy of this model. We observe an
increase in CER from 0.09 to 0.11 and WER from 0.24 to
0.25. To investigate this performance degradation, we analyze
the output of the struck-out text predicted by the model. The
model tries to predict the struck-out characters from the trained
character set, as shown in Figure 5. The performance of struck-
out detection is calculated as.
True Positive (TP) = number of actual struck-out words
correctly detected.
False Negative (FN) = number of actual struck-out words
incorrectly recognized as normal words
False Positive (FP) = number of normal words incorrectly
recognized as struck-out
True Negative (TN) = number of normal words correctly
detected as normal
(b) Training of Model-2
On the Modified-IAM test set, we observed 25 TP out of
480 struck-out words. The remaining 455 words were not
predicted as struck-out (see Table II). However, they were
predicted as a sequence of letters from the training character
set. There were no FP.