Professional Documents
Culture Documents
8, AUGUST 2015 1
Abstract—Unconstrained text recognition is an important computer vision task, featuring a wide variety of different sub-tasks, each
with its own set of challenges. One of the biggest promises of deep neural networks has been the convergence and automation of
feature extractors from input raw signals, allowing for the highest possible performance with minimum required domain knowledge. To
this end, we propose a data-efficient, end-to-end neural network model for generic, unconstrained text recognition. In our proposed
architecture we strive for simplicity and efficiency without sacrificing recognition accuracy. Our proposed architecture is a fully
convolutional network without any recurrent connections trained with the CTC loss function. Thus it operates on arbitrary input sizes
arXiv:1812.11894v1 [cs.CV] 31 Dec 2018
and produces strings of arbitrary length in a very efficient and parallelizable manner. We show the generality and superiority of our
proposed text recognition architecture by achieving state of the art results on seven public benchmark datasets, covering a wide
spectrum of text recognition tasks, namely: Handwriting Recognition, CAPTCHA recognition, OCR, License Plate Recognition, and
Scene Text Recognition. Our proposed architecture has won the ICFHR2018 Competition on Automated Text Recognition on a READ
Dataset.
Index Terms—Text Recognition, Optical Character Recognition, Handwriting Recognition, CAPTCHA Solving, License Plate
Recognition, Convolutional Neural Network, Deep Learning.
1 I NTRODUCTION
techniques that are suitable for any text recognition task each symbol of the target sequence alphabet. A CTC layer
and show how they affect the performance of the system. then computes, for a given input sequence, a probability
We demonstrate the superior performance of our proposed distribution over all possible output sequences. An output
system through extensive experimentation on seven public sequence can then be predicted either greedily (i.e. taking
benchmark datasets. We were also able, for the first time, to the most likely output at each time-step) or using beam-
demonstrate human-level performance on the reCAPTCHA search. A thorough explanation of CTC can be found in [25].
dataset proposed recently in a Science paper [13], which is BLSTM [26] + CTC was used to transcribe online
more than 20% absolute increase in CAPTCHA recognition handwriting data on [27], they were also used for OCR
rate compared to their proposed RCN system. We also in [5]. Deep BLSTM + CTC were used for handwriting
achieve state of the art performance in SVHN [14] (the full recognition (with careful preprocessing) in [28], they were
sequence version), the unconstrained settings of IAM En- also used on HOG features for scene text recognition in
glish offline handwriting dataset [15], KHATT Arabic offline [29]. This architecture was mostly abandoned in favor of
handwriting dataset [16], University of Washington (UW3) CNN+BLSTM+CTC.
OCR dataset [17], AOLP license plate recognition dataset Deep MDLSTM [30] (interleaved with convolutions) +
[18] (in all divisions). Our proposed system has also won the CTC have achieved the best results on offline handwriting
ICFHR2018 Competition on Automated Text Recognition recognition since their introduction in [31] until recently
on a READ Dataset [19] achieving more than 25% relative [32], [33], [34]. However, their performance was recently
decrease in Character Error Rate (CER) compared to the equated with the much faster CNN+BLSTM+CTC architec-
entry achieving the second place. ture [35], [36]. MDLSTMs have also shown good results in
To summarize, we address the unconstrained text recog- speech recognition [37].
nition problem. In particular, we make the following contri- Various flavors of CNN+BLSTM+CTC currently deliver
butions: state of the results in many text recognition tasks. Breuel [38]
• We propose a simple, novel neural network archi- shows how they improve over BLSTM+CTC for OCR tasks.
tecture that is able to achieve state of the art per- Li et al. [8] shows state of the art license plate recognition
formance, with feed forward connections only (no using them. Shi et al. [39] uses it for scene text recognition,
recurrent connections), and using only the highly however, it is not currently the state of the art. Puigcerver
efficient convolutional primitives. [35] and Dutta et al. [36] used CNN+BLSTM+CTC to equate
• We propose a set of data augmentation procedures the performance of MDLSTMs in offline handwriting recog-
that are generic to any text recognition task and nition.
can boost the performance of any neural network There is a recent shift to recurrence-free architectures
architecture on text recognition. in all sequence modeling tasks [40] due to their much
• We conduct an extensive set of experiments on seven better parallelization performance. We can see the trend of
benchmark datasets to demonstrate the validity of CNN+CTC in sequence transduction problems as used in
our claims about the generality of our proposed [41] for speech recognition, or in [42], [43] for scene text
architecture. We also perform an extensive ablation recognition. The main difference between these later works
study on our proposed model to demonstrate the and ours is that while we also utilize a CNN+CTC (1)
importance of each of its submodules. we use a completely different CNN architecture (2) while
previous CNN+CTC work was mostly demonstrative of the
Section 2 gives an overview of related work on the field. viability/competitiveness of the approach and was carried
Section 3 describes our architecture design and its training on a specific task, we show strong state of the art results on
process in details. Section 4 describes our extensive set of a wide range of text recognition tasks.
experiments and presents its results.
if hi ≠ hi+1 or wi ≠ wi+1
if ci ≠ ci−1
hi+1 × wi+1 × ci
hi × wi × ci
hi × wi × ci
ELU
hi × wi × ci
CT C Loss
LayerN orm(0, 1)
Sof tmax
Images Image & Text Text
Pairs
LayerN orm
Concat(3)
wL × ncls
H × W × 19
T anh T anh Sigmoid
Split(3)
ci
DropOut(0.75)
hL × wL × ncls hi × wi ×
2
H × W × 16 (13 × 13) ci
DropOut(0.5)
ci hi × wi × 3
hi × wi × 3 2
2
LayerN orm
GateBlockL
H × W × 16
ci
ci hi × wi ×
ELU
LayerN orm
hi × wi × 2
2
hi × wi × ci−1
H × W × 3
GateBlock1
LayerNormalization with
Concat(d) Concatenate input tensors over H, W are height and width of original input image.
static shift and scale of 0
LayerN orm(0, 1) dimension d into one tensor of hi × wi × ci are dimensions at the i th GateBlock.
and 1 (if these are omitted,
H × W × C shape H × W × C ncls is the size of target alphabet plus 1 (CTC blank)
then they are learned)
Fig. 1. Detailed diagram of our proposed network. We explicitly present details of the network, with clear description of each block in the legend
below the diagram. We also highlight each dimensionality change of input tensors as they flow through the network. (a) Our overall architecture.
Here, we assume the network has L stacked layers of the repeated GateBlock. (b) Level i of the repeated GateBlock.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
a good result. Bluche et al. [49] tackles full paragraph recog- be effectively an unrolled iterative estimation [58] of target
nition (without line segmentation) they use a MDLSTM as sequence. Spatial DropOut is applied on the output of the
an encoder, LSTM as a decoder, and another MDLSTM for last GateBlock , then the channel dimension is projected
computing attention weights. to be equal to the size of the target alphabet plus one
(required for the CTC blank [12]) using a 1x1 convolution.
3 M ETHODOLOGY Finally, we perform global average pooling on the height
dimension. This step enables the network to handle inputs
In this section, we present details of our proposed architec- of any height. The result is then layer normalized, softmax-
ture. First, Section 3.1 gives an overview of the proposed normalized then fed to the CTC loss along with its corre-
architecture. Then, Section 3.2 presents derivation of the sponding text label during training.
gating mechanism utilized by our architecture. After that,
Section 3.3 discusses the data augmentation techniques we
use in the course of training. Finally, Section 3.4 describes 3.2 GateBlock s
our model training and evaluation process. Stacked GateBlock s are the main computational blocks
of our model. Using attention gates to control the inter-
3.1 Overall Architecture layer Information flow, in the sense of filtering-out unim-
Figure 1 presents a detailed diagram of the proposed ar- portant signals (e.g. background noise) and sharpen-
chitecture. We opted to use a very detailed diagram for ing/intensifying important signals, has received a lot of
illustration instead of a generic one, in order to make it interest by many researchers lately [59], [60]. However, the
straightforward to implement the proposed architecture in original concept is old [61], [62]. Many patterns of gating
any of the popular deep learning frameworks without much mechanisms has been proposed in literature, first for con-
guessing. necting RNNs time steps as a technique to mitigate the
As shown in Figure 1 (a), our proposed architecture is vanishing gradient problem [61], [63], then this idea was
a fully convolutional network (FCN) that makes heavy use extended to connecting layers of any feed-forward neural
of both Batch Normalization [50] and Layer Normalization networks. This was done either for training very deep
[51] to both increase convergence speed and regularize networks [64], [65] or for using it, as stated previously, as
the training process. We also use Batch Renormalization a smart, content-aware filters.
[52] on all Batch Normalization layers. This allows us to However, most of the attention gates that were sug-
use batch sizes as low as 2 with minor effects on final gested recently for connecting CNN layers use ad-hoc
classification performance. Our main computational block mechanisms tested only on the target datasets without
is Depthwise separable convolutions. Using them instead of comparison to other mechanisms. We opted to base our
regular convolutions gives the same or better classification attention gates on the gating mechanism suggested by
performance, but at a much faster convergence speed and Highway Networks [64], as they are based on the widely
a drastic reduction in parameter count and computational used, heavily studied LSTM networks. Also, a number of
requirements. We also found the inductive bias of depth- variants of Highway networks were studied, compared and
wise convolutions to have a strong regularization effect on contrasted in subsequent studies [58], [66]. We also validate
training. For example, for our original architecture with reg- the superiority of these gates experimentally in Section 4.
ular convolutions, we had to use spatial DropOut [53] (i.e. Assume a plain feed-forward L-layer neural network,
randomly dropping entire channels completely) after nearly with layers y1 . . . yL . Each of these layers applies a non-
every convolution, in contrast to a single spatial DropOut linear transformation H to its input to get its output, such
layer after the stack of GateBlock s as we currently do. An that
important point to note here, is that in contrast to other yi+1 = H(yi ) (1)
works that use stacks of depthwise separable convolutions In a Highway network, we have two additional non-linear
[54], [55]. We found it crucial to use Batch Normalization in- transformations T and C . These act as gates that can dy-
between the depthwise convolution and the 1x1 pointwise namically pass part of its inputs and suppress the other part,
convolution [56]. We also note here that we found spatial conditioned on the input itself. Therefore:
DropOut to provide much better regularization than regular
un-structured DropOut [57] on the whole tensor. yi+1 = H(yi ) · T (yi ) + yi · C(yi ) (2)
First, the input image is Layer Normalized, input fea-
Authors suggest to tie the two gates, such that we have only
tures are then projected to be 16 using a 1x1 convolu-
one gate T and the other is its opposite C = 1 − T . This way
tion, then the input tensor is softmax-normalized as this
we find out
was found to greatly increase convergence speed on most
datasets. Next, each channel is preprocessed independently yi+1 = H(yi ) · T (yi ) + yi · [1 − T (yi )] (3)
with a 13x13 filter using a depthwise convolution, then we
employ the only used dense connection by concatenating, This can re-factored to
h i
on the channel dimension, the result of preprocessing with
yi+1 = H(yi ) − yi · T (yi ) + yi (4)
the layer-normalized original image. The result is fed to
a stack of GateBlock s that are responsible for the major- We should note here that the dimensionality of yi , yi+1 ,
ity of the input sequence modeling, and target sequence H(yi ), and T (yi ) must be the same, for this equation to
generation task through a fixed number of in-place whole- be valid. This is can be an important limitation from both
sequence transformation stages. This was demonstrated to the computational and memory usage perspectives for wide
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
networks (e.g for CNNs when each of those tensors has a generic and can benefit most text recognition problems,
large number of channels) or even small networks when yi these techniques are applied independently to any input
have big spatial dimensions. data sample during training
If we would use the same original plain network but
with residual connections we would have 3.3.1 Projective Transforms
h i
yi+1 = H(yi ) + yi (5) Text lines or words to be recognized can suffer various
types of scaling or shearing and more generally projection
Comparing Equations 4 and 5, it is evident that both equa- transforms. We apply a random projection transform to
tions have the same residual connection in the right hand- every input training sample, by randomly sampling four
side of the equation and thus we can interpret the left hand- points that correspond to the original four corners of the
side of Equation 4 as the transformation function applied original image (i.e. the sampled points represent the new
to input yi . This interpretation motivates us to perform two corners of the image) under the following criteria
modifications to Equation 4.
• We either change the x coordinate of all four points
First, we note that the left hand-side of Equation 4
and fix the y coordinate or vice versa. We found it
performs a two-stage filtering operation on its input yi
almost always better to either change the image ver-
before adding the residual connection. The output of the
tically or horizontally but not both simultaneously.
transformation function H is first subtracted from the input
• When changing either vertically or horizontally, the
signal yi , then multiplied by the gating function T . We argue
new distance between any connected pair of corners
that all transformations to the input signal should be learned
must not be less than half the original one, or more
and should be conditioned on the input signal yi . So instead
than double the original one.
of using a single transformation function H , we propose
using two transformation functions H1 and H2 , such that:
h i 3.3.2 Elastic Distortions
yi+1 = H1 (yi ) − H2 (yi ) · T (yi ) + yi (6) Elastic distortions were first proposed by [68] as a way for
emulating uncontrolled oscillations of hand muscles. It was
In all experiments in this paper, H1 , H2 , and T are im- then used in [69] for augmenting images of handwritten
plemented as Depthwise separable convolutions with tanh, digits. In [68], the authors apply a displacement field to each
tanh, and sigmoid activation functions respectively. pixel in the original image, such that each pixel moves by
Second, the residual connection interpretation along a random amount in both the x and y directions. The dis-
with the dimensionality problem we noted above, motivates placement field is built by convolving a randomly initialized
us to make a simple modification to Equation 4. This would field with a Gaussian kernel. An obvious problem with this
allow us to efficiently utilize Highway gates even for wide technique is that it operates at a very fine level (the pixel
and deep networks. level), which makes it more likely that the resulting image
Let P1 be a transformation mapping x ∈ RH×W ×C is unrealistically distorted.
0 0 0
to x0 ∈ RH ×W ×C and P2 the opposite transformation; To counter this problem, we take the approach used by
0 0 0
mapping x0 ∈ RH ×W ×C to x ∈ RH×W ×C . We can change [70] and [71], by working at a coarser level to better preserve
Equation 4 such that the localities of the input image. We generate a regular grid
and apply a random displacement field to the control points
yi0 = P 1(yi ) (7)
of the grid. The original image is then warped according to
the displaced control points. It is worth noting that the use
h i
= P 2 H1 (yi0 ) − H2 (yi0 ) · T (yi0 ) + yi
yi+1 (8)
of a coarse grid makes it unnecessary to apply a smoothing
When using a sub-sampling transformation for P1 this Gaussian to the displacement filed. However, unlike [70]
allows us to retain optimization benefit of residual con- we do not align to the baseline of the input image, which
nections (right hand-side of Equation 8) whilst computing makes it much simpler to implement. We also impose the
Highway gates on a lower dimensional representation of following criteria on the randomly generated displacement
yi which can be performed much faster and uses less field, which is uniformly sampled in our implementation:
memory. In all our experiments in this paper, P1 and P2 are
• We apply the distortions either to the x direction or
implemented as Depthwise separable convolutions with the
the y direction and not to both. As we did for the
Exponential Linear Unit (ELU) [67] as an activation function.
projective transforms, we also found here it is better
Also, we always set C 0 = C2 , H 0 = H , and W 0 = W .
to apply distortions to one dimension at a time.
This means that sub-sampling is only done on the channel
• We force the minimum width or height of a distorted
dimension and spatial dimensions are kept the same.
grid rectangle to be 1. In other words, zero or nega-
tive values, although possible, are not allowed.
3.3 Data Augmentation
Encoding domain knowledge into data through label- 3.3.3 Sign Flipping
preserving data transformations has always been very suc- We randomly flip the sign of the input image tensor. This is
cessful and had provided fundamental contributions to the motivated by the fact that text content is invariant to its color
success of data-driven learning systems in the past, and or texture, and one of the simplest methods to alarm this
lately after the recent revival of neural networks. We here invariance is sign flipping. It is important to stress that this
describe a number data augmentation techniques that are augmentation improves performance even if all training and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
testing images have the same text and background colors. The variable part of our model is the GateBlock
The fact that color-inversion data augmentation improves stack, which is (in our experiments) parameterized by
text recognition performance was previously noticed in [72]. three parameters written as n(c1 , c2 ). n is the number of
GateBlock s, c1 and c2 are the number of channels in the
first and third GateBlock . For all instances of our model
3.4 Model Training and Evaluation
used in the experiments, the number of channels is changed
We train our model using the CTC loss function. This only in the first and third GateBlock .
enables using unsegmented pairs of line-images and cor-
responding text transcriptions to train the model without
4.2 Vicarious reCAPTCHA Dataset
any character/frame-level alignment. Every epoch, training
instances are sampled without replacement from the set of
TABLE 1
training examples. Previously described augmentation tech- Results of our method compared to state of the art on the Vicarious
niques are applied during training on a batch-wise manner, reCAPTCHA Dataset. SER is the whole sequence error rate, Aug. is
i.e. the same random generated values are applied to the whether data augmentation is applied (True) or not (False). Case is
whole batch. whether error metrics are case sensitive (True) or not (False). Best
results are in bold
In order to generate target sequences, we simply use
CTC greedy decoding [12]: i.e. taking the most probable la-
Method Input Scale SER(%) CER(%) Aug. Case
bel at each time step, then mapping the result using the CTC
sequence mapping function which first removes repeated RCN [13] 2X 33.4 5.7 True False
Human [13] 1X 12.6 - - False
labels, then removes the blanks. In all our experiments, we
Ours: 8(64,512) 1X 21.70 3.77 False True
do not utilize any form of language modeling or any other Ours: 8(64,512) 1X 12.86 2.14 True True
non-visual cues. Ours: 16(64,512) 1X 12.30 2.04 True True
We evaluate the quality of our models based on the pop-
ular Levenstein edit distance [73] evaluated on the character
In a recent Science paper [13], Vicarious presented
level, and normalized by the length of the ground truth,
RCN, a system that could break most modern text-based
which is commonly known as Character Error Rate (CER).
CAPTCHAs in a very data-efficient manner, they claimed it
Due to the inherent ambiguity in some of the datasets used
is multiple orders of magnitude more data efficient than ri-
for evaluation (e.g. handwriting datasets), we need another
valing deep-learning based systems in breaking CAPTCHAs
metric to illustrate how much of this ambiguities have been
and also more accurate. To support those claims they col-
learned by our model, and by how much can the model
lected a number of datasets from multiple online CAPTCHA
performance increase if another signal (other than the visual
systems and annotated them using humans. We choose to
one, e.g. a linguistic one) can help ranking the possible
compare against the reCAPTCHA subset specifically since
output sequence transcriptions. For this task, we use the
they have also done an analysis of human performance
CER@TopN metric, which computes the CER as if we were
on this specific subset only (shown in Table 1). The re-
allowed to choose for each output sequence the candidate
CAPTCHA dataset consists of 5500 images, 500 for training
with the lowest CER out of those returned from CTC beam
and 5000 for testing, with the critical aspect being gener-
search decoder [12]. This gives us an upper bound on the
alizing to unseen CAPTCHAs using only the 500 training
maximum possible performance gain by using a secondary
images. In our experiments, we split the training images
re-ranking signal.
to 475 images used for training and 25 images used for
We use Adam [74] to optimize our network parameters,
validation.
also the base learning rate is exponentially decayed during
Before comparing our results directly to RCN’s, it is
the training process. We also evaluate our models using
important to note that all RCN’s evaluations are case-
Polyak averaging [75] at inference time.
insensitive, while our evaluations use the much harder case-
sensitive evaluation, the reason we did so is that case-
4 E XPERIMENTS sensitive evaluation is the norm for text recognition and
we were already getting accurate results without problem
In this section we evaluate and compare the proposed simplification.
architecture to the previous state of the art techniques by As shown in Table 1, without data augmentation, and
conducting extensive experiments on seven public bench- with half input spatial resolution our system is already 10%
mark datasets covering a wide spectrum of text recognition more accurate than RCN. In fact, it alot more accurate since
tasks. RCN’s results are case-insensitive. When we apply data
augmentations and use a deeper model, our model is 20%
4.1 Implementation Details better than RCN, and more importantly, it is able to surpass
the human performance estimated in their paper.
For all our experiments, we use an initial learning rate of
5×10−3 , which is exponentially decayed such that it reaches
0.1 of its value after 9 × 104 batches. We use the maximum 4.3 SVHN
batch allowed by our platform with a minimum size of 2. Street View House Numbers (SVHN) [14] is a challenging
All our experiments are performed on a machine equipped real-world dataset released by Google. This dataset contains
with a single Nvidia TITAN Xp GPU. All our models are around 249k real world images of house numbers divided to
implemented using TensorFlow [76]. 235k images for training and validation and 13k for testing,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
TABLE 7
Results of our method compared to state of the art on the IAM Handwriting Database. Aug. is whether data augmentation is applied (True) or not
(False). Best results are in bold
TABLE 8
Results of our method compared to other methods submitted to ICFHR2018 Competition on Automated Text Recognition. Here we show the CER
achieved by the system for each respective number of additional pages averaged over all 5 test documents. We also show the average CER for all
submission for each of the 5 test documents. Imp is the ratio of CER at 16 pages to CER at 0 pages. Other methods results are taken from [19].
Best results are in bold.
CER per additional specific training pages CER per test document
Method Imp. total error
0 1 4 16 Konzil C Schiller Ricordi Patzig Schwerin
Ours: 16(128,512) 25.3474 12.6283 8.2786 5.8248 22.9 6.4899 13.7660 17.3295 14.8451 12.3286 13.0198
Ours: 8(128,512) 25.9067 12.5891 8.3709 6.8175 26.3 6.4113 14.1694 17.7747 15.0185 13.2164 13.4210
OSU 31.3987 17.7344 13.2672 9.0238 28.7 9.3941 21.0974 23.2664 23.1713 12.9845 17.8560
ParisTech 32.2516 19.7981 16.9794 14.7213 45.6 10.4938 19.0465 35.5964 23.8308 17.0202 20.9376
LITIS 35.2940 22.5078 16.8871 11.3448 32.1 9.1394 25.6926 30.5006 25.1841 18.0407 21.5084
PRHLT 32.7927 22.1470 17.8952 13.3288 40.6 8.6511 18.3928 35.0685 26.2566 18.6527 21.5409
RPPDI 30.8045 28.4038 27.2462 22.8461 74.1 11.9013 21.8799 37.2920 32.7516 28.5525 27.3252
TABLE 9
CER per document when using 16 additional specific training pages for methods submitted to ICFHR2018 Competition on Automated Text
Recognition. We also describe the architecture used by every method and whether they use data augmentation on training and testing or not. <10
is the number of documents on which the CER is less than 10%. Best results are in bold.
Method Konzil C Schiller Ricordi Patzig Schwerin < 10 Architecture Training Aug. Test Aug.
Ours 2.8278 8.1746 11.4441 6.72669 2.2833 4 16(128,512) Yes No
Ours 2.9467 8.5148 14.0927 6.77966 4.1130 4 8(128,1024) Yes No
Yes (grid
distortion [70],
Yes
OSU 3.7874 12.4514 15.0412 12.5371 3.4996 2 7 Convs. + 2 BLSTMs rescaling,
(37 copies)
rotation and
shearing)
13 Convs. + 3 BLSTMs
ParisTech 8.0163 14.5801 30.1986 15.5085 9.1846 2 Yse [90] Yes [90]
+ word LM
11 Convs. + 2 BLSTMs Yes (baseline
LITIS 4.8064 19.5665 16.3744 12.8284 6.6127 2 No
+ 9-gram LM biasing)
4 Convolutional
PRHLT 4.9847 12.5486 28.5164 16.3506 7.1178 2 Recurrent + 4 BLSTMs No No
+ 7-gram char-LM
4-layer of [Conv. +
2D-LSTM], double
RPPDI 9.1797 16.2908 30.4939 28.2998 24.9046 1 No No
input conv. [83]
+4-gram word-LM
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
reported in [19]. We achieve the best result in every single the significance of different components of our system. We
metric reported in [19]. Not only do our method achieve the resize input images to 32 pixels height while maintaining
best results in the various forms of CER reported in Table the aspect ratio, however, we limit image width to 200
8, but we also achieve the best improvement in CER from pixels and down-sample feature maps to a maximum of 4
0 specific training pages to 16 specific training pages with x 50 pixels to limit the computational requirements needed
nearly 80% decrease in CER which means our method is the for our experiments. We also use a small baseline model -
most data-efficient of all submissions. 8(32,64) - for the same purpose. All experimented models
An important measure of method’s quality is its applica- use a fixed parameter budget of roughly 71k parameters.
bility to real-life use case scenarios and resource constrains,
so given well-transcribed 16 pages out of given document, 4.9.1 Gating Function
how far can we go? This is shown in Table 9. To asses First, we experiment with different gating functions other
the significance of presented results, we use an empirical than the used baseline in Equation 8. For notational simpli-
rule developed within the READ project [92] (mentioned fication we remove P1 and P2 from the equation, such that
in [19]), stating that transcriptions which achieve a CER it becomes
below 10% are already helpful as an initial transcription
which can be corrected quickly by a human operator. We yi+1 = H1 (yi ) − H2 (yi ) · T (yi ) + yi (9)
can see that while most submitted methods can manage at In Table 10, the second model uses the gating function
most 2 documents with CER less than 10% when using 16 utilized by most previous work that made use of inter-
additional pages, our methods raises this number to 4 out layer gating functions in CNNs like [42], [59], [93], [94].
of 5 tested documents. For the fifth document we achieve It can be seen that our baseline model achieves a 20%
a CER of 11.4% (slightly higher than 10%). It is worth relative reduction in CER. This superiority may, however,
noting that while all training documents and the 4 other be tied to our architectural choices. The third model uses
testing documents are German (from various dialects and a simplification of our proposed gating function with one
centuries), this document (Ricordi) is Italian, and it is the transformation function, resembling the T-Only Highway
only Italian resource in the whole dataset. network from [58]. It achieves slightly better CER than the
baseline. We opted for the baseline as it gives roughly the
4.9 Ablation Study same performance but with two transformation functions so
it has more expressive power. The fourth and fifth models
represent other variations on our proposed scheme where
TABLE 10 we try to change the input filtering order by performing
Results of using different gating functions in our ablation study. CER is
the validation CER of the model. CTC is the validation CTC loss multiplication before subtraction. The last three models
achieved by the model. Best results are in bold. illustrate the importance of gating and residual connections.
It is interesting to note that using gates without residual
id Method CER(%) CTC connections is much worse than removing both of them.
1 Baseline 22.85 33.65 4.9.2 Normalization
2 yi+1 = [T (yi ) + 1] · yi 27.82 41.64
3 yi+1 = H1 (yi ) · T (yi ) + yi 22.46 33.17 In Table 11, we show the effect of removing different parts
4 yi+1 = [T (yi ) + 1] · H1 (yi ) − H2 (yi ) + yi 24.11 35.66 of the normalization mechanisms utilized by our model.
5 yi+1 = H1 (yi ) · T (yi ) − H2 (yi ) + yi 25.65 37.81
6 with residual connections, no gates 25.53 37.30
The second row shows that removing Layer Normalization
7 no residual connections, with gates 29.85 43.06 has the most drastic effect on the model’s performance.
8 no residual connections, no gates 25.89 38.23 The third row shows that even just removing the two
Layer Normalization layers at the beginning and end of
the model increases CER by almost 30%. The third shows
that Batch Normalization is also crucial for model’s perfor-
TABLE 11
Results of using different normalization schemes in our ablation study. mance, although not as important as Layer Normalization.
CER is the validation CER of the model. CTC is the validation CTC loss The last two rows show the effect of removing Softmax
achieved by the model. Best results are in bold. normalization before GateBlocks or replacing it with just a
Tanh activation function.
Method CER(%) CTC
Baseline 22.85 33.65 5 C ONCLUSION
no Layer Normalization 44.92 68.24
no Layer Normalization at start and end 29.29 42.93 In this paper, we tackled the problem of general, un-
no Batch Normalization 35.59 50.81 constrained text recognition. We presented a simple data
no Softmax before GateBlocks 24.52 36.12 and computation efficient neural network architecture that
Tanh instead of Softmax before GateBlocks 24.43 36.11
can be trained end-to-end on variable-sized images using
variable-sized line level transcriptions. We conducted an
We here conduct an extensive ablation study to high- extensive set of experiments on seven public benchmark
light which parts of our model are the most crucial on datasets covering a wide range of text recognition sub-tasks,
system’s final classification performance. All experiments and demonstrated state of the art performance on each
on this section are performed on the IAM handwriting one of them using the same architecture and with minimal
dataset [15], as it presents sufficient complexity to highlight change in hyper-parameters. The experiments demonstrated
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
both the generality of our architecture, and the simplicity [16] S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T.
of adopting it to any new text recognition task. We also Parvez, V. Märgner, and G. A. Fink, “Khatt: An open arabic offline
handwritten text database,” Pattern Recognition, vol. 47, no. 3, pp.
conducted an extensive ablation study on our model that 1096–1112, 2014.
highlighted the importance of each of its submodules. [17] I. Phillips, “UserâĂŹs reference manual for the uw en-
Our presented architecture is general enough for ap- glish/technical document image database iii,” UW-III En-
plication on many sequence prediction problem, a further glish/Technical Document Image Database Manual, 1996.
[18] G.-S. Hsu, J.-C. Chen, and Y.-Z. Chung, “Application-oriented
research direction that we think is worthy of investigation is license plate recognition,” IEEE transactions on vehicular technology,
using it for automated speech recognition (ASR). Especially vol. 62, no. 2, pp. 552–561, 2013.
given its sample efficiency and high noise robustness as [19] T. Strauç, G. Leifert, R. Labahn, T. Hodel, and G. MÃijhlberger,
demonstrated by its performance on handwriting recogni- “Icfhr2018 competition on automated text recognition on a read
dataset,” 2018, pp. 477–482.
tion and scene text recognition. [20] Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition:
Recent advances and future trends,” Frontiers of Computer Science,
vol. 10, no. 1, pp. 19–36, 2016.
ACKNOWLEDGMENTS [21] Q. Ye and D. Doermann, “Text detection and recognition in im-
agery: A survey,” IEEE transactions on pattern analysis and machine
We gratefully acknowledge the support of NVIDIA Corpo- intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
[22] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
ration with the donation of the Titan X Pascal GPU used for
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
this research. resentations using rnn encoder-decoder for statistical machine
translation,” arXiv preprint arXiv:1406.1078, 2014.
[23] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
R EFERENCES learning with neural networks,” in Advances in neural information
processing systems, 2014, pp. 3104–3112.
[1] H. F. Schantz, The history of OCR, optical character recognition. [24] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
Recognition Technologies Users Association Manchester, VT, 1982. lation by jointly learning to align and translate,” arXiv preprint
[2] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing arXiv:1409.0473, 2014.
neural network model for a mechanism of visual pattern recogni- [25] A. Hannun, “Sequence modeling with ctc,” Distill, vol. 2, no. 11,
tion,” in Competition and cooperation in neural nets. Springer, 1982, p. e8, 2017.
pp. 267–285. [26] A. Graves and J. Schmidhuber, “Framewise phoneme classification
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based with bidirectional lstm and other neural network architectures,”
learning applied to document recognition,” Proceedings of the IEEE, Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
vol. 86, no. 11, pp. 2278–2324, 1998. [27] A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, and S. Fernán-
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- dez, “Unconstrained on-line handwriting recognition with recur-
cation with deep convolutional neural networks,” in Advances in rent neural networks,” in Advances in neural information processing
neural information processing systems, 2012, pp. 1097–1105. systems, 2008, pp. 577–584.
[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High- [28] T. Bluche, “Deep neural networks for large vocabulary handwrit-
performance ocr for printed english and fraktur using lstm net- ten text recognition,” Ph.D. dissertation, Université Paris Sud-
works,” in Document Analysis and Recognition (ICDAR), 2013 12th Paris XI, 2015.
International Conference on. IEEE, 2013, pp. 683–687. [29] B. Su and S. Lu, “Accurate scene text recognition based on re-
[6] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and current neural network,” in Asian Conference on Computer Vision.
J. Schmidhuber, “A novel connectionist system for unconstrained Springer, 2014, pp. 35–48.
handwriting recognition,” IEEE transactions on pattern analysis and [30] A. Graves, S. Fernández, and J. Schmidhuber, “Multi-dimensional
machine intelligence, vol. 31, no. 5, pp. 855–868, 2009. recurrent neural networks,” in International Conference on Artificial
[7] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Read- Neural Networks. Springer, 2007, pp. 549–558.
ing text in the wild with convolutional neural networks,” Interna- [31] A. Graves and J. Schmidhuber, “Offline handwriting recognition
tional Journal of Computer Vision, vol. 116, no. 1, pp. 1–20, 2016. with multidimensional recurrent neural networks,” in Advances in
[8] H. Li and C. Shen, “Reading car license plates using deep convolu- neural information processing systems, 2009, pp. 545–552.
tional neural networks and lstms,” arXiv preprint arXiv:1601.05610, [32] T. Bluche, J. Louradour, M. Knibbe, B. Moysset, M. F. Benzeghiba,
2016. and C. Kermorvant, “The a2ia arabic handwritten text recognition
[9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- system at the open hart2013 evaluation,” in Document Analysis
works for semantic segmentation,” in Proceedings of the IEEE Systems (DAS), 2014 11th IAPR International Workshop on. IEEE,
conference on computer vision and pattern recognition, 2015, pp. 3431– 2014, pp. 161–165.
3440. [33] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout
[10] L. Sifre and S. Mallat, “Rigid-motion scattering for image classifi- improves recurrent neural networks for handwriting recognition,”
cation,” Ph.D. dissertation, Citeseer, 2014. in Frontiers in Handwriting Recognition (ICFHR), 2014 14th Interna-
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for tional Conference on. IEEE, 2014, pp. 285–290.
image recognition,” in Proceedings of the IEEE conference on computer [34] P. Voigtlaender, P. Doetsch, and H. Ney, “Handwriting recognition
vision and pattern recognition, 2016, pp. 770–778. with large multidimensional long short-term memory recurrent
[12] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- neural networks,” in Frontiers in Handwriting Recognition (ICFHR),
nectionist temporal classification: labelling unsegmented sequence 2016 15th International Conference on. IEEE, 2016, pp. 228–233.
data with recurrent neural networks,” in Proceedings of the 23rd [35] J. Puigcerver, “Are multidimensional recurrent layers really neces-
international conference on Machine learning. ACM, 2006, pp. 369– sary for handwritten text recognition?” in Document Analysis and
376. Recognition (ICDAR), 2017 14th IAPR International Conference on,
[13] D. George, W. Lehrach, K. Kansky, M. Lázaro-Gredilla, C. Laan, vol. 1. IEEE, 2017, pp. 67–72.
B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang et al., “A generative [36] K. Dutta, P. Krishnan, M. Mathew, and C. Jawahar, “Improving
vision model that trains with high data efficiency and breaks text- cnn-rnn hybrid networks for handwriting recognition,” 2018.
based captchas,” Science, vol. 358, no. 6368, p. eaag2612, 2017. [37] J. Li, A. Mohamed, G. Zweig, and Y. Gong, “Exploring multidi-
[14] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. mensional lstms for large vocabulary asr,” in Acoustics, Speech and
Ng, “Reading digits in natural images with unsupervised feature Signal Processing (ICASSP), 2016 IEEE International Conference on.
learning,” in NIPS workshop on deep learning and unsupervised feature IEEE, 2016, pp. 4940–4944.
learning, vol. 2011, no. 2, 2011, p. 5. [38] T. M. Breuel, “High performance text recognition using a hybrid
[15] U.-V. Marti and H. Bunke, “The iam-database: an english sentence convolutional-lstm implementation,” in Document Analysis and
database for offline handwriting recognition,” International Journal Recognition (ICDAR), 2017 14th IAPR International Conference on,
on Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46, 2002. vol. 1. IEEE, 2017, pp. 11–16.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
[39] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network [63] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to for-
for image-based sequence recognition and its application to scene get: Continual prediction with lstm,” Neural Computation, vol. 12,
text recognition,” IEEE transactions on pattern analysis and machine no. 10, pp. 2451–2471, 2000.
intelligence, vol. 39, no. 11, pp. 2298–2304, 2017. [64] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway net-
[40] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation works,” arXiv preprint arXiv:1505.00387, 2015.
of generic convolutional and recurrent networks for sequence [65] N. Kalchbrenner, I. Danihelka, and A. Graves, “Grid long short-
modeling,” arXiv preprint arXiv:1803.01271, 2018. term memory,” arXiv preprint arXiv:1507.01526, 2015.
[41] Y. Wang, X. Deng, S. Pu, and Z. Huang, “Residual convolutional [66] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware
ctc networks for automatic speech recognition,” arXiv preprint neural language models.” in AAAI, 2016, pp. 2741–2749.
arXiv:1702.07793, 2017. [67] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate
[42] Y. Gao, Y. Chen, J. Wang, and H. Lu, “Reading scene text deep network learning by exponential linear units (elus),” arXiv
with attention convolutional sequence modeling,” arXiv preprint preprint arXiv:1511.07289, 2015.
arXiv:1709.04303, 2017. [68] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for
[43] F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu, “Scene text recogni- convolutional neural networks applied to visual document anal-
tion with sliding convolutional character models,” arXiv preprint ysis,” in Seventh International Conference on Document Analysis and
arXiv:1709.01727, 2017. Recognition, 2003. Proceedings., Aug 2003, pp. 958–963.
[44] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: an [69] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep
attentional scene text recognizer with flexible rectification,” IEEE neural networks for image classification,” in 2012 IEEE Conference
transactions on pattern analysis and machine intelligence, 2018. on Computer Vision and Pattern Recognition, June 2012, pp. 3642–
[45] Z. Wojna, A. N. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, 3649.
and J. Ibarz, “Attention-based extraction of structured information [70] C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and S. Co-
from street view imagery,” in Document Analysis and Recognition hen, “Data augmentation for recognition of handwritten words
(ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, and lines using a cnn-lstm network,” in 2017 14th IAPR Interna-
2017, pp. 844–850. tional Conference on Document Analysis and Recognition (ICDAR).
[46] M. Liang and X. Hu, “Recurrent convolutional neural network IEEE, 2017, pp. 639–645.
for object recognition,” in Proceedings of the IEEE Conference on [71] M. D. Bloice, C. Stocker, and A. Holzinger, “Augmentor: an
Computer Vision and Pattern Recognition, 2015, pp. 3367–3375. image augmentation library for machine learning,” arXiv preprint
[47] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention arXiv:1708.04680, 2017.
modeling for ocr in the wild,” in Proceedings of the IEEE Conference [72] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239. “Autoaugment: Learning augmentation policies from data,” arXiv
[48] A. Chowdhury and L. Vig, “An efficient end-to-end neural model preprint arXiv:1805.09501, 2018.
for handwritten text recognition,” arXiv preprint arXiv:1807.07965, [73] V. I. Levenshtein, “Binary codes capable of correcting deletions,
2018. insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8,
[49] T. Bluche, J. Louradour, and R. Messina, “Scan, attend and read: 1966, pp. 707–710.
End-to-end handwritten paragraph recognition with mdlstm at- [74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tention,” in Document Analysis and Recognition (ICDAR), 2017 14th tion,” arXiv preprint arXiv:1412.6980, 2014.
IAPR International Conference on, vol. 1. IEEE, 2017, pp. 1050–1055. [75] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approx-
[50] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep imation by averaging,” SIAM Journal on Control and Optimization,
network training by reducing internal covariate shift,” in Interna- vol. 30, no. 4, pp. 838–855, 1992.
tional Conference on Machine Learning, 2015, pp. 448–456. [76] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
[51] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for
preprint arXiv:1607.06450, 2016. large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
[52] S. Ioffe, “Batch renormalization: Towards reducing minibatch [77] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-
dependence in batch-normalized models,” in Advances in Neural digit number recognition from street view imagery using deep
Information Processing Systems, 2017, pp. 1945–1953. convolutional neural networks,” arXiv preprint arXiv:1312.6082,
[53] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, 2013.
“Efficient object localization using convolutional networks,” in [78] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
Recognition, 2015, pp. 648–656. [79] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial trans-
[54] F. Chollet, “Xception: Deep learning with depthwise separable former networks,” in Advances in neural information processing
convolutions,” in Proceedings of the IEEE Conference on Computer systems, 2015, pp. 2017–2025.
Vision and Pattern Recognition, 2017, pp. 1251–1258. [80] A. Ablavatski, S. Lu, and J. Cai, “Enriched deep recurrent visual
[55] L. Kaiser, A. N. Gomez, and F. Chollet, “Depthwise separa- attention model for multiple object recognition,” in Applications of
ble convolutions for neural machine translation,” arXiv preprint Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE,
arXiv:1706.03059, 2017. 2017, pp. 971–978.
[56] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint [81] C. Wu, S. Xu, G. Song, and S. Zhang, “How many labeled license
arXiv:1312.4400, 2013. plates are needed?” in Chinese Conference on Pattern Recognition and
[57] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and Computer Vision (PRCV). Springer, 2018, pp. 334–346.
R. Salakhutdinov, “Dropout: a simple way to prevent neural net- [82] R. Ahmad, S. Naz, M. Z. Afzal, S. F. Rashid, M. Liwicki, and
works from overfitting,” The Journal of Machine Learning Research, A. Dengel, “Khatt: A deep learning benchmark on arabic script,”
vol. 15, no. 1, pp. 1929–1958, 2014. in 2017 14th IAPR International Conference on Document Analysis and
[58] K. Greff, R. K. Srivastava, and J. Schmidhuber, “Highway and Recognition (ICDAR). IEEE, 2017, pp. 10–14.
residual networks learn unrolled iterative estimation,” arXiv [83] D. Castro, B. L. D. Bezerra, and M. ValenÃğa, “Boosting the deep
preprint arXiv:1612.07771, 2016. multidimensional long-short-term memory network for handwrit-
[59] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, ten recognition systems,” in 16th International Conference on Fron-
and X. Tang, “Residual attention network for image classification,” tiers in Handwriting Recognition. IEEE, 2018.
in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE [84] M. Villegas, V. Romero, and J. A. Sánchez, “On the modification of
Conference on. IEEE, 2017, pp. 6450–6458. binarization algorithms to retain grayscale information for hand-
[60] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” written tet recognition,” in Iberian Conference on Pattern Recognition
arXiv preprint arXiv:1709.01507, 2017. and Image Analysis. Springer, 2015, pp. 208–215.
[61] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [85] S. Johansson, “The lob corpus of british english texts: presentation
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. and comments,” ALLC journal, vol. 1, no. 1, pp. 25–36, 1980.
[62] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, [86] P. Krishnan and C. Jawahar, “Generating synthetic data for text
“Adaptive mixtures of local experts,” Neural computation, vol. 3, recognition,” arXiv preprint arXiv:1608.04224, 2016.
no. 1, pp. 79–87, 1991. [87] S. Johansson, “The tagged {LOB} corpus: User\’s manual,” 1986.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13