You are on page 1of 13

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2015 1

Accurate, Data-Efficient, Unconstrained Text


Recognition with Convolutional Neural Networks
Mohamed Yousef, Khaled F. Hussain, and Usama S. Mohammed

Abstract—Unconstrained text recognition is an important computer vision task, featuring a wide variety of different sub-tasks, each
with its own set of challenges. One of the biggest promises of deep neural networks has been the convergence and automation of
feature extractors from input raw signals, allowing for the highest possible performance with minimum required domain knowledge. To
this end, we propose a data-efficient, end-to-end neural network model for generic, unconstrained text recognition. In our proposed
architecture we strive for simplicity and efficiency without sacrificing recognition accuracy. Our proposed architecture is a fully
convolutional network without any recurrent connections trained with the CTC loss function. Thus it operates on arbitrary input sizes
arXiv:1812.11894v1 [cs.CV] 31 Dec 2018

and produces strings of arbitrary length in a very efficient and parallelizable manner. We show the generality and superiority of our
proposed text recognition architecture by achieving state of the art results on seven public benchmark datasets, covering a wide
spectrum of text recognition tasks, namely: Handwriting Recognition, CAPTCHA recognition, OCR, License Plate Recognition, and
Scene Text Recognition. Our proposed architecture has won the ICFHR2018 Competition on Automated Text Recognition on a READ
Dataset.

Index Terms—Text Recognition, Optical Character Recognition, Handwriting Recognition, CAPTCHA Solving, License Plate
Recognition, Convolutional Neural Network, Deep Learning.

1 I NTRODUCTION

T EXT recognition is considered one of the earliest com-


puter vision tasks to be tackled by researches. For more
than a century now [1] since its inception as a field of
Deep Learning based techniques have invaded most tasks
related to computer vision, either equating or surpassing
all previous methods, at a fraction of the required domain
research, researchers never stopped working on it. This can knowledge and field expertise. Text recognition was no
be contributed to two important factors. exception, and methods based on CNNs and Recurrent Neu-
First, the pervasiveness of text and its importance to our ral Networks (RNNs) have dominated all text recognition
everyday life, as a visual encoding of language that is used tasks like OCR [5], Handwriting recognition [6], scene text
extensively in communicating and preserving all kinds of recognition [7], and license plate recognition [8], and have
human thought. Second, the necessity of text to humans became the de facto standard for the task.
and its pervasiveness has led to big adequacy requirements Despite their success, one can spot a number of short-
over its delivery and reception which has led to the large comings in these works. First, for many tasks, an RNN is
variability and ever-increasing visual forms text can appear required for achieving state of the art results which brings-in
in. Text can originate either as printed or handwritten, non-trivial latencies due to the sequential processing nature
with large possible variability in the handwriting styles, the of RNNs. This is surprising, given the fact that, for pure-
printing fonts, and the formatting options. Text can be found visual text recognition long range dependencies have no
organized in documents as lines, tables, forms, or clut- effect and only local neighborhood should affect the final
tered in natural scenes. Text can suffer countless types and frame or character classification results. Second, for each
degrees of degradations, viewing distortions, occlusions, of these tasks, we have a separate model that can, with
backgrounds, spacings, slantings, and curvatures. Text from its own set of tweaks and tricks, achieve state of the art
spoken languages alone (as an important subset of human- result in a single task or a small number of tasks. Thus, No
produced text) is available in dozens of scripts that corre- single model is demonstrated effective on the wide spec-
spond to thousands of languages. All this has contributed trum of text recognition tasks. Choosing a different model
to the long-standing complicated nature of unconstrained or feature-extractor for different input data even inside a
text recognition. limited problem like text recognition is a great burden for
Since a deep Convolutional Neural Network (CNN) [2], practitioners and clearly contradicts the idea of automatic
[3] won the ImageNet image classification challenge [4], or data-driven representation learning promised by deep
learning methods.
• Mohamed Yousef, and Khaled F. Hussain are with Computer In this work, we propose a very simple and novel neural
Science Department, Faculty of Computers and Information, Assiut network architecture for generic, unconstrained text recog-
University, Asyut 71515, Egypt (e-mail: mohamed.mahdi@fci.au.edu.eg ; nition. Our proposed architecture is a fully convolutional
khussain@aun.edu.eg).
CNN [9] that consists mostly of depthwise separable convo-
• Usama S. Mohammed is with Electrical Engineering Department, Fac- lutions [10] with novel inter-layer residual connections [11]
ulty of Engineering, Assiut University, Asyut 71515, Egypt (e-mail: and gating, trained on full line or word labels using the CTC
usama@aun.edu.eg).
loss [12]. We also propose a set of generic data augmentation
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

techniques that are suitable for any text recognition task each symbol of the target sequence alphabet. A CTC layer
and show how they affect the performance of the system. then computes, for a given input sequence, a probability
We demonstrate the superior performance of our proposed distribution over all possible output sequences. An output
system through extensive experimentation on seven public sequence can then be predicted either greedily (i.e. taking
benchmark datasets. We were also able, for the first time, to the most likely output at each time-step) or using beam-
demonstrate human-level performance on the reCAPTCHA search. A thorough explanation of CTC can be found in [25].
dataset proposed recently in a Science paper [13], which is BLSTM [26] + CTC was used to transcribe online
more than 20% absolute increase in CAPTCHA recognition handwriting data on [27], they were also used for OCR
rate compared to their proposed RCN system. We also in [5]. Deep BLSTM + CTC were used for handwriting
achieve state of the art performance in SVHN [14] (the full recognition (with careful preprocessing) in [28], they were
sequence version), the unconstrained settings of IAM En- also used on HOG features for scene text recognition in
glish offline handwriting dataset [15], KHATT Arabic offline [29]. This architecture was mostly abandoned in favor of
handwriting dataset [16], University of Washington (UW3) CNN+BLSTM+CTC.
OCR dataset [17], AOLP license plate recognition dataset Deep MDLSTM [30] (interleaved with convolutions) +
[18] (in all divisions). Our proposed system has also won the CTC have achieved the best results on offline handwriting
ICFHR2018 Competition on Automated Text Recognition recognition since their introduction in [31] until recently
on a READ Dataset [19] achieving more than 25% relative [32], [33], [34]. However, their performance was recently
decrease in Character Error Rate (CER) compared to the equated with the much faster CNN+BLSTM+CTC architec-
entry achieving the second place. ture [35], [36]. MDLSTMs have also shown good results in
To summarize, we address the unconstrained text recog- speech recognition [37].
nition problem. In particular, we make the following contri- Various flavors of CNN+BLSTM+CTC currently deliver
butions: state of the results in many text recognition tasks. Breuel [38]
• We propose a simple, novel neural network archi- shows how they improve over BLSTM+CTC for OCR tasks.
tecture that is able to achieve state of the art per- Li et al. [8] shows state of the art license plate recognition
formance, with feed forward connections only (no using them. Shi et al. [39] uses it for scene text recognition,
recurrent connections), and using only the highly however, it is not currently the state of the art. Puigcerver
efficient convolutional primitives. [35] and Dutta et al. [36] used CNN+BLSTM+CTC to equate
• We propose a set of data augmentation procedures the performance of MDLSTMs in offline handwriting recog-
that are generic to any text recognition task and nition.
can boost the performance of any neural network There is a recent shift to recurrence-free architectures
architecture on text recognition. in all sequence modeling tasks [40] due to their much
• We conduct an extensive set of experiments on seven better parallelization performance. We can see the trend of
benchmark datasets to demonstrate the validity of CNN+CTC in sequence transduction problems as used in
our claims about the generality of our proposed [41] for speech recognition, or in [42], [43] for scene text
architecture. We also perform an extensive ablation recognition. The main difference between these later works
study on our proposed model to demonstrate the and ours is that while we also utilize a CNN+CTC (1)
importance of each of its submodules. we use a completely different CNN architecture (2) while
previous CNN+CTC work was mostly demonstrative of the
Section 2 gives an overview of related work on the field. viability/competitiveness of the approach and was carried
Section 3 describes our architecture design and its training on a specific task, we show strong state of the art results on
process in details. Section 4 describes our extensive set of a wide range of text recognition tasks.
experiments and presents its results.

2.2 Encoder-Decoder Models


2 R ELATED W ORK
These models consist of two main parts. First, the encoder
Work in text recognition is enormous and spans over a
reads the whole input sequence, then the decoder produces
century, we here focus on methods based on deep learning
the target sequence token-by-token using information pro-
as they have been the state of the art for at least five years
duced by the decoder. An attention mechanism is usually
now. Traditional methods are reviewed in [20], [21], and
utilized to allow the decoder to automatically (soft-)search
very old methods are reviewed in [1].
for parts of the source sequence that are relevant to predict-
There is two major trends in deep learning based se-
ing the current target token [22], [23], [24].
quence transduction problems, both avoid the need for fine-
In [44] a CNN+BLSTM is used as an encoder and a
grained alignment between source and target sequences.
GRU [22] is used as a decoder for scene text recognition.
The first is using CTC [12], and the second is using
A CNN encoder and an RNN decoder is used for reading
sequence-to-sequence (Encoder-Decoder) models [22], [23]
street name signs from multiple views in [45]. Recurrent
usually with attention [24].
convolutional layers [46] are used as an encoder and an
RNN as a decoder for scene text recognition in [47]. In
2.1 CTC Models [48] a CNN+BLSTM is used as an encoder and an LSTM is
In these models the neural network is used as an emission used as a decoder for handwriting recognition. They achieve
model, such that the input signal is divided into frames and competitive results, but their inference time is prohibitive, as
the model computes, for each input frame, the likelihood of they had to use beam search with very wide beam to achieve
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

if hi ≠ hi+1 or wi ≠ wi+1

if ci ≠ ci−1

hi+1 × wi+1 × ci

hi × wi × ci

hi × wi × ci
ELU
hi × wi × ci

CT C Loss

LayerN orm(0, 1)

Sof tmax
Images Image & Text Text
Pairs

LayerN orm

LayerN orm(0, 1) LayerN orm(0, 1) LayerN orm(0, 1)

Concat(3)
wL × ncls
H × W × 19
T anh T anh Sigmoid

LayerN orm ELU

Split(3)

ci
DropOut(0.75)
hL × wL × ncls hi × wi ×
2

H × W × 16 (13 × 13) ci
DropOut(0.5)
ci hi × wi × 3
hi × wi × 3 2
2

Sof tmax LayerN orm

LayerN orm

GateBlockL
H × W × 16

ci
ci hi × wi ×
ELU
LayerN orm
hi × wi × 2
2

hi × wi × ci−1
H × W × 3
GateBlock1

(a) Overall proposed architecture (b) Details of the repeated GateBlocki

Average-Pool to an output Cond Optional block. contents of this


Graph start node, with input
H × W × C H × W × C shape H × W × C block gets executed only if
shape of H × W × C
cond is T rue

1x1 Convolution + Batch


Max-Pool to output shape Element-wise summation,
Normalization, output shape H × W × C
H × W × C H × W × C subtraction, and multiplication
H × W × C

Depthwise Convolution with


Split(d, n) Split input tensor over
H × W × C (n × m) n × m filter (default 3x3) + Spatial DropOut with a keep
dimension d into n tensors of DropOut(p)
Batch Norm, output shape probability of p
H × W × C
H × W × C shape H × W × C

LayerNormalization with
Concat(d) Concatenate input tensors over H, W are height and width of original input image.
static shift and scale of 0
LayerN orm(0, 1) dimension d into one tensor of hi × wi × ci are dimensions at the i th GateBlock.
and 1 (if these are omitted,
H × W × C shape H × W × C ncls is the size of target alphabet plus 1 (CTC blank)
then they are learned)

Fig. 1. Detailed diagram of our proposed network. We explicitly present details of the network, with clear description of each block in the legend
below the diagram. We also highlight each dimensionality change of input tensors as they flow through the network. (a) Our overall architecture.
Here, we assume the network has L stacked layers of the repeated GateBlock. (b) Level i of the repeated GateBlock.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

a good result. Bluche et al. [49] tackles full paragraph recog- be effectively an unrolled iterative estimation [58] of target
nition (without line segmentation) they use a MDLSTM as sequence. Spatial DropOut is applied on the output of the
an encoder, LSTM as a decoder, and another MDLSTM for last GateBlock , then the channel dimension is projected
computing attention weights. to be equal to the size of the target alphabet plus one
(required for the CTC blank [12]) using a 1x1 convolution.
3 M ETHODOLOGY Finally, we perform global average pooling on the height
dimension. This step enables the network to handle inputs
In this section, we present details of our proposed architec- of any height. The result is then layer normalized, softmax-
ture. First, Section 3.1 gives an overview of the proposed normalized then fed to the CTC loss along with its corre-
architecture. Then, Section 3.2 presents derivation of the sponding text label during training.
gating mechanism utilized by our architecture. After that,
Section 3.3 discusses the data augmentation techniques we
use in the course of training. Finally, Section 3.4 describes 3.2 GateBlock s
our model training and evaluation process. Stacked GateBlock s are the main computational blocks
of our model. Using attention gates to control the inter-
3.1 Overall Architecture layer Information flow, in the sense of filtering-out unim-
Figure 1 presents a detailed diagram of the proposed ar- portant signals (e.g. background noise) and sharpen-
chitecture. We opted to use a very detailed diagram for ing/intensifying important signals, has received a lot of
illustration instead of a generic one, in order to make it interest by many researchers lately [59], [60]. However, the
straightforward to implement the proposed architecture in original concept is old [61], [62]. Many patterns of gating
any of the popular deep learning frameworks without much mechanisms has been proposed in literature, first for con-
guessing. necting RNNs time steps as a technique to mitigate the
As shown in Figure 1 (a), our proposed architecture is vanishing gradient problem [61], [63], then this idea was
a fully convolutional network (FCN) that makes heavy use extended to connecting layers of any feed-forward neural
of both Batch Normalization [50] and Layer Normalization networks. This was done either for training very deep
[51] to both increase convergence speed and regularize networks [64], [65] or for using it, as stated previously, as
the training process. We also use Batch Renormalization a smart, content-aware filters.
[52] on all Batch Normalization layers. This allows us to However, most of the attention gates that were sug-
use batch sizes as low as 2 with minor effects on final gested recently for connecting CNN layers use ad-hoc
classification performance. Our main computational block mechanisms tested only on the target datasets without
is Depthwise separable convolutions. Using them instead of comparison to other mechanisms. We opted to base our
regular convolutions gives the same or better classification attention gates on the gating mechanism suggested by
performance, but at a much faster convergence speed and Highway Networks [64], as they are based on the widely
a drastic reduction in parameter count and computational used, heavily studied LSTM networks. Also, a number of
requirements. We also found the inductive bias of depth- variants of Highway networks were studied, compared and
wise convolutions to have a strong regularization effect on contrasted in subsequent studies [58], [66]. We also validate
training. For example, for our original architecture with reg- the superiority of these gates experimentally in Section 4.
ular convolutions, we had to use spatial DropOut [53] (i.e. Assume a plain feed-forward L-layer neural network,
randomly dropping entire channels completely) after nearly with layers y1 . . . yL . Each of these layers applies a non-
every convolution, in contrast to a single spatial DropOut linear transformation H to its input to get its output, such
layer after the stack of GateBlock s as we currently do. An that
important point to note here, is that in contrast to other yi+1 = H(yi ) (1)
works that use stacks of depthwise separable convolutions In a Highway network, we have two additional non-linear
[54], [55]. We found it crucial to use Batch Normalization in- transformations T and C . These act as gates that can dy-
between the depthwise convolution and the 1x1 pointwise namically pass part of its inputs and suppress the other part,
convolution [56]. We also note here that we found spatial conditioned on the input itself. Therefore:
DropOut to provide much better regularization than regular
un-structured DropOut [57] on the whole tensor. yi+1 = H(yi ) · T (yi ) + yi · C(yi ) (2)
First, the input image is Layer Normalized, input fea-
Authors suggest to tie the two gates, such that we have only
tures are then projected to be 16 using a 1x1 convolu-
one gate T and the other is its opposite C = 1 − T . This way
tion, then the input tensor is softmax-normalized as this
we find out
was found to greatly increase convergence speed on most
datasets. Next, each channel is preprocessed independently yi+1 = H(yi ) · T (yi ) + yi · [1 − T (yi )] (3)
with a 13x13 filter using a depthwise convolution, then we
employ the only used dense connection by concatenating, This can re-factored to
h i
on the channel dimension, the result of preprocessing with 
yi+1 = H(yi ) − yi · T (yi ) + yi (4)
the layer-normalized original image. The result is fed to
a stack of GateBlock s that are responsible for the major- We should note here that the dimensionality of yi , yi+1 ,
ity of the input sequence modeling, and target sequence H(yi ), and T (yi ) must be the same, for this equation to
generation task through a fixed number of in-place whole- be valid. This is can be an important limitation from both
sequence transformation stages. This was demonstrated to the computational and memory usage perspectives for wide
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

networks (e.g for CNNs when each of those tensors has a generic and can benefit most text recognition problems,
large number of channels) or even small networks when yi these techniques are applied independently to any input
have big spatial dimensions. data sample during training
If we would use the same original plain network but
with residual connections we would have 3.3.1 Projective Transforms
h i
yi+1 = H(yi ) + yi (5) Text lines or words to be recognized can suffer various
types of scaling or shearing and more generally projection
Comparing Equations 4 and 5, it is evident that both equa- transforms. We apply a random projection transform to
tions have the same residual connection in the right hand- every input training sample, by randomly sampling four
side of the equation and thus we can interpret the left hand- points that correspond to the original four corners of the
side of Equation 4 as the transformation function applied original image (i.e. the sampled points represent the new
to input yi . This interpretation motivates us to perform two corners of the image) under the following criteria
modifications to Equation 4.
• We either change the x coordinate of all four points
First, we note that the left hand-side of Equation 4
and fix the y coordinate or vice versa. We found it
performs a two-stage filtering operation on its input yi
almost always better to either change the image ver-
before adding the residual connection. The output of the
tically or horizontally but not both simultaneously.
transformation function H is first subtracted from the input
• When changing either vertically or horizontally, the
signal yi , then multiplied by the gating function T . We argue
new distance between any connected pair of corners
that all transformations to the input signal should be learned
must not be less than half the original one, or more
and should be conditioned on the input signal yi . So instead
than double the original one.
of using a single transformation function H , we propose
using two transformation functions H1 and H2 , such that:
h i 3.3.2 Elastic Distortions

yi+1 = H1 (yi ) − H2 (yi ) · T (yi ) + yi (6) Elastic distortions were first proposed by [68] as a way for
emulating uncontrolled oscillations of hand muscles. It was
In all experiments in this paper, H1 , H2 , and T are im- then used in [69] for augmenting images of handwritten
plemented as Depthwise separable convolutions with tanh, digits. In [68], the authors apply a displacement field to each
tanh, and sigmoid activation functions respectively. pixel in the original image, such that each pixel moves by
Second, the residual connection interpretation along a random amount in both the x and y directions. The dis-
with the dimensionality problem we noted above, motivates placement field is built by convolving a randomly initialized
us to make a simple modification to Equation 4. This would field with a Gaussian kernel. An obvious problem with this
allow us to efficiently utilize Highway gates even for wide technique is that it operates at a very fine level (the pixel
and deep networks. level), which makes it more likely that the resulting image
Let P1 be a transformation mapping x ∈ RH×W ×C is unrealistically distorted.
0 0 0
to x0 ∈ RH ×W ×C and P2 the opposite transformation; To counter this problem, we take the approach used by
0 0 0
mapping x0 ∈ RH ×W ×C to x ∈ RH×W ×C . We can change [70] and [71], by working at a coarser level to better preserve
Equation 4 such that the localities of the input image. We generate a regular grid
and apply a random displacement field to the control points
yi0 = P 1(yi ) (7)
of the grid. The original image is then warped according to
the displaced control points. It is worth noting that the use
h i
= P 2 H1 (yi0 ) − H2 (yi0 ) · T (yi0 ) + yi

yi+1 (8)
of a coarse grid makes it unnecessary to apply a smoothing
When using a sub-sampling transformation for P1 this Gaussian to the displacement filed. However, unlike [70]
allows us to retain optimization benefit of residual con- we do not align to the baseline of the input image, which
nections (right hand-side of Equation 8) whilst computing makes it much simpler to implement. We also impose the
Highway gates on a lower dimensional representation of following criteria on the randomly generated displacement
yi which can be performed much faster and uses less field, which is uniformly sampled in our implementation:
memory. In all our experiments in this paper, P1 and P2 are
• We apply the distortions either to the x direction or
implemented as Depthwise separable convolutions with the
the y direction and not to both. As we did for the
Exponential Linear Unit (ELU) [67] as an activation function.
projective transforms, we also found here it is better
Also, we always set C 0 = C2 , H 0 = H , and W 0 = W .
to apply distortions to one dimension at a time.
This means that sub-sampling is only done on the channel
• We force the minimum width or height of a distorted
dimension and spatial dimensions are kept the same.
grid rectangle to be 1. In other words, zero or nega-
tive values, although possible, are not allowed.
3.3 Data Augmentation
Encoding domain knowledge into data through label- 3.3.3 Sign Flipping
preserving data transformations has always been very suc- We randomly flip the sign of the input image tensor. This is
cessful and had provided fundamental contributions to the motivated by the fact that text content is invariant to its color
success of data-driven learning systems in the past, and or texture, and one of the simplest methods to alarm this
lately after the recent revival of neural networks. We here invariance is sign flipping. It is important to stress that this
describe a number data augmentation techniques that are augmentation improves performance even if all training and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

testing images have the same text and background colors. The variable part of our model is the GateBlock
The fact that color-inversion data augmentation improves stack, which is (in our experiments) parameterized by
text recognition performance was previously noticed in [72]. three parameters written as n(c1 , c2 ). n is the number of
GateBlock s, c1 and c2 are the number of channels in the
first and third GateBlock . For all instances of our model
3.4 Model Training and Evaluation
used in the experiments, the number of channels is changed
We train our model using the CTC loss function. This only in the first and third GateBlock .
enables using unsegmented pairs of line-images and cor-
responding text transcriptions to train the model without
4.2 Vicarious reCAPTCHA Dataset
any character/frame-level alignment. Every epoch, training
instances are sampled without replacement from the set of
TABLE 1
training examples. Previously described augmentation tech- Results of our method compared to state of the art on the Vicarious
niques are applied during training on a batch-wise manner, reCAPTCHA Dataset. SER is the whole sequence error rate, Aug. is
i.e. the same random generated values are applied to the whether data augmentation is applied (True) or not (False). Case is
whole batch. whether error metrics are case sensitive (True) or not (False). Best
results are in bold
In order to generate target sequences, we simply use
CTC greedy decoding [12]: i.e. taking the most probable la-
Method Input Scale SER(%) CER(%) Aug. Case
bel at each time step, then mapping the result using the CTC
sequence mapping function which first removes repeated RCN [13] 2X 33.4 5.7 True False
Human [13] 1X 12.6 - - False
labels, then removes the blanks. In all our experiments, we
Ours: 8(64,512) 1X 21.70 3.77 False True
do not utilize any form of language modeling or any other Ours: 8(64,512) 1X 12.86 2.14 True True
non-visual cues. Ours: 16(64,512) 1X 12.30 2.04 True True
We evaluate the quality of our models based on the pop-
ular Levenstein edit distance [73] evaluated on the character
In a recent Science paper [13], Vicarious presented
level, and normalized by the length of the ground truth,
RCN, a system that could break most modern text-based
which is commonly known as Character Error Rate (CER).
CAPTCHAs in a very data-efficient manner, they claimed it
Due to the inherent ambiguity in some of the datasets used
is multiple orders of magnitude more data efficient than ri-
for evaluation (e.g. handwriting datasets), we need another
valing deep-learning based systems in breaking CAPTCHAs
metric to illustrate how much of this ambiguities have been
and also more accurate. To support those claims they col-
learned by our model, and by how much can the model
lected a number of datasets from multiple online CAPTCHA
performance increase if another signal (other than the visual
systems and annotated them using humans. We choose to
one, e.g. a linguistic one) can help ranking the possible
compare against the reCAPTCHA subset specifically since
output sequence transcriptions. For this task, we use the
they have also done an analysis of human performance
CER@TopN metric, which computes the CER as if we were
on this specific subset only (shown in Table 1). The re-
allowed to choose for each output sequence the candidate
CAPTCHA dataset consists of 5500 images, 500 for training
with the lowest CER out of those returned from CTC beam
and 5000 for testing, with the critical aspect being gener-
search decoder [12]. This gives us an upper bound on the
alizing to unseen CAPTCHAs using only the 500 training
maximum possible performance gain by using a secondary
images. In our experiments, we split the training images
re-ranking signal.
to 475 images used for training and 25 images used for
We use Adam [74] to optimize our network parameters,
validation.
also the base learning rate is exponentially decayed during
Before comparing our results directly to RCN’s, it is
the training process. We also evaluate our models using
important to note that all RCN’s evaluations are case-
Polyak averaging [75] at inference time.
insensitive, while our evaluations use the much harder case-
sensitive evaluation, the reason we did so is that case-
4 E XPERIMENTS sensitive evaluation is the norm for text recognition and
we were already getting accurate results without problem
In this section we evaluate and compare the proposed simplification.
architecture to the previous state of the art techniques by As shown in Table 1, without data augmentation, and
conducting extensive experiments on seven public bench- with half input spatial resolution our system is already 10%
mark datasets covering a wide spectrum of text recognition more accurate than RCN. In fact, it alot more accurate since
tasks. RCN’s results are case-insensitive. When we apply data
augmentations and use a deeper model, our model is 20%
4.1 Implementation Details better than RCN, and more importantly, it is able to surpass
the human performance estimated in their paper.
For all our experiments, we use an initial learning rate of
5×10−3 , which is exponentially decayed such that it reaches
0.1 of its value after 9 × 104 batches. We use the maximum 4.3 SVHN
batch allowed by our platform with a minimum size of 2. Street View House Numbers (SVHN) [14] is a challenging
All our experiments are performed on a machine equipped real-world dataset released by Google. This dataset contains
with a single Nvidia TITAN Xp GPU. All our models are around 249k real world images of house numbers divided to
implemented using TensorFlow [76]. 235k images for training and validation and 13k for testing,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE 2 essential for CNN-LSTM networks [38]. It was mainly mod-


Results of our method compared to state of the art on SVHN. SER is eled for targeting Latin scripts, which have many character
the whole sequence error rate. Parameters are in millions. Best results
are in bold pairs that are nearly identical in shape but are distinguished
from each other based on their position and size relative to
text-line’s baseline and x-height. It is a fairly-complex, script
Method Input Scale SER(%) Params Year
specific operation.
11 layer Maxout CNN [77] 64x64 3.96 51 2013 The only preprocessing we perform on images is to
Single DRAM [78] 64x64 5.1 14 2014
Single DRAM MC [78] 64x64 4.4 14 2014
resize it such that the height of the input text line is 32 pixels,
Double DRAM MC [78] 64x64 3.9 28 2014 while maintaining the aspect ratio of the line image.
ST-CNN Single [79] 64x64 3.7 33 2015 As shown in Table 3, even a fairly small model without
ST-CNN Multi [79] 64x64 3.6 37 2015 any form of input text line normalization or script specific
EDRAM Single [80] 64x64 4.36 11 2017
EDRAM Double [80] 64x64 3.6 22 2017 knowledge is able to get state of the art CER of 0.10% on the
Ours: 4(128,512) 32x32 3.3 0.9 2018 dataset. Using a deeper model, we are able to get a slightly
Ours: 4(128,1024) 32x32 3.1 3.4 2018 better accuracy of 0.08% CER. One important thing to also
note from the table is that our models take two orders of
magnitude less epochs to converge to the final classification
the task is to recognize the sequence of numbers in each accuracy, and use smaller input spatial dimensions.
image. There are between 1 and 5 digits in each image, with
a large variability in texture, scale, and spatial arrangement.
4.5 Application-Oriented License Plate (AOLP) Dataset
Following [77] we formed a validation set of 5k images
by randomly sampling images from the training set. We also
convert input RGB images to grayscale, as it was noticed by TABLE 4
Results of our method compared to state of the art on the AOLP
many authors [77], [78] that this did not affect performance. dataset. S is the whole sequence recognition accuracy. C is the
Although most literature resizes the cropped digits image character recognition accuracy. AC, LE, and RP are divisions of the
to 64x64 pixels, we found that for our model, 32x32 pixels AOLP dataset. Best results are in bold
was sufficient. This leads to much faster processing of input
images. We only applied the data augmentations introduced Method
AC LE RP
in 3.3. S(%) C(%) S(%) C(%) S(%) C(%)
As shown in Table 2, after progress in this benchmark LBP+LDA [18] 88.5 96 86.6 94 85.7 95
has stalled since 2015 on a full sequence error of 3.6%, we CNN+BLSTM [8] 94.85 - 94.19 - 88.38 -
DenseNet [81] 96.61 99.08 97.80 99.65 91.00 97.22
were able to advance this 3.3% using only 0.9M parameters.
Ours: 4(128,512) 97.63 99.56 97.65 98.95 92.93 98.27
That is a massive 25X reduction in parameter count from Ours: 8(128,512) 97.63 99.56 97.65 98.95 94.57 98.63
second best performing system and at a better classification
accuracy. Using a larger model with 3.4M parameters we are
Application-Oriented License Plate (AOLP) dataset [18]
able to advance the full sequence error to 3.1%.
has 2049 images of Taiwanese license plates, it is catego-
rized into three subsets with different levels of difficulty
4.4 University of Washington Database III
for detection and recognition: access control (AC), traffic
law enforcement (LE), and road patrol (RP). The evaluation
TABLE 3 protocol used in all previous work on AOLP is to train on
Results of our method compared to state of the art on the University of
Washington Database III. Norm. is whether text line geometric two subsets and test on the third (e.g. for the RP results, we
normalization is applied (True) or not (False). Epochs is the number of train on AC and LE and test on RP).
epochs of training the network took to converge. Best results are in bold As shown in Table 4, we achieve state of the art results
with small margin in AC and LE, and at big margin for the
Method Input Scale CER(%) Norm. Epochs challenging RP subset. It is important to note that while [81]
LSTM [5] 32 x W 0.60 True 100
uses extensive data augmentation on the shape and color of
LSTM [38] 48 x W 0.40 True 500 input license plate, we use only our described augmentation
CNN-LSTM [38] 48 x W 0.17 True 500 techniques and convert input image to grayscale (thus,
CNN-LSTM [38] 48 x W 0.43 False 500 going away with most color information).
Ours: 4(128,512) 32 x W 0.10 False 5
Ours: 8(128,512) 32 x W 0.08 False 8
4.6 KHATT database
The University of Washington Database III (UW3) KFUPM Handwritten Arabic TexT (KHATT) is a freely avail-
dataset [17], contains 1600 pages of document images from able offline handwritten text database of modern Arabic.
scientific journals and other common sources. We use text It consists of scanned handwritten pages with different
lines extracted by [5] which contains 100k lines split to 90k writers and text content. Pages segmentation into lines is
for training and 10k for testing. As an OCR benchmark it also provided to allow the evaluation of line-based recog-
presents different sets of challenges like small original input nition systems directly without layout analysis. Forms were
resolutions, and the need for extremely low CER. scanned at 200, 300, and 600 dpi resolutions. The database
Geometric Line Normalization is introduced in [5] to aid consists of 4000 paragraphs written by 1000 distinct writers
text line recognition using 1D LSTM which are not transla- across 18 countries. The 4000 paragraphs are divided into
tionally invariant along the vertical axis. It was also found 2000 unique randomly selected paragraphs with different
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE 5 achieves an absolute 1% decrease on CER (relative 16%


Results of our method compared to state of the art on the KHATT. Best improvement) compared to previous state of the arts [35],
results are in bold
[36] which used a CNN + BLSTM. Although we use input
which is down-sampled 4 times compared to [35], and
Method CER(%) although [36] (which achieves 0.1% better CER than [35])
HMM [16] 53.87% uses pre-training on a big synthetic dataset [86] and also
MDLSTM [82] 24.20% uses test time data augmentation.
Ours: 8(128,512) 9.9%
Ours: 16(128,512) 9.5%
An important question in visual classification problems -
Ours: 16(128,1024) 8.7% and text recognition is no exception - is: given an ambiguous
case, can our model disentangle this ambiguity and shortlist
the possible output classes ? In a trial to answer this question
text contents, and 2000 fixed paragraphs with the same text we propose using the CER@TopN metric which accepts
content covering all the shapes of Arabic characters. N output sequences for an input image and returns the
Following [16], [82] we carry our experiments on the minimum CER computed between the ground-truth and
2000 unique randomly selected paragraphs, using the each of the N output sequences. CTC beam search decoding
database default split of these paragraphs into a training can output a list of Top-N candidate output sequences,
set with 4825 text-lines, test set with 966 text-lines, and a ordered by their likelihood as computed by the model. We
validation set with 937 text-line images. use this list to compute CER@TopN . For all our experiments
The main preprocessing step we performed on KHATT we use a constant beam width of 10.
was removing the large extra white spaces that is found on In Table 6 we can see that when our model is allowed to
some line-images, as was done by [82]. We had to perform give multiple answers it achieves significant gains, allowing
this step as the alternative was using line-images with large a second choice gives a 15% relative reduction in CER for
height to account for this excessive space that can be, in both models. For our larger model, with only 5 possible
some images, many times more than the line height. output sequences (full lines), it can match the state of the
As we do in all our experiments, we compare only with art result on IAM [34] which uses a combination of word
methods that report pure optical results (i.e. without using and character 10-gram language models, both trained on
any language models or syntactic constrains). As shown in combined LOB [87], Brown [88], and Wellington [89] corpora
Table 5, we achieve a significant reduction in CER compared containing around 3.5M running words.
to previous state-of-the-art, this is partly due to the complex-
4.8 ICFHR2018 Competition on Automated Text Recog-
ity of Arabic script and partly due the fact that little work
nition
was previously done on KHATT.
The purpose of this open competition is to achieve a min-
TABLE 6 imum character error rate on a previously unknown text
Comparison of our top performing methods on IAM. C@N is the corpus from which only a few pages are available for ad-
CER@TopN(%) metric. justing an already pre-trained recognition engine [19]. This
is a very realistic setup and is much more probable than
Method C@1 C@2 C@3 C@4 C@5 C@6 the usual assumption that we already have thousands of
Ours: 8(128,1024) 5.8 5.1 4.8 4.6 4.4 4.3
hand-annotated lines which is usually unrealistic due to the
Ours: 16(128,1024) 4.9 4.2 3.9 3.8 3.6 3.5 associated cost.
The dataset associated with the competition consists of
22 heterogeneous documents. Each of them was written by
only one writer but in different time periods in Italian, and
4.7 IAM Handwriting Database modern and medieval German. The training data is divided
The IAM database [15] (modern English) is probably the into a general set (17 documents) and a document-specific
most famous offline handwriting benchmark. It is compiled set (5 documents) of equal scripts as in the test set. The
by the FKI-IAM Research Group. The dataset is handwritten training set comprises roughly 25 pages per document. The
by 657 different writers and composed of 1539 scanned test set contains the same 5 documents as the document-
text pages that correspond to English texts extracted from specific training set but uses a different set of pages: 15
the LOB corpus [85]. They are partitioned into writer- pages for each document.
independent training, validation and test partitions of 6161, Each participant has to submit 4 transcriptions per test
966 and 2915 lines respectively. The original line images in set based on 0, 1, 4 or 16 additional (specific) training pages,
the training set have an average width of 1751 pixels and an where 0 pages correspond to a baseline system without ad-
average height of 124 pixels. There are 79 different character ditional training. To identify the winner of the competition,
classes in the dataset. CER of all four transcriptions for each of the 5 test docu-
Since the primary concern of this work is image-based ments is calculated (i.e. each submission is comprised of 20
text recognition and how to get the best CER, purely out separate files). Our system is currently the best performing
of visual data, we do not compare to any method that uses system in this open competition achieving more than 25%
additional linguistic data to enhance the recognition rates of relative decrease in CER compared to the second place entry,
their systems (e.g. through the use of language models). as shown in competition website [91].
In Table 7 we give an overview and compare to pure In Table 8 we can see that our method achieves sig-
optical methods in literature. As we can see, our method nificant reduction in CER compared to other submissions
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE 7
Results of our method compared to state of the art on the IAM Handwriting Database. Aug. is whether data augmentation is applied (True) or not
(False). Best results are in bold

Method Preprocessing Input Scale Aug. Val CER(%) Test CER(%)


De-skewing, De-slanting,
5 layers MLP, 1024 units each
contrast normalization, and 72 x W No - 15.6
[28]
region height normalization
7-layers BLSTM, 200 units each
Same as above 72 x W No - 7.3
[28]
mean and variance
3-layers of [Conv. + 2D-LSTM],
normalization of the pixel Original No 7.4 10.8
(4,20,100) units [33]
values
3-layers of [Conv. + 2D-LSTM],
De-slanting and inversion Original No 5.35 7.85
wide layers [83]
5-layers of [Conv. + 2D-LSTM],
Same as above Original No 4.42 6.64
wide layers [83]
4-layers of [Conv. + 2D-LSTM],
wide layers, double input conv. Same as above Original No 4.31 6.39
[83]
De-skewing and Binarization
5 Convs. + 5 BLSTMs [35] 128 x W No 5.1 8.2
[84]
5 Convs. + 5 BLSTMs [35] Same as above 128 x W Yes 3.8 5.8
STN [79] + ReseNet18 + De-skewing, De-slanting,
- Yes - 5.7
BLSTM stack [36] Model pre-training
Ours: 8(128,1024) None 32 x W Yes 4.1 5.8
Ours: 16(128,1024) None 32 x W Yes 3.3 4.9

TABLE 8
Results of our method compared to other methods submitted to ICFHR2018 Competition on Automated Text Recognition. Here we show the CER
achieved by the system for each respective number of additional pages averaged over all 5 test documents. We also show the average CER for all
submission for each of the 5 test documents. Imp is the ratio of CER at 16 pages to CER at 0 pages. Other methods results are taken from [19].
Best results are in bold.

CER per additional specific training pages CER per test document
Method Imp. total error
0 1 4 16 Konzil C Schiller Ricordi Patzig Schwerin
Ours: 16(128,512) 25.3474 12.6283 8.2786 5.8248 22.9 6.4899 13.7660 17.3295 14.8451 12.3286 13.0198
Ours: 8(128,512) 25.9067 12.5891 8.3709 6.8175 26.3 6.4113 14.1694 17.7747 15.0185 13.2164 13.4210
OSU 31.3987 17.7344 13.2672 9.0238 28.7 9.3941 21.0974 23.2664 23.1713 12.9845 17.8560
ParisTech 32.2516 19.7981 16.9794 14.7213 45.6 10.4938 19.0465 35.5964 23.8308 17.0202 20.9376
LITIS 35.2940 22.5078 16.8871 11.3448 32.1 9.1394 25.6926 30.5006 25.1841 18.0407 21.5084
PRHLT 32.7927 22.1470 17.8952 13.3288 40.6 8.6511 18.3928 35.0685 26.2566 18.6527 21.5409
RPPDI 30.8045 28.4038 27.2462 22.8461 74.1 11.9013 21.8799 37.2920 32.7516 28.5525 27.3252

TABLE 9
CER per document when using 16 additional specific training pages for methods submitted to ICFHR2018 Competition on Automated Text
Recognition. We also describe the architecture used by every method and whether they use data augmentation on training and testing or not. <10
is the number of documents on which the CER is less than 10%. Best results are in bold.

Method Konzil C Schiller Ricordi Patzig Schwerin < 10 Architecture Training Aug. Test Aug.
Ours 2.8278 8.1746 11.4441 6.72669 2.2833 4 16(128,512) Yes No
Ours 2.9467 8.5148 14.0927 6.77966 4.1130 4 8(128,1024) Yes No
Yes (grid
distortion [70],
Yes
OSU 3.7874 12.4514 15.0412 12.5371 3.4996 2 7 Convs. + 2 BLSTMs rescaling,
(37 copies)
rotation and
shearing)
13 Convs. + 3 BLSTMs
ParisTech 8.0163 14.5801 30.1986 15.5085 9.1846 2 Yse [90] Yes [90]
+ word LM
11 Convs. + 2 BLSTMs Yes (baseline
LITIS 4.8064 19.5665 16.3744 12.8284 6.6127 2 No
+ 9-gram LM biasing)
4 Convolutional
PRHLT 4.9847 12.5486 28.5164 16.3506 7.1178 2 Recurrent + 4 BLSTMs No No
+ 7-gram char-LM
4-layer of [Conv. +
2D-LSTM], double
RPPDI 9.1797 16.2908 30.4939 28.2998 24.9046 1 No No
input conv. [83]
+4-gram word-LM
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

reported in [19]. We achieve the best result in every single the significance of different components of our system. We
metric reported in [19]. Not only do our method achieve the resize input images to 32 pixels height while maintaining
best results in the various forms of CER reported in Table the aspect ratio, however, we limit image width to 200
8, but we also achieve the best improvement in CER from pixels and down-sample feature maps to a maximum of 4
0 specific training pages to 16 specific training pages with x 50 pixels to limit the computational requirements needed
nearly 80% decrease in CER which means our method is the for our experiments. We also use a small baseline model -
most data-efficient of all submissions. 8(32,64) - for the same purpose. All experimented models
An important measure of method’s quality is its applica- use a fixed parameter budget of roughly 71k parameters.
bility to real-life use case scenarios and resource constrains,
so given well-transcribed 16 pages out of given document, 4.9.1 Gating Function
how far can we go? This is shown in Table 9. To asses First, we experiment with different gating functions other
the significance of presented results, we use an empirical than the used baseline in Equation 8. For notational simpli-
rule developed within the READ project [92] (mentioned fication we remove P1 and P2 from the equation, such that
in [19]), stating that transcriptions which achieve a CER it becomes
below 10% are already helpful as an initial transcription  
which can be corrected quickly by a human operator. We yi+1 = H1 (yi ) − H2 (yi ) · T (yi ) + yi (9)
can see that while most submitted methods can manage at In Table 10, the second model uses the gating function
most 2 documents with CER less than 10% when using 16 utilized by most previous work that made use of inter-
additional pages, our methods raises this number to 4 out layer gating functions in CNNs like [42], [59], [93], [94].
of 5 tested documents. For the fifth document we achieve It can be seen that our baseline model achieves a 20%
a CER of 11.4% (slightly higher than 10%). It is worth relative reduction in CER. This superiority may, however,
noting that while all training documents and the 4 other be tied to our architectural choices. The third model uses
testing documents are German (from various dialects and a simplification of our proposed gating function with one
centuries), this document (Ricordi) is Italian, and it is the transformation function, resembling the T-Only Highway
only Italian resource in the whole dataset. network from [58]. It achieves slightly better CER than the
baseline. We opted for the baseline as it gives roughly the
4.9 Ablation Study same performance but with two transformation functions so
it has more expressive power. The fourth and fifth models
represent other variations on our proposed scheme where
TABLE 10 we try to change the input filtering order by performing
Results of using different gating functions in our ablation study. CER is
the validation CER of the model. CTC is the validation CTC loss multiplication before subtraction. The last three models
achieved by the model. Best results are in bold. illustrate the importance of gating and residual connections.
It is interesting to note that using gates without residual
id Method CER(%) CTC connections is much worse than removing both of them.
1 Baseline 22.85 33.65 4.9.2 Normalization
2 yi+1 = [T (yi ) + 1] · yi 27.82 41.64
3 yi+1 = H1 (yi ) · T (yi ) + yi 22.46 33.17 In Table 11, we show the effect of removing different parts
4 yi+1 = [T (yi ) + 1] · H1 (yi ) − H2 (yi ) + yi 24.11 35.66 of the normalization mechanisms utilized by our model.
5 yi+1 = H1 (yi ) · T (yi ) − H2 (yi ) + yi 25.65 37.81
6 with residual connections, no gates 25.53 37.30
The second row shows that removing Layer Normalization
7 no residual connections, with gates 29.85 43.06 has the most drastic effect on the model’s performance.
8 no residual connections, no gates 25.89 38.23 The third row shows that even just removing the two
Layer Normalization layers at the beginning and end of
the model increases CER by almost 30%. The third shows
that Batch Normalization is also crucial for model’s perfor-
TABLE 11
Results of using different normalization schemes in our ablation study. mance, although not as important as Layer Normalization.
CER is the validation CER of the model. CTC is the validation CTC loss The last two rows show the effect of removing Softmax
achieved by the model. Best results are in bold. normalization before GateBlocks or replacing it with just a
Tanh activation function.
Method CER(%) CTC
Baseline 22.85 33.65 5 C ONCLUSION
no Layer Normalization 44.92 68.24
no Layer Normalization at start and end 29.29 42.93 In this paper, we tackled the problem of general, un-
no Batch Normalization 35.59 50.81 constrained text recognition. We presented a simple data
no Softmax before GateBlocks 24.52 36.12 and computation efficient neural network architecture that
Tanh instead of Softmax before GateBlocks 24.43 36.11
can be trained end-to-end on variable-sized images using
variable-sized line level transcriptions. We conducted an
We here conduct an extensive ablation study to high- extensive set of experiments on seven public benchmark
light which parts of our model are the most crucial on datasets covering a wide range of text recognition sub-tasks,
system’s final classification performance. All experiments and demonstrated state of the art performance on each
on this section are performed on the IAM handwriting one of them using the same architecture and with minimal
dataset [15], as it presents sufficient complexity to highlight change in hyper-parameters. The experiments demonstrated
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

both the generality of our architecture, and the simplicity [16] S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T.
of adopting it to any new text recognition task. We also Parvez, V. Märgner, and G. A. Fink, “Khatt: An open arabic offline
handwritten text database,” Pattern Recognition, vol. 47, no. 3, pp.
conducted an extensive ablation study on our model that 1096–1112, 2014.
highlighted the importance of each of its submodules. [17] I. Phillips, “UserâĂŹs reference manual for the uw en-
Our presented architecture is general enough for ap- glish/technical document image database iii,” UW-III En-
plication on many sequence prediction problem, a further glish/Technical Document Image Database Manual, 1996.
[18] G.-S. Hsu, J.-C. Chen, and Y.-Z. Chung, “Application-oriented
research direction that we think is worthy of investigation is license plate recognition,” IEEE transactions on vehicular technology,
using it for automated speech recognition (ASR). Especially vol. 62, no. 2, pp. 552–561, 2013.
given its sample efficiency and high noise robustness as [19] T. Strauç, G. Leifert, R. Labahn, T. Hodel, and G. MÃijhlberger,
demonstrated by its performance on handwriting recogni- “Icfhr2018 competition on automated text recognition on a read
dataset,” 2018, pp. 477–482.
tion and scene text recognition. [20] Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition:
Recent advances and future trends,” Frontiers of Computer Science,
vol. 10, no. 1, pp. 19–36, 2016.
ACKNOWLEDGMENTS [21] Q. Ye and D. Doermann, “Text detection and recognition in im-
agery: A survey,” IEEE transactions on pattern analysis and machine
We gratefully acknowledge the support of NVIDIA Corpo- intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
[22] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
ration with the donation of the Titan X Pascal GPU used for
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
this research. resentations using rnn encoder-decoder for statistical machine
translation,” arXiv preprint arXiv:1406.1078, 2014.
[23] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
R EFERENCES learning with neural networks,” in Advances in neural information
processing systems, 2014, pp. 3104–3112.
[1] H. F. Schantz, The history of OCR, optical character recognition. [24] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
Recognition Technologies Users Association Manchester, VT, 1982. lation by jointly learning to align and translate,” arXiv preprint
[2] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing arXiv:1409.0473, 2014.
neural network model for a mechanism of visual pattern recogni- [25] A. Hannun, “Sequence modeling with ctc,” Distill, vol. 2, no. 11,
tion,” in Competition and cooperation in neural nets. Springer, 1982, p. e8, 2017.
pp. 267–285. [26] A. Graves and J. Schmidhuber, “Framewise phoneme classification
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based with bidirectional lstm and other neural network architectures,”
learning applied to document recognition,” Proceedings of the IEEE, Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
vol. 86, no. 11, pp. 2278–2324, 1998. [27] A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, and S. Fernán-
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- dez, “Unconstrained on-line handwriting recognition with recur-
cation with deep convolutional neural networks,” in Advances in rent neural networks,” in Advances in neural information processing
neural information processing systems, 2012, pp. 1097–1105. systems, 2008, pp. 577–584.
[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High- [28] T. Bluche, “Deep neural networks for large vocabulary handwrit-
performance ocr for printed english and fraktur using lstm net- ten text recognition,” Ph.D. dissertation, Université Paris Sud-
works,” in Document Analysis and Recognition (ICDAR), 2013 12th Paris XI, 2015.
International Conference on. IEEE, 2013, pp. 683–687. [29] B. Su and S. Lu, “Accurate scene text recognition based on re-
[6] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and current neural network,” in Asian Conference on Computer Vision.
J. Schmidhuber, “A novel connectionist system for unconstrained Springer, 2014, pp. 35–48.
handwriting recognition,” IEEE transactions on pattern analysis and [30] A. Graves, S. Fernández, and J. Schmidhuber, “Multi-dimensional
machine intelligence, vol. 31, no. 5, pp. 855–868, 2009. recurrent neural networks,” in International Conference on Artificial
[7] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Read- Neural Networks. Springer, 2007, pp. 549–558.
ing text in the wild with convolutional neural networks,” Interna- [31] A. Graves and J. Schmidhuber, “Offline handwriting recognition
tional Journal of Computer Vision, vol. 116, no. 1, pp. 1–20, 2016. with multidimensional recurrent neural networks,” in Advances in
[8] H. Li and C. Shen, “Reading car license plates using deep convolu- neural information processing systems, 2009, pp. 545–552.
tional neural networks and lstms,” arXiv preprint arXiv:1601.05610, [32] T. Bluche, J. Louradour, M. Knibbe, B. Moysset, M. F. Benzeghiba,
2016. and C. Kermorvant, “The a2ia arabic handwritten text recognition
[9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- system at the open hart2013 evaluation,” in Document Analysis
works for semantic segmentation,” in Proceedings of the IEEE Systems (DAS), 2014 11th IAPR International Workshop on. IEEE,
conference on computer vision and pattern recognition, 2015, pp. 3431– 2014, pp. 161–165.
3440. [33] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout
[10] L. Sifre and S. Mallat, “Rigid-motion scattering for image classifi- improves recurrent neural networks for handwriting recognition,”
cation,” Ph.D. dissertation, Citeseer, 2014. in Frontiers in Handwriting Recognition (ICFHR), 2014 14th Interna-
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for tional Conference on. IEEE, 2014, pp. 285–290.
image recognition,” in Proceedings of the IEEE conference on computer [34] P. Voigtlaender, P. Doetsch, and H. Ney, “Handwriting recognition
vision and pattern recognition, 2016, pp. 770–778. with large multidimensional long short-term memory recurrent
[12] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- neural networks,” in Frontiers in Handwriting Recognition (ICFHR),
nectionist temporal classification: labelling unsegmented sequence 2016 15th International Conference on. IEEE, 2016, pp. 228–233.
data with recurrent neural networks,” in Proceedings of the 23rd [35] J. Puigcerver, “Are multidimensional recurrent layers really neces-
international conference on Machine learning. ACM, 2006, pp. 369– sary for handwritten text recognition?” in Document Analysis and
376. Recognition (ICDAR), 2017 14th IAPR International Conference on,
[13] D. George, W. Lehrach, K. Kansky, M. Lázaro-Gredilla, C. Laan, vol. 1. IEEE, 2017, pp. 67–72.
B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang et al., “A generative [36] K. Dutta, P. Krishnan, M. Mathew, and C. Jawahar, “Improving
vision model that trains with high data efficiency and breaks text- cnn-rnn hybrid networks for handwriting recognition,” 2018.
based captchas,” Science, vol. 358, no. 6368, p. eaag2612, 2017. [37] J. Li, A. Mohamed, G. Zweig, and Y. Gong, “Exploring multidi-
[14] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. mensional lstms for large vocabulary asr,” in Acoustics, Speech and
Ng, “Reading digits in natural images with unsupervised feature Signal Processing (ICASSP), 2016 IEEE International Conference on.
learning,” in NIPS workshop on deep learning and unsupervised feature IEEE, 2016, pp. 4940–4944.
learning, vol. 2011, no. 2, 2011, p. 5. [38] T. M. Breuel, “High performance text recognition using a hybrid
[15] U.-V. Marti and H. Bunke, “The iam-database: an english sentence convolutional-lstm implementation,” in Document Analysis and
database for offline handwriting recognition,” International Journal Recognition (ICDAR), 2017 14th IAPR International Conference on,
on Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46, 2002. vol. 1. IEEE, 2017, pp. 11–16.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

[39] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network [63] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to for-
for image-based sequence recognition and its application to scene get: Continual prediction with lstm,” Neural Computation, vol. 12,
text recognition,” IEEE transactions on pattern analysis and machine no. 10, pp. 2451–2471, 2000.
intelligence, vol. 39, no. 11, pp. 2298–2304, 2017. [64] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway net-
[40] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation works,” arXiv preprint arXiv:1505.00387, 2015.
of generic convolutional and recurrent networks for sequence [65] N. Kalchbrenner, I. Danihelka, and A. Graves, “Grid long short-
modeling,” arXiv preprint arXiv:1803.01271, 2018. term memory,” arXiv preprint arXiv:1507.01526, 2015.
[41] Y. Wang, X. Deng, S. Pu, and Z. Huang, “Residual convolutional [66] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware
ctc networks for automatic speech recognition,” arXiv preprint neural language models.” in AAAI, 2016, pp. 2741–2749.
arXiv:1702.07793, 2017. [67] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate
[42] Y. Gao, Y. Chen, J. Wang, and H. Lu, “Reading scene text deep network learning by exponential linear units (elus),” arXiv
with attention convolutional sequence modeling,” arXiv preprint preprint arXiv:1511.07289, 2015.
arXiv:1709.04303, 2017. [68] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for
[43] F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu, “Scene text recogni- convolutional neural networks applied to visual document anal-
tion with sliding convolutional character models,” arXiv preprint ysis,” in Seventh International Conference on Document Analysis and
arXiv:1709.01727, 2017. Recognition, 2003. Proceedings., Aug 2003, pp. 958–963.
[44] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: an [69] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep
attentional scene text recognizer with flexible rectification,” IEEE neural networks for image classification,” in 2012 IEEE Conference
transactions on pattern analysis and machine intelligence, 2018. on Computer Vision and Pattern Recognition, June 2012, pp. 3642–
[45] Z. Wojna, A. N. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, 3649.
and J. Ibarz, “Attention-based extraction of structured information [70] C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and S. Co-
from street view imagery,” in Document Analysis and Recognition hen, “Data augmentation for recognition of handwritten words
(ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, and lines using a cnn-lstm network,” in 2017 14th IAPR Interna-
2017, pp. 844–850. tional Conference on Document Analysis and Recognition (ICDAR).
[46] M. Liang and X. Hu, “Recurrent convolutional neural network IEEE, 2017, pp. 639–645.
for object recognition,” in Proceedings of the IEEE Conference on [71] M. D. Bloice, C. Stocker, and A. Holzinger, “Augmentor: an
Computer Vision and Pattern Recognition, 2015, pp. 3367–3375. image augmentation library for machine learning,” arXiv preprint
[47] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention arXiv:1708.04680, 2017.
modeling for ocr in the wild,” in Proceedings of the IEEE Conference [72] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239. “Autoaugment: Learning augmentation policies from data,” arXiv
[48] A. Chowdhury and L. Vig, “An efficient end-to-end neural model preprint arXiv:1805.09501, 2018.
for handwritten text recognition,” arXiv preprint arXiv:1807.07965, [73] V. I. Levenshtein, “Binary codes capable of correcting deletions,
2018. insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8,
[49] T. Bluche, J. Louradour, and R. Messina, “Scan, attend and read: 1966, pp. 707–710.
End-to-end handwritten paragraph recognition with mdlstm at- [74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tention,” in Document Analysis and Recognition (ICDAR), 2017 14th tion,” arXiv preprint arXiv:1412.6980, 2014.
IAPR International Conference on, vol. 1. IEEE, 2017, pp. 1050–1055. [75] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approx-
[50] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep imation by averaging,” SIAM Journal on Control and Optimization,
network training by reducing internal covariate shift,” in Interna- vol. 30, no. 4, pp. 838–855, 1992.
tional Conference on Machine Learning, 2015, pp. 448–456. [76] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
[51] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for
preprint arXiv:1607.06450, 2016. large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
[52] S. Ioffe, “Batch renormalization: Towards reducing minibatch [77] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-
dependence in batch-normalized models,” in Advances in Neural digit number recognition from street view imagery using deep
Information Processing Systems, 2017, pp. 1945–1953. convolutional neural networks,” arXiv preprint arXiv:1312.6082,
[53] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, 2013.
“Efficient object localization using convolutional networks,” in [78] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
Recognition, 2015, pp. 648–656. [79] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial trans-
[54] F. Chollet, “Xception: Deep learning with depthwise separable former networks,” in Advances in neural information processing
convolutions,” in Proceedings of the IEEE Conference on Computer systems, 2015, pp. 2017–2025.
Vision and Pattern Recognition, 2017, pp. 1251–1258. [80] A. Ablavatski, S. Lu, and J. Cai, “Enriched deep recurrent visual
[55] L. Kaiser, A. N. Gomez, and F. Chollet, “Depthwise separa- attention model for multiple object recognition,” in Applications of
ble convolutions for neural machine translation,” arXiv preprint Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE,
arXiv:1706.03059, 2017. 2017, pp. 971–978.
[56] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint [81] C. Wu, S. Xu, G. Song, and S. Zhang, “How many labeled license
arXiv:1312.4400, 2013. plates are needed?” in Chinese Conference on Pattern Recognition and
[57] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and Computer Vision (PRCV). Springer, 2018, pp. 334–346.
R. Salakhutdinov, “Dropout: a simple way to prevent neural net- [82] R. Ahmad, S. Naz, M. Z. Afzal, S. F. Rashid, M. Liwicki, and
works from overfitting,” The Journal of Machine Learning Research, A. Dengel, “Khatt: A deep learning benchmark on arabic script,”
vol. 15, no. 1, pp. 1929–1958, 2014. in 2017 14th IAPR International Conference on Document Analysis and
[58] K. Greff, R. K. Srivastava, and J. Schmidhuber, “Highway and Recognition (ICDAR). IEEE, 2017, pp. 10–14.
residual networks learn unrolled iterative estimation,” arXiv [83] D. Castro, B. L. D. Bezerra, and M. ValenÃğa, “Boosting the deep
preprint arXiv:1612.07771, 2016. multidimensional long-short-term memory network for handwrit-
[59] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, ten recognition systems,” in 16th International Conference on Fron-
and X. Tang, “Residual attention network for image classification,” tiers in Handwriting Recognition. IEEE, 2018.
in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE [84] M. Villegas, V. Romero, and J. A. Sánchez, “On the modification of
Conference on. IEEE, 2017, pp. 6450–6458. binarization algorithms to retain grayscale information for hand-
[60] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” written tet recognition,” in Iberian Conference on Pattern Recognition
arXiv preprint arXiv:1709.01507, 2017. and Image Analysis. Springer, 2015, pp. 208–215.
[61] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [85] S. Johansson, “The lob corpus of british english texts: presentation
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. and comments,” ALLC journal, vol. 1, no. 1, pp. 25–36, 1980.
[62] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, [86] P. Krishnan and C. Jawahar, “Generating synthetic data for text
“Adaptive mixtures of local experts,” Neural computation, vol. 3, recognition,” arXiv preprint arXiv:1608.04224, 2016.
no. 1, pp. 79–87, 1991. [87] S. Johansson, “The tagged {LOB} corpus: User\’s manual,” 1986.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

[88] W. N. Francis, A manual of information to accompany A standard


sample of present-day edited American English, for use with digital
computers. Department of Linguistics, Brown University, 1971.
[89] L. Bauer, Manual of information to accompany the Wellington corpus of
written New Zealand English. Department of Linguistics, Victoria
University of Wellington Wellington, 1993.
[90] E. Chammas, C. Mokbel, and L. Likforman-Sulem, “Handwriting
recognition of historical documents with few labeled data,” in 2018
13th IAPR International Workshop on Document Analysis Systems
(DAS). IEEE, 2018, pp. 43–48.
[91] “ICFHR2018 competition on automated text recognition on a
read dataset,” https://scriptnet.iit.demokritos.gr/competitions/
10/scoreboard/, accessed: 2018-11-20.
[92] “READ: Recognition and enrichment of archival documents,”
https://read.transkribus.eu/, accessed: 2018-11-20.
[93] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai,
“Scene text recognition from two-dimensional perspective,” arXiv
preprint arXiv:1809.06508, 2018.
[94] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Mi-
sawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al.,
“Attention u-net: Learning where to look for the pancreas,” arXiv
preprint arXiv:1804.03999, 2018.

You might also like