2020 Deep CNN TR Le

219
Word-Spotting approach using transfer

deep learning of a CNN network
Ryma Benabdelaziz Djamel Gaceb Mohammed Haddad
Computer Science, Modeling, Computer Science, Modeling, Lab LIRIS, UMR CNRS 5205,
Optimization, and Electronic Systems Optimization, and Electronic Systems University of Claude Bernard Lyon 1
Laboratory (LIMOSE), UMBB Laboratory (LIMOSE), UMBB F-69622, Villeurbanne, France
Boumerdes,Algeria Boumerdes, Algeria mohammed.haddad@univ-lyon1.fr
r.benabdelaziz@univ-boumerdes.dz d.gaceb@univ-boumerdes.dz
Abstract— Convolutional Neural Networks (CNNs) are old/damaged documents or documents with rare languages
deep learning models that are trained to automatically extract not straightforward and hard to conceive. On the other hand,
the most discriminating features directly from an input image word-spotting techniques are further convenient for this type
to be used for visual classification tasks. Recently, CNNs of documents especially techniques based on feature
attracted a lot of interest thanks to their effectiveness in many extraction. Nowadays, the current trend in computer vision
computer vision applications (medical imaging, video is related to applications based on deep neural networks.
surveillance, biometrics, pattern recognition, OCR, etc.). Deep learning is a collection of machine learning algorithms
Transfer learning is an optimization method that uses a pre- used to characterize advanced abstractions or functions.
trained network to speed up the training of another related
Deep learning models are built hierarchically in different
task or application. This helps speed up and improve the
levels (layers). Usually, models will encode low-level
training process on a new dataset. In this paper, we propose a
new approach of handwritten word retrieval based on deep features in their first layers and abstract these features to
learning and transfer learning. We compared the performance higher-level ones on subsequent layers. Most recently, the
between two types of extracted features based on transfer light was shed on different deep learning architectures that
learning: from a pre-trained model and a fine-tuned network. have demonstrated their effectiveness such as Convolutional
Experiments are performed using six different CNN Neural Network (CNN), Long Short-Term Memory
architectures and three similarity measures on the pre- (LSTM), Recurrent Neural Network (RNN), Deep Belief
segmented Bentham dataset of the ICDAR competition. The Network (DBN), Gated Recurrent Units (GRU), Auto-
obtained results demonstrate the effectiveness of our proposed Encoder (AE), to name a few. A detailed review of deep
approach compared to existing methods, evaluated in this learning architectures is presented in [1].
competition.
In this article, we study the robustness of CNN networks
Keywords— CNN, Deep and Transfer Learning, Word in the context of handwritten word retrieval (using a visual
Spotting, Word Retrieval, Features Extraction, Similarity query) in a large corpus of document images. In particular,
Distances. we investigate the effectiveness of transfer learning by
observing its impact on different CNN architectures. We
I. INTRODUCTION compare the retrieval efficiency between models pre-trained
on natural images and model re-trained using a reduced
For natural images, especially document images, image dataset of handwritten images. Our experiments were
retrieval is one of the most active topics in the artificial performed on the Bentham dataset used in the ICDAR
vision community. Technological advances have made it Competition [2] for handwritten word spotting. This dataset
possible to implement a variety of techniques that facilitate is complex and establishes several scientific challenges,
and accelerate the navigation and retrieval of important degradation, and irregularities in the handwriting. The
information in a large corpus of scanned documents. dataset includes old, degraded and deformed handwritten
Nowadays, word spotting techniques are faced with new word images with different sizes, inclination degrees,
challenges due to the ever-increasing complexity of resolutions, styles, ink visibilities, and qualities. In addition,
documents. Content-Based Image Retrieval (CBIR) is a due to the lack of techniques focusing on ancient manuscript
retrieval technique designed to retrieve images from a huge images, this competition is one of the few initiatives we can
image database. This type of technique is based on leverage to evaluate our results with other approaches using
computing the similarity between visual features (texture, the same benchmarking dataset.
color, and shape). The CBIR techniques are suitable for
global search between natural images, but not for partial The remainder of the article is organized as follows. In
retrieval of local information contained in an image, and the second section, we present a literature review of existing
hence not suitable for word-level retrieval in document word spotting techniques. Additionally, we give details
images. In order to achieve this, other systems more adapted related to CNNs, their architectures and the different word
to this type of application are proposed. These systems are spotting approaches based on CNNs. In section three, we
called Document Image Retrieval System (DIRS). DIRS present the methodology of our word spotting approach and
systems can be split into two categories: recognition-based its implementation. This is followed in section four by a
techniques (manual or automatic) and word spotting description of our experimental setup and our evaluation
techniques. On the one hand, recognition-based techniques results. Finally, we draw our conclusions and set future
are deemed to be accurate and more suitable for recent challenges.
handwriting and high quality printed/handwritten
documents. However, these techniques are not effective
when applied to ancient and degraded documents.
Furthermore, the latter methods are usually limited to one
language making handwriting recognition systems for
978-1-7281-5835-8/20/$31.00 ©2020 IEEE
Authorized licensed use limited to: University of Massachusetts Amherst. Downloaded on August 02,2020 at 10:22:41 UTC from IEEE Xplore. Restrictions apply.
220
II. RELATED WORKS (Resnet [25], DensNet [27]), Width (PyramidalNet [28],
Xception [29]), Exploiting feature maps (squeeze and
A. Word spotting techniques excitation [30]), Channel boosting (CNN Boosted channel
Characters and word recognition techniques are not often using TL [31]), and Attention-based CNNs (Convolutional
suitable for word retrieval systems, especially for block attention [32]).
handwritten documents. Retrieval approaches have been The CNN training strength comes down to multiple steps
proposed to overcome these issues by using methods called in extracting features through hidden layers that can learn
word/keyword spotting techniques. The latter can be representations from data. CNNs are considered as very
separated into two families: techniques based on a textual powerful image processing techniques that took the
query (QBS: Query By String) and techniques using an attention of many large companies (Google, Microsoft, etc.)
image as a query (QBE: Query By Example). In this article, and demonstrated outstanding performance in several
we focused on QBE approaches. Further, we can distinguish machine-learning tasks: segmentation, object detection, and
between these techniques using multiple criteria: with or recognition, retrieval, regression, classification, etc. CNNs
without using learning (learning or learning-free), in have three different types of layers: (a) convolutional layer,
addition to using or not using segmentation (segmentation- (b) non-linear processing units, and (c) sub-sampling layers.
based and segmentation-free approaches). Word spotting is Different transformations have been made on the CNN
a new trend that was first proposed by the speech technique such as the application of convolution operations
recognition community [3], then propagated to computer using kernels (filters). In the first group (convolutional
vision. In the literature, many works addressed this context layers), the images are divided into small parts that will
and have been used in various applications. participate in the locales feature extraction stage. The
Several QBE works have been proposed [4] [5] [6] [7], outputs of these convolutional layers are directly related to
but the problem with these kinds of approaches is that a the second group, Non-linear processing layers. Which
query image (image) needs to be inserted by the user, and allows learning abstractions and additionally introduces
which must necessarily be present in the collection. This has Non-linearity in the feature space. Non-linearity helps to
pushed to develop techniques that do not require a query learn different image semantics. The Non-linear group is
sample, which are called QBS. This family uses Supervised connected to the sub-sampling layers, which summarizes the
and Unsupervised learning that usually includes the results of the previous layer and makes the input invariant to
construction of the query or word models. In this context, different geometric distortions. CNNs are structured by a
we have found various works. The authors of [8] [9] successive stacking of convolutional and pooling layers and
introduced several improvements with a supervised learning then aggregated by a single or multiple densely connected
stage such as probabilistic or statistical classifiers (HMM), layers, which are replaced in some applications by global
Artificial Neural Networks (ANNs), Support Vector average pooling. In addition to learning layers, the
Machines (SVM), etc. Or in unsupervised learning, which architecture of a CNN also includes normalization
uses, for example, the Bag Of Visuals Words (BOVW) [10]. functionality and dropouts that are intended for optimization
Several approaches for word spotting including the QBE and performance improvements.
and QBS methods have relied on extracting different Although this technique was very effective in many
features such as Gabor features [11], Histogram of Oriented fields, we could only find very few word spotting related
Gradient Features (HOG) [12], texture descriptors [13], works based on CNNs. The authors of [33] propose a
SIFT [14], SURF [15], structure or shape descriptors such as segmentation-based word spotting technique in which a
Fourier transform [16], skeletons based features [17], CNN was trained on a proposed Pyramidal Histogram Of
moments: Hu moments [18], Zernike moments [19], Radial Characters (PHOC). To test their technique, the authors
moments [20], etc. Other approaches attempted to combine have used four databases: George Washington manuscripts,
multiscale features for word images such as the work done IAM, Esposalles database, IFN / ENIT. In the literature, we
in [21]. can find multiple extensions of the PHOC technique such as
[34] [35]. In [35], the authors propose an approach inspired
B. Convolutional Neural Networks architectures by this last technique [33] to implement a free-segmentation
The CNN model was initially proposed in 1990, but it word spotting technique. They defined their proper layer
only demonstrated its full potential recently by achieving the based on Regions of Interest (ROI) of the input document
state-of-the-art classification performance on the ImageNet image. The pooling ROIs allowing pooling features over
dataset. ImageNet is a large image database containing 14 candidate word regions. They also activate the same layer at
million of natural images organized in 1000 categories. It the end (Sigmoid). The authors of [36] propose a technique
was designed by academics and dedicated to computer to analyze handwritten historical document images and use
vision research. It was used to train many CNN CNNs to learn the word descriptors and word similarity
architectures. Since their creation, CNNs have undergone between word descriptors. The authors combined a deep
multiple improvements (implementation optimization, learning classification and regression tasks to increase the
parameters adjustment, architecture modification, etc.). The performance of the word retrieval system.
CNN architecture restructuring is one of the most important
changes as it helps increasing efficiency. This usually III. METHODOLOGY
involves depth (adding new layers) and space exploitation. Some of the advantages of deep neural networks are the
According to the architectural changes, CNNs have been fact they can be reused fully or partially to speed up the
divided into seven categories such as: Spatial exploitation training and enhance the performance of a model on a new
(AlexNet [22], GoogleNet [23], VGG [24]), Depth (Resnet related problem. Transfer learning is the process of reusing
[25], Inception V3 / V4, inceptionResnet [26]), multi-path the knowledge (features, weights, etc.) of a pre-trained
221
Tranfer deep learning step Word image description step

stage, the model is considered as a visual features’ extractor.
Pre-trained network Fine-tuned network
To extract the final descriptors’ vector, we flatten the output
Test Dataset
(Word Image)
of the last convolutional block (MaxPolling layer) of the
Queries + candidates pre-trained model.
Large Dataset Small Dataset
(ImageNet)
(Query Images) 2) Feature extraction using a fine-tuned network
Transfer learning is applied in fine-tuned networks when
Fine-tuning the new dataset is smaller than the first dataset but with
Train Query word
Train Candidate word
different particular characteristics. Here, the fine-tuning
Features extraction using pretrained consists of re-training the pre-trained model (trained initially
Pre-trained or fine-tuned network
weights Knewledge
Frozen Fine-tuned on the ImageNet dataset, composed of natural images) on a
weights weights
new dataset of the ICDAR contest (composed of
(1) Features extraction (2) Features extraction
Matching step
Using distance metric
handwritten word images collected from old scanned
from CB output Using partially fine-tuned CB documents) which has a different nature (textual) and a
smaller size with different elements (words) to classify.
Fig. 1. A synoptic diagram of our proposed word spotting technique using
a description based on transfer deep learning. Because the number of classes is different in the new
dataset, the fine-tuned network needs to adapt the layers of
model for training newer models on a new task. This the second network part (classifier) to recognize the new
approach is currently very popular in deep learning because word classes. First, we remove the second part (the FC
it can be used to train deep neural networks with relatively layers) of an existing network. Afterward, we add a new
little data. Our proposed approach is mainly based on CNNs adapted set of FC layers on top of the CNN (by changing the
using transfer learning for word spotting applications. It size of the Softmax layer, which corresponds, to the number
consists of three stages: transfer learning (pre-training/fine- of classes of the new dataset). We then fine-tune its weights
tuning), word image description and matching (see the on the new dataset. Thereafter, the weights of the CB last
synoptic diagram of Fig. 1). layers are partially or fully fine-tuned. The first layers of the
CB can be frozen to take advantage of the model’s generic
A. Transfer learning knowledge already acquired during the initial training on a
The existing datasets for the evaluation of word-spotting large dataset and accelerate the re-training process on the
approaches are small and insufficient for achieving good new dataset. The level of genericity or reusability of an
generalization for deep learning techniques. Reusing a image description constructed by a specific layer depends on
trained model based on transfer learning represents a better its depth in the relative model. The first layers of a model
approach to avoid this limitation. It allows us to not only usually encode low-level features whereas top-level layer
reduce the need to depend on a large dataset but also uses encodes higher-level features. The presence of differences
high-performance models to respond to the different between the new and initial datasets leads us to freeze (fix) a
constraints associated with our word-spotting application large part of the first layers of the CB and avoiding freezing
(large variability of handwritten words, deformations, ink all the remaining layers. Therefore, several layers of the
variation, degradation, etc.). This is translated in deep initial model will be reused to re-train the new one (some
learning by the usage of a single or multiple layers extracted weights of the first model will be adapted to train the second
from a previously trained model to a new model. In our one). The resulting network is thus considered as a visual
approach, we propose and compare two ways of extracting features’ extractor using the flattened CB output
features from handwritten word images using transfer (MaxPolling layer). The difference is that in the fine-tuned
learning, through a pre-trained network or a fine-tuned network, we retrain our new model using the query images
network. In both directions, the original (initial) CNN model (some weights of the original model will be changed and
is pre-trained on the ImageNet dataset. adjusted to this new data). Next, we ensure that the new
network can extract features related to our context (query
1) Feature extraction using a pre-trained network dataset).
The feature extraction from a pre-trained network is a To evaluate our method, we used the proposed queries
transfer learning which allows the use of resulting features set of the Bentham dataset used in the ICDAR15 word
from the pre-trained database initially designed to predict spotting contest, which consists of 95 query word images
classes. The architecture of a deep network is composed of each containing 2 to 5 samples (20 different words) and
two complementary parts. The first one includes a sequence 3233 tests images. For our experiments, we divided the
of convolutional layers and pooling called convolutional query dataset into image samples (which represents the
base (CB). The second is based on the densely or fully same word) and added a 21st sample. In this last class, we
connected (FC) layers, called a classifier. The extracted added a sample of different words extracted from the
features, used here, are based on the convolution base output Bentham candidate dataset. We note that the 21st sample
flattening in a pre-trained network. The reason for this does not include any query word of the 20 classes in the aim
choice is that the description learned by the convolutional to distinguish between query images and the rest of the
base is likely to be more generic and therefore more images to improve the similarity measurements. We have
reusable. However, the image description learned by the carried out two tests: the first one evaluates the word
classifier will necessarily be specific to the set of classes on spotting using a pre-trained network (with ImageNet
which the original model was trained on (adapted to the weights) (see Fig. 4.a), the second one introduces fine-
ImageNet and not to the ICDAR dataset). The feature tuning to see the influence of transfer learning in performing
extraction requires loading a suited model then removing the word retrieval using three similarity measures for the feature
top layer and preserving the convolutional base. At this vectors (see Fig. 4.b).
222
Fig. 3. (a) Example of Bantham dataset document image, (b) Handwriting

irregularities in segmented word extracted from Bantham dataset.
Fig. 2. Example of VGG16 network architecture using new dataset.
This dataset consists of 3233 candidate images and 95 query
We have evaluated our technique on six CNN images. In addition, several authors participated in the
architectures (VGG16, VGG19, InceptionV3, ResNet, writing of these words; hence, the words are of different
Resnet-InceptionV2, Xception). See the example of VGG16 font sizes, styles, and present artifacts, defects, and
network in (Fig. 2). irregularities (Fig. 3).
In our experiments, we used feature vectors of size M, Finally, in the matching step, each query image is
this size depends on the used CNN architecture (M=25088 matched to all candidate images. The resulted list is sorted
in VGG16 model). in ascending order according to the similarity values of the
best corresponding images. We coded our approach using
Python and the deep learning models were implanted using
B. Word image description using Pre-trained/Fine-tuned the Keras deep learning library. The tests were carried out
networks on a machine with processor Intel ® Core i7-7700HQ with
This step consists of computing the feature vectors of 16GB of RAM and using an Nvidia GeForce GTX 1070
each image. Query images and candidate images are passed with 8GB of VRAM.
through the pre-trained/fine-tuned network. Several
B. Evaluation Metric
operations are applied on each pixel using the weights of the
pre-trained/fine-tuned network. The descriptor vector of each In order to test and evaluate our word spotting technique,
image is generated and used in the next step of word image we used the Mean Average Precision metric, which is
matching. commonly used in such approaches. The mAP is computed
as follows: the evaluation script computes the (interpolated)
C. Matching step average precision for each query image in the set Q, using
In this last step, we carry out matching between query the (interpolated) precision at-top-k, (k) (’(k)) and the
images and candidate images by computing the distance recall-at-top-k, (k). Equation 4, outlines the (interpolated)
between their resulting features vectors. We have compared precision and recall scores of the top-k responses, using the
in our matching step three types of distances, to measure the set of all relevant items R, and the set of top-k results in the
pairwise similarities between two features vectors, which solution S(k). Which allows computing the interpolated
correspond to the query and candidate images vectors (see average precision AP (where (k) is the difference in recall
results in the following section). between items k and k1), and denes the mAP metric, from
-Euclidian distance: ED (P , Q ) = the AP of each query q, AP(q)) [2].
¦ i =1 (Pi − Q i )2
M
(1)
§ (P − Q )2 · R ∩ S (k ) R ∩ S (k )
-Chi2 distance: (2) π (k ) = ; ρ (k) = ; π '(k) = max (j)
¦i =1 ¨¨ Pi i+ Qi i+ ε ¸¸
M
CH 2( P, Q) = j:ρ (j)≥ ρ (k)
S (k ) R
© ¹
n
1
-Kullback-Leibler Divergence: AP = ¦π '(k) ⋅Δρ (k); mAP = ¦ AP(q)
Q
KL (P, Q ) = ¦i=1 (Pi × log 2 (Pi )) − (Pi × log 2 (Qi ))
M k =1 q∈Q
(3) (4)
In which P and Q are two vectors of the same size (M), C. Results and discussion
P=P+ and Q=Q+ in the third equation. According to the experimental results, our approach is
based on fine-tuned network and Kullback-Leibler distance
(see Table 1) has yielded a better precision compared to
IV. TESTS AND EXPERIMENTATION
existing approaches presented in [5][37][38][39] as well as
A. Bentham Dataset our previous training-free word spotting method [40]. In
Table 1, we compare our results with the top 2 techniques of
The performance of our proposed approach is evaluated
segmentation-based word spotting (Track I. A- PRG group
following the same evaluation protocol used in the
[39], and CVC group [38]) and the other three approaches
ICDAR15 competition on Keyword Spotting for [5][37][40] evaluated in the Bentham dataset and using the
Handwritten Documents [2] and the same image datasets. same evaluation protocol of the ICDAR15 competition.
This dataset comprises pre-segmented word images acquired The Top 1 method of competition is proposed by the
from 10 handwritten pages written by the British
PRG group [39] which began with the feature extraction
philosopher Jeremy Bentham (1748-1832) and his secretary.
step using the SIFT technique, and then have used these
223
TABLE I. EXPERIMENTAL RESULTS FOR SEGMENTATION-BASED IN

mAP on Pre-trained CNN architectures BENTHAM DATASET OF THE ICDAR15 COMPETITION.
Mean average
Methods
precision (mAP)
RESNET-INCEPTION V2
Best result using fine-tuned VGG16 and
0.634
XCEPTION Kullback-Leibler distance
INCEPTION V3 Our previous learning-free word spotting technique
0.578
[40]
RESNET50 Best result using pre-training VGG19 and Chi2
0,556
VGG19 distance
VGG16 Zagoris’s method using a dynamic window [6] 0.440
0 0,1 0,2 0,3 0,4 0,5 0,6 PRG group method [39] 0.424
Kullback-Leibler Divergence Chi2 distance Euclidienne distance
CVC group (Ghosh & Valveny, 2015) [38] 0.300
(a)
CSPD method [37] 0.193
mAP on Re-trained CNN architectures

small sample set of query images can be formed using only
RESNET-INCEPTION V2 a query image (without searching exemplary in the dataset
XCEPTION
to train the model), we can do this for example, by making
changes in the query image scale, slant, skew, rotation
INCEPTION V3
degrees, etc.
RESNET50
V. CONCLUSION
VGG19
VGG16
Through this paper, we presented a new word spotting
technique that relies on transfer learning from a pre-trained
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7
convolutional neural network to extract visual features and a
Kullback-Leibler Divergence Chi2 distance Euclidienne distance
similarity measurement technique to achieve accurate
(b) Query-By-Example word spotting. We compared between
Fig. 4. mAP using three different distances on pre-trained and re-trained using CNN on pre-trained networks on natural images
(fine-tuned) CNN architectures. (ImageNet) and with the same networks but using transfer
learning from a small dataset of pre-classified word images.
features to construct a codebook. Group 2 is the CVC [38] The visual features extracted from both types of approaches
which has used the integral histograms of oriented gradients were used to perform the retrieval on a word spotting dataset
(IHOG) technique and then creates a bag of visual words. used within the ICDAR15 conference. We carried out tests
We have also compared our results with our previous on six different CNN architectures using three different
training-free word-spotting technique [40]. Our previous distance metrics. Despite the complexity of the handwriting,
technique is based on textural local features extracted from deep CNNs can perform word-spotting tasks and can
handwriting, and original matching step using the KNN effectively be tuned using transfer learning. This work
technique in a spatial context. pushes us to see further, to test in the future other datasets
The comparison of the two transfer learning approaches and administrative document components such as stamps,
shows that the results of the fine-tuned network give a better signatures, logo, etc.
precision compared to the direct re-use of a pre-trained
network on a general dataset like ImageNet. All these results REFERENCES
are proof that transfer learning can help to develop and [1] M. Z. Alom et al., “A State-of-the-Art Survey on Deep Learning
accelerate word-spotting applications without using large Theory and Architectures,” Electronics, vol. 8, no. 3, p. 292, Mar.
word-image datasets. The histograms in Fig 4.a and 4.b 2019, doi: 10.3390/electronics8030292.
[2] J. Puigcerver, A. H. Toselli, and E. Vidal, “ICDAR2015
compare the accuracy of six deep learning architectures in Competition on Keyword Spotting for Handwritten Documents,”
two contexts (pre-trained and fine-tuned models) using three Aug. 2015, pp. 1176–1180, doi: 10.1109/ICDAR.2015.7333946.
different distance metrics. This comparison shows that the [3] J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish, “Continuous
VGG architecture is more efficient and adapted to such a hidden Markov modeling for speaker-independent word spotting,”
in International Conference on Acoustics, Speech, and Signal
network refinement approach on a small specific dataset. Processing, Glasgow, UK, 1989, pp. 627–630, doi:
This demonstrates that the features resulting from the fine- 10.1109/ICASSP.1989.266505.
tuned model are more relevant to distinguish handwritten [4] I. P. Gurov, A. S. Potapov, O. V. Scherbakov, and I. N. Zhdanov,
words and that learning on ImageNet only is insufficient and “Hough and Fourier Transforms in the Task of Text Lines
Detection,” p. 7.
does not offer the needed genericity to process textual [5] N. Aouadi and A. Kacem, “Word Spotting for Arabic Handwritten
images. We notice that all images samples corresponding to Historical Document Retrieval using Generalized Hough
each query images are to some extent similar to the style the Transform,” p. 5, 2011.
queries writing. The difference can be observed is in the [6] K. Zagoris, I. Pratikakis, and B. Gatos, “Unsupervised Word
Spotting in Historical Handwritten Document Images Using
writing slant, size of characters, ink visibility. Document-Oriented Local Features,” IEEE Transactions on Image
Thus, we found that a small set of query samples are Processing, vol. 26, no. 8, pp. 4032–4041, Aug. 2017, doi:
enough to train the CNN models, which allows retrieving 10.1109/TIP.2017.2700721.
many similar samples in a large dataset. We argue that this [7] B. Gatos and I. Pratikakis, “Segmentation-free Word Spotting in
Historical Printed Documents,” in 2009 10th International
Conference on Document Analysis and Recognition, Barcelona,
Spain, 2009, pp. 271–275, doi: 10.1109/ICDAR.2009.236.
224
[8] Y. Liang, M. C. Fairhurst, and R. M. Guest, “A synthesised word Honolulu, HI, Jul. 2017, pp. 2261–2269, doi:
approach to word retrieval in handwritten documents,” Pattern 10.1109/CVPR.2017.243.
Recognition, vol. 45, no. 12, pp. 4225–4236, Dec. 2012, doi: [28] D. Han, J. Kim, and J. Kim, “Deep Pyramidal Residual Networks,”
10.1016/j.patcog.2012.05.024. in 2017 IEEE Conference on Computer Vision and Pattern
[9] L. Rothacker and G. A. Fink, “Segmentation-free Query-by-String Recognition (CVPR), Honolulu, HI, Jul. 2017, pp. 6307–6315, doi:
Word Spotting with Bag-of-Features HMMs,” th International 10.1109/CVPR.2017.668.
Conference on Document Analysis and Recognition, p. 5, 2015. [29] F. Chollet, “Xception: Deep Learning with Depthwise Separable
[10] A. Fischer, A. Keller, V. Frinken, and H. Bunke, “Lexicon-free Convolutions,” in 2017 IEEE Conference on Computer Vision and
handwritten word spotting using character HMMs,” Pattern Pattern Recognition (CVPR), Honolulu, HI, Jul. 2017, pp. 1800–
Recognition Letters, vol. 33, no. 7, pp. 934–942, May 2012, doi: 1807, doi: 10.1109/CVPR.2017.195.
10.1016/j.patrec.2011.09.009. [30] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-
[11] D. Gaceb, V. Eglin, S. Bres, and H. Emptoz, “Handwriting Excitation Networks,” arXiv:1709.01507 [cs], May 2019,
Similarities as Features for the Characterization of Writer’s Style Accessed: Nov. 17, 2019. [Online]. Available:
Invariants and Image Compression,” in Image Analysis and http://arxiv.org/abs/1709.01507.
Recognition, vol. 4142, A. Campilho and M. Kamel, Eds. Berlin, [31] A. Khan, A. Sohail, and A. Ali, “A New Channel Boosted
Heidelberg: Springer Berlin Heidelberg, 2006, pp. 776–789. Convolutional Neural Network using Transfer Learning,” p. 23.
[12] K. Terasawa and Y. Tanaka, “Slit style HOG feature for document [32] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM:
image word spotting,” in Document Analysis and Recognition, Convolutional Block Attention Module,” in Computer Vision –
2009. ICDAR’09. 10th International Conference on, 2009, pp. 116– ECCV 2018, vol. 11211, V. Ferrari, M. Hebert, C. Sminchisescu,
120. and Y. Weiss, Eds. Cham: Springer International Publishing, 2018,
[13] S. Dey, A. Nicolaou, J. Llados, and U. Pal, “Local Binary Pattern pp. 3–19.
for Word Spotting in Handwritten Historical Document,” [33] S. Sudholt and G. A. Fink, “PHOCNet: A Deep Convolutional
arXiv:1604.05907 [cs], Apr. 2016, Accessed: Jan. 14, 2019. Neural Network for Word Spotting in Handwritten Documents,” in
[Online]. Available: http://arxiv.org/abs/1604.05907. 2016 15th International Conference on Frontiers in Handwriting
[14] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Recognition (ICFHR), Shenzhen, China, Oct. 2016, pp. 277–282,
Keypoints,” International Journal of Computer Vision, vol. 60, no. doi: 10.1109/ICFHR.2016.0060.
2, pp. 91–110, Nov. 2004, doi: [34] S. Sudholt and G. A. Fink, “Evaluating Word String Embeddings
10.1023/B:VISI.0000029664.99615.94. and Loss Functions for CNN-Based Word Spotting,” in 2017 14th
[15] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up IAPR International Conference on Document Analysis and
Robust Features,” in Computer Vision – ECCV 2006, vol. 3951, A. Recognition (ICDAR), Kyoto, Nov. 2017, pp. 493–498, doi:
Leonardis, H. Bischof, and A. Pinz, Eds. Berlin, Heidelberg: 10.1109/ICDAR.2017.87.
Springer Berlin Heidelberg, 2006, pp. 404–417. [35] S. K. Ghosh and E. Valveny, “R-PHOC: Segmentation-Free Word
[16] C. Jawahar, A. Balasubramanian, and M. Meshesha, “Word-level Spotting Using CNN,” in 2017 14th IAPR International Conference
access to document image datasets,” in Proceedings of the on Document Analysis and Recognition (ICDAR), Kyoto, Nov.
workshop on computer vision, graphics and image processing, 2017, pp. 801–806, doi: 10.1109/ICDAR.2017.136.
2004, pp. 73–76. [36] Z. Zhong, W. Pan, L. Jin, H. Mouchere, and C. Viard-Gaudin,
[17] J. Lladós and G. Sánchez, “Indexing historical documents by word “SpottingNet: Learning the Similarity of Word Images with
shape signatures,” in Document Analysis and Recognition, 2007. Convolutional Neural Network for Word Spotting in Handwritten
ICDAR 2007. Ninth International Conference on, 2007, vol. 1, pp. Historical Documents,” in 2016 15th International Conference on
362–366. Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China,
[18] Zhihu Huang and Jinsong Leng, “Analysis of Hu’s moment Oct. 2016, pp. 295–300, doi: 10.1109/ICFHR.2016.0063.
invariants on image scaling and rotation,” in 2010 2nd [37] K. Zagoris, K. Ergina, and N. Papamarkos, “Image retrieval
International Conference on Computer Engineering and systems based on compact shape descriptor and relevance feedback
Technology, Chengdu, China, 2010, pp. V7-476-V7-480, doi: information,” Journal of Visual Communication and Image
10.1109/ICCET.2010.5485542. Representation, vol. 22, no. 5, pp. 378–390, Jul. 2011, doi:
[19] P. B. Rao, D. V. Prasad, and C. P. Kumar, “Feature Extraction 10.1016/j.jvcir.2011.03.002.
Using Zernike Moments,” International Journal of Latest Trends in [38] S. K. Ghosh and E. Valveny, “A Sliding Window Framework for
Engineering and Technology, vol. 2, no. 2, p. 7, 2013. Word Spotting Based on Word Attributes,” in Pattern Recognition
[20] M. Kassis and J. El-Sana, “Word spotting using radial descriptor and Image Analysis, vol. 9117, R. Paredes, J. S. Cardoso, and X.
graph,” in Frontiers in Handwriting Recognition (ICFHR), 2016 M. Pardo, Eds. Cham: Springer International Publishing, 2015, pp.
15th International Conference on, 2016, pp. 31–35. 652–661.
[21] M. Mhiri, M. Cheriet, and C. Desrosiers, “Query-by-example word [39] S. Sudholt, L. Rothacker, and G. A. Fink, “Learning local image
spotting using multiscale features and classification in the space of descriptors for word spotting,” in 2015 13th International
representation differences,” in 2017 IEEE International Conference Conference on Document Analysis and Recognition (ICDAR),
on Image Processing (ICIP), Beijing, Sep. 2017, pp. 1112–1116, Tunis, Tunisia, Aug. 2015, pp. 651–655, doi:
doi: 10.1109/ICIP.2017.8296454. 10.1109/ICDAR.2015.7333842.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet [40] R. Benabdelaziz, D. Gaceb, and M. Haddad, “Word Spotting Based
classification with deep convolutional neural networks,” Commun. on Bispace Similarity for Visual Information Retrieval in
ACM, vol. 60, no. 6, pp. 84–90, May 2017, doi: 10.1145/3065386. Handwritten Document Images:,” International Journal of
[23] C. Szegedy et al., “Going deeper with convolutions,” in 2015 IEEE Computer Vision and Image Processing, vol. 9, no. 3, pp. 38–58,
Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2019, doi: 10.4018/IJCVIP.2019070103.
Boston, MA, USA, Jun. 2015, pp. 1–9, doi:
10.1109/CVPR.2015.7298594.
[24] K. Simonyan and A. Zisserman, “Very Deep Convolutional
Networks for Large-Scale Image Recognition,” arXiv:1409.1556
[cs], Apr. 2015, Accessed: Nov. 17, 2019. [Online]. Available:
http://arxiv.org/abs/1409.1556.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
Image Recognition,” in 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), Las Vegas, NV, USA, Jun. 2016,
pp. 770–778, doi: 10.1109/CVPR.2016.90.
[26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the Inception Architecture for Computer Vision,” in
2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 2818–
2826, doi: 10.1109/CVPR.2016.308.
[27] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger,
“Densely Connected Convolutional Networks,” in 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),

2020 Deep CNN TR Le

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 Deep CNN TR Le

Uploaded by

Copyright:

Available Formats

219

Word-Spotting approach using transfer

978-1-7281-5835-8/20/$31.00 ©2020 IEEE

Tranfer deep learning step Word image description step

Fig. 3. (a) Example of Bantham dataset document image, (b) Handwriting

TABLE I. EXPERIMENTAL RESULTS FOR SEGMENTATION-BASED IN

mAP on Re-trained CNN architectures

You might also like