Professional Documents
Culture Documents
Abstract—Named entity recognition is an important topic in find some keywords, e.g., date-field, named-entity, numeric
the field of natural language processing, whereas in document value, currency, time information etc. from the document
image processing, such recognition is quite challenging without image. These are useful in document indexing/retrieval with-
employing any linguistic knowledge. In this paper we propose out reading the full text in a document image.
an approach to detect named entities (NEs) directly from In this paper, we strive to identify word image that
offline handwritten unstructured document images without represent a Named Entity (NE). An NE usually refers to
explicit character/word recognition, and with very little aid some name-of-things. These entities may be living or non-
from natural language and script rules. At the preprocessing living, such as a person, place, company, organic/inorganic
stage, the document image is binarized, and then the text is chemical compound, currency, time, month and so on.
segmented into words. The slant/skew/baseline corrections of Named Entity Recognition (NER) [1] has been a popular
the words are also performed. After preprocessing, the words topic in the fields of Natural Language Processing (NLP)
are sent for NE recognition. We analyze the structural and and Information Retrieval (IR) for the last two decades. A
positional characteristics of NEs and extract some relevant detailed survey on NER research in natural language text is
features from the word image. Then the BLSTM neural reported in [2]. However, work on NE identification from
network is used for NE recognition. Our system also contains document images is rare. One approach of NER from a
a post-processing stage to reduce the true NE rejection rate.
document image was attempted by Zhu et al. [3] who con-
The proposed approach produces encouraging results on both
sidered semi-structured and unstructured printed documents
of a special type, namely “automated expense reimburse-
historical and modern document images, including those from
ment”. In their method, a Conditional Random Field (CRF)
an Australian archive, which are reported here for the very
framework with some rich page layout features was used.
first time.
They also employed an OCR engine and recognized the NEs
with assistance from the OCR outputs.
Keywords: BLSTM neural network, Document image anal- Without character/ word recognition, NE detection from
ysis, Dual layer bagging, Information retrieval, Named entity a document image is very difficult because NLP-based
recognition. knowledge can hardly be used in such a situation. How-
ever, such detection is essential where linguistic knowledge
1. Introduction cannot be used due to the poor performance of handwritten
text recognition engines. In this paper, we attempt to fill this
A document may contain text as well as non-texts such gap by proposing an approach for NE recognition without
as maps, drawings, photographic images etc. For hand- doing OCR on the document image. We find some features
written documents, some special types of non-text (e.g., by analyzing the structural and positional characteristics of
doodles, struck-out words, annotations etc.) may also be NE and feed them to a BLSTM neural network for NE
there. Sometimes a document may be a mixture (hybrid) recognition. Our method can process unstructured handwrit-
of printed and handwritten text where artifacts from both ten offline historical and contemporary English documents.
groups may exist. We have not found any published work on NE recognition
Important studies have been undertaken on layout anal- directly from document images, which is similar to the one
ysis, printed/handwritten text separation and text/non-text described here.
segregation in documents. Moreover, better accuracy with The present problem is completely different from the
higher efficiency has been achieved on preprocessing mod- classical keyword spotting [4] problem. In keyword spotting,
ules such as text-line identification, word/character segmen- a template (of a keyword) is given to find its matches in a
tation, as well as Optical Character Recognition (OCR) document image. But in our proposed work, no template is
engines. provided at all. We just use some structural and positional
On degraded documents, where OCR engines do not characteristics of an NE for its identification. The rest of
work well, keyword spotting can play a remarkable role in the paper is organized as follows. Section 2 describes our
identifying important words. Also, it may be necessary to proposed method in detail. The results and evaluation of the
376
middle and last positioned words are generally NEs when 2.5. Feature Extraction
the first character is a capital letter. Some exceptions are
there, which we handle during post-processing. The features used in our method of detecting NE words
Now, the difficulty arises with the first positional NE, are described below. All features are normalized in the range
since any first positional word starts with a capital letter. To [0, 1].
know how frequently they occur, we performed positional
occurrence analysis of NEs in an English sentence. For 2.5.1. Object Pixel Distribution.
this purpose, we took 300 articles (100 newspapers, 100
story books and 100 online articles). Among these 10 These types of features are based on the distribution
articles on various topics were chosen from 10 English of the foreground object (ink) pixels throughout the word
newspapers having the highest circulation in 10 countries. image.
Also, 10 pages from each of the 10 different English story We consider a vertical sliding window of size h × 1,
books and novels of distinct writers from various countries where ‘h’ is the height of the word bounding-box. This
were selected. Moreover, 100 online articles on dissimilar window is moved from left to right over a word. Then we
topics were chosen from different websites (e.g. various calculate the weight (f1 ), center of gravity (f2 ) and 2nd -
blogs, Wikipedia pages etc.). After processing these 300 order moment (f3 ) of the window as [9]. Sometimes, f1 is
articles by the Stanford Named Entity Recognizer [1], we called a vertical projection profile.
found that per article at most 16.66% (minimum: 0%)
of NEs can reside in the first position of a sentence.
h
1 2
h
Obtaining guidance from the SMART [10] project, we
categorized four special classes of frequently occurring f3 (c) = r .p(r, c) (3)
h2 r=1
words (in the English literature) with respect to structures
and various positions in a sentence. These classes are Here p(r, c) is the binary image of I(r, c) obtained in Section
described below. 2.1 and p(r, c) can take on values 0 or 1, where 1 is the
i) Wh-word class: This class contains Wh-question words foreground pixel.
starting with characters “Wh” and generally positioned at Additionally, we employ a horizontal sliding window of
the beginning of a sentence. e.g. Why, Who, Whom, What, size 1×w, where ‘w’ is the width of the word bounding-box,
When, Where, Which, Whose etc. and calculate the features f4 , f5 and f6 , as
377
black-white transition (f9 ) [9]. Additionally, we compute input nodes, respectively. The input nodes are connected
horizontal black-white transitions (f10 ) of a word. with 2 distinct recurrent hidden layers, one for forward and
Furthermore, a vertical run-length histogram (f11 ) and another for the backward sequence. The output layer is also
a horizontal run-length histogram (f12 ) are obtained. The connected with both hidden layers. We primarily wanted
run-length is calculated as the sum of consecutive object to distinguish NE from non-NE words. Then four add-on
pixels in the vertical/horizontal direction. special classes (Wh-word, Th-word, Small-width and All-
caps: described in Section 2.4) are also considered. So, the
2.5.3. Special Weighted Features. total number of classes is K = 2 + 4 = 6. At the output
layer, one extra node is added for a non-word class, such as
In addition to the above twelve features, two weighted noisy components and punctuation marks. Thus, the output
features described below are employed. layer (for each of B1 , B2 and B3 ) contains K +1 = 7 nodes.
On the upper projection profile of a word image, we The details of BLSTM-NN working principles can be found
count the transition of peaks and valleys. Since an NE in [13]. The outputs of B1 , B2 and B3 are joined with a
generally starts with a capital letter, we give some weight on fusion scheme. Combining such classifiers, called bagging
the peak-valley transition count. Let, n1 , n2 and n3 be the strategy, is applied for decreasing the error rate. The setup
count of peak-valley transitions on the left 1/3rd , middle of BLSTM-NN as the classifier for our problem is described
1/3rd and right 1/3rd portion (in x-direction), respectively, in Section 3.2.
of the upper projection profile. This may be used as a feature
(f13 ). 2.7. Post-processing
f13 = k1 .n1 + k2 .n2 + k3 .n3 (7)
Here, k1 , k2 and k3 are the weight factors. By testing on a After classification, we check the positional occurrence
training dataset, we obtained optimum results with k1 = 0.8, (as described in Section 2.4.2) of the marked NEs. Some
k2 = 0.1 and k3 = 0.1 . non-NEs, including first positional words, may be misclas-
On the horizontal projection profile (f4 ), we mark the sified as NEs. We check with the special classes (Wh-
upper, middle and lower zones, as described in Section 2.4.1. word, Th-word, Small-width and All-caps: see Section 2.4)
Since the capital letters normally reside in the upper and to separate the non-NEs. In this way, we attempt to reduce
middle zone, we apply some weights to this profile feature the false acceptance of the first positional non-NE words.
and obtain f14 . We focus on minimizing the genuine NE rejection rate,
sacrificing higher accuracy. It is more likely that we would
w
3 2w
3
w find a subset of words, and those are most certainly the NEs.
f14 = k4 . p(r, c) + k5 . p(r, c) + k6 . p(r, c) So, we mark some words as potential NEs in this subset with
c=1 c> w
3 c> 2w
3
a degree of confidence [15].
(8) In certain writing, we note that NEs having many oc-
Here, ‘w’ is the width of a word. The weight factors k4 = currences are indicative of the writing topic and thus assist
0.6, k5 = 0.3 and k6 = 0.1 perform well for this problem. in context analysis. So, we also find the NE-word frequency
count using the above BLSTM-NN to extend our work for
2.6. Neural Network-based Classification context analysis. Here we employ the strategy of [4], but
use the features presented in Section 2.5.
Among a few standard classifiers, we have found that
a Neural Network (NN) yields highest accuracy for our 3. Experimental Results and Discussions
problem. A detailed survey on NN-based classification is
reported in [12]. Of the various NN classifiers, we use Before discussing the experimental results, we begin
“Bidirectional Long-Short Term Memory”(BLSTM) neural with the description of the dataset used for evaluation of
network for our problem. Since the BLSTM-NN has shown our scheme.
promising results in other handwriting analysis/ recognition
problems [13], we have chosen this classifier. Also, as shown
in [14], BLSTM-NN is time efficient in comparison with 3.1. Dataset Employed
Dynamic Time Warping (DTW) for shape matching.
In our approach, we have employed the above 14 fea- For our experiments, we used three handwritten offline
tures. To feed into the BLSTM-NN, in general, the features datasets: i) George Washington database (GWdb) [16], ii)
are categorized into 3 types: (i) vertical feature set fV : {f1 , Queensland State Archives database (QSAdb) [17], iii) IAM
f2 , f3 , f7 , f8 , f9 , f11 }, (ii) horizontal feature set fH : {f4 , database (IAMdb) [18].
f5 , f6 , f10 , f12 } and (iii) special feature set fS : {f13 , f14 }. GWdb and QSAdb are historical text databases. Of
With these 3 sets of features (fV , fH and fS ), we employ 3 those, the GWdb contains 20 handwritten pages of G.
different BLSTM-NNs (B1 , B2 and B3 , respectively). The Washington. We took 66 pages from the QSAdb containing
input layer of a BLSTM-NN has d nodes, where ‘d’ is the handwritten text only. Finally, the IAMdb contains recent
number of features. So, B1 , B2 and B3 have 7, 5 and 2 data with 1539 handwritten pages written by 657 writers.
378
For ground-truth generation of the NEs on such pages, a
semi-automatic approach with human-intervention was em-
ployed. In addition, the manuscripts having publicly avail-
able transcripts, were fed into the Stanford Named Entity
Recognizer [1] for generating the ground truth.
(a)
3.2. Results and Evaluation
379
a document image) only, not its machine-readable output [9] U.-V. Marti and H. Bunke, “Using a statistical language model to
with the aid of any name-dictionary. This will also be the improve the performance of an HMM-based cursive handwriting
recognition systems”, Hidden Markov Models: Applications in Com-
scope of our future work. puter Vision, World Scientific, ISBN: 981-02-4564-5, pp.65-90, 2002.
[10] G. Salton, “The SMART retrieval system-experiments in automatic
References document processing”, Prentice-Hall Inc., 1971.
[11] T. M. Rath and R. Manmatha, “Word image matching using dy-
[1] J. R. Finkel, T. Grenager and C. Manning, “Incorporating non-local namic time warping”, Proc. Computer Vision and Pattern Recognition
information into information extraction systems by Gibbs sampling”, (CVPR), vol.2, pp.521-527, 2003.
Proc. Annual Meeting on Association for Computational Linguistics
(ACL), pp.363-370, 2005. [12] G. P. Zhang, “Neural networks for classification: a survey”, IEEE
Trans. on SMC, Part C: Applications and Reviews, vol.30, no.4,
[2] D. Nadeau and S. Sekine, “A survey of named entity recognition pp.451-462, 2000.
and classification”, J. Lingvisticae Investigationes, vol.30, no.1, John
Benjamins Pub. Co., pp.3-26, 2007. [13] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke and
J. Schmidhuber, “A novel connectionist system for unconstrained
[3] G. Zhu, T. J. Bethea and V. Krishna, “Extracting relevant named handwriting recognition”, IEEE Trans. on PAMI, vol.31, no.5, pp.
entities for automated expense reimbursement”, Proc. ACM Conf. on 855-868, 2009.
Knowledge, Discovery and Data Mining (KDD), pp.1004-1012, 2007.
[14] R. Jain, V. Frinken, C. V. Jawahar and R. Manmatha, “BLSTM neural
[4] V. Frinken, A. Fischer, H. Bunke and R. Manmatha, “Adapting network based word retrieval for Hindi documents”, Proc. Int. Conf.
BLSTM neural network based keyword spotting Trained on Modern on Document Analysis and Recognition (ICDAR), pp.83-87, 2011.
Data to Historical Documents”, Proc. Int. Conf. on Frontiers in
Handwriting Recognition (ICFHR), pp.352-357, 2010. [15] H. Zaragoza and F. d’Alché-Buc, “Confidence measures for neural
network classifiers”, IPMU, vol.1, pp.886-893, 1998.
[5] K. Ntirogiannis, B. Gatos and I. Pratikakis, “A combined approach
for the binarization of handwritten document images”, Pattern Recog- [16] “George Washington Papers”, The Library of Congress, USA. Web:
nition Letters, vol.35, pp.3-15, 2014. http://memory.loc.gov/ammem/gwhtml/gwhome.html
[6] N. Otsu, “A threshold selection method from gray-level histograms”, [17] “Queensland State Archives”, Australia-4113. Online available at:
IEEE Trans. on Systems, Man and Cybernetics, vol.9, no.1, pp.62-66, http://www.archivessearch.qld.gov.au/Search/BasicSearch.aspx
1979. [18] U.-V. Marti and H. Bunke, “The IAM-database: an English sentence
[7] E. Kavallieratou, N. Fakotakis and G. K. Kokkinakis, “Slant estima- database for off-line handwriting recognition”, Int. J. on Document
tion algorithm for OCR systems”, Pattern Recognition, vol. 34, no.12, Analysis and Recognition, vol.5, pp.39-46, 2002.
pp.2515-2522, 2001.
[19] J. Kittler, M. Hatef, R. P. W. Duin and J. Matas, “On combining
[8] E. Wigner, “On the quantum correction for thermodynamic equilib- classifiers”, IEEE Trans. on PAMI, vol.20, no.3, pp.226-239, 1998.
rium”, Physical Review, vol.40, pp.749-759, 1932.
380