Professional Documents
Culture Documents
net/publication/332948719
CITATIONS READS
0 3,162
4 authors, including:
Krzysztof Małecki
West Pomeranian University of Technology, Szczecin
38 PUBLICATIONS 277 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Adam Nowosielski on 13 June 2019.
Forczmański P., Smoliński A., Nowosielski A., Małecki K. (2020) Segmentation of Scanned
Documents Using Deep-Learning Approach. In: Burduk R., Kurzynski M., Wozniak M. (eds)
Progress in Computer Recognition Systems. Advances in Intelligent Systems and
Computing, vol 977, pp 141-152. Springer, Cham
Segmentation of scanned documents using
deep-learning approach
1 Introduction
Paper documents have been one of the elementary means of human commu-
nication for a long time. Despite many differences, they often contain typical
elements such as blocks of text, tables, stamps, signatures, tables, and logos.
In the paper we present an algorithm making possible to segment characteristic
visual objects from paper document which would make the document under-
standing easier. Such an approach may be used to transform the document into
a hierarchical representation in terms of structure and content, which would
allow for an easier exchange, editing, browsing, indexing, filling and retrieval
[22].
Developed algorithm, being a part of a larger document managing system,
may help in determining parts that should be processed further or to be a sub-
ject of enhancement or denoising (e.g. pictures, graphics, charts, etc.[10]). It
could also serve as a pre-processing stage of a sophisticated content-based image
retrieval system, or just a filter that could select specific documents contain-
ing particular elements [13] or sort them in terms of importance (colored vs.
monochromatic documents [3]). It is important, that the presented approach is
document type independent, thus it can be applied to any class of documents, e.g.
2 Authors Suppressed Due to Excessive Length
2 Previous works
Literature survey shows that the problem addressed in this paper has been a
subject of study for more than three decades. It is often referred as page seg-
mentation and zone classification [15]. Page segmentation methods understood
as global approaches can be divided into three categories: bottom-up, top-down
and heuristic. Top-down methods are preferred in case of documents with ini-
tially known structure. In such strategy, the input document is decomposed into
smaller elements such as blocks and lines of text, single words, characters, etc. On
the other hand, bottom-up strategy begins with a pixel-level analysis, and then
the pixels with common properties are grouped into bigger structures. Bottom-
up techniques are preferred in case of documents with complex structure, but are
often slower than top-down ones. Heuristic procedures often combine advantages
of the above methods, i.e. robustness of top-down approaches and accuracy of
bottom-up methods.
The most popular approaches in the group of bottom-up methods are con-
nected component analysis [15] and sliding-window approach [19] combined with
feature extraction based on textural features [12]. Such algorithms are quite ro-
bust to skew, but depending on selected measures, the processing cost may vary.
Reported accuracy of text detection reaches 94% − 99%
The top-down strategies rely often on run-length analysis performed on bi-
narized, skew-corrected documents. Other solutions include usage of Gaussian
pyramid in combination with low-level features or Gabor-filtered pixel clustering.
Heuristic methods combine bottom-up and top-down strategies. Good ex-
amples are XY-cuts joined with run-length matrix statistics, quad-tree adaptive
split-and-merge operations connected with fractal signature [15].
A zone classification is associated with a multi-class discrimination problem.
A classification of eight typical objects in the scanned documents was presented
in [10]. This is a difficult task, as demonstrated by experimental results. Reported
error rate is equal to 2.1%, but 72.7% of logos and 31.4% of tables were misclas-
sified. Wang et al. [22] reported quite mean accuracy of 98.45%, yet 84.64% of
logos and 72.73% of ’other’ elements were misclassified.
In order to simplify the problem and increase the accuracy, one can use an
individual approach, that focuses on single class detection and recognition. It
is based on a detection of typical features, followed by “one versus all” classifi-
cation [3]. For example, the detection of text blocks can be realized by means
of statistical analysis [20], edge extraction [16], texture analysis [8, 16]. Other
authors used stroke filters [9], cosine transform [23] and LBP algorithm [14]. It
should be also stressed, that the intraclass variance of many objects (especially
tables and signatures) can be a huge problem [1, 7, 24].
Segmentation of scanned documents using deep-learning approach 3
The analysis of the literature shows, that many algorithms involve image pre-
processing techniques (e.g. document rectification), deal with a limited forms of
documents (e.g. bank cheques) and employ handcrafted features together with
multi-tier approaches [6]. The other observation is that there is a lack of methods
aimed at the detection of all possible classes of interesting objects in scanned
documents. It is caused by the characteristics of objects that have a lot of vari-
ability in how they appear.
To overcome above limitations, the Convolutional Neural Networks were in-
troduced, which are capable of dealing with all these issues within a single model.
Such alternative direction in research is aimed at creating effective classifiers not
taking care of carefully crafted low-level features [25]. In such scenario we solve
the task of mapping input data such as image features, e.g. brightness of pixels,
to certain outputs, e.g. abstract class of object. The deep learning model, being
a hierarchical representation of the perceptual problem, consists of sets of input,
transitional and output functions. Such model encodes low-level, elementary fea-
tures into high-level abstract concepts. The main advantage of such an approach
is that it is highly learnable. It should be stressed, that the idea of learning is not
completely new, since it comes from a classical neural networks. The input of a
deep network is fed iteratively, and the training algorithm lets it compute layer-
by-layer in order to generate output, which is later compared with the correct
answer. The output error moves backward through the net by back-propagation
in order to correct the weights of each node and reduce the error. Such strategy
continuously improves the model during computations. The key computational
problem in a CNN is the convolution of a feature detector with an input signal.
Thus, the features flow from elementary pixels to certain primitives like hori-
zontal and vertical lines, circles, and color areas. In opposition to the classical
image filters working on single-channel images, CNN filters process all the in-
put channels simultaneously. Since convolutional filters are translation-invariant,
they create a strong response in areas where a specific feature is discovered.
Hence, in the proposed approach, we do not apply any pre-processing but
employ a very efficient CNN detector that deals with probably all possible object
types that can be found in documents, which has got no significant representation
in the literature.
The second element of the addressed detection algorithm is block integration,
performed in order to join neighbouring regions of the same class, since CNN-
based detectors often mark objects with a regular, often rectangular, box.
In our research we compared traditional feature-based detectors supported
by AdaBoost algorithm with CNN-based detector in order to confirm the high
robustness and accuracy of the later.
3 Algorithm description
The developed algorithm consists of two subsequent stages. The first one is
a rough detection of candidates (classification of detected regions of interest),
while the second one is a verification/integration of found objects.
The first stage of processing is based on an effective YOLOv2 algorithm [17],
which results in significantly high number of true positives. Then it is supported
with a region integration stage. At the training stage we learned the net with
five individual classes of objects, namely: stamps, logos/ornaments, texts, tables,
and signatures. Exemplary objects are presented in Tab. 1.
Stamp
Logo
Text
Signature
Table
The YOLOv2 algorithm inputs a full image into a single neural network. The
image is rescaled to a pre-defined constant size. The net divides the image into
regions and predicts bounding boxes and probabilities for each region. These
bounding boxes are weighted by the predicted probabilities. The scheme of pro-
cessing is presented in Fig. 1. After the detection, we perform an integration
of regions, for each class, individually, taking into consideration the confidence
values returned by YOLOv2 and perform the normalization.
In our approach we employed a DarkFlow (Python-based implementation of
DarkNet, using GPU-enabled TensorFlow) for implementing Convolutional Neu-
ral Networks. The CNN used in our algorithm consists of 28 layers. The input
layer (which actually contains whole input image) is 416 × 416 elements. The net
structure is presented in the Tab. 2, where Layer type represents the type of a
layer (Convolutional computes the output of neurons that are connected to local
regions, Max Pooling and Avg Pooling perform downsampling operation along
the spatial dimensions; Soft Max performs SoftMax regression). The N o. F ilters
column describes the number of convolution filters and the Size/Stride is re-
Segmentation of scanned documents using deep-learning approach 5
sponsible for the window size.The Output dim. represents the spatial resolution
of the map at each layer.
After a successful detection, the YOLOv2 returns boxes with confidence
values for all classes. We take the boxes and, in case of overlapping regions,
sum the confidences and perform logical or on regions. The confidence values
are normalized to one. The comparison of the unaltered result and the inte-
grated/normalized ones, are presented in Fig. 2. In this way we get the regions
that are pointed out by the detector with higher (cumulative) confidence. Some-
times it leads to the falsely increased confidence, in the areas that do not contain
respective objects, yet in most cases it increases the overall clarity of the results.
Fig. 2: The effect of region integration and normalization (from the left: original
image, output from YOLOv2 detector, and processed result, respectively).
6 Authors Suppressed Due to Excessive Length
Table 2: A structure of the CNN used in the experiments (based on YOLO v2)
Layer type No. Filters Size / Stride Output dim.
Convolutional 32 3×3 / 1 416 × 416
Max Pooling — 2×2 / 2 208 × 208
Convolutional 64 3×3 / 1 208 × 208
Max Pooling — 2×2 / 2 104 × 104
Convolutional 128 3×3 / 1 104 × 104
Convolutional 64 1×1 / 1 104 × 104
Convolutional 128 3×3 / 1 104 × 104
Max Pooling — 2×2 / 2 52 × 52
Convolutional 256 3×3 / 1 52 × 52
Convolutional 128 3×3 / 1 52 × 52
Convolutional 256 3×3 / 1 52 × 52
Max Pooling — 2×2 / 2 26 × 26
Convolutional 512 3×3 / 1 26 × 26
Convolutional 256 1×1 / 1 26 × 26
Convolutional 512 3×3 / 1 26 × 26
Convolutional 256 1×1 / 1 26 × 26
Convolutional 512 3×3 / 1 26 × 26
Max Pooling — 2×2 / 2 13 × 13
Convolutional 1024 3×3 / 1 13 × 13
Convolutional 512 1×1 / 1 13 × 13
Convolutional 1024 3×3 / 1 13 × 13
Convolutional 512 1×1 / 1 13 × 13
Convolutional 1024 3×3 / 1 13 × 13
Convolutional 1024 3×3 / 1 13 × 13
Convolutional 1024 3×3 / 1 13 × 13
Convolutional 55 1×1 / 1 13 × 13
Avg Pooling 55 Global 6
SoftMax — — —
4 Experiments
The detector’s accuracy is estimated with the help of Intersection over Union
- IoU - considered reliable detection quality estimate in many challenges such
as popular PASCAL VOC challenge [2]. An IoU score is in range 0 − 1. During
experiments we set a IoU threshold, which is responsible for the false positive
and false negative rates. The lower the threshold is, the larger number of falsely
detected regions can appear. In parallel, the larger the threshold is, the false
negative rate increases. The default threshold value proposed by YOLOv2 de-
velopers is equal to 0.25 [17]. We investigated it’s value in the range from 0.02 to
8 Authors Suppressed Due to Excessive Length
0.93, in order to increase the detection accuracy and calculate the average IoU
values for all classes.
4.3 Results
Several experiments have been carried out in order to verify the quality of page
segmentation in scanned documents. The detection involved testing the detector
with different thresholds values. The results, compared to our previous method
(involving Haar-like features and AdaBoost learning i.e. Viola-Jones detector [21]
implemented in OpenCV library), are presented in Tab. 4. Presented average and
maximal values are for various thresholds of the detector.
As it can be seen, the CNN-based detector is superior to more classical Haar+
Adaboost detector, except the text class (for which is almost equally accurate).
The significantly higher detection accuracy can be observed in case of highly
variable stamp and signature classes. It should be stressed out, that both algo-
rithms were trained on the same datasets. Several examples of page segmentation
are presented in the following figures (Fig. 4-6).
If we take a closer look at the IoU values (see Fig. 7) we discover, that the
YOLOv2 threshold parameter has only a slight impact on the detection accuracy.
It seems, that the threshold around 0.5 gives the most optimal results, since no
further increase in the mean IoU can be observed for its lower values.
Fig. 4: Segmentation results for document No. 147: original image with bounding
boxes (ground truth) and the maps for signature class, stamp class and text class,
respectively.
Fig. 5: Segmentation results for document No. 168: original image with bounding
boxes (ground truth) and the maps for signature class, stamp class and text class,
respectively.
Fig. 6: Segmentation results for document No. 201: original image with bounding
boxes (ground truth) and the maps for stamp class, table class and text class,
respectively.
96.18%. The weighted value (taking into consideration different number of test
samples), is equal to 91.57%. In [10], the authors obtained an average detection
accuracy equal to 81.84%, however when we consider only classes, that are similar
to our case (however, without stamp class), the accuracy drops to 72.95%. The
main problem with that approach is a high number of false rejections in case
of tables. On the other hand, our algorithm detected tables with a very high
accuracy. In [22] the mean accuracy for 9 classes is equal to 84.38%. When we
reduce the set in order to be similar to the one in our case (also without stamp
class), it is equal to 89.11%. The highest detection rate was obtained for printed
text class, while the most difficult class is logotype.
As it was shown, our approach is superior to the state-of-the art methods,
while featuring very intuitive computational flow and high processing speed
(thanks to the utilization of GPU-enabled CNN computations). The average
processing time for a single document is lower than one second (we used Nvidia
GeForce GTX 1070M). It also takes into consideration classes, that are not anal-
ysed by many existing algorithms, i.e. stamps and signatures. It should be also
stressed out, that if we extend the learning datasets and perform extra training
iterations, the accuracy could be even higher.
5 Summary
We have presented a novel approach to the extraction of visual objects from digi-
tized paper documents. Its main contribution is a two-stage detection/verification
idea based of deep learning. As opposite to other, known methods, the whole
framework is common for various classes of objects. It also detects classes, that
are not considered in other global approaches, namely signatures and stamps.
Segmentation of scanned documents using deep-learning approach 11
Performed extensive experiments showed, that the whole idea is valid and may
be applied in practice.
References
19. Sauvola, J., Pietikäinen, M.: Page Segmentation and Classification Using Fast Fea-
ture Extraction and Connectivity Analysis. ICDAR, 1995, Proceedings of 3rd Inter-
national Conference on Document Analysis and Recognition, 1127–1131 (1995)
20. Su, C., Haralick, M.R., Ihsin, T.P.: Extraction of text lines and text blocks on
document images based on statistical modeling. International Journal of Imaging
Systems and Technology vol. 7, no. 4, 343–356 (1996)
21. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vision
57(2), pp. 137–154 (2004)
22. Wang, Y., Phillips, T.I., Haralick, M.R.: Document zone content classification and
its performance evaluation. Pattern Recognition, vol. 39, no. 1, 57–73 (2006)
23. Zhong, Y., Zhang, H., Jain, A.K.: Automatic caption localization in compressed
video. IEEE TPAMI, vol. 22, no.4, 385–392 (2000)
24. Zhu, G., Zheng, Y., Doermann, D., Jaeger, S.: Signature Detection and Matching
for Document Image Retrieval. IEEE TPAMI, vol. 31, no. 11, 2015–2031 (2009)
25. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep con-
volutional neural networks. Advances in Neural Information Processing Systems 25,
1106–1114 (2012)