An Automated Technique To Recognize and Extract Images From Scanned Archaeological Documents

2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)
An automated technique to recognize and extract images from scanned

archaeological documents
Cindy Roullet∗ , David Fredrick† , John Gauch∗ and Rhodora Vennarucci†

∗ Computer Science and Computer Engineering
Emails: ceroulle@uark.edu, jgauch@uark.edu
† World Languages, Literatures and Cultures
Emails: dfredric@uark.edu, rhodorav@uark.edu
∗† University of Arkansas
Fayetteville, Arkansas USA
Abstract—”Pompei: pitture e mosaici” is a valuable set of and mapping them to their location could help predict where
volumes containing over 20,000 historical annotated images of which types of art are buried where on the site.
the archaeological site of Pompeii, Italy. Our project consists Each page of the books of ”Pompei: pitture e mo-
of extracting, archiving, analyzing and classifying all the image
data from a digitized version of these books. In this paper, we saici” contains different types of images such as maps,
describe a method that automatically locates and separates photographs, schemes etc. Many pages also have text in
graphical elements such as maps, drawings, paintings and Italian describing the artwork. This paper presents our image
photographic images from text. We also introduce our ongoing processing solution to locate and extract the image portions
work on the interpretation of the retrieved data. of each page as well as how we overcame the technical
Keywords-Scanned documents; Graphics detection; Archae- problems and various issues from cropping images from the
ology; Image processing; Document layout; Text/Graphics scans.
separation; Dataset; Image cropping; Pompeii;
II. R ELATED W ORK
I. I NTRODUCTION In the past decade, automatic extraction of text and image
Pompeii contains a lot of interesting historical art. Sadly, elements from scanned documents has been researched
over the years this art work has been exposed to harsh and developed especially for the document digitization and
elements since its discovery, including not only deterioration document retrieval.
but also bombing during World War II, earthquakes (1980) It is a challenging problem as documents present them-
and vandalism which has damaged the quality of the art selves under many different formats. Systems recognizing
work in Pompeii. the layout and structure of the page are the most common
Fortunately, photographs of most of the art of Pompeii methods to solve this problem because they can adapt to the
at its best state can be found in the volumes of ”Pompei: various existing layouts and extract text region for Optical
pitture e mosaici”. This collection has 10 volumes containing Character Recognition (OCR) operation [3].
approximately 20,000 photos of Pompeiian art from the Detecting the layouts enables us to extract features from
1920s when the excavation on the site was at its peak. the page and classify them into different categories such
”Pompei: pitture e mosaici” is the most accurately an- as text, images and tables. One method, called Document
notated and controlled resource we have. It has, amongst Structure Recognition (DSR)[1], uses the Image skeletoniza-
others, thorough descriptions and uses standard room labels tion technique on the whole page and analyzes each skeleton
which maps out art artifacts to their precise location on the result separately as different features. On the hypothesis, that
site. One of the goals of our work is to create a robust dataset a text element is noticeably smaller than a graphic element,
identifying the art of Pompei. Digitizing ”Pompei: pitture e it uses the size and height of each segment skeleton to
mosaici” can provide a solid core of data from which we classify them into graphics or non-graphics categories. It
could also later add more images from other resources. In then merges the segments to collect a final bounding box
digital form, the resource will be a great tool for numerous containing the graphic. In this method, it is assumed that the
kinds of data analysis of the art of Pompeii. documents contain a distinct contrast between background
Another goal of our research is to classify the images and foreground. In addition, it is making the assumption that
and run various analyses on the data such as training neural the picture elements are colored images, and that drawings
networks to identify the four different Roman wall painting consist of interconnected groups of lines and curves; which
styles. In the near future, analyzing art complexity levels was not a possible assumption in our project. In our project,
978-1-7281-5054-3/19/$31.00 ©2019 IEEE 20

DOI 10.1109/ICDARW.2019.00009
Authorized licensed use limited to: DRDC LIBRARY. Downloaded on June 23,2020 at 17:56:22 UTC from IEEE Xplore. Restrictions apply.
images could be black and white or colored, photographs All of this background work has addressed some serious
or diagrams, sketches or paintings. We could not generalize problems in digitization of print sources, such as dealing
the definition of the images the same way. with poor image scanned, understanding page layouts, ex-
Similarly, the elements on a document could be extracted tracting features and classifying text and none text regions.
using SIFT features and then classified into categories using However, they are not focused on accurately extracting the
AdaBoost learning [2]. Methods like this one do not always image regions and dealing with the obstacles that it prompts,
provide boundary-box determination and image extraction. which is the main focus of this paper.
Yet, the latter was our main focus because our goal is to
create a set of images from the scanned volumes of ”Pompei: III. P ROPOSED M ETHOD
pitture e mosaici”. Most of the images in the volumes are A. Foreground Background Segmentation
in a rectangular shape which was a factor for choosing our
solution. The background of the scans we are working with corre-
One approach that does provide boundary-boxes around sponds to most of the white part of a page. The foreground
the different elements is to extract the layout without using elements correspond to the text, the images that we want to
features extraction, but first applying Edge Enhance Dif- extract and the different lines laid out on the page.
fusion (EED) to smooth the image and enhance its edges
at the same time [3]. This allows minimizing the effect of
scanning artifacts and noise from the image. Then, it uses
active contour model based level set method to find the
boundaries of the text portions and image portions. Since
this method creates the boundaries from an arbitrary point
and then grows to create the boundary, it can adapt to any
kind of text shape or image shape. This method does not
provide a method to classify into different classes for the
resulted boundary boxes but it shows interesting results for
text extraction.
Alternatively, using a multi-resolution approach to char-
acterize a document based on texture features can be used (a) (b)
to determine their layout. Comparing the extracted features
to other document features, from which the ground truth Figure 1: Two examples of page layout
is known, allows to recognize the layout without worrying
about rotation and skew distortions from scanned documents Figure 1 shows two examples of page layout. Some pages
[4]. Many digitized documents have elements that can distort contain only black and white images (a), some have color
the quality of the image, causing scenarios that require images and some have both (b). The position of the text also
extra processing to overcome these issues. A lot of scanned changes according to the images layout. Some pages have
document pages are often not straight, rotated or skewed. no information text because the images takes the entire page
They can also show different pages at once and/or extra space. By using thresholding, from the grayscaled image
margins. Different factors can cause these problems, but they we can create a new binary image where all its pixels
are usually due to the sub-par quality of the scanner. corresponding to the background would be black and all
Using an adaptive Wiener Filter can remove noise and the pixels corresponding to foreground would be white :
enhancing the contrast between background and foreground.
It allows detecting the borders of the page with edge
0 if pixel value > threshold
detection using the Prewitt operator [5], [6]. This combined new pixel value =
255 otherwise.
with histogram analysis allows the localization of the edge of
the page and the extraction of relevant content. Furthermore, Knowing that the background on the page is very close
Principal Component Analysis (PCA) and the Hough Trans- to being white, using a threshold value close to 255
form can be used to clear the skew [6]. This is a common (corresponding to the pixel value for white) could be a
method that has also shown good results detecting car plate solution. The problem is that every scanned image has a
numbers [7]. slightly different global lighting and different local lighting
Another common problem is the different lighting and according to how they were placed in the scanner. The ideal
shading due to scanning. We did encounter this problem in global threshold that would extract most of the background
our project, and describe our solution in Section III-A. Using will differ for each page. Figure 2 shows the results of
an algorithm called Convex Hull to remove the drawbacks thresholding the pages from Figure 1. using a threshold value
of the binarization technique solves this problem. of 220. It can be observed that it works pretty well for the
21
(This is why the pixels seem gray to the eye instead of white
in 3 (b)). Since most of the pixels of the images of the page
are correctly detected, the noise effect from the adaptive
method will not cause any problem.
The next steps consist of removing noise from the thresh-
old image and ensuring the foreground in the best condition
for the Connected Component step to work best. To get rid
of the noise from the adaptive threshold image. We used
a median blur of size 3x3. Using a 3x3 window over the
image, the pixel in the middle of the window will become
the median value of all the pixels contained in the window.
We then applied an average blur of size 5x5 in the intent to
(a) (b) create more connected pixels.
Figure 2: Results of thresholding the pages from Figure 1.
Those results from using a threshold of 220.
image in page (a) but not as well for the page (b). The bottom
image of the page (b) is detected as mostly background.
One way to find a good threshold is to use an adaptive
thresholding method that will calculate a different threshold
value for different regions of the image. We used the method
which calculates the threshold as a weighted sum of a 3x3 (a) Without Median Blur (b) With Median Blur
neighborhood of the pixel minus a value C. The weights
Figure 4: Comparison of using the median blur as an
are a Gaussian window and we chose the value C to be 2
additional step. (Zoom of page (a) from 2 after the threshold
(most common value used). According to previous research
step.)
[7] might not be best adaptive method but for our cases this
method showed good results on bringing homogeneity on
Figure 4 shows the effect of using the median blur before
the images.
the average blur on a portion of the page. It does a good
job at removing some of the noise effects from the adaptive
threshold.
We are expecting the connected components step to
extract the different group of pixels connected to each
other. To make sure that most of the pixels, from a same
image, will be grouped together they need to be touching.
Applying the median blur might have gotten rid of some of
those connections. This is why we used an average blur to
attempt to recreate some connections between pixels.
The average blur works just like the median blur but instead
of transforming the middle pixel by the median value, it
uses the average of all the pixels in the window. We used a
(a) Simple thresholding (b) Adaptive thresholding window of size 5x5 for the average blur.
Figure 3: Comparison of thresholding methods for the page
(b) from Figure 1 B. Image Extraction
OpenCV1 provides a connected components function that
Figure 3 shows that the adaptive method is able to identify return all the connected components of an image given a
more pixels accurately for the bottom image of the page. It connectivity [7]. We have chosen to use this function with
fixes the problem that simple thresholding was having. On a connectivity of 8, which will create regions where pixels
the other hand, we can see that the overall accuracy went are connected to their 8 nearest neighbors.
down as pixels from the background are being identified Figure 5 shows the results of the connected components
as being part as foreground that were correctly identified function. As it can be seen, many regions correspond to
with the simple thresholding method. The same problem but 1 For more information please visit https://opencv.org/
reversed can be observed for the top two images of the page
22
(a) (b) (a) (b)
Figure 5: Connected components results for the pages from Figure 7: Final results of pages from Figure 1 with images
Figure 1 framed in red.
text and line segments.We can also see that the big images
are in their own connected components. To take out the in several images detected overlapping each other. Indeed,
small connected components, We counted the size of each we considered images to be overlapping if at least one of
region and removed the ones under 16000 pixels, which the corners’ coordinates of one image was contained inside
cleared most of the unwanted components, see Figure 6 for the square formed by the coordinates of another. When this
illustration . was the case, it was necessary of combine all those images
and take into account as the final image the combination of
them all. This was an occurrence mainly on the images of
maps in the book.
(a) (b)
Figure 6: Connected components left after removing the
small ones.
(a) Overlapping problem (b) Results after dealing
example with overlapping
We then collected the extremities (maximum and mini-
mum of the x axis, maximum and minimum of the y axis) Figure 8: Overlapping problem example.
of each component left and removed the regions that were
too small (1), too big (2) or overlapping (3).
For case (1), some pages’ lines were still detected as Figure 8 shows an example of a map and the overlap-
components but were obviously too short in height to be ping problem it was experiencing before making sure the
an image. This can be noticed in the Figure 6 (a) and (b) overlapping images would just be counted as one.
where the top line of the page is visible as a component. The extremities of the components become then the coor-
For case (2), there were some special cases where the dinates of the images of the page to be cropped. The results
scanner had left a border all around the page which was for the page from Figure 1 can be seen on Figure 7 where the
making the whole page detected as an image, getting rid of coordinates of each images have been drawn in red. From
images too big got rid of that problem. those coordinates, we saved the final cropped images with a
For Case (3), some images on the page were detected as padding of 10 pixels to make sure they contained the image
several components instead of one big one, which resulted neatly.
23
IV. E XPERIMENT AND R ESULTS
A. Evaluation
The books of ”Pompei: pitture e mosaici”, contain and
discuss photography of the art and mosaics of all the regions
of Pompei. The goal of extracting all the images from those
books would be to digitize and create a search-able dataset
from them. A dataset that could enable further analysis and
investigation regarding the art of the ancient Roman city.
At this date, the described algorithm was ran on 4 different
volumes out of the 10 existing. Each volume has about 1,000
pages each that contains 1 to 6 images. The output gives two
folders. One contains all the cropped images and the second (a) (b)
one contains the full pages with the cropped coordinates
outlined. The latter is useful to evaluate the results manually Figure 10: Result pages containing problem with bright sky
and check for the errors by going through all of them. The or bright edges.
pages with inaccurate cropping are then cropped manually.
Figure 12 describes the results on each volume.
The percentage of error for each volume is less than 2
percents. We are very happy with these results. They are
efficient enough for the task at hand. We did not focus on
the computing time for this algorithm. We wanted to target
only on the precision of the process rather than the speed.
B. Special cases and remaining problems
Some of the errors are unavoidable cases like Figure 9.
One of the images is not in a square shape and overlaps the
one on the top corner. Which results in a single large image
because the algorithm combines overlapping regions.
(a) (b)
Figure 11: Result page containing a problem case with color
images.
them before running our algorithm. Most of our errors as

presented the case shown in Figure 10. These accuracy errors
are on average about 10 percents of the image width or
height.
In the cases in Figure 11, the accuracy error is higher
as most of the time the cropping occurs on the actual
drawing part of the image and does not take into account
(a) the background as it gets erased through the steps described
in section III-A.
Figure 9: Result page containing a special case
Most of the images cropped too small are images of ruins

with part of the sky missing. This happens because the sky
is very bright. Figure 10 (a) shows an example where part
of the image is missing. It also happens with very bright
and white sketches as shown in 10 (b) where the left and
some of the top part of the sketch is not being extracted.
However, most of the errors of volume 04 occur in colored
sketches images as illustrated in figure 11.
The scans of the book are high resolution, around
3500x5000 pixels per page. We did not shrink or compress
24
Figure 12: Results
V. C ONCLUSION using Non-linear Diffusion and Level Set. Interna-

tional Conference On Advances In Computing and
The goal of our work is to study the art work in the
Communications, 2016.
10 volume set of ”Pompei: pitture e mosaici”. This paper
[4] Nicholas Journet,Jean-Yves Ramel,Rmy Mul-
presented an effective automated analysis approach for de-
lot,Vronique Egli, Document image characterization
tecting and extracting images from the scans of the books
using a multiresolution analusis of the texture:
of ”Pompei: pitture e mosaici”. Thanks to our method we
appplication to old documents. IJDAR, 2008
were able to quickly accomplish our goal of extracting all the
[5] Maryam Shamqoli,Hossein Khosravi, Border Detec-
images. We were able to correctly quickly and automatically
tion of Document Images Scanned From Large Books.
extract 7,250 images from Volume 1-4 of ”Pompei: pitture e
2013.
mosaici”. In fact, very little the manual cropping work was
[6] Rajeev N. Verma, Dr. Latesh G. Malik, Review of Illu-
needed by virtue of the efficiency of the algorithm, with only
mination and Skew Correction Techniques for Scanned
54 images needing manual extraction.
Documents. International Conference on Advanced
With the collection of those numerous images of Pompeii,
Computing Technologies and Applications, 2015.
we are now working on training a Convolutional Neural
[7] K.M. Sajjad Automatic License Plate Recognition us-
Network (CNN) to be able to categorize the images and
ing Python and OpenCV. 2010.
also recognize the four Pompeian styles prominent in the
ancient Roman wall paintings. We will then link them to
their geographic location to build a vastly search-able and
usable database. Running OCR on the text portions of the
pages is also in part of our plans, as it will help with the
analysis of the images and be a searchable element of our
database
ACKNOWLEDGMENT
The authors would like to thank the staff of the library of
the University of Arkansas for their hard work, patience and
communication during the scanning process of the volumes
of ”Pompei: pitture e mosaici”.
R EFERENCES
[1] Grzegorz Kamola, Michal Spytkowski, Mariusz Parad-
owski, Urszula Markowska-Kaczmar, Image-based
logical document structure recognition. 2014.
[2] Shumeet Baluja, Michele Covell, Finding Images and
Line-Drawings in Document-Scanning Systems. Inter-
national Conference on Document Analysis and Recog-
nition: ICDAR, 2009.
[3] Sachin Kumar Sa, Parvathy Rajendran, Prabaharan
Pb, K P Somana, Text/Image Region Separation for
Document Layout Detection of Old Document Images
25

An Automated Technique To Recognize and Extract Images From Scanned Archaeological Documents

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Automated Technique To Recognize and Extract Images From Scanned Archaeological Documents

Uploaded by

Copyright:

Available Formats

2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)

An automated technique to recognize and extract images from scanned

Cindy Roullet∗ , David Fredrick† , John Gauch∗ and Rhodora Vennarucci†

978-1-7281-5054-3/19/$31.00 ©2019 IEEE 20

them before running our algorithm. Most of our errors as

Most of the images cropped too small are images of ruins

V. C ONCLUSION using Non-linear Diffusion and Level Set. Interna-

You might also like