Professional Documents
Culture Documents
Chapter 1
INTRODUCTION
Extraction of text from document images has seen a rapid progress in Document
Image Analysis domain. Further, discrete objects such as tables and figures in documents
such as Catalogues are efficiently segmented. However, the recognized text within the
tables are Unstructured.
The proposal here is to analyse texts both horizontally and vertically and group
them to define structure for the segmented table.
1.1 Aim
The aim for implementing text detection and extraction from commercial
catalogue is to classify the text by grouping it, mapping the text elements using spatial
knowledge and to create a heterogeneous data structure to store the data. Text
identification is done by segmenting the text and non-text regions using edge feature
extraction from the binary image followed by extracting a text from a non-text region.
After extraction of data we have to identify the data based on five attributes type face,
slope, size, width and weight of the text.
1.2 Purpose
The purpose of this project it to find an innovative way of converting the images
which contains text in tabular manner into structured digital format similar to the
catalogue view. Our system will give the unique solution to convert the unstructured data
format to structured by minimising the manual work of identifying and editing the digital
data. This approach will make the data storing at ease in the database and also it is one of
the solution for Big Data as it majorly consists of unstructured data.
1.3 Overview
As images contains text and noise, our goal is to identify and extract the text in a
similar way like we recognize and understand using human eye. The images that are in
tabular manner will be converted into unstructured format of output using Google Lens
Technology which extracts the text on line wise basis regardless of the attribute of the text
which it belongs to. The problem here in existing system is the text is extracted
individually without the relation of the text that is being extracted, also we won’t get the
output in
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 1
TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE
structured format as we see in the input image. By applying the knowledge of data
structures, geometry, classes and object, pixel location using computer graphics we relate
the text to its respective attribute. The solution to this existing system is to classify the
text by grouping it, mapping the text elements using spatial knowledge and to create a
heterogeneous data structure to store the data. Initially an image is captured as an input
and document border are detected by computing mean for every block and assigned to
pixel intensity value .Document borders are removed by performing Logical OR
operation. Edge feature of the ruling lines are extracted using horizontal and vertical
filtering of table. After identification of table the text each letter is boxed and its attributes
are recognized using character and word spacing techniques by logically dividing the text
into upper, centre and lower zones. Also the font weight is detected by determining the
pixel density and typeface is detected by inspecting the upper and lower zones of the text.
In order to get an English word set form an image or catalogue, we randomly select 1,000
English words from the most common 5,000 English words sampled from a large corpus.
Now, we randomly divide the selected 1,000 English words into lower and uppercases
with equal probability. We collect in total 447 typefaces, each with different number of
variations resulting from combinations of different styles, e.g., regular, bold, italic, etc.
leading to 2,420 font classes in the end. For each font class, we generate one image per
English word, which gives 2.42 million synthetic images for the whole dataset. We now
normalize the text size by adding two lower case letters “fg” in front of each word to find
the ascender and descender lines of the text. And then “fg” is removed from the
synthesized images. After normalization, we obtain all the word images with the same
height of 105 pixels. Now we crop the texts from these images with a bounding box to
normalize the text size approximately to the same scale as the synthetic data.
Chapter 2
Literature Survey
A literature survey or a literature review in a project report is that section which
shows the various analyses and research made in the field of interest and the results
already published, taking into account the various parameters of the project and the extent
of the project. It is the most important part of report as it gives a direction in the area of
our research. It helps to set a goal for the analysis - thus giving the problem statement.
--
In the paper Table Benchmark for image-based Table Detection and Recognition
by MinghaoLi1, LeiCui, ShaohanHuang, FuruWei, MingZhou and ZhoujunLi in the year
2019. They presented Table Bank, a new image-based table detection and recognition
dataset built with novel weak supervision from Word and Latex documents on the
internet. Existing research for image-based table detection and recognition usually fine-
tunes pre-trained models on out-of-domain data with a few thousands human labelled
examples, which is difficult to generalize on real world applications. With Table Bank
that contains 417K high-quality labelled tables, they build several strong base lines using
state-of-the-art models with deep neural networks. They made Table Bank publicly
available and it will empower more deep learning approaches in the table detection and
recognition task.
In the paper Automatic Detection of Font Size Straight from Run Length
Compressed Text Documents by Mohammed Javed, P. Nagabhushan , B.B. Chaudhuri
in the year 2014. Automatic detection of font size finds many applications in the area of
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 4
TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE
Intelligent OCR-ing and document image analysis, which has been traditionally practised
Over uncompressed documents, although in real life the documents exist in compressed
form for efficient storage and transmission. It would be novel and intelligent if the task of
font size detection could be carried out directly from the compressed data of these
documents without decompressing, which would result in saving of considerable amount
of processing time and space. Therefore, in this paper we present a novel idea of learning
and detecting font size directly from run-length compressed text documents at line level
using simple line height features, which paves the way for intelligent OCR-ing and
document analysis directly from compressed documents. In the proposed model, the given
mixed-case text documents of different font size are segmented into compressed text lines
and the features extracted such as line height and ascender height are used to capture the
pattern of font size in the form of a regression line, using which the automatic detection of
font size is done during the recognition stage. The method is experimented with a dataset
of 50 compressed documents consisting of 780 text lines of single font size and 375 text
lines of mixed font size resulting in an overall accuracy of 99.67%.
In the paper Large-Scale Visual Font Recognition by Guang Chen, Jianchao Yang,
Hailin Jin, Jonathan Brandt, Eli Shechtman, Aseem Agarwala and Tony X. Han. This
paper addresses the large-scale visual font recognition (VFR) problem, which aims at
automatic identification of the typeface, weight, and slope of the text in an image or photo
without any knowledge of content. Although visual font recognition has many practical
applications, that largely been neglected by the vision community. To address the VFR
problem, we construct a large-scale dataset containing2, 420fontclasses, which easily
exceeds the scale of most image categorization datasets in computer vision. As font
recognition is inherently dynamic and open-ended, i.e., new classes and data for existing
categories are constantly added to the database over time, we propose a scalable solution
based on the nearest class mean classifier (NCM). The core algorithm is built on local
feature embedding, local feature metric learning and max-margin template selection,
which is naturally amenable to NCM and thus to such open-ended classification problems.
The new algorithm can generalize to new classes and new data at little added cost.
Extensive experiments demonstrate that our approach is very effective on our synthetic
test images, and achieves promising results on real world test images.
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 6
TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE
Chapter 3
Conclusion
Input for the proposed system is the image which is captured as an input and
document border are detected by computing mean for every block and assigned to
pixel intensity value. Document borders are removed by performing Logical OR
operation. Edge feature of the ruling lines are extracted using horizontal and vertical
filtering of table. After identification of table the text each letter is boxed and its
attributes are recognized using character and word spacing techniques by logically
dividing the text into upper, centre and lower zones. Also, the font weight is detected
by determining the pixel density and typeface is detected by inspecting the upper and
lower zones of the text. In order to get an English word set form an image or
catalogue, we randomly select 1,000 English words from the most common 5,000
English words sampled from a large corpus. Now, we randomly divide the selected
1,000 English words into lower and uppercases with equal probability. We collect in
total 447 typefaces, each with different number of variations resulting from
combinations of different styles, e.g., regular, bold, italic, etc. leading to 2,420 font
classes in the end. For each font class, we generate one image per English word,
which gives 2.42 million synthetic images for the whole dataset. We now normalize
the text size by adding two lower case letters “fg” in front of each word to find the
ascender and descender lines of the text. And then “fg” is removed from the
synthesized images. After normalization, we obtain all the word images with the same
height of 105 pixels. Now we crop the texts from these images with a bounding box to
normalize the text size approximately to the same scale as the synthetic data. Finally
the identified and extracted text is mapped with the actual positions which it was
appeared visually in the input image.
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 7