You are on page 1of 9

TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE

Chapter 1

INTRODUCTION
Extraction of text from document images has seen a rapid progress in Document
Image Analysis domain. Further, discrete objects such as tables and figures in documents
such as Catalogues are efficiently segmented. However, the recognized text within the
tables are Unstructured.
The proposal here is to analyse texts both horizontally and vertically and group
them to define structure for the segmented table.

1.1 Aim
The aim for implementing text detection and extraction from commercial
catalogue is to classify the text by grouping it, mapping the text elements using spatial
knowledge and to create a heterogeneous data structure to store the data. Text
identification is done by segmenting the text and non-text regions using edge feature
extraction from the binary image followed by extracting a text from a non-text region.
After extraction of data we have to identify the data based on five attributes type face,
slope, size, width and weight of the text.

1.2 Purpose
The purpose of this project it to find an innovative way of converting the images
which contains text in tabular manner into structured digital format similar to the
catalogue view. Our system will give the unique solution to convert the unstructured data
format to structured by minimising the manual work of identifying and editing the digital
data. This approach will make the data storing at ease in the database and also it is one of
the solution for Big Data as it majorly consists of unstructured data.

1.3 Overview
As images contains text and noise, our goal is to identify and extract the text in a
similar way like we recognize and understand using human eye. The images that are in
tabular manner will be converted into unstructured format of output using Google Lens
Technology which extracts the text on line wise basis regardless of the attribute of the text
which it belongs to. The problem here in existing system is the text is extracted
individually without the relation of the text that is being extracted, also we won’t get the
output in
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 1
TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE

structured format as we see in the input image. By applying the knowledge of data
structures, geometry, classes and object, pixel location using computer graphics we relate
the text to its respective attribute. The solution to this existing system is to classify the
text by grouping it, mapping the text elements using spatial knowledge and to create a
heterogeneous data structure to store the data. Initially an image is captured as an input
and document border are detected by computing mean for every block and assigned to
pixel intensity value .Document borders are removed by performing Logical OR
operation. Edge feature of the ruling lines are extracted using horizontal and vertical
filtering of table. After identification of table the text each letter is boxed and its attributes
are recognized using character and word spacing techniques by logically dividing the text
into upper, centre and lower zones. Also the font weight is detected by determining the
pixel density and typeface is detected by inspecting the upper and lower zones of the text.
In order to get an English word set form an image or catalogue, we randomly select 1,000
English words from the most common 5,000 English words sampled from a large corpus.
Now, we randomly divide the selected 1,000 English words into lower and uppercases
with equal probability. We collect in total 447 typefaces, each with different number of
variations resulting from combinations of different styles, e.g., regular, bold, italic, etc.
leading to 2,420 font classes in the end. For each font class, we generate one image per
English word, which gives 2.42 million synthetic images for the whole dataset. We now
normalize the text size by adding two lower case letters “fg” in front of each word to find
the ascender and descender lines of the text. And then “fg” is removed from the
synthesized images. After normalization, we obtain all the word images with the same
height of 105 pixels. Now we crop the texts from these images with a bounding box to
normalize the text size approximately to the same scale as the synthetic data.

Dept. of CSE NIE-IT, Mysore, 2019-20 Page 2


TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE

Chapter 2

Literature Survey
A literature survey or a literature review in a project report is that section which
shows the various analyses and research made in the field of interest and the results
already published, taking into account the various parameters of the project and the extent
of the project. It is the most important part of report as it gives a direction in the area of
our research. It helps to set a goal for the analysis - thus giving the problem statement.

2.1 Survey Papers

--

In the paper Table Benchmark for image-based Table Detection and Recognition
by MinghaoLi1, LeiCui, ShaohanHuang, FuruWei, MingZhou and ZhoujunLi in the year
2019. They presented Table Bank, a new image-based table detection and recognition
dataset built with novel weak supervision from Word and Latex documents on the
internet. Existing research for image-based table detection and recognition usually fine-
tunes pre-trained models on out-of-domain data with a few thousands human labelled
examples, which is difficult to generalize on real world applications. With Table Bank
that contains 417K high-quality labelled tables, they build several strong base lines using
state-of-the-art models with deep neural networks. They made Table Bank publicly
available and it will empower more deep learning approaches in the table detection and
recognition task.

In the paper A framework for information extraction from tables in biomedical


literature by Nikola Milosevic, Cassie Gregson, Robert Hernandez and Goran Nenadic in
the year 2019. Text mining provided in the past methods to retrieve and extract
information from text however, most of these approaches ignored tables and figures. The
research done in mining table data still does not have an integrated approach for mining
that would consider all complexities and challenges of a table. This research is examining
the methods for extracting numerical and textual information from tables in the clinical
literature. This

Dept. of CSE NIE-IT, Mysore, 2019-20 Page 3


TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE

presented a requirement analysis template and an integral methodology for information


extraction from tables in clinical domain that contains 7 steps: (1) table detection, (2)
functional processing, (3) structural processing, (4) semantic tagging, (5) pragmatic
processing, (6) cell selection and (7) syntactic processing and extraction. Our approach
performed with the F-measure ranged between 82 and 92%, depending on the variable,
task and its complexity.

In the paper Automatic localization and extraction of tables from handheld


mobile-camera captured handwritten document images by R. Amarnatha, G.S.
Sindhushreea, P. Nagabhushana, and Mohammed Javed in the year 2019. A table is a
compact, effective and structured way of representing information in any document.
Automatic localization of tables in scanned and written document images, and extracting
the information are very critical and challenging task for applications like Optical
Character Recognition, handwriting analysis, and auto-evaluation systems. The same task
becomes more complex, when the hand written document images are acquired through
hand held mobile-cameras, because the captured images naturally get distorted due to
poor illumination, device vibration, camera-angle, camera-orientation, camera movement,
and camera-distance. In this research article, a novel technique of automatic localization
and segmentation of tables in handwritten document images which are captured using a
handheld mobile-camera is proposed. Generally, ruling lines are used for structuring
tables, sketching figures, and scribing scientific equations. In the current research work,
tables are detected and extracted based one edge features of the ruling lines subjected to
three main stages. Firstly block–wise mean computed fuzzy based binarization techniques
proposed for analysing the distortion in the acquired image, and subsequently the
background surface that envelops the document area of the image is removed. Secondly,
horizontal and vertical granule or strip-based technique is proposed for fast edge-feature
extraction from the ruling lines of the table in the binarized image. Finally, entropy
quantifiers are employed for segmenting the table in the image. The performance of the
proposed technique is evaluated and reported using the proposed composite handwritten
benchmarked dataset. Linear computational benefit 0(h×w) is observed in the worst-case
tolerance.

In the paper Automatic Detection of Font Size Straight from Run Length
Compressed Text Documents by Mohammed Javed, P. Nagabhushan , B.B. Chaudhuri
in the year 2014. Automatic detection of font size finds many applications in the area of
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 4
TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE

Intelligent OCR-ing and document image analysis, which has been traditionally practised
Over uncompressed documents, although in real life the documents exist in compressed
form for efficient storage and transmission. It would be novel and intelligent if the task of
font size detection could be carried out directly from the compressed data of these
documents without decompressing, which would result in saving of considerable amount
of processing time and space. Therefore, in this paper we present a novel idea of learning
and detecting font size directly from run-length compressed text documents at line level
using simple line height features, which paves the way for intelligent OCR-ing and
document analysis directly from compressed documents. In the proposed model, the given
mixed-case text documents of different font size are segmented into compressed text lines
and the features extracted such as line height and ascender height are used to capture the
pattern of font size in the form of a regression line, using which the automatic detection of
font size is done during the recognition stage. The method is experimented with a dataset
of 50 compressed documents consisting of 780 text lines of single font size and 375 text
lines of mixed font size resulting in an overall accuracy of 99.67%.

In the paper Script Independent Detection of Bold Words in Multi Font-size


Document by Pedamalli Saikrishna and A.G.Ramakrishnan in the year 2010. A script
independent, font-size independent scheme is proposed for detecting bold words in
printed pages. In OCR applications such as minor modifications of an existing printed
form, it is desirable to reproduce the font size and characteristics such as bold, and italics
in the OCR recognized document. In this morphological opening based detection of bold
(MOBDoB) method, the binarized image is segmented into sub-images with uniform font
sizes, using the word height information. Rough estimation of the stroke widths of
characters in each sub-image is obtained from the density. Each sub-image is then opened
with a square structuring element of size determined by the respective stroke width. The
union of all the opened sub-images is used to determine the locations of the bold words.
Extracting all such words from the binarized image gives the final image. A minimum of
98 % of bold words were detected from a total of 65 Tamil, Kannada and English pages
and the false alarm rate is less than 0.4 %.
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 5
TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE

In the paper Optical Font Recognition Using Typographical Features by


Abdelwahab Zramdini and Rolf Ingold in the year 1998. A new statistical approach based
on global typographical features is proposed to the widely neglected problem of font
recognition. It aims at the identification of the typeface, weight, slope and size of the text
from an image block without any knowledge of the content of that text. The recognition is
based on a multivariate Bayesian classifier and operates on a given set of known fonts.
The effectiveness of the adopted approach has been experimented on a set of 280 fonts.
Font recognition accuracies of about 97 percent were reached on high-quality images. In
addition, rates higher than 99.9 percent were obtained for weight and slope detection.
Experiments have also shown the system robustness to document language and text
content and its sensitivity to text length.

In the paper Large-Scale Visual Font Recognition by Guang Chen, Jianchao Yang,
Hailin Jin, Jonathan Brandt, Eli Shechtman, Aseem Agarwala and Tony X. Han. This
paper addresses the large-scale visual font recognition (VFR) problem, which aims at
automatic identification of the typeface, weight, and slope of the text in an image or photo
without any knowledge of content. Although visual font recognition has many practical
applications, that largely been neglected by the vision community. To address the VFR
problem, we construct a large-scale dataset containing2, 420fontclasses, which easily
exceeds the scale of most image categorization datasets in computer vision. As font
recognition is inherently dynamic and open-ended, i.e., new classes and data for existing
categories are constantly added to the database over time, we propose a scalable solution
based on the nearest class mean classifier (NCM). The core algorithm is built on local
feature embedding, local feature metric learning and max-margin template selection,
which is naturally amenable to NCM and thus to such open-ended classification problems.
The new algorithm can generalize to new classes and new data at little added cost.
Extensive experiments demonstrate that our approach is very effective on our synthetic
test images, and achieves promising results on real world test images.
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 6
TEXT DETECTION AND EXTRACTION FROM COMMERCIAL CATALOUGE

Chapter 3

Conclusion
Input for the proposed system is the image which is captured as an input and
document border are detected by computing mean for every block and assigned to
pixel intensity value. Document borders are removed by performing Logical OR
operation. Edge feature of the ruling lines are extracted using horizontal and vertical
filtering of table. After identification of table the text each letter is boxed and its
attributes are recognized using character and word spacing techniques by logically
dividing the text into upper, centre and lower zones. Also, the font weight is detected
by determining the pixel density and typeface is detected by inspecting the upper and
lower zones of the text. In order to get an English word set form an image or
catalogue, we randomly select 1,000 English words from the most common 5,000
English words sampled from a large corpus. Now, we randomly divide the selected
1,000 English words into lower and uppercases with equal probability. We collect in
total 447 typefaces, each with different number of variations resulting from
combinations of different styles, e.g., regular, bold, italic, etc. leading to 2,420 font
classes in the end. For each font class, we generate one image per English word,
which gives 2.42 million synthetic images for the whole dataset. We now normalize
the text size by adding two lower case letters “fg” in front of each word to find the
ascender and descender lines of the text. And then “fg” is removed from the
synthesized images. After normalization, we obtain all the word images with the same
height of 105 pixels. Now we crop the texts from these images with a bounding box to
normalize the text size approximately to the same scale as the synthetic data. Finally
the identified and extracted text is mapped with the actual positions which it was
appeared visually in the input image.
Dept. of CSE NIE-IT, Mysore, 2019-20 Page 7

You might also like