Ahmed Zioudi

République Tunisienne
Ministère de l’Enseignement Supérieur Code : GSP-RS-03-00

et de la Recherche Scientifique
Université de Tunis
RAPPORT DE PROJET
DE FIN D’ETUDES Date de création :
16-06-2023
Department of Computer Engineering
Presented in order to obtain
Engineering degree in Computer Sciences

Speciality: GLID
By M. : Zioudi Ahmed
Table extraction and structure recognition
Realised within Axefinance
Defended on September, 29th 2023, In front of the jury composed of:
President: M. KAMMOUNE Slim
Reporter: M. BOULARES Mehrez
University supervisor: M. KOUKI Zoulel
Industrial supervisor: M. BOUDHHIR Maher
Academic Year 2022 - 2023
ENSIT 5, avenue Taha Hussein Montfleury, 1008 Tunis Site Web : http://www.ensit.tn E-mail : direction.stage@ensit.rnu.tn
Tél (+216) 71 49 68 96/71 4968 80/7139 25 91 Fax : (+216) 71 39 11 66
Dedications
To my dear parents and all my family,
To all my friends,
To everyone who has helped me, To everyone I love,
Kindly, I dedicate this work to you.

Acknowledgment
At the end of this modest work, I would like to express my deep gratitude to all those
who gave me their support and helped me to accomplish this project in the best conditions
and especially under exceptional and critique circumstances. First of all,
I would like to express my gratitude to Mr. Ayad abdelmajid team lead at Axe
Finance, for allowing me to live, within his team, an experience full of interests.
I would like to thank sincerely my technical supervisor, Mr BOUDHHIR Maher
for having guided me throughout my graduation project and for sharing with me his broad
experience. I could not have imagined having a better advisor for this work.
I would also like to present my sincere thanks to Mrs.KOUKI Zoulel my supervisor at
the Higher National Engineering School of Tunis (ENSIT) for the support, the permanent
help and the precious directives
Finally, I address my most devoted thanks to the members of the jury for having honored
me byagreeing to evaluate this work while hoping that they will find in it the qualities of
clarity and motivation that they expect... From the bottom of the heart: thank you!
General Introduction
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines
or computer systems. AI technologies enable machines to perform tasks that typically
require human intelligence, such as understanding natural language, recognizing patterns,
solving complex problems, and making decisions. AI has a wide range of applications
across various industries and has the potential to transform the way we live and work.
As organizations seek to gain a competitive advantage and change customer demands,
digital transformation has impacted almost every industry. The banking industry has
also recognized the game-changing effects of innovative technologies such as Artificial
Intelligence. Its capacity to identify and process data across diverse formats has reshaped
conventional workflows. The work described in this report explores the application of an
AI-based table structure recognition model.
This report comprises four intricately structured chapters, each dedicated to a specific
facet of table structure recognition . Starting with an in-depth presentation of our host
company and the precise delineation of project scope. secon chapter exposes the popular
methodologies used in table recognition.The third chapter deals with data annotation
and environment set up , and the final chapter undertake an examination of modeling,
evaluation, and deployment process.
1
Chapter 1
Host Company Presentation and

Project Scope
1.1 Introduction
This chapter is devoted to present in details the host organization, its characteristics,
the project scope and overview of the work Methodology.
1.2 Company presentation
Axe Finance, founded in 2004, is a global software vendor specializing in loan automation
for financial institutions (including traditional and Islamic banking) with over 20000 users
in 20 countries seeking a competitive advantage in efficiency and customer service across all
client segments: commercial, retail, corporate, and so on. Axe Finance has offices in Tunis,
Amsterdam, Abu Dhabi, and Mumbai, among other places. The graph below depicts some
key figures from Axe Finance’s activity.
2
CHAPTER 1. HOST COMPANY PRESENTATION AND PROJECT SCOPE
Figure 1.1: Key numbers of the activity of Axe Finance
Axe Finance employs about 200 individuals across four departments, each of which is
organized into multiple sub-departments, as illustrated in the organizational chart (Figure
1.2):
Figure 1.2: Axe Finance Organizational chart
1.2.1 Main Services
Axe Credit Portal is an end-to-end comprehensive credit automation system from
Axe Finance, offered as a locally hosted. Société Générale, Al Rajhi Bank, Banque
3
Internationale de Luxembourg, and First Abu Dhabi Bank are among axe finance’s trusted
global financial partners.
Axe Limit Management is responsible for multi-level facility structures of various
types throughout all of the bank’s divisions, including Corporate & Commercial Lending,
Treasury, Trade Finance, Specialized Lending, and others (the method of software delivery
and licensing in which software is accessed online via a subscription). Large and medium-
sized businesses, retail investment banks, public sector groups (PSGs), SMEs, non-bank
financial institutions (NBFIs), and high-net-worth individuals are among the wholesale
clients served (HNIs).
Axe Collateral Management is a service that proactively monitors the institution’s
whole portfolio of collateral by sending out early alerts and notifications about coverage
shortfalls, deferment expiration, documentation renewals, margin calls (equities), product
disposals, and other events. The complexity of collateral management is efficiently man-
aged by a number of advanced business rules and adaptable processes across the financial
institution’s operations and departments, including remote locations and international
subsidiaries. Credit Risk Analysts, Relationship Managers, Risk Managers, Credit Adminis-
trators, Collateral and Collection Officers, Legal Officers, Sustainability and Environmental
Offices, and Portfolio Teams will all benefit from Axe Retail Lending. It uses a single co-
operative solution to streamline their tasks and build on their input, running all automated
processes on a single platform of risk and credit data. Axe Collection and Provisioning
is a system aimed to improve repair methods by increasing efficiencies throughout the
repair process by ensuring consistent data flow and streamlining processes. A low-cost,
end-to-end solution that provides a high return on investment in the short, medium, and
long term. The solution collects all of the information needed to compute expected losses
and individual or group provisions for good, bad, and damaged assets.
4
1.2.2 Customers
The most well-known consumers and those who trust the company’s services are depicted
in the diagram below:
]logos/image.png
Figure 1.3: Customers
1.3 Project Scope
The project is titled "Table Extraction and Structure Recognition." Its core objective is
the examination of Existing Table Recognition Approaches , Selection and implementation
of a model designed to extract the structure of tables from images.
5
1.3.1 Problem Definition
The problem involves the detection, recognition, and precise retention of table contents,
along with maintaining the original layout and structure. This aims to enable the storage of
tabular data in editable formats like Docx, Excel, and more. This problem holds significant
relevance in digitizing documents, facilitating editing, and enabling efficient document
search and retrieval processes.
Typical sub-problems of table data:
• Table Detection (TD): determines the position of the table in the image.
• Table Structure Recognition (TSR): identify, reconstruct and store relative position
information of cells in a table.
• Table Recognition (TR): similar to TSR, but includes reading information, recognizing
characters on the table, and mapping accurately to each cell in the table
1.3.2 Basic Definitions
• Table: table
• Row, Column: row, column of table
• Grids: the smallest unit representing coordinates in a table, belonging only to 1 row
and 1 given column
• Cells: larger than the Grid, 1 cell can include multiple sub-grids (called span-cells),
or only 1 child grid is single-cell. Contains location information and text content
inside the cell
• Single-cells: is a cell, 1 cell corresponds to 1 grid
• is a cell, 1 cell corresponding to many grids, ie rows / columns are stretched / wide,
covering many other rows / sub-columns

6
• Relative Position:the relative position of the cells/grid in the table, represented by
index 0,1,2,3,... Original coordinates on top-left, including 4 parameters (start-row,
end-row, start-column, end-column)
1.4 Methodology of the project
To better organize the structure and steps of our project, we used the CRISP method,
which is an agile and iterative process.[7] Let’s first clarify what an Agile method is. It is a
strategy that advocates for setting short-term objectives. Therefore, the project is divided
into several sub-projects. Once an objective is reached, we move on to the next one until
the final objective is reached. This method is more flexible. As it is impossible to predict
and anticipate everything, it allows for the acceptance of unexpected events and changes.
CRISP- DM (cross-industry standard process for data mining) is, as the name implies,
an open standard process framework specifically for planning data mining projects.It is
important to note that the process is highly non-linear, and moving from one step to the
next is the norm rather than the exception and it is divided into six major steps
• Business Understanding: The business understanding phase is concerned with de-
termining what problems the company wishes to solve. According to data science
project management, the most important tasks in this phase are: Determine the
business question and goal. Make a detailed plan for each project phase, including
the tools we will use
• Data Understanding: Following comprehension of the business perspective, it is time
to determine which data is available to us and concentrate on comprehending it in
order to solve 4 the business problem. In other words, we demonstrate everything we
know about the data and how it relates to the business question. This phase may
include the following steps: collecting, attempting to describe the data you have at a
glance, and exploring it.

7
• Data Preparation: Once the data and the business are clear and understood, it is
time to prepare the collected data for modeling. In this step, we will select the data
that will be used for analysis, then clean it up and pre-process it. Indeed, this is the
key to a great modeling process that will set your data science project apart even
more.
• Modeling: In this phase, we choose and apply various modeling techniques and
calibrate their parameters to optimal values. We will develop our ML or DL
model/product by experimenting with many models before settling on a specific
algorithm, then we will design our modeling test design by dividing the data into
training, testing, and validation sets, and finally we will define the technical success
measures and select the best viable model(s) to solve the business question.
• Evaluation: Before proceeding with the model’s final deployment, it is critical to
further evaluate the model and review the steps taken to build the model to ensure
that it is meeting the business objectives. A key goal is to identify any significant
business issues that have not been adequately addressed. A decision on how to use
the data mining results should be made at the end of this phase.
• Deployment: This is the last step in the procedure. It entails putting the obtained
models into production for the final users. Its goal is to present the results in an
appropriate format and incorporate them into the decision-making process.
8
Figure 1.4: CRISP-DM Process Diagram
1.5 Conclusion
In culmination, this chapter laid the foundation for our journey by introducing the Host
Company Presentation and delineating the Project Scope. The Host Company Presentation
provided a comprehensive overview of our collaborating entity, offering insight into its
objectives, expertise, and industry standing.Simultaneously, the Project Scope demarcated
the boundaries and objectives of our endeavor
9
Chapter 2
Current Methods for Table

Recognition
2.1 Introduction
Within this chapter, we embark on a comprehensive exploration of contemporary
methodologies employed for table structure recognition. Our aim is to provide an insightful
analysis of these existing techniques, shedding light on their strengths, limitations, and
practical implications. By juxtaposing these methods, we strive to offer a holistic perspective
that aids in understanding the evolving landscape of table structure recognition. This
exploration equips us with a nuanced understanding, enabling informed decision-making
and laying the groundwork for the model choice.
2.2 LGPMA
This method is built on Mask-RCNN is one model instance-segmentation technique that
performes pixellevel segmentation on detected objects ,when applied with table recognition
purposes it outputs bounding-box and mask segment. The model consists of four main
modules: Aligned Bounding-box Detection, LPMA (local pyramid mask alignment), GPMA
10
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
(global pyramid mask alignment) and Aligned Bounding-box Refinement.[5]
Figure 2.1: LGPMA model
• The Align Bounding-box Detection uses features extracted from the Region of Interest-
11
Align (RoI-Align) witch may be defined as a region in an image where a potential
object might be located. It includes two output branches: bounding box-classification
and bounding box-regression, similar to the model of Mask-RCNN, used to identify
non-empty cells. However,empty cells can not be easily recognized .Therefor the
authors of LGPMA prpose the following workflow for the aime of the alignment and
refinment of these empty cells
• LPMA:is applied at the scale of each single cell and consists of two subbranches:
The first is used for binary segmentation aiming to identify text regions. The second
performs a local pyramid mask regression task .It creates masks with a descending
gradient from the center of the text, defined for both vertical mask and horizontal
mask. In the figure 2.2 (a) shows the original aligned bounding box (blue) and
text region box (red). (b) shows the pyramid mask labels in horizontal and vertical
directions, respectively :
Figure 2.2: LPMA subbranches
• GPMA: is also a segmentation module, with two submodules, Global segmentation
and Global pyramid mask regression.The first simply performs a binary segmentation
to identify aligned non-empty cells and empty-cells . The second section is identifies
the entire set of non-empty cells with only two outputs, global horizontal pyramid
mask and global vertical pyramid mask .
• The last module is Aligned Bounding-box Refinement which refines the cell identity.It
is based on merging the local-mask and global-mask of the previously presented

12
modules to create a voting area .The process starts with Cell matching to identify
cells of the same row and same column,identifies the empty-cells and merge them
together if needed. .
The Figure 2.3 is a visualization of an example that is successfully refined. (a) shows The
aligned bounding boxes before refinement. (b) gives LPMA (in horizontal). (c) displays
GPMA (in horizontal). (d) presents Global binary segmentation. (e) unveils Final result
after refinement and empty cell merging
Figure 2.3: Exemple of LGPMA application
13
2.3 Split-Embed-Merge
this model uses the divide-and-rule approach, the authors divides the model into three
smaller submoduels, respectively Split, Embed, Merge as shown by figure 2.4.
Figure 2.4: Split-Embed-Merge model
• Split-model: is used to define row/column separation areas. However it takes a
segmentation approach, the output is two corresponding masks for the separated row
and column.
Figure 2.5: Split model
• Embed-model: with the grid specified from the Split-model, uses Roi-Align to extract
the feature of each grid area of the image, called the Vision Module. At the same
time, Embed uses a Text-Module with BERT as a feature extraction to extract more
text features. Then, 2 modules are combined to make input for the 3rd model,
14
• Merge-model with the same concept as Deep-Split-Merge. However, the modeling
and paper section uses 1 GRU model with Attention, at each timestep there is an
output of 1 merged-map MxN dimension (MxN is the grid size from the Split-model),
indicating that at this timestep, which grids need to be merged together, denoted
as 1, vice versa as 0. From there, perform a restructuring of the table similar to
Deep-Split-Merge.
Figure 2.6: Merger model
2.4 GraphTSR
The modeling part is designed in graph form and modeled with the Graph Neural
Network (GNN)
Graph Neural Networks (GNNs): GNNs are a type of neural network architecture designed
to work with graph-structured data. In a graph, you have nodes (vertices) and edges
connecting them. GNNs are used to process and analyze data in this graph format, making
them suitable for tasks that involve relationships between entities, such as nodes and edges.
Figure 2.7 presents an Overview of the method : (a) Preprocessing: obtaining cell contents
15
and their corresponding bounding box from the image; (b) Graph construction: building
an undirected graph on these cells; (c) Relation prediction: predicting adjacent relations by
our proposed GraphTSR; (d) Post-processing: recovering table structure from the labeled
graph.
Figure 2.7: Overview of the method
Utilizing Graph Neural Networks (GNN), the network’s nodes are symbolized as text-
contained bounding boxes. For instance, text labels like "Method," "D1," "D2," "P," "R,"
"F1," correspond to nodes in the graph. Edges represent relationships between nodes,
visualized as connecting lines.
In the study, the Text Structure Recognition (TSR) issue is approached as an edge-
classification challenge. This involves categorizing edge relationships. Given existing
vertices and edges, the goal is to assign newly created edges into three labels:
• Label "0": No significant relationship between connected nodes.
• Label "1": Horizontal relationship, like a green line, indicating a side-by-side connec-
tion.
• Label "2": Longitudinal relationship, akin to a red line, implying a vertical or
hierarchical link.
By defining the relationships between each cell (0/1/2) thus, we can identify the span-cells
and perform table restructuring as shown above. With 1 row-span cell Method consists of
16
2 sublines and 2 column-spans D1, D2 with 3 subcolumns (, , ).PRF1
2.5 Comparison of methods
2.5.1 TEDS Metric (ICDAR 2021 Competition)
TEDS(Tree-Edit-Distance-based Similarity) is a metric employed to quantify the simi-
larity between two tree structures. It measures the minimum number of edit operations
required to transform one tree into another. These edit operations typically include insert-
ing, deleting, or substituting nodes in one tree to make it match the structure of the other
tree.
When calculating similarity using Tree-Edit-Distance-based Similarity (TEDS), the idea
is to use the tree edit distance as a basis to determine how similar or dissimilar two tree
structures are. Smaller tree edit distances indicate higher similarity, while larger distances
indicate greater dissimilarity.
The process generally involves the following steps:
• Tree Representation: Convert the input trees or hierarchical structures into appro-
priate representations for tree edit distance calculations. This representation often
includes nodes, edges, labels, and their relationships.
• Tree Edit Distance Calculation: Calculate the tree edit distance between the two tree
structures. This involves finding the minimum sequence of edit operations (insertions,
deletions, substitutions) needed to transform one tree into the other.
• Similarity Calculation: Convert the tree edit distance into a similarity score. This
can be done by inversely scaling the distance or applying a transformation function
to map it to a similarity scale (e.g., 0 to 1).
• Interpretation: The resulting similarity score provides information about how similar
the two trees are. Higher similarity scores indicate greater structural resemblance,
17
while lower scores indicate more structural differences.
LGPMA , GraphTSR and Split-Embed-Merge were put to proof in the ICDAR 2021
Competition and were tested using the TEDS metric using the following settings :
• Cost of Operations: The cost of insertion and deletion operations is set to 1.
• Substitution Cost:
– When a substitution operation is performed, if either of the nodes being substi-
tuted (replaced) is not "td,"(table data) the cost is 1.
– If both nodes are "td," the substitution cost depends on whether the column
span or row span of the nodes is different. If they are different, the cost is 1. if
the column span and row span of both nodes are the same, the substitution cost
is calculated using the normalized Levenshtein similarity between the content of
the nodes (ranging from 0 to 1).
• TEDS Calculation: The TEDS similarity between two trees (or tables) is calculated
using the formula:
EditDist(Ta , Tb )
T EDS(Ta , Tb ) = 1 −
max(|Ta |, |Tb |)
Where EditDist is the tree-edit distance between Ta and Tb, and |Ta| and |Tb|
represent the number of nodes in the trees.
• Table Recognition Performance: The performance of a method on a set of test samples
is defined as the mean TEDS score between the recognition result produced by the
method and the corresponding ground truth for each sample.
The results are shown in the table :
18
Method TEDS Simple TEDSComplex TEDS all

LGPMA 97.88 94.78 96.36.
Split-Embed-Merge 97.60 94.89 96.27.
GraphTSR 97.18 92.40 94.84.
Table 2.1: TEDS results
The table displays the results of different methods applied to three datasets: TEDS
Simple(dataset of tables with simple structure), TEDS Complex(dataset of tables with
complex structure), and TEDS all(represents the combination of two both previous datasets).
Here’s an interpretation of the results:
• TEDS Simple:
– LGPMA: LGPMA achieved the highest performance on the TEDS Simple
dataset with a score of 97.88
– Split-Embed-Merge: Split-Embed-Merge also performed well on the TEDS
Simple dataset, with a score of 97.60
– GraphTSR: GraphTSR had a slightly lower performance score of 97.18% on
the TEDS Simple dataset.
• TEDS Complex:
– LGPMA: LGPMA continued to lead in performance on the TEDS Complex
dataset with a score of 94.98%.
– Split-Embed-Merge: Split-Embed-Merge achieved a performance score of
94.89% on the TEDS Complex dataset.
– GraphTSR: GraphTSR had a lower performance score of 92.40% on the TEDS
Complex dataset.
• TEDS all:
19
– LGPMA: LGPMA maintained its lead on the TEDS all dataset with a score
of 96.36%.
– Split-Embed-Merge: Split-Embed-Merge achieved a performance score of
96.27% on the TEDS all dataset.
– GraphTSR: GraphTSR had a lower performance score of 94.84% on the TEDS
all dataset.
In summary, based on these results:
• The "LGPMA" method consistently performed well across all three datasets, making
it a strong candidate for various applications related to TEDS data. The "Split-
Embed-Merge" method also demonstrated strong performance across the datasets,
with results close to those of LGPMA..
• The "GraphTSR" method, while still achieving reasonable performance, had slightly
lower scores compared to the other methods, particularly on the more complex TEDS
Complex and TEDS all datasets
2.5.2 Results and Interpretation
In the table presented below, we delineate the strengths and limitations of the examined
models :
20
Method Advantages Limitations

LGPMA Empty cells can be located eas- the model is built for dis-
ily tributed training
Very good at detecting span- The model has a substantial
ning cells GPU resource demand
Available weights for the pre-
trained model on 750 000 tab-
ular images
Holds the first place in ICDAR
2021 Competition
Split-Embed-Merge Has the highest TEDS score A significant dataset is re-
for the complex structures quired to properly train the
Simple to implement model
Top3 ICDAR 2021 competi- The model has a substantial
tion GPU resource demand
GraphTSR Excels at detecting spanning Graph models can become
cells computationally intensive and
Node Context: Graph models memory-hungry as the size of
leverage the context of neigh- the graph increases)
boring nodes, allowing them can suffer from overfitting if
to capture local patterns and the dataset is too small
propagate information across
the graph, leading to improved
predictions and insights.
Table 2.2: Current methods :Advantages and Limitations
The table presents an evaluation of three different methods for Tabular Structure Recogni-
tion: LGPMA, Split-Embed-Merge, and GraphTSR. These methods are assessed based
on their respective advantages and limitations, providing insights into their practical
applicability.[1]
LGPMA:
• Advantages: - LGPMA excels in locating empty cells within tabular structures, a
crucial task in table recognition. - It demonstrates strong performance in detecting
spanning cells, which are common in complex tables. - The availability of pretrained
21
model weights trained on a substantial dataset of 750,000 tabular images enhances
its recognition capabilities. - LGPMA’s notable achievement of holding the first place
in the ICDAR 2021 Competition highlights its effectiveness.
• Limitations: - The method is designed for distributed training, which places substan-
tial demands on GPU resources, making it resource-intensive.
Split-Embed-Merge:
• Advantages: - Split-Embed-Merge achieves the highest TEDS score for complex table
structures, indicating its suitability for intricate layouts. - It is relatively simple to
implement, making it accessible to a wide range of users. - The method achieved
recognition by ranking among the top three in the ICDAR 2021 Competition.
• Limitations: - Proper training of Split-Embed-Merge requires a significant dataset,
which may pose challenges in data-constrained scenarios. - Similar to LGPMA, it
exhibits considerable GPU resource demands.
GraphTSR:
• Advantages: - GraphTSR is effective in detecting spanning cells, a valuable feature
in tabular recognition. - It leverages graph-based models to capture local patterns
and propagate information across the graph, contributing to improved predictions
and insights.[9]
• Limitations: - Graph-based models like GraphTSR can become computationally
intensive and memory-hungry as the graph size increases, potentially limiting scala-
bility. - The risk of overfitting exists, particularly if the dataset used for training is
small.
In summary, LGPMA stands out in terms of empty and spanning cell detection. Split-
Embed-Merge performs well on complex structures but requires substantial data and
GPU resources. GraphTSR leverages graph models for improved predictions but may face
22
scalability and overfitting challenges. These insights assisted us in selecting LGPMA as
the most suitable method for our task.
2.6 Conclusion
Following a meticulous analysis of the various methodologies discussed in this chapter,
a clear path has emerged. Our rigorous investigation has led us to the resolute decision of
adopting the LGPMA model for our table structure recognition task. The comprehensive
evaluation of existing techniques has affirmed the suitability of this model for our specific
objectives.
23
Chapter 3
Data Proficiency and Environment

Setup: Navigating Understanding,
Preparation, and Configuration
3.1 Introduction
This chapter is dedicated to introducing our image dataset and aiming to provide a con-
cise overview of the data through the generation of exploratory insights and visualizations.
Additionally, we will delve into the process of data annotation, a crucial step to adapt the
data into the necessary format for our LGPMA model.
Furthermore, we will present the configuration of the environment, detailing the setup
that enables our project’s smooth execution.
24
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
3.2 Data Understanding and Preparation
3.2.1 Image Dataset
Data exploration, or Exploratory Data Analysis (EDA), is a critical initial phase in
the process of working with any form of data, including the data processed in computer
vision tasks and involves a systematic and in-depth examination of the dataset. Our
dataset comprises 250 tabular images featuring a diverse array of tables, each possessing
distinct structures. These images are sourced directly from authentic employee documents,
providing a genuine representation of the data encountered in real-world scenarios. However,
during the curation process, the dataset was refined to 230 images. This reduction was
necessitated by the exclusion of certain images that were deemed unsuitable for model
training. These images exhibited notable noise and irregularities such as doted lines of
the tables or unclear words, making them less conducive to effective learning. The careful
curation process ensures that the dataset maintains a higher quality and relevance, thereby
enhancing the reliability and performance of the model during training and subsequent
applications. The figure 3.1 shows an example of our image dataset:
25
Figure 3.1: Images from Dataset
26
3.2.2 Data annotation
Data annotation, also referred to as data labeling, tagging, or classification, involves the
essential task of assigning pertinent labels (such as tags, annotations, or classes) to individual
data samples. This procedure holds significant influence over a model’s performance. In the
context of our project, image annotation was meticulously undertaken using LabelImage
( image annotation tool set in an anaconda environment ). This meticulous annotation
process serves as the foundation for generating the training dataset, enabling supervised AI
models to acquire knowledge. The manually annotated images establish a baseline dataset
crucial for training the LGPMA model effectively. This preparatory step plays a pivotal
role in facilitating the model’s ability to comprehend and make predictions accurately. The
figure below shows the annotation .
Figure 3.2: annotation process
The annotation process is a fundamental step in preparing our dataset for effective model
training. In this process, each image is labeled using bounding boxes, where each bounding
box represents an individual cell within the table witch deemed possible with the help of
annotation tool.
27
Figure 3.3: Annotation tool
An image annotation tool is a software application or platform specifically designed for
adding labels, shapes, text, or other metadata to images. These tools are commonly used
in computer vision, machine learning, and data annotation tasks. Image annotation tools
help in creating labeled datasets for training and testing machine learning models.In our
case we chose labelImg. LabelImg: LabelImg is an open-source graphical image annotation
tool that allows us to draw bounding boxes around objects in images. It’s commonly used
for object detection tasks. LabelImg supports both PASCAL VOC and YOLO formats.
The figure 3.4 represents the output of the annotation tool:
28
Figure 3.4: XML file generated by the labelImg tool
This approach allows us to accurately map the tabular structure present in the images.
The resulting annotations are compiled into XML files, which encapsulate essential infor-
mation including the height, width, and coordinates of each bounding box. This structured
XML representation forms the cornerstone of our annotated dataset. Subsequently, these
XML annotations are transformed into the format required by the model . The figure 3.5
shows the structure of Jason file used as input for the model:
29
Figure 3.5: Format required for LGPMA model
Description of Data Parameters:
• filename: This parameter corresponds to the name of the sample image.
• height: Specifies the height of the image.
• width: Denotes the width of the image.
• content-ann: This dictionary encapsulates three critical aspects of table information:
– bboxes: A list of coordinates outlining the text area within a cell. The for-
mat employed is [x1, y1, x2, y2], representing the upper-left and lower-right
coordinates.
30
– cells: A list indicating the row and column information for each cell. The format
is [Start Row Index, Start Column Index, End Row Index, End Column Index].
– labels: A list of labels for each cell, with values of 0 indicating a header cell and
1 representing a non-header cell.
It is important to pay attention to the following nuances in the above data format:
• For cells with no text area, the bboxes list should be represented as an empty list.
• The row and column indexes in the cells parameter begin from 0.
• It is imperative to maintain the order of the three lists (bboxes, cells, and labels)
precisely, ensuring that they correspond one-to-one without any mismatch.
3.3 Environment Configuration
In response to the considerable resource demands imposed by the model, we made
the strategic decision to transition to a Linux machine provided by a hosting company,
equipped with hardware capabilities that align with our computational requirements. This
new setup boasts 32GB of RAM, coupled with a powerful GP102 GPU, specifically the
GeForce GTX 1080 Ti, featuring 12GB of dedicated graphics memory. This shift ensures
that our computational infrastructure is robust enough to handle the complexities and
demands of our model, enabling us to achieve optimal performance and efficiency.
31
Figure 3.6: Environment Info log
3.3.1 Software
In this section, we will discuss the different components of the software environment
and explain the underlying reasons for their use.
• . Python : Python is a popular high-level and polyvalent programming language. It
was invented in 1991 by Guido van Rossum and developed by the Python Software
Foundation.[3] It is widely used in the scientific and research communities, as the
majority of data scientists have ranked it as the most privileged programming language
because it provides a large collection of open-source libraries that help to easily solve
complex business problems, build robust systems, and especially applications such as
32
table structure recognition Studio Code: It is a powerful code editor that runs on
any operating system (Windows, Linux, Mac Os). It comes with built-in support for
JavaScript, TypeScript and Node.js and has a rich ecosystem of extensions for other
languages (such as C++, C, Java, Python,etc.) and runtimes.
• Google Colaboratory or Colab, is a free Google tool for developing data science
projects. It is a free Jupyter notebook environment that runs on Google’s cloud
servers, allowing the user to leverage backend hardware like GPUs and TPUs
• Jupyter is an open-source project that provides a web-based interactive computing
environment. It’s particularly popular in the field of data science and scientific
computing. The name "Jupyter" is a combination of three core programming languages
it initially supported: Julia, Python, and R.
The key component of Jupyter is the Jupyter Notebook, which allows users to create
and share documents that contain live code, equations, visualizations, and narrative
text. These notebooks are a versatile tool for data analysis, scientific research,
machine learning, and more. Users can write and execute code in a notebook, see the
results immediately, and document their work in a coherent and interactive manner.
Jupyter supports a wide range of programming languages beyond the original three,
thanks to "kernels," which are language-specific computational engines that execute
the code within the notebooks. This flexibility makes Jupyter an invaluable tool
for researchers, data scientists, and educators working with various programming
languages and data analysis tasks.
• PyTorch is an open-source deep learning framework developed by Facebook’s AI
Research lab (FAIR). It provides a flexible and dynamic computational graph, making
it widely adopted in both research and production for various machine learning tasks.
Main Features:
– Dynamic Computation Graph: PyTorch uses a dynamic computational graph,

33
which allows for more flexible and intuitive model design and debugging.
– GPU Support: It has native support for GPUs, making it efficient for training
deep neural networks on GPU hardware.
– PyTorch has a rich ecosystem of libraries and tools, including torchvision for
computer vision tasks, and PyTorch Lightning for streamlining the training
process.
3.3.2 libraries
In this section, we are going to list the packages incorporated in our project
• OpenCV: OpenCV is a graphics library developed by Intel, specialized in image
processing. This library provides more than 2500 computer vision algorithms that
can be used to process images. These algorithms are primarily based on complex
mathematical calculations that primarily concern the processing of matrices, since
an image is considered as a matrix of pixels
• Spacy: spaCy is a Python-based free and open-source natural language processing
(NLP) library with many built-in features such as NER, POS tagging, dependency
parsing, sentence segmentation, text classification, lemmatization, morphological
analysis, entity linking, and more. Because of its cutting-edge speed and rigorously
tested accuracy, it is becoming increasingly popular for NLP processing and analysis.
• Tensorflow: Tensorflow is an open-source library for programming data streams in
various tasks. It is a symbolic mathematical library, which is also used for machine
learning applications such as neural networks. Created by the "Google Brain" team,
it is a toolbox for solving extremely complex mathematical problems easily. [8]
• MMCV (Multimedia Common Vision): MMCV is an open-source deep learning
library primarily focused on computer vision tasks. It is developed and maintained
by the Multimedia Laboratory at the Chinese University of Hong Kong.

34
– MMCV offers a wide range of pre-processing, data augmentation, model archi-
tecture, and evaluation tools specifically designed for computer vision tasks..
– Integration with Other Libraries: MMCV is often used in conjunction with
other popular libraries like PyTorch and MMDetection
– Use Cases: MMCV is commonly used in computer vision research and applica-
tions, such as image classification, object detection, instance segmentation, and
more.
• MMDetection: is an open-source deep learning framework specifically designed for
object detection, instance segmentation, and other related computer vision tasks.
It is built on top of the PyTorch deep learning framework and is developed and
maintained by the Multimedia Laboratory at the Chinese University of Hong Kong.[4]
– Modular Design: MMDetection follows a modular design philosophy, making
it highly customizable and adaptable to different computer vision tasks. Users
can easily configure and extend the framework to suit their specific needs.[2]
– Wide Range of Models: The framework provides a collection of state-of-
the-art object detection and instance segmentation models. These models are
pre-implemented and can be easily used for various tasks, including Faster
R-CNN, Mask R-CNN, RetinaNet, and many others.
– Efficient Training and Evaluation: MMDetection includes efficient training
and evaluation pipelines, making it straightforward to train models on custom
datasets and evaluate their performance using standard metrics.
– Rich Set of Features: It offers a rich set of features such as multi-scale
training, data augmentation, anchor generation, and more, which are crucial for
achieving high performance in object detection and related tasks.
– Integration with MMCV: MMDetection is closely integrated with the MMCV
library (Multimedia Common Vision), which provides additional computer vision

35
utilities and tools for data pre-processing, visualization, and evaluation.
– Community and Development: MMDetection has gained popularity in the
computer vision community and is actively developed and maintained by a
community of researchers and engineers. It benefits from contributions from the
open-source community, ensuring its continued improvement and enhancement.
3.4 Conclusion
In conclusion, this chapter encompassed the exploration and transformation of our
dataset, alongside a detailed discussion of environment configuration .
36
Chapter 4
Modeling, Evaluation, and

Deployment
4.1 Introduction
In this chapter, we will address the principles of modeling, evaluation, and deployment
in the field of data science, providing a scientific analysis of these essential processes in our
project.
4.2 Modeling
4.2.1 Parameter Configuration
Model-related configuration parameters are specified in the
/demo/table_recognition/lgpma/config/lgpma_base.py and lgpma_pub.py files as dic-
tionaries. 1.lgpma_base.py file can configure model training parameters, backbone, neck,
optimizer, batchsizeas as shown in figure 4.1,
37
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
Figure 4.1: model training parameters, backbone, neck, optimizer, batchsize
• batch size : is the number of samples processed each time the model is updated
• backbone: is a set of parameters essential to the feature pyramid network which
performs feature extraction on the input data and transforms it into certain repre-
sentation compatible with ROI-Align input.
• optimizer:its purpose is to adjust model weights to maximize a loss function .
• neck :set of parameters representing the color values respectively for input and output
images .In our case , we chose to process grayscale images which would output only
a black channel .
2.In lgpma_pub.py file, we configure the training data path, model storage path, and
log storage path.
38
Figure 4.2: training data path, model storage path, and log storage path
3.In lgpma_pub.py file, we can configure epoch number as well as number of gpu and the
weights of the pretrained model.
Figure 4.3: train configuration
The figures 4.2 and 4.3 represents our configuration for the model.
4.2.2 LPMA
for the pyramid mask regression, we assign the pixels in the proposal bounding box
regions with the softlabel in both horizontal and vertical directions, as shown in Figure
3. The middle point of text will have the largest regressed target "1" which is the darkest
39
level. Specifically, we assume the proposed aligned bounding box has the shape of H ×W.
The top-left point and bottom right point of the text region are denoted as (x1, y1),(x2,
y2), respectively, where 0<x1<x2 W and 0<y1<y2 H. Therefore, the target of the pyramid
mask is in shape R 2×H×W [0, 1], in which the two channels represent the target map
of the horizontal mask and vertical mask, respectively. For every pixel (h, w), these two
targets can be formed as:

 
 w if w ≤ xmid  h if h ≤ ymid

 

xmid ymid
t(w, h) = v(w, h) =
 W −w
 

if w > xmid  H−h

if h > ymid
W −xmid H−ymid
In this way, every pixel in the proposal region takes part in predicting the boundaries.
4.2.3 GPMA
Although LPMA allows the predicted mask to break through the proposal bounding
boxes, the local region’s receptive fields are limited. To determine the accurate coverage
area of a cell, the global feature might also provide some visual clues. Inspired by , learning
the offsets of each pixel from a global view could help locate more accurate boundaries.
However, bounding boxes in celllevel might be varied in width-height ratios, which leads
to the unbalance problem in regression learning. Therefore, we use the pyramid labels as
the regressing targets for each pixel, named Global Pyramid Mask Alignment (GPMA). .
The ground-truth of empty cells are generated according to the maximum height/width of
the non-empty cells in the same row/column. Only this task learns empty cell division
information since empty cells don’t have visible text texture that might influence the region
proposal networks to some extent. We want the model to capture the most reasonable cell
division pattern during the global boundary segmentation according to the human’s reading
habit, which is reflected by the manually labeled annotations. For the global pyramid mask
regression, since only the text region could provide the information of distinct cells, all
non-empty cells will be assigned with the soft labels . All of the ground-truths of aligned
bounding boxes in GPMA will be shrunk by 5% to prevent boxes from overlapping
40
4.3 Evaluation
4.3.1 IOU(Intersection Over Union)
Intersection over Union (IoU) is a metric commonly used in computer vision and image
segmentation tasks to evaluate the accuracy of an object detection or image segmentation
algorithm. It measures the overlap between the predicted and ground truth regions in an
image. IoU is particularly useful for tasks where you need to assess how well a model’s
predictions match the actual objects or regions of interest in an image.
Here’s the formula for calculating IoU:
Intersection Area
IoU =
Union Area
4.3.2 Precision , Recall and F1 score
• Precision:Precision is a measure of how accurate the positive predictions made by a
model are. In the context of object detection or segmentation with IoU, precision
is the ratio of true positives (correctly predicted object instances with IoU above a
certain threshold) to the total number of positive predictions (both true positives
and false positives). Here’s the formula for Precision:
True Positives
Precision =
True Positives + False Positives
• Recall:
Recall, also known as sensitivity or true positive rate, measures the model’s ability
to identify all the relevant positive instances in the dataset. In the context of IoU,
recall is the ratio of true positives to the total number of actual positive instances.
41
Recall (Sensitivity or True Positive Rate):
True Positives
Recall =
True Positives + False Negatives
• F1 score :
The F1 score is a metric commonly used in binary classification tasks to measure the
model’s accuracy in terms of both precision and recall. It is the harmonic mean of
precision and recall. Here’s the formula for the F1 score :
2 · precision · recall
F1 =
precision + recall
4.3.3 Evaluation Results
In this subsection we will discuss the result of the evaluation Process :
Precision Recall F1 score

0.70 0.79 0.74.
Table 4.1: Evaluation results
For a table structure recognition task, the performance metrics can be interpreted as
follows:
• F1 Score (0.74):
– An F1 score of 0.74 indicates a good balance between precision and recall,
suggesting effective table structure recognition.
• Recall (0.70):
– A recall of 0.70 signifies that the model correctly identifies 70% of the actual
tables in the dataset.

42
• Precision (0.79):
– A precision of 0.79 implies that when the model predicts a table, it is correct
approximately 79% of the time.
In summary, these performance metrics suggest that the model is reasonably effective at
recognizing table structures in the data. It correctly identifies a significant portion of
the tables (recall of 0.70) while maintaining a high level of accuracy in its predictions
(precision of 0.79). However, the specific interpretation may depend on the requirements
and objectives of the table structure recognition task and whether certain trade-offs between
precision and recall are acceptable in the given application.the obtained resulted may be
more accurate if we had a bigger dataset which is considered very low volume for this type
of model.
4.4 Deployment
FastAPI is a modern, fast (high-performance), web framework for building APIs with
Python. It is designed to be easy to use, while also being highly efficient and providing
automatic validation, serialization, and documentation of APIs. FastAPI has gained
popularity for its simplicity and performance, making it an excellent choice for building
RESTful APIs and web applications.[6]
Here are some key features and concepts associated with FastAPI:
• Automatic API Documentation: FastAPI automatically generates interactive API
documentation using the OpenAPI standard. You can access this documentation
through a web browser, making it easy for developers to understand and test your
API.
• Automatic API Documentation: FastAPI automatically generates interactive API
documentation using the OpenAPI standard. You can access this documentation
43
through a web browser, making it easy for developers to understand and test your
API.
• Async Support: FastAPI fully supports asynchronous programming with Python’s
async and await syntax. This allows you to write non-blocking, high-performance
code.
• File Uploads: It supports handling file uploads from clients with ease.
FastAPI application is designed to process base64-encoded images and generate JSON
output, the input typically involves sending the image data in base64 format as part of an
HTTP request. Once received, FastAPI can decode this input and process it. In our case
the JSON output would typically comprise three key components. First, it includes HTML
representing the structure of the table, outlining its rows, columns, and cells. Second,
the JSON output can contain coordinates of these cells, providing information about
their precise positioning within the image. Finally, the third component would consist of
the content extracted from each cell, enabling users to access the actual data within the
recognized table. By organizing and presenting these three parts within a structured JSON
response our model is deployed effectively. The figure 4.5 et 4.6 will present the input and
output of the model respectively
44
Figure 4.4: Input of Fast-API
Figure 4.5: Output of Fast-API
45
4.5 Conclusion
In conclusion, the evaluation of our modeling efforts in this chapter has yielded highly
positive results, aligning closely with the objectives and criteria we set out to achieve. The
comprehensive analysis and assessment of our models have provided valuable insights into
their performance and effectiveness.
Throughout this chapter, we have systematically examined various aspects of our models,
ranging from their accuracy and precision to their ability to generalize beyond the training
data. We have also considered their computational efficiency, scalability, and robustness in
real-world scenarios.
The results obtained indicate that our model has met the predefined expectations. we
assume it is ready to successfully address the specific tasks and challenges we set out to
tackle.
46
General Conclusion
This report represent end-of-study project carried out in Axe finance in order to obtain
the national diploma of computer science engineering of the National Superior Engineering
of Tunis and aimes to implement table and structure recognition model LGPMA
The initial chapter laid the foundation with a detailed presentation of the host company
and a concise problem definition presenting , establishing the need for an effective solution.
Recognizing the significance of choosing the right model, we evaluated various options and
ultimately selected LGPMA. This decision was driven by its track record of delivering
excellent results based on the icdar2021 competition and its remarkable capability to
handle complex data structures effectively.
Following this, we executed data annotation and environment setup, crucial steps in
preparing our data and creating a conducive modeling environment. This approach
ensured alignment with our objectives and the compatibility of our dataset with the
LGPMA model. then we moved into the modeling phase, implementing the LGPMA
model and evaluating its performance. The model consistently met the predefined
expectations, affirming the effectiveness of this approach . however ,It’s worth noting that
a larger dataset could potentially enhance the model’s performance.
Finally, we successfully deployed our solution using FastAPI, making it accessible and
practical for real-world applications
47
Bibliography
[1] USA ; Vlad I. Morariu; Brian Price; Scott Cohen; Tony Martinez Chris Tensmeyer
Adobe Research, San Jose. Deep splitting and merging for table structure decomposition.
IEEE, page 12, 20-25 September 2019.
[2] PyTorch Contributors. pytorh, 2021. [Accessed may 14, 2023].
[3] Python Software Foundation. python doc, 2021. [Accessed on April ,may ,juin , 2023].
[4] IBM. Tprésentation générale de crisp-dm, Year. [Accessed on Ferauary 2, 2023].
[5] Zhanzhan Cheng Peng Zhang Shiliang Pu Yi Niu Wenqi Ren Wenming Tan Fei Wu
Liang Qiao, Zaisheng Li. Lgpma: Complicated table structure recognition with local
and global pyramid mask alignment. Journal Name, page 17, 2022.
[6] MIT. Fastapi, 2021. [Accessed juin 10, 2023].
[7] OpenMMLab. Mmdetection’s documentation, 2021. [Accessed on April 10, 2023].
[8] Tanserflow team. Tanserflow, 2022. [Accessed may 15, 2023].
[9] Heng-Da Xu Houjin Yu Wanxuan Yin Xian-Ling Mao Zewen Chi, Heyan Huang.
Complicated table structure recognition. 20-25 September 2019, page 9, 13 Aug 2019.
48
Abstract
This report represents an end-of-study project carried out in Axe Finance in order to obtain the national
diploma of computer science engineering of the National Superior Engineering of Tunis and aims to
implement table and structure recognition model LGPMA The results obtained after systemic work
from data annotation, environment set up, modeling, evaluation and deployment indicate that our
model has met the predefined expectations
Keywords: LGPMA, modeling, data annotation, deployment
Résumé
Ce rapport représente un projet de fin d'études réalisé au sein d'Axe Finance en vue de l'obtention du
diplôme national d'ingénieur informatique de l'Ingénieur National Supérieur de Tunis et vise à mettre
en œuvre le modèle de reconnaissance de tables et de structures LGPMA. Les résultats obtenus après
un travail systémique de l'annotation des données, la mise en place de l'environnement, la
modélisation, l'évaluation et le déploiement indiquent que notre modèle a répondu aux attentes
prédéfinies
Mots clés : LGPMA, modeling, data annotation, deployment
‫الملخص‬
‫نم رتويبمكلا مولع ةسدنهل ينطولا مولبدلا ىلع لوصحلا لجأ نم يف هذيفنت مت يذلا ةساردلا ةياهن عورشم ريرقتلا اذه لثمي‬
‫لمعلا دعب اهيلع لوصحلا مت يتلا جئاتنلا لكيهلاو لودجلا ىلع فرعتلا جذومن ذيفنت ىلإ فدهيو سنوتب ايلعلا ةينطولا ةسدنهلا‬
‫اًقبسم ةددحملا تاعقوتلا ىفوتسا دق انجذومن نأ ىلإ رشنلاو مييقتلاو ةجذمنلاو ةئيبلا دادعإو تانايبلا حرش ريشي نم يجهنملا‬
LGPMA, modeling, data annotation, deployment : ‫الكلمات المفاتيح‬

Ahmed Zioudi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ahmed Zioudi

Uploaded by

Copyright:

Available Formats

République Tunisienne

Ministère de l’Enseignement Supérieur Code : GSP-RS-03-00

Department of Computer Engineering

Presented in order to obtain

Engineering degree in Computer Sciences

Table extraction and structure recognition

Realised within Axefinance

Defended on September, 29th 2023, In front of the jury composed of:

President: M. KAMMOUNE Slim

Reporter: M. BOULARES Mehrez

University supervisor: M. KOUKI Zoulel

Industrial supervisor: M. BOUDHHIR Maher

Academic Year 2022 - 2023

To my dear parents and all my family,

To everyone who has helped me, To everyone I love,

Kindly, I dedicate this work to you.

and especially under exceptional and critique circumstances. First of all,

I would like to thank sincerely my technical supervisor, Mr BOUDHHIR Maher

I would also like to present my sincere thanks to Mrs.KOUKI Zoulel my supervisor at

help and the precious directives

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines

or computer systems. AI technologies enable machines to perform tasks that typically

require human intelligence, such as understanding natural language, recognizing patterns,

As organizations seek to gain a competitive advantage and change customer demands,

also recognized the game-changing effects of innovative technologies such as Artificial

AI-based table structure recognition model.

evaluation, and deployment process.

Host Company Presentation and

the project scope and overview of the work Methodology.

1.2 Company presentation

key figures from Axe Finance’s activity.

Figure 1.1: Key numbers of the activity of Axe Finance

organized into multiple sub-departments, as illustrated in the organizational chart (Figure

Figure 1.2: Axe Finance Organizational chart

1.2.1 Main Services

Axe Credit Portal is an end-to-end comprehensive credit automation system from

global financial partners.

Axe Limit Management is responsible for multi-level facility structures of various

clients served (HNIs).

Axe Collateral Management is a service that proactively monitors the institution’s

shortfalls, deferment expiration, documentation renewals, margin calls (equities), product

institution’s operations and departments, including remote locations and international

is a system aimed to improve repair methods by increasing efficiencies throughout the

in the diagram below:

Figure 1.3: Customers

1.3 Project Scope

the examination of Existing Table Recognition Approaches , Selection and implementation

of a model designed to extract the structure of tables from images.

1.3.1 Problem Definition

relevance in digitizing documents, facilitating editing, and enabling efficient document

search and retrieval processes.

Typical sub-problems of table data:

information of cells in a table.

1.3.2 Basic Definitions

• Row, Column: row, column of table

and 1 given column

inside the cell

• Single-cells: is a cell, 1 cell corresponds to 1 grid

covering many other rows / sub-columns

• Relative Position:the relative position of the cells/grid in the table, represented by

index 0,1,2,3,... Original coordinates on top-left, including 4 parameters (start-row,

end-row, start-column, end-column)

1.4 Methodology of the project