Professional Documents
Culture Documents
Ahmed Zioudi
Ahmed Zioudi
By M. : Zioudi Ahmed
ENSIT 5, avenue Taha Hussein Montfleury, 1008 Tunis Site Web : http://www.ensit.tn E-mail : direction.stage@ensit.rnu.tn
Tél (+216) 71 49 68 96/71 4968 80/7139 25 91 Fax : (+216) 71 39 11 66
Dedications
To all my friends,
At the end of this modest work, I would like to express my deep gratitude to all those
who gave me their support and helped me to accomplish this project in the best conditions
I would like to express my gratitude to Mr. Ayad abdelmajid team lead at Axe
Finance, for allowing me to live, within his team, an experience full of interests.
for having guided me throughout my graduation project and for sharing with me his broad
experience. I could not have imagined having a better advisor for this work.
the Higher National Engineering School of Tunis (ENSIT) for the support, the permanent
Finally, I address my most devoted thanks to the members of the jury for having honored
me byagreeing to evaluate this work while hoping that they will find in it the qualities of
clarity and motivation that they expect... From the bottom of the heart: thank you!
General Introduction
solving complex problems, and making decisions. AI has a wide range of applications
across various industries and has the potential to transform the way we live and work.
digital transformation has impacted almost every industry. The banking industry has
Intelligence. Its capacity to identify and process data across diverse formats has reshaped
conventional workflows. The work described in this report explores the application of an
This report comprises four intricately structured chapters, each dedicated to a specific
facet of table structure recognition . Starting with an in-depth presentation of our host
company and the precise delineation of project scope. secon chapter exposes the popular
methodologies used in table recognition.The third chapter deals with data annotation
and environment set up , and the final chapter undertake an examination of modeling,
1
Chapter 1
1.1 Introduction
This chapter is devoted to present in details the host organization, its characteristics,
Axe Finance, founded in 2004, is a global software vendor specializing in loan automation
for financial institutions (including traditional and Islamic banking) with over 20000 users
in 20 countries seeking a competitive advantage in efficiency and customer service across all
client segments: commercial, retail, corporate, and so on. Axe Finance has offices in Tunis,
Amsterdam, Abu Dhabi, and Mumbai, among other places. The graph below depicts some
2
CHAPTER 1. HOST COMPANY PRESENTATION AND PROJECT SCOPE
Axe Finance employs about 200 individuals across four departments, each of which is
1.2):
Axe Finance, offered as a locally hosted. Société Générale, Al Rajhi Bank, Banque
3
CHAPTER 1. HOST COMPANY PRESENTATION AND PROJECT SCOPE
Internationale de Luxembourg, and First Abu Dhabi Bank are among axe finance’s trusted
types throughout all of the bank’s divisions, including Corporate & Commercial Lending,
Treasury, Trade Finance, Specialized Lending, and others (the method of software delivery
and licensing in which software is accessed online via a subscription). Large and medium-
sized businesses, retail investment banks, public sector groups (PSGs), SMEs, non-bank
financial institutions (NBFIs), and high-net-worth individuals are among the wholesale
whole portfolio of collateral by sending out early alerts and notifications about coverage
disposals, and other events. The complexity of collateral management is efficiently man-
aged by a number of advanced business rules and adaptable processes across the financial
subsidiaries. Credit Risk Analysts, Relationship Managers, Risk Managers, Credit Adminis-
trators, Collateral and Collection Officers, Legal Officers, Sustainability and Environmental
Offices, and Portfolio Teams will all benefit from Axe Retail Lending. It uses a single co-
operative solution to streamline their tasks and build on their input, running all automated
processes on a single platform of risk and credit data. Axe Collection and Provisioning
repair process by ensuring consistent data flow and streamlining processes. A low-cost,
end-to-end solution that provides a high return on investment in the short, medium, and
long term. The solution collects all of the information needed to compute expected losses
and individual or group provisions for good, bad, and damaged assets.
4
CHAPTER 1. HOST COMPANY PRESENTATION AND PROJECT SCOPE
1.2.2 Customers
The most well-known consumers and those who trust the company’s services are depicted
]logos/image.png
The project is titled "Table Extraction and Structure Recognition." Its core objective is
5
CHAPTER 1. HOST COMPANY PRESENTATION AND PROJECT SCOPE
The problem involves the detection, recognition, and precise retention of table contents,
along with maintaining the original layout and structure. This aims to enable the storage of
tabular data in editable formats like Docx, Excel, and more. This problem holds significant
• Table Detection (TD): determines the position of the table in the image.
• Table Structure Recognition (TSR): identify, reconstruct and store relative position
• Table Recognition (TR): similar to TSR, but includes reading information, recognizing
characters on the table, and mapping accurately to each cell in the table
• Table: table
• Grids: the smallest unit representing coordinates in a table, belonging only to 1 row
• Cells: larger than the Grid, 1 cell can include multiple sub-grids (called span-cells),
or only 1 child grid is single-cell. Contains location information and text content
• is a cell, 1 cell corresponding to many grids, ie rows / columns are stretched / wide,
To better organize the structure and steps of our project, we used the CRISP method,
which is an agile and iterative process.[7] Let’s first clarify what an Agile method is. It is a
strategy that advocates for setting short-term objectives. Therefore, the project is divided
into several sub-projects. Once an objective is reached, we move on to the next one until
the final objective is reached. This method is more flexible. As it is impossible to predict
and anticipate everything, it allows for the acceptance of unexpected events and changes.
CRISP- DM (cross-industry standard process for data mining) is, as the name implies,
an open standard process framework specifically for planning data mining projects.It is
important to note that the process is highly non-linear, and moving from one step to the
next is the norm rather than the exception and it is divided into six major steps
termining what problems the company wishes to solve. According to data science
project management, the most important tasks in this phase are: Determine the
business question and goal. Make a detailed plan for each project phase, including
know about the data and how it relates to the business question. This phase may
include the following steps: collecting, attempting to describe the data you have at a
• Data Preparation: Once the data and the business are clear and understood, it is
time to prepare the collected data for modeling. In this step, we will select the data
that will be used for analysis, then clean it up and pre-process it. Indeed, this is the
key to a great modeling process that will set your data science project apart even
more.
• Modeling: In this phase, we choose and apply various modeling techniques and
algorithm, then we will design our modeling test design by dividing the data into
training, testing, and validation sets, and finally we will define the technical success
measures and select the best viable model(s) to solve the business question.
further evaluate the model and review the steps taken to build the model to ensure
that it is meeting the business objectives. A key goal is to identify any significant
business issues that have not been adequately addressed. A decision on how to use
the data mining results should be made at the end of this phase.
• Deployment: This is the last step in the procedure. It entails putting the obtained
models into production for the final users. Its goal is to present the results in an
8
CHAPTER 1. HOST COMPANY PRESENTATION AND PROJECT SCOPE
1.5 Conclusion
In culmination, this chapter laid the foundation for our journey by introducing the Host
Company Presentation and delineating the Project Scope. The Host Company Presentation
provided a comprehensive overview of our collaborating entity, offering insight into its
9
Chapter 2
2.1 Introduction
methodologies employed for table structure recognition. Our aim is to provide an insightful
analysis of these existing techniques, shedding light on their strengths, limitations, and
that aids in understanding the evolving landscape of table structure recognition. This
2.2 LGPMA
performes pixellevel segmentation on detected objects ,when applied with table recognition
purposes it outputs bounding-box and mask segment. The model consists of four main
modules: Aligned Bounding-box Detection, LPMA (local pyramid mask alignment), GPMA
10
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
• The Align Bounding-box Detection uses features extracted from the Region of Interest-
11
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
non-empty cells. However,empty cells can not be easily recognized .Therefor the
authors of LGPMA prpose the following workflow for the aime of the alignment and
• LPMA:is applied at the scale of each single cell and consists of two subbranches:
The first is used for binary segmentation aiming to identify text regions. The second
performs a local pyramid mask regression task .It creates masks with a descending
gradient from the center of the text, defined for both vertical mask and horizontal
mask. In the figure 2.2 (a) shows the original aligned bounding box (blue) and
text region box (red). (b) shows the pyramid mask labels in horizontal and vertical
directions, respectively :
and Global pyramid mask regression.The first simply performs a binary segmentation
to identify aligned non-empty cells and empty-cells . The second section is identifies
the entire set of non-empty cells with only two outputs, global horizontal pyramid
• The last module is Aligned Bounding-box Refinement which refines the cell identity.It
modules to create a voting area .The process starts with Cell matching to identify
cells of the same row and same column,identifies the empty-cells and merge them
together if needed. .
The Figure 2.3 is a visualization of an example that is successfully refined. (a) shows The
aligned bounding boxes before refinement. (b) gives LPMA (in horizontal). (c) displays
GPMA (in horizontal). (d) presents Global binary segmentation. (e) unveils Final result
13
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
2.3 Split-Embed-Merge
this model uses the divide-and-rule approach, the authors divides the model into three
segmentation approach, the output is two corresponding masks for the separated row
and column.
• Embed-model: with the grid specified from the Split-model, uses Roi-Align to extract
the feature of each grid area of the image, called the Vision Module. At the same
time, Embed uses a Text-Module with BERT as a feature extraction to extract more
text features. Then, 2 modules are combined to make input for the 3rd model,
14
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
and paper section uses 1 GRU model with Attention, at each timestep there is an
output of 1 merged-map MxN dimension (MxN is the grid size from the Split-model),
indicating that at this timestep, which grids need to be merged together, denoted
Deep-Split-Merge.
2.4 GraphTSR
The modeling part is designed in graph form and modeled with the Graph Neural
Network (GNN)
Graph Neural Networks (GNNs): GNNs are a type of neural network architecture designed
to work with graph-structured data. In a graph, you have nodes (vertices) and edges
connecting them. GNNs are used to process and analyze data in this graph format, making
them suitable for tasks that involve relationships between entities, such as nodes and edges.
Figure 2.7 presents an Overview of the method : (a) Preprocessing: obtaining cell contents
15
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
and their corresponding bounding box from the image; (b) Graph construction: building
an undirected graph on these cells; (c) Relation prediction: predicting adjacent relations by
our proposed GraphTSR; (d) Post-processing: recovering table structure from the labeled
graph.
Utilizing Graph Neural Networks (GNN), the network’s nodes are symbolized as text-
contained bounding boxes. For instance, text labels like "Method," "D1," "D2," "P," "R,"
"F1," correspond to nodes in the graph. Edges represent relationships between nodes,
In the study, the Text Structure Recognition (TSR) issue is approached as an edge-
vertices and edges, the goal is to assign newly created edges into three labels:
• Label "1": Horizontal relationship, like a green line, indicating a side-by-side connec-
tion.
hierarchical link.
By defining the relationships between each cell (0/1/2) thus, we can identify the span-cells
and perform table restructuring as shown above. With 1 row-span cell Method consists of
16
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
larity between two tree structures. It measures the minimum number of edit operations
required to transform one tree into another. These edit operations typically include insert-
ing, deleting, or substituting nodes in one tree to make it match the structure of the other
tree.
is to use the tree edit distance as a basis to determine how similar or dissimilar two tree
structures are. Smaller tree edit distances indicate higher similarity, while larger distances
• Tree Representation: Convert the input trees or hierarchical structures into appro-
priate representations for tree edit distance calculations. This representation often
• Tree Edit Distance Calculation: Calculate the tree edit distance between the two tree
structures. This involves finding the minimum sequence of edit operations (insertions,
• Similarity Calculation: Convert the tree edit distance into a similarity score. This
• Interpretation: The resulting similarity score provides information about how similar
the two trees are. Higher similarity scores indicate greater structural resemblance,
17
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
LGPMA , GraphTSR and Split-Embed-Merge were put to proof in the ICDAR 2021
Competition and were tested using the TEDS metric using the following settings :
• Substitution Cost:
– If both nodes are "td," the substitution cost depends on whether the column
span or row span of the nodes is different. If they are different, the cost is 1. if
the column span and row span of both nodes are the same, the substitution cost
• TEDS Calculation: The TEDS similarity between two trees (or tables) is calculated
EditDist(Ta , Tb )
T EDS(Ta , Tb ) = 1 −
max(|Ta |, |Tb |)
Where EditDist is the tree-edit distance between Ta and Tb, and |Ta| and |Tb|
is defined as the mean TEDS score between the recognition result produced by the
18
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
The table displays the results of different methods applied to three datasets: TEDS
complex structure), and TEDS all(represents the combination of two both previous datasets).
• TEDS Simple:
• TEDS Complex:
Complex dataset.
• TEDS all:
19
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
– LGPMA: LGPMA maintained its lead on the TEDS all dataset with a score
of 96.36%.
all dataset.
• The "LGPMA" method consistently performed well across all three datasets, making
it a strong candidate for various applications related to TEDS data. The "Split-
• The "GraphTSR" method, while still achieving reasonable performance, had slightly
lower scores compared to the other methods, particularly on the more complex TEDS
In the table presented below, we delineate the strengths and limitations of the examined
models :
20
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
The table presents an evaluation of three different methods for Tabular Structure Recogni-
tion: LGPMA, Split-Embed-Merge, and GraphTSR. These methods are assessed based
on their respective advantages and limitations, providing insights into their practical
applicability.[1]
LGPMA:
spanning cells, which are common in complex tables. - The availability of pretrained
21
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
its recognition capabilities. - LGPMA’s notable achievement of holding the first place
• Limitations: - The method is designed for distributed training, which places substan-
Split-Embed-Merge:
• Advantages: - Split-Embed-Merge achieves the highest TEDS score for complex table
recognition by ranking among the top three in the ICDAR 2021 Competition.
GraphTSR:
and insights.[9]
intensive and memory-hungry as the graph size increases, potentially limiting scala-
bility. - The risk of overfitting exists, particularly if the dataset used for training is
small.
In summary, LGPMA stands out in terms of empty and spanning cell detection. Split-
Embed-Merge performs well on complex structures but requires substantial data and
GPU resources. GraphTSR leverages graph models for improved predictions but may face
22
CHAPTER 2. CURRENT METHODS FOR TABLE RECOGNITION
2.6 Conclusion
a clear path has emerged. Our rigorous investigation has led us to the resolute decision of
adopting the LGPMA model for our table structure recognition task. The comprehensive
evaluation of existing techniques has affirmed the suitability of this model for our specific
objectives.
23
Chapter 3
3.1 Introduction
This chapter is dedicated to introducing our image dataset and aiming to provide a con-
cise overview of the data through the generation of exploratory insights and visualizations.
Additionally, we will delve into the process of data annotation, a crucial step to adapt the
Furthermore, we will present the configuration of the environment, detailing the setup
24
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
the process of working with any form of data, including the data processed in computer
vision tasks and involves a systematic and in-depth examination of the dataset. Our
dataset comprises 250 tabular images featuring a diverse array of tables, each possessing
distinct structures. These images are sourced directly from authentic employee documents,
during the curation process, the dataset was refined to 230 images. This reduction was
necessitated by the exclusion of certain images that were deemed unsuitable for model
training. These images exhibited notable noise and irregularities such as doted lines of
the tables or unclear words, making them less conducive to effective learning. The careful
curation process ensures that the dataset maintains a higher quality and relevance, thereby
enhancing the reliability and performance of the model during training and subsequent
25
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
26
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
Data annotation, also referred to as data labeling, tagging, or classification, involves the
essential task of assigning pertinent labels (such as tags, annotations, or classes) to individual
data samples. This procedure holds significant influence over a model’s performance. In the
context of our project, image annotation was meticulously undertaken using LabelImage
process serves as the foundation for generating the training dataset, enabling supervised AI
models to acquire knowledge. The manually annotated images establish a baseline dataset
crucial for training the LGPMA model effectively. This preparatory step plays a pivotal
role in facilitating the model’s ability to comprehend and make predictions accurately. The
The annotation process is a fundamental step in preparing our dataset for effective model
training. In this process, each image is labeled using bounding boxes, where each bounding
box represents an individual cell within the table witch deemed possible with the help of
annotation tool.
27
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
adding labels, shapes, text, or other metadata to images. These tools are commonly used
in computer vision, machine learning, and data annotation tasks. Image annotation tools
help in creating labeled datasets for training and testing machine learning models.In our
tool that allows us to draw bounding boxes around objects in images. It’s commonly used
for object detection tasks. LabelImg supports both PASCAL VOC and YOLO formats.
28
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
This approach allows us to accurately map the tabular structure present in the images.
The resulting annotations are compiled into XML files, which encapsulate essential infor-
mation including the height, width, and coordinates of each bounding box. This structured
XML representation forms the cornerstone of our annotated dataset. Subsequently, these
XML annotations are transformed into the format required by the model . The figure 3.5
shows the structure of Jason file used as input for the model:
29
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
– bboxes: A list of coordinates outlining the text area within a cell. The for-
mat employed is [x1, y1, x2, y2], representing the upper-left and lower-right
coordinates.
30
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
– cells: A list indicating the row and column information for each cell. The format
is [Start Row Index, Start Column Index, End Row Index, End Column Index].
– labels: A list of labels for each cell, with values of 0 indicating a header cell and
It is important to pay attention to the following nuances in the above data format:
• For cells with no text area, the bboxes list should be represented as an empty list.
• The row and column indexes in the cells parameter begin from 0.
• It is imperative to maintain the order of the three lists (bboxes, cells, and labels)
equipped with hardware capabilities that align with our computational requirements. This
new setup boasts 32GB of RAM, coupled with a powerful GP102 GPU, specifically the
GeForce GTX 1080 Ti, featuring 12GB of dedicated graphics memory. This shift ensures
that our computational infrastructure is robust enough to handle the complexities and
31
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
3.3.1 Software
In this section, we will discuss the different components of the software environment
was invented in 1991 by Guido van Rossum and developed by the Python Software
majority of data scientists have ranked it as the most privileged programming language
because it provides a large collection of open-source libraries that help to easily solve
complex business problems, build robust systems, and especially applications such as
32
CHAPTER 3. DATA PROFICIENCY AND ENVIRONMENT SETUP: NAVIGATING
UNDERSTANDING, PREPARATION, AND CONFIGURATION
table structure recognition Studio Code: It is a powerful code editor that runs on
any operating system (Windows, Linux, Mac Os). It comes with built-in support for
JavaScript, TypeScript and Node.js and has a rich ecosystem of extensions for other
• Google Colaboratory or Colab, is a free Google tool for developing data science
servers, allowing the user to leverage backend hardware like GPUs and TPUs
environment. It’s particularly popular in the field of data science and scientific
The key component of Jupyter is the Jupyter Notebook, which allows users to create
and share documents that contain live code, equations, visualizations, and narrative
text. These notebooks are a versatile tool for data analysis, scientific research,
machine learning, and more. Users can write and execute code in a notebook, see the
results immediately, and document their work in a coherent and interactive manner.
Jupyter supports a wide range of programming languages beyond the original three,
the code within the notebooks. This flexibility makes Jupyter an invaluable tool
for researchers, data scientists, and educators working with various programming
Research lab (FAIR). It provides a flexible and dynamic computational graph, making
it widely adopted in both research and production for various machine learning tasks.
Main Features:
which allows for more flexible and intuitive model design and debugging.
– GPU Support: It has native support for GPUs, making it efficient for training
– PyTorch has a rich ecosystem of libraries and tools, including torchvision for
computer vision tasks, and PyTorch Lightning for streamlining the training
process.
3.3.2 libraries
In this section, we are going to list the packages incorporated in our project
processing. This library provides more than 2500 computer vision algorithms that
can be used to process images. These algorithms are primarily based on complex
(NLP) library with many built-in features such as NER, POS tagging, dependency
analysis, entity linking, and more. Because of its cutting-edge speed and rigorously
tested accuracy, it is becoming increasingly popular for NLP processing and analysis.
various tasks. It is a symbolic mathematical library, which is also used for machine
learning applications such as neural networks. Created by the "Google Brain" team,
tecture, and evaluation tools specifically designed for computer vision tasks..
– Use Cases: MMCV is commonly used in computer vision research and applica-
more.
object detection, instance segmentation, and other related computer vision tasks.
It is built on top of the PyTorch deep learning framework and is developed and
can easily configure and extend the framework to suit their specific needs.[2]
the-art object detection and instance segmentation models. These models are
pre-implemented and can be easily used for various tasks, including Faster
training, data augmentation, anchor generation, and more, which are crucial for
3.4 Conclusion
36
Chapter 4
4.1 Introduction
In this chapter, we will address the principles of modeling, evaluation, and deployment
in the field of data science, providing a scientific analysis of these essential processes in our
project.
4.2 Modeling
tionaries. 1.lgpma_base.py file can configure model training parameters, backbone, neck,
37
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
• batch size : is the number of samples processed each time the model is updated
performs feature extraction on the input data and transforms it into certain repre-
• neck :set of parameters representing the color values respectively for input and output
images .In our case , we chose to process grayscale images which would output only
a black channel .
2.In lgpma_pub.py file, we configure the training data path, model storage path, and
38
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
Figure 4.2: training data path, model storage path, and log storage path
3.In lgpma_pub.py file, we can configure epoch number as well as number of gpu and the
The figures 4.2 and 4.3 represents our configuration for the model.
4.2.2 LPMA
for the pyramid mask regression, we assign the pixels in the proposal bounding box
regions with the softlabel in both horizontal and vertical directions, as shown in Figure
3. The middle point of text will have the largest regressed target "1" which is the darkest
39
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
level. Specifically, we assume the proposed aligned bounding box has the shape of H ×W.
The top-left point and bottom right point of the text region are denoted as (x1, y1),(x2,
y2), respectively, where 0<x1<x2 W and 0<y1<y2 H. Therefore, the target of the pyramid
mask is in shape R 2×H×W [0, 1], in which the two channels represent the target map
of the horizontal mask and vertical mask, respectively. For every pixel (h, w), these two
4.2.3 GPMA
Although LPMA allows the predicted mask to break through the proposal bounding
boxes, the local region’s receptive fields are limited. To determine the accurate coverage
area of a cell, the global feature might also provide some visual clues. Inspired by , learning
the offsets of each pixel from a global view could help locate more accurate boundaries.
However, bounding boxes in celllevel might be varied in width-height ratios, which leads
to the unbalance problem in regression learning. Therefore, we use the pyramid labels as
the regressing targets for each pixel, named Global Pyramid Mask Alignment (GPMA). .
The ground-truth of empty cells are generated according to the maximum height/width of
the non-empty cells in the same row/column. Only this task learns empty cell division
information since empty cells don’t have visible text texture that might influence the region
proposal networks to some extent. We want the model to capture the most reasonable cell
division pattern during the global boundary segmentation according to the human’s reading
habit, which is reflected by the manually labeled annotations. For the global pyramid mask
regression, since only the text region could provide the information of distinct cells, all
non-empty cells will be assigned with the soft labels . All of the ground-truths of aligned
40
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
4.3 Evaluation
Intersection over Union (IoU) is a metric commonly used in computer vision and image
algorithm. It measures the overlap between the predicted and ground truth regions in an
image. IoU is particularly useful for tasks where you need to assess how well a model’s
Intersection Area
IoU =
Union Area
model are. In the context of object detection or segmentation with IoU, precision
is the ratio of true positives (correctly predicted object instances with IoU above a
certain threshold) to the total number of positive predictions (both true positives
True Positives
Precision =
True Positives + False Positives
• Recall:
Recall, also known as sensitivity or true positive rate, measures the model’s ability
to identify all the relevant positive instances in the dataset. In the context of IoU,
recall is the ratio of true positives to the total number of actual positive instances.
41
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
True Positives
Recall =
True Positives + False Negatives
• F1 score :
The F1 score is a metric commonly used in binary classification tasks to measure the
model’s accuracy in terms of both precision and recall. It is the harmonic mean of
2 · precision · recall
F1 =
precision + recall
For a table structure recognition task, the performance metrics can be interpreted as
follows:
• F1 Score (0.74):
• Recall (0.70):
– A recall of 0.70 signifies that the model correctly identifies 70% of the actual
• Precision (0.79):
– A precision of 0.79 implies that when the model predicts a table, it is correct
In summary, these performance metrics suggest that the model is reasonably effective at
the tables (recall of 0.70) while maintaining a high level of accuracy in its predictions
(precision of 0.79). However, the specific interpretation may depend on the requirements
and objectives of the table structure recognition task and whether certain trade-offs between
precision and recall are acceptable in the given application.the obtained resulted may be
more accurate if we had a bigger dataset which is considered very low volume for this type
of model.
4.4 Deployment
FastAPI is a modern, fast (high-performance), web framework for building APIs with
Python. It is designed to be easy to use, while also being highly efficient and providing
popularity for its simplicity and performance, making it an excellent choice for building
Here are some key features and concepts associated with FastAPI:
documentation using the OpenAPI standard. You can access this documentation
through a web browser, making it easy for developers to understand and test your
API.
documentation using the OpenAPI standard. You can access this documentation
43
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
through a web browser, making it easy for developers to understand and test your
API.
async and await syntax. This allows you to write non-blocking, high-performance
code.
• File Uploads: It supports handling file uploads from clients with ease.
output, the input typically involves sending the image data in base64 format as part of an
HTTP request. Once received, FastAPI can decode this input and process it. In our case
the JSON output would typically comprise three key components. First, it includes HTML
representing the structure of the table, outlining its rows, columns, and cells. Second,
the JSON output can contain coordinates of these cells, providing information about
their precise positioning within the image. Finally, the third component would consist of
the content extracted from each cell, enabling users to access the actual data within the
recognized table. By organizing and presenting these three parts within a structured JSON
response our model is deployed effectively. The figure 4.5 et 4.6 will present the input and
44
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
45
CHAPTER 4. MODELING, EVALUATION, AND DEPLOYMENT
4.5 Conclusion
In conclusion, the evaluation of our modeling efforts in this chapter has yielded highly
positive results, aligning closely with the objectives and criteria we set out to achieve. The
comprehensive analysis and assessment of our models have provided valuable insights into
Throughout this chapter, we have systematically examined various aspects of our models,
ranging from their accuracy and precision to their ability to generalize beyond the training
data. We have also considered their computational efficiency, scalability, and robustness in
real-world scenarios.
The results obtained indicate that our model has met the predefined expectations. we
assume it is ready to successfully address the specific tasks and challenges we set out to
tackle.
46
General Conclusion
This report represent end-of-study project carried out in Axe finance in order to obtain
the national diploma of computer science engineering of the National Superior Engineering
of Tunis and aimes to implement table and structure recognition model LGPMA
The initial chapter laid the foundation with a detailed presentation of the host company
and a concise problem definition presenting , establishing the need for an effective solution.
Recognizing the significance of choosing the right model, we evaluated various options and
ultimately selected LGPMA. This decision was driven by its track record of delivering
excellent results based on the icdar2021 competition and its remarkable capability to
Following this, we executed data annotation and environment setup, crucial steps in
preparing our data and creating a conducive modeling environment. This approach
ensured alignment with our objectives and the compatibility of our dataset with the
LGPMA model. then we moved into the modeling phase, implementing the LGPMA
model and evaluating its performance. The model consistently met the predefined
expectations, affirming the effectiveness of this approach . however ,It’s worth noting that
Finally, we successfully deployed our solution using FastAPI, making it accessible and
47
Bibliography
[1] USA ; Vlad I. Morariu; Brian Price; Scott Cohen; Tony Martinez Chris Tensmeyer
Adobe Research, San Jose. Deep splitting and merging for table structure decomposition.
[3] Python Software Foundation. python doc, 2021. [Accessed on April ,may ,juin , 2023].
[5] Zhanzhan Cheng Peng Zhang Shiliang Pu Yi Niu Wenqi Ren Wenming Tan Fei Wu
Liang Qiao, Zaisheng Li. Lgpma: Complicated table structure recognition with local
and global pyramid mask alignment. Journal Name, page 17, 2022.
[9] Heng-Da Xu Houjin Yu Wanxuan Yin Xian-Ling Mao Zewen Chi, Heyan Huang.
Complicated table structure recognition. 20-25 September 2019, page 9, 13 Aug 2019.
48
Abstract
This report represents an end-of-study project carried out in Axe Finance in order to obtain the national
diploma of computer science engineering of the National Superior Engineering of Tunis and aims to
implement table and structure recognition model LGPMA The results obtained after systemic work
from data annotation, environment set up, modeling, evaluation and deployment indicate that our
model has met the predefined expectations
Résumé
Ce rapport représente un projet de fin d'études réalisé au sein d'Axe Finance en vue de l'obtention du
diplôme national d'ingénieur informatique de l'Ingénieur National Supérieur de Tunis et vise à mettre
en œuvre le modèle de reconnaissance de tables et de structures LGPMA. Les résultats obtenus après
un travail systémique de l'annotation des données, la mise en place de l'environnement, la
modélisation, l'évaluation et le déploiement indiquent que notre modèle a répondu aux attentes
prédéfinies
الملخص
نم رتويبمكلا مولع ةسدنهل ينطولا مولبدلا ىلع لوصحلا لجأ نم يف هذيفنت مت يذلا ةساردلا ةياهن عورشم ريرقتلا اذه لثمي
لمعلا دعب اهيلع لوصحلا مت يتلا جئاتنلا لكيهلاو لودجلا ىلع فرعتلا جذومن ذيفنت ىلإ فدهيو سنوتب ايلعلا ةينطولا ةسدنهلا
اًقبسم ةددحملا تاعقوتلا ىفوتسا دق انجذومن نأ ىلإ رشنلاو مييقتلاو ةجذمنلاو ةئيبلا دادعإو تانايبلا حرش ريشي نم يجهنملا