0% found this document useful (0 votes)
64 views15 pages

Token Path Prediction for NER in VrDs

Uploaded by

13265300920
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views15 pages

Token Path Prediction for NER in VrDs

Uploaded by

13265300920
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Reading Order Matters: Information Extraction from Visually-rich

Documents by Token Path Prediction


Chong Zhang1†∗, Ya Guo2†, Yi Tu2 , Huan Chen2 , Jinyang Tang2 ,
Huijia Zhu2 , Qi Zhang1‡, Tao Gui3‡
1
School of Computer Science, Fudan University, Shanghai, China
2
Ant Info Security Lab, Ant Group Inc., Hangzhou, China
3
Institute of Modern Languages and Linguistics, Fudan University, Shanghai, China
{chongzhang20, qz, tgui}@fudan.edu.cn
{guoya.gy,qianyi.ty,chenhuan.chen,jinyang.tjy,huijia.zhj}@antgroup.com
B-Q I-Q I-Q
Abstract B-H I-H I-H

I-Q
Recent advances in multimodal pre-trained
models have significantly improved informa- Model Input Order 1 2 3 4 5 6 7

tion extraction from visually-rich documents


arXiv:2310.11016v1 [cs.CL] 17 Oct 2023

Ideal Text NAME OF ACCOUNT # OF STORES SUPPLIED


(Ordered) B-H I-H I-H B-Q I-Q I-Q I-Q
(VrDs), in which named entity recognition Actual Text # OF STORES NAME OF ACCOUNT SUPPLIED
(NER) is treated as a sequence-labeling task (Disordered) B-Q I-Q I-Q B-H I-H I-H I-Q

of predicting the BIO entity tags for tokens, (a) Entity annotation by BIO-tags. With disordered inputs,
following the typical setting of NLP. However, BIO-tags fail to assign a proper label for each word to mark
BIO-tagging scheme relies on the correct order entities clearly.
of model inputs, which is not guaranteed Question
Header
in real-world NER on scanned VrDs where
text are recognized and arranged by OCR
systems. Such reading order issue hinders the Model Input Order 1 2 3 4 5 6 7

accurate marking of entities by BIO-tagging Ideal Text


NAME OF ACCOUNT # OF STORES SUPPLIED
(Ordered)
scheme, making it impossible for sequence-
Actual Text
labeling methods to predict correct named (Disordered)
# OF STORES NAME OF ACCOUNT SUPPLIED

entities. To address the reading order issue,


we introduce Token Path Prediction (TPP),
(b) Entity annotation as word sequences within the document.
a simple prediction head to predict entity
The entity annotations are not affected by model input order.
mentions as token sequences within documents.
Alternative to token classification, TPP models Figure 1: A document image with its layout and entity
the document layout as a complete directed annotations. (a) Sequence-labeling methods are not
graph of tokens, and predicts token paths within suitable for VrD-NER as model inputs are disordered in
the graph as entities. For better evaluation real situation. (b) We address the reading order issue by
of VrD-NER systems, we also propose two treating VrD-NER as predicting word sequences within
revised benchmark datasets of NER on scanned documents.
documents which can reflect real-world sce-
narios. Experiment results demonstrate the
effectiveness of our method, and suggest its
potential to be a universal solution to various and comprehension of scanned VrDs are necessary
information extraction tasks on documents. in various scenarios, including legal, business, and
financial fields (Stanisławek et al., 2021; Huang
et al., 2019; Stray and Svetlichnaya, 2020). Re-
1 Introduction cently, several transformer-based multimodal pre-
Visually-rich documents (VrDs), including forms, trained models (Garncarek et al., 2020; Xu et al.,
receipts and contracts, are essential tools for 2021a; Li et al., 2021a; Hong et al., 2022; Huang
gathering, carrying, and displaying information et al., 2022; Tu et al., 2023) have been proposed.
in the digital era. The ability to understand and Known as document transformers, these models
extract information from VrDs is critical for real- can encode text, layout, and image inputs into a
world applications. In particular, the recognition unified feature representation, typically by marking

each input text token with its xy-coordinate on the
This work was done when the author was an intern at
Ant Group.
document layout using an additional 2D positional

Equal contribution. embedding. Thus, document transformers can

Corresponding author. adapt to a wide range of VrD tasks, such as Named
Entity Recognition (NER) and Entity Linking (EL)
(Jaume et al., 2019; Park et al., 2019).
However, in the practical application of infor-
mation extraction (IE) from scanned VrDs, the Figure 2: The reading order issue of scanned documents
reading order issue is known as a pervasive in real-world scenarios. Left: The entity "TOTAL
problem that lead to suboptimal performance of 230,000" is disordered when reading from top to down.
current methods. This problem is particularly The entity "CASH CHANGE" lies in multiple rows and is
typical in VrD-NER, a task that aims to identify separated to two segments. Right: The entity "TOTAL
word sequences in a document as entities of 30,000" is interrupted by "(Qty 1.00)" when reading from
predefined semantic types, such as headers, names left to right. All these situations result in disordered
model inputs and affect the performance of sequence-
and addresses. Following the classic settings of
labeling based VrD-NER methods. More real-world
NLP, current document transformers typically treat examples of disordered layouts are displayed in Figure
this task as a sequence-labeling problem, tagging 7 in Appendix.
each text token using the BIO-tagging scheme
(Ramshaw and Marcus, 1999) and predicting the
entity tag for tokens through a token classification and can be adapted to various VrD-IE and VrD
head. These sequence-labeling-based methods understanding (VrDU) tasks, such as NER, EL,
assumes that each entity mention is a continuous and reading order prediction (ROP). Specifically,
and front-to-back word sequence within inputs, when using TPP for VrD-NER, we model the token
which is always valid in plain texts. However, inputs as a complete directed graph of tokens, and
for scanned VrDs in real-world, where text and each entity as a token path, which is a group
layout annotations are recognized and arranged of directed edges within the graph. We adopt
by OCR systems, typically in a top-to-down a grid label for each entity type to represent the
and left-to-right order, the assumption may fail token paths as n ∗ n binary values of whether two
and the reading order issue arises, rendering the tokens are linked or not, where n is the number
incorrect order of model inputs for document of text tokens. Model learns to predict the grid
transformers, and making these sequence-labeling labels by binary classification in training, and
methods inapplicable. For instance, as depicted search for token paths from positively-predicted
in Figure 1, the document contains two entity token pairs in inference. Overall, TPP provides
mentions, namely "NAME OF ACCOUNT" and "# OF a viable solution for VrD-NER by presenting
STORES SUPPLIED". However, due to their layout a suitable label form, modeling VrD-NER as
positions, the OCR system would recognize the predicting token paths within a graph, and proposes
contents as three segments and arrange them in a straightforward method for prediction. This
the following order:(1)"# OF STORES"; (2)"NAME method does not require any prior reading order
OF ACCOUNT"; (3)"SUPPLIED". Such disordered and is therefore unaffected by the reading order
input leads to significant confusion in the BIO- issue. For adaptation to other VrD tasks, TPP is
tagging scheme, making the models unable to applied to VrD-ROP by predicting a global path
assign a proper label for each word to mark the of all tokens that denotes the predicted reading
entity mentions clearly. Unfortunately, the reading order. TPP is also capable of modeling the entity
order issue is particularly severe in documents linking relations by marking linked token pairs
with complex layouts, such as tables, multi-column within the grid label, making it suitable for direct
contents, and unaligned contents within the same adaptation to the VrD-EL task. Our work is related
row or column, which are quite common in scanned to discontinuous NER in NLP as the tasks share
VrDs. Therefore, we believe that the sequence- a similar form, yet current discontinuous NER
labeling paradigm is not a practical approach methods cannot be directly applied to address the
to address NER on scanned VrDs in real-world reading order issue since they also require a proper
scenarios. reading order of contents.
To address the reading order issue, we introduce For better evaluation of our proposed method,
Token Path Prediction (TPP), a simple yet strong we also propose two revised datasets for VrD-NER.
prediction head for VrD-IE tasks. TPP is compat- In current benchmarks of NER on scanned VrDs,
ible with commonly used document transformers, such as FUNSD (Jaume et al., 2019) and CORD
(Park et al., 2019), the reading order is manually- together with its layout. In sequence-labeling based
corrected, thus failing to reflect the reading order NER with document transformers, BIO-tagging
issue in real-world scenarios. To address this scheme assigns a BIO-tag for each text token to
limitation, we reannotate the layouts and entity mark entities: B-ENT/I-ENT indicates the token to
mentions of the above datasets to obtain two be the beginning/inside token of an entity with type
revised datasets, FUNSD-r and CORD-r, which ENT, and O indicates the word does not belong
accurately reflect the real-world situations and to any entity. In this case, the type and boundary
make it possible to evaluate the VrD-NER methods of each entity is clearly marked in the tags. The
in disordered scenarios. We conduct extensive BIO-tags are treated as the classification label of
experiments by integrating the TPP head with each token and is predicted by a token classification
different document transformer backbones and head of the document transformer. As illustrated
report quantitative results on multiple VrD tasks. in introduction, sequence-labeling methods are
For VrD-NER, experiments on the FUNSD-r and typical for NER in NLP and is adopted by current
CORD-r datasets demonstrate the effectiveness of VrD-NER methods, but is not suitable for VrD-
TPP, both as an independent VrD-NER model, NER in real-world scenarios due to the reading
and as a pre-processing mechanism to reorder order issue.
inputs for sequence-labeling models. Also, TPP
Reading-order-aware Methods Several studies
achieves SOTA performance on benchmarks for
have addressed the reading order issue on VrDs
VrD-EL and VrD-ROP, highlighting its potential
in two directions: (1) Task-specific models that
as a universal solution for information extraction
directly predict the reading order, such as Lay-
tasks on VrDs. The main contribution of our work
outReader, which uses a sequence-to-sequence
are listed as follows:
approach (Wang et al., 2021b). However, this
1. We identify that sequence-labeling-based VrD- method is limited to the task at hand and cannot be
NER methods is unsuitable for real-world sce- directly applied to VrD-IE tasks. (2) Pre-trained
narios due to the reading order issue, which is models that learn from supervised signals during
not adequately reflected by current benchmarks. pre-training to improve their awareness of reading
2. We introduce Token Path Prediction, a simple order. For instance, ERNIE-Layout includes a
yet strong approach to address the reading order pre-training objective of reading order prediction
issue in information extraction on VrDs. (Peng et al., 2022); XYLayoutLM enhances the
3. Our proposed method outperforms SOTA meth- generalization of reading order pre-training by
ods in various VrD tasks, including VrD-NER, generating various proper reading orders using an
VrD-EL, and VrD-ROP. We also propose two augmented XY Cut algorithm (Gu et al., 2022).
revised VrD-NER benchmarks reflecting real- However, these methods require massive amount
world scenarios of NER on scanned VrDs. of labeled data and computational resources during
pre-training. Comparing with the above methods,
2 Related Work our work is applicable to multiple VrD-IE tasks of
VrDs, and can be integrated with various document
Sequence-labeling based NER with Document
transformers, without the need for additional
Transformers Recent advances of pre-trained
supervised data or computational costs.
techniques in NLP (Devlin et al., 2019; Zhang
et al., 2019) and CV (Dosovitskiy et al., 2020; Discontinuous NER The form of discontinuous
Li et al., 2022c) have inspired the design of pre- NER is similar to VrD-NER, as it involves identi-
trained representation models in document AI, in fying discontinuous token sequences from text as
which document transformers (Xu et al., 2020, named entities. Current methods of discontinuous
2021a; Huang et al., 2022; Li et al., 2021a,c; Hong NER can be broadly categorized into four groups:
et al., 2022; Wang et al., 2022; Tu et al., 2023) are (1) Sequence-labeling methods with refined BIO-
proposed to act as a layout-aware representation tagging scheme (Tang et al., 2015; Dirkson et al.,
model of document in various VrD tasks. By 2021), (2) 2D grid prediction methods (Wang et al.,
the term document transformers, we refer to 2021a; Li et al., 2022b; Liu et al., 2022), (3)
transformer encoders that take vision (optional), Sequence-to-sequence methods (Li et al., 2021b;
text and layout input of a document as a token He and Tang, 2022), and (4) Transition-based
sequence, in which each text token is embedded methods (Dai et al., 2020). These methods all rely
Question Grid Labels
Header
0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0
𝒘𝟏 𝒘𝟐 𝒘𝟑 𝒘𝟒 𝒘𝟓 𝒘𝟔 𝒘𝟕 0 0 0 0 0 0 0 0 0 0 0 1 0 0
# OF STORES NAME OF ACCOUNT SUPPLIED 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Document Transformer 0 0 0 0 0 0 0 0 0 0 0 0 0 0
with Token Path Prediction

Predictions Training
Class-Imbalanced Loss
𝒘𝟏 𝒘𝟐 𝒘𝟑 𝒘𝟒 𝒘𝟓 𝒘𝟔 𝒘𝟕 𝒘𝟏 𝒘𝟐 𝒘𝟑 𝒘𝟒 𝒘𝟓 𝒘𝟔 𝒘𝟕 for Binary Classification
𝒘𝟏 0.1 0.9 0.2 0.6 0.8 0.3 0.1

𝒘𝟐 Question:
𝒘𝟑
𝒘𝟒
Token Path: 𝒘𝟏 >> 𝒘𝟐 >> 𝒘𝟑 >> 𝒘𝟕
Prediction: # OF STORES SUPPLIED

Decoding
𝒘𝟓
Header:
𝒘𝟔
𝒘𝟕
Token Path: 𝒘𝟒 >> 𝒘𝟓 >> 𝒘𝟔 >> 𝒘𝟏
Prediction: NAME OF ACCOUNT #

Figure 3: An overview of the training procedure of Token Path Prediction for VrD-NER. A TPP head for each entity
type predicts whether an input tokens links to an another. The predict result is viewed as an adjacent matrix in
which token paths are decoded as entity mentions. The overall model is optimized by the class-imbalance loss.

on a proper reading order to predict entities from a complete directed graph with the ND tokens
front to back, thus cannot be directly applied to {wi }i=1,...,ND in D as vertices. This graph
address the reading order issue in VrD-NER. consists of n2 directed edges pointing from each
token to each token. For each entity sj =
3 Methodology {ej , (wj1 , . . . , wjk )}, the word sequence can be
represented by a path (wj1 → wj2 , . . . , wjk−1 →
3.1 VrD-NER
wjk ) within the graph, referred to as a token path.
The definition of NER on visually-rich documents TPP predicts the token paths in a given graph by
with layouts is formalized as follows. A visually- learning grid labels. For each entity type, the token
rich document with ND words is represented as paths of entities are marked by a n ∗ n grid label
D = {(wi , bi )}i=1,...,ND , where wi denotes the of binary values, where n is the number of text
i-th word in document and bi = (x0i , yi0 , x1i , yi1 ) tokens. In specific, for each edge in every token
denotes the position of wi in the document layout. path, the pair of the beginning and ending tokens
The coordinates (x0i , yi0 ) and (x1i , yi1 ) correspond is labeled as 1 in the grid label, while others are
to the bottom-left and top-right vertex of wi ’s labeled as 0. For example in Figure 3, the token
bounding box, respectively. The objective of pair "(NAME, OF)" and "(OF, ACCOUNT)" are marked 1
VrD-NER is to predict all the entity mentions in the grid label. In this way, entity annotations can
{s1 , . . . , sJ } within document D, given the pre- be represented as NE grids of n ∗ n binary values.
defined entity types E = {ei }i=1,...,NE . Here, The learning of grid labels is then treated as
the j-th entity in D is represented as sj = NE binary classification tasks on n2 samples,
{ej , (wj1 , . . . , wjk )}, where ej ∈ E is the entity which can be implemented using any classification
type and (wj1 , . . . , wjk ) is a word sequence, where model. In TPP, we utilize document transformers
the words are two-by-two different but not necessar- to represent document inputs as token feature
ily adjacent. It is important to note that sequence- sequences, and employ Global Pointer (Su et al.,
labeling based methods assign a BIO-label to each 2022) as the classification model to predict the
word and predict adjacent words (wj , wj+1 , . . . ) as grid labels by binary classification. Following (Su
entities, which is not suitable for real-world VrD- et al., 2022), the weights are optimized by a class-
NER where the reading order issue exists. imbalance loss to overcome the class-imbalance
problem, as there are at most n positive labels out
3.2 Token Path Prediction for VrD-NER of n2 labels in each grid. During evaluation, we
In Token Path Prediction, we model VrD-NER collect the NE predicted grids, filtering the token
as predicting paths within a graph of tokens. pairs predicted to be positive. If there are multiple
Specifically, for a given document D, we construct token pairs with same beginning token, we only
Reading Order: including selecting adequate data resources and the
AGE: Under 25 3.2 25-34 3.0
annotating process.

25 er
E

d
>
AG

0
Un

34
25
<s

3.

3.
25 er

-
:
4.1 Motivation
E

d
AG

0
Un <s>

34
25
3.

3.

-
:

AGE AGE
: : FUNSD (Jaume et al., 2019) and CORD (Park et al.,
3.2 3.2
Under Under
2019) are the most popular benchmarks of NER
25 25 on scanned VrDs. However, these benchmarks are
3.0 3.0
25 25 biased towards real-world scenarios. In FUNSD
- -
34
and CORD, segment layout annotations are aligned
34

Entity Linking Annotation Reading Order Annotation


with labeled entities. The scope of each entity
on the document layout is marked by a bounding
Figure 4: The grid label design of TPP for VrD-EL box that forms the segment annotation together
and VrD-ROP. For VrD-EL, Under 25/25-34 is linking with each entity word. We argue that there are
to 3.2/3.0 so each token between the two entities are two problems in their annotations that render them
linked. For VrD-ROP, the reading order is represented
unsuitable for evaluating current methods. First,
as a global path including all tokens starting from an
auxiliary beginning token <s>, in which the linked these benchmarks do not reflect the reading order
token pairs are marked. issue of NER on scanned VrDs, as each entity
corresponds to a continuous span in the model
input. In these benchmarks, each segment models a
keep the pair with highest confidence score. After continuous semantic unit where words are correctly
that, we perform a greedy search to predict token ordered, and each entity corresponds exactly to one
paths as entities. Figure 3 displays the overall segment. Consequently, each entity corresponds
procedure of TPP for VrD-NER. to a continuous and front-to-back token span in
In general, TPP is a simple and easy-to- the model input. However, in real-world scenarios,
implement solution to address the reading order entity mentions may span across segments, and
issue in real-world VrD-NER. segments may be disordered, necessitating the
consideration of reading order issues. For instance,
3.3 Token Path Prediction for Other Tasks as shown in Figure 5, the entity "Sample Requisition
We have explored the capability of TPP to address [Form 02:02:06]" is located in a chart cell spanning
various VrD tasks, including VrD-EL and VrD- multiple rows; while the entity is recognized as
ROP. As depicted in Figure 4, these tasks are two segments by the OCR system since OCR-
addressed by devising task-specific grid labels for annotated segments are confined to lie in a single
TPP training. For the prediction of VrD-EL, we row. Second, the segment layout annotations in
gather all token pairs between every two entities current benchmarks vary in granularity, which is
and calculate the mean logit score. Two entities are inconsistent with real-world situations. The scope
predicted to be linked if the mean score is greater of segment in these benchmarks ranges from a
than 0. In VrD-ROP, we perform a beam search single word to a multi-row paragraph, whereas
on the logit scores, starting from the auxiliary OCR-annotated segments always correspond to
beginning token to predict a global path linking words within a single row.
all tokens. The feasibility of TPP on these tasks Therefore, we argue that a new benchmark
highlights its potential as a universal solution for should be developed with segment layout anno-
VrD tasks. tations aligned with real-world situations and entity
mentions labeled on words.
4 Revised Datasets for Scanned VrD-NER 4.2 Dataset Construction
In this section, we introduce FUNSD-r and CORD- As illustrated above, current benchmarks cannot
r, the revised VrD-NER datasets to reflect the real- reflect real-world scenarios and adequate bench-
world scenarios of NER on scanned VrDs. We marks are desired. That motivates us to develop
first point out the existing problems of popular new NER benchmarks of scanned VrDs with real-
benchmarks for NER on scanned VrDs, which indi- world layout annotations.
cates the necessity of us to build new benchmarks. We achieve this by reannotating existing bench-
We then describe the construction of new datasets, marks and select FUNSD (Jaume et al., 2019)
datasets proposed in our work. The performance
of methods are measured by entity-level F1 on
these datasets. Noticing that previous works on
FUNSD and CORD have used word-level F1 as
the primary evaluation metric, we argue that this
may not be entirely appropriate, as word-level F1
only evaluates the accuracy of individual words, yet
entity-level F1 assesses the entire entity, including
Figure 5: A comparison of the original (left) and its boundary and type. Compared with word-level
revised (right) layout annotation. The revised layout
F1, entity-level F1 is more suitable for evaluation
annotation reflects real-world situations, where (1)
character-level position boxes are annotated rather than on NER systems in pratical applications. For VrD-
word-level boxes, (2) every segment lies in a single row, EL, the experiments are conducted on FUNSD and
(3) spatially-adjacent words are recognized into one the methods are measured by F1. For VrD-ROP,
segment without considering their semantic relation, the experiments are conducted on ReadingBank.
and (4) missing annotation exists. The methods are measured by Average Page-level
BLEU (BLEU for short) and Average Relative
Distance (ARD) (Wang et al., 2021b).
and CORD (Park et al., 2019) for the following
reasons. First, FUNSD and CORD are the most 5.2 Baselines and Implementation Details
commonly used benchmarks on VrD-NER as they
For VrD-NER, we compare the proposed TPP with
contains high-quality document images that closely
sequence-labeling methods that adopts document
resemble real-world scenarios and clearly reflect
transformers integrated with token classification
the reading order issue. Second, the annotations on
heads. As these methods suffer from the reading
FUNSD and CORD are heavily biased due to the
order issue, we adopt a VrD-ROP model as the
spurious correlation between its layout and entity
pre-processing mechanism to reorder the inputs,
annotations. In these benchmarks, each entity
mitigating the impact of the reading order issue.
exactly corresponds to a segment, which leaks the
We adopt LayoutLMv3-base (Huang et al., 2022)
ground truth entity labels during training by the 2D
and LayoutMask (Tu et al., 2023) as the backbones
positional encoding of document transformers, as
to integrate with TPP or token classification heads,
tokens of the same entity share the same segment
as they are the SOTA document transformers
2D position information. Due to the above issues,
in base-level that use "vision+text+layout" and
we reannotate the layouts and entity mentions
"text+layout" modalities, respectively. For VrD-
on the document images of the two datasets.
EL and VrD-ROP, we introduce the baseline
We automatically reannotate the layouts using an
methods with whom we compare the proposed
OCR system, and manually reannotate the named
TPP in Appendix B. The implementation details of
entities as word sequences based on the new layout
experiments are illustrated in Appendix C.
annotations to build the new datasets. The proposed
FUNSD-r and CORD-r datasets consists of 199 5.3 Evaluation on VrD-NER
and 999 document samples including the image, In this section, we display and discuss the per-
layout annotation of segments and words, and formance of different methods on VrD-NER in
labeled entities. The detailed annotation pipeline real-world scenario. In all, the effectiveness of
and statistics of these datasets is introduced in TPP is demonstrated both as an independent VrD-
Appendix A. We publicly release the two revised NER model, and as a pre-processing mechanism
datasets at GitHub. 1 to reorder inputs for sequence-labeling models. By
both means, TPP outperforms the baseline models
5 Experiments
on two benchmarks, across both of the integrated
5.1 Datasets and Evaluation backbones. It is concluded in Table 1 that: (1)
TPP is effective as an independent VrD-NER
The experiments in this work involve VrD-NER,
model, surpassing the performance of sequence-
VrD-EL, and VrD-ROP. For VrD-NER, the experi-
labeling methods on two benchmarks for real-
ments are conducted on the FUNSD-r and CORD-r
world VrD-NER. Since sequence-labeling methods
1
https://github.com/chongzhangFDU/TPP are adversely affected by the disordered inputs,
Backbone Method Pre. Cont. (%) F1 Backbone Method Pre. Cont. (%) F1
None 95.74 78.77 None 92.10 82.72
Sequence Sequence
LR 95.53 78.37 LRC 82.10 70.33
LayoutLMv3 Labeling LayoutLMv3 Labeling
TPPR 97.29 79.72 TPPC 92.43 83.24
TPP None - 80.40 TPP None - 91.85
None 95.74 77.10 None 92.10 81.84
Sequence Sequence
LR 95.53 77.24 LRC 82.10 68.05
LayoutMask Labeling LayoutMask Labeling
TPPR 97.29 80.70 TPPC 92.43 81.90
TPP None - 78.19 TPP None - 89.34

(a) VrD-NER on FUNSD-r (b) VrD-NER on CORD-r

Table 1: The VrD-NER performance of different methods on FUNSD-r and CORD-r. Pre. denotes the pre-processing
mechanism used to re-arrange the input tokens, where LR∗ /TPP∗ denotes that input tokens are reordered by a
LayoutReader/TPP-for-VrD-ROP model, LR and TPPR are trained on ReadingBank, and LRC and TPPC are
trained on CORD. Cont. denotes the continuous entity rate, higher for better pre-processing mechanism. The best
F1 score and the best continuous entity rates are marked in bold. Note that TPP-for-VrD-NER methods do not
leverage any reading order information from ground truth annotations or pre-processing mechanism predictions.

their performance is constrained by the ordered by-column. Therefore, for fair comparison on
degree of datasets. In contrast, TPP is not affected CORD-r, we train LayoutReader and TPP on
by disordered inputs, as tokens are arranged during the original CORD to be used as pre-processing
its prediction. Experiment results show that TPP mechanisms, denoted as LRC and TPPC . As
outperforms sequence-labeling methods on both illustrated in Table 1, TPPC also improves the
benchmarks. Particularly, TPP brings about a +9.13 continuous entity rate and the predict performance
and +7.50 performance gain with LayoutLMv3 of sequence-labeling methods, comparing with the
and LayoutMask on the CORD-r benchmark. It LayoutReader alternative. Contrary to TPP, we
is noticeable that the real superiority of TPP for find that using LayoutReader for pre-processing
VrD-NER might be greater than that is reflected would result in even worse performance on the
by the results, as TPP actually learns a more two benchmarks. We attribute this to the fact
challenging task comparing to sequence-labeling. that LayoutReader primarily focuses on optimizing
While sequence-labeling solely learns to predict BLEU scores on the benchmark, where the model
entity boundaries, TPP additionally learns to is only required to accurately order 4 consecutive
predict the token order within each entity, and still tokens. However, according to Table 4 in Appendix
achieves better performance. (2) TPP is a desirable A, the average entity length is 16 and 7 words in the
pre-processing mechanism to reorder inputs for datasets, and any disordered token within an entity
sequence-labeling models, while the other VrD- leads to its discontinuity in sequence-labeling.
ROP models are not. According to Table 1, Lay- Consequently, LayoutReader may occasionally
outReader and TPP-for-VrD-ROP are evaluated as mispredict long entities with correct input order,
the pre-processing mechanism by the continuous due to its seq2seq nature which makes it possible to
entity rate of rearranged inputs. As a representative generate duplicate tokens or missing tokens during
of current VrD-ROP models, LayoutReader does decoding.
not perform well as a pre-processing mechanism, For better understanding the performance of
as the continuous entity rate of inputs decreases TPP, we conduct ablation studies to determine the
after the arrangement of LayoutReader on two best usage of TPP-for-VrD-NER by configuring
disordered datasets. In contrast, TPP performs the TPP and backbone settings. The results and
satisfactorily. For reordering FUNSD-r, TPPR detailed discussions can be found in Appendix D.
brings about a +1.55 gain on continuous entity rate,
thereby enhancing the performance of sequence- 5.4 Evaluation on Other Tasks
labeling methods. For reordering CORD-r, the
desired reading order of the documents is to read Table 2 and 3 display the performance of VrD-
row-by-row, which conflicts with the reading order EL and VrD-ROP methods, which highlights the
of ReadingBank documents that are read column- potential of TPP as a universal solution for VrD-IE
tasks. For VrD-EL, TPP outperforms MSAU-PAF,
Case Ground Truth Sequence-labeling Token Path Prediction

Multi-row
Entity

Multi-column
Entity

Long Entity

Entity Type
Identification

Figure 6: Case study of Token Path Prediction for VrD-NER, where entities of different types are distinguished by
color, and different entities of the same type are distinguished by the shade of color.

Method F1 Avg. Page-level BLEU (%)


Order Method
GNN+MLP (Carbonell et al., 2021) 39 r=100% r=50% r=0%
SPADE (Hwang et al., 2021) 41.3 LayoutReader 97.65 97.88 98.19
OCR
Doc2Graph (Gemelli et al., 2022) 53.36 TPP 98.18 98.13 98.16
LayoutXLM (Xu et al., 2021b) 54.83 LayoutReader 97.72 97.70 17.83
Shfl.
SERA (Zhang et al., 2021) 65.96 TPP 98.16 98.09 98.12
BROS (Hong et al., 2022) 71.46
MSAU-PAF (Dang et al., 2021) 75 (a) The average page-level BLEU of methods (higher is better).
TPP 79.23
ARD
Order Method
r=100% r=50% r=0%
Table 2: The VrD-EL performance of different methods LayoutReader 2.50 2.24 1.75
on FUNSD. The best result is marked in bold. OCR
TPP 0.29 0.35 0.37
LayoutReader 2.48 2.46 72.94
Shfl.
a competitive method, by a large margin of 4.23 TPP 0.37 0.39 0.39
on F1 scores. The result suggests that the current
(b) The average relative distance of methods (lower is better).
labeling scheme for VrD-EL provides adequate
information for models to learn from. For VrD- Table 3: The VrD-ROP performance of different
ROP, TPP shows several advantages comparing methods on ReadingBank. For the Order setting, OCR
with the SOTA method LayoutReader: (1) TPP denotes that inputs are arranged left-to-right and top-
is robust to input shuffling. The performance of to-bottom in evaluation. Shfl. denotes that inputs are
TPP remains stable when train/evaluation inputs shuffled in evaluation. r is the proportion of shuffled
are shuffled as TPP is unaware of the token input samples in training. The best results are marked in bold.
order. However, LayoutReader is sensitive to
input order and its performance decreases sharply the reading order by generating a token sequence
when evaluation inputs are shuffled. (2) In and carries the risk of missing some tokens in
principle, TPP is more suitable to act as a pre- the document, which results in a negative impact
processing reordering mechanism compared with the ARD metric, and also make LayoutReader
LayoutReader. TPP surpasses LayoutReader by unsuitable for use as a pre-processing reordering
a significant margin on ARD among six distinct mechanism. (3) TPP surpasses LayoutReader
settings, which is mainly attributed to the paradigm among five out of six settings on BLEU. TPP
differences. Specifically, TPP-for-VrD-ROP guar- is unaware of the token input order, and the
antees to predict a permutation of all tokens as performance of it among different settings is
the reading order, while LayoutReader predicts influenced only by random aspects. Although
LayoutReader has a higher performance in the performance on VrD-EL and VrD-ROP tasks. We
setting where train and evaluation samples are also propose two revised VrD-NER benchmarks
both not shuffled, it is attributed to the possible to reflect the real situations of NER on scanned
overfitting of the encoder of LayoutReader on VrDs. In future, we plan to further improve our
global 1D signals under this setting, where global method and verify its effectiveness on more VrD-
1D signals strongly hint the reading order. IE tasks. We hope that our work will inspire future
research to identify and tackle the challenges posed
5.5 Case Study by VrD-IE in practical scenarios.
For better understanding the strength and weakness
of TPP, we conduct a case study for analyzing Limitations
the prediction results by TPP in VrD-NER. As
discussed, TPP-for-VrD-NER make predictions Our method has the following limitations:
by modeling token paths. Intuitively, TPP should 1. Relatively-high Computation Cost: Token Path
be good at recognizing entity boundaries in VrDs. Prediction involves the prediction of grid labels,
To verify this, we visualize the predict result of therefore is more computational-heavily than
interested cases by TPP and other methods in vanilla token classification, though it is still
Figure 6. According to the visualized results, a simple and generalizable method for VrD-
TPP is good at identifying the entity boundary IE tasks. Towards training on FUNSD-r,
and makes accurate prediction of multi-row, multi- LayoutLMv3/LayoutMask with Token Path
column and long entities. For instance, in the Multi- Prediction requires 2.45x/2.75x longer training
row Entity case, TPP accurately identifies the entity time and 1.56x/1.47x higher peak memory occu-
"5/3/79 10/6/80, Update" in complex layouts. In the pancy compared to vanilla token classification.
Multi-column Entity case, TPP identifies the entity The impact is significant when training Token
"TOTAL 120,000" with the interference of the injected Path Prediction on general-purpose large-scale
texts within the same row. In the Long Entity case, datasets.
the long entity as a paragraph is accurately and 2. Lack of Evaluation on Real-world Benchmarks:
completely extracted, while the sequence-labeling To illustrate the effectiveness of Token Path
method predicts the paragraph as two entities. Prediction on information extraction of real-
Nevertheless, TPP may occasionally misclassify world scenarios, we rely on adequate down-
entities as other types, resulting in suboptimal stream datasets. However, benchmarks is
performance. For example, in the Entity Type still missing of other IE tasks than NER,
Identification case, TPP recognizes the entity and of other languages than English. We
boundaries of "SEPT 21" and "NOV 9" but fails to appeal for more works to propose VrD-IE
predict their correct entity types. This error can and VrDU benchmarks that reflect real-world
be attributed to the over-reliance of TPP on layout scenarios, and the demonstration could be
features while neglecting the text signals. more concrete of our method to be capable
of various VrD tasks, be compatible with
6 Conclusion and Future Work document transformers of different modalities
and multiple languages, and be open to future
In this paper, we point out the reading order issue work of layout-aware representation models.
in VrD-NER, which can lead to suboptimal perfor-
mance of current methods in practical applications. Acknowledgements
To address this issue, we propose Token Path
Prediction, a simple and easy-to-implement method The authors wish to thank the anonymous reviewers
which models the task as predicting token paths for their helpful comments. This work was partially
from a complete directed graph of tokens. TPP can funded by National Natural Science Foundation of
be applied to various VrD-IE tasks with a unified China (No.62206057,61976056,62076069), Shang-
architecture. We conduct extensive experiments to hai Rising-Star Program (23QA1400200), Natural
verify the effectiveness of TPP on real-world VrD- Science Foundation of Shanghai (23ZR1403500),
NER, where it serves both as an independent VrD- Program of Shanghai Academic Research Leader
NER model and as a pre-processing mechanism to under grant 22XD1401100.
reorder model inputs. Also, TPP achieves SOTA
References In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages
Manuel Carbonell, Pau Riba, Mauricio Villegas, Alicia 4583–4592.
Fornés, and Josep Lladós. 2021. Named entity
recognition and relation extraction with graph neural Yuxin He and Buzhou Tang. 2022. Setgner: General
networks in semi structured documents. In 2020 25th named entity recognition as entity set generation. In
International Conference on Pattern Recognition Proceedings of the 2022 Conference on Empirical
(ICPR), pages 9622–9627. IEEE. Methods in Natural Language Processing, pages
Xiang Dai, Sarvnaz Karimi, Ben Hachey, and Cecile 3074–3085.
Paris. 2020. An effective transition-based model
for discontinuous ner. In Proceedings of the 58th Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok
Annual Meeting of the Association for Computational Hwang, Daehyun Nam, and Sungrae Park. 2022.
Linguistics, pages 5860–5870. Bros: A pre-trained language model focusing on text
and layout for better key information extraction from
Tuan Anh Nguyen Dang, Duc Thanh Hoang, documents. In Proceedings of the AAAI Conference
Quang Bach Tran, Chih-Wei Pan, and Thanh Dat on Artificial Intelligence, volume 36, pages 10767–
Nguyen. 2021. End-to-end hierarchical relation 10775.
extraction for generic form understanding. In
2020 25th International Conference on Pattern Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and
Recognition (ICPR), pages 5238–5245. IEEE. Furu Wei. 2022. Layoutlmv3: Pre-training for
document AI with unified text and image masking.
Tuan-Anh Nguyen Dang and Dat-Thanh Nguyen. 2021. In MM ’22: The 30th ACM International Conference
End-to-end information extraction by character-level on Multimedia, Lisboa, Portugal, October 10 - 14,
embedding and multi-stage attentional u-net. arXiv 2022, pages 4083–4091. ACM.
preprint arXiv:2106.00952.
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Dimosthenis Karatzas, Shijian Lu, and CV Jawahar.
Kristina Toutanova. 2019. Bert: Pre-training 2019. Icdar2019 competition on scanned receipt ocr
of deep bidirectional transformers for language and information extraction. In 2019 International
understanding. In Proceedings of NAACL-HLT, Conference on Document Analysis and Recognition
pages 4171–4186. (ICDAR), pages 1516–1520. IEEE.
AR Dirkson, Suzan Verberne, Wessel Kraaij, E Holder- Wonseok Hwang, Jinyeong Yim, Seunghyun Park,
ness, A Jimeno Yepes, A Lavelli, AL Minard, J Puste- Sohee Yang, and Minjoon Seo. 2021. Spatial
jovsky, and F Rinaldi. 2021. Fuzzybio: a proposal dependency parsing for semi-structured document
for fuzzy representation of discontinuous entities. In information extraction. In Findings of the Associ-
Proceedings of the 12th International Workshop on ation for Computational Linguistics: ACL-IJCNLP
Health Text Mining and Information Analysis, pages 2021, pages 330–343.
77–82. Association for Computational Linguistics.
Hiroshi Inoue. 2019. Multi-sample dropout for
Alexey Dosovitskiy, Lucas Beyer, Alexander
accelerated training and better generalization. arXiv
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
preprint arXiv:1905.09788.
Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. Guillaume Jaume, Hazim Kemal Ekenel, and Jean-
An image is worth 16x16 words: Transformers Philippe Thiran. 2019. Funsd: A dataset for form
for image recognition at scale. arXiv preprint understanding in noisy scanned documents. In 2019
arXiv:2010.11929. International Conference on Document Analysis and
Lukasz Garncarek, Rafal Powalski, Tomasz Stanis- Recognition Workshops (ICDARW), volume 2, pages
lawek, Bartosz Topolski, Piotr Halama, Michał 1–6. IEEE.
Turski, and Filip Grali’nski. 2020. Lambert: Layout-
aware language modeling for information extraction. Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi.
In IEEE International Conference on Document 2019. Pifpaf: Composite fields for human pose esti-
Analysis and Recognition. mation. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages
Andrea Gemelli, Sanket Biswas, Enrico Civitelli, Josep 11977–11986.
Lladós, and Simone Marinai. 2022. Doc2graph: a
task agnostic document understanding framework Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang
based on graph neural networks. arXiv preprint Huang, Fei Huang, and Luo Si. 2021a. Structurallm:
arXiv:2208.11168. Structural pre-training for form understanding. In
Proceedings of the 59th Annual Meeting of the
Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Association for Computational Linguistics and the
Weiqiang Wang, Ming Gu, and Liqing Zhang. 2022. 11th International Joint Conference on Natural
Xylayoutlm: Towards layout-aware multimodal Language Processing (Volume 1: Long Papers),
networks for visually-rich document understanding. pages 6309–6318.
Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, 2021. Kleister: key information extraction datasets
Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng involving long documents with complex layouts.
Zhu, Baohua Lai, Xiaoguang Hu, et al. 2022a. In Document Analysis and Recognition–ICDAR
Pp-ocrv3: More attempts for the improvement 2021: 16th International Conference, Lausanne,
of ultra lightweight ocr system. arXiv preprint Switzerland, September 5–10, 2021, Proceedings,
arXiv:2206.03001. Part I, pages 564–579. Springer.
Fei Li, ZhiChao Lin, Meishan Zhang, and Donghong Ji. Jonathan Stray and Stacey Svetlichnaya. 2020. Project
2021b. A span-based model for joint overlapped deepform: Extract information from documents.
and discontinuous named entity recognition. In
Proceedings of the 59th Annual Meeting of the Jianlin Su, Ahmed Murtadha, Shengfeng Pan, Jing Hou,
Association for Computational Linguistics and the Jun Sun, Wanwei Huang, Bo Wen, and Yunfeng
11th International Joint Conference on Natural Liu. 2022. Global pointer: Novel efficient span-
Language Processing (Volume 1: Long Papers), based approach for named entity recognition. arXiv
pages 4814–4828. preprint arXiv:2208.03054.
Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Buzhou Tang, Qingcai Chen, Xiaolong Wang, Yonghui
Zhang, Chong Teng, Donghong Ji, and Fei Li. 2022b. Wu, Yaoyun Zhang, Min Jiang, Jingqi Wang, and Hua
Unified named entity recognition as word-word Xu. 2015. Recognizing disjoint clinical concepts in
relation classification. In Proceedings of the AAAI clinical text using machine learning-based methods.
Conference on Artificial Intelligence, volume 36, In AMIA annual symposium proceedings, volume
pages 10965–10973. 2015, page 1184. American Medical Informatics
Association.
Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha
Zhang, and Furu Wei. 2022c. Dit: Self-supervised Yi Tu, Ya Guo, Huan Chen, and Jinyang Tang. 2023.
pre-training for document image transformer. In Layoutmask: Enhance text-layout interaction in
MM ’22: The 30th ACM International Conference on multi-modal pre-training for document understand-
Multimedia, Lisboa, Portugal, October 10 - 14, 2022, ing. arXiv preprint arXiv:2305.18721.
pages 3530–3539. ACM.
Jiapeng Wang, Lianwen Jin, and Kai Ding. 2022. Lilt:
Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, A simple yet effective language-independent layout
Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, transformer for structured document understanding.
Jingtuo Liu, and Errui Ding. 2021c. Structext: In Proceedings of the 60th Annual Meeting of the
Structured text understanding with multi-modal Association for Computational Linguistics (Volume
transformers. In Proceedings of the 29th ACM 1: Long Papers), pages 7747–7757.
International Conference on Multimedia, pages 1912–
1920. Yucheng Wang, Bowen Yu, Hongsong Zhu, Tingwen
Liu, Nan Yu, and Limin Sun. 2021a. Discontinuous
Jiang Liu, Donghong Ji, Jingye Li, Dongdong Xie, named entity recognition as maximal clique discov-
Chong Teng, Liang Zhao, and Fei Li. 2022. Toe: ery. In Proceedings of the 59th Annual Meeting of
A grid-tagging discontinuous ner model enhanced by the Association for Computational Linguistics and
embedding tag/word relations and more fine-grained the 11th International Joint Conference on Natural
tags. IEEE/ACM Transactions on Audio, Speech, and Language Processing (Volume 1: Long Papers),
Language Processing, 31:177–187. pages 764–774.
Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee,
Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and
Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019.
Furu Wei. 2021b. Layoutreader: Pre-training of
Cord: a consolidated receipt dataset for post-ocr
text and layout for reading order detection. In
parsing. In Workshop on Document Intelligence at
Proceedings of the 2021 Conference on Empirical
NeurIPS 2019.
Methods in Natural Language Processing, pages
Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, 4735–4744.
Zhenyu Zhang, Zhengjie Huang, Teng Hu, Weichong
Yin, Yongfeng Chen, Yin Zhang, et al. 2022. Ernie- Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu
layout: Layout knowledge enhanced pre-training Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
for visually-rich document understanding. arXiv Zhang, Wanxiang Che, et al. 2021a. Layoutlmv2:
preprint arXiv:2210.06155. Multi-modal pre-training for visually-rich document
understanding. In Proceedings of the 59th Annual
Lance A Ramshaw and Mitchell P Marcus. 1999. Meeting of the Association for Computational Lin-
Text chunking using transformation-based learning. guistics and the 11th International Joint Conference
Natural language processing using very large on Natural Language Processing (Volume 1: Long
corpora, pages 157–176. Papers), pages 2579–2591.

Tomasz Stanisławek, Filip Graliński, Anna Wróblewska, Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,
Dawid Lipiński, Agnieszka Kaliska, Paulina Ros- Furu Wei, and Ming Zhou. 2020. Layoutlm: Pre-
alska, Bartosz Topolski, and Przemysław Biecek. training of text and layout for document image
understanding. In Proceedings of the 26th ACM A The Annotation Pipeline and Statistics
SIGKDD International Conference on Knowledge of Revised Datasets
Discovery & Data Mining, pages 1192–1200.
The annotation pipeline is introduced as follows.
Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang,
Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu First, we collect the document images of FUNSD
Wei. 2021b. Layoutxlm: Multimodal pre-training for and CORD, and re-annotate the layout automati-
multilingual visually-rich document understanding. cally with an OCR system. We choose PP-OCRv3
arXiv preprint arXiv:2104.08836. (Li et al., 2022a) OCR engine as it is the SOTA
Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei solution of general-purpose OCR and can extend to
Hua, Bohan Li, Huang Chen, Mingyu Liu, Mingrui multilingual settings. We operate each document
Chen, Jianfeng Kuang, Mengjun Cheng, et al. 2023. image of FUNSD and CORD, keeping layout
Icdar 2023 competition on structured text extraction annotations with more than 0.8 confidence, and
from visually-rich document images. arXiv preprint
arXiv:2306.03287. filtering document images with more than 20 valid
words. After that, we manually annotate entity
Yue Zhang, Zhang Bo, Rui Wang, Junjie Cao, Chen Li, mentions on the new layouts as word sequences,
and Zuyi Bao. 2021. Entity relation extraction as
dependency parsing in visually rich documents. In based on the original entity annotations. Entity
Proceedings of the 2021 Conference on Empirical mentions that cannot match a word sequence on the
Methods in Natural Language Processing, pages new layout are deprecated. After the annotation, we
2759–2768. divide the train/validation/test splits according to
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, the original settings to obtain the revised datasets.
Maosong Sun, and Qun Liu. 2019. Ernie: Enhanced In all, the proposed FUNSD-r and CORD-r
language representation with informative entities. datasets consists of 199 and 999 document samples
In Proceedings of the 57th Annual Meeting of the including the image, layout annotation of segments
Association for Computational Linguistics, pages
1441–1451. and words, and labeled entities. Table 4 lists
the detailed summary statistics of the proposed
datasets. The total number of segments, words,
entities and entity types, the average length of
segments and entities, and the sample number
of data splits are displayed. Additionally, the
continuous entity rate is introduced as the rate of
entity whose tokens are continuous in the order
of segment words. When fed into model, the
continuous entities correspond to continuous token
spans within inputs and are able to be predicted
by sequence-labeling methods. Therefore, the
continuous entity rate indicates the ordered degree
of the document layouts in the dataset.

B Baselines for VrD-EL and VrD-ROP


For VrD-EL, we compare the proposed TPP with
the following strong baseline models:
• GNN-based document encoders: GNN+MLP
(Carbonell et al., 2021) is the pioneer to adopt
GNN as the document encoder for better mod-
eling of document inputs. Doc2Graph (Gemelli
et al., 2022) is an enhanced GNN-based docu-
ment encoder for VrD tasks.
• Transformer-based document encoders: Lay-
outXLM (Xu et al., 2021b) is a pre-trained
document encoder for multilingual document
inputs. BROS (Hong et al., 2022) is a pre-trained
document encoder focuses on achieving better
Case Disordered Layout Desired Reading Order Reading Order by OCR Result

Multi-row Content 本年阶梯累计用水量 21 本年阶梯累计 21 用水量

当日当次 执收单位(盖章) 湖南省高速公路集团 当日当次


Seal with Askew or
湖南省高速公路集团有限公司 有限公司票证专用章
Overlapped Texts
票证专用章 执收单位(盖章)

车道:X08 收费员: X08 收费员:车道:13 01 元


Print Distortion 车型:01 金额:13 元 车型:金额:株洲北 湘潭北
入口站:株洲北 出口站:湘潭北 入口站:出口站

Design Layout with 商品标价签 商品标价签


Various Font Sizes 品名:利群(红) 利群(红)品名:

年月 年月
Skewness due to
品名 沫煤 规格 露天矿 沫煤 品名 规格 露天矿
Photography
毛重 68040 皮重 16980 皮重 毛重 68040 16980

承运单位 货达 派车证号 46525 派车 承运 证号 单位 货达 46525


Semantic-driven
品名 中硫焦精煤 毛重 单位 名 品
Reading Order
单位 吨 毛重 43.22 吨 中硫焦精煤 43.22

Figure 7: Examples of disordered layouts in real-world scenarios. These examples are taken from (Yu et al., 2023).

# of # of Avg. Length # of Avg. Length Cont. # of Entity # of Samples


Dataset
Segments Words of Segment Entities of Entity (%) Types (Train/Val/Test)
FUNSD-r 10,091 166,040 16.45 7,924 15.21 95.74 3 149/-/50
CORD-r 12,582 123,153 7.55 12,582 9.20 92.10 30 799/100/100

Table 4: Statistics of the proposed datasets. Cont. denotes the continuous entity rate, as the rate of entity whose
tokens are continuous in the order of segment words. Higher continuous entity rate indicates to more orderly layouts.

performance on key information extraction tasks. methods in Table 2 and 3. The backbone of all
• Parsing-based methods: SPADE (Hwang et al., compared methods are of base-level.
2021) treats VrD-EL as spatial-aware depen-
dency parsing problem and adopts a spatial- C Implementation Details
enhanced language model to address the task.
SERA (Zhang et al., 2021) also treats the task as The experiments in this work are divided into two
dependency parsing, and adopts a biaffine parser parts, as experiments on VrD-NER and on other
together with a document encoder to address the tasks.
task. For VrD-NER, we adopt LayoutLMv3-base
• Detection-based methods: MSAU-PAF (Dang (Huang et al., 2022) and LayoutMask (Tu et al.,
et al., 2021) tackles the task by integrating the 2023) as the backbones to integrate with Token
MSAU architecture for detection (Dang and Path Prediction. The maximum sequence length
Nguyen, 2021) and the PIF-PAF mechanism for of textual tokens for both of them is 512. We use
link prediction (Kreiss et al., 2019). Global 1D and Segment 2D position information
in LayoutLMv3, and Local 1D and Segment 2D
For VrD-ROP, we compare the proposed TPP in LayoutMask, according to the preferred settings.
with LayoutReader (Wang et al., 2021b), which We adopt positional residual linking and multi-
is a sequence-to-sequence model achieving strong dropout to improve the efficiency and robustness
performance on VrD-ROP. We refer to the reported in training TPP. Positional residual linking adds
performance in the original papers of these baseline the 1D positional embeddings of each token to the
backbone outputs to enhance 1D position signals Position
Backbone FUNSD-r CORD-r
that is crucial to the task. Multi-dropout duplicates 1D 2D
the last fully-connect layer of TPP. Following the Global Segment 80.40 91.85
LayoutLMv3 None Segment 74.15 89.91
setting of (Inoue, 2019), the duplicated layers
Global Word 77.86 90.45
are trained with different dropout noises and
Local Segment 78.19 89.34
their weights are shared, which enhances model LayoutMask Global Segment 66.24 87.36
robustness towards randomness. The effectiveness None Segment 67.81 82.27
of these mechanisms are discussed in Appendix Local Word 75.96 89.45
D. In fine-tuning, we generally follow the original
(a) Ablation study on different 1D and 2D positional encoding
setting of previous VrD-NER works (Huang et al., choices of backbones.
2022; Tu et al., 2023). We use an Adam optimizer,
with 1% linear warming-up steps, a 0.1 dropout Backbone Method FUNSD-r CORD-r
rate, and a 1e-5 weight decay. For Token Path TPP 80.40 91.85
LayoutLMv3 (w/o mul.) 79.61 90.99
Prediction, the learning rate is searched from {3e-
(w/o mul. & res.) 78.93 91.14
5, 5e-5, 8e-5}. On FUNSD, the best learning rate is
TPP 78.19 89.34
5e-5/5e-5 for LayoutLMv3/LayoutMask, while on LayoutMask (w/o mul.) 75.34 87.96
CORD the learning rate is 3e-5/8e-5, respectively. (w/o mul. & res.) 65.53 89.55
In fine-tuning the comparing token classification
models, we all set the learning rate to 5e-5. We (b) Ablation study on the design of Token Path Prediction.
Positional residual linking and multi-dropout are abbreviated
adopt positional residual linking and multi-dropout as res. and mul., respectively.
in Token Path Prediction. We fine-tune FUNSD-r
and CORD-r by 1,000 and 2,500 steps on 8 Tesla Table 5: Ablation studies to TPP-for-VrD-NER on
A100 GPUs, respectively, with a batch size of position encoding choices of backbones and model
16. The maximum number of decoded entities is design.
limited to 100.
Besides to VrD-NER, we conduct experiments
brings about the best performance. Although TPP
on VrD-EL and VrD-ROP adopting LayoutMask
is unaware of the order of tokens, the incorporation
backbone. For VrD-EL, TPP-for-VrD-EL is fine-
of better 1D positional signals can enhance the
tuned by 1000 steps with a learning rate of 8e-5
contextualized representation generated by the
and a batch size of 16, with the learning rate of
backbones, leading to improved performance. For
the global pointer weights is set as 10x for better
CORD-r, we note that in some occasions, using
convergence. For VrD-ROP, TPP-for-VrD-ROP is
word-level 2D position information brings about
fine-tuned by 100 epochs with a learning rate of
better performance. This is because short entities
5e-5 and a batch size of 16, following the settings
are the majority in CORD-r, especially entities with
of (Wang et al., 2021b). The beam size is set to 8
only one char, such as numbers and symbols. For
during decoding. For all experiments, we choose
these entities, word-level 2D position information
the model checkpoint with the best performance
brings about a direct hint to the model in training
on the validation set, and report its performance on
and prediction, resulting in better performance.
the test set.
For the second question, we conduct an ablation
D Ablation Studies to TPP on VrD-NER study by removing the positional residual linking
and multi-dropout of TPP. We observe from Table
For better understanding of TPP on VrD-NER, we 5b that: (1) Positional residual linking and multi-
propose two questions: (1) from which setting TPP dropout are both essential due to their effectiveness
benefits most of positional encoding of backbones, on the results of FUNSD-r. (2) Positional residual
and (2) by what means positional residual linking linking enhances the 1D position signals in model
and multi-dropout be helpful to TPP. We conduct inputs. Comparing with vanilla TPP, TPP with posi-
the ablation studies to answer the questions. tional residual linking behaves better on FUNSD-r
For the first question, we alter the choices of 1D but slightly worse on CORD-r. This is because
and 2D positional encoding of backbones, and the segments in CORD-r are relatively short, when
results are reported in Table 5a. For FUNSD-r, the inputs are disordered, 1D signals are much more
preferred positional encoding setting of backbones noisy and negatively affects model performance
since they do not provide sufficient information.
The results verify the function of positional residual
linking as a enhancement mechanism of 1D signals.
(3) Multi-dropout enhances model robustness to
random aspects. TPP outperforms its variant
without multi-dropout among different backbones
and different benchmarks.

You might also like