Professional Documents
Culture Documents
Dongmei Zhang
Microsoft Research
dongmeiz@microsoft.com
ABSTRACT 1 INTRODUCTION
We propose TUTA, a unified pre-training architecture for under- Table, as a key structure to organize and present data, is widely
standing generally structured tables. Noticing that understanding used in webpages, spreadsheets, PDFs, and more. Unlike sequential
a table requires spatial, hierarchical, and semantic information, NL text, tables usually arrange cells in bi-dimensional matrices with
we enhance transformers with three novel structure-aware mech- stylistic formatting such as border, merge, alignment, and bold font.
anisms. First, we devise a unified tree-based structure, called a As Figure 1 shows, tables are flexible with various structures,
bi-dimensional coordinate tree, to describe both the spatial and such as relational, entity and matrix tables [30]. To present data
hierarchical information of generally structured tables. Upon this, more effectively in the bi-dimensional space, real-world tables of-
we propose tree-based attention and position embedding to better ten use hierarchical structures [2, 3, 13]. For example, Figure 1 (d)
capture spatial and hierarchical information. Moreover, we devise shows a matrix spreadsheet table with both hierarchical top and left
three progressive pre-training objectives to enable representations headers (“B1:E” and “A3:A2”), shaped by merged cells and indents
at the token, cell, and table levels. We pre-train TUTA on a wide respectively, to organize data in a compact and hierarchical way
range of web and spreadsheet tables and fine-tune it on two critical for easy look-up or side-by-side comparison.
tasks of table structure understanding: cell type classification and Tables store a great amount of high-value data and gain increas-
table type classification. Experiments show that TUTA is highly ing attention from research community. There is a flurry of re-
effective, achieving state-of-the-art on five widely-studied datasets. search on table understanding tasks, including entity linking in
tables [1, 8, 35, 47], column type identification of tables [4, 20], an-
CCS CONCEPTS swering natural language questions over tables [21, 32, 41, 42], and
• Information systems → Information retrieval. generating data analysis for tables [48]. However, the vast major-
ity of these work only focuses on relational tables, which only
KEYWORDS account for 0.9% of commonly crawled web tables1 and 22.0% of
self supervision; transformer; generally structured table spreadsheet tables [3]. They overlook other widely used table types
such as matrix tables and entity tables. This leads to a large gap be-
ACM Reference Format: tween cutting-edge table understanding techniques and variously
Zhiruo Wang*, Haoyu Dong* †, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dong-
structured real-world tables. Therefore, it is important to enable ta-
mei Zhang. 2021. TUTA: Tree-based Transformers for Generally Struc-
tured Table Pre-training. In Proceedings of the 27th ACM SIGKDD Con-
ble understanding for variously structured tables and make a critical
ference on Knowledge Discovery and Data Mining (KDD ’21), August 14– step to mitigate this gap. There are several attempts [2, 3, 10, 18, 24]
18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 11 pages. on identifying table hierarchies and cell types to extract relational
https://doi.org/10.1145/3447548.3467434 data from variously structured tables. However, labeling such
structural information is very time-consuming and labor-
* Equal contribution. Work done during Zhiruo’s internship at Microsoft Research.
† Corresponding author. intensive, hence greatly challenges machine learning methods
that are rather data-hungry.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
Motivated by the success of large-scale pre-trained language
for profit or commercial advantage and that copies bear this notice and the full citation models (LMs) [9, 34] in numerous NL tasks, one promising way to
on the first page. Copyrights for components of this work owned by others than ACM mitigate the label-shortage challenge is self-supervised pre-training
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a on large volumes of unlabeled tables. [21, 41] target question an-
fee. Request permissions from permissions@acm.org. swering over relational tables via joint pre-training of tables and
KDD ’21, August 14–18, 2021, Virtual Event, Singapore their textual descriptions. [8] attempts to pre-train embeddings on
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00
https://doi.org/10.1145/3447548.3467434 1 http://webdatacommons.org/webtables/
1780
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
1781
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Figure 2: An example to illustrate our proposed tree coordinates and tree distance for generally structured tables. In this
example, both the top tree and the left tree contain three levels. Cell “A6” (‘Urinary tract’) is the ‘parent’ node of cells “A7”
and “A8”, and has ‘brothers’ including “A3” and “A9”.
are widely used in government, finance, accounting, and medical by [36], we define the top/left tree distance of every node pair
domains, hence have large quantity and high quality. to be the step number of the shortest path between them. Then,
We collect web tables from Wikipedia (WikiTable) and the WDC the bi-tree distance of two cells is the sum of their top and left
WebTable Corpus [27], and spreadsheet tables from more than tree distance. Figure 2 (b) exemplifies distances against cell “A6”
13.5 million web-crawled spreadsheet files. We end up built a total (‘Urinary tract’) in the left header. (1) Since “A6” is the parent
of 57.9 million tables, owning much greater volume than datasets node of “A8” (‘Bladder’), they are only 1 step apart. (2) “A6” is the
used by other table pre-training methods. More details of our data ‘uncle node’ of the cell “A10” (‘Larynx’), so connecting them goes
collection, filtering, and pre-processing can be found in Appendix A. “A6 → root → A9 → A10” on the left tree, thus the distance is 3.
(3) “A6” and “C2” (‘Females’) have an even longer distance of 6, for
it requires moving both on the top (3 steps) and left (3 steps).
2.2 Bi-dimensional coordinate tree Note that our bi-dimensional coordinate applies generally on
As introduced in Section 1, tables usually contain headers on top or both hierarchical tables and flat tables. In flat tables, tree coordi-
left to describe other cells [40, 43]. For example, vertical/horizontal nates degenerate to rectangular Cartesian coordinates, and
tables usually contain top/left headers, and matrix tables usually the distance between two cells in the same row or column is 2,
contain both the top and left headers [30]. Also, a table is re- otherwise is 4. For texts like titles and surrounding NL descriptions,
garded as hierarchical if its header exhibits a multi-layer tree since they are global information to tables, their distances to cells
structure [3, 28]. If not, it reduces to a flat table without any hi- are set to 0. This tree-based distance enables effective spatial and
erarchy. To generally model various tables using a unified data hierarchical data-flow via our proposed tree attention in Section 3.3.
structure, we propose a novel bi-dimensional coordinate tree that Tree extraction includes two steps: (1) detect header regions in
jointly describes the cell location and header hierarchy. tables and (2) extract hierarchies from headers. Methods of header
Tree-based position We define the bi-dimensional coordinate detection have already achieved high accuracy [10, 15], we adopt a
tree to be a directed tree, on which each node has a unique parent recent CNN-based method [10]. Next, to extract header hierarchies,
and an ordered finite list of children. It has two orthogonal sub-trees: we consolidate effective heuristics based on common practices of
a top tree and a left tree. For a node in the top/left tree, its position human formatting, such as the merged cells in the top header and
can be got from its path from the top/left tree root. Each table cell indentation levels in the left header. Besides, since formulas con-
maps to respective nodes on the top and left trees. Positions of taining SUM or AVERAGE are also strong indications of hierarchies,
its top and left nodes combine to a unique bi-dimensional tree we also incorporate them in this algorithm. This approach has de-
coordinates. As shown in Figure 2 (a), the left coordinate of cell sirable performance and interpretability when processing large
“A6” is <2> since it’s the third child of the left root; successively, unlabeled corpus. More details can be found in Appendix B. Since
”D8” is the second child of “A6” thus has a left coordinate <2,1>. TUTA is a general table pre-training framework based on bi-trees,
Using similar methods, “D8” has a top coordinate <2,0>. one can also employ other methods introduced by [3, 28, 31] to
Tree-based distance In the top/left tree, each pair of nodes extract headers and hierarchies for TUTA.
connects through a path, or say, a series of steps, with each step
either moving up to the parent or going down to a child. Motivated
1782
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
R i =0 G i are concatenated L-hot vectors for top and left tree co-
ÍL−1
3.2 Embedding layer ordinates, Wr , Wc ∈ Rd RC ×G L−1 are row and column embedding
Compared with NL text, tables involve much richer information like weights, x r , xc ∈ RG L−1 are one-hot vectors for row and column
text, positions, and formats. Tokens are jointly embedded by their: indexes, G ∈ NL are maximum node degrees for L tree-layers, and
in-table position Et _pos , in-cell position Ec_pos , token semantics ⊕ represents vector concatenation at the last dimension.
Etok , numerical properites Enum , and formatting features E f mt . In-cell position In-cell position is the index of a token inside
In-table position As introduced in Section 2.2, cells use bi-tree a cell. Each position can be simply assigned with a trainable em-
coordinates to store hierarchical information. For trees of different bedding: Ec_pos = Wc_pos · xc_pos , where Wc_pos ∈ RH ×I is the
depths to get coordinates with unified length, we extend top/left learnable weight, xc_pos is the one-hotted position, and I is prede-
tree coordinates to a predefined max length L, as detailed in Ap- fined to be the maximum number of tokens in a cell.
pendix C.1. Next, we compare two embedding methods for tree Token & Number The token vocabulary is a finite set of size
positions. For one, we adopt the explicit tree embedding in [36]. V = 30, 522, so each token can be simply assigned with a trainable
It can be explicitly computed and directly extended to our bi-tree embedding in Wt ok ∈ RH ×V . However, numbers construct an infi-
structure. For another, we use randomly initialized weights as im- nite set, so we extract four discrete features and one-hotly encode
plicit embeddings and train them jointly with attention layers as [9]. them: magnitude (xmaд ), precision (xpr e ), first digit (x f st ), and last
1783
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Top tree positional embedding Left tree positional embedding 3.4 Pre-training objectives
Tables naturally have progressive levels — token, cell, and table
dTL dTL dTL dTL dRC dTL dTL dTL dTL dRC
levels. To capture table information in such a progressive routine,
Top tree positional Column index Left tree positional Row index we devise novel cell- and table-level objectives in addition to the
embedding (level 1-L) embedding embedding (level 1-L) embedding
common objective at the token level, as shown in Figure 5.
Masked language modeling (MLM) MLM [9, 25] is widely
Figure 4: Size specification of in-table position embeddings.
used in NL pre-training. Besides context learning within each cell by
randomly masking individual tokens, we also use a whole-cell mask-
digit (xlst ). They are embeded using weights Wmaд , Wpr e , Wf st ing strategy to capture relationships of neighboring cells. Motivated
and Wlst , and concatenated at the last dimension: by [21], we train token representations by predictions on both table
Enum = (Wmaд · xmaд )⊕(Wpr e · xpr e )⊕(Wf st · x f st )⊕(Wl st · xl st ). cells and text segments. MLM is modeled as a multi-classification
Format Formats are very helpful for humans to understand problem for each masked token with the cross-entropy loss Lmlm .
tables, so we use format features to signify if a cell has merge, Cell-level Cloze (CLC) Cells are basic units to record text, po-
border, formula, font bold, non-white background color, and non- sition, and format. So, representations at the cell level are crucial
black font color. See a detailed feature list in Appendix A. We get for various tasks such as cell type classification, entity linking, and
the joint formatting embedding by linearly transforming these cell- table question answering. Existing methods [8, 21, 41] represent a
level features: E f mt = Wf mt · x f mt + bf mt , where Wf mt ∈ RH ×F , cell by taking its token average, yet neglect the format and token or-
der. Hence, we design a novel task at the cell-level called Cell-Level
b ∈ RH , and F is the number of formatting features.
Cloze. We randomly select out cell strings from the table as candi-
Eventually, the embedding of a token is the sum of all components:
date choices, and at each blanked position, encourage the model
E = Etok + Enum + Ec_pos + Et _pos + E f mt
to retrieve its corresponding string. As shown in Figure 5, one can
More detailed specifications of embedding layers can be found in
view it as a cell-level one-to-one mapping from blanked positions
Appendix C.2.
to candidate strings. Given the representations of blanked positions
and candidate strings, we compute pair-wise mapping probabilities
3.3 Tree attention
using a dot-product attention module [37]. Next, we perform at
Based on the bi-tree coordinate and distance (Section 2.2), we now each blanked position a multi-classification over candidate strings
introduce the tree-based attention. Self-attention is considered to be and compute a cross-entropy loss Lclc . Note that, in our attention
more parallelizable and efficient than structure-specific models like mechanism, blanked locations can ‘see’ their structural contexts,
tree-LSTM and RNN [29, 36], but is challenged by lots of ‘distraction’ while candidate strings can only ‘see’ its internal tokens.
in large bi-dimensional tables. Because in its general formulation, Table context retrieval (TCR) Table-level representation is im-
self-attention allows every token to attend to every other token, portant in two aspects. On the one hand, all local cells in a table
without any hints on structure (e.g., if two tokens appear in the constitute an overall meaning; On the other hand, titles and text de-
same cell, appear in the same row/column, or have a hierarchical scriptions are informative global information for cell understanding.
relationship). Since spatial and hierarchical information is highly So we propose a novel objective to learn table representations.
important for local cells to aggregate their structurally related con- Each table is provided with text segments (either positive ones
texts and ignore noisy contexts, we aim to devise a structure-aware from its own context, or negative ones from irrelevant contexts),
attention mechanism. from which we use the [CLS] to retrieve the correct ones. At the
Motivated by the general idea of graph attention networks [38], implementation level, given the similarity logit of each text seg-
we inject the tree structure into the attention mechanism by per- ment against the table representation, TCR is a binary classification
forming masked attention — using a symmetric binary matrix M to problem with cross-entropy loss Ltcr .
indicate visibility between tokens. Given the i t h token toki in table The final objective L = Lmlm + Lclc + Ltcr .
sequence and its structural neighborhood SNi , we set Mi, j = 1
if tok j ∈ SNi , otherwise Mi, j = 0. Based on the tree distance in
Section 2.2, neighbor cells are those within a predefined distance
threshold D, and neighbor tokens are those of the same or neigh- 3.5 Pre-training details
boring cells. Smaller D can enforce local cells to be more “focus”, Data processing To get sequential inputs for transformer-based
but when D is large enough, tree attention works the same with models, we iterate table cells by row and tokenize their strings
global attention. Succeeding the general attention form of [37] us- using the WordPiece Tokenizer. Each cell has a special leading
ing Q, K, V ∈ RH ×H , our tree-based mask attention simply reads: token [SEP], while the textual description uses [CLS]. During pre-
BiT reeAttenion(Q, K, V ) = Attention(Q, K, V ) · M. Different from training, Wiki, WDC, and spreadsheet tables are iterated in parallel
existing methods such as bottom-up tree-based attention [29] and to provide diverse table structures and data characteristics.
constituent tree attention [39] in NL domain, our encoder with tree Model configuration TUTA is a N -layer Transformer encoder
attention enables both top-down, bottom-up and peer-to-peer data with a configuration aligned with BERTBAS E . To retain BERT’s
flow in the bi-tree structure. Since TUTA’s encoder has N stacked general-semantic representations, TUTA initializes with its token
layers with attention depth D, each cell can represent information embedding and encoder weights. TUTA pre-trains on Tesla V100
from its neighborhood with nearly arbitrary depth (N × D). In GPU and consumes about 16M tables in 2M steps.
Section 4, we compare different D values for ablation studies. Refer more details about data and model to Appendix C.2.
1784
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Cancer incidence and mortality statistics worldwide in 2018. Containing relational data tuples:
Age Household Size Number Percent
statistics
_ 15 to 19
15 to 19
Two people
Three or more
20
30
40.0%
60.0%
20 to 24 Two people 436 42.7%
Males 20 to 24 Three or more 584 57.3%
25 to 34 Two people 2,093 24.7%
25 to 34 Three or more 6,390 75.3%
35 or over Two people 23,020 47.6%
35 or over Three or more 25,390 52.4%
skin
_ Age Number
15 to 19 50
_ 20 to 24 1,020
25 to 34 8,483
35 or over 48,410
Index Index name Value name
Metadata Top attribute Left attribute Data Derived Notes
bronchus _
_
Figure 6: An example to illustrate cell types through rela-
_
tional data extraction. Table on the left is a real case in SAUS
(simplified due to space). On the right shows corresponding
Masked language
relational tuples. Different cell types (highlighted by differ-
Cell-level cloze: Table context retrieval:
modeling: Fill the randomly Retrieve table titles and descriptions
ent colors and shapes) play different roles in the process of
Predict randomly masked cell locations from randomly selected text snippets: relational data extraction.
masked tokens from cell strings: • Family Households by Size, Type, and
from the whole A. Mortality Age of Householder.
vocabulary B. Urinary tract • It includes major cancer types including Table 2: Statistics of CTC datasets.
C. Mesothelioma skin, respiratory, etc…
D. Kidney and renal pelvis • Full-time Law Enforcement Officers.
E. Colorectum and anus
F. 725,352 … WebSheet SAUS DeEX CIUS
Number of labeled tables 3,503 221 284 248
Number of labeled cells 1,075k 192k 711k 216k
Figure 5: An intuitive example of pre-training objectives.
Avg. number of rows 33.2 52.5 220.2 68.4
Avg. number of columns 9.9 17.7 12.7 12.7
Prop. of hierarchical tables 53.7% 93.7% 43.7% 72.1%
4 EXPERIMENTS Prop. of hierarchical top 35.8% 68.8% 28.9% 46.8%
Understanding the semantic structures of tables is the initial and Prop. of hierarchical left 29.3% 76.0% 29.2% 30.2%
critical step for plenty of tasks on tables. Mainly two tasks receive
broad attention: (1) identify the structural type of table cells in the
Cell Type Classification (CTC) task; (2) categorize the integrated Baselines We compare TUTA with five strong baselines to ver-
table structure in Table Type Classification (TTC). ify its effectiveness. CNNBERT [10], Bi-LSTM [18], and PSLRF [33]
In this section, we first evaluate TUTA on CTC and TTC against are three top existing methods for CTC. CNNBERT is a CNN-based
competitive baselines. Then, we perform ablation studies to show method for both cell classification and table range detection using
the effectiveness of the critical modules of TUTA. Last, we analyze pre-trained BERT. RNNC+S is a bidirectional LSTM-based method
typical cases to present more insights on the function of TUTA. for cell classification using pre-trained cell and format embeddings.
PSLRF is a hybrid neuro-symbolic approach that combines embed-
4.1 Cell Type Classification (CTC) ded representations with probabilistic constraints. We also include
To understand table structures, a key step is to identify fine-grained two representative table pre-training methods, TAPAS [21] and
cell types. CTC has been widely studied [10, 18, 19, 24, 33] with TaBERT [41]. To ensure an unbiased comparison, we download the
several well-annotated datasets. It requires models to capture both pre-trained TAPAS and TaBERT models, and fine-tune them using
semantic and structure information. Therefore, we use CTC to the same CTC head and loss function with TUTA. Note that, PSLRF
validate the effectiveness of TUTA. is a relevant yet very recent work evaluated on DeEx, SAUS, and
Datasets Existing CTC datasets include WebSheet [10], deexcel- CIUS, so we include it to be a strong baseline. However, since the
erator (DeEx) [24], SAUS [18], and CIUS [18]. They are collected code has not been publicly available, we have not evaluated it on
from different domains (financial, business, agricultural, healthcare, WebSheet.
etc.), thus contain tables with various structures and semantics. Fine-tune Tables in CTC datasets are tokenized, embedded, and
Table 2 shows statistics of table size and structure in each anno- encoded in the same way as Section 3. We use the leading [SEP] of
tated dataset. Note that they adopt two different taxonomies of cell cells to classify their types. Following [10] on WebSheet and [18] on
types. DeEx, SAUS, and CIUS categorize cells into general types: DeEx, SAUS, and CIUS, we use the same train/validation/test sets
metadata (MD), notes (N), data (D), top attribute (TA), left attribute for TUTA and baselines, thus no test tables/cells appear at training.
(LA), and derived (B). While to enable automatic relational data Please find more details of data processing and model configuration
extraction, WebSheet further defines three fine-grained semantic in Appendix C.3.
cell types inside table headers, namely index, index name, and value Experiment results Results are measured using Macro-F1, for
name [10]. As Figure 6 shows, cells of different types play different it’s a common method to evaluate the overall accuracy over multiple
roles in the process of relational data extraction. classes. As shown in Table 3, TUTA achieves an averaged macro-F1
1785
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Table 3: Results of F1-scores on CTC datasets. IN, I, and VN Table 4: Evaluation results on WCC. R, E, M, L, ND stand for
stand for the IndexName, Index, and ValueName types. relational, entity, matrix, list, and non-data table types.
(%)
WebSheet DeEx SAUS CIUS Total (%) R E M L ND Macro-F1
IN I VN Macro- Macro- Avg.
DWTC 83.0 91.0 73.0 0.0 79.0 79.0
CNNB E RT 69.9 86.9 78.4 78.4 60.8 89.1 95.1 80.9
TabVec 70.0 53.0 47.0 55.0 71.0 63.0
RNNC +S 75.0 86.6 77.1 79.6 70.5 89.8 97.2 84.3
PSLR F - - - - 62.7 89.4 96.7 - TabNet 9.0 41.0 9.0 11.0 25.0 27.0
TaBERTL 76.8 85.2 75.8 79.3 50.0 78.9 92.9 75.3 TAPASL 73.3 77.4 90.5 80.5 78.5 80.0
TAPASL 74.3 88.1 84.6 82.3 68.6 83.9 94.1 82.2 TaBERTL 87.3 87.5 70.9 86.0 86.2 83.6
TUTA 83.4 91.6 84.8 86.6 76.6 90.2 99.0 88.1 TUTA 88.7 77.9 81.7 100.0 89.9 87.6
of 88.1% on four datasets, outperforming all baselines by a large some interesting insights: (1) Although in the CTC task, TaBERT
margin (3.8%+). We observe that RNNC+S also outperforms TaBERT performs worse than Tapas due to neglecting bi-dimensional cell
and TAPAS. It is probably because RNNC+S is better at capturing coordinates, in TTC, TaBERT outperforms TAPAS by a large mar-
spatial information than TAPAS and TaBERT. TAPAS takes spatial gin. We think the underlying reason is that the sequent row-wise
information by encoding only the row and column indexes, which and column-wise attention helps TaBERT better capture global in-
can be insufficient for hierarchical tables with complicated headers. formation than TAPAS. (2) DWTC achieves a high macro-F1 score,
TaBERT, without joint coordinates from bi-dimensions, employs but fails to give correct predictions for the List type. After our study,
row-wise attention and subsequent column-wise attention. Due to we find that DWTC uses different table type definitions from WCC,
such indirect position encoding, it performs not as well on CTC. especially for the List type, thus the expert-engineered features
In Table 3, we also dig deeper into F1-scores for each cell type are too dedicated to achieving desirable results on WCC. In con-
in WebSheet, where TUTA consistently achieves the highest score. trast, representation learning methods are more transferable from
Note that, WebSheet, to perform relational data extraction, further one taxonomy to another. (3) TabNet fails to get reasonable results
categorizes the header cells into fine-grained types. This is a quite on most table types [17]. We believe the reason is that the model
challenging task, for table headers have diverse semantics under size of TabNet (a combination of CNN and RNN) is too large to
complicated hierarchies. Hence, it significantly demonstrates the perform well on tiny datasets (WCC has 386 annotated tables in
superiority of TUTA in discriminating fine-grained cell types. total). In contrast, TUTA, TAPAS, and TaBERT can generalize well
to downstream tasks via self-supervised pre-training.
4.2 Table Type Classification (TTC)
4.3 Ablation studies
In this section, we use a table-level task, TTC, to further validate
the effectiveness of TUTA on grasping table-level information. It To validate the effectiveness of Tree-based Attention (TA), Position
is a critical task on table structural type classification, and also Embeddings (PE), and three pre-training objectives, we evaluate
receives lots of attention [7, 14, 17, 26, 30]. eight variants of TUTA.
Dataset There are various taxonomies for categorizing table We start with TUTA-base before using position embeddings to
types [7, 14, 26], however, most datasets are not publicly available, prove the function of using and varying TA distances. One variant
except the July 2015 Common Crawl (WCC) in [17]. Fortunately, this without TA and three of visible distances 8, 4, 2 are tested.
dataset annotates with one of the most general web table taxonomy • TUTA-base, w/o TA: cells are globally visible.
introduced by [7]. Tables are categorized into five types: relational • TUTA-base, TA-8/4/2: cells are visible in a distance of 8/4/2.
(R), entity (E), matrix (M), list (L), and non-data (ND). Upon this, we keep the distance threshold D = 2 and augment
Baselines We compare TUTA with five strong baselines on TUTA-base with explicit or implicit positional embeddings.
TTC: (1) DWTC [14]: a Random Forest method based on expert- • TUTA-implicit: embed positions using trainable weights.
engineered global and local features, (2) TabNet [30]: a hybrid
• TUTA-explicit: compute positional inputs explicitly as in [36].
method combining LSTM and CNN, (3) TabVec [17]: an unsuper-
vised clustering method based on expert-engineered table-level Further, to measure the contribution of each objective, we re-
encodings. Due to the absence of public code and data for DWTC spectively remove each from the TUTA-implicit (the best so far).
and TabNet, we borrow their evaluation results on WCC from [17]. • TUTA, w/o MLM; TUTA, w/o CLC; TUTA, w/o TCR
(4)(5) Similarly to CTC, we also include TAPAS-large and TaBERT- Experiment results Table 5 shows the ablation results on CTC
large as two strong baselines. and TTC. It is clear that smaller attention distances help TUTA-base
Fine-tune We tokenize, embed, and encode WCC tables as in Sec- to achieve better results. In cell type classification, TUTA-base can
tion 3. We use [CLS] to do multi-type classification. Following the only get 79.9% of averaged macro-F1, lower than 82.2% achieved by
training and testing pipeline in [17], we use 10-fold cross-validation. TAPAS. But as the distance threshold decreases to 2, TUTA-base,TA-
Read more details about data and parameters in Appendix C.3. 2 improves a lot (5.4%), then outperforms TAPAS by 3.1%.
Experiment results Table 4 lists F1-scores on five table types Further, tree position embeddings improve accuracy as well.
and overall macro-F1 scores. TUTA outperforms all baselines by a In cell type classification, TUTA-implicit gets an 88.1% macro-F1
large margin (4.0%+) on macro-F1. Comparison results also show on average, which is 2.7% higher than its base variant TA-2, and
1786
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
1787
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
neural networks have been explored for table understanding and [18] Majid Ghasemi Gol, Jay Pujara, and Pedro Szekely. Tabular cell classification
question answering [23, 44, 46]. [12] use conditional GANs to rec- using pre-trained cell embeddings. In 2019 IEEE International Conference on Data
Mining (ICDM), pages 230–239. IEEE, 2019.
ommend good-looking formatting for tables. TUTA is the first [19] Julius Gonsior, Josephine Rehak, Maik Thiele, Elvis Koci, Michael Günther, and
transformer-based method for general table structure understand- Wolfgang Lehner. Active learning for spreadsheet cell classification. In EDBT/ICDT
Workshops, 2020.
ing and achieves state-of-the-arts on two representative tasks. [20] Tong Guo, Derong Shen, Tiezheng Nie, and Yue Kou. Web table column type
detection using deep learning and probability graph model. In International Conference
on Web Information Systems and Applications, pages 401–414. Springer, 2020.
6 CONCLUSION AND DISCUSSION [21] Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno,
In this paper, we propose TUTA, a novel structure-aware pre- and Julian Martin Eisenschlos. Tapas: Weakly supervised table parsing via pre-
training. arXiv preprint:2004.02349, 2020.
training model to understand generally structured tables. TUTA [22] Marcin Kardas, Piotr Czapla, Pontus Stenetorp, Sebastian Ruder, Sebastian Riedel,
is the first transformer-based method for table structure under- Ross Taylor, and Robert Stojnic. Axcell: Automatic extraction of results from machine
learning papers. arXiv preprint:2004.14356, 2020.
standing. It employs two core techniques to capture spatial and
[23] Elvis Koci, Maik Thiele, Wolfgang Lehner, and Oscar Romero. Table recognition
hierarchical information in tables: tree attention and tree position in spreadsheets via a graph representation. In 2018 13th IAPR International Workshop
embeddings. Moreover, we devise three pre-training objectives to on Document Analysis Systems (DAS), pages 139–144. IEEE, 2018.
[24] Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner.
enable representation learning at token, cell and table levels. TUTA Deco: A dataset of annotated spreadsheets for layout and table recognition. In
shows a large margin of improvements over all baselines on five International Conference on Document Analysis and Recognition. IEEE, 2019.
widely-studied datasets in downstream tasks. [25] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretrain-
ing. arXiv preprint:1901.07291, 2019.
TUTA is believed to be effective in other tasks like table Question [26] Larissa R Lautert, Marcelo M Scheidt, and Carina F Dorneles. Web table taxonomy
Answering (QA) and table entity linking as well. Since current and formalization. ACM SIGMOD Record, 42(3):28–33, 2013.
datasets of these tasks only consider relational tables, we aim to [27] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. A large
public corpus of web tables containing time and context metadata. In Proceedings of
extend the scope of them to variously structured tables later on. the 25th International Conference Companion on World Wide Web, pages 75–76, 2016.
Gladly, TUTA’s superior performance on CTC and TTC soundly [28] Seung-Jin Lim and Yiu-Kai Ng. An automated approach for retrieving hierar-
chical data from html tables. In Proceedings of the eighth international conference on
proves the effectiveness of structure-aware pre-training. Information and knowledge management, pages 466–474, 1999.
[29] Xuan-Phi Nguyen, Shafiq Joty, Steven CH Hoi, and Richard Socher. Tree-
structured attention with hierarchical accumulation. arXiv preprint:2002.08046, 2020.
REFERENCES [30] Kyosuke Nishida, Kugatsu Sadamitsu, Ryuichiro Higashinaka, and Yoshihiro
[1] Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. Tabel: entity Matsuo. Understanding the semantic structures of tables with a hybrid deep neural
linking in web tables. In International Semantic Web Conference. Springer, 2015. network architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[2] Zhe Chen and Michael Cafarella. Automatic web spreadsheet data extraction. In [31] Viacheslav Paramonov, Alexey Shigarov, and Varvara Vetrova. Table header
Proceedings of the 3rd International Workshop on Semantic Search over the Web, 2013. correction algorithm based on heuristics for improving spreadsheet data extraction.
[3] Zhe Chen and Michael Cafarella. Integrating spreadsheet data via accurate and In International Conference on Information and Software Technologies. Springer, 2020.
low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference [32] Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-
on Knowledge discovery and data mining, pages 1126–1135, 2014. structured tables. arXiv preprint:1508.00305, 2015.
[4] Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. Colnet: [33] Kexuan Sun Harsha Rayudu Jay Pujara. A hybrid probabilistic approach for table
Embedding the semantics of web tables for column type prediction. In Proceedings of understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
the AAAI Conference on Artificial Intelligence, volume 33, pages 29–36, 2019. [34] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving
[5] Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. Learning language understanding by generative pre-training, 2018.
semantic annotations for tabular data. arXiv preprint:1906.00781, 2019. [35] Dominique Ritze and Christian Bizer. Matching web tables to dbpedia-a feature
[6] Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang utility study. context, 42(41):19–31, 2017.
Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based [36] Vighnesh Shiv and Chris Quirk. Novel positional encodings to enable tree-based
fact verification. arXiv preprint:1909.02164, 2019. transformers. In Advances in Neural Information Processing Systems, 2019.
[7] Eric Crestan and Patrick Pantel. Web-scale table census and classification. In [37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Proceedings of international conference on Web search and data mining, 2011. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
[8] Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. Turl: Table under- In Advances in neural information processing systems, pages 5998–6008, 2017.
standing through representation learning. arXiv preprint:2006.14806, 2020. [38] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint:1710.10903, 2017.
training of deep bidirectional transformers for language understanding. arXiv [39] Yau-Shian Wang, Hung-Yi Lee, and Yun-Nung Chen. Tree transformer: Integrat-
preprint:1810.04805, 2018. ing tree structures into self-attention. arXiv preprint:1909.06639, 2019.
[10] Haoyu Dong, Shijie Liu, Zhouyu Fu, Shi Han, and Dongmei Zhang. Semantic [40] Xinxin Wang. Tabular abstraction, editing, and formatting. 2016.
structure extraction for spreadsheet tables with a multi-task learning architecture. [41] Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. Tabert:
In Workshop on Document Intelligence at NeurIPS 2019, 2019. Pretraining for joint understanding of textual and tabular data. arxiv, 2020.
[11] Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. Tablesense: [42] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James
Spreadsheet table detection with convolutional neural networks. In Proceedings of Ma, Irene Li, et al. Spider: A large-scale human-labeled dataset for complex and
the AAAI Conference on Artificial Intelligence, volume 33, pages 69–76, 2019. cross-domain semantic parsing and text-to-sql task. arXiv preprint:1809.08887, 2018.
[12] Haoyu Dong Dong, Jinyu Wang, Zhouyu Fu, Shi Han, and Dongmei Zhang. [43] Richard Zanibbi, Dorothea Blostein, and James R Cordy. A survey of table
Neural formatting for spreadsheet tables. In Proceedings of the 29th ACM International recognition. Document Analysis and Recognition, 7(1):1–16, 2004.
Conference on Information & Knowledge Management, pages 305–314, 2020. [44] Vicky Zayats, Kristina Toutanova, and Mari Ostendorf. Representations for
[13] Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. Expandable question answering from documents with tables and text. ArXiv:2101.10573, 2021.
group identification in spreadsheets. In Proceedings of the 33rd ACM/IEEE International [45] Li Zhang, Shuo Zhang, and Krisztian Balog. Table2vec: Neural word and entity
Conference on Automated Software Engineering, pages 498–508, 2018. embeddings for table population and retrieval. In Proceedings of International ACM
[14] Julian Eberius, Katrin Braunschweig, and Others. Building the dresden web table SIGIR Conference on Research and Development in Information Retrieval, 2019.
corpus: A classification approach. In 2015 IEEE/ACM 2nd International Symposium on [46] Xingyao Zhang, Linjun Shou, Jian Pei, Ming Gong, Lijie Wen, and Daxin Jiang.
Big Data Computing (BDC), pages 41–50. IEEE, 2015. A graph representation of semi-structured data for web question answering. arXiv
[15] Jing Fang, Prasenjit Mitra, Zhi Tang, and C Lee Giles. Table header detection and preprint:2010.06801, 2020.
classification. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. [47] Chen Zhao and Yeye He. Auto-em: End-to-end fuzzy entity-matching using
[16] Besnik Fetahu, Avishek Anand, and Maria Koutraki. Tablenet: An approach for pre-trained deep models and transfer learning. In The World Wide Web Conference,
determining fine-grained relations for wikipedia tables. In The World Wide Web pages 2413–2424, 2019.
Conference, pages 2736–2742, 2019. [48] Mengyu Zhou, Wang Tao, Ji Pengxin, and Others. Table2analysis: Modeling
[17] Majid Ghasemi-Gol and Pedro Szekely. Tabvec: Table vectors for classification of and recommendation of common analysis patterns for multi-dimensional data. In
web tables. arXiv preprint:1802.06290, 2018. Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
1788
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
4 https://github.com/bfetahu/wiki_tables
5 By
C EXPERIMENT DETAILS
estimation, there are around 800 million spreadsheet users.
https://medium.com/grid-spreadsheets-run-the-world/excel-vs-google-sheets- C.1 Data pre-processing
usage-nature-and-numbers-9dfa5d1cadbd
6 https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/ General pre-processing In order to control sequences within
7 https://github.com/ClosedXML/ClosedXML a reasonable length and prevent unhelpful numerical information,
1789
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
we: (1) perform a heuristic sampling over cells in the data region, for satisfy this condition. Note that though few (< 1%), it is necessary
they often express similar numerics yet introduce limited semantics. to trim those length-exceeding cases, for a string too long often
By classifying data cells into text- and value-dominant types, we introduces noise and produces computational inefficiency.
sample out 50% in the former and 90% in the latter. (2) take at To embed numbers, magnitude xmaд ∈ [0, M], precisionxpr e ∈
most 8 tokens from a cell and 64 tokens from a text segment, given [0, P], the first digit x f st ∈ [0, F ], and the last digit xl st ∈ [0, L],
empirical studies showing that 99% of the cases meet this condition. where M, P, F , L = 10. Correspondingly, their embedding weights
are Wmaд ,Wpr e ,Wf st ,Wl st ∈ RM /P /F /L× 4 , hidden size H = 768.
H
Masked Language Model (MLM) When selecting tokens to be
masked, we employ a hybrid strategy of token-wise and whole-cell For format embeddings, the F = 11 integer cell features x ∈ NF
masking. To be more specific, we first randomly select 15% of the (as listed in Table 6) undergoes a transformation E f mt = Wf mt ·x +b
table cells, then for each cell: (1) with a 70% probability, mask one
by weight Wf mt ∈ RF ×H and bias b ∈ RH .
random token, or (2) with a 30% chance, mask all of the tokens of
Pre-training hyper-parameters Pre-training TUTA has two
that cell.
stages. First, we use table sequences with no more than 256 tokens
Cell-Level Cloze (CLC) When choosing cells to be dug out, we
with a batch size of 12 for the first 1M steps of training. Then, we
assign a higher probability for cells in table headers than those
extend the supporting sequence length to 512 and continue the
in the data region. Meanwhile, for more attentive training on the
training for another 1M steps using a batch size of 4. The above
structural information, we apply two strategies at cell selection.
procedures are implemented with PyTorch distributed training. We
In one way, we randomly choose cells on the same sub-tree based
train TUTA and its variants on 64 Tesla V100 GPUs, and each TUTA
on our bi-tree structure; in another, cells are randomly taken from
variant takes nine days on four Tesla V100 GPUs.
different top header rows and left header columns. In all, around
20% of the cells are selected in each table.
C.3 Downstream tasks
Table Context Retrieval (TCR) We consider table titles and
surrounding descriptions to be relevant texts of tables. We split Cell Type Classification (CTC)
these texts into segments, then provide each table with both posi- We use a fine-tune head for this multi-classification problem.
tive (from its own context) and negative (from other’s context) ones. Given the encoder’s outputs, we use the leading [SEP] in each cell
Each table has at most three positive segments and three negative to represent the whole cell, then perform linear transformations to
segments. We randomly choose one positive segment and put it compute logits over cell types: pc = W2 · дelu(W1 ·rc +b1 )+b2 , where
after the leading token [CLS]. Other segments are assigned with rc ∈ RH is the cell representation, W1 ∈ RH ×H and W2 ∈ RC×H
leading [SEP] tokens and padded after the sequence of table tokens. are weights, b1 ∈ RH and b2 ∈ RC are bias, and C is the number of
Note that in our tree-based attention, [CLS] learns table-level rep- cell types. We compute the cross-entropy loss against labels.
resentations and can ‘see’ all table cells, while the [SEP] of each When splitting datasets into train, validation, and test, we adopt
text segment learns text segment representations and only ‘see’s a table-wise, rather than a cell-wise manner. Since cells in a table
its internal tokens. Please find more details in our codebase about are always considered together, none of the cells in test tables have
computations of the attention visibility matrix 8 . been used for training.
At fine-tuning, we follow [10] and tune our model for 4 epochs on
C.2 Model Configuration WebSheet. For DeEx, SAUS and CIUS, since the number of tables is
not as big, we separately tune TUTA for 100 epochs on five random-
Embedding details Here we specify the embedding modules.
split folds and report their average macro f1. All experiments set
For in-table positions, we empirically set an upper-bound for
the batch size to 4 and the learning rate to 8e−6.
table size and header shape. To control the maximum supported
Table Type Classification (TTC) Similar to CTC, we use a
tree size, we define a maximum number of tree layers L = 4, and at
dedicated head to classify table sequences into types. Taking the
deeper levels i ∈ [0, L], an increasing number of degrees from 32,
leading [CLS] of each table to represent the whole table, we compute
32, 64, to 256, thus a maximum number of node degrees G ∈ NL
logics over table types: pt = W2 · дelu(W1 · r t + b1 ) + b2 , where the
for tree layers. In another word, a top/left header has at most L = 4
table representation r t ∈ RH , weights W1 ∈ RH ×H ,W2 ∈ RCt ×H ,
rows/columns; while within the i t h row/column, there are at most
biases b1 ∈ RH , b2 ∈ RCt , and the number of table types Ct = 5.
G i cells, where G = [32, 32, 64, 256], i ∈ [1, 4]. Note that the last
Finally, we compute the cross-entropy loss against ground truths.
supporting degree, G L−1 = 256, accords with both the maximum
To ensure an unbiased comparison, we also fine-tune TAPAS and
number of rows and columns, allowing a compatible conversion
TaBERT using the same TTC head and loss function with TUTA.
from a hierarchical coordinate to a flat setting. When encountering
Since they didn’t specify the details of dataset segmentation, we
large tables, we can split them into several smaller ones (share the
randomly split tables into train/valid/test sets under the proportion
same top or left header) based on detected table headers. When
of 8/1/1. Meanwhile, having noticed previous considerations to
transforming in-table position to embeddings, each index of the
balance samples across types, we always divide within classes and
tree coordinate is mapped to a dimension of size dT L = 72, while
keep the type distribution in train/valid/test comparable to that of
row and column indices are embedded to size d RC = 96.
the entire dataset.
For internal position embedding, we keep the first I = 8 tokens
During fine-tuning, since TUTA, TAPAS, and TaBERT are all
in each cell and I = 64 for each text segment, given the empirical
transformer-based models, we use identical hyper-parameters: with
studies on our corpus showing that over 99% of cell strings and texts
a batch size of 2, the learning rate of 8e−6, and run 4 epochs.
8 https://github.com/microsoft/TUTA_table_understanding/tuta/model/backbones.py
1790