Professional Documents
Culture Documents
supervision consisting of denotations instead 2017). One prominent data collection approach
of logical forms. However, training seman- focuses on weak supervision where a training ex-
tic parsers from weak supervision poses diffi- ample consists of a question and its denotation
culties, and in addition, the generated logical instead of the full logical form (Clarke et al., 2010;
forms are only used as an intermediate step Liang et al., 2011; Artzi and Zettlemoyer, 2013).
prior to retrieving the denotation. In this pa-
Although appealing, training semantic parsers from
per, we present TA PAS, an approach to ques-
tion answering over tables without generating this input is often difficult due to the abundance of
logical forms. TA PAS trains from weak super- spurious logical forms (Berant et al., 2013; Guu
vision, and predicts the denotation by select- et al., 2017) and reward sparsity (Agarwal et al.,
ing table cells and optionally applying a cor- 2019; Muhlgay et al., 2019).
responding aggregation operator to such selec- In addition, semantic parsing applications only
tion. TA PAS extends BERT’s architecture to
utilize the generated logical form as an intermedi-
encode tables as input, initializes from an ef-
fective joint pre-training of text segments and
ate step in retrieving the answer. Generating logi-
tables crawled from Wikipedia, and is trained cal forms, however, introduces difficulties such as
end-to-end. We experiment with three differ- maintaining a logical formalism with sufficient ex-
ent semantic parsing datasets, and find that pressivity, obeying decoding constraints (e.g. well-
TA PAS outperforms or rivals semantic parsing formedness), and the label bias problem (Andor
models by improving state-of-the-art accuracy et al., 2016; Lafferty et al., 2001).
on SQA from 55.1 to 67.2 and performing on
In this paper we present TA PAS (for Table
par with the state-of-the-art on W IKI SQL and
W IKI TQ, but with a simpler model architec- Parser), a weakly supervised question answering
ture. We additionally find that transfer learn- model that reasons over tables without generating
ing, which is trivial in our setting, from W IK - logical forms. TA PAS predicts a minimal program
I SQL to W IKI TQ, yields 48.7 accuracy, 4.2 by selecting a subset of the table cells and a possi-
points above the state-of-the-art. ble aggregation operation to be executed on top of
them. Consequently, TA PAS can learn operations
1 Introduction from natural language, without the need to spec-
Question answering from semi-structured tables is ify them in some formalism. This is implemented
usually seen as a semantic parsing task where the by extending BERT’s architecture (Devlin et al.,
question is translated to a logical form that can be 2019) with additional embeddings that capture tab-
executed against the table to retrieve the correct ular structure, and with two classification layers
denotation (Pasupat and Liang, 2015; Zhong et al., for selecting cells and predicting a corresponding
2017; Dasigi et al., 2019; Agarwal et al., 2019). aggregation operator.
Semantic parsers rely on supervised training data Importantly, we introduce a pre-training method
that pairs natural language questions with logical for TA PAS, crucial for its success on the end task.
forms, but such data is expensive to annotate. We extend BERT’s masked language model objec-
In recent years, many attempts aim to reduce tive to structured data, and pre-train the model over
op Pa(op) compute(op,Ps,T) Rank ... Days Ps
millions of tables and related text segments crawled 1 ... 37 0.9
NONE 0 -
from Wikipedia. During pre-training, the model COUNT 0.1 .9 + .9 + .2 = 2 2 ... 31 0.9
SUM 0.8 .9×37 + .9×31 + .2×15 = 64.2 3 ... 17 0
masks some tokens from the text segment and from AVG 0.1 64.2 ÷ 2 = 32.1 4 ... 15 0.2
... ... ... 0
the table itself, where the objective is to predict spred= .1×2 + .8×64.2 + .1×32.1 = 54.8
...
training recipe that allows TA PAS to train from ... ...
E[CLS] E1 EN E[SEP] E’1 E’M
weak supervision. For examples that only involve
selecting a subset of the table cells, we directly [CLS] Tok 1 ... Tok N [SEP] Tok 1 ... Tok M
train the model to select the gold subset. For exam-
Question Flattened Table
ples that involve aggregation, the relevant cells and
the aggregation operation are not known from the Figure 1: TA PAS model (bottom) with example model
denotation. In this case, we calculate an expected outputs for the question: “Total number of days for the
soft scalar outcome over all aggregation operators top two”. Cell prediction (top right) is given for the
given the current model, and train the model with a selected column’s table cells in bold (zero for others)
regression loss against the gold denotation. along with aggregation prediction (top left).
In comparison to prior attempts to reason over ta-
bles without generating logical forms (Neelakantan Additional embeddings We add a separator to-
et al., 2015; Yin et al., 2016; Müller et al., 2019), ken between the question and the table, but unlike
TA PAS achieves better accuracy, and holds several Hwang et al. (2019) not between cells or rows. In-
advantages: its architecture is simpler as it includes stead, the token embeddings are combined with
a single encoder with no auto-regressive decoding, table-aware positional embeddings before feeding
it enjoys pre-training, tackles more question types them to the model. We use different kinds of posi-
such as those that involve aggregation, and directly tional embeddings:
handles a conversational setting.
We find that on three different semantic pars- • Position ID is the index of the token in the flat-
ing datasets, TA PAS performs better or on par in tened sequence (same as in BERT).
comparison to other semantic parsing and ques- • Segment ID takes two possible values: 0 for the
tion answering models. On the conversational question, and 1 for the table header and cells.
SQA (Iyyer et al., 2017), TA PAS improves state- • Column / Row ID is the index of the colum-
of-the-art accuracy from 55.1 to 67.2, and achieves n/row that this token appears in, or 0 if the token
on par performance on W IKI SQL (Zhong et al., is a part of the question.
2017) and W IKI TQ (Pasupat and Liang, 2015).
Transfer learning, which is simple in TA PAS, from • Rank ID if column values can be parsed as floats
W IKI SQL to W IKI TQ achieves 48.7 accuracy, 4.2 or dates, we sort them accordingly and assign an
points higher than state-of-the-art. Our code and embedding based on their numeric rank (0 for
pre-trained model are publicly available at https: not comparable, 1 for the smallest item, i + 1
//github.com/google-research/tapas. for an item with rank i). This can assist the
model when processing questions that involve
2 TA PAS Model superlatives, as word pieces may not represent
numbers informatively (Wallace et al., 2019).
Our model’s architecture (Figure 1) is based on • Previous Answer given a conversational setup
BERT’s encoder with additional positional embed- where the current question might refer to the
dings used to encode tabular structure (visualized previous question or its answers (e.g., question
in Figure 2). We flatten the table into a sequence 5 in Figure 3), we add a special embedding that
of words, split words into word pieces (tokens) and marks whether a cell token was the answer to the
concatenate the question tokens before the table to- previous question (1 if the token’s cell was an
kens. We additionally add two classification layers answer, or 0 otherwise).
for selecting table cells and aggregation operators
that operate on the cells. We now describe these Cell selection This classification layer selects a
modifications and how inference is performed. subset of the table cells. Depending on the selected
Table Token
Embeddings
[CLS] query ? [SEP] col ##1 col ##2 0 1 2 3
col1 col2
Position
Embeddings POS0 POS1 POS2 POS3 POS4 POS5 POS6 POS7 POS8 POS9 POS10 POS11
0 1
2 3 Segment
Embeddings
SEG0 SEG0 SEG0 SEG0 SEG1 SEG1 SEG1 SEG1 SEG1 SEG1 SEG1 SEG1
Column
Embeddings COL0 COL0 COL0 COL0 COL1 COL1 COL2 COL2 COL1 COL2 COL1 COL2
Row
Embeddings ROW0 ROW0 ROW0 ROW0 ROW0 ROW0 ROW0 ROW0 ROW1 ROW1 ROW2 ROW2
Rank
Embeddings RANK0 RANK0 RANK0 RANK0 RANK0 RANK0 RANK0 RANK0 RANK1 RANK1 RANK2 RANK2
Figure 2: Encoding of the question “query?” and a simple table using the special embeddings of TA PAS. The
previous answer embeddings are omitted for brevity.
aggregation operator, these cells can be the final probability is larger than 0.5. These predictions
answer or the input used to compute the final an- are then executed against the table to retrieve the
swer. Cells are modelled as independent Bernoulli answer, by applying the predicted aggregation over
variables. First, we compute the logit for a token the selected cells.
using a linear layer on top of its last hidden vec-
tor. Cell logits are then computed as the average 3 Pre-training
over logits of tokens in that cell. The output of
(c) Following the recent success of pre-training models
the layer is the probability ps to select cell c. We on textual data for natural language understanding
additionally found it useful to add an inductive bias tasks, we wish to extend this procedure to struc-
to select cells within a single column. We achieve tured data, as an initialization for our table parsing
this by introducing a categorical variable to select task. To this end, we pre-train TA PAS on a large
the correct column. The model computes the logit number of tables from Wikipedia. This allows the
for a given column by applying a new linear layer model to learn many interesting correlations be-
to the average embedding for cells appearing in tween text and the table, and between the cells of a
that column. We add an additional column logit columns and their header.
that corresponds to selecting no column or cells. We create pre-training inputs by extracting text-
We treat this as an extra column with no cells. The table pairs from Wikipedia. We extract 6.2M tables:
(co)
output of the layer is the probability pcol to select 3.3M of class Infobox1 and 2.9M of class WikiTable.
column co computed using softmax over the col- We consider tables with at most 500 cells. All
(c)
umn logits. We set cell probabilities ps outside of the end task datasets we experiment with only
the selected column to 0. contain horizontal tables with a header row with
column names. Therefore, we only extract Wiki
Aggregation operator prediction Semantic tables of this form using the <th> tag to identify
parsing tasks require discrete reasoning over the headers. We furthermore, transpose Infoboxes into
table, such as summing numbers or counting cells. a table with a single header and a single data row.
To handle these cases without producing logical The tables, created from Infoboxes, are arguably
forms, TA PAS outputs a subset of the table cells not very typical, but we found them to improve
together with an optional aggregation operator. performance on the end tasks.
The aggregation operator describes an operation As a proxy for questions that appear in the end
to be applied to the selected cells, such as SUM, tasks, we extract the table caption, article title, ar-
COUNT, AVERAGE or NONE. The operator is ticle description, segment title and text of the seg-
selected by a linear layer followed by a softmax ment the table occurs in as relevant text snippets.
on top of the final hidden vector of the first token In this way we extract 21.3M snippets.
(the special [CLS] token). We denote this layer We convert the extracted text-table pairs to pre-
as pa (op), where op is some aggregation operator. training examples as follows: Following Devlin
et al. (2019), we use a masked language model
Inference We predict the most likely aggregation
pre-training objective. We also experimented with
operator together with a subset of the cells (using
adding a second objective of predicting whether
the cell selection layer). To predict a discrete cell
1
selection we select all table cells for which their en.wikipedia.org/wiki/Help:Infobox
the table belongs to the text or is a random table where s is populated but C is empty, we train the
but did not find this to improve the performance on model to predict an aggregation over the table cells
the end tasks. This is aligned with Liu et al. (2019) that amounts to s. We now describe each of these
that similarly did not benefit from a next sentence cases in detail.
prediction task.
Cell selection In this case y is mapped to a subset
For pre-training to be efficient, we restrict our
of the table cell coordinates C (e.g., question 1 in
word piece sequence length to a certain budget
Figure 3). For this type of examples, we use a
(e.g., we use 128 in our final experiments). That
hierarchical model that first selects a single column
is, the combined length of tokenized text and table
and then cells from within that column only.
cells has to fit into this budget. To achieve this, we
We directly train the model to select the column
randomly select a snippet of 8 to 16 word pieces
col which has the highest number of cells in C. For
from the associated text. To fit the table, we start
our datasets cells C are contained in a single col-
by only adding the first word of each column name
umn and so this restriction on the model provides a
and cell. We then keep adding words turn-wise
useful inductive bias. If C is empty we select the
until we reach the word piece budget. For every
additional empty column corresponding to empty
table we generate 10 different snippets in this way.
cell selection. The model is then trained to select
We follow the masking procedure introduced
cells C ∩ col and not select (T \ C) ∩ col. The loss
by BERT. We use whole word masking2 for the
is composed of three components: (1) the average
text, and we find it beneficial to apply whole cell
binary cross-entropy loss over column selections:
masking (masking all the word pieces of the cell if
any of its pieces is masked) to the table as well. 1
CE(pcol , 1co=col )
(co)
X
Jcolumns =
We note that we additionally experimented with |Columns|
co∈Columns
data augmentation, which shares a similar goal
to pre-training. We generated synthetic pairs of where the set of columns Columns includes the ad-
questions and denotations over real tables via a ditional empty column, CE(·) is the cross entropy
grammar, and augmented these to the end tasks loss, 1 is the indicator function. (2) the average
training data. As this did not improve end task binary cross-entropy loss over column cell selec-
performance significantly, we omit these results. tions:
1
CE(ps , 1c∈C ),
(c)
X
4 Fine-tuning Jcells =
|Cells(col)|
c∈Cells(col)
Overview We formally define table parsing in a
weakly supervised setup as follows. Given a train- where Cells(col) is the set of cells in the chosen col-
ing set of N examples {(xi , Ti , yi )}Ni=1 , where xi umn. (3) As for cell selection examples no aggrega-
is an utterance, Ti is a table and yi is a correspond- tion occurs, we define the aggregation supervision
ing set of denotations, our goal is to learn a model to be NONE (assigned to op0 ), and the aggregation
that maps a new utterance x to a program z, such loss is:
that when z is executed against the corresponding Jaggr = − log pa (op0 ).
table T , it yields the correct denotation y. The pro-
gram z comprises a subset of the table cells and an The total loss is then JCS = Jcolumns + Jcells +
optional aggregation operator. The table T maps a αJaggr , where α is a scaling hyperparameter.
table cell to its value. Scalar answer In this case y is a single scalar s
As a pre-processing step described in Section 5.1, which does not appear in the table (i.e. C = ∅, e.g.,
we translate the set of denotations y for each ex- question 2 in Figure 3). This usually corresponds
ample to a tuple (C, s) of cell coordinates C and to examples that involve an aggregation over one
a scalar s, which is only populated when y is a or more table cells. In this work we handle aggre-
single scalar. We then guide training according to gation operators that correspond to SQL, namely
the content of (C, s). For cell selection examples, COUNT, AVERAGE and SUM, however our model
for which s is not populated, we train the model to is not restricted to these.
select the cells in C. For scalar answer examples, For these examples, the table cells that should be
2
https://github.com/google-research/ selected and the aggregation operator type are not
bert/blob/master/README.md known, as these cannot be directly inferred from
Table Example questions
2 Ric Flair 8 3,103 3 How many world champions are there with only COUNT(Dory Funk Jr., Ambiguous answer
one reign? Gene Kiniski)=2
3 Harley Race 7 1,799
4 What is the number of reigns for Harley Race? 7 Ambiguous answer
4 Dory Funk Jr. 1 1,563
Which of the following wrestlers were ranked in {Dory Funk Jr., Dan Cell selection
5 Dan Severn 2 1,559 the bottom 3? Severn, Gene Kiniski}
5
6 Gene Kiniski 1 1,131 Out of these, who had more than one reign? Dan Severn Cell selection
Figure 3: A table (left) with corresponding example questions (right). The last example is conversational.
Table 4: W IKI TQ denotation accuracy. For SQA, Table 5 shows that TA PAS leads to
substantial improvements on all metrics: Improv-
ing all metrics by at least 11 points, sequence accu-
5.3 Results racy from 28.1 to 40.4 and average question accu-
All results report the denotation accuracy for mod- racy from 55.1 to 67.2.
els trained from weak supervision. We follow Model ablations Table 6 shows an ablation study
Niven and Kao (2019) and report the median for on our different embeddings. To this end we pre-
5 independent runs, as BERT-based models can train and fine-tune models with different features.
degenerate. We present our results for W IKI SQL As pre-training is expensive we limit it to 200, 000
and W IKI TQ in Tables 3 and 4 respectively. Table steps. For all datasets we see that pre-training on
3 shows that TA PAS, trained in the weakly super- tables and column and row embeddings are the
vised setting, achieves close to state-of-the-art per- most important. Positional and rank embeddings
formance for W IKI SQL (83.6 vs 83.9 (Min et al., are also improving the quality but to a lesser extent.
2019)). If given the gold aggregation operators and We additionally find that when removing the
selected cell as supervision (extracted from the ref- scalar answer and aggregation losses (i.e., set-
erence SQL), which accounts as full supervision to ting JSA=0 ) from TA PAS, accuracy drops for both
TA PAS, the model achieves 86.4. Unlike the full datasets. For W IKI TQ, we observe a substantial
SQL queries, this supervision can be annotated by drop in performance from 29.0 to 23.1 when re-
non-experts. moving aggregation. For W IKI SQL performance
For W IKI TQ the model trained only from the drops from 84.7 to 82.6. The relatively small de-
original training data reaches 42.6 which surpass crease for W IKI SQL can be explained by the fact
similar approaches (Neelakantan et al., 2015). that most examples do not need aggregation to be
When we pre-train the model on W IKI SQL or answered. In principle, 17% of the examples of
SQA (which is straight-forward in our setup, as 4
As explained in Section 5.2, we report TA PAS numbers
we do not rely on a logical formalism), TA PAS comparing against our own reference answers. Appendix A
achieves 48.7 and 48.8, respectively. contains numbers WRT the official W IKI SQL eval script.
the dev set have an aggregation (SUM, AVERAGE all text header cell
or COUNT), however, for all types we find that for all 71.4 68.8 96.6 63.4
more than 98% of the examples the aggregation is word 74.1 69.7 96.9 66.6
only applied to one or no cells. In the case of SUM number 53.9 51.7 83.6 53.2
and AVERAGE, this means that most examples can
be answered by selecting one or no cells from the Table 7: Mask LM accuracy on held-out data, when
table. For COUNT the model without aggregation the target word piece is located in the text, table header,
operators achieves 28.2 accuracy (by selecting 0 cell or anywhere (all) and the target is anything, a word
or 1 from the table) vs. 66.5 for the model with or number.
aggregation. Note that 0 and 1 are often found in
a special index column. These properties of W IK -
data do not occur in the training data. Table 7
I SQL make it challenging for the model to decide
shows the accuracy of masked word pieces of dif-
whether to apply aggregation or not. For W IKI TQ
ferent types and in different locations. We find
on the other hand, we observe a substantial drop
that average accuracy across position is relatively
in performance from 29.0 to 23.1 when removing
high (71.4). Predicting tokens in the header of
aggregation.
the table is easiest (96.6), probably because many
Qualitative Analysis on W IKI TQ We manu- Wikipedia articles use instances of the same kind of
ally analyze 200 dev set predictions made by table. Predicting word pieces in cells is a bit harder
TA PAS on W IKI TQ. For correct predictions via (63.4) than predicting pieces in the text (68.8). The
an aggregation, we inspect the selected cells to see biggest differences can be observed when compar-
if they match the ground truth. We find that 96% of ing predicting words (74.1) and numbers (53.9).
the correct aggregation predictions where also cor- This is expected since numbers are very specific
rect in terms of the cells selected. We further find and often hard to generalize. The soft-accuracy
that 14% of the correct aggregation predictions had metric and example (Appendix C) demonstrate,
only one cell, and could potentially be achieved by however, that the model is relatively good at pre-
cell selection, with no aggregation. dicting numbers that are at least close to the target.
We also perform an error analysis and identify Limitations TA PAS handles single tables as con-
the following exclusive salient phenomena: (i) 12% text, which are able to fit in memory. Thus, our
are ambiguous (“Name at least two labels that re- model would fail to capture very large tables, or
leased the group’s albums.”), have wrong labels or databases that contain multiple tables. In this case,
missing information ; (ii) 10% of the cases require the table(s) could be compressed or filtered, such
complex temporal comparisons which could also that only relevant content would be encoded, which
not be parsed with a rich formalism such as SQL we leave for future work.
(“what country had the most cities founded in the In addition, although TA PAS can parse composi-
1830’s?”) ; (iii) in 16% of the cases the gold de- tional structures (e.g., question 2 in Figure 3), its
notation has a textual value that does not appear in expressivity is limited to a form of an aggregation
the table, thus it could not be predicted without per- over a subset of table cells. Thus, structures with
forming string operations over cell values ; (iv) on multiple aggregations such as “number of actors
10%, the table is too big to fit in 512 tokens ; (v) on with an average rating higher than 4” could not be
13% of the cases TA PAS selected no cells, which handled correctly. Despite this limitation, TA PAS
suggests introducing penalties for this behaviour succeeds in parsing three different datasets, and
; (vi) on 2% of the cases, the answer is the differ- we did not encounter this kind of errors in Section
ence between scalars, so it is outside of the model 5.3. This suggests that the majority of examples
capabilities (“how long did anne churchill/spencer in semantic parsing datasets are limited in their
live?”) ; (vii) the other 37% of the cases could not compositionality.
be classified to a particular phenomenon.
6 Related Work
Pre-training Analysis In order to understand
what TA PAS learns during pre-training we analyze Semantic parsing models are mostly trained to pro-
its performance on 10,000 held-out examples. We duce gold logical forms using an encoder-decoder
split the data such that the tables in the held-out approach (Jia and Liang, 2016; Dong and Lapata,
2016). To reduce the burden in collecting full logi- can fine-tune on semantic parsing datasets, only
cal forms, models are typically trained from weak using weak supervision, with an end-to-end differ-
supervision in the form of denotations. These are entiable recipe. Results show that TA PAS achieves
used to guide the search for correct logical forms better or competitive results in comparison to state-
(Clarke et al., 2010; Liang et al., 2011). of-the-art semantic parsers.
Other works suggested end-to-end differentiable In future work we aim to extend the model to
models that train from weak supervision, but do not represent a database with multiple tables as context,
explicitly generate logical forms. Neelakantan et al. and to effectively handle large tables.
(2015) proposed a complex model that sequentially
predicts symbolic operations over table segments 8 Acknowledgments
that are all explicitly predefined by the authors, We would like to thank Yasemin Altun, Srini
while Yin et al. (2016) proposed a similar model Narayanan, Slav Petrov, William Cohen, Massimo
where the operations themselves are learned during Nicosia, Syrine Krichene, Jordan Boyd-Graber and
training. Müller et al. (2019) proposed a model that the anonymous reviewers for their constructive
selects table cells, where the table and question are feedback, useful comments and suggestions. This
represented as a Graph Neural Network, however work was completed in partial fulfillment for the
their model can not predict aggregations over ta- PhD degree of the first author, which was also sup-
ble cells. Cho et al. (2018) proposed a supervised ported by a Google PhD fellowship.
model that predicts the relevant rows, column and
aggregation operation sequentially. In our work,
we propose a model that follow this line of work, References
with a simpler architecture than past models (as R. Agarwal, C. Liang, D. Schuurmans, and M. Norouzi.
the model is a single encoder that performs com- 2019. Learning to generalize from sparse
putation for many operations implicitly) and more and underspecified rewards. arXiv preprint
arXiv:1902.07198.
coverage (as we support aggregation operators over
selected cells). D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta,
K. Ganchev, S. Petrov, and M. Collins. 2016. Glob-
Finally, pre-training methods have been de- ally normalized transition-based neural networks.
signed with different training objectives, including arXiv preprint arXiv:1603.06042.
language modeling (Dai and Le, 2015; Peters et al.,
Daniel Andor, Luheng He, Kenton Lee, and Emily
2018; Radford et al., 2018) and masked language Pitler. 2019. Giving bert a calculator: Finding op-
modeling (Devlin et al., 2019; Lample and Con- erations and arguments with reading comprehension.
neau, 2019). These methods dramatically boost arXiv preprint arXiv:1909.00109.
the performance of natural language understanding Y. Artzi and L. Zettlemoyer. 2013. Weakly supervised
models (Peters et al., 2018, inter alia). Recently, learning of semantic parsers for mapping instruc-
several works extended BERT for visual question tions to actions. Transactions of the Association for
answering, by pre-training over text-image pairs Computational Linguistics (TACL), 1:49–62.
while masking different regions in the image (Tan J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Se-
and Bansal, 2019; Lu et al., 2019). As for tables, mantic parsing on Freebase from question-answer
Chen et al. (2019) experimented with rendering a pairs. In Empirical Methods in Natural Language
Processing (EMNLP).
table into natural language so that it can be handled
with a pre-trained BERT model. In our work we Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai
extend masked language modeling for table repre- Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and
William Yang Wang. 2019. Tabfact: A large-scale
sentations, by masking table cells or text segments. dataset for table-based fact verification. ArXiv,
abs/1909.02164.
7 Conclusion
Minseok Cho, Reinald Kim Amplayo, Seung won
Hwang, and Jonghyuck Park. 2018. Adversarial
In this paper we presented TA PAS, a model for tableqa: Attention supervision for question answer-
question answering over tables that avoids gener- ing on tables. In ACML.
ating logical forms. We showed that TA PAS effec-
J. Clarke, D. Goldwasser, M. Chang, and D. Roth.
tively pre-trains over large scale data of text-table 2010. Driving semantic parsing from the world’s re-
pairs and successfully restores masked words and sponse. In Computational Natural Language Learn-
table cells. We additionally showed that the model ing (CoNLL), pages 18–27.
A. M. Dai and Q. V. Le. 2015. Semi-supervised se- M. Iyyer, W. Yih, and M. Chang. 2017. Search-based
quence learning. In Advances in Neural Information neural structured learning for sequential question an-
Processing Systems (NeurIPS). swering. In Association for Computational Linguis-
tics (ACL).
Pradeep Dasigi, Matt Gardner, Shikhar Murty, Luke
Zettlemoyer, and Eduard Hovy. 2019. Iterative R. Jia and P. Liang. 2016. Data recombination for neu-
search for weakly supervised semantic parsing. In ral semantic parsing. In Association for Computa-
Proceedings of the 2019 Conference of the North tional Linguistics (ACL).
American Chapter of the Association for Computa-
J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-
tional Linguistics: Human Language Technologies,
ditional random fields: Probabilistic models for seg-
Volume 1 (Long and Short Papers), pages 2669–
menting and labeling data. In International Confer-
2680.
ence on Machine Learning (ICML), pages 282–289.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Guillaume Lample and Alexis Conneau. 2019. Cross-
Kristina Toutanova. 2019. BERT: Pre-training of lingual language model pretraining. arXiv preprint
deep bidirectional transformers for language under- arXiv:1901.07291.
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association Carolin Lawrence and Stefan Riezler. 2018. Improving
for Computational Linguistics: Human Language a neural semantic parser by counterfactual learning
Technologies, Volume 1 (Long and Short Papers), from human bandit feedback. In Proceedings of the
pages 4171–4186, Minneapolis, Minnesota. Associ- 56th Annual Meeting of the Association for Compu-
ation for Computational Linguistics. tational Linguistics (Volume 1: Long Papers), pages
1820–1830, Melbourne, Australia. Association for
L. Dong and M. Lapata. 2016. Language to logical Computational Linguistics.
form with neural attention. In Association for Com-
putational Linguistics (ACL). C. Liang, M. Norouzi, J. Berant, Q. Le, and N. Lao.
2018. Memory augmented policy optimization
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, for program synthesis with generalization. In Ad-
and M. Gardner. 2019. DROP: A reading com- vances in Neural Information Processing Systems
prehension benchmark requiring discrete reasoning (NeurIPS).
over paragraphs. In North American Association for
Computational Linguistics (NAACL). P. Liang, M. I. Jordan, and D. Klein. 2011. Learn-
ing dependency-based compositional semantics. In
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Association for Computational Linguistics (ACL),
Greg Kochanski, John Elliot Karro, and D. Sculley, pages 590–599.
editors. 2017. Google Vizier: A Service for Black-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Box Optimization.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
K. Guu, P. Pasupat, E. Z. Liu, and P. Liang. 2017. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
From language to programs: Bridging reinforce- Roberta: A robustly optimized BERT pretraining ap-
ment learning and maximum marginal likelihood. In proach. CoRR, abs/1907.11692.
Association for Computational Linguistics (ACL). Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
Lee. 2019. Vilbert: Pretraining task-agnostic visi-
T. Haug, O. Ganea, and P. Grnarova. 2018. Neural olinguistic representations for vision-and-language
multi-step reasoning for question answering on semi- tasks. arXiv preprint arXiv:1908.02265.
structured tables. In European Conference on Infor-
mation Retrieval. Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and
Luke Zettlemoyer. 2019. A discrete hard EM ap-
J. Herzig and J. Berant. 2017. Neural semantic parsing proach for weakly supervised question answering.
over multiple knowledge-bases. In Association for In Proceedings of the 2019 Conference on Empirical
Computational Linguistics (ACL). Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
P. J. Huber. 1964. Robust estimation of a location guage Processing (EMNLP-IJCNLP), pages 2851–
parameter. The Annals of Mathematical Statistics, 2864, Hong Kong, China. Association for Computa-
35(1):73–101. tional Linguistics.
Wonseok Hwang, Jinyeung Yim, Seunghyun Park, and D. Muhlgay, J. Herzig, and J. Berant. 2019. Value-
Minjoon Seo. 2019. A comprehensive exploration based search in execution space for mapping instruc-
on wikisql with table-aware word contextualization. tions to programs. In North American Association
arXiv preprint arXiv:1902.01069. for Computational Linguistics (NAACL).
S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and Thomas Müller, Francesco Piccinno, Massimo Nicosia,
L. Zettlemoyer. 2017. Learning a neural semantic Peter Shaw, and Yasemin Altun. 2019. Answering
parser from user feedback. In Association for Com- conversational questions on structured data without
putational Linguistics (ACL). logical forms. arXiv preprint arXiv:1908.11787.
A. Neelakantan, Q. V. Le, M. Abadi, A. McCallum, and Y. Wang, J. Berant, and P. Liang. 2015. Building a
D. Amodei. 2017. Learning a natural language inter- semantic parser overnight. In Association for Com-
face with neural programmer. In International Con- putational Linguistics (ACL).
ference on Learning Representations (ICLR).
Pengcheng Yin, Zhengdong Lu, Hang Li, and Kao Ben.
Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. 2016. Neural enquirer: Learning to query tables in
2015. Neural programmer: Inducing la- natural language. In Proceedings of the Workshop
tent programs with gradient descent. CoRR, on Human-Computer Question Answering, pages
abs/1511.04834. 29–35, San Diego, California. Association for Com-
putational Linguistics.
Timothy Niven and Hung-Yu Kao. 2019. Probing neu-
ral network comprehension of natural language ar- Y. Zhang, P. Pasupat, and P. Liang. 2017. Macro gram-
guments. In Proceedings of the 57th Annual Meet- mars and holistic triggering for efficient semantic
ing of the Association for Computational Linguis- parsing. In Empirical Methods in Natural Language
tics, pages 4658–4664, Florence, Italy. Association Processing (EMNLP).
for Computational Linguistics.
Victor Zhong, Caiming Xiong, and Richard Socher.
Panupong Pasupat and Percy Liang. 2015. Compo- 2017. Seq2sql: Generating structured queries
sitional semantic parsing on semi-structured tables. from natural language using reinforcement learning.
In Proceedings of the 53rd Annual Meeting of the CoRR, abs/1709.00103.
Association for Computational Linguistics and the
7th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), pages
1470–1480, Beijing, China. Association for Compu-
tational Linguistics.
M. E. Peters, M. Neumann, M. Iyyer, M. Gard-
ner, C. Clark, K. Lee, and L. Zettlemoyer. 2018.
Deep contextualized word representations. In North
American Association for Computational Linguistics
(NAACL).
A. Radford, K. Narasimhan, T. Salimans, and
I. Sutskever. 2018. Improving language understand-
ing by generative pre-training. Technical report,
OpenAI.
Y. Su and X. Yan. 2017. Cross-domain semantic pars-
ing via paraphrasing. In Empirical Methods in Nat-
ural Language Processing (EMNLP).
Yibo Sun, Duyu Tang, Nan Duan, Jingjing Xu, Xi-
aocheng Feng, and Bing Qin. 2018. Knowledge-
aware conversational semantic parsing over web ta-
bles. arXiv preprint arXiv:1809.04271.
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning
cross-modality encoder representations from trans-
formers. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Process-
ing.
Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,
and Matt Gardner. 2019. Do nlp models know num-
bers? probing numeracy in embeddings. arXiv
preprint arXiv:1909.07940.
Bailin Wang, Ivan Titov, and Mirella Lapata. 2019.
Learning semantic parsers from denotations with la-
tent structured alignments and abstract programs. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3772–
3783, Hong Kong, China. Association for Computa-
tional Linguistics.
A W IKI SQL Execution Errors Parameter
Training Steps
PRETRAIN
1,000,000
SQA
200,000
W IKI SQL
50,000
W IKI TQ
50,000
Learning rate 5e-5 1.25e-5 6.17164e-5 1.93581e-5
Warmup ratio 0.01 0.2 0.142400 0.128960
Temperature 1.0 0.107515 0.0352513
Answer loss cutoff 0.185567 0.664694
In some tables, W IKI SQL contains “REAL” num- Huber loss delta
Cell selection preference
1265.74
0.611754
0.121194
0.207951
bers stored in “TEXT” format. This leads to in- Batch size
Gradient clipping
512 128 512
10
512
10
correct results for some of the comparison and Select one column
Reset cell selection weights
1
0
0
0
1
1
Question What was the away team’s score when the crowd at Arden Street Oval was larger than 31,481?
SQL Query SELECT col3 AS result FROM table 2 10767641 15
WHERE col5 > 31481.0 AND col4 = "arden street oval"
W IKI SQL answer ["14.11 (95)"]
Our answer []
Table 8: Table “2-10767641-15” from W IKI SQL. “col6” was removed. The “Crowd” column is of type “REAL”
but the cell values are actually stored as “TEXT”. Below we have two questions from the training set with the
answer that is produced by the W IKI SQL evaluation script and the answer we derive.
T we can define a random subset S ⊆ T where Bernoulli variables) in the special case where one
pc = P (c ∈ S) for each cell c ∈ T . P of the probabilities is 1.
The expected P value of COUNT(S) =P c Gc can By using the Jensen inequality we get a lower
1
P computed as c pc and SUM(S) = c Gc Tc as
be bound on Qc as E[X c]
= 1+P1 pj . Note that if in-
j6=c
c pc Tc as described in Table 1. For the average stead we used P1 pj then we recover the weighted
however, this is not straight-forward. We will see in j
average, which is strictly bigger than the lower
what follows that the quotient of the expected sum
bound and in general not an upper or lower bound.
and the count, which equals the weighed average
We can get better approximations by computing
of T by pc in general is not the true expected value,
the Taylor expansion using the moments6 of Xc of
which can be written as:
order k:
P
Gc Tc
1 1 var [Xc ]
E P
Gc Qc = E ' + + ···+
Xc E [Xc ] E [Xc ]3
This quantity differs from the weighted average,
h i
E (Xc − E [Xc ])k
a key difference being that the weighted average (−1)k
is not sensitive to constants scaling all the output E [Xc ]k+1
probabilities, which could in theory find optima P
where var [Xc ] = j6=c pj (1 − pj ).
where all the pc are below 0.5 for example. By the
The full form for the zero and second order Tay-
linearity of the expectation we can write:
lor approximations are:
" # " #
Gc 1 p
Pc
X X X
Tc E P = Tc pc E P AVERAGE0 (T, p) = Tc
j Gj 1+ j6=c Gj c
1+ j6=c pj
c c
X pc (1 + c )
So it comes
h i down h to computing that quantity AVERAGE2 (T, p) = Tc P
1 1
i
c
1 + j6=c pj
Qc = E Xc = E 1+P Gj . The key obser- P
j6=c
j6=c pj (1 − pj )
vation is that this is the expectation of a reciprocal with c = 2
P
of a Poisson Binomial Distribution 5 (a sum of (1 + j6=c pj )
5 6
wikipedia.org/Poisson binomial distribution wikipedia.org/Taylor expansions for the moments
The approximations are then easy to write in any
tensor computation language and will be differen-
tiable. In this work we experimented with the zero
and second order approximations and found small
improvements over the weighted average baseline.
It’s worth noting that in the dataset the proportion
of average examples is very low. We expect this
method to be more relevant in the more general
setting.