You are on page 1of 11

Probabilistic Assumptions Matter: Improved Models for

Distantly-Supervised Document-Level Question Answering


Hao Cheng∗ , Ming-Wei Chang† , Kenton Lee† , Kristina Toutanova†

Microsoft Research
chehao@microsoft.com

Google Research
{mingweichang, kentonl, kristout}@google.com

Abstract Question: How is Joan Molinsky better known?


TriviaQA
Answer: Joan Rivers
We address the problem of extractive question : { Joan Rivers, Diary of a Mad Diva }
answering using document-level distant super- P1: Joan Alexandra Molinsky, known professionally as Joan Rivers,
arXiv:2005.01898v1 [cs.CL] 5 May 2020

vision, pairing questions and relevant docu- was an American comedian, actress, writer, producer, and television
host. … Joan Rivers was strongly influenced by Lenny Bruce. …
ments with answer strings. We compare previ-
P2: … She received a Grammy Award for Best Spoken Word Album for
ously used probability space and distant super- her book, Diary of a Mad Diva. …
vision assumptions (assumptions on the corre- P3: Joan Alexandra Molinsky was born on June 8, 1933, in Brooklyn,
spondence between the weak answer string la- New York. … Before entering show business, she chose Joan Rivers as
bels and possible answer mention spans). We her stage name. …
show that these assumptions interact, and that NarrativeQA
Question: Where do the dancers purify themselves?
different configurations provide complemen- Answer: in the spring at mount helicon mount helicon
tary benefits. We demonstrate that a multi- : { in the spring at mount helicon, mount helicon }
objective model can efficiently combine the P1: The play begins with three pages …
advantages of multiple assumptions and out- P2: The courtiers … She sentences them to make reparation and to
perform the best individual formulation. Our purify themselves by bathing in the spring at mount helicon. The
figure of Actaeon in the play may represent ...
approach outperforms previous state-of-the-art
models by 4.3 points in F1 on TriviaQA-Wiki
Figure 1: TriviaQA and NarrativeQA examples. In the Triv-
and 1.7 points in Rouge-L on NarrativeQA iaQA example, there are three occurrences of the original an-
summaries.1 swer string “Joan Rivers” (blue), and one alternate but incor-
rect alias “Diary of a Mad Diva” (purple). Only two “Joan
1 Introduction Rivers” mentions (shown in blue boxes) support answering
the question. In the NarrativeQA example, there are two an-
Distant supervision assumptions have enabled the swer stings in A: “in the spring at mount helicon” (blue) and
creation of large-scale datasets that can be used “mount helicon” (orange), with the latter being a substring of
the former. Both mentions in P2 are correct answer spans.
to train fine-grained extractive short answer ques-
tion answering (QA) systems. One example is
TriviaQA (Joshi et al., 2017). There the au-
Depending on the data generation process, the
thors utilized a pre-existing set of Trivia question-
properties of the resulting supervision from the
answer string pairs and coupled them with rele-
sets A may differ. For example, the provided an-
vant documents, such that, with high likelihood,
swer sets in TriviaQA include aliases of original
the documents support answering the questions
trivia question answers, aimed at capturing seman-
(see Fig. 1 for an illustration). Another example
tically equivalent answers but liable to introducing
is the NarrativeQA dataset (Kočiský et al., 2018),
semantic drift. In Fig. 1, the possible answer string
where crowd-sourced abstractive answer strings
“Diary of a Mad Diva” is related to “Joan Rivers”,
were used to weakly supervise answer mentions
but is not a valid answer for the given question.
in the text of movie scripts or their summaries. In
this work, we focus on the setting of document- On the other hand, the sets of answer strings in
level extractive QA, where distant supervision is NarrativeQA are mostly valid since they have high
specified as a set A of answer strings for an input overlap with human-generated answers for the
question-document pair. given question/document pair. As shown in Fig. 1,
1
“in the spring at mount helicon” and “mount he-
Based on the TriviaQA-Wiki leaderboard, our approach
was the SOTA when this work was submitted on Dec 04, licon” are both valid answers with relevant men-
2019. tions. In this case, the annotators chose answers
that appear verbatim in the text but in the more String
Probabilities
𝑷𝒂 (“Joan Rivers”) 𝑷𝒂 (“Diary of a Mad Diva”) …
general case, noise may come from partial phrases (𝑷𝒂 )
𝚵 𝚵
and irrelevant mentions.
Span
While distant supervision reduces the annota- Probabilities (“Joan Rivers”| 𝒑𝟏 ) … (“Joan Rivers”| 𝒑𝟑 ) …
(𝑷𝒔 )
tion cost, increased coverage often comes with
increased noise (e.g., expanding entity answer Begin and End
Probabilities … …
strings with aliases improves coverage but also in- (𝑷𝒃 , 𝑷𝒆 )
creases noise). Even for fixed document-level dis-
tant supervision in the form of a set of answers A,
Contextualized
Representation
BERT BERT …
different interpretations of the partial supervision … …
𝒒 𝒑𝟏 𝒒 𝒑𝟑
lead to different points in the coverage/noise space
and their relative performance is not well under- Figure 2: The document-level QA model as used for
stood. test-time inference. The lower part is a BERT-based
This work systematically studies methods for paragraph-level answer scoring component, and the up-
learning and inference with document-level dis- per part illustrates the probability aggregation across
tantly supervised extractive QA models. Using a answer spans sharing the same answer string. Ξ refers
to either a sum or a max operator. In the given example,
BERT (Devlin et al., 2019) joint question-passage
“John Rivers” is derived from two paragraphs.
encoder, we study the compound impact of:
• Probability space (§2): ways to define the
model’s probability space based on independent tasks. Results are further strengthened by transfer
paragraphs or whole documents. learning from fully labeled short-answer extrac-
• Distant supervision assumptions (§3): ways to tion data in SQuAD 2.0 (Rajpurkar et al., 2018),
translate the supervision from possible strings A leading to a final state-of-the-art performance of
to possible locations of answer mentions in the 76.3 F1 on TriviaQA-Wiki and 62.9 on the Narra-
document. tiveQA summaries task.2
• Optimization and inference (§4): ways to
2 Probability Space
define corresponding training objectives (e.g.
Hard EM as in Min et al. (2019) vs. Maxi- Here, we first formalize both paragraph-level and
mum Marginal Likelihood) and make answer document-level models, which have been previ-
string predictions during inference (Viterbi or ously used for document-level extractive QA. Typ-
marginal inference). ically, paragraph-level models consider each para-
We show that the choice of probability space graph in the document independently, whereas
puts constraints on the distant supervision as- document models integrate some dependencies
sumptions that can be captured, and that all three among paragraphs.
choices interact, leading to large differences in To define the model, we need to specify the
performance. Specifically, we provide a frame- probability space, consisting of a set of possible
work for understanding different distant supervi- outcomes and a way to assign probabilities to in-
sion assumptions and the corresponding trade-off dividual outcomes. For extractive QA, the proba-
among the coverage, quality and strength of dis- bility space outcomes consist of token positions of
tant supervision signal. The best configuration de- answer mention spans.
pends on the properties of the possible annotations The overall model architecture is shown in
A and is thus data-dependent. Compared with re- Fig. 2. We use BERT (Devlin et al., 2019) to
cent work also using BERT representations, our derive representations of document tokens. As
study show that the model with most suitable prob- is standard in state-of-the-art extractive QA mod-
abilistic treatment achieves large improvements of els (Devlin et al., 2019; Lee et al., 2019; Min
4.6 F1 on TriviaQA and 1.7 Rouge-L on Narra- et al., 2019), the BERT model is used to encode
tiveQA respectively. Additionally, we design an a pair of a given question with one paragraph
efficient multi-loss objective that can combine the from a given document into neural text represen-
benefits of different formulations, leading to sig- tations. These representations are then used to
nificant improvements in accuracy, surpassing the 2
The code is available at https://github.com/
best previously reported results on the two studied hao-cheng/ds_doc_qa
define scores/probabilities of possible answer be- and j k range over positions in the respective
gin and end positions, which are in turn used to paragraphs.
define probabilities over possible answer spans. The answer positions in different paragraphs
Then the answer string probabilities can be de- are independent, and the probability of each para-
fined as the aggregation over all possible answer graph’s answer begin and end is computed by nor-
spans/mentions. malizing over all possible positions in that para-
In the following, we show that paragraph-level graph, i.e.,
and document-level models differ only in the X
space of possible outcomes and the way of com- Zbk = exp(sb (i)), (3)
puting answer span probabilities from answer po- i∈I k ∪{NULL}
sition begin and end scores. X
Zek = exp(se (j)), (4)
Scoring answer begin and end positions Given j∈I k ∪{NULL}
a question q and a document d consisting of
K paragraphs p1 , . . . , pK , the BERT encoder where I k is the set of all positions in the para-
produces contextualized representations for each graph pk . The probability of an answer begin at
question-paragraph pair (q, pk ). Specifically, for ik is Pb (ik ) = exp(sb (ik ))/Zb k and the prob-
each token position ik in pk , the final hidden vec- ability of an end at j k is defined analogously.
tor h(i,k) ∈ Rd is used as the contextualized token The probability of a possible answer position as-
embedding, where d is the vector dimension. signment for the document dQis then defined as
The span-begin score is computed as sb (ik ) = P ([(i1 , j 1 ), . . . , (iK , j K )]) = k Pb (ik )Pe (j k ).
wb h(i,k) using a weight vector wb ∈ Rd . The
T As we can see from the above definition, due
span-end score se (j k ) is defined in the same way. to the independence assumption, models using
The probabilities for a start position ik and an end paragraph-level normalization do not learn to di-
position j k are rectly calibrate candidate answers from different
paragraphs against each other.
exp(sb (ik ))
Pb (ik ) = , (1) Document-level model In document-level mod-
Zb
els, we assume that for a given question against
exp(se (j k ))
Pe (j k ) = , (2) document d, a single answer span is selected
Ze
(as opposed to one for each paragraph in the
where Zb , Ze are normalizing factors, depending paragraph-level models).4 Here, the possible posi-
on the probability space definition (detailed be- tions in all paragraphs are a part of a joint probabil-
low). The probability of an answer span from ik ity space and directly compete against each other.
to j k is defined as Ps (ik , j k ) = Pb (ik )Pe (j k ). In this case, Ω is the set of token spans {(i, j)},
The partition functions Zb and Ze depend on where i and j are the begin and end positions of
whether we use a paragraph-level or document- the selected answer. The normalizing factors are
level probability space. therefore aggregated over all paragraphs, i.e.,
Paragraph-level model In paragraph level K X
X
models, we assume that for a given question Zb∗ = exp(sb (i)), (5)
against a document d, each of its paragraphs k=1 i∈I k
p1 , . . . , pK independently selects a pair of answer K X
positions (ik , j k ), which are the begin and end
X
Ze∗ = exp(se (j)). (6)
of the answer from paragraph pk . In the case k=1 j∈I k
that pk does not support answering the question
q, special NULL positions are selected (follow- Compared with (3) and (4), since there is always a
ing the SQuAD 2.0 BERT implementation3 ). valid answer in the document for the tasks stud-
Thus, the set of possible outcomes Ω in the ied here, NULL is not necessary for document-
paragraph-level probability space is the set of level models and thus can be excluded from the
lists of begin/end position pairs, one from each 4
In this paper, we focus on datasets where the document
paragraph: {[(i1 , j 1 ), . . . , (iK , j K )]}, where ik is known to contain a valid answer. It is straightforward to
remove this assumption and consider document-level NULL
3
https://github.com/google-research/bert for future work.
Coverage Quality Strength analogously. In addition, we term an answer span
(i, j) correct for question q, if its corresponding
H1 % & %
answer string is a correct answer to q, and the con-
H2 −→ −→ −→
text of the specific mention of that answer string
H3 & % &
from positions i to j entails this answer. Similarly,
we term an answer begin/end position correct if
Table 1: Distant supervision assumptions and their cor-
responding tradeoffs. (%) indicates highest value, (→) there exists a correct answer span starting/ending
medium, and (&) lowest value. at that position.
H1: All A-consistent answer spans are correct.
While this assumption is evidently often incorrect
inner summation of (5) and (6). The probabil- (low on the quality dimension &), especially for
ity of a possible outcome, i.e. an answer span, is TriviaQA, as seen from Fig. 1, it provides a large
P (i, j) = exp(sb (i) + se (j))/(Zb∗ Ze∗ ). number of positive examples and a strong supervi-
sion signal (high on coverage % and strength %).
3 Distant Supervision Assumptions
We include this in our study for completeness.
There are multiple ways to interpret the distant su- H1 translates differently into possible outcomes
pervision signal from A as possible outcomes in for corresponding models depending on the prob-
our paragraph-level and document-level probabil- ability space (paragraph or document). Paragraph-
ity spaces, leading to corresponding training loss level models select multiple answer spans, one for
functions. Although several different paragraph- each paragraph, to form a possible outcome. Thus,
level and document-level losses (Chen et al., 2017; multiple A-consistent answer spans can occur in
Kadlec et al., 2016; Clark and Gardner, 2018; Lin a single outcome, as long as they are in differ-
et al., 2018; Min et al., 2019) have been studied in ent paragraphs. For multiple A-consistent answer
the literature, we want to point out that when in- spans in the same paragraph, these can be seen as
terpreting the distant supervision signal, there is a mentions that can be selected with equal probabil-
tradeoff among multiple desiderata: ity (e.g., by different annotators). Document-level
• Coverage: maximize the number of instances models select a single answer span in the docu-
of relevant answer spans, which we can use to ment and therefore multiple A-consistent answer
provide positive examples to our model. spans can be seen as occurring in separate anno-
• Quality: maximize the quality of annotations tation events. Table 2 shows in row one the log-
by minimizing noise from irrelevant answer probability of outcomes consistent with H1.
strings or mentions. H2: Every positive paragraph has a correct an-
• Strength: maximize the strength of the signal swer in its A-consistent set. Under this assump-
by reducing uncertainty and pointing the model tion, each paragraph with a non-empty set of A-
more directly at correct answer mentions. consistent spans (termed a positive paragraph) has
We introduce three assumptions (H1, H2, H3) a correct answer. As we can see from the Trivi-
for how the distant supervision signal should be in- aQA example in Fig. 1, this assumption is correct
terpreted, which lead to different tradeoffs among for the first and third paragraph, but not the sec-
the desiderata above (see Table 1). ond one, as it only contains a mention of a noisy
We begin with setting up additional useful nota- answer alias. This assumption has medium cov-
tion. Given a document-question pair (d, q) and erage (→), as it generates positive examples from
a set of answer strings A, we define the set of multiple paragraphs but does not allow multiple
A-consistent token spans YA in d as follows: for positive mentions in the same paragraph. It also
each paragraph pk , span (ik , j k ) ∈ YA
k if and only
decreases noise (higher quality →) (e.g. does not
if the string spanning these positions in the para- claim that all the mentions of “Joan Rivers” in the
graph is in A. For paragraph-level models, if for first paragraph support answering the question).
paragraph pk the set YA k is empty, we redefine Y k
A The strength of the supervision signal is weakened
to be {NULL}. Similarly, we define the set of A- (→) relative to H1, as now the model needs to fig-
consistent begin positions Yb,Ak as the start posi-
ure out which of the multiple A-consistent men-
tions of consistent spans: Yb,A k = ∪(i,j)∈Y k {i}. tions in each paragraph is correct.
A
k
Ye,A for A-consistent end positions is defined H2 has two variations: correct span, assuming
Span-Based Position-Based
(ik , j k ) log Pb (ik ) + k∈K j k ∈Y k log Pe (j k )
P P P P P P
H1 k
(ik ,j k )∈YA log Ps k
ik ∈Yb,A
Pk∈K k k
k∈K
P k
P e,A
k
H2 k∈K log Ξ(ik ,j k )∈YAk Ps (i , j ) k∈K log Ξik ∈Y k Pb (i ) +
b,A k∈K log Ξj k ∈Y k Pe (j )
e,A
H3 log Ξk∈K Ξ(ik ,j k )∈Y k Ps (ik , j k ) log Ξk∈K Ξik ∈Y k Pb (ik ) + log Ξk∈K Ξj k ∈Y k Pe (j k )
A b,A e,A

Table 2: Objective
P functions for a document-question pair (d, q) under different distant supervision assumptions.
Ξ refers to and max for MML and HardEM, respectively.

that one of the answer spans (ik , j k ) in YA


k is cor- begin/end positions), but not every positive para-
rect, and correct position, assuming that the para- graph needs to have one. It further improves super-
graph has a correct answer begin position from vision quality (%), because for example, it allows
k and a correct answer end position from Y k ,
Yb,A the model to filter out the noise in paragraph two
e,A
but its selected answer span may not necessarily in Fig. 1. Since the model is given a choice of any
belong to YA k . For example, if A contains {abcd, of the A-consistent mentions, it has the capability
bc}, then abc would have correct begin and end, to assign zero probability mass on the supervision-
but not be a correct span. It does not make sense consistent mentions in that paragraph.
for modeling to assume the paragraph has correct On the other hand, H3 has lower coverage (&)
begin and end positions instead of a correct an- than H1 and H2, because it provides a single pos-
swer span (i.e., we don’t really want to get in- itive example for the whole document, rather than
consistent answers like abc above), but given that one for each positive paragraph. It also reduces
our probabilistic model assumes independence of the strength of the supervision signal (&), as the
begin and end answer positions, it may not be model now needs to figure out which mention to
able to learn well with span-level weak supervi- select from the larger document-level set YA .
sion. Some prior work (Clark and Gardner, 2018) Note that we can only use H3 coupled with a
uses an H2 position-based distant supervision as- document-level model, because a paragraph-level
sumption with a pair-paragraph model akin to our model cannot directly tradeoff answers from dif-
document-level ones. Lin et al. (2018) use an H2 ferent paragraphs against each other, to select a
span-based distant supervision assumption. The single answer span from the document. As with
impact of position vs. span-based modeling of the the other distant supervision hypotheses, span-
distant supervision is not well understood. As we based and position-based definitions of the possi-
will see in the experiments, for the majority of set- ble consistent outcomes can be formulated. The
tings, position-based weak supervision is more ef- log-probabilities of these eventsPare defined in row
fective than span-based for our model. three of Table 2, when using for Ξ. H3 was
For paragraph-level and document-level mod- used by Kadlec et al. (2016) for cloze-style dis-
els, H2 corresponds differently to possible out- tantly supervised QA with recurrent neural net-
comes. For paragraph models, one outcome can work models.
select answer spans in all positive paragraphs and The probability-space (paragraph vs. document-
NULL in negative ones. For document-level mod- level) and the distant supervision assumption (H1,
els, we view answers in different paragraphs as H2, and H3, each position or span-based) together
outcomes of multiple draws from the distribution. define our interpretation of the distant supervision
The identity of the particular correct span or be- signal resulting in definitions of probability space
gin/end position is unknown, but we can compute outcomes consistent with the supervision. Next,
the probability of the event comprising the consis- we define corresponding optimization objectives
tent outcomes. Table 2 shows the log-probability to train a model based on this supervision and de-
of the outcomes consistent with H2 in row two scribe the inference methods to make predictions
(right for span-based and left for with a trained model.
Pposition-based
interpretation, when plugging in for Ξ).
4 Optimization and Inference Methods
H3: The document has a correct answer in
its A-consistent set YA . This assumption posits For each distant supervision hypothesis, we max-
that the document has a correct answer span (or imize either the marginal log-likelihood of A-
consistent outcomes (MML) or the log-likelihood Clark and Gardner (2018) for TriviaQA-Wiki6 ,
of the most likely outcome (HardEM). The latter we only keep the top 8 ranked paragraphs up to
was found effective for weakly supervised tasks 400 tokens for each document-question pair for
including QA and semantic parsing by Min et al. both training and evaluation. Following Min et al.
(2019). (2019), for NarrativeQA we define the possible
Table 2 shows the objective functions for all answer string sets A using Rouge-L (Lin, 2004)
distant supervision assumptions, each compris- similarity with crouwdsourced abstractive answer
ing a pairing of a distant supervision hypothesis strings. We use identical data preprocessing and
(H1, H2, H3) and position-based vs. span-based the evaluation script provided by the authors.
interpretation. The probabilities are defined ac- In this work, we use the BERT-base model for
cording to the assumed probability space (para- text encoding and train our model with the de-
graph or document). In the table, K denotes the fault configuration as described in (Devlin et al.,
set of all paragraphs in the document, and Y k 2019), fine-tuning all parameters. We fine-tune
denotes the set of weakly labeled answer spans for 3 epochs on TriviaQA and 2 epochs on Nar-
for the paragraph pk (which can be {NULL} for rativeQA.
paragraph-level models). Note that span-based
and position-based objective functions are equiv- 5.2 Optimization and Inference for Latent
alent for H1 because of the independence assump- Variable Models
tion, i.e. Ps (ik , j k ) = Pb (ik )Pe (j k ). Here we look at the cross product of optimization
Inference: Since the task is to predict an an- (HardEM vs MML) and inference (Max vs Sum)
swer string rather than a particular mention for a for all distant supervision assumptions that result
given question, it is potentially beneficial to aggre- in models with latent variables. We therefore ex-
gate information across answer spans correspond- clude H1 and look at the other two hypotheses, H2
ing to the same string during inference. The score and H3, each coupled with a span-based (Span) or
of a candidate answer string can be obtained as position-based (Pos) formulation and a paragraph-
Pa (x) = Ξ(i,j)∈X Ps (i, j), where X is the set of level (P) or a document level (D) probability space.
spans corresponding to the answer string x, and Ξ The method used in Min et al. (2019) corresponds
can be either or max.5 It is usually beneficial to
P
to span-based H2-P with HardEM training and
match the training objective with the correspond- Max inference. The results are shown in Fig. 3.
ing inference P method, i.e. MML with marginal in- First, we observe that inference with Sum leads
ference Ξ = , and HardEM with max (Viterbi) to significantly better results on TriviaQA un-
inference Ξ = max. Min et al. (2019) showed der H2-P and H2-D, and slight improvement un-
HardEM optimization was useful when using an der H3-D. On NarrativeQA, inference with Max
H2 span-level distant supervision assumption cou- is better. We attribute this to the fact that cor-
pled with max inference, Pbut it is unclear whether rect answers often have multiple relevant mentions
this trend holds when inference is useful or for TriviaQA (also see §5.6), whereas for Narra-
other distant supervision assumptions perform bet- tiveQA this is rarely the case. Thus, inference with
ter. We therefore study exhaustive combinations Sum in NarrativeQA could potentially boost the
of probability space, distant supervision assump- probability of irrelevant frequent strings.
tion, and training and inference methods. Consistent with (Min et al., 2019), we observe
that span-based HardEM works better than span-
5 Experiments based MML under H2-P, with a larger advantage
5.1 Data and Implementation on NarrativeQA than on TriviaQA. However, un-
der H2-D and H3-D, span-based MML performs
Two datasets are used in this paper: TriviaQA consistently better than span-based HardEM. For
(Joshi et al., 2017) in its Wikipedia formulation, position-based objectives, MML is consistently
and NarrativeQA (summaries setting) (Kočiský better than HardEM (potentially because HardEM
et al., 2018). Using the same preprocessing as may decide to place its probability mass on begin-
5 P
For inference with marginal ( ) scoring, we use an ap-
end position combinations that do not contain
proximate scheme where we only aggregate probabilities of mentions of strings in A). Finally, it can be ob-
candidates strings generated from a 20-best list of begin/end
6
answer positions for each paragraph. https://github.com/allenai/document-qa
Max Sum
0.76 TriviaQA NarrativeQA
0.74 Objective Infer
0.72 F1 EM Rouge-L
0.7
0.68 Paragraph-level Models
0.66
0.64 Max 67.9 63.3 55.3
0.62 H1-P
Sum 70.4 66.0 53.6

MML-Pos

MML-Pos

MML-Pos
HardEM-Span

MML-Span

HardEM-Span

MML-Span

HardEM-Span

MML-Span
HardEM-Pos

HardEM-Pos

HardEM-Pos
Max 71.9 67.7 59.2
H2-P
Sum 73.0 69.0 57.8
Document-level Models
H2-P H2-D H3-D
Max 55.8 51.0 59.4
H1-D
(a) TriviaQA F1 Sum 65.2 61.2 59.1
Max 70.3 66.2 60.1
0.62 Max Sum H2-D
Sum 72.4 68.4 59.9
0.6
0.58 Max 75.1 70.6 59.1
0.56 H3-D
0.54 Sum 75.3 70.8 59.2
0.52
0.5
0.48 Table 3: Comparison of distant supervision hypothe-
MML-Pos

MML-Pos

MML-Pos
HardEM-Span

MML-Span

HardEM-Span

MML-Span

HardEM-Span

MML-Span
HardEM-Pos

HardEM-Pos

HardEM-Pos

ses using MML-Pos objectives on TriviaQA and Nar-


rativeQA dev sets.

H2-P H2-D H3-D ter results than other formulations. Only H3-
(b) NarrativeQA Rouge-L D is capable of “cleaning” noise from positive
paragraphs that don’t have a correct answer (e.g.
Figure 3: Comparison of different optimization and in-
paragraph two in Fig. 1), by deciding which A-
ference choices grouped by distant supervision hypoth-
esis based on dev set results for TriviaQA and Narra- consistent mention to trust. The paragraph-level
tiveQA. models H1-P and H2-P outperform their corre-
sponding document-level counterparts H1-D and
H2-D. This may be due to the fact that without
served that under each distant supervision hypoth- H3, and without predicting NULL, D models do not
esis/probability space combination, the position- learn to detect irrelevant paragraphs.
based MML is always the best among the four ob- Unlike for TriviaQA, H2-D models achieve the
jectives. Position-based objectives may perform best performance for NarrativeQA. We hypothe-
better due to the independence assumptions for be- size this is due to the fact that positive paragraphs
gin/end positions of the model we use and future that don’t have a correct answer are very rare in
work may arrive at different conclusions if posi- NarrativeQA (as summaries are relatively short
tion dependencies are integrated. Based on this and answer strings are human-annotated for the
thorough exploration, we focus on experimenting specific documents). Therefore, H3 is not needed
with position-based objectives with MML for the to clean noisy supervision, and it is not useful
rest of this paper. since it also leads to a reduction in the number of
positive examples (coverage) for the model. Here,
5.3 Probability Space and Distant
document-level models always improve over their
Supervision Assumptions
paragraph counterparts, by learning to calibrate
In this subsection, we compare probability space paragraphs directly against each other.
and distant supervision assumptions. Table 3
shows the dev set results, where the upper sec- 5.4 Multi-Objective Formulations and Clean
tion compares paragraph-level models (H1-P, H2- Supervision
P), and the lower section compares document- Here we study two methods to further improve
level models (H1-D, H2-D, H3-D). The perfor- weakly supervised QA models. First, we com-
mance of models with both Max and Sum infer- bine two distant supervision objectives in a multi-
ence is shown. We report F1 and Exact Match task manner, i.e. H2-P and H3-D for TriviaQA, and
(EM) scores for TriviaQA, and Rouge-L scores for H2-P and H2-D for NarrativeQA, chosen based
NarrativeQA. on the results in §5.3. H2 objectives have higher
For TriviaQA, H3-D achieves significantly bet- coverage than H3 while being more susceptible
TriviaQA NarrativeQA TriviaQA Wiki
Objective Clean Infer
F1 EM Rouge-L Full Verified
Single-objective F1 EM F1 EM
Max 71.9 67.7 59.2 Ours (H2-P+H3-D) 76.3 72.1 85.5 82.2
X
Sum 73.0 69.0 57.8 w/o SQUAD 75.7 71.6 83.6 79.6
Par
Max 74.2 70.1 61.7 (Wang et al., 2018b) 71.4 66.6 78.7 74.8
X
Sum 74.9 70.9 61.7 (Clark and Gardner, 2018) 68.9 64.0 72.9 68.0
Max 75.1 70.6 60.1 (Min et al., 2019) 67.1 – – –
X
Sum 75.3 70.8 59.9 NarrativeQA Summary
Doc
Max 75.5 70.8 62.8 Rouge-L
X
Sum 75.5 70.9 62.9
Ours (H2-P+H2-D) 62.9
Multi-objective w/o SQUAD 60.5
Max 75.6 71.2 60.5 (Nishida et al., 2019) 59.9
Par X
+ Sum 75.9 71.6 60.5
w/o external data 54.7
Doc Max 75.8 71.2 63.0
X (Min et al., 2019) 58.8
Sum 76.2 71.7 63.1

Table 4: Dev set results comparing multi-objectives Table 5: Test set results on TriviaQA Wiki and Narra-
and clean supervison. X indicates the QA model is tiveQA Summaries. “w/o SQUAD” refers to our best
pre-trained on SQUAD. model without pretraining on SQUAD 2.0. “w/o exter-
nal data” refers to the model from (Nishida et al., 2019)
without using MS MARCO data (Bajaj et al., 2018).
to noise. Paragraph-level models have the ad-
vantage of learning to score irrelevant paragraphs
L scores are reported.
(via NULL outcomes). Note that we use the same
parameters for the two objectives and the multi- Compared to recent TriviaQA SOTA (Wang
objective formulation does not have more param- et al., 2018b), our best models achieve 4.9 F1
eters and is no less efficient than the individual and 5.5 EM improvement on the full test set, and
models. Second, we use external clean supervi- 6.8 F1 and 7.4 EM improvement on the verified
sion from SQUAD 2.0 (Rajpurkar et al., 2018) subset. On the NarrativeQA test set, we improve
to train the BERT-based QA model for 2 epochs. Rouge-L by 3.0 over (Nishida et al., 2019). The
This model matches the P probability space and large improvement, even without additional fully
is able to detect both NULL and extractive answer labeled data, demonstrates the importance of se-
spans. The resulting network is used to initialize lecting an appropriate probability space and inter-
the models for TriviaQA and NarrativeQA. The re- preting the distant-supervision in a way cognizant
sults are shown in Table 4. of the properties of the data, as well as selecting
It is not surprising that using external clean a strong optimization and inference method. With
supervision improves model performance (e.g. external fully labeled data to initialize the model,
(Min et al., 2017)). We note that, interestingly, performance is further significantly improved.
this external supervision narrows the performance
5.6 Analysis
gap between paragraph-level and document-level
models, and reduces the difference between the In this subsection, we carry out analyses to study
two inference methods. the relative performance of paragraph-level and
Compared with their single-objective compo- document-level models, depending on the size
nents, multi-objective formulations improve per- of answer string set |A| and the number of A-
formance on both TriviaQA and NarrativeQA. consistent spans, which are hypothesized to cor-
relate with label noise. We use the TriviaQA dev
5.5 Test Set Evaluation set and the best performing models, i.e. H2-P and
Table 5 reports test set results on TriviaQA and H3-D with Sum inference.
NarrativeQA for our best models, in comparison to We categorize examples based on the size of
recent state-of-art (SOTA) models. For TriviaQA, their answer string set, |A|, and the size of
we report F1 and EM scores on the full test set their corresponding set of A-consistent spans, |I|.
and the verified subset. For NarrativeQA, Rouge- Specifically, we divide the data into 4 subsets and
Subset |A| |I| Size H2-P H3-D ∆ position-based formulations. They don’t explore
Q ss
=1 ≤5 2585 66.8 67.4 0.6 multiple inference methods or combinations of ob-
Qls >1 ≤5 853 68.7 70.1 1.4 jectives and use less powerful representations. In
Qsl =1 >5 1149 82.0 84.9 2.9 (Lin et al., 2018), a coarse-to-fine model is pro-
Qll >1 >5 3034 86.3 88.4 2.1
posed to handle label noise by aggregating infor-
Table 6: F1 scores on 4 subsets of TriviaQA dev, mation from relevant paragraphs and then extract-
grouped by the size of their answer string sets A and ing answers from selected ones. Min et al. (2019)
corresponding set of possible mentions I. ∆ indicates propose a hard EM learning scheme which we in-
the improvement from H2-P to H3-D. cluded in our experimental evaluation.
Our work focuses on examining probabilistic
report performance separately on each subset, as assumptions for document-level extractive QA.
shown in Table 6. In general, we expect Qsl and We provide a unified view of multiple methods
Qll to be noisier due to the larger I, where Qsl po- in terms of their probability space and distant su-
tentially includes many irrelevant mentions while pervision assumptions and evaluate the impact of
Qll likely contains more incorrect answer strings their components in combination with optimiza-
(false aliases). We can observe that the improve- tion and inference methods. To the best of our
ment is more significant for these noisier subsets, knowledge, the three DS hypotheses along with
suggesting document-level modeling is crucial for position and span-based interpretations have not
handling both types of label noise. been formalized and experimentally compared on
multiple datasets. In addition, the multi-objective
6 Related Work formulation is new.

Distant supervision has been successfully used


for decades for information extraction tasks such 7 Conclusions
as entity tagging and relation extraction (Craven
In this paper, we demonstrated that the choice of
and Kumlien, 1999; Mintz et al., 2009). Sev-
probability space and interpretation of the distant
eral ways have been proposed to learn with DS,
supervision signal for document-level QA have a
e.g., multi-label multi-instance learning (Surdeanu
large impact, and that they interact. Depending on
et al., 2012), assuming at least one supporting
the properties of the data, different configurations
evidence (Hoffmann et al., 2011), integration of
are best, and a combined multi-objective formula-
label-specific priors (Ritter et al., 2013), and adap-
tion can reap the benefits of its constituents.
tion to shifted label distributions (Ye et al., 2019).
Recent work has started to explore distant su- A future direction is to extend this work to ques-
pervision to scale up QA systems, particularly tion answering tasks that require reasoning over
for open-domain QA where the evidence has to multiple documents, e.g., open-domain QA. In ad-
be retrieved rather than given as input. Read- dition, the findings may generalize to other tasks,
ing comprehension (RC) with evidence retrieved e.g., corpus-level distantly-supervised relation ex-
from information retrieval systems establishes a traction.
weakly-supervised QA setting due to the noise in
the heuristics-based span labels (Chen et al., 2017; Acknowledgement
Joshi et al., 2017; Dunn et al., 2017; Dhingra et al.,
2017). One line of work jointly learns RC and Some of the ideas in this work originated from
evidence ranking using either a pipeline system Hao Cheng’s internship with Google Research.
(Wang et al., 2018a; Lee et al., 2018; Kratzwald We would like to thank Ankur Parikh, Michael
and Feuerriegel, 2018) or an end-to-end model Collins, and William Cohen for discussion and
(Lee et al., 2019). detailed feedback on this work, as well as other
Another line of work focuses on improving members from the Google Research Language
distantly-supervised RC models by developing team and the anonymous reviewers for valuable
learning methods and model architectures that can suggestions. We would also like to thank Sewon
better use noisy labels. Clark and Gardner (2018) Min for generously sharing the processed data and
propose a paragraph-pair ranking objective, which evaluation script for NarrativeQA.
has components of both our H2-P and H3-D
References Rudolf Kadlec, Martin Schmid, Ondřej Bajgar, and Jan
Kleindienst. 2016. Text understanding with the at-
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, tention sum reader network. In Proceedings of the
Jianfeng Gao, Xiaodong Liu, Rangan Majumder, 54th Annual Meeting of the Association for Compu-
Andrew McNamara, Bhaskar Mitra, Tri Nguyen, tational Linguistics (Volume 1: Long Papers), pages
Mir Rosenberg, Xia Song, Alina Stoica, Saurabh 908–918. Association for Computational Linguis-
Tiwary, and Tong Wang. 2018. MS MARCO: A tics.
human generated machine reading comprehension
dataset. CoRR, abs/1611.09268. Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom,
Chris Dyer, Karl Moritz Hermann, Gábor Melis, and
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Edward Grefenstette. 2018. The NarrativeQA read-
Bordes. 2017. Reading Wikipedia to answer open- ing comprehension challenge. Transactions of the
domain questions. In Proceedings of the 55th An- Association for Computational Linguistics, 6:317–
nual Meeting of the Association for Computational 328.
Linguistics (Volume 1: Long Papers), pages 1870–
1879. Association for Computational Linguistics. Bernhard Kratzwald and Stefan Feuerriegel. 2018.
Adaptive document retrieval for deep question an-
Christopher Clark and Matt Gardner. 2018. Simple swering. In Proceedings of the 2018 Conference on
and effective multi-paragraph reading comprehen- Empirical Methods in Natural Language Process-
sion. In Proceedings of the 56th Annual Meeting of ing, pages 576–581. Association for Computational
the Association for Computational Linguistics (Vol- Linguistics.
ume 1: Long Papers), pages 845–855. Association
for Computational Linguistics. Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, Miyoung
Ko, and Jaewoo Kang. 2018. Ranking paragraphs
Mark Craven and Johan Kumlien. 1999. Constructing for improving answer recall in open-domain ques-
biological knowledge bases by extracting informa- tion answering. In Proceedings of the 2018 Con-
tion from text sources. In Proceedings of the Sev- ference on Empirical Methods in Natural Language
enth International Conference on Intelligent Systems Processing, pages 565–569. Association for Com-
for Molecular Biology. putational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
Kristina Toutanova. 2019. BERT: Pre-training of 2019. Latent retrieval for weakly supervised open
deep bidirectional transformers for language under- domain question answering. In Proceedings of the
standing. In Proceedings of the 2019 Conference of 57th Annual Meeting of the Association for Compu-
the North American Chapter of the Association for tational Linguistics, pages 6086–6096. Association
Computational Linguistics: Human Language Tech- for Computational Linguistics.
nologies, Volume 1 (Long and Short Papers), pages
4171–4186. Association for Computational Linguis- Chin-Yew Lin. 2004. ROUGE: A package for auto-
tics. matic evaluation of summaries. In Text Summa-
rization Branches Out, pages 74–81. Association for
Bhuwan Dhingra, Kathryn Mazaitis, and William W. Computational Linguistics.
Cohen. 2017. Quasar: Datasets for question answer-
ing by search and reading. CoRR, abs/1707.03904. Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun.
2018. Denoising distantly supervised open-domain
Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur question answering. In Proceedings of the 56th An-
Güney, Volkan Cirik, and Kyunghyun Cho. 2017. nual Meeting of the Association for Computational
Searchqa: A new q&a dataset augmented with con- Linguistics (Volume 1: Long Papers), pages 1736–
text from a search engine. CoRR, abs/1704.05179. 1745.

Raphael Hoffmann, Congle Zhang, Xiao Ling, Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and
Luke Zettlemoyer, and Daniel S. Weld. 2011. Luke Zettlemoyer. 2019. A discrete hard EM ap-
Knowledge-based weak supervision for information proach for weakly supervised question answering.
extraction of overlapping relations. In Proceedings In Proceedings of the 2019 Conference on Empirical
of the 49th Annual Meeting of the Association for Methods in Natural Language Processing and the
Computational Linguistics: Human Language Tech- 9th International Joint Conference on Natural Lan-
nologies, pages 541–550. Association for Computa- guage Processing (EMNLP-IJCNLP), pages 2844–
tional Linguistics. 2857. Association for Computational Linguistics.

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi.
Zettlemoyer. 2017. Triviaqa: A large scale distantly 2017. Question answering through transfer learn-
supervised challenge dataset for reading comprehen- ing from large fine-grained supervision data. In Pro-
sion. In Proceedings of the 55th Annual Meeting of ceedings of the 55th Annual Meeting of the Associa-
the Association for Computational Linguistics (Vol- tion for Computational Linguistics (Volume 2: Short
ume 1: Long Papers), pages 1601–1611. Associa- Papers), pages 510–517. Association for Computa-
tion for Computational Linguistics. tional Linguistics.
Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-
rafsky. 2009. Distant supervision for relation ex-
traction without labeled data. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, pages
1003–1011. Association for Computational Linguis-
tics.
Kyosuke Nishida, Itsumi Saito, Kosuke Nishida, Kazu-
toshi Shinoda, Atsushi Otsuka, Hisako Asano, and
Junji Tomita. 2019. Multi-style generative reading
comprehension. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 2273–2284. Association for Compu-
tational Linguistics.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.


Know what you don’t know: Unanswerable ques-
tions for SQuAD. In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 2: Short Papers), pages 784–789.
Association for Computational Linguistics.
Alan Ritter, Luke Zettlemoyer, Oren Etzioni, et al.
2013. Modeling missing data in distant supervision
for information extraction. Transactions of the As-
sociation for Computational Linguistics, 1:367–378.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,
and Christopher D. Manning. 2012. Multi-instance
multi-label learning for relation extraction. In Pro-
ceedings of the 2012 Joint Conference on Empirical
Methods in Natural Language Processing and Com-
putational Natural Language Learning, EMNLP-
CoNLL ’12, pages 455–465, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,
Tim Klinger, Wei Zhang, Shiyu Chang, Gerald
Tesauro, Bowen Zhou, and Jing Jiang. 2018a. R3:
Reinforced reader-ranker for open-domain question
answering. In AAAI Conference on Artificial Intelli-
gence.

Wei Wang, Ming Yan, and Chen Wu. 2018b. Multi-


granularity hierarchical attention fusion networks
for reading comprehension and question answering.
In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), pages 1705–1714. Association for
Computational Linguistics.
Qinyuan Ye, Liyuan Liu, Maosen Zhang, and Xiang
Ren. 2019. Looking beyond label noise: Shifted la-
bel distribution matters in distantly supervised rela-
tion extraction. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 3839–3848. Association for Com-
putational Linguistics.

You might also like