Professional Documents
Culture Documents
Rapidly Bootstrapping A Question Answering Dataset For COVID-19
Rapidly Bootstrapping A Question Answering Dataset For COVID-19
of-domain fine-tuning has limited effectiveness. Each table is different, and in our running exam-
Of the models examined, T5 (Raffel et al., 2019) ple, the table has columns containing the title of
for ranking (Nogueira et al., 2020) achieves the an article that contains an answer, its date, as well
highest effectiveness in identifying sentences from as the asymptomatic proportion, age, study design,
documents containing answers. Furthermore, it and sample size. In this case, according to the
appears that, in general, transformer models are site, the entries in the answer table came from
more effective when fed well-formed natural lan- notebooks by two data scientists (Ken Miller and
guage questions, compared to keyword queries. David Mezzetti), whose contents were then vet-
ted by two curators (Candler Clawson and Devan
2 Approach Wilkins).
At a high level, our dataset comprises (ques- For each row (entry) in the literature review an-
tion, scientific article, exact answer) triples that swer table, we began by manually identifying the
have been manually created from the literature re- exact article referenced (in terms of the unique
view page of Kaggle’s COVID-19 Open Research ID) in the COVID-19 Open Research Dataset
Dataset Challenge. It is easiest to illustrate our ap- (CORD-19). To align with ongoing TREC doc-
proach by example; see Figure 1. ument retrieval efforts,5 we used the version of
the corpus from April 10. Finding the exact docu-
The literature review “products” are organized
ment required keyword search, as there are some-
into categories and subcategories, informed by
times slight differences between the titles in the
“Tasks” defined by the Kaggle organizers.4 One
answer tables and the titles in CORD-19. Once
such example is “Asymptomatic shedding” and
we have located the article, we manually identified
“Proportion of patients who were asymptomatic”,
the exact answer span—a verbatim extract from
respectively. The subcategory may or may not be
the document that serves as the answer. For ex-
phrased in the form of a natural language ques-
ample, in document 56zhxd6e the exact answer
tion; in this case, it is not. In the “question”
was marked as “49 (14.89%) were asymptomatic”.
of CovidQA, we preserve this categorization and,
based on it, manually created both a query com- The annotation of the answer span is not a
prising of keywords (what a user might type into a straightforward text substring match in the raw ar-
search engine) and also a well-formed natural lan- ticle contents based on the Kaggle answer, but re-
guage question. For both, we attempt to minimize quired human judgment in most cases. For simpler
the changes made to the original formulations; see cases in our running example, the Kaggle answer
Figure 1. was provided with a different precision. In more
In the Kaggle literature review, for each cate- complex cases, the Kaggle answer does not match
gory/subcategory there is an “answers table” that any text span in the article. For example, the arti-
presents evidence relevant to the information need. cle may provide the total number of patients and
4 5
COVID-19 Open Research Dataset Challenge Tasks TREC-COVID
the number of patients who were asymptomatic, from the other annotators. Two annotators are
but does not explicitly provide a proportion. In undergraduate students majoring in computer sci-
these cases, we used our best judgment—if the ab- ence, one is a science alumna, another is a com-
solute numbers were clearly stated in close prox- puter science professor, and the lead annotator is
imity, we annotated (at least a part of) the exact a graduate student in computer science—all affili-
answer span from which the proportion could be ated with the University of Waterloo. Overall, the
computed. See Figure 1 for an example in the dataset took approximately 23 hours to produce,
case of document rjm1dqk7; here, the total num- representing a final tally (for the version 0.1 re-
ber of patients is stated nearby in the text, from lease) of 124 question–answer pairs, 27 questions
which the proportion can be computed. In some (topics), and 85 unique articles. For each question–
cases, however, it was not feasible to identify the answer pair, there are 1.6 annotated answer spans
answer in this manner, and thus we ignored the en- on average. We emphasize that this dataset, while
try. Thus, not all rows in the Kaggle answer table too small to train supervised models, should still
translated into a question–answer pair. prove useful for evaluating unsupervised or out-of-
A lesson from the QA literature is that there domain transfer-based models.
is considerable nuance in defining exact an-
swers and answer contexts, dating back over 20 3 Evaluation Design
years (Voorhees and Tice, 1999). For example, in The creation of the CovidQA dataset was mo-
document 56zhxd6e, although we have anno- tivated by a multistage design of end-to-end
tated the answer as “49 (14.89%) were asymp- search engines, as exemplified by our Neural
tomatic” (Figure 1), an argument could be made Covidex (Zhang et al., 2020) for AI2’s COVID-
that “14.89%” is perhaps a better exact span. We 19 Open Research Dataset and related clinical
are cognizant of these complexities and sidestep trials data. This architecture, which is quite
them by using our dataset to evaluate model ef- standard in both academia (Matveeva et al., 2006;
fectiveness at the sentence level. That is, we con- Wang et al., 2011) and industry (Pedersen, 2010;
sider a model correct if it identifies the sentence Liu et al., 2017), begins with keyword-based re-
that contains the answer. Thus, we only need to trieval to identify a set of relevant candidate doc-
ensure that our manually annotated exact answers uments that are then reranked by machine-learned
are (1) proper substrings of the article text, from its models to bring relevant documents into higher
raw JSON source provided in CORD-19, and that ranks.
(2) the substrings do not cross sentence boundaries.
In the final stage of this pipeline, a module
In practice, these two assumptions are workable at
would take as input the query and the document
an intuitive level, and they allow us to avoid the
(in principle, this could be an abstract, the full text,
need to articulate complex annotation guidelines
or some combination of paragraphs from the full
that try to define the “exactness” of an answer.
text), and identify the most salient passages (for
Another challenge that we encountered relates example, sentences), which might be presented as
to the scope of some questions. Drawn directly highlights in the search interface (Lin et al., 2003).
from the literature review, some questions pro- Functionally, such a “highlighting module”
duced too many possible answer spans within a would not be any different from a span-based ques-
document and thus required rephrasing. As an ex- tion answering system, and thus CovidQA can
ample, for the topic “decontamination based on serve as a test set. More formally, in our eval-
physical science”, most sentences in some articles uation design, a model is given the question (ei-
would be marked as relevant. To address this issue, ther the natural language question or the keyword
we deconstructed these broad topics into multiple query) and the full text of the ground-truth article
questions, for example, related to “UVGI intensity in JSON format (from CORD-19). It then scores
used for inactivating COVID-19” and “purity of each sentence from the full text according to rele-
ethanol to inactivate COVID-19”. vance. For evaluation, a sentence is deemed cor-
Five of the co-authors participated in this an- rect if it contains the exact answer, via substring
notation effort, applying the aforementioned ap- matching.
proach, with one lead annotator responsible for ap- From these results, we can compute a battery of
proving topics and answering technical questions metrics. Treating a model’s output as a ranked list,
in this paper we evaluate effectiveness in terms against all the query tokens, then determine sen-
of mean reciprocal rank (MRR), precision at rank tence relevance as the maximum contextual simi-
one (P@1), and recall at rank three (R@3). larity at the token level.