0% found this document useful (0 votes)
39 views19 pages

Reasoning Over Public and Private Data in Retrieval-Based Systems

The document introduces the S PLIT I TERATIVE R ETRIEVAL (SPIRAL) problem, which addresses the need for retrieval systems to effectively incorporate both public and private data while preserving user privacy. It presents a new benchmark dataset, C ONCURRENT QA, designed for evaluating retrieval approaches that operate across multiple privacy scopes. The authors highlight the performance challenges of existing retrieval systems in this context and propose methods to mitigate privacy-performance tradeoffs.

Uploaded by

staging.wework
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views19 pages

Reasoning Over Public and Private Data in Retrieval-Based Systems

The document introduces the S PLIT I TERATIVE R ETRIEVAL (SPIRAL) problem, which addresses the need for retrieval systems to effectively incorporate both public and private data while preserving user privacy. It presents a new benchmark dataset, C ONCURRENT QA, designed for evaluating retrieval approaches that operate across multiple privacy scopes. The authors highlight the performance challenges of existing retrieval systems in this context and propose methods to mitigate privacy-performance tradeoffs.

Uploaded by

staging.wework
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Reasoning over Public and Private Data in Retrieval-Based Systems

Simran Arora‡ , Patrick Lewis† , Angela Fan† , Jacob Kahn†∗ , and Christopher Ré‡∗

Stanford University
{simran, chrismre}@[Link]

Facebook AI Research
{plewis, angelafan, jacobkahn}@[Link]

Abstract retrieve relevant information to a user input from


background knowledge sources before providing
Users and organizations are generating ever- an output, do not consider retrieving from private
increasing amounts of private data from data that organizations and individuals aggregate
a wide range of sources. Incorporating
locally. Neural retrieval systems are achieving im-
private context is important to personal-
ize open-domain tasks such as question-
pressive performance across applications such as
answering, fact-checking, and personal as- language-modeling (Borgeaud et al., 2022), ques-
sistants. State-of-the-art systems for these tion answering (Chen et al., 2017), and dialogue
tasks explicitly retrieve information that is (Dinan et al., 2019), and we focus on the under-
relevant to an input question from a back- explored question of how to personalize these sys-
ground corpus before producing an answer. tems while preserving privacy.
While today’s retrieval systems assume rel-
evant corpora are fully (e.g., publicly) ac-
Consider the following examples that require
cessible, users are often unable or unwill- retrieving information from both public and pri-
ing to expose their private data to entities vate scopes. Individuals could ask “With my GPA
hosting public data. We define the S PLIT and SAT score, which universities should I apply
I TERATIVE R ETRIEVAL (SPIRAL) prob- to?” or “Is my blood pressure in the normal range
lem involving iterative retrieval over multi- for someone 55+?”. In an organization, an ML
ple privacy scopes. We introduce a founda- engineer could ask: “How do I fine-tune a lan-
tional benchmark with which to study SPI-
guage model, based on public StackOverflow and
RAL, as no existing benchmark includes
data from a private distribution. Our dataset, our internal company documentation?”, or a doc-
C ONCURRENT QA, includes data from dis- tor could ask “How are COVID-19 vaccinations
tinct public and private distributions and is affecting patients with type-1 diabetes based on
the first textual QA benchmark requiring our private hospital records and public PubMed
concurrent retrieval over multiple distribu- reports?”. To answer such questions, users manu-
tions. Finally, we show that existing re- ally cross-reference public and private information
trieval approaches face significant perfor-
sources. We initiate the study of a retrieval setting
mance degradations when applied to our
proposed retrieval setting and investigate
that enables using public (global) data to enhance
approaches with which these tradeoffs can our understanding of private (local) data.
be mitigated. We release the new bench- Modern retrieval systems typically collect doc-
mark and code to reproduce the results.1 uments that are most similar to a user’s question
from a massive corpus, and provide the result-
ing documents to a separate model, which reasons
1 Introduction
over the information to output an answer (Chen
The world’s information is split between publicly et al., 2017). Multi-hop reasoning (Welbl et al.,
and privately accessible scopes and the ability to 2018) can be used to answer complex queries
simultaneously reason over both scopes is useful over information distributed across multiple docu-
to support personalized tasks. However, retrieval- ments, e.g. news articles and Wikipedia. For such
based machine learning (ML) systems, which first queries, we observe that using multiple rounds of

Equal contribution.
retrieval (i.e., combining the original query with
1
[Link] retrieved documents at round i for use in retrieval
concurrentqa at round i + 1) provides over 75% improvement in
Figure 1: Multi-hop retrieval systems use beam search to collect information from a massive corpus: retrieval in
hopi+1 is conditioned on the top documents retrieved in hopi . The setting of retrieving from corpora distributed
across multiple privacy scopes is unexplored. Here, the content of a private document retrieved in hopi is revealed
to the entity hosting public data if used to retrieve public documents in hopi+1 .

quality versus using one round of retrieval (Sec- appropriately evaluate SPIRAL, we create the
tion 5). Iterative retrieval is now common in re- first textual multi-distribution benchmark, C ON -
trieval (Miller et al., 2016; Feldman and El-Yaniv, CURRENT QA, which spans Wikipedia in the pub-
2019; Asai et al., 2020; Xiong et al., 2021; Qi lic domain and emails in the private domain, en-
et al., 2021; Khattab et al., 2021, inter alia.). abling the study of two novel real-world retrieval
Existing multi-hop systems perform retrieval setups: (1) multi-distribution and (2) privacy-
over a single privacy scope. However, users and preserving retrieval:
organizations often cannot expose data to public
entities. Maintaining terabyte-scale and dynamic • Multi-distribution retrieval The ability for a
data is difficult for many private entities, warrant- model to effectively retrieve over multiple dis-
ing retrieval from multiple distributed corpora. tributions, even in the absence of privacy con-
straints, is a precursor to effective SPIRAL sys-
To understand why distributed multi-hop re- tems since it is unlikely for all private distribu-
trieval implicates privacy concerns, consider two tions to be reflected at train time. However, the
illustrative questions an employee may ask. First, typical retrieval setup requires retrieving over a
to answer “Of the products our competitors re- single document distribution with a single query
leased this month, which are similar to our un- distribution (Thakur et al., 2021). We initiate the
released upcoming products?”, an existing multi- study of the real-world multi-distribution set-
hop system likely (1) retrieves public documents ting. We find that the SoTA multi-hop QA model
(e.g., news articles) about competitors, and (2) trained on 90.4k Wikipedia data underperforms
uses these to find private documents (e.g., com- the same model trained on the 15.2k C ONCUR -
pany emails) about internal products, leaking no RENT QA (Wikipedia and Email) examples by
private information. Meanwhile, “Have any com- 20.8 F1 points on questions based on Email pas-
panies ever released similar products to the one sages. Further, we find the performance of the
we are designing?” entails (1) retrieving private model trained on Wikipedia improves by 4.3% if
documents detailing the upcoming product, and we retrieve the top k2 passages from each distri-
(2) performing similarity search for public prod- bution vs. retrieving the overall top k passages,
ucts using information from the confidential doc- which is the standard protocol.
uments. The latter reveals private data to an un-
trusted entity hosting a public corpus. An effective • Privacy-preserving retrieval We then propose
privacy model will minimize leakage. a framework for reasoning about the privacy
We introduce the S PLIT I TERATIVE R E - tradeoffs required for SoTA models to achieve
TRIEVAL (SPIRAL) problem. Public and private as good performance on public-private QA as
document distributions usually differ and our first is achieved in public-QA. We evaluate perfor-
observation is that all existing textual benchmarks mance when no private information is revealed,
require retrieving from one data-distribution. To and models trained only on public data (e.g.
Wikipedia) are utilized. Under this privacy stan- tributed across multiple (public and private) docu-
dard, models sacrifice upwards of 19% perfor- ments, termed multi-hop reasoning (Welbl et al.,
mance under SPIRAL constraints to protect 2018). To collect the distributed evidence, sys-
document privacy and 57% to protect query pri- tems use multiple hops of retrieval: representa-
vacy when compared to a baseline system with tions of the top passages retrieved in hopi are used
standard, non privacy-aware retrieval mechan- to retrieve passages in hopi+1 (Miller et al., 2016;
ics. We then study how to manage the privacy- Feldman and El-Yaniv, 2019; Asai et al., 2020;
performance tradeoff using selective prediction, Wolfson et al., 2020; Xiong et al., 2021; Qi et al.,
a popular approach for improving the reliabil- 2021; Khattab et al., 2021). 2 Finally, we discuss
ity of QA systems (Kamath et al., 2020; Lewis the applicability of existing multi-hop benchmarks
et al., 2021; Varshney et al., 2022). to our problem setting in Section 4.
In summary: (1) We are the first to report on 2.3 Privacy Preserving Retrieval
problems with applying existing neural retrieval
Information retrieval is a long standing topic span-
systems to the public and private retrieval setting,
ning the machine learning, databases, and privacy
(2) We create C ONCURRENT QA, the first textual
communities. We discuss the prior work and con-
multi-distribution benchmark to study the prob-
siderations for our setup along three axes: (1) Lev-
lems, and (3) We provide extensive evaluations of
els of privacy. Prior private retrieval system de-
existing retrieval approaches under the proposed
signs guarantee privacy for different components
real-world retrieval settings. We hope this work
across both query and document privacy. Our set-
encourages further research on private retrieval.
ting requires both query and document privacy.
2 Background & Related Work (2) Relative isolation of document storage and
retrieval computation. The degree to which prior
2.1 Retrieval-Based Systems retrieval and database systems store or send private
Open-domain applications, such as question an- data to centralized machines (with or without en-
swering and personal assistants, must support user cryption) varies. Our work structures dataflow to
inputs across a broad range of topics. Implicit- eliminate processing of private documents on pub-
memory approaches for these tasks focus on mem- lic retrieval infrastructure. (3) Updatability and
orizing the knowledge required to answer ques- latency. Works make different assumptions about
tions within model parameters (Roberts et al., how a user will interact with the system. These
2020). Instead of memorizing massive amounts of include (1) tolerance of high-latency responses
knowledge in model parameters, retrieval-based and (2) whether corpora are static or changing.
systems introduce a step to retrieve information Our setting focuses on open-domain questions for
that is relevant to a user input from a massive cor- interactive applications with massive, temporally
pus of documents (e.g., Wikipedia), and then pro- changing corpora and requiring low-latency.
vide this to a separate task model that produces the Isolated systems with document and query
output. Retrieval-free approaches have not been privacy but poor updatability. To provide the
shown to work convincingly in multi-hop settings strongest possible privacy guarantee, i.e. no in-
(Xiong et al., 2021). formation about the user questions or passages
is revealed, prior work considers when purely
2.2 Multi-hop Retrieval local search is possible (Cao et al., 2019), i.e.
We focus on open-domain QA (ODQA), a clas- search performed on systems controlled exclu-
sic application for retrieval-based systems. ODQA sively by the user. This guarantee provides no
entails providing an answer a to a question q, ex- threat opportunities assuming that both data (doc-
pressed in natural language and without explic- uments and queries) and computation occur on
itly provided context from which to find the an- controlled infrastructure. Scaling the amount of
swer (Voorhees, 1999). A retriever collects rele- locally hosted data and updating local corpora
vant documents to the question from a corpus, then with quickly changing public data is challenging;
a reader model extracts an answer from selected we build a system that might meet such demands.
documents. 2
Note that beyond multi-hop QA, retrieval augmented
Our setting is concerned with complex queries language models and dialogue systems also involve iterative
where supporting evidence for the answer is dis- retrieval (Guu et al., 2020).
Public, updatable database systems provid- et al., 2014). This multiplies communication costs
ing query privacy. A distinct line of work ex- since nearest neighbors must be retrieved for each
plores how to securely perform retrieval such that of the k queries; iterative retrieval worsens this
the user query is not revealed to a public entity cost penalty. Further, the private query is revealed
that hosts and updates databases. Private informa- among the full set of k; the threat of identifying the
tion retrieval (PIR) (Chor et al., 1998) in the cryp- user’s true query remains (Haeberlen et al., 2011).
tography community refers to a setup where users Finally, we note that our primary focus is on
know the entry in a remote database that they want inference-time privacy concerns and note that dur-
to retrieve (Kogan, 2020). The threat model is di- ing training time, federated learning (FL) with dif-
rectly related to the cryptographic scheme used to ferential privacy (DP) is a popular strategy for
protect queries and retrieval computation. Here, training models without exposing training data
the document containing the answer is assumed (McMahan et al., 2016; Dwork et al., 2006).
to be known; leaking the particular corpus con- Overall, despite significant interest in IR, there
taining the answer may implicitly leak information is limited attention towards characterizing the pri-
about the query. In contrast, we focus on open- vacy risks as previously observed (Si and Yang,
domain applications, where users ask about any 2014). Our setting, which focuses on support-
topic imaginable and do not know which corpus ing open-domain applications with modern dense
item holds the answer. Our setting also considers retrievers is not well-studied. Further, the prior
document privacy, as discussed in Section 6. works do not characterize the privacy concerns
associated with iterative retrieval. Studying this
Public, updatable but high-latency secure
setting is increasingly important with the preva-
nearest neighbor search with document and
lence of API-hosted large language models and
query privacy. The next relevant line of work
services. For instance, users may want to incor-
focuses on secure nearest neighbor search (NNS)
porate private knowledge into systems that make
(Murugesan et al., 2010; Chen et al., 2020a;
multiple calls to OpenAI model endpoints (Brown
Schoppmann et al., 2020; Servan-Schreiber,
et al., 2020; Khattab et al., 2022). Code assistants,
2021), where the objective is to securely (1) com-
which may be extended to interact with both pri-
pute similarity scores between the query and pas-
vate repositories and public resources like Stack
sages, and (2) select the top-k scores. The speed
Overflow, are also seeing widespread use (Chen
of cryptographic tools (secure multi-party com-
et al., 2021).
putation, secret sharing) that are used to perform
these steps increase as the sparsity of the query 3 Problem Definition
and passage representations increases. Perform-
ing the secure protocol over dense embeddings Objective Given a multi-hop input q, a set of
can take hours per query (Schoppmann et al., private documents p ∈ DP , and public documents
2020). As before, threats in this setting are re- d ∈ DG , the objective is to provide the user with
lated to vulnerabilities in cryptographic schemes the correct answer a, which is contained in the
or in actors gaining access to private document documents. Figure 1 (Right) provides an exam-
indices if not directly encrypted. Prior work re- ple. Overall, the S PLIT I TERATIVE R ETRIEVAL
laxes privacy guarantees and computes approxi- (SPIRAL) problem entails maximizing quality,
mate NNS; speeds, however, are still several sec- while protecting query and document privacy.
onds per query (Schoppmann et al., 2020; Chen
Standard, Non-Privacy Aware QA Standard
et al., 2020a). This is prohibitive for iterative open
non-private multi-hop ODQA involves answering
domain retrieval applications.
q with the help of passages d ∈ DG , using beam
Partial query privacy via fake query aug- search. In the first iteration of retrieval, the k pas-
mentation for high-latency retrieval from sages from the corpus, d1 , ..., dk , that are most rel-
public databases. Another class of privacy tech- evant to q are retrieved. The text of a retrieved
niques for hiding the user’s intentions is query- passage is combined with q using function f (e.g.,
obfuscation or k-anonymity. The user’s query is concatenating the query and passages sequences)
combined with fake queries or queries from other to produce qi = f (q, di ), for i ∈ [1..k]. Each qi
users to increase the difficulty of linking a par- (which contains di ) is used to retrieve k more pas-
ticular query to the user’s true intentions (Gervais sages in the following iteration.
We now introduce the SPIRAL retrieval prob- 2007; Gervais et al., 2014). There is an observable
lem. The user inputs to the QA system are the difference in the search behavior of users with pri-
private corpus DP and questions q. There are two vacy concerns (Zimmerman et al., 2019) and an
key properties of the problem setting. effective system will protect queries.

Property 1: Data is likely stored in multiple 4 C ONCURRENT QA Benchmark


enclaves and personal documents p ∈ DP can
Here we develop a testbed for studying public-
not leave the user’s enclave. Users and organi-
private retrieval. We require questions spanning
zations own private data, and untrustworthy (e.g.,
two corpora, DP and DG . First, we consider using
cloud) services own public data. First, we assume
existing benchmarks and describe the limitations
users likely do not want to publicly expose their
we encounter, motivating the creation of our new
data to create a single public corpus nor blindly
benchmark, C ONCURRENT QA. Then we describe
write personal data to a public location. Next, we
the dataset collection process and its contents.
also assume it is challenging to store global data
locally in many cases. This is because not only 4.1 Adapting Existing Benchmarks
are there terabytes of public data and user-searches
We first adapt the widely used benchmark, Hot-
follow a long tail (Bernstein et al., 2012) (i.e. it is
potQA (Yang et al., 2018), to study our problem.
challenging to anticipate all a user’s information
HotpotQA contains multi-hop questions, which
needs), but public data is also constantly being up-
are each answered using two Wikipedia passages.
dated (Zhang and Choi, 2021). Thus, DP and DG
We create HotpotQA-SPIRAL by splitting the
are hosted as separate corpora.
Wikipedia corpus into DG and DP . This results
Now given q, the system must perform one re-
in questions entirely reliant on p ∈ DP , entirely
trieval over DG and one over DP rank the results
on d ∈ DG , or reliant on a mix of one private and
such that the top-k passages will include kP pri-
one public document, allowing us to evaluate per-
vate and kG public passages, and use these for the
formance under SPIRAL constraints.
following iteration of retrieval. If the retrieval-
Ultimately however, DP and DG come from
system stops after a single-hop, there is no doc-
a single Wikipedia distribution in HotpotQA-
ument privacy risk since no p ∈ DP is publicly
SPIRAL. Private and public data will often reflect
exposed and no query privacy risk if the system
different linguistic styles, structures, and topics.
used to retrieve from DP is private, as is assumed.
We observe all existing textual multi-hop bench-
However for multi-hop questions, if kP > 0 for
marks require retrieving from a single distribution.
an initial round of retrieval, meaning there exists
We cannot combine two existing benchmarks over
some pi ∈ DP which was in the top-k passages,
two corpora because this will not yield questions
it would sacrifice privacy if f (q, pi ) were to be
that rely on both corpora simultaneously. To eval-
used to perform the next round of retrieval from
uate with a more realistic setup, we create a new
DG . Thus, for the strongest privacy guarantee,
benchmark: C ONCURRENT QA. We quantitatively
public retrievals should precede private document
demonstrate the limitations of using HotpotQA-
retrievals. For less privacy-sensitive use cases, this
SPIRAL in the experiments and analysis.
strict ordering can be weakened.
4.2 C ONCURRENT QA Overview
Property 2: Inputs that entirely rely on pri-
vate information should not be revealed pub- We create and release a new multi-hop QA dataset,
licly. Given the multiple indices, DP and DG , C ONCURRENT QA, which is designed to more
q may be entirely answerable using multiple hops closely resemble a practical use case for SPIRAL.
over the DP index, in which case, q would never C ONCURRENT QA contains questions spanning
need to leave the user’s device. For example, Wikipedia documents as DG and Enron employee
the query from an employee standpoint, Does the emails (Klimt and Yang, 2004) as DP . 3 We
search team use any infrastructure tools that our propose two unique evaluation settings for C ON -
personal assistant team does not use?, is fully an- CURRENT QA: performance (1) conditioned on

swerable with company information. Prior work the sub-domains in which the question evidence
demonstrates that queries are very revealing of 3
The Enron Corpus includes emails written by 158 em-
user interests, intents, and backgrounds (Xu et al., ployees of Enron Corporation and are in the public domain.
Question Hop 1 and Hop 2 Gold Passages
What was the estimated 2016 population of Hop 1 An email mentions that San Francisco gener-
the city that generates power at the Hetch ates power at the Hetch Hetchy dams.
Hetchy hydroelectric dams? Hop 2 The Wikipedia passage about San Francisco
reports the 2016 census-estimated population.
Which firm invested in both the 5th round of Hop 1 An email lists 5th round Exraprise investors.
funding for Extraprise and first round of fund- Hop 2 An email lists round-1 investors for JobsOn-
ing for [Link]? [Link].

Table 1: Example C ONCURRENT QA queries based on Wikipedia passages (DG ) and emails (DP ).

can be found (Section 5), and (2) conditioned on Split Total EE EW WE WW


the degree of privacy protection (Section 6).
Train 15,239 3762 4002 3431 4044
The corpora contain 47k emails (DP ) and 5.2M Dev 1,600 400 400 400 400
Wikipedia passages (DG ), and the benchmark Test 1,600 400 400 400 400
contains 18,439 examples (Table 2). Questions
require three main reasoning patterns: (1) bridge Table 2: Size statistics. The evaluation splits are bal-
questions require identifying an entity or fact in anced between questions with gold passages as emails
Hop1 on which the second retrieval is depen- (E) vs. Wikipedia (W) passages for Hop1 and Hop2 .
dent, (2) attribute questions require identifying
the entity that satisfies all attributes in the ques-
tion, where attributes are distributed across pas- Benchmark Collection We used Amazon Turk
sages, and (3) comparison questions require com- for collection. The question generation stage be-
paring two similar entities, each appearing in a gan with an onboarding process in which we pro-
separate passage. We estimate the benchmark is vided training videos, documents with examples
80% bridge, 12% attribute, and 8% comparison and explanations, and a multiple-choice exam.
questions. We focus on factoid QA. Workers completing the onboarding phase were
given access to pilot assignments, which we man-
Benchmark Design Each benchmark example
ually reviewed to identify individuals with high
includes the question that requires reasoning over
quality submissions. We worked with these indi-
multiple documents, answer which is a span of
viduals to collect the full dataset. We manually
text from the supporting documents, and the spe-
reviewed over 2.5k queries in the quality-check
cific supporting sentences in the documents which
process and prioritized including the manually-
are used to arrive at the answer and can serve as
verified examples in the final evaluation splits.
supervision signals.
As discussed in Yang et al. (2018), collecting a In the manual review, examples of the criteria
high quality multi-hop QA dataset is challenging that led us to discard queries included: the query
because it is important to provide reasonable pairs could be answered using one passage alone, had
of supporting context documents to the worker — multiple plausible answers either in or out of the
not all article pairs are conducive to a good multi- shown passages, or lacked clarity. During the
hop question. There are four types of pairs we manual review, we developed a multiple-choice
need to collect for the Hop1 and Hop2 passages: questionnaire to streamline the checks along the
Private and Private, Private and Public, Public and identified criteria. We then used this to launch a
Private, and Public and Public. We use the in- second Turk task to validate the generated queries
sight that we can obtain meaningful passage-pairs that we did not manually review. Assembling
by showing workers passages that mention similar the cohort of crowdworkers for the validation
or overlapping entities. All crowdworker assign- task again involved onboarding and pilot steps, in
ments contain unique passage pairs. A detailed which we manually reviewed performance. We
description of how the passage pairs are produced shortlisted ∼20 crowdworkers with high quality
is in Appendix C and we release all our code for submissions who collectively validated examples
creating the passage pairs. appearing in the final benchmark.
could impact the distinction between public and
private data. We investigate this in Section 5.
Ethics Statement The Enron Dataset is already
widely-used in NLP research (Heller, 2017). That
said, we acknowledge the origin of this data as col-
lected and made public by the U.S. FERC during
Figure 2: NER types for C ONCURRENT QA answers. their investigation of Enron. We note that many of
the individuals whose emails appear in the dataset
were not involved in wrongdoing. We defer to us-
4.3 Benchmark Analysis
ing inboxes that are frequently used in prior work.
Emails and Wiki passages differ in several ways. In the next sections, we evaluate C ONCURREN -
Format: Wiki passages for entities of the same T QA in the SPIRAL setting. We first ask how
type tend to be similarly structured, while emails a range of SoTA retrievers perform in the multi-
introduce many formats — for example, certain domain retrieval setting in Section 5, then in-
emails contain portions of forwarded emails, lists troduce baselines for C ONCURRENT QA under a
of articles, or spam advertisements. Noise: Wiki strong privacy guarantee in which no private in-
passages tend to be typo-free, while the emails formation is revealed whatsoever in Section 6.
contain several typos, URLs, and inconsistent cap-
italization. Entity Distributions: Wiki passages 5 Evaluating Mixed-Domain Retrieval
tend to focus on details about one entity, while
a single email can cover multiple (possibly un- Here we study the SoTA multi-hop model perfor-
related) topics. Information about email entities mance on C ONCURRENT QA in the novel multi-
is also often distributed across passages, whereas distribution setting. The ability for models trained
public-entity information tends to be localized to on public data to generalize to private distribu-
one Wiki passage. We observe that a private entity tions, with little or no labeled data, is a precursor
occurs 9× on average in gold training data pas- to solutions for SPIRAL. In the commonly stud-
sages while a public entity appears 4× on aver- ied zero-shot retrieval setting (Guoa et al., 2021;
age. There are 22.6k unique private entities in the Thakur et al., 2021), the top k of k passages will
gold training data passages, and 12.8k unique pub- be from a single distribution, however users often
lic entities. Passage Length: Finally, emails are have diverse questions and documents.
3× longer than Wiki passages on average.4 We first evaluate multi-hop retrievers. Then we
apply strong single-hop retrievers to the setting, to
Answer Types C ONCURRENT QA is a factoid understand the degree to which iterative retrieval
QA task so answers tend to be short spans of text is required in C ONCURRENT QA.
containing nouns, or entity names and properties.
5.1 Benchmarking Multi-Hop Retrievers
Figure 2 shows the distribution NER tags across
answers and examples from each category. Retrievers We evaluate the multi-hop dense re-
trieval model (MDR) (Xiong et al., 2021), which
Limitations As in HotpotQA, workers see the achieves SoTA on multi-hop QA and multi-hop
gold supporting passages when writing questions, implementation of BM25, a classical bag-of-
which can result in lexical overlap between the words method, as prior work indicates its strength
questions and passages. We mitigate these effects in OOD retrieval (Thakur et al., 2021).
through validation task filtering and by limiting MDR is a bi-encoder model consisting of a
the allowed lexical overlap via the Turk interface. query encoder and passage encoder. Passage em-
Next, our questions are not organic user searches, beddings are stored in an index designed for effi-
however existing search and dialogue logs do not cient retrieval (Johnson et al., 2017). In Hop1 , the
contain questions over public and private data to embedding for query q is used to retrieve the k pas-
our knowledge. Finally, Enron was a major public sages d1 , ..., dk with the highest retrieval score by
corporation; data encountered during pretraining the maximum inner product between question and
4
Since information density is generally lower in emails vs.
passage encodings. For multi-hop MDR, those re-
Wiki passages, this helps crowdworkers generate meaningful trieved passages are each appended to q and en-
questions. Lengths chosen within model context window. coded, and each of the k resulting embeddings are
OVERALL Domain-Conditioned
Retrieval Method
EM F1 EE EW WE WW
C ONCURRENT QA-MDR 48.9 56.5 49.5 66.4 41.8 68.3
HotpotQA-MDR 45.0 53.0 28.7 61.7 41.1 81.3
Subsampled HotpotQA-MDR 37.2 43.9 23.8 51.1 28.6 72.1
BM25 33.2 40.8 44.2 30.7 50.2 30.5
Oracle 74.1 83.4 66.5 87.5 89.4 90.4

Table 3: C ONCURRENT QA results using four retrieval approaches, and oracle retrieval. On the right, we show
performance (F1 scores) by the domains of the Hop1 and Hop2 gold passages for each question (email is “E”,
Wikipedia is “W”, and “EW” indicates the gold passages are email for Hop1 and Wikipedia for Hop2 ).

used to collect k more passages in Hop2 , yielding


k 2 passages. The top-k of the passages after the fi-
nal hop are inputs to the reader, ELECTRA-Large
(Clark et al., 2020). The reader selects a candidate
answer in each passage.5 The candidate with the
highest reader score is outputted.

Baselines We evaluate using four retrieval base-


lines: (1) C ONCURRENT QA-MDR, a dense re-
triever trained on the C ONCURRENT QA train set
(15.2k examples), to understand the value of in- Figure 3: F1 score vs training data size, training MDR
domain training data for the task; (2) HotpotQA- on subsampled HotpotQA (HPQA) and subsampled
MDR, trained on HotpotQA (90.4K examples), C ONCURRENT QA (CQA) training data. We also show
to understand how well a publicly trained model trends by the question domain for CQA (dotted lines).
performs on the multi-distribution benchmark; (3)
matches the zero-shot performance of the Sub-
Subsampled HotpotQA-MDR, trained on sub-
sampled HotpotQA model on C ONCURRENT QA.
sampled HotpotQA data of the same size as the
For larger dataset sizes (HotpotQA-MDR) and in-
C ONCURRENT QA train set, to investigate the ef-
domain training data (C ONCURRENT QA-MDR),
fect of dataset size; and (4) BM25 sparse retrieval.
dense outperforms sparse retrieval. Notably, it
Results are in Table 3. Experimental details are in
may be difficult to obtain training data for all pri-
Appendix A.6
vate or temporally arising distributions.
Training Data Size Strong dense retrieval per-
formance requires a large amount of training Domain Specific Performance Each retriever
data. Comparing C ONCURRENT QA-MDR and excels in a different subdomain of the bench-
Subsampled HotpotQA-MDR, the former outper- mark. Table 3 shows the retrieval performance
forms by 12.6 F1 points as it is evaluated in- of each method based on whether the gold sup-
domain. However, the HotpotQA-MDR baseline, porting passages for Hop1 and Hop2 are email
trained on the full HotpotQA training set, per- (E) or Wikipedia (W) passages (EW is Email-
forms nearly equal to C ONCURRENT QA-MDR. Wiki for Hop1 -Hop2 ). HotpotQA-MDR perfor-
Figure 3 shows the performance as training dataset mance on WW questions is far better than on ques-
size varies. Next we observe the sparse method tions involving emails. The sparse retriever per-
forms worse than the dense models on questions
5
Xiong et al. (2021) compares ELECTRA and other read- involving W, but better on questions with E in
ers such as FiD (Izacard and Grave1, 2021), finding similar
performance. We follow their approach and use ELECTRA. Hop2 . When training on C ONCURRENT QA, per-
6
We check for dataset leakage stemming from the “pub- formance on questions involving E improves sig-
lic” models potentially viewing “private” email information nificantly, but remains low on W-based questions.
in pretraining. Using the MDR and ELECTRA models fine- Finally, we explicitly provide the gold supporting
tuned on HotpotQA, we evaluate on C ONCURRENT QA using
a corpus of only Wiki passages. Test scores are 72.0 and 3.3
passages to the reader model (Oracle). EE oracle
EM for questions based on two Wiki and two email passages performance also remains low, indicating room to
respectively, suggesting explicit access to emails is important. improve the reader.
Method Recall@10 able gap between the one and two hop baselines.
Strong single-hop models trained over more di-
Two-hop MDR 77.5 verse publicly available data may help address the
One-hop MDR 45.7 SPIRAL problem as demonstrated by Contriever
Contriever 52.7 fine-tuned on MS-MARCO.
Contriever MS-MARCO 64.3 However, when evaluating the one-hop base-
lines on C ONCURRENT QA, we find Contriever
Table 4: Comparing the retrieval quality using one-hop
MDR, Contriever, and Contriever fine-tuned on MS- underperforms the two-hop baseline more signifi-
MARCO to the quality of two-hop MDR. Results are cantly as shown in Appendix Table 8. This is con-
over the HotpotQA dataset. sistent with prior work that finds Contriever qual-
ity degrades on tasks that increasingly differ from
the pretraining distribution (Zhan et al., 2022).
How well does the retriever trained on pub-
By sub-domain, Contriever MS-MARCO returns
lic data perform in the SPIRAL setting? We
the gold first-hop passage for 85% of questions
observe the HotpotQA-MDR model is biased to-
where both gold passages are from Wikipedia, but
wards retrieving Wikipedia passages. On exam-
for less than 39% of questions when at least one
ples where the gold Hop1 passage is an email,
gold passage (Hop1 and/or Hop2 ) is an email. By
15% of the time, no emails appear in the top-k
hop, we find Contriever MS-MARCO retrieves the
Hop1 results; meanwhile, this only occurs 4% of
first-hop passage 49% of the time and second-hop
the time when Hop1 is Wikipedia. On the slice of
passage 25% of the time.
EE examples, 64% of Hop2 passages are E, while
Finally, to explore whether a stronger single-
on the slice of WW examples, 99.9% of Hop2 pas-
hop retriever may further improve the one-hop
sages are W. If we simply force equal retrieval ( k2 )
baseline, we continually fine-tune Contriever on
from each domain on each hop, we observe 2.3 F1
C ONCURRENT QA. We follow the training pro-
points (4.3%) improvement in C ONCURRENT QA
tocol and use the code released in Izacard et al.
performance, compared to retrieving the overall
(2021), and include these details in Appendix A.
top-k. Optimally selecting the allocation for each
The fine-tuned model achieves 39.7 Recall@10
domain is an exciting question for future work.
and 63.6 Recall@100, while two-hop MDR
Performance on WE questions is notably worse
achieves 55.9 Recall@10 and 73.8 Recall@100
than on EW questions. We hypothesize this is
(Table 9 in the Appendix). We observe Con-
because several emails discuss each Wikipedia-
triever’s one-hop Recall@100 of 63.6 exceeds the
entity, which may increase the noise in Hop2 (i.e.,
two-hop MDR Recall@10 of 55.9, suggesting a
WE is a one-to-many hop, while for EW, W typ-
tradeoff space between the number of passages re-
ically contains one valid entity-specific passage).
trieved per hop (which is correlated with cost) and
The latter is intuitively because individuals refer to
the ability to circumvent iterative retrieval (which
a narrow set of public entities in private discourse.
we identify implicates privacy concerns).
5.2 Benchmarking Single-Hop Retrieval
6 Evaluation under Privacy Constraints
In Section 3, we identify that iterative retrieval im-
plicates document privacy. Therefore, an impor- This section provides baselines for C ONCURREN -
tant preliminary question is to what degree multi- T QA under privacy constraints. We concretely
ple hops are actually required? We investigate this study a baseline in which no private information
question using both HotpotQA and C ONCURREN - is revealed publicly whatsoever. We believe this is
T QA. We evaluate MDR using just the first-hop an informative baseline for two reasons:
results and Contriever (Izacard et al., 2021), the
1. The privacy setting we study is often catego-
SoTA single-hop dense retrieval model.
rized as an access-control framework — dif-
Results In Table 4, we summarize the retrieval ferent parties have different degrees of access
results from using three off-the-shelf models for to different degrees of privileged information.
HotpotQA: (1) the HotpotQA MDR model for While this setting is quite restrictive, this pri-
one-hop, (2) the pretrained Contriever model, and vacy framework is widely used in practice for
(3) the MS-MARCO (Nguyen et al., 2016) fine- instance in the government and medical fields
tuned variant of Contriever. We observe a size- (Bell and LaPadula, 1976; Hu et al., 2006).
Privacy Level Sample Questions Answered under Each Privacy Level
Answered with Q1 In which region is the site of a meeting between Dabhol manager Wade
No Privacy, but Cline and Ministry of Power Secretary A. K. Basu located?
not under Docu- Q2 What year was the state-owned regulation board that was in conflict with
ment Privacy Dabhol Power over the DPC project formed?
Answered with Q1 The U.S. Representative from New York who served from 1983 to 2013
Document Pri- requested a summary of what order concerning a price cap complaint?
vacy Q2 How much of the company known as DirecTV Group does GM own?
Answered with Q1 Which CarrierPoint backer has a partner on SupplySolution’s board?
Query Privacy Q2 At the end of what year did Enron India’s managing director responsible
for managing operations for Dabhol Power believe it would go online?
*All evidence is in private emails and not in Wikipedia.

Table 5: Examples of queries answered under different privacy restrictions. Bold indicates private information.

H OTPOT QA-SPIRAL C ONCURRENT QA


Model
EM F1 EM F1
No Privacy Baseline 62.3 75.3 45.0 53.0
No Privacy Multi-Index 62.3 75.3 45.0 53.0
Document Privacy 56.8 68.8 36.1 43.0
Query Privacy 34.3 43.3 19.1 23.8

Table 6: Multi-hop QA datasets using the dense retrieval baseline (MDR) under each privacy setting.

2. There are many possible privacy constraints portions of questions for which the gold docu-
as users find different types of information to ments are public and private in Hop1 and Hop2
be sensitive (Xu et al., 2007). Studying these match those in C ONCURRENT QA.
is an exciting direction that we hope is facil-
itated by this work. Because the appropriate 6.1 Evaluation
privacy relaxations are subjective, we focus We evaluate performance when no private infor-
on characterizing the upper (Section 5) and mation (neither queries nor documents) is revealed
lower-bounds (Section 6) of retrieval quality whatsoever. We compare four baselines, shown
in our proposed setting. in Table 6. (1) No Privacy Baseline: We com-
bine all public and private passages in one cor-
Setup We use models trained on Wikipedia data, pus, ignoring privacy concerns. (2) No Privacy
to evaluate performance under privacy restrictions Multi-Index: We create two corpora and retrieve
both in the in-distribution multi-hop HotpotQA- the top k from each index in each hop, and retain
SPIRAL (an adaptation of the HotpotQA bench- the top-k of these 2k documents for the next hop,
mark to the SPIRAL setting (Yang et al., 2018)) without applying any privacy restriction. Note per-
and multi-distribution C ONCURRENT QA settings. formance should match single-index performance.
Motivating the latter setup, sufficient training data (3) Document Privacy: We use the process in
is seldom available for all private distributions. (2), but cannot use a private passage retrieved in
We use the multi-hop SoTA model, MDR, which Hop1 to subsequently retrieve from public DG .
is representative of the iterative retrieval proce- (4) Query Privacy: The baseline to keep q en-
dure that is used across multi-hop solutions (Miller tirely private is to only retrieve from DP .
et al., 2016; Feldman and El-Yaniv, 2019; Xiong We can answer many complex questions while
et al., 2021, inter alia.). revealing no private information whatsoever (see
We construct Hotpot-SPIRAL by randomly as- Table 5). However, in maintaining document pri-
signing passages to the private (DP ) and public vacy, the end-to-end QA performance degrades
(DG ) corpora. To enable a clear comparison, we by 9% HotpotQA and 19% for C ONCURRENT QA
ensure that the sizes of DP and DG , and the pro- compared to the quality of the non-private sys-
HPQA Risk-Coverage CQA Risk-Coverage
0.3

Risk (1-F1 Score)

Risk (1-F1 Score)


0.4
0.2

0.1 0.2
With Document Privacy With Document Privacy
No Privacy No Privacy
No Privacy Overall Risk No Privacy Overall Risk
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Coverage Coverage

Figure 4: Risk-coverage curves using the model trained on Wikipedia data for HotpotQA-PAIR and multi-
distribution C ONCURRENT QA retrieval, both under No Privacy and Document Privacy, where privacy is achieved
by restricting P rivate to P ublic retrieval altogether. The left shows the overall test results, and the right is split
by the the domains of the gold supporting passages for the question at hand, for Hop1 to Hop2 .

tem; degradation is worse under query privacy. 2017; Varshney et al., 2022). Models are trained
We hope the resources we provide facilitate future on HotpotQA, representing the public domain.
work under alternate privacy frameworks. Results Risk-coverage curves for HotpotQA
and C ONCURRENT QA are in Figure 4. Under
6.2 Managing the Privacy-Quality Tradeoff
Document Privacy, the “No Privacy” score of 75.3
Alongside improving the retriever’s quality, an im- F1 for HotpotQA and 53.0 F1 for C ONCURREN -
portant area of research for end-to-end QA sys- T QA are achieved at 85.7% and 67.8% coverage.
tems is to avoid providing users with incorrect In the top plots, in the absence of privacy con-
predictions, given existing retrievers. Significant cerns, the risk-coverage trends are worse for C ON -
work focuses on equipping QA-systems with this CURRENT QA vs. HotpotQA (i.e. quality degrades
selective-prediction capability (Chow, 1957; El- more quickly as the coverage increases). Out-of-
Yaniv and Wiener, 2010; Kamath et al., 2020; distribution selective prediction is actively studied
Jones et al., 2021, inter alia.). Towards improving (Kamath et al., 2020). However, this setting differs
the reliability of the QA system, we next evaluate from the standard setup. The bottom plots show
selective prediction in our novel retrieval setting. on C ONCURRENT QA that that the risk-coverage
Setup Selective prediction aims to provide the trends differ widely based on the sub-domains of
user with an answer only when the model is con- the questions; the standard retrieval setup typically
fident. The goal is to answer as many questions has a single distribution (Thakur et al., 2021).
as possible (high coverage) with as high perfor- Next, privacy restrictions correlate with degre-
mance as possible (low risk). Given query q, and dations in the risk-coverage curves on both C ON -
a model which outputs (â, c), where â is the pre- CURRENT QA and HotpotQA. Critically, Hot-
dicted answer and c ∈ R represents the model’s potQA is in-distribution for the retriever. Strate-
confidence in â, we output â if c ≥ γ for some gies beyond selective prediction via max-prob,
threshold γ ∈ R, and abstain otherwise. As γ in- the prevailing approach in NLP (Varshney et al.,
creases, risk and coverage both tend to decrease. 2022), may be useful for the SPIRAL setting.
The QA model outputs an answer and score for
7 Conclusion
each of the top-k retrieved passages — we com-
pute the softmax over the top-k scores and use the We ask how to personalize neural retrieval-
top softmax score as c (Hendrycks and Gimpel, systems in a privacy-preserving way and report
on how arbitrary retrieval over public and private References
data poses a privacy concern. We define the SPI-
Akari Asai, Kazuma Hashimoto, Hannaneh Ha-
RAL retrieval problem, present the first textual
jishirzi, Richard Socher, and Caiming Xiong.
multi-distribution benchmark to study the novel
2020. Learning to retrieve reasoning paths over
setting, and empirically characterize the privacy-
wikipedia graph for question answering. In In-
quality tradeoffs faced by neural retrieval systems.
ternational Conference on Learning Represen-
We motivated the creation of a new bench- tations (ICLR).
mark, as opposed to repurposing existing bench-
marks. C ONCURRENT QA is multi-distributional David E. Bell and Leonard J. LaPadula. 1976. Se-
— we qualitatively identified differences between cure computer system: Unified exposition and
the public Wikipedia and private emails in Section multics interpretation. The MITRE Corpora-
4.3, and quantitatively demonstrated the effects of tion.
applying models trained on one distribution (e.g.
public) to the mixed-distribution (e.g. public and Jonathan Berant, Andrew Chou, Roy Frostig, and
private) setting in Sections 5 and 6. Private iter- Percy Liang. 2013. Semantic parsing on free-
ative retrieval is underexplored and we hope the base from question-answer pairs. Proceedings
benchmark-resource and evaluations we provide of the 2013 Conference on Empirical Methods
inspire further research on this topic, for instance in Natural Language Processing (EMNLP).
under alternate privacy models. Michael S. Bernstein, Jaime Teevan, Susan Du-
mais, Daniel Liebling, , and Eric Horvitz. 2012.
Acknowledgements Direct answers for search queries in the long
tail. SIGCHI.
We thank Jack Urbaneck, Wenhan Xiong, and Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
Gautier Izacard for their advice and feedback. We mann, Trevor Cai, Eliza Rutherford, Katie Mil-
gratefully acknowledge the support of NIH under lican, George van den Driessche, Jean-Baptiste
No. U54EB020405 (Mobilize), NSF under Nos. Lespiau, Bogdan Damoc, Aidan Clark, Diego
CCF1763315 (Beyond Sparsity), CCF1563078 de Las Casas, Aurelia Guy, Jacob Menick,
(Volume to Velocity), and 1937301 (RTML); US Roman Ring, Tom Hennigan, Saffron Huang,
DEVCOM ARL under No. W911NF-21-2-0251 Loren Maggiore, Chris Jones, Albin Cassirer,
(Interactive Human-AI Teaming); ONR under Andy Brock, Michela Paganini, Geoffrey Irv-
No. N000141712266 (Unifying Weak Supervi- ing, Oriol Vinyals, Simon Osindero, Karen Si-
sion); ONR N00014-20-1-2480: Understanding monyan, Jack W. Rae, Erich Elsen, and Lau-
and Applying Non-Euclidean Geometry in Ma- rent Sifre. 2022. Improving language models
chine Learning; N000142012275 (NEPTUNE); by retrieving from trillions of tokens. In Pro-
NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, ceedings of the 39th International Conference
NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Ac- on Machine Learning (PMLR).
centure, Ericsson, Qualcomm, Analog Devices,
Google Cloud, Salesforce, Total, the HAI-GCP Tom Brown, Benjamin Mann, Nick Ryder,
Cloud Credits for Research program, the Stanford Melanie Subbiah, Jared D. Kaplan, Prafulla
Data Science Initiative (SDSI), Stanford Grad- Dhariwal, Arvind Neelakantan, Pranav Shyam,
uate Fellowship, and members of the Stanford Girish Sastry, Amanda Askell, Sandhini Agar-
DAWN project: Facebook, Google, and VMWare. wal, Ariel Herbert-Voss, Gretchen Krueger,
The U.S. Government is authorized to reproduce Tom Henighan, Rewon Child, Aditya Ramesh,
and distribute reprints for Governmental purposes Daniel M. Ziegler, Jeffrey Wu, Clemens Win-
notwithstanding any copyright notation thereon. ter, Christopher Hesse, Mark Chen, Eric Sigler,
Any opinions, findings, and conclusions or rec- Mateusz Litwin, Scott Gray, Benjamin Chess,
ommendations expressed in this material are those Jack Clark, Christopher Berner, Sam McCan-
of the authors and do not necessarily reflect the dlish, Alec Radford, Ilya Sutskever, and Dario
views, policies, or endorsements, either expressed Admodei. 2020. Language models are few-shot
or implied, of NIH, ONR, or the U.S. Government. learners. Advances in neural information pro-
cessing systems (NeurIPS), 33:1877–1901.
Qingqing Cao, Noah Weber, Niranjan Balasub- Chao-Kong Chow. 1957. An optimum character
ramanian, and Aruna Balasubramanian. 2019. recognition system using decision functions. In
Deqa: On-device question answering. In The IRE Transactions on Electronic Computers.
17th Annual International Conference on Mo-
bile Systems, Applications, and Services (Mo- Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
biSys). Christopher D. Manning. 2020. Electra: pre-
training text encoders as discriminators rather
Danqi Chen, Adam Fisch, Jason Weston, and An- than generators. In International Conference on
toine Bordes. 2017. Reading wikipedia to an- Learning Representations (ICLR).
swer open-domain questions. In Association for
Emily Dinan, Stephen Roller, Kurt Shuster, An-
Computational Linguistics (ACL).
gela Fan, Michael Auli, and Jason Weston.
Hao Chen, Ilaria Chillotti, Yihe Dong, Oxana 2019. Wizard of wikipedia: Knowledge-
Poburinnaya, Ilya Razenshteyn, and M. Sadegh powered conversational agents. In Interna-
Riazi. 2020a. Sanns: Scaling up secure approx- tional Conference on Learning Representations
imate k-nearest neighbors search. In USENIX (ICLR).
Security Symposium.
Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Mark Chen, Jerry Tworek, Heewoo Jun, Qim- Gabriel Stanovsky Sameer Singh, and Matt
ing Yuan, Henrique Ponde de Oliveira Pinto, Gardner. 2019. Drop: A reading compre-
Jared Kaplan, Harri Edwards, Yuri Burda, hension benchmark requiring discrete reasoning
Nicholas Joseph, Greg Brockman, Alex Ray, over paragraphs. Proceedings of the 2019 Con-
Raul Puri, Gretchen Krueger, Michael Petrov, ference of the North American Chapter of the
Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Association for Computational Linguistics: Hu-
Brooke Chan, Scott Gray, Nick Ryder, Mikhail man Language Technologies (NAACL).
Pavlov, Alethea Power, Lukasz Kaiser, Mo- Cynthia Dwork, Frank McSherry, Kobbi Nissim,
hammad Bavarian, Clemens Winter, Philippe and Adam Smith. 2006. Calibrating noise to
Tillet, Felipe Petroski Such, Dave Cummings, sensitivity in private data analysis. Theory of
Matthias Plappert, Fotios Chantzis, Elizabeth Cryptography Conference (TCC).
Barnes, Ariel Herbert-Voss, William Heb-
gen Guss, Alex Nichol, Alex Paino, Nikolas Ran El-Yaniv and Yair Wiener. 2010. On the foun-
Tezak, Jie Tang, Igor Babuschkin, Suchir Bal- dations of noise-free selective classification. In
aji, Shantanu Jain, William Saunders, Christo- Journal of Machine Learning Research (JMLR).
pher Hesse, Andrew N. Carr, Jan Leike, Josh
Yair Feldman and Ran El-Yaniv. 2019. Multi-
Achiam, Vedant Misra, Evan Morikawa, Alec
hop paragraph retrieval for open-domain ques-
Radford, Matthew Knight, Miles Brundage,
tion answering. Proceedings of the 57th Annual
Mira Murati, Katie Mayer, Peter Welinder, Bob
Meeting of the Association for Computational
McGrew, Dario Amodei, Sam McCandlish, Ilya
Linguistics (ACL).
Sutskever, and Wojciech Zaremba. 2021. Eval-
uating large language models trained on code. Arthur Gervais, Reza Shokri, Adish Singla, Srd-
In arXiv:2107.03374. jan Capkun, and Vincent Lenders. 2014. Quan-
tifying web-search privacy. In ACM Confer-
Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan
ence on Computer and Communications Secu-
Xiong, Hong Wang, and William Wang. 2020b.
rity (SIGSAC).
Hybridqa: A dataset of multi-hop question an-
swering over tabular and textual data. In Find- Mandy Guoa, Yinfei Yanga, Daniel Cera, Qinlan
ings of the Association for Computational Lin- Shenb, and Noah Constant. 2021. Multireqa: A
guistics (EMNLP). cross-domain evaluation for retrieval question
answering models. In Proceedings of the Sec-
Benny Chor, Eyal Kushilevitz, Oded Goldreich, ond Workshop on Domain Adaptation for NLP.
and Madhu Sudan. 1998. Private informa-
tion retrieval. Journal of the ACM (JACM), Kelvin Guu, Kenton Lee, Zora Tung, Panupong
45(6):965–981. Pasupat, and Ming-Wei Chang. 2020. Realm:
Retrieval augmented language model pre- Amita Kamath, Robin Jia, and Percy Liang.
training. In Proceedings of the 37th In- 2020. Selective question answering under do-
ternational Conference on Machine Learning main shift. In Proceedings of the 58th Annual
(ICML). Meeting of the Association for Computational
Linguistics (ACL).
Andreas Haeberlen, Benjamin C Pierce, and Arjun
Narayan. 2011. Differential privacy under fire. Vladimir Karpukhin, Barlas Oguz, Sewon Min,
In USENIX Security Symposium, volume 33, Patrick Lewis, Ledell Wu, Sergey Edunov,
page 236. Danqi Chen, and Wen tau Yih. 2020. Dense
passage retrieval for open-domain question an-
Nathan Heller. 2017. What the enron e-mails say swering. In Proceedings of the 2020 Confer-
about us. ence on Empirical Methods in Natural Lan-
guage Processing (EMNLP).
Dan Hendrycks and Kevin Gimpel. 2017. A
baseline for detecting misclassified and out-of- Omar Khattab, Christopher Potts, and Matei Za-
distribution examples in neural networks. In In- haria. 2021. Baleen: Robust multi-hop reason-
ternational Conference on Learning Represen- ing at scale via condensed retrieval. 35th Con-
tations (ICLR). ference on Neural Information Processing Sys-
tems (NeurIPS).
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sug-
awara, and Akiko Aizawa. 2020. Construct- Omar Khattab, Keshav Santhanam, Xiang Lisa
ing a multi-hop qa dataset for comprehensive Li, David Hall, Percy Liang, Christopher Potts,
evaluation of reasoning steps. Proceedings of and Matei Zaharia. 2022. Demonstrate-search-
the 28th International Conference on Computa- predict: Composing retrieval and language
tional Linguistics (COLING). models for knowledge-intensive NLP. arXiv
preprint arXiv:2212.14024.
Vincent C. Hu, David F. Ferraiolo, and Rick D. B. Klimt and Y. Yang. 2004. Introducing the enron
Kuhn. 2006. Assessment of access control sys- corpus. In Proceedings of the 1st Conference on
tems. National Institute of Standards and Tech- Email and Anti-Spam (CEAS).
nology (NIST).
Henry Corrigan-Gibbs Dmitry Kogan. 2020. Pri-
Gautier Izacard, Mathilde Caron, Lucas Hos- vate information retrieval with sublinear online
seini, Sebastian Riedel, Piotr Bojanowski, Ar- time. In Annual International Conference on
mand Joulin, and Edouard Grave. 2021. Unsu- the Theory and Applications of Cryptographic
pervised dense information retrieval with con- Techniques (EUROCRYPT).
trastive learning. In Transactions on Machine
Learning Research (TMLR). Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Gautier Izacard and Edouard Grave1. 2021. Alberti, Danielle Epstein, Illia Polosukhin,
Leveraging passage retrieval with generative Matthew Kelcey, Jacob Devlin, Kenton Lee,
models for open domain question answering. In Kristina N. Toutanova, Llion Jones, Ming-Wei
Proceedings of the 16th Conference of the Eu- Chang, Andrew Dai, Jakob Uszkoreit, Quoc
ropean Chapter of the Association for Compu- Le, and Slav Petrov. 2019. Natural questions:
tational Linguistics (EACL). a benchmark for question answering research.
Transactions of the Association of Computa-
Jeff Johnson, Matthijs Douze, and Hervé Jégou. tional Linguistics (TACL).
2017. Billion-scale similarity search with gpus.
IEEE Transactions on Big Data. Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale
Minervini, Heinrich Küttler, Aleksandra Piktus,
Erik Jones, Shiori Sagawa, Pang Wei Koh, Ananya Pontus Stenetorp, and Sebastian Riedel. 2021.
Kumar, and Percy Liang. 2021. Selective clas- Paq: 65 million probably-asked questions and
sification can magnify disparities across groups. what you can do with them. In Transactions of
In International Conference on Learning Repre- the Association for Computational Linguistics
sentations (ICLR). (TACL).
Brendan McMahan, Eider Moore, Daniel Ram- Proceedings of the 2020 Conference on Empir-
age, Seth Hampson, and Blaise Agüera y Ar- ical Methods in Natural Language Processing
cas. 2016. Communication-efficient learning of (EMNLP).
deep networks from decentralized data. Pro-
Phillipp Schoppmann, Lennart Vogelsang, Adrià
ceedings of the 20th International Conference
Gascón, and Borja Balle. 2020. Secure and
on Artificial Intelligence and Statistics (AIS-
scalable document similarity on distributed
TATS).
databases: Differential privacy to the rescue.
Alexander H. Miller, Adam Fisch, Jesse Dodge, Proceedings on Privacy Enhancing Technolo-
Amir-Hossein Karimi, Antoine Bordes, and Ja- gies (PETS).
son Weston. 2016. Key-value memory net- Sacha Servan-Schreiber. 2021. Private nearest
works for directly reading documents. Proceed- neighbor search with sublinear communication
ings of the 2016 Conference on Empirical Meth- and malicious security. In 2022 IEEE Sympo-
ods in Natural Language Processing (EMNLP). sium on Security and Privacy (SP).
Mummoorthy Murugesan, Wei Jiang, Chris Luo Si and Hui Yang. 2014. Privacy-preserving ir:
Clifton, Luo Si, and Jaideep Vaidya. 2010. Effi- When information retrieval meets privacy and
cient privacy-preserving similar document de- security. Proceedings of the 37th international
tection. The International Journal on Very conference on Research development in infor-
Large Data Bases (VLDB). mation retrieval (ACM SIGIR).

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Alon Talmor and Jonathan Berant. 2018. The
Gao, Saurabh Tiwary, Rangan Majumder, and web as a knowledge-base for answering com-
Li Deng. 2016. Ms marco: A human gener- plex questions. Conference of the North Amer-
ated machine reading comprehension dataset. ican Chapter of the Association for Computa-
CoCo@NIPS. tional Linguistics (NAACL).
Nandan Thakur, Nils Reimers, Andreas Ruckle,
Fabio Petroni, Aleksandra Piktus, Angela Fan, Abhishek Srivastav, and Iryna Gurevych. 2021.
Patrick Lewis, Majid Yazdani, Nicola De Cao, Beir: A heterogeneous benchmark for zero-
James Thorne, Yacine Jernite, Vladimir shot evaluation of information retrieval models.
Karpukhin, Jean Maillard, Vassilis Plachouras, Thirty-fifth Conference on Neural Information
Tim Rocktäschel, and Sebastian Riedel. 2021. Processing Systems Datasets and Benchmarks
"KILT: a benchmark for knowledge intensive Track (NeurIPS).
language tasks". In Proceedings of the 2021
Conference of the North American Chapter of Harsh Trivedi, Niranjan Balasubramanian, Tushar
the Association for Computational Linguistics: Kho, and Ashish Sabharwal. 2021. Musique:
Human Language Technologies (NAACL-HLT), Multi-hop questions via single-hop question
pages 2523–2544. composition. In Transactions of the Association
for Computational Linguistics (TACL).
Peng Qi, Haejun Lee, Oghenetegiri Sido, and
Neeraj Varshney, Swaroop Mishra, and Chitta
Christopher D. Manning. 2021. Retrieve, read,
Baral. 2022. Investigating selective prediction
rerank, then iterate: Answering open-domain
approaches across several tasks in iid, ood, and
questions of varying reasoning steps from text.
adversarial settings. Findings of the Association
Nils Reimers and Iryna Gurevych. 2019. for Computational Linguistics (ACL).
Sentence-bert: Sentence embeddings using Ellen M. Voorhees. 1999. The trec-8 question an-
siamese bert-networks. In Proceedings of the swering track report. In TREC.
2019 Conference on Empirical Methods in
Natural Language Processing (EMNLP). Johannes Welbl, Pontus Stenetorp, and Sebastian
Riedel. 2018. Constructing datasets for multi-
Adam Roberts, Colin Raffel, and Noam Shazeer. hop reading comprehension across documents.
2020. How much knowledge can you pack In Transactions of the Association for Compu-
into the parameters of a language model? In tational Linguistics (TACL).
Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Model Avg-PR
Gardner, Yoav Goldberg, Daniel Deutch, and
Jonathan Beran. 2020. Break it down: A ques- Learning Rate 5e-5
tion understanding benchmark. In Transactions Batch Size 150
of the Association for Computational Linguis- Maximum passage length 300
tics (TACL). Maximum query length at initial hop 70
Maximum query length at 2nd hop 350
Wenhan Xiong, Xiang Lorraine Li, Srinivasan Warmup ratio 0.1
Iyer, Jingfei Du, Patrick Lewis, William Wang, Gradient clipping norm 2.0
Yashar Mehdad, Wen tau Yih, Sebastian Riedel, Traininig epoch 64
Douwe Kiela, and Barlas Oguz. 2021. Answer- Weight decay 0
ing complex open-domain questions with multi-
hop dense retrieval. In International Confer- Table 7: Retrieval hyperparameters for MDR training
on C ONCURRENT QA and Subsampled-HotpotQA ex-
ence on Learning Representations (ICLR).
periments.
Yabo Xu, Benyu Zhang, Zheng Chen, and
Ke Wang. 2007. Privacy-enhancing personal-
ized web search. In Proceedings of the 16th the 42nd International ACM Conference on Re-
international conference on World Wide Web search and Development in Information Re-
(WWW). trieval (SIGIR).

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua A Experimental Details


Bengio, William W. Cohen andRuslan The MDR retriever is trained with a contrastive
Salakhutdinov, and Christopher D. Man- loss as in Karpukhin et al. (2020), where each
ning. 2018. Hotpotqa: A dataset for diverse, query is paired with a (gold annotated) positive
explainable multi-hop question answering. In passage and m negative passages to approximate
Proceedings of the 2018 Conference on Empir- the softmax over all passages. We consider two
ical Methods in Natural Language Processing methods of collecting negative passages: first,
(EMNLP). we use random passages from the corpus that
do not contain the answer (random), and second,
Wentau Yih, Matthew Richardson, Christopher
we use one top-ranking passage from BM25 that
Meek, Ming-Wei, and Chang Jina Suh. 2016.
does not contain the answer as a hard-negative
The value of semantic parse labeling for knowl-
paired with remaining random negatives. We do
edge base question answering. Proceedings of
not observe much difference between the two ap-
the 54th Annual Meeting of the Association for
proaches for C ONCURRENT QA-results (also ob-
Computational Linguistics (ACL).
served in Xiong et al. (2021)), and thus use ran-
Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jiaxin dom negatives for all experiments.
Mao, Xiaohui Xie, Min Zhang, and Shaop- The number of passages retrieved per hop, k, is
ing Ma. 2022. Disentangled modeling of do- an important hyperparameter; increasing k tends
main and relevance for adaptable dense re- to increase recall, but sacrifice precision. A larger
trieval. arXiv:2208.05753. k is also less efficient at inference time. We use
k = 100 for all experiments in the paper and Table
Michael J.Q. Zhang and Eunsol Choi. 2021. Situ- 9 studies the effect of using different values of k.
atedQA: Incorporating extra-linguistic contexts We find the hyperparameters in Table 7 in the
into QA. Proceedings of the Conference on Em- Appendix work best and train on up to 8 NVidia-
pirical Methods in Natural Language Process- A100 GPUs.
ing (EMNLP).
Sparse Retrieval For the sparse retrieval base-
line, we use Pyserini with default parameters.7 We
Steven Zimmerman, Alistair Thorpe, Chris Fox,
consider different values of k ∈ {1, 10, 25, 100}
and Udo Kruschwitz. 2019. Investigating the
per retrieval, reported in Table 10 in the Appendix.
interplay between searchers’ privacy concerns
7
and their search behavior. In Proceedings of [Link]
Model Recall@10 negative passages to be best. These hyperparam-
eters are chosen following the protocol in Izacard
Two-hop MDR 55.9 et al. (2021).
Contriever 12.1
Contriever MS-MARCO 36.9 B Additional Analysis
Table 8: Comparison of one-hop baseline models eval- We include two figures to further characterize the
uated on the two-hop C ONCURRENT QA task without differences between the Wikipedia and Enron dis-
finetuning. tributions.
Figure 5 (Left, Middle) in the Appendix shows
k Avg-PR F1 the UMAP plots of C ONCURRENT QA questions
k =1 41.4 33.5 using BERT-base representations, split by whether
k = 10 55.9 44.7 the gold hop passages are both from the same do-
k = 25 63.3 48.0 main (e.g., two Wikipedia or two email passages)
k = 50 68.4 50.4 or require one passage from each domain. The
k = 100 73.8 53.0 plots reflect a separation between Wiki-based and
email-based questions and passages.
Table 9: Retrieval performance (Average Passage-
Recall@k, F1) for k ∈ {1, 10, 25, 50, 100} retrieved C C ONCURRENT QA Details
passages per hop using the retriever trained on Hot-
potQA for OOD C ONCURRENT QA test data. Here we compare C ONCURRENT QA to available
textual QA benchmarks and provide additional de-
k F1 tails on the benchmark collection procedure.

k =1 22.0 C.1 Overview


k = 10 34.6
C ONCURRENT QA is the first multi-distribution
k = 25 37.8
textual benchmark. Existing benchmarks in
k = 50 39.3
this category are summarized in Table 11 in
k = 100 40.8
the Appendix. We note that HybridQA (Chen
Table 10: F1 score on the C ONCURRENT QA test data et al., 2020b) and similar benchmarks also in-
for k ∈ {1, 10, 25, 100} per hop using BM25. clude multi-modal documents. However, these
only contain questions that require one passage
We generate the second hop query by concatenat- from each domain for all questions, i.e. one ta-
ing the text of the initial query and first hop pas- ble and one passage. Our benchmark considers
sages. text-only documents, where questions can require
arbitrary retrieval patterns across the distributions.
QA Model We use the provided ELECTRA-
Large reader model checkpoint from Xiong et al. C.2 Benchmark Construction
(2021) for all experiments. The model was trained We need to generate passage pairs for Hop1 , Hop2
on HotpotQA training data. Using the same reader of two Wikipedia documents (Public, Public), an
is useful to understand how retrieval quality affects email and a Wikipedia document (Public, Private
performance, in the absence of reader modifica- and Private, Public), and two emails (Private, Pri-
tions. vate).
Contriever Model We use the code released Public-Public Pairs For Public-Public Pairs, we
by for zero-shot and fine-tuning implementation use a directed Wikipedia Hyperlink Graph, G
and evaluation (Izacard et al., 2021).8 We per- where a node is a Wikipedia article and an edge (a,
form a hyperparamter search for the learning rate b) represents a hyperlink from the first paragraph
∈ {1e − 4, 1e − 5}, temperature ∈ {0.5, 1}, and of article a to article b. The entity associated with
number of negatives ∈ {5, 10}. We found a learn- article b, is mentioned in article a and described
ing rate of 1e − 5 with a linear schedule and 10 in article b, so b forms a bridge, or commonality,
8
[Link] between the two contexts. Crowdworkers are pre-
contriever sented the final public document pairs (a, b) ∈ G.
Figure 5: UMAP of BERT-base embeddings, using Reimers and Gurevych (2019), of C ONCURRENT QA questions
based on the domains of the gold passage chain to answer the question (left and middle). I.e., questions that require
an Email passage for hop 1 and Wikipedia passage for hop 2 are shown as “Wiki-Email”. Embeddings for all gold
passages are also shown, split by domain (right).

Dataset Size Domain


WebQuestions (Berant et al., 2013) 6.6K Freebase
WebQSP (Yih et al., 2016) 4.7K Freebase
WebQComplex (Talmor and Berant, 2018) 34K Freebase
MuSiQue (Trivedi et al., 2021) 25K Wiki
DROP (Dua et al., 2019) 96K Wiki
HotpotQA (Yang et al., 2018) 112K Wiki
2Wiki2MultiHopQA (Ho et al., 2020) 193K Wiki
Natural-QA (Kwiatkowski et al., 2019) 300K Wiki
C ONCURRENT QA 18.4K Email & Wiki

Table 11: Existing textual multi-hop benchmarks are designed over a single-domain.

We provide the title of b as a hint to the worker, as types across the 5.2 million entities and permit en-
a potential anchor for the multi-hop question. tities containing any type that occurs at least 1000
To initialize the Wikipedia hyperlink graph, we times. We also restrict to Wikipedia documents
use the KILT KnowledgeSource resource (Petroni containing a minimum number of sentences and
et al., 2021) to identify hyperlinks in each of the tokens. The intuition for this is that highly specific
Wikipedia passages. 9 To collect passages that types entities (e.g., a legal code or scientific fact)
share enough in common, we eliminate entities b and highly general types of entities (e.g. countries)
which are too specific or vague, having many plau- occur less frequently.
sible correspondences across passages. For ex-
Pairs with Private Emails Unlike Wikipedia,
ample, given a representing a “company”, it may
hyperlinks are not readily available for many un-
be challenging to write a question about its con-
structured data sources including the emails, and
nection to the “business psychology” doctrine the
the non-Wikipedia data contains both private and
company ascribes to (b is too specific) or to the
public (e.g., Wiki) entities. Thus, we design the
“country” in which the company is located (b is
following approach to annotate the public and pri-
too general). To determine which Wiki entities to
vate entity occurrences in the email passages:
permit for a and b pairings shown to the workers,
we ensure that the entities come from a restricted 1. We collect candidate entities with SpaCy. 10
set of entity-categories. The Wikidata knowledge
base stores type categories associated with enti- 2. We split the full set into candidate public
ties (e.g., “Barack Obama” is a “politician” and and candidate private entities by identifying
“lawyer”). We compute the frequency of Wikidata Wikipedia linked entities amongst the spans
9
tagged by the NER model. We annotate
[Link]
10
KILT [Link]
the text with the open-source SpaCy entity-
linker, which links the text to entities in the
Wiki knowledge base, to collect candidate
occurrences of global entities. 11 We use
heuristic rules to filter remaining noise in the
public entity list.

3. We post-process the private entity lists to


improve precision. High precision entity-
linking is critical for the quality of the bench-
mark: a query assumed to require the re-
trieval of private passages a and b should not
be unknowingly answerable by public pas-
sages. After curating the private entity list,
we restrict to candidates which occur at least
5 times in the deduplicated set of passages.

A total of 43.4k unique private entities and 8.8k


unique public entities appear in the emails, and
1.6k private and 2.3k public entities occur at least
5 times across passages. We present crowd work-
ers emails containing at least three total entities to
ensure there is sufficient information to write the
multi-hop question.
Private-Private Pairs are pairs of emails that
mention the same private entity e. The Private-
Public and Public-Private are pairs of emails men-
tioning public entity e and the Wikipedia passage
for e. In both cases, we provide the hint that e is a
potential anchor for the multi-hop question.
Comparison Questions For comparison ques-
tions, Wikidata types are readily available for pub-
lic entities, and we use these to present the crowd-
worker with two passages describing entities of
the same type. For private emails, there is no as-
sociated knowledge graph so we heuristically as-
signed types to private entities, by determining
whether type strings occurred frequently along-
side the entity in emails (e.g., if “politician” is fre-
quently mentioned in the emails in which an entity
occurs, assign the “politician” type).
Finally, crowdworkers are presented with a pas-
sage pair and asked to write a question that re-
quires information from both passages. We use
separate interfaces for bridge vs. comparison
questions and guide the crowdworker to form
bridge questions by using the passages in the de-
sired order for Hop1 and Hop2 .

11
[Link]
spaCy-entity-linker

You might also like