Professional Documents
Culture Documents
ABSTRACT 1. INTRODUCTION
Given a data set consisting of private information about Let X be a data set consisting of n entries X = {x1 , . . . , xn }
individuals, we consider the online query auditing problem: taken from some domain (for this work we will typically use
given a sequence of queries that have already been posed xi ∈ R). The data sets X of interest in this paper contain
about the data, their corresponding answers – where each private information about specific individuals. A statistical
answer is either the true answer or “denied” (in the event query q = (Q, f ) specifies a subset of the data set entries
that revealing the answer compromises privacy) – and given Q ⊆ [n] and a function f (usually f : 2R → R). The answer
a new query, deny the answer if privacy may be breached f (Q) to the statistical query is f applied to the subset of
or give the true answer otherwise. A related problem is entries {xi |i ∈ Q}. Some typical choices for f are sum, max,
the offline auditing problem where one is given a sequence min, and median. Whenever f is known from the context,
of queries and all of their true answers and the goal is to we abuse notation and identify q with Q.
determine if a privacy breach has already occurred.
We consider the online query auditing problem: Suppose
We uncover the fundamental issue that solutions to the of- that the queries q1 , . . . , qt−1 have already been posed and the
fline auditing problem cannot be directly used to solve the answers a1 , . . . , at−1 have already been given, where each aj
online auditing problem since query denials may leak in- is either the true answer to the query fj (Qj ) or “denied”
formation. Consequently, we introduce a new model called for j = 1, . . . , t − 1. Given a new query qt , deny the answer
simulatable auditing where query denials provably do not if privacy may be breached, and provide the true answer
leak information. We demonstrate that max queries may be otherwise.
audited in this simulatable paradigm under the classical def-
inition of privacy where a breach occurs if a sensitive value Many data sets exhibit a competing (or even conflicting)
is fully compromised. We also introduce a probabilistic no- need to simultaneously preserve individual privacy, yet also
tion of (partial) compromise. Our privacy definition requires allow the computation of large-scale statistical patterns. One
that the a-priori probability that a sensitive value lies within motivation for auditing is the monitoring of access to pa-
some small interval is not that different from the posterior tient medical records where hospitals have an obligation to
probability (given the query answers). We demonstrate that report large-scale statistical patterns, e.g., epidemics or flu
sum queries can be audited in a simulatable fashion under outbreaks, but are also obligated to hide individual patient
this privacy definition. records, e.g., whether an individual is HIV+. Numerous
other datasets exhibit such a competing need – census bu-
∗
Department of Computer Science, Stanford University, reau microdata is another commonly referenced example.
Stanford, CA 94305. Supported in part by NSF Grants
EIA-0137761 and ITR-0331640, and a grant from SNRC. One may view the online auditing problem as a game be-
kngk@cs.stanford.edu. tween an auditor and an attacker. The auditor monitors
†
HP Labs, Palo Alto, CA 94304, and Department of Com- an online sequence of queries, and denies queries with the
puter Science, Stanford University, Stanford, CA 94305. purpose of ensuring that an attacker that issues queries and
nmishra@cs.stanford.edu. Research supported in part by
NSF Grant EIA-0137761. observes their answers would not be able to ‘stitch together’
‡ this information in a manner that breaches privacy. A naive
Department of Computer Science, Ben-Gurion University,
P.O.B. 653 Be’er Sheva 84105, ISRAEL. Work partially done and utterly dissatisfying solution is that the auditor would
while at Microsoft Research, SVC. kobbi@cs.bgu.ac.il. deny the answer to every query. Hence, we modify the game
a little, and look for auditors that provide as much informa-
tion as possible (usually, this would be “large scale” infor-
mation, almost insensitive to individual information) while
Permission to make digital or hard copies of all or part of this work for preserving privacy (usually, this would refer to “small-scale”
personal or classroom use is granted without fee provided that copies are individual information).
not made or distributed for profit or commercial advantage, and that copies
bear this notice and the full citation on the first page. To copy otherwise, to To argue about privacy, one should of course define what
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. it means. This is usually done by defining what privacy
PODS 2005 June 13-15, 2005, Baltimore, Maryland. breaches (or compromises) are (i.e., what should an attacker
Copyright 2005 ACM 1-59593-062-0/05/06 . . . $5.00.
achieve in order to win the privacy game). In most previous be uniquely determined. Consequently, max(x1 , x2 , x3 ) = 5
work on auditing, a ‘classical’ notion of compromise is used, and the attacker learns that x1 = x2 = x3 = 5 — a privacy
i.e., a compromise occurs if there exists an index i such that breach of all three entries. The issue here is that query de-
xi is learned in full. In other words, in all datasets con- nials reduce the space of possible consistent solutions, and
sistent with the queries and answers, xi is the same. While this reduction is not explicitly accounted for in existing of-
very appealing due to its combinatorial nature, we note that fline auditing algorithms.
this definition of privacy is usually not quite satisfying. In
particular, it does not consider cases where a lot of infor- In fact, denials could leak information even if we could pose
mation is leaked about an entry xi (and hence privacy is only SQL-type queries to a table containing multiple at-
intuitively breached), as long as this information does not tributes (i.e., not explicitly specify the subset of the rows).
completely pinpoint xi . One of the goals of this work is to For example, consider the three queries, sum, count and max
extend the construction of auditing algorithms to guarantee (in that order) on the ‘salary’ attribute, all conditioned on
more realistic notions of privacy, such as excluding partial the same predicate involving other attributes. Since the se-
compromise. lection condition is the same, all the three queries act on the
same subset of tuples. Suppose that the first two queries are
The first study of the auditing problem that we are aware answered. Then the max query is denied whenever its value
of is due to Dobkin, Jones, and Lipton [9]. Given a private is exactly equal to the ratio of the sum and the count values
dataset X, a sequence of queries is said to compromise X if (which happens when all the selected tuples have the same
there exists an index i such that in all datasets consistent salary value). Hence the attacker learns the salary values of
with the queries and answers, xi is the same (as mentioned, all the selected tuples when denial occurs.
we refer to this kind of compromise as classical compromise).
The goal of this work was to design algorithms that never To work around this problem, we introduce a new model
allow a sequence of queries that compromises the data, re- called simulatable auditing where query denials provably do
gardless of the actual data. Essentially, this formulation not leak information. Intuitively, an auditor is simulatable if
never requires a denial. We call this type of auditing query an attacker, knowing the query-answer history, could make
monitoring or, simply monitoring. In terms of utility, moni- the same decision as the simulatable auditor regarding a
toring may be too restrictive as it may disallow queries that, newly posed query. Hence, the decision to deny a query does
in the context of the underlying data set, do not breach pri- not leak any information beyond what is already known to
vacy. One may try to answer this concern by constructing the attacker.
auditing algorithms where every query is checked with re-
spect to the data set, and a denial occurs only when an 1.1 Classical vs. Probabilistic Compromise
‘unsafe’ query occurs.
As we already mentioned, the classical definition of compro-
mise has been extensively studied in prior work. This defini-
A related variant of the auditing problem is the offline au-
tion is conceptually simple, has an appealing combinatorial
diting problem: given a collection of queries and true an-
structure, and can serve as a starting point for evaluating
swers to each query, determine if a compromise has already
solution ideas for privacy. However, as has been noted by
happened, e.g., if some xi can be uniquely determined. Of-
many others, e.g., [2, 15, 6, 18], this definition is inadequate
fline algorithms are used in applications where the goal is to
in many real contexts. For example, if an attacker can de-
“catch” a privacy breach ex post facto.
duce that a private data element xi falls in a tiny interval,
then the classical definition of privacy is not violated unless
It is natural to ask the following question: Can an offline au-
xi can be uniquely determined. While some have proposed
diting algorithm directly solve the online auditing problem?
a privacy definition where each xi can only be deduced to
In the traditional algorithms literature, an offline algorithm
lie in a sufficiently large interval [18], note that the distri-
can always be used to solve the online problem – the only
bution of the values in the interval matters. For example,
penalty is in the efficiency of the resulting algorithm. To
ensuring that age lies in an interval of length 50 when the
clarify the question for the auditing context, if we want to
user can deduce that age is between [−49, 1] essentially does
determine whether to answer qt , can we consult the dataset
not preserve privacy.
for the true answer at and then run an offline algorithm to
determine if providing at would lead to a compromise?
In order to extend the discussion on auditors to more re-
alistic partial compromise notions of privacy, we introduce
Somewhat surprisingly, we answer this question negatively.
a new privacy definition. Our privacy definition assumes
The main reason is that denials leak information. As a
that there exists an underlying probability distribution from
simple example, suppose that the underlying data set is
which the data is drawn. This is a fairly natural assump-
real-valued and that a query is denied only if some value
tion since most attributes like age, salary, etc. have a known
is fully compromised. Suppose that the attacker poses the
probability distribution. The essence of the definition is that
first query sum(x1 , x2 , x3 ) and the auditor answers 15. Sup-
for each data element xi and small-sized interval I, the au-
pose also that the attacker then poses the second query
ditor ensures that the prior probability that xi falls in the
max(x1 , x2 , x3 ) and the auditor denies the answer. The de-
interval I is about the same as the posterior probability that
nial tells the attacker that if the true answer to the sec-
xi falls in I given the queries and answers. This definition
ond query were given then some value could be uniquely
overcomes some of the aforementioned problems with clas-
determined. Note that max(x1 , x2 , x3 ) 6< 5 since then the
sical compromise.
sum could not be 15. Further, if max(x1 , x2 , x3 ) > 5 then
the query would not have been denied since no value could
With this notion of privacy, we demonstrate a simulatable
auditor for sum queries. The new auditing algorithm com- auditing algorithms. But the mixed sum/max problem was
putes posterior probabilities by utilizing existing random- shown to be NP-hard.
ized algorithms for estimating the volume of a convex poly-
tope, e.g., [11, 19, 16, 20]. To guarantee simulatability, we 2.2 Offline Auditing
make sure that the auditing algorithm does not access the In the offline all problem, the auditor is given an offline set
data set while deciding whether to allow the newly posed of queries q1 , . . . , qt and true answers a1 , . . . , at and must
query qt (in particular, it does not compute the true answer determine if a breach of privacy has occurred. In most of
to qt ). Instead, the auditor draws many data sets accord- the works, a privacy breach is defined to occur whenever
ing to the underlying distribution, assuming the previous some element in the data set can be uniquely determined. If
queries and answers, then computes an expected answer a′t only sum or only max queries are posed, then polynomial-
and checks whether revealing it would breach privacy for time auditing algorithms are known to exist [4]. However,
each of the randomly generated data sets. If for most an- when sum and max queries are intermingled, then determin-
swers the data set is not compromised then the query is ing whether a specific value can be uniquely determined is
answered, and otherwise the query is denied. known to be NP-hard [3].
1.2 Overview of the paper Kam and Ullman [15] consider auditing subcube queries
The rest of this paper is organized as follows. In Section 2 we which take the form of a sequence of 0s, 1s, and *s where the
discuss related work on auditing. A more general overview *s represent “don’t cares”. For example, the query 10**1*
can be found in [1]. In Section 3 we give several motivat- matches all entries with a 1 in the first position, 0 in the
ing examples where the application of offline auditing algo- second, 1 in the fifth and anything else in the remaining
rithms to solve the online auditing problem leads to massive positions. Assuming sum queries over the subcubes, they
breaches of the dataset since query denials leak informa- demonstrate when compromise can occur depending on the
tion. We then introduce simulatable auditing in Section 4 number of *s in the queries and also depending on the range
and prove that max queries can be audited under this def- of input data values.
inition of auditing and the classical definition of privacy in
Section 5. Then in Section 6 we introduce our new defini- Kleinberg, Papadimitriou and Raghavan [17] consider the
tion of privacy and finally in Section 7 we prove that sum Boolean auditing problem, where the data elements are Boolean,
queries can be audited under both this definition of privacy and the queries are sum queries over the integers. They
and simulatable auditing. showed that it is coNP-hard to decide whether a set of
queries uniquely determines one of the data elements, and
suggest an approximate auditor (that may refuse to answer a
2. RELATED WORK query even when the answer would not compromise privacy)
2.1 Online Auditing that could be efficiently implemented. Polynomial-time max
As the initial motivation for work on auditing involves the auditors are also given.
online auditing problem, we begin with known online audit-
ing results. The earliest are the query monitoring results In the offline maximum version of the auditing problem, the
due to Dobkin, Jones, and Lipton [9] and Reiss [21] for the auditor is given a set of queries q1 , . . . , qt , and must identify
online a maximum-sized subset of queries such that all can be an-
Psum auditing problem, where the answer to a query
Q is i∈Q xi . With queries of size at least k elements, each swered simultaneously without breaching privacy. Chin [3]
pair overlap in at most r elements, they showed that any proved that the offline maximum sum query auditing prob-
data set can be compromised in (2k − (ℓ + 1))/r queries by lem is NP-hard, as is the offline maximum max query audit-
an attacker that knows ℓ values a priori. For fixed k, r and ℓ, ing problem.
if the auditor denies answers to query (2k−(ℓ+1))/r and on,
then the data set is definitely not compromised. Here the 3. MOTIVATING EXAMPLES
monitor logs all the queries and disallows Qi if |Qi | < k, or The offline problem is fundamentally different from the on-
for some query t < i, |Qi ∩Qt | > r, or if i ≥ (2k −(ℓ+1))/r.1 line problem. In particular, existing offline algorithms can-
These results completely ignore the answers to the queries. not be used to solve the online problem because offline algo-
On the one hand, we will see later that this is desirable in rithms assume that all queries were answered. If we invoke
that the auditor is simulatable – the decision itself cannot an offline algorithm with only the previously truly answered
leak any information about the data set. On the other hand, queries, then we next illustrate how an attacker may exploit
we will see that because the answers are ignored, sometimes auditor responses to breach privacy since query denials leak
only short query sequences are permitted, whereas longer information. In all cases, breaches are with respect to the
ones could have been permitted without resulting in a com- same privacy definition for which the offline auditor was de-
promise. signed.
In addition, Chin [3] considered the online sum, max, and We assume that the attacker knows the auditor’s algorithm
mixed sum/max auditing problems. Both the online sum for deciding denials. This is a standard assumption em-
and the online max problem were shown to have efficient ployed in the cryptography community known as Kerkhoff ’s
1 Principle – despite the fact that the actual privacy-preserving
Note that this is a fairly negative result. For example, if auditing algorithm is public, an attacker should still not be
k = n/c for some constant c and r = 1, then the auditor
would have to shut off access to the data after only a con- able to breach privacy.
stant number of queries, since there are only about c queries
where no two overlap in more than one element. We first demonstrate how to use a max auditor to breach
the privacy of a significant fraction of a data set. The same the candidates is the right one, we choose the last query to
attack works also for sum/max auditors. Then we demon- consist of three distinct entries xi , xj , xk of nonequal values,
strate how a Boolean auditor may be adversarially used to such that for each entry both queries touching it were denied.
learn an entire data set. The same attack works also for the The query sum(xi , xj , xk ) is never denied3 . The answer to
(tractable) approximate auditing version of Boolean audit- the last query (1 or 2) resolves which of the two candidates
ing in [17]. for d is the right one.
We arrange the data in n/4 disjoint 4-tuples and consider 4. SIMULATABLE AUDITING
each 4-tuple of entries independently. In each of these 4- Evidently these examples show that denials leak informa-
tuples we will use the offline auditor to learn one of the four tion. The situation is further worsened by the failure of the
entries, with success probability 1/2. Hence, on average we auditors to formulate this information leakage and to take it
will learn 1/8 of the entries. Let a, b, c, d be the entries in a into account along with the answers to allowed queries. In-
4-tuple. We first issue the query max(a, b, c, d). This query tuitively, denials leak because users can ask why a query was
is never denied, let m be its answer. For the second query, denied, and the reason is in the data. If the decision to allow
we drop at random one of the four entries, and query the or deny a query depends on the actual data, it reduces the
maximum of the other three. If this query is denied, then we set of possible consistent solutions for the underlying data.
learn that the dropped entry equals m. Otherwise, we let
m′ (= m) be the answer and drop another random element A naive solution to the leakage problem is to deny when-
and query the maximum of the remaining two. If this query ever the offline algorithm would, and to also randomly deny
is denied, we learn that the second dropped entry equals m′ . queries that would normally be answered. While this solu-
If the query is allowed, we continue with the next 4-tuple. tion seems appealing, it has its own problems. Most impor-
tantly, although it may be that denials leak less information,
Note that whenever the second or third queries are denied, leakage is not generally prevented. Furthermore, the audit-
the above procedure succeeds in revealing one of the four ing algorithm would need to remember which queries were
entries. If all the elements are distinct, then the probability randomly denied, since otherwise an attacker could repeat-
we succeed in choosing the maximum value in the two ran- edly pose the same query until it was answered. A difficulty
dom drops is 2/4. Consequently, on average, the procedure then arises in determining whether two queries are equiva-
reveals 1/8 of the data. lent. The computational hardness of this problem depends
on the query language, and may be intractable, or even un-
3.2 Boolean Auditing Breach decidable.
Dinur and Nissim [8] demonstrated how an offline auditing
algorithm can be used to mount an attack on a Boolean data To work around the leakage problem, we make use of the
set. We give a (slightly modified) version of that attack, that simulation paradigm which is of vast usage in cryptography
allows an attacker to learn the entire database. (starting with the definition of semantic security [14]). The
idea is the following: The reason that denials leak informa-
Consider a data set of Boolean entries, with sum queries. tion is because the auditor uses information not available
Let d = x1 , . . . , xn be a random permutation of the original to the attacker (the answer to the newly posed query), fur-
data set. As in [8], we show how to use the auditor to learn thermore resulting in a computation the attacker could not
d or its complement. We then show that in a typical (‘non- perform by himself. A successful attacker capitalizes on this
degenerate’) database one more query is enough for learning leakage to gain information. We introduce a notion of au-
the entire data set. diting where the attacker provably cannot gain any new in-
formation from the auditor’s decision, so that denials do not
We make the queries qi = sum(xi , xi+1 ) for i = 1, . . . , n − 1. leak information. This is formalized by requiring that the
Note that xi 6= xi+1 iff the query qi is allowed2 . After n − 1 attacker is able to simulate or mimic the auditor’s decisions.
queries we learn two candidates for the data set, namely, d In such a case, because the attacker can equivalently decide
itself and the bit-wise complement of d. To learn which of if a query would be denied, denials do not leak information.
2
When xi 6= xi+1 the query is allowed as the answer x1 +
xi+1 = 1 does not reveal which of the values is 1 and which 4.1 A Perspective on Auditing
is 0. When xi = xi+1 the query is denied, as its answer (0
3
or 2) would reveal both xi and xi+1 . Here we use the fact that denials are not taken into account.
We cast related work on auditing based on two important account in the auditor’s decision. We thus assume without
dimensions: utility and privacy. It is interesting to note the loss of generality that q1 , . . . , qt−1 were all answered.
relationship between the information an auditor uses and
its utility – the more information used, the longer query se-
quences the auditor can decide to answer. That is because Definition 4.1. An auditor is simulatable if the decision
an informed auditor need not deny queries that do not ac- to deny or give an answer to the query qt is made based
tually put privacy at risk. On the other hand, as we saw in exclusively on q1 , . . . , qt , and a1 , . . . , at−1 (and not at and
Section 3, if the auditor uses too much information, some of not the dataset X = {x1 , . . . , xn }) and possibly also the un-
this information my be leaked, and privacy may be adversely derlying probability distribution D from which the data was
affected. drawn.
We now discuss how we obtain an algorithm for max sim- In fact, it is enough to check the condition in the lemma for
ulatable auditing. Clearly, it is impossible to check for qt and the query sets intersecting it (instead of all the query
all possible answers mt in (−∞, +∞). We show that it sets).
is sufficient to test only a finite number of points. Let
q1′ , . . . , ql′ be the previous queries that intersect with the
Lemma 5.2. For 1 ≤ j ≤ n, xj is uniquely determined for
current query qt , ordered according to the corresponding an-
some value of mt in (m′s , m′s+1 ) if and only if xj is uniquely
swers, m′1 ≤ . . . ≤ m′l . Let m′lb = m′1 − 1 and m′ub = m′l + 1 m′s +m′s+1
be the bounding values. Our algorithm checks for only 2l+1 determined when mt = 2
.
values: the bounding values, the above l answers, and the
mid-points of the intervals determined by them. Proof. Observe that revealing mt can only affect ele-
ments in qt and the queries intersecting it. This is because
Algorithm 1 Max Simulatable Auditor revealing mt can possibly lower the upper bounds of ele-
m′1 +m′2 m′ +m′ ments in qt , thereby possibly making some element in qt or
1: for mt ∈ {m′lb , m′1 , 2
, m′2 , 2 2 3 , m′3 , . . . ,
m′ +m′ the queries intersecting it the only extreme element of that
m′l−1 , l−12 l , m′l , m′ub } do query set. Revealing mt does not change the upper bound
2: if mt is consistent with the previous answers of any element in a query set disjoint with qt and hence does
m1 , . . . , mt−1 AND if ∃1 ≤ j ≤ n xj is uniquely not affect elements in such sets.
determined {using [17]} then
3: Output “Deny” and return Hence it is enough to consider j ∈ qt ∪ q1′ ∪ . . . ∪ ql′ . We
4: end if consider the following cases:
5: end for
6: Output “Answer” and return
• qt = {j}: Clearly xj is breached irrespective of the
value of mt .
Next we address the following questions: How we can tell if
a value is uniquely determined? How can we tell if an answer • j is the only extreme element of qt and |qt | > 1: Sup-
for the current query is consistent with previous queries and pose that xj is uniquely determined for some value
of mt in (m′s , m′s+1 ). This means that each element While both simulatable auditors and monitoring methods
indexed in qt \ {j} had an upper bound < mt and are safe, simulatable auditors potentially have greater util-
hence ≤ m′s (since an upper bound can only be one ity, as shown by the following example.
of the answers given so far). Since this holds even for
m′ +m′ Consider the problem of auditing max queries on a database
mt = s 2 s+1 , j is still the only extreme element of
qt and hence xj is still uniquely determined. A similar containing 5 elements. We will consider three queries and
argument applies for the converse. two possible sets of answers to the queries. We will demon-
strate that the simulatable auditor answers the third query
• j is the only extreme element of qk′ for some k: Sup- in the first case and denies it in the second case while a query
pose that xj is uniquely determined for some value of monitor (which makes the decisions based only on the query
mt in (m′s , m′s+1 ). This means that mt < m′k (and sets) has to deny the third query in both cases. Let the query
hence m′s+1 ≤ m′k ) and revealing mt reduced the up- sets be q1 = {1, 2, 3, 4, 5}, q2 = {1, 2, 3}, q3 = {3, 4} in that
per bound of some element indexed in qk′ \ {j}. This order. Suppose that the first query is answered as m1 = 10.
m′s +m′s+1 We consider two scenarios based on m2 . (1) If m2 = 10,
would be the case even when mt = 2
. The then every query set has at least two extreme elements, ir-
converse can be argued similarly. respective of the value of m3 . Hence the simulatable auditor
will answer the third query. (2) Suppose m2 = 8. When-
ever m3 < 10, 5 is the only extreme element for S1 so that
To complete the proof, we will show that values of mt in
x5 = 10 is determined. Hence it is not safe to answer the
(m′s , m′s+1 ) are either all consistent or all inconsistent (so
third query.
that our algorithm preserves consistency when considering
only the mid-point value). Suppose that some value of mt in
While the simulatable auditor provides an answer q3 in the
(m′s , m′s+1 ) results in inconsistency. Then, by Lemma 5.1,
first scenario, a monitor would have to deny q3 in both, as its
some query set qα has no extreme element. We consider two
decision is oblivious of the answers to the first two queries.
cases:
m1 = 10
m2 = 8
• qα = qt : This means that the upper bound of every m3 < 10
element in qt was < mt and hence ≤ m′s . This would
be the case even for any value of mt in (m′s , m′s+1 ). x1 x2 x3 x4 x5
6. PROBABILISTIC COMPROMISE
We next describe a definition of privacy that arises from
m′s +m′s+1
Thus it suffices to check for mt = 2
∀1 ≤ s < l some of the previously noted limitations of classical com-
together with mt = m′s ∀1 ≤ s ≤ l and also representative promise. On the one hand, classical compromise is a weak
points, (m′1 − 1) in (−∞, m′1 ) and (m′l + 1) in (m′l , ∞). Note definition since if a private value can be deduced to lie in a
that inconsistency for a representative point is equivalent to tiny interval – or even a large interval where the distribu-
inconsistency for the corresponding interval. tion is heavily skewed towards a particular value – it is not
considered a privacy breach. On the other hand, classical
As noted earlier, the upper bounds of all elements as well as compromise is a strong definition since there are situations
the extreme elements of all the query sets and hence each where no query would ever be answered. This problem has
iteration of the for loop in Algorithm 1 can be computed in been previously noted [17]. For example, if the dataset con-
O( ti=0 |qi |) time (which is linear in the input size). As the tains items known to fall in a bounded range, e.g., Age, then
P
number of iterations
P is 2l + 1 ≤ 2t, the running time of the no sum query would ever be answered. For instance, the
algorithm is O(t ti=0 |qi |), proving Theorem 5.1. query sum(x1 , . . . , xn ), would not be answered since there
exists a dataset, e.g., xi = 0 for all i where a value, in fact
We remark that both conditions in step 2 of Algorithm 1 all values, can be uniquely determined.
exhibit monotonicity with respect to the value of mt . For
example, if mt = α results in inconsistency, so does any To work around these issues, we propose a definition of pri-
mt < α. By exploiting this observation, the P running time of vacy that bounds the change in the ratio of the posterior
the algorithm can be improved to O((log t) ti=0 |qi |). The probability that a value xi lies in an interval I given the
details can be found in the full version of the paper. queries and answers to the prior probability that xi ∈ I.
This definition is related to definitions suggested in the out-
put perturbation literature including [13, 10].
5.3 Utility of Max Simulatable Auditor vs. Mon-
itoring For an arbitrary data set X = {x1 , . . . , xn } ∈ [0, 1]n and
queries qj = (Qj , fj ), for j = 1, . . . t, let aj be the answers to 7.1 Estimating the Predicate S()
qj according to X. Define a Boolean predicate that evaluates We begin by considering the situation when we have the
to 1 if the data entry xi is safe with respect to an interval answer to the last query at . Our auditors cannot use at ,
I ⊆ [0, 1] (i.e., whether given the answers one’s confidence and we will show how to get around this assumption. Given
in xi ∈ I changes significantly): at and all previous queries and answers, Algorithm 2 checks
whether privacy has been breached. The idea here is to
Sλ,i,I (q1 , . . . , qt , a1 , . . . , at ) =
( express the ratio of probabilities in Sλ,i,I as the ratio of
PrD (xi ∈I|∧tj=1 fj (Qj )=aj )
1 if (1 − λ) ≤ ≤ 1/(1 − λ) the volumes of two convex polytopes. We then make use
PrD (xi ∈I)
0 otherwise of randomized, polynomial-time algorithms for estimating
the volume of a convex body (see for example [11, 19, 16,
Let I be the set of intervals [ j−1
α
, αj ] for j = 1, . . . , α. Define 20]). Given a convex polytope P with volume Vol(P ), these
algorithms output a value V such that (1 − ǫ)Vol(P ) < V <
Sλ (q1 , . . . , qt , a1 , . . . , at ) = Vol(P )/(1 − ǫ) with probability at least 1 − η. The running
^
Sλ,i,I (q1 , . . . , qt , a1 , . . . , at ) . time of the algorithm is polynomial in n, 1ǫ , and log(1/η).
i∈[n],I∈I We call such an algorithm Volǫ,η – this algorithm is employed
in steps 5 and 7 of Algorithm 2. Note that Vol(P ) without
i.e., Sλ (q1 , . . . , qt , a1 , . . . , at ) = 1 iff q1 , . . . , qt , a1 , . . . , at is any subscripts is the true volume of P .
λ-safe for all entries, for all intervals in I.
Algorithm 2 Safe
We now turn to our privacy definition. Let X = {x1 , . . . , xn }
be drawn according to a distribution D on [0, 1]n , and con- 1: Input: Queries and Answers qj and aj for j = 1, . . . , t,
sider the following (λ, α, T )-privacy game between an at- and parameters λ, η, α, n.
2: Let safe=true, ǫ = λ/6.
tacker and an auditor, where in each round t (for up to T
rounds): 3: for each xi and for each interval I in I do
4: Let P be the (possibly less than n-dimensional) poly-
tope defined by (∩tj=1 (sum(Qj ) = aj )) ∩ [0, 1]n
1. The attacker (adaptively) poses a query qt = (Qt , ft ). 5: V = Volǫ,η (P )
6: Let Pi,I be the polytope defined by P ∩ (xi ∈ I)
2. The auditor decides whether to allow qt (and then re-
7: Vi,I “
= Volǫ,η (Pi,I )
ply with at = ft (Qt )), or at = “denied”. α·Vi,I
” “
α·Vi,I
”
8: if V
< (1 − 4ǫ) OR V
> 1/(1 − 4ǫ)
3. The attacker wins if Sλ (q1 , . . . , qt , a1 , . . . , at ) = 0. then
9: Let safe=false
Definition 6.1. We say that an auditor is (λ, δ, α, T )- 10: end if
private if for any attacker A 11: end for
12: Return safe.
Pr[A wins the (λ, α, T )-privacy game] ≤ δ .
The probability is taken over the randomness in the distri-
Lemma 7.1. 1. If Sλ (q1 , . . . , qt , a1 , . . . , at ) = 0 then Al-
bution D and the coin tosses of the auditor and attacker.
gorithm Safe returns 0 with probability at least 1 − 2η.
2. If Sλ/3 (q1 , . . . , qt , a1 , . . . , at ) = 1 then Algorithm Safe
Combining Definitions 4.1 and 6.1, we have our new model
returns 1 with probability at least 1 − 2nαη.
of simulatable auditing. In other words, we seek auditors
that are simulatable and (λ, δ, α, T )-private.
Proof. We first note that
Note that, on the one hand, the simulatable definition pre- PrD (xi ∈ I| ∧tj=1 (sum(Qj ) = aj ))
vents the auditor from using at in deciding whether to an- =
PrD (xi ∈ I)
swer or deny the query. On the other hand, the privacy
PrD (xi ∈ I ∧ tj=1 (sum(Qj ) = aj ))
V
definition requires that regardless of what at was, with high
=
probability, each data value xi is still safe (as defined by Si ). PrD (xi ∈ I) PrD ( tj=1 (sum(Qj ) = aj ))
V
Consequently, it is important that the current query qt be
used in deciding whether to deny or answer. Upon making a Vol(Pi,I )
= α·
decision, our auditors ignores the true value at and instead Vol(P )
make guesses about the value of at obtained by randomly
sampling data sets according to the distribution. where P and Pi,I are as defined in steps 4 and 6 of Al-
gorithm Safe, and where Vol(P ) denotes the true volume
7. SIMULATABLE FRACTIONAL SUM AU- of the polytope P in the subspace whose dimension equals
DITING the dimension of P . (The last equality is due to D being
In this section we consider the problem of auditing sum uniform.)
queries, i.e., where each query is of the form sum(Qj ) and Qj
is a subset of dimensions. We assume that the data falls in Using the guarantee on the volume estimation algorithm, we
the unit cube [0, 1]n , although the techniques can be applied get that with probability at least 1 − 2η,
to any bounded real-valued domain. We also assume that Vi,I Vol(Pi,I ) 1 Vol(Pi,I ) 1
the data is drawn from a uniform distribution over [0, 1]n . ≤ · ≤ · .
V Vol(P ) (1 − ǫ)2 Vol(P ) 1 − 2ǫ
The last inequality follows noting that (1−ǫ)2 = 1−2ǫ+ǫ2 ≥ Algorithm 3 Fractional Sum Simulatable Auditor
V Vol(P )
1 − 2ǫ. Similarly, Vi,I ≥ Vol(Pi,I) · (1 − 2ǫ). 1: Input: Data Set X, (allowed) queries and answers qj and
aj for j = 1, . . . , t − 1, a new query qt , and parameters
We now prove the two parts of the claim: λ, λ′ , α, n, δ, T .
2: Let P be the (possibly less than n-dimensional) polytope
defined by (∩t−1 j=1 (sum(Qj ) = aj )) ∩ [0, 1]
n
Vol(Pi,I ) Vi,I
1. If α· Vol(P )
< 1−λ then α· V
< (1−λ)/(1−2ǫ), and T T
3: for O( δ log δ ) times do
by our choice of ǫ = λ/6 we get α ·
Vi,I
< 1 − 4ǫ, hence 4: Sample a data set X ′ from P { using [11]}
V
Vol(Pi,I ) 5: Let a′ = sumX ′ (Qt )
Algorithm Safe returns 0. The case α · >
Vol(P ) 6: Evaluate algorithm Safe on input
1/(1 − λ) is treated similarly. q1 , . . . , qt , a1 , . . . , at−1 , a′ and parameters λ, λ′ , α, n
Vol(Pi,I ) Vi,I 7: end for
2. If α· > 1−λ/3 then α· > (1−λ/3)(1−2ǫ) >
Vol(P ) V 8: if the fraction of sampled data sets for which algorithm
Vol(Pi,I )
1 − 4ǫ. The case α · < 1/(1 − λ/3) is treated
Vol(P )
Safe returned false is more than δ/2T then
similarly. Using the union bound on all i ∈ [n] and 9: Return “denied”
I ∈ I yields the proof. 10: else
11: Return at , the true answer to qt = sumX (Qt )
12: end if