Professional Documents
Culture Documents
A Distributional Framework For Data Valuation
A Distributional Framework For Data Valuation
between the value of z computed within Bs and within Bb . 2019b) gave several additional methods for efficient approx-
This inconsistency may be especially pronounced in the case imation of Shapley values for training data; subsequently,
when the buyer has significantly less data than the seller. (Jia et al., 2019a) provided an exact algorithm for computa-
tion of Shapley values for nearest neighbor classifiers.
O UR CONTRIBUTIONS .
Beyond data valuation, the Shapley framework has been
Conceptual: Extending prior works on data Shapley, we used in a variety of ML applications, e.g. as a measure of fea-
formulate and develop a notion of distributional Shapley ture importance (Cohen et al., 2007; Kononenko et al., 2010;
value in Section 2. We define the distributional variant in Datta et al., 2016; Lundberg & Lee, 2017; Chen et al., 2018).
terms of the original data Shapley: the distributional Shap- The idea of a distributional Shapley value bears resemblance
ley value is taken to be the expected data Shapley value, to the Aumann-Shapley value (Aumann & Shapley, 1974),
where the data set is drawn i.i.d. from the underlying data a measure-theoretic variant of Shapley that quantifies the
distribution. Reformulating this notion of value as a statisti- value of individuals within a continuous “infinite game.”
cal quantity allows us to prove that the notion is stable with Our distributional Shapley value focuses on the tangible
respect to perturbations to the inputs as well as the under- setting of finite data sets drawn from a (possibly continuous)
lying data distribution. Further, we show a mathematical distribution.
identity that gives an equivalent definition of distributional
Shapley as an expected marginal performance increase by 2. Distributional Data Valuation
adding the point, suggesting an unbiased estimator.
Preliminaries.
Algorithmic: In Section 3, we develop this estimator into
a novel sampling-based algorithm, D-S HAPLEY. In con- Let D denote a data distribution supported on a universe Z.
trast to prior estimation heuristics, D-S HAPLEY comes with For supervised learning problems, we often think of Z =
strong formal approximation guarantees. Leveraging the X × Y where X ⊆ Rd and Y is the output, which can be
stability properties of distributional Shapley value and the discrete or continuous. For m ∈ N, let S ∼ Dm a collection
simple nature of our algorithm, we develop theoretically- of k data points sampled i.i.d. from D. Throughout, we use
principled optimizations to D-S HAPLEY. In our experi- the shorthand [m] = {1, . . . , m} and let k ∼ [m] denote a
ments across diverse tasks, the optimizations lead to order- uniform random sample from [m].
of-magnitude reductions in computational costs while main-
We denote by U : Z ∗ → [0, 1] a potential function1 or
taining the quality of estimations.
performance metric, where for any S ⊆ Z, U (S) represents
Empirical: Finally, in Section 4, we present a data pric- abstractly the value of the subset. While our analysis applies
ing case study that demonstrates the consistency of values broadly, in our context, we think of U as capturing both the
produced by D-S HAPLEY. In particular, we show that a learning algorithm and the evaluation metric. For instance,
data broker can list distributional Shapley values as “prices,” in the context of training a logistic regression model, we
which a collection of buyers all agree are fair (i.e. the data might think of U (S) as returning the population accuracy
gives each buyer as much value as the seller claims). In of the empirical risk minimizer when S is the training set.
all, our results demonstrate that the distributional Shapley
framework represents a significant step towards the practical 2.1. Distributional Shapley Value
viability of the Shapley-based approaches to data valuation.
Our starting point is the data Shapley value, proposed in
Related works. Shapley value, introduced in (Shapley, (Ghorbani & Zou, 2019; Jia et al., 2019b) as a way to valuate
1953), has been studied extensively in the literature on co- training data equitably.
operative games and economics (Shapley et al., 1988), and Definition 2.1 (Data Shapley Value). Given a potential
has traditionally been used in the valuation of private infor- function U and data set B ⊆ Z where |B| = m, the data
mation and data markets (Kleinberg et al., 2001; Agarwal Shapley value of a point z ∈ B is defined as
et al., 2019).
m
1 X 1 X
Our work follows recent works that apply Shapley value φ(z; U, B) , m−1
(U (S ∪ {z}) − U (S)) .
m
to the data valuation problem. (Ghorbani & Zou, 2019) k=1 k−1 S⊆B\{z}:
|S|=k−1
developed the notion of “Data Shapley” and provided two
algorithms to efficiently estimate values. Specifically, lever-
In words, the data Shapley value of a point z ∈ B is a
aging the permutation-based characterization of Shapley
weighted empirical average over subsets S ⊆ B of the
value, they developed a “truncated Monte Carlo” sampling
scheme (referred to as TMC-S HAPLEY), demonstrating em- 1
We use Z ∗ = n∈N Z n to indicates any finite Cartesian
S
pirical effectiveness across various ML tasks. (Jia et al., product of Z with itself; thus, U is well-defined on the any natural
number of inputs from Z.
A Distributional Framework for Data Valuation
marginal potential contribution of z to each S; the weight- expected marginal contribution in U to a set of i.i.d. samples
ing is such that each possible cardinality |S| = k ∈ from D of uniform random cardinality.
{0, . . . , m − 1} is weighted equally. The data Shapley value
satisfies a number of desirable properties; indeed, it is the The identity holds as a consequence of the definition of data
unique valuation function that satisfies the Shapley axioms2 . Shapley value and linearity of expectation.
Note that as the data set size grows, the absolute magnitude
of individual data points’ values typically scales inversely. Proof.
While data Shapley value is a natural solution concept for ν(z; U, D, m) = E [φ(z; U, D ∪ {z})]
data valuation, its formulation leads to several limitations. D∼D m−1
In particular, the values may be very sensitive to the exact m
choice of B; given another B 0 6= B where z ∈ B ∩ B 0 , the =
1
Em−1
X 1 X
(U (S ∪ {z}) − U (S))
value φ(z; U, B) might be quite different from φ(z; U, B 0 ). m−1
m
D∼D
k=1 k−1 S⊆D:
At the extreme, if a new point z 0 6∈ B is added to B, then in |S|=k−1
principle, we would have to rerun the procedure to compute
the data Shapley values for all points in B ∪ {z 0 }. 1
m
X 1 X
= m−1
Em−1 (U (S ∪ {z}) − U (S))
In settings where our data are drawn from an underlying dis- m k−1
D∼D
k=1 S⊆D:
tribution D, a natural extension to the data Shapley approach |S|=k−1
would parameterize the valuation function by D, rather than 1 X
m
the specific draw of the data set. Such a distributional Shap- = E [U (S ∪ {z}) − U (S)] (1)
m S∼D k−1
ley value should be more stable, by removing the explicit k=1
dependence on the draw of the training data set. = E [U (S ∪ {z}) − U (S)]
k∼[m]
Definition 2.2 (Distributional Shapley Value). Given a po- S∼D k−1
tential function U : Z ∗ → [0, 1], a distribution D supported
on Z, and some m ∈ N, the distributional Shapley value of where (1) follows by the fact that D ∼ Dm−1 consists of
a point z ∈ Z is the expected data Shapley value over data i.i.d. samples, so each S ⊆ D with |S| = k −1 is identically
sets of size m containing x. distributed according to Dk−1 .
2.2. Stability of distributional Shapley values distributional shift under the Wasserstein distance.3
Before presenting our algorithm, we discuss stability proper- Theorem 2.7. Fix a metric space (Z, d) and let U : Z ∗ →
ties of distributional Shapley, which are interesting in their [0, 1] be β(k)-Lipschitz stable with respect to d. Suppose Ds
own right, but also have algorithmic implications. We show and Dt are two distributions over Z. Then, for all m ∈ N
that when the potential function U satisfies a natural stabil- and all z ∈ Z,
ity property, the corresponding distributional Shapley value
|ν(z; U, Ds , m) − ν(z; U, Dt , m)|
inherits stability under perturbations to the data points and
m−1
the underlying data distribution. First, we recall a standard 2 X
notion of deletion stability, often studied in the context of ≤ kβ(k) · W1 (Ds , Dt ).
m
k=1
generalization of learning algorithms (Bousquet & Elisseeff,
2002).
The proof of Theorem 2.7 is included in Appendix C. Note
Definition 2.5 (Deletion Stability). For potential U : Z ∗ → that the theorem bounds the difference in values under shifts
[0, 1] and non-increasing β : N → [0, 1], U is β(k)-deletion in distribution holding the potential U fixed. Often in ap-
stable if for all k ∈ N and S ∈ Z k−1 , for all z ∈ Z plications, we will take the potential function to depend on
the underlying data distribution. For instance, we may take
|U (S ∪ {z}) − U (S)| ≤ β(k). UD (S) = Ez∼D [`S (z)] to be a measure of population accu-
racy, where `S (z) is the loss on a point z ∈ Z achieved by a
We can similarly discuss the idea of replacement stability, model trained on the data set S ⊆ Z. In the case where we
where we bound |U (S ∪ {z}) − U (S ∪ {z 0 })|; note that by only have access to samples from Ds , we still may want to
the triangle inequality, β(k)-deletion stability of U implies guarantee that ν(z; UDs , Ds , m) and ν(z; UDt , Dt , m) are
2β(k)-replacement stability. To analyze the properties of close. Thankfully, such a result follows by showing that
distributional Shapley, a natural strengthening of replace- UDs is close to UDt . For completeness, we formalize this
ment stability will be useful, which we call Lipschitz sta- argument in Appendix C.
bility. Lipschitz stability is parameterized by a metric d,
requires the degree of robustness under replacement of z Similar points receive similar values. As discussed, a
with z 0 to scale according to the distance d(z, z 0 ). key limitation with the data Shapley approach for fixed
data set B is that we can only ascribe values to z ∈ B.
Definition 2.6 (Lipschitz Stability). Let (Z, d) be a metric Intuitively, however, we would hope that if two points z and
space. For potential U : Z ∗ → [0, 1] and non-increasing z 0 are similar according to some appropriate metric, then
β : N → [0, 1], U is β(k)-Lipschitz stable with respect to d they would receive similar Shapley values. We confirm this
if for all k ∈ N, S ∈ Z k−1 , and all z, z 0 ∈ Z, intuition for distributional Shapley values when the potential
function U satisfies Lipschitz stability.
|U (S ∪ {z}) − U (S ∪ {z 0 })| ≤ β(k) · d(z, z 0 ).
Theorem 2.8. Fix a metric space (Z, d) and a distribution
D over Z; let U : Z ∗ → [0, 1] be β(k)-Lipschitz stable
By taking d to be the trivial metric, where d(z, z 0 ) = 1 with respect to d. Then for all m ∈ N, for all z, z 0 ∈ Z,
if z 6= z 0 , we see that Lipschitz-stability generalizes the
idea of replacement stability; still, there are natural learning |ν(z; U, D, m) − ν(z 0 ; U, D, m)| ≤ E [β(k)] · d(z, z 0 ).
k∼[m]
algorithms that satisfy Lipschitz stability for nontrivial met-
rics. As one example, we show that Regularized empirical Proof. For any data set size m ∈ N, we expand
risk minimization over a Reproducing Kernel Hilbert Space ν(z 0 ; U, D, m) to express it in terms of ν(z; U, D, m).
(RKHS) – a prototypical example of a replacement stable
learning algorithm – also satisfies this stronger notion of ν(z 0 ; U, D, m) = E [U (S ∪ {z 0 }) − U (S)]
k∼[m]
Lipschitz stability. We include a formal statement and proof S∼D k−1
in Appendix C.
= E [U (S ∪ {z}) − U (S)]
k∼[m]
S∼D k−1
Similar distributions yield similar value functions.
The distributional Shapley value is naturally parameterized + E [U (S ∪ {z 0 }) − U (S ∪ {z})]
k∼[m]
by the underlying data distribution D. For two distributions S∼D k−1
Ds and Dt , given the value ν(z; U, Ds , m), what can we say ≤ ν(z; U, D, m) + E [β(k)] · d(z, z 0 ) (2)
k∼[m]
about the value ν(z; U, Dt , m)? Intuitively, if Ds and Dt
are similar under an appropriate metric, we’d expect that the 3
Fixing a metric d over Z, the Wasserstein distance over two
values should not change too much. Indeed, we can formally distributions Ds , Dt is the infimum over all couplings γ ∈ Γst of
quantify how the distributional Shapley value is stable under the expected distance between (s, t) ∼ γ.
A Distributional Framework for Data Valuation
where (2) follows by the assumption that U is β(k)- Algorithm 1 D-S HAPLEY
Lipschitz stable and linearity of expectation. Fix: potential U : Z ∗ → [0, 1]; distribution D; m ∈ N
Given: data set Z ⊆ Z to valuate; # iterations T ∈ N
Theorem 2.8 suggests that in many settings of interest, the for z ∈ Z do
distributional Shapley value will be Lipschitz in z. This ν1 (z) ← 0 // initialize estimates
Lipschitz property also suggests that, given the values of a end for
(sufficiently-diverse) set of points Z, we may be able to infer for t = 1, . . . , T do
the values of unseen points z 0 6∈ Z through interpolation. Sample St ∼ Dk−1 for k ∼ [m]
Concretely, in Section 3.1, we leverage this observation for z ∈ Z do
to give an order of magnitude speedup over our baseline ∆z U (St ) ← U (St ∪ {z}) − U (St )
estimation algorithm. νt+1 (z) ← 1t · ∆z U (St ) + t−1
t · νt (z)
// update unbiased estimate
end for
3. Efficiently Estimating Distributional end for
Shapley Values return {(z, νT (z)) : z ∈ Z}
Here, we describe an estimation procedure, D-S HAPLEY,
for computing distributional Shapley values. To begin, we
assume that we can actually sample from the underlying D. termination T , the cardinality |Z| of the points to valuate,
Then, in Section 3.1, we propose techniques to speed up the and the expected time to evaluate U on data sets of size
estimation and look into the practical issues of obtaining k ∼ [m]. We analyze the iteration complexity necessary to
samples from the distribution. The result of these consid- achieve ε-approximations of ν(z; U, D, m) for each z ∈ Z.
erations is a practically-motivated variant of the estimation Theorem 3.1. Fixing a potential
U and distribution D, and
procedure, FAST-D-S HAPLEY. In Section 3.2, we investi- log(|Z|/δ)
Z ⊆ Z, suppose T ≥ Ω ε2 . Algorithm 1 pro-
gate how these optimizations perform empirically; we show duces unbiased estimates and with probability at least 1 − δ,
that the strategies provide a way to smoothly trade-off the |ν(z; U, D, m) − νT (z)| ≤ ε. for all z ∈ Z.
precision of the valuation for computational cost.
Remark. When understanding this (and future) formal ap-
Obtaining unbiased estimates. The formulation from proximation guarantees, it is important to note that we
Theorem 2.3 suggests a natural algorithm for estimating the take ε to be an absolute additive error. Recall, however,
distributional Shapley values of a set of points. In particular, that ν(z; U, D, m) is normalized by m; thus, as we take m
the distributional Shapley value ν(z; U, D, m) is the expec- larger, the relative error incurred by a fixed ε error grows.
tation of the marginal contribution of z to S ⊆ Z on U , In this sense, ε should typically scale inversely as O(1/m).
drawn from a specific distribution over data sets. Thus, the
change in performance when we add a point z to a data set The claim follows by proving uniform convergence of the
S drawn from the correct distribution will be an unbiased estimates for each z ∈ Z. Importantly, while the samples in
estimate of the distributional Shapley value. Consider the each iteration are correlated across z, z 0 ∈ Z, fixing z ∈ Z,
Algorithm 1, D-S HAPLEY, which given a subset Z0 ⊆ Z the samples ∆z U (St ) are independent across iterations. We
of data, maintains for each z ∈ Z0 a running average of include a formal analysis in Appendix D.
U (S ∪ {z}) − U (S) over randomly drawn S.
In each iteration, Algorithm 1 uses a fixed sample St to 3.1. Speeding up D-Shapley: theoretical and practical
estimate the marginal contribution to U (St ∪ {z}) − U (St ) considerations
for each z ∈ Z. This reuse correlates the estimation errors Next, we propose two principled ways to speed up the base-
between points in Z, but provides computational savings. line estimation algorithm. Under stability assumptions, the
Recall that each evaluation of U (S) requires training a ML strategies maintain strong formal guarantees on the quality
model using the points in S; thus, using the same S for each of the learned valuation. We also develop some guiding
z ∈ Z reduces the number of models to be trained by |Z| per theory addressing practical issues that arise from the need
iteration. In cases where the U (S ∪ {z}) can be derived effi- to sample from D. Somewhat counterintuitively, we argue
ciently from U (S), the savings may be even more dramatic; that given only a fixed finite data set B ∼ DM , we can still
for instance, given a machine-learned model trained on S, it estimate values ν(z; U, D, m) to high accuracy, for M that
may be significantly cheaper to derive a model trained on grows modestly with m.
S ∪ {z} than retraining from scratch (Ginart et al., 2019).
The running time of Algorithm 1 can naively be upper Subsampling data and interpolation. Theorem 2.8
bounded by the product of the number of iterations before shows that for sufficiently stable potentials U , similar points
A Distributional Framework for Data Valuation
have similar distributional Shapley values. This property of stable and can be evaluated on sets of cardinality k in time
distributional Shapley values is not only useful for inferring R(k) ≥ Ω(k). For p ∈ [0, 1] and w = {wk ∝ 1/k}, Algo-
the values of points z ∈ Z that were not in our original rithm 2 produces estimates that with probability 1 − δ, are
data set, but also suggests an approach for speeding up the ε-accurate for all z ∈ Zp and runs in expected time
computations of values for a fixed Z ⊆ Z. In particular, to
estimate the values for z ∈ Z (with respect to a sufficiently log(|Z| /δ) · R(m)
RTw (m) ≤ Õ p · |Z| · .
Lipschitz-stable potential U ) to O(ε)-precision, it suffices ε2 m 2
to estimate the values for an ε-cover of Z, and interpolate
(e.g. via nearest neighbor search). Standard arguments show To interpret this result, note that if the subsampling prob-
that random sampling is an effective way to construct an ability p is large enough that Zp will ε-cover Z, then us-
ε-cover (Har-Peled, 2011). ing a nearest-neighbor predictor as R will produce O(ε)-
estimates for all z ∈ Z. Further, if we imagine ε = Θ(1/k),
As our first optimization, in Algorithm 2, we reduce the
then the computational cost grows as the time it takes to
number of points to valuate through subsampling. Given
train a model on m points scaled by a factor logarithmic
a data set Z to valuate, we first choose a random subset
in |Z| and the failure probability. In fact, Theorem 3.2 is
Zp ⊆ Z (where each z ∈ Z is subsampled into Zp i.i.d. with
a special case of a more general theorem that provides a
some probability p); then, we run our estimation procedure
recipe for devising an appropriate sampling scheme based
on the points in Zp ; finally, we train a regression model
on the stability of the potential U . In particular, the general
on (z, νT (z)) pairs from Zp to predict the values of the
theorem (stated and proved in Appendix D) shows that the
points from Z \ Zp . By varying the choice of p ∈ [0, 1], we
more stable the potential, the more we can bias sampling in
can trade-off running time for quality of estimation: p ≈ 1
favor of smaller sample sizes.
recovers the original D-S HAPLEY scheme, whereas p ≈ 0
will be very fast but likely produce noisy valuations. Estimating distributional Shapley from data. Estimat-
ing distributional Shapley values ν(z; U, D, m) requires
samples from the distribution D. In practice, we often
Importance sampling for smaller data sets. To under-
want evaluate the values with respect to a distribution D
stand the running time of Algorithm 1 further, we denote
for which we only have some database B ∼ DM for some
the time to evaluate U on a set of cardinality k ∈ N by
large (but finite) M ∈ N. In such a setting, we need to be
R(k).4 As such, we can express the asymptotic expected
careful; indeed, avoiding artifacts from a single draw of data
running time as |Z| · T · Ek∼[m] [R(k)]. Note that when
is the principle motivation for introducing the distributional
U (S) corresponds to the accuracy of a model trained on S,
Shapley framework. In fact, the analysis of Theorem 3.2
the complexity of evaluating U (S) may grow significantly
also reveals an upper bound on how big the database should
with |S|. At the same time, as the data set size k grows, the
be in order to obtain accurate estimates with respect to D.
marginal effect of adding z ∈ Z to the training set tends
As a concrete bound, if U is O(1/k)-deletion stable and we
to decrease; thus, we should need fewer large samples to
take ε = Θ(1/m) error, then the database need only be
accurately estimate the marginal effects. Taken together,
intuitively, biasing the sampling of k ∈ [m] towards smaller M ≤ Õ (m · log(|Z| /δ)) .
training sets could result in a faster estimation procedure
with similar approximation guarantees. In other words, for a sufficiently stable potential U , the data
complexity grows modestly with m. Note that, again, this
Concretely, rather than sampling k ∼ [m] uniformly, we
bound leverages the fact that in every iteration, we reuse the
can importance sample each k proportional to some non-
same sample St ∼ Dk for each z ∈ Z. See Appendix D for
uniform weights {wk : k ∈ [m]}, where the weights de-
a more detailed analysis.
crease for larger k. More formally, we weight the draw
of k based on the stability of U . Algorithm 2 takes as in- In practice, we find that sampling subsets of data from the
put a set of importance weights w = {wk } and samples database with replacement works well; we describe the full
k procedure in Algorithm 2, where we denote an i.i.d. sample
Pproportionally; without loss of generality, we assume
k wk = 1 and let k ∼ [m]w denote a sample drawn such
of k points drawn uniformly from the database as S ∼ B k .
that Pr[k] = wk . We show that for the right choice of Finally, we note that ideally, m should be close to the size
weights w, sampling k ∼ [m]w improves the overall run- of the training sets that model developers to use; in practice,
ning time, while maintaining ε-accurate unbiased estimates these data set sizes may vary widely. One appealing aspect
of the values ν(z; U, D, m). of both D-S HAPLEY algorithms is that when we estimate
values with respect to m, the samples we obtain also allow
Theorem 3.2 (Informal). Suppose U is O(1/k)-deletion
us to simultaneously estimate ν(z; U, D, m0 ) for any m0 ≤
4
We assume that the running time to evaluate U (S) is a func- m. Indeed, we can simply truncate our estimates to only
tion of the cardinality of S (and not other auxiliary parameters). include samples corresponding to St with |St | ≤ m0 .
A Distributional Framework for Data Valuation
Adult Income
erations T ∈ N; subsampling rate p ∈ [0, 1]; importance
weights {wk }; regression algorithm R Alg. 1
Subsample Zp ⊆ Z s.t. z ∈ Zp w.p. p for all z ∈ Z
for z ∈ Zp do
ν1 (z) ← 0 // initialize estimates
end for
for t = 1, . . . , T do
Sample St ∼ B k−1 for k ∼ [m]w
UK Biobank
for z ∈ Zp do
∆z U (St ) ← U (St ∪ {z}) − U (St )
νt+1 (z) ← 1t · ∆zwUk(St) t−1
m + t · νt (z)
// update unbiased estimate
end for
end for
h ← R ({(z, νT (z)) : z ∈ Zp })
// regress on (z,val(z)) pairs
return {(z, h(z)) : z ∈ Z}
Figure 1. Point removal performance. Given a data set and task,
we iteratively a point, retrain the model, and evaluate its perfor-
mance. Each curve corresponds to a different point removal order,
3.2. Empirical performance based on the estimated distributional Shapley values (compared to
random). For example, the 10% curve correspond to estimating
We investigate the empirical effectiveness of the distribu- values with 10% of the baseline computation of Algorithm 1. We
tional Shapley framework by running experiments in three plot classification accuracy vs. fraction of data points removed
settings on large real-world data sets.5 The first setting uses from the training set, for each task and each optimization method.
the UK Biobank data set, containing the genotypic and phe-
notypic data of individuals in the UK (Sudlow et al., 2015);
we evaluate a task of predicting whether the patient will be
the weights {wk } and subsampling probability p. All algo-
diagnosed with breast cancer using 120 features. Overall,
rithms are truncated when the average absolute change in
our data has 10K patients (5K diagnosed positively); we use
value in the past 100 iterations is less than 1%.
9K patients as our database (B), and take classification ac-
curacy on a hold-out set of 500 patients as the performance To evaluate the quality of the distributional Shapley esti-
metric (U ). The second data set is Adult Income where the mates, we perform a point removal experiment, as proposed
task is to predict whether income exceeds $50K/yr given 14 by (Ghorbani & Zou, 2019), where given a training set, we
personal features (Dua & Graff, 2017). With 50K individ- iteratively remove points, retrain the model, and observe
uals total, we use 40K as our database, and classification how the performance changes. In particular, we remove
accuracy on 5K individuals as our performance metric. In points from most to least valuable (according to our esti-
these two experiments, we take the maximum data set size mates), and compare to the baseline of removing random
m = 1K and m = 5K, respectively. points. Intuitively, removing high value data points should
result in a more significant drop in the model’s performance.
For both settings, we first run D-S HAPLEY without opti-
We report the results of this point removal experiment using
mizations as a baseline. As a point of comparison, in these
the values determined using the baseline Algorithm 1, as
settings the computational cost of this baseline is on the
well as various factor speed-ups (where t% refers to the
same order as running the TMC-S HAPLEY algorithm of
computational cost compared to baseline).
(Ghorbani & Zou, 2019) that computes the data Shapley
values φ(z; U, B) for each z in the data set B. As Figure 1 demonstrates, when training a logistic regres-
sion model, removing the high distributional Shapley valued
We evaluate the effectiveness of the proposed optimizations,
points causes a sharp decrease in accuracy on both tasks,
using importance sampling and interpolation (separately),
even when using the most aggressive weighted sampling and
for different levels of computational savings, by varying
interpolation optimizations. Appendix E reports the results
5
Code is available on Github at https://github.com/ for various other models. As a finer point of investigation,
amiratag/DistributionalShapley we report the correlation between the estimated values with-
A Distributional Framework for Data Valuation
(a)
Diabetes130 Diabetes130
UK Biobank
(b)
a data set S of the same size; the buyers then valuate the ley framework, is that the value of every point depends
points in S by running the TMC-S HAPLEY algorithm of nontrivially on every other point in the data set. In a sense,
(Ghorbani & Zou, 2019) on B ∪ S. In Figure 3(a), we show this makes the data Shapley value an inherently non-private
that the rank correlation between the broker’s distributional value: the estimate of φ(z; U, B) for a point z ∈ B reveals
estimates ν(z; U, D, m) and the buyer’s observed values information about the other points in B. By marginalizing
φ(z; U, B ∪ S) is generally high. Even when the rank corre- the dependence on the data set, the distributional Shapley
lation is a bit lower (≈ 0.6), the broker and buyer agree on framework opens the door for to estimating data valuations
the value of the set as a whole. Specifically, we observe that while satisfying strong notions of privacy, such as differ-
the seller’s estimates are approximately unbiased, and the ential privacy (Dwork et al., 2006). Such an estimation
absolute percentage
P error is low, where scheme could serve as a powerful tool amidst increasing
z∈S ν(z; U, D, m) − φ(z; U, B ∪ S) calls to ensure the privacy of and compensate individuals
AP E = P . for their personal data (Ligett et al., 2019).
z∈S ν(z; U, D, m)
In Figure 3(b), we show the results of a point addition exper-
iment for the Diabetes130 data set. Here, we consider the
effect of adding the points of S to B under three different or-
derings: according to the broker’s estimates ν(z; U, D, m),
according to the buyer’s estimates φ(z; U, B ∪ S), and un-
der a random ordering. We observe that the performance
(classification accuracy) increase by adding the points ac-
cording to ν(z) and according to φ(z) track one another
well; after the addition of all of S, the resulting models
achieve essentially the same performance and considerably
outperforming random. We report results for the other data
sets in Appendix F.
5. Discussion
The present work makes significant progress on understand-
ing statistical aspects in determining the value of data. In
particular, by reformulating the data Shapley value as a dis-
tributional quantity, we obtain a valuation function that does
not depend on a fixed data set; reducing the dependence
on the specific draw of data eliminates inconsistencies in
valuation that can arise to sampling artifacts. Further, we
demonstrate that the distributional Shapley framework pro-
vides an avenue to valuate data across a wide variety of
tasks, providing stronger theoretical guarantees and orders
of magnitude speed-ups over prior estimation schemes. In
particular, the stability results that we prove for distribu-
tional Shapley (Theorems 2.8 and 2.7) are not generally true
for the original data Shapley due to its dependence on a
fixed dataset.
One outstanding limitation of the present work is the re-
liance on a known task, algorithm, and performance metric
(i.e. taking the potential U to be fixed). We propose reduc-
ing the dependence on these assumptions as a direction for
future investigations; indeed, very recent work has started
to chip away at the assumption that the learning algorithm
is fixed in advance (Yona et al., 2019).
The distributional Shapley perspective also raises the
thought-provoking research question of whether we can
valuate data while protecting the privacy of individuals who
contribute their data. One severe limitation of the data Shap-
A Distributional Framework for Data Valuation
Acknowledgements Jia, R., Dao, D., Wang, B., Hubis, F. A., Hynes, N., Gürel,
N. M., Li, B., Zhang, C., Song, D., and Spanos, C. J.
MPK was supported in part by CISPA Center for Informa- Towards efficient data valuation based on the shapley
tion Security and NSF Award IIS-1908774. value. In The 22nd International Conference on Artificial
Intelligence and Statistics, pp. 1167–1176, 2019b.
References
Kleinberg, J., Papadimitriou, C. H., and Raghavan, P. On
Agarwal, A., Dahleh, M., and Sarkar, T. A marketplace the value of private information. In Theoretical Aspects
for data: An algorithmic solution. In Proceedings of the Of Rationality And Knowledge: Proceedings of the 8
2019 ACM Conference on Economics and Computation, th conference on Theoretical aspects of rationality and
pp. 701–726, 2019. knowledge, volume 8, pp. 249–257. Citeseer, 2001.
Aumann, R. J. and Shapley, L. S. Values of non-atomic Kononenko, I. et al. An efficient explanation of individual
games. Princeton University Press, 1974. classifications using game theory. Journal of Machine
Bousquet, O. and Elisseeff, A. Stability and generalization. Learning Research, 11(Jan):1–18, 2010.
Journal of machine learning research, 2(Mar):499–526, Ligett, K., Nissim, K., and Gordon-Tapiero, A. Data co-ops.
2002. https://csrcl.huji.ac.il/book/data-co-ops, 2019.
Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. L-
Lundberg, S. M. and Lee, S.-I. A unified approach to in-
shapley and c-shapley: Efficient model interpretation for
terpreting model predictions. In Advances in Neural
structured data. arXiv preprint arXiv:1808.02610, 2018.
Information Processing Systems, pp. 4765–4774, 2017.
Cohen, S., Dror, G., and Ruppin, E. Feature selection via
Shapley, L. S. A value for n-person games. Contributions
coalitional game theory. Neural Computation, 19(7):
to the Theory of Games, 2(28):307–317, 1953.
1939–1961, 2007.
Datta, A., Sen, S., and Zick, Y. Algorithmic transparency Shapley, L. S., Roth, A. E., et al. The Shapley value: essays
via quantitative input influence: Theory and experiments in honor of Lloyd S. Shapley. Cambridge University Press,
with learning systems. In Security and Privacy (SP), 2016 1988.
IEEE Symposium on, pp. 598–617. IEEE, 2016. Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P.,
Dua, D. and Graff, C. UCI machine learning repository, Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M.,
2017. URL http://archive.ics.uci.edu/ml. et al. Uk biobank: an open access resource for identifying
the causes of a wide range of complex diseases of middle
Dwork, C., McSherry, F., Nissim, K., and Smith, A. Cal- and old age. PLoS medicine, 12(3):e1001779, 2015.
ibrating noise to sensitivity in private data analysis. In
Theory of cryptography conference, pp. 265–284, 2006. Yona, G., Ghorbani, A., and Zou, J. Who’s responsible?
jointly quantifying the contribution of the learning algo-
Ghorbani, A. and Zou, J. Data shapley: Equitable valuation rithm and training data. arXiv preprint arXiv:1910.04214,
of data for machine learning. In International Conference 2019.
on Machine Learning, pp. 2242–2251, 2019.
Ghorbani, A., Abid, A., and Zou, J. Interpretation of neural
networks is fragile. arXiv preprint arXiv:1710.10547,
2017.
Ginart, A., Guan, M., Valiant, G., and Zou, J. Y. Making
ai forget you: Data deletion in machine learning. In
Advances in Neural Information Processing Systems, pp.
3513–3526, 2019.
Har-Peled, S. Geometric approximation algorithms. Num-
ber 173. American Mathematical Soc., 2011.
Jia, R., Dao, D., Wang, B., Hubis, F. A., Gurel, N. M., Li,
B., Zhang, C., Spanos, C., and Song, D. Efficient task-
specific data valuation for nearest neighbor algorithms.
Proceedings of the VLDB Endowment, 12(11):1610–1623,
2019a.