You are on page 1of 7

The American Statistician

ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: https://www.tandfonline.com/loi/utas20

On Causal Inferences for Personalized Medicine:


How Hidden Causal Assumptions Led to Erroneous
Causal Claims About the D-Value

Sander Greenland, Michael P. Fay, Erica H. Brittain, Joanna H. Shih, Dean A.


Follmann, Erin E. Gabriel & James M. Robins

To cite this article: Sander Greenland, Michael P. Fay, Erica H. Brittain, Joanna H. Shih, Dean
A. Follmann, Erin E. Gabriel & James M. Robins (2020) On Causal Inferences for Personalized
Medicine: How Hidden Causal Assumptions Led to Erroneous Causal Claims About the D-Value,
The American Statistician, 74:3, 243-248, DOI: 10.1080/00031305.2019.1575771

To link to this article: https://doi.org/10.1080/00031305.2019.1575771

Published online: 20 May 2019.

Submit your article to this journal

Article views: 1263

View related articles

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=utas20
THE AMERICAN STATISTICIAN
2020, VOL. 74, NO. 3, 243–248: General
https://doi.org/10.1080/00031305.2019.1575771

On Causal Inferences for Personalized Medicine: How Hidden Causal Assumptions Led
to Erroneous Causal Claims About the D-Value
Sander Greenlanda , Michael P. Fayb , Erica H. Brittainb , Joanna H. Shihc , Dean A. Follmannb , Erin E. Gabrield , and
James M. Robinse
a
Department of Epidemiology and Department of Statistics, University of California, Los Angeles, Los Angeles, CA; b Biostatistics Research Branch,
National Institute of Allergy and Infectious Diseases, Bethesda, MD; c Biometric Research Program, National Cancer Institute, Rockville, MD; d Department
of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden; e Department of Epidemiology and Department of Biostatistics,
Harvard T. Chan School of Public Health, Boston, MA

ABSTRACT ARTICLE HISTORY


Personalized medicine asks if a new treatment will help a particular patient, rather than if it improves the Received June 2017
average response in a population. Without a causal model to distinguish these questions, interpretational Accepted January 2019
mistakes arise. These mistakes are seen in an article by Demidenko that recommends the “D-value,” which
is the probability that a randomly chosen person from the new-treatment group has a higher value KEYWORDS
for the outcome than a randomly chosen person from the control-treatment group. The abstract states Causality; D-value; Effect size;
Individualized treatment;
“The D-value has a clear interpretation as the proportion of patients who get worse after the treatment” Patient-centered
with similar assertions appearing later. We show these statements are incorrect because they require
assumptions about the potential outcomes which are neither testable in randomized experiments nor
plausible in general. The D-value will not equal the proportion of patients who get worse after treatment
if (as expected) those outcomes are correlated. Independence of potential outcomes is unrealistic and
eliminates any personalized treatment effects; with dependence, the D-value can even imply treatment is
better than control even though most patients are harmed by the treatment. Thus, D-values are misleading for
personalized medicine. To prevent misunderstandings, we advise incorporating causal models into basic
statistics education.

1. Causal Models for Individual Effects the D-value can indicate that the new treatment is better than the
control (D < 0.5) even though the treatment harms a majority of
1.1. Background: The D-Value Is Not the Proportion of
its recipients (Hand 1992; Fay et al. 2018).
Patients Harmed by Treatment
How were such profoundly erroneous claims justified? We
Despite great advances in causal modeling over recent decades, will show that the claims can be derived by introducing a statis-
it still seems largely unrecognized that intuitive causal interpre- tically nonidentified causal assumption, one which we regard as
tations of statistical parameters can be rendered fallacious via extremely implausible in every setting we can imagine. Because
their dependence on hidden assumptions about causal struc- similar hidden assumptions appear to be behind other common
ture. This problem is well illustrated with a recent article (Demi- misinterpretations of effect measures, and given the attention
denko 2016) which made a number of remarkable causal claims received by Demidenko (2016), we provide a detailed review of
about a measure of association called the D-value.1 Over the the core problem: failure to recognize when interpretations are
ensuing two years it became one of the most downloaded articles based on strong and often implausible assumptions about the
in The American Statistician—which is alarming in light of effect of treatment on outcome.
the fact that all the causal claims in the article are incorrect.
Foremost, the article states in the abstract that “The D-value 1.2. Experimental Statistics and Modern Causal Modeling
has a clear interpretation as the proportion of patients who Theory
get worse after the treatment.” This statement is false under
even the simplest plausible causal model, yet is repeated on Scientists have been making causal inferences from experiments
page 37. Related incorrect assertions appear throughout the since the dawn of modern science, and formal causal models
article, for example, on page 36 “The D-value is for personalized for the design and analysis of randomized experiments go back
medicine when the treatment is sought, not on a group, but on nearly a century (Neyman 1923; Welch 1937). Since then, the
an individual level.” Worse, under the quoted misinterpretations topic of causal inference has become a major sub-branch of

CONTACT Sander Greenland lesdomes@ucla.edu Department of Epidemiology and Department of Statistics, University of California, Los Angeles, Los Angeles,
CA.
1 The D-value (possibly with tie adjustments) is also known as the Mann–Whitney parameter (Fay et al. 2018), the probabilistic index (Thas et al. 2012), the relative
treatment effect (Brunner and Munzel 2000), and for discrimination problems the concordance index or c-index (Harrell, Lee, and Mark 1996), where it equals the area
under the ROC curve (Demidenko 2016).
© 2019 American Statistical Association
244 S. GREENLAND ET AL.

statistical theory with the growing recognition that causal claims would be if given the new treatment (e.g., an experimental
require additional formal structure to link them deductively drug). In more common causal notation Xi is Yi (0) and Yi is
to statistical analyses. Starting with the earliest randomization- Yi (1), but we here use Xi and Yi to better match Demidenko’s
test literature, by far the most popular structures for this pur- notation. Let (Xi , Yi ) be the potential outcome vector for a
pose have been in the form of potential outcomes, which have random subject, that is, the outcomes under control and new
gradually begun to appear in textbooks, for example, see recent treatment for subject i. Randomization splits the subjects into
examples by Morgan and Winship (2015), Imbens and Rubin two random samples from the total: a control sample in which
(2015), VanderWeele (2015), Pearl, Glymour, and Jewell (2016), only the control response X is observed and Y is missing;
Rosenbaum (2017), and Hernán and Robins (2019). Using this and a new-treatment sample in which only the new-treatment
model to study effects of treatment, the traditional outcome response Y is observed and X is missing. The randomization
or response variable is expanded into a list (vector) of poten- allows estimation (identification) of the marginal X and Y dis-
tial outcomes that shows what the outcome would be for the tributions by classical techniques (Rubin 1978).
same unit under different treatments; conditional independence Nonetheless, as has been long known (see, e.g., Dawid 2000
(“ignorability”) assumptions about treatment assignment are and discussants), treatment randomization3 does not identify
then used to deduce tests and estimates of causal effects, with the the distribution of the pairwise (joint) potential-outcome (Xi ,
latter defined as contrasts comparing the marginal distributions Yi ) distribution. Specifically, the randomized-trial design allows
of the different potential outcomes in the same study group (e.g., identification of only the marginal distributions of Xi and Yi .
all trial participants). In particular, if higher outcome values are considered undesir-
The resulting causal-inference theory is more complex than able, the proportion of patients who would be harmed by the
classical regression or ANOVA, for there is now a vector variable treatment is π = Pr(Yi > Xi ), which is a statement about
inserted where only one outcome variable appeared before. the pairs (Xi , Yi ). It thus cannot be identified from the trial
Beyond that however it is distinct from traditional mathematical data alone because it depends on the joint (Xi , Yi ) distribution.
statistics to the extent that the latter focuses strictly on deduc- Because no complete (Xi , Yi ) pair is observed, we cannot verify
tions derived from assumptions about probability distributions harm (Yi > Xi ), benefit (Yi < Xi ), or no effect (Yi = Xi ) in
for the data. It has long been known that even complete knowl- any individual from the data alone; we only observe marginal
edge of those probability distributions cannot pinpoint the exact averages over these individuals (Senn 2009).
causal mechanism (the explanation or “story”) generating the
data; for example, two mutually inconsistent potential-outcome
models arising from distinct mechanisms can lead to identical 2. Why the D-Value Is Not the Proportion Harmed
data distributions, even when treatment is randomized (Robins 2.1. D-Value Versus Proportion Harmed Under a Basic
1986, sec. 2A; Robins and Greenland 1989a, 1989b; Dawid 2000; Model
Pearl 2009, sec. 11.1.1).
Thus, although experimental designs can identify and con- Demidenko (2016) introduced the D-value first under normal-
trast various marginal distributions for the component poten- ity assumptions for X and Y, and then (in his sec. 3.1) non-
tial outcomes, without further assumptions they cannot iden- parametrically, referring to the D-value as both the sample
tify their joint (vector) distribution. That limitation may be estimate and the population parameter. We focus on the D-value
unsurprising: once we apply a treatment to a group, we can parameter, first defining it generally, then examine assumptions
only observe its outcome distribution under that treatment; and their consequences.
the potential outcomes under the unreceived (counterfactual) Suppose we have N subjects in the study; then the complete
treatments are now unobservable. Some critics have labeled this unobserved potential outcomes are the N pairs, (X1 , Y1 ), …,
nonidentification a problem with potential-outcome models; (XN , YN ). The treatment-assignment process picks n of the N
others have responded that it is instead a valuable representation subjects to get the control treatment (say subjects 2, 3, 5, 8, …)
of an innate limitation of purely statistical studies of causation, and N − n subjects to get new treatment (say 1, 4, 6, 7, 9, …).
for example, contrast Dawid (2000) with its discussants.2 As we Then we only observe (., Y1 ), (X2 , .), (X3 , .), (., Y4 ), (X5 , .),
will explain in detail, it is precisely these inherent limitations (., Y6 ), (., Y7 ), (X8 , .), (., Y9 ), and so on, where the “.” represents
that were overlooked in Demidenko (2016). the missing part of the pair (Rubin 1978). The D-value popula-
To show how the quoted claims fail, we will use a basic tion parameter is δ = Pr(Yi > Xj ), where Yi is the outcome
potential-outcome model for treatment effects in which Xi is of a patient randomly drawn from the new-treatment group,
what the outcome of a randomly selected trial subject indexed and Xj is the outcome of a distinct patient independently and
by i would be if given the control treatment (e.g., placebo or randomly drawn from the control group (cf. Demidenko 2016,
standard of care) and Yi is what the outcome of the same subject sec. 3.1). Note well, δ is defined using different indices i and j to
indicate that the two random variables in this sampling cannot
be from the same individual, since δ is derived from sampling
2
Indeed, one alternative to potential outcomes (Dawid 2015) deals with the two distinct groups.
problem by using a reduced structure in which only marginal responses to
treatments are defined. This causal model reproduces the same observable
distributions as those derived from potential-outcome models, and so
yields identical inferential implications and limitations of statistical exper-
3
iments; it thus lacks sufficient structure to correctly represent concepts This means any weaker condition such as ignorability (independence of
like “proportion harmed” and “probability of causation” which contrast treatment and potential outcomes across patients) will also not identify the
potential outcomes within single individuals. (Xi , Yi ) distribution.
THE AMERICAN STATISTICIAN 245

When larger response denotes more harm, one might be δ = π if and only if the correlation is 0 so that Xi and Yi are
tempted to shorten the description of δ to “the proportion independent. Although π is not identifiable, Gadbury and Iyer
harmed,” leading to the incorrect causal interpretations in (2000) explored bounds on π under the bivariate normality
Demidenko (2016). To see why this is wrong, we must contrast assumption. Even without bivariate normality, if the outcomes
the D-value, δ = Pr(Yi > Xj ), to the proportion harmed, of the same subject under different treatments are independent,
which is π = Pr(Yi > Xi ). The key difference overlooked in Pr(Xi , Yi ) = Pr(Xi )Pr(Yi ); under randomization the latter
Demidenko (2016) is that (in sharp contrast to δ), the indices of product equals Pr(Xi )Pr(Yj ), leading to δ = π ; however, this
X and Y in the proportion harmed π are by definition identical: independence would mean a subject’s control outcome has
use of the same index i in π denotes that the two outcomes must no predictive value at all for their treatment outcome. As we
come from the same individual. This follows common sense: to hope will become obvious to the reader, this is an untenable
claim patient i was harmed by the new treatment is to claim that assumption.
patient i would have done better on the control treatment, i.e., It follows that any statement derived from confusing the D-
that Yi > Xi . value with the proportion harmed can be highly misleading,
To highlight the difference between δ and π , we revisit the because strong positive correlations of potential outcomes
example in Demidenko (2016) which compares placebo to a should be expected: regardless of treatment, an individual will
weight-loss drug in a randomized trial. The average ending retain the same baseline outcome predictors, such as age, sex,
weight is one pound less in the drug group than the placebo baseline weight, their entire genome, their entire biome, and
group; the corresponding D-value is 0.486, which the article countless unmeasured predictors. Thus, a patient who would
states is “the proportion of patients who get worse after the have ended with a higher outcome than other patients under
treatment.” Suppose however that the 1-pound average loss control is also likely to end with a higher outcome under the
arose because the diet drug produced some weight loss in every new treatment, simply by virtue of having identical baseline
patient and on average that loss was one pound (e.g., as would predictors.
happen if the treatment causes a one-pound weight loss relative Not only can δ and π differ to a large extent, they can even
to placebo in everyone: Yi − Xi = −1 for all i); then no one was differ in direction in the sense of δ < 1/2 < π , and thus convey
or would be harmed by treatment (π = 0). Thus the D-value of a very different impression of typical direction of effect, so that
0.486 cannot be the proportion harmed. the new treatment could look better in terms of the D-value
If the outcome were transient with no carry-over or time (i.e., δ < 1/2) but could actually be worse for most patients (i.e.,
trend, and the trial could be repeated in the same patient π > 1/2). Hand (1992) showed how this could happen even if X
(an N-of-1 crossover trial), we could see both Yi and Xi , and and Y are marginally normal; specifically, with means μx and μy
thus see directly whether a patient was harmed (Yi > Xi ), and the same variance, σ 2 , for X and Y, then it is possible to have
unaffected (Yi = Xi ), or helped (Yi < Xi ) by the treatment. opposite directional
  effects, and these opposite effects can occur
Otherwise, one must turn to a causal model to impute the  μy μx 
whenever  σ  < 1.35, a result sometimes called “Hand’s
missing potential outcome (Xi missing in the treatment group,
paradox. ” Under
 these normality assumptions, the D-value is
Yi missing in the control group) (Greenland, Rothman, and μ μ
Lash 2008; Morgan and Winship 2015; Imbens and Rubin δ = Φ √y x , where Φ is the standard normal distribution; it

2015; Hernán and Robins 2019). The error in Demidenko follows that the paradox (i.e., δ and π indicate opposite direction
(2016) can thus be attributed to not holding the individual of effects) can occur whenever δ ∈ (0.170, 0.830) (Fay et al.
subscript i constant when claiming to estimate the proportion 2018). So even if δ indicates that treatment is worse than control,
harmed π or other “personalized” effects of treatment, and thus we cannot rule out the possibility that treatment benefits more
failing to recognize that the marginal distributions identified than half the population. Basically, with δ, large harms for the
by randomization do not determine the joint distribution on minority can outweigh small benefits for the majority (and vice
which the proportion harmed π is defined. This oversight will versa).
invalidate any statistical claim about personalized effects based
solely on the treatment-assignment mechanism (as opposed to
assumptions about the mechanism connecting treatment to the
2.3. D-Value Without Normality Assumptions
outcome).
To illustrate how the D-value behaves without normality
assumptions, we explore three simple discrete-outcome exam-
ples. Let Fxy be the bivariate distribution of (Xi , Yi ), and let Fx
2.2. D-Value With Normality Assumptions
and Fy be the marginal distribution of Xi and Yi , respectively.
Demidenko (2016, sec. 3) assumed the marginal distributions Figure 1 shows three bivariate distributions Fxy , all with
were normal, but made no explicit assumptions about the the same marginal distributions, Fx and Fy . The marginal
bivariate distribution Fxy . Gadbury and Iyer (2000) assumed distribution for X has probability 1/3 on each of the points: 2,
that the (Xi , Yi ) pairs were draws from a bivariate normal 4, and 6, while the marginal distribution on Y has probability
distribution; they noted that π is not identifiable from a 1/3 on each of the points: 1, 3, and 5. As before, assume larger
two-sample randomized experiment because the correlation values are worse, so parameter values less than 1/2 imply that
between Xi and Yi is not identifiable (since as noted above the new treatment is better than the control. Since the marginal
for each individual only one of the potential outcomes is distributions are the same in all three panels in Figure 1, the δ
observed). Under the bivariate normality assumption, we get values are the same: all have δ = Pr(Yi > Xj ) = 1/3. Figure 1(a)
246 S. GREENLAND ET AL.

Figure 1. Three bivariate distributions, all with the same marginal distributions: Fx has equal probabilities on 2, 4, and 6, and Fy has equal probabilities on 1, 3, and 5. Circle
areas are proportional to the probabilities.

shows the bivariate distribution when X and Y are independent; 3. Personalized Medicine and Statistical Inferences
this is the case when π = Pr(Yi > Xi ) = δ = 1/3. Figure 1(b)
In order for any measure to be used for personalized medicine,
shows the bivariate distribution with a constant effect Yi − Xi =
we need baseline covariates to describe different subgroups
−1 for all patients, for which case π = 0 since everyone benefits
with different effects. Without subgroup analysis we can only
from treatment (the effect is strictly downward monotonic); yet,
approach individual effects via their group averages. Examples
under Demidenko’s misinterpretation of δ, we would falsely
in simple two-sample randomized trials include the average
infer that δ = 1/3 of patients would do worse under the
causal effect E(Yi ) − E(Xi ) = E(Yi − Xi ), which shows how the
new treatment than under the control. Finally, in Figure 1(c)
usual mean difference is a simple average of individual effects, a
illustrates Hand’s paradox where π = 2/3 > 1/2, so that the
property not shared by other measures; notably, odds ratios are
new treatment harms most patients, yet under Demidenko’s
not weighted averages of individual causal effects (Greenland
misinterpretation of δ = 1/3 < 1/2, we would falsely infer that
Robins, and Pearl 1999). In the next section, we show how
most patients are not harmed by the new treatment. Of course,
δ − 1/2 can be reconstructed as an average individual effect,
Hand’s paradox can occur with continuous data as well. To see
albeit not a simple one. Because these effects are averaged over
this, imagine mixtures of bivariate normal distributions, with
individuals for each group, it is misleading to state “the D-value
their centers at the points in Figure 1(c).
is for personalized medicine when the treatment is sought, not
As in the case where Fxy is bivariate normal, in the gen-
on a group, but on an individual level” (Demidenko 2016, p. 36).
eral case we can identify δ but not π using only the marginal
If we want to come closer to individual effects we must
distributions Fx and Fy (which are all that are identified by
measure baseline covariates to allow us to “personalize” the
the randomization indicator alone, as in classical randomiza-
effects. Given data sufficient in quantity, quality, and detail (as
tion tests and their approximations).4 Nevertheless, we can get
from large trials and pooling projects), personalized treatment
bounds on π . Fay et al. (2018) reviewed bounds for π given the
decisions can be aided by estimating potential outcomes as
marginal distributions Fx and Fy under both nonparametric and
functions of baseline covariates Z available in the trial data
semiparametric assumptions.
(Robins 1986, sec. 3C; 2004; Murphy 2003). Even with such
For continuously distributed responses, Fx = Fy implies
resources however we think it probable that Y and X will remain
δ = 1/2, but for discrete responses we can have δ < 1/2
highly correlated within covariate levels, and thus may render
when, unlike the examples of Figure 1, there is the possibility
the covariate-specific D-values far from the actual proportion
of ties (i.e., Pr(Yi = Xj ) > 0). We could adjust the D-value
harmed.
for ties using Pr(Yi > Xj ) + Pr(Yi = Xj )/2, so that when
Fx = Fy then the adjusted D-value parameter equals 1/2 even
with discrete responses with ties. We could similarly adjust π
4. A Causal Interpretation of the D-Value
to Pr(Yi > Xi ) + Pr(Yi = Xi )/2 but this adjustment would
destroy its interpretation as the proportion harmed; for example, To summarize to this point: we can interpret δ causally by
π would then equal 1/2 for a treatment with no harm or benefit, viewing its departure from its null value of 1/2 as one measure
i.e., Pr(Yi = Xi ) = 1, and would equal 1/4 for a treatment of the change in the marginal outcome distribution produced
that harmed no one and benefitted half the patients, that is, by the new treatment relative to the control. Nonetheless, as we
Pr(Yi > Xi ) = 0, Pr(Yi = Xi ) = Pr(Yi < Xi ) = 1/2. have seen this interpretation is not a patient-specific (“individ-
ualized”) causal effect because it averages over within-patient
effects within each group, so the same average effect can come
from very different distributions of individual effects. For exam-
4
To see this note that δ is a functional gδ (Fx , Fy ) = Pr(Yi > Xj ) of the
randomization-identified marginal distributions, whereas π is a functional ple, δ = 1/2 if there is no individual effect (Yi = Xi for everyone,
gπ (Fxy ) = Pr(Yi > Xi ) of the nonidentified bivariate distribution. in which case π = 0), but δ = 1/2 could also occur if everyone
THE AMERICAN STATISTICIAN 247

was affected (e.g., if half the patients had Yi − Xi = 1 and half ture has focused on models for treatment effects on marginal
had Yi − Xi = −1, in which case π = δ = 1/2). distributions.
To understand the limitations of δ within a causal model,
we can imagine that the patients enrolled in the trial define a
population of potential outcomes under control treatment (the
5. Discussion
X marginal distribution) and a population of potential outcomes
under the new treatment (the Y marginal distribution). Then The difference between the D-value parameter δ and the pro-
δ is the probability that a randomly selected potential outcome portion harmed π has long been known (see, e.g., Hand 1992)
under the new treatment is larger than an independently selected yet has been overlooked repeatedly. One reason for the oversight
potential outcome under the control treatment. In contrast, π may be that in a randomized trial the observed outcomes on
refers to pairs (Xi , Yi ) of these potential outcomes matched by which δ is defined (Xj and Yi from different patients j and i)
patient: π is the probability that a randomly selected patient has are independent, while the potential-outcome pairs on which
a potential outcome under new treatment that is larger than that π is defined (Xi and Yi from the same patient i) are likely to
same patient’s potential outcome under the control treatment. be highly correlated. Confusion thus seems inevitable when (as
While in general δ = π we may ask if there is at least in Demidenko 2016) there is no underlying causal model that
some sort of precise causal interpretation for δ in terms of the distinguishes the two cases. Once such a model is given, the mis-
change in outcome distribution produced by treatment. It turns take of confusing δ with π can be seen as a mistake of equating
out there is one for the null-centered version δ − 1/2, albeit independent marginal sampling to joint-distribution sampling
one much less straightforward than the causal interpretation (sampling of patient-specific potential-outcome pairs).
for π because it depends on averaging explicit quantiles of the This issue is closely related to other confusions of marginal
marginal distributions of X and Y under the different treat- and individual causal measures. Consider a binary outcome or
ments. Specifically, Fay et al. (2018) derive δ − 1/2 in terms of failure-time outcome. Most textbooks of the past century as well
what they call the expected quantile difference (EQD) causal as many legal and policy documents mistakenly equated the
effect, which should not be confused with the quantile causal attributable fraction ϕ = (Pr(Yi = 1) − Pr(Xi = 1))/Pr(Yi =
effect Fy−1 (Yi ) − Fx−1 (Xi ); see Xu, Daniels, and Winterstein 1) (“attributable risk” or excess fraction) to the probability of
(2018). For simplicity, we will only describe the EQD under the causation Pr(Yi > Xi )/Pr(Yi = 1) = π /Pr(Yi = 1). But as
continuity assumption; see Fay et al. (2018) for the general tie- above for δ versus π , the fraction ϕ is defined from the marginal
adjusted version. distributions and is thus identified by treatment randomization,
We first define the randomized outcome quantile level for the whereas π is not (Greenland and Robins 2000). Without strong
total patient population in a 1:1 randomized experiment. Let the assumptions about the bivariate distribution of potential out-
observed outcome be Wi = Ri Yi + (1 − Ri )Xi , where Ri is the comes (Xi , Yi ) we can only say ϕ ≤ π /Pr(Yi = 1) ≤ 1 (Robins
independent random (Bernoulli) treatment-assignment indica- and Greenland 1989a, 1989b).
tor with P(R = 1) = 1/2. Wi represents randomized observation Demidenko (2016) repeats the well-known fact that a p-value
of one of the two potential outcomes Y and X for the ith indi- does not represent an effect-size estimate, which certainly bears
vidual, which yields the ith individual’s observed response. The emphasis. Unfortunately, his article concludes that replacing the
distribution of Wi is G(w) = (Fx (w)+Fy (w))/2 = Pr(Wi ≤ w). p-value with the D-value offers a partial solution to “the poor
With G(wi ) = qi , the observed outcome wi for patient i is reproducibility of scientific experiments.” We strongly disagree
then at the qi th quantile of G, and qi is the quantile level of the with that statement regardless of the chosen measure of effect.
observed outcome in the total patient population. The difference Any measure is vulnerable to data-driven or otherwise moti-
in quantile levels for the ith individual due to treatment is vated selection effects by the analyst, including artefacts of sub-
Di = G(Yi ) − G(Xi ) which Fay et al. (2018) called the quantile group analysis. Even with perfectly honest and valid reporting,
difference causal effect for the ith individual. Di represents an in fields like psychology, social sciences, and medicine, effect
individual causal treatment effect insofar as it measures the sizes can and should be expected to vary dramatically across
change in the individual’s ordinal outcome location in the total groups due to differences in the distribution of innumerable
patient population produced by the new treatment. For example, modifiers of the effect. It is for example often lamented that the
if Di is 0.10, then the quantile level for the ith individual would effects of drugs seen in randomized trials do not predict well
increase by 0.10 on treatment (e.g., go from 0.16 on control to the effects seen in clinical practice—a fact unsurprising when
0.26 on treatment). Fay et al. (2018) showed that the expectation one realizes that trials have numerous admission criteria that
of those individual causal treatment effects is δ − 1/2, that is, could alter effects (e.g., excluding reproductive-age women), but
clinical prescribers can and do prescribe far beyond those limits
δ − 1/2 = E(Di ) = E(G(Yi )) − E(G(Xi )). and practice in a much less controlled environment.
Those estimation problems aside, there seems to be poor
Since G is a function of the marginal distributions only, δ is understanding that a valid p-value should vary dramatically
identifiable. So a proper causal interpretation of δ − 1/2 from across studies (Senn 2001, 2002; Gelman and Stern 2006; Boos
a randomized experiment is the expectation of individual dif- and Stefanski 2011; Greenland 2019). Consequently, “replica-
ference in quantile level causal effects. Although the causal tion failures” are often simply an artefact of using an arbi-
interpretation of δ is less straightforward than that of π , the lack trary and easily crossed threshold such as 0.05 to dichotomize
of identifiability of π limits its practical use, and so it may be results into “positive” and “negative” and from focusing on
unsurprising that most of the modern causal modeling litera- only one hypothesis and its p-value (Poole 1987; Rothman,
248 S. GREENLAND ET AL.

Johnson, and Sugano 1999; Senn 2001, 2002; Greenland 2017a; Greenland, S., Rothman, K. J., and Lash, T. L. (2008), “Measures of Effect
Amrhein, Trafimow, and Greenland 2019). Such dichotomania and Measures of Association,” in Modern Epidemiology (3rd ed.), eds.
K. J. Rothman, S. Greenland, and T. L. Lash, Philadelphia: Lippincott-
and nullism has been condemned in scores if not hundreds
Wolters-Kluwer (chapter 4). [245]
of sources, and is addressed simply by reporting precisely and Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman,
discussing the different p-values that one gets from testing dif- S. N., and Altman, D. G. (2016), “Statistical Tests, P-Values, Confidence
ferent parameter values, rather than focusing on where those Intervals, and Power: A Guide to Misinterpretations,” The American
p-values fall relative to some arbitrary if conventional cutoff Statistician, 70, (S1), 1–12. [248]
Hand, D. J. (1992), “On Comparing Two Treatments,” The American Statis-
[in this journal see recent articles by Wasserstein and Lazar tician, 46, 190–192. [243,245,247]
(2016), Greenland et al. (2016), and Greenland (2019)]. Simply Harrell, F. E., Lee, K. L., and Mark, D. B. (1996), “Multivariable Prognos-
adopting a different measure of effect does nothing to address tic Models: Issues in Developing Models, Evaluating Assumptions and
this core source of spurious “nonreproducibility.” Adequacy, and Measuring and Reducing Errors,” Statistics in Medicine,
15, 361–387. [243]
The main goal of this note has been to highlight the misinter-
Hernán, M. A., and Robins, J. M. (2019), Causal Inference, Boca Raton:
pretation of statistical parameters that arise when the parameter Chapman & Hall/CRC, forthcoming. [244,245]
is not precisely defined by an explicit causal model. This has Imbens, G., and Rubin, D. B. (2015), Causal Inference for Statistics, Social,
been illustrated by claims that the D-value (δ) is the propor- and Biomedical Sciences: An Introduction, New York, Cambridge Univer-
tion harmed (π ), when in fact those are two different param- sity Press. [244,245]
Morgan, S., and Winship, C. (2015), Counterfactuals and Causal Inference:
eters which may not even be close under realistic assump- Methods and Principles for Social Research (2nd ed.), New York: Cam-
tions. It is difficult to see that the two parameters are different bridge University Press. [244,245]
without using potential outcomes or similar causal formalisms. Murphy, S. A. (2003), “Optimal Dynamic Treatment Regimes,” Journal of
Although there remains disagreement about the relative merits the Royal Statistical Society, Series B, 65, 331–355. [246]
Neyman, J. (1923), On the “Application of Probability Theory to Agricul-
of formal and informal approaches to causal inference (Green-
tural Experiments. Essay on Principles.” Sec. 9, translation in Statistical
land 2017b), we strongly advise that basic causal models become Science 1990; 5:465–472. [243]
part of elementary statistics education so that mistaken intuitive Pearl, J. (2009), Causality: Models, Reasoning, and Inference (2nd ed.), New
interpretations of effect estimates can be avoided and valid York: Cambridge University Press. [244]
methods for effect prediction can be more widely deployed. Pearl, J., Glymour, M., and Jewell, N. P. (2016), Causal Inference in Statistics:
A Primer, New York: Wiley. [244]
Poole, C. (1987), “Beyond the Confidence Interval,” American Journal of
Public Health, 77, 195–199. [247]
References Robins, J. M. (1986), “A New Approach to Causal Inference in Mortality
Amrhein, V., Trafimow, D., and Greenland, S. (2019), “Inferential Statistics Studies with Sustained Exposure Periods–Application to Control of the
are Descriptive Statistics,” The American Statistician, 73(S1), 262–270. Healthy Worker Survivor Effect.” Mathematical Modelling, 7, 1393–1512.
[248] [244]
Boos, D. D., and Stefanski, L. A. (2011), “P-Value Precision and Repro- (2004), “Optimal Structural Nested Models for Optimal Sequential
Decisions,” in Proceedings of the Second Seattle Symposium in Biostatis-
ducibility,” The American Statistician, 65, 213–221. [247]
tics, New York: Springer, pp. 189–326. [246]
Brunner, E., and Munzel, U. (2000), “The Nonparametric Behrens–Fisher
Robins, J. M., and Greenland, S. (1989a), “Estimability and Estimation
Problem: Asymptotic Theory and a Small-Sample Approximation,” Bio-
of Excess and Etiologic Fractions,” Statistics in Medicine, 8, 845–859.
metrical Journal, 42, 17–25. [243]
[244,247]
Dawid, A. P. (2000), “Causal Inference Without Counterfactuals” (with
(1989b), “The Probability of Causation Under a Stochastic Model
discussion), Journal of the American Statistical Association, 95, 407–448.
for Individual Risks,” Biometrics, 45, 1125–1138. [244,247]
[244]
Rosenbaum, P. R. (2017), Observation and Experiment: An Introduction to
Dawid, A. P. (2015), “Statistical Causality From a Decision-Theoretic Per- Causal Inference, Cambridge, MA: Harvard University Press. [244]
spective,” Annual Review of Statistics and Its Application, 2, 273–303. Rothman, K. J., Johnson, E. S., and Sugano, D. S. (1999), “Is Flutamide
[244] Effective in Patients With Bilateral Orchiectomy?,” The Lancet, 353, 1184.
Demidenko, E. (2016), “The p-Value You Can’t Buy,” The American Statis- [248]
tician, 70, 33–38. [243,244,245,246,247] Rubin, D. B. (1978), “Bayesian Inference for Causal Effects: The Role of
Fay, M. P., Brittain, E. H., Shih, J. A., Follmann, D. A., and Gabriel, E. E. Randomization,” Annals of Statistics, 6, 34–58. [244]
(2018), “Causal Estimands and Confidence Intervals Associated With Senn, S. J. (2001), “Two Cheers for P-Values,” Journal of Epidemiology and
Wilcoxon–Mann–Whitney Tests in Randomized Experiments,” Statis- Biostatistics, 6, 193–204. [247,248]
tics in Medicine, 37, 2923–2937. [243,245,246,247] (2002), “Letter to the Editor re: Goodman 1992,” Statistics in
Gadbury, G. L., and Iyer, H. K. (2000), “Unit-Treatment Interaction and its Medicine, 21, 2437–2444. [247,248]
Practical Consequences,” Biometrics, 56, 882–885. [245] (2009), “Three Things That Every Medical Writer Should Know
Gelman, A., and Stern, H. (2006), “The Difference Between ‘Significant’ About Statistics,” The Journal of the European Medical Writers Associa-
and ‘Not Significant’ Is Not Itself Statistically Significant,” The American tion, 18, 159–162. [244]
Statistician, 60, 4, 328–331. [247] Thas, O., Neve, J. D., Clement, L., and Ottoy, J. P. (2012), “Probabilistic Index
Greenland, S. (2017a), “The Need for Cognitive Science in Methodology,” Models,” Journal of the Royal Statistical Society, Series B, 74, 623–671.
American Journal of Epidemiology, 186, 639–645. [248] [243]
Greenland, S. (2017b), “For and Against Methodology: Some Perspectives VanderWeele, T. (2015), Explanation in Causal Inference: Methods for Medi-
on Recent Causal and Statistical Inference Debates,” European Journal of ation and Interaction. New York, NY: Oxford University Press. [244]
Epidemiology, 32, 3–20. [248] Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s Statement on p-
Greenland, S. (2019), “Valid P-Values Behave Exactly as They Should: Some Values: Context, Process, and Purpose,” The American Statistician, 70,
Misleading Criticisms of P-Values and Their Resolution With S-Values,” 129–133. [248]
The American Statistician, 73, 106–114. [247,248] Welch, B.L. (1937), “On the Z-Test in Randomized Blocks and Latin
Greenland, S., and Robins, J. M. (2000), “Epidemiology, Justice, and the Squares,” Biometrika, 29, 21–52. [243]
Probability of Causation,” Jurimetrics, 40, 321–340. [247] Xu, D., Daniels, M. J., and Winterstein, A. G. (2018), “A Bayesian Nonpara-
Greenland, S., Robins, J. M., and Pearl, J. (1999), “Confounding and Col- metric Approach to Causal Inference on Quantiles” Biometrics, 74, 986–
lapsibility in Causal Inference,” Statistical Science, 14, 29–46. [246] 996. [247]

You might also like