You are on page 1of 5

The McCloskey Critique

Deirdre McCloskey is an outspoken critic of significance testing in economics.


Significance testing focuses on whether something is true (for example, whether
x has an effect on y), without regard for the magnitude of the relationship.
Virtually all academic research in economics focuses primarily on statistical
significance. I describe this as the p < 0.05 mentality. (McCloskey also says the
cult of statistical significance.) Researchers estimate a model, and they report
which explanatory variables have a significant effect. The magnitude of the effect
is of secondary interest. Frequently, researchers do not discuss magnitude at all.
Usually, the magnitude gets attention only if the effect is significant. Almost
never do researchers discuss the magnitude of insignificant results (even if the
effect is potentially large!).
Perhaps some historical perspective helps. One of Ronald Fishers early
applications of significance testing was in answering the question of whether
genetically identical twins existed. He specifically wanted to test whether more
than 50% of twins were same-sex. Fisher was interested in seeking out truth.
He wanted to know whether something was true, a yes-or-no condition. At
times, this is an interesting question. We might want to know whether
extraterrestrial intelligence exists. We might want to know whether cold fusion is
possible. We might want to know whether monetary policy affects the economy.
This is seeking knowledge for the sake of knowledge. Knowing whether
something is true can guide research. If we know that genetically identical twins
exist, then we will be able to do analyses that separate nature from nurture. (That
was Fishers intent.) But for this to have a practical application, we need to know
how big the effect is. If identical twins are rather frequent, we could gather a
large enough sample to conduct research. If they are rare, it would be
impractical. Magnitude matters in practical applications. Hence, McCloskey
reminds us to think about practical significance, also called economic
significance.
Let me discuss four variations on the McCloskey critique.
1. Statistical significance versus practical significance.
What if economic theory predicts that purchasing power parity should equal 1?
We collect data and estimate PPP to equal 0.99990, with a 95% confidence
interval going from 0.99989 to 0.9991. We would reject the hypothesis that PPP
equals 1. Should we completely toss out the theory entirely, as a complete failure,
based on this small deviation? Should we reject any action that is optimal when
PPP = 1? Strict significance testing says the theory is false, and these actions are
less than optimal. A wise person might instead say, the theory is almost
perfect, except that we havent figured out one tiny detail; these actions are
almost optimal, off by only a smidgeon.
As economists, we know that the decision to take some action depends on costs
and benefits. Statistical significant is a purely mathematical calculation. It
introduces no values. Practical significance is defined as does it have any
applications? or would you make a decision based on it?
McCloskey would say that statistical significance is neither a necessary nor
sufficient condition for practical significance. Something can have statistical
significance without practical significance. My example is discrimination in
earnings. We might see that there is a statistically significant difference between
womens and mens earnings. If women earn 70% as much as men, most of us
might agree that it calls for action; we would want to implement policies to
remedy the inequality. What if women earn 90% as much as men? 99% as much?
99.99% as much, but its statistically significant? At what point do we determine
that the cost of taking action against the inequality are not worth the benefits?
Thats a value judgment. And thats what defines practical significance: the point
at which the effect is big enough to be a call to action. A statistically significant
difference might be so small that it has few practical implications.
Conversely, sometimes a call to action is necessary despite no statistically proven
effect. My example is global warming. There is a lot of debate as to whether
global warming is caused by humans. The statistical evidence is mixed. Some
studies suggest yes, some studies suggest no. In general, evidence is leaning in
the direction of saying that it is caused by humans, but there is no absolute
consensus. So what if evidence says that its 85% likely that global warming is
caused by humans, that its 15% likely that its caused by other sources. Is that
statistically significant? No, by the p < 0.05 criterion, we dont reject zero human-
made effect. Should we act on it? My opinion is yes (probably), since the
consequences of inaction are catastrophic. The practical implications are huge,
even if we are not statistically confident of the effect.
Another example might be an experimental medical treatment. Even though
some treatment is not proven to be effective (proven usually means that we
have done some statistical test to reject zero effect), we might want to go ahead
with the procedure if it offers large potential benefits and it is reasonably likely
to work. We dont need statistical proof to take action; we dont need statistical
significance to have practical significance.
2. Insignificant does not equal zero.
A common fallacy is to ignore effects that are insignificant. For example, a
researcher might estimate that some

! = 3.88 , with a 95% confidence interval


from -0.14 to 7.90. Since this effect is not significant, the researcher acts as if the
conclusion is that it has no effect.
Incorrect. First, we havent proven zero effect. We simply have not ruled out zero
effect. As always, there are two possibilities when we do not reject a hypothesis.
One is that the hypothesis is true, that the variable indeed has no effect. The
other possibility is that the effect is nonzero, but we dont have solid evidence to
demonstrate this.
Second, if we had to come up with a single best guess of the effect, it would not
be zero. The point estimate of 3.88 remains our best guess, and having zero
inside the confidence interval does not change that.
3. Oomph
McCloskey suggests a thought experiment. Suppose that I want to lose a
substantial amount of weight, and I ask my doctor for help. She tells me about
two pills that I could take. (Medication is much easier than changing lifestyle!)
Research shows that for patients taking medicine Meh, 95% experience weight
loss between 1 and 3 pounds. For patients taking medicine Oomph, 95%
experience weight loss between -5 and 45 pounds.
Which one should I select for substantial weight loss? This depends on my
preferences, of course, but it seems extremely likely that I would prefer Oomph.
The expected weight loss is much greater. In fact, only a small fraction of
patientsaround 6%would do worse with Oomph than with Meh. Unless I am
extremely risk averse, Oomph seems like the better choice.
And yet, Oomph has not passed the test of statistical significance, since zero is
contained within the 95% confidence interval for the effects of Oomph. If we
judge medications using the p < 0.05 criterion, we would select Meh over
Oomph. For 94% of patients, this will result in worse outcomes. Do you realize
what a terrible decision this is, this decision based on statistical significance?
In fact, Oomph would probably not even make it to market. The Food & Drug
Administration places a lot of emphasis on proven effectiveness. This means
showing a statistically significant effect, which Oomph does not. Heres a dirty
secret: whenever you hear someone talk about proven effectiveness, you are
hearing the fallacy of statistical significance.
4. p < 0.05 as a policy-making criterion.
Many decisions, including in medicine and economics, are based on whether the
treatment or policy offers statistically significant improvement. In other words,
we enact the policy if the probability that it outperforms the alternative is more
than 95%. This virtually never leads to the optimal outcome. (Some
foreshadowing: 95% is completely arbitrary, isnt it? Why not 99%? Or 90%?)
Suppose a regulatory agency is deciding whether to approve a new drug. A
randomized controlled trial demonstrated that

P[drug is effective] > 0.95 ; or to
put it a different way,

P[drug is ineffective] < 0.05 . Good, it works! Approve it,
right? Patients should take it, right? (Incidentally,

P[drug is ineffective] is
essentially the p-value from the hypothesis test.)
Well, maybe not. If it potentially has terrible side effects, maybe you should wait
until youre more certain that the benefits outweigh the costs. Unless its really
life-savingthen you might want to adopt it even if the probability of being
effective is lower. Unless its extremely expensive. And so forth.
Truthfully, a good decision is a cost-benefit analysis. There are no costs or
benefits in the p < 0.05 decision rule! As economists, we would model the
optimal decision as

E[net benefit of drug] > 0
P[effective] ! (benefit) " (costs) " (side effects) > 0
P[effective] ! (benefit) > (costs) + (side effects)
P[effective] > [(costs) + (side effects)]/(benefit)

That is an economically rational decision rule. Why should we arbitrarily use

P[effective] > 0.95 instead? The p < 0.05 rule is unlikely to come close to
making the right decisions.
But remember, p < 0.05 was never intended to guide decisions. Fisher
introduced it as a way of finding truth. Knowledge for knowledge sake. p <
0.05 is intended to tell us whether something is true, without any regard for the
size of the effect. And it certainly doesnt involve any value judgments (costs,
benefits, and moral values). Economists are experts in understanding values. Of
all people, we should be incorporating them into our policy decisions.
And thats the bottom line of the McCloskey critique. Statistical significance
doesnt tell us about practical significance, which can be defined as whether we
should do anything based on the information. Using p < 0.05 as a decision rule
is likely to lead to non-optimal outcomes most of the time.
http://en.wikipedia.org/wiki/McCloskey_critique
http://www.unc.edu/~swlt/lossfunction.pdf (about neglected costs/benefit analyses)
http://www.unc.edu/~swlt/mccloskeycult.pdf
http://www.unc.edu/~swlt/fishergenesis.pdf (about twins)
http://www.amazon.com/The-Cult-Statistical-Significance-Economics/dp/0472050079
http://en.wikipedia.org/wiki/Deirdre_McCloskey

You might also like