Deirdre McCloskey is an outspoken critic of significance testing in economics.
Significance testing focuses on whether something is true (for example, whether x has an effect on y), without regard for the magnitude of the relationship. Virtually all academic research in economics focuses primarily on statistical significance. I describe this as the p < 0.05 mentality. (McCloskey also says the cult of statistical significance.) Researchers estimate a model, and they report which explanatory variables have a significant effect. The magnitude of the effect is of secondary interest. Frequently, researchers do not discuss magnitude at all. Usually, the magnitude gets attention only if the effect is significant. Almost never do researchers discuss the magnitude of insignificant results (even if the effect is potentially large!). Perhaps some historical perspective helps. One of Ronald Fishers early applications of significance testing was in answering the question of whether genetically identical twins existed. He specifically wanted to test whether more than 50% of twins were same-sex. Fisher was interested in seeking out truth. He wanted to know whether something was true, a yes-or-no condition. At times, this is an interesting question. We might want to know whether extraterrestrial intelligence exists. We might want to know whether cold fusion is possible. We might want to know whether monetary policy affects the economy. This is seeking knowledge for the sake of knowledge. Knowing whether something is true can guide research. If we know that genetically identical twins exist, then we will be able to do analyses that separate nature from nurture. (That was Fishers intent.) But for this to have a practical application, we need to know how big the effect is. If identical twins are rather frequent, we could gather a large enough sample to conduct research. If they are rare, it would be impractical. Magnitude matters in practical applications. Hence, McCloskey reminds us to think about practical significance, also called economic significance. Let me discuss four variations on the McCloskey critique. 1. Statistical significance versus practical significance. What if economic theory predicts that purchasing power parity should equal 1? We collect data and estimate PPP to equal 0.99990, with a 95% confidence interval going from 0.99989 to 0.9991. We would reject the hypothesis that PPP equals 1. Should we completely toss out the theory entirely, as a complete failure, based on this small deviation? Should we reject any action that is optimal when PPP = 1? Strict significance testing says the theory is false, and these actions are less than optimal. A wise person might instead say, the theory is almost perfect, except that we havent figured out one tiny detail; these actions are almost optimal, off by only a smidgeon. As economists, we know that the decision to take some action depends on costs and benefits. Statistical significant is a purely mathematical calculation. It introduces no values. Practical significance is defined as does it have any applications? or would you make a decision based on it? McCloskey would say that statistical significance is neither a necessary nor sufficient condition for practical significance. Something can have statistical significance without practical significance. My example is discrimination in earnings. We might see that there is a statistically significant difference between womens and mens earnings. If women earn 70% as much as men, most of us might agree that it calls for action; we would want to implement policies to remedy the inequality. What if women earn 90% as much as men? 99% as much? 99.99% as much, but its statistically significant? At what point do we determine that the cost of taking action against the inequality are not worth the benefits? Thats a value judgment. And thats what defines practical significance: the point at which the effect is big enough to be a call to action. A statistically significant difference might be so small that it has few practical implications. Conversely, sometimes a call to action is necessary despite no statistically proven effect. My example is global warming. There is a lot of debate as to whether global warming is caused by humans. The statistical evidence is mixed. Some studies suggest yes, some studies suggest no. In general, evidence is leaning in the direction of saying that it is caused by humans, but there is no absolute consensus. So what if evidence says that its 85% likely that global warming is caused by humans, that its 15% likely that its caused by other sources. Is that statistically significant? No, by the p < 0.05 criterion, we dont reject zero human- made effect. Should we act on it? My opinion is yes (probably), since the consequences of inaction are catastrophic. The practical implications are huge, even if we are not statistically confident of the effect. Another example might be an experimental medical treatment. Even though some treatment is not proven to be effective (proven usually means that we have done some statistical test to reject zero effect), we might want to go ahead with the procedure if it offers large potential benefits and it is reasonably likely to work. We dont need statistical proof to take action; we dont need statistical significance to have practical significance. 2. Insignificant does not equal zero. A common fallacy is to ignore effects that are insignificant. For example, a researcher might estimate that some
! = 3.88 , with a 95% confidence interval
from -0.14 to 7.90. Since this effect is not significant, the researcher acts as if the conclusion is that it has no effect. Incorrect. First, we havent proven zero effect. We simply have not ruled out zero effect. As always, there are two possibilities when we do not reject a hypothesis. One is that the hypothesis is true, that the variable indeed has no effect. The other possibility is that the effect is nonzero, but we dont have solid evidence to demonstrate this. Second, if we had to come up with a single best guess of the effect, it would not be zero. The point estimate of 3.88 remains our best guess, and having zero inside the confidence interval does not change that. 3. Oomph McCloskey suggests a thought experiment. Suppose that I want to lose a substantial amount of weight, and I ask my doctor for help. She tells me about two pills that I could take. (Medication is much easier than changing lifestyle!) Research shows that for patients taking medicine Meh, 95% experience weight loss between 1 and 3 pounds. For patients taking medicine Oomph, 95% experience weight loss between -5 and 45 pounds. Which one should I select for substantial weight loss? This depends on my preferences, of course, but it seems extremely likely that I would prefer Oomph. The expected weight loss is much greater. In fact, only a small fraction of patientsaround 6%would do worse with Oomph than with Meh. Unless I am extremely risk averse, Oomph seems like the better choice. And yet, Oomph has not passed the test of statistical significance, since zero is contained within the 95% confidence interval for the effects of Oomph. If we judge medications using the p < 0.05 criterion, we would select Meh over Oomph. For 94% of patients, this will result in worse outcomes. Do you realize what a terrible decision this is, this decision based on statistical significance? In fact, Oomph would probably not even make it to market. The Food & Drug Administration places a lot of emphasis on proven effectiveness. This means showing a statistically significant effect, which Oomph does not. Heres a dirty secret: whenever you hear someone talk about proven effectiveness, you are hearing the fallacy of statistical significance. 4. p < 0.05 as a policy-making criterion. Many decisions, including in medicine and economics, are based on whether the treatment or policy offers statistically significant improvement. In other words, we enact the policy if the probability that it outperforms the alternative is more than 95%. This virtually never leads to the optimal outcome. (Some foreshadowing: 95% is completely arbitrary, isnt it? Why not 99%? Or 90%?) Suppose a regulatory agency is deciding whether to approve a new drug. A randomized controlled trial demonstrated that
P[drug is effective] > 0.95 ; or to put it a different way,
P[drug is ineffective] < 0.05 . Good, it works! Approve it, right? Patients should take it, right? (Incidentally,
P[drug is ineffective] is essentially the p-value from the hypothesis test.) Well, maybe not. If it potentially has terrible side effects, maybe you should wait until youre more certain that the benefits outweigh the costs. Unless its really life-savingthen you might want to adopt it even if the probability of being effective is lower. Unless its extremely expensive. And so forth. Truthfully, a good decision is a cost-benefit analysis. There are no costs or benefits in the p < 0.05 decision rule! As economists, we would model the optimal decision as
That is an economically rational decision rule. Why should we arbitrarily use
P[effective] > 0.95 instead? The p < 0.05 rule is unlikely to come close to making the right decisions. But remember, p < 0.05 was never intended to guide decisions. Fisher introduced it as a way of finding truth. Knowledge for knowledge sake. p < 0.05 is intended to tell us whether something is true, without any regard for the size of the effect. And it certainly doesnt involve any value judgments (costs, benefits, and moral values). Economists are experts in understanding values. Of all people, we should be incorporating them into our policy decisions. And thats the bottom line of the McCloskey critique. Statistical significance doesnt tell us about practical significance, which can be defined as whether we should do anything based on the information. Using p < 0.05 as a decision rule is likely to lead to non-optimal outcomes most of the time. http://en.wikipedia.org/wiki/McCloskey_critique http://www.unc.edu/~swlt/lossfunction.pdf (about neglected costs/benefit analyses) http://www.unc.edu/~swlt/mccloskeycult.pdf http://www.unc.edu/~swlt/fishergenesis.pdf (about twins) http://www.amazon.com/The-Cult-Statistical-Significance-Economics/dp/0472050079 http://en.wikipedia.org/wiki/Deirdre_McCloskey