You are on page 1of 13

The Journal of Socio-Economics 33 (2004) 651–663

Statistical significance, path dependency, and the


culture of journal publication
Morris Altman∗
Department of Economics, University of Saskatchewan, 9 Campus Drive,
Saskatoon, Saskatchewan, Canada S7N 5A5

Abstract

A brief introduction to the misuse and abuse of tests of statistical significance is presented. This
is followed by an analysis of why such an inappropriate socially sub-optimal and inefficient practice
can persist over time in the face of a multiplicity of competing peer-reviewed journals. It is argued
that this practice is path dependent and represents a market failure often resulting in misleading
research findings and misguided public policy. This can only be corrected by changes to institutional
parameters related to publication.
© 2004 Elsevier Inc. All rights reserved.

Econ lit code: B400; C100; D700; L000

Keywords: Path dependency; Peer-reviewed journals; Institutional parameters; Statistical significance

1. Introduction

It is now well established that the use of tests of statistical significance as the key cri-
terion to establish the analytical important of empirical results has been a mainstay across
disciplines for many decades. In spite of the severe criticism this practice has been subject
to, there has been barely any improvement over time. Typically, little evidence apart from
statistical significance is provided to establish the importance of ones empirical findings
and if additional evidence is provided, such as coefficient size or intervals, such informa-
tion is not much discussed. This procedure unites empirical practitioners from economics,
ecology, sociology, psychology, and medicine. It unites empiricists across the political spec-

∗ Tel.: +1 306 966 5198; fax: +1 306 966 5232.


E-mail address: altman@sask.usask.ca.

1053-5357/$ – see front matter © 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.socec.2004.09.037
652 M. Altman / The Journal of Socio-Economics 33 (2004) 651–663

trum. Within disciplines it unites practitioners across the methodological divide as well. The
practice of statistical significance has produced a powerful consensus in use. Thus, the fo-
cus of scholarly debate is rarely about sampling issues, the size of coefficients, coefficient
intervals, result replicability, or missing variables (possible by-products of misconstrued
theories), which speak to the analytical importance of ones empirical results. Rather the
focus is on whether or not the results are statistically significant. Needless to say, given
the nature of statistical significance, few are convinced by such a discourse. Both analyt-
ically insignificant and significant results can be statistically significant and statistically
insignificant results might nevertheless be suggestive of analytical significance. There is no
necessary causal relationship between statistical and analytical significance.
What does this imply for scientific practice if, as so many leading statisticians agree, the
use of tests of statistical significance to determine analytical significance is flat out wrong
and an abuse of a technique which, at best, can only determine the probability that the results
generated from a random sample is a product of chance?1 How is it possible for sub-optimal
analytical techniques to persist over time? These questions must be addressed if one is to
determine policy which might provoke a shift away from tests of statistical significance when
one wishes to address the analytical importance of estimated coefficients. I argue that the
current practice statistical significance represents a market failure in the sense that the market
for published and refereed articles has failed to drive out a substandard product: the use of
tests statistical significance for wrong (unscientific) reasons. Moreover, the persistence of
this sub-optimal practice is path dependant, a product of the type of market which exists
for journal articles and the economic and psychological costs of producing the product.
The current structure of incentives is such that one cannot expect that the current wrong
practices will be easily abandoned or significantly modified. We are locked into a path of
empirical practice which yields unscientific results with regards to analytical significance.
Scientific criticism and moral suasion have not worked. For the path to shift, the incentive
structure as well as constraints, especially informational constraints, must be changed. This
need not imply that the only hope for change lies in changing journals’ editorial policy.
Other methods are suggested which might tip the balance of forces towards focusing upon
analytical significance when this is indeed the question which one wishes to address.

2. There is a problem and it’s not getting better

The prevalence of the misuse of tests of statistical significance—its use to determine


analytical significance—is well documented across fields, although we focus our attention
upon economics and psychology. For example, Anderson, Burnham, and Thompson (2000)
document the problem in the journals, Ecology and the Journal of Wildlife Management,
premier journals in ecological studies. Morrison and Henkel (1970) review the sociology

1 There are many published papers which discuss the scientific problem with current practice of using statistical

significance. Moreover, there exists an array of papers which discuss best practice methodology in the construction
of empirical papers which make use of statistical methodologies (Thompson, 2004). More specifically, see for
example, Carver (1978), Cohen (1994), McCloskey (1985a,b, 1992, 1995), Leamer (1983), Morrison and Henkel
(1970), Zellner (1984), Arrow et al. (1959), Granger et al. (1995), Thompson (1998). See also the contributions
to this special issue.
M. Altman / The Journal of Socio-Economics 33 (2004) 651–663 653

and psychology literature. Fidler et al., 2004 review the psychology and medical literature.
See also the contributions of Thompson and Fidler, Cummings, Burgmans, and Thomason
in this issue. The only major reforms with regards to the use statistical significance tests
have been instituted in Psychology and Medicine.
In economics, McCloskey and Ziliak (1996, 2004) have documented the past and current
practice of statistical significance tests in the American Economic Review. They conclude
(2004) of the most recent period, during which time economists have been more exposed
to the substantive errors of using statistical significance tests for analytical purposes, that
of the 137 full length papers published in the 1990s in the American Economic Review,
87 papers use only statistical significance as a criterion of analytical importance at first
use. They suggest that if this is what the prestigious American Economic Review is all
about it is unlikely to be much better elsewhere. Zellner (1984) reveals similar results for
leading econometric journals for an earlier period. A less scientific perusal of the literature
as a journal editor would suggest that McCloskey and Ziliak are probably quite correct
and the statistical significance is the analytical mainstay even in economic history and
experimental economics which rely heavily on samples of convenience, where such tests
have no place at all. The critique of the misuse of statistical significance has never been
very pervasive in economics and that which exists has had little impact on any of the major
professional economics organization (see Appendix B for the Editorial Policy statements of
two economics journals which demand that their authors focus on analytical significance).
In psychology, decades of criticisms eventually resulted in the American Psychologi-
cal Association (APA) revising its publication manual in 1994, encouraging authors to go
beyond the use of statistical significance tests. This is of importance given that the APA is re-
sponsible for 27 primary journals and moreover, the APA manual is the primary publication
guide for no less than 1000 other journals in journals in psychology, the behavioral sciences,
nursing and personnel administration (Fidler, 2002). Also, “In the light of continuing debate
over the applications of significance testing in psychology journals and following the pub-
lication of Cohen’s (1994) article, the Board of Scientific Affairs (BSA) of the American
Psychological Association (APA) convened a committee called the Task Force on Statistical
Inference (TFSI) whose charge was ‘to elucidate some of the controversial issues surround-
ing applications of statistics including significance testing and its alternatives; alternative
underlying models and data transformation; and newer methods made possible by powerful
computers’ (BSA, personal communication, February 28, 1996)” (Wilkinson et al., 1999).
The fundamentals of this Task Force were published in 1999 (Wilkinson et al., 1999)
and recommended further reforms, encouraging journal editors to be ever more receptive
and encourage other measure of significance, such as size and confidence intervals. The
Task Force backed-off going beyond recommending what is required for a quality piece of
empirical research (Wilkinson et al., 1999): “Some had hoped that this task force would
vote to recommend an outright ban on the use of significance tests in psychology journals.
Although this might eliminate some abuses, the committee thought that there were enough
counterexamples . . . to justify forbearance. Furthermore, the committee believed that the
problems raised in its charge went beyond the simple question of whether to ban signifi-
cance tests. The task force hopes instead that this report will induce editors, reviewers, and
authors to recognize practices that institutionalize the thoughtless application of statistical
methods.”
654 M. Altman / The Journal of Socio-Economics 33 (2004) 651–663

Although some might argue that the Task Force did not go far enough in its recommen-
dations, the subsequent revision to the APA Publication Manual went even less far. “The
manual has been criticized for not following through on its own recommendations to include
effect sizes, CIs [confidence intervals], and statistical power with examples of how to report
these measures. This perceived failure has been described as sending an overall message
of NHST [the null hypothesis statistical test], business as usual. Some reformers claim that
specific recommendations are so poorly integrated in the fifth edition that they are unlikely
to have any effect on practice” (Fidler, 2002, p. 752). Moreover (Fidler, 2002, p. 761), “The
heart of the issue, for most critics, is not that the manual did not ban NHST, or mandate
effect sizes, or prescribe any other particular methods. The heart of the criticisms is that the
decision not to provide explicit requirements seems to have also excluded presenting the
reasons for, and implications of, the recommendations. Following through with examples,
offering general advice on good practice, and providing explanations and education are
all things the manual could have conceivably done without taking an official position on
NHST.” Nevertheless, with all its problems the APA’s revised Publication Manual strongly
signals the utility of going beyond significance tests in empirical analyses.
Although the professional psychology community has gone furthest in confronting and
dealing with the misuse of statistical significance and in spite of its various efforts at re-
form and moral suasion not much has changed in the realm of praxis, in the world of
publishing—the misuse of significance tests has never been banned or monitored nor the
use of alternatives mandated. Currently, only 20 of well over 1000 psychology related
journal actually mandate the reporting of size effects. However, of these two are flagship
journals of associations (The American Counseling Association and the Council for Excep-
tional Children), with over 55,000 members a piece (Appendix A; Fidler, 2002, p. 754). A
leading critic of the misuse of tests of statistical significance in psychology, Bruce Thomp-
son (1999; see also McLean and Ernest, 1998; Fidler et al., 2004; Sedlmeier and Gigerenzer,
1989), concludes that oft-reported misuse of statistical significance tests persists and has
remained unabated over time. The psychology literature’s empirics across journals remain
dominated by significance tests which are in effect used as tests of analytical significance.
Writing in the same vein DeVaney (2001, p. 311) concludes: “Despite continued criticism
and the availability of alternative or supplemental procedures, statistical significance test-
ing techniques continue to be commonplace in social science research and the reporting of
effect sizes as recommended by the APA [American Psychological Association] continues
to be neglected.”
As detailed by Fidler et al. (2004), there have been various attempts made in Medical
research to encourage the movement away from significance tests as the core determinant
of analytical significance since the late 1970s, even prior to the efforts made in Psychology,
beginning with the New England Journal of Medicine in 1977. In 1986, the British Medical
Journal implemented a policy encouraging the reporting of confidence intervals. And in
1988, The International Committee of Medical Journal Editors revised their manuscript
submission requirements for the biomedical journals to discourage the traditional focus of
statistical significance tests and redirect energies to alternative measures. More than 300
Medical and Biomedical journals agreed to comply with the revised guidelines. Neverthe-
less, the determination of the effectiveness of the new standard for excellence was at the
Editors’ discretion and the gist of the recommendations was in the context of moral suasion.
M. Altman / The Journal of Socio-Economics 33 (2004) 651–663 655

Moral suasion has not been enough to displace the use of statistical significance tests as
the dominant determinant of analytical significance. Nevertheless, some changes for the
better with regards to the reporting of results have occurred. However, Fidler, Thomason,
Cumming, Finch and Lee (2004, p. 124) conclude that even when non-statistical significant
tests variables are reported, such as confidence intervals, statistical significant tests re-
mained the bedrock of the analytical narrative. Confidence intervals, for example, became
much more noticeable on paper, but more as window dressing required or encouraged by
editors or editorial policy, than anything else. Having relevant non-statistical significance
tests variables is, of course, quite important, for it provides the reader with the means
to better interpret the results of a paper in a scientifically meaningful fashion. But this is
only the first step. But even this first step has not been taken in most disciplines and, in
psychology and medicine where some movement has been afoot, that first step neither
appear to be pervasive nor solidly grounded.
The problems with the application of statistical significance tests do not simply domi-
nate the universe of article publications they also pervade the typical statistical textbooks,
which is where the students who become publishers derive so many of their rules and norms
for excellence in applied research from. McCloskey and Ziliak (1996, 2004) find that in
economics it is only the exceptional statistical textbook which discusses in any detail and
places much importance on the distinction between statistical significance tests and indi-
cators of analytical significance. In psychology and related social science disciplines all
reviewed textbooks published from 1994 deal in some depth with statistical significance.
Where both the latter and size affects are discussed, statistical significance easily over-
whelms the latter in page numbers (Capraro and Capraro, 2002). The content of textbooks
in some sense reflects the importance which scholars place upon understanding and deter-
mining analytical significance. It also informs the methodology adopted in applied work
done by students, professors, and non-academic researchers. Textbooks represent one im-
portant information constraint faced by practitioners. They also reflect the standard culture
of empirical methodology of future and current practitioners.

3. Explaining the problem

The persistence of the misuse of statistical significance tests and the severe neglect of
measures of analytical significance, representativeness, and the like over the long run in the
face of strong critiques and even efforts at moral suasion to alter current practices raises
serious intellectual questions. How can such a set of bad scholarly practices persist in the
long run? What characterizes the ‘market’ for academic articles which would allow for the
persistence of such inefficient and sub-optimal practices? To the extent that such a market
exists, we appear to be experiencing a severe market failure where negative externalities are
being generated to public and private consumers of statistical research findings, findings
which are misleading, incorrect, or irrelevant. Does the solution to this possible market
failure lie with interventions of regulatory authorities outside of the disciplines or perhaps
within? Sedlmeier and Gigerenzer (1989, p. 315) argue that: “We believe there is only one
force that can effect a change, and that is the same force that helped to institutionalize null
hypothesis testing as the sine qua non for publication, namely, the editors of the major
656 M. Altman / The Journal of Socio-Economics 33 (2004) 651–663

journals.” What will push such editors to change their past practices? Why should they
change their preferences?
There have been various efforts to explain why we appear to be in a steady-state low-
level sub-optimal equilibrium with regards to empirical analyses. There are three sets of
informative commentaries which can contribute to a more general understanding of this
problem. Cumming and Fidler (2002) argue that to a large extent the current problems are a
product of poor information. The practitioners do not clearly understand what the problem is
and what alternatives exist, in concrete terms, to the standard statistical significance tests. But
this is far from a simple process. They write: “Reforming psychologists’ statistical practices
will require attitude change and the acquisition of new understandings and skills by those
teaching statistics in psychology, as well as by researchers, practitioners and students. It is,
therefore, even more challenging than overcoming students’ naı̈ve statistics beliefs.”
Given there exists a culture of statistical significance testing which is self-sustaining, the
question remains of how does one trigger and maintain a process of cultural change? But
even if one begins to resolve the information problem, would this be the end of the story?
Ziliak and McCloskey (2004; see also McCloskey, 1996, pp. 45–46) argue that the
misuse of statistical significance tests and the neglect of analytical significance and other
such important issues was, quoting William Kruskal, a former president of the American
Statistical Association, “I guess a cheap way to get marketable results.” This is a product of
the advent of rapid technological changes in both the hardware and software components
of computing statistical results. They continue (Ziliak and McCloskey, 2004): “Finding
statistical significance is simple, and publishing statistically significant coefficients survives
at least that market test.” Social scientists purchase the product which is cheapest, and it’s
cheaper and increasingly cheaper to compute statistical as opposed to analytical significance.
I would argue that the new computer technology cheaply produces various estimates
which help tackle issues of analytical significance, such as the size and variance of coef-
ficients as well as issues of statistical significance. There is no evidence that the relative
costs of generating the statistically significant and analytically significant related estimates
have altered as a result of technical change. However, an important component of the cost
of incorporating estimates related to analytical significance into ones narrative is not sig-
nificantly affected by technical change. The process of discussing analytical significance,
sampling and representativeness, reproducibility, size effects, variability, modeling, for ex-
ample, is something which computer technology cannot address. Computers cannot cut the
costs of thinking analytically about ones model and ones estimates and the time costs of
such analysis are large, requiring significant prior investment in human capital as well as
current expenditures of time and effort. By avoiding these private costs, researchers can
increase their productivity, albeit at a much lower quality and while possibility generating
some serious negative externalities.2
Thompson (1999) suggests that bad practice in empirical research persists for three
key reasons: (1) atavism, (2) “is/ought” logic fallacies, and (3) confusion or desperation.
Atavism speaks to the need of individuals to conform to standard practice or the fear from
deviating from the current norms of behavior. The “is/ought” logic fallacy suggests that any

2 See Walker and Smith (1993) who build upon Siegel (1961) on the importance of time costs in the decision
making process.
M. Altman / The Journal of Socio-Economics 33 (2004) 651–663 657

practice which survives for such a long period of time (it exists) must be the best practice
(ought to exist). This is analogous to the economics argument that market forces should
drive out inefficient players at least in the long run. Confusion or desperation suggest that
practitioners might simply be confused as to what statistical significance tests are all about
and they desperately want to believe that statistical analysis yields the type of results which
they desire (analytical importance). The latter two are similar to or overlap with Cumming
and Fidler’s emphasis on poor information as the driving force underlying bad practice in
empirical work.
I would argue that the persistence of bad practice in empirical analyses can be best
framed in the context of path dependency and cultural embeddedness. The traditional path
dependency literature suggests that inefficient economic regimes can survive in the long
run because of a first mover advantage which they would have over a more efficient systems
which comes into play second (Arthur, 1989, 1990; David, 1985). This theoretical frame-
work has been severely criticized since a more efficient system should be able to dominate
the less efficient higher cost system in the long run (Liebowitz and Margolis, 1990, 1994).
However, the persistence of an inefficient regime can be explained when the inefficient
regime is protected from market forces or even in the absence of protection when there ex-
ists no competitive advantage in shifting to a more efficient regime (Altman, 2000). Cultural
embeddedness refers to the social context in which decision making is embedded. Thus,
if an individual finds her or himself in a network wherein behaving in one fashion is most
highly regarded (utility augmenting) or behaving alternatively is negatively viewed (utility
reducing) the individual will tend to choose the former course of action. The individual
can choose to go against the flow if her or his preferences are such that utility is increased
in spite of the negative social repercussions of such behavior—doing the right or moral
thing in spite of social pressures. These thoughts relate to Becker’s concept of social capital
where social capital (1988, p. 4), “. . . incorporates the influence of past actions by peers
and others in an individual’s social network and control system.” Culture is that component
of social capital which changes at a relatively slow pace and is largely given to individuals
over their lifetimes. Individuals have some choice over their social capital and even over its
cultural component to the extent that they can choose the social network or control system
of which they are part (Becker, 1998, ch. 1).
There are clear supply and demand dimensions plus a moral dimension to the question
of why researchers persist in using statistical significance tests as indicators and proofs
of analytical significance, given the overwhelming scientific evidence that this approach
to empirical work is incorrect. As McCloskey (1996) and McCloskey and Ziliak (1996)
point out, it has become increasingly cheaper and more productive to focus upon statis-
tical significance tests when producing empirical articles. I would argue that this is true
especially when one incorporates the overall cognitive costs (including time and human
capital) into the equation. Although technological change has not cut the relative costs of
generating coefficients of statistical significance relative to estimates related to analytical
significance, the latter still go largely unreported. Given that computer technology has not
affected the cost of integrating the estimates related to analytical significance into a com-
prehensive narrative which speaks to this issue, one reason why the analytical significance
related estimates go unreported or underreported, is that the relative cost of constructing a
paper simply based upon statistical significance related coefficients has fallen. Why report
658 M. Altman / The Journal of Socio-Economics 33 (2004) 651–663

empirical finding related to analytical significance when it is not part of one’s discourse?
It is eminently rational from the private agent’s perspective to increase the supply of non-
analytical empirical papers as the relative costs of so doing diminish. Moreover, given that
tenure and promotion require more and more publications, researchers have the incentive to
adopt the least cost high productivity strategy towards publishing. This strategy is linked to
the less time and thought intensive focus on statistical significance. Choosing the relatively
high cost strategy yields either higher cognitive costs or lower productivity. The opportunity
cost of the latter, especially when there is a significant upper-bound constraint on cognitive
costs—one reaches a maximum prior to achieving target productivity levels—could be the
failure to achieve tenure or promotion. The falling relative cost of producing empirical work
improperly yields a Gresham’s Law of journal publication wherein poor quality publica-
tions chase out good quality publications. Unless there is a moral imperative forcing the
author to do otherwise, there is an unequivocal cost incentive for empirical economists to
focus upon statistical significance.
However, increasing the supply of statistically significance-based output requires that
there exists a market for a product which is of increasingly poor quality and which also
generates negative externalities. Poor quality is a function of the focus on statistical as
opposed to analytical significance and related questions. Externalities are generated by
results which can be false or misleading and which affect private and public decision making.
Obviously such a market does exist in abundance in journals across many disciplines and
appears to be limited only by the number of journals that publish empirical papers and the
number of pages allocated to these journals. With respect to academic publishing the demand
is perfectly inelastic at the market price of zero. Journals pay nothing or close to nothing for
published papers and the suppliers of articles ask for no financial remuneration in return for
their paper being published. There are minor variations to this rule as some journals request
that authors pay to have their papers reviewed or published, but authors receive no payment
for the finished product. There is a negative market price for the published piece from the
perspective of the authors. Nevertheless, the income effect of publishing is typically quite
high given the positive and substantive effect which publishing has upon tenure, promotion,
and levels and rates of changes in real income.3 There is no evidence that journal circulation
is negatively affected by the dominance of empirical articles which build upon statistical
as opposed to analytical significance. Such substandard output has not resulted in a fall in
demand and the demise of the distributors of this output. Thus, the demand by journals for
statistical significance-based papers has grown apace with output.
To the extent that producers know that they are producing shoddy and possibly damaging
products there must not be any moral imperative constraining the producers from behaving
in this fashion otherwise such producers might find it utility maximizing to produce an
empirical piece in a fashion consistent with scientific as opposed to culturally appropriate
practice. It might very well be possible for the moral agent to be as productive as the
less moral agent or the producer who is simply unaware that what she or he is doing is
scientifically inappropriate, but this would require more time and effort and more up-front

3 The price of the article is more complex than this since the academic authors’ tenure, promotion, pay level

and rate of pay increase are often tied to their publication rate. Thus, although journals pay nothing for articles,
authors receive compensation for publications through their universities.
M. Altman / The Journal of Socio-Economics 33 (2004) 651–663 659

and even long term investment in human capital—increasing productivity comes at a cost.
Nevertheless, even given the existing set of incentives one would expect there to be quality
empirical publications as the moral imperative of some would preclude them from adopting
the least cost method of producing empirical papers. There does appear to be a market for
both properly done and other types—statistically significance-based—papers (DeVaney,
2001). But the incentive exists to encourage the production of the latter. The percentage of
empirical papers which are scientifically constructed, thus, depends on the demand side,
the distribution of a moral imperative across the scholarly population, the relative costs of
producing scientifically constructed papers, and the distribution of knowledge as to what
constitutes a scientifically constructed empirical contribution.
It should be noted that if prospective authors perceive that publishing as well as tenure
and promotion requires that one adapt to current cultural statistical practices, they will
engage in the bad practices irrespective of cost and state of knowledge—this relates to
Thompson’s atavism. Falling costs will simply further increase the supply. In this case,
doing the right thing, it is perceived, comes at a very high cost which goes well beyond the
time costs of producing well done empirical pieces. If this perception about the requirements
for tenure and promotion is incorrect, approaching empirical work scientifically becomes
more a matter of trying to fit into a social network (cultural embeddedness) and the relative
cost of producing such work than being pressured into producing a substandard product by
the professional rules of the game.
It should also be noted that if practitioners believe that what they are doing is correct, cost
would not be the only major determinant of adopting bad statistical methods—ignorance
or imperfect information would also play a determining role. If the practitioner believes
the both objectively bad and good practices are both equally good (scientifically valid),
the falling costs of doing statistical significance focused research will increase the relative
supply of articles which build upon incorrect statistical methods. If such practitioners
become enlightened as to the errors of their ways, their behavior may still not change
given the above constraints (cultural embeddedness and other costs) unless their moral
imperatives moved them to do so. Thus, it is not clear that more and better information,
including educating practitioners would fix the problem, although it could help. In
this scenario, the moral imperative might come into play, overwhelming relative cost
considerations.
As long as there is a market for the product one remains at a steady-state low-level
equilibrium with some moral exceptions to the rule. A steady-state low-level equilibrium
is a product of the fact that practitioners invest in current incorrect statistical practices and
have reputations which are based on what they produced in the past. Once one invests in
the culture of statistical significance, there are significant costs involved in reversing one’s
approach to producing empirical papers. There exists path dependency in the persistence of
incorrect statistical methodology in the constructing of empirical papers. Moreover, to the
extent that doing the wrong thing—focusing upon statistical significance tests as opposed
to analytical significance—is the cultural norm, the social costs of deviating from the norm
could be large irrespective of past behavior. Thus, significant costs will be incurred if the
practitioner shifts methodology. Both past investments in the construction of empirical
papers and cultural embeddedness contribute towards locking the agent into a sub-optimal
path of statistical methodology.
660 M. Altman / The Journal of Socio-Economics 33 (2004) 651–663

The demand side of the equation is critical to overturning the current state of affairs.
A change in the demand side would shock the system into a higher scientific path of
statistical practice. However, editors, journal referees, and organizational leaders who play
a determining role on the demand side suffer from many of the same constraints faced by
the producers of which they form a subset. They are characterized by many of the same
incentives faced by the producers of articles to preserve the current culture of statistical
significance. Even if editors are open to scientifically constructed empirical papers, if any
one referee has a veto over what is published this increases the probability of accepting
statistical significance-based papers and rejecting papers which are not so based. Moreover,
the costs of enforcing the use of analytical significance can be quite high. These costs involve
the costs of screening out statistical significance-based papers and monitoring revisions. The
incentives (in terms of costs) are in favor of preserving the status quo ante.
Events in Psychology clearly show that dissent in the rank-and-file and amongst leading
scholars can have a large impact on the demand side. In this case, pressure from below has
resulted in a movement to legislate encouragement in the use correct statistical procedures.
Moreover, on the demand side one has public and private consumers of empirical research,
which include granting agencies, who are outside of the realm of the immediate academic
producers. These consumers are also not imbued with the constraints which characterize
these producers and which contribute towards maintaining bad statistical methods. If these
consumers change their preferences—it is not all clear that they appreciate the poor quality
of the product which they purchase or otherwise fund—this could force a downward shift in
the demand function for the current empirical output which focuses on statistical as opposed
to analytical significance. To the extent the demand function for statistically significant type
papers is perfectly elastic, such preference change would serve to enhance the market for
papers which focus on analytical significance (a separate demand function) and to limit
the demand for papers based upon statistical significance. In this fashion, the provision of
better information on why statistical significance is not analytical significance can impact
indirectly on the preferences of agents who play a critical role on the demand-side, such as
editors and organizations and publishing houses which sponsor academic research.

4. Conclusion

The current incentive structure encourages the misuse and abuse of statistical significance
tests. This problem has been well documented across disciplines and has persisted over time
even the face of severe scientific criticism. We are in a low-level equilibrium trap of sorts.
Neither market forces nor moral suasion have been able to force a way out of this dilemma.
Producers of empirical papers appear to be locked into unscientific statistical practices for
reasons of economic and psychological costs where the latter is affected by the current
culture of statistical significance. Nevertheless, potential solutions to this particular type
of market failure exist and are located on the demand side. Where change has occurred,
it was a product of persistent pressure for change amongst rank-and-file and key practi-
tioners. Moreover, where such pressure fails to affect change amongst journal editors and
organizations which sponsor academic research, changing the preferences of non-academic
consumers of empirical research can play a critical role in changing the incentive structure
M. Altman / The Journal of Socio-Economics 33 (2004) 651–663 661

in the empirical sciences towards a high quality scientific product. In the absence of demand
side changes, we are left with and locked into the steady state production of substandard
output with significant negative externalities to end users of empirical papers in both the
private and public spheres. Given that empirical results can affect decisions made in the
private and public sphere thereby impacting on our state of wellbeing, the current practice
of statistical significance has important repercussions which go well beyond mere scientific
concerns, however, important they might be.

Acknowledgement

The author thanks Louise Lamontagne for her comments and suggestions.

Appendix A. Psychology journals requiring size effect reporting (Thompson,


2003)

1. Career Development Quarterly


2. Contemporary Educational Psychology
3. Early Childhood Research Quarterly
4. Educational and Psychological Measurement
5. Exceptional Children
6. Journal of Agricultural Education
7. Journal of Applied Psychology
8. Journal of Community Psychology
9. Journal of Consulting & Clinical Psychology
10. Journal of Counseling and Development
11. Journal of Early Intervention
12. Journal of Educational and Psychological Consultation
13. Journal of Experimental Education
14. Journal of Learning Disabilities
15. Journal of Personality Assessment
16. Language Learning
17. Measurement and Evaluation in Counseling and Development
18. The Professional Educator
19. Reading and Writing
20. Research in the Schools

Appendix B. Economic journal’s with editorial policy on statistical significance

B.1. Journal of Socioeconomics


(http://www.elsevier.com/homepage/sae/econworld/econbase/soceco/frame.htm)

The Journal of Socio-Economics welcomes submissions that are empirical in orientation.


However, authors should carefully distinguish in their analysis between the use of statis-
662 M. Altman / The Journal of Socio-Economics 33 (2004) 651–663

tical and substantive significance. We are most interested in the substantive or analytical
significance of estimated coefficients. As Deirdre McCloskey often asks, how big is your
coefficient in terms the scientific conversation at hand? Statistical significance only provides
us with some information on the probability that coefficients estimated from a sample are
a matter of chance. It provides us with no information on the analytical importance of the
coefficient. With respect to samples, we are interested in how the sample is constructed and
the probable representativeness of the sample. When the population of a data set is used in
ones analysis, tests of statistical significance provide us with no useful information. Overall,
please play particular attention to the substantive or analytical significance of your statistical
analyses. For further information on this matter see, D.N. McCloskey, “The Loss Function
Has Been Mislaid,” American Economic Review 75, 201–205 and D.N. McCloskey and S.T
Ziliak, “The Standard Error of Regressions,” Journal of Economic Literature 34, 97–114.

B.2. Feminist Economics (http://www.ruf.rice.edu/femec/edpolicies.html)

Feminist Economics editorial policy requires that discussions of statistical results report
standard errors rather than t-statistics. This policy is to make it easier for readers to construct
confidence intervals. The policy accords with the view that statistical significance cannot
be interpreted without information on sample size.
The policy also requires that comments on statistical results address the economic im-
portance of results. Statistical significance should therefore be addressed only in the context
of sample size and the economic meaningfulness of a coefficient. For example, as is well
known, a coefficient can be statistically significantly different from zero but so close to zero
that the statistical significance may be of little relevance. Articles should therefore empha-
size the economic importance of variables in the context of confidence intervals rather than
statistical significance

References

Altman, M., 2000. A behavioral model of path dependency: the economics of profitable inefficiency and market
failure. Journal of Socio-Economics 29, 127–145.
Arrow, K.J., et al., 1959. Decision theory and the choice of a level of significance for the t-test. In: Olkin, I. et al.
(Ed.), Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford University
Press, Stanford, CA.
Anderson, D.R., Burnham, K.P., Thompson, W.L., 2000. Null hypothesis testing: problems, prevalence, and an
alternative. Journal of Wildlife Management 64, 912–923.
Arthur, B.W., 1989. Competing technologies, increasing returns, and lock-in by historical events. Economic Journal
99, 116–131.
Arthur, B.W., 1990. Positive feedbacks in the economy. Scientific American 204, 92–99.
Becker, G.S., 1998. Accounting for Tastes. Harvard University Press, Cambridge, Mass.
Carver, R.P., 1978. The case against statistical significance testing. Harvard Educational Review 48, 378–
399.
Capraro, R.M., Capraro, M.M., 2002. Treatments of effect sizes and statistical significance tests in textbooks.
Educational and Psychological Measurement 62, 771–782.
Cohen, J., 1994. The earth is round (p < 0.05). American Psychologist 49, 997–1003.
Cumming, G., Fidler, F. 2002. The statistical re-education of psychology. ICOTS6.
David, P.A., 1985. Clio and the economics of QWERTY. American Economic Review 75, 332–337.
M. Altman / The Journal of Socio-Economics 33 (2004) 651–663 663

DeVaney, T.A., 2001. Statistical significance, effect size, and replication: what do the journals say? Journal of
Experimental Education 69, 310–320.
Fidler, F., 2002. The fifth edition of the APA Publication Manual: why its statistics recommendations are so
controversial. Educational and Psychological Measurement 62, 749–770.
Fidler, F., Thomason, N., Cumming, G., Finch, S., Lee, J., 2004. Editors can lead researchers to confidence intervals,
but can’t make them think: statistical reform lessons from medicine. Psychological Science 15, 119–126.
Granger, C.W.J., King, M.L., White, H., 1995. Comments on testing economic theories and the use of model
selection criteria. Journal of Econometrics 67, 173–187.
Leamer, E.E., 1983. Let take the con out of econometrics. American Economic Review 73, 31–43.
Liebowitz, S.J., Margolis, S.E., 1990. The fable of the keys. Journal of Law and Economics 33, 1–25.
Liebowitz, S.J., Margolis, S.E., 1994. Network externalities: an uncommon tragedy. Journal of Economic Per-
spectives 8, 133–150.
McCloskey, D., 1985a. The loss function has been mislaid: the rhetoric of significance tests. American Economic
Review 75, 201–205.
McCloskey, D., 1985b. The Rhetoric of Economics. University of Wisconsin Press, Madison.
McCloskey, D., 1992. The bankruptcy of statistical significance. Eastern Economic Journal 18, 359–361.
McCloskey, D., 1995. The insignificance of statistical significance. Scientific American 72, 32–33.
McCloskey, D.N., 1996. The Vices of Economists: The Virtues of the Bourgeoisie. University of Amsterdam
Press, Amsterdam.
McCloskey, D.N., Ziliak, S., 1996. The standard error of regressions. Journal of Economic Literature 34, 97–114.
McLean, J.E., Ernest, J.M., 1998. The role of statistical significance testing in educational research. Research in
the Schools 5, 15–22.
Morrison, D.E., Henkel, R.E., 1970. The Significance Test Controversy: A Reader. Aldine, Chicago.
Sedlmeier, P., Gigerenzer, G., 1989. Do studies of statistical power have an effect on the power of studies?
Psychological Bulletin 705, 309–316.
Siegel, S., 1961. Decision making and learning under varying conditions of reinforcement. Annals of the New
York Academy of Science 89, 766–783.
Thompson, B., 1998. Statistical significance and effect size reporting: portrait of a possible future. Research in
the Schools 5, 33–38.
Thompson, B., 1999. Why ‘encouraging’ effect size reporting is not working: the etiology of researcher resistance
to changing practices. The Journal of Psychology 133, 133–141.
Thompson, B. 2003. Psychology journals requiring size effect reporting. http://www.coe.tamu.edu/∼bthompson/
index.htm.
Thompson, W.L. 2004. 402 Citations Questioning the Indiscriminate Use of Null Hypothesis Significance Tests
in Observational Studies. http://biology.uark.edu/coop/Courses/thompson5.html.
Walker, J., Smith, V.L., 1993. Monetary rewards and decision cost in experimental economics. Economic Inquiry
31, 245–261.
Wilkinson, L. and Task Force on Statistical Inference APA Board of Scientific Affairs 1999. Statistical methods
in psychology journals: guidelines and explanations. American Psychologist, 54, 594–604.
Zellner, A., 1984. Basic Issues in Econometrics. University of Chicago Press, Chicago.
Ziliak, S.T., McCloskey, D.N., 2004. Size matters: The standard error of regressions in the American Economic
Review, Journal of Socio-Economics, 32 (this issue).

You might also like