You are on page 1of 6

Measurement and Evaluation in Counseling and

Development

ISSN: 0748-1756 (Print) 1947-6302 (Online) Journal homepage: https://www.tandfonline.com/loi/uecd20

Comment on Significance Testing

Bruce H. Biskin

To cite this article: Bruce H. Biskin (1998) Comment on Significance Testing, Measurement and
Evaluation in Counseling and Development, 31:1, 58-62, DOI: 10.1080/07481756.1998.12068950

To link to this article: https://doi.org/10.1080/07481756.1998.12068950

Published online: 29 Aug 2019.

Submit your article to this journal

Article views: 2

View related articles

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=uecd20
Comment on Significance Testing
BRUCE H. SISKIN

This article comments on, and extends, Vacha-Haase and Nilsson's (1998)
discussion of significance testing. The author recommends policies for re-
porting significance tests.

Vacha-Haase and Nilsson (1998) have added their voices to those who have
questioned the use of significance tests in behavioral science research. They
have laid out some elements of statistical significance, focused on improper
uses of significance testing, and summarized the recent use of significance
tests in Measurement and Evaluation in Counseling and Development, the jour-
nal of the Association for Assessment in Counseling. They concluded, as many
thoughtful critics have, that significance tests are often used inappropriately
in counseling research, resulting in a distorted view of research findings.
Although I differ with some statements in the article, I agree with Vacha-
Haase and Nilsson on their thrust. In my comments, I briefly mention the
wider historical context of the significance test controversy and I relate some
personal experiences I have had with statistical significance. I follow with com-
ments on the Vacha-Haase and Nilsson article, and finally have some com-
ments and recommendations for using significance tests.

HISTORICAL PERSPECTIVE

Vacha-Haase and Nilsson acknowledge that the debate on the use of signifi-
cance tests has been around for almost three-quarters of a centucy, but I be-
lieve they overstate historical facts when they suggest that Carver's 1978 ar-
ticle was "seminal." Though much of the early literature on significance test
use was published in statistical journals, articles criticizing significance tests
began appearing both in psychology and sociology journals as early as the late
1950s. So extensive was the criticism that much of it was compiled by Morrison
and Henkel (1970) 8 years before Carver's article appeared. More recent at-
tacks on significance testing (e.g., Cohen, 1994; Schmidt, 1996) extend, but
also echo, the arguments of many other thoughtful critics (c.f. Bakan, 1967 I
1970; Berkson, 19421 1970; Meehl, 1967 I 1970; Rozeboom, 19601 1970).

PERSONAL NOTES

My first contact with the significance testing controversy came in early 1973
during a guest lecture to a graduate psychometrics seminar given by David

Bruce H. Biskin is the senior psyclwmetrlcian at the American Institute of Certified Public
Accountants, Jersey City, New Jersey. The views expressed are the autlwr's and do not
represent official positions or policies of the American Institute of Certified Public Accoun·
tants. Correspondence regarding this article slwuld be sent to Bruce H. Bislcin, AICPA,
Harborside Financial Center. 201 Plaza III, Jersey City, NJ 07311 (e-mail: bbiskin@aicpa.org).

MEASUREMENf AND EVALUATION IN COUNSEUNG AND DEVELOPMENf I APRIL 1998 I VOL. 31

58
Campbell on the Strong Vocational Interest Blank (SVIB). I recall Dr. Campbell
declaring (somewhat proudly, I believe) to a group of psychometrics students
that no significance tests were used in developing SVIB Occupational Scales.
I remember my first reaction was thinking. WHey. good joke!" and I recall my
reaction turning to puzzlement when I realized he was not being funny.
After a bit of thought, I realized two reasons for not bothering with signifi-
cance tests. First, because constructing Strong occupational scales required
both large samples and large item-response differences between general refer-
ence and criterion samples, statistical significance (at conventional levels) was
practically guaranteed. More important, but less obvious to me at the time,
tests of statistical significance were effectively moot because the reference groups
and occupational groups were not randomly drawn from the same population!
The main focus in SVIB occupational scale construction, therefore, was how
well the scales differentiated groups that were neither randomly sampled nor
drawn from the same population. Because Type I error-concluding the groups
were not from the same population when, in fact, they were-could not occur,
traditional significance tests were irrelevant. The lesson I learned was to think
critically about significance testing when evaluating a set of data.
I learned another lesson several years later. I was collaborating on an experi-
mental study in computer administration of the Minnesota Multiphasic Per-
sonality Inventory (MMPI), which included a replication. Statistical procedures
included multivariate analysis of variance with associated significance testing
(Biskin & Kolotkin, 1977). We randomly assigned participants to experimental
conditions. Essentially, profile differences between paper and pencil and com-
puter administration in both studies at the multivariate level were not signifi-
cantly different (p was greater than .50 in both experiments): however, two
clinical scales showed small, but consistent, differences in both studies (see
Endnote). This presented a dilemma: Believe the replicated multivariate re-
sults that suggested there were no differences. or trust the replicated univariate
differences that suggested there were differences existing on two MMPI scales
due to mode of administration.
The source of this dilemma is that the probability of Type I and Type II errors
is not cumulated across replications. That is, although researchers may ask,
MWhat is the probab111ty of getting a difference as large as the one we got in this
study by chance?" we never seem to ask, MWhat is the probability of getting
differences as large as the one we got in this replication and previous ones by
chance?" Significance testing as currently practiced is not a cumulative enter-
prise. Another lesson for me was the following: Significance tests may be ap-
propriate in the early stages of research, but replication is much more impor-
tant to the scientific enterprise. (By the way. the statistically significant MMPI
scale differences were small and considered to be clinically unimportant, so
the dilemma for us was short-lived. Next time might not be so easy.)

COMMENTS ON THE VACHA-HAASE AND NILSSON ARTICLE

Vacha-Haase and Nilsson make a small error in stating Ma statistically signifi-


cant result indicates the maximwn probability [emphasis added) of obtaining
the sample statistics assuming the sample came from a population in which
the null hypothesis was true" (p.4 7) In fact, the test result expresses the prob-
ability of obtaining a test result, or a more extreme one, when the null hypoth-
esis is true.

MEASUREMENT AND EVALUATION IN COUNSEUNG AND DEVELOPMENT I APRIL 1998 I VOL. 31

59
In the same paragraph. Vacha-Haase and Nilsson make a more serious error
when they confuse sampling error and systematic error with random error
(which they call "chance"). Their quote about chance from Bergin and Garfield
is accurate, but their example about nonrandom convenience samples does
not follow-accurate significance testing requires randomization (random sam-
pling or assignment) to be interpretable. I have reviewed many manuscripts in
which the researchers apply significance tests to data that are based neither
on random sampling nor on random assignment. If sampling or assignment is
nonrandom, a large enough sample will usually result-correctly-in rejection
of the null hypothesis because the groups compared do not, in fact, belong to
the same population. In such a case, significance tests are irrelevant because
the null hypothesis must be false. When there is no randomization, Type I
errors are irrelevant and the researcher's prime consideration should be effect
size.
A related error, made by many others, is the notion that statistical significance
predominantly depends on sample size. As noted by Hagen (1997), this is true
only when the null hypothesis is, in fact, false, because larger samples result in
more power to correctly detect a false null hypothesis. However, when the null
hypothesis is true, sample size does not affect the probability of an incorrect
rejection. (See Hagen for an elegant discussion of this point.)
Vacha-Haase and Nilsson make another related error, also made by many crit-
ics of significance testing, when they say "taken literally, the null hypothesis is
always false" (p. 48). Again, Hagen (1997) clarified this issue, which typically
results from the correct belief that two samples, especially if they are large, will
almost never be identical even if they are drawn from the same population.
However, tests of significance are intended to make inferences about popula-
tions. not samples. If the samples are, in fact, randomly selected or assigned
from a single population, the null hypothesis may be true. If the groups are not
randomly sampled or assigned. tests of significance are moot because the null
hypothesis is false, a priori.
One omission by Vacha-Haase and Nilsson, often ignored in the criticisms of
significance tests, is worth mentioning briefly. Significance tests are based on
statistical assumptions-for example, equal variances and normal distributions-
that behavioral science data rarely meet. As a result, the sensitivity of the se-
lected test statistic (e.g.. t, F. or &) to violations of these assumptions, and to
combinations of violations. should be considered when interpreting the results
of significance tests. That is, the probability associated with the test statistic
will rarely be exact. How inexact depends on the robustness of the statistic to
the particular combination of violations. The more complicated the statistical
design, the more likely there are to be serious effects of those violations on the
accuracy of the probability associated with the value of the test statistic. There-
fore, researchers should investigate potential violations of assumptions under-
lying their significance tests and report any violations that could affect the in-
terpretation of the results. Researchers should also be cautious about putting
too much faith in the precision of the probability values spewed out by their
statistical programs when they compute their test statistics.

SHOULD WE, AS RESEARCHERS, USE SIGNIFICANCE TESTS?

Some critics believe that significance testing should be discontinued. Others


suggest researchers should learn how to use significance tests better. Even if

MEASUREMENf AND EVALUATION IN COUNSEUNG AND DEVELOPMENf I APRIL 1998 I VOL. 31

60
we continue to use significance tests, can we realistically hope to learn to use
them appropriately? After all, we have been unable to stop abusing, misusing,
and misinterpreting them after 70 years of argument, protest, and warning.
I believe researchers should continue to use significance tests because they
can be a valuable part of well-designed, quantitative research. Under the best
circumstances, we can apply significance tests in single studies to make ten-
tative determinations about the likelihood of group differences or effects of
interventions. These determinations can help us decide whether, and how, to
plan later studies. If some creative statistician develops a practical way to test
the cumulative significance of a series of replications, researchers could apply
significance tests more usefully in programmatic research (but see Hagen, 1997,
for an example of how prior information might be used). Still, significance
tests become less useful as a body of research grows. As a research area ma-
tures, effect size estimates should have greater clinical, scientific, and practi-
cal interest than statistical significance. However, new questions are always
being raised, so null and alternative hypotheses are always being generated.
Tests of statistical significance can play a limited, but important, role in help-
ing us to make judgments about our research findings.
When are significance tests not needed? I see three cases that probably cover
most situations. First, when the sampling design implies that the null hypoth-
esis is false, as when two groups are known to come from different popula-
tions, significance tests are moot (but so may be the study). Second, when
sample size is so large that differences small enough to reject the null hypoth-
esis are trivial, statistical significance is irrelevant. Third, when the body of
prior research and other knowledge, such as a meta-analysis suggesting a
moderate effect size, convinces us, a priori, that the null hypothesis is false,
significance testing may be unnecessary. On this last point, however, I would
err on the side of reporting significance tests if many readers are likely to be
unconvinced.

RECOMMENDATIONS

While awaiting the outcome of the American Psychological Association Task


Force on Statistical Inference's deliberations, I share my personal recommen-
dations about significance testing. These recommendations are for research-
ers, reviewers, editors, and, of course, consumers of research. The main con-
sideration underlying these recommendations is that all these involved parties
in the research enterprise-including those involved in measurement and evalu-
ation in counseling and development-should focus on using good judgment
in producing and consuming published research.

1. The rationale for reporting-or not reporting-significance tests should be


implicit in the research design or else made explicit by the researchers.
2. Significance tests need not be reported when the sampling plan results in
a false null condition.
3. When sample size is so large that trivial differences are statistically sig-
nificant, significance tests need not be reported.
4. If possible, significance tests in replications should be cumulative, incorpo
rating the results of the replicated studies.
5. When a body of knowledge has grown sufficiently large to obtain a consen-
sus that the null hypothesis is false, significance tests are unnecessary.

MEASUREMENT AND EVALUATION IN COUNSEUNG AND DEVEWPMENT I APRIL 1998 I VOL. 31

61
6. When significance tests are reported, violations of the statistical assumptions
underlying the tests should be evaluated by the researcher and reported
either in the manuscript or in a letter to the editor. In the latter case, the
evaluation should be made available to interested readers.
7. Whenever statistical assumptions of test statistics are violated, the asso-
ciated probability values should be evaluated skeptically.

REFERENCES
Bakan, D. (1970). On method. In D. E. Morrison & R. E. Henkel (Eds.), The significance
test controversy (pp. 231-251). Chicago: Aldine. (Reprinted from On method, pp. 1-
29, by D. Bakan, 1967, San Francisco: Jossey-Bass)
Berkson, J. (1970). Tests of significance considered as evidence. In D. E. Morrison & R.
E. Henkel (Eds.), The significance test controversy (pp. 285-294). Chicago: Aldine.
(Reprinted from Journal of the American Statistical Associatton, 37, 325-335, 1942)
Biskin, B. H., & Kolotkin, R. L. (1977). Effects of computer administration on scores on
the Minnesota Multiphasic Personality Inventory. Applied Psychological Measurement,
1, 543-549.
Carver, R. P. (1978). The case against statistical significance testing. Haroard Educa-
tional Review, 48, 378-399.
Cohen , J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psycholo-
gist, 52, 15-24.
Meehl, P. E. (1970). Theory testing in psychology and physics: A methodological para-
dox. ln D. E. Morrison & R. E. Henkel (Eds.), The signlfteance test controversy (pp.
252-266). Chicago: Aldine. (Reprinted from Philosophy ofScience, 34, 103-115, 1967)
Morrison, D. E., & Henkel, R. E. (1970). The significance test controversy. Chicago: Aldine.
Rozeboom, W. W. (1970). The fallacy of the null hypothesis significance test. In D. E.
Morrison & R. E. Henkel (Eds.), The significance test controversy (pp. 216-230). Chicago:
Aldine. (Reprinted from Psychological Bulletin, 57, 416-428, 1960)
Schmidt. F. L. (1996). Statistical significance testing and cumulative knowledge in psy-
chology: Implications for training researchers. Psychological Methods, 1, 115-129.
Vacha-Haase, T., & Nilsson, J. E. (1998). Statistical significance reporting: Current trends
and Uses in MECD. Measurement and Evaluation in CoW1Seling and Development,
46-57.

ENDNOTE
Parenthetically, we argued in the manuscript that p = .10 should be used as
the criterion for statistical significance because we saw the study as explor-
atory and we were more willing to make Type I than Type II errors. Because we
hoped to find no differences between groups, lack of statistical significance
was desirable and we wished to "stack the deck" against ourselves. Appar-
ently, even in the 1970s, the editor and reviewers accepted the logical develop-
ment of our argument because the study was published, no questions asked!

MEASUREMENT AND EVALUATION IN COUNSEUNG AND DEVEWPMENT I APRIL 1998 I VOL. 31

62

You might also like