You are on page 1of 7

Psychological Reports, 1962, 11, 639-645.

@ Southern Universities Press 1962

THE DIFFERENCE BETWEEN STATISTICAL HYPOTHESES A N D SCIENTIFIC HYPOTHESES


ROBERT C. BOLLES
Hollins College

When a professional stacistician runs a statistical test he is usually concerned only with the mathematical properties of certain sets of numbers, but when a scientist runs a stztistical test he is usually crying to understand some namral phenomenon. The hypotheses the statistician tests exist in a world of black and white, where the alternatives are clear, simple, and few in number, whereas the scientist works in a vast gray area in which the alternative hypotheses are often confusing, complex, and limited in number only by the scientist's ingenuity. The present paper is concerned with just one feature of this distinction, namely, that when a statistician rejects the null hypothesis at a certain level of confidence, say .05, he may then be fairly well assured ( p = .95) that the alternative statistical hypothesis is correct. However, when a scientist runs the same rest, using the same numbers, rejecting the same null hypothesis, he cannot in general conclude with p = .95 that his scientific hypothesis is correct. In assessing the probability of his hypothesis he is also obliged to consider the probability char the statistical model he assumed for purposes of the test is really applicable. The staciscician can say "if the distribution is normal," or "zf we assume the parent ppulation is distributed exponentially." These ifs cost the statistician nothing, but they can prove to be quite a burden on the poor E whose numbers represent controlled observations nor just symbols written on paper. The scientist also has che burden of judging whether his hypothesis has a greater probability of being correct than other hypotheses that could also explain his data. The stacistician is confronted with just two hypotheses, and the decision which he makes is only between these two. Suppose he has two samples and is concerned with whether the two means differ. The observed difference can be attributed either to random variation (the null hypothesis) or to the alternative hypochesis that the samples have been drawn from two populations with different means. Ordinarily these two alternatives exhaust the statistician's universe. The scientist, on the other hand, being ultimately concerned with the nature of natural phenomena, has only started his work when he rejects the null hypothesis. An example may help to illustrate these rwo points. Consider the following situation. Two groups of rats are tested for water consumption after one, the experimental group, has been subjected to a particular

640

R. C. BOLLES

treatment. Suppose the collected data should appear as shown in Table 1.l After the data are collected, E can pretend he is a statistician for a while and say to himself, "Let's assume the populations are normal and try a t test." The t statistic is encouragingly large but not Iarge enough (which is just as well
TABLE 1

N
Control Ss Experimental Ss

Occ.

lcc.

2cc. 3cc. 4cc.


2 0
0

...
...

12cc.
0 1

15 12

3 3

0 4

...

because of the difficulty that would arise in attempting to justify the assumption of normality). Several transformations of the data are tried, but they don't help. Our E recognizes that he needs another statistical model. Perhaps a nonparametric test would work, one which is not sensitive to the great skewness of the data, and one which makes no assumption about the underlying distribution. A Mann-Whicney tesc is discouraging. A chi-square tesc (tried even though the expected frequencies in the important cells at the tails of the distribution are really too small) does not even approach a significant value. By this cime E, weary of being an amateur statistician, consults a professional one who tells him that, if we can assume [he populations have exponential distributions, then we can use Fescinger's test ( 1943 ) . This works ( F = 4.83 ) . In due cime E publishes a report in which a highly significant ( f i < .01) increase in water consumption is attributed to the experimental treatment. Inspection of the table, however, should lead to some skepticism. If our E has actually discovered anything about nature, he has found that (1) most animals under these test conditions don't drink more than 2 cc., and ( 2 ) his experimental treatment may make a few animals drink a good deal more rhan normal, 4 cc. or more. It is necessary to digress a moment in order to notice that throughout this whole scientific episode our E's behavior has been above reproach. Even when he was acting like a statistician he wasn't jmt hunting for a test that would work; he was also searching for a tesc and a model, that would fit his particular problem. The t test will not work, not just because it is inappropriate from a statistical point of view, but also because it tests the wrong thing. The t tesc tests whether the means are different. The Mann-Whitney test, for another example, is appropriate in the statistical sense, but it too is primarily sensitive
lThe problem, the data, and the daim of significance are those of Siegel and Siegel ( 1 9 4 9 ) ; what follows is my construction.

STATISTICAL AND SCIENTIFIC HYPOTHESES

64 1

co differences in means, and so is likely to pick up only the first phenomenon our E has discovered (that most Ss drink very little) and noc the second (that some Ss respond co treatment). E should use a test which is sensitive to his special problem such as the Kolmogorov-Smirnov test which is highly sensitive to differences in the shafes of distributions, or a test specifically for the difference in skewness between the two distributions." Enough statistical digression. The point about which our E, as a scientist, should be concerned is the discrepancy between the high level of statistical significance obtained ( f < .01) and the lingering doubts he must have whether there may be some explanation for his data ocher than the experimental trearment (subjective p = ? ) . There are two very good but often ignored reasons for the discrepancy, as I suggested above. One source of doubt is that the probability of correctness of the scientist's hypothesis depends not only upon the probability of rejecting the null hypothesis, but also upon the probability that the statistical model is appropriate. Now, with the non-parametric tests there is little problem here, and in fact, that is their great virtue." Our E with the thirscy rats wisely eschewed the t test, and he probably would have even if it had yielded significance, not because it was "wrong," but because it would not have given him any assurance that his scientific hypothesis had been confirmed. But what, we may ask, is the probability that the populations which his rwo groups represent are actually distributed exponentially? I must say they don't look exponential. The samples are much too small t o give us any assurance that the model might be appropriare. Let us be generous and say chat the model has a .50 chance of being applicable. What becomes of E's claimed high significance level? Poor E has a more serious matter to worry about, a more vexing source of doubt. His whole case hinges on the performance of one animal, the one char drank 12 cc. The remaining 39 Ss don't give him a thing, with any test. Now suppose that the true state of affairs is this: The experimental rreatment really has no appreciable effect u?on water consumption. But let us suppose, however, that there occurs, every once in a while, a bubble in the animal's homecage drinking tube, which prevents normal drinking. If this were the case, and if bubbles occur, say, 21/2% of the time, then about 50% of the rime a
aOr he may replicate the study to get a larger n so that other tests n ~ i l lhave more power. This is probably the best approach, especially if he varies the experimental conditions in order to find out how to ger better control over the rare event.

T h e power of the Mann-Whitney, for example, is usually cited as approaching .95 that of the t test, in those conditions where the latter is appropriate. Considering that there is usually at least a .05 chance that any given set of data does not come from a normal population, the Mann-Whitney emerges as perhaps the more powerful test for testing scientific (as against statistical) hypotheses. Moreover, its loss of statistical power with small samples may be more than offset by the scientist's gain in assurance that the underlying model is appropriare.

642

R. C. BOLLES

bubble S will appear in the experimental group, so E will get just the results he got about 50% of the time.4 Moreover, if he continues to use the Festinger test, we can expect him half the time to get highly significant differences in support of his hypothesis, whatever his hypochesis may be! ll The problem here, basically, is that scaciscical rejection of the n ~ ~hypochesis tells the scientist only what he was already quite sure of-the animals are not behaving randomly. The fact the null hypothesis can be rejected with a p of .99 does not give E an assurance of .99 that his particular hypothesis is true, but only that some alternative to the null hypothesis is true. H e may not like the bubble hypothesis because it is ad hoc. Buc that is quite irrelevant. What is crucial is that the bubble hypochesis, or some other hypothesis, may be more probable than his own. The final confidence he can have in his scientific hypothesis is not dependent upon statistical significance levels; it is ultimately determined by his ability to reject alternatives. Consider another illustration. Suppose we are interested in whether a certain stimulus will have reinforcing power for a certain group of 20 animals. After pretraining on a straightaway, we run them 15 trials in a T-maze which has the stimulus in question on one side and not on the other. W e collect our data and graph them in the hope of seeing a typical learning curve. B u t what we find is Fig. 1. All is not lost, though, apparently, because it is still possible to concIude from the data that the stimulus did have a reinforcing effect, and that the associated p value is less than .00Z!5 The first thing to look for is whether there is a rising trend in the points of Fig. 1. It turns out that the best fit line does rise, but that an F test for the significance of the trend shows it to be less than would be expected on the basis of the day-to-day variation. To find a significanc difference anywhere here, we have to ignore the data of Fig. 1 and turn our attention to the number of responses made by each S to the "reinforced" side, during its 15 trials. The 20 such scores have a mean of 8.8 which proves to be highly significantly different from the expected value of 7.5 ( p = .002). (The learning curve, correspondingly, runs along at about 59% instead of 50%.) What this significance test tells us is that the animals probably weren't running randomly.6 But it is a long way from that to the inference that learning has occurred because of the special stimulus.
T h e other half of the time the experiment will be disastrous for E's hypothesis, however. H e had better just d o the study once; that way he has at least a .50 chance of getting high significance in the favorable direction, and no more than a .50 chance of discovering that the real world is full of bubbles. T h e problem, the data, and the conclusions are D'Amato's ( 1 9 5 5 ) . Note that performance o n the first trial was at 5 0 % . This was pre-arranged by putting the reinforcer on both sides for half the Ss and o n neither side for the other half. 'No one who has ever run animals would seriously consider that they might run randomly. W h a t the null hypochesis implies in empirical terms is chat the different animals were doing different t h ~ n g sat different times so that the total set of scoses looks as if it were random.

STATISTICAL AND SCIENTIFIC HYPOTHESES

TRIALS
T-maze
FIG. 1. The performance o rats alleged to have learned t go t one side of a f o o

One hypochesis with a high a priori probabiliry is that mosc of the animals gave scores that lay quite close to 7.5 but that one or two animals, with strong unlearned position habits, continued going to the side to which they had gone on the first trial. The probability that in the sampling process, just those one or two animals with strong positions habits should be placed under the same condition, and under the particular condition that the "reinforcer" was on their preferred side, is fairly small, but still a great deal larger than the reported significance level. This hypothesis can be ruled ouc, however, but by certain features of the data and not on grounds of ics a primi probabiliry. W e can deduce from [he small SD (1.47) that few of the animals could have had position habits of appreciable strength. In fact, we can deduce (with a little effort) from the size of the SD that no more than one S could have had extreme position habit, and chat even chis was not actually the case, since one S could not have moved the mean from 7.5 to 8.8. Hence, we must conclude that the distribution represents a tightly bunched set of scores, whose mean is indeed significantly larger Clan 7.5. But this suggests another hypothesis, which does account for the data, and which also has a high a p~io7iprobability. The hypothesis is simply that mosc animals have slight position habits. According to this hypochesis, any particular animal could be expected to go to one side 8 or 9 or 10 times out of 15. ( W e have already noted the high significance level indicating how consistent chis is.) The setting of the performance level at 50% on the first trial is a red would herring; it does not set the expected percentage correct at 50%-that be true in any case before ic was known which was the preferred side for a

644

R. C. BOLLES

given animal. Now that the data are in, we can see (according to this hypothesis) that the preferred side happened to be predominantly on the same side for which S was "reinforced." To assess the probability that a significantly high proportion of the animals had the "reinforcement" on their preferred side (which would be good evidence that the reinforcement was effective), we must go back to the data of Fig. 1. Performance over the last 14 trials was at 59%, while the performance on Trial 1 was 50%. The quescion is whether E's performance on Trial 1, when he was selecting the side to put the critical stimulus, is significantly different from the 59% baseline for the animal's perf~rmance?~ The answer is that it is well within the trial by trial variation. There was that much or more variation from the mean on 6 of the trials. So, what it comes down co is that the animals did show slight but consistent preferences for one side or the other; the p figure of .002 shows this. The important question is whether these slight and consistent preferences are due to a slight but consiscent effect of the experimental treatment, or whecher they would have occurred without the treatment and E was just a little unlucky in trying to counterbalance them. Which is the more probable? The poinc of this message is not thac it is fucile to do experiments (although it might be wise to be cautious of some of che statistician's favorite designs). Rather, the emphasis should be upon the distinction between why scientists run statistical tests and why statisticians do it. The former run tests for the same reason chey run experiments, in the atcempt to understand natural physical phenomena. The latter do it in the attempt to understand machematical phenomena. The scientist gains his understanding through the rejeccion or confirmation of sciencific hypotheses, but this depends upon much more than merely rejecting or failing to reject the null hypothesis. It depends partly upon the confirmation from other investigators (e.g., Amsel & Malczman, 1950; Wike & Casey, 1954), particularly as the experimental conditions are varied (e.g., Siege1 & Brantley, 1951; Amsel & Cole, 1953). Confirmation of scientific hypotheses also depends in part upon whether they can be incorporated into a larger theoretical framework (e.g., Hull, 1943). Final confirmacion of scientific hypotheses and the larger theories they support depends upon whecher they can stand the test of time. These processes have to move slowly. As Bakan (1953) has observed, the development of a scientific idea is gradual, like learning itself; ics probability of being correct increases gradually from one experimental verification to the next, as response probability increases from one trial to the next. The effecc of any single experimental verification is not to confirm a scientific hypothesis
'Or put another way, what is the probability chat 20 animals, all with slight position habits, will distribute themselves o n one particular trial so that the group will depart 9 percentage points from its mean value.

STATISTICAL AND SCIENTIFIC HYPOTHESES

64 5

but only to make its a posteriori probability a little higher than its a p r i o r i probability. Our present day over-reliance upon statistical hypothesis testing is apr to obscure this feature of the scientific enterprise. W e have almost come to believe that an assertion about the nature of the empirical world can be validated (at least with a probability level such as .95 or .99) in one stroke if the data demonstrate statistical significance. k it any wonder then that our use of statistical hypothesis testing is rapidly passing from routine to ritual?

SUMMARY
One of the chief differences between the hypotheses of the statistician and those of the scientist is that, when the statistician has rejected the null hypothesis, .his job is virtually finished. The scientist, however, has only just begun his task. H e must also be able to show that the statistical model underlying the test is applicable to his empirical situation because whatever significance level he obtained for the test, his cocfidence in his scientific hypothesis must be reduced below that by any lack of confidence in the model. Furthermore, confidence in his scientific hypothesis is reduced by the plausibility of alternative hypotheses. Hence the scientist's ultimate confidence in his hypothesis may be far lower than the significance level he can report. REFERENCES
AMSEL, A., & COLE,K. F. Generalization of fear motivated interference with water intake. J. exp. Psychol., 1953, 46, 243-247. AMSEL, & MALTZW, I. The effect upon generalized drive strength of emotionA., ality as inferred from the level of consummatory response. J. exp. Psychol., 1950,
40, 563-569. BAKAN,D. Learning and the principle of inverse probability. Psychol. Rev., 1953, 60, 360-370. D'AMATO, M. R. Transfer of secondary reinforcement across the hunger and thirst drives. 1. exp. Psychol., 1955,49, 352-356. FESTINGER, An exact test of significance for means of samples drawn from populations L. with an exponential frequency distribution. Psychometrika, 1943, 8 , 153-160. HULL, L. Principles of behavior. New York: Appleton-Century, 1943. C. SIEGEL, S., & BRANTLEY,J. J. The relacionship of emotionality to the consumP. matory response of eating. J . exp. Psychol., 1951, 42, 304-306. SIEGEL, S., & SIEGEL, S. The effect of emotionality on the water intake of the P. H. rat. J . comp. physiol. Psychol., 1949, 42, 12-16. WKE, E. L., & CASEY, The secondary reinforcing value of food for thirsty animals. A. J. comp, physiol. Psychol., 1954, 4 7 , 240-243.

Accepted September 25, 1962.