You are on page 1of 7

JOURNAL OF APPLIED BEHAVIOR ANALYSIS 1974, 7.

647-653 NUMBER 4 (WINTER 1974)


STATISTICAL INFERENCE FOR INDIVIDUAL ORGANISM
RESEARCH: MIXED BLESSING OR CURSE?'
JACK MICHAEL
WESTERN MICHIGAN UNIVERSITY

Descriptive and inferential statistics are described as judgemental aids, stimuli to which
the scientist can more easily react than to his raw experimental results. The increasing
emphasis on the significance test as the main judgemental aid utilized in experimental
psychology is credited with several harmful effects on experimental practice. The area
known as "the experimental analysis of behavior" has so far escaped most of these harm-
ful effects, but now we see an increased interest in the development of appropriate sig-
nificance tests for individual organism research. This interest is based on the view that it
is not possible to effect adequate levels of experimental control with much human ap-
plied research, and that in such cases a significance test would be quite valuable as a
judgemental aid, both of which points are considered to be essentially incorrect, and if
accepted, potentially harmful.

portant characteristic of the larger set, such as


Descriptive and Inferential Statistics the mean and range of the raw data.
As Judgemental Aids Using the term "judgement" to refer to any
The observations resulting from scientific ex- of the various kinds of reactions that a scientist
periments are stimuli that hopefully affect the could make to the data of his experiment, it is
scientist and his colleagues by producing better useful to refer to these stimulus-simplifying tech-
practical behavior, more sophisticated follow-up niques and their products as "judgemental aids".
experiments, or better verbal behavior regarding In this sense, then, the graphing devices and the
the subject matter. These stimuli, however, may measures of central tendency, variability, etc. of
not result in any effective reaction, a fairly the field of descriptive statistics are all judge-
common reason being their complexity. Re- mental aids. In some way they produce a stim-
peated observation of the same experimental ulus to which the experimenter can react more
condition, for example, may give rise to a set of easily than to his raw data.
numbers, all differing considerably from one The various judgemental aids do not achieve
another. This situation has occurred quite often their simplifying effects without some cost, how-
and methods have been discovered for simplify- ever. In the first place, they are easier to react to
ing it to some degree. Some of the methods in part because they are abbreviations. Some
generate two-dimensional visual stimuli where stimulus aspects of the raw data are simply ab-
the values of each dimension stand in a point-to- sent from the aid, and if one's entire reaction is
point relation to some feature of the data; a based on the abbreviation, the missing feature
frequency polygon is such a stimulus. Another cannot affect behavior at all. Further, the scien-
stimulus-simplifying technique results in a tist must spend some time learning about them,
smaller set of numbers, each related to some im- time he might be spending in other activities
relevant to his subject matter. Statistics courses
'This is one in a series of articles available for displace other topics from the curriculum.
$1.50 from the Business Manager, Journal of Ap- A more complex type of cost consists of the
plied Behavior Analysis, Department of Human De- time and effort that must be expended determin-
velopment, University of Kansas, Lawrence, Kansas
66045. Ask for Monograph #4. ing the extent to which some particular aid is
647
648 JACK MICHAEL

appropriate to the circumstances and data of a tion that are fully as complex as the features that
particular experiment. Finally, just as he must the aid is supposed to simplify.
accumulate experience with his subject matter Whether there is net gain, even with the most
by reacting to it in various ways and being widely used and simplest inferential procedures
affected by the relatively long-term consequences depends upon the extent to which the scientist
of his reactions, he must now accumulate ex- and his colleagues react more effectively with
perience in reacting to the judgemental aid and than without such judgemental aids. And al-
feel the long-term effects of this behavior. though these techniques have been widely used
With such devices as frequency polygons, in experimental psychology for over 30 yr it is
means, percentages, there seems to be a relatively not at all clear that this particular field is in
clear net gain. The time required to learn how any way more effective because of them. From
to use such techniques and the time spent in an empirical point of view, it would be desirable
determining which one to use in a particular to have some data comparing the scientific or
situation is relatively small compared with the practical results achieved when significance tests
simplifying effect achieved. Furthermore, the are used with those when judgements are other-
circumstances where they apply occur often wise based. I know of no information of this sort.
enough that the individual scientist has some From a rational point of view, the incorporation
chance of acquiring the necessary experience of statistical inference into the broader field of
regarding the long-range effects of his reliance decision theory clarifies considerably the possible
on such judgemental aids. role of the significance level as a guide to action.
Inferential statistics are also no more nor When combined with estimates of prior prob-
less than techniques for simplifying a complex ability values for null and alternative hypotheses,
stimulus situation. When an experiment results and with quantitative estimates of the utility
in two sets of numbers, one from a control and to the decision maker of correct and incorrect
another from an experimental condition the decisions, the significance of a treatment effect
comparison may be quite difficult to make. It is may be seen as a part of a very reasonable system
usually easier to compare frequency polygons for making decisions (as described, for example
and means, and one's reaction to this state of by Raiffa, 1968, or Schmitt, 1969). There is
affairs may be further aided in some way by some hope that developments within this area
performing what is called a statistical significance may eventually prove useful to the psychologist,
test or computing a confidence interval. The but at present the assignment of prior prob-
former is the most common inference procedure abilities and utilities seems possible in only a
used in experimental psychology and results in few applied research situations and not at all
a statement that the probability of such a differ- in basic research. From this decision-theoretic
ence (or a larger one) arising by chance when the point of view, the significance test by itself is a
population means are actually equal is less than very incomplete basis for any kind of judgement,
or equal to some specified value. and there certainly seems to be no rationale for
Significance tests and confidence intervals are the widespread use of any particular level of
more expensive judgemental aids than descriptive significance (0.05 is the most common) as a
procedures. The abbreviation is more extreme, basis for distinguishing "real" from "chance"
and the time required to learn how to obtain and effects. If and when these better rationalized
interpret them is much greater. Determining to inference procedures become available to the
what degree the judgemental aid is appropriate to psychologist, however, they will be even more
the particular experiment-whether the assump- expensive in terms of time spent dealing with
tions underlying the significance test are met-is the details of the judgemental aid itself, and a
likely to require reaction to features of the situa- net gain will be realized only if they produce a
STATISTICAL INFERENCE: MIXED BLESSING OR CURSE? 649
considerable improvement in experimental The possibility of "statistical control" greatly
effectiveness. reduced the necessity for developing experi-
mental control. An experimenter could ask his
Some Detrimental Effects Arising from an experimental question irrespective of consider-
Emphasis on Statistical Inference able uncontrolled variation in his dependent
Although it is not at all clear how statistical variable, if he could simply identify the sources
inference has helped the field of experimental of this variation. The study could then be de-
psychology, it does seem closely linked with signed in such a way that these sources were
some undesirable changes in experimental prac- "balanced" across the various groups constituting
tice. By the early 1930s, professional statisticians the main comparison, and a satisfactory sig-
had developed significance test procedures ap- nificance test could be computed with respect to
propriate to experiments of considerable com- this main comparison. These same methods of
plexity. If one was willing to rely on the result experimental design and statistical analysis also
of the significance test as the main basis for made possible the simultaneous investigation of
reacting to an experiment, it then became more than one independent variable, thus further
possible to "control" statistically for unwanted reducing experimental time and labor, but also
sources of variation in a dependent variable, reducing the duration and intensity of the ex-
especially using the analysis of variance as de- perimenter's contact with his problem area.
veloped by R. A. Fisher (1925). Before this At least five harmful effects of this general
development, an investigator had to discover trend can be discerned.
techniques for experimentally controlling 1. The prolonged and intense interaction with
sources of irrelevant variation before he could the subject matter undertaken in order to experi-
even carry out his experiment. In the process mentally control irrelevant sources of variation
he was likely to acquire a very valuable form probably constituted a rich source of ideas for
of knowledge, irrespective of the ultimate value further experimentation. The use of statistical
of the specific experiment. He was learning how control deprives the experimenter of this source
to control his subject matter-in the case of and he becomes more dependent upon theory,
psychology, the behavior of organisms, and even other researchers' experiments, and a form of
if the original reason for conducting the experi- commonsense analysis not necessarily related to
ment was a poor one, something useful was his problem area as the basis for directing his
likely to come of it. In addition to the reportable research.
knowledge resulting from the effort to develop 2. The knowledge developed in order to
experimental control, this activity usually re- identify sources of variation and to select subjects
quired a good deal of time, and so the experi- in such a way as to "balance" for these sources is
menter was repeatedly exposed to the relevant considerably less useful to other experimenters
contingencies of his problem area. He thus had or for practical purposes than the knowledge
a chance of being shaped into more effe~ctive required actually to control such variation.
forms of behavior regarding this subject ma"tter 3. Statistical control in complex experiments
even before his verbal repertoire regarding it is easiest to accomplish by obtaining data from
was well developed. Also, since most problems a large number of relatively independent be-
concern more than one important independent having organisms, and such numbers generally
variable, yet only one could generally be studied preclude prolonged study of any one organism.
at a time, an investigator would usually conduct Experimental situations then, are designed to
a series of separate experiments to tease out the maximize the efficiency with which they provide
various relationships, and was thus further ex- exactly the type of information relevant to the
posed to the contingencies of his problem area. particular experimental question being asked,
650 JACK MICHAEL

and become increasingly unlike any other situa- ology out of the main stream of experimental
tions, either inside or outside of the laboratory. science.
The results from such experiments are thus less
useful for any purpose other than answering Statistical Inference for
the specific question being asked in that experi- Individual Organism Research:
ment, which has the further disadvantage that A Weak Solution to an Artificial Problem
they are less likely to be verified by another Not all areas within experimental psychology
experimenter using the same situation to study have adopted the research methodology deplored
a different problem. above. One that has been relatively unaffected is
4. Reliance on the significance test leading the area referred to as "the experimental analysis
to the extensive use of statistical control and of behavior", operant conditioning", "Skinner-
multiple-factor experiments produces an ex- ian psychology", etc. The shunning of signifi-
cessive dependency on the significance test, since cance testing by researchers with this orientation
such experiments cannot be reacted to in any may be due, as Gentile et al. suggested (1972),
other way. What started out as a supplement to to the unavailability of inferential techniques
other bases of judgement, has become, in the appropriate to typical "single subject" data. On
minds of many researchers an essential aspect the other hand, this type of individual organism
of scientific method. Yet, as Skinner points out, research has been going on for well over 30 yr
"We owe most of our scientific knowledge to and it is reasonable to assume that if any strong
methods of inquiry which have never been need for such techniques was felt there would
formally analyzed or expressed in normative have been some concerted effort to develop
rules. (1972, p. 319)" them. It seems to me that the relative indiffer-
5. Since extensive preliminary study of an ence to statistical inference is more accurately
area is seemingly rendered unnecessary if one attributable to the strong emphasis on effective
designs his experiment properly, and since such experimental control as a major scientific goal
properly designed experiments cannot be in- and as the main evidence of the scientist's "un-
terpreted until all the data are in and the sig- derstanding" of his problem area.2 The situation
nificance tests have been performed, experiments where a significance test might seem helpful is
tend to be carried out in a somewhat inflexible typically one involving sufficient uncontrolled
manner. In the type of research emphasizing ex- variability in the dependent variable that neither
perimental control, and thereby often involving the experimenter nor his readers can be sure that
prolonged study of a small number of organisms there is an interpretable relationship. This is
using relatively simple experimental designs, it evidence that the relevant behavior is not under
is usually possible to change the procedure while good experimental control, a situation calling for
the experiment is under way. If it appears that more effective experimentation, not a more com-
some previously unrecognized source of varia- plex judgemental aid.
tion is causing trouble, the main manipulations In any case, whether by necessity, scientific
can be postponed until means for controlling the cunning, or prejudice, operant researchers, basic
interferring factor are developed. Or, if some and applied, have made little use of statistical
aspect of the incoming results suggests an inter- inference and do not seem to have suffered as a
esting variation the experiment can be redirected
immediately. 2This emphasis, of course, predisposes investigators
All in all, it seems possible to argue that what toward prolonged study of a small number of orga-
might have been a moderately useful judge- nisms, and within-subject comparisons where possible.
mental aid has ultimately had the unfortunate Between-subject comparisons, however, can also be
quite meaningful if behavior is under good experi-
effect of moving psychological research method- mental control.
STATISTICAL INFERENCE: MIXED BLESSING OR CURSE? 651
result. Increasingly sophisticated methods of subject data seems based on two faulty premises.
experimental control have developed within the First is the belief that applied data are taken
area of basic research, and applied researchers under conditions where effective experimental
have generally been able to make use of the control cannot be expected. While workers in
same technology, or develop methods of experi- the field of applied behavior analysis have not
mental control appropriate to their own problem been as badly affected by the experimental de-
areas. sign and statistical significance enthusiasts as
As the applied area expands, however, there some other kinds of psychologists, they may not
seems to be an increasing tendency to present have escaped entirely. Peaceful coexistence with
experimental results that are not easily inter- those who emphasize statistical control and
preted when simply displayed in graphical form, multiple-factor experiments seems to have re-
or as a table of means or per cents. This is said sulted in an increased tendency to plan, carry out,
to be due to the practical difficulties that the and then analyze the experiment all as a rela-
applied researcher encounters in his efforts to ob- tively inflexible unit of behavior-the fifth
tain human data in the nonlaboratory environ- harmful effect listed earlier. When a dependent
ment. It is argued that he does not have the variable is not under good control-when there
luxury of discontinuing the experiment until he is considerable unexplained variability even
discovers and experimentally controls various though the independent variable being studied
sources of irrelevant or confusing variation in is at a constant value-it is not usually necessary
his data. He cannot, like the basic researcher, to go ahead with the other planned manipula-
simply discard that pigeon and start over again tions. Further efforts can be made to obtain a
with another. The opportunity for experimenting more stable dependent variable, or to discover
may no longer be present, a number of people and eliminate some of the sources of uncon-
may have been inconvenienced, a good deal of trolled variation.
experimenter time may have been spent, and If these efforts are unsuccessful and if the
considerable financial as well as other resources experiment is an expensive one in terms of time
may have been expended. One must, in a sense, and other resources it is probably wise to abandon
make the best of the data as they stand, and this it at this point or recognize it as a gamble with a
is where the significance test comes in. Faced low probability of payoff. There are, of course,
with data that do not constitute an effective a number of "nonscientific" reasons for con-
stimulus for judgement, the experimenter and tinuing an apparently unprofitable experiment,
his readers must do whatever is possible, and such as the necessity of completing a thesis or
perhaps they will be able to behave somewhat dissertation requirement, or the belief that if one
more effectively if they have the judgemental does not carry out the research project that he
aid offered by some statistical inference pro- spoke so highly of in the grant request he may
cedure. have trouble getting another grant. That the
This, of course, is what Gentile et al. are significance test might be of aid in such situa-
offering, and although critical of that specific tions and could actually further such purposes is
solution, the other authors (Hartmann, 1974; certainly no recommendation.
Keselman, 1974; Kratochwill et 4a., 1974; The second faulty premise is that the signifi-
Thoresen and Elashoff, 1974) offer their own cance test is an especially helpful judgemental
solutions of the same type. It is probably never aid, and therefore worth a good deal of time
appropriate to be critical of any valid knowledge- and inconvenience. When experimental control
seeking activity per se, but one can criticize its is emphasized and results can be portrayed in
rationale. The present interest in obtaining a relatively simple graphical form, the probability
proper significance test procedure for single- of those results or more extreme ones given the
652 JACK MICHAEL

null hypothesis is a very crude form of informa- uate instruction time and their proper usage
tion, compared with the other stimulus features could easily become a main concern from the
available to the experimenter, and is likely to be point of view of data analysis-clearly a case of
ignored if it is not consistent with the interpreta- the tail wagging the dog.4
tion arrived at otherwise.3 In the typical multiple- If the decreased experimental flexibility and
factor experiment relying heavily on statistical the distraction from our primary subject of in-
control, the significance value is no more in- terest is not sufficient reason to be unenthusiastic
formative in an absolute sense, but since the about this development, there is the further
results cannot generally be reacted to in any distinct possibility that editors confronted with
other way, it seems more useful. This means results that are in an obvious sense relatively
only that one should avoid experimenting in meaningless may be induced to foist these re-
such a way that he is forced to rely on such a sults off on the readers if they are accompanied
weak tool. by an appropriate significance test that reaches
An overvaluation of the significance test by the 59% value.
itself is a relatively harmless misunderstanding, What Gentile, Roden, and Klein, and the
but it is likely to cause other changes in experi- other authors as well, are offering researchers
mental practice that are more serious. If the sig- in the area of behavior analysis is an opportunity
nificance test is valued above all other judge- to adopt a practice that has had a 30-yr trial
mental aids, experimenters are likely to try to period and is still of uncertain value. It is a
design their experiments so that a significance practice, furthermore, that seems historically al-
test can be computed, an obvious loss in terms most incompatible with the emphasis on experi-
of experimental flexibility. Note in this con- mental control that has characterized the operant
nection Hartmann's (1974) suggestion that research orientation. This would seem to be an
"... . at least 12 and preferably more stable data offer we can afford to refuse.
points should be available for each condition."
Another undesirable possibility is that a good REFERENCES
deal of time will be spent in learning about and
interacting with the judgemental aid, rather than Fisher, R. A. Statistical methods for research work-
in contact with the experimental area itself. In ers. Edinburgh: Oliver and Boyd, 1925.
Gentile, R. R., Roden, A. H., and Klein, R. D. An
the operant area we already have a powerful analysis-of-variance model for the intrasubject
source of distraction from our primary "target", replication design. journal of Applied Behavior
in that many experimenters often find it at least Analysis, 1972, 5, 193-198.
temporarily more satisfying to experiment with 4It is often pointed out that the time spent dealing
their behavior control equipment-electrome- with the statistical judgemental aid can now be mini-
mized by utilizing computer programs developed for
chanical, solid state, and more recently on-line this purpose. The experimenter can simply "plug in"
computer-than to experiment with behavior. In his data and read out the significance value as well as
the case of the autoregressive techniques that some indication of the appropriateness of the par-
seem to be "just around the corner", their under- ticular technique to those data. This would seem to
represent even further dependency upon an expertise
standing will surely require a good deal of grad- which is beyond one's own critical scrutiny, an
essentially undesirable direction to take. It can be
3It is possible that there are still some researchers argued, of course, that we all depend upon experts
who overvalue the significance test because they be- in other areas-an example is the biologist's depen-
lieve that the significance value reached in any par- dency upon the optical specialists who design and
ticular test is equivalent to the probability that the construct his microscopes. We do it, however, on the
null hypothesis is true. We cannot blame the pro- basis of earned confidence, and the statisticians' con-
fessional statisticians for this misinterpretation, how- tribution to experimental psychology seems quite un-
ever, except that in warning us to avoid this error certain when compared with the optical specialists'
they have not often substituted a plausible alternative. contribution to biology.
STATISTICAL INFERENCE: MIXED BLESSING OR CURSE? 653
Hartmann, D. P. Forcing square pegs into round Raiffa, H. Decision analysis. Reading, Mass.: Addi-
holes: some comments on 'An analysis-of-variance son-Wesley, 1968.
model for the intrasubject replication design.' Schmitt, S. A. Measuring uncertainty. Reading,
Journal of Applied Behavior Analysis, 1974, 7, Mass.: Addison-Wesley, 1969.
635-638. Skinner, B. F. Cumulative record. 3d ed. New York:
Keselman, H. J. Concerning the statistical proce- Appleton-Century-Crofts, 1972.
dures enumerated by Gentile et al.: another per- Thoresen, C. E. and Elashoff, J. D. 'An analysis-of-
spective. Journal of Applied Behavior Analysis, variance model for intrasubject replication de-
1974, 7, 643-645. sign:' some additional comments. Journal of Ap-
Kratochwill, T., Alden, K., Demuth, D., Dawson, D., plied Behavior Analysis, 1974, 7, 639-64 1.
Panicucci, C., Arnston, P., McMurray, N., Hemp-
stead, J. and Levin, J. A further consideration in Received 20 August 1974.
the application of an analysis of variance model (Published without revision.)
for the intrasubject replication design. journal of
Applied Behavior Analysis, 1974, 7, 629-633.

You might also like