You are on page 1of 9

JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 1978, 29, 17-25 NUMBER I (JANUARY)

STABILITY CRITERIA
PETER R. KILLEEN1
ARIZONA STATE UNIVERSITY

Three approaches to the determination of behavioral stability were examined. In the first,
a learning curve was fit to acquisition data (from Cumining and Schoenfeld, 1960), and
the "experiment" stopped when the data approached sufficiently close to the theoretical
asymptote. In the second, the data were analyzed for variability and linear and quadratic
trend. In the third, the experiment was stopped when the magnitude of the daily changes
in the data fell below a criterion. Accuracy was measured as deviation between the average
value of the dependent variable when the experiment was stopped, and the average value
over the last 100 sessions. The first approach was most accurate, but at the cost of requir-
ing the most sessions and being the most difficult to apply. Both the second and third
approaches provided acceptable criteria with a reasonable cost-accuracy tradeoff. The
second approach permits a continuous adjustment of the criteria to accommodate the
variability intrinsic in the experimental paradigm. The third, nomothetic, approach also
takes into account the decreasing marginal utility of extended training sessions.
Key words: learning curves, trend analysis, nomothetic criteria, optional stopping

Cumming and Schoenfeld's words intro- The issue is important, for it is the custom
duce this research as well as they did their of behavior analysts to continue an experi-
own: ment until the data "appear stable", rather
In the literature, the term "stability" than terminate after a fixed number of ses-
appears to refer to one or both of two sions. Yet, to this date, there are no widely
things. In some places, it means that be- accepted criteria for stopping. The criterion
havior is no longer changing significantly tested by Cumming and Schoenfeld (1960) did
because it is close to its asymptotic value not work:
under the given conditions. In other con-
texts, the term seemingly refers to be- A 6-day period was considered to have
havior . . . that shows minimal variabil- met the stability criterion if the difference
ity. . . The concern of the worker is car- between the mean rate for the first three
ried by several concrete questions. . .: days and the rate for the second three days
When could the experiment have been was no greater than 5% of the overall
stopped with any desired probability that 6-day mean. . . Although the criterion
no further change in the dependent varia- itself tended on repeated application to
ble would have been observed? . . .What select 6-day means at random, the use of
is a satisfactory rationale for defining "sta- the first occasion on which the criterion
bility," and what is a reasonable criterion is met proves a bad choice in practice
to set for accepting behavior as "stable"? (pp. 78, 79).
(1960, p. 71). Lacking algorithms for stopping experi-
'This work was supported in part by grant BMS 74-
mental sessions, experimenters have usually
23566 from the National Science Foundation. Listings relied on visual inspection to determine sta-
of Basic computer programs relevant to this article may bility. Visual stability tests are presumably one
be obtained from the author. I am indebted to the of the contingency-shaped behaviors that are
Glass Bead Group for their cooperation in Experiment acquired during graduate education. Because
II and to Stephen Hanson for comments on the manu-
script. Listings and reprints may be obtained from such learning is often several stages removed
Peter R. Killeen, Department of Psychology, Arizona from the relevant contingencies, we seek here
State University, Tempe, Arizona 85281. to replace it with rule-governed behavior.
17
18 PETER R. KILLEEN
In their experiment, Cumming and Schoen- which change is to be measured is often not
feld (1960) trained six pigeons on a t-T sched- zero. This may be easily taken into account
ule (analogous to an Fl 28.5-sec LH 1.5-sec by adding a parameter "B" to the right side
schedule) for 200 sessions. These data, similar of the equation, which sets the dependent
in appearance to other learning curves pub- variable at level B (i.e., baseline level) on the
lished in this journal, provide a good testing zeroth session.
ground for alternate types of stopping criteria.
The present paper evaluates three strategies METHOD
for generating such criteria. Subjects
The data were response rates collected by
Cumming and Schoenfeld (1960) and are avail-
EXPERIMENT I able from American Documentation Institute
"Asymptote" is a line that a tangent to a as Document No. 6244.
curve approaches, as the curve is extended to
infinity. To presume that behavior may be at Apparatus
some point asymptotic is to presume that the A PDP- 11 computer was used.
"learning curve" has a single asymptote, and
that it will approach reasonably close to it Procedure
within the lifetime of the subjects. Experi- After each session, a simple iterative pro-
mental contingencies may, however, establish gram searched for the best values of A and C.
an undamped negative feedback loop, so that "Best" was taken as that value which mini-
the only stability is a stable oscillation be- mized the sum of squared deviations between
tween two asymptotes ("metastability"). Bar- the obtained data and the theoretical learning
ring such a complication-or given indices curve. B (response rate of Session "0") was
such as frequency of oscillation rather than fiixed at a value extrapolated from the first
raw data-we may ask whether any simple two sessions. A number of stopping criteria
quantitative model of the learning process were evaluated, with their merit decided by
provides useful estimates of asymptote and the smallness of the deviation between the
rate of approach to it. average rate during the six days before stop-
There are a number of different models of ping and the average rate over Sessions 80
learning curves to choose from: autocatalytic to 180 (the last stable 100 sessions, labelled
(Robertson, 1920), cumulative normal (Culler "Asymptote" in Table 1).
and Girden, 1951), power functions (Stevens
and Savin, 1962), and exponential-integral RESULTS
(Anderson, 1963; Estes, 1950). There should The data were considered stable when two
be few important practical differences among conditions were satisfied: the learning process
the models' estimates of proximity to asmyp- had to be 99% of the way to completion (i.e.,
tote, so we choose one of the simplest, the ex- J: 5C) and the average of the last six ses-
ponential learning curve: sions had to be within 5%7O of the predicted
asymptote. Table 1 shows the session in which
R = A(l - e-J/c) (1), the criterion was reached by each of the ani-
where R is the dependent variable, A is the mals, the- average rate for the six days preced-
asymptote, e the base of the natural loga- ing that session, and the per cent by which
rithms, J the number of sessions, and C a rate this average deviates from that of the last 100
constant. The parenthetical expression ranges sessions.
between 0 (at Session 0) and 1.0 (as J ap-
proaches infinity). The rate of approach to DISCUSSION
asymptote is governed by C, called the time- The results hardly justify the effort put into
constant of the system. When J = C, the system curve-fitting and updating that fit every ses-
is 63% of the way to asymptote; at J = 3C, sion. The average error was 14% (the average
95%, and at J = 5C, 99%. Here is a simple error between the theoretical asymptotes and
model that provides direct measure of asymp- the last 100 sessions was also 14%/). While this
tote and rate of approach. But two parameters is considerably better than the performance of
are seldom adequate, since the baseline from Cumming and Schoenfeld's criterion (average
STABILITY CRITERIA 19
deviation: 25%), the average number of ses- the data, which, in this context, becomes as
sions required was also considerably greater. complicated as the control model. Other learn-
If we had indiscriminately stopped all animals ing curves (e.g., logistic) with other criteria for
on the forty-third session, the average individ- the elimination of data and the judgement of
ual deviation would have been 12%. System- stability might yet prove viable, but we now
atic evaluation of other criteria (J -
2C proceed to other types of criteria.
through J :
6C, deviation from theoretical
asymptote from 2%0 through 10%, X2, etc.)
yielded none that fared better. EXPERIMENT II
The reason for the failure of this technique What people look at when they "eyeball"
do
is the failure of the pigeons to approach an data for stability? In the following experi-
asymptote in a smooth, monotonic fashion. A ment, we attempted to distill an algorithm
fast start, as was the case with the first two from a number of judgements of "stable by
pigeons, would cause an overestimation of visual inspection".
asymptote during the early sessions. This bias
would be eliminated as the data continued to Subjects
accumulate, but it was most likely that the Five sophisticated laboratory researchers
criterion would be satisfied by data that were served.
anomalously high and met the descending the-
oretical asymptote before it had stabilized at Apparatus
a more representative level. Fifty-five graphs of "data points" were gen-
Control theory might provide a better model erated by computer from random variables
for the stabilization of data than does learn- with a mean of 50 and a standard deviation of
ing theory, for it provides explicit representa- five (scale: 25 units to the inch). Each graph
tion for data that overshoot their asymptotic consisted of six unconnected points, with no
level. But that requires three additional pa- numerical ordinates visible.
rameters: the damping ratio, and the fre-
quency and phase of oscillation. These make Procedure
computer convergence difficult, and complicate Each graph was displayed through a window
the model to the point of impracticality. Alter- in a file folder. Subjects were told to suppose
nately, if data from the earliest part of the that these were the last six sessions from an
experiment are expendable, we may improve experiment that had been in progress for 25
our estimate of asymptote at the cost of an sessions. Only one of the subjects knew the
inferior account of the first few sessions. But true nature of the data. They were asked
the issue of how much data to discard itself whether they would stop the experiment at
stands in need of a criterion. That decision this point, and asked to rate their confidence
leads into the issue of exponential weights for in that decision on a scale of one to three.

Table 1
A comparison of various stability criteria. "Dev" is absolute per cent deviation between
rates and asymptotes. The two values for "Avg Dev" are the average of the column and
the deviation of the average rates from the average asymptotes.
Criterion
Nomothetic
Learning Curve Trend Analysis Criterion
Subject Asymptote Day Rate Dev Day Rate Dev Day Rate Dev
26 30.7 60 40.7 33 20 45.2 47 22 43.0 40
27 76.0 56 83.8 10 14 54.4 28 18 59.7 21
28 36.7 36 40.3 10 29 38.3 4 32 36.0 2
33 42.3 57 38.5 9 13 38.3 9 16 44.8 6
42 46.1 26 54.5 18 8 58.0 26 14 57.0 24
45 44.6 21 45.7 2 43 48.0 8 35 51.8 16
Avg 46.1 43 50.6 14/9.7 21 47.0 20/2.0 23 48.7 18/5.7
20 PETER R. KILLEEN

Two subjects looked at the graphs in one variation exceeds 0.14, the coefficient of linear
order, three in the opposite order. The first trend 0.12, or the coefficient of quadratic
five ratings were not included in the analysis. trend 0.20. These cut-points would have
stopped the experiment every time the aver-
RESULTS AND DISCUSSION age rater stopped it, and continued the ex-
Although the population from which these periment 93% of the time the rater continued
random variables were sampled had a 10% it.
coefficient of variation and zero linear and Next, I applied the criterion to Cumming
quadratic trend, random sampling of only six and Schoenfeld's data, stopping a subject
items from the population generated a range whenever all three coefficients over a six-day
of data configurations that resembled typical period were less than 14%. Given that cri-
data, with reasonable disparity in both disper- terion, on the average, 21 sessions are required
sion and trend from one graph to another. for stability, with an average deviation be-
Most subjects lamented their inability to tween rates over the last six sessions and
see data from sessions previous to the six on asymptotic rates of 20% (see Table 1). This
display. In this sense, the experiment did not error falls between the 25% average deviation
permit the full range of observing behavior of Cumming and Schoenfeld's criterion, and
typically involved in visual estimates of sta- the 14% average deviation of the "learning
bility. The subjects reported that they at- curve" criterion. Thus, while the "trend analy-
tended to both variability and trend in mak- sis" is better than Cumming and Schoenfeld's
ing their decisions. A number of indices of criterion, it leaves room for improvement.
variability and trend were therefore exam- Some of the error arises from shifts in response
ined in an attempt to capture the essence of rates late in training, and it would be difficult
the subject's behavior. The most successful for any criterion imposed earlier in training
indices were the coefficient of variation, the to get past periods of relative stability that
amount of linear trend and the amount of preceded the later changes in level. Another
quadratic trend, with both of the latter mea- source of inaccuracy is sampling error: the
sured by weighting the data with appropriate average individual standard deviation during
orthogonal polynomials. the last 100 sessions was 6.6 responses per
The coefficient of variation was calculated minute. The standard error of the mean for
by dividing the sample standard deviation by samples of six sessions was therefore 2.7 re-
the mean of the data, and the indices of trend sponses per minute. It follows that the average
were calculated by multiplying each of the deviation of sample means from population
scores by the appropriate weight (-5, -3, -1, means (i.e., rates over the last 100 sessions)
1, 3, 5, for linear trend, 5, -1, -4, -4, -1, 5 will be about 4%. Response rates during Ses-
for quadratic trend), and dividing the weighted sions 75 to 80, which Cumming and Schoen-
sum by the root mean square average of the feld considered asymptotic, deviated by an
weights (8.37 for linear, 9.17 for quadratic), average of 9%. Four to 9% is therefore about
and by the mean of scores. The correlations the best we can hope for from these data.
between the average confidence rating for each Is it possible to approach closer to the
graph and each of the indices were: coefficient floor of 4 to 9% with trend analyses? Tests
of variation, 0.73; linear trend, 0.58; quadratic employing larger numbers of sessions (10) and
trend, 0.53. The multiple correlation was 0.81. a range of different cutpoints failed to improve
Despite its success, the multiple regression substantially on the 20% error. As Sidman
may not be a good analog of the behavior of (1960) noted, increasing the stringency of our
the raters. Rather than estimating a weighted criteria will not guarantee an increase in the
sum of the indices, the subjects seemed to stability of the data when those criteria are
employ a noncompensatory approach: too met; it may just cause the researchers "to
much variability, or too much linear or quad- spend a lifetime, if they are that stubborn,
ratic trend would preclude stopping, no mat- on the same uncompleted experiment. . ..
ter how close to zero the other indices were. Even if the criterion were occasionally met by
We can reconstruct the binary decisions from chance, in the course of uncontrolled varia-
the data with the following post-hoc analysis: bility, the data would be chaotic. As a result,
continue running whenever the coefficient of either the experiment will be abandoned (with
STABILITY CRITERIA 21
an attendant loss of time and effort) or the pected deviation between the sample mean
data will be invalid" (p. 260). and the population mean will decrease as the
Sidman's words call our attention to two square root of N - 1. Similarly, as the number
factors involved in every reasonable decision of sessions is increased, the deviation between
concerning stability: the levels of variability the subjects' measured performance and their
generally expected within a paradigm and asymptotic performance will decrease. Given
within a laboratory, and the cost-effectiveness fixed resources, we can maximize the precision
of additional training sessions. Wlhereas the of our estimate of the population mean by
cut-points of the trend analyses and the num- judiciously allocating those resources between
ber of sessions it encompasses may be easily and sessions and subjects. Conversely, given a de-
properly adjusted by each user to allow for cision about the number of subjects to be
intrinsic variability,2 it is only with the nomo- employed, we can stipulate when the decreas-
thetic approach that the costs of experimenta- ing marginal utility of additional sessions
tion are explicitly taken into account. passes a threshold beyond which that decision
should be revised-or, standing by that de-
cision, when it becomes reasonable to termi-
EXPERIMENT III nate the experiment. This strategy does not
"N should refer to the number of observa- assume that we have any particular interest in
tions, not to the number of subjects", I was between-group comparisons; it does assume
recently reminded. This sentiment, uttered in that we are interested in generalizing the re-
defense of "small N" researclh, implies a possi- sults obtained with a sample to the popula-
ble trade-off between the number of subjects tion as a whole.
and the number of sessions involved in an
experiment. Different types of information are METHOD
derived by conducting numerous sessions in- To effect the analysis, we must estimate
volving a few subjects, versus conducting a the probable magnitude of error that might
few sessions involving numerous subjects. In arise both from sampling error and from fail-
the former case, we achieve a good specifica- ure of the subjects to reach asymptote. In
tion of the behavior of a few organisms, but the data reported by Cumming and Schoen-
little information about how representative feld, the mean of response rates for the six
they are of their species. In the latter case, we subjects over the last 100 sessions was 46.1,
achieve a good sample of the population, but witlh an estimated a- of 15.7. Assuming a nor-
the behavior measured may be far from mal distribution of error, we may infer that
asymptotic. approximately 95% of the replications of their
It is our thesis that these two types of in- experiment with six subjects will generate
formation are both necessary, and further- asymptotic mean response rates within two
more, that they are commensurable. We as- standard errors of the mean, that is, between
sume a nomothetic philosophy, according to 33 and 59 responses per minute. The more
which data from individual organisms are to subjects that are run, the more tightly these
be evaluated in terms of their contribution to limits may be drawn. The standard error of
our knowledge of population characteristics the mean, calculated by dividing the sample
(Falk, 1956). Both sources of errors-deviation estimate of the population standard deviation
of individuals from their asymptotic perform- (s) by the square root of N, thus provides the
ance, and deviation of that asymptotic per- needed relationship between sample size and
formance from the performance of the typical expected sampling error.
animal of the species-are to be minimized. Let us estimate the change in error that oc-
The nomothetic assumption is invoked, not curs with each additional session (dE/dJ) by
only because of its inherent reasonableness, subtracting the response rate during one ses-
but because it provides the needed standard sion from that measured during the previous
for our next stability criterion. We know that sessions. A provisional criterion for stability
as the sample size (N) is increased, the ex- is the following: when the overall decrease in
error derived from additional sessions becomes
2We now routinely employ eight-session trend tests less than the decrease in error derivable from
with cut-points around 15%. additional subjects (dE/dN), it becomes rea-
22"D 22 PETER R. KILLEEN
sonable to terminate those subjects and ini- where s, the sample estimate of a-,
tiate new subjects in their place. Alternatively,
taking the original number of subjects as an I (Xi -M)2
index of the tolerable error-and this will be i (4)
S= V
N-I
the usual course of action-it becomes reason-
able to terminate the experiment. This for- is presumed constant with respect to N (i.e., is
mulation may be improved by selective appli- presumed unbiased; cf. Dixon and Massey,
cation of the criterion: whenever the daily 1957). We specify that a subject be terminated
decrease in error for an individual subject be- when the change in the dependent variable is
comes less than the criterion, terminate it and less than the above quantity times the num-
allocate the saved resources either to new ber of sessions to date. Since N can change
subjects, or to more sessions for the slower only in integral steps, the critical point will
subjects. occur halfway between two values of N. We
The above statement ignores the cost in- therefore add 0.5 to N, to get the upper cate-
volved in conducting additional sessions or gory boundary of which N is the midpoint.
running additional subjects. If we take the Our criterion becomes:
unit cost to be one subject-session, the cost
of each additional session is 1 (or N for the _= sJ
group). The cost of each additional subject - (5a)
2 (N + 0.5)1.5
must take into account the likelihood that it
will be necessary to run the subject as long For treatment of groups as a whole, it is:
as the one it is replacing to obtain adequate
stability-let us say J sessions. Our criterion yI= sJ (5b)
then becomes: terminate a subject whenever 2N (N + 0.5)1 5
Let us see how this works for the data of
-dE _dE_2a
dj dJ/J-ddN /1. ~~(2a)
Cumming
stantiates:
and Schoenfeld. Equation 5a in-
That is, terminate whenever the error reduc- = 15.7 J
tion expected from one additional session, di- =
(6)
vided by the number of sessions to date, be- 2 (6.5)1.5
comes less than the error reduction expected Y =.47J (7)
from one additional subject. This explicit The criterion may be represented as a
introduction of the cost for additional sessions straight line that changes in error from one
saves us from the Sisyphean fate of Sidman's
session
stubborn researcher, for the accumulating The changes to the next (dE/dJ) must cross below.
costs of additional sessions progressively re- tude to changes in error are equivalent in magni-
laxes the criterion. The ad-hoc decision "long but since it is notin known the dependent variable,
enough" (i.e., visual stability) is replaced by a such changes represent an a increase priori whether
or de-
continuous and specifiable adjustment of the crease in error, we conservatively require that
criterion. the absolute value of the change be less than
If the structure or economics of the labora- the criterion. The number of sessions required
tory dictate termination of the group as a for each of Cumming and Schoenfeld's sub-
whole, Equation 2a becomes
jects to pass this criterion for six days in a
row ranged from 14 to 35, and averaged 23.
dJ J- dE IN (2b) The average rates over those days deviated
from the asymptotic rates by 18%-a per-
where the group is treated as a single organism, formance within the range established by the
and the dE/dJ is based on the change in the first two approaches (see Table 1).
dependent variable averaged over the group. Our criterion was generated in reference to
The derivative of the standard error of the group error, not individual error, and may be
mean with respect to N is estimated by evaluated on that basis. The average response
dE 2 S rate over animals during the stable conditions
(3) was 48.7
a-IV2N1.5'3 responses per minute. This deviates
STABILITY CRITERIA 23
by less than 67% from the average asymptotic vi
(8)
rate of 46.1 responses per minute, and is easily
included in the 95% confidence interval for
2(N + 0.5).5 (N-1)05
or, approximately
the mean.
The nomothetic criterion requires an esti- (9a)
mate of the standard deviation of asymptotic 2N1.5.
response rates, but this is not a severe prob- When the group will be terminated as a
lem. Let us assume a large error in estimating unit, Equation 5b transforms to approxi-
the standard deviation. Had we presumed it mately:
to be 7.9 instead of 15.7, we would have found vi
ourselves conducting an average of 34 sessions, yI (9b)
11 more than previously. An underestimation -2N2.5 Y
of the sampling variability thus biases us DISCUSSION
toward running additional sessions, since the It may be argued that sampling error over
marginal utility of an extra animal becomes subjects is somehow less important than fail-
less than the marginal utility of an extra ses- ure to reach asymptote. For instance, the for-
sion, until additional sessions with their de- mer may be expected to be random, whereas
creasing marginal returns and cumulating the latter may be biased, with all individuals
costs have again balanced the scales. Con- approaching asymptote from the same direc-
versely, had we presumed the standard devia- tion. In fact, this was not a problem in the
tion to be 31.4 instead of 15.7, we would have analysis of Cumming and Schoenfeld's data,
run an average of 17 sessions instead of 23. where our estimate of the terminal rate was
The subjects' average response rate when 6% too high, even though all animals ap-
stopped would have been 45.7-an error no proached asymptote from below. But the argu-
greater than that occurring with the stricter ment has some merit, and is applicable to any
criterion, although the average deviation for type of stability criterion. If experimenters do
individual animals increases to 21%. not sequence conditions according to a Latin
Problems of estimation may be further al- Square design, or pretrain animals to a range
leviated by use of the coefficient of variation, of response rates, they may wish to adjust the
rather than the standard deviation, in the stability criterion. If sampling error is con-
above calculations. The coefficient of variation sidered to be only half as detrimental as de-
is the standard deviation divided by the mean; viation of individual animals from asymptote,
it is less variable over subjects-and a fortiori the experimenter merely need halve the cri-
over experiments-than is the standard de- terion value. Similar proportional changes in
viation (cf. Cumming and Schoenfeld, 1960, the criterion will accommodate different eval-
P. 73). Cumming and Schoenfeld's coefficient uations of the relative cost of subjects and
of variation for average asymptotic response sessions. Whatever the experimenter's deci-
rates was 31%, while their within-subject co- sion, the value of Y chosen provides an ex-
efficient of variation averaged 14.3%/ over the plicit statement of the probable error in the
last 100 sessions. It is unlikely that the be- data, which is an important advance over
tween-subject variability will ever be less than visual stability criteria.
the within-subject variability, so that when To employ Equations 9a or 9b, we need
the former is lacking, the latter may be taken merely plot a straight line on graph paper,
as a conservative estimate of it. If the coefficient with origin at zero and a slope of Y or Y'.
of variation is used in Equation 5a, the left The abscissa will measure sessions, and the
side of the equation must also be divided by ordinate will measure the proportional rate
the mean, so that we test proportional change of change in error, as estimated by the dif-
in error, rather than absolute change, against ferences in the dependent variable from one
the criterion. Since the coefficient of variation day to the next divided by the mean value of
(V) is traditionally based on the sample stan- the dependent variable on those days. To in-
dard deviation, rather than the estimate of crease the smoothness of these "operating
population standard deviation (i.e., it employs characteristics", I graph the average of the
a divisor of N, rather than N - 1), our cri- previous day's entry and the current change
terion becomes index (this latter tactic generates an exponen-
24 PETER R. KILLEEN

tially-weighted moving average of the change 1977). The within-subject design is appealing,
indices). Thus, if Di is the current value of when appropriate, because we can usually de-
the dependent variable (for each subject when crease the variability in our dependent varia-
using 9a or averaged over subjects when using bles by choosing as our dependent variable
9b), Di-, the value of the dependent varia- the difference or ratio of each individuals' per-
ble on the previous session, and Yi the value formance in the various experimental condi-
to be graphed, then tions, rather than the absolute values of their
performance. This decrease in V translates
Yi = 0.5(Yi-1 + Di-1 - Di /Di). (10) into decreases in the number of subjects
Inspection of Equations 9a and 9b reveals needed, and, given the constraint of fixed re-
that the criterion line becomes steeper, and sources, permits each to be run for a greater
thus easier to pass, as the number of subjects number of sessions.
is decreased. This seems to penalize the re-
searcher who employs a large number of sub- GENERAL DISCUSSION
jects, by requiring that they be run for a
greater number of sessions than is required Three approaches to the evaluation of a
of a researcher who employs fewer subjects. stability have been formulated. The learning-
This outcome follows from the logic of the curve approach was rejected as unwieldy, even
nomothetic approach, where the choice of N though it performed about as well as the sub-
is not arbitrary, but is presumed to be based sequent techniques. The trend analysis was
both on the variability expected between sub- simpler to apply, and provides researchers
jects, and on the speed with which the de- the flexibility of experimenting with their own
pendent variable is expected to approach cut-points and number of sessions to be tested.
asymptote. Increases in N betoken, in a bal- The nomothetic approach addressed the issue
anced allocation, increased resources, and of behavioral stability at a more fundamental
Equations 9a and 9b automatically apportion level. In this approach, the importance of
some of those resources to an increased num- minimizing both learning error and subject
ber of sessions. If a researcher finds the present sampling error was assumed. Some such as-
Y criterion too lenient, it merely implies that sumption is implicit in most research, whose
the researcher is using too few subjects for results are inevitably generalized beyond the
the amount of resources that he or she is three or four subjects employed. In the nomo-
willing to invest. thetic approach, representativeness is mea-
It is usually necessary to shape the animals sured, not assumed. As number of subjects is
to respond, or pretrain them in other ways, increased, representatives of the sample is in-
before one can even begin to measure the creased, with rate of increase being a function
dependent variable. These initial costs may of the number of subjects already in the
be taken into account by allocating them to sample, and the intersubject variability typical
each of the conditions to be run, and then of the paradigm. Just as there is a decrease in
appropriately offsetting the origin of the the marginal utility of each extra subject, so
change indices. If, for example, 12 days of there is a decrease in the marginal utility of
pretraining are necessary for a proposed ABA each extra session of training. When the latter
design. we should begin graphing the change exceeds the former, it is time to add new
indices four days to the right of the origin. subjects-or to stop the experiment, if N is
At the heart of the nomothetic approach is large enough. This logic formed the basis for
our desire to optimize the information that we the nomothetic approach, and led to the
get from any experiment about population criterion lines of Equations 9a and 9b.
characteristics. This goal is equally valid Some researchers may insist that they are
whether we employ within-group or between- interested only in the idiosyncracies of the few
groups designs. In both cases, we take as N the animals in their experiment. Such idiographic
number of subjects in each condition-whether research needs no normative criteria, and
or not the same subjects have appeared in may indeed be terminated at the discretion
other conditions. The logic of when to use of the investigator. Other researchers may wish
each type of design has recently been reviewed to discriminate between two or more the-
by Greenwald (1976; see also Erlebacher, ories, and the stopping criteria we proposed,
STABILITY CRITERIA 25
while optimal, may not be adequate. But it cies (Polya, 1954; Skinner, 1956, 1958). The
is short-sighted to attend to only one source nomothetic analysis questions those contingen-
of variability (deviation from asymptote) while cies. It suggests that the issues of precision,
ignoring other sources; the essence of theories cost-effectiveness, and representativeness are
is their generality over a population, and re- properly involved in every scientific decision.
searchers interested in testing theories must They deserve explicit consideration in experi-
be willing to invest the necessary resources, in mental analyses of behavior.
both subjects and sessions. Conversely, for
simpler questions-such as, "is there any ef- REFERENCES
fect from treatment B"-we may employ fewer Anderson, N. H. Comparison of different populations:
subjects, and stop them short of asymptote. Resistance to extinction and transfer. Psychological
The estimation of parameters has been a Review, 1963, 70, 162-179.
central problem of statistics for some time. Bush, R. R. Estimation and evaluation. In R. D. Luce,
The approaches evaluated in the present paper R. R. Bush, and E. Galanter (Eds), Handbook of
are not new. Bush (1963) estimated the num- mathematical psychology. New York: Wiley, 1963.
Pp. 429-469.
ber of sessions necessary to maximize the pre- Culler, E. and Girden, E. The learning curve in rela-
cision of estimates of the parameters for tion to other psychometric functions. American
single-operator linear learning models. Al- Journal of Psychblogy, 1951, 64, 327-349.
thouglh restricted to a single model (similar to Cummning, W. W. and Schoenfeld, W. N. Behavior
stability under extended exposure to a time-cor-
Equation 1), his apportioning of resources be- related reinforcement contingency. Journal of the
tween subjects and sessions adumbrates the Experimental Analysis of Behavior, 1960, 3, 71-82.
nomothetic philosophy. Kazdin (in Hersen and Dixon, W. J. and Massey, F. J., Jr. Introduction to
Barlow, 1976) evaluated the statistics availa- statistical analysis. New York: McGraw-Hill, 1957.
Erlebacher, A. Design and analysis of experiments
ble for "N of 1" researclh. These statistics are contrasting the within- and between-subjects manip-
essentially trend analyses, with major trends ulation of the independent variable. Psychological
removed, so that the residuals are uncor- Bulletin, 1977, 84, 212-219.
related and become amenable to analyses of Estes, W. K. Toward a statistical theory of learning.
variance. A provocative discussion of such ap Psychological Review, 1950, 57, 94-107.
Falk, J. L. Issues distinguishing idiographic from
proaches is provided by Michael (1974). Wald nomothetic approaches to personality theory. Psy-
(1947) and others have refined the theory of chological Review, 1956, 63, 53-62.
sequential statistical tests, in which sampling Fisher, R. A. The design of experiments. New York:
is continued until the ratio of positive to nega- Hafner, 1935/1953.
Greenwald, A. G. Within-subjects designs: To use or
tive outcomes passes one of two or more not to use? Psychological Bulletin, 1976, 83, 314-320.
boundaries. These "sequential probability ra- Hersen, M. and Barlow, D. H. Single case experi-
tio tests" are primarily useful for data whose mental designs: Strategies for studying behavior
parameters do not change as a function of change. New York: Pergammon, 1976.
Michael, J. Statistical inference for individual organ-
trials, but the associated graphical technique, ism research: mixed blessing or curse? Journal of
wherein the sampling is stopped when an op- Applied Behavior Analysis, 1974, 7, 647-653.
erating characteristic curve passes a boundary Polya, G. Patterns of plausible inference. Princeton:
line, is similar to the nomothetic tests outlined University Press, 1954.
above. Finally, R. A. Fisher's (1935/1953) em- Robertson, T. B. Principles of biochemistry. Ncw
York: Lea & Ferbiger, 1920.
phasis on maximizing the precision (defined Sidman, M. Tactics of scientific research. New York:
as the reciprocal of the variance) of a parame- Basic Books, 1960.
ter estimate for a fixed experimental cost indi- Skinner, B. F. A case history in scientific method.
cates a concern for efficiency that is funda- American Psychologist, 1956, 11, 221-233.
Skinner, B. F. The flight from the laboratory. In J. T.
mental to the present analyses. Wilson et al. (Eds), Current trends in psychological
In Experiment I, we were concerned with theory. Pittsburgh: University Press, 1958. Reprinted
a model of the learning curve, in Experiment in A. C. Catania's (Ed), Contemporary research in
II with a model of "visual stability tests", and operant behavior. Glenview: Scott, Foresman, 1968.
in Experiment III with a model of an efficient Stevens, J. C. and Savin, H. B. On the form-i of learn-
ing curves. Journal of the Experimlental Analysis of
experimenter, one committed to representative- Behavior, 1962, 5, 15-18.
ness as well as stability. The trend test formu- Wald, A. Sequential analysis. New York: Wiley, 1947.
lated in Experiment II captures the criteria Received 6 A ugust 1976.
that derive from current laboratory contingen- (Final acceptance I July 1977.)

You might also like