You are on page 1of 24

JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 1993, 60, 17-40 NUMBER 1 (JULY)

A SELECTIONIST APPROACH TO REINFORCEMENT


JOHN W. DONAHOE, JOSE E. BURGOS, AND DAVID C. PALMER
UNIVERSITY OF MASSACHUSETTS AT AMHERST AND SMITH COLLEGE

We describe a principle of reinforcement that draws upon experimental analyses of both behavior and
the neurosciences. Some of the implications of this principle for the interpretation of behavior are
explored using computer simulations of adaptive neural networks. The simulations indicate that a
single reinforcement principle, implemented in a biologically plausible neural network, is competent
to produce as its cumulative product networks that can mediate a substantial number of the phenomena
generated by respondent and operant contingencies. These include acquisition, extinction, reacquisition,
conditioned reinforcement, and stimulus-control phenomena such as blocking and stimulus discrimi-
nation. The characteristics of the environment-behavior relations selected by the action of reinforcement
on the connectivity of the network are consistent with behavior-analytic formulations: Operants are
not elicited but, instead, the network permits them to be guided by the environment. Moreover, the
guidance of behavior is context dependent, with the pathways activated by a stimulus determined in
part by what other stimuli are acting on the network at that moment. In keeping with a selectionist
approach to complexity, the cumulative effects of relatively simple reinforcement processes give promise
of simulating the complex behavior of living organisms when acting upon adaptive neural networks.
Key words: reinforcement, selectionism, neural networks, evolution, interpretation

Within evolutionary biology, Darwin's great were then available to contribute to the vari-
insight was that the complexity of species arose ation upon which subsequent selections acted,
as the cumulative product of the repeated ac- with complexity as a possible cumulative out-
tion of relatively simple biobehavioral pro- come.
cesses, most notably those whose effects are Darwin's approach to complexity-selec-
functionally described by the principle of nat- tionism -has since been explicitly pursued
ural selection. More broadly conceived, Dar- throughout biology and is implicit in accounts
win's account was the first comprehensive pro- of complex phenomena in other natural sci-
posal whereby higher level complexity could ences (cf. Campbell, 1974). For example, the
be interpreted as the "unintended" effect of formation of planetary systems as the result of
the repeated action of lower level processes. gravitational and other processes acting on a
Complexity was unintended in the sense that swirling cloud of interstellar dust particles ex-
no external agencies or order-imposing prin- emplifies selectionism (cf. Gehrz, Black, &
ciples were needed to oversee its production. Solomon, 1984). For a planet to orbit the sun,
Instead, complexity emerged as a by-product it must have just enough velocity tangent to its
of the three-step sequence of variation of be- orbit to compensate for the tendency to fall
havioral and morphological characteristics, se- toward the sun. If its velocity is too great, it
lection by the environment of those character- escapes from orbit; if too low, it spirals into
istics that affected reproductive fitness, and the sun. The planets achieve the velocities re-
retention of the selected variations via the quired to maintain their orbits, and those bod-
mechanisms of heredity (cf. Campbell, 1974; ies with the requisite velocities are all that
Mayr, 1982). These retained characteristics remain to be observed. Gravitational force pro-
duces organized complexity as a by-product,
with most of the primordial matter collapsing
This research was supported in part by a grant from into the sun and planets or escaping the solar
the National Science Foundation, BNS-8409948, and a system altogether.
Biomedical Research Support Grant to the University of Note, however, that although gravitation,
Massachusetts at Amherst. The authors thank Michael
Bushe for assistance with some of the computer program- together with other physical processes, may be
ming and Maria Morgan and Jay Alexander for prepa- sufficient to account for the formation of plan-
ration of some of the figures. The Appendix was prepared ets, the specific arrangement of planets that
by the second author. Correspondence and requests for characterizes our solar system is not a necessary
reprints may be addressed to John W. Donahoe, De- consequence of their action. The same pro-
partment of Psychology, Program in Neuroscience and
Behavior, University of Massachusetts, Amherst, Mas- cesses-acting on different initial conditions
sachusetts 01003. and, hence, in different sequences-are com-
17
JOHN W. DONAHOE et al.

petent to produce planetary systems of many havior is modified (Donahoe et al., 1982;
different configurations. Moreover, the order Stickney & Donahoe, 1983).
that we now observe may not be stable, as,
over time, matter assumes new orbits about Historical Reception of a Selectionist Approach
the body that it orbits or reveals itself to be Environmental conditions have been iden-
moving chaotically. Whether concerned with tified under which selection by reinforcement
planetary systems or species, the cumulative occurs and plausible interpretations based on
product of selection processes may be not only those findings have been provided for a wide
complex but also diverse (cf. Donahoe & range of complex environment-behavior re-
Palmer, 1989). The variation among species lations. Nevertheless, few outside the behav-
is particularly eloquent testimony to the di- ior-analytic community accept the proposition
versity as well as the complexity of which se- that complex behavior can be understood as
lection processes are capable. (See Palmer & the cumulative product of relatively simple re-
Donahoe, 1992, for a discussion of other char- inforcement processes. Why has selection by
acteristics of the products of selection.) reinforcement not been widely accepted as the
best extant account of behavioral complexity,
whereas natural selection has triumphed in its
TOWARD A SELECTIONIST domain? The answer to this question is po-
ACCOUNT OF BEHAVIORAL tentially important because an appreciation of
COMPLEXITY the scope of a principle of reinforcement may
From a selectionist perspective, a principle depend upon the existence of circumstances
of reinforcement is central to an account of analogous to those that preceded the accep-
behavioral complexity. A well-formulated tance of the principle of natural selection. What
principle of reinforcement should bear the same were those circumstances?
relation to the emergence of complex behavior Darwin (and, independently, Alfred Wal-
as the principle of natural selection bears to lace) proposed the principle of natural selec-
the emergence of complex morphology. That tion in 1859 as the central insight into what
is, a principle of reinforcement should prove he called "that mystery of mysteries," the or-
to be as fundamental and as fruitful to un- igin of species. What is insufficiently appre-
derstanding the origins of complex behavior in ciated is that, although the notion of evolution
individual organisms as the principle of nat- was generally accepted by the scientific com-
ural selection has proved to be in understand- munity, natural selection as the process
ing the origins of complex morphologies in whereby evolution occurred was not embraced
species (Donahoe, Crowley, Millard, & Stick- until the 1930s (e.g., Dobzhansky, 1937)
ney, 1982). some 70 years later! This period of "the eclipse
On the behavioral level, experimental anal- of Darwinism" has been discussed elsewhere
ysis leading to a principle of reinforcement (e.g., Bowler, 1983; Hull, 1973) and has been
seeks to identify the conditions under which commented upon in this journal (Catania,
the behavior of the individual organism comes 1987). Although a number of circumstances
to be guided by the environment. Indeed, ex- contributed to the acceptance of Darwinism,
perimental analysis has identified such con- two are especially important. First was the
ditions-the brief temporal intervals between rediscovery of Mendel's work, which led to the
the environmental, behavioral, and reinforcing identification of the biological mechanisms, the
events that define three-term contingencies genetic bases of heredity, through which Dar-
(e.g., Ferster & Skinner, 1957; Skinner, 1938) win's functional account could be realized.
and the evocation by the reinforcing stimulus Second was the development of population ge-
of behavior that would not otherwise occur in netics, with its more formal techniques-sta-
that environment (Kamin, 1968, 1969; cf. Res- tistics as developed by Ronald Fischer, J. B.
corla, 1969; Rescorla & Wagner, 1972). When S. Haldane, and Sewell Wright and, much
appropriate temporal relations occur between later, computer simulation (e.g., Maynard
these events in proximity to a reinforcer-in- Smith, 1982)-for tracing the course of selec-
duced behavioral change (i.e., a behavioral dis- tion. These more rigorous techniques provided
crepancy), the environmental guidance of be- a means for exploring the implications of nat-
SELECTION BY REINFORCEMENT 19

ural selection that were more compelling than pendent of physiology as physiology is of bio-
Darwin's verbal interpretations. The integra- chemistry, and as interdependent as well.
tion of the mechanisms of heredity with pop- Although Skinner was educated at least as
ulation genetics provided a persuasive account much in the biology as the psychology of the
of evolution through natural selection and time, he engaged in almost no experimental
formed what is now known as the "modern work at the physiological level. As someone
synthesis" or the "synthetic theory" of evo- committed to the study of behavior, he re-
lution. garded the coordination of behavior and phys-
What lessons bearing on the acceptance of iology as dependent on the prior establishment
a principle of selection by reinforcement may of a science of behavior (Skinner, 1938, pp.
be drawn from the record of the acceptance of 423-424). Also, Skinner had little more con-
natural selection? If a historical parallel holds, fidence in the science of physiology available
then the acceptance of a principle of selection when he began his work than he did in the
by reinforcement awaits the identification of extant science of behavior. He viewed much
its biological mechanisms and the development of physiology and almost all of psychology as
of techniques for interpreting its implications concerned with what he called the "Concep-
that are more rigorous than verbal interpre- tual Nervous System" (Skinner, 1938, pp. 421),
tation (cf. Donahoe & Palmer, 1989). This is that is, a "nervous system" whose structures
not to say that either the identification of bi- and processes were inferred from observations
ological mechanisms or the development of at other levels of analysis, not the real nervous
more formal interpretative techniques are log- system that could be directly subjected to ex-
ically necessary in order for selection by re- perimental analysis. In short, Skinner's lack
inforcement to be preferred to alternative ac- of concern for things physiological was
counts of behavioral complexity. Rather, the grounded in strategic and pragmatic consid-
historical record suggests that both may be erations, not in principled reservations about
"psychologically" necessary for the general ac- the potential relevance of physiology to be-
ceptance of reinforcement as the key insight havior, and, equally important, of behavior to
into the origins of behavioral complexity. Ac- physiology.
cordingly, the experimental analysis of behav-
ior should be supplemented (not replaced) by
experimental analyses of the neurosciences, and TOWARD A BIOBEHAVIORAL
the resulting synthesis should be interpreted PRINCIPLE OF REINFORCEMENT
using more formal techniques. The integration In keeping with the foregoing overview of
of behavioral and neuroscientific findings the circumstances that preceded the acceptance
would constitute a new modern synthesis that of the principle of natural selection, the twin
might claim behavioral complexity as its do- goals of the present paper are to formulate a
main just as the synthetic theory of evolution principle of reinforcement informed by both
now claims morphological complexity. We re- behavioral and neuroscientific research and,
fer to the synthesis of behavior analysis and then, to explore some of its implications using
the neurosciences as the biobehavioral ap- a more formal interpretive technique than ver-
proach (see Figure 1). bal interpretation (i.e., computer simulation
Some may regard a call for the integration via adaptive neural networks).
of behavior analysis with the neurosciences as
questioning one of Skinner's major accom- A Unified Principle of Reinforcement
plishments, the establishment of an indepen- As already noted, the experimental analysis
dent science of behavior. This would be a mis- of behavior has identified two sets of conditions
perception of our position and, more that are required for selection by reinforce-
importantly, of Skinner's. As has been noted ment: (a) brief temporal intervals between the
elsewhere (Donahoe & Palmer, 1989), to ar- environmental, behavioral, and reinforcing
gue for an integration of the experimental events of the three-term contingency and (b)
analysis of behavior and physiology in no way a reinforcing stimulus that evokes a behavioral
undermines the independence of behavior change or discrepancy. On this view, the sen-
analysis. Behavior analysis remains as inde- sitivity of organisms to relations defined over
20 JOHN W. DONAHOE et al.

Inferences Inferences
/
/

i Observed
Environment Intra-Organismic , Behavior

I Events
l

Inferences Inferences
0
_.7

Biobehavioral Approach
Fig. 1. A biobehavioral approach to the analysis and interpretation of complex behavior. The experimental analysis
of behavior, which is concerned with the effects of environmental manipulations on behavior, is supplemented by the
experimental analysis of physiology, which is concerned with the effects of intraorganismic manipulations on intraor-
ganismic events and behavior.

longer time intervals (so-called molar relations only contact between [the schedule] and the
or correlations) is the cumulative effect of mo- organism occurs at the moment (emphasis added)
ment-to-moment relations among environ- of reinforcement.... Under a given schedule
mental, behavioral, and reinforcing events of reinforcement, it can be shown that at the
(Donahoe et al., 1982; Rescorla & Wagner, moment of reinforcement a given set of stimuli
1972). That is, sensitivity to correlation is the will usually prevail. A schedule is simply a
convenient way of arranging this. (pp. 2-3)
emergent product of sensitivity to contiguity.
This molecular view is consistent with Skin- The phrase "moment of reinforcement" ap-
ner's approach to selection by reinforcement. pears at several other points in the introduction
For example, in Schedules of Reinforcement to the study of reinforcement schedules, and
(Ferster & Skinner, 1957; cf. Skinner, 1981), many examples of moment-to-moment anal-
it states yses appear in discussions of the behavioral
A more general analysis is possible which effects of various schedules (see also Morse,
1966; Skinner, 1938, 1948).
...

answers the question of why (emphasis in orig-


inal) a given schedule generates a given per- Again, none of this denies the existence of
formance.... It does this by a closer analysis molar regularities (e.g., between mean re-
of the actual contingencies of reinforcement sponse rate and mean reinforcement rate, as
prevailing under any given schedule.... The in the matching law; Herrnstein, 1970). Nor
SELECTION BY REINFORCEMENT 21

should the emphasis upon moment-to-moment stimuli necessarily precede reinforcers in a re-
contingencies be seen as reducing the contri- spondent procedure although, over time, only
bution that an appreciation of molar regular- stimuli reliably precede the reinforcer.
ities makes to the verbal interpretation of be- Similarly, in the prototypical operant pro-
havior (e.g., the implication of the matching cedure of Skinner, lever pressing preceded food
law that dysfunctional behavior can be reduced but some stimulus must necessarily have been
by reinforcing alternative responses rather than sensed immediately prior to the reinforcer. (For
punishing the dysfunctional behavior). In- an insightful examination of Skinner's aban-
stead, as already noted, the present view af- donment of the stimulus in his treatment of
firms that these molar regularities are the operant conditioning, see Coleman, 1981,
cumulative and emergent products of moment- 1984.) For example, just prior to lever pressing
to-moment relations among events defined over the rat might have seen the lever, or smelled
brief time intervals. Consistent with Skinner's some odor within the chamber, or sighted the
advocacy of an analysis of molecular contin- houselight while attempting to climb out of the
gencies, recent theoretical work has progres- chamber and, in so doing, "inadvertently"
sively moved toward moment-to-moment in- emitted the criterion response. Although the
terpretations of previously uncovered molar rat need not have sensed any particular stim-
regularities (e.g., Herrnstein, 1982; Heth, ulus prior to pressing the lever and receiving
1992; Hinson & Staddon, 1983; Shimp, 1969; the food, some stimulus must have been sensed
Silberberg, Hamilton, Ziriax, & Casey, 1978; immediately prior to the response and the re-
Staddon & Hinson, 1983; Vaughan, 1981). inforcer. Thus stimulus events necessarily pre-
Although future work must determine whether cede the reinforcer in the operant procedure
molecular accounts can encompass all molar although, over time, only responses reliably
regularities (cf. Nevin, 1979; B. Williams, precede the reinforcer. (Of course, specific
1990), current research suggests that many stimuli may be scheduled to precede the re-
molar regularities fall within the reach of mo- inforcer in a discriminated operant procedure.)
ment-to-moment analyses. What is crucial to note in the foregoing de-
Operant-respondent distinction. Paradoxi- scription of the procedures that implement re-
cally, the consistent application of a moment- spondent and operant contingencies is that at
to-moment analysis undermines the cogency of the moment when food occurs (i.e., at the mo-
the distinction between operant and respon- ment of reinforcement), both procedures nec-
dent conditioning as fundamentally different essarily contain a sequence of the same types
types of conditioning requiring different prin- of events-stimulus, response, and reinforcer.
ciples for their understanding (cf. Skinner, In both procedures some stimulus must have
1935b, 1937). Consider the prototypical re- been sensed and some response must have oc-
spondent conditioning procedure of Pavlov in curred prior to the reinforcer (see Figure 2).
which the ticking of a metronome was paired Although the experimenter manipulates dif-
with the introduction of meat powder into the ferent stimuli and measures different re-
mouth of a dog. Although the reinforcing stim- sponses in respondent and operant procedures,
ulus (here, meat powder) occurred indepen- the learner is exposed to a similar sequence of
dently of behavior, it is inescapably true that events in either case (Donahoe et al., 1982).
some behavior must necessarily have preceded If comparable momentary sequences of events
the reinforcing stimulus. For example, the dog's occur in the two procedures and if selection by
ears might prick up or its head turn toward reinforcement is governed by moment-to-mo-
the sound of the metronome immediately be- ment relations between events, then the basis
fore receiving the meat powder. As Schoenfeld upon which a principle of reinforcement may
has emphasized, reinforcers are necessarily in- differentiate between the procedures is elimi-
troduced into an ongoing "stream" of behavior nated. The two procedures cannot require dif-
(e.g., Schoenfeld et al., 1972). Thus, although ferent "laws of learning" because, even if dif-
Pavlov's dogs need not have behaved in any ferent laws existed, no basis would exist at the
particular manner prior to receiving the re- moment of selection with which the organism
inforcing stimulus, they were nevertheless be- could "decide" which set of laws to invoke.
having in some manner, even if they were This does not mean that the two procedures
standing perfectly still. Responses as well as may not have very different cumulative effects
22 JOHN W. DONAHOE et al.

ENVIRONMENT SI S2 S3 *. Si . . Sm . .

>\ CLASSICAL

US-UR

OPERANT

BEHAVIOR R3. . . Rj . . Rn. .

Fig. 2. An organism is immersed in a stream of environmental events, or stimuli (S), in whose presence the organism
is continuously behaving, or responding (R). With a respondent, or classical, contingency the occurrence of an eliciting,
or unconditioned, stimulus (US) is contingent on an environmental event, but some behavioral event must necessarily
precede the US. The US functions as the putative reinforcing stimulus. With an operant, or instrumental, contingency
the occurrence of the US is contingent on a behavioral event, but some environmental event must necessarily precede
the US. The US functions as the putative reinforcing stimulus. With an operant, or instrumental, contingency the
occurrence of the US is contingent on a behavioral event, but some environmental event must necessarily precede the
US. (The two types of contingencies are indicated by wavy lines with arrows.) In the operant procedure, the responses
that are candidates for control by the environment include both the operant (R) and the elicited, or unconditioned,
response (UR).

on behavior because of the differences in which respondent and operant contingencies while
events reliably precede the reinforcer in the also yielding, as its cumulative product, the
two procedures. However, it does mean that a emergence of the different behavioral outcomes
commitment to a moment-to-moment analysis that typify the two contingencies (Donahoe et
requires the formulation of a common prin- al., 1982). The unified reinforcement principle
ciple of reinforcement that is competent to pro- holds that whenever a behavioral discrepancy
duce the different behavioral outcomes of re- occurs, other things being equal, all those stim-
spondent and operant contingencies as its uli preceding the discrepancy will acquire con-
emergent product. trol over all those responses occurring imme-
To repeat, a commitment to a moment-to- diately prior to and contemporaneous with the
moment analysis does not imply that the cu- discrepancy. In a respondent contingency, the
mulative effect of selection by reinforcement is most reliably occurring stimulus is the con-
the same for the two contingencies. Quite the ditioned stimulus (e.g., the sound of the met-
contrary, as Skinner consistently pointed out; ronome) and the most reliably occurring re-
the net effects of operant and respondent con- sponses are orienting responses (e.g., turning
tingencies differ in ways that have profound toward the source of the tone) and uncondi-
implications for the emergence of complex be- tioned responses (e.g., salivation). In a dis-
havior, and we concur with that view. Most criminated operant contingency, the most re-
importantly, responses from the full behav- liably occurring stimulus is the discriminative
ioral repertoire of the organism are candidates stimulus (e.g., the presence of a tone only dur-
for selection by reinforcement with an operant ing periods when lever pressing produces food)
contingency. By contrast, with a respondent and the most reliably occurring response is, in
contingency the eligible responses are confined addition to the orienting response to the dis-
to those that are already elicited by specific criminative stimulus and the unconditioned re-
environmental stimuli. sponse to the reinforcing stimulus, the operant
A single reinforcement principle-the uni- itself (Donahoe et al., 1982; see also Mack-
fied reinforcement principle-has been pro- intosh, 1983; Staddon, 1983). The net outcome
posed to accommodate conditioning with both of either contingency depends on the interac-
SELECTION BY REINFORCEMENT 23

tions, if any, among the various responses and theory of the world in which the organism
whatever constraints arise from natural selec- lives. If this constraint were not satisfied, nat-
tion (so-called "biological constraints on learn- ural selection could not have favored the neural
ing"). The unified reinforcement principle is mechanisms mediating selection by reinforce-
based upon the experimental analysis of be- ment. Thus a principle of reinforcement may
havior and provides the conceptual framework not be immune to superstitions but, at the same
that we shall supplement with findings from time, it must, more often than not, have the
the experimental analysis of the neurosciences. cumulative effect of veridically extracting the
correlations between environmental and be-
Behavioral Constraints on a havioral events that precede reinforcers (cf.
Principle of Reinforcement Stone, 1986). Note that, given these boundary
In addition to the requirement that a single constraints, the product of selection by rein-
principle of reinforcement accommodate con- forcement is a relation between classes of stim-
ditioning with both respondent and operant uli and classes of responses, as Skinner had
contingencies, behavioral considerations im- anticipated (Skinner, 1935a). Moreover, these
pose several additional constraints upon a classes are "fuzzy" classes, in that no one mem-
biobehavioral account of reinforcement. First, ber of either class may be a necessary com-
a selection principle must, in general, permit ponent of the selected environment-behavior
the stimuli and responses that enter into se- relation. (See Palmer & Donahoe, 1992, for a
lected environment-behavior relations to in- discussion of the far-reaching implications of
clude as candidates a wide range of the stimuli Skinner's insight regarding reinforcement as
and responses preceding the reinforcer. With- the selection of relations between classes.)
out maintaining the eligibility for selection of Acquired reinforcement. The final behavioral
a wide range of stimuli and responses, rela- constraint considered here is that a reinforce-
tively arbitrary environment-behavior rela- ment principle must accommodate the select-
tions could not be acquired, with a correspond- ing effect of acquired as well as unconditioned
ing limitation on the potential complexity of reinforcers. The term acquired reinforcer is used
relations that could be selected. Although in preference to either conditioned (secondary)
maintaining a potentially large pool of can- reinforcer for operant contingencies or higher
didate stimuli and responses facilitates the se- order conditioning for respondent contingen-
lection of complex environment-behavior re- cies because these latter terms imply a fun-
lations, the effect of this constraint also allows damental distinction between the theoretical
the acquisition of superstitions of both the first treatments of the two contingencies, a distinc-
(Skinner, 1948) and second kinds (Morse & tion that is not honored by the unified rein-
Skinner, 1957). That is, the selected relation forcement principle. Procedurally, respondent
may include response topographies containing and operant conditioning are undeniably dis-
responses that are not necessary to produce the tinguishable; conceptually, the same reinforce-
reinforcer and stimulus complexes containing ment principle seeks to encompass both.
stimuli that do not necessarily precede the re- From the present perspective, any stimulus
inforcer. Selection by reinforcement no more that evokes a change in behavior (i.e., a be-
ensures the isolation of the necessary and suf- havioral discrepancy) can potentially function
ficient conditions for reinforcement in any given as a reinforcer to select environment-behavior
situation than the principle of natural selection relations. No fundamental distinction is made
ensures the reproductive fitness of any given between stimuli that evoke behavior because
individual. It is not in the nature of selection of prior selection by the ancestral environment,
processes to guarantee the parsing of events as with unconditioned stimuli, and stimuli that
into necessary and sufficient relations. evoke behavior because of prior selection by
A second and countervailing behavioral con- the individual environment, as with condi-
straint is that the cumulative effect of selection tioned and discriminative stimuli. Regardless
by reinforcement must, more often than not, of the origin of their ability to evoke responses,
converge on stimulus and response classes stimuli function as reinforcers to the extent
whose members are most reliably correlated that their occurrence produces a change in on-
with the reinforcer. That is, in general, the going behavior.
cumulative effect of selection must yield a valid The central observation affirming a critical
24 JOHN W. DONAHOE et al.
role for behavioral discrepancy in selection by be constrained by relevant neuroscientific find-
reinforcement is the phenomenon of blocking. ings. A moment-to-moment account of rein-
In blocking, a stimulus standing in a favorable forcement at the behavioral level is congenial
temporal relation to a putative reinforcer will to supplementation by a neuroscientific ac-
not acquire the capacity to function as either count because, whatever their nature, the neu-
a conditioned stimulus (Kamin, 1968, 1969) ronal changes mediating reinforcement nec-
or a discriminative stimulus (vom Saal & Jen- essarily occur on a moment-to-moment basis.
kins, 1970) if that stimulus is paired with the As Skinner (1938) stated, "I agree with Car-
reinforcer in the presence of a second stimulus michael [1936] that 'those concepts which do
that already evokes the reinforcer-elicited re- not make physiological formulation impossible
sponse. For example, when a light is presented and which are amenable to growing physio-
in simultaneous compound with a tone and logical knowledge are preferable, other things
both stimuli are followed by food, the light being equal, to those that are not so amena-
will not come to evoke salivation if the tone ble"' (p. 440).
has previously been paired with food. Because What is known about the neural systems
the tone already evokes salivation as a result and cellular processes mediating reinforce-
of prior conditioning, no change in behavior ment? Although a comprehensive answer to
occurs when salivation is also evoked by food this question is beyond the scope of the present
in the presence of the light-tone compound paper and some important questions remain
stimulus. Consequently, selection by the pu- unanswered (cf. Donahoe & Palmer, in press;
tative reinforcer cannot occur, and the light Krieckhaus, Donahoe, & Morgan, 1992), the
does not acquire control over salivation. major outlines may be given here.
Because the blocking procedure prevents a Neural selection of environment-behavior re-
stimulus from functioning as either a condi- lations. When neurons whose cell bodies are
tioned or a discriminative stimulus, the blocked in the ventral tegmental area (VTA, see Figure
stimulus should also be prevented from func- 3) are electrically stimulated after an operant
tioning as an acquired reinforcer. Experimen- has occurred, the strength of the operant is
tal analysis confirms that when the discrimi- increased. Thus VTA stimulation functions as
native function of a stimulus has been blocked, a reinforcer. Moreover, experimental work in-
its ability to function as an acquired reinforcer dicates that VTA neurons are activated by en-
is also blocked, even though the stimulus has vironmental stimuli that commonly function
been paired with an unconditioned reinforcer as unconditioned reinforcers, such as the smell
many times (Palmer, 1987). Thus both ac- and taste of food (see Hoebel, 1988; Trowill,
quired and unconditioned reinforcers must Panksepp, & Gandelman, 1969, for reviews).
evoke behavioral change if they are to function As shown schematically in Figure 3, axons
as reinforcers. This view is consistent with from VTA neurons diffusely project through-
earlier behavior-analytic work indicating that out the motor association areas of the frontal
acquired (conditioned) reinforcers must first lobes (Fallon & Loughlin, 1987; Swanson,
have the status of discriminative stimuli (e.g., 1982). The neuromodulator dopamine (DA)
Keller & Schoenfeld, 1950; cf. Dinsmoor, 1950; is released by VTA fibers, and, because of their
Thomas & Caronite, 1964). Whether a stim- widespread projections, dopamine is posi-
ulus that evokes behavior functions as a re- tioned to affect synaptic efficacies throughout
inforcer for a particular environment-behav- the frontal lobes. The affected synapses are
ior relation depends, for both acquired and (among others) between presynaptic neurons
unconditioned reinforcers, upon any interac- carrying sensory information from the pari-
tions between the response evoked by the pu- etal-temporal-occipital lobes of the cortex that
tative reinforcing stimulus and the operant are activated by environmental events (e.g.,
upon which it is contingent (e.g., Long, 1966; SVISUAL in Figure 3) and postsynaptic neurons
cf. Donahoe et al., 1982). in the frontal lobes that lead ultimately to be-
havior. Cellular research has shown that the
Neuroscientific Constraints on a introduction of dopamine into synapses im-
Principle of Reinforcement mediately after a postsynaptic neuron has been
In addition to behavioral constraints, a bio- activated by a presynaptic neuron produces
behavioral principle of reinforcement must also long-lasting changes in synaptic efficacies. That
SELECTION BY REINFORCEMENT 25

is, the ability of the presynaptic neuron to ini-


tiate activity in the postsynaptic neuron is in- BEHAVIOR 4-
creased (Stein & Belluzzi, 1988, 1989; see also
Beninger, 1983; Bliss & Lomo, 1973; Hoebel,
1988; Irikl, Pavlides, Keller & Asanuma, 1989;
Kety, 1970; Levy & Desmond, 1985; Wise,
1989; Wise & Bozarth, 1987). Because pre- $VISUAL
synaptic neurons may be activated by sensory
input and postsynaptic neurons may lead to
behavior, the cumulative effect of the intro- AUDITORY
duction of dopamine into synapses between
coactive neurons in the motor association cor-
tex is to select a neural system that mediates
environment-behavior relations scheduled for
reinforcement.' Fig. 3. A schematic diagram of the proposed cortical
Because dopamine is released into synapses systems of the brain involved in modifying synaptic effi-
between many coactive neurons, some of the cacies in sensory and motor cortices. Stimuli from various
affected synapses may mediate environment- sensory channels (e.g., SVISUAL and S.AU,)DITORY) activate neu-
behavior relations other than those necessary mary rons in the primary sensory cortex. Axons from the pri-
cortex ultimately end in synapses on neurons
for the reinforcer. However, among the co- in thesensory motor association cortex (e.g., A and V) and on
active neurons must be those that mediated the polysensory neurons in the sensory association cortex. Ax-
reinforced relation, because, without the oc- ons from polysensory cells ultimately end in synapses on
currence of the criterion response and the stim- neurons in the motor association cortex (e.g., AV) and also
uli present when it was emitted, the reinforcing on cells within the hippocampus (HIPP). Hippocampal
CAl neurons project diffusely via multisynaptic pathways
stimulus that activated the VTA neurons would to the sensory association cortex. Within the motor asso-
never have occurred. And, because it is syn- ciation cortex, the axons of some neurons ultimately ac-
apses between these neurons into which do- tivate effectors that lead to behavior. Axons from some
pamine most reliably diffuses, these synapses to neurons within the motor association cortex project back
the ventral tegmental area (VTA) by means of the
are, cumulatively, most substantially modified. medial forebrain bundle (MFB). VTA neurons are acti-
In short, the proposed neural system for re- vated by unconditioned reinforcing stimuli and project
inforcement operates on a moment-to-moment diffusely to the motor association cortex and to the synapses
basis but has the cumulative effect of modi- of CAl hippocampal neurons. (See the text for the func-
fying those synaptic efficacies that most reli- tioning of these neural systems in the selection of pathways
mediating environment-environment and environment-
ably mediate reinforced environment-behavior behavior relations.)
relations. Such a neural system of reinforce-
ment can fall prey both to stimulus and re-
sponse superstitions (as when irrelevant sen-
The discussion has focused on the strengthening of sory inputs
or response outputs occur prior to
connections in the cerebral cortex. In the cortex and some the reinforcer), but it cumulatively selects those
subcortical areas (such as the nucleus accumbens), the environment-behavior relations that most re-
levels of dopamine are phasically increased when rein- liably precede reinforcers.
forcers occur. In other subcortical areas, and in the evo- Acquired reinforcement. Together, the neu-
lutionarily older regions of the brain generally, dopamine ral systems and cellular mechanisms just de-
appears to be tonically present. Most neuroscientific anal-
yses of conditioning in mammals have been conducted with scribed provide an account whereby uncon-
preparations in which simple conditioning does not require ditioned stimuli that activate VTA neurons
an intact cortex (e.g., nictitating membrane conditioning function as reinforcers to select environment-
in the rabbit; Thompson, 1990). Accordingly, the contri- behavior relations. However, such a neural
bution of dopamine to changes in synaptic efficacies is less
obvious because levels are more stable (cf. Lowel & Singer, system cannot, by itself, accommodate ac-
1992; Singer & Rauschecker, 1982). However, when the quired reinforcers. Acquired reinforcers vary
action of dopamine is blocked by appropriate receptor with the particulars of the organism's selection
antagonists in these preparations, its effects are consistent history and, therefore, natural selection cannot
with the proposal presented here (e.g., Gimpl, Gormezano, provide pathways to the VTA that are "ap-
& Harvey, 1979; Marshall-Goodell & Gormezano, 1991).
That is, the acquisition of conditioned responses is im- propriate" for these diverse and changing his-
paired. tories. Natural selection can provide pathways
JOHN W. DONAHOE et al.

that have the potential, when acted upon by or when both are absent. Such situations define
selection by unconditioned reinforcers, to stimulus patterning or configural conditioning
strengthen differentially the synaptic efficacies in behavioral research (e.g., Woodbury, 1943;
along pathways that implement acquired re- cf. Kehoe, 1988) and exclusive-OR problems
inforcement. As shown in Figure 3, some of (the simplest nonlinearly separable problem)
the neurons in the frontal association cortex in artificial intelligence research (Rumelhart,
have axons that project back to the VTA via Hinton, & Williams, 1986). Some neural
the medial forebrain bundle (MFB) (Shizgal, mechanisms must exist whereby moment-to-
Bielajew, & Rompre, 1988; Yeomans, 1988, moment processes may cumulatively allow the
1989). If synaptic efficacies to these "feedback" organism to be sensitive to the correlations
neurons are increased by the action of uncon- among environmental events (e.g., between
ditioned (or previously acquired) reinforcers, lights and tones) as well as between environ-
then discriminative stimuli become able to mental and behavioral events. That is, both
function as acquired reinforcers. These stimuli environment-behavior relations and the en-
function as acquired reinforcers because they vironment-environment relations upon which
activate VTA neurons through feedback from some environment-behavior relations depend
pathways arising from the frontal lobes rather must be selected. What neural systems imple-
than through the phylogenetically selected ment the selection of environment-environ-
pathways used by unconditioned reinforcers. ment relations?
As a result of the foregoing process, stimuli As shown schematically in Figure 3, neu-
that are constituents of previously selected en- roanatomical studies indicate that axons from
vironment-behavior relations lead not only to neurons in the sensory association cortex, in
the emission of behavior but also to the en- addition to innervating motor areas of the brain,
gagement of the neural mechanisms of ac- initiate activity in pathways that provide in-
quired reinforcement. In this manner, ac- puts to the hippocampus. Upon entering the
quired reinforcement facilitates the acquisition hippocampus, activity is initiated among hip-
of new environment-behavior relations. For pocampal neurons, and ultimately the CAl
example, if backward chaining is used to es- hippocampal neurons are activated. Axons from
tablish a component of a long behavioral se- CAl neurons constitute the major output of
quence, the stimuli in the first component ac- the hippocampus and are the origins of mul-
tivate the acquired reinforcement system and tisynaptic pathways that project diffusely back
thereby function as reinforcers for responses to the sensory association cortex from which
in the second component. The proposed neural the inputs to the hippocampus arose (Amaral,
system for acquired reinforcement is consistent 1987). Because of this arrangement, the output
with findings from the experimental analysis of CAl neurons is positioned to modulate the
of behavior: Stimuli that function as acquired functioning of cells throughout the sensory as-
reinforcers also function as discriminative sociation cortex (Donahoe & Palmer, in press;
stimuli. Krieckhaus et al., 1992).
Neural selection of environment-environment We propose that diffuse feedback from the
relations. The account of the neural mecha- hippocampus exerts a neuromodulatory effect
nisms of unconditioned and acquired rein- on synaptic efficacies in the sensory association
forcement of environment-behavior relations cortex that is analogous to the effect of the
assumes that activity arising from sensory diffuse VTA-derived reinforcing system on the
regions of the brain is sufficient to distinguish motor association cortex. That is, diffuse hip-
environments in which a given activity is re- pocampal feedback strengthens synaptic effi-
inforced from those in which it is not reinforced cacies between coactive pre- and postsynaptic
(or differently reinforced). For reasons more neurons. If this is the case, then the most re-
fully described elsewhere (Donahoe & Palmer, liably affected synapses are those whose activ-
in press), this may not always be the case. For ity is correlated with the output of CAl cells.
example, consider a rat for which lever press- For example, suppose that an auditory and a
ing is reinforced during the co-occurrence of visual stimulus occur together, and that their
an auditory and a visual stimulus, but is not co-occurrence causes a hippocampal output (see
reinforced when either stimulus occurs alone Figure 3). The diffusely projected output of
SELECTION BY REINFORCEMENT 27

the hippocampus would increase synaptic ef- reinforcement. By means of widely broadcast
ficacies to polysensory "audio-visual" cells in neural systems for implementing reinforce-
the sensory association cortex. As a result, these ment, the eligibility for selection is maintained
polysensory cells would become more strongly for the full range of stimuli that can be sensed
polysensory (i.e., more reliably activated by the and responses that can be emitted. Organisms
co-occurrence of stimuli from different sensory equipped with diffuse neural systems for im-
channels). The cumulative effect of this pro- plementing selection by reinforcement appear
cess is that synaptic efficacies of polysensory to be well equipped to acquire complex en-
cells are modified to reflect the correlations vironment-behavior relations and the environ-
between environmental events. In short, the ment-environment relations upon which such
connectivity of the sensory association cortex relations sometimes depend. Tracing the im-
is altered to permit the mediation of environ- plications of a principle of reinforcement is not
ment-environment relations. a simple task, however, and requires equally
Thus far, the hippocampal-derived diffuse powerful techniques of interpretation. We now
projection system selects connections in the turn to one such technique, adaptive neural
sensory association cortex to mediate environ- networks (Donahoe & Palmer, 1989).
ment-environment relations, and the VTA-
derived diffuse projection system selects con-
nections in the motor association cortex to INTERPRETATION OF
mediate environment-behavior relations. The REINFORCEMENT VIA SELECTION
selection of these two types of relations must NETWORKS
be coordinated so that the correlation between Complex behavior is the product of such a
environmental events is most appreciated at prolonged history of selection that experimen-
those times when responses are followed by tal analysis is often precluded. Faced with
reinforcers. How is this accomplished? At the impediments to experimental analysis, other
neural level, coordination is implemented by historical sciences have supplemented experi-
axons from VTA neurons that project to syn- mental analysis with interpretation. Scientific
apses of CAl hippocampal neurons (Swanson, interpretation differs from mere speculation in
1982). Dopamine from these axons is known that interpretation makes use only of princi-
to increase the ability of CAl cells to be driven ples that are derived from independent exper-
by their inputs (Stein & Belluzzi, 1988, 1989). imental analyses. New principles are never
Thus when responses are followed by rein- uncovered through interpretation, although
forcers, the diffuse hippocampus-derived neu- interpretation may reveal new implications of
romodulatory signal is strongest and environ- existing principles (cf. Donahoe & Palmer,
ment-environment relations preceding the 1989, in press).
responses are most rapidly selected. For ex- The interpretive technique used here to ex-
ample, if the co-occurrence of auditory and plore the implications of a reinforcement prin-
visual stimuli reliably precede a reinforced re- ciple is computer simulation via adaptive neu-
sponse, the motor association cortex would be ral networks (cf. Donahoe & Palmer, 1989;
provided with inputs whose activity signaled Kehoe, 1989). An adaptive neural network is
the co-occurrence of auditory and visual stim- an interconnected set of units whose charac-
uli (AV) as well as the separate occurrences teristics are constrained by findings from ex-
of auditory (A) and visual (V) stimuli (Figure perimental analyses of the neurosciences. If
3). The implications of this account for the biobehaviorally constrained computer simu-
interpretation of phenomena such as "percep- lations yield results that are consistent with
tual learning," "latent learning," the forma- empirical observations and do not yield incon-
tion of equivalence classes, and phoneme- sistent results, then the principles that inform
grapheme correspondences in verbal behavior the simulations are accepted (with the tenta-
are described elsewhere (Donahoe & Palmer, tiveness accompanying all conclusions in sci-
in press). ence) as explanations of the phenomena.
In summary, experimental analyses of be- Adaptive neural networks are not "conceptual
havior and the neurosciences are converging nervous systems" in Skinner's sense because
upon a powerful conception of a principle of their characteristics are constrained by exper-
28 JOHN W. DONAHOE et al.

imental analyses of directly observed structures of environmental events selectively strengthen


and processes. Their characteristics are not the connections between simultaneously activated
result of inferences from observations at one input and polysensory units through simula-
level (the behavioral level) to processes at an- tion of the effect of the diffusely projecting
other level (the neural level). hippocampal system. The selection of path-
The adaptive neural networks used in the ways that mediate environment-environment
following simulations contain input units that relations proceeds more rapidly when co-oc-
simulate the state of the environment, interior currences are followed by reinforcers that ac-
(often called "hidden") units, and output units tivate the simulated VTA system and amplify
that simulate the behavior of the network. the diffuse hippocampal signal. Within the re-
When environmental events occur, they acti- sponse-selection component, reinforcers selec-
vate input units; this activation probabilisti- tively strengthen connections between inputs
cally activates other interior or output units to coming from the stimulus-selection component
which they are connected. The units and their and simultaneously active units in the re-
connections define the architecture of the net- sponse-selection component through simula-
work. In the simulations described here, the tion of the diffusely projecting VTA system.
architecture and functioning of the reinforce- The cumulative effect of these coordinated se-
ment system reflect what is known about dif- lection processes forges functional pathways
fuse neural systems in modifying synaptic ef- through the selection network that mediate re-
ficacies, or connection weights, between units.2 lations between events that are reliably present
Because the cumulative action of the diffuse prior to reinforcers. (Equations defining the
reinforcement systems selects those connec- activation rule, whereby one unit activates an-
tions that most reliably mediate reinforced in- other, and the learning rules, whereby the con-
put-output relations, we call such networks nection weights between units are modified,
selection networks (Donahoe & Palmer, in are given in the Appendix.)
press). A selection network is composed of two
subnetworks: (a) a response-selection compo- Simulation of Classical and Operant
nent that mediates environment-behavior re- Contingencies
lations and simulates the VTA-motor cortex The unified reinforcement principle, when
system and (b) a stimulus-selection component simulated in a selection network, should enable
that mediates environment-environment re- a network to mediate the environment-behav-
lations and simulates the hippocampal-sensory ior relations characteristically produced by re-
cortex system. The output of the stimulus- spondent and operant contingencies. Figure 5
selection component serves as input to the re- depicts two response-selection components of
sponse-selection component and thereby en- identical architecture but with different
riches the stimulus patterns guiding behavior. contingencies for the presentation of the re-
The main architectural features of a selection inforcer. (The simulated procedures do not re-
network are shown in Figure 4. quire the selection of environment-environ-
The functioning of a selection network may ment relations because the inputs to the
be summarized as follows. Simulated environ- stimulus-selection component are sufficient to
mental events activate the input units of the distinguish environments in which reinforcers
stimulus-selection component. Within the occur from those in which they do not. Ac-
stimulus-selection component, co-occurrences cordingly, except where indicated, the simu-
lations that follow use the response-selection
component only.) In the upper network, a re-
2 The
remaining architectural features are simplified in spondent (or classical) procedure is illustrated:
these simulations because the focus is upon the reinforce-
ment system. In more complete simulations, the architec- The reinforcing stimulus (the unconditioned
ture is constrained by other neuroanatomical research or stimulus, or US) is contingent upon an envi-
is the product of the simulation of neural development by ronmental event (the conditioned stimulus, or
means of what are called genetic algorithms (e.g., Davis, S,). The respondent contingency is indicated
1991). A genetic algorithm simulates the effect of natural
selection on network architecture, whereas a learning al- by the broken line. In the lower network, an
gorithm, which is the focus here, simulates the effect of operant procedure is illustrated: In the pres-
selection by reinforcement on synaptic efficacies. ence of a given environment, SI, the reinforcing
SELECTION BY REINFORCEMENT 29

SBLECTION NETWORK

IFfm, Rwu

B
B
O... 00
soo ~~~000
se
.0.0 N
H R<=O.Q.
A
>0<
V ...0 N
I
0 N
E
R
T
AMFAL

W()
S8

MOTOR CORTEI 3iORY CORThX


S5LrcIION
RrESPON5MB STMULUS SrEZC'nO
Fig. 4. The architecture of a selection network, an adaptive neural network whose structure simulates the neural
systems mediating reinforcement in the brain. Shown are the diffusely projecting lateral hypothalamic (LH) and ventral
tegmental area (VTA) systems that select pathways mediating environment-behavior relations and the diffusely
projecting hippocampal system that selects pathways mediating environment-environment relations.

stimulus (here also symbolized by US) is con- trial and any reinforcer occurred on the 10th
tingent on a behavioral event (activation of the time step. Multiple updates within a trial sim-
operant unit, or R). The broken line now in- ulate the continuously changing activity that
dicates an operant contingency. ensues upon the presentation of a stimulus and
In both simulations, whenever a reinforcing the propagation of that activity through an
stimulus evoked an increase in the elicited re- interconnected set of neurons.)
sponse (an increase in the activation of the In the classical procedure, the coactivated
UR/CR unit), a reinforcing signal that was units, whose connections are the only ones el-
proportional to the increase was broadcast igible for strengthening at a given moment,
throughout the network. Momentary increases necessarily include those interior units acti-
in the activation of the UR/CR unit strength- vated by the conditioned stimulus (SI) input
ened connection weights between all active unit and the output unit simulating the un-
units, whether in the respondent or operant conditioned (and conditioned) response (UR/
procedure. (Simulations in which connection CR unit). In the operant procedure, in addi-
weights are modified, or "updated," several tion to connections from S, to interior units,
times within a "trial" are known as "real- some of the strengthened connections are nec-
time" simulations. All simulations reported essarily from interior units to the output unit
here are real-time simulations in which the simulating the operant, R. The R unit must
connection weights were updated 10 times per be activated on any trail in which a reinforcer
30 JOHN W. DONAHOE et al.

in which selection occurred as described by the


unified reinforcement principle. The same pa-
rameter values (see Appendix) were used for
all simulations. (Qualitative aspects of the
findings reported here are not restricted to ei-
ther the specific network architectures or pa-
rameter values used in the simulations, al-
though both affect quantitative aspects of the
findings and are critical in the simulation of
some phenomena.) Examples of the outcomes
of these simulations are shown in Figure 6.
The leftmost upper panel depicts acquisition
of the CR in the classical procedure, where
the measure of conditioning is the level of ac-
tivation of the UR/CR, or respondent, unit by
the S,. As can be seen, the respondent unit was
initially unreliably activated at very low levels
by the S, but, after some 30 US presentations,
the broadcast reinforcement signal had suffi-
ciently strengthened connections along path-
ways between the SI and US/CR units for the
network to mediate the S1-CR relation strongly
and reliably.
Acquisition with an operant contingency is
shown in the leftmost lower panel of Figure
-z OPERANT CONTINGENCY 6. Here, the occurrence of the reinforcer (i.e.,
Fig. 5. Response-selection component of a selection activation of the US unit) was contingent on
network showing networks having three input units, three the prior activation of the operant (R) unit.
interior units, and two output units. The input units can At first, the R unit was rarely and weakly
be activated by either of two environmental stimuli (SI activated by the environmental stimulus (Si).
and S2) and the putative reinforcing stimulus (US). Ac- (Concomitant changes in activation of the CR
tivation of the output units simulates the occurrence of the
operant response (R) and the unconditioned response (UR) unit are discussed shortly.) This initial level
or, when the UR unit is activated by a stimulus other than of activation corresponds to the baseline or
the US, the conditioned response (CR). Activation of the operant level of the R unit. Whenever the R
UR/CR unit also engages the diffuse reinforcing system unit was activated, the reinforcer was pre-
that modifies connection weights between all coactive units.
The upper panel illustrates a classical, or respondent, sented and the diffuse reinforcement system
contingency in that the occurrence of the US is dependent was activated. The connection weights be-
on the prior activation of an input unit, SI. The lower tween all coactive units were then strength-
panel depicts an operant, or instrumental, contingency in ened. As shown in Figure 6, the cumulative
that activation of the US unit is dependent on the prior effect of this process was that connection
activation of the output unit, R, by the presentation of S,.
weights were strengthened along pathways that
permitted the network to mediate the S1-R
occurs, because activation of the operant unit relation after some 50 presentations of the re-
is required by the operant contingency. inforcer. Thus the unified reinforcement prin-
Acquisition. Computer simulations of clas- ciple implemented in a selection network is
sical and operant procedures were conducted competent to simulate acquisition with an op-

Fig. 6. Simulation of acquisition, extinction, and reacquisition with a respondent contingency (upper panel) and
an operant contingency (lower panel). The activation levels of the CR output unit are shown for the classical contingency,
and the activation levels of the R and CR output units are shown for the operant contingency over a number of
simulated conditioning trials. The simulations were real-time simulations (see text) in which the connection weights
were adjusted at 10 time steps (t) within each trial. The activation levels at Time-Step 9, which is the time step before
the occurrence of any reinforcer, are shown.
SELECTION BY REINFORCEMENT 31

1-
0.9-
0.8-
0.7-

co
o0.6-
a0.4
0.2

Acquisition Extinction

0.9-
0.8-
0.7-

0
0O.6-
0- CR R t. . . . ..T

<0.4-
0.3-
0.2-
0.1

0 50 100 150 200


Trial (t=9)
32 JOHN W. DONAHOE et al.

erant as well as with a respondent contingency. as well as the operant (R) unit during operant
Respondent and operant procedures do not conditioning, extinction, and reconditioning.
necessarily require different principles for their According to the unified reinforcement prin-
interpretations. Instead, the different outcomes ciple, respondents and operants are concur-
of the two procedures may be viewed as the rently acquired when an operant contingency
emergent and cumulative products of the same is implemented.
principle. First, note that activation of the CR unit by
Extinction and reacquisition. The middle sec- S, was acquired before activation of the R unit.
tions of Figure 6 show the simulated effects of This is a general result and occurs for two
extinction in the respondent (upper panel) and primary reasons: (a) The CR unit is more
operant (lower panel) procedures. Activation strongly activated by the US from the outset
of a unit without the concurrent activation of of conditioning than is the R unit by S, (i.e.,
the diffuse reinforcement system decreased the the respondent is elicited by the US, whereas
connection weights between all coactive units. the operant is emitted in the presence of SI).
As the cumulative effect of this process, stim- (b) The delay in reinforcement after activating
ulation of the S, unit gradually lost its ability the R unit is necessarily greater than after
to activate the CR unit in the respondent pro- activating the CR/UR unit. (The reinforcer-
cedure and the R unit in the operant proce- activation of the US unit-necessarily occurs
dure. An emergent effect of the simulation of after activation of the R unit in an operant
extinction is illustrated in the rightmost sec- procedure, whereas the diffuse reinforcing sig-
tions of both panels: When the US (reinforcer) nal occurs immediately upon activation of the
unit was again activated in accordance with CR/UR unit.) Because changes in connection
respondent (upper panel) or operant (lower weights are directly related to the activation
panel) contingencies, S, more rapidly reac- levels of the coactive units and because acti-
quired its ability to activate the CR and R vation levels decay over time, changes in con-
units, respectively, than in original acquisition. nection weights to the CR/UR unit occur more
Reacquisition was facilitated by the following rapidly than to the R unit.
process: During extinction, the repeated acti- The more rapid acquisition of the CR than
vation of S, without reinforcement weakened R in operant procedures has implications for
connection weights from the S, unit to interior the interpretation of a number of behavioral
units most rapidly, because these units were phenomena. Only two are considered here.
most frequently activated. Once connections First, when the conditioned response is incom-
from S, to interior units had weakened suffi- patible with the operant, the putative rein-
ciently for the interior units to be no longer forcer will be relatively ineffective for that op-
activated, connection weights from interior erant, because the CR gains strength before
units to the output units were "protected" from the R. In this way, many so-called biological
further weakening. (Only connection weights constraints on learning may be seen as emer-
of activated units may change.) Thus, during gent outcomes of the reinforcement principle
the simulation of extinction, connection weights itself (Breland & Breland, 1961; cf. Donahoe
to units "deep" within the network remained et al., 1982). Second, the same CR-R inter-
relatively unchanged. Then, when reacquisi- actions provide insight into the punishment
tion began, these intact connections weights procedure. If the putative punisher elicits re-
were available to facilitate reconditioning (cf. sponses (e.g., withdrawal) that are incompat-
Kehoe, 1988). ible with the operant, then the more rapid
acquisition of the CR prevents R from gaining
Simulation of Acquired Reinforcement strength. If, however, an aversive stimulus elic-
Some of the potential contributions of ac- its responses that are compatible with the op-
quired reinforcement to complex behavioral erant, then the aversive stimulus will function
chains were noted earlier. Computer simula- as a reinforcer for that operant (cf. Kelleher &
tion also reveals important emergent contri- Morse, 1968). Other phenomena, such as de-
butions of the neural mechanisms of acquired valuation (Rescorla, 1991) and autoshaping
reinforcement even during simple condition- (e.g., D. Williams & Williams, 1969), may
ing. The lower panels of Figure 6 depict the also be interpreted as dependent upon the more
activation levels of the respondent (CR) unit rapid acquisition of the CR than R and are
SELECTION BY REINFORCEMENT 33

discussed elsewhere (Donahoe & Palmer, in behavioral changes produced by an operant


press) .3 contingency.
A second important consequence of the con-
ditioning of CR activity in the operant pro- Simulation of Stimulus-Control Phenomena
cedure is the effect on acquired reinforcement: The final set of simulations uses selection
Because activation of the CR/UR unit engages networks to demonstrate three phenomena of
the diffuse reinforcement system, activation of stimulus control-blocking, stimulus discrim-
the CR/UR unit by Sl-initiated activity im- ination, and stimulus configuring.
plements acquired reinforcement within a trial Blocking. Blocking was simulated using a
during simple conditioning. Some sense of the respondent contingency in which activation of
effect of acquired reinforcement on condition- the CS, unit was first paired with activation
ing is shown during acquisition in the lower of the US unit until CS, controlled the CR/
left panel of Figure 6. Trials during which the UR unit (see the left panel of Figure 7). Then
CR unit was strongly activated by Si were simulation of CS,-US pairing continued, but
generally followed by increases in the ability with the CS2 unit now activated concurrently
of S, to initiate activity in the R unit. The with the CS, unit. As shown in the middle
operation of acquired reinforcement within the panel of Figure 7, the CR unit was reliably
network strengthened the S-R relation. Thus activated by the co-occurrence of the compound
the acquired reinforcement system mediates CS of CS1 and CS2. To assess the ability of
what might loosely be called "self-reinforce- CS2 alone to control the CR unit after 100
ment" (cf. Catania, 1975). During extinction pairings with the compound CS, the CS2 unit
(see lower middle panel of Figure 6), periods was activated by itself to simulate probe tests.
of transiently increased activation of the R unit As shown in the right panel of Figure 7, the
following increases in activation of the CR unit ability of CS2 to activate the CR unit was
may reveal the mechanism responsible for blocked. Blocking occurred because (a) the re-
changes in response strength that are labeled inforcer-induced behavioral discrepancy was
as "emotional" (Skinner, 1938). The effects of reduced at the outset of the second phase of
acquired reinforcement facilitate acquisition, conditioning (CS1 already activated the CR/
particularly in deeply layered networks (cf. UR unit prior to presentation of the US) and
Donahoe & Palmer, in press), and retard ex- (b) the pathways activated by CS1 and CS2
tinction in operant conditioning. In short, ac- were in competition with one another for con-
quired reinforcement operating within a se- trol of the CR/UR unit. (The relative contri-
lection network has emergent effects on the bution of these two processes to blocking de-
pends upon the architecture of the network.)
3 It would be a mistake to view the foregoing discussion Thus blocking of the acquisition of stimulus
of the acquisition of CR and R with an operant contin- control may be simulated by a selection net-
gency as indicating that the unified reinforcement principle work.
is some variant of two-factor theory (Mowrer, 1947; Res- Stimulus discrimination. The acquisition of
corla & Solomon, 1967). To the contrary, the present view
argues against multifactor accounts of conditioning and an operant intradimensional discrimination
for a single reinforcement principle that has different cu- (Honig, 1970) was simulated using a selection
mulative effects in respondent and operant procedures. network with three input units-Sl, S2, and
That is, there is only one factor-conditioning. Similarly, S3. (An intradimensional procedure is one in
it would be a mistake to view the present formulation as which the stimuli to be discriminated differ
an attempt to "reduce" operant conditioning to respondent
conditioning (e.g., Commons, Bing, Griffy, & Trudeau, within the same sensory dimension, e.g., dif-
1991; Trapold & Overmier, 1972; cf. Pear & Eldridge, ferent wavelengths of light or frequencies of
1984). Again, there is only one fundamental process tone.) The stimulus during which activation
conditioning. Further, conditioning in whatever procedure of the R unit was reinforced (i.e., S+) consisted
it occurs can be accommodated by a single principle that
produces quite different behavioral outcomes depending of the simultaneous activation of the S, and S2
upon which stimuli and responses are reliably contiguous input units. The stimulus during which acti-
with a behavioral discrepancy and the compatibility of vation of the R unit was nonreinforced (i.e.,
those responses. Historically, the view that is closest to the S-) consisted of the simultaneous activation
present conception is that of Hilgard and Marquis (1940), of the S2 and S3 input units. Thus the S2 input
who tentatively regarded the operant-respondent distinc-
tion as procedural and capable, perhaps, of being inter- unit was activated during both S + and S - to
preted by a common principle of conditioning. simulate the common receptors activated by
34 JOHN W. DONAHOE et al.

Trials (t=9)
Fig. 7. Simulation of blocking in a three-phase classical conditioning experiment. During the first phase, activation
of the US input unit was contingent on the prior activation of the CS1 input unit. During Phase 2, activation of the
US unit was contingent on the activation of both the CS1 and CS2 units. During the test phase, only the CS2 unit
was activated, with the learning algorithm disengaged for "probe" tests to determine whether CS2 had acquired control
over the CR unit.

similar stimuli within the same sensory di- selection component is capable of differentially
mension. (In these simulations, S2 was more strengthening the connection weights from the
strongly activated than either SI or S3 to in- S, and S2 units to an interior polysensory unit
crease the similarity between S + and S -.) As with the result that the co-occurrence of SI and
shown in Figure 8, S + acquired strong control S2 reliably activates the polysensory unit. That
over the R unit, whereas S- only weakly con- is, the stimulus-selection component "con-
trolled the R unit after a period during which structs" a polysensory S12 unit that can then
the activation level increased (cf. Hanson, control activation of the R unit.
1959). Figure 9 depicts the activation levels of a
Stimulus configuring. The last simulation il- polysensory unit within the stimulus-selection
lustrates the functioning of the stimulus-selec- component when reinforcers occurred only af-
tion component of a selection network. Sup- ter the S, and S2 units were simultaneously
pose that a response is reinforced following the activated. One simulation shows the activation
coactivation of input units SI and S2 but is levels of the polysensory unit when the sim-
nonreinforced when either S, or S2 is activated ulated reinforcement signal from the response-
alone. Under these conditions, the stimulus- selection component was large (US = 0.9); the
SELECTION BY REINFORCEMENT 35

0.5- V ^ w
~0.4-
cc
0.3-
0.2-
0.1

0 20 40 60 80 100
Trials (t=9)
Fig. 8. Simulation of the acquisition of an intradimensional operant stimulus discrimination by a selection network.
Shown are the activation levels of the operant (R) unit initiated by the positive stimulus (S+) and the negative stimulus
(S-). S+ activated two input units (S, and S2); if the R output unit became active, then the US input unit, which
functioned as the reinforcer, was activated. S- activated two input units (S2 and S3) but, whether or not the R unit
became active, the reinforcer was never presented.

other shows the activation levels when the re- & Tailby, 1982) are discussed elsewhere
inforcement signal was small (US = 0.1). Ini- (Donahoe & Palmer, in press).
tially, the polysensory unit was only weakly
activated by the co-occurrence of S1 and S2 but
ultimately became strongly activated, with the CONCLUDING COMMENTS
increase occurring more rapidly for the larger A principle of reinforcement has been de-
reinforcement signal. Thus the stimulus-se- scribed whose formulation is constrained by
lection component has the ability to strengthen experimental analyses of both behavior and the
connections, on-line and "as needed," to poly- neurosciences. This principle-the unified re-
sensory units whose activity reflects the envi- inforcement principle-is competent to yield
ronment-environment relations detected by the many of the basic phenomena produced by
input units, particularly when those relations operant and respondent contingencies, includ-
occur prior to reinforcers. The implications of ing acquired reinforcement and stimulus con-
the stimulus-selection component for the in- trol, when implemented in a class of adaptive
terpretation of such complex phenomena as neural networks known as selection networks.
place learning (O'Keefe & Nadel, 1978), la- From a selectionist perspective, complex
tent learning (Tolman, 1932), declarative ver- phenomena are emergent outcomes of the cu-
sus procedural "memory" (Squire, 1992), and mulative action of relatively simple processes.
the formation of equivalence classes (Sidman Whether the unified reinforcement principle
36 JOHN W. DONAHOE et al.

1-
0.9-

g0.8 US=0.9 S=0.1


0 0.7-
0-.6

~0.5-
~04-
0 0.3-
a.
0.2-
0.1 /

0
0 20 40 60 80 100 120 140 160 180 200
Trial (t =9)
Fig. 9. Simulations of the strengthening of connections to a polysensory unit within the stimulus-selection component
of a selection network. Shown are the activation leveis of a polysensory unit as a function of the number of concurrent
activations of two input units, S, and S2. In one simulation, the magnitude of the simulated VAT reinforcing signal
was low (US = 0.1). In the other simulation, the magnitude was high (US = 0.9).

will fulfill its promise as a general formulation tentiation of synaptic transmission in the dentate area
for describing the selecting effect of the indi- of the anaesthetized rabbit following stimulation of the
perforant path. Journal of Physiology (London), 232,
vidual environment on behavior awaits further 331-356.
experimental analyses at the behavioral and Bowler, P. J. (1983). The eclipse of Darwinism: Anti-
neural levels. Whether it will prove sufficiently Darwinian evolution theories in the decades around 1900.
powerful to yield truly complex behavior when Baltimore, MD: Johns Hopkins University Press.
implemented in neural networks of more com- Breland, K., & Breland, M. (1961). The misbehavior
of organisms. American Psychologist, 16, 681-684.
plex architectures awaits further simulation Campbell, D. T. (1974). Evolutionary epistemology. In
research. No principled impediments appear P. A. Schlipp (Ed.), The library of living philosophers:
to exist to either enterprise. Vol. 14-1. The philosophy ofKarl Dopper (pp. 413-463).
LaSalle, IL: Open Court Publishing Co.
Catania, A. C. (1975). The myth of self-reinforcement.
REFERENCES Behaviorism, 3, 192-199.
Catania, A. C. (1987). Some Darwinian lessons for be-
Amaral, D. G. (1987). Memory: Anatomical organi- havior analysis: A review of Bowler's The Eclipse of
zation of candidate brain regions. In F. Plum (Ed.) Darwinism. Journal of the Experimental Analysis of Be-
Handbook of physiology: Sec. 1. Neurophysiology: Vol. 5. havior, 47, 249-257.
Higher functions of the brain (pp. 211-294). Bethesda, Coleman, S. R. (1981). Historical context and systematic
MD: American Physiological Society. functions of the concept of the operant. Behaviorism, 9,
Beninger, R. J. (1983). The role of dopamine activity 207-226.
in locomotor activity and learning. Brian Research Re- Coleman, S. R. (1984). Background and change in B.
views, 6, 173-196. F. Skinner's metatheory from 1930 to 1938. Journal of
Bliss, T. V. P., & Lomo, T. (1973). Long-lasting po- Mind and Behavior, 5, 471-500.
SELECTION BY REINFORCEMENT 37

Commons, M. L., Bing, E. W., Griffy, C. C., & Trudeau, psychology: Vol. 1. Perception and motivation (pp. 547-
E. J. (1991). Models of acquisition and preference. 625). New York: Wiley.
In M. L. Commons, S. Grossberg, & J. E. R. Staddon Honig, W. K. (1970). Attention and the modulation of
(Eds.), Neural network models of conditioning and action stimulus control. In D. I. Mostofsky (Ed.), Attention:
(pp. 201-223). Hillsdale, NJ: Erlbaum. Contemporary theory and analysis (pp. 193-238). New
Davis, L. (Ed.). (1991). Handbook of genetic algorithms. York: Appleton-Century-Crofts.
New York: Van Nostrand Reinhold. Hull, D. L. (1973). Darwin and his critics; the reception
Dinsmoor, J. A. (1950). A quantitative comparison of of Darwin's theory of evolution by the scientific community.
the discriminative and reinforcing functions of a stim- Cambridge, MA: Harvard University Press.
ulus. Journal of Experimental Psychology, 40, 458-472. Irikl, A., Pavlides, C., Keller, A., & Asanuma, H. (1989).
Dobzhansky, T. G. (1937). Genetics and the origin of Long-term potentiation in the motor cortex. Science,
species. New York: Columbia University Press. 245, 1385-1387.
Donahoe, J. W., Crowley, M. A., Millard, W. J., & Kamin, L. J. (1968). "Attention-like" processes in clas-
Stickney, K. A. (1982). A unified principle of rein- sical conditioning. In M. R. Jones (Ed.), Miami sym-
forcement: Some implications for matching. In M. L. posium on the prediction of behavior (pp. 9-31). Coral
Commons, R. J. Herrnstein, & H. Rachlin (Eds.), Gables, FL: University of Miami Press.
Quantitative analyses of behavior: Vol. 2. Matching and Kamin, L. J. (1969). Predictability, surprise, attention
maximizing accounts (pp. 493-521). Cambridge, MA: and conditioning. In B. A. Campbell & R. M. Church
Ballinger. (Eds.), Punishment and aversive behavior (pp. 279-296).
Donahoe, J. W., & Palmer, D. C. (1989). The inter- New York: Appleton-Century-Crofts.
pretation of complex human behavior: Some reactions Kehoe, E. J. (1988). A layered network model of as-
to Parallel Distributed Processing, edited by J. L. sociative learning: Learning to learn and configuration.
McClelland, D. E. Rumelhart, and the PDP Research Psychological Review, 95, 411-433.
Group. Journal of the Experimental Analysis of Behavior, Kehoe, E. J. (1989). Connectionist models of condition-
51, 399-416. ing: A tutorial. Journal of the Experimental Analysis of
Donahoe, J. W., & Palmer, D. C. (in press). Learning Behavior, 52, 427-440.
and complex behavior. Boston: Allyn & Bacon. Kelleher, R. T., & Morse, W. H. (1968). Schedules
Fallon, J. H., & Loughlin, S. E. (1987). Monoamine using noxious stimuli: III. Responding maintained by
innervation of cerebral cortex and a theory of the role response-produced electric shocks. Journal of the Ex-
of monoamines in cerebral cortex and basal ganglia. perimental Analysis of Behavior, 11, 819-838.
In E. G. Jones & A. Peters (Eds.), Cerebral cortex: Vol. Keller, F. S., & Schoenfeld, W. N. (1950). Principles of
6. Further aspects of cortical function, including hippo- psychology: A systematic text in the science of behavior.
campus (pp. 41-127). New York: Plenum. New York: Appleton-Century-Crofts.
Ferster, C. B., & Skinner, B. F. (1957). Schedules of Kety, S. S. (1970). The biogenic amines in the central
reinforcement. New York: Appleton-Century-Crofts. nervous system: Their possible role in arousal, emotion,
Gehrz, R. D., Black, D. C., & Solomon, P. M. (1984). and learning. In F. 0. Schmidtt (Ed.), Neuroscience
The formation of stellar systems from interstellar mo- second study program (pp. 324-336). New York: Rocke-
lecular clouds. Science, 224, 823-830. feller, University Press.
Gimpl, M. P., Gormezano, I., & Harvey, J. A. (1979). Krieckhaus, E. E., Donahoe, J. W., & Morgan, M. A.
Effect of haloperidol and pimozide (PIM) on Pavlovian (1992). Paranoid schizophrenia may be caused by
conditioning of the rabbit nictitating membrane re- dopamine hyperactivity of CAl hippocampus. Biolog-
sponse. In E. Usdin, I. J. Kopin, & J. Barchas (Eds.), ical Psychiatry, 31, 560-570.
Catecholamines: Basic and clinical frontiers (Vol. 2, pp. Levy, W. B., & Desmond, N. L. (1985). The rules of
1711-1713). New York: Pergamon Press. elemental synaptic plasticity. In W. B. Levy, J. A.
Hanson, H. M. (1959). Effects of discrimination train- Anderson, & S. Lehmkuhle (Eds.), Synaptic modifica-
ing on stimulus generalization. Journal of Experimental tion, neuron selectivity, and nervous system organization
Psychology, 58, 321-334. (pp. 105-121). Hillsdale, NJ: Erlbaum.
Herrnstein, R. J. (1970). On the law of effect. Journal Long, J. B. (1966). Elicitation and reinforcement as
of the Experimental Analysis of Behavior, 13, 243-266. separate stimulus functions. Psychological Reports, 19,
Herrnstein, R. J. (1982). Melioration as behavioral dy- 759-764.
namism. In M. L. Commons, R. J. Herrnstein, & H. Lowel, S., & Singer, W. (1992). Selection of intrinsic
Rachlin (Eds.), Quantitative analyses of behavior: Vol. 2. horizontal connections in the visual cortex by correlated
Matching and maximizing accounts (pp. 433-458). neuronal activity. Science, 255, 209-212.
Cambridge, MA: Ballinger. Mackintosh, N. J. (1983). Conditioning and associative
Heth, C. D. (1992). Levels of aggregation and the gen- learning. New York: Oxford University Press.
eralized matching law. Psychological Review, 99, 306- Marshall-Goodell, B., & Gormezano, I. (1991). Effects
321. of cocaine on conditioning of the rabbit nictitating
Hilgard, E. R., & Marquis, D. G. (1940). Conditioning membrane response. Pharmacology Biochemistry and
and learning. New York: Appleton-Century. Behavior, 39, 503-507.
Hinson, J. M., & Staddon, J. E. R. (1983). Hill-climb- Maynard Smith, J. (1982). Evolution and the theory of
ing by pigeons. Journal of the Experimental Analysis of games. Cambridge: Cambridge University Press.
Behavior, 39, 25-47. Mayr, E. (1982). The growth of biological thought: Di-
Hoebel, B. G. (1988). Neuroscience and motivation: versity, evolution, and inheritance. Cambridge, MA:
Pathways and peptides that define motivational sys- Belknap Press.
tems. In R. C. Atkinson, R. J. Herrnstein, G. Lindzey, Morse, W. H. (1966). Intermittent reinforcement. In
& R. D. Luce (Eds.), Stevens' handbook of experimental W. K. Honig (Ed.), Operant behavior: Areas of research
38 JOHN W. DONAHOE et al.
and application (pp. 52-108). New York: Appleton- Silberberg, A., Hamilton, B., Ziriax, J. M., & Casey, J.
Century-Crofts. (1978). The structure of choice. Journal of Experi-
Morse, W. H., & Skinner, B. F. (1957). A second type mental Psychology: Animal Behavior Processes, 4, 368-
of superstition in the pigeon. American Journal of Psy- 398.
chology, 70, 308-311. Singer, W., & Rauschecker, J. P. (1982). Central core
Mowrer, 0. H. (1947). On the dual nature of learning: control of developmental plasticity in the kitten visual
A reinterpretation of "conditioning" and "problem- cortex: II. Electrical activation of mesencephalic and
solving." Harvard Educational Review, 17, 102-150. diencephalic projections. Experimental Brain Research,
Nevin, J. A. (1979). Overall matching versus momen- 47, 223-233.
tary maximizing: Nevin (1969) revisited. Journal of Skinner, B. F. (1935a). The generic nature of the con-
Experimental Psychology: Animal Behavior Processes, 5, cepts of stimulus and response. Journal of General Psy-
300-306. chology, 12, 40-65.
O'Keefe, J., & Nadel, L. (1978). The hippocampus as a Skinner, B. F. (1935b). Two types of conditioned reflex
cognitive map. Oxford: Clarendon Press. and a pseudo type. Journal of General Psychology, 12,
Palmer, D. C. (1987). The blocking of conditioned rein- 66-77.
forcement. Unpublished doctoral dissertation, Univer- Skinner, B. F. (1937). Two types of conditioned reflex:
sity of Massachusetts, Amherst. A reply to Konorski and Miller. Journal of General
Palmer, D. C., & Donahoe, J. W. (1992). Essentialism Psychology, 16, 272-279.
and selection in cognitive science and behavior analysis. Skinner, B. F. (1938). The behavior of organisms. New
American Psychologist, 47, 1344-1358. York: Appleton-Century.
Pear, J. J., & Eldridge, G. D. (1984). The operant- Skinner, B. F. (1948). "Superstition" in the pigeon.
respondent distinction: Future directions. Journal of the Journal of Experimental Psychology, 38, 168-172.
Experimental Analysis of Behavior, 42, 453-467. Skinner, B. F. (1981). Selection by consequences. Sci-
Rescorla, R. A. (1969). Conditioned inhibition of fear. ence, 213, 501-504.
In N. J. Mackintosh & W. K. Honig (Eds.), Funda- Squire, L. R. (1992). Memory and the hippocampus: A
mental issues in associative learning (pp. 65-89). Hali- synthesis from findings with rats, monkeys, and hu-
fax, Nova Scotia: Dalhousie University Press. mans. Psychological Review, 99, 195-231.
Rescorla, R. A. (1991). Associative relations in instru- Staddon, J. E. R. (1983). Adaptive behavior and learning.
mental learning: The eighteenth Bartlett memorial lec- Cambridge: Cambridge University Press.
ture. Quarterly Journal of Experimental Psychology, 43B, Staddon, J. E. R., & Hinson, J. M. (1983). Optimi-
1-23. zation: A result or a mechanism? Science, 221, 976-
Rescorla, R. A., & Solomon, R. L. (1967). Two-process 977.
learning theory: Relationships between Pavlovian con- Stein, L., & Belluzzi, J. D. (1988). Operant conditioning
ditioning and instrumental learning. Psychological Re- of individual neurons. In M. L. Commons, R. M.
view, 74, 151-182. Church, J. R. Stellar, & A. R. Wagner (Eds.), Quan-
Rescorla, R. A., & Wagner, A. R. (1972). A theory of titative analyses of behavior (Vol. 7, pp. 249-264). Hills-
Pavlovian conditioning: Variations in the effectiveness dale, NJ: Erlbaum.
of reinforcement and nonreinforcement. In A. H. Black Stein, L., & Belluzzi, J. D. (1989). Cellular investiga-
& W. F. Prokasy (Eds.), Classical conditioning: Current tions of behavioral reinforcement. Neuroscience and
research and theory (pp. 64-99). New York: Appleton- Biobehavior Reviews, 13, 69-80.
Century-Crofts. Stickney, K., & Donahoe, J. W. (1983). Attenuation of
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. blocking by a change in US locus. Animal Learning &
(1986). Learning internal representations by error Behavior, 11, 60-66.
propagation. In D. E. Rumelhart & J. L. McClelland Stone, G. 0. (1986). An analysis of the delta rule and
(Eds.), Parallel distributed processing: Explorations in the the learning of statistical associations. In D. E. Ru-
microstructure of cognition: Vol. 1. Foundations (pp. 318- melhart & J. L. McClelland (Eds.), Parallel distributed
362). Cambridge, MA: MIT Press. processing: Explorations in the microstructure of cognition:
Schoenfeld, W. N., Cole, B. K., Blaustein, J., Lachter, G. Vol. 1. Foundations (pp. 444-459). Cambridge, MA:
D., Martin, J. M., & Vickery, C. (1972). Stimulus MIT Press.
schedules: The t-r systems. New York: Harper & Row. Swanson, L. W. (1982). The projections of the ventral
Shimp, C. P. (1969). Optimal behavior in free-operant tegmental area and adjacent regions: A combined flu-
experiments. Psychological Review, 76, 97-112. orescent retrograde tracer and immunofluorescence
Shizgal, P., Bielajew, C., & Rompre, P-P. (1988). study in the rat. Brain Research Bulletin, 9, 321-353.
Quantitative characteristics of the directly stimulated Thomas, D. R., & Caronite, S. C. (1964). Stimulus
neurons subserving self-stimulation of the medial fore- generalization of a positive conditioned reinforcer. II.
brain bundle: Psychophysical inference and electro- Effects of discrimination training. Journal of Experi-
physiological measurement. In M. L. Commons, R. mental Psychology, 68, 402-406.
M. Church, J. R. Stellar, & A. R. Wagner (Eds.), Thompson, R. F. (1990). Neural mechanisms of clas-
Quantitative analyses of behavior: Vol. 7. Biological de- sical conditioning in mammals. Philosophical Transac-
terminants of reinforcement (pp. 59-85). Hillsdale, NJ: tions of the Royal Society, London, 161-170.
Erlbaum. Tolman, E. C. (1932). Purposive behavior in animals and
Sidman, M., & Tailby, W. (1982). Conditional discrim- men. New York: Century.
ination vs. matching to sample: An expansion of the Trapold, M. A., & Overmier, J. B. (1972). The second
testing paradigm. Journal of the Experimental Analysis learning process in instrumental learning. In A. H.
of Behavior, 37, 5-22. Black & W. F. Prokasy (Eds.), Classical conditioning
SELECTION BY REINFORCEMENT 39

II: Current research and theory (pp. 427-452). New Liebman & S. J. Cooper (Eds.), The neuropharmacolog-
York: Appleton-Century-Crofts. ical basis of reward (pp. 377-424). New York: Oxford
Trowill, J. A., Panksepp, J., & Gandelman, R. (1969). University Press.
An incentive model of rewarding brain stimulation. Wise, R. A., & Bozarth, M. A. (1987). A psychomotor
Psychological Review, 76, 264-281. stimulant theory of addiction. Psychological Review, 94,
Vaughan, W., Jr. (1981). Melioration, matching, and 469-492.
maximizing. Journal of the Experimental Analysis of Be- Woodbury, C. B. (1943). The learning of stimulus pat-
havior, 36, 141-149. terns by dogs. Journal of Comparative Psychology, 35,
vom Saal, W., & Jenkins, H. M. (1970). Blocking the 29-40.
development of stimulus control. Learning and Moti- Yeomans, J. (1988). Mechanisms of brain stimulation
vation, 1, 52-64. reward. In A. N. Epstein & A. R. Morrison (Eds.),
Williams, B. A. (1990). Enduring problems for molec- Progress in psychobiology and physiological psychology
ular accounts of operant behavior. Journal of Experi- (Vol. 13, pp. 227-266). San Diego: Academic Press.
mental Psychology: Animal Behavior Processes, 16, 213- Yeomans, J. S. (1989). Two substrates for medial fore-
216. brain bundle self-stimulation: Myelinated axons and
Williams, D. R., & Williams, H. (1969). Auto-main- dopamine neurons. Neuroscience and Biobehavioral Re-
tenance in the pigeon: Sustained pecking despite con- views, 13, 91-98.
tingent nonreinforcement. Journal of the Experimental
Analysis of Behavior, 12, 511-520. Received October 23, 1992
Wise, R. A. (1989). The brain and reward. In J. M. Final acceptance February 23, 1993

APPENDIX
Activation Function p(epsp,j,t) + r( j)p(epsp,j,t-1)
Let N ={xEN I 1 ' x ' n} and P ={ jENI * - p(epsp,j,t)]
m < n } be the sets of units and neural - p(ipsp,j,t)
processing elements (NPEs) in an artificial if exc(j,t) >- E(J,t)
neural network, respectively, where N is the =and exc(J.,t) > inh(j,t), (3)
set of positive integers, n is the number of units t) a(J,
a(j,t-1) - k(j)a(j,t-1)
in the network, and m is the number of input
units. Let R = {xER+ 0.0 < x < 1.0} be the 1 - a(j,t-1)]
set of possible activation and connection-weight if exc(j,t) < e(j,t)
values, where R+ is the set of positive real and exc(j,t) > inh(j,t),
numbers. a: P x T -- R is the activation func- 0.0 if exc(j,t) < inh(j,t)
tion, where T C N, the elements of T repre-
senting time steps. where p(epspJ,t) = L[exc(j,t)], r(j) is a tem-
The rule for implementing function a in the poral-summation parameter (0.0 <Kr() < 1.0),
neural-network simulations is defined as fol- p(ipspp,gt) = L[inh(j,t)], E(j,t) is a random ex-
lows. Let e(j,t) be the vector of activations of citatory threshold generated according to a
the excitatory inputs to postsynaptic element Gaussian distribution with parameters, and
j at t, i(j,t) be the vector of activations of the a, and k(j) is an activation-decay parameter
inhibitory inputs to j at t, w(j,t) be the vector (0.0 < k(j) < 1.0). The term p(epspj,t) is
of excitatory weights associated withj at t, and interpreted as the probability of occurrence of
w'(j,t) be the vector of inhibitory weights as- an excitatory postsynaptic potential (epsp) at
sociated with j at t, where jEP and tET. As-
suming that e(j,t), i(j,t), w(j,t), W'(j,t)ERn ) j during t, whereas p(ipspp,gt) is interpreted as
the amounts of excitation (exc) and inhibition the probability of occurrence of an inhibitory
(inh) produced at j during t are given, re- postsynaptic potential (ipsp) at j during t.
spectively, by Function L is the logistic probability distri-
exc(j,t) = e(j,t) w(j,t) (1) bution with parameters y and 6:
inh(j,t) = i(j,t) w'(j,t). (2) L(x) = 1/(1 + exp[(-x + y)/6]). (4)
The activation a of j at t is conceptualized In the simulations, r(j) = .1, k(j) = .05, ,u =
as the probability of firing, defined as follows: 0.0, a = 1.0, y = .5, and 6 = .1.
40 JOHN W. DONAHOE et al.

Learning Function sum[w(j,t)]. If j is a hidden NPE in the stim-


Let c: N x N - {O,1} be the predicate ulus-selection subnetwork (i.e., a polysensory
"connected" (0 signifies "is not connected to" NPE), then d(t) = +(t) + co(t) [1 - c(t)],
and 1 signifies " is connected to"), E = {iEN s(i) where +(t) = javg[h(t) - h(t-1)]I and w(t)
= avg[v(t) -v(t-1)]. In the last two equa-
= 1} and I = {iENIs(i) = 0} be the sets of
excitatory units and inhibitory units, respec- tions, h is a vector of activations of all the
tively, where s: N - {0,1 } is the function "re- hippocampal NPEs receiving inputs from the
leases" [0 signifies an inhibitory neurotrans- polysensory NPEs of the stimulus-selection
mitter (e.g., gamma-amino-butyric acid or subnetwork, whereas v is a vector of activa-
GABA) and 1 signifies an excitatory neuro- tions of all the VTA NPEs receiving inputs
transmitter (e.g., glutamate)]. Let E = {(ip)EE from the motor association NPEs of the re-
x N Ic(ij) = 1} and I = {(ij)EI X N I c(ij) sponse-selection subnetwork. On the other
= 1} be the sets of possible excitatory and hand, if j is a motor NPE (i.e., an NPE in the
inhibitory connections in a neural network, response-selection subnetwork), then d(t) =
respectively. The mapping w: E x T -* R is w(t). The use of the absolute value in function
the learning function for excitatory presyn- 4 is based on the assumption that both the
aptic inputs, whereas w': I x T - R is the onset and offset of hippocampal NPEs influ-
learning function for inhibitory presynaptic ence sensory learning (i.e, the output of these
inputs. The networks trained in the simula- NPEs is sensitive to sensory stimulus change).
tions did not have inhibitory connections, so Because the networks trained in the simula-
the learning function for inhibitory presyn- tions did not include any sensory NPEs, only
aptic inputs was not used. w(t) was used as the reinforcing signal.
The rules for implementing functions w and The connection weight w' between a pre-
w' in the simulations are defined as follows. synaptic element iEI and a postsynaptic ele-
The connection weight w between a presyn- ment 'EN at t is given by
aptic element iEE and a postsynaptic element
jEN at t is given by
w'(i,ft- 1)
w(i,j,t- 1) + a'(j)a(j,t)d(t)p(i,t)r(j,t)
w10]idI(t-d(t)1) 0,
+ ao(j)a(j,t)d(t)p(i,t)r(j,t) w'ijt)
1,15 = if >
w(l,1'
t) =
if d(t) > 0,
w(i,]',t - ) -#'(j)w'(1,j,t- 1)a(i,t)a( j,t
if d(t)'O, 0
- (j)w(i, j,t-l )a(i,t)a(j,t)
if d(t) ' 0,
(5) (6)
where a(j) is the acquisition rate for excitatory where a'(j) is the acquisition rate for inhibi-
connections, ,B(j) is the extinction rate for ex- tory connections, fl'(j) is the extinction rate for
citatory connection [in the simulations, a(j) = inhibitory connections, p (i,t) = [a (i,t)
.5 , 13(j) = .035, and w(i,t = 0) = .01], d(t) w'(i,j,t-1)]/inh(j,t), and r(j,t) = 1 -
is the reinforcing signal, p(i,t) = [a(i,t) sum[w'(j,t)]. The term d(t) is defined as in the
w(i,j,t-1)]/exc(j,t), and r(j,t) = 1 - excitatory learning rule (see above).

You might also like