Professional Documents
Culture Documents
Technological Sciences
Progress of Projects Supported by NSFC November 2013 Vol.56 No.11: 2745–2761
doi: 10.1007/s11431-013-5369-0
Received August 6, 2013; accepted September 13, 2013; published online September 30, 2013
Operant conditioning is one of the fundamental mechanisms of animal learning, which suggests that the behavior of all animals,
from protists to humans, is guided by its consequences. We present a new stochastic learning automaton called a Skinner au-
tomaton that is a psychological model for formalizing the theory of operant conditioning. We identify animal operant learning
with a thermodynamic process, and derive a so-called Skinner algorithm from Monte Carlo method as well as Metropolis algo-
rithm and simulated annealing. Under certain conditions, we prove that the Skinner automaton is expedient, -optimal, optimal,
and that the operant probabilities converge to the set of stable roots with probability of 1. The Skinner automaton enables ma-
chines to autonomously learn in an animal-like way.
Learning automata, Boltzmann distribution, operant conditioning, operant learning, simulated annealing
Citation: Ruan X G, Wu X. The skinner automaton: A psychological model formalizing the theory of operant conditioning. Sci China Tech Sci, 2013, 56:
27452761, doi: 10.1007/s11431-013-5369-0
© Science China Press and Springer-Verlag Berlin Heidelberg 2013 tech.scichina.com www.springerlink.com
2746 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11
need to express emotions, behaviors and personality in a convergence to global optimization. Our theoretical analysis
human-like manner. They presented a behavior model with shows that the Skinner automaton governed by the Skinner
operant conditioning and applied it to an emotion expres- algorithm is not only expedient and -optimal, but also op-
sion humanoid robot (WE-4RII) so that the machine could timal and absolutely expedient. At the end of this paper,
autonomously select suitable behavior [13, 14]. some optimal computing and simulated experiments are
Reinforcement learning is an important form of machine presented with the Skinner automaton to test the perfor-
learning, which is inspired by behaviorist psychology. More mance of the Skinner automaton. Our work suggests that
exactly, operant conditioning theory underlies reinforce- operant conditioning not only is psychological and biologi-
ment learning. In fact, the concept of reinforcement origi- cal, but also can be computational, psychodynamical and
nates from the theory of operant conditioning in which re- thermodynamical. The Skinner automaton enables autono-
inforcement plays a key role. Sutton and Barto established mous agents including robots to autonomously learn in an
the basic principles of reinforcement learning [15]. Howev- animal- like way.
er, reinforcement learning can be traced back to the work on The rest of this paper is organized as follows. Section 2
learning automata (LA). The term “learning automata” was unscrambles operant conditioning and extracts its funda-
first used by Narendra and Thathachar [16], who pointed mental elements. Section 3 configurates the Skinner autom-
out that the first learning automata were developed in aton for synthesizing operant conditioning and describes the
mathematical psychology. The original notion of learning Skinner algorithm that is the core of the Skinner automaton.
automata corresponds to the so-called finite action-set Section 4 discusses the self-organizing property of the
learning automata (FALA) [16]. An FALA has a given set Skinner automaton. Sections 5 and 6 reproduce famous
of actions and a specific reinforcement scheme, which gen- Skinner’s experiments of pigeons with the Skinner automa-
erally operates in an unknown random environment and ton. In Section 7, an animated 3D Wienerian worm is built
updates its action probabilities in accordance with the re- and the Skinner automaton serves as its photosensory-motor
sponses from the environment. It is obvious that learning system to shape its behavior of negative phototropism by
automata are also inspired by behaviorist psychology. operant learning (i.e. learning in operant conditioning way).
Learning automata are the models of operant learning as In Section 8, the Skinner automaton serves as the sen-
well as the formal frameworks for reinforcement learning. sorimotor system related posture and movement of a physi-
The notion of learning automata has been extended over cal flexible two-wheeled robot, and enables the robot to
the years [17]. FALA ensure convergence only to a local gradually develop its motor skills and to learn to balance in
maximum of the reinforcement signal [18, 19]. To get con- autonomous operant conditioning. Finally, our conclusions
vergence to global maximum, parameterized learning au- are given in Section 9.
tomata (PAL) were proposed [20]. To handle associative
reinforcement learning problems, generalized learning au-
tomata (GLA) were introduced [21, 22]. The common 2 Analysis of operant conditioning
theme among them is reinforcement schemes that serve as
the basis of the learning process for learning automata A natural question about machine learning is whether it is
[23–27]. According to their linearity, reinforcement compatible with animal learning. We aim at formalizing the
schemes can be categorized into linear, nonlinear, and hy- theory of operant conditioning (OC) in a form analogous to
brid schemes. A reinforcement scheme in a learning autom- learning automata (LA) so that artificial systems learn in an
aton is the crucial factor that affects the performance of the operant conditioning way like animals. From psychodynam-
learning automaton. According to the properties exhibited ics [32], operant conditioning is a dynamical process. A
by learning automata using the schemes, reinforcement learning automaton is a dynamical system. In this section, we
schemes can be classified as expedient, -optimal, optimal, try to unscramble the dynamical process of OC in terms of
and absolutely expedient. LA so that our OC model is compatible with OC in animals.
As has been pointed out by Touretzky and Saksida [10],
however, animals learn much more complicated behaviors
through operant conditioning than machines acquiring 2.1 Learning automata
through reinforcement learning or learning automata. In this Here the LA theory is briefly described to provide a back-
paper, we address a so-called Skinner automaton as a new ground for understanding the learning process of LA and for
psychological model for formalizing the theory of operant being compared and contrasted with OC.
conditioning. We identify animal operant learning with a Generally, a learning automaton (LA) is a stochastic dy-
thermodynamic process, and derive the so-called Skinner namical system operating in a random environment, and can
algorithm from Metropolis algorithm [28, 29] and simulated be represented with a 7-tuple:
annealing [30, 31]. The Skinner algorithm is the core of the
Skinner automaton and serves as its reinforcement scheme. LA t , S , , , F , G, A , (1)
With simulated annealing, the Skinner automaton ensures where t{0, 1, 2, ⋯} represents the discrete time for each
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2747
Figure 1 Learning automaton and its dynamical process, where (t), (t),
and p(t) respectively represent the action of the LA, the response of the Figure 2 The closed loop of alternation of behavior and consequence in
environment, and the action probabilities at stage t. OC.
2748 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11
quences induced by actions, such as reinforcers, either re- the state change, but cannot tell the process. We need to
wards or punishments for actions. However, we argue that, notice some differences between eqs. (1) and (3). The state
in the process of OC, the responses from the environment transition in OC is pushed directly by operants, whereas that
are not the real consequences of operants, but only the me- in LA by the responses from the environment to actions.
diate ones. The real consequences induced by operants are
the state changes of the organism in OC. For instance, a 2.3 Motivation and reinforcement in OC
reward for an operant is not the final consequence of the
operant, but a medium between the operant and its conse- Reinforcement is a key concept in operant conditioning [34],
quence, which tends towards bringing the organism to a which is defined as an event that increases the occurrence
certain satisfied internal state that is the real consequence probability of an operant by following the operant with a
induced by the operant. Accordingly, in the process of OC, certain reinforcer that is something strengthening or re-
operant consequences can be supposed to be the state warding the operant. The opposite is punishment, which
changes of organisms. To some extent, the current state of lowers the probability. Reinforcement plays an important
the organism represents the consequence of the last operant. role in shaping or modifying behaviors of human beings and
As a consequence, operant selection in the SR feed- animals. If reinforcement is supposed to be either positive
forward process of OC can be represented with the function or negative, then both reward and punishment can be re-
G(p): S in LA (1). Some learning automata formulate garded as some reinforcement. Computationally, in that
behavior selection with function G: S (p) that means case, positive reinforcers can stand for reward, and negative
behavior selection in LA depends on both the state of LA ones for punishment.
and the input from the environment. As stated above, how- In the process of OC, organisms seem inclined to exhibit
ever, the effect of the input from the environment can be more frequently behaviors that lead to reward, and less fre-
supposed to be contain in the state of the organism in OC. quently lead to punishment. However, we argue that organ-
Accordingly, at each stage t, operant selection in OC can be isms seek not so much for rewards for certain satisfied psy-
generalized computationally with chological and physiological feelings or states. Without
p o ( t )| s ( t )
doubt, the psychological and physiological feelings or states
SR : s (t ) o(t ), (2) of organisms are the most radical reinforcers in the process
of OC or operant learning. Actually, in the process of OC,
where s(t) and o(t) are the current state and operant, respec- the effect of any reinforcement would be finally reflected in
tively, and p(o(t)|s(t)) is the occurrence probability of the the change of psychological and physiological feelings or
operant o(t) at the state s(t), i.e., the probability of the oper- states of organisms. Happiness, joviality, and comfortability,
ant o(t) selected at the state s(t). Eq. (2) formulates the SR etc. are positive reinforcers which inclined to increase the
feed-forward process from consequence (state) to operant in occurrence probabilities of operants, whereas sadness, dis-
OC, which states that the organism in OC selects its current appointment, dejection, and melancholy, etc. are negative
operant totally based on its current state. ones which inclined to decrease the occurrence probabilities
The state transition of LA as represented with eq. (1) fol- of operants.
lows the function F: SS, which means that the state Reinforcement in OC is related to the motivation [35, 36]
transition of the LA is driven by the inputs (i.e., the re- that causes organisms to take action or operant. Motivation
sponses to the actions) from the environment. As we have is a kind of force, which initiates, guides and maintains
argued above, however, the state transition of an organism goal-oriented behaviors or operants in terms of OC. Operant
in OC can be supposed to be consequences induced by op- conditioning or operant learning tends towards giving rise to
erants. Thereupon, the state transition of OC does not really positive reinforcement when the psychological and physio-
depend on the responses from the environment, but on the logical feeling of the organism is consistent with the moti-
operants of the organism itself. Accordingly, at each stage t, vation, or when the state transition of the organism is guid-
the state transition of the organism in OC can be generalized ed towards the goal that the motivation means, otherwise to
computationally with negative reinforcement.
The mechanism of operant conditioning can be charac-
RS : s (t ) o(t ) s (t 1), (3)
terized with psychodynamics [32] that is a blend of psy-
where s(t) and o(t) are the current state and operant, respec- chology and thermodynamics. Brucke, the founder of psy-
tively, and s(t+1) is the next state. Eq. (3) formulates the RS chodynamics, supposed that all organisms were ener-
feedback process from operant to consequence (state) in OC, gy-systems governed by the first law of thermodynamics
which states that operant is the power that drives the state [37]. It can also be characterized with biological thermody-
transition of organisms, and the future state of the organism namics [38] that refers to bioenergetics [39] and investigates
in OC depends on both the current state and operant. the energy transductions in living organisms. Coincidently,
Actually, in OC, the state transition process of an organ- both psychodynamics and biological thermodynamics are
ism is a black box. The organism can sense its own state or inspired by thermodynamics and associated with energy,
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2749
which suggests that, in the process of OC, biological or forcement (reward) that p(o(t)|s(t))>0, or negative rein-
psychological and physiological states require energy to forcement (punishment) that p(o(t)|s(t))<0. p(o(t)|s(t)) is
hold, and operants require energy to operate. Accordingly, the reinforce at the stage t that follows the operant o(t). Re-
an organism in OC can be supposed to be an energy system inforcement in OC depicted in Figure 3 and in eq. (6) obeys
ESYS that consists of two functions: one is the internal ener- the principle of the simultaneity of cause and effect, alt-
gy function ES that assigns each biological state a nonnega- hough there is a lapse of time between an operant and its
tive energy value; the other is the operant energy function reinforcer.
EO that assigns each operant a nonnegative energy value, From the above analysis, there are three fundamental
that is, elements in operant conditioning: (i) Operant that is active
or voluntary behavior, (ii) motivation that is embodied as an
E SYS (t ) ∆E S (t ) E O (t ), energy system, and (iii) reinforcement that is computation-
(4)
E S : s (t ) R , E O : o(t ) R , ally be formulated as the update of the occurrence probabil-
ities of operants guided by the motivation system.
where ES(t)=ES(t1)ES(t) is the increment of the internal Eqs. (2)(6) generalize the behavior selection, state tran-
energy from t to t+1. The internal energy ES is potential en- sition, motivation system, and reinforcement of organisms
ergy for organisms to hold the biological states, and the in OC, respectively, and become the principles for us to
operant energy EO is kinetic energy for organisms to operate design the Skinner automaton.
operants. Eq. (4) implies that the state transition of an or-
ganism in OC driven by an operant is accompanied with the
change of the internal energy and the consumption of the 3 Synthesis of Skinner automaton
operant energy. At each stage t, ESYS estimates the conse-
quence (effect) of the operant o(t) at the state s(t) by a Inspired with the ideas of psychodynamics and biological
nonnegative energy value. thermodynamics [32, 37–39], a so-called Skinner automaton
It is usually supposed that low-energy states are coinci- (SAUTO) is built in this section as a machine learning mod-
dental with the propensity or tropism of organisms [40]. The el for artificial systems (autonomous agents) to learn in op-
motivation of the organism in OC can be supposed to be the erant conditioning way like animals.
propensity or tropism of the organism to hold the low-
est-energy state by the lowest-energy operant. Thus, the 3.1 Definition of the SAUTO
motivation of an organism in OC can be characterized by a
motivation system MSYS with an objective function that is From Figures 13 and eqs. (2)(6), the Skinner automaton
based on the energy system ESYS: can be defined with a 7-tuple as
E SYS (o | s ) E SYS o(t ) | s (t ) (1 )E SYS (o | s ) , (10) 1) Initializing: sS and oO, set ESYS(o|s)=0. Let the
initial time t be zero, the initial state be s(t)=s0 (S), the
where (0,1) . Eq. (10) is actually a low pass filter, artificial temperature TA (=KBT) be large enough.
which is twofold: (i) | E SYS (o | s ) | gradually expands 2) Behavior selecting: Compute the occurrence probabil-
ities of the operants (in O) at the s(t) from eq. (9) with
from zero to a relative stable statistic with iterative operant
ESYS(o|s(t)), and select an operant o from O using the
conditioning, and (ii) it is some adaptive to the changing
environments. probabilities.
3) Operating: Put the operant o(t)=o on the state transition
process F to get the state s(t+1).
3.3 Skinner algorithm 4) Summing up operant effect: Get the energy value
The Skinner algorithm is the core of the SAUTO, which ESYS(o(t)|s(t)) induced by the o(t) at the s(t), take the statistic
simulates the mechanism of OC and is employed as a rein- ESYS(o(t)|s(t)) of the energy system. Replace t by t+1, and
forcement scheme by the SAUTO. The Skinner algorithm repeat from step 2 until thermal equilibrium at the TA or
derives its philosophy from the Metropolis Monte Carlo enough repetition is reached.
Method [28, 29] and simulated annealing [30, 31]. With the 5) Reinforcing by cooling: Decrease the TA according to
Skinner algorithm, the SAUTO runs like a thermodynamic a specified cooling schedule, and repeat from step 2 until
system and tries to gain some optimal learning. the TA is low enough or F reaches the frozen state.
The Skinner algorithm employs the Monte Carlo method At the start of the Skinner algorithm, all the operants in
to repeatedly take the sample of operant-consequence from O have the same occurrence probabilities for any given state
the state transition process of the SAUTO at each stage t and in S at time t=0. In the end, however, there will be rare op-
sum up the operant effect. It defines the occurrence proba- erants fired to hold the lowest-energy states.
bility of operant with energy function, which is analogous to The Skinner algorithm is able to simulate operant condi-
eq. (8) of Metropolis algorithm. Moreover, it establishes tioning and behavior selection, which runs in the way of the
conditioned operant of the SAUTO in the way of the Me- Metropolis Monte Carlo Method and simulated annealing.
tropolis algorithm. It implies some establishment of condi-
tioned operant in OC that the SAUTO reaches thermal equi-
librium at a specified temperature T. However, the condi-
4 Thermal equilibrium and operant entropy
tioned operants are probably not globally optimal.
Inspired by annealing in metallurgy, simulated annealing The Skinner automaton is a thermodynamic system as well
was introduced for globe optimization by Kirkpatrick, Ge- as a self-organizing psychodynamic system. Its self-organ-
latt, and Vecchi in 1983 [30], and by Černý in 1985 [31], izing property is to a certain extent reflected by its thermal
which generalizes the Metropolis algorithm through intro- equilibrium phenomenon.
ducing a temperature scheme. Simulated annealing starts the To demonstrate the thermal equilibrium of the Skinner
Metropolis algorithm with a high temperature, and then, automaton, we assume a simpler case that F is determina-
slowly cools so that the search space gradually shrinks, and tive, and E(S)(o|s) is the mean of the internal energy in-
eventually to a small set of states with the global energy crement. Eq. (5) then has the following equivalent trans-
minimum. As a matter of fact, both animal learning and formation:
annealing are gradual optimization processes. Operant con- 1 E ( S ) ( o| s )/ KBT
ditioning is a process of behavior selection, or in other p(ο | s ) Z ( s ) e ,
words, a process of behavior optimization. At the beginning, (11)
Z ( s ) e E ( o | s )/ KBT ,
(S )
s(0): being satisfied s(1): longing for food s(2): being in pain
o(0) : to do nothing s(1) – s(1)
(1) (0)
o : to peck at the red – s –
o(2) : to peck at the yellow – s(1) –
o(3) : to peck at the blue – s(2) –
with Skinner automata. Like Skinnerian superstitious pi- through sufficient operant learning. It has an internal energy
geons, however, the Skinner automata may also reinforce function of E(S)(s(i))=i (i=0,1) and an operant energy func-
their behaviors that are accidentally coincidental with the tion of E(O)(o(k))=0.5k (0k6). In the process of operant
occurrences of food pellets. We call a superstitious action a conditioning, it takes 25 sample paired data of oper-
pseudo operant, and call it the pseudo operant conditioning ant-consequence (response-stimulus) at each fixed temper-
(POC) to reinforce pseudo operants. ature.
The state transition process of the Skinner automata sim- As depicted with Figure 10(a), with well-balanced oper-
ulating superstitious pigeons can be illustrated with the state ant learning, the simulated bird has learned to do nothing to
diagram in Figure 9 and the state table in Table 2. In the wait for food pellets occurring automatically. This means
state diagram, there are only two states: (i) s(0): being satis- that well-balanced pigeons are apt to learn the truth.
fied, and (ii) s(1): longing for food. The simulated birds have
one operant o(0) that means to do nothing at the states s(0) 6.2 POC due to being hyperactive
and s(1), and six pseudo operants that are likely to occur at
the s(1): (i) o(1): to flap wings, (ii) o(2): to spin, (iii) o(3): to The second simulated pigeon is hyperactive. It has an inter-
twist, (iv) o(4): to turn neck, (v) o(5): to sing, and (vi) o(6): to nal energy function of E(S)(s(i))=i (i=0,1) and an operant en-
bow. ergy function of E(O)(o(k))=0.5k (0k6), which means that
The simulated Skinnerian birds have the same state dia- the simulated bird is too energetic.
gram but different internal energy functions, operant energy As depicted with Figure 10(b), the simulated bird being
functions, and samplings, which represents different psy- hyperactive gets into a POC process and becomes supersti-
chological and physiological states, biological propensities, tious. To pray food, it bows. It seems that pigeons being too
and extents of cognition and learning. energetic or hyperactive are apt to be superstitious.
Figure 9 The state diagram of the Skinner Automata simulating the su-
perstitious pigeons.
sequence at each fixed temperature, which means that they 7 A 3D Wienerian worm
are biased, and lack sufficient cognition and learning.
However, the simulated birds have different behaviors Norbert Wiener, the father of Cybernetics, was very inter-
because the simulated experiments of operant conditioning ested in how machines work like animals. He described a
are computationally stochastic processes. The simulated thought experiment in his book “Cybernetics” [43] and
results are some accidental and unpredictable. As shown in conceived a mechanism for imitating aphototropic worm,
Figure 11, the simulated birds exhibit different behaviors. which we later name as the Wienerian worm. Wiener sup-
Among eight simulated birds, six of them have become su- posed that an electric bridge could serve as a photosenso-
perstitious owing to POC. To pray food, two flaps its wings ry-motor system so that the Wienerian worm was able to
(see Figures 11(a), (b)), two spins (see Figures 11(c), (d)), search the darkness as a real aphototropic worm was. Influ-
one turns its neck (see Figure 11(e)), one bows (see Figure enced by Wiener, Braitenberg also designed a series of
11(f)), and the rest two do nothing (see Figures 11(g), (h)). thought experiments in which a vehicle with simple internal
As shown in the simulated experiments, it seems that the structure acts in unexpectedly complex ways [44], including
higher the state s(1) and the fewer the sample paired data of the phototropic and the aphototropic. Both Wienerian worm
operant-consequence at each fixed temperature, the more and Braitenberg’s vehicle are simple but most significant
the simulated bird is apt to be superstitious. because it demonstrates that it is possible for machines to
We acquire three significant suggestions from the super- behave like animals. However, the mechanisms of Wiener
stitious Skinner automata: (i) Being hyperactive is apt to be and Braitenberg act like animals just externally but not in-
superstitious; (ii) Hyperepithymia is apt to be superstitious; ternally. Their behaviors are being designed, but not being
(iii) Being biased or lack of sufficient cognition and learn- shaped by autonomous development, cognition or learning.
ing is apt to be superstitious. The Skinner automaton is able to serve as sensorimotor
Figure 11 The operant probability graphs: results from the POC of the simulated birds being biased and hyperepithymia.
2756 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11
systems of autonomous agents and robots so that they can temperatures, in which the case at the artificial temperature
gradually develop their motor skills like animals. We pro- TA=10000C takes 3000 discrete times, the others takes
gram a 3D Wienerian worm shown in Figure 12(a) in a vir- 1000 discrete times. Figure 14 implies that the behaviors of
tual reality environment. It is 4cm in length and 2.4 cm in the 3D Winerian worm gets organized more and more as the
width, and equipped with the receptors of two photosensors SAUTO cools. The 3D Wienerian worm is totally unor-
PSL and PSR, and the effectors of two motors ML and MR ganized at the artificial temperature TA=10000C, but gets
that respectively drive the left wheel WL and the right wheel highly organized at TA=0.001C. After having finished op-
WR. erant learning, the 3D worm is able to adroitly search the
As depicted in Figure 12(b), a Skinner automaton darkness in practice, and walk from any jumping-off point
(SAUTO) is used to serve as the photosensory-motor system to the darkness. Lastly it moves very slowly or stays at the
of the 3D Wienerian worm, which receives state signals foot of the walls.
from the receptors PSL and PSR, and sends operant signals It is most interesting that, by operant learning, the higher
to the effectors ML and MR. Each state of the SAUTO is a the luminance of the environment, the faster the 3D Wie-
vector of s (∆ρ, ρ ) where =1 if L>R or else =1, nerian worm moves. The Skinner automaton makes the
L and R are the luminances measured by PSL and PSR, Wienerian worm behave like animals not only externally
respectively, and ρ ( ρL ρR ) / 2 that is divided into 9 but internally as well.
levels. There are altogether 18 states in the SAUTO. The
internal energy function of the SAUTO is defined by 8 A robot learning to balancing
E (S )
( s ) 250 ρ .
2
(17)
Two-wheeled robots are a class of balancing robots, which
Each operant of the SAUTO is a vector of o=(,), where are much absorbing due to their inherent unstable dynamics.
is the instructive signal of velocity and is the instruc- These robots are characterized by the ability to balance on
tive signal of direction; is divided into 4 levels, i.e., 0, 1, its two wheels and spin on the spot [45] whereas such abil-
2, and 3. If =0, we have =0, else =0.157. Accord- ity is generally man-made and designed in advance.
ingly, there are altogether 7 optional operants for the 3D In a sense, two-wheeled robots are a class of bionic sys-
Wienerian worm. The operant energy function is defined by tems, which imitate human upright posture and attempt to
exhibit some balancing skills to balance their postures. The
E (O ) (ο) 0.025 , ( {0,1,2,3}). (18)
posture balancing skills of a human are developed and
The 3D Wienerian worm is confined to a small ring of shaped during operant learning, in which operant condi-
radius 26 cm where a light lies in the center so that the cen- tioning plays an important role. Without a doubt, it is sig-
tre is bright and the edge is dark (see Figure 13). The oper- nificant for balancing robots to learn balancing skills like
ant learning process of the 3D Wienerian worm is shown in human and animals.
Figure 14, in which an obstacle avoidance scheme will start We have built a physical two-wheeled robot [46] called
up when the 3D Wienerian worm is too close to one of the Hominid 3 that is 58 cm in height and 22.5 kg in weight
walls of the ring. By operant learning, the 3D Wienerian (see Figure 15(a)). Hominid 3 is a flexible balancing robot
worm gradually develops its motor skills of negative photo- with complicated dynamics. It has a flexible lumbar made
tropism to search the darkness. Figure 14 depicts the foot- of a spring so that it is more bionic and more challenging to
prints of the 3D Winerian worm in the ring at different posture balancing. As shown in Figure 15(b), Hominid 3
Figure 12 (a) A 3D Winerian worm. (b) The sensorimotor system of the 3D Wienerian worm.
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2757
Figure 14 The footprints of the 3D worm searching the darkness during the process of operant learning.
2758 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11
Table 3 The state levels of the tilt angle and the angular velocity
Table 4 The operant levels of the PWM value Figure 17 The operant entropy and the mean of the internal energy ver-
sus log10(TA(C)) where TA (=KBT) is the artificial temperature.
Level k UPWM
7 2500
6 1850 9 Conclusion
5 1250
4 750 The Skinner automaton makes the psychological theory of
3 450 operant conditioning ‘take the form of computer programs’.
2 250 The Skinner automaton develops the Monte Carlo method
1 100 from the Metropolis algorithm to Skinner algorithm, in
0 0 which operant (active and voluntary behavior) is introduced
1 +100
as an essential element. As we state in Section 3, although
2 +250
the Monte Carlo method can be used to investigate the rela-
tion between operants (behaviors) and consequences in the
3 +450
process of response-stimulus (RS) conditioning of an or-
4 +750
ganism, it can select neither operants (behaviors) nor con-
5 +1250
sequences. The Metropolis algorithm has modified the
6 +1850
Monte Carlo method so that it is able to select (accept or
7 +2500
refuse) the states or configurations of a system. However,
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2759
The Metropolis algorithm cannot be directly used to bio- Skinner automaton expands the algorithm to biological sys-
logical systems. For an organism, the states or configura- tems, and moreover, it is able to represent behavior selec-
tions imply its behavioral consequence. During the RS con- tion mechanism in operant conditioning way. The Skinner
ditioning process of an organism, once a consequence algorithm selects operants (behaviors) via operant condi-
comes into being, it is impossible to refuse it. Now, the tioning, and it selects consequences via selecting the oper-
2760 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11
18 Thathachar M A L, Sastry P S. A new approach to designing rein- 33 Palm W J. System Dynamics. 2nd ed. London: McGraw-Hill Sci-
forcement schemes for learning automata. IEEE Trans Syst Man Cy- ence/Engineering/Math, 2009. 172–283
bern, 1985, SMC-15: 168–175 34 Kiese-Himmel C. Verstärkungslernen: Operante Konditionierung.
19 Lanctot J K, Oommen B J. Discretized estimator learning automata. Sprache-Stimme-Gehör, 2010, 34: 1
IEEE Trans Syst Man Cybern, 1992, 22: 1473–1483 35 Dayan P, Belleine W. Reward, motivation and reinforcement learning.
20 Thathachar M A L, Phansalkar V V. Learning the global maximum Neuron, 2002, 36: 285–298
with parameterized learning automata. IEEE Trans Neural Netw, 36 Oudeyer P Y, Kaplan F, Hafner V V. Intrinsic motivation systems for
1995, 6: 398–406 autonomous mental development. IEEE Trans Evolut Comput, 2007,
21 Phansalkar V V, Thathachar M A L. Local and global optimization 11: 265–286
algorithms for generalized learning automata. Neural Comput, 1995, 37 Brucke E W. Lectures on Physiology. Vienna: Braumuller, 1874.
7: 950–973 38 Haynie D. Biological Thermodynamics. Cambridge: Cambridge
22 Hauwere Y-M De, Vrancx P, Nowé A. Generalized learning automa- University Press, 2001. 293–330
ta for multi-agent reinforcement learning. AI Commun, 2010, 23: 39 Nicholls D G, Ferguson S J. Bioenergetics. 4th ed. Europe: Academic
311–324 Press, 2013. 1–52
23 Viswanathan R, Narendra K S. A note on the linear reinforcement 40 Hopfield J J. Networks, computations, logic, and noise. In: Proceed-
scheme for variable-structure stochastic automata. IEEE Trans Syst ings of IEEE First International Conference on Neural Networks,
Man Cybern, 1972, SMC-2: 292–294 California, USA, 1987. 109–141
24 Poznyak S, Najim K. On nonlinear reinforcement schemes. IEEE 41 Neumann J von. Various techniques used in connection with random
Trans Automat Contr, 1997, 42: 1002–1004 digits, in Monte Carlo Method. Applied Mathematics Series, vol. 12,
25 Stoica F, Popa E M. An absolutely expedient learning algorithm for Washington D.C.: U.S. Department of Commerce, National Bureau
stochastic automata. WSEAS Trans COMPUTERS, 2007, 6: 229–235
of Standards, 1951. 36–38
26 Stoica F, Popa E M. A new evolutionary reinforcement scheme for
42 Skinner B F. ‘Superstition’ in the pigeon. J Exp Psychol, 1948, 38(2):
stochastic learning automata. In: Mastorakis N E, Mladenov V,
168–172
Bojkovic Z, et al., eds. The Proceedings of the 12th WSEAS Interna-
43 Wiener N. Cybernetics: Or Control and Communication in the Ani-
tional Conference on Computers, Stevens Point, Wisconsin, USA,
mal and the Machine. New York: J. Wiley, 1948. 60–132
2008. 268–273
44 Braitenberg V. Vehicles: Experiments in Synthetic Psychology. USA:
27 Simian D, Stoica F. A new nonlinear reinforcement scheme for sto-
The MIT Press, 1986. 95–144
chastic learning automata. In: The Proceedings of 12th WSEAS In-
ternational Conference on Automatic control, Modeling & Simulation, 45 Ooi R C. Balancing a two-wheeled autonomous robot. Dissertation of
Catania, Sicily, Italy, 2010. 450–454 Masteral Degree. Perth: University of Western Australia, 2003. 1–7
28 Metropolis N, Rosenbluth A W, Rosenbluth M N, et al. Equation of 46 Ruan X G, Li X Y, ZHAO J W, et al. A flexible two-wheeled
State Calculations by Fast Computing Machines. J Chem Phys, 1953, self-balancing robot system and its motion control method. China
21: 1087–1092 Patent 200910084259.8, 2010-10-9
29 Jorgensen W L. Perspective on ‘Equation of state calculations by fast 47 Asada M, Hosoda K, Kuniyoshi Y, et al. Cognitive developmental
computing machines’. Theor Chem Acc, 2000, 103: 225–227 robotics: A survey. IEEE Trans Auton Ment Dev, 2009, 1: 12–34
30 Kirkpatrick S, Gelatt C D, Vecchi M P. Optimization by Simulated 48 Wood S E, Wood E G, Boyd D. Mastering the World of Psychology.
Annealing. Science, 1983, 220: 671–680 Boston: Allyn & Bacon, 2004. 333–354
31 Černý V A. Thermodynamical approach to the travelling salesman 49 Baranès A, Oudeyer P Y. R-IAC: Robust intrinsically motivated ex-
problem: An efficient simulation algorithm. J Optim Theory Appl, ploration and active learning. IEEE Trans Auton Ment De, 2009, 1:
1985, 45: 41–51 155–169
32 Horowitz M J. Introduction to Psychodynamics: A New synthesis. 50 Oudeyer P Y, Kaplan F. What is intrinsic motivation? A typology of
New York: Basic Books, 1988. 17–243 computational approaches. Front Neurorobot, 2007, 1: 1–14