You are on page 1of 17

SCIENCE CHINA

Technological Sciences
Progress of Projects Supported by NSFC November 2013 Vol.56 No.11: 2745–2761
doi: 10.1007/s11431-013-5369-0

The skinner automaton: A psychological model formalizing


the theory of operant conditioning
RUAN XiaoGang & WU Xuan*
Institute of Artificial Intelligence and Robots, School of Electronic Information and Control Engineering,
Beijing University of Technology, Beijing 100124, China

Received August 6, 2013; accepted September 13, 2013; published online September 30, 2013

Operant conditioning is one of the fundamental mechanisms of animal learning, which suggests that the behavior of all animals,
from protists to humans, is guided by its consequences. We present a new stochastic learning automaton called a Skinner au-
tomaton that is a psychological model for formalizing the theory of operant conditioning. We identify animal operant learning
with a thermodynamic process, and derive a so-called Skinner algorithm from Monte Carlo method as well as Metropolis algo-
rithm and simulated annealing. Under certain conditions, we prove that the Skinner automaton is expedient, -optimal, optimal,
and that the operant probabilities converge to the set of stable roots with probability of 1. The Skinner automaton enables ma-
chines to autonomously learn in an animal-like way.

Learning automata, Boltzmann distribution, operant conditioning, operant learning, simulated annealing

Citation: Ruan X G, Wu X. The skinner automaton: A psychological model formalizing the theory of operant conditioning. Sci China Tech Sci, 2013, 56:
27452761, doi: 10.1007/s11431-013-5369-0

1 Introduction and involuntary in the process of Pavlovian conditioning.


He believed that animals engaged with their environments
actively and voluntarily [1]. Skinner termed an active be-
Operant conditioning is one of the fundamental concepts in
havior as an operant [2], from which operant conditioning
behavioral psychology, and sometimes it is called the Skin-
took its name.
nerian conditioning due to Skinner’s pioneering work [1, 2].
Operant conditioning is an important form of psycholog-
Earlier work dates from American psychologist Thorndike
ical learning and sometimes called operant learning, during
who investigated trial-and-error learning in cats, dogs,
which humans and animals learn to associate their behaviors
chicks, and monkeys. He developed his famous Law of Ef-
with the consequences. Operant learning inspires the study
fect [3] that formed the conceptual starting point for Skin-
of machine learning. There are some computational models
ner’s work. Watson, the founder of behaviorism [4], argued
trying to simulate operant conditioning. The earlier work
that behavior could be researched scientifically without re-
was contributed by Grossberg [7, 8] who proposed an ide-
course to inner mental states [5]. Skinner was a radical be-
alized neural network modelling operant conditioning for
haviorist and much influenced by Watson. However, Skin-
explaining behavioral data on learning in vertebrates.
ner thought that there existed serious shortcomings in Wat-
Grossberg’s model was later used for robots to learn the
son’s theory that was totally based on Pavlovian condition-
operant behavior of obstacle avoidance by Chang and
ing [6]. In Skinner’s opinion, organisms were too passive
Gaudiano [9]. Touretzky and his team studied the operant
conditioning model for making the physical robot Skin-
*Corresponding author (email: abc0767@live.com) nerbot trainable [10–12]. Itoh et al. considered that robots

© Science China Press and Springer-Verlag Berlin Heidelberg 2013 tech.scichina.com www.springerlink.com
2746 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

need to express emotions, behaviors and personality in a convergence to global optimization. Our theoretical analysis
human-like manner. They presented a behavior model with shows that the Skinner automaton governed by the Skinner
operant conditioning and applied it to an emotion expres- algorithm is not only expedient and -optimal, but also op-
sion humanoid robot (WE-4RII) so that the machine could timal and absolutely expedient. At the end of this paper,
autonomously select suitable behavior [13, 14]. some optimal computing and simulated experiments are
Reinforcement learning is an important form of machine presented with the Skinner automaton to test the perfor-
learning, which is inspired by behaviorist psychology. More mance of the Skinner automaton. Our work suggests that
exactly, operant conditioning theory underlies reinforce- operant conditioning not only is psychological and biologi-
ment learning. In fact, the concept of reinforcement origi- cal, but also can be computational, psychodynamical and
nates from the theory of operant conditioning in which re- thermodynamical. The Skinner automaton enables autono-
inforcement plays a key role. Sutton and Barto established mous agents including robots to autonomously learn in an
the basic principles of reinforcement learning [15]. Howev- animal- like way.
er, reinforcement learning can be traced back to the work on The rest of this paper is organized as follows. Section 2
learning automata (LA). The term “learning automata” was unscrambles operant conditioning and extracts its funda-
first used by Narendra and Thathachar [16], who pointed mental elements. Section 3 configurates the Skinner autom-
out that the first learning automata were developed in aton for synthesizing operant conditioning and describes the
mathematical psychology. The original notion of learning Skinner algorithm that is the core of the Skinner automaton.
automata corresponds to the so-called finite action-set Section 4 discusses the self-organizing property of the
learning automata (FALA) [16]. An FALA has a given set Skinner automaton. Sections 5 and 6 reproduce famous
of actions and a specific reinforcement scheme, which gen- Skinner’s experiments of pigeons with the Skinner automa-
erally operates in an unknown random environment and ton. In Section 7, an animated 3D Wienerian worm is built
updates its action probabilities in accordance with the re- and the Skinner automaton serves as its photosensory-motor
sponses from the environment. It is obvious that learning system to shape its behavior of negative phototropism by
automata are also inspired by behaviorist psychology. operant learning (i.e. learning in operant conditioning way).
Learning automata are the models of operant learning as In Section 8, the Skinner automaton serves as the sen-
well as the formal frameworks for reinforcement learning. sorimotor system related posture and movement of a physi-
The notion of learning automata has been extended over cal flexible two-wheeled robot, and enables the robot to
the years [17]. FALA ensure convergence only to a local gradually develop its motor skills and to learn to balance in
maximum of the reinforcement signal [18, 19]. To get con- autonomous operant conditioning. Finally, our conclusions
vergence to global maximum, parameterized learning au- are given in Section 9.
tomata (PAL) were proposed [20]. To handle associative
reinforcement learning problems, generalized learning au-
tomata (GLA) were introduced [21, 22]. The common 2 Analysis of operant conditioning
theme among them is reinforcement schemes that serve as
the basis of the learning process for learning automata A natural question about machine learning is whether it is
[23–27]. According to their linearity, reinforcement compatible with animal learning. We aim at formalizing the
schemes can be categorized into linear, nonlinear, and hy- theory of operant conditioning (OC) in a form analogous to
brid schemes. A reinforcement scheme in a learning autom- learning automata (LA) so that artificial systems learn in an
aton is the crucial factor that affects the performance of the operant conditioning way like animals. From psychodynam-
learning automaton. According to the properties exhibited ics [32], operant conditioning is a dynamical process. A
by learning automata using the schemes, reinforcement learning automaton is a dynamical system. In this section, we
schemes can be classified as expedient, -optimal, optimal, try to unscramble the dynamical process of OC in terms of
and absolutely expedient. LA so that our OC model is compatible with OC in animals.
As has been pointed out by Touretzky and Saksida [10],
however, animals learn much more complicated behaviors
through operant conditioning than machines acquiring 2.1 Learning automata
through reinforcement learning or learning automata. In this Here the LA theory is briefly described to provide a back-
paper, we address a so-called Skinner automaton as a new ground for understanding the learning process of LA and for
psychological model for formalizing the theory of operant being compared and contrasted with OC.
conditioning. We identify animal operant learning with a Generally, a learning automaton (LA) is a stochastic dy-
thermodynamic process, and derive the so-called Skinner namical system operating in a random environment, and can
algorithm from Metropolis algorithm [28, 29] and simulated be represented with a 7-tuple:
annealing [30, 31]. The Skinner algorithm is the core of the
Skinner automaton and serves as its reinforcement scheme. LA  t , S ,  ,  , F , G, A , (1)
With simulated annealing, the Skinner automaton ensures where t{0, 1, 2, ⋯} represents the discrete time for each
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2747

stage in the learning process of the LA, S a set of internal in animals.


states,  a set of actions that are the outputs of the LA,  a
set of responses that are the inputs of the LA from the envi- 2.2 Association between behaviors and consequences in
ronment, F: SS a function that maps the current state OC
and input into the next state, which can be deterministic or
stochastic, G: S (p) a function that maps the current All living organisms are dynamical systems. In terms of
state into the current output (an action in ) with the occur- system dynamics [33], at any given time an organism has an
rence probability p of the action , and A: pp an internal state, which can be psychological and physiological,
algorithm (also called an updating scheme or reinforcement and always evolving over time.
scheme) that maps the current action probabilities, action Operant conditioning (OC) is a form of psychological
and input into the next (new) action probabilities. and physiological dynamic process of organisms, which
The dynamical process of the LA is depicted in Figure 1, makes an association between behaviors and consequences.
showing that the interaction of the LA and its environment In OC, behaviors are supposed to be voluntary and active,
and called operants. Operant, termed by Skinner [2], is a
conduce to a closed-loop feedback connection. Action (t),
core concept in the theory of OC, which reflects the basic
the output of the LA at stage t, operates on the environment
characteristics of behaviorism and has twofold important
and induces it to give the response (t) that is fed back to
implications: (i) It is a voluntary and active behavior of or-
the LA and serves as the input at stage t to induce the LA to
ganisms, and (ii) it will generate a certain consequence.
take the next action (t+1).
In a sense, the concept of operant in OC is identified with
The LA in eq. (1) has two dynamical characteristics: One
the term action in the theory of LA.
is the state transition, and the other is the learning update.
Any operant of organisms must result in a certain conse-
At any stage, the next state of the LA depends not only on
quence that, in turn, can cause a new operant. In the process
the current inputs of the LA but also on the current state.
of OC, the alternation of operant and consequence conduces
The LA in eq. (1) learns through updating the action proba-
to an internal feedback loop of organisms through the envi-
bilities on the base of the responses from the environment.
ronment as depicted in Figure 2, which contains both the
The next (new) action probabilities depend not only on the
feed-forward and the feedback associations between operant
current action and response from the environment (i.e., the
and consequence. In the process of OC, consequence is
current input of the LA), but also on the current action
sometimes regarded as stimulus, and behavior (operant) as
probabilities.
response, and therefore, operant conditioning is also called
The LA in eq. (1) has three basic elements: (i) A set of
the response-stimulus (RS) conditioning. The feed-forward
actions (), (ii) a set of reinforcers (), and (iii) a rein- in the closed loop of OC is an SR process from stimulus
forcement scheme (A). It appears that the LA can make re- (consequence) to response (operant), which refers to be-
inforcement learning in an operant conditioning way, in havior selection of organisms, or operant selection in terms
which the reinforcement scheme A always tries to reward of OC. Whereas the feedback is an RS process from re-
good actions and to punish bad ones. In terms of psychody- sponse (operant) to stimulus (consequence), which refers to
namics [32], however, there is much difference between the the state transition of organisms.
action probabilities updating in LA and operant conditioning In the theory of LA, the inputs (i.e., the responses to ac-
tions) from the environment are regarded as the conse-

Figure 1 Learning automaton and its dynamical process, where (t), (t),
and p(t) respectively represent the action of the LA, the response of the Figure 2 The closed loop of alternation of behavior and consequence in
environment, and the action probabilities at stage t. OC.
2748 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

quences induced by actions, such as reinforcers, either re- the state change, but cannot tell the process. We need to
wards or punishments for actions. However, we argue that, notice some differences between eqs. (1) and (3). The state
in the process of OC, the responses from the environment transition in OC is pushed directly by operants, whereas that
are not the real consequences of operants, but only the me- in LA by the responses from the environment to actions.
diate ones. The real consequences induced by operants are
the state changes of the organism in OC. For instance, a 2.3 Motivation and reinforcement in OC
reward for an operant is not the final consequence of the
operant, but a medium between the operant and its conse- Reinforcement is a key concept in operant conditioning [34],
quence, which tends towards bringing the organism to a which is defined as an event that increases the occurrence
certain satisfied internal state that is the real consequence probability of an operant by following the operant with a
induced by the operant. Accordingly, in the process of OC, certain reinforcer that is something strengthening or re-
operant consequences can be supposed to be the state warding the operant. The opposite is punishment, which
changes of organisms. To some extent, the current state of lowers the probability. Reinforcement plays an important
the organism represents the consequence of the last operant. role in shaping or modifying behaviors of human beings and
As a consequence, operant selection in the SR feed- animals. If reinforcement is supposed to be either positive
forward process of OC can be represented with the function or negative, then both reward and punishment can be re-
G(p): S in LA (1). Some learning automata formulate garded as some reinforcement. Computationally, in that
behavior selection with function G: S (p) that means case, positive reinforcers can stand for reward, and negative
behavior selection in LA depends on both the state of LA ones for punishment.
and the input from the environment. As stated above, how- In the process of OC, organisms seem inclined to exhibit
ever, the effect of the input from the environment can be more frequently behaviors that lead to reward, and less fre-
supposed to be contain in the state of the organism in OC. quently lead to punishment. However, we argue that organ-
Accordingly, at each stage t, operant selection in OC can be isms seek not so much for rewards for certain satisfied psy-
generalized computationally with chological and physiological feelings or states. Without
p  o ( t )| s ( t ) 
doubt, the psychological and physiological feelings or states
SR : s (t )  o(t ), (2) of organisms are the most radical reinforcers in the process
of OC or operant learning. Actually, in the process of OC,
where s(t) and o(t) are the current state and operant, respec- the effect of any reinforcement would be finally reflected in
tively, and p(o(t)|s(t)) is the occurrence probability of the the change of psychological and physiological feelings or
operant o(t) at the state s(t), i.e., the probability of the oper- states of organisms. Happiness, joviality, and comfortability,
ant o(t) selected at the state s(t). Eq. (2) formulates the SR etc. are positive reinforcers which inclined to increase the
feed-forward process from consequence (state) to operant in occurrence probabilities of operants, whereas sadness, dis-
OC, which states that the organism in OC selects its current appointment, dejection, and melancholy, etc. are negative
operant totally based on its current state. ones which inclined to decrease the occurrence probabilities
The state transition of LA as represented with eq. (1) fol- of operants.
lows the function F: SS, which means that the state Reinforcement in OC is related to the motivation [35, 36]
transition of the LA is driven by the inputs (i.e., the re- that causes organisms to take action or operant. Motivation
sponses to the actions) from the environment. As we have is a kind of force, which initiates, guides and maintains
argued above, however, the state transition of an organism goal-oriented behaviors or operants in terms of OC. Operant
in OC can be supposed to be consequences induced by op- conditioning or operant learning tends towards giving rise to
erants. Thereupon, the state transition of OC does not really positive reinforcement when the psychological and physio-
depend on the responses from the environment, but on the logical feeling of the organism is consistent with the moti-
operants of the organism itself. Accordingly, at each stage t, vation, or when the state transition of the organism is guid-
the state transition of the organism in OC can be generalized ed towards the goal that the motivation means, otherwise to
computationally with negative reinforcement.
The mechanism of operant conditioning can be charac-
RS : s (t )  o(t )  s (t  1), (3)
terized with psychodynamics [32] that is a blend of psy-
where s(t) and o(t) are the current state and operant, respec- chology and thermodynamics. Brucke, the founder of psy-
tively, and s(t+1) is the next state. Eq. (3) formulates the RS chodynamics, supposed that all organisms were ener-
feedback process from operant to consequence (state) in OC, gy-systems governed by the first law of thermodynamics
which states that operant is the power that drives the state [37]. It can also be characterized with biological thermody-
transition of organisms, and the future state of the organism namics [38] that refers to bioenergetics [39] and investigates
in OC depends on both the current state and operant. the energy transductions in living organisms. Coincidently,
Actually, in OC, the state transition process of an organ- both psychodynamics and biological thermodynamics are
ism is a black box. The organism can sense its own state or inspired by thermodynamics and associated with energy,
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2749

which suggests that, in the process of OC, biological or forcement (reward) that p(o(t)|s(t))>0, or negative rein-
psychological and physiological states require energy to forcement (punishment) that p(o(t)|s(t))<0. p(o(t)|s(t)) is
hold, and operants require energy to operate. Accordingly, the reinforce at the stage t that follows the operant o(t). Re-
an organism in OC can be supposed to be an energy system inforcement in OC depicted in Figure 3 and in eq. (6) obeys
ESYS that consists of two functions: one is the internal ener- the principle of the simultaneity of cause and effect, alt-
gy function ES that assigns each biological state a nonnega- hough there is a lapse of time between an operant and its
tive energy value; the other is the operant energy function reinforcer.
EO that assigns each operant a nonnegative energy value, From the above analysis, there are three fundamental
that is, elements in operant conditioning: (i) Operant that is active
or voluntary behavior, (ii) motivation that is embodied as an
E SYS (t )  ∆E S (t )  E O (t ), energy system, and (iii) reinforcement that is computation-
(4)
E S : s (t )  R  , E O : o(t )  R  , ally be formulated as the update of the occurrence probabil-
ities of operants guided by the motivation system.
where ES(t)=ES(t1)ES(t) is the increment of the internal Eqs. (2)(6) generalize the behavior selection, state tran-
energy from t to t+1. The internal energy ES is potential en- sition, motivation system, and reinforcement of organisms
ergy for organisms to hold the biological states, and the in OC, respectively, and become the principles for us to
operant energy EO is kinetic energy for organisms to operate design the Skinner automaton.
operants. Eq. (4) implies that the state transition of an or-
ganism in OC driven by an operant is accompanied with the
change of the internal energy and the consumption of the 3 Synthesis of Skinner automaton
operant energy. At each stage t, ESYS estimates the conse-
quence (effect) of the operant o(t) at the state s(t) by a Inspired with the ideas of psychodynamics and biological
nonnegative energy value. thermodynamics [32, 37–39], a so-called Skinner automaton
It is usually supposed that low-energy states are coinci- (SAUTO) is built in this section as a machine learning mod-
dental with the propensity or tropism of organisms [40]. The el for artificial systems (autonomous agents) to learn in op-
motivation of the organism in OC can be supposed to be the erant conditioning way like animals.
propensity or tropism of the organism to hold the low-
est-energy state by the lowest-energy operant. Thus, the 3.1 Definition of the SAUTO
motivation of an organism in OC can be characterized by a
motivation system MSYS with an objective function that is From Figures 13 and eqs. (2)(6), the Skinner automaton
based on the energy system ESYS: can be defined with a 7-tuple as

M SYS : J  o(t )    t  0 E SYS (t ),



(5) SAUTO  (t , S , O, M , F , G, A), (7)

where t{0, 1, 2, ⋯} represents the discrete time, S a set of


where the objective function J is also called an operant in-
internal states, O a set of operants, M: S(t)O(t)S(t+1)
dex. Eq. (5) states that the motivation of the subject of OC
computationally implies the desired goal to minimize the R+([0,+)) the motivated unit, F: S(t)O(t)S(t+1) the
operant index and to acquire an optimal sequence of oper- state transition process, G: S(t)O(t) (p) the operant selec-
ants {o*(t)}. Therewith, reinforcement (RF) of OC can be tion process, and A the Skinner algorithm.
formulated computationally as the update of the occurrence In the SAUTO, the discrete time t is twofold: (i) to dis-
probabilities of operants: cretize the process of operant conditioning for digital com-
puting, and (ii) to mark the simultaneity of cause (operant)
 0, E SYS (t )  0,
RF : ∆p  o(t ) | s (t )   (6)
 0, E SYS (t )  0,
where p(o(t)|s(t)) is the occurrence probability of the oper-
ant o(t) at the state s(t), p(o(t)|s(t)) is the increment of the
occurrence probabilities p(o(t)|s(t)), and ESYS(t) is the ener-
gy value from the energy system (4).
The motivation and reinforcement of OC can be depicted
with Figure 3, showing that the process of OC can be illus-
trated by a probability updating process. In the reinforce-
ment scheme (6), the operant o(t) at the state s(t) is rein-
forced by giving the increment p(o(t)|s(t)) to the occur-
rence probabilities p(o(t)|s(t)). This implies positive rein- Figure 3 Motivation and reinforcement in OC.
2750 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

and effect (consequence or reinforcer). Each discrete time  1, ∆E  0,


can stand for a stage in the process of OC. p( s  | s )    ∆E / KBT (8)
 e , ∆E  0,
F is the state transition process of the SAUTO that, at
each stage t, maps the current state s(t)S and operant where T is temperature, KB is Boltzmann’s constant, and
o(t)O into the next state s(t+1)S. It is generally a black E=E(s)E(s) is the energy increment. Eq. (8) implies that
box and can be deterministic or stochastic. the lower the internal energy of the mutations s, the larger
G is the operant selection process of the SAUTO that, at its opportunity accepted. The mutations s will absolutely be
each stage t, maps the current state s(t)S into the current accepted if E0. The Metropolis algorithm is guaranteed to
operant o(t)O with the occurrence probability p of the produce a population of states governed by the Boltz-
operant o(t) at s(t). In other words, at each stage t, the mann-Gibbs distribution.
SAUTO selects the current operant (action) based on the In the Metropolis algorithm, the random mutations s can
current state s(t). be rejected if E > 0. In OC, however, the state s(t+1) cannot
M is the motivated unit of the SAUTO, which is an ener-
be rejected in Metropolis’ way, as it has already become the
gy system with the internal energy function ES and the op-
consequence generated by the operant o(t). An organism or
erant energy function EO. Based on eq. (4), at each stage t, it an animal cannot refuse any consequence induced by its
maps the internal energy increment ES(t) from s(t) to s(t+1) behavior. It can only attempt to avoid the consequence dis-
and the operant energy EO(t) of the operant o(t) that con- satisfied by exhibiting less frequently the correlative oper-
duces to ES(t) into R+ ([0,+)). The motivated unit M de- ant. Like organisms or animals, the SAUTO regulates the
sires the SAUTO to hold the lowest-energy state by the occurrence probability of the operant o(t) at the state s(t) in
lowest-energy operant. operant conditioning way so that the opportunity that the
state s(t+1) occurs following the state s(t) decreases if it has
3.2 Occurrence probabilities of operants a higher internal energy value than the state s(t) and the
operant o(t) has a higher operant energy value, otherwise,
From the theory of OC, organisms or animals inclined to increases if it has a lower internal energy value than the
exhibit more frequently behaviors that lead to reward, and state s(t) and the operant o(t) has a lower operant energy
less frequently lead to punishment. In other words, operant value. Here we define the occurrence probability p(o|s) of
conditioning is inclined to reward an operant so that its oc- an operant o(O) at a given state s(S) as
currence probability increases if its consequence is coinci-
dental with the biological propensity or motivation, other-  1 ESYS ( o|s ) / KBT
wise, to punish it so that the probability decreases.  p(o | s )  Z ( s ) e ,
 (9)
From psychodynamics and biological thermodynamics,  Z ( s )   e ESYS ( o |s ) / KBT ,
operant conditioning is a thermodynamic process, and  o O
therefore, can be investigated in the way of thermodynamics.
This makes us associate with the Metropolis algorithm and where ESYS(o|s)=ES(o|s)+EO(o) is the energy value of the
simulated annealing in which the Monte Carlo method is the energy system in eq. (4), EO(o) is the operant energy of the
most fundamental. operant oO, ES(o|s) is the internal energy increment in-
The Monte Carlo method relies on repeated random duced by the operant oO at the state sS, and  is the
sampling to compute its results [41]. It can be used to inves- statistic of ESYS(o|s) as the o iteratively emerges at the s in
tigate the relation between operants and consequences of the the state transition process of OC. Eq. (9) is consistent with
SAUTO in OC. A state s(t) and an operant o(t) can be se- eq. (6) that generalizes the reinforcement mechanism of OC.
lected randomly at each stage t to generate the state s(t+1). This implies that an operant that has lower operant energy
Summing up the results of all times, one can obtain the sta- and induces the internal energy of the SAUTO to descend
tistical properties of F. However, there is no conditioning tends towards a larger occurrence probability. This is also
between operants and consequences in the Monte Carlo analogous to eq. (8) employed in the Metropolis algorithm.
method. Generally, the state transition process F can be a proba-
In statistical thermodynamics applications, Metropolis et bilistic model, a stochastic process, or a physical system
al. presented a modified Monte Carlo procedure [28, 29] to that is rather non-determinative and with uncertainty. We
obtain a practical solution for the configurational integrals, can take the statistic of ESYS(o|s) in eq. (9) by collecting
in which an internal energy function E is defined for ther- the energy value ESYS(o|s) at each discrete time t in various
modynamic systems. Their key contribution is that, instead different ways. For example, ESYS(o|s) can be the arithme-
of unconditionally accepting the random mutations s gen- tic mean or weighted arithmetic mean of the energy values.
erated from the current configuration s, they accept the s One of the optional statistical methods is that, at stage t,
with a probability  o  O and s  S , if o  o(t ) and s  s (t ) , then
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2751

E SYS (o | s )   E SYS  o(t ) | s (t )   (1   )E SYS (o | s ) , (10) 1) Initializing: sS and oO, set ESYS(o|s)=0. Let the
initial time t be zero, the initial state be s(t)=s0 (S), the
where   (0,1) . Eq. (10) is actually a low pass filter, artificial temperature TA (=KBT) be large enough.
which is twofold: (i) | E SYS (o | s ) | gradually expands 2) Behavior selecting: Compute the occurrence probabil-
ities of the operants (in O) at the s(t) from eq. (9) with
from zero to a relative stable statistic with iterative operant
ESYS(o|s(t)), and select an operant o from O using the
conditioning, and (ii) it is some adaptive to the changing
environments. probabilities.
3) Operating: Put the operant o(t)=o on the state transition
process F to get the state s(t+1).
3.3 Skinner algorithm 4) Summing up operant effect: Get the energy value
The Skinner algorithm is the core of the SAUTO, which ESYS(o(t)|s(t)) induced by the o(t) at the s(t), take the statistic
simulates the mechanism of OC and is employed as a rein- ESYS(o(t)|s(t)) of the energy system. Replace t by t+1, and
forcement scheme by the SAUTO. The Skinner algorithm repeat from step 2 until thermal equilibrium at the TA or
derives its philosophy from the Metropolis Monte Carlo enough repetition is reached.
Method [28, 29] and simulated annealing [30, 31]. With the 5) Reinforcing by cooling: Decrease the TA according to
Skinner algorithm, the SAUTO runs like a thermodynamic a specified cooling schedule, and repeat from step 2 until
system and tries to gain some optimal learning. the TA is low enough or F reaches the frozen state.
The Skinner algorithm employs the Monte Carlo method At the start of the Skinner algorithm, all the operants in
to repeatedly take the sample of operant-consequence from O have the same occurrence probabilities for any given state
the state transition process of the SAUTO at each stage t and in S at time t=0. In the end, however, there will be rare op-
sum up the operant effect. It defines the occurrence proba- erants fired to hold the lowest-energy states.
bility of operant with energy function, which is analogous to The Skinner algorithm is able to simulate operant condi-
eq. (8) of Metropolis algorithm. Moreover, it establishes tioning and behavior selection, which runs in the way of the
conditioned operant of the SAUTO in the way of the Me- Metropolis Monte Carlo Method and simulated annealing.
tropolis algorithm. It implies some establishment of condi-
tioned operant in OC that the SAUTO reaches thermal equi-
librium at a specified temperature T. However, the condi-
4 Thermal equilibrium and operant entropy
tioned operants are probably not globally optimal.
Inspired by annealing in metallurgy, simulated annealing The Skinner automaton is a thermodynamic system as well
was introduced for globe optimization by Kirkpatrick, Ge- as a self-organizing psychodynamic system. Its self-organ-
latt, and Vecchi in 1983 [30], and by Černý in 1985 [31], izing property is to a certain extent reflected by its thermal
which generalizes the Metropolis algorithm through intro- equilibrium phenomenon.
ducing a temperature scheme. Simulated annealing starts the To demonstrate the thermal equilibrium of the Skinner
Metropolis algorithm with a high temperature, and then, automaton, we assume a simpler case that F is determina-
slowly cools so that the search space gradually shrinks, and tive, and E(S)(o|s) is the mean of the internal energy in-
eventually to a small set of states with the global energy crement. Eq. (5) then has the following equivalent trans-
minimum. As a matter of fact, both animal learning and formation:
annealing are gradual optimization processes. Operant con-  1  E ( S ) ( o| s )/ KBT
ditioning is a process of behavior selection, or in other  p(ο | s )  Z ( s ) e ,
words, a process of behavior optimization. At the beginning,  (11)
 Z ( s )   e  E ( o | s )/ KBT ,
(S )

operant learning has a large set of optional operants for each


 o O
state, which may induce a great many states including
high-energy ones to emerge. This is analogous to the high where E(S)(o|s) = E(S)(s) is the internal energy of the state s
temperature situation in annealing. Gradually, some oper- induced by the o at the s. In eq. (11), the occurring probabil-
ants are no longer selected owing to their violating the mo- ity of any operant at a given state is proportional to the
tivation or biological propensity, and lastly, the set of op- Boltzmann factor [40], that is,
tional operants shrinks to a small one of optimal behaviors. (S )
p (o | s )  e  E ( o| s )/ K BT
. (12)
Obviously, both the simulated annealing and operant condi-
tioning has the analogous characteristics. It is most noticea- Consider an ideal case that, s and sS, there exists an
ble that cooling in simulated annealing implies reinforcing exclusive oO to bring F from s to s. Then from eq. (11),
in operant conditioning. According to eq. (9), a ‘good’ op- s and sS, there exists o and oO such that
erant that induces the internal energy of F to descend will
obtain more opportunity to occur as the temperature cools. p( sβ ) p( sα  sβ ) p(oα | sα )   E ( sβ )  E ( sα )  / K BT
  e , (13)
The Skinner algorithm is with five fundamental steps. p( sα ) p( sβ  sα ) p(oβ | sβ )
2752 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

where ss represents the state transition from s to s.


Eq. (13) means that, in the case of [36, 40], the states of the
Skinner automaton obey the Boltzmann-Gibbs distribution
and will reach thermal equilibrium at a specified tempera-
ture T.
Generally, the state transition process F can be a proba-
bilistic model, a stochastic process, or a physical system
that is rather non-determinative and with uncertainty. We
can take the statistic of the E(S) in eqs. (5) and (6) by
Figure 4 A Skinner box with a hungry Skinnerian pigeon.
collecting the energy increment E(S) at each discrete time t
in various different ways. One of the optional statistical
methods is first, the bird pecked at the disks randomly. With operant
conditioning, however, it gradually learned to peck at the
∆E (S )
t 1 (ot | st )   ∆E(S )
(ot | st )  (1   )∆E (S )
(ot | st ), (14)
t t
right disk.
where   (0,1) ,  ∆Et( S )  indicates  ∆E ( S )  at time t, and Like the hungry Skinnerian pigeon, a Skinner automaton
is also able to learn to peck at the right disk to get what it
∆Et( S ) (ot | st )  E ( S ) ( st 1 )  E ( S ) ( st ) is the internal energy
wants. Here we demonstrate how a Skinner automaton re-
increment induced by the ot at the time t. Eq. (14) is actually produces the experiment of the hungry Skinnerian pigeon.
a low pass filter, which is twofold: (i) |  ∆Et( S )  | gradually A Skinner automaton (SAUTO) is built to simulate the hun-
expands from zero to a relative stable statistic with iterative gry pigeon in the Skinner box with colored disks. The sim-
operant conditioning, and (ii) it is some adaptive to chang- ulated bird will gain a bit of grain if it pecks at the red disk,
ing environments. or suffer an electric shock if at the blue. Nothing will hap-
The concept of entropy is originally defined by the sec- pen if it pecks at the yellow.
ond law of thermodynamics and is generalized to be a The SAUTO simulating the hungry pigeon is rather small
measure of the disorganization of various systems. Here we and simple to serve as an illustration. Its state transition
introduce the concept of operant entropy as a measure of the process F can be illustrated with the state diagram in Figure
disorganization of the Skinner automaton to describe the 5 and the state table in Table 1, in which there are only three
operant uncertainty in the Skinner automaton, which is de- states: (i) s(0): being satisfied, (ii) s(1): longing for food, and
fined as (iii) s(2): being in pain. It has three optional operants for the
state s(1): (i) o(1): to peak at the red disk in the box, (ii) o(2):
H   p (o | s ) ln p(o | s ). (15)
sS oO
to peak at the yellow, and (iii) o(3): to peak at the blue. It has
only an operant o(0) that means to do nothing for the states
As a matter of fact, operant entropy is just behavioral en- s(0) and s(2) so that it spontaneously return to the state s(1).
tropy [41] that is able to describe the behavioral uncertainty The internal energy function is defined with E(S)(s(k))=k2, i.e.,
of humans and animals and autonomous agents. (i) E(S)(s(0))=0, (ii) E(S)(s(1))=1, and (iii) E(S)(s(2))=4. The state
It is obvious that the operant entropy of the Skinner au- s(0) is the lowest-energy one the simulated bird expects. No
tomaton always decreases as it cools and will lastly con- operant energy is considered, i.e., oO E(O)(o)=0.
verge to zero. This implies that operant conditioning in the The simulated results are depicted in Figures 6 and 7.
Skinner automaton is a self-organization process. The initial artificial temperature TA is 10000C, and the
cooling schedule of the SAUTO is exponential, i.e.,
5 A hungry Skinnerian pigeon TA (k  1)  σTA (k ), (16)

In a sense, operant conditioning comes from a series of pi-


geon experiments conducted by Skinner. Skinner ever con-
ducted lots of experiments with pigeons to demonstrate his
operant conditioning theory [4]. He designed a soundproof
apparatus, commonly called a Skinner box, with which he
conducted the experiments of operant conditioning. One
version of the box was equipped with a food tray and a dis-
penser, and there were three colored (red, blue and yellow)
disks for a hungry pigeon to peck at (see Figure 4). The bird
would gain a bit of grain from the food dispenser if it
pecked at the red disk, or suffered an electric shock if at the
blue. Nothing would happen if it pecked at the yellow. At Figure 5 The state diagram of the SAUTO simulating the bird.
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2753

Table 1 The state transition process of the Simulated Hungry Bird

s(0): being satisfied s(1): longing for food s(2): being in pain
o(0) : to do nothing s(1) – s(1)
(1) (0)
o : to peck at the red – s –
o(2) : to peck at the yellow – s(1) –
o(3) : to peck at the blue – s(2) –

simulating the bird is self-organizing.


In each simulated experiment, the simulated bird pecks at
the colored disks about 2000 times to carry out operant
learning from the high temperature of 10000C to the low of
0.01C. During one of the simulated experiments, the simu-
lated bird pecks at the colored disks 1885 times in all: 1115
times at the red, 445 times at the yellow, and 325 times at
the blue. We assume that the discrete time interval of the
SAUTO is 15 s. It then takes the simulated bird about eight
hours to carry out operant learning for food.
Figure 6 The operant probability graph: the occurring probabilities of the
three optional operants at the state s(1) versus log10(TA(C)) where TA(=KBT)
is the artificial temperature.
6 Why Skinnerian pigeons superstitious

Skinner’s work on superstition in pigeons [42] is much fa-


mous and interesting. He found that pigeons might become
superstitious. To pray food, Skinnerian pigeons tended to
exhibit strange behaviors, such as flapping wings, spinning,
and twisting, which were so-called superstitious actions. It
seemed that Skinnerian pigeons were apt to reinforce their
behaviors, which were coincidental with the occurrence of
food pellet, whereas the food was not related with their be-
haviors at all.
Figure 7 The operant entropy and the mean of the internal energy versus Why might pigeons become superstitious? The Skinner
log10(TA(C)) where TA(=KBT) is the artificial temperature. automaton may be able to reproduce the experiment and
conduce to our understanding of it. Like Skinnerian pigeons,
the Skinner automaton may also become superstitious.
where =0.9.
To demonstrate why Skinnerian pigeons become super-
At first, as shown in Figure 6, the bird simulated with the
stitious, we programs a few simulate Skinnerian pigeons
SAUTO pecks at the colored disks at the state s(1) with the
with Skinner automata. We design an imitating Skinner box
equal probability to one-third. As cooling in the process of
depicted in Figure 8, which is equipped with a food tray, a
operant learning, the SAUTO pecks at the red disk more and
food dispenser, and a timer. The dispenser releases a food
more frequently to get food. With operant conditioning, the
pellet every 15 min on schedule. Whether a food pellet oc-
occurring probability of the operant o(1) of the SAUTO
curs is not related to the behaviors of those birds simulated
gradually rises from 1/3 to 1, and that of the operants o(2)
and o(3) gradually descend from 1/3 to zero. It is very inter-
esting that, to avoid being shocked by electricity, the proba-
bility of its pecking at the blue descends much faster than
that of its pecking at the yellow. Through operant condi-
tioning in the simulated experiment, the simulated Skinner-
ian pigeon with the SAUTO has learned to peck at the right
disk to get food from the food dispenser and not to peck at
the wrong disk.
Figure 7 shows that the mean of the internal energy of
the SAUTO becomes lower and lower as the temperature
descends. Meanwhile its operant entropy gradually de- Figure 8 A simulating Skinner box for reproducing the experiments on
creases. This implies that operant learning of the SAUTO superstition in pigeons.
2754 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

with Skinner automata. Like Skinnerian superstitious pi- through sufficient operant learning. It has an internal energy
geons, however, the Skinner automata may also reinforce function of E(S)(s(i))=i (i=0,1) and an operant energy func-
their behaviors that are accidentally coincidental with the tion of E(O)(o(k))=0.5k (0k6). In the process of operant
occurrences of food pellets. We call a superstitious action a conditioning, it takes 25 sample paired data of oper-
pseudo operant, and call it the pseudo operant conditioning ant-consequence (response-stimulus) at each fixed temper-
(POC) to reinforce pseudo operants. ature.
The state transition process of the Skinner automata sim- As depicted with Figure 10(a), with well-balanced oper-
ulating superstitious pigeons can be illustrated with the state ant learning, the simulated bird has learned to do nothing to
diagram in Figure 9 and the state table in Table 2. In the wait for food pellets occurring automatically. This means
state diagram, there are only two states: (i) s(0): being satis- that well-balanced pigeons are apt to learn the truth.
fied, and (ii) s(1): longing for food. The simulated birds have
one operant o(0) that means to do nothing at the states s(0) 6.2 POC due to being hyperactive
and s(1), and six pseudo operants that are likely to occur at
the s(1): (i) o(1): to flap wings, (ii) o(2): to spin, (iii) o(3): to The second simulated pigeon is hyperactive. It has an inter-
twist, (iv) o(4): to turn neck, (v) o(5): to sing, and (vi) o(6): to nal energy function of E(S)(s(i))=i (i=0,1) and an operant en-
bow. ergy function of E(O)(o(k))=0.5k (0k6), which means that
The simulated Skinnerian birds have the same state dia- the simulated bird is too energetic.
gram but different internal energy functions, operant energy As depicted with Figure 10(b), the simulated bird being
functions, and samplings, which represents different psy- hyperactive gets into a POC process and becomes supersti-
chological and physiological states, biological propensities, tious. To pray food, it bows. It seems that pigeons being too
and extents of cognition and learning. energetic or hyperactive are apt to be superstitious.

6.1 Well-balanced operant conditioning 6.3 POC due to being hyperepithymia


The first simulated pigeon is well-balanced and carries Skinner fed eight pigeons less than normal so that they
would be hungry. Then he put them in the Skinner box as
depicted in Figure 8. A few days later, he found that six of
them became superstitious.
We programmed eight simulated pigeon with Skinner
automata. They have the same internal energy function as
E(S)(s(i))=25i (i=0,1) that means they are hyperepithymia,
and the same operant energy function as E(O)(o(k))=0.5k (0k
6). They take only 1 sample paired data of operant-con-

Figure 9 The state diagram of the Skinner Automata simulating the su-
perstitious pigeons.

Table 2 The state transition process of the Simulated Superstitious Birds

s(0): being satisfied s(1): longing for food


(0) (1)
o : to do nothing s s(1)
o(1) : to flap wings – s(1)
o(2) : to spin – s(1)
(3)
o : to twist – s(1)
(4)
o : to turn the neck – s(1)
(5)
o : to sing – s(3)
(6)
o : to bow – s(1)
The outside event # – s(0)
Note: The dispenser releases a food pellet every 15 min, which may Figure 10 The operant probability graphs: (a) Results from a well-bal-
occur at the same time when an o(k) (0k6) occurs. anced simulated bird, (b) results from a hyperactive simulated bird.
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2755

sequence at each fixed temperature, which means that they 7 A 3D Wienerian worm
are biased, and lack sufficient cognition and learning.
However, the simulated birds have different behaviors Norbert Wiener, the father of Cybernetics, was very inter-
because the simulated experiments of operant conditioning ested in how machines work like animals. He described a
are computationally stochastic processes. The simulated thought experiment in his book “Cybernetics” [43] and
results are some accidental and unpredictable. As shown in conceived a mechanism for imitating aphototropic worm,
Figure 11, the simulated birds exhibit different behaviors. which we later name as the Wienerian worm. Wiener sup-
Among eight simulated birds, six of them have become su- posed that an electric bridge could serve as a photosenso-
perstitious owing to POC. To pray food, two flaps its wings ry-motor system so that the Wienerian worm was able to
(see Figures 11(a), (b)), two spins (see Figures 11(c), (d)), search the darkness as a real aphototropic worm was. Influ-
one turns its neck (see Figure 11(e)), one bows (see Figure enced by Wiener, Braitenberg also designed a series of
11(f)), and the rest two do nothing (see Figures 11(g), (h)). thought experiments in which a vehicle with simple internal
As shown in the simulated experiments, it seems that the structure acts in unexpectedly complex ways [44], including
higher the state s(1) and the fewer the sample paired data of the phototropic and the aphototropic. Both Wienerian worm
operant-consequence at each fixed temperature, the more and Braitenberg’s vehicle are simple but most significant
the simulated bird is apt to be superstitious. because it demonstrates that it is possible for machines to
We acquire three significant suggestions from the super- behave like animals. However, the mechanisms of Wiener
stitious Skinner automata: (i) Being hyperactive is apt to be and Braitenberg act like animals just externally but not in-
superstitious; (ii) Hyperepithymia is apt to be superstitious; ternally. Their behaviors are being designed, but not being
(iii) Being biased or lack of sufficient cognition and learn- shaped by autonomous development, cognition or learning.
ing is apt to be superstitious. The Skinner automaton is able to serve as sensorimotor

Figure 11 The operant probability graphs: results from the POC of the simulated birds being biased and hyperepithymia.
2756 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

systems of autonomous agents and robots so that they can temperatures, in which the case at the artificial temperature
gradually develop their motor skills like animals. We pro- TA=10000C takes 3000 discrete times, the others takes
gram a 3D Wienerian worm shown in Figure 12(a) in a vir- 1000 discrete times. Figure 14 implies that the behaviors of
tual reality environment. It is 4cm in length and 2.4 cm in the 3D Winerian worm gets organized more and more as the
width, and equipped with the receptors of two photosensors SAUTO cools. The 3D Wienerian worm is totally unor-
PSL and PSR, and the effectors of two motors ML and MR ganized at the artificial temperature TA=10000C, but gets
that respectively drive the left wheel WL and the right wheel highly organized at TA=0.001C. After having finished op-
WR. erant learning, the 3D worm is able to adroitly search the
As depicted in Figure 12(b), a Skinner automaton darkness in practice, and walk from any jumping-off point
(SAUTO) is used to serve as the photosensory-motor system to the darkness. Lastly it moves very slowly or stays at the
of the 3D Wienerian worm, which receives state signals foot of the walls.
from the receptors PSL and PSR, and sends operant signals It is most interesting that, by operant learning, the higher
to the effectors ML and MR. Each state of the SAUTO is a the luminance of the environment, the faster the 3D Wie-
vector of s  (∆ρ, ρ ) where =1 if L>R or else =1, nerian worm moves. The Skinner automaton makes the
L and R are the luminances measured by PSL and PSR, Wienerian worm behave like animals not only externally
respectively, and ρ  ( ρL  ρR ) / 2 that is divided into 9 but internally as well.
levels. There are altogether 18 states in the SAUTO. The
internal energy function of the SAUTO is defined by 8 A robot learning to balancing
E (S )
( s )  250  ρ .
2
(17)
Two-wheeled robots are a class of balancing robots, which
Each operant of the SAUTO is a vector of o=(,), where are much absorbing due to their inherent unstable dynamics.
 is the instructive signal of velocity and  is the instruc- These robots are characterized by the ability to balance on
tive signal of direction;  is divided into 4 levels, i.e., 0, 1, its two wheels and spin on the spot [45] whereas such abil-
2, and 3. If =0, we have =0, else =0.157. Accord- ity is generally man-made and designed in advance.
ingly, there are altogether 7 optional operants for the 3D In a sense, two-wheeled robots are a class of bionic sys-
Wienerian worm. The operant energy function is defined by tems, which imitate human upright posture and attempt to
exhibit some balancing skills to balance their postures. The
E (O ) (ο)  0.025  , (  {0,1,2,3}). (18)
posture balancing skills of a human are developed and
The 3D Wienerian worm is confined to a small ring of shaped during operant learning, in which operant condi-
radius 26 cm where a light lies in the center so that the cen- tioning plays an important role. Without a doubt, it is sig-
tre is bright and the edge is dark (see Figure 13). The oper- nificant for balancing robots to learn balancing skills like
ant learning process of the 3D Wienerian worm is shown in human and animals.
Figure 14, in which an obstacle avoidance scheme will start We have built a physical two-wheeled robot [46] called
up when the 3D Wienerian worm is too close to one of the Hominid 3 that is 58 cm in height and 22.5 kg in weight
walls of the ring. By operant learning, the 3D Wienerian (see Figure 15(a)). Hominid 3 is a flexible balancing robot
worm gradually develops its motor skills of negative photo- with complicated dynamics. It has a flexible lumbar made
tropism to search the darkness. Figure 14 depicts the foot- of a spring so that it is more bionic and more challenging to
prints of the 3D Winerian worm in the ring at different posture balancing. As shown in Figure 15(b), Hominid 3

Figure 12 (a) A 3D Winerian worm. (b) The sensorimotor system of the 3D Wienerian worm.
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2757

and therefore, there are 15 different operants in all in the


SAUTO’s operant set. The operant being selected will be
operated on the servos to control the speed and direction of
the wheels. No operant energy is considered in Hominid 3,
i.e., oO E(O) (o)=0.
We have conducted physical experiments with the robot
of Hominid 3 to demonstrate how a balancing robot
equipped with the Skinner automaton to learn balancing
skills in autonomous operant conditioning way. The initial
artificial temperature TA of the SAUTO is 10000C, the
Figure 13 The room of a ring that the 3D Winerian lives in. cooling schedule of the SAUTO follows eq. (16), the coeffi-
cient  is smaller and equals to 0.75 so that the physical
experiment gets a little fast. There is a protector fixed on the
has the receptor of an inertial measurement unit (IMU), and chassis of Hominid 3 to prevent it from tumbling. At the
the effectors of two motors ML and MR that respectively beginning of each fixed temperature, the robot of Hominid 3
drive the left wheel WL and the right wheel WR. The IMU is
is at rest with a tilt angle of about 28. The discrete time
used for measuring the tilt angle  of the robot as well as the interval of the SAUTO t is 10 ms. It takes the robot 180 s
angular velocity d/dt. (18000 intervals) to do operant learning at each given tem-
A Skinner automaton (SAUTO) serves as a sensorimotor perature.
system related posture and movement for Hominid 3 to de- During the physical experiment, if the absolutely value of
velop its balancing skills (see Figure 15(b)), which is pro- the tilt angle is larger than 25, operant learning will stop
grammed and embodied in a digital signal processor (DSP). and a Proportional-Integral-Derivative (PID) controller will
The SAUTO’s state is a vector of s  ( ,) measured start up to make the robot upright. Then, if the absolutely
by the IMU. The tilt angle and the angular velocity are, re- value of the tilt angle is less than 5, the PID control will
spectively, divided into 11 levels as showing in Table 3, and stop and the robot will perform operant learning again.
therefore, there are 121 states in all in the SAUTO’s state set. The physical experiment results of the robot’s operant
Here s= 0 is the upright state of the robot of Hominid 3. The learning are shown in Figures 1618. The sensorimotor
internal energy function that represents the propensity or system of the robot of Hominid 3 and its motor skills are
tropism of Hominid 3 is defined by developing as operant conditioning. Figure 17 shows that,
as the SAUTO cools during the process of operant condi-
E ( S ) ( s )  (150 ) 2  1.8  (4500)  (30) 2 , (19)
tioning, the robot of Hominid 3 is gradually acquiring bal-
where the first item represents the propensity to regulate the ancing skills and lastly becomes the first two-wheeled robot
tilt angle, the last to do the tilt velocity, and the middle to do that has autonomously learned to balance. As depicted in
the coupling between the tilt angle and the tilt velocity. Figure 17, the mean of the internal energy and the operant
The SAUTO’s operant that will operate on the effectors entropy at each temperature TA decrease as the SAUTO
of two motors ML and MR is a scalar o=UPWM ([3000, cools, which implies the self-organizing property of Homi-
+3000]), the value of Pulse Width Modulation (PWM). The nid 3. We have recorded a video of the physical experiment
PWM value is divided into 15 levels as listed in Table 4, and only some truncation charts are shown in Figure 18.

Figure 14 The footprints of the 3D worm searching the darkness during the process of operant learning.
2758 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

Figure 15 The robot of Hominid 3: (a) the physical configuration of the


robot, (b) the sensorimotor system made with the Skinner automaton for
the robot to develop its balancing skills.

Table 3 The state levels of the tilt angle and the angular velocity

Level k  (°) d /dt (°/s)


5 (, 15] (, 60]
4 (15, 10] (60, 35]
Figure 16 The tilt angle graphs at different temperature TA: the tilt angle
3 (10, 6] (35, 20] versus the time sensorimotor system made with the Skinner automaton.
2 (6, 3] (20, 10]
1 (3, 0.75] (10, 2.5]
0 (0.75, +0.75) (2.5, +2.5)
1 [+0.75, +3) [+2.5, +10)
2 [+3, +6) [+10, +20)
3 [+6, +10) [+20, +35)
4 [+10, +15) [+35, +60)
5 [+15, +) [+60, +)

Table 4 The operant levels of the PWM value Figure 17 The operant entropy and the mean of the internal energy ver-
sus log10(TA(C)) where TA (=KBT) is the artificial temperature.
Level k UPWM
7 2500
6 1850 9 Conclusion
5 1250
4 750 The Skinner automaton makes the psychological theory of
3 450 operant conditioning ‘take the form of computer programs’.
2 250 The Skinner automaton develops the Monte Carlo method
1 100 from the Metropolis algorithm to Skinner algorithm, in
0 0 which operant (active and voluntary behavior) is introduced
1 +100
as an essential element. As we state in Section 3, although
2 +250
the Monte Carlo method can be used to investigate the rela-
tion between operants (behaviors) and consequences in the
3 +450
process of response-stimulus (RS) conditioning of an or-
4 +750
ganism, it can select neither operants (behaviors) nor con-
5 +1250
sequences. The Metropolis algorithm has modified the
6 +1850
Monte Carlo method so that it is able to select (accept or
7 +2500
refuse) the states or configurations of a system. However,
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2759

Figure 18 The robot of Hominid 3 is learning balancing skills as it swaying.

The Metropolis algorithm cannot be directly used to bio- Skinner automaton expands the algorithm to biological sys-
logical systems. For an organism, the states or configura- tems, and moreover, it is able to represent behavior selec-
tions imply its behavioral consequence. During the RS con- tion mechanism in operant conditioning way. The Skinner
ditioning process of an organism, once a consequence algorithm selects operants (behaviors) via operant condi-
comes into being, it is impossible to refuse it. Now, the tioning, and it selects consequences via selecting the oper-
2760 Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11

ant that induce the consequences. ton autonomous operant conditioning.


The work on the Skinner automaton and the Skinner al- The second-order behaviors and propensity of an organ-
gorithm advances our understanding of animal cognition ism are related to the problem of intrinsic motivation [49,
and learning. The study on why Skinnerian pigeons become 50]. In the Skinner automaton, the energy function related
superstitious provides a significative instance for us to un- to the second-order behaviors is in advance designed and
derstand animal behaviors in the way of psychology com- unalterable, but not shaped by ‘environmental forces’. As
puting and thermodynamics computing. As demonstrated the second-order behaviors, how the internal energy func-
with the 3D Winerian worm and the robot Hominid 3, this tion or propensity is developed in the Skinner automaton
work also contributes to robotics, especially to cognitive way will be our future work.
developmental robotics (CDR). The aim of CDR is to pro-
vide understanding of how animals’ higher cognitive func-
tions develop and how machines developmentally acquire This work was supported by the National Natural Science Foundation of
cognitive functions in animals’ way [47]. A model of cogni- China (Grant Nos. 61075110, 60774077, 61375086), the National Basic
Research Program of China (“973” Project) (Grant No. 2012CB720000),
tive development starts from a sensorimotor system, in the National High-Tech Research and Development Program of China
which operant learning plays an important role. As demon- (“863” Project)(Grant No. 2007AA04Z226), the Beijing Natural Science
strated in the simulated and physical experiments, the Skin- Foundation (Grant No. 4102011), the Key Project of S&T Plan of Beijing
ner automaton is able to serve as the sensorimotor systems Municipal Commission of Education (Grant Nos. KM2008- 10005016,
KZ201210005001) and the Specialized Research Fund for the Doctoral
of autonomous agents. Moreover, this work is beneficial to Program of Higher Education (Grant No. 20101103110007).
machine learning, especially to autonomous machine learn-
ing. Operant conditioning is one of the fundamental mecha-
1 Skinner B F. The Behavior of Organisms. New York: Appleton-
nisms of animal learning. The Skinner automaton provides
Century-Crofts, 1938. 61–116
artificial lives and autonomous agents including mobile 2 Skinner B F. Science and Human Behavior. New York: Macmillan,
robots with replicated neural mechanisms of operant condi- 1953. 45–128
tioning and makes them learn in animals’ way, so that ma- 3 Thorndike E L. Animal Intelligence: Experimental Studies. Edison:
Transaction Publishers, 1911. 241–282
chines act like animals not only externally but internally as
4 Watson J B. Behaviorism. New York: People’s Institute, 1924.
well. As a behavioral psychologist, Skinner believed that 141–232
the causes of behavior are in the environment and do not 5 Watson J B. Psychology as the behaviorist views it. Psychol Rev,
result from inner mental events such as thoughts, feelings, 1913, 20: 158–177
6 Pavlov I P. Conditioned Reflexes: An Investigation of the Physiolog-
or perceptions. Rather, Skinner claimed that these inner
ical Activity of the Cerebral Cortex. London: Oxford University
mental events are themselves behaviors, and like any other Press, 1927. 219–300
behaviors, are shaped and determined by environmental 7 Grossberg S. On the dynamics of operant conditioning. J Theor Biol,
forces [48]. Of course, cognition and learning including 1971, 33: 225–255
8 Grossberg S. Classical and instrumental learning by neural networks.
self-learning and autonomous learning are also some be-
In: Rosen R, Snell F, eds. Progress in theoretical biology. New York:
haviors. Academic Press, 1974. 51–141
The behaviors of an organism may be divided into two 9 Chang C, Gaudiano P. Application of biological learning theories to
classes: the first-order and the second-order. The ones we mobile robot avoidance and approach behaviors. Advs Complex Syst,
call them actions may belong to the first-order behaviors, 1998, 1: 79–114
10 Touretzky D S, Saksida L M. Operant conditioning in Skinnerbots.
and the sort of thoughts, feelings, and perceptions may be- Adapt Behav, 1997, 5: 219–247
long to the second-order behaviors. The second-order be- 11 Saksida L M, Raymond S M, Touretzky D S. Shaping robot behavior
haviors are the behaviors of the first-order behaviors, and using principles from instrumental conditioning. Rob Auton Syst,
1997, 22: 231-249
the causes of the first-order behaviors. The environmental
12 Daw N D, Touretzky D S. Operant behavior suggests attentional gat-
forces shape the second-order behaviors, and the sec- ing of dopamine system inputs. Neurocomputing, 2001,38: 1161–
ond-order behaviors shape the first-order behaviors. We 1167
claim that the second-order behaviors play important roles 13 Itoh K, Miwa H, Matsumoto M, et al. Behavior model of humanoid
in self-learning and autonomous learning. It is most signifi- robots based on operant conditioning. In: Proceedings of the 5th
IEEE–RAS International Conference on Humanoid Robots, Tsukuba,
cative and necessary for autonomous systems, such as arti- Japan, 2005. 220–225
ficial lives, autonomous agents, and mobile robots, to learn 14 Itoh K, Onishi Y, Takahashi S, et al. Development of face robot to
autonomously or self-directedly, and become autonomous express various face shapes by moving the parts and outline. In: Pro-
learners. ceedings of the 2nd Biennial IEEE/RAS–EMBS International Con-
ference on Biomedical Robotics and Biomechatronics, Scottsdale,
The Skinner automaton could make autonomous agents AZ, USA, 2008. 439–444
including robots become autonomous learners, in which the 15 Sutton R S, Barto A G. Reinforcement Learning: An Introduction.
operants are the first-order behaviors, the energy function Cambridge, MA: MIT Press, 1998. 1–86
represents a certain biological propensity or tropism related 16 Narendra K S, Thathachar M A L. Learning automata: A survey.
IEEE Trans Syst Man Cybern, 1974, SMC-14: 323–334
to the second-order behaviors. We call the replicated 17 Thathachar M A L, Sastry P S. Varieties of learning automata: An
mechanism of operant conditioning in the Skinner automa- Overview. IEEE Trans Syst Man Cybern B Cybern, 2002, 32: 711–722
Ruan X G, et al. Sci China Tech Sci November (2013) Vol.56 No.11 2761

18 Thathachar M A L, Sastry P S. A new approach to designing rein- 33 Palm W J. System Dynamics. 2nd ed. London: McGraw-Hill Sci-
forcement schemes for learning automata. IEEE Trans Syst Man Cy- ence/Engineering/Math, 2009. 172–283
bern, 1985, SMC-15: 168–175 34 Kiese-Himmel C. Verstärkungslernen: Operante Konditionierung.
19 Lanctot J K, Oommen B J. Discretized estimator learning automata. Sprache-Stimme-Gehör, 2010, 34: 1
IEEE Trans Syst Man Cybern, 1992, 22: 1473–1483 35 Dayan P, Belleine W. Reward, motivation and reinforcement learning.
20 Thathachar M A L, Phansalkar V V. Learning the global maximum Neuron, 2002, 36: 285–298
with parameterized learning automata. IEEE Trans Neural Netw, 36 Oudeyer P Y, Kaplan F, Hafner V V. Intrinsic motivation systems for
1995, 6: 398–406 autonomous mental development. IEEE Trans Evolut Comput, 2007,
21 Phansalkar V V, Thathachar M A L. Local and global optimization 11: 265–286
algorithms for generalized learning automata. Neural Comput, 1995, 37 Brucke E W. Lectures on Physiology. Vienna: Braumuller, 1874.
7: 950–973 38 Haynie D. Biological Thermodynamics. Cambridge: Cambridge
22 Hauwere Y-M De, Vrancx P, Nowé A. Generalized learning automa- University Press, 2001. 293–330
ta for multi-agent reinforcement learning. AI Commun, 2010, 23: 39 Nicholls D G, Ferguson S J. Bioenergetics. 4th ed. Europe: Academic
311–324 Press, 2013. 1–52
23 Viswanathan R, Narendra K S. A note on the linear reinforcement 40 Hopfield J J. Networks, computations, logic, and noise. In: Proceed-
scheme for variable-structure stochastic automata. IEEE Trans Syst ings of IEEE First International Conference on Neural Networks,
Man Cybern, 1972, SMC-2: 292–294 California, USA, 1987. 109–141
24 Poznyak S, Najim K. On nonlinear reinforcement schemes. IEEE 41 Neumann J von. Various techniques used in connection with random
Trans Automat Contr, 1997, 42: 1002–1004 digits, in Monte Carlo Method. Applied Mathematics Series, vol. 12,
25 Stoica F, Popa E M. An absolutely expedient learning algorithm for Washington D.C.: U.S. Department of Commerce, National Bureau
stochastic automata. WSEAS Trans COMPUTERS, 2007, 6: 229–235
of Standards, 1951. 36–38
26 Stoica F, Popa E M. A new evolutionary reinforcement scheme for
42 Skinner B F. ‘Superstition’ in the pigeon. J Exp Psychol, 1948, 38(2):
stochastic learning automata. In: Mastorakis N E, Mladenov V,
168–172
Bojkovic Z, et al., eds. The Proceedings of the 12th WSEAS Interna-
43 Wiener N. Cybernetics: Or Control and Communication in the Ani-
tional Conference on Computers, Stevens Point, Wisconsin, USA,
mal and the Machine. New York: J. Wiley, 1948. 60–132
2008. 268–273
44 Braitenberg V. Vehicles: Experiments in Synthetic Psychology. USA:
27 Simian D, Stoica F. A new nonlinear reinforcement scheme for sto-
The MIT Press, 1986. 95–144
chastic learning automata. In: The Proceedings of 12th WSEAS In-
ternational Conference on Automatic control, Modeling & Simulation, 45 Ooi R C. Balancing a two-wheeled autonomous robot. Dissertation of
Catania, Sicily, Italy, 2010. 450–454 Masteral Degree. Perth: University of Western Australia, 2003. 1–7
28 Metropolis N, Rosenbluth A W, Rosenbluth M N, et al. Equation of 46 Ruan X G, Li X Y, ZHAO J W, et al. A flexible two-wheeled
State Calculations by Fast Computing Machines. J Chem Phys, 1953, self-balancing robot system and its motion control method. China
21: 1087–1092 Patent 200910084259.8, 2010-10-9
29 Jorgensen W L. Perspective on ‘Equation of state calculations by fast 47 Asada M, Hosoda K, Kuniyoshi Y, et al. Cognitive developmental
computing machines’. Theor Chem Acc, 2000, 103: 225–227 robotics: A survey. IEEE Trans Auton Ment Dev, 2009, 1: 12–34
30 Kirkpatrick S, Gelatt C D, Vecchi M P. Optimization by Simulated 48 Wood S E, Wood E G, Boyd D. Mastering the World of Psychology.
Annealing. Science, 1983, 220: 671–680 Boston: Allyn & Bacon, 2004. 333–354
31 Černý V A. Thermodynamical approach to the travelling salesman 49 Baranès A, Oudeyer P Y. R-IAC: Robust intrinsically motivated ex-
problem: An efficient simulation algorithm. J Optim Theory Appl, ploration and active learning. IEEE Trans Auton Ment De, 2009, 1:
1985, 45: 41–51 155–169
32 Horowitz M J. Introduction to Psychodynamics: A New synthesis. 50 Oudeyer P Y, Kaplan F. What is intrinsic motivation? A typology of
New York: Basic Books, 1988. 17–243 computational approaches. Front Neurorobot, 2007, 1: 1–14

You might also like