You are on page 1of 19

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

Interplay of Rhythmic and Discrete Manipulation


Movements During Development: A Policy-Search
Reinforcement-Learning Robot Model
Valentina Cristina Meola, Daniele Caligiore, Valerio Sperati, Loredana Zollo,
Anna Lisa Ciancio, Fabrizio Taffoni, Eugenio Guglielmelli and Gianluca Baldassarre

AbstractThe flexibility of human motor behaviour strongly


relies on rhythmic and discrete movements. Developmental psychology has shown how these movements closely interplay during
development but the dynamics of that are largely unknown and
we currently lack computational models suitable to investigate
such interaction. This work initially presents an analysis of the
problem from a computational and empirical perspective and
then proposes a novel computational model to start to investigate
it. The model is based on a movement primitive capable of
producing both rhythmic and end-point discrete movements, and
on a policy search reinforcement learning algorithm capable of
mimicking trial-and-error learning processes involved in development and efficient enough to work on real robots. The model
is tested with hand manipulation tasks (touching, tapping,
and rotating an object). The results show how the system
progressively shapes the initial rhythmic exploration into refined
rhythmic or discrete movements depending on the task demand.
The tests on the real robot also show how the system exploits
the specific hand-object physical properties, some possibly shared
with developing infants, to find effective solutions to the tasks.
The results show that the model represents a useful tool to
investigate the interplay of rhythmic and discrete movements
during development.
Index TermsDevelopmental robotics, motor system and development, iCub robot hand, motor babbling, rhythmic and
discrete movement primitives, central patter generators, policy
search reinforcement learning.

I. I NTRODUCTION

HYTHMIC and discrete movements play an important role in supporting the flexibility and complexity
of human motor behaviour. These two types of movements
are produced by motor primitives (MPs) involving partially
overlapping neural structures of brain [1], [2].
This research received funds from the European Commission under the
7th Framework Programme (FP7/2007-2013), ICT Challenge 2 Cognitive
Systems and Robotics, project IM-CLeVeR - Intrinsically Motivated Cumulative Learning Versatile Robots (grant agreement no. ICT-IP-231722), from
the Italian Ministry of Instruction, University and Research, PRIN project
HANDBOT - biomechatronic hand prostheses endowed with bio-inspired
tactile perception, bidirectional neural interfaces and distributed sensorimotor
control (CUP: B81J12002680008), and from the Italian Institute for Labour
Accidents (INAIL) with PPR 2 project (CUP: E58C13000990001). V.C. Meola, L. Zollo, A.L. Ciancio, F. Taffoni and E. Guglielmelli are with Biomedical
Robotics and Biomicrosystem Lab, Universita Campus Bio-Medico di Roma,
Via Alvaro del Portillo 21, I-00128 Roma, Italy, vc.meola@gmail.com,
{l.zollo, a.ciancio, f.taffoni, e.guglielmelli}@unicampus.it; D. Caligiore, V.
Sperati and G. Baldassarre are with the Laboratory of Computational Embodied Neuroscience, Istituto di Scienze e Tecnologie della Cognizione,
Consiglio Nazionale delle Ricerche (LOCEN-ISTC-CNR), Roma, Italy
{daniele.caligiore, valerio.sperati, gianluca.baldassarre}@istc.cnr.it .

This work has a threefold objective. First, it aims to


highlight a relevant problem of developmental theory so far
largely overlooked, namely the interplay between rhythmic
and discrete movements during development. Empirical data,
computationally-informed analyses, and robotic experiments
will be used to address the questions: How are the two types
of movements related to exploration and learning processes, in
particular with a focus on trial-and-error learning processes?
How do they influence each other while they are being
learned? Second, the work proposes a computational model
that is capable of mixing and learning together rhythmic
and discrete movements thus allowing the study of how they
interplay in motor development. The model was also designed
to be able to directly learn within a real robot on the basis
of an efficient policy search reinforcement learning (RL)
algorithm and rather simple motor primitives which generate
rhythmic movements based on cosine-based central pattern
generators (CPGs) and discrete movements based on endpoints and proportional-derivative (PD) modules. The capacity
of the model to learn behaviours in a real robot was pursued
as it has the potential to uncover sensorimotor aspects also
playing a relevant role in infant development, a strength of
developmental robotics [3], [4], [5], [6]. Third, we used the
model to present some initial results on how rhythmic and
discrete movements interact while they are being learned by
trial-and-error processes. The results are based on two simple
tasks involving two one-finger touching and tapping tasks and
a more complex task involving the rotation of a cylinder with
the thumb and index fingers.
Rhythmic motor primitives produce oscillatory, repetitive
movements, such as sucking, chewing, crawling, walking,
swimming, and sweeping. Rhythmic movements heavily rely
on spinal-cord neural circuits implementing Central Pattern
Generators (CPGs), neural systems capable of producing oscillatory signals when suitably activated by the central nervous
system [7], [8], [9]. CPGs can generate rhythmic movements
by alternating the activation of flexor/extensor muscles [9],
[10]. Also arms and hands produce rhythmic movements,
like waving, scratching, hitting, and rotating objects, possibly
involving CPGs [11], [12], [13], [14]. In addition to spinal
cord, basal ganglia, a set of sub-cortical nuclei playing a
fundamental role in the acquisition of the capacity to select
different movements on the basis of trial-and-error processes
[15], [16], are involved in the selection and timing of rhythmic
movements [17], [18]. Brain imaging data show that rhythmic

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

movements involve the activation of a network of brain areas


encompassing parietal, primary, premotor, and supplementary
cortical areas [2].
Discrete motor primitives produce point-to-point movements such as reaching, grasping, throwing, and pressing.
The production of discrete motor primitives recruits cortical
areas [19], [20], [21], [22], [23] which also recruited by
rhythmic movements [2], but also additional areas, such as
prefrontal and premotor cortices important for motor planning,
not involved in rhythmic movements [2]. Subcortical regions
are also involved in the production of discrete movements
[24], [25]. In particular, the basal ganglia, in concert with
areas of frontal cortex, are critical for the initiation and
termination of discrete movements [26], [27] and for the
selection of different movements depending on context [28],
[16]. Cerebellum contributes to the fine control of movements
through internal models [29], [30].
Complex motor skills often require the combination of
rhythmic and discrete movements. For example, playing piano
[1] and handwriting [31] may be regarded as activities involving a mixture of rhythmic and discrete movements. Empirical
data suggest that basal ganglia could be important to manage
the arbitration between rhythmic and discrete movements
[32] while cerebellum could contribute to manage the timing
aspects related to it [33], [34].
Alongside their neural underpinnings, a critical question for
developmental psychology and developmental robotics is how
rhythmic and discrete movements interact while they develop
in infancy. As aforementioned, a first contribution of this work
is indeed the stress of the importance and interestingness of
this problem so far largely overlooked by the literature on
motor development. This literature tends to focus on the development of either rhythmic or discrete movements [35], [36],
[37]. At present, rhythmic and discrete movements are indeed
investigated by rather distinct research communities using
different experimental paradigms and theoretical models [31],
[38]. A similar separation tends to also involve robotic communities where the discrete movements literature for reaching
[39], [40], [41], [42] (see [43] for a recent overview), grasping
[44], [45], robust manipulation [46], and throwing [47], is quite
separated from the literature on rhythmic movements related
to locomotion, swimming, crawling [8], [9].
Some researchers have recently underlined the importance
of integrating the studies on rhythmic and discrete movements
to understand their interactions [48], [31], [49]. A key fact
on these interactions, which also represents an important
motivation of this work, is that a large amount of movements
produced during the first year of life are rhythmic rather
than discrete, including arm and hand movements. The long
term goal of the research agenda that led to this work is to
understand why (adaptive advantages) and how (mechanisms)
this is the case. This work contributes to this goal by first
theoretically framing the problem based on empirical and
computational considerations (in this and the next section),
and then by presenting a model suitable to study the issue and
by illustrating its behaviour in some experiments involving
the interplay of rhythmic and discrete movements during
development.

Evidence on the importance of rhythmic movements in early


age comes from the classic study of Thelen [10], [50] showing
how rhythmic movements produced by infants during the first
year of life are very rich and take a consistent amount of their
awake time, in particular between 5% and 10% of it (with
peaks of 40% for some babies) from the 16th to the 52nd
week.
Such amount and richness of rhythmic movements has
intrigued students of infants for years, leading them to wonder
about their possible functional value or lack thereof. Thelen
herself [10] proposed that rhythmic movements might be a
by-product of the maturation of motor control circuits, but
she recognised that they could also play a role to support
further development. Piaget [51] proposed that repeated movements play a pivotal role for development. Thus, he identified
primary circular reactions as the initial repeated movements
directed to learn the properties of the own body, and secondary
circular reactions as the following repeated movements driven
by the interest on the effects they produce on the environment.
A direct demonstration of how rhythmic movements might
lead to develop focused movements producing a reward for the
infant comes from the important study of Rovee and collegues
[52], [53]. This study showed how three-months infants rapidly
refine (initially rhythmic) leg movements if a ribbon is attached
to the leg ankle and can activate a pleasurable overhead
crib mobile. The important function of rhythmic movements
for bootstrapping the following motor development is also
recognised in recent computationally-informed perspectives on
development [54].
Why is initial behaviour of infants largely based on rhythmic
movements? We argue there might be various reasons for this,
based on computational considerations and empirical evidence.
First, from a computational perspective the preparation and
production of rhythmic movements might be simpler than for
the discrete movements. Rhythmic movements, for example,
can be produced through circuits formed by few mutually
inhibiting leaky-integrators [55]. Instead, although discrete
movements with a simple stereotyped shape can be directly
produced by simple PD-like devices (as done here, see section
I-B), this still requires the setting of the movement endpoint. The preparation and production of more refined discrete
movements even requires a fine control of forces to produce
the needed transient accelerations and decelerations [43], [56].
Moreover, rhythmic movements are often open-loop or use
little feedback [8], whereas at least more sophisticated discrete
movements can be closed-loop (especially in their terminal
part following an initial ballistic part, [57]), thus requiring a
more complex information processing. These features might
facilitate an earlier development of rhythmic movements with
respect to discrete ones.
Second, rhythmic movements and CPGs emerged early
during the course of animal evolution as they can serve the
fundamental function of displacing the body in water, earth,
and air [58]. Instead, discrete movements tend to be useful for
specific interactions with the environment, in particular for manipulation behaviours, and thus became highly sophisticated
in later stages of evolution, e.g. in primates [59]. This agrees
with data mentioned above suggesting that many brain areas

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

underlying the production of rhythmic movements are a subset


of those underlying the production of discrete movements and
the latter ones also involve higher levels of brain emerged later
in the course of evolution [2]. As it often happens (ontogeny
recapitulates phylogeny), during the childs life these higher
brain areas, e.g. those of the prefrontal cortex, have a slower
maturation [60]. This tends to generate a temporal order in
the development of rhythmic and discrete movements, with
the former preceding the latter ones (cf. Thelens proposal
mentioned above, [10]).
Third, ecological considerations, related to the interactions
of the child with the environment, suggest that rhythmic
movements, being repetitive, facilitate learning (cf. [10]). In
particular, rhythmic movements tend to automatically reset
the environment (including own body), i.e. to bring it back
to its initial state, so facilitating the repetition of a certain
experience for several times (as if the child self-generates
several learning trials, e.g. when banging, swaying, and
twirling an object). Instead, producing a repeated experience
through discrete movements usually requires the performance
of a sequence of two or more movements to first affect the
environment and then to bring it back to its initial state (unless
the environment self-resets, e.g. when the child interacts
with an elastic object anchored to the environment). These
sequences are more complex to be learned with respect to
the triggering of only one rhythmic movement. Moreover,
circular/elliptic rhythmic movements allow the infant to more
easily explore a larger area of space with respect to segmentlike discrete movements and this increases the chances of
obtaining feedback for learning (e.g. by contacting objects).
Alongside the functional relations between the development
of rhythmic and discrete movements, it is also important
to consider the mechanisms behind such development. A
first concept useful to explain motor development is motor
babbling, for which infants initially produce unstructured
movements and then progressively mould them to produce
highly efficient functional motor behaviours through learning
processes guided by environmental feedback [61]. This idea
underlies prominent interpretations of empirical research on
motor development [62], [63]. Motor babbling has also become a key ingredient of several computational models of
motor development [64], [65], [44]. Motor babbling might
however be a simplification with respect to the actual learning
processes of infants as these might employ goal-directed exploration mechanisms since the beginning of life, for example
reaching behaviours that, although inaccurate, are targeted to
specific regions or objects in space [66]. This view has been
captured by models that generate exploratory movements by
setting random end-point postures rather than force commands
[43], [67] (end postures can be interpreted as proximal goals in
the posture space) and by models performing goal babbling
[68], [69]. In these models random exploration involves goals
set in terms of points in the working space to reach with limbs
independently of the posture used.
A second concept useful to explain motor development,
closely related to motor babbling, is trial-and-error learning [62]. This process explains how the production of exploratory movements through motor/goal babbling can lead

the child to progressively refine movements on the basis of


the retention of the movement features that allow a more
efficient/effective attainment of valuable outcomes. Trial-anderror learning processes relevant for developmental studies
have been successfully captured in models through RL algorithms. For example, RL models have been used to accurately
reproduce and account for the progressive development of
reaching movements from sub-movements [43], [70], [71] and
the complex evolution of their kinematic and dynamic features
[43]. In addition, several developmental-robotics models have
used RL to successfully account for the development of
reaching [71], [43], grasping [45], [44], manipulation [72],
and overt attention [73], [74], [75].
At the moment, we lack a computational model that allows
an easy study of the interplay of rhythmic and discrete
movements during development directly within a humanoid
robot. We propose that this model should have these minimal
features. First, the model should be capable of producing both
rhythmic and discrete movements and allow the use of them at
the same time so as to allow the study of their interdependencies while they are learned. Second, the model should acquire
the movements by a trial-and-error learning processes thus
allowing the study of developmental phenomena that depend
on such processes. Third, the learning process of the model
should be efficient enough to directly work on real robots so
as to allow the model to capture some of the complexities and
opportunities generated by the sensorimotor interaction of the
infant with the physical environment.
The rest of the paper is organised as follows. The following
two sub-sections review the existing models relevant for this
study and introduce and justify the key ingredients of our
model. Sec. II presents the robot and the tasks used to test the
model, and explains the functioning and learning mechanisms
of the model. Sec. III illustrates the results of the tests of the
model, and also analyses in detail the type of solutions found
by the model with a focus on the interplay of rhythmic and
discrete movements. Sec. IV discusses the results with respect
to the issues introduced above. Sec. V draws the conclusions
and suggests possible future developments of the model.
A. Related models
The literature has already proposed some models capable of
simultaneously producing rhythmic and discrete movements.
The model presented in [76] is based on a sophisticated
dynamical system that can produce both rhythmic and discrete
movements as limit-cycle and fixed-point attractors, respectively. The model presented in [77], instead, is based on
a Matsuoka oscillator [55] relying on two coupled leaky
units. The rhythmic and discrete movements are produced
by respectively sending tonic or pulse inputs to the oscillator
units, and mixed rhythmic/discrete movements are obtained by
sending sums of such types of signals to the units. Both these
models do not include RL, a necessary step to reproduce trialand-error learning processes relevant to study development,
and have not been developed to work on robots.
To the best of our knowledge, only four robotic systems
have been proposed to combine rhythmic and discrete movements. The first system [78] could control a humanoid robot

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

engaged in a drumming task. The model used two different


pattern generators to produce such movements, i.e. a limitcycle pattern generator to generate rhythmic movements and
a fixed-point pattern generator to produce discrete movements.
The system used the two pattern generators independently at
different stages of behaviour and the behaviour was hardwired
rather than learned: these two features do not allow using this
model to investigate the coupled development of rhythmic and
discrete movements.
Similarly, the model proposed in [49] used different systems to produce rhythmic and discrete movements: a set of
differential equations based on the VITE (Vector Integration
To Endpoint) model (originally developed in [79]) to simulate
active and passive arm movements, and a modified Hopf
oscillator (originally proposed in [80]) to produce rhythmic
movements. The study of the model, which did not include
learning processes, focused on the capacity of the system to
switch between rhythmic and discrete behaviors on the fly
rather than on their interplay during development.
The work in [81] proposed a Dynamic Movement Primitive
(DMP) to encode both a rhythmic motion and the initial
transient movement needed to start it. The system produced
the transient behaviour required to start the rhythmic motion
by avoiding to use ad-hoc procedures typically used to get
the robot into the periodic motion. In particular, the proposed DMP used a dynamical system approach to produce
an asymptotically stable limit cycle producing the rhythmic
behaviour. The initial transient behaviour was obtained through
trajectories varying on the basis of the initial conditions and
converging towards the limit cycle. The work focused on the
transient movement needed to start a rhythmic movement and
did not include learning, but it might be relevant for future
improvements of the model presented here.
A last system [82] used the discrete and rhythmic DMPs
proposed in [83]. In their typical form, DMPs generate movement trajectories based on a dynamical system controlling
each joint (e.g., in terms of force or desired position) on the
basis of two additive components [38]. The first component
is a spring-damping system. In discrete DMPs, the second
component is based on a linear combination of Gaussian
functions, depending on a phase variable that exponentially
decays from one to zero, that transiently perturbs the trajectory.
In rhythmic DMPs, the second component is based on a
linear combination of cosine functions depending on a phase
variable that linearly increases starting from zero. In both
cases, the coefficients of the linear combination represent the
DMP parameters generating the movement shape. In [82] a
rhythmic DMP was used to learn to paddle a ball and a
discrete DMP to manage the initial part of the task. The
DMPs had several parameters that had to first be initialised
with a supervised learning algorithm (based on imitation of
a human demonstrator) and then refined with a RL algorithm.
The system was neither applied to learning manipulation
skills, nor to the study of the rhythmic/discrete movement codevelopment, as here.
The model we proposed in [72], [84], a predecessor of
the model presented here, was directed to study the role of
rhythmic movements during development with a focus on

manipulation skills. The model generated rhythmic movements


on the basis of the CPG model proposed in [8]. The whole
system had a two-level hierarchical architecture [85], [86]:
the lower level was based on multiple CPGs with different
complexity and capable of setting the desired positions of the
hand-joints; the higher level searched the parameters of the
CPGs, and their mixed contribution to movement, through a
RL actor-critic model [87]. The results of the tests of the model
showed that the manipulation tasks were best solved either by
a mixed use of all the CPGs (the hierarchical CPG, or CPGH) or by a CPG that controlled the hand degrees of freedom
in an independent fashion (the complex CPG, or CPG-C)
whereas simpler CPGs controlling joints in a joint fashion
were less efficient. The work presented here represents a
substantial advancement with respect to its predecessor model.
First, it proposes the use the Policy Improvement Black Box
(PIBB ) algorithm: this is more efficient than the actor-critic
RL method that was used in previous works and prevented
its use in real robots. Second, it presents results obtained
with a real robot (iCub) and set-up (objects to be touched
or rotated) rather than in simulation. Third, it presents and
analyses robotic tests focused on the interplay of rhythmic and
discrete movements during development whereas the previous
works focussed on different topics (manipulation skills and
comparison of different robotic hands).
B. Ingredients of the proposed model
The model proposed here has the three desirable features
illustrated in Sec. I. First, the model allows the study of the
interplay of rhythmic and discrete movements during development. To this purpose, following the approach we proposed in
[72], [84], we used here a motor primitive where the rhythmic
component is based on the CPG model described in [8]. This
is an abstract CPG (its core oscillation is based on a cosine
function) that is capable of producing coupled oscillations of
different motor joints. It has various elements tunable with the
following parameters: (a) the frequency of oscillation, equal
for all joints; (b) the amplitude of oscillation, different for
different joints; (c) the phase difference between each couple
of joints (to this purpose the joints are best organised as a tree
hierarchy). This CPG was preferred to the rhythmic DMPs illustrated in the previous section [38], [83] to avoid the complex
initialisation of DMPs typically done with supervised learning
and imitation [82]. Instead, although less flexible, the CPG
employed here produces rhythmic movements by default for
almost all values of the parameters. This facilitated this study
directed to evaluate the implications of an initial exploratory
behaviour based on rhythmic movements, entering the model
as an assumption, on the subsequent motor development.
To have a motor primitive also capable of producing discrete
movements, we added to each oscillator, controlling a robot
joint, a further parameter representing its desired centre of
oscillation and a proportional derivative controller (PD; [88])
progressively leading to it. This PD is similar to the springdamping component of rhythmic DMPs illustrated in Sec. I-A
[83], [38]. The PD captures in a simple way the basic idea of
the equilibrium point (EP) hypothesis [89] for which the central nervous system produces movements by issuing suitable

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

reference points to muscles and then these try to achieve them


by generating torques on the basis of their spring-damping
properties. In this way, the shape of the trajectories followed
by the limbs is automatically generated by the PD that moves
the CPG oscillation centre and by the PID of the robot. We
decided to use this simple version of discrete movements to
focus on the interdependent overall development of rhythmic
and discrete movements while keeping simple other motor
control issues related to the generation of complex trajectories
(e.g., see [44], [90]). This choice also allowed us to keep the
MPs simple enough to be directly searched with RL on the
real robot, whereas, as seen in the previous sub-section, more
complex DMPs require to be initialised with imitation and
supervised learning thus making it more difficult to employ
them to study how rhythmic and discrete movements interplay
while they developed. In Sec. V we discuss how to develop
more sophisticated versions of the model to generate more
complex discrete movement trajectories.
The rhythmic and discrete components of the MP were
aggregated by summing their output to compute the desired
value of the controlled variables (here joint angles). This
solution has been used in other models (see Sec. I-A) and
seems a natural way to aggregate the two components as the
rhythmic component needs a centre (reference frame) for
the oscillation that can be given by the output of the discrete
component. The solution has also the advantage of allowing
a simple weighting of the contribution of the two components
to movement by respectively tuning the oscillation amplitude
of the rhythmic component and the initial/final points of the
discrete component.
To investigate the interplay of rhythmic/discrete movements
during learning we tested the model with the same robot
and set-up, i.e. an iCub humanoid robot touching a target
object with one finger, and two different reward functions, one
for touching and one for tapping the target object. These
two tasks show how the robot progressively shapes the initial
rhythmic movements to produce either a final rhythmic or a
final discrete movement depending on the reward function.
Regarding the second desirable feature, the model searches
its parameters on the basis of a RL algorithm capable of
reproducing the trial-and-error learning processes of the child
[4]. The reproduction of trial-and-error processes involving
the acquisition of motor control is important as it can reveal
the possible developmental motor patterns followed by infants
(e.g., see [43], [75], [91]), in this case the rhythmic-movement
improvement or the rhythmic-to-discrete movement transition.
Regarding the third desirable feature, the model was designed to be able to learn simple manipulation tasks directly
on a real robot on the basis of RL. This ensures the possibility
of studying some of the difficulties faced by infants during
motor development and also the opportunities stemming from
the sensorimotor interactions with the environment typical of
embodied systems [6]. Not all the difficulties and opportunities
encountered by the robot reflect those encountered by infants
due to the fact that their bodies are substantially different
(e.g., in terms of motor plant dynamics, body consistence,
friction, etc.), but some do (e.g., overall ranges of kinematic
features, forces ranges, properties of manipulated objects,

gravity). To investigate how these aspects possibly affected the


development of movements we tested the system with a third
more difficult task where the robot had to rotate a cylinder
with the index and the thumb fingers (similarly to unscrewing
a jar cap).
To be able to study trial-and-error learning directly on
real robots, we used an efficient algorithm called Policy
Improvement Black Box (PIBB ) [92], [93]. PIBB belongs to
the class of policy search methods, including other algorithms
such as PoWER [94] and PI2 [95]. These algorithms do not
search action policies (i.e., sensorimotor input-output mappings) by using evaluation functions (i.e. functions estimating
the reward-related evaluation of states, or state-action pairs,
based on gradients) as in value function based methods [87],
[96]. Indeed, the computation of such evaluation functions
could encounter problems in robotic tasks involving noisy,
discontinuous reward functions [94]. Rather, policy search
methods search for the policy parameters without the need of
an evaluation function by (a) randomly perturbating such parameters to obtain different policies similar to the original one,
and then by (b) computing the new policy parameters as the
weighted average of such policies with weights depending on
their reward performance (reward-weighted averaging). When
used to control robots with DMPs, policy search methods are
very efficient with respect to value function-based methods
in terms of both convergence speed and quality of the final
solution [97], [95], [94], [98].
This work used PIBB [92], [93] for its simplicity, which
facilitates the interpretation of results, and its efficiency, which
allows its application to real robots. Note that here we do
not claim that PIBB processes are actually implemented in
brain (e.g., the parameter update after several role-outs,
illustrated below, is biologically rather implausible). Rather,
the algorithm is intended to capture well, at a functional level,
the effects on motor development of the childrens trial-anderror learning processes. PIBB (which can be obtained from
PI2 by setting some parameters to values that simplify its
equations [92]) is simple because: (a) as other policy search
methods, such as PoWER and PI2 , it directly searches for the
policy parameters without using evaluation functions or policy
gradients; (b) differently from other policy search methods,
(b1) it perturbs the policy parameters only at the beginning
of each trial (roll-out) and (b2) it updates the policy on the
basis of the aggregated trial reward rather than on the basis of
the step-by-step rewards. In this respect, the algorithm belongs
to the family of the Black Box optimisation techniques from
which it gets its name, [92]. In terms of efficiency, PIBB is
shown in [92], [93] to have a performance similar or higher to
that of PI2 in the same tasks with which PI2 was initially tested
and shown to outperform other state-of-the-art algorithms [95].
II. M ETHODS
A. iCub hand and fingers
The model was tested with the fingers of the hand of
the iCub humanoid robot1 , a robot built to study cognitive
1 http

: //wiki.icub.org/wiki/ICub joints

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

development in children [99], [100]. Each arm of the iCub has


16 joints: three for the shoulder (J02 ), one for the elbow (J3 ),
three for the wrist (J46 ) and nine for the hand (J715 ). During the experiments 2 DOFs of the index and 2 DOFs of the
thumb were controlled, while all other DOFs were kept to fixed
values to keep the fingers in a straight posture. The 4 DOFs
controlled were (Fig. 1): adduction/abduction (AAI) of the first
index joint J10 , ranging in [20 ; 55 ]; flexion/extension (FEI)
of the second index joint J12 , ranging in [20; 70 ]; opposition
(OT) of the first thumb joint J7 , that could assume values in
the range [22.5; 54 ]; flexion/extension (FET) of the second
thumb joint J8 , ranging in [15 ; 47.5].

(a)

(a)

(b)

Fig. 1. The iCub hand (a) and the kinematic structure (b) of index and thumb
fingers used in the manipulation tasks.

B. Manipulation tasks and manipulated objects


The robot was tested with three independent manipulation
tasks each involving a learning phase divided in roll-outs.
The first two tasks required learning to respectively perform a
discrete movement (contact-task) and a rhythmic movement
(tapping-task) using the two joints of the thumb (OT and
FET). The object manipulated by the robot during both tasks
was a small cylindrical sponge covered with an aluminium
foil. The sponge was anchored through a spring to another
sponge inserted in a rigid-plastic-cilinder base in turn located
on a table (Fig. 2). The sponges and the spring ensured a
safe compliant contact between the iCub thumb and the object
being the robots fingers very fragile and easily breakable. The
aluminium foil was used as the iCub touch sensors are more
sensitive with metallic surfaces (the touch sensor plastic cover
was removed to this purpose). The touch signal detected by the
robot touch sensors was used to train the robot to accomplish
both tasks (cf. Sec. II-C3) as well as to collect data during
the test phases (cf. Sec. III). The contact-task required to gain
contact with the object and to maintain it for the rest of the
roll-out (cf. [101]). The tapping-task required to touch the
object as many times as possible during the roll-out.
The third task (rotation-task) required to rotate a cylindrical object around its vertical axis as much as possible during
the roll-out. Fig. 3 shows the device used during this task. The
device consisted of two parts: a cylindrical object, built with
a soft polyurethane resin, and a module (Circular Tap) used
to detect the angle of rotation of the object on the basis of an
infrared encoder [102], [103].

(b)
Fig. 2. (a) The iCub robot and the object used during the contact-task and
the tapping-task. (b) A zoomed picture of the robot hand and object.

Fig. 3. The iCub hand and the device used for the rotation-task formed by
a manipulated cylinder (left side) and the Circular Tap module (right side)
measuring the cylinder rotation.

C. Model architecture, functioning, and learning


Fig. 4 shows the architecture of the model. The PIBB
algorithm was used to search for the parameters of the CPGs.
The outputs of the CPGs represented the desired joint angles
(d ) of the thumb joints (in the contact-task and tapping-task)
or of the thumb and index joints (in the rotation-task). The
current (r ) and desired joint angles (d ) were used by the
Proportional Integrative Derivative (PID) controller embedded
in the iCub control software (this PID should not be confused
with the PD system of the MP illustrated below) to compute
the motor commands sent to the robot.
1) Motor primitive (MP): The MPs acting on the hand
joint DOFs were composed of coupled oscillators with tunable

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

Fig. 4. The overall architecture of the model. The main components of


the architecture are: the PIBB learning component, the MPs, and the PID
controller. Solid arrows represent information flows taking place at each time
step whereas the dashed arrow represents an information flow taking place at
the end of each roll-out.

centres each controlling a different joint. Each MP was based


on the CPG model proposed in [8] with an additional PDbased component (Eq. 3) enabling the regulation of the centre
of oscillation of each oscillator (cf. [84]). The equations
regulating the dynamics of the MP were as follows:
X
rj wij sin(j i ij )
(1)
i = 2i +
j

ri = ai

a

ci = bi

(2)

(3)

(Ri ri ) ri

bi
(Ci ci ) ci
4

zi = ci + ri cos(i )

(4)

where i is the phase (in radians) of the oscillation of oscillator


i, i is the intrinsic oscillation frequency, ij is the desired
phase difference between oscillator i and oscillator j of the
CPG, wij represents the strength of the coupling between the
oscillations of oscillators i and j, ri is the actual oscillation
amplitude, Ri is the desired oscillation amplitude, ai is a
positive constant determining the convergence rate of ri to
Ri , ci is the actual oscillation centre, Ci is the desired
oscillation centre, bi is a positive constant determining the
convergence rate of ci to Ci , zi is the controlled variable
(here a joint angle). Based on the formula, the evolution of
the phase i depends on the intrinsic frequency i , on the
coupling strengths wij , and on the desired phase differences
ij . The amplitude variable ri smoothly converges to Ri
following a damped second order differential law. Similarly,
the oscillation centre ci smoothly converges to Ci again with
a damped second order differential law. Each CPG oscillator
thus involved the following parameters searched by PIBB : Ri
(desired amplitude); Ci (desired oscillation centre); i (desired
oscillation frequency, here unique for all oscillators of the
CPG). Additionally, for each couple of oscillators the CPG
required one parameter ij (desired phase difference) and one
parameter wij (coupling between oscillators). Note that the
parameters wij allow the configuration (here by design) of
the dependencies between oscillation centres, in particular by
setting them to one or zero when they are respectively present
or absent. The values chosen here reflect the tree-structure
of the hand joints used. Dependencies and independences
between oscillators can be easily represented with a directional
graph showing the oscillators as nodes and their dependencies
as directed links (Fig. 5).

We tested MPs (CPGs) having different complexity to verify


if they had a different performance as found in [84] (the
latter work used different CPGs to compare different kinematic
properties of two robotic hands). The complexity and the
number of parameters of the CPGs used here depended on the
number of oscillators used and the number of DOFs controlled
by each oscillator (see Table I and Table II):
Fig. 5a shows the simple CPG (CPG-S). This had
one oscillator, N1, generating the desired angles of all
DOFs: FEI, FET, AAI, OT. The CPG-S is not used in
isolation but together with other oscillators within the
more sophisticated CPG-H.
Fig. 5b shows the medium-complexity CPG (CPG-M).
This was formed by two oscillators: N1 generating the
desired angle for both FEI and FET; N2 generating the
desired angle for both AAI and OT. The CPG-M is not
used in isolation but together with other oscillators within
the more sophisticated CPG-H.
Fig. 5c shows the complex CPG (CPG-C) having four
oscillators: N1 generating the desired angle of FEI; N2
generating the desired angle of AAI; N3 generating the
desired angle of FET; and N4 generating the desired angle
of OT.
Finally, a hierachical CPG (CPG-H) (so called for coherence with our previous work [84] where it was part of
a two-levels hierarchical RL architecture; here the CPG-H
is more precisely a compound CPG) was formed by all
the three CPGs considered above. In particular, the output
signals of the CPG-H were the weighted average of the
three CPGs output signals computed using the weights
{ws , wm , wc }: these weights represented three additional
parameters searched by PIBB .
2) The CPGs used in the three tasks: In the rotation-task
we compared the performance obtained by the system when
using the CPG-C and the CPG-H as in our previous work [84].
The parameters of all CPGs are indicated in Table I.
TABLE I
PARAMETERS OF DIFFERENT CPG S : THE CPG-C AND THE CPG-H WERE
USED IN THE ROTATION - TASK .
CPGs
CPG-S
CPG-M
CPG-C
CPG-H

Input parameters
k = [, R, C]
k = [, R1 , R2 , C1 , C2 , 12 ]
k = [, R1 , R2 , R3 , R4 , C1 , C2 , C3 , C4 , 34 , 31 , 12 ]
k = [, R1 , R2 , R3 , R4 , R5 , R6 , R7 ,
C1 , C2 , C3 , C4 , C5 , C6 , C7 ,
12 , 12 , 13 , 34 , ws , wm , wc ]

For the contact-task and the tapping-task we used only


the CPG-C as its comparison with the others CPGs in the
more complex rotation-task revealed this was the best one
(cf. Sec. III). For these tasks we used the flexion/extension
of the thumb (FET) and the opposition of the thumb (OT):
the parameters of the CPG controlling them are indicated in
Table II.
3) Policy Improvement Black Box (PIBB ): We now provide
an overview of PIBB and then explain it in detail. Each roll-out
(trial) lasted 300 time steps. Each step lasted 0.2 s. The rollouts of the training process used for each task were clustered in

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

to explore the parameter space.


b) Testing the samples: Each sample k represented a
possible parameter set for the CPGs. Based on one parameter
set the CPGs generated a certain trajectory (desired values) of
the robot joints during one roll-out. During each roll-out the
performance of the parameter sample was evaluated through
a cost function depending on the task. In the contact-task,
the cost function JC was based on the difference between the
maximum number of roll-out steps that the robot could ideally
spend in contact with the object (Cmax = 300, equal to the
duration of the roll-out) and the number of actual steps in
contact with the object, C:
JC = Cmax C
(a)

(b)

(c)

Fig. 5. The CPGs used in the model, having different levels of complexity.
(a) CPG-S formed by one single oscillator (N1). (b) CPG-M formed by two
oscillators (N1, N2). (c) CPG-C formed by four oscillators (N1, N2, N3,
N4). The CPG-H was a combination of CPG-S, CPG-M, and CPG-C. In each
graph, a node represents one oscillator, a directed link (arrowhead) represents
the dependence between two oscillators. The absence of a link between two
oscillators indicates that they are independent (i.e., their wij = 0).
TABLE II
CPG PARAMETERS USED IN THE CONTACT- TASK AND THE TAPPING - TASK
( HENCEFORTH , IN THIS TASKS THE INDEXES OF 12 ARE DROPPED AS
THIS PARAMETER IS UNIQUE ).
CPG
CPG-C

In the tapping-task, the cost function JT was computed by


considering the difference between the maximum number of
times the thumb could intermittently hit the object within a
roll-out (Tmax = 100) and the number of actual intermittent
touches of the object, T :
JT = Tmax T

(7)

In the rotation-task, the cost function JR was computed


as the difference between the maximum rotation angle of the
cylindrical object in one roll-out (Rmax = 200 , estimated on
the basis of some pilot experiments) and the actual rotation
angle during the roll-out, R:

CPG input parameters


, R3, R4, C3, C4,

JR = Rmax R

groups (epochs) of K roll-outs each. At the beginning of each


training, PIBB was given a certain initial parameter set of the
CPGs, 0 . At the beginning of each epoch, PIBB generated
K samples k of the CPGs parameters on the basis of the
current parameter set . During one roll-out, at each time step
the CPG corresponding to a certain parameter sample of the
K samples produced desired joint angles that were issued to
the iCub hand. The performance of each parameter sample
was measured during one roll-out on the basis of a certain
cost function. PIBB then produced a new parameter set e
as an average of the K parameter samples k with weights
depending on their performance. This process was repeated
for several succeeding epochs. We now explain the steps of
PIBB more in detail.
a) Sampling for one epoch: Initially, the parameters were
set to values 0 suitable for the purposes of the three tasks (see
Sec. III). For each epoch, K parameter samples k (K = 6 in
the contact-task and tapping-task; K = 8 in the rotation-task)
were generated from the mean parameter set (parameters
were scaled to [0, 1] to this purpose). The random sampling
was based on a multivariate Gaussian distribution:
k = N (, )

(6)

(5)

with and representing the mean vector and covariance


matrix of the distribution, respectively. The matrix had
0.1 elements along the principal diagonal, and 0 elements
elsewhere, thus implying a constant exploration noise used

(8)

c) Reward-weighted average of samples to update :


Cost values Jdk obtained by the samples k in task d were
compared and transformed into the probabilities pdk of contributing to the new mean parameter set through a soft-max
function:
1
e Jdk
pdk = PK
(9)
1
Jdk ]

k=1 [e

where ( = 10) was a temperature parameter regulating


the effect on the probabilities of different performance levels
(chosen based on some pilot tests). The new mean parameter
set, e , was then computed by weight-averaging the sample
means, k , using their probabilities pdk as weights (rewardweighted averaging):
e =

K
X

pdk k

(10)

k=1

The whole process was iterated for M successive epochs


(M = 10 for the contact-task and tapping-task, and M = 35
for the rotation-task). The idea behind the algorithm is that
since Eq. 9 tends to assign higher probabilities to samples
with lower costs, the new mean parameters (Eq. 10) tend to
improve during the epochs.
Note that, due to Eq. 9 used to compare the performance of
different parameter samples, the maximum values considered
in the cost formulas (Cmax , Tmax , and Rmax ) do not affect
the algorithm2.
2 Indeed:

i)
Pe(constantx
(constantx
k

k)

econstant exi
exk
econstant

Pexi x
k

e k

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

This section presents the results of the tests of the model.


The capacity of the model to learn to solve each of the three
tasks was tested three times for each task. Pilot experiments
revealed that the initial values of some parameters (vector 0 )
were very important as they affected exploration, so they were
set on the basis of the following criteria (such parameters were
set equal in all repetitions of the experiments). The parameters
related to the centres of oscillation of joints were set to initial
values that corresponded to fingers far from the object and the
initial finger posture was set out of contact with the object.
This was done to represent a challenge which is common
in ecological conditions where the infant would benefit, in
terms of learning, by interacting with objects but her/his hands
are not in contact with them. Recall that the cost (reward)
used in the task was only related to an actual interaction with
the object (touching, rotating, etc.) and not to other elements
such as the visual hand-object distance, difficult to evaluate
by the robot and by infants at the beginning of development
[104]. In some pilot experiments, we set the initial values
of the oscillation amplitude to low/null values and the initial
oscillation centres to values different from those of the initial
posture, thus causing a discrete movement. This produced a
contact with the object only for some values of the initial
oscillation centres whereas for other values the finger moved
far from the object and the learning process did not take off
for lack of feedback (reward). Instead, setting the oscillation
parameters to intermediate values, involving relatively high
initial oscillations of the joints, led to explore a larger area
of the working space thus leading to the successful results
reported below. Notice how these outcomes rely on the basic
circumstance for which exploring the work space through a
circular/ellipsoid trajectory (rhythmic movement) tend to have
a higher probability of encountering objects than exploring
the work space through a straight/bent discrete trajectory if
the latter is not very sophisticated.
The other parameters were set as follows. The initial frequency of oscillation was set to relatively low values to allow
us to detect and avoid potentially dangerous interactions of the
robot with the objects due to the fact that the iCub fingers are
not compliant. For similar security reasons we set the duration
of learning (ten epochs for the contact and tapping tasks, and
35 epochs for the object rotation task) to values that on one
side ensured a satisfying performance (possibly not reaching
the maximum) but that on the other side did not allow the
robot to optimise performance at the cost of security. For
example, by setting oscillation centers into the objects, or by
producing too ample/fast oscillations, implying a too strong
contact with objects. In addition to this, we searched initial
hand-object configurations that lowered the risk of incurring
in these behaviours and we interrupted experiment repetitions
when the risk happened. The other parameters were set to
intermediate values of their range. The specific initial values
and ranges of the parameters are reported in the sections below.
The initial configuration of the hand was the one after the
automatic initialisation of the robot, corresponding to a fully
open hand (see Fig. 1a).

A. Contact task
Fig. 6 shows the performance of the model (based on the
CPG-C and PD equations setting the oscillation centres) in the
contact task, related to three repetitions of the test, measured
as the number of steps in which the thumb finger was in
contact with the target object during a roll-out (C in Eq. 6).
In all the three repetitions, the performance increases with the
number of epochs and reaches the maximum after ten epochs,
corresponding to a stable contact of the finger with the target
object.
300
Number of steps in contact

III. R ESULTS

250
200
150
100
50
0

4
6
Number of epochs

10

Fig. 6. Contact task: six-order polynomial regression of the performance


of the model (CPG-C and PD setting the oscillation centres) in the three
repetitions of the task during ten epochs. The stars indicate the values of the
robots performance in the three repetitions. In all the three repetitions, after
learning the robot achieves maximum performance corresponding to a stable
contact of the finger with the object for the whole roll-out duration.

Fig. 7 shows how before learning the controlled joints of the


thumb exhibit oscillatory movements, which result in irregular
contacts with the object, whereas after learning they exhibit a
stable posture corresponding to a continuous contact with the
object.
Table III summarizes the CPG-C parameter values before
and after learning in the best repetition of the experiment.
TABLE III
C ONTACT TASK : VALUES OF THE MODEL (CPG-C AND PD) PARAMETERS
BEFORE LEARNING AND AFTER LEARNING ( AVERAGE FOR THE THREE
REPETITIONS OF THE EXPERIMENT ).
CPG

R1
R2
C1
C2

Range
[2.00; 6.00]
[-3.14; 3.14]
[0.00; 1.00]
[0.00; 1.00]
[0.00; 2.00]
[0.00; 2.00]

Initial values
3.50
0.00
0.50
0.50
1.00
1.00

Final values
3.99
0.35
0.62
0.28
1.97
2.00

Fig. 8 shows the dynamics of the parameter values tried


out by PIBB during learning. The dynamics is quite similar
for the three repetitions of the experiment suggesting that
the evolution of the parameters is driven by relevant forces.
The figure shows an interesting evolution of the parameters
during learning. The desired oscillation centres C1 and C2
progressively change during the epochs so that, after learning,
during one roll-out the thumb moves from its initial position

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

10

Degrees

60
OT
FET

50

6
2

40
30

20
10

350

400
450
500
Number of time steps

550
2

(a)

4
6
8
Number of epochs

Degrees

60

4
6
8
Number of epochs

(a)

40
30

OT
FET

20
3.265

10

(b)
R1

50

10

10

R2

0.5

0.5

3.27
3.275
Number of time steps x 104

(b)
Fig. 7. Contact task: trajectories performed by the thumb joints FET (thin
line) and OT (thick line) in the best of the three experiment repetitions with
the CPG-C/PD model. (a) Before learning of the task. (b) After learning the
task.

towards the object in order to gain a stable contact with it. The
desired oscillation amplitude of R2 (the thumb opposition)
initially increases to a high value ensuring that the finger
repeatedly enters in contact with the object and gets some
reward feedback notwithstanding the centre of the related
oscillation, C2 , is still not on the object. With the progression
of learning, R2 moves towards lower values ensuring that
during the roll-out the finger reaches a stable contact with
the object. The effect of the residual amplitude of R1 (thumb
flexion/extension) at the end of learning does not affect the
stability of the contact as the related movement tends to be
prevented as the finger is strongly pressed against the object
(see Fig. 7). Once the system has gained a stable contact with
the object by reducing the main oscillation (R2 ), the values
of the oscillation frequency () and phase difference between
the joints () do not affect behaviour anymore.

4
6
8
Number of epochs

10

4
6
8
Number of epochs

(c)

(d)
C1

C2

1.5

1.5

0.5

0.5

10

4
6
8
Number of epochs

10

4
6
8
Number of epochs

(e)

10

(f)

Fig. 8. Contact task: dynamics of the CPG-C/PD parameter values found


by PIBB . For each parameter, the curve represents a six-order polynomial
regression of the values of the parameters (marked by stars) during the three
repetitions of the experiment. (a) Dynamics of the oscillation frequency ().
(b) Dynamics of the phase difference between the joints (). (c) Desired
amplitude of oscillation of the thumb joint OT (R1). (d) Desired amplitude
of oscillation of the thumb joint FET (R2). (e) Desired oscillation centres of
the thumb joint OT (C1). (f) Desired oscillation centres of the thumb joint
FET (C2).
13

B. Tapping task
Fig. 9 shows the performance of the model (CPG-C and
PD) in the tapping-task, related to three repetitions of the test
measured as the number of taps on the object (T in Eq. 7). In
all the three repetitions, after the training the model exhibits
a dynamics of the finger that allows it to perform a reliable
and fast tapping of the object.
Fig. 10 shows how the thumb joints, OT and FET, are
controlled before and after learning. The figure shows that,
similarly to the contact-task, the initial exploration starts with
a rhythmic movement. However, now the task demand requires
a final rhythmic behaviour rather than a discrete one. As a
consequence, although the set-up is identical for the two tasks
and the sole difference between them is the cost function, now
the model continues to produce a rhythmic movement until
the end of the simulation, as shown in Fig. 10b. However, the
figure also shows that the rhythmic behaviour is progressively

Number of contacts

12
11
10
9
8
7
6
5

4
6
Number of epochs

10

Fig. 9. Tapping task: performance of the system (CPG-C and PD) in three
repetitions of the test. The curve represents a six-order polynomial regression
of the performance of the model in the three repetitions. The stars indicate
the performance values of the system in the three repetitions.

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

changed to produce an effective tapping behaviour yielding a


higher performance in the task.

11

6
2

60
OT
FET

Degrees

50
40

30

20
10
2750

4
6
8
Number of epochs

10

4
6
8
Number of epochs

(a)
2800 2850 2900 2950
Number of time steps

(b)
R1

3000

10

R2

0.5

0.5

(a)
Degrees

60
OT
FET

50
40
30

20
10

2.645 2.65 2.655 2.66 2.665 2.67


Number of time steps x 104

4
6
8
Number of epochs

10

(c)

C2

Fig. 10. Tapping task: trajectories performed by the thumb joints OT and
FET in the best of the three experiment repetitions with the CPG-C/PD model.
(a) Before learning the task. (b) After learning the task.

1.5

1.5

Table IV indicates the initial and final values of the CPG-C


parameters and the range within which they could change, in
the best repetition of the experiment.

0.5

0.5

TABLE IV
TAPPING TASK : VALUES OF THE MODEL (CPG-C AND PD) PARAMETERS
BEFORE LEARNING AND AFTER LEARNING ( AVERAGE FOR THE THREE
REPETITIONS OF THE EXPERIMENT ).
CPG

R1
R2
C1
C2

Range
[2.00; 6.00]
[-3.14; 3.14]
[0.00; 1.00]
[0.00; 1.00]
[0.00; 2.00]
[0.00; 2.00]

Initial values
3.50
0.00
0.50
0.50
1.00
1.00

Final values
3.64
-0.05
0.68
0.93
1.48
1.37

Fig. 11 shows the dynamics of the parameters during


learning for the three repetitions of the test. The evolution of
the parameters is again quite stable in the three repetitions. The
figure shows that the model learned to accomplish the task by
increasing the amplitude of the oscillatory trajectories followed
by OT (R2) and FET (R1). In particular, the oscillation of
the critical thumb opposition joint (R2) was raised to the
maximum value to guarantee an ample oscillation of the finger
towards and away from the object. Interestingly, the model
also shifted the centres of oscillation of the two joints (C1
and C2), likely to settle them at a distance from the object
that facilitated its tapping and a reliable contact with the object
sufficient to trigger the touch sensor activation, as revealed by
the regularity of the acquired behaviour (see Fig. 10).
C. Rotation-task
Fig. 12 shows the performances of the robot, measured in
terms of degrees of rotation of the cylindrical object during

4
6
8
Number of epochs

(e)

10

(d)
C1

(b)

4
6
8
Number of epochs

10

4
6
8
Number of epochs

10

(f)

Fig. 11. Tapping task: dynamics of the CPG-C/PD parameter values found
by PIBB . For each parameter, the curve represents a six-order polynomial
regression of the values of the parameters (marked by stars) during the three
repetitions of the experiment. (a) Dynamics of the oscillation frequency ().
(b) Dynamics of the phase difference between the joints (). (c) Desired
amplitude of oscillation of the thumb joint OT (R1). (d) Desired amplitude
of oscillation of the thumb joint FET (R2). (e) Desired oscillation centres of
the thumb joint OT (C1). (f) Desired oscillation centres of the thumb joint
FET (C2).

one roll-out (R in Eq. 8), in the rotation-task. In particular,


the data shown in the figure refer to the model employing
either the complex CPG (CPG-C) or the hierachical CPG
(CPG-H). The figure shows that in both cases the performance
increases with the number of epochs. The results show that
CPG-C rather than CPG-H reaches the highest performance
at the end of the simulation, thus indicating that the results
obtained in [84] do not hold with the domain and algorithm
(P I BB ) used here (see Sec. V for a discussion).
Given the highest performance of the CPG-C we now focus
on the analysis of the behaviour and functioning of this version
of the model. Fig. 13 shows the typical thumb/index contacts
of the fingers with the object, causing the object rotation, at the
beginning and end of training. These data show that the model
has discovered an unexpected way of exploiting the set-up
features to improve performance. Indeed, the thumb and index
gain contact with the cylinder, and rotate it, in an alternate
fashion. The solution found by the robot is possible as the
object is firmly anchored to the environment, so the rotation of
the object does not require a contextual pressure of the object

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

12

Performance (rotation angle)

200
180
160
140
120

(a)

(b)

(c)

(d)

(e)

(f)

100
80
60
40
20
0
0

10

20
Number of epochs

30

Fig. 12. Second-order polynomial regressions of the performance of the


model in three repetitions of the rotation-task when using the CPG-H/PD
(repetitions indicated with stars) and the CPG-C/PD (repetitions indicated
with circles).

by both fingers to compensate for the opposing forces and keep


the object in place (as in the case of unscrewing a jar cap in
normal life conditions). In the condition of the experiment the
alternation between the two fingers allows a faster rotation of
the object; indeed, while one finger is re-gaining contact with
the object to start a new rotation the other finger can rotate
the object, thus producing a better exploitation of the rollout time. Fig. 14 shows a sequence of pictures illustrating this
behaviour in detail. This outcome illustrates the capacity of the
model to discover and exploit the possibilities offered by the
actual physical interaction with the environment, thus giving
an idea of its potential to capture the type of opportunities that
infants might encounter while learning in realistic physical setups (but some opportunities are necessarily different due to a
different body hardware).

Thumb
Index

Before learning

Thumb
Index
0

Fig. 14.
Six phases of the object rotation caused by the thumb/index
coordinated movements after learning (CPG-C/PD model). (a) the thumb gains
a reliable contact with the cylinder; (b) the thumb starts rotating the cylinder
while the index does not touch the cylinder; (c) while the thumb rotates the
cylinder the index starts approaching it; (d) the thumb reaches its movement
limit (maximum values for its DOFs) and the index gains contact with the
cylinder; (e) the thumb has left the cylinder and starts going back to the
starting position while the index rotates the cylinder; (f) the thumb is in the
starting position and the index leaves the cylinder.

capable of taking these features of the hardware response into


consideration as it searches for the motor behaviour on the
basis of its final performance measured by the cylinder rotation
(task demand), rather than in terms of proximal measures (e.g.,
response of the hardware).
Table V shows the values of the CPG-C parameters at
the beginning and after learning, and the range within which
they could change, for the best of the three repetitions of the
test. The table shows that the learning algorithm substantially
increases the frequency of oscillation to perform faster
movements. The found phase differences (the ) between the
oscillators of joints assure a suitable coordination between
fingers leading to the coordinated alternation of the thumb
and index described above (Fig. 14). Finally, the centres of
the oscillations (C) changed with learning notwithstanding
the task requires a final rhythmic behaviour: this probably
happened to ensure a reliable and timely contact of the fingers
with the object (Fig. 13).

After learning
20
40
60
Number of time steps

Fig. 13. Contacts of the thumb and index fingers with the cylinder in the
rotation task, before and after learning (CPG-C/PD model). Notice how the
initial unstructured interaction of the fingers with the object become a wellcoordinated behaviour of the fingers after learning.

Fig. 15 shows the desired trajectories generated by the


model and the actual trajectories measured by the robot
encoders for all the four controlled joints. The figure indicates
that the final behaviour involves a well coordinated rhythmic dynamics of the four robot joints (e.g., notice how the
oscillations have different phases). The figure also shows a
non-perfect match between the desired and actual trajectories
followed by the fingers, in particular for the two joints of
the index (this is likely due to hardware calibration/delays
issues related to the index finger): the adaptive algorithm is

TABLE V
I NITIAL ( BEFORE LEARNING ) AND FINAL ( AFTER LEARNING ) VALUES OF
THE PARAMETERS OF BY THE CPG-C/PD MODEL , AND THEIR RANGES , IN
THE ROTATION - TASK ( BEST SIMULATION ).
CPG-C

12
13
34
R1
R2
R3
R4
C1
C2
C3
C4

Range
[2.00; 6.00]
[-3.14; 3.14]
[-3.14; 3.14]
[-3.14; 3.14]
[0.00; 1.00]
[0.00; 1.00]
[0.00; 1.00]
[0.00; 1.00]
[0.00; 2.00]
[0.00; 2.00]
[0.00; 2.00]
[0.00; 2.00]

Initial values
2.00
0.00
0.00
0.00
0.50
0.50
0.50
0.50
1.00
1.00
1.00
1.00

Final values
5.57
0.34
1.36
0.82
1.00
0.72
1.00
0.97
0.86
0.80
1.35
0.54

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

JOINT 7
Degrees

80
CPG
Encoder

60
40
20
5.634

5.636
5.638
5.64
Number of time steps x 104
JOINT 8

Degrees

80
CPG
Encoder

60
40
20
5.604

5.606
5.608
5.61
Number of time steps x 104
JOINT 10

Degrees

80
CPG
Encoder

60
40
20

5.468
5.47
5.472
Number of time steps

5.474
4
x 10

JOINT 12
Degrees

80
60
40
CPG
Encoder

20
5.292

5.294
5.296
5.298
Number of time steps x 104

Fig. 15.
Desired thumb and index trajectories generated by the CPGC/PD model and actual trajectories measured by the robots encoders (Joint
7: thumb opposition joint; Joint 8: thumb flexion/extension joint; Joint 10:
index adduction/abduction joint; Joint 12: flexion/extension index joint). The
encoders follow the desired postures accurately.

13

ing parameters of motor primitives agrees with an important


current trend through which the RL robotic community is
trying to overcome the low-speed learning limitation of RL
algorithms: the algorithms are not directly applied to search
fine movement commands (e.g., forces or angle variations of
joints) but rather to search for the parameters of movement
primitives generating the step-by-step fine movements [94],
[105].
Second, the model incorporates the assumption, based on
empirical data from infants, that when it initially faces a
task it first explores the environment through rhythmic movements. These might facilitate exploration and ensure an initial
important feedback to allow learning processes to take off,
as suggested by the fact that some oscillations are initially
increased by the learning algorithm even when the requested
final movement is discrete. Oscillations also allow the system
to automatically reset the environment (here the relation
between the fingers and the object), so giving the system
the opportunity of repeatedly experiencing the conditions that
produce the reward and that allow the learning process to takeoff. Starting from these initial feedback, the system gradually
refines the rhythmic movements, or transforms them into
discrete movements, depending on the task to be solved. Thus,
for example, in the touching-task the rhythmic movement
allows the robot to come in contact with the object and
experience the reward due to such contact: with learning this
movement evolves into a discrete movement towards the object
ensuring a stable contact with it. In the tapping task the
initial rhythmic movement allows the robot to experience the
reward related to the touch-release interaction with the object:
with learning this movement evolves into a refined tapping
behaviour. In the rotation-task the initial rhythmic movement
allows the robot to experience the reward related to the rotation
of the cylinder: with learning this movement evolves into
a sophisticated finger-alternation movement exploiting some
properties of the physical set-up. These results give initial
indications of the advantages that rhythmic movements might
give in the initial stages of human development [10], [50].

IV. D ISCUSSION
Although reproducing detailed behaviours of learning infants was out of the scope of this work, the results of the
model in the three tests show some patterns that might reflect
some general aspects of the organisation and development of
motor behaviour in children. First, by design the model initial
exploration is not based on a low-level motor babbling (e.g.,
muscle micro-movements affected by white noise) but rather
on a structured behaviour based on motor primitives potentially capable of producing meaningful outcomes [66]. This
strategy might speed up learning as learning can mould the
whole dynamical patterns generated by the motor primitives,
rather than building adaptive behaviour from scratch. In this
respect, the experiments reported here showed that the model
presented here could reach a satisfying performance in the
touching-task and tapping-task in about ten epochs (hence
10 6 roll-outs in total) and the more complex rotation-task
requiring a fine coordination between fingers in 35 epochs
(hence 358 roll-outs in total). This strategy based on search-

V. C ONCLUSION

AND FUTURE WORK

A key element of the flexibility of human motor behaviour


relies on its capacity to produce rhythmic and discrete motor
movements. Neuroscience indicates that the two types of
movements involve different brain systems, in particular the
network of areas supporting rhythmic movements tends to be
included in the one supporting discrete movements. Developmental psychology experiments indicate the importance of
rhythmic movements in the first year of life. These movements
progressively develop into refined discrete or rhythmic movements depending on the different task demands encountered
by infants. This and other relations between the two types of
movements during development are still unclear and a first
contribution of this work has been to highlight the importance
of this problem for developmental robotics studies on motor
development.
An obstacle to the investigation of this problem is that
current computational models are not suited for it. A second

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

contribution of this work has been the proposal of a computational model usable to this purpose. The model is based on
a simple transparent motor primitive that can produce both
rhythmic and discrete movements. Moreover, the model is
capable of searching the parameters of the motor primitive on
the basis of reinforcement learning algorithms, thus making
the model suitable to study development issues depending
on trial-and-error learning processes. The architecture of the
model, encompassing an efficient policy search reinforcement
learning algorithm applied to the search of the parameters of
the motor primitive, learns very fast and can thus be directly
used in real robots. This allows the model to capture some of
the difficulties and opportunities posed by the embodied interactions with the environment faced by infants (but some others
are necessarily different due to their different embodiment).
A third contribution of this work has been a first application
of the model to the computational study of the relation between
rhythmic and discrete movements during the development of
manipulation tasks. We have already analysed these results in
Sec. IV. Here we stress that, to our knowledge, this is the
first work showing the possible mechanisms through which
rhythmic hand movements can be refined or progressively
evolve into discrete movements. In this respect, the model
also gives a contribution to the issue, much debated in the
literature, on the existence of different mechanisms to produce
discrete and rhythmic movements [2]. Indeed, models like the
one presented here can show the advantages of using one
or the other type of mechanism to produce a higher performance or a faster learning process (the latter is particularly
important as it is often overlooked in the literature). Although
this cannot solve the issue, which requires further empirical
investigations, it can show the existence of adaptive advantages
to evolve/develop one or the other type of mechanism, or both,
in different conditions.
We are aware that the model and the results presented here
are only a first step towards a systematic study of the rhythmic/discrete movement interplay during development. The
current limitations of the model suggest possible directions
for future research on this topic. From a computational point
of view, a key functionality to add to the model is the capacity
of producing discrete movement trajectories. Indeed, now the
model can only set the final equilibrium points (i.e., the final
desired postures) of the movement, whereas the trajectories
followed by the limbs are automatically generated by the PD
moving the oscillation centre of the CPG and by the robot PID.
This choice allowed us to keep things as simple as possible
in this initial work, but in future work it would be useful to
modify the motor primitive used to control the trajectory of
discrete movements, e.g. to move the limb around an obstacle
[44], [90]). To this purpose, one could test the use of the
non-linear forcing component of discrete DMPs [83], which
could be added to the PD-based equation used here to set the
oscillation centres of the CPGs. This solution would still allow
a nice mix of discrete and rhythmic movements as done here
while at the same time allowing the encoding of more complex
trajectories. This approach, however, would involve a higher
number of parameters, so suitable solutions would be needed
to use it with real robots directed to capture early trial-and-

14

error learning processes in infants. Another related possibility


would be using the more flexible rhythmic DMPs proposed in
[83] to implement the rhythmic component of the movement
primitive in place of the simple CPG used here. However,
again in this case one should face the problem of how having
a learning process directly applicable to real robots without
necessarily relying on imitation.
Another very different approach to produce and study the
relation between rhythmic and discrete movements during
development could rely on reservoir computing, e.g. on
echo-state neural networks [106]. In this respect it has been
shown how these networks have the potential to learn to
produce both rhythmic and discrete movements without the
need of predefining specific rhythmic or discrete dynamical
components to produce one or the other [107], although this
has been done only with supervised learning algorithm.
The model presented here was not intended to capture the
detail of the brain organisation underlying the production
of rhythmic and discrete movements. However, the system
architecture in part captures, at least at a functional level, some
aspects of such an organisation. In particular, the system motor
primitive is based on a CPG capable of producing rhythmic
dynamic movement when activated. The information about
the centre of oscillation reaches such a CPG adding a further
dimension to the oscillation. This is reminiscent of the brain
organisation illustrated in Sec. I for which discrete movements
involve additional brain areas that might contribute to modulate the activity of areas generating rhythmic movements [2].
An architecture with a more biologically plausible structure
might produce specific predictions on this topic, which might
be tested against available or new empirical evidence (e.g.,
see [108], [109], [110], [111]). This topic, however, deserves
further investigation.
The model has been used here to solve tasks involving
a relatively small number of degrees of freedom, so one
might wander how the system would scale to learn more
complex, adult-like tasks involving several possibly redundant degrees of freedom. This problem is also important
for robotic applications. Possible solutions to this problem
might be based on direct inverse modelling approaches [112]
introducing mechanisms to face the redundancy problem (e.g.,
[113]), possibly enhanced by performing babbling in the
goal space (work space) rather than in the joint space [68],
[69]. Some of these approaches have also been extended to
work with reinforcement learning [114], so they could be
integrated with the system proposed here.
Another important issue to tackle in future work is the
learning of multiple tasks, not possible with the current system
but needed, for example for open-ended/life-long learning
[115], [116]. To this purpose, we might endow the model with
a hierarchical architecture capable of training and selecting
multiple experts [117], [118], e.g. multiple motor primitives,
to solve different tasks requiring different sensorimotor mappings or sequences of movements [46], [86], [86]. Sequences
of actions might also be mentally assembled on the basis
of forward models [119], [120], [121]. If the system has to
perform multiple possibly interfering tasks, the work of [122]
has proposed solutions based on DMPs and PIBB to resolve

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

such interference prior to task execution while removing the


need for task prioritization. Alongside hierarchical architectures, life-long learning also requires the capability of the
system to self-generate own goals and motivations to drive
learning processes in an open-ended fashion. Here we have
not simulated internal components of the system producing
the rewards guiding learning but in the future these might
be possible on the basis of intrinsic motivations [91], [115],
[123], [124], [125], [126], [127]. A relevant issue to face to
use the model for life-long learning involves the parameter
establishing the number of roll-outs per epoch. This parameter
has to be higher with a higher stochasticity of the world so that
noise is cancelled out by multiple roll-outs. One might think to
tune this parameter on the basis of a sampling of such noise.
Another parameter that should be tuned automatically to have
life-long learning and to increase the efficiency of the model
would be the exploration parameter of PIBB . This could be
regulated on the basis of the reward-weighted variance of the
samples within each epoch [92], [98].
With respect to the use of the model to investigate developmental phenomena involving rhythmic and discrete movements, future work should address specific empirical data, for
example data on the various classes of rhythmic behaviours
that infants exhibit in the first year of life [10], on how
infants exploit rhythmic and discrete movements to learn to
cause interesting effects in the environment [53], or how they
use rhythmic and discrete movements to solve more complex
problems [128]. We think the results presented here show
that our model represents a valuable starting point to develop
systems usable to carry out these investigations.
R EFERENCES
[1] D. Sternad, Towards a unified theory of rhythmic and discrete movement behavioral, modeling and imaging results, in Coordination:
Neural, behavioral and social dynamics. Springer, 2008, pp. 105
133.
[2] S. Schaal, D. Sternad, R. Osu, and M. Kawato, Rhythmic arm
movement is not discrete, Nature Neuroscience, vol. 7, pp. 11361143,
2004.
[3] M. Lungarella, G. Metta, R. Pfeifer, and G. Sandini, Developmental
robotics: a survey, Connection Science, vol. 15, pp. 151190, 2003.
[4] A. Cangelosi and M. Schlesinger, Developmental robotics: From babies
to robots. Boston, MA: The MIT Press, 2014.
[5] M. Asada, K. F. MacDorman, and Y. Kuniyoshi, Cognitive developmental robotics as a new paradigm for design of humanoid robots,
Robotics and Autonomous Systems, vol. 37, pp. 185193, 2001.
[6] S. Nolfi and D. Floreano, Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. Cambridge,
MA: MIT Press, 2000.
[7] S. Grillner and P. Wallen, Central pattern generators for locomotion,
with special reference to vertebrates, Annual Review in Neuroscience,
vol. 8, pp. 233261, 1985.
[8] A. J. Ijspeert, Central pattern generators for locomotion control in
animals and robots: a review, Neural Networks, no. 21, pp. 642 653,
2008.
[9] A. Ijspeert, A. Crespi, D. Ryczko, and J. Cabelguen, From swimming
to walking with a salamander robot driven by a spinal cord model,
Science, vol. 315, pp. 14161420, 2007.
[10] E. Thelen, Rhythmical stereorypies in normal human infants, Animal
Behaviour, vol. 27, pp. 699715, 1979.
[11] E. Zehr, T. Carroll, R. Chua, D. Collins, A. Frigon, C. Haridas,
S. Hundza, and A. Thompson, Possible contributions of CPG activity
to the control of rhythmic human arm movement, Canadian Journal
of Physiology and Pharmacology, vol. 82, pp. 556568, 2004.

15

[12] G. N. Orlovsky, T. G. Deliagina, and S. Grillner, Neuronal control


of locomotion: from mollusc to man. New York: Oxford University
Press., 1999.
[13] H. Taguchi, K. Hase, and T. Maeno, Analysis of the motion pattern and
the learning mechanism for manipulation objects by human fingers (in
japanese), Transactions of the Japan Society of Mechanical Engineers,
vol. 68, pp. 16471654, 2002.
[14] Y. Kurita, J. Ueda, Y. Matsumoto, and T. Ogasawara, CPG-based manipulation: Generation of rhytmic finger gaits from human observation,
in Proceedings of the 2004 IEEE International Conference on Robotics
and Automation. New Orleans., 2004.
[15] G. Alexander, M. DeLong, and P. Strick, Parallel organization of
functionally segregated circuits linking basal ganglia and cortex.
Annual Review of Neuroscience, vol. 9, pp. 357381, 1986.
[16] K. Gurney, T. J. Prescott, and P. Redgrave, A computational model
of action selection in the basal ganglia. I. A new functional anatomy.
Biological Cybernetics, vol. 84, pp. 401410, 2001.
[17] S. Grillner, O. Ekeberg, A. El Manira, A. Lansner, D. Parker, J. Tegner,
and P. Wallen, Intrinsic function of a neuronal network - a vertebrate
central pattern generator, Brain Research Reviews, vol. 26, pp. 184
197, 1998.
[18] M. MacKay-Lyons, Central pattern generation of locomotion: a review
of the evidence, Physical Therapy, vol. 82, pp. 6983, 2002.
[19] J. D. Meier, T. N. Aflalo, S. Kastner, and M. S. Graziano, Complex
organization of human primary motor cortex: a high-resolution fMRI
study, Journal of neurophysiology, vol. 100, pp. 18001812, 2008.
[20] D. Caligiore, A. Borghi, D. Parisi, and G. Baldassarre, Affordances
and compatibility effects: a neural-network computational model, in
Connectionist models of behaviour and cognition II: Proceedings of
the 11th Neural Computation and Psychology Workshop, J. Mayor,
N. Ruh, and K. Plunkett, Eds. Singapore: World Scientific, 2009, pp.
1526.
[21] D. Caligiore, A. M. Borghi, D. Parisi, and G. Baldassarre, TRoPICALS: A computational embodied neuroscience model of compatibility effects, Psychological Review, vol. 117, pp. 11881228, 2010.
[22] D. Caligiore, A. M. Borghi, D. Parisi, R. Ellis, A. Cangelosi, and
G. Baldassarre, How affordances associated with a distractor object
affect compatibility effects: A study with the computational model
TRoPICALS. Psychological Research, vol. 77, pp. 719, 2013.
[23] M. S. A. Graziano, C. S. R. Taylor, and T. Moore, Complex
movements evoked by microstimulation of precentral cortex. Neuron,
vol. 34, pp. 841851, 2002.
[24] R. Shadmehr and S. P. Wise, The computational neurobiology of
reaching and pointing: a foundation for motor learning. Cambridge,
MA: The MIT Press, 2005.
[25] C. Prablanc, M. Desmurget, and H. Grea, Neural control of on-line
guidance of hand reaching movements, Progress in Brain Research,
pp. 155170, 2003.
[26] J. C. Houk, C. Bastianen, D. Fansler, A. Fishbach, D. Fraser, P. J.
Reber, S. A. Roy, and L. S. Simo, Action selection and refinement in
subcortical loops through basal ganglia and cerebellum, Philosophical
Transactions of the Royal Society B Biological Sciences, vol. 362, pp.
15731583, 2007.
[27] J. C. Houk, Action selection and refinement in subcortical loops
through basal ganglia and cerebellum, in Modelling Natural Action
Selection, A. K. Seth, T. J. Prescott, and J. J. Bryson, Eds. Cambridge:
Cambridge University Press, 2011.
[28] H. H. Yin and B. J. Knowlton, The role of the basal ganglia in habit
formation, Nature Reviews Neuroscience, vol. 7, pp. 464476, 2006.
[29] D. M. Wolpert, R. C. Miall, and M. Kawato, Internal models in the
cerebellum. Trends in cognitive sciences, vol. 2, pp. 338347, 1998.
[30] D. Caligiore, G. Pezzulo, R. C. Miall, and G. Baldassarre, The contribution of brain sub-cortical loops in the expression and acquisition
of action understanding abilities, Neuroscience and Biobehavioral
Reviews, vol. 37, pp. 25042515, 2013.
[31] N. Hogan and D. Sternad, On rhythmic and discrete movements: reflections, definitions and implications for motor control. Experimental
Brain Research, vol. 181, pp. 1330, 2007.
[32] J. Mink and W. Thach, Basal ganglia motor control I: Nonexclusive
relation of pallidal discharge to five movement modes, Journal of
Neurophisiology, vol. 65, pp. 273300, 1991.
[33] R. Ivry and R. Spencer, The neural representation of time, Current
Opinion in Neurobiology, vol. 14, pp. 225232, 2004.
[34] R. B. Ivry, R. M. Spencer, H. N. Zelaznik, and J. Diedrichsen, The
cerebellum and event timing, Annals of the New York academy of
sciences, vol. 978, pp. 302317, 2002.

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

[35] S. A. Fontenelle, B. A. Kahrs, S. A. Neal, A. T. Newton, and J. J.


Lockman, Infant manual exploration of composite substrates, Journal
of Experimental Child Psychology, vol. 98, pp. 153167, 2007.
[36] J. J. Gibson, The ecological approach to visual perception. Boston:
Houghton Mifflin, 1979.
[37] E. J. Gibson and A. D. Pick, An ecological approach to perceptual
learning and development, New York: Oxford University Press, 2000.
[38] S. Schaal, P. Mohajerian, and A. Ijspeert, Dynamics systems vs.
optimal control - a unifying view, Progress in Brain Research, vol.
165, pp. 425445, 2007.
[39] C. Laschi, G. Asuni, E. Guglielmelli, G. Teti, R. Johansson, H. Konosu,
Z. Wasik, M. C. Carrozza, and P. Dario, A bio-inspired predictive
sensory-motor coordination scheme for robot reaching and preshaping,
Autonomous Robotics, vol. 25, pp. 85101, 2008.
[40] R. F. Reinhart and J. J. Steil, Reaching movement generation with
a recurrent neural network based on learning inverse kinematics for
the humanoid robot icub, in IEEE-RAS International Conference on
Humanoid Robots, 2009, pp. 323330.
[41] D. Caligiore, E. Guglielmelli, D. Parisi, and G. Baldassarre, A
reinforcement learning model of reaching integrating kinematic and
dynamic control in a simulated arm robot, in IEEE International
Conference on Development and Learning (ICDL2010). Piscataway,
NJ: IEEE, 2010, pp. 211218.
[42] R. F. Reinhart and J. J. Steil, Neural learning and dynamical selection
of redundant solutions for inverse kinematic control, in IEEE-RAS
International Conference on Humanoid Robots, 2011, pp. 564569.
[43] D. Caligiore, D. Parisi, and G. Baldassarre, Integrating reinforcement
learning, equilibrium points and minimum variance to understand
the development of reaching: a computational model, Psychological
Review, vol. 121, pp. 389421, 2014.
[44] D. Caligiore, T. Ferrauto, D. Parisi, N. Accornero, M. Capozza, and
G. Baldassarre, Using motor babbling and hebb rules for modeling the
development of reaching with obstacles and grasping, in Proceedings
of COGSYS 2008. Karlsruhe, Germany: Springer, 2008, pp. E18.
[45] E. Oztop, N. S. Bradley, and M. A. Arbib, Infant grasp learning:
a computational model, Experimental Brain Research, vol. 158, pp.
480503, 2004.
[46] F. Stulp, E. A. Theodorou, and S. Schaal, Reinforcement learning
with sequences of motion primitives for robust manipulation, IEEE
Transactions on robotics, vol. 28, no. 6, 2012.
[47] B. Castro da Silva, G. Baldassarre, G. Konidaris, and A. Barto,
Learning Parameterized Motor Skills on a Humanoid Robot, 2014, ch.
Proceedings of the 2014 IEEE International Conference on Robotics
and Automation (ICRA 2014). Hong Kong, China.
[48] D.
Sternard,
H.
Marino,
S.
Charles,
M.
Duarte,
L. Dipietro, and N. Hogan, Transitions between discrete
and rhythmic primitives in a unimanual task, Frontiers
in computational neuroscience, 2013. [Online]. Available:
http://journal.frontiersin.org/article/10.3389/fncom.2013.00090/full
[49] S. Degallier, L. Righetti, S. Gay, and A. Ijspeert, Toward simple control for complex, autonomous robotic applications: combining discrete
and rhythmic motor primitives, Autonomous Robots, vol. 31, pp. 155
181, 2011.
[50] E. Thelen, Kicking, rocking, and waving: contextual analysis of
rhythmical stereotypies in normal human infants, Animal Behavior,
vol. 29, pp. 311, 1981.
[51] J. Piaget, The Origins of Intelligence in Children. London: Routledge
and Kegan Paul, 1953.
[52] C. K. Rovee and D. T. Rovee, Conjugate reinforcement of infant
exploratory behavior. Journal of Experimental Child Psychology,
vol. 8, pp. 3339, 1969.
[53] C. K. Rovee-Collier, M. W. Sullivan, M. Enright, D. Lucas, and J. W.
Fagen, Reactivation of infant memory, Science, vol. 208, pp. 1159
1161, 1980.
[54] F. Guerin, N. Kruger, and D. Kraft, A survey of the ontogeny of tool
use: from sensorimotor experience to planning, IEEE Transactions on
Autonomous Mental Development, vol. 5, pp. 1845, 2012.
[55] K. Matsuoka, Sustained oscillations generated by mutually inhibiting
neurons with adaptation, Biological cybernetics, vol. 52, pp. 367376,
1985.
[56] T. Flash and N. Hogan, The coordination of arm movements: an experimentally confirmed mathematical model. Journal of Neuroscience,
vol. 5, pp. 16881703, 1985.
[57] M. A. Smith, J. Brandt, and R. Shadmehr, Motor disorder in Huntingtons disease begins as a dysfunction in error feedback control,
Nature, vol. 403, no. 6769, pp. 544549, 2000.

16

[58] A. J. Ijspeert, A. Crespi, D. Ryczko, and J.-M. Cabelguen, From


swimming to walking with a salamander robot driven by a spinal cord
model. Science, vol. 315, pp. 14161420, 2007.
[59] R. E. Passingham and S. P. Wise, The neurobiology of the prefrontal
cortex: anatomy, evolution, and the origin of insight. Oxford: Oxford
University Press, 2012, no. 50.
[60] S. Tsujimoto, The prefrontal cortex: functional neural development
during early childhood, Neuroscientist, vol. 14, pp. 345358, 2008.
[61] C. von Hofsten, Eye-hand coordination in newborns, Developmental
Psychology, no. 18, pp. 450461, 1982.
[62] N. E. Berthier and R. Keen, Development of reaching in infancy.
Experimental Brain Research, vol. 169, pp. 507518, 2006.
[63] N. Berthier, The syntax of human infant reaching, in Proceedings of
the Eighth International Conference on Complex Systems. NECSI,
2011, pp. 14771487.
[64] D. Bullock, S. Grossberg, and F. Guenther, A self-organizing neural
model of motor equivalent reaching and tool use by a multijoint arm,
Journal of Cognitive Neuroscience, vol. 5, pp. 408435, 1993.
[65] R. Der and G. Martius, From motor babbling to purposive actions:
Emerging self-exploration in a dynamical systems approach to early
robot development, in From Animals to Animats 9. Berlin: Springer,
2006, pp. 406421.
[66] C. von Hofsten, An action perspective on motor development, Trends
in Cognitive Sciences, vol. 8, pp. 266272, 2004.
[67] H. Miyamoto, J. Morimoto, K. Doya, and M. Kawato, Reinforcement
learning with via-point representation, Neural Networks, vol. 17, no. 3,
pp. 299305, 2004.
[68] M. Rolf, J. J. Steil, and M. Gienger, Goal babbling permits direct
learning of inverse kinematics, IEEE Transactions on Autonomous
Mental Development, vol. 2, no. 3, pp. 216229, Sep 2010.
[69] A. Baranes and P. Oudeyer, Active learning of inverse models with
intrinsically motivated goal exploration in robots, Robotics and Autonomous Systems, vol. 61, no. 1, pp. 4973, 2013.
[70] N. E. Berthier, Learning to reach: A mathematical model, Developmental Psychology, vol. 32, pp. 811823, 1996.
[71] N. E. Berthier, M. T. Rosenstein, and A. G. Barto, Approximate
optimal control as a model for motor learning, Psychological Review,
vol. 112, pp. 329346, 2005.
[72] A. L. Ciancio, L. Zollo, E. Guglielmelli, D. Caligiore, and G. Baldassarre, Hierarchical reinforcement learning and central pattern generators for modeling the development of rhythmic manipulation skills,
IEEE International Conference on Development and Learning and
Epigenetic Robotics, 2011.
[73] D. Ognibene, C. Balkenius, and G. Baldassarre, Integrating epistemic
action (active vision) and pragmatic action (reaching): a neural architecture for camera-arm robots, in From Animals to Animats 10:
Proceedings of the Tenth International Conference on the Simulation
of Adaptive Behavior (SAB2008), M. Asada, J. C. Hallam, J.-A. Meyer,
and J. Tani, Eds. Berlin: Springer Verlag, 2008, pp. 220229.
[74] R. Marraffa, V. Sperati, D. Caligiore, J. Triesch, and G. Baldassarre,
A bio-inspired attention model of anticipation in gaze-contingency
experiments with infants, in Proceedings of the IEEE International
Conference on Development and Learning and Epigenetic Robotics
(ICDL-EpiRob-2012), J. Movellan and M. Schlesinger, Eds. New
York, NY: IEEE, 2012, pp. e18.
[75] D. Ognibene and G. Baldassare, Ecological active vision: four bioinspired principles to integrate bottom-up and adaptive top-down attention tested with a simple camera-arm robot, IEEE Transactions on
Autonomous Mental Development, vol. 7, no. 1, pp. 325, 2015.
[76] G. Schoner, A dynamic theory of coordination of discrete movement,
Biological Cybernetics, vol. 63, pp. 257270, 1990.
[77] A. De Rugy and D. Sternad, Interaction between discrete and rhythmic
movements: reaction time and phase of discrete movement initiation
during oscillatory movements, Brain Research, vol. 994, pp. 160174,
2003.
[78] S. Schaal, S. Kotosaka, and D. Sternad, Nonlinear dynamical systems as movement primitives, in IEEE International Conference on
Humanoid Robotics, 2000, pp. 111.
[79] D. Bullock and S. Grossberg, Vite and flete: neural modules for
trajectory formation and postural control. in Volitional Action, W. Hershberger, Ed. Amsterdam: Elsevier, 1989, pp. 253298.
[80] S. H. Strogatz, Nonlinear Dynamics and Chaos, A. Wesley, Ed., 1994.
[81] J. Ernesti, L. Righetti, M. Do, T. Asfour, and S. Schaal, Encoding
of periodic and their transient motions by a single dynamic movement primitive, in IEEE-RAS International Conference on Humanoid
Robots, 2012, pp. 5764.

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

[82] J. Kober and J. Peters, Learning motor primitives for robotics, in


IEEE International Conference on Robotics and Automation 2009
(ICRA09). IEEE, 2009, pp. 21122118.
[83] A. Ijspeert, J. Nakanishi, and S. Schaal, Learning attractor landscapes
for learning motor primitives, 2002, vol. 15, pp. 15231530.
[84] A. L. Ciancio, L. Zollo, G. Baldassare, D. Caligiore, and
E. Guglielmelli, The role of learning and kinematic features in
dexterous manipulation: a comparative study with two robotic hands,
International Journal of Advanced Robotic Systems, vol. 10, pp. E121,
2013.
[85] D. Caligiore, M. Mirolli, D. Parisi, and G. Baldassarre, A bioinspired
hierarchical reinforcement learning architecture for modeling learning
of multiple skills with continuous state and actions, in Proceedings of the Tenth International Conference on Epigenetic Robotics
(EpiRob2010), B. Johansson, E. Sahin, and C. Balkenius, Eds. Lund,
Sweden: Lund University Cognitive Studies, 2010, pp. 2734.
[86] D. Caligiore, P. Tommasino, V. Sperati, and G. Baldassarre, Modular
and hierarchical brain organization to understand assimilation, accommodation and their relation to autism in reaching tasks: a developmental
robotics hypothesis, Adaptive Behavior, vol. 22, pp. 304329, 2014.
[87] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge MA, USA: The MIT Press, 1998.
[88] L. Sciavicco and B. Siciliano, Modelling and control of robot manipulators. Berlin, Germany: Springer-Verlag, 2000.
[89] A. G. Feldman and M. F. Levin, The origin and use of positional
frames of reference in motor control, Behavioral and Brain Sciences,
vol. 18, pp. 723744, 1995.
[90] P. Bendahan and P. Gorce, A neural network architecture to learn arm
motion planning in grasping tasks with obstacles avoidance, Robotica,
vol. 24, pp. 197204, 2006.
[91] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner, Intrinsic motivation
systems for autonomous mental development, IEEE Transactions on
Evolutionary Computation, vol. 11, pp. 265286, 2007.
[92] F. Stulp and O. Sigaud, Policy improvement methods: between
black-box optimization and episodic reinforcement learning, Robotics
and Computer Vision, ENSTA-Paris Tech, Paris, Hal Archive hal00738463, 2012.
[93] , Robot skill learning: From reinforcement learning to evolution
strategies, Paladyn, Journal of Behavioral Robotics, vol. 4, no. 1, pp.
4961, 2013.
[94] J. Kober and J. Peters, Policy search for motor primitives in robotics,
Machine Learning, vol. 84, pp. 171203, 2011.
[95] E. Theodorou, J. Buchli, and S. Schaal, A generalized path integral
control approach to reinforcement learning, Journal of Machine Learning Research, vol. 11, pp. 31373181, 2010.
[96] R. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient
methods for reinforcement learning with function approximation, in
Advances in neural information processing systems, 2000, no. 12, pp.
10571063.
[97] J. Peters and S. Schaal, Natural actor-critic, Neurocomputing, vol. 71,
pp. 11801190, 2008.
[98] F. Stulp and O. Sigaud, Path integral policy improvement with
covariance matrix adaptation, in Proceedings of the 29th International
Conference on Machine Learning, Edimburgh, UK, 26/0601/07/2012,
2012.
[99] G. Sandini, G. Metta, and D. Vernon, The iCub cognitive humanoid
robot: an open-system research platform for enactive cognition, in 50
Years of Artificial Intelligence. Essays Dedicated to the 50th Anniversary of Artificial Intelligence, M. Lungarella, F. Iida, J. Bongard, and
R. Pfeifer, Eds. Berlin: Springer-Verlag, 2007, vol. 4850.
[100] V. Tikhanoff, A. Cangelosi, P. Fitzpatrick, G. Metta, L. Natale, and
F. Nori, An open-source simulator for cognitive robotics research:
the prototype of the icub humanoid robot simulator, in Proceedings
of IEEE Workshop on Performance Metrics for Intelligent Systems
Workshop, R. Madhavan and E. R. Messina, Eds. Washington, DC:
IEEE, 2008, pp. 5761.
[101] V. Sukhoy, J. Sinapov, L. Wu, and A. Stoytchev, Learning to press
doorbell buttons. in IEEE 9th International Conference on Development and Learning (ICDL-2010), B. Kuipers, T. Shultz, A. Stoytchev,
and C. Yu, Eds. New York, NY: IEEE, 2010, pp. 132139.
[102] F. Taffoni, M. Vespignani, D. Formica, G. Cavallo, E. Polizzi Di Sorrentino, G. Sabbattini, V. Truppa, M. Mirolli, G. Baldassarre, E. Visalberghi, and F. Keller, A mechatronic platform for behavioral analysis
on nonhuman primates, Journal of Integrative Neuroscience, vol. 11,
no. 1, 2012.
[103] E. Polizzi di Sorrentino, G. Sabbatini, V. Truppa, A. Bordonali,
F. Taffoni, D. Formica, G. Baldassarre, M. Mirolli, E. Guglielmelli, and

17

[104]

[105]
[106]

[107]
[108]

[109]

[110]

[111]

[112]
[113]

[114]

[115]
[116]
[117]
[118]
[119]
[120]

[121]
[122]

[123]
[124]

E. Visalberghi, Exploration and learning in capuchin monkeys (sapajus


spp.): the role of action-outcome contingencies, Animal Cognition,
2014.
D. Corbetta, S. L. Thurman, R. F. Wiener, Y. Guan, and J. L. Williams,
Mapping the feel of the arm with the sight of the object: on the
embodied origins of infant reaching, Frontiers in Psychology, vol. 5,
p. 576, 2014.
J. Peters and S. Schaal, Reinforcement learning of motor skills with
policy gradients, Neural Networks, vol. 21, no. 4, pp. 682697, 2008.
M. Lukosevicius and H. Jaeger, Reservoir computing approaches
to recurrent neural network training, Computer Science Review,
vol. 3, no. 3, pp. 127149, Aug 2009. [Online]. Available:
http://dx.doi.org/10.1016/j.cosrev.2009.03.005
F. Mannella and G. Baldassarre, Selection of cortical dynamics for
motor behaviour by the basal ganglia, Biological Cybernetics, inpr.
D. Ognibene, A. Rega, and G. Baldassarre, A model of reaching that
integrates reinforcement learning and population encoding of postures,
in From Animals to Animats 9: Proceedings of the Ninth International
Conference on the Simulation of Adaptive Behavior (SAB2006), ser.
Lecture Notes in Artificial Intelligence, S. Nolfi, G. Baldassarre,
R. Calabretta, J. Hallam, D. Marocco, J.-A. Meyer, O. Miglino, and
D. Parisi, Eds. Berlin: Springer Verlag, 2006, vol. 4095, pp. 381393,
rome, Italy, 25-29 September 2006.
F. Mannella and G. Baldassarre, A neural-network reinforcementlearning model of domestic chicks that learn to localize the centre
of closed arenas, Philosophical Transactions of the Royal Society B
Biological Sciences, vol. 362, no. 1479, pp. 383401, 2007.
G. Baldassarre, F. Mannella, V. G. Fiore, P. Redgrave, K. Gurney,
and M. Mirolli, Intrinsically motivated action-outcome learning and
goal-based action recall: A system-level bio-constrained computational
model. Neural Networks, vol. 41, pp. 168187, 2013.
V. G. Fiore, V. Sperati, F. Mannella, M. Mirolli, K. Gurney, K. Firston,
R. J. Dolan, and G. Baldassarre, Keep focussing: striatal dopamine
multiple functions resolved in a single mechanism tested in a simulated
humanoid robot, Frontiers in Psychology Cognitive Science, vol. 5,
no. 124, pp. e117, 2014.
M. Kuperstein, Neural model of adaptive hand-eye coordination for
single postures. Science, vol. 239, no. 4845, pp. 13081311, Mar 1988.
M. Butz, O. Herbort, and J. Hoffmann, Exploiting redundancy for
flexible behavior: unsupervised learning in a modular sensorimotor
control architecture, Psychological Review, vol. 114, pp. 10151046,
2007.
O. Herbort, O. Ognibene, M. Butz, and G. Baldassarre, Learning
to select targets within targets in reaching tasks, in The 6th IEEE
International Conference on Development and Learning (ICDL2007),
M. D. Demiris Yiannis, Scassellati Brian, Ed.
London: Imperial
College, 2007, pp. e16.
G. Baldassarre and M. Mirolli, Eds., Intrinsically motivated learning
in natural and artificial systems. Berlin: Springer, 2013.
S. Thrun and T. M. Mitchell, Lifelong robot learning. Springer, 1995.
G. Baldassarre and M. Mirolli, Eds., Computational and Robotic
Models of the Hierarchical Organisation of Behaviour.
Berlin:
Springer-Verlag, 2013.
A. G. Barto and S. Mahadevan, Recent advances in hierarchical
reinforcement learning, Discrete Event Dynamic Systems, vol. 13,
no. 4, pp. 341379, 2003.
G. Baldassarre, Planning with neural networks and reinforcement
learning, PhD Thesis, Computer Science Department, University of
Essex, Colchester, UK, 2002.
, Forward and bidirectional planning based on reinforcement
learning and neural networks in a simulated robot, in Anticipatory
behaviour in adaptive learning systems, ser. Lecture Notes in Artificial
Intelligence, M. Butz, O. Sigaud, and P. Grard, Eds. Berlin: Springer
Verlag, 2003, vol. 2684, pp. 179200.
K. Seepanomwan, D. Caligiore, G. Baldassarre, and A. Cangelosi,
Modelling mental rotation in cognitive robots, Adaptive Behavior,
vol. 21, no. 4, pp. 299312, 2013.
R. Lober, V. Padois, and O. Sigaud, Multiple task optimization using
dynamical movement primitives for whole-body reactive control, in
IEEE-RAS International Conference on Humanoid Robots, Madrid,
Spain, 18-20/11/2014, 2014.
J. Schmidhuber, Formal theory of creativity, fun, and intrinsic motivation (1990-2010), IEEE Transactions on Autonomous Mental Development, vol. 2, no. 3, pp. 230 247, 2010.
G. Baldassarre, What are intrinsic motivations? a biological perspective, in Proceedings of the International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob-2011),

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

18

A. Cangelosi, J. Triesch, I. Fasel, K. Rohlfing, F. Nori, P.-Y. Oudeyer,


M. Schlesinger, and Y. Nagai, Eds. New York, NY: IEEE, 2011, pp.
E18, frankfurt am Main, Germany, 2427/08/11.
V. G. Santucci, G. Baldassarre, and M. Mirolli, Which is the best
intrinsic motivation signal for learning multiple skills? Frontiers in
Neurorobotics, vol. 7, pp. e114, 2013.
A. Barto, M. Mirolli, and G. Baldassarre, Novelty or surprise?
Frontiers in Psychology Cognitive Science, vol. 4, no. 907, pp. e115,
2013.
M. Mirolli, V. G. Santucci, and G. Baldassarre, Phasic dopamine
as a prediction error of intrinsic and extrinsic reinforcement driving
both action acquisition and reward maximization: A simulated robotic
study, Neural Networks, vol. 39, pp. 4051, 2013.
R. Keen, The development of problem solving in young children: a
critical cognitive skill. Annual Review of Psychology, vol. 62, pp. 1
21, 2011.

Valerio
Sperati
(valerio.sperati@istc.cnr.it;
http://www.istc.cnr.it/people/valerio-sperati):
In
2006 he received a Master Degree in Psychology
at University of Rome La Sapienza, Italy. Since
2007 he is a Research Fellow with the Institute
of Cognitive Sciences and Technologies, Italian
National Research Council (ISTC-CNR), Rome,
Italy. Since 2012 he is a PhD student in Computer
Science with the University of Plymouth, UK (with
the ISTC-CNR Rome node). He has partecipated to
various European projects in the field of embodied
cognition and developmental robotics: ECAgents Embodied Cognitive
Agents; Swarmanoid Towards Humanoid Robotic Swarms; IM-CLeVeR
Intrinsically Motivated Cumulative Learning Versatile Robots. His research
interests are on swarm-robotics, bio-inspired artificial vision and attention
control, intrinsic motivations applied to developmental robotics. He has
produced about 10 international peer-review publications.

Valentina
Cristina
Meola
(valentyna1390@hotmail.it): In 2011 and 2013 she
received the BS and MS degrees in Biomedical
Engineering at the University Campus Bio-Medico,
Roma, Italy. During 2013-2014 she was a Research
Fellow with the Institute of Cognitive Sciences and
Technologies, National Research Council (ISTCCNR), Rome, Italy, contributing to the research
presented in this paper. Since 2014 she is Quality
Engineer at Flextronics company. Her research
interests include computational neuroscience,
reinforcement learning, and bio-inspired robotics and control.

Loredana
Zollo
(l.zollo@unicampus.it;
http://www.biorobotics): Loredana Zollo received
the Laurea degree in Electronic Engineering from
the University of Naples in October 2000 and
the Research Doctorate degree in Bioengineering
from the Scuola Superiore SantAnna of Pisa in
May 2004. In 2008, she was appointed assistant
professor of Bio-engineering at Universit`a Campus
Bio-Medico di Roma, Laboratory of Biomedical
Robotics and Biomicrosystems (led by Prof.
Eugenio Guglielmelli). She has a faculty position
(tenured) in Biomedical Robotics and Rehabilitation Bio-engineering in the
same university. Her research interests are in the fields of rehabilitation
robotics, assistive robotics and neuro-robotics on the following research
topics: kinematic and dynamic analysis of robot manipulators, design and
development of control schemes for robot manipulators with elastic joints
and flexible links, design and development of control schemes of interaction
robotic machines for rehabilitation robotics, biological motor control and
neurophysiological models of sensory-motor coordination, design and
development of motor control schemes for bio-inspired robotic systems,
multi-sensory integration and sensory-motor coordination of anthropomorphic
robotic systems. She is member of the editorial board as Associate Editor
of the IEEE Robotics and Automation Magazine. She has been co-chair of
the IEEE Robotics and Automations Technical Committee on Rehabilitation
and Assistive Robotics. She is expert and reviewer of EU FP7 research
program, and has been involved in many EU-funded and national projects in
her application fields. She has authored/co-authored about 80 peer-reviewed
publications appeared in international journals, books and conference
proceedings.

[125]
[126]
[127]

[128]

Daniele Caligiore (daniele.caligiore@istc.cnr.it;


http://www.istc.cnr.it/people/daniele-caligiore):
He received a Master Degree in Electronics
Engineering at the University of Catania (Italy) in
2003, and a PhD in Biomedical Engineering at the
University Campus Bio-Medico di Roma (Italy)
in 2011. During his PhD he was visiting scholar
at the Centre for Robotics and Neural Systems
and at the School of Psychology (University of
Plymouth, UK) to collaborate with Prof. Angelo
Cangelosi and with Prof. Rob Ellis, and at
Embodied Cognition Lab (Universita of Bologna, Italy) to collaborate
with Prof. Anna M. Borghi. Since 2004 he is a Research Fellow with
the Institute of Cognitive Sciences and Technologies, National Research
Council (ISTC-CNR, Rome, Italy). He has participated to several European
projects in the field of embodied cognition and developmental robotics:
MindRACES from Reactive to Anticipatory Cognitive Embodied Systems;
ROSSI Emergence of communication in RObots through Sensorimotor
and Social Interaction; IM-CLeVeR Intrinsically Motivated Cumulative
Learning Versatile Robots. His research interests include motor development,
embodied cognition, system-level computational neuroscience, brain cortical
and sub-cortical hierarchies; reinforcement learning and Hebbian learning.
He has authored/co-authored about 70 peer-reviewed publications appeared
in international journals, books and conference proceedings. Recently, he
was guest-editor for a special issue of the journal Psychological Research
titled Vision, action and language unified through embodiment and for a
consensus paper of the journal Cerebellum titled Towards a systems-level
view of cerebellar function: the interplay between cerebellum, basal ganglia
and cortex.

Anna Lisa Ciancio (a.ciancio@unicampus.it;


http://www.biorobotics.it): In 2010 she received
a Master Degree in Biomedical Engineering at
Universit`a Campus Bio-Medico di Roma (Rome,
Italy). In april 2014 she recevied the Ph.D. degree
in Biomedical Engineering at the same university.
During her PhD she was visiting student at the
Institut des Neurosciences et de la Cognition
(Universit Paris Descartes, FR) and collaborated
with Prof. Marc Maier. Since april 2014 she
is a Postdoctoral fellow with the Laboratory of
Biomedical Robotics and Biomicrosystems. Her research interests are
focused on the bio-inspired control of anthropomoprhic robotic hands.

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAMD.2015.2494460, IEEE Transactions on Autonomous Mental Development
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015

Fabrizio
Taffoni
(f.taffoni@unicampus.it;
http://www.biorobotics.it): In 2009 he received
the Ph.D. degree in biomedical engineering from
the Universit`a Campus Bio-Medico di Roma
(Rome, Italy). From 2009 to 2012, he has been
a Postdoctoral fellow with the Laboratory of
Biomedical Robotics and Biomicrosystems at
Universit`a Campus Bio-Medico di Roma where
he serves as tenure-track assistant professor of
Bioengineering from 2012. His research interests,
at the intersection of developmental neuroscience,
bioengineering and mechatronic, are focused on the design and development
of new platforms, tools, and methods for ecological assessment of motor
development.

19

Gianluca
Baldassarre
(gianluca.baldassarre@istc.cnr.it;
http://www.istc.cnr.it/people/gianlucabaldassarre):
In 1998 he got a BA and MA in Economics
at the University of Rome La Sapienza. In
1999 he got an MSc in Cognitive Psychology
and Neural Networks at the same university. In
2003 he got a PhD in Computer Science at the
University of Essex, UK (research on Planning
with Neural Networks). Then he did a postdoc
at the Italian Institute of Cognitive Sciences and
Technologies, National Research Council, Rome, working on swarm robotics.
From 2006 he is a Researcher at this same Institute and coordinates the
Research Group called LOCEN Laboratory of Computational Embodied
Neuroscience. In 2006-2009 he was Team Leader of the EU project ICEA
Integrating Cognition Emotion and Autonomy and in 2009-2013 he
was the Coordinator of the European Integrated Project IM-CLeVeR
Intrinsically-Motivated Cumulative-Learning Versatile Robots. His research
interests are on cumulative learning of multiple sensorimotor skills driven
by extrinsic and intrinsic motivations in animals, humans, and robots,
and the brain/robot architectures to do so. He studies these topics with
LOCEN by following two synergistic approaches: (a) with computational
models constrained by data on brain and behaviour, aiming to understand
the latter ones; (b) with machine-learning/robotic approaches, aiming to
produce technologically useful robots. In his research these two approaches
have a strong interdisciplinary cross-talk as animals embodiment and their
interaction with the world, which can be studied with robots, are seen
as critical elements for the evolution and development of their brain and
intelligence; and as animals brain and behaviour are invaluable sources of
novel ideas and solutions for autonomous robotics problems. He has over
100 international peer-review publications.

Eugenio
Guglielmelli
(e.guglielmelli@unicampus.it;
http://www.biorobotics.it): He received the M.Sc.in
Electronic Engineering and the PhD in Electronics,
Computer Science and Telecommunications from
the University of Pisa, Italy, in 1991 and in 1995
respectively. He is Professor of Bioengineering at
Universit`a Campus Bio-Medico di Roma (Roma,
Italy) where he serves as Pro-Rector for Research,
and as the Director and Founder of the Laboratory
of Biomedical Robotics and Biomicrosystems. From
2009 to 2013, he served as Director of Studies at the School of Engineering
of the same university. From 1991 to 2004 he was at the Scuola Superiore
SantAnna (Pisa, Italy), where he served (2002-2004) as the Coordinator of
the Advanced Robotics Technology & Systems Laboratory (ARTS Lab). His
main current research interests are in the fields of human-centred robotics,
biomechatronic design and biomorphic control of robotic systems, and in
their application to rehabilitation, personal assistance and neurorobotics.
He is principal investigator/partner of several national, European and
international projects in the area of biomedical robotics, rehabilitation
engineering and prosthetics. He is author/co-author of more than 250 papers
appeared on peer-reviewed international journals, conference proceedings
and books. He is co-inventor of 4 patents and co-founder of 4 start-up
companies in the field of biomedical technologies. He currently serves as
Editor-in-Chief of the IEEE Robotics and Automation Magazine (RAM), and
as Editor-in-Chief of the Springer Series on Biosystems and Biorobotics. He
served (2009-2013) as Associate Editor of the IEEE Transactions of Robotics
(T-RO), and of the CRC International Journal on Applied Bionics and
Biomechanics. He is a Senior Member of the IEEE Robotics & Automation
Society, where he served as Associate Vice-President for Technical Activities
and for Membership activities. He is Emeritus Co-chair and Founder of
the IEEE-RAS TCon Rehabilitation and Assistive Robotics. He serves as
reviewer and evaluator of scientific projects for the European Commission,
and for several other international research agencies. He currently serves
as the Delegate for the European FET Flagships projects on behalf of the
Italian Ministry of Education, University and Research.

1943-0604 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.