You are on page 1of 9

Two Examples of Active Categorisation Processes Distributed Over Time

Tomassino Ferrauto

Elio Tuci

Marco Mirolli

Gianluca Massera

Stefano Nolfi ISTC-CNR, Via San Martino della Battaglia, 44, 00185 Rome, Italy {tomassino.ferrauto, elio.tuci, marco.mirolli, gianluca.massera, stefano.nolfi}@istc.cnr.it

Abstract

Active perception refers to a theoretical ap- proach grounded on the idea that percep- tion is an active process in which the actions performed by the agent play a constitutive role. In this paper we present two different scenarios in which we test active perception principles using an evolutionary robotics ap- proach. In the first experiment, a robotic arm equipped with coarse-grained tactile sensors is required to perceptually categorize spherical and ellipsoid objects. In the second experi- ment, an active vision system has to distin- guish between five different kinds of images of different sizes. In both situations the best in- dividuals develop a close to optimal ability to discriminate different objects/images as well as an excellent ability to generalize their skills in new circumstances. Analyses of evolved be- haviours show that agents are able to solve their tasks by actively selecting relevant infor- mation and by integrating these information over time.

1

Introduction

Traditionally, Cognitive Science and Articial Intelli- gence tended to view intelligence as the result of a chain of three information processing systems, con- stituted by perception, cognition, and action. Ac- cording to this view, the perception system operates by transforming the information gathered from the external world (sensations) into internal representa- tions of the environment itself. The cognitive system operates by transforming these internal representa- tions into plans (i.e. strategies for achieving certain goals in certain contexts). Finally, the action sys- tem transforms plans into sequences of motor acts. This is what Susan Hurley has labelled the “Cogni- tive Sandwich” view of intelligence (Hurley, 1998), according to which perception and action are con- sidered as peripheral processes separated from each other and from cognition, which represents the cen- tral core of intelligence.

The severe criticisms raised to this general view during the last two decades, however, led to the de- velopment of an alternative framework according to which perception, action, and cognition are deeply intermingled processes that cannot be studied in iso- lation (Clark, 1997; Pfeifer and Scheier, 1999). Ac- cording to this view, behaviour and cognition should be conceptualised as dynamical processes that arise from the continuous interactions occurring between the agent and the environment (van Gelder, 1998; Beer, 2000).

This new view of cognition led also to a new ap- proach to categorisation. Categorisation represents one of the most fundamental cognitive capacities dis- played by natural organisms, being an important prerequisite for the exhibition of several other cog- nitive skills (Harnad, 1987): for example, it is in- volved in any task that calls for differential respond- ing, from operant discrimination to pattern recog- nition to naming and describing objects and states- of-affairs. The “Cognitive Sandwich” view of intelli- gence tends to look at categorisation by focusing on processes that are passive (i.e., the agents can not influence their sensory states through their actions) and instantaneous (i.e., the agents are demanded to categorise their current sensory state rather than a sequence of sensory states distributed over a certain time period). The new paradigm to the study of cognition mentioned above demands to look at cat- egorisation processes that are “active” and possibly distributed over time.

Active perception can be studied by exploiting the properties of autonomous embodied and situ- ated agents, in which perception is strongly influ- enced by the agent action (on this issue, see also Gibson, 1977; No¨e, 2004). Nevertheless, our abil- ity to build artificial systems that are able to ex- ploit sensory-motor coordination is still very lim- ited. This can be explained by considering that, from the point of view of the designer of the robot, identifying the way in which the robot should inter- act with the environment in order to sense sensory states that might facilitate perception is extremely difficult. One promising approach, in this respect, is

constituted by adaptive methods in which the robots are left free to determine how they interact with en- vironment (i.e. how they behave in order to solve their task). There are several works that success- fully employed such methods for the control of em- bodied agents in categorisation tasks. For example the works described in (Nolfi, 2002) and in (Beer, 2003) demonstrate how categorisation can emerge from the dynamical interaction between the agent and the environment. Other works have shown how an active perception system can act in order to per- ceive discriminating stimuli that greatly simplify the discrimination task (see, for example Scheier et al., 1998; Nolfi and Marocco, 2002). In some cases, how- ever, sensory-motor coordination is not sufficient to experience well differentiated sensory patterns for different categories. Thus, in these circumstances the agents are required to integrate “ambiguous” sensory-motor states over time. So far, only a few studies have shown evolved agents that are able to cope with this kind of problems (e.g. Gigliotta and Nolfi, 2008; Tuci et al., 2004).

This paper presents two experiments that aim to extend the current state of the art to more complex scenarios. We designed both experiments so to make active perception necessary: in particular, the per- ceptual systems of our agents are too poor and the categorisation tasks too complex for a passive per- ception strategy to work. Indeed, both categoriza- tion tasks require not only that the agent actively selects its own perceptual stimuli, but also that it integrates perceptual information over time since no single discriminative stimulus can be found. The first experiment consists of a simulated anthropomorphic robotic arm with coarse grained tactile sensors that is asked to discriminate between spherical and ellip- soid objects. The high number of Degrees of Freedom (DoFs), the necessity to master the effects of grav- ity, inertia, and collisions, and the high similarity between the two objects make this task significantly more difficult than those in literature. The second

experiment consists in an active vision system that has to correctly recognise five different letters of dif- ferent sizes. In this case the difficulty lies in the number of categories (almost all previous works use only two classes) and in the variability within ele- ments of the same category. Despite the two setups are quite different, we show that the principles that underlie the behavior of successful agents in the two cases are the same. In particular, successful agents are able to obtain close to optimal performance by

(a) actively selecting sensory stimuli so to reduce

perceptual ambiguities as much as possible, and (b) integrating perceived sensory-motor states over time.

(b) integrating perceived sensory-motor states over time . (a) (c) (b) (d) Figure 1: The simulated

(a)

integrating perceived sensory-motor states over time . (a) (c) (b) (d) Figure 1: The simulated robotic

(c)

perceived sensory-motor states over time . (a) (c) (b) (d) Figure 1: The simulated robotic arm

(b)

perceived sensory-motor states over time . (a) (c) (b) (d) Figure 1: The simulated robotic arm

(d)

Figure 1: The simulated robotic arm (a) in position A, and (b) in position B. The kinematic chain (c) of the arm, and (d) of the hand. In (c) and (d), cylinders rep- resent rotational DoFs; the axes of cylinders indicate the corresponding axis of rotation; the links among cylinders represents the rigid connections that make up the arm

structure. T i with i = 1,

, 10 are the tactile sensors.

2

Experiment 1

2.1

Methods

The first experimental setup consists of a simulated anthropomorphic robotic arm and hand with tactile sensors which is asked to discriminate between spher- ical and ellipsoid objects (see Fig. 1a and 1b). The experiment presented here is a continuation of the work described in Tuci et al. (2009): please refer to that paper for additional information. The robot and the robot/environment interactions are simulated using Newton Game Dynamics (NGD), a library for accurately simulating rigid body dynam- ics and collisions (www.newtondynamics.com). The arm has 7 actuated DoFs while the hand has 20 ac- tuated DoFs. Fig. 1c shows the kinematic chain for the arm, the forearm and the wrist, with labels from J 1 to J 7 indicating rotational joints with the rota- tion axis along the axis of the corresponding cylinder. The robotic hand is composed of a palm and fourteen phalangeal segments that make up the digits (two for the thumb and three for each of the other four fin- gers) connected through 15 joints with 20 DoFs (see Fig. 1d). (See Massera et al., 2007, for a detailed description of the structural properties of the arm). Tactile sensors (indicated by the labels T 1 to T 10 in Fig. 1d) return 1 if the corresponding part of the hand is in contact with any other body (e.g., the ta- ble, the sphere, the ellipsoid, or other parts of the

arm), 0 otherwise. The agent controller consists of a continuous time recurrent neural network (CTRNN, see Beer and Gallagher, 1992) with 22 sensory neurons, 8 inter- nal neurons, 16 motor neurons, and 2 categorization neurons. The first 7 input neurons are updated on the basis of the state of the proprioceptive sensors on joints J 1 to J 7 respectively (angles are linearly scaled on the range [1, 1]), other 10 input neurons are up- dated accordingly to the state of tactile sensors T 1 to T 10 respectively, and the remaining 5 input neurons are updated on the basis of the state of the hand pro- prioceptive sensors on joints J 8 to J 12 respectively (angles are linearly scaled in the range [0, 1], with 0 for a fully extended and 1 for a fully flexed finger). In order to take into account the fact that sensors are noisy, 5% uniform noise is added to proprioceptive sensors, while tactile sensors have a 5% probability of returning the wrong value. For all input neurons the activation value is computed by multiplying the corresponding sensory input by a gain factor g. Internal neurons are fully connected to each other, and each receives one incoming synapse from each sensory neuron. Each motor and categorization neu- ron receives one incoming synapse from each internal neuron while there are no direct connections between sensory and motor neurons. The state of both in- ternal, motor and categorization neurons is updated using the following equations:

σ(x)

τ i y˙ i

=

=

1 e x

1 +

y i + ω ji σ(y j + β j )

jN i

(1)

(2)

where y i is the state for neuron i, σ(y j + β j ) is the output of neuron j and N i is the set of index of neu- rons with connection to neuron i. All time constants τ i , biases β i , network connection weights ω ij , and all the input gains are genetically specified networks’ pa- rameters. There is one single bias for all the sensory neurons. The activation values of motor neurons determine the state of the simulated muscles of the arm. Each joint in the arm is moved by an antagonist pair of muscles, so two neural outputs are associated with each joint (in total 14 neurons). For a complete de- scription of the muscle model used in this work, see Massera et al. (2007). The joints of the hand are ac- tuated by a limited number of independent variables through velocity-proportional controllers: the neural network has 2 output neurons for hand movements, one to set all desired thumb angles, the other to set the desired angles for all other fingers. The DoFs relative to joints J 9 to J 12 are not actuated. Finally, the activation values of the two categorization neu- rons are used to categorize the shape of the object (see below).

are used to categorize the shape of the object (see below). (a) (b) Figure 2: (a)

(a)

used to categorize the shape of the object (see below). (a) (b) Figure 2: (a) The

(b)

Figure 2: (a) The sphere and the ellipsoid of the first experiment viewed from above and (b) from west. The radius of the sphere is 2.5 cm. The radii of the ellipsoid are 2.5, 3.0 and 2.5 cm. In (a) the arrows indicate the intervals within which the initial rotation of the ellipsoid is set in different trials.

A generational genetic algorithm is employed to

set the parameters of the networks (see Goldberg,

1989; Nolfi and Floreano, 2000). The initial popula- tion contains 100 genotypes, represented as vectors

of 420 parameters, each encoded with 16 bits. Gen-

erations following the first one are produced by a combination of selection with elitism and mutation:

for each new generation, the 20 highest scoring in- dividuals (“the elite”) from the previous generation are retained unchanged, while the remainder of the new population is generated by making 4 mutated copies of each of the 20 highest scoring individuals with 1.5% mutation probability per bit.

During evolution, each genotype is translated into

an arm controller and evaluated 8 times in position

A and 8 times in position B (see Fig. 1); for each

position, the arm experiences 4 times the ellipsoid

and 4 times the sphere. Moreover, the rotation of the ellipsoid with respect to the z-axis is randomly set in different ranges for each trial (see Fig. 2a). At the beginning of each trial, the arm is located

in the corresponding initial position (i.e., A or B),

and the state of the neural controller is reset. It is

then left free to interact with the object (e.g. by sliding the hand above it so to make it slightly roll) for 4 simulated seconds (400 time steps) but the trial

is terminated earlier in case the object falls off the

table.

In each trial, an agent is rewarded by an evaluation

function that seeks to assess its ability to recognise and distinguish the ellipsoid from the sphere. Rather than imposing a representation scheme in which dif- ferent categories are associated with a priori deter- mined states of the categorization neurons, we leave the robot free to determine how to communicate the result of its decision, while requiring that objects’ categories are well represented in the categorization- output space. More precisely, at each time step, the output of the two categorization neurons is a point in the bi-dimensional Cartesian space C = [0, 1] × [0, 1]. Given a set of such points, one can build the AABB (Axis-Aligned Bounding Box), which is the minimum

rectangle containing all points in the set such that its edges are parallel to the coordinate axes. The idea is that of scoring agents on the basis of the extent to which the AABBs associated to different categories are non-overlapping. During each trial, we collect the categorization output produced by the agent during the last 20 steps. We consider the sphere category (referred to as C S ) as the minimum bounding box of all the categorization output collected while the agent was interacting with the sphere, and the ellip- soid category (referred to as C E ) as the minimum bounding box of all the categorization output col- lected while the agent was interacting with the ellip- soid. The final fitness F F attributed to an agent is the sum of two fitness components: F 1 rewards the robots for touching the objects, and corresponds to the average distance over a set of 16 trials between the hand and the experienced object; F 2 rewards the robots for developing an unambiguous category rep- resentation scheme on the basis of the position in a two-dimensional space of C S and C E . F 1 and F 2 are computed as follows:

F 1

F 2

=

=

1

16

{ 0

1

16

k=1

( 1

d max )

d

k

area(C S C

E

)

min{area(C S ),area(C E )}

(3)

if F 1 = 1

if F 1 = 1 (4)

with d k the euclidean distance between the object and the centre of the palm at the end of the trial k and d max the maximum distance between the palm and the object when located on the table. F 2 = 1 if C S and C E do not overlap (i.e., if C S C E = ).

2.2

Results

Eight evolutionary simulations, each using a dif- ferent random initialisation, were run for 500 gen- erations. Results of post-evaluation tests illus- trated in (Tuci et al., 2009) shows that the best evolved agent (hereafter, A 1 ) possesses a close to optimal ability to discriminate the shape of the objects as well as an excellent ability to gener- alize their skill in new circumstances. Moreover, in (Tuci et al., 2009) it is shown that A 1 , for one of the two positions experienced during evo-

lution (i.e., position A, angle of joints J 1 ,

are {−50 , 20 , 20 , 100 , 30 , 0 , 10 }), ex- ploits only tactile sensation to categorise the objects. In this Section, we take advantage of this latest result by running tests that further explore the dynamics of the decision of A 1 in position A, beyond the qualita- tive description illustrated in (Tuci et al., 2009). In particular, our interest is in finding out whether the discrimination process occur at a specific moment, as a response to a sensory pattern that encode the regu-

,

J

7

1 50 100 150 200 250 300 350 400 GSI(t) 0 0.2 0.4 0.6 0.8
1
50
100
150
200
250
300
350
400
GSI(t)
0
0.2
0.4
0.6
0.8
1

Time steps (t)

1 50 100 150 200 250 300 350 400 Number of tactile ambiguities 0 30
1
50
100
150
200
250
300
350
400
Number of tactile ambiguities
0
30
60
90
120
150
180

Time steps (t)

Figure 3: (a) The Geometric Separability Index (GSI). (b) Number of tactile ambiguities.

larities which are necessary for discriminating, or if it occurs over time by integrating the information con- tained in several successive sensory states. Movies of the best evolved strategies can be found at http:

//laral.istc.cnr.it/esm/active_perception.

To answer this question we use a slightly mod- ified version of the Geometric Separability Index (hereafter, referred to as GSI) originally proposed in (Thornton, 1997). GSI represents an estimate of the degree to which tactile sensor readings ex-

perienced during the interactions with the sphere or with the ellipsoid are separated in sensory space. We built four hundred data sets, one for each time step

(t)} k=1 ), and four hun-

dred data sets, one for each time step with the sphere

˜

(i.e., { I

E (t) is the tactile sensor

with the ellipsoid (i.e., { I

˜ E

k

˜

180

k

(t)} k=1 ). Where, I

k

S

180

readings experienced by A 1 while interacting with

S (t) is

the ellipsoid at time step t of trial k; and I

˜

k

the tactile sensor readings experienced by A 1 while interacting with the sphere at time step t of trial k. Trial after trial, the initial rotation of the ellipsoid around the z-axis changes of 1 , from 0 in the first trial to 179 in the last trial. Each trial is differently seeded to guaranteed random variations in the noise added to sensors readings. At each time step t, the

GSI is computed as follows:

3

Experiment 2

GSI(t) = 1

z k (t)

=

180

1

0

m

EE

k

(t)

180

z k (t)

k=1

u k (t) u k (t)+v

˜

if

if

otherwise

m EE (t) m EE (t)

k

k

k (t)

E

k

= min (H( I

j

˜

(t), I

j

(t)))

E

<

>

m ES (t) m ES (t)

k

k

m

ES

k

(t)

u k (t)

v k (t)

˜

= min (H( I

E

k

˜

(t), I

E

k

E

k

S

j

(t)))

˜

j

E

j

S

j

˜

=|{ I

(t)

(t)

˜

: H( I

˜

: H( I

(t), I

˜

I

(t),

E

j

(t)) = m EE (t)}|

k

S

j

(t)) = m ES (t)}|

k

˜

=|{ I

(5)

where H(x, y) is the Hamming distance between tac- tile sensor readings. |x| means the cardinality of the set x. GSI=1 means that at time step t the closest

(t).

neighbourhood of each I

˜

E

k

(t) is one or more I

k

˜

E

GSI=0 means that at time step t the closest neigh-

˜

bourhood of each I

E

k

˜

(t) is one or more I

S

k

(t).

As shown in Fig. 3a, the GSI(t) tends to increase from about 0.5 at time step 1 to about 0.9 at time step 200, and to remain around 0.9 until time step 400. This trend suggests that during the first 200 time steps, the agent acts in a way to bring forth those tactile sensor readings which facilitate the ob- ject identification and classification task. In other words, the behaviour exhibited by the agent allows it to experience two classes of sensory states, rather well separated in the sensory space, which corre- spond to objects belonging to two different cate- gories. However, the fact that the GSI does not reach the value of 1.0 indicates that the two groups of sensory patterns belonging to the two objects are not fully separated in the sensory space. In other words, some of the sensory patterns experienced dur- ing the interactions with an ellipsoid are very similar or identical to sensory patterns experienced during interactions with the sphere and vice versa. This is confirmed by the graph shown in Fig. 3b, which refers to the number of tactile ambiguities at each time step.

A tactile ambiguity is defined as a condition in which at least some of the patterns are experienced during interactions with both an ellipsoid and a sphere. If there are tactile ambiguities, then the agent cannot determine the category of the object solely on the basis of the single sensory stimuli. The fact that the number of tactile ambiguities never reaches zero while agent’s performance gets a almost optimal performance seems to imply that the agent’s categtorization strategy involves an ability to inte- grate sequences of experienced sensory states over time.

3.1

Methods

The second experimental scenario involves a simu- lated agent provided with a moving eye located in

front of a screen that is used to display images to be

categorized (one at a time). The eye includes a fovea constituted by 5 × 5 photoreceptors distributed uni-

formly over a square area located at the centre of the

eye’s ‘retina’, and a periphery constituted by 5 × 5

photoreceptors distributed uniformly over a square

area that covers the entire retina of the eye. Each

photoreceptor detects the average grey level of an

area corresponding to 1 × 1 pixel or to 10 × 10 pixels

of the image displayed on the screen, for foveal and peripheral photoreceptors, respectively (see Fig. 4b). The activation of each photoreceptor ranges between 0 and 1 and is given by the average gray level of the pixels spanned by its receptive field (where 0 and 1

represent a fully white and a fully black visual field,

respectively). The eye can explore the image by mov-

ing along the up-down and left-right axes up to a

maximum distance corresponding to 25 pixels of the image. The screen, located in front of the agent’s eye, is used to display five types of italic letters (‘l’, ‘u’, ‘n’, ‘o’, ‘j’), each of which can be of 5 different sizes (with a variation of ±10% and ±20% with re- spect to the intermediate size: see Fig. 4a, for the letter ‘l’). The letters are displayed in black/gray over a white background. As shown in Fig. 4b, the eye can perceive only a tiny part of a letter with its foveal vision and a much larger but still incomplete part of the letter with its peripheral vision. It is im- portant to clarify that this set-up is not intended to model how humans actually recognize letters; rather, the characteristics of the set-up have been chosen so to allow us to study how an active vision system can categorize stimuli through the exploitation of its eye movements and, possibly, to the integration of the perceived information over time. Agents are provided with a neural network con- troller with 57 sensory neurons, 5 internal neurons, and 7 output neurons: see Fig. 4c for the network architecture. Notice that sensory neurons relative to the eye periphery are connected only to the two movement output neurons. This connection pattern represents a very crude abstraction of the functional organization of the human visual system, in which eye movements seem to be driven primarily by the periphery while recognition seems to be based pri- marily on the information provided by fovea (Findlay and Gilchrist, 2003; Wong, 2008). To take into ac- count the fact that sensors are noisy, a random value with a uniform distribution in the range [0.05; 0.05] is added to the activation state of each photoreceptor of the fovea in each time step. The output of each of the 5 leaky internal neurons

Figure 4: (a) Letter ‘l’ shown in the 5 different sizes used in the experiment.

Figure 4: (a) Letter ‘l’ shown in the 5 different sizes used in the experiment. (b) The screen displaying the letter ‘l’ in its intermediate size and an exemplification of the field of view of the foveal and peripheral vision (smaller and larger squares, respectively). (c) The architecture of the neural controller. The number inside the each rectangle indicates the number of neurons, the letter L in a box indicates that these neurons are leaky integrators. Solid arrows between two boxes indicate all-to-all connections between neurons of those boxes, while dashed arrows in- dicate that the activation of the output units at time t is copied in the respective input units at time t + 1.

depends on the input received from the sensory and internal neurons through the weighted connections and by its own activation at the previous time step, and is calculated as follow:

O

t

i

=

τ i O

t1

i

+ (1 τ i )σ( O

jN i

t1

j

w ji + b i )

(6)

i t is the output of unit i at time t, τ i is

the time constant of unit i, in [0; 1], w ji is the weight

of the connection from unit j to unit i, and b i is the unit’s bias, and σ(x) is calculated as in equation 1. The output of the output units is calculated as in equation 6 but the time constant is fixed to 0 (i.e. output neurons do not depend on their previous state). The output of the motor units is then linearly normalized in the range [25; 25] and used to vary the position of the eye along the x and y axes of the image, respectively. Free network parameters are learned using a ge- netic algorithm similar to the one described for the previous experiment. Agents are evaluated for 50 tri- als lasting 100 time steps each. At the beginning of each trial the screen is set so to display one of the five different letters in one of the five different sizes (each letter of each size is presented twice to each individ- ual), the state of the internal neurons of the agent’s neural controller is initialized to 0, and the eye is ini- tialized in a random position within the central third of the screen (so that the agent can always perceive some part of the letter, at least with its peripheral vision). During the 100 time steps of each trial the

where O

agent is left free to visually explore the screen. Tri- als, however, are terminated earlier if the agent does not perceive any part of the letter through its pe- ripheral vision for three consecutive time steps. The task of the agent consists in labelling the category of the current letter correctly during the second half of the trial. More specifically, the agents are eval- uated on the basis of the following fitness function F F which comprises two components: the first one measures the agents’ ability to activate the catego- rization unit corresponding to the current category more than the other units; the second one measures the ability to maximize the activation of the right unit while minimizing those of the other units:

F 1 (t,

c)

=

F 2 (t,

c)

=

FF

=

2 rank(t,c)

1

2 O

50

t=1

t,c

r

+

OO

t,c

w

1

8 (1 O)

c=50 ( 1

100

2 F 1 (t, c) + 2 F 2 (t, c) )

1

50 · 50

(7)

(8)

(9)

where F 1 (t, c) and F 2 (t, c) are the values of the two fitness components at step c of trial t, rank(t, c) is the ranking of the activation of the categoriza- tion unit corresponding to the correct letter (from 0, meaning the most activated, to 4, meaning the

is the activation of the output

corresponding to the right letter at step c of trial t

is the set of activations corresponding to

the wrong letters at step c of trial t. Notice that, as

and O

least activated), O

t,c

r

t,c

w

in the previous setup, individuals are not rewarded for moving their eyes or for producing a certain type of exploration behaviour but only for the ability to categorize the current letter.

3.2

Results

Twenty evolutionary simulations were run, each last- ing 3000 generations. The best agents of all simula- tions obtained on the average a good performance, with the best agent of the best replication reach- ing close to optimal performance. In order to better quantify the ability of the adapted agents to catego- rize the letters, we measured the percentage of times in which, during the second half of each trial, the cat- egorization unit corresponding to the current letter is the most activated. We evaluated the best individ- uals of each of the 20 replications of the experiment for 10000 trials during which they are exposed to all possible combinations of the 5 letters with 50 sizes (uniformly distributed over the range [20%, +20%] of the intermediate size), 40 times each for each com- bination. As a result, we obtained that the average performance over all replications is 76.92% and the performance of the best individual of the best repli-

cation is 94.32%. In the remaining part of this sec- tion, we will focus our analysis on the best evolved agent, that is the best individual of replication 12. By analysing the behaviour displayed by the best individual we can see how, after an initial phase last- ing typically from 5 to 30 time steps (in which the behaviour varies significantly for different initial po- sitions of the eye and for different letter sizes), the behaviour of the agent converges either on a fixed point attractor (i.e. the eye stops moving after hav- ing reached a particular position of the letter) or on a limit cycle attractor (i.e. the eye keeps moving by pe- riodically foveating sequentially 2-6 different specific areas of the image). Interestingly, the agent displays the same type of behaviour in interaction with let- ters belonging to the same category even if they are of different sizes, and different behaviours for letters of different categories. As for the previous experimental setup, we wanted to quantitatively ascertain the capacity of evolved individuals to actively select discriminating inputs. Apart from the efferent copies that provide as input the categorization output produced by the agent in the previous time step, the categorization answer of our system depends on two sources of information:

the visual information provided by photoreceptors of the fovea and the motor information provided by the efferent copies of the motor neurons controlling the eye movements. Starting from the GSI index intro- duced in the previous experiment, we adapted it to the new setup and then we observed the evolution of the values of this index for both kinds of input (vi- sual and motor) during the interaction of the agent with the images. For this particular experiment, we wanted a more demanding index, which would take into account not only the nearest neighbor, but all the input vectors belonging to each category. Hence, we devised what we call the Modified Geometric Separability Index (MGSI), which is defined as the average, over all pat- terns, of the proportion of the patterns belonging to the same category that are in the |C x | nearest pat- terns (using the euclidean distance), with |C x | rep- resenting the total number of patterns in the same category as pattern x. More formally, the MGSI is calculated as follows:

MGSI(P ) =

C x (n)

xP

nN X

|C x |

|P|

(10)

where |S| indicates the cardinality of the set S, P is the set comprising all the patterns, C x is the set of all patterns belonging to the same category as pattern x (x doesn’t belong to C x ), N x is the set of the |C x | patterns nearest to pattern x and C x (n) is

| patterns nearest to pattern x and C x ( n ) is Figure 5: Evolution

Figure 5: Evolution of the MGSI of the input coming from the fovea and from the efferent copy of the eye movements during the 100 cycles of the trials. Each point along the x axis represents the value of the MGSI calcu- lated by taking all the inputs recorded in 250 trials (5 letters × 5 dimensions × 10 repetitions) during one of the 100 cycles of each trial.

the indicator function of set C x : it returns 1 if n is in the set C x , 0 otherwise.

We calculated the MGSI of both the visual and motor-copy patterns experienced by the best evolved agent during 250 test trials, ten replications (with different initial positions) for each of the 5 by 5 letter-dimension pairings. More specifically, the two MGSIs were calculated for each of the 100 cycles composing trails, so that we could observe their evo- lution during the agent’s interactions with the im- ages. The results are shown in Fig. 5. They show three things. First, the separability of the input pat- terns in both sensory channels (visual and motor) significantly increase throughout trials, in particular during the first 20 cycles, meaning that the agent’s sensory-motor behavior has evolved so to facilitate the categorization process. Second, the geometric separability of the inputs in the two channels reaches very similar values (with the motor-copy channel be- ing slightly better), meaning that the categorization behavior might be based on both kinds of sensory in- formation. Third, the geometric separability of nei- ther of the two channels reaches very high values, meaning that, as in the previous experiment, to suc- cessfully solve the task the system has to integrate the information collected during different time steps, because each sensory pattern collected in a singular time step does not provide enough information for correct discrimination.

4

Conclusions

In this paper we presented two different experimental setups in which embodied agents are asked to catego- rize various objects by actively selecting their inputs. In the first scenario an anthropomorphic robotic arm equipped with coarse grained tactile sensors has been asked to distinguish between spherical and ellipsoidal objects. The setup is significantly more complex than those in the active perception literature due to the high similarity between the objects to be dis- criminated, the difficulty of controlling a system with so many degrees of freedom, and the need to master the effects produced by gravity, inertia, collisions, etc. Nevertheless the evolved system is able to solve the task and reach close to optimal performance. The second scenario involves an agent with a sim- ulated moving eye that have to recognize different letters. Whereas work in literature has mainly fo- cused on experiments comprising only two different categories, this setup is more challenging as there are significantly more categories with more variabil- ity (five letters of different dimensions). Also in this case the system is able to successfully solve the task with a close to optimal performance. Both experiments show that active perception sys- tems are indeed able to cope with complex scenarios. The ability to actively select one’s own input is ex- ploited by agents by looking for perceptions that are as little ambiguous as possible, as the modified GSI trend shows. In spite of the effectiveness of their ac- tions, however, agents often encounter input patterns not uniquely associated with one category. Thus, the best evolved agents also show a complementary abil- ity to integrate partially conflicting information over time.

Acknowledgements

This research work was supported by the ITALK project (EU, ICT, Cognitive Systems and Robotics Integrating Project, grant n 214668).

References

Beer, R. (2000). Dynamical approaches to cognitive science. Trends in Cognitive Sciences, 4:91–99.

Beer, R. and Gallagher, J. (1992). Evolving dynamic neural networks for adaptive behavior. Adaptive Behavior, 1(1):91–122.

Beer, R. D. (2003). The dynamics of active categori- cal perception in an evolved model agent. Adaptive Behavior, 11(4):209–243.

Clark, A. (1997). Being There: putting brain, body and world together again. Oxford University Press, Oxford.

Findlay, J. M. and Gilchrist, I. D. (2003). Active Vi- sion. The Psychology of Looking and Seeing. Ox- ford University Press, Oxford.

Gibson, J. J. (1977). The theory of affordances. In Shaw, R. and Bransford, J., (Eds.), Perceiving, Acting and Knowing. Toward an Ecological Psy- chology, chapter 3, pages 67–82. Lawrencw Erl- baum Associates, Hillsdale, NJ.

Gigliotta, O. and Nolfi, S. (2008). On the coupling between agent internal and agent/environmental dynamics: Development of spatial representations

in evolving autonomous robots. Adaptive Behav-

ior, 16:148–165.

Goldberg, D. (1989). Genetic algorithms in search, optimization and machine learning. Reading, MA:

Addison-Wesley.

Harnad, S., (Ed.) (1987).

Categorical Perception:

Cambridge Uni-

The Groundwork of Cognition.

versity Press.

Hurley, S. (1998). Consciousness in Action. Harvard University Press, Cambridge, MA.

Massera, G., Cangelosi, A., and Nolfi, S. (2007). Evo- lution of prehension ability in an anthropomorphic neurorobotic arm. Front. Neurorobot., 1.

No¨e, A. (2004). Action in Perception. MIT Press, Cambridge, MA.

Nolfi, S. (2002). Power and limits of reactive agents. Neurocomputing, 49:119–145.

Nolfi, S. and Floreano, D. (2000). Evolutionary robotics. The biology, intelligence, and technology of self-organizing machines. MIT Press, Cam- bridge, MA.

Nolfi, S. and Marocco, D. (2002). Active perception:

A

sensorimotor account of object categorisation.

In

Hallam, B., Floreano, D., Hallam, J., Hayes,

G., and Meyer, J.-A., (Eds.), Proc. of the 7 th In-

ernational Conference on Simulation of Adaptive Behavior (SAB ’02), pages 266–271. MIT Press, Cambridge, MA.

Understanding

intelligence. MIT Press, Cambridge, MA.

Pfeifer, R. and Scheier, C. (1999).

Scheier, C., Pfeifer, R., and Kunyioshi, Y. (1998). Embedded neural networks: exploiting con- straints. Neural Networks, 11(7-8):1551–1596.

Thornton, C. (1997). Separability is a learner’s best friend. In Bullinaria, J., Glasspool, D., and Houghton, G., (Eds.), Proc. of the 4 th Neural Computation and Psychology Workshop: Connec- tionist Representations, pages 40–47. Springer Ver- lag, London, UK.

Tuci, E., Massera, G., and Nolfi, S. (2009). Active categorical perception in an evolved anthropomor- phic robotic arm. In Proc. of the IEEE Confer- ence on Evolutionay Computation (CEC ’09), Spe- cial Session on Evolutionary Robotics, ISBN: 978- 1-4244-2959-2. Draft available at http://laral. istc.cnr.it/elio.tuci/pagn/pubb.html.

Tuci, E., Trianni, V., and Dorigo, M. (2004). Feeling the flow of time through sensory-motor coordina- tion. Connection Science, 16(4):301–324.

van Gelder, T. J. (1998). The dynamical hypothesis in cognitive science. Behavioral and Brain Sci- ences, 21:615–665.

Wong, A. M. (2008). Eye Movement Disorders. Ox- ford University Press, Oxford.