You are on page 1of 6

2013 BRICS Congress on1st

Computational
BRICS Countries
Intelligence
Congress& on
11th
Computational
Brazilian Congress
Intelligence
on Computational Intelligence

A Biologically Inspired Architecture for Multiagent


Games
Fernanda M. Eliott Carlos H. C. Ribeiro
Computer Science Division Computer Science Division
Technological Institute of Aeronautics (ITA) Technological Institute of Aeronautics (ITA)
São José dos Campos, Brazil São José dos Campos, Brazil

Abstract—This paper reports modifications on a biologically to learn and calculate the utility values of behaviour-state
inspired robotic architecture originally designed to work in single pairs. The top level is composed by independently kept rules
agent contexts. Several adaptations have been applied to the that counterbalance the other level by providing different
architecture, seeking as result a model-free artificial agent able
to accomplish shared goals in a multiagent environment, from suggestions of action selection.
sensorial information translated into homeostatic variable values We propose in this paper modifications to ALEC to make
and a rule database that play roles respectively in temporal it suitable to MA coordinating tasks, hence the name Multi-
credit assignment and action-state space exploration. The new A for the proposed architecture. Multi-A is intended to learn
architecture was tested in a well-known benchmark game, and through its actions, reinforcements and environment, including
the results were compared to the ones from the multiagent RL
algorithm WoLF-PHC. We verified that the proposed architecture other agents. Although in its validating experiments ALEC
can produce coordinated behaviour equivalent to WoLF-PHC was considered as having a homeostatic system fed by the
in stationary domains, and is also able to learn cooperation in sensors of a Khepera robot, the idea was to have it built in
non-stationary domains. The proposal is a first step towards an such a way that it is independent of specific robotic sensors,
artificial agent that cooperate as result of a biologically plausible as a result it might be easier to custom the system to different
computational model of morality.
environments.
Keywords- biologically inspired architectures, multiagent
systems, game theory, reinforcement learning. A. Related Work
Application of model-free techniques such as Q-learning in
I. I NTRODUCTION
multiagent games is an open area of research. An important
Important engineering applications have resulted from the issue is how to accomplish cooperative behaviour in general-
embodiment of a biological concept. Besides, biologically sum games when the Pareto-optimal solution is not a Nash
inspired artifices nay trigger conceivable analogies to areas as equilibrium [13] [15]. As a matter of fact, [13] contrasts the
Philosophy and Psychology. We consider in this paper a way of performance of classic MA learning algorithms when facing
achieving cooperative behaviour by modifying a biologically aspects that can be problematic to handle, such as: number
inspired computing architecture and make it able to fulfil tasks of states, agents and actions per agent; single, several and
in multiagent (MA) environments to maximize the average shadowed optimal equilibrium; deterministic versus stochastic
reinforcement, rather than maximizing the agent’s own rein- games. With the aim of conceiving an investigative foundation
forcement. In [9], a behaviour-based control architecture was and establishing an algorithm to be contrasted with Multi-A,
instantiated over simulated primary emotions and hormonal we analyzed some classic MA algorithms: Correlated-Q [8];
processes maintained by a homeostatic system. The latter was Awesome [2]; CMLeS [7]; Manipulator [17]; M-Qubed [3] [4]
inspired by the Somatic Marker hypothesis from [5]: what and WoLF-PHC [1]. The essential prerequisite for choosing a
would assist us to make fast decisions under low time and benchmarking MA learning algorithm was its ability to handle
computational burden, supporting predictions of what might stochastic and repeated general-sum games; moreover it was
occur hereafter. In [11] [12] the computational architecture important to have published results that illustrate different
was improved giving rise to ALEC (Asynchronous Learning kinds of difficulties, as those results would be compared to
by Emotion and Cognition). It was influenced by the Clar- those by Multi-A. We then decided to consider WoLF-PHC
ion Model [18], an architecture intended to model cognitive as a benchmark. It was developed according to the description
processes in a psychological approach. ALEC was designed from [1] and follows the “Win or Learn Fast” principle: learn
to use data from sensors of a Khepera robot [14] and its fast when losing and carefully when winning as it uses a
multi-task abilities were constructed in the context of a single variable learning rate to help making it robust against the alter-
robot. ALEC is based on homeostatic system and two levels exploration problem, i.e. perturbation caused to the agent’s
participating in the processes of decision-making. The bottom learning process due to environmental exploration by another
level is a backpropagation [23] feedforward artificial neural agents [13]. Reference [1] verified the convergence of WoLF-
network (ANN) employing the Q-Learning algorithm [21] PHC to optimal policies in several general-sum stochastic

978-1-4799-3194-1/13 $31.00 © 2013 IEEE 230


DOI 10.1109/BRICS-CCI-CBIC.2013.45
10.1109/BRICS-CCI.&.CBIC.2013.41
games; likewise [3] and [13] set forth different analysis about
WoLF-PHC in self-play.
Using WoLF-PHC as reference, we tested Multi-A in a
benchmark game. We verified that the proposed architecture
can produce coordinated behaviour equivalent to WoLF-PHC
in stationary domains, and is also able to learn cooperation in
non-stationary domains. The proposal is a first step towards
an artificial agent that cooperate as result of a biologically
plausible computational model of morality.

II. M ULTI -A A RCHITECTURE


The general scheme of Multi-A is illustrated in Figure 1. It
is based on four major modules that operate as follows:
1) Sensory Module: stores the information the agent has Figure 1. General scheme of Multi-A architecture.
about its environment. After the update of the sensory
and homeostatic variables, a well-being index is cal-
culated from an equation that qualitatively estimates (well-being) of choices and environment, and are updated at
the current situation of the agent. When the sensory each iteration step.
variables present a certain conguration, an action is
performed, and right after the execution of the action,
sensory and homeostatic variables are updated and the A. Well-being
well-being is recalculated. The well-being W represents the current situation of the
2) Cognitive Module: stores and manipulates rules that agent w.r.t. its interaction with the environment and other
consist of interval-based specifications of the sensory agents. It is calculated from the homeostatic variables, but
space and a mapped recommended action to be per- with normalizing weights so that the final value falls in the
formed in face of such specifications. The range for rule range [−1.0, 1.0]. It thus produces the target value brought to
specification may be, for example, quantified in intervals the Learning Module for correcting the ANN weights through
of 0.2 over the sensory variable values. The set of rules an updating algorithm (in our case, standard backpropagation).
is supposed to assist in cases of excessive generalization Additionally, W supports the cognitive system in calculating
produced by the Learning Module. the likelihood of success or failure of a rule: if it is greater
3) Learning Module (adaptive system): uses an artificial than or equal to a parameter RV a rule is created (and if a rule
neural network (ANN) for each available action, and the that fits the current sensory data already exists, its successful
Q-learning algorithm [21] to estimate the utility value rate will be updated).
for the current sensory data and action. Learning or More specifically, RV is a threshold to W , consequently
correction of the weights of the ANN is made through it is set within the same range [−1.0, 1.0]. All pairs sensory
the Backpropagation algorithm [11] employing the well- data/action that produce well-being equal or above RV will
being as the target value. be added to the rule set. Thereby, RV is a parameter that
4) Action Selection Module (AS): receives from the Learn- indirectly determines the influence of rules on the decision
ing Module the Q-values for each action, and then gath- process: higher values of RV induce a lesser effect of rules
ers actions suggested by the rules that match with the on the decision process - since just the pairs that resulted into
current sensory data (if there is any rule that contemplate high values of well-being will be initially kept in the set of
the current values of the sensory variables). During the rules.
beginning of a simulation, AS uses a high exploration
W is calculated according to Equation 1:
rate for the action space.
n

The total number of homeostatic variables and sensory
inputs must be defined according to the domain. For the sake W = a i Hi (1)
i=1
of generality let m be the number of sensory variables (values
in [0.0, 1.0]). Added to a bias, these variables feed the adaptive where n is the number of homeostatic variables H. The
(ANN) and rule systems. There are also n homeostatic vari- weights ai are set according to the relevance of each homeo-
ables Hi with values in [−1.0, 1.0] supplied by the sensory static variable to the task.
data. The values for the homeostatic variables are the result The reinforcements are normalized to the range [−1.0, 1.0],
of an operation involving reinforcements and values of sensory since the homeostatic variables fit the range [−1.0, 1.0]: as
variables, and the application domain where the architecture homeostatic variables are created from reinforcements and
operates will determine the nature of such operation. They sensory variables, the reinforcements are incorporated to the
provide input values to the equation that indicates the valence well-being via them.

231
B. Cognitive Module The action with maximum F Vi is selected with higher
Homeostatic variables provide guidelines about the environ- probability (e.g. using an -greedy strategy).
ment. For example: if there is a homeostatic variable related The value of the constant CR determines the importance
to positive reinforcements, high values of that homeostatic given to the recommendation of a rule. There may be several
variable indicate the agent has been through a situation that rules that indicate the same action and also conflicting rules. If
gives a positive reinforcement. Thus, the rule system can store a prominent number of rules support the same action, the Q-
the set of associate sensory data/actions that led to a goal state values provided by the ANN may turn out to be irrelevant
in domains where a positive reinforcement is only associated in Equation 2, thereby the same action will be performed
with such states. More specifically, during learning the pair continuously. The value of the multiplicative constant CR
action/state (indicated via sensory variables) that leads to a should be low enough to allow for a balance between the
faster drive towards the goal state (by supplying levels of application of actions driven by rules (more specific) and by
well-being equal or above RV ) will be stored via the rules, the Q-values (more general). The values of CR, MEx, and RV
increasing the count of success for that sensory data/action must be defined according to the domain while respecting the
pair. Stored rules from past positive situations but that became observations above.
inadequate (e.g., by leading to frequent collisions against
another agents) will be deleted through updates of success III. E XPERIMENTAL S ETUP AND R ESULTS
rates. In the early stages of exploration, the agent may produce Each simulation covered a total number of times (trials) that
sensory values that will not come about again, and stored rules a game was played. Each trial had the duration of one game:
fitting that situation will become useless and deleted as the from its beginning to its end, when a goal was reached. Thus
maximum allowed number of rules in the set is reached. the total number of steps or iterations for each trial was not
When an action is successful in a state (well-being greater always the same: during learning, agents can spend more time
than or equal to RV ), a rule consistent with the current steps until solving the game. The total number of trials for
sensorial values of Multi-A and the action taken will be created each simulation was 50, 000. The number of simulations was
(if not already existent), and added to the set of rules. The 50. The tested games were those already reported in [1] and
existence of conflicting rules is allowed: the same description [3]: the Coordination Game and the Gridworld Game. The
of the input sensorial values but with different actions. If a depicted data in the figures are the mean reinforcement taken
rule is used and the outcome is a well-being below RV , its at intervals of 100 trials. The mean reinforcement corresponds
failure rate will be increased; otherwise the success rate will to two agents in self play (i.e., playing against each other and
be increased. Whenever a stored rule matches the sensory using the same architecture). Multi-A was compared to the
variable values, its recency is updated. Once in the set of rules, WoLF-PHC algorithm, and as it operates on reinforcements
the rule can be manipulated: reduced, expanded or deleted. within the interval [−1, 1], the original values of the game
However, manipulations of a rule are only allowed if the rule scores were normalized to this range. The state of a Multi-
fits the current sensory values and its suggestion of action is A agent is determined by the values of its sensory entries,
performed for at least M Ex times. If a rule is not enforced suggesting a fair number of trials to train the homeostatic
a minimum number of times M Ex, it will be deleted only if system.
the set of rules is complete (the cardinality of this set being The number of sensory data used in the experiments re-
a design parameter). New rules replace the ones that were ported herein was 6 and the number of homeostatic variables
applied more remotely in time. was 4. The 6 sensory variables are:

C. Action Selection Module (AS) • Clearance: a maximum value when there was no collision,
low otherwise.
The Action Selection Module receives values from both • Obstacle Density: high when there was collision (either
the Cognitive and Learning Modules. The Learning Module, against an obstacle or another agent); zero otherwise.
through the ANN, delivers the utility values for the pairs Depending on the application, Obstacle Density may be
(current sensory data, actions). If there is a rule that fits the used to differentiate kinds of collision, such as against an
current state, the Cognitive Module provides suggestions of obstacle or against another agent.
action selection: each rule will have the same weight CR. • Movement: represents the number of steps the agent has
From the data sent by both Modules, the AS Module will been moving around during a trial. It is decreased when
assign values F Vi to the available actions through equation 2: the agent stops (since that is usually a bad option: the
agent being still, other agents might have time to take
F Vi = Qi + CR × ACi (2) advantage of the environment) and increased otherwise.
The agent stops if there is a collision, be it against an
where F Vi is the value of action i; CR is the weight of a obstacle or another agent.
rule that suggests action i; ACi is the number of rules fitting • Energy: reflects if the agent has been finding its goal
the current sensory values that have i as action suggestion and often. Starts in the 1st trial of the 1st simulation with
Qi is the Q-value of action i. maximum value, but is decreased step-by-step. Only

232
grows when the agent receives positive reinforcement:
the later will be added to the Energy value.
• Target Proximity and Target Direction: both simulate the
light intensity sensors from ALEC [12]: light source is
replaced by target state. Those sensory variables provide
high values in the goal state and (with a smaller value)
in the neighboring states (but not diagonal to the target).
Otherwise the variables will be zeroed. In the 1st trial
of the 1st simulation these variables are always zeroed
as the agent still does not know where the target is:
the sensory data and actions that lead to a target state
Figure 2. The Coordination Game: mean reinforcement of Multi-A and
have to be learned. Once achieved the first positive WoLF-PHC in self-play.
reinforcement, those variables change their values. Notice
that the target state and neighbor state discrimination
provide incomplete environmental information about the
global localization of the agent in the environment.
The 4 homeostatic variables are:
• HM : related to the sensory variable Movement. In a
multiagent task a decision should be taken fast, as there
are other agents who can take advantage of any delay. So
this variable is expected to reach maximum and lowest
values quickly. It will be decreased at any collision and
increased otherwise.
• HC: related to the sensory variable Clearance. It equals
Figure 3. The Coordination Game: mean reinforcement of Multi-A and
−1 when there is collision and is 1 otherwise. WoLF-PHC playing as colleagues in the same simulation.
• HD: related to the sensory variable Energy, it reflects for
how long the agent have not received positive reinforce-
ments. B. The Gridworld Game
• HN : Variable fed by negative reinforcements, in their
The difference from the previous game is that the only state
absence it equals zero. that allows multiple simultaneous occupation is the target,
A. The Coordination Game since both agents have the same state as target. Figure 4
Each of two agents has 4 options of actions in a grid world illustrates the game and the initial positions, which have a
with 3 × 3 = 9 states. The game ends when at least one agent barrier with a 50% chance of closing. One of the players
reaches its target position, receiving reinforcement R = 1. must learn to leave the starting position through the free state
When both agents try to go to the same state, they remain numbered 8, and the other must try to go northboud (with
still and both receive R = −0.01. The agents have to learn the risk of colliding against the barrier), then coordinate their
how to coordinate their paths to the target position so they paths to the target, otherwise they will be stuck trying to go
both will get reinforcement R = 1. to the same place (state 8), colliding against each other. Both
Figure 2 shows that both WoLF-PHC and Multi-A learn to algorithms learned to manage the task in 3 steps (minimum
coordinate their paths to the target state in self-play. The mean required quantity) and laid upon the same strategy. As learning
reinforcement of Muti-A is slightly lower, since in contrast goes by, the average reinforcement converges to 75%. The
with WoLF-PHC, Multi-A does not use complete global state agent which continually leaves the starting position via the
information, and there is perceptual ambiguity caused by the free state always wins, whereas the agent that repeatedly tries
internal sensory readings and by the adopted range for rules the barrier reaches the target approximately in 50 % of the
description (sensory variable values quantified in intervals of trials (that is, only when the barrier opens). Figure 5 shows
0.2 : {[0; 0.2); [0.2; 0.4); [0.4; 0.6); [0.6; 0.8); [0.8; 1.0]}). The that, in fact, the average reinforcement for both algorithms
parameters set for Multi-A were: M Ex = 10, CR = 0.2, reaches 75%. The parameters for Multi-A were M Ex = 10;
RV = 0.6. CR = 0.2 and RV = 0.2.
Figure 3 shows the mean reinforcement for two agents under
different architectures: Multi-A and WoLF-PHC. Both learn to C. The Gridworld Game - Second Version
coordinate their actions in order to achieve their own goals. A second version of the grid world game was created to
It is interesting to note that the two learning architectures test the algorithms under a non-stationary condition. Now the
managed to play despite using different leaning strategies. The barriers have 50% closing probability only in the 1st step
parameters set in Multi-A were M Ex = 10, CR = 0.15, of each trial. During all the remaining steps of the trial,
RV = 0.6. the barriers will remain opened. For the games previously

233
version, are shown in Figure 7. The parameters were CR=0.15
and RV = 0.2. As explained in the Cognitive Module, manip-
ulations of a rule are only allowed if the rule fits the current
sensory values, and its suggestion of action is performed for
at least M Ex times. With the intention of evaluating the
impact of the set of rules in the decision process, we tested
different values of M Ex and obtained different outcomes.
In general, the rules eventually met an exploratory role: the
fact of keeping the set of rules dynamic (rules being reduced,
expanded or deleted) is likely to impact on the selection of
actions. Consequently, the agent might not maintain its action
Figure 4. The Gridworld game. The barrier is ilustred in red colour. policy for too long, thus frustrating the expectation of action
selection from another agent and subsequently causing impact
on their best-response policy (when the strategy for solving
the task is opponent-dependent [1]).
The lower the maximum allowed number of rules in the set
of rules, the greater the likelihood that operations are applied
to the set (since rules can be deleted all the time to open space
to new rules), consequently resulting in changes on the action
selection. When there is a high rate of exploration it may be
convenient to keep a small value of M Ex so that the set of
rules adapts quickly to learning. However, depending on the
application, at some point it may be appropriate to increase
the value of M Ex or even prohibit any change on the set of
Figure 5. The Gridworld game: mean reinforcement of Multi-A and WoLF- rules, so the agent can keep an action policy and ’commit’
PHC in self-play.
itself, in response getting the same from another agent and
enabling the emergence and maintenance of cooperation. In
Figure 7, M A(M Ex10) is for Multi-A with M Ex = 10,
whereas M Ex(−) represents that there is no usage of the
Cognitive Module (rules are never created). Different M Ex
rates cause diverse outcomes, summarized as follows:
1- The only tested M Ex that produced consistent outcomes
regarding the game result was M Ex = 10. Although re-
ceiving negative reinforcement because of colliding among
themselves, the two agents learned to always try to go to the
free state in the 1st step. Thus both will be delayed, but able
to get to the target state at the same time. As result they both
always win the game, and a trial will last 4 steps, and not
3 anymore. Thus, the average reinforcement converges to the
Figure 6. The Gridworld Game - Version 2: mean reinforcement of WoLF-
PHC in self-play with different exploration rates. maximum value subtracted by a collision, resulting in 0.99.
All simulations converged to the very same ending (actually,
only two failed) but some of them took longer times: from all
described both algorithms reached very similar performance, simulations, the one that first converged did that by trial 1100
however this game originates different outcomes. and the last one by trial 34900. As M Ex is small, the set of
WoLF-PHC has the same behaviour and performance as rules of both agents can be quite different from simulation to
in the original game – see figure 6, W oLF − P HC(2000). simulation, producing that variance.
Different exploration rates were set aiming to detect if it could 2- M Ex = 50, M Ex = 100 and M Ex(−): all produced
perform differently, the final conclusion concerning different agents that do not know how to deal with collisions. As they
exploration rates was that either the algorithm behaves the change their paths during the simulation (because of perceptual
same as in the original game (the Gridworld Game) or there ambiguity caused by the internal sensory readings and by the
is no effective learning. W oLF − P HC(2000) corresponds adopted range for rules description), as there are collisions
to a linearly decreasing exploration rate from trial 1 to trial they have difficulties trying achieve the target state.
2000, starting at 0.5 and down to 0.0001 in each trial. Several 3- M Ex = 15, M Ex = 20, M Ex = 25 and M Ex = 40.
other exploration strategies were tested and generated similar We observed here that there were 3 kinds of policy of
results. W oLF − P HC(−) had a zero exploration rate. actions as outcome: first, similar to the ones observed for
M Ex = 50, M Ex = 100 and M Ex(−); second, the same as
The results for Multi-A in the Gridworld game, second

234
sible computational model of morality. This will be achieved
through the inclusion of the additional module mentioned
above, yet to be devised and implemented in the project.
ACKNOWLEDGMENTS
The authors thank CNPQ and FAPESP for the financial
support.
R EFERENCES
[1] M. Bowling, M. Veloso, “Multiagent learning using a variable learning
rate”, Artificial Intelligence, Vol. 136, 2002, pp.215-250.
[2] V. Conitzer, T. Sandholm. “Awesome: A general multiagent learning
algorithm that converges in self-Play and learns a best response against
Figure 7. The Gridworld Game - Version 2: mean reinforcement of Multi-A stationary opponents”, Machine Learning, Vol. 67, 1-2, 2007, pp. 23-43.
with different values of M Ex. [3] J. Crandall, Learning Successful Strategies in Repeated General-sum
Games, Ph.D. thesis, Brigham Young University, 2005.
[4] J. Crandall, M. Goodrich, “Learning to compete, coordinate, and cooper-
ate in repeated games using reinforcement learning”, Machine Learning,
for M Ex = 10; and a third where both agents first collide Vol. 82 (3), 2011, pp. 281-314.
into a wall and then go to the target. For the latter the [5] A. Damásio, Descartes Error - Emotion, Reason and the Human Brain (O
Erro de Descartes: emoção, razão e cérebro humano), Portugal, Fórum
final mean reinforcement was maximum since there were no da Ciência, Publicações Europa-América, 1995.
further negative reinforcements provided by collisions against [6] A. Damasio, H. Damasio and A. Bechara, “Emotion, decision making
a wall – that strategy is a good one only if there is certainty and the orbitofrontal cortex”, Cerebral Cortex, Oxford University Press,
Oxford, Vol. 10 (3), 2000, pp. 295-307.
and cooperative expectation about other agents. M Ex = 40 [7] D. Chakraborty, P. Stone, “Convergence, targeted optimality, and safety
and M Ex = 20 were omitted from the graph for better in multiagent learning”, ICML 2010, pp. 191-198.
visualization. [8] A. Greenwald, K. Hall, “Correlated-Q learning”, Proceedings of the
International Conference on Machine Learning (ICML), 2003.
With the aim of ensuring that the Cognitive System is work- [9] S. Gadanho, Reinforcement Learning in Autonomous Robots: an Em-
ing together with the Learning Module instead of determining pirical Investigation of the Role of Emotions. PhD Thesis, Edinburgh
the selected actions all by itself, another simulation was made University, 1999.
[10] S. Gadanho, L. Custódio, “Learning behavior-selection in a multi-
without the activation of the Learning Module. In this case the goal robot task”, Technical Report RT-701-02, Instituto de Sistemas e
agent performance was unsatisfactory. Robótica, IST, Lisbon, 2002a.
[11] S. Gadanho, L. Custódio, “Asynchronous learning by emotions and
IV. F INAL R EMARKS AND F UTURE W ORK cognition”, Proceedings of the Seventh International Conference on
Simulation of Adaptive Behavior, From Animals to Animats, 2002b.
We proposed in this paper Multi-A, a multiagent version [12] S. Gadanho, “Learning behavior-selection by emotions and cognition in
of a biologically-inspired computational agent for multiagent a multi-Goal robot task”, Journal of Machine Learning Research, JMLR,
(4), 2003, pp.385-412.
games. Multi-A was tested in two benchmark coordination [13] L. Matignon, G. Laurent and N. Le Fort-Piat, “Independent rein-
games producing results similar to those of the WoLF-PHC forcement learners in cooperative Markov games: a survey regarding
algorithm but without using complete global localization infor- coordination problems”, The Knowledge Engineering Review, Cambridge
University Press, 27(1), 2012, pp.1-31.
mation. In a modified non-stationary version of a coordination [14] F. Mondada, E. Franzi and P. Ienne, “Mobile robot miniaturization: A
game, Multi-A produced a higher mean reinforcement. Such tool for investigation in control algorithms”, Yoshikawa and Miyazaki
a behaviour is specially encouraging because the original idea (eds), Experimental Robotics III, Lecture notes in Control and Informa-
tion Sciences, London, Springer-Verlag, 1994.
was just to adapt the ALEC computational model for single [15] J. Nash, “Equilibrium points in n-person games”, Proceedings of the
agents to a general MA context via the inclusion of an ad hoc National Academy of Sciences 36 (1), 1950, pp. 48-49.
module designed to bring up cooperation or model knowledge [16] M. Osborne, A. Rubinstein, A Course in Game Theory, Cambridge, MA:
MIT Press, 1994.
about another agent, such a module has not yet been fully [17] R. Powers, Y. Shoham, “Learning against opponents with bounded
implemented. memory”, Proceedings of IJCAI 2005.
Issues still unanswered about Multi-A are: a) will the agent [18] R. Sun, T. Peterson, “Autonomous learning of sequential tasks: experi-
ments and analysis”, IEEE Transactions on Neural Networks, Vol. 9 (6),
be mixed (cooperative and competitive, acting accordingly 1998, pp.1217-1234.
to a certain pattern of reciprocity) or purely cooperative? b) [19] R. Sun, “The CLARION cognitive architecture: extending cognitive
will it work with agents operating under different learning modeling to social simulation”, Ron Sun (ed.), Cognition and MA
Interaction, Cambridge University Press, 2006.
algorithms (not just in self-play) in other games? For this [20] R. Sutton, A. Barto, Reinforcement Learning, The MIT Press, 1998.
second issue, the results we had in the Coordination game [21] C. Watkins, Learning from Delayed Rewards, PhD Thesis, Cambridge
suggest that Multi-A and WoLF-PHC can coordinate their University, 1989.
[22] C. WatkinS, P. Dayan, “Technical note Q-Learning,” Machine Learn-
action selections. Consequently, depending on the task, Multi- ing,(8), 1992, pp. 279.
A is not restricted to self-play, this is an interesting finding [23] J. Werbos, “Beyond regression: new tools for prediction and analysis in
already achieved for the proposed architecture. the behavioral sciences”, Harvard, 1974.
Altogether with answering that questions and the improve-
ment of Multi-A, our original purpose was to move towards an
artificial agent that cooperates as result of a biologically plau-

235

You might also like