Professional Documents
Culture Documents
net/publication/335880430
CITATIONS READS
0 347
6 authors, including:
Murray Shanahan
Imperial College London
168 PUBLICATIONS 5,783 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Benjamin Beyret on 20 September 2019.
Figure 3: Samples of various types of mazes we want to eval- We are interested in evaluating the two trained agents on
uate an agent on. the set of experiments defined above (Figure 3). In Figure 5
we show that both agents reached equilibrium, maximising
Figure 3(a-d) shows samples of different types of mazes the rewards they can obtain on their two training sets respec-
that we would like to evaluate a trained agent on in order to tively. Note that the red curve shows the agent learning over
test its navigation skills. The experimenter needs to design a a curriculum, therefore the difficulty of the training tasks in-
twofold training procedure made of a trainable agent (simi- crease with the number of timesteps (hence the initial spike
lar to the classic RL paradigm) and also a set of arena config- in rewards obtained). We then select the best agents: the best
urations to train the agent on. For a given algorithm, several one overall for the agent trained on a 2×2 maze, and the best
training sets will lead to agents acquiring various skills. performer on the hardest task of the curriculum. In Table 1
For this example, we compare two agents trained using we show both the average rewards and the success rates of
PPO (same hyper-parameters for both, see appendix) (Schul- each of the two best agents, evaluated on the four types of
man et al. 2017). One agent is trained to solve a randomised mazes defined in Figure 3. As expected, the agent trained
2 × 2 maze (fig 3a) 3 × 3 maze (fig 3b) Scrambled maze (fig 3c) Circular maze (fig 3d)
A.R. S.R A.R. S.R A.R. S.R A.R. S.R
Curriculum 0.70 64% 0.92 70% 0.4 53% 0.03 39%
2 × 2 maze 1.82 99% 1.18 78% 0.51 55% 0.12 41%
Table 1: Average Reward (A.R.) and Success Rates (S.R.) for the two agents trained on a Curriculum and a randomised 2 × 2
maze, tested on four different types of mazes. The maximum reward an agent can obtain is equal to 2.
on the 2 × 2 maze outperforms the other agent on the same One partial solution to dealing with this number of con-
problem and reaches 99% accuracy. This represents a stan- founds is to define test batteries (Herrmann et al. 2007;
dard reinforcement learning problem and solution. Shaw and Schmelz 2017) that increase the number of skills
The agent trained on 2 × 2 mazes has a significant drop an individual agent is tested on. Compared to animals, exper-
in performance as we move through tests a-d, away from iments with AI systems are less costly and time intensive, so
similarity to its training set. Even though it has seen many each skill can also require solving a wide variety of tests.
variations of 2 × 2 mazes, it loses considerable performance An agent scoring well on only a few tests would likely have
when moving to 3x3 mazes. The curriculum agent, whilst overfitted, or found a lucky solution. With a full test battery,
not achieving the same level of performance in the 2 × 2 an agent’s successes can be compared across tests to build a
maze, has much less of a drop in performance on the harder profile of its behaviour and capabilities.
problems, achieving near parity with the standard agent on
the circular maze. The First Animal-AI Test Battery
These results give an idea as to how the environment
The Animal-AI test battery has been released as part of
can be used to train and test for animal-like cognitive skills
a competition alongside the Animal-AI Environment. The
and also how they can transfer across different problem in-
competition format allows us to assess performance on com-
stances. However, introducing variability into training sets
pletely hidden tasks. If we had first released the full details
adds new considerations for AI research, and also means
of the test battery then we would have lost the opportunity to
more extensive testbeds in order to draw valid conclusions.
know how well AI systems could do with zero prior knowl-
edge of the tests. Once the competition has ended we will
Artificial Cognition release the full details of the tests. At this point, it will be
Drawing conclusions about the cognitive abilities of animals even more important for anyone working with the test bat-
requires caution, even from the most well-designed experi- tery to disclose their training setup so that it can be deter-
ments. Moving to AI makes this even harder. We have seen mined how much they are training to solve the specific tests,
that the Animal-AI environment introduces a new paradigm and how much to learn robust food retrieval behaviour.
in the field of AI research by dissociating the training and We designed 300 experiment configurations split over cat-
testing environments, making the design of a training cur- egories testing for 10 different cognitive skills. We then chal-
riculum part of the entire training process. In the previous lenged the AI community to design both training environ-
toy example we could, for example, train an agent using ments and learning algorithms and submit agents that they
PPO on a training curriculum A, and then another agent us- believe display these cognitive skills. We briefly introduce
ing Rainbow (Hessel et al. 2017) on a training curriculum B, the ten categories here:
yielding success rates s1 and s2 on the same test set. How- 1. Basic food retrieval: Most animals are motivated by food
ever, even if s1 > s2 , we could not conclude that PPO is and this is exploited in animal cognition tests. This category
better than Rainbow as we cannot isolate whether the algo- tests the agent’s ability to reliably retrieve food in the pres-
rithm used or the training curriculum is the source of the ence of only food items. It is necessary to make sure agents
better results. Therefore, instead of comparing training al- have the same motivation as animals for subsequent tests.
gorithms, one now needs to compare the entire package of 2. Preferences: This category tests an agent’s ability to
algorithm and curriculum combined. choose the most rewarding course of action. Almost all an-
The methods of comparative psychology provide us with imals will display preferences for more food, or easier to
tools for controlling confounds like the above. In order to obtain food, although exact details differ between species.
interpret the results of a test trial in comparative cognition, Tests are designed to be unambiguous as to the correct
one must consider a participant’s evolutionary and ontoge- course of action based on the rewards in our environment.
netic history, previous training, and familiarity with the ex- 3. Obstacles: This category contains objects that might im-
perimental setup. As noted above, when a participant per- pede the agent’s navigation. To succeed in this category the
forms successfully on a cognitive task, researchers carefully agent will have to explore its environment, a key component
rule out plausible alternative explanations - ones in which of animal behaviour.
the participant might have learned to solve the task using ca- 4. Avoidance: This category identifies an agent’s ability to
pacities other than the targeted skill. Ruling out such alter- detect and avoid negative stimuli. This is a critical capac-
natives requires proper experimental design and statistical ity shared by biological organisms, and is also an important
analysis (Bausman and Halina 2018). prerequisite for subsequent tests.
5. Spatial Reasoning: This category tests an agent’s abil-
ity to understand the spatial affordances of its environment.
It tests knowledge of some of the simple physics by which
the environment operates, and also the ability to remember
previously visited locations.
6. Robustness: This category includes variations of the en-
vironment that look superficially different, but for which af-
fordances and solutions to problems remain the same.
7. Internal Models: In these tests, the lights may turn off
and the agent must remember the layout of the environment
to navigate in the dark. Most tests are simple in nature, with
light either alternating or only going off after some time.
8. Object Permanence: This category checks whether the
agent understands that objects persist even when they are out
of sight, as they do in the real world and in our environment.
9. Numerosity: This category tests the agent’s ability to
make more complex decisions to ensure it gets the highest
possible reward. These include situations where decisions Figure 6: Radar plot showing the performance of the top four
must be made between different sets of rewards. entries to the Animal-AI Olympics at the half-way point.
10. Causal Reasoning: This category tests the agent’s abil-
ity to plan ahead and contains situations where the conse-
quences of actions need to be considered. All the tests in this mal cognition in order to better train AI agents that pos-
category have been passed by some non-human animals, and sess cognitive skills. It is designed to contain only the nec-
these include some of the most striking examples of intelli- essary ingredients for performing cognitive testing built up
gence from across the animal kingdom. from perception and navigation. We start with simple yet
By organising these problems into categories, we can get crucial tasks that many animals are able to solve. These in-
a profile for each agent, which is more informative than a clude reward maximisation, intuitive physics, object perma-
simple overall test score. Creating agents that can solve such nence, and simple causal reasoning. By mimicking tasks and
a wide range of problems is a challenging task. We expect protocols from animal cognition we can better evaluate the
this to require sustained effort from the AI community to progress of domain-general AI.
solve, and that doing so will lead to many innovations in The Animal-AI platform is currently in its first iteration,
reinforcement learning. and will continue to be developed as AI progresses and
solves the original testbed. There are many tasks that some
Current State-of-the-art animals can solve that were considered too complex (for AI
Human performance on the tests is close to 100%, and most to solve) for the first iteration. Once it becomes possible to
of the tests have been successfully completed by some an- train agents to solve the first tasks, we will introduce more
imals. Some of the easier tests are solvable by nematodes, complex tests. In future versions we also plan to include
whilst the most complex tests in the competition are only more complicated substrates and objects such as liquids, soft
solved by animals such as corvids and chimps. Each cate- bodies and ropes which are common in animal experiments.
gory contains a variety of test difficulties. For example, the A more complex physics engine would allow for embodied
first tests in the basic food category spawn a single food in interactions between the agent and its environment, creat-
front of the agent whereas the final tests contain multiple ing many more opportunities for animal-like tests. We will
moving food objects. also add the possibility for multi-agent training which could
Current state-of-the-art in AI scores around 40%. Figure allow for social interactions and social learning and also in-
6 shows the results of the top four competitors of the com- clude tests that require online learning to solve correctly.
petition at the moment. Methods used were all variations of Having said this, we expect the current testbed to remain
reinforcement learning. We can see that certain categories, an open challenge and provide a pathway for AI research for
corresponding to those closer to traditional problems in re- years to come. We have seen incredible results in game-like
inforcement learning, are easier to solve than others. These environments in recent years, and we hope this momentum
include navigation-based problems and problems involving can translate to solving the kind of problems that animals
choosing between types of reward. On the other hand, those solve on a daily basis when navigating their environment or
that require planning or that necessarily involve keeping ac- foraging for food. The Animal-AI environment is set up to
curate models of the world, such as for object permanence, make this transition as easy as possible, and to provide a
are much harder to solve and will be an important challenge series of increasingly difficult challenges towards animal-
for ongoing research. like artificial cognition. Finally, this platform helps bring
the communities of AI and comparative cognition closer to-
Conclusions gether, with the potential to disentangle capabilities and be-
haviours that are associated in animals, but can be disasso-
The Animal-AI environment is a new AI experimentation ciated in AI.
and evaluation platform that implements ideas from ani-
References [Juliani et al. 2019] Juliani, A.; Khalifa, A.; Berges, V.-P.; Harper,
[Bausman and Halina 2018] Bausman, W., and Halina, M. 2018. J.; Henry, H.; Crespi, A.; Togelius, J.; and Lange, D. 2019. Obsta-
Not null enough: pseudo-null hypotheses in community ecology cle tower: A generalization challenge in vision, control, and plan-
and comparative psychology. Biology & Philosophy 33(3-4):30. ning. arXiv preprint arXiv:1902.01378.
[Beattie et al. 2016] Beattie, C.; Leibo, J. Z.; Teplyashin, D.; Ward, [Kabadayi, Bobrowicz, and Osvath 2018] Kabadayi, C.; Bobrow-
T.; Wainwright, M.; Küttler, H.; Lefrancq, A.; Green, S.; Valdés, icz, K.; and Osvath, M. 2018. The detour paradigm in animal
V.; Sadik, A.; et al. 2016. Deepmind lab. arXiv preprint cognition. Animal cognition 21(1):21–35.
arXiv:1612.03801. [Kempka et al. 2016] Kempka, M.; Wydmuch, M.; Runc, G.;
[Bellemare et al. 2012] Bellemare, M. G.; Naddaf, Y.; Veness, J.; Toczek, J.; and Jaśkowski, W. 2016. Vizdoom: A doom-based ai
and Bowling, M. 2012. The arcade learning environment: An eval- research platform for visual reinforcement learning. In Conference
uation platform for general agents. CoRR abs/1207.4708. on Computational Intelligence and Games (CIG), 1–8. IEEE.
[Machado et al. 2018] Machado, M. C.; Bellemare, M. G.; Talvitie,
[Bellemare et al. 2015] Bellemare, M.; Naddaf, Y.; Veness, J.; and
E.; Veness, J.; Hausknecht, M.; and Bowling, M. 2018. Revisiting
Bowling, M. 2015. The arcade learning environment: An evalua-
the arcade learning environment: Evaluation protocols and open
tion platform for general agents. In 24th Int. Joint Conf. on Artifi-
problems for general agents. J. of Artificial Intelligence Research
cial Intelligence.
61:523–562.
[Bengio et al. 2009] Bengio, Y.; Louradour, J.; Collobert, R.; and
[Olton 1979] Olton, D. S. 1979. Mazes, maps, and memory. Amer-
Weston, J. 2009. Curriculum learning. In Proceedings of the 26th
ican psychologist 34(7):583.
Annual International Conference on Machine Learning, ICML ’09,
41–48. New York, NY, USA: ACM. [OpenAI et al. 2018] OpenAI; Andrychowicz, M.; Baker, B.;
Chociej, M.; Józefowicz, R.; McGrew, B.; Pachocki, J. W.; Pa-
[Brockman et al. 2016] Brockman, G.; Cheung, V.; Pettersson, L.;
chocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; Schneider,
Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Ope-
J.; Sidor, S.; et al. 2018. Learning dexterous in-hand manipulation.
nai gym. arXiv preprint arXiv:1606.01540.
CoRR abs/1808.00177.
[Castelvecchi 2016] Castelvecchi, D. 2016. Tech giants open vir- [Osband et al. 2019] Osband, I.; Doron, Y.; Hessel, M.; Aslanides,
tual worlds to bevy of AI programs. Nature News 540(7633):323. J.; Sezener, E.; Saraiva, A.; McKinney, K.; Lattimore, T.; Szepez-
[Cobbe et al. 2018] Cobbe, K.; Klimov, O.; Hesse, C.; Kim, T.; and vari, C.; and otherst. 2019. Behaviour suite for reinforcement learn-
Schulman, J. 2018. Quantifying generalization in reinforcement ing. arXiv preprint arXiv:1908.03568.
learning. arXiv preprint arXiv:1812.02341. [Packer et al. 2018] Packer, C.; Gao, K.; Kos, J.; Krähenbühl, P.;
[Espeholt et al. 2018] Espeholt, L.; Soyer, H.; Munos, R.; Si- Koltun, V.; and Song, D. 2018. Assessing generalization in deep
monyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; reinforcement learning. arXiv preprint arXiv:1810.12282.
Dunning, I.; et al. 2018. Impala: Scalable distributed deep-rl with [Pellow et al. 1985] Pellow, S.; Chopin, P.; File, S. E.; and Briley,
importance weighted actor-learner architectures. In International M. 1985. Validation of open: closed arm entries in an elevated plus-
Conference on Machine Learning, 1406–1415. maze as a measure of anxiety in the rat. Journal of neuroscience
[Ha and Schmidhuber 2018] Ha, D., and Schmidhuber, J. 2018. methods 14(3):149–167.
World models. arXiv preprint arXiv:1803.10122. [Perez-Liebana et al. 2016] Perez-Liebana, D.; Samothrakis, S.;
[Hernández-Orallo et al. 2017] Hernández-Orallo, J.; Baroni, M.; Togelius, J.; Schaul, T.; Lucas, S. M.; Couëtoux, A.; Lee, J.; Lim,
Bieger, J.; Chmait, N.; Dowe, D. L.; Hofmann, K.; Martı́nez- C.-U.; and Thompson, T. 2016. The 2014 general video game play-
Plumed; et al. 2017. A new AI evaluation cosmos: Ready to play ing competition. IEEE Trans. on Computational Intelligence and
the game? AI Magazine 38(3):66–69. AI in Games 8(3):229–243.
[Hernández-Orallo 2017] Hernández-Orallo, J. 2017. Evaluation in [Schulman et al. 2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Rad-
artificial intelligence: From task-oriented to ability-oriented mea- ford, A.; and Klimov, O. 2017. Proximal policy optimization algo-
surement. Artificial Intelligence Review 48(3):397–447. rithms. CoRR abs/1707.06347.
[Herrmann et al. 2007] Herrmann, E.; Call, J.; Hernández-Lloreda, [Shaw and Schmelz 2017] Shaw, R. C., and Schmelz, M. 2017.
M. V.; Hare, B.; and Tomasello, M. 2007. Humans have evolved Cognitive test batteries in animal cognition research: evaluating the
specialized skills of social cognition: The cultural intelligence hy- past, present and future of comparative psychometrics. Animal cog-
pothesis. science 317(5843):1360–1366. nition 20(6):1003–1018.
[Hessel et al. 2017] Hessel, M.; Modayil, J.; van Hasselt, H.; [Shettleworth 2009] Shettleworth, S. J. 2009. Cognition, evolution,
Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, and behavior. Oxford university press.
M. G.; and Silver, D. 2017. Rainbow: Combining improvements [Silver et al. 2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez,
in deep reinforcement learning. CoRR abs/1710.02298. A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou,
[Jaderberg et al. 2018] Jaderberg, M.; Czarnecki, W. M.; Dunning, I.; et al. 2016. Mastering the game of go with deep neural networks
I.; Marris, L.; Lever, G.; Castaneda, A. G.; Beattie, C.; et al. and tree search. Nature 529:484–503.
2018. Human-level performance in first-person multiplayer games [Silver et al. 2017] Silver, D.; Hubert, T.; Schrittwieser, J.;
with population-based deep reinforcement learning. arXiv preprint Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Ku-
arXiv:1807.01281. maran, D.; Graepel, T.; et al. 2017. Mastering chess and shogi by
[Johnson et al. 2016] Johnson, M.; Hofmann, K.; Hutton, T.; and self-play with a general reinforcement learning algorithm. CoRR
Bignell, D. 2016. The malmo platform for artificial intelligence abs/1712.01815.
experimentation. In IJCAI, 4246–4247. [Thorndike 1911] Thorndike, E. 1911. Animal Intelligence: Exper-
[Juliani et al. 2018] Juliani, A.; Berges, V.; Vckay, E.; Gao, Y.; imental Studies. The Macmillan Company.
Henry, H.; Mattar, M.; and Lange, D. 2018. Unity: A general [Todorov, Erez, and Tassa 2012] Todorov, E.; Erez, T.; and Tassa,
platform for intelligent agents. CoRR abs/1809.02627. Y. 2012. Mujoco: A physics engine for model-based control. 2012
IEEE/RSJ International Conference on Intelligent Robots and Sys-
tems 5026–5033.
[Vinyals et al. 2017] Vinyals, O.; Ewalds, T.; Bartunov, S.;
Georgiev, P.; Vezhnevets, A. S.; Yeo, M.; Makhzani, A.; Küttler,
H.; Agapiou, J.; Schrittwieser, J.; Quan, J.; Gaffney, S.; Petersen,
S.; et al. 2017. Starcraft II: A new challenge for reinforcement
learning. CoRR abs/1708.04782.
[Wasserman and Zentall 2006] Wasserman, E. A., and Zentall, T. R.
2006. Comparative cognition: Experimental explorations of ani-
mal intelligence. Oxford University Press, USA.
Appendix
In this appendix, we provide further details about the envi-
ronment. We first describe the characteristics of the Unity-
based setup, before showing how one can use the Animal-
AI environment to define experiments, configure arenas and
train/test agents.
All source-code is available online, along with details of the
competition:
• Competition details: http://animalaiolympics.com/
• API source-code, environment and instructions: (a) Empty arena (from above) (b) Agent: black dot is forward
https://github.com/beyretb/AnimalAI-Olympics
Figure 8: Basic elements of an arena
• Environment source-code:
https://github.com/beyretb/AnimalAI-Environment
Rewards As mentioned earlier, all the animal experiments
Basics of the Environment we are interested in contain at least one piece of food. In our
The Animal-AI environment is built on top of Unity ML- setup this is represented by three types of spheres, which
Agents, which itself relies on the game engine Unity. We bear positive or negative rewards and might or might not
therefore use the internal physics engine to model the con- terminate an episode:
trol and motion of the agent and the objects. The coordinate • GoodGoal (fig 9a): a positive reward represented by a
system in Unity is set as shown in figure 7, where z is for- green sphere of diameter d ∈ [1, 5], mass 1, bears a re-
ward, y is up and x is right. A unit on each of these axes can ward equal to d. It is always green and always terminates
be seen as a meter: mass is in kg and time is proportional to an episode if touched by the agent.
wall time and is machine independent (therefore velocity is • BadGoal (fig 9b): a negative reward represented by a red
uniform across machines). sphere of diameter d ∈ [1, 5], mass 1, bears a reward equal
to −d. It is always red and always terminates an episode
if touched by the agent.
• GoodGoalMulti (fig 9c): a positive reward represented by
a golden sphere of diameter d ∈ [1, 5], mass 1, bears a
reward equal to d. It is always gold and will terminate an
episode only if all the GoodGoalMulti have been collected
and there is no GoodGoal in the arena.
(a) Death zone (b) Hot zone • CylinderTunnelTransparent (fig 12e):same as above but
the tunnel is transparent and its colour cannot be set nor
Figure 10: Zones on the ground randomised
• Ramp (fig 12c: a ramp the agent can climb on, of size
Movable Objects In order to assess the cognitive capa- (x, y, z) ∈ [0.5, 40] × [0.1, 10] × [0.5, 40], its colour can
bilities of an animal, experimenters often require this one be set or randomised
to interact with objects by pushing, pulling or lifting various
types of objects in order to obtain a reward. We provide such
objects in different forms:
• Cardbox1 (fig 11a): a light box make of cardboard of mass
1 and of size (x, y, z) ∈ [0.5, 10]3
• Cardbox2 (fig 11b): a heavier box make of cardboard of
mass 2 and of size (x, y, z) ∈ [0.5, 10]3
• LObject (fig 11d): a stick-like object of mass 3 and of size
(x, y, z) ∈ [1, 5] × [0.3, 2] × [3, 20]
• LObject2 (fig 11e): a stick-like object of mass 3 and of (a) Opaque wall (b) Transp. wall (c) Ramp
size (x, y, z) ∈ [1, 5] × [0.3, 2] × [3, 20], symetric of the
LObject
• UObject (fig 11c): a stick-like object of mass 3 and of size
(x, y, z) ∈ [1, 5] × [0.3, 2] × [3, 20]
!ArenaConfig
arenas: Items Spawn
0: !Arena
t: 600
blackouts: [5,10,15,20,25] When provided with an arena configuration, the Animal-AI
items: environment will attempt to spawn as many items as possi-
- !Item ble. However, it does not allow for items colliding at incep-
name: Wall tion. This means that if item1 and item2 are defined one
positions: after the other in the configuration file and that they overlap,
- !Vector3 {x: 10, y: 0, z: 10} only item1 will be spawned. One can however spawn ob-
- !Vector3 {x: -1, y: 0, z: 30} jects on top of each other as long as they do not collide at
colors: t = 0, they can of course fall onto each other.
- !RGB {r: 204, g: 0, b: 204 }
rotations: [45] When spawning items, the environment will start with the
sizes: items listed first in the configuration file, and move down the
- !Vector3 {x: -1, y: 5, z: -1} list. If any parameter is randomised, the environment will
- !Item create the object with a certain value for the randomised
name: CylinderTunnel parameter. It will then attempt to spawn the item where it
colors: should appear, if the spot is free the object will spawn there,
- !RGB {r: 204, g: 0, b: 204 } otherwise the item is deleted, a new one is created with a
- !RGB {r: 204, g: 0, b: 204 } new value for the randomised parameter and the environ-
- !RGB {r: 204, g: 0, b: 204 } ment tries again. This loop will go on for 20 iterations in
- !Item total, if at the end of it no spot has been found for the object,
name: GoodGoal it is ignored and the environment will move on to the next
item in the list. This means that for long configuration files
Config. 1: Yaml file for figure 13a that include randomness, the last items in the list are less
First of all, this arena has a time step limit defined by t: likely to spawn than the first.
600 timesteps. If this was set to 0 the episode would only
end once a reward is collected. We then define a list of items,
three in this case: Cube, CylinderTunnel and GoodGoal. For
each item we can specify lists of values for positions, Example: Maze Curriculum
rotations, colors and sizes. If none of these is spec-
ified all the characteristics of the item will be randomised if
they can be. For example the GoodGoal above will have a An example application for using such configuration files
random location, size and rotation, but will be green as this during training is the PPO agent we trained earlier. We can
is always true. define a series of YAML files, each of increasing difficulty
to solve, and train the agent on each one after the other.
For a given item, the number of in-
stances we will attempt to spawn is equal to Configurations 2, 3 and 4 are the configuration files used
max(|positions|, |colors|, |sizes|, |rotations|) for arenas shown in figures 4a, 4b and 4d respectively. In
where | · | is the length of a list. For example above we will our example we created arenas with walls randomly placed
attempt to spawn two Wall and three CylinderTunnel. We along the x and z axes at given values, adding one or two
use a simple Vector3 to provide values for positions, walls per level. We switch from one level of difficulty to the
sizes which can contain floats, and we use a similar next once the agent reach a certain success rate.
!ArenaConfig !ArenaConfig
arenas: arenas:
0: !Arena 0: !Arena
t: 250 t: 500
items: items:
- !Item - !Item
name: Wall name: GoodGoal
positions: sizes:
- !Vector3 {x: -1, y: 0, z: 10} - !Vector3 {x: 2, y: 2, z: 2}
rotations: [90] - !Item
sizes: name: Wall
- !Vector3 {x: 1, y: 5, z: 9} positions:
- !Item - !Vector3 {x: -1, y: 0, z: 5}
name: GoodGoal - !Vector3 {x: -1, y: 0, z: 10}
positions: - !Vector3 {x: -1, y: 0, z: 15}
- !Vector3 {x: -1, y: 0, z: 35} - !Vector3 {x: -1, y: 0, z: 20}
sizes: - !Vector3 {x: -1, y: 0, z: 25}
- !Vector3 {x: 2, y: 2, z: 2} - !Vector3 {x: -1, y: 0, z: 30}
- !Item - !Vector3 {x: -1, y: 0, z: 35}
name: Agent - !Vector3 {x: 5, y: 0, z: -1}
positions: - !Vector3 {x: 10, y: 0, z: -1}
- !Vector3 {x: -1, y: 1, z: 5} - !Vector3 {x: 15, y: 0, z: -1}
- !Vector3 {x: 20, y: 0, z: -1}
Config. 2: YAML file for figure 4a - !Vector3 {x: 25, y: 0, z: -1}
- !Vector3 {x: 30, y: 0, z: -1}
- !Vector3 {x: 35, y: 0, z: -1}
!ArenaConfig rotations: [90,90,90,90,90,90,
arenas: 90,0,0,0,0,0,0,0]
0: !Arena sizes:
t: 400 - !Vector3 {x: 1, y: 5, z: 9}
items: - ... # idem x6
- !Item - !Vector3 {x: 1, y: 5, z: 9}
name: Wall
positions: Config. 4: YAML file for figure 4d
- !Vector3 {x: -1, y: 0, z: 10}
- !Vector3 {x: -1, y: 0, z: 20} Running the Environment
- !Vector3 {x: -1, y: 0, z: 30} The Animal-AI environment can be used in two different
rotations: [90,90,90] ways: play mode or training/inference mode. Both modes
sizes: accept configurations as described in the previous parts as
- !Vector3 {x: 1, y: 5, z: 9} inputs. Play mode allows participants to visualise an arena
- !Vector3 {x: 1, y: 5, z: 9} after defining configuration files in order to assess their cor-
- !Vector3 {x: 1, y: 5, z: 9} rectness. It also permits to control the agent in order to ex-
- !Item plore and solve the problem at hand.
name: GoodGoal
positions: Play Mode Running the executable will start an arena with
- !Vector3 {x: -1, y: 0, z: 35} all the objects cited above in it. As shown in figure 14, the
sizes: user can change between a bird’s eye view (fig. 14a) or a
- !Vector3 {x: 2, y: 2, z: 2} first person view (fig. 14b). In first person the agent can be
- !Item controlled using W, A, S, D on the keyboard. Pressing C
name: Agent changes the view and R resets the arena. The score of the
positions: previous and current episodes are shown on the top left cor-
- !Vector3 {x: -1, y: 1, z: 5} ner.
Training and Inference Mode To interface with the envi-
Config. 3: YAML file for figure 4b ronment we provide two similar ways: (i) a classic OpenAI
Gym API and (2) a Unity ML-Agents API. Both contain
the usual functions: reset which takes as optional argu-
ment a configuration as defined earlier, and step which
sends actions to the environment and returns observations,
rewards and additional information needed for training and
• max steps: 10e6
• num epoch: 3