You are on page 1of 2

NEWS & VIEWS For News & Views online, go to


Learning to see and act

An artificial-intelligence system uses machine learning from massive training sets to teach itself to play 49 classic computer
games, demonstrating that it can adapt to a variety of tasks. See Letter p.529

BERNHARD SCHLKOPF model to explain those data (a process called we often learn without a supervisor telling

function approximation). It does this by us the outputs of a hypothetical inputoutput
mprovements in our ability to process choosing from a class of model specified by function. Here, reinforcement has a cen-
large amounts of data have led to progress the systems designer. Designing this class is tral role in learning behaviours from weaker
in many areas of science, not least artificial an art: its size and complexity should reflect supervision. Machine learning adopted this
intelligence (AI). With advances in machine the amount of training data available, and its idea to develop reinforcement-learning algo-
learning has come the development of content should reflect prior knowledge that rithms, in which supervision takes the form
machines that can learn intelligent behaviour the designer of the system considers useful for of a numerical reward signal2, and the goal is
directly from data, rather than being explicitly the problem at hand. If all this is done well, the for the system to learn a policy that, given the
programmed to exhibit such behaviour. For inferred model will then apply not only for the current state, determines which action to pick
instance, the advent of big data has resulted training set, but also for other data that adhere to maximize an accumulated future reward.
in systems that can recognize objects or sounds to the same underlying pattern. Mnih et al. use a form of reinforcement
with considerable precision. On page529 of The rapid growth of data sets means that learning known as Q-learning3 to teach sys-
this issue, Mnih etal.1 describe an agent that machine learning can now use complex tems to play a set of 49 vintage video games,
uses large data sets to teach itself how to play model classes and tackle highly non-trivial learning how to increase the game score as a
49classic Atari 2600 computer games by inference problems. Such problems are usu- numerical reward. In Q-learning, Q*(s,a) rep-
looking at the pixels and learning actions that ally characterized by several factors: the data resents the accumulated future reward, Q*, if
increase the game score. It beat a professional are multidimensional; the underlying pattern in state s the system first performs action a,
games player in many instances a remark- is complex (for instance, it might be nonlinear and subsequently follows an optimal policy.
able example of the progress being made in AI. or changeable); and the designer has only weak The system tries to approximate Q* by using
In machine learning, systems are trained prior knowledge about the problem in par- an artificial neural network a function
to infer patterns from observational data. ticular, a mechanistic understanding is lacking. approximator loosely inspired by biological
A particularly simple type of pattern, a map- The human brain repeatedly solves non- neural networks called a deep Q-network
ping between input and output, can be learnt trivial inference problems as we go about our (DQN). The DQNs input (the pixels from four
through a process called supervised learning. daily lives, interpreting high-dimensional consecutive game screens) is processed by con-
A supervised-learning system is given train- sensory data to determine how best to control nected hidden layers of computations, which
ing data consisting of example inputs and the all the muscles of the body. Simple supervised extract more and more specialized visual
corresponding outputs, and comes up with a learning is clearly not the whole story, because features to help approximate the complex



Image convolutions

o-ga Hidden layers
Vide Game controller action values
u la tor

Figure 1 | Computer gamer. Mnih et al.1 have designed an artificial- analyse the pixels of the game screen and extract information from more
intelligence system, using a deep Q-network (DQN), that learns how to and more specialized visual features (image convolutions). Subsequent, fully
play 49 video games. The DQN analyses a sequence of four game screens connected hidden layers predict the value of actions from these features. The
simultaneously and approximates, for each possible action it can make, the last layer is the output the action taken by the DQN. The possible outputs
consequences on the future game score if that action is taken and followed depend on the specific game the system is playing; everything else is the same
by the best possible course of subsequent actions. The first layers of the DQN in each of the 49 games.

4 8 6 | N AT U R E | VO L 5 1 8 | 2 6 F E B R UA RY 2 0 1 5
2015 Macmillan Publishers Limited. All rights reserved

nonlinear mapping between inputs and the behaviour, in which statistical associations machine learning outperforms conventional
value of possible actions for instance, the may be misinterpreted as causal; and in engineering methods. Mnih and colleagues
value of a move in each possible direction making systems more robust with respect to may have chosen the right tools for this job,
when playing Space Invaders (Fig. 1). data-set shifts, such as changes in the behav- and a set of video games may be a better model
The system picks output actions on the basis iours or visual appearance of game charac- of the real world than chess, at least as far as AI
of its current estimate of Q*, thereby exploit- ters3,5,6. And how should we handle latent is concerned.
ing its knowledge of a games reward structure, learning the fact that biological systems also
and intersperses the predicted best action with learn when no rewards are present? Could this Bernhard Schlkopf is at the Max Planck
random actions to explore uncharted territory. help us to handle cases in which the dimen- Institute for Intelligent Systems,
The game then responds with the next game sionality is even higher and the key quantities 72076 Tbingen, Germany.
screen and a reward signal equal to the change are hidden in a sea of irrelevant information? e-mail:
in the game score. Periodically, the network In the early days of AI, beating a professional
1. Mnih, V. et al. Nature 518, 529533 (2015).
uses inputs and rewards to update the DQN chess player was held by some to be the gold 2. Sutton R. S. & Barto A. G. Reinforcement Learning:
parameters, attempting to move closer to Q*. standard. This has now been achieved, and the An Introduction (MIT Press, 1998).
Much thought went into how exactly to do this, target has shifted as we have grown to under- 3. Watkins, C. J. C. H. Learning from Delayed Rewards.
PhD thesis, Univ. Cambridge (1989).
given that the agent collects its own training stand that other problems are much harder 4. Guo, X., Singh, S., Lee, H., Lewis, R. L. & Wang, X.
data over time. As such, the data are not inde- for computers, in particular problems involv- Adv. Neural Inf. Process. Syst. 27 (2014).
pendent from a statistical point of view, imply- ing high dimensionalities and noisy inputs. 5. Bareinboim, E. & Pearl, J. in Proc. 25th AAAI Conf.
on Artificial Intelligence 100108 (2011).
ing that most of statistical theory does not These are real-world problems, at which bio- 6. Schlkopf, B. et al. in Proc. 29th Int. Conf. on
apply. The authors store past experiences in the logical perceptionaction systems excel and Machine Learning 12551262 (Omnipress, 2012).
systems memory and subsequently re-train on
them a procedure they liken to hippocampal
processes during sleep. They also report that BI O D I VER S I T Y
the system benefits from randomly permuting
these experiences.
There are several interesting aspects of
Mnih and colleagues paper. First, the system
The benefits of
traditional knowledge
performances are comparable to those of a
human games tester. Second, the approach
displays impressive adaptability. Although
each system was trained using data from one
game, the prior knowledge that went into the A study of two Balkan ethnic groups living in close proximity finds that
system design was essentially the same for all traditional knowledge about local plant resources helps communities to cope
49 games; the systems essentially differed only with periods of famine, and can promote the conservation of biodiversity.
in the data they had been trained on. Finally,
the main methods used have been around for
several decades, making Mnih and colleagues M A N U E L PA R D O - D E- S A N TAYA N A and linguistically distinct rural Islamic ethnic
engineering feat all the more commendable. & MANUEL J. MACA groups (the Gorani and Albanians) that,

What is responsible for the impressive per- despite living in close proximity in this region
formance of Mnih and colleagues system, also nderstanding how human groups and facing similar environmental and eco-
reported for another DQN4? It may be largely obtain, manage and perceive their nomic conditions, have remained relatively
down to improved function approximation local resources particularly the isolated from one another. The two groups use
using deep networks. Even though the size plants they use as food and medicine is cru- wild plants in different ways, giving the authors
of the game screens produced by the emula- cial for ensuring that those communities can an opportunity to investigate the role of cul-
tor is reduced by the system to 8484 pixels, continue to live and benefit from their local tural factors in shaping how the local flora is
the problems dimensionality is much higher ecosystems in a sustainable way. The study understood and used in daily life, health prac-
than that of most previous applications of rein- of these complex interactions between plants tices and, ultimately, survival. Among the vari-
forcement learning. Also, Q* is highly nonlin- and people is the aim of an integrative disci- ous quantitative techniques used, the authors
ear, which calls for a rich nonlinear function pline known as ethnobotany, which is based designed a simple but innovative tool to com-
class to be used as an approximator. This type on methods derived mainly from botany and pare the cultural similarities and differences
of approximation can be accurately made only anthropology1. Most ethnobotanical research between the two groups use of plant species.
using huge data sets (which the game emulator reveals that traditional knowledge about local The researchers report significant variation
can produce), state-of-the-art function learn- edible and healing resources is suffering an in the plant species used for medicinal pur-
ing and considerable computing power. alarming decline 2, especially in Europe 3. poses by the two ethnic groups. A plausible
Some fundamental issues remain open, However, writing in Nature Plants, Quave explanation for this is that the spread of health-
however. Can we mathematically understand and Pieroni4 suggest that wild plants still have related lore requires a high degree of affinity,
reinforcement learning from dependent data, an essential role for communities living in because trying a new remedy requires a great
and develop algorithms that provably work? the mountains of Kuks, one of the poorest deal of trust5. Health is a sensitive topic, so
Is it sufficient to learn statistical associa- districts of Albania. Their results also show people accept advice mainly from knowledge-
tions, or do we need to take into account the how preserving local knowledge is linked to able relatives or friends belonging to the same
underlying causal structure, describing, say, maintaining biodiversity. ethnic group6. Moreover, many traditional
which pixels causally influence others? This The mountains of Kuks lie in the Balkans, remedies have a highly symbolic component,
may help in finding relevant parts of the state a hotspot of cultural and biological diversity and the mechanisms by which they are believed
space (for example, identifying which sets of that has suffered major political and eco- to bring about healing can lie totally or
pixels form a relevant entity, such as an alien nomic shifts over the past three decades. partially in the remedys cultural meaning7.
in Space Invaders); in avoiding superstitious Quave and Pieroni studied two culturally Quave and Pieroni find only two species,

2 6 F E B R UA RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 4 8 7
2015 Macmillan Publishers Limited. All rights reserved