You are on page 1of 3

Neuron

Previews

How to Perfect a Chocolate Souffle


and Other Important Problems
Timothy E.J. Behrens1,2,* and Gerhard Jocham1,*
1FMRIB Centre, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DU, UK
2Wellcome Trust Centre for Neuroimaging, 12 Queen Square, London WC1N 3BG, UK
*Correspondence: behrens@fmrib.ox.ac.uk (T.E.J.B.), gjocham@fmrib.ox.ac.uk (G.J.)
DOI 10.1016/j.neuron.2011.07.004

When learning to achieve a goal through a complex series of actions, humans often group several actions into
a subroutine and evaluate whether the subroutine achieved a specific subgoal. A new study reports brain
responses consistent with such hierarchical reinforcement learning.

To culinary novices like ourselves, it midbrain show firing rate changes that from another, and because the number
seems something of a miracle that the appear remarkably consistent with of possible actions they might choose
chocolate souffle came into existence. prediction error signaling: firing rates from is immense. It is clear, however,
Baking a good souffle requires so many increase when a reward is better than ex- that humans have more sophisticated
complex steps and processes (http:// pected and decrease when worse than strategies in their learning armory. One
www.bbcgoodfood.com/recipes/2922/ expected (Schultz, 2007). In rodents, such strategy, well known to both
hot-chocolate-souffl-) that, at first glance, causal interference with these neurons computer scientists and chefs, is termed
it would seem to be an impossible art to induces artificial learning (Tsai et al., hierarchical reinforcement learning (HRL;
perfect. When the first souffle failed to 2009). In human imaging studies, it is Botvinick et al., 2009). Here, sequences
rise, how did the chef know, for example, also possible to find midbrain prediction- of actions may be grouped together into
whether the ganache was under-velvety, error signals (DArdenne et al., 2008), subroutines (make a ganache or whip
or the creme patisserie over-floury? but, for technical reasons, such signals some egg whites). Each of these subrou-
Current theories of how the brain learns are more commonly found in dopamino- tines may be evaluated according to its
from its successes and failures offer scant ceptive regions in the striatum (ODoherty, own subgoals, and if these subgoals are
advice to the budding soufflist. However, 2004) and prefrontal cortex (Rushworth not met, they will generate their own
in this issue of Neuron, Ribas-Fernandes and Behrens, 2008). prediction errors. These pseudo-reward
and colleagues (2011) demonstrate neural RL has had a tremendous impact on prediction errors (PPEs) are distinct from
correlates of a learning strategy that cognitive neuroscience due to its power reward prediction errors because they
dramatically simplifies not only this impor- in explaining behavioral and neural data. are not associated with eventual reward,
tant problem, but also nearly every real- However, in the real world, simple actions but with an internally set subgoal that is
world example of human learning. rarely lead directly to rewards. Instead, a stepping stone toward the eventual
Reinforcement learning (RL) is a central the pursuit of reward (or souffle) often outcome. Hence, in a hierarchical frame-
feature of human and animal behavior. requires many actions to be taken, each work, RPEs are used to learn which
Actions that result in good outcomes depending on the last. In such a world, it combinations of subroutines lead to
(termed rewards or reinforcers) are is a complex problem to understand how rewarding outcomes, whereas PPEs are
repeated more often than those that do learning should occur when an outcome used to learn which combinations of
not, increasing the likely number of future is different from expected (the souffle actions (and sub-subroutines!) lead to
rewards. This simplistic form of learning wont rise), as it is not clear which actions a subgoal. Because they may only be
can be ameliorated by keeping an esti- or combinations of actions should be held attributed to the small number of actions
mate of precisely how much reward can responsible for a prediction error, and in the subroutine, PPEs substantially
be expected from any given action (an therefore which should be adjusted for reduce the complexity of learning
actions value). Now, high-value actions the next attempt. Solving this problem (Figure 1): if the egg whites are droopy, it
may be repeated more frequently than using a standard RL approach becomes cannot be the chocolates fault!
low-value ones, and, when outcomes exponentially more difficult as the number It is the neural correlates of these PPEs
are different from what was expected, of actions increases. Learning to cook that form the focus of Ribas-Fernandes
action values may be updated to drive a souffle would seem an intractable et al. (2011). Here, we suspect mainly
future behavior. This difference between problem! for practical reasons, subjects were not
received and expected reward is termed In a complex world, then, standard RL asked to bake souffles in the MRI
the reward prediction error (RPE) and is approaches suffer because it is difficult scanner. Instead, they performed a task
thought to be a major neural substrate to evaluate intermediate actions with devised in the world of robotics to probe
for learning and behavioral control. Dopa- respect to the final outcome, because HRL. Using a joystick, participants navi-
mine neurons in the primate and rodent they cannot distinguish one type of error gated a lorry to collect a package and

Neuron 71, July 28, 2011 2011 Elsevier Inc. 203


Neuron

Previews

A B

Figure 1. Conventional Chef Is Confused and Has No Souffle, but Fortunately Hierarchical Chef Has Enough Souffle for Everybody
In conventional reinforcement learning (A), the agent goes through all steps until the final goal is reached. If the souffle is worse than expected, any of the actions
may be to blame. The learning problem can be drastically simplified by hierarchical reinforcement learning (HRL, B). In this example, the agent learns three
subroutines (SR1SR3). Each of these subroutines leads to its associated subgoal (SG1SG3). If one of the subgoals is not achieved, only the three candidate
actions of the corresponding subroutine need to be evaluated.

deliver it to a target location. In this task, existing hierarchy? Alternatively, repre- tion is that activity in the region is
there is one final goal (delivery of the sentations of specific goals and outcomes more concerned with behavioral update
package to the target), which can be split can be found in the ventromedial pre- caused by the outcome than caused by
into two subroutines (driving to collect the frontal and orbitofrontal (Burke et al., the reward prediction error per se (Rush-
package and transporting the package 2008) cortices. Might these same regions worth and Behrens, 2008).
to the target). Ingeniously, in some trials update subgoal representations? In a Further similarities can be found in
the experimenter moves the package series of three experiments, the authors subcortical structures. PPEs, like RPEs,
such that the distance to the subgoal demonstrate activity that is instead con- are coded positively in the ventral striatum
(the package) will change but the overall sistent with a third hypothesis: neural and negatively in the habenular complex.
distance to the eventual target will remain responses to pseudo-reward prediction Although it is not yet clear whether the
the same. This causes a PPE with no errors show remarkable similarity to reported PPE activity recruits the dopami-
associated RPE (as the subject may be familiar RPE responses. nergic mechanisms famous for coding
further from the package but is equally far Using EEG, previous studies have RPEs, this latter finding makes it a likely
from eventual reward). In other trials, the shown RPE correlations in a characteristic possibility. Cells in the monkey lateral
experimenter again moves the package, midline voltage wave termed the feed- habenula not only code RPEs negatively,
but now to a spot selected such that back-related negativity (FRN; Holroyd but they also causally inhibit the firing of
distances to both subgoal and target and Krigolson, 2007). In the current study, dopamine cells in the ventral tegmental
remain the same, eliciting neither type of this same negative deflection can be seen area (Matsumoto and Hikosaka, 2007).
prediction error. Hence, by comparing in response to a PPE. The source of the The data presented in Ribas-Fernandes
neural activity between these trial types, FRN is often assumed to lie in the dorsal et al. (2011) therefore raise the possibility
the authors are able to isolate responses anterior cingulate cortex (ACC), and, that prediction error responses at different
caused by PPEs. when the hierarchical task is taken into levels of a hierarchical learning problem
How, then, would the brain respond the MRI scanner, PPE-related activity is recruit the same neuronal mechanisms.
to a pseudo-reward prediction error? A indeed found in the ACC BOLD signal Previous theories have considered the
number of possibilities seemed reason- (Ribas-Fernandes et al., 2011). While role of dopamine in learning from re-
able. Hierarchical organization is already reward prediction errors can be found in warding events. It is now likely that these
thought to exist in the lateral prefrontal single-unit activity in the ACC (Matsumoto same mechanisms can control the learning
cortex, with more rostral regions repre- et al., 2007), the current observation of complex internal goals and subgoals.
senting more abstract and temporally by Ribas-Fernandes et al. (2011) that As we move to more complex models of
extended plans (make ganache) and pseudo-rewards, as well as fictive learning, the potential for common predic-
more caudal regions executing more rewards (Hayden et al., 2009), cause tion error mechanisms places strong
concrete and immediate actions (snap similar activity requires a theory of ACC constraints on the types of models that
chocolate bar) (Koechlin et al., 2003). Might processing that goes beyond simple re- should be considered. However, this idea
hierarchical PPE mechanisms utilize this ward-and-error processing. One sugges- immediately raises a new problem. How

204 Neuron 71, July 28, 2011 2011 Elsevier Inc.


Neuron

Previews

does the brain know which level of the in a wide variety of situations. However, DArdenne, K., McClure, S.M., Nystrom, L.E., and
Cohen, J.D. (2008). Science 319, 12641267.
hierarchy has generated the error? Theo- humans also exhibit behavioral flexibility
retically, RPEs and PPEs can be generated that cannot be explained by HRL strate- Daw, N.D., Gershman, S.J., Seymour, B., Dayan,
by the same event, even in opposite gies. For example, if an apple falls from P., and Dolan, R.J. (2011). Neuron 69, 12041215.
directions. Should the value of the action a tree on a windy day, the next day we
Hayden, B.Y., Pearson, J.M., and Platt, M.L.
or the value of the subroutine be updated? might shake the tree and expect another (2009). Science 324, 948950.
This question is left unaddressed in the to fall, even if we have never shaken
current study, but an intriguing possibility a tree before. If the souffle is burnt, it is Holroyd, C.B., and Krigolson, O.E. (2007). Psycho-
is that the hierarchical organization in more likely due to too much time in the physiology 44, 913917.

the prefrontal cortex can solve this prob- oven than to too much chocolate in the Koechlin, E., Ody, C., and Kouneiher, F. (2003).
lem in concert with the striatum. Striatal ganache. This type of learning relies on Science 302, 11811185.
circuits may gate error signals to the a causal understanding (or model) of the
Matsumoto, M., and Hikosaka, O. (2007). Nature
appropriate prefrontal cells (Badre and world and our interactions with it and is
447, 11111115.
Frank, 2011). also a major recent focus in behavioral
By arranging actions and combinations neuroscience (Daw et al., 2011). It is Matsumoto, M., Matsumoto, K., Abe, H., and
of actions into a hierarchy, and by intro- hoped that by studying such strategies Tanaka, K. (2007). Nat. Neurosci. 10, 647656.
ducing intermediate subgoals, HRL can both separately and in combination,
ODoherty, J.P. (2004). Curr. Opin. Neurobiol. 14,
explain complex behaviors that cannot modern neuroscientists will make big 769776.
be explained by more traditional learning strides toward understanding the deter-
theories. Not only is learning dramatically minants of human behavior. Ribas-Fernandes, J.J.F., Solway, A., Diuk, C.,
McGuire, J.T., Barto, A.G., Niv, Y., and Botvinick,
simplified, but also subroutines can be M.M. (2011). Neuron 71, this issue, 370379.
transferred between learning problems. REFERENCES
Egg-whisking skills perfected during Rushworth, M.F., and Behrens, T.E. (2008). Nat.
Neurosci. 11, 389397.
souffle baking may prove useful for Badre, D., and Frank, M.J. (2011). Cereb. Cortex, in
tomorrow nights lemon mousse. More press. Published online June 21, 2011.
Schultz, W. (2007). Annu. Rev. Neurosci. 30,
prosaically, the complex sequence of Botvinick, M.M., Niv, Y., and Barto, A.C. (2009). 259288.
muscle commands required, for example, Cognition 113, 262280.
Tsai, H.C., Zhang, F., Adamantidis, A., Stuber,
to move a limb may be combined into Burke, K.A., Franz, T.M., Miller, D.N., and Schoen- G.D., Bonci, A., de Lecea, L., and Deisseroth, K.
a single subroutine (or action!) and used baum, G. (2008). Nature 454, 340344. (2009). Science 324, 10801084.

Neuron 71, July 28, 2011 2011 Elsevier Inc. 205

You might also like