Professional Documents
Culture Documents
www.nature.com/nrn
Perspectives
a cached value estimate that can be derived The role of DA as a MF RL teaching respectively31,32, suggesting that DAergic
using simple computations that rely only signal is supported by work in both signals support both instrumental (action–
on easily accessible information (Box 2) humans and non-human animals showing value) and non-instrumental (state–value)
signalling how ‘off ’ the current estimate that DA affects corticostriatal plasticity, learning in the striatum. Consistent with
is. However, the computational efficiency as theoretically predicted28. Subsequent MF value learning, additional research has
of a MF approach causes it to be relatively research has focused on the causal shown that DAergic targets such as the
inflexible, as it can only look to the past to importance of DAergic input to show dorsal striatum seem to track MF cached
inform its choices, whereas the prospective that systematic modulation of DA cell value representations33,34. Pharmacological
capacity of the MB agent9 allows it to flexibly activity is sufficient for the development of and genetic studies involving humans
adapt to changes in the environment or its cue-induced reward-seeking behaviour29,30. have shown that variation in DAergic
own goals. Work in humans using functional MRI has function and the manipulation of striatal
The scientific progress resulting from implicated the ventral and dorsal striatal DA sensitivity foster altered learning from
applying a RL computational framework targets of DA in learning about state positive and negative reward prediction
is plainly apparent through the rapid values and learning about action policies, errors35–37. Furthermore, DA signals need
advances in cognitive neuroscience4,17,18.
RL has been pivotal in providing a sound
quantitative theory of learning, and a Box 1 | Formal rL algorithms
normative framework through which we Most commonly, reinforcement learning (rL) problems are formalized as a Markov decision
can understand the brain and behaviour. process, which is defined as: a set of states, S; a set of actions, A; a function R(s, a) that defines the
As an explanatory framework, RL advances reward delivered after taking action a ∈ A while in state s ∈ S; and a function T(s′|s, a) that defines
our understanding beyond phenomenology which state, s′ ∈ S, the agent will transition into if action a ∈ A is performed while in state s ∈ S.
in ascribing functional structure to Model-free RL algorithms
observed data. Here, we highlight One approach to solving a rL problem is to redistribute reward information in a way that reflects
some of the key findings. the environment’s structure. Model-free (MF) rL methods make no attempt to represent the
dynamics of the environment; rather, they store a set of state or action values that estimate
MF RL and the brain the value of what is expected without explicitly representing the identity of what is to come. This
Early research into the principles that implies that learned values reflect a blend of both the reward structure and the transition structure
govern learning likened behaviour of the environment, as encountered reward values are propagated back to be aggregated with
preceding state values or action values. For example, having chosen to visit the cafeteria (action a1)
to the output of a stimulus–response
while hungry in their office (state s1), a student encounters the new cafe’s booth (state s2) and
association machine that forms links samples their food (reward r1). in one variant of MF rL, the agent learns about the circumstances
between stimuli and motor responses that led to reward using a reward prediction error. specifically, the difference between the
through reinforcement19. Various models predicted value of going to the cafeteria for lunch, Q(a1, s1), and the actual value, r1 + γ · Q(a2, s2),
described the relationships between stimuli, where γ discounts future value relative to immediate reward, is quantified as a temporal difference
response and reward, with nearly all reward prediction error (δ):
sharing a common theme of an associative
process driven by a surprise signal20–22. δ = (r1 + γ ⋅ Q(a2, s2)) − Q(a1, s1) (1)
Computational RL theory built on the The mismatch between the expected outcome and the experienced outcome is then used to
principles that animal behaviourists had improve the agent’s prediction according to learning rate α:
distilled through experimentation, to
develop the method of temporal difference Q(a1, s1) ← Q(a1, s1) + α ⋅ δ (2)
(TD) learning (a MF algorithm), which
offers general-purpose learning rules while Note that both the reward value (r1) and the discounted expected value of subsequent events
also formalizing the RL problem23. (γ · Q(a2, s2)) are considered as part of the prediction error calculation, offering a path through
which rewards can be propagated back to their antecedents.
The TD RL algorithm sparked a turning
point in our understanding of DA function Model-based RL algorithms
in the brain. In a seminal set of studies, the as implied by their name, model-based (MB) algorithms tackle rL problems using a model of the
phasic firing patterns of DA neurons in environment to plan a course of action by predicting how the environment will respond to its
the ventral tegmental area were shown to interventions. although the word ‘model’ can have very different meanings, the model used in MB
rL is very specifically defined as the transition function, T(s′|a, s), and the reward function, R(a, s),
mirror the characteristics of a TD RL reward
of the environment. Commonly referenced MB RL methods either attempt to learn, or are endowed
prediction error (see Eq. (1) in Box 1), with, the model of the task. with a model of the environment, the agent can estimate cumulative
offering a bridge between behaviourally state–action values online by planning forward from the current state or backward from a
descriptive models and a functional terminal state. The optimal policy can be computed using the Bellman equation, in which the value
understanding of learning algorithms of each action available in the current state, QMB(a1, s1), takes into account the expected reward
embodied by the brain17,24,25. Continued R(a1, s1), and the discounted expected value of taking the best action at the subsequent state,
work along this line of research has probed γ ⋅ max a [Q(a′, s′)], weighted by the probability of actually transitioning into that state T(s′|s1, a1):
′
the details of DA activity in greater detail,
linking it to various flavours of MF RL26,27. QMB(a1, s1) = R(a1, s1) + ∑ T(s′|s1, a1) ⋅ γ ⋅ max a [QMB(a′, s′)] (3)
s′ ′
Importantly, this work has shifted the
conceptualization of stimulus–response This approach can be recursively rolled out to subsequent states, deepening the plan under
instrumental learning away from inflexible consideration. Thus, when faced with a choice of what to do for lunch, a MB strategy can flexibly
reflex-like behaviour towards one of consider the value of going back to the cafeteria or of visiting the new cafe by dynamically solving
the Bellman equation describing the choice problem.
adaptable, value-based learning.
Fig. 2 | contrast between MB and MF algorithms in response to environ- option). However, the student encounters a stand in front of the cafeteria
mental changes. (Left) A student has learned that the cafeteria is to the offering delicious items from a new menu at the coffee shop (r = +10). (Right)
east of their laboratory and the coffee shop is to the west. Having visited The next day, the student must decide which direction to take for lunch.
both several times in the past, they have also learned that the lunch offer- A MB strategy will consult its model of the environment to identify the path
ings at the cafeteria are passable (reward (r) = +1), whereas the coffee shop towards the best lunch option, which is now at the coffee shop (go west).
does not offer food (r = 0). (Middle) On day n, the student opts to visit the A MF strategy, by contrast, will consult its value estimates and, owing to the
cafeteria (with value V(east) = 1, and V(west) = 0, both model-based (MB) unexpectedly good lunch the previous day, will repeat the action of heading
and model-free (MF) strategies agree going east to the cafeteria is the best east (towards the cafeteria).
not be limited to learning outwardly task, a choice between two available dichotomization in learning from MB–MF,
observable ‘actions’, as projections to the options stochastically leads to one of as we develop in the next section.
cortex have also been suggested to be two second-stage states, at which a second
involved in learning cognitive ‘actions’, such choice can lead to reward. Each first-level Risks
as determining which items should be held option typically moves the participant into Like any conceptual framework, the MB–MF
in working memory36,38–40, thus implicating a specific second-stage state (such as a1 → s1 theory of learning and decision-making has
the DA learning signal as a general-purpose and a2 → s2). However, on rare occasions, intrinsic limitations. Ironically, its increasing
learning signal. In sum, a broad set of the participant’s choice will lead to the popularity and scope of application
methodologies and experimental protocols alternative state (for example, a1 → s2). could erode its potential by advancing
have shown consistent links between brain, Choices following rare transitions can a misinterpretation that data must be
behaviour and computationally defined dissociate MB RL from MF RL: MF RL described along this singular dimension10.
MF signals associated with the predictive agents credit reward for the option that was Indeed, researchers may be led to force
value of the current state and/or actions chosen, irrespective of the path that led a square peg through a round hole when
according to motivationally salient events to that reward, and will thus be more analysing separable components of their
such as reward. Although some work likely to repeat a rewarded first-stage choice data through the lens of a coarse-grained
challenges the DA-dependent TD RL after a rare transition. By contrast, a MB MB–MF dichotomy. Here, we detail some
framework41–43, a broad corpus supports strategy will plan to reach the rewarded of the more important limitations this
it, and the computational RL theory has second-stage state once more9, and thus presents and how much richer learning
driven very rich, new understanding of will be less likely to repeat the first-stage theory should become.
learning in the brain. choice, favouring the alternative option
that most reliably returns it to the reward Challenge of disambiguation
A mixture of MB and MF RL state (Fig. 2). MF behaviour can look MB, and vice versa.
Additional research has built on the Investigations into the relationship Despite the apparent ubiquity of MB control
successes of using MF RL algorithms to between MB and MF RL and other cognitive in human behaviour51, labelling behaviour
explain brain and behaviour by including or psychological processes have identified as uniquely MB has been surprisingly
MB RL as a mechanism through which a links with MB RL45–49 more readily than difficult52. Notably, there are several channels
broader spectrum of phenomena may be with MF processes50. There are several through which behaviour that depends
understood. It has long been recognized potential explanations for this, one being on a MF cached valuation may seem to
that animal behaviour is not solely that the experimental protocols used to reflect planning, and thus be labelled MB.
determined by reinforcement history but probe MB and MF processes, such as the For example, a MF strategy can flexibly
also exhibits planning characteristics that two-step task, are more sensitive to MB adapt in a MB-like way when learners
depend on a cognitive representation of control. In addition, MB RL could broadly form compound representations using
the task at hand44. MB RL presents a useful relate to multiple processes that are highly previously observed stimuli and outcomes
computational framework through which dependent on a single mechanism, such as in conjunction with current stimuli14, a
this aspect of behaviour may be captured. attention, which offers an easily manipulable process that has been offered as a means of
Attention to MB RL has increased channel through which many subservient transforming a partially observable Markov
considerably since the creation of the processes may be disrupted. Alternatively, decision process (where task dynamics
two-step task, in which the behavioural the imbalance tilted in favour of MB are determined by the current state of the
signatures of MF responses and MB cohesion across theoretical boundaries environment, but the agent cannot directly
planning can be dissociated7. In this may highlight a problem in the strict observe the underlying state) into a more
www.nature.com/nrn
Perspectives
tractable Markov decision process, where the Model use in MF RL. More generally, of the environment also occurs in the
state is fully observed53. In similar fashion, other theories of learning assume that context of identifying hidden states, such
MB-like behaviour can emerge from a MF agents use a model of the environment as non-directly observable rules54,63–65.
controller when contextual information is but do not adopt a MB planning strategy MF learning with model use is thus involved
used to segregate circumstances in which for decision-making. For example, the in a rich set of learning experiences, meaning
similar stimuli require different actions54, specific type of model used by classic MB that a strict segregation between MB and
or when a model is used retrospectively to algorithms for planning (the transition MF learning and decision-making is not
identify a previously ambiguous choice13. function) can also be used to apply MF RL easily justified or helpful.
Furthermore, applying a MF learning updates on retrospectively inferred latent
algorithm to representations that capture states13. This combination of a MB feature MB and MF learning are not primitive
features of trajectories in the environment in otherwise MF learning constitutes an MB and MF learning are often treated as
(for example, successor representations example of a class of model-dependent MF a singular learning primitive (for example,
that track the frequency with which the RL algorithms. Models of the environment ‘manipulation X increases MB-ness’).
agent arrives at a given state55) mimics some in this class can include knowledge other However, the measurable output of either
aspects of MB behaviour (yet makes separate than transition functions and reward a MF or a MB algorithm relies on many
predictions). In sum, coupling additional functions. For example, a model of the computational mechanisms that need not be
computational machinery such as working relationship between the outcome of considered as unique components associated
memory with standard MF algorithms can two choices facilitates counterfactual MF with a singular system. Indeed, MB and MF
mimic a MB planning strategy. value updates60,61, whereas a model of learning and decision-making is arguably
Similarly, there are several paths through the environment’s volatility can be used better understood as a high-level process
which a MB controller may produce to dynamically adjust and optimize MF that emerges through the coordination
behaviour that looks MF. For example, RL learning rates62. Learning using MF RL of many separable sub-computations,
one important indication of MB control updates in conjunction with models some of which may be shared between
is sensitivity to devaluation, whereby an
outcome that had been previously desired Box 2 | Learning as a mixture of MB and MF rL
is rendered aversive (for example, through
association with illness). However, it is not The original paper reporting the two-step task showed that human behaviour exhibits both
always clear which aspect of MB control model-based (MB) and model-free (MF) components7. since then, many have used versions of
this task to replicate and expand on these findings in what has become a rich and productive
has been disrupted if the agent remains
line of research, highlighting the relevance of MB versus MF reinforcement learning (rL) in
devaluation-insensitive (and, thus, seems understanding learning across many different domains. We do not provide an exhaustive
MF). For MB control to materialize, the review of this here (see ref.145) but, instead, highlight the impact of this theoretical framework
agent must identify its goal, search its model on studies of neural systems, studies of individual differences and non-human research to show
for a path leading to that goal and then act the breadth of the framework’s impact on the field of the computational cognitive neuroscience
on its plan. Should any of these processes of learning, and beyond.
fail (for example, using the wrong model, Separable neural systems in humans
neglecting to update the goal or planning The dual systems identified by the two-step task and the MB–MF mixture model were shown
errors), then the agent could seem to act to largely map to separable systems, either by identifying separate neural correlates48 or by
more like a MF agent — if that is the only identifying causal manipulations that taxed the systems independently. Causal manipulations have
alternative under consideration12,56,57. typically targeted executive functions and, as such, the majority (if not all) of research using this
Further contributing to the risk of paradigm has been found to modulate the MB, but not the MF, component of behaviour. successful
strategy misattribution, non-RL strategies manipulations that reduced the influence of the MB component included taxing attention via
can masquerade as RL when behaviour is multitask interference45 or task-switching72, inducing stress46, disrupting regions associated with
executive function146 and pharmacological treatments47. Manipulations targeting the MF system
dichotomized across a singular MB–MF
are largely absent, potentially reflecting that system’s primacy or heterogeneity.
dimension. Simple strategies that rely only
on working memory, such as ‘win–stay/ Individual differences
lose–shift’, can mimic — or, at the very least, Individuals vary in their decision-making processes and how they learn from feedback. The
be difficult to distinguish from — MF control. MB–MF theoretical framework, along with the two-step task, was successfully used to capture
such individual differences and relate them to predictive factors147. For example, a study of a
Although simple strategies such as ‘win–stay/
developmental cohort96 showed that the MB component increases from age 8 through 25 years,
lose–shift’ can be readily identified in tasks whereas the MF component of learning remains stable. This framework has also been used
explicitly designed to do so58, more complex to identify specific learning deficits in psychiatric populations, such as people with obsessive–
non-RL strategies, such as superstitious compulsive disorders148 or repetitive disorders149, addiction150, schizophrenia151 and other
behaviour (for example, gambler’s fallacy, in psychiatric constructs49,152.
which losing in the past is believed to predict Non-human studies
a better chance of winning in the future) or early models of animal behaviour described a causal relationship between stimuli and responses153,
intricate inter-trial patterns of responding which was expanded upon to show that some behaviour was better accounted for by models that
(for instance, switch after two wins or four included a cognitive map of the environment44. However, more refined investigations suggested
losses), can be more difficult to identify59. that both strategies, a stimulus-driven response and an outcome-motivated action, can emerge
Unfortunately, when behavioural response from the same animals2. anatomical work in rats has dissociated these strategies, indicating
patterns are analysed within a limited scope that prelimbic regions are involved in goal-directed learning98,154, whereas the infralimbic cortex
along a continuum of being either MB or MF, has been associated with stimulus–response control155. This dissociation mirrors a functional
non-RL strategies are necessarily pressed into segregation between the dorsolateral and dorsomedial striatum, with the former implicated in
stimulus–response behaviour and the latter being associated with goal-directed planning156–158.
the singular axis of MF–MB.
www.nature.com/nrn
Perspectives
the cognitive processes are more richly brain network that implements a form of and habitual behaviour mirror the
interconnected than they are distinct. MF RL are solely responsible for such a response patterns associated with MB
But more importantly for our understanding broad range of phenomena. Instead, it is and MF RL, respectively99. However, as
of brain function and its applicable import more likely that the DA-dependent neural pertinent experimental variables have been
(for example, in the treatment of mental MF RL process is fairly slow (as reflected probed in more detail, growing evidence
disease), we suggest that these computations in the comparably slow learning of many suggests that these constructs are not
themselves may not map cleanly onto non-human animals), and that faster interchangeable. Studies have investigated
singular underlying neural mechanisms learning, even when it seemingly can be individual differences in measures across
(Fig. 3c). For example, learning a model of captured by MF RL algorithms, actually the goal-directed–habitual dimension
the environment, as contrasted with using reflects additional underlying memory in attempts to relate those to indices of
that model to plan a course of action, may mechanisms, such as working memory88–90 MB–MF control49,100. These studies have
rely on a common use of working memory and/or episodic memory91–95. demonstrated the predicted correspondence
resources67,72, suggesting some functional In summary, it is important to remember between goal-directed response and MB
overlap at the level of implementation in that MB RL and MF RL are not atomic control, but establishing a relationship
the brain. principal components of learning and between habits and MF control has proved
An important, but often overlooked, decision-making that map to unique and more elusive. Indeed, eliciting robust habits
detail is that the primitive functions of RL separable underlying neural mechanisms. is challenging101, more so than would be
algorithms assume a predefined state and The MB–MF dichotomy should be expected if habits related to in-laboratory
action space23. When humans and animals considered a convenient description of measures of MF RL.
learn, the task space must be discovered, some aspects of learning, including forward Additional facets of learning and
even if MF RL learning mechanisms then planning, knowledge of transitions and decision-making have fallen along
operate over them54,64,65,73–76. Shaping a state outcome valuation, but one that depends the emotional axis, with a ‘hot’ system
space to represent the relevant components on multiple independent subcomponents. driving emotionally motivated behaviour
of the environment for the task at hand and a ‘cold’ system guiding rational
probably involves separate networks, such The challenge of isomorphism decision-making1,102,103. Similarly, others
as the medial prefrontal cortex77, lateral The computational MB–MF RL framework have contrasted decisions based on an
prefrontal cortex74, orbitofrontal cortex78,79 has drawn attention as a promising formal associative system rooted in similarity-based
and hippocampus80. Furthermore, a lens through which some of the many judgements with decisions derived from
state-identification process probably depends dichotomous psychological frameworks of a rule-based system that guides choice
on complex, multipurpose functions such decision-making may be reinterpreted and in a logical manner3,5,6. Studies and theory
as categorization, generalization or causal unified11, offering a potential successor to have further segregated strategic planning,
inference54,63,64,81. Critically, the process the commonly used, but vaguely defined, whereby one can describe why and how
through which a state space comes to System1/System2 rubric5,6. However, hybrid they acted, and implicit ‘gut-feeling’
be defined can have dramatic effects on MB–MF RL cannot be the sole basis of a choice104,105. Although it is tempting
behavioural output. For example, animals solid theoretical framework for modelling to map these various dichotomies to
can rapidly reinstate response rates the breadth of learning behaviour. In this MF–MB RL along something akin to a
following extinction82,83. A learning and section, we highlight separable components common ‘thoughtfulness’ dimension, they
decision mechanism that relies on a singular of learning that do not cleanly align with a are theoretically distinct. The MF–MB
cached value (as is commonly implement MB–MF dichotomization (Fig. 3d), focusing distinction makes no accommodation
in MF RL) has difficulty capturing this primarily on the habitual versus goal-directed for the emotional state of the agent.
response pattern as, in MF, value is learned dichotomy, as this is often treated as Similarity-based judgements and rule
and relearned symmetrically. However, synonymous with MB and MF RL96. creation are not generally addressed
some implementations of MF RL can A substantial body of evidence points by RL algorithms, and the MB–MF
readily elicit reinstatement by learning new to two distinguishable modes of behaviour: dichotomy has not been cleanly mapped
representational values for the devalued a goal-directed strategy that guides actions to a contrast between explicit and implicit
option and, as such, return to prior response according to the outcomes they bring decision-making.
rates rapidly, not as a result of learning per se about; and habitual control, through which In summary, many dual-system
but as a result of state identification81,84,85. responses are induced by external cues2. The frameworks share common themes, thus
Finally, MF value updates may not in all principal sources of evidence supporting motivating the more general reference
cases be a relevant computational primitive this dichotomy come from devaluation and of System1/System2 (refs5,6). Although
representative of a unique underlying contingency-degradation protocols aimed many of the phenomena explained by
mechanism, despite the fact that it seems at probing outcome-directed planning, with these dual-system frameworks mirror
to account for behavioural variance and the former indexing behavioural adaptations the gist of the MB–MF dichotomy, none
be reflected in a set of underlying neural to changes in outcome values, and the can be fully reduced to it. Contrasting
mechanisms. The family of MF algorithms latter manipulating the causal relationship some of these dichotomies highlights the
is extremely broad, and can underlie between action and outcome (see refs97,98 fact that the MB–MF dichotomy is not
extremely slow learning (as is used to train for a review). Behaviour is considered simply a quantitative formalism of those
deep Q-nets over millions of trials86) or habitual if there is no detectable change in more qualitative theories but is, indeed,
very fast learning (as often observed in performance despite devalued outcomes or theoretically distinct from most (such as
human bandit tasks with high learning degraded action–outcome contingencies. the hot–cold emotional dimension) and
rates87). It is unlikely that the functions The outcome-seeking and stimulus- offers patchy coverage of others (such as
embodied by a singular DA-dependent driven characteristics of goal-directed the habitual–goal-directed framework).
What is lost the apparent gaps between MB–MF RL and whereas offline learners can use information
Considering other dichotomous goal-directed–habitual behaviour could at a later point for ‘batch’ updating, relying
frameworks highlights the multifaceted promote both theoretical and experimental heavily on information storage and the
nature of learning and decision-making by advances. Failure to elicit a detectable ability to draw from it23. Offline learning
highlighting the many independent axes change in the post-devaluation response rate has been suggested to occur between
along which behaviour can be described. using a devaluation protocol (that is, habits) learning trials that involve working memory
Although aligning cognitive, neural and could be caused by various investigable or hippocampal replay121,122, or during
behavioural data across various dualities mechanisms (including degradation consolidation in sleep123, and may contribute
offers the means to expose and examine of the transition model, compromised to both model and reward learning
key variables, something is necessarily lost goal maintenance or engagement of a (for example, the Dyna learning algorithm
when a system as complex as the brain is MF controller). These gaps point to the in which past experiences are virtually
scrutinized through a dichotomous lens. importance of considering other dimensions replayed to accelerate learning23).
Indeed, broad dichotomous descriptions of learning and decision-making (such as Insights garnered from neuroscience
(such as System1/System2) often lack the potential role of value-free response113), should also continue contributing to enrich
predictive precision, and a proliferation of and other facets of behaviour such as our understanding of the dimensions of
isolated contrastive frameworks stands to exploration that may be interpreted learning and decision-making, as studies
impair a coherent understanding of brain as being more MB or MF14,56. of regional specificity have implicated
and behaviour106–108. The application of RL Computer science research (see Fig. 1) separable aspects of behaviour in different
in studying the brain has facilitated notable can aid us in the identification of additional brain regions. For example, studies in
progress by offering a formal framework relevant dimensions of learning. For which memory load was systematically
through which theorems may be proved109, example, algorithms have used hierarchical manipulated exposed separable roles of MF
axiomatic patterns may be described110, organization as a means of embedding task RL and working memory in learning88–90,124,
brain function can be probed29 and theories abstraction, whereby the agent can focus its with the two processes mapping to
may be falsified. However, distilling learning decision-making machinery on high-level expected underlying neural systems88,125,126.
and decision-making into a single MB–MF actions, such as ‘make coffee’, without Further examples of using insights from
dimension risks conflating many other needing to explicitly consider all of the neuroscience to illuminate the computations
sources of variance, and, more importantly, subordinate steps involved. In hierarchical underlying learning behaviour follow from
threatens to dilute the formal merits of the RL, information is learned and decisions a long history of research into hippocampal
computational RL framework to that of a are made at multiple levels of abstraction function. Previous work has fostered a
verbal theory (for example, the agent ‘uses’ in parallel. Thus, hierarchical RL offers dichotomy between the hippocampus and
a ‘model’). potentially beneficial task abstractions the basal ganglia, with the former being
that can span across time114–116 or the state implicated in declarative learning and the
Paths forward or action space, and has been observed in latter in implicit procedural learning127–129.
Identifying the computational primitives humans54,63,74,117,118. Notably, hierarchical More recent work has begun to probe how
that support learning is an essential RL may be implemented using either MB these two systems may compete for control91
question at the core of cognitive (neuro) planning or MF responses, which offers a or collaborate130. Such collaboration may
science, but also has implications for rich set of computational tools that research emerge through relational associations
all domains that rely on learning — (and the brain) may draw upon, but also maintained in the hippocampus upon
including education, public health, compounds the risk of misattribution which value may be learned131,132, or
software design and so on. Characterizing when a singular MB–MF dimension is through developing a representation
these primitives will be an important considered. Benefit can also come from that captures a transition structure in the
step towards gaining deeper insight into considering the classic AI partition between environment133. Further strengthening a
learning differences between and across supervised learning, whereby explicit functional relationship between MB and
populations, including differences with teaching signals are used to shape system MF RL, research has also offered evidence
developmental trajectories111, differences output, and unsupervised learning, in of a cooperative computational role between
depending on environmental factors and which the system relies on properties of systems during reward learning as a means
differences associated with psychiatric the input to drive response. Research has of actively sampling previous events to
or neurological diseases112. Here, we shown that human behaviour is shaped improve value estimates93–95.
highlight ways in which past research has by, and exhibits interactions between, It is important to note that identifying
successfully identified learning primitives instructed and experienced trajectories separable components of learning and
that go beyond the MB–MF RL dichotomy, through an environment39. Proposals have decision-making is complicated by the
covering many separable dimensions of outlined frameworks in which supervised, existence of interactions between different
learning and decision-making. These unsupervised and RL systems interact neural systems. Most theoretical frameworks
successful approaches offer explicit paths to build and act on representations of treat separable components as independent
forward in the endeavour of deconstructing the environment119,120, further bending the strategies in competition for control.
learning into its interpretable, neurally notion that a singular spectrum of MB–MF However, these components often interact
implementable basic primitives. control can sufficiently explain behaviour. in complex ways beyond competition for
Disparities or inconsistencies A third algorithmic dimension that warrants choice134. For example, in the MB–MF
between classic psychological theoretical consideration, as it may compound worries framework4, striatal signals show that MB
frameworks offer opportunities to refine of misattribution, is the distinction between information is not segregated from MF
our understanding of their underlying offline and online learning. Online-learning reward prediction error. Similar findings
computational primitives. For example, agents integrate observations as they arrive, have also been observed in recordings from
www.nature.com/nrn
Perspectives
DA neurons135,136. Even functions known to theories should not hide their limitations: 11. Dayan, P. Goal-directed control and its antipodes.
Neural Netw. 22, 213–219 (2009).
stem from largely separable neural systems the equations of a computational model are 12. da Silva, C. F. & Hare, T. A. A note on the analysis of
exhibit such interactions; for example, defined in a precise environment and do not two-stage task results: how changes in task structure
affect what model-free and model-based strategies
information in working memory seems necessarily expand seamlessly to capture predict about the effects of reward and transition on
to influence the reward prediction error neighbouring concepts. the stay probability. PLoS ONE 13, e0195328 (2018).
13. Moran, R., Keramati, M., Dayan, P. & Dolan, R. J.
computations of MF RL89,124–126. Going Most importantly, we should remember Retrospective model-based inference guides
beyond simple dichotomies will necessitate David Marr’s advice and consider our model-free credit assignment. Nat. Commun. 10,
750 (2019).
not only increasing the dimensionality goal when attempting to find primitives 14. Akam, T., Costa, R. & Dayan, P. Simple plans or
of the space of learning processes we of learning8. The MB and MF family of sophisticated habits? State, transition and learning
interactions in the two-step task. PLoS Comput. Biol.
consider but also considering how algorithms, as defined by computer 11, e1004648 (2015).
different dimensions interact. scientists, offers a high-level theory of what 15. Shahar, N. et al. Credit assignment to state-
independent task representations and its
In summary, there are numerous axes information is incorporated and how it is relationship with model-based decision making.
along which learning and decision-making used during decision-making, and how Proc. Natl Acad. Sci. USA 116, 15871–15876
(2019).
vary as identified through various traditions learning is shaped. This may be satisfactory 16. Deserno, L. & Hauser, T. U. Beyond a cognitive
of research (including psychology, AI and for research concerned with applying dichotomy: can multiple decision systems prove
useful to distinguish compulsive and impulsive
neuroscience). Future research should learning science to other domains, such symptom dimensions? Biol. Psychiatry https://doi.org/
continue the progress of recent work in as AI or education. However, for research 10.1016/j.biopsych.2020.03.004 (2020).
17. Schultz, W., Dayan, P. & Montague, P. R. A neural
identifying many additional dimensions that aims to understand matters that are substrate of prediction and reward. Science 275,
of learning that capture other important dependent on the mechanisms of learning 1593–1599 (1997).
18. Dabney, W. et al. A distributional code for value in
sources of variance in how we learn, such as (that is, the brain’s implementation), such dopamine-based reinforcement learning. Nature 577,
meta-learning mechanisms137,138, learning as the study of individual differences 671–675 (2020).
19. Thorndike, E. L. Animal Intelligence: Experimental
to use attention73,139,140, strategic learning59 in learning, it is particularly important Studies (Transaction, 1965).
and uncertainty-dependent parameter to ensure that the high-level theory of 20. Bush, R. R. & Mosteller, F. Stochastic models for
learning (John Wiley & Sons, Inc. 1955).
changes62,141,142. This is evidence that learning learning primitives proposes computational 21. Pearce, J. M. & Hall, G. A model for Pavlovian
and decision-making vary along numerous primitives that precisely relate to the learning: variations in the effectiveness of conditioned
but not of unconditioned stimuli. Psychol. Rev. 87,
dimensions that cannot be reduced to underlying circuits. This effort may 532–552 (1980).
a simple two-dimensional principal benefit from a renewed enthusiasm 22. Rescorla, R. A. & Wagner, A. R. in Classical
Conditioning II: Current Research and Theory
component space, whether that axis is from computational modellers for the Ch. 3 (eds Black, A. H. & Prokasy, W. F) 64–99
labelled as MB–MF, hot–cold, goal-directed basic building blocks of psychology and (Appleton-Century-Crofts, 1972).
versus habitual or otherwise. neuroscience143,144, and a better appreciation 23. Sutton, R. S. & Barto, A. G. Reinforcement learning:
An Introduction (MIT Press, 2018).
for the functional building blocks formalized 24. Montague, P. R., Dayan, P. & Sejnowski, T. J.
Conclusions by a rich computational theory. A framework for mesencephalic dopamine systems
based on predictive Hebbian learning. J. Neurosci.
We have attempted to show the importance 16, 1936–1947 (1996).
Anne G. E. Collins 1 ✉ and Jeffrey Cockburn2
of identifying the primitive components 25. Bayer, H. M. & Glimcher, P. W. Midbrain dopamine
1
Department of Psychology and the Helen Wills neurons encode a quantitative reward prediction
that support learning and decision-making Neuroscience Institute, University of California, error signal. Neuron 47, 129–141 (2005).
as well as the risks inherent to compressing Berkeley, Berkeley, CA, USA.
26. Morris, G., Nevet, A., Arkadir, D., Vaadia, E. &
Bergman, H. Midbrain dopamine neurons encode
complex and multifaceted processes 2
Division of the Humanities and Social Sciences, decisions for future action. Nat. Neurosci. 9,
into a two-dimensional space. Although California Institute of Technology, Pasadena, CA, USA.
1057–1063 (2006).
27. Roesch, M. R., Calu, D. J. & Schoenbaum, G.
dual-system theories offer a means through ✉e-mail: annecollins@berkeley.edu Dopamine neurons encode the better option in
which unique and dissociable components https://doi.org/10.1038/s41583-020-0355-6
rats deciding between differently delayed or sized
rewards. Nat. Neurosci. 10, 1615–1624 (2007).
of learning and decision-making may Published online xx xx xxxx 28. Shen, W., Flajolet, M., Greengard, P. & Surmeier, D. J.
be highlighted, key aspects could be Dichotomous dopaminergic control of striatal synaptic
1. Roiser, J. P. & Sahakian, B. J. Hot and cold cognition plasticity. Science 321, 848–851 (2008).
fundamentally misattributed to unrelated in depression. CNS Spectr. 18, 139–149 (2013). 29. Steinberg, E. E. et al. A causal link between prediction
computations, and scientific debate could 2. Dickinson, A. Actions and habits: the development of errors, dopamine neurons and learning. Nat. Neurosci.
behavioural autonomy. Philos. Trans. R. Soc. London. 16, 966–973 (2013).
become counterproductive when different B Biol. Sci. 308, 67–78 (1985). 30. Kim, K. M. et al. Optogenetic mimicry of the transient
subfields use the same label, even those as 3. Sloman, S. A. The empirical case for two systems activation of dopamine neurons by natural reward
of reasoning. Psychol. Bull. 119, 3 (1996). is sufficient for operant reinforcement. PLoS ONE 7,
well computationally defined as MB and MF 4. Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. e33612 (2012).
RL, to mean different things. & Dolan, R. J. Model-based influences on humans’ 31. O’Doherty, J. et al. Dissociable roles of ventral and
choices and striatal prediction errors. Neuron 69, dorsal striatum in instrumental conditioning. Science
We also propose ways forward. One is 1204–1215 (2011). 304, 452–454 (2004).
to renew a commitment to being precise in 5. Stanovich, K. E. & West, R. F. Individual differences 32. McClure, S. M., Berns, G. S. & Montague, P. R.
in reasoning: implications for the rationality debate? Temporal prediction errors in a passive learning
our vocabulary and conceptual definitions. Behav. Brain Sci. 23, 645–665 (2000). task activate human striatum. Neuron 38, 339–346
The success of the MB–MF RL framework 6. Kahneman, D. & Frederick, S. in Heuristics and (2003).
Biases: The Psychology of Intuitive Judgment 33. Samejima, K., Ueda, Y., Doya, K. & Kimura, M.
had begun to transition clearly defined Ch. 2 (eds Gilovich, T., Griffin, D. & Kahneman, D.) Representation of action-specific reward values in
computational algorithms towards a 49–81 (Cambridge Univ. Press, 2002). the striatum. Science 310, 1337–1340 (2005).
7. Daw, N. in Decision Making, Affect, and Learning: 34. Lau, B. & Glimcher, P. W. Value representations in the
range of terms synonymous to many Attention and Performance XXIII Ch. 1 primate striatum during matching behavior. Neuron
individuals with various dichotomous (eds Delgado, M. R., Phelps, E. A. and 58, 451–463 (2008).
Robbins, T. W.) 1–26 (Oxford Univ. Press, 2011). 35. Frank, M. J., Seeberger, L. C. & O’Reilly, R. C.
approximations that may or may not 8. Marr, D. & Poggio, T. A computational theory of By carrot or by stick: cognitive reinforcement
touch on shared functional or neural human stereo vision. Proc. R. Soc. Lond. B. Biol. Sci. learning in Parkinsonism. Science 306, 1940–1943
204, 301–328 (1979). (2004).
mechanisms. We have argued that such a 9. Doll, B. B., Duncan, K. D., Simon, D. A., Shohamy, D. 36. Frank, M. J., Moustafa, A. A., Haughey, H. M.,
shift leads to a dangerous approximation & Daw, N. D. Model-based choices involve prospective Curran, T. & Hutchison, K. E. Genetic triple
neural activity. Nat. Neurosci. 18, 767–772 (2015). dissociation reveals multiple roles for dopamine in
of a much higher-dimensional space. 10. Daw, N. D. Are we of two minds? Nat. Neurosci. 21, reinforcement learning. Proc. Natl Acad. Sci. USA
The rigour of computationally defined 1497–1499 (2018). 104, 16311–16316 (2007).
37. Cockburn, J., Collins, A. G. & Frank, M. J. 63. Collins, A. G. E. & Koechlin, E. Reasoning, learning, reinforcement learning in non-human primates
A reinforcement learning mechanism responsible for and creativity: frontal lobe function and human performing a trial-and-error problem solving task.
the valuation of free choice. Neuron 83, 551–557 decision-making. PLoS Biol. 10, e1001293 (2012). Behav. Brain Res. 355, 76–89 (2017).
(2014). 64. Gershman, S. J., Norman, K. A. & Niv, Y. Discovering 91. Poldrack, R. A. et al. Interactive memory systems
38. Frank, M. J., O’Reilly, R. C. & Curran, T. When memory latent causes in reinforcement learning. Curr. Opin. in the human brain. Nature 414, 546–550 (2001).
fails, intuition reigns: midazolam enhances implicit Behav. Sci. 5, 43–50 (2015). 92. Foerde, K. & Shohamy, D. Feedback timing modulates
inference in humans. Psychol. Sci. 17, 700–707 65. Badre, D., Kayser, A. S. & Esposito, M. D. Article brain systems for learning in humans. J. Neurosci. 31,
(2006). frontal cortex and the discovery of abstract action 13157–13167 (2011).
39. Doll, B. B., Hutchison, K. E. & Frank, M. J. rules. Neuron 66, 315–326 (2010). 93. Bornstein, A. M., Khaw, M. W., Shohamy, D. &
Dopaminergic genes predict individual differences in 66. Konovalov, A. & Krajbich, I. Mouse tracking reveals Daw, N. D. Reminders of past choices bias decisions
susceptibility to confirmation bias. J. Neurosci. 31, structure knowledge in the absence of model-based for reward in humans. Nat. Commun. 8, 15958 (2017).
6188–6198 (2011). choice. Nat. Commun. 11, 1893 (2020). 94. Bornstein, A. M. & Norman, K. A. Reinstated
40. Doll, B. B. et al. Reduced susceptibility to confirmation 67. Gläscher, J., Daw, N., Dayan, P. & O’Doherty, J. P. episodic context guides sampling-based decisions
bias in schizophrenia. Cogn. Affect. Behav. Neurosci. States versus rewards: dissociable neural prediction for reward. Nat. Neurosci. 20, 997–1003 (2017).
14, 715–728 (2014). error signals underlying model-based and model-free 95. Vikbladh, O. M. et al. Hippocampal contributions to
41. Berridge, K. C. The debate over dopamine’s reinforcement learning. Neuron 66, 585–595 (2010). model-based planning and spatial memory. Neuron
role in reward: the case for incentive salience. 68. Huys, Q. J. et al. Interplay of approximate planning 102, 683–693 (2019).
Psychopharmacology 191, 391–431 (2007). strategies. Proc. Natl Acad. Sci. USA 112, 96. Decker, J. H., Otto, A. R., Daw, N. D. & Hartley, C. A.
42. Hamid, A. A. et al. Mesolimbic dopamine signals the 3098–3103 (2015). From creatures of habit to goal-directed learners:
value of work. Nat. Neurosci. 19, 117–126 (2016). 69. Suzuki, S., Cross, L. & O’Doherty, J. P. Elucidating the tracking the developmental emergence of model-based
43. Sharpe, M. J. et al. Dopamine transients are sufficient underlying components of food valuation in the human reinforcement learning. Psychol. Sci. 27, 848–858
and necessary for acquisition of model-based orbitofrontal cortex. Nat. Neurosci. 20, 1786 (2017). (2016).
associations. Nat. Neurosci. 20, 735–742 (2017). 70. Badre, D., Doll, B. B., Long, N. M. & Frank, M. J. 97. Dickinson, A. & Balleine, B. Motivational control of
44. Tolman, E. C. Cognitive maps in rats and men. Rostrolateral prefrontal cortex and individual goal-directed action. Anim. Learn. Behav. 22, 1–18
Psychol. Rev. 55, 189–208 (1948). differences in uncertainty-driven exploration. (1994).
45. Economides, M., Kurth-Nelson, Z., Lübbert, A., Neuron 73, 595–607 (2012). 98. Balleine, B. W. & Dickinson, A. Goal-directed
Guitart-Masip, M. & Dolan, R. J. Model-based 71. Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. instrumental action: contingency and incentive learning
reasoning in humans becomes automatic with training. & Cohen, J. D. Humans use directed and random and their cortical substrates. Neuropharmacology 37,
PLoS Comput. Biol. 11, e1004463 (2015). exploration to solve the explore–exploit dilemma. 407–419 (1998).
46. Otto, A. R., Raio, C. M., Chiang, A., Phelps, E. A. J. Exp. Psychol. Gen. 143, 2074 (2014). 99. Daw, N. D. & Doya, K. The computational neurobiology
& Daw, N. D. Working-memory capacity protects 72. Otto, A. R., Gershman, S. J., Markman, A. B. & of learning and reward. Curr. Opin. Neurobiol. 16,
model-based learning from stress. Proc. Natl Daw, N. D. The curse of planning: dissecting multiple 199–204 (2006).
Acad. Sci. USA 110, 20941–20946 (2013). reinforcement-learning systems by taxing the central 100. Friedel, E. et al. Devaluation and sequential decisions:
47. Wunderlich, K., Smittenaar, P. & Dolan, R. J. executive. Psychol. Sci. 24, 751–761 (2013). linking goal-directed and model-based behavior.
Dopamine enhances model-based over model-free 73. Niv, Y. et al. Reinforcement learning in Front. Hum. Neurosci. 8, 587 (2014).
choice behavior. Neuron 75, 418–424 (2012). multidimensional environments relies on attention 101. de Wit, S. et al. Shifting the balance between goals
48. Deserno, L. et al. Ventral striatal dopamine reflects mechanisms. J. Neurosci. 35, 8145–8157 (2015). and habits: five failures in experimental habit
behavioral and neural signatures of model-based 74. Badre, D. & Frank, M. J. Mechanisms of hierarchical induction. J. Exp. Psychol. Gen. 147, 1043–1065
control during sequential decision making. reinforcement learning in cortico-striatal circuits 2: (2018).
Proc. Natl Acad. Sci. USA 112, 1595–1600 (2015). evidence from fMRI. Cereb. Cortex 22, 527–536 102. Madrigal, R. Hot vs. cold cognitions and consumers’
49. Gillan, C. M., Otto, A. R., Phelps, E. A. & Daw, N. D. (2012). reactions to sporting event outcomes. J. Consum.
Model-based learning protects against forming habits. 75. Collins, A. G. E. Reinforcement learning: bringing Psychol. 18, 304–319 (2008).
Cogn. Affect. Behav. Neurosci. 15, 523–536 (2015). together computation and cognition. Curr. Opin. 103. Peterson, E. & Welsh, M. C. in Handbook of Executive
50. Groman, S. M., Massi, B., Mathias, S. R., Lee, D. & Behav. Sci. 29, 63–68 (2019). Functioning (eds Goldstein, S. & Naglieri, J. A.) 45–65
Taylor, J. R. Model-free and model-based influences 76. Collins, A. G. in Goal-directed Decision Making (Springer, 2014).
in addiction-related behaviors. Biol. Psychiatry 85, (eds Morris, R., Bornstein, A. & Shenhav, A) 105–123 104. Barch, D. M. et al. Explicit and implicit reinforcement
936–945 (2019). (Elsevier, 2018). learning across the psychosis spectrum. J. Abnorm.
51. Doll, B. B., Simon, D. A. & Daw, N. D. The ubiquity 77. Donoso, M., Collins, A. G. E. & Koechlin, E. Psychol. 126, 694–711 (2017).
of model-based reinforcement learning. Curr. Opin. Foundations of human reasoning in the prefrontal 105. Taylor, J. A., Krakauer, J. W. & Ivry, R. B. Explicit and
Neurobiol. 22, 1075–1081 (2012). cortex. Science 344, 1481–1486 (2014). implicit contributions to learning in a sensorimotor
52. Cushman, F. & Morris, A. Habitual control of goal 78. Wilson, R. C., Takahashi, Y. K., Schoenbaum, G. adaptation task. J. Neurosci. 34, 3023–3032
selection in humans. Proc. Natl Acad. Sci. USA 112, & Niv, Y. Orbitofrontal cortex as a cognitive map (2014).
201506367 (2015). of task space. Neuron 81, 267–278 (2014). 106. Sloman, S. A. in Heuristics and biases: The psychology
53. O’Reilly, R. C. & Frank, M. J. Making working memory 79. Schuck, N. W., Wilson, R. & Niv, Y. in Goal-directed of intuitive judgment Ch. 22 (eds Gilovich, T., Griffin, D.
work: a computational model of learning in the Decision Making (eds Morris, R., Bornstein, A. & & Kahneman D.) 379–396 (Cambridge Univ. Press,
prefrontal cortex and basal ganglia. Neural Comput. Shenhav, A) 259–278 (Elsevier, 2018). 2002).
18, 283–328 (2006). 80. Ballard, I. C., Wagner, A. D. & McClure, S. M. 107. Evans, J. S. B. T. in In two minds: Dual processes and
54. Collins, A. G. & Frank, M. J. Cognitive control over Hippocampal pattern separation supports beyond (eds J. S. B. T. Evans & K. Frankish) p. 33–54
learning: creating, clustering, and generalizing reinforcement learning. Nat. Commun. 10, 1073 (Oxford Univ. Press, 2009).
task-set structure. Psychol. Rev. 120, 190–229 (2019). 108. Stanovich, K. Rationality and the Reflective Mind
(2013). 81. Redish, A. D., Jensen, S., Johnson, A. & (Oxford Univ. Press, 2011).
55. Momennejad, I. et al. The successor representation Kurth-Nelson, Z. Reconciling reinforcement learning 109. Dayan, P. The convergence of TD(λ) for general λ.
in human reinforcement learning. Nat. Hum. Behav. 1, models with behavioral extinction and renewal: Mach. Learn. 8, 341–362 (1992).
680–692 (2017). implications for addiction, relapse, and problem 110. Caplin, A. & Dean, M. Axiomatic methods, dopamine
56. Da Silva, C. F. & Hare, T. A. Humans are primarily gambling. Psychol. Rev. 114, 784 (2007). and reward prediction error. Curr. Opin. Neurobiol.
model-based and not model-free learners in the 82. Bouton, M. E. Context and behavioral processes 18, 197–202 (2008).
two-stage task. bioRxiv https://doi.org/10.1101/ in extinction. Learn. Mem. 11, 485–494 (2004). 111. van den Bos, W., Bruckner, R., Nassar, M. R., Mata, R.
682922 (2019). 83. Rescorla, R. A. Spontaneous recovery. Learn. Mem. & Eppinger, B. Computational neuroscience across the
57. Toyama, A., Katahira, K. & Ohira, H. Biases in 11, 501–509 (2004). lifespan: promises and pitfalls. Dev. Cogn. Neurosci.
estimating the balance between model-free and 84. O’Reilly, R. C., Frank, M. J., Hazy, T. E. & Watz, B. 33, 42–53 (2018).
model-based learning systems due to model PVLV: the primary value and learned value Pavlovian 112. Adams, R. A., Huys, Q. J. & Roiser, J. P. Computational
misspecification. J. Math. Psychol. 91, 88–102 learning algorithm. Behav. Neurosci. 121, 31 (2007). psychiatry: towards a mathematically informed
(2019). 85. Gershman, S. J., Blei, D. M. & Niv, Y. Context, understanding of mental illness. J. Neurol. Neurosurg.
58. Iigaya, K., Fonseca, M. S., Murakami, M., Mainen, Z. F. learning, and extinction. Psychol. Rev. 117, 197–209 Psychiatry 87, 53–63 (2016).
& Dayan, P. An effect of serotonergic stimulation on (2010). 113. Miller, K. J., Shenhav, A. & Ludvig, E. A. Habits
learning rates for rewards apparent after long intertrial 86. Wang, J. X. et al. Prefrontal cortex as a meta- without values. Psychol. Rev. 126, 292–311 (2019).
intervals. Nat. Commun. 9, 2477 (2018). reinforcement learning system. Nat. Neurosci. 21, 114. Botvinick, M. M., Niv, Y. & Barto, A. Hierarchically
59. Mohr, H. et al. Deterministic response strategies in 860–868 (2018). organized behavior and its neural foundations:
a trial-and-error learning task. PLoS Comput. Biol. 14, 87. Iigaya, K. et al. Deviation from the matching law a reinforcement-learning perspective. Cognition
e1006621 (2018). reflects an optimal strategy involving learning 113, 262–280 (2009).
60. Hampton, A. N., Bossaerts, P. & O’Doherty, J. P. over multiple timescales. Nat. Commun. 10, 1466 115. Konidaris, G. & Barto, A. G. in Advances in Neural
The role of the ventromedial prefrontal cortex in (2019). Information Processing Systems 22 (eds Bengio, Y.,
abstract state-based inference during decision making 88. Collins, A. G. E. & Frank, M. J. How much of Schuurmans, D., Lafferty, J. D., Williams, C. K. I. &
in humans. J. Neurosci. 26, 8360–8367 (2006). reinforcement learning is working memory, not Culotta, A.) 1015–1023 (NIPS, 2009).
61. Boorman, E. D., Behrens, T. E. & Rushworth, M. F. reinforcement learning? A behavioral, computational, 116. Konidaris, G. On the necessity of abstraction.
Counterfactual choice and learning in a neural and neurogenetic analysis. Eur. J. Neurosci. 35, Curr. Opin. Behav. Sci. 29, 1–7 (2019).
network centered on human lateral frontopolar cortex. 1024–1035 (2012). 117. Frank, M. J. & Fossella, J. A. Neurogenetics and
PLoS Biol. 9, e1001093 (2011). 89. Collins, A. G. E. The tortoise and the hare: interactions pharmacology of learning, motivation, and cognition.
62. Behrens, T. E., Woolrich, M. W., Walton, M. E. & between reinforcement learning and working memory. Neuropsychopharmacology 36, 133–152 (2010).
Rushworth, M. F. Learning the value of information J. Cogn. Neurosci. 30, 1422–1432 (2017). 118. Collins, A. G. E., Cavanagh, J. F. & Frank, M. J. Human
in an uncertain world. Nat. Neurosci. 10, 1214–1221 90. Viejo, G., Girard, B. B., Procyk, E. & Khamassi, M. EEG uncovers latent generalizable rule structure
(2007). Adaptive coordination of working-memory and during learning. J. Neurosci. 34, 4677–4685 (2014).
www.nature.com/nrn
Perspectives
119. Doya, K. What are the computations of the 134. Kool, W., Cushman, F. A. & Gershman, S. J. in Goal- 150. Gillan, C. M., Kosinski, M., Whelan, R., Phelps, E. A.
cerebellum, the basal ganglia and the cerebral directed Decision Making Ch. 7 (eds Morris, R. W. & Daw, N. D. Characterizing a psychiatric symptom
cortex? Neural Netw. 12, 961–974 (1999). & Bornstein, A.) 153–178 (Elsevier, 2018). dimension related to deficits in goal-directed control.
120. Fermin, A. S. et al. Model-based action planning 135. Langdon, A. J., Sharpe, M. J., Schoenbaum, G. eLife 5, e11305 (2016).
involves cortico-cerebellar and basal ganglia networks. & Niv, Y. Model-based predictions for dopamine. 151. Culbreth, A. J., Westbrook, A., Daw, N. D., Botvinick, M.
Sci. Rep. 6, 31378 (2016). Curr. Opin. Neurobiol. 49, 1–7 (2018). & Barch, D. M. Reduced model-based decision-making
121. Gershman, S. J., Markman, A. B. & Otto, A. R. 136. Starkweather, C. K., Babayan, B. M., Uchida, N. in schizophrenia. J. Abnorm. Psychol. 125, 777–787
Retrospective revaluation in sequential decision & Gershman, S. J. Dopamine reward prediction (2016).
making: a tale of two systems. J. Exp. Psychol. Gen. errors reflect hidden-state inference across time. 152. Patzelt, E. H., Kool, W., Millner, A. J. & Gershman, S. J.
143, 182 (2014). Nat. Neurosci. 20, 581–589 (2017). Incentives boost model-based control across a
122. Pfeiffer, B. E. & Foster, D. J. Hippocampal place-cell 137. Krueger, K. A. & Dayan, P. Flexible shaping: range of severity on several psychiatric constructs.
sequences depict future paths to remembered goals. how learning in small steps helps. Cognition 110, Biol. Psychiatry 85, 425–433 (2019).
Nature 497, 74–79 (2013). 380–394 (2009). 153. Skinner, B. F. The Selection of Behavior: The Operant
123. Peyrache, A., Khamassi, M., Benchenane, K., 138. Bhandari, A. & Badre, D. Learning and transfer of Behaviorism of BF Skinner: Comments and
Wiener, S. I. & Battaglia, F. P. Replay of rule-learning working memory gating policies. Cognition 172, Consequences (CUP Archive, 1988).
related neural patterns in the prefrontal cortex 89–100 (2018). 154. Corbit, L. H., Muir, J. L. & Balleine, B. W. Lesions
during sleep. Nat. Neurosci. 12, 919–926 (2009). 139. Leong, Y. C. et al. Dynamic interaction between of mediodorsal thalamus and anterior thalamic
124. Collins, A. G. E., Albrecht, M. A., Waltz, J. A., reinforcement learning and attention in nuclei produce dissociable effects on instrumental
Gold, J. M. & Frank, M. J. Interactions among multidimensional environments. Neuron 93, conditioning in rats. Eur. J. Neurosci. 18, 1286–1294
working memory, reinforcement learning, and effort 451–463 (2017). (2003).
in value-based choice: a new paradigm and selective 140. Farashahi, S., Rowe, K., Aslami, Z., Lee, D. & Soltani, A. 155. Coutureau, E. & Killcross, S. Inactivation of the
deficits in schizophrenia. Biol. Psychiatry 82, Feature-based learning improves adaptability without infralimbic prefrontal cortex reinstates goal-directed
431–439 (2017). compromising precision. Nat. Commun. 8, 1768 (2017). responding in overtrained rats. Behav. Brain Res.
125. Collins, A. G. E., Ciullo, B., Frank, M. J. & Badre, D. 141. Bach, D. R. & Dolan, R. J. Knowing how much you 146, 167–174 (2003).
Working memory load strengthens reward prediction don’t know: a neural organization of uncertainty 156. Yin, H. H., Knowlton, B. J. & Balleine, B. W. Lesions
errors. J. Neurosci. 37, 2700–2716 (2017). estimates. Nat. Rev. Neurosci. 13, 572–586 (2012). of dorsolateral striatum preserve outcome expectancy
126. Collins, A. A. G. E. & Frank, M. J. M. Within- and 142. Pulcu, E. & Browning, M. The misestimation of but disrupt habit formation in instrumental learning.
across-trial dynamics of human EEG reveal cooperative uncertainty in affective disorders. Trends Cogn. Sci. Eur. J. Neurosci. 19, 181–189 (2004).
interplay between reinforcement learning and working 23, 865–875 (2019). 157. Yin, H. H., Knowlton, B. J. & Balleine, B. W.
memory. Proc. Natl Acad. Sci. USA 115, 2502–2507 143. Badre, D., Frank, M. J. & Moore, C. I. Interactionist Inactivation of dorsolateral striatum enhances
(2018). neuroscience. Neuron 88, 855–860 (2015). sensitivity to changes in the action–outcome
127. Knowlton, B. J., Mangels, J. A. & Squire, L. R. A 144. Krakauer, J. W., Ghazanfar, A. A., Gomez-Marin, A., contingency in instrumental conditioning.
neostriatal habit learning system in humans. Science MacIver, M. A. & Poeppel, D. Neuroscience needs Behav. Brain Res. 166, 189–196 (2006).
273, 1399–1402 (1996). behavior: correcting a reductionist bias. Neuron 93, 158. Ito, M. & Doya, K. Distinct neural representation
128. Squire, L. R. & Zola, S. M. Structure and function 480–490 (2017). in the dorsolateral, dorsomedial, and ventral parts
of declarative and nondeclarative memory systems. 145. Doll, B. B., Shohamy, D. & Daw, N. D. Multiple of the striatum during fixed-and free-choice tasks.
Proc. Natl Acad. Sci. USA 93, 13515–13522 memory systems as substrates for multiple decision J. Neurosci. 35, 3499–3514 (2015).
(1996). systems. Neurobiol. Learn. Mem. 117, 4–13 (2014).
129. Eichenbaum, H. et al. Memory, Amnesia, and the 146. Smittenaar, P., FitzGerald, T. H., Romei, V., Author contributions
Hippocampal System (MIT Press, 1993). Wright, N. D. & Dolan, R. J. Disruption of dorsolateral The authors contributed equally to all aspects of the article.
130. Foerde, K. & Shohamy, D. The role of the basal ganglia prefrontal cortex decreases model-based in favor of
in learning and memory: insight from Parkinson’s model-free control in humans. Neuron 80, 914–919
Competing interests
disease. Neurobiol. Learn. Mem. 96, 624–636 (2013).
The authors declare no competing interests.
(2011). 147. Doll, B. B., Bath, K. G., Daw, N. D. & Frank, M. J.
131. Wimmer, G. E., Daw, N. D. & Shohamy, D. Variability in dopamine genes dissociates model-based
Generalization of value in reinforcement learning by and model-free reinforcement learning. J. Neurosci. Peer review information
humans. Eur. J. Neurosci. 35, 1092–1104 (2012). 36, 1211–1222 (2016). Nature Reviews Neuroscience thanks the anonymous reviewer(s)
132. Wimmer, G. E., Braun, E. K., Daw, N. D. & Shohamy, D. 148. Voon, V. et al. Motivation and value influences for their contribution to the peer review of this work.
Episodic memory encoding interferes with reward in the relative balance of goal-directed and habitual
learning and decreases striatal prediction errors. behaviours in obsessive-compulsive disorder. Publisher’s note
J. Neurosci. 34, 14901–14912 (2014). Transl. Psychiatry 5, e670 (2015). Springer Nature remains neutral with regard to jurisdictional
133. Gershman, S. J. The successor representation: 149. Voon, V., Reiter, A., Sebold, M. & Groman, S. claims in published maps and institutional affiliations.
its computational logic and neural substrates. Model-based control in dimensional psychiatry.
J. Neurosci. 38, 7193–7200 (2018). Biol. Psychiatry 82, 391–400 (2017). © Springer Nature Limited 2020