You are on page 1of 25

Self-Tuning Deep Reinforcement Learning

Tom Zahavy 1 Zhongwen Xu 1 Vivek Veeriah 1 Matteo Hessel 1 Junhyuk Oh 1 Hado van Hasselt 1
David Silver 1 Satinder Singh 1 2

Abstract hyperparameters, evaluating them either in terms of their


Reinforcement learning (RL) algorithms often inner loop performance, or in terms of their performance
on some special validation data. Inner loop training is typi-
arXiv:2002.12928v2 [stat.ML] 2 Mar 2020

require expensive manual or automated hyper-


parameter searches in order to perform well on cally differentiable and can be efficiently performed using
a new domain. This need is particularly acute in back propagation, while the optimization of the outer loop
modern deep RL architectures which often incor- is typically performed via gradient-free optimization. Since
porate many modules and multiple loss functions. only the hyperparameters (but not the agent policy itself) are
In this paper, we take a step towards addressing transferred between outer loop iterations, we refer to this as
this issue by using metagradients (Xu et al., 2018) a “multiple lifetime” approach. For example, random search
to tune these hyperparameters via differentiable for hyperparameters (Bergstra & Bengio, 2012) falls under
cross validation, whilst the agent interacts with this category, and population based training (Jaderberg et al.,
and learns from the environment. We present the 2017) also shares many of its properties. The cost of rely-
Self-Tuning Actor Critic (STAC) which uses this ing on multiple lifetimes to tune hyperparameters is often
process to tune the hyperparameters of the usual not accounted for when the performance of algorithms is
loss function of the IMPALA actor critic agent reported. The impact of this is mild when users can rely on
(Espeholt et. al., 2018 ), to learn the hyperpa- hyperparameters established in the literature, but the cost of
rameters that define auxiliary loss functions, and hyper-parameter tuning across multiple lifetimes manifests
to balance trade offs in off policy learning by in- itself when algorithm are applied to new domains.
troducing and adapting the hyperparameters of a This motivates a significant body of work on tuning hyper-
novel leaky V-trace operator. The method is sim- parameters online, within a single agent lifetime. Previous
ple to use, sample efficient and does not require work in this area has often focused on solutions tailored
significant increase in compute. Ablative studies to specific hyperparameters. For instance, Schaul et al.
show that the overall performance of STAC im- (2019) proposed a non stationary bandit algorithm to adapt
proves as we adapt more hyperparameters. When the exploration-exploitation trade off. Mann et al. (2016)
applied to 57 games on the Atari 2600 environ- and White & White (2016) proposed algorithms to adapt
ment over 200 million frames our algorithm im- λ (the eligibility trace coefficient). Rowland et al. (2019)
proves the median human normalized score of the introduced an α coefficient into V-trace to account for the
baseline from 243% to 364%. variance-contraction trade off in off policy learning. In an-
other line of work, metagradients were used to adapt the
optimiser’s parameters (Sutton, 1992; Snoek et al., 2012;
1. Introduction Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al.,
2017; Young et al., 2018).
Reinforcement Learning (RL) algorithms often have mul-
tiple hyperparameters that require careful tuning; this is In this paper we build on the metagradient approach to tune
especially true for modern deep RL architectures, which hyperparameters. In previous work, online metagradients
often incorporate many modules and many loss functions. have been used to learn the discount factor or the λ coeffi-
Training a deep RL agent is thus typically performed in two cient (Xu et al., 2018), to discover intrinsic rewards (Zheng
nested optimization loops. In the inner training loop, we fix et al., 2018) and auxiliary tasks (Veeriah et al., 2019). This
a set of hyperparameters, and optimize the agent parameters is achieved by representing an inner (training) loss func-
with respect to these fixed hyperparameters. In the outer tion as a function of both the agent parameters and a set of
(manual or automated) tuning loop, we search for good hyperparameters. In the inner loop, the parameters of the
agent are trained to minimize this inner loss function w.r.t
1
Work done at DeepMind, London. 2 University of Michigan. the current values of the hyperparameters; in the outer loop,
Correspondence to: Tom Zahavy <tomzahavy@gmail.com>. the hyperparameters are adapted via back propagation to
minimize the outer (validation) loss.
Self-Tuning Deep Reinforcement Learning

It is perhaps surprising that we may choose to optimize θ denotes the parameters of the agent and parameterises,
a different loss function in the inner loop, instead of the for example, the value function and the policy; these param-
outer loss we ultimately care about. However, this is not eters are randomly initialised at the beginning of an agent’s
a new idea. Regularization, for example, is a technique lifetime, and updated using back-propagation on a suitable
that changes the objective in the inner loop to balance the inner loss function. ζ denotes the hyperparameters, in-
bias-variance trade-off and avoid the risk of over fitting. cluding, for example, the parameters of the optimizer (e.g
In model-based RL, it was shown that the policy found the learning rate) or the parameters of the loss function (e.g.
using a smaller discount factor can actually be better than the discount factor); these may be tuned over the course of
a policy learned with the true discount factor (Jiang et al., many lifetimes (for instance via random search) to optimize
2015). Auxiliary tasks (Jaderberg et al., 2016) are another an outer (validation) loss function. In a typical deep RL
example, where gradients are taken w.r.t unsupervised loss setup, only these first two types of parameters need to be
functions in order to improve the agent’s representation. considered. In metagradient algorithms a third set of param-
Finally, it is well known that, in order to maximize the eters must be specified: the metaparameters, denoted η;
long term cumulative reward efficiently (the objective of these are a subset of the differentiable parameters in ζ that
the outer loop), RL agents must explore, i.e., act according start with some some initial value (itself a hyper-parameter),
to a different objective in the inner loop (accounting, for but that are then adapted during the course of training.
instance, for uncertainty).
This paper makes the following contributions. First, we 2.1. The metagradient approach
show that it is feasible to use metagradients to simulta- Metagradient RL (Xu et al., 2018) is a general framework
neously tune many critical hyperparameters (controlling for adapting, online, within a single lifetime, the differen-
important trade offs in a reinforcement learning agent), as tiable hyperparameters η. Consider an inner loss that is a
long as they are differentiable w.r.t a validation/outer loss. function of both the parameters θ and the metaparameters
Importantly, we show that this can be done online, within a η: Linner (θ; η). On each step of an inner loop, θ can be opti-
single lifetime, requiring only 30% additional compute. mized with a fixed η to minimize the inner loss Linner (θ; η) :
We demonstrate this by introducing two novel deep RL ar- .
chitectures, that extend IMPALA (Espeholt et al., 2018), a θ̃(ηt ) = θt+1 = θt − ∇θ Linner (θt ; ηt ). (1)
distributed actor-critic, by adding additional components In an outer loop, η can then be optimized to minimize the
with many more new hyperparameters to be tuned. The first outer loss by taking a metagradient step. As θ̃(η) is a func-
agent, referred to as a Self-Tuning Actor Critic (STAC), in- tion of η this corresponds to updating the η parameters by
troduces a leaky V-trace operator that mixes importance sam- differentiating the outer loss w.r.t η,
pling (IS) weights with truncated IS weights. The mixing
ηt+1 = ηt − ∇η Louter (θ̃(ηt )). (2)
coefficient in leaky V-trace is differentiable (unlike the orig-
inal V-trace) but similarly balances the variance-contraction The algorithm is general, as it implements a specific case of
trade-off in off-policy learning. The second architecture is online cross-validation, and can be applied, in principle, to
STACX (STAC with auXiliary tasks). Inspired by Fedus et al. any differentiable meta-parameter η used by the inner loss.
(2019), STACX augments STAC with parametric auxiliary
loss functions (each with it’s own hyper-parameters). 2.2. IMPALA
These agents allow us to show empirically, through ex- Specific instantiations of the metagradient RL framework
tensive ablation studies, that performance consistently im- require specification of the inner and outer loss functions.
proves as we expose more hyperparameters to metagradients. Since our agent builds on the IMPALA actor critic agent
In particular, when applied to 57 Atari games (Bellemare (Espeholt et al., 2018), we now provide a brief introduction.
et al., 2013), STACX achieved a normalized median score
of 364%, a new state-of-the-art for online model free agents. IMPALA maintains a policy πθ (at |xt ) and a value function
vθ (xt ) that are parameterized with parameters θ. These
policy and the value function are trained via an actor-critic
2. Background update with entropy regularization; such an update is often
represented (with slight abuse of notation) as the gradient
In the following, we consider three types of parameters:
of the following pseudo-loss function
2
1. θ – the agent parameters. L(θ) =gv · (vs − Vθ (xs ))
−gp ρs · log πθ (as |xs )(rs + γvs+1 − Vθ (xs ))
2. ζ – the hyperparameters. X
−ge · πθ (as |xs ) log πθ (as |xs ), (3)
3. η ⊂ ζ – the metaparameters. a
Self-Tuning Deep Reinforcement Learning

where gv , gp , ge are suitable loss coefficients. We refer to The outer loss is defined using Eq. (3),
the policy that generates the data for these updates as the
behaviour policy µ(at |xt ). In the on policy case, where Louter (η) = Louter (θ̃(η))
2
µ(at |xt ) = π(at |xt ), ρs = 1 , then vs is the n-steps boot-

Ps+n−1 =gvouter · vs − Vθ̃(η) (xs )
strapped return vs = t=s γ t−s rt + γ n V (xs+1 ).
−gpouter · ρs log πθ̃(η) (as |xs )(rs + γvs+1 − Vθ̃(η) (xs ))
IMPALA uses a distributed actor critic architecture, that X
assign copies of the policy parameters to multiple actors in −geouter · πθ̃(η) (as |xs ) log πθ̃(η) (as |xs )
different machines to achieve higher sample throughput. As a
a result, the target policy π on the learner machine can be
 
outer
+gkl · KL πθ̃(η) , πθ . (6)
several updates ahead of the actor’s policy µ that generated
the data used in an update. Such off policy discrepancy can
lead to biased updates, requiring to weight the updates with Notice the new Kullback–Leibler (KL) term in Eq. (6), mo-
IS weights for stable learning. Specifically, IMPALA (Espe- tivating the η-update not to change the policy too much.
holt et al., 2018) uses truncated IS weights to balance the Compared to the work by Xu et al. (2018) that only self-
variance-contraction trade off on these off-policy updates. tuned the inner-loss γ and λ (hence η = {γ,λ}), also
This corresponds to instantiating Eq. (3) with self-tuning the inner-loss coefficients corresponds to set-
s+n−1
X ting η = {γ, λ, gv , gp , ge }. These metaparameters allow for
γ t−s Πt−1

vs = V (xs ) + i=s ci δt V, (4) loss specific learning rates and support dynamically balanc-
t=s ing exploration with exploitation by adapting the entropy
where we define δt V = ρt (rt + γV (xt+1 ) − V (xt )) and loss weight 1 . The hyperparameters of STAC include the ini-
t |xt ) π(at |xt )
we set ρt = min(ρ̄, π(a
µ(at |xt ) ), ci = λ min(c̄, µ(at |xt ) ) for
tialisations of the metaparameters, the hyperparameters of
suitable truncation levels ρ̄ and c̄. the outer loss {γ outer , λouter , gvouter , gpouter , geouter }, the KL co-
outer
efficient gkl (set to 1), and the learning rate of the ADAM
2.3. Metagradient IMPALA meta optimizer (set to 1e − 3).

The metagradient agent in (Xu et al., 2018) uses the meta- IMPALA Self-Tuning IMPALA
gradient update rules from the previous section with the θ vθ , πθ vθ , πθ
actor-critic loss function of Espeholt et al. (2018). More {γ outer , λouter , gvouter , gpouter , geouter }
specifically, the inner loss is a parameterised version of the ζ {γ, λ, gv , gp , ge } Initialisations
IMPALA loss with metaparameters η based on Eq. (3), outer
ADAM parameter, gkl
L(θ; η) =0.5 (vs − Vθ (xs ))
2 η – {γ, λ, gv , gp , ge }
−ρs · log πθ (as |xs )(rs + γvs+1 − Vθ (xs ))
X Table 1. Parameters in & IMPALA and self-tuning IMPALA.
−0.01 πθ (as |xs ) log πθ (as |xs ),
a To set the initial values of the metaparameters of self-tuning
where η = {γ, λ}. Notice that γ and λ also affect the inner IMPALA, we use a simple “rule of thumb” and set them
loss through the definition of vs (Eq. (4)). to the values of the corresponding parameters in the outer
loss (e.g., the initial value of the inner-loss λ is set equal to
The outer loss is defined to be the policy gradient loss the outer-loss λ). For outer-loss hyperparameters that are
Louter (η) =Louter (θ̃(η)) common to IMPALA we default to the IMPALA settings.
= − ρs log πθ̃(η) (as |xs )(rs + γvs+1 − Vθ̃(η) (xs )). In the next two sections, we show how embracing self tuning
via metagradients enables us to augment this agent with a
3. Self-Tuning actor-critic agents parameterised Leaky V-trace operator and with self-tuned
auxiliary loss functions. These ideas are examples of how
We first consider a slightly extended version of the meta-
the ability to self-tune metaparameters via metagradients can
gradient IMPALA agent (Xu et al., 2018). Specifically, we
be used to introduce novel ideas into RL algorithms without
allow the metagradient to adapt learning rates for individual
requiring extensive tuning of the new hyperparameters.
components in the loss,
1
L(θ; η) =gv · (vs − Vθ (xs ))
2 There are a few additional subtle differences between this
self-tuning IMPALA agent and the metagradient agent from Xu
−gp ρs · log πθ (as |xs )(rs + γvs+1 − Vθ (xs )) et al. (2018). For example, we do not use the γ embedding used
X in (Xu et al., 2018). These differences are further discussed in the
−ge · πθ (as |xs ) log πθ (as |xs ). (5) supplementary where we also reproduce the results of Xu et al.
a (2018) in our code base.
Self-Tuning Deep Reinforcement Learning

3.1. STAC We highlight that for αρ = 1, αc = 1, Leaky V-trace is


exactly equivalent to V-trace, while for αρ = 0, αc = 0, it
All the hyperparameters that we considered for self-tuning
is equivalent to canonical importance sampling. For other
so far have the property that they are explicitly defined in
values we get a mixture of the truncated and non truncated
the definition of the loss function and can be directly differ-
importance sampling weights.
entiated. The truncation levels in the V-trace operator within
IMPALA, on the other hand, are equivalent to applying a Theorem 1 below suggests that Leaky V-trace is a contrac-
ReLU activation and are non differentiable. tion mapping, and that the value function that it will con-
verge to is given by V πρ̄,αρ , where
Motivated by the study of non linear activations in Deep
Learning (Xu et al., 2015), we now introduce an agent based
on a variant of the V-trace operator that we call leaky V-
trace. We will refer to this agent as the Self-Tuning Actor αρ min (ρ̄µ(a|x), π(a|x)) + (1 − αρ )π(a|x)
Critic (STAC). Leaky V-trace uses a leaky rectifier (Maas πρ̄,αρ = P ,
αρ b min (ρ̄µ(b|x), π(b|x)) + 1 − αρ
et al., 2013) to truncate the importance sampling weights, (9)
which allows for a small non-zero gradient when the unit is
is a policy that mixes (and then re-normalizes) the target
saturated. We show that the degree of leakiness can control
policy with the V-trace policy Eq. (7) 2 . A more formal
certain trade offs in off policy learning, similarly to V-trace,
statement of Theorem 1 and a detailed proof (which closely
but in a manner that is differentiable.
follows that of Espeholt et al. (2018) for the original v-trace
Before we introduce Leaky V-trace, let us first recall how operator) can be found in the supplementary material.
the off policy trade offs are represented in V-trace using the
t |xt )
coefficients ρ̄, c̄. The weight ρt = min(ρ̄, π(a
µ(at |xt ) ) appears Theorem 1. The leaky V-trace operator defined by Eq. (8) is
in the definition of the temporal difference δt V and defines a contraction operator and it converges to the value function
the fixed point of this update rule. The fixed point of this of the policy defined by Eq. (9).
update is the value function V πρ̄ of the policy πρ̄ that is
somewhere between the behaviour policy µ and the target
policy π controlled by the hyper parameter ρ̄, Similar to ρ̄, the new parameter αρ controls the fixed point
of the update rule, and defines a value function that inter-
min (ρ̄µ(a|x), π(a|x)) polates between the value function of the target policy π
πρ̄ = P . (7)
b min (ρ̄µ(b|x), π(b|x)) and the behaviour policy µ. Specifically, the parameter αc
allows the importance weights to ”leak back” creating the
The product of the weights cs , ..., ct−1 in Eq. (4) measures
opposite effect to clipping.
how much a temporal difference δt V observed at time t
impacts the update of the value function. The truncation Since Theorem 1 requires us to have αρ ≥ αc , our main
level c̄ is used to control the speed of convergence by trading STAC implementation parametrises the loss with a single
off the update variance for a larger contraction rate, similar parameter α = αρ = αc . In addition, we also experimented
to Retrace(λ) (Munos et al., 2016). By clipping the impor- with a version of STAC that learns both αρ and αc . Quite
tance weights, the variance associated with the update rule interestingly, this variation of STAC learns the rule αρ ≥ αc
is reduced relative to importance-weighted returns . On the by its own (see the experiments section for more details).
other hand, the clipping of the importance weights effec-
Note that low values of αc lead to importance sampling
tively cuts the traces in the update, resulting in the update
which is high contraction but high variance. On the other
placing less weight on later TD errors, and thus worsening
hand, high values of αc lead to V-trace, which is lower
the contraction rate of the corresponding operator.
contraction and lower variance than importance sampling.
Following this interpretation of the off policy coefficients, Thus exposing αc to meta-learning enables STAC to directly
we now propose a variation of V-trace which we call leaky control the contraction/variance trade-off.
V-trace with new parameters αρ ≥ αc ,
In summary, the metaparameters for STAC are
π(at |xt )  {γ, λ, gv , gp , ge , α}. To keep things simple, when using
ISt = , ρt = αρ min ρ̄, ISt + (1 − αρ )ISt , Leaky V-trace we make two simplifications w.r.t the hy-
µ(at |xt )
  perparameters. First, we use V-trace to initialise Leaky
ci = λ αc min c̄, ISt + (1 − αc )ISt , V-trace, i.e., we initialise α = 1. Second, we fix the outer
s+n−1
X loss to be V-trace, i.e. we set αouter = 1.
γ t−s Πt−1

vs = V (xs ) + i=s ci δt V, 2
t=s Note that α−trace (Rowland et al., 2019), another adaptive
algorithm for off policy learning, mixes the V-trace policy with the
δt V = ρt (rt + γV (xt+1 ) − V (xt )). (8) behaviour policy; Leaky V-trace mixes it with the target policy.
Self-Tuning Deep Reinforcement Learning

3.2. STAC with auxiliary tasks (STACX) The metaparameters for STACX are
{γ i , λi , gvi , gpi , gei , αi }3i=1 . Since the outer loss is de-
Next, we introduce a new agent, that extends STAC with
fined only w.r.t head 1, introducing the auxiliary tasks into
auxiliary policies, value functions, and respective auxiliary
STACX does not require new hyperparameters for the outer
loss functions; this is new because the parameters that de-
loss. In addition, we use the same initialisation values for
fine the auxiliary tasks (the discount factors in this case)
all the auxiliary tasks. Thus, STACX has exactly the same
are self-tuned. As this agent has a new architecture in ad-
hyperparameters as STAC.
dition to an extended set of metaparameters we give it a
different acronym and denote it by STACX (STAC with aux-
iliary tasks). The auxiliary losses have the same parametric
350%

Median human normalized scores


form as the main objective and can be used to regularize its
objective and improve its representations. 300% Meta gradient, Xu et. al.
250% Fixed auxiliary tasks
200%
150%
100%
MLP MLP MLP
50% IMPALA
STAC
0% STACX
0 0.5e8 1e8 1.5e8 2e8
Deep
Residual
Block
Figure 2. Median human-normalized scores during training.
Observation

STAC STACX
4. Experiments
Figure 1. Block diagrams of STAC and STACX. In all of our experiments (with the exception of the ro-
bustness experiments in Section 4.3) we use the IMPALA
hyperparameters both for the IMPALA baseline and for
STACX’s architecture has a shared representation layer
the outer loss of the STAC agent, i.e., λouter = 1, αouter =
θshared , from which it splits into n different heads (Fig. 1). outer
1, gv,p,e = 1. We use γ = 0.995 as it was found to improve
For the shared representation layer we use the deep residual
the performance of IMPALA considerably (Xu et al., 2018).
net from (Espeholt et al., 2018). Each head has a policy and
a corresponding value function that are represented using a
2 layered MLP with parameters {θi }ni=1 . Each one of these 4.1. Atari learning curves
heads is trained in the inner loop to minimize a loss function We start by evaluating STAC and STACX in the Arcade
L(θi ; ηi ). parametrised by its own set of metaparameters ηi . Learning Environment (Bellemare et al., 2013, ALE). Fig. 2
The policy of the STACX agent is defined to be the policy presents the normalized median scores 4 during training. We
of a specific head (i = 1). The hyperparameters {ηi }ni=1 are found STACX to learn faster and achieve higher final per-
trained in the outer loop to improve the performance of this formance than STAC. We also compare these agents with
single head. Thus, the role of the auxiliary heads is to act versions of them without self tuning (fixing the metapa-
as auxiliary tasks (Jaderberg et al., 2016) and improve the rameters). The version of STACX with fixed unsupervised
shared representation θshared . Finally, notice that each head auxiliary tasks achieved a normalized median score of 247,
has its own policy π i , but the behaviour policy is fixed to be similar to that of UNREAL (Jaderberg et al., 2016) but not
π 1 . Thus, to optimize the auxiliary heads we use (Leaky) much better than IMPALA. In Fig. 3 we report the relative
V-trace for off policy corrections 3 . improvement of STACX over IMPALA in the individual
levels (an equivalent figure for STAC may be found in the
3
We also considered two extensions of this approach. (1) Ran- supplementary material).
dom ensemble: The policy head is chosen at random from [1, .., n],
and the hyperparameters are differentiated w.r.t the performance always led to a small decrease in performance when compared to
of each one of the heads in the outer loop. (2) Average ensemble: our auxiliary task agent without these extensions. Similar findings
The actor policy is defined to be the average logits of the heads, were reported in (Fedus et al., 2019).
4
and we learn one additional head for the value function of this Normalized median scores are computed as follows. For each
policy. The metagradient in the outer loop is taken with respect to Atari game, we compute the human normalized score after 200M
the actor policy, and /or, each one of the heads individually. While frames of training and average this over 3 different seeds; we then
these extensions seem interesting, in all of our experiments they report the overall median score over the 57 Atari domains.
Self-Tuning Deep Reinforcement Learning

Road runner
Wizard of wor
lines 5 , and in green and blue, we can see different ablations
Chopper command
Jamesbond
of STAC and STACX respectively. In these ablations η
Robotank
Kangaroo
corresponds to the subset of the hyperparameters that are
Defender being self-tuned, e.g., η = {γ} corresponds to self-tuning
Tennis
Beam rider a single loss function where only γ is self-tuned (and the
Gravitar
Gopher other hyperparameters are fixed).
Battle zone
Alien
Space invaders Inspecting Fig. 4 we observe that the performance of STAC
Ice hockey
Asteroids and STACX consistently improve as they self-tune more
Centipede
Star gunner
metaparameters. These metaparameters control different
Video pinball
Tutankham
trade-offs in reinforcement learning: discount factor con-
Asterix
Qbert
trols the effective horizon, loss coefficients affect learning
Assault
Zaxxon
rates, the Leaky V-trace coefficient controls the variance-
Breakout
Hero
contraction-bias trade-off in off policy RL.
Seaquest
Krull We have also experimented with adding more auxiliary
Surround
Berzerk losses, i.e, having 5 or 8 auxiliary loss functions. These
Fishing derby
Frostbite variations performed better then having a single loss func-
Bank heist
Crazy climber tion but slightly worse then having 3. This can be further
Venture
Enduro
explained by Fig. 9 (Section 4.4), which shows that the
Freeway
Pitfall
auxiliary heads are self-tuned to similar metaparameters.
Private eye
Montezuma revenge
Pong
Skiing
Demon attack { i, i, gvi , gei , gpi , i}3i = 1 364%
{ i, i, gvi , gei , gpi }3i = 1 341%
Ms pacman
Atlantis

249%
Solaris
Kung fu master
{ i, i}3i = 1
262%
Time pilot
Name this game
{ }3i = 1
247%
Riverraid
Amidar
Double dunk
{}3i = 1
Bowling
Boxing
Xu et al. 287%
Yars revenge
Up n down
{ , , gv, ge, gp, } 292%
Phoenix
35% 50% 75% 100% 150% 200% 275%
Relative human normalized scores
350% 500% { , , gv, ge, gp} 257%
{, } 248%
Figure 3. Mean human-normalized scores after 200M frames, rela-
{} 240%
tive improvement in percents of STACX over IMPALA. {} 243%
IMPALA 191%
Nature DQN 95%

STACX achieved a median score of 364%, a new state of the


art result in the ALE benchmark for training online model- Figure 4. Ablative studies of STAC (green) and STACX (blue)
free agents for 200M frames. In fact, there are only two compared to some baselines (red). Median normalized scores
agents that reported better performance after 200M frames: across 57 Atari games after training for 200M frames.
LASER (Schmitt et al., 2019) achieved a normalized me-
dian score of 431% and MuZero (Schrittwieser et al., 2019) Finally, we experimented with a few more versions of
achieved 731%. These papers propose algorithmic mod- STACX. One variation allows STACX to self-tune both
ifications that are orthogonal to our approach and can be αρ and αc without enforcing the relation αρ ≥ αc . This ver-
combined in future work; LASER combines IMPALA with sion performed slightly worse than STACX (achieved a me-
a uniform large-scale experience replay; MuZero uses replay dian score of 353%) and we further discuss it in Section 4.4
and a tree-based search with a learned model. (Fig. 10). In another variation we self-tune α together with a
single truncation parameter ρ̄ = c̄. This variation performed
4.2. Ablative analysis much worse achieving a median score of 301%, which may
be explained by ρ̄ not being differentiable.
Next, we perform an ablative study of our approach by train-
ing different variations of STAC and STACX. The results 5
We further discuss reproducing the results of Xu et al. (2018)
are summarized in Fig. 4. In red, we can see different base- in the supplementary material.
Self-Tuning Deep Reinforcement Learning

22
{ i , i , g i , g i , g i , i }3
v e p i=1
20
{ i , i , g i , g i , g i }3
v e p i=1
18 { , , gv, ge, gp} Chopper_command

Normalized mean
16 { , }
Number of games

STAC
14 { } 50 IMPALA

12
10 0 0.99 0.992 0.995 0.997 0.999
8 Jamesbond
100

Normalized mean
6
4 50
2
0 0 0.99 0.992 0.995 0.997 0.999
Defender

Normalized mean
0% 20% 40% 60% 80% 100%
Percentage of relative improvement 20

0 0.99 0.992 0.995 0.997 0.999


Beam_rider

Normalized mean
Figure 5. The number of games (y-axis) in which ablative varia- 5
tions of STAC and STACX improve the IMPALA baseline by at
0 0.99 0.992 0.995 0.997 0.999
least x percents (x-axis). Crazy_climber

Normalized mean
5

0 0.99 0.992 0.995 0.997 0.999

In Fig. 5, we further summarize the relative improvement


of ablative variations of STAC (green,yellow, and blue; the
bottom three lines) and STACX (light blue and red; the top Figure 6. Robustness to the discount factor. Mean and confidence
two lines) over the IMPALA baseline. For each value of intervals (over 3 seeds), after 200M frames of training. Left
x ∈ [0, 100] (the x axis), we measure the number of games (blue) bars correspond to STACX and right (red) bars to IMPALA.
STACX is better than IMPALA in 72% of the runs measured by
in which an ablative version of the STAC(X) agent is better
mean.
than the IMPALA agent by at least x percent and subtract
from it the number of games in which the IMPALA agent
is better than STAC(X). Clearly, we can see that STACX
(light blue) improves the performance of IMPALA by a large
Chopper_command
100
Normalized mean

margin. Moreover, we observe that allowing the STAC(X) STAC


IMPALA

agents to self-tune more metaparameters consistently im- 0


0.25 0.35 0.5 0.75 1
proves its performance in more games. Jamesbond
Normalized mean

50
4.3. Robustness 0
0.25 0.35 0.5 0.75 1
Defender
Normalized mean

It is important to note that our algorithm is not hyperpa- 20


rameter free. For instance, we still need to choose hyper- 0
0.25 0.35 0.5 0.75 1
parameter settings for the outer loss. Additionally, each Beam_rider
Normalized mean

10
hyperparameter in the inner loss that we expose to meta-
gradients still requires an initialization (itself a hyperparame- 0 0.25 0.35 0.5 0.75 1
Crazy_climber
ter). Therefore, in this section, we investigate the robustness
Normalized mean

5
of STACX to its hyperparameters.
0 0.25 0.35 0.5 0.75 1
We begin with the hyperparameters of the outer loss. In
these experiments we compare the robustness of STACX
with that of IMPALA in the following manner. For each hy-
per parameter (γ, gv ) we select 5 perturbations. For STACX Figure 7. Robustness to the critic weight gv . Mean and confidence
we perturb the hyper parameter in outer loss (γ outer , gvouter ) intervals (over 3 seeds), after 200M frames of training. Left (blue)
bars correspond to STACX and right (red) bars correspond to
and for IMPALA we perturb the corresponding hyper pa-
IMPALA. STACX is better than IMPALA in 80% of the runs
rameter (γ, gv ). We randomly selected 5 Atari levels and
measured by mean.
present the mean and standard deviation across 3 random
seeds after 200M frames of training.
Fig. 6 presents the results for the discount factor. We can Next, we investigate the robustness of STACX to the initial-
see that overall (in 72% of the configurations measured isation of the metaparameters in Fig. 8. We selected values
by mean), STACX indeed performs better than IMPALA. that are close to 1 as our design principle is to initialise
Similarly, Fig. 7 shows the robustness of STACX to the the metaparameters to be similar to the hyperparameters in
critic weight (gv ), where STACX improves over IMPALA the outer loss. We observe that overall, the method is quite
in 80% of the configurations. robust to different initialisations.
Self-Tuning Deep Reinforcement Learning

Chopper_command Jamesbond Defender Beam_rider Crazy_climber 0.995 1.00


0.990 0.95
60 80 20
6
6
0.985 0.90
50
15
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
All init valuess

60
40
1e8 1e8
4
4 1.0 1.0
30 40 10
0.8

gp

gv
0.5
20
20 5
2 2
0.6
10 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
1e8 1e8
0 0 0 0 0 1.0 1.0
0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

ge
100 30 0.5 0.5
50

25 6
80
6 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
40 1e8 1e8
Discount init value

20
60
30 4 4 head 1 head 2 head 3
15

20 40
10
2 2
10 20
5
Figure 9. Adaptivity in Jamesbond.
0 0 0 0 0
0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

0.99
0.992
0.995
0.997
0.999

Finally, we also noticed an interesting behaviour in the ver-


Figure 8. Robustness to the initialisation of the metaparameters. sion of STACX where we expose to self-tuning both the
Mean and confidence intervals (over 6 seeds), after 200M frames αρ and αc coefficients, without imposing αρ ≥ αc (Theo-
of training. Columns correspond to different games. Bottom: rem 1). This variation of STACX achieved a median score
perturbing γ init ∈ {0.99, 0.992, 0.995, 0.997, 0.999}. Top: per- of 353%. Quite interestingly, the metagradient discovered
turbing all the meta parameter initialisations. I.e., setting all the hy-
αρ ≥ αc on its own, i.e., it self-tunes αρ so that is greater or
perparamters {γ init , λinit , gvinit , geinit , gpinit , αinit }3i=1 to a single fixed
value in {0.99, 0.992, 0.995, 0.997, 0.999}.
equal to αc in 91.2% of the time (averaged over time, seeds,
and levels) , and so that αρ ≥ 0.99αc in 99.2% of the time.
Fig. 10 shows an example of this in Jamesbond.

4.4. Adaptivity 0.995


3
3
c
0.990
In Fig. 9 we visualize the metaparameters of STACX during 0.0 0.5 1.0 1.5 2.0
1e8
training. As there are many metaparameters, seeds and lev-
2
els, we restrict ourselves to a single seed (chosen arbitrary 0.995
2
c
to 1) and a single game (Jamesbond). More examples can 0.990
0.0 0.5 1.0 1.5 2.0
be found in the supplementary material. For each hyper 1e8
1.0
1
parameter we plot the value of the hyperparameters associ-
0.5 1
c
ated with three different heads, where the policy head (head 0.0
0.0 0.5 1.0 1.5 2.0
number 1) is presented in blue and the auxiliary heads (2 1e8
and 3) are presented in orange and magenta.
Inspecting Fig. 9 we can see that the two auxiliary heads self- Figure 10. Discovery of αρ ≥ αc .
tuned their metaparameters to have relatively similar values,
but different than those of the main head. The discount
factor of the main head, for example, converges to the value 5. Summary
of the discount factor in the outer loss (0.995), while the In this work we demonstrate that it is feasible to use metagra-
discount factors of the auxiliary heads change quite a lot dients to simultaneously tune many critical hyperparameters
during training and learn about horizons that differ from (controlling important trade offs in a reinforcement learning
that of the main head. agent), as long as they are differentiable; we show that this
We also observe non trivial behaviour in self-tuning the coef- can be done online, within a single lifetime. We do so by
ficients of the loss functions (ge , gp , gv ), in the λ coefficient presenting STAC and STACX, actor critic algorithms that
and in the off policy coefficient α. For instance, we found self-tune a large number of hyperparameters of very differ-
that at the beginning of training α is self tuned to a high ent nature. We showed that the performance of these agents
value (close to 1) so it is quite similar to V-trace; instead, improves as they self-tune more hyperparameters, and we
towards the end of training STACX self-tunes it to lower demonstrated that STAC and STACX are computationally
values which makes it closer to importance sampling. efficient and robust to their own hyperparameters.
Self-Tuning Deep Reinforcement Learning

References Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare,


M. Safe and efficient off-policy reinforcement learning.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
In Advances in Neural Information Processing Systems,
The arcade learning environment: An evaluation plat-
pp. 1054–1062, 2016.
form for general agents. Journal of Artificial Intelligence
Research, 47:253–279, 2013. Pedregosa, F. Hyperparameter optimization with approxi-
mate gradient. In International Conference on Machine
Bergstra, J. and Bengio, Y. Random search for hyper-
Learning, pp. 737–746, 2016.
parameter optimization. J. Mach. Learn. Res., 13(1):
281–305, February 2012. ISSN 1532-4435. Rowland, M., Dabney, W., and Munos, R. Adap-
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, tive trade-offs in off-policy learning. arXiv preprint
V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, arXiv:1910.07478, 2019.
I., et al. Impala: Scalable distributed deep-rl with impor-
Schaul, T., Borsa, D., Ding, D., Szepesvari, D., Ostrovski,
tance weighted actor-learner architectures. arXiv preprint
G., Dabney, W., and Osindero, S. Adapting behaviour
arXiv:1802.01561, 2018.
for learning progress. arXiv preprint arXiv:1912.06910,
Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., and 2019.
Larochelle, H. Hyperbolic discounting and learning over
Schmitt, S., Hessel, M., and Simonyan, K. Off-policy
multiple horizons. arXiv preprint arXiv:1902.06865,
actor-critic with shared experience replay. arXiv preprint
2019.
arXiv:1909.11583, 2019.
Franceschi, L., Donini, M., Frasconi, P., and Pontil, M.
Forward and reverse gradient-based hyperparameter opti- Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K.,
mization. In Proceedings of the 34th International Con- Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis,
ference on Machine Learning-Volume 70, pp. 1165–1173. D., Graepel, T., et al. Mastering atari, go, chess and
JMLR. org, 2017. shogi by planning with a learned model. arXiv preprint
arXiv:1911.08265, 2019.
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,
Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce- Snoek, J., Larochelle, H., and Adams, R. P. Practical
ment learning with unsupervised auxiliary tasks. arXiv bayesian optimization of machine learning algorithms.
preprint arXiv:1611.05397, 2016. In Advances in neural information processing systems,
pp. 2951–2959, 2012.
Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,
Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, Sutton, R. S. Adapting bias by gradient descent: An incre-
I., Simonyan, K., et al. Population based training of mental version of delta-bar-delta. In AAAI, pp. 171–176,
neural networks. arXiv preprint arXiv:1711.09846, 2017. 1992.

Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The depen- Veeriah, V., Hessel, M., Xu, Z., Rajendran, J., Lewis, R. L.,
dence of effective planning horizon on model accuracy. Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. Discov-
In Proceedings of the 2015 International Conference on ery of useful questions as auxiliary tasks. In Advances in
Autonomous Agents and Multiagent Systems, pp. 1181– Neural Information Processing Systems, pp. 9306–9317,
1189, 2015. 2019.

Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectifier non- White, M. and White, A. A greedy approach to adapting
linearities improve neural network acoustic models. In the trace parameter for temporal difference learning. In
in ICML Workshop on Deep Learning for Audio, Speech Proceedings of the 2016 International Conference on
and Language Processing. Citeseer, 2013. Autonomous Agents & Multiagent Systems, pp. 557–565,
2016.
Maclaurin, D., Duvenaud, D., and Adams, R. Gradient-
based hyperparameter optimization through reversible Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluation
learning. In International Conference on Machine Learn- of rectified activations in convolutional network. arXiv
ing, pp. 2113–2122, 2015. preprint arXiv:1505.00853, 2015.

Mann, T. A., Penedones, H., Mannor, S., and Hester, T. Xu, Z., van Hasselt, H. P., and Silver, D. Meta-gradient re-
Adaptive lambda least-squares temporal difference learn- inforcement learning. In Advances in neural information
ing. arXiv preprint arXiv:1612.09465, 2016. processing systems, pp. 2396–2407, 2018.
Self-Tuning Deep Reinforcement Learning

Young, K., Wang, B., and Taylor, M. E. Metatrace: Online


step-size tuning by meta-gradient descent for reinforce-
ment learning control. arXiv preprint arXiv:1805.04514,
2018.
Zheng, Z., Oh, J., and Singh, S. On learning intrinsic
rewards for policy gradient methods. In Advances in
Neural Information Processing Systems, pp. 4644–4654,
2018.
Self-Tuning Deep Reinforcement Learning

6. Additional results

Jamesbond
Wizard of wor
Chopper command
Robotank
Battle zone
Tennis
Centipede
Road runner
Gravitar
Kangaroo
Gopher
Assault
Beam rider
Video pinball
Crazy climber
Alien
Ice hockey
Tutankham
Up n down
Breakout
Zaxxon
Seaquest
Fishing derby
Krull
Frostbite
Demon attack
Star gunner
Montezuma revenge
Venture
Pitfall
Private eye
Boxing
Freeway
Pong
Enduro
Solaris
Hero
Surround
Bank heist
Kung fu master
Ms pacman
Berzerk
Double dunk
Atlantis
Name this game
Qbert
Time pilot
Asterix
Bowling
Defender
Amidar
Asteroids
Riverraid
Yars revenge
Space invaders
Phoenix
Skiing
35% 50% 75% 100% 150% 200% 275% 350% 500%
Relative human normalized scores

Figure 11. Mean human-normalized scores after 200M frames, relative improvement in percents of STAC over the IMPALA baseline.
Self-Tuning Deep Reinforcement Learning

7. Analysis of Leaky V-trace


Define the Leaky V-trace operator R̃:
 
X
γ t Πt−1

R̃V (x) = V (x) + Eµ  i=0 c̃i ρ̃t (rt + γV (xt+1 ) − V (xt )) |x0 = x, µ) ,
 (10)
t≥0

where the expectation Eµ is with respect to the behaviour policy µ which has generated the trajectory (xt )t≥0 , i.e.,
x0 = x, xt+1 ∼ P (·|xt , at ), at ∼ µ(·|xt ). Similar to (Espeholt et al., 2018), we consider the infinite-horizon operator but
very similar results hold for the n-step truncated operator.
Let
π(at |xt )
IS(xt ) = ,
µ(at |xt )
be importance sampling weights, let

ρt = min(ρ̄, IS(xt )), ct = min(c̄, IS(xt )),

be truncated importance sampling weights with ρ̄ ≥ c̄, and let

ρ̃t = αρ ρt + (1 − αρ )IS(xt ), c̃t = αc ct + (1 − αc )IS(xt )

be the Leaky importance sampling weights with leaky coefficients αρ ≥ αc .


Theorem 2 (Restatement of Theorem 1). Assume that there exists β ∈ (0, 1] such that Eµ ρ0 ≥ β. Then the operator R̃
defined by Eq. (10) has a unique fixed point Ṽ π̃ , which is the value function of the policy πρ̄,αρ defined by

αρ min (ρ̄µ(a|x), π(a|x)) + (1 − αρ )π(a|x)


πρ̄,αρ = P ,
b αρ min (ρ̄µ(b|x), π(b|x)) + (1 − αρ )π(b|x)

Furthermore, R̃ is a η̃-contraction mapping in sup-norm, with


 
X
η̃ = γ −1 − (γ −1 − 1)Eµ  γ t Πt−2

i=0 c̃i ρ̃t−1 ≤ 1 − (1 − γ)(αρ β + 1 − αρ ) < 1,

t≥0

where c̃−1 = 1, ρ̃−1 = 1 and Πt−2


s=0 cs = 1 for t = 0, 1.

Proof. The proof follows the proof of V-trace from (Espeholt et al., 2018) with adaptations for the leaky V-trace coefficients.
We have that X
γ t Πt−2

R̃V1 (x) − R̃V2 (x) = Eµ s=0 cs [ρ̃t−1 − c̃t−1 ρ̃t ] (V1 (xt ) − V2 (xt )) .
t≥0

Denote by κ̃t = ρt−1 − c̃t−1 ρ̃t , and notice that

Eµ ρ̃t = αρ Eµ ρt + (1 − αρ )Eµ IS(xt ) ≤ 1,

since Eµ IS(xt ) = 1, and therefore, Eµ ρt ≤ 1. Furthermore, since ρ̄ ≥ c̄ and αρ ≥ αc , we have that ∀t, ρ̃t ≥ c̃t . Thus, the
coefficients κ̃t are non negative in expectation, since

Eµ κ̃t = Eµ ρ̃t−1 − c̃t−1 ρ̃t ≥ Eµ c̃t−1 (1 − ρ̃t ) ≥ 0.


Thus, V1 (x) − V2 (x) is a linear combination of the values V1 − V2 at the other states, weighted by non-negative coefficients
whose sum is
Self-Tuning Deep Reinforcement Learning

X
γ t Eµ Πt−2

s=0 c̃s [ρ̃t−1 − c̃t−1 ρ̃t ]
t≥0
X X
γ t Eµ Πt−2 γ t Eµ Πt−1
 
= s=0 c̃s ρ̃t−1 − s=0 c̃s ρ̃t
t≥0 t≥0
X
=γ −1 − (γ −1 − 1) γ t Eµ Πt−2

s=0 c̃s ρ̃t−1
t≥0

≤γ −1 − (γ −1 − 1)(1 + γEµ ρ̃0 ) (11)


=1 − (1 − γ)Eµ ρ̃0
=1 − (1 − γ)Eµ (αρ ρ0 + (1 − αρ )IS(x0 ))
≤1 − (1 − γ) (αρ β + 1 − αρ ) < 1 (12)

where Eq. (11) holds since we expanded only the first two elements in the sum, and all the elements in this sum are positive,
and Eq. (12) holds by the assumption.
We deduce that kR̃V1 (x) − R̃V2 (x)k∞ ≤ η̃kV1 (x) − V2 (x)k∞ , with η̃ = 1 − (1 − γ) (αρ β + 1 − αρ ) < 1, so R̃ is a
contraction mapping. Furthermore, we can see that the parameter αρ controls the contraction rate, for αρ = 1 we get the
contraction rate of V-trace η̃ = 1 − (1 − γ)β and as αρ gets smaller with get better contraction as with αρ = 0 we get that
η̃ = γ.
Thus R̃ possesses a unique fixed point. Let us now prove that this fixed point is V πρ̄,αρ , where

αρ min (ρ̄µ(a|x), π(a|x)) + (1 − αρ )π(a|x)


πρ̄,αρ = P , (13)
b αρ min (ρ̄µ(b|x), π(b|x)) + (1 − αρ )π(b|x)

is a policy that mixes the target policy with the V-trace policy.
We have:

Eµ [ρ̃t (rt + γV πρ̄,αρ (xt+1 ) − V πρ̄,αρ (xt ))|xt ]


=Eµ (αρ ρt + (1 − αρ )IS(xt )) (rt + γV πρ̄,αρ (xt+1 ) − V πρ̄,αρ (xt ))
   
X π(a|xt ) π(a|xt )
= µ(a|xt ) αρ min ρ̄, + (1 − αρ ) (rt + γV πρ̄,αρ (xt+1 ) − V πρ̄,αρ (xt ))
a
µ(a|x t ) µ(a|x t )
X
= (αρ min (ρ̄µ(a|xt ), π(a|xt )) + (1 − αρ )π(a|xt )) (rt + γV πρ̄,αρ (xt+1 ) − V πρ̄,αρ (xt ))
a
X X
= πρ̄,αρ (a|xt )(rt + γV πρ̄,αρ (xt+1 ) − V πρ̄,αρ (xt )) · (αρ min (ρ̄µ(b|xt ), π(b|xt )) + (1 − αρ )π(b|xt )) = 0,
a b

where we get that the left side (up to the summation on b) of the last equality equals zero since this is the Bellman equation
for V πρ̄,αρ . We deduce that R̃V πρ̄,αρ = V πρ̄,αρ , thus, V πρ̄,αρ is the unique fixed point of R̃.
Self-Tuning Deep Reinforcement Learning

8. Reproducibility
Inspecting the results in Fig. 4 one may notice that there are small differences between the results of IMAPALA and using
meta gradients to tune only λ, γ compared to the results that were reported in (Xu et al., 2018).
We investigated the possible reasons for these differences. First, our method was implemented in a different code base. Our
code is written in JAX, compared to the implementation in (Xu et al., 2018) that was written in TensorFlow. This may
explain the small difference in final performance between our IMAPALA baseline (Fig. 4) and the the result of Xu et. al.
which is slightly higher (257.1).
Second, Xu et. al. observed that embedding the γ hyper parameter into the θ network improved their results significantly,
reaching a final performance (when learning γ, λ) of 287.7 (see section 1.4 in (Xu et al., 2018) for more details). Our
method, on the other hand, only achieved a score of 240 in this ablative study. We further investigated this difference
by introducing the γ embedding intro our architecture. With γ embedding, our method achieved a score of 280.6 which
almost reproduces the results in (Xu et al., 2018). We then introduced the same embedding mechanism to our model with
auxiliary loss functions. In this case for auxiliary loss i we embed γi . We experimented with two variants, one that shares the
embedding weights across the auxiliary tasks and one that learns a specific embedding for each auxiliary task. Both of these
variants performed similarly (306.8, 3.077 respectively) which is better then our previous result with embedding and without
auxiliary losses that achieved 280.6. Unfortunately, the performance of the auxiliary loss architecture actually performed
better without the embedding (353.4) and we therefor ended up not using the γ embedding in our architecture. We leave it to
future work to further investigate methods of combining the embedding mechanisms with the auxiliary loss functions.
Self-Tuning Deep Reinforcement Learning
Self-Tuning Deep Reinforcement Learning

9. Individual level learning curves


1.00 1.00 1.0
gp 1.0
gv 1.00
ge 1.00 25000
reward
0.98
alien, seed=1

0.98 0.98 0.8 20000


0.8 0.98
0.96 0.96
0.96 15000
0.6 0.6 0.96
0.94 0.94
0.94
0.4 0.4 0.92 10000
0.92 0.94
0.92 0.90
0.90 0.2 0.2 5000
0.92
0.90 0.88
0.88 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0 1.0 1.0 1.00


30000
0.998
alien, seed=2

0.8 0.8 0.9 0.98 25000


0.996 0.8
0.8 0.96 20000
0.994 0.6 0.6
0.6 0.94
0.992 0.7 15000
0.4 0.4
0.990 0.4 0.6 0.92 10000
0.988 0.2 0.2
0.90 5000
0.5
0.986 0.2
0.0 0.0 0.88 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0 1.0 1.00 1.00


0.998 0.9 20000
alien, seed=3

0.8 0.8 0.95 0.98


0.996 0.8
0.96 15000
0.994 0.7 0.6 0.6 0.90
0.992 0.6 0.94 10000
0.4 0.4 0.85
0.990
0.5 0.92
0.988 0.2 0.2 0.80 5000
0.4
0.986 0.90
0.3 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.75 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.998 1.000 1.0 1.000 1.0 1.0


1400
amidar, seed=1

0.996 0.9
0.995 0.995 1200
0.8 0.8 0.8
0.994 0.990 1000
0.7 0.990
0.6
0.992 0.985 0.6 0.6 800
0.985
0.990 0.980 0.5 0.4 600
0.980 0.4
0.988 0.975 0.4 400
0.2
0.3 0.975 200
0.986 0.970 0.2
0.2 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.00 1.0 1.00 1.0 1.0


amidar, seed=2

0.98 800
0.994 0.98 0.8
0.8 0.8
0.96
0.992 0.96 600
0.94 0.6
0.990 0.6 0.94 0.6
0.92 400
0.4
0.988 0.92
0.90 0.4 0.4
0.986 0.88 0.90 0.2 200
0.984 0.86 0.2 0.88 0.2
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.00 1.0 1.00 1.0 1.0


amidar, seed=3

0.994 0.9 0.9 2000


0.98 0.8 0.98 0.8
0.8
0.992 0.7 1500
0.7 0.6
0.96 0.6 0.96
0.990 0.6
0.5 1000
0.5 0.4
0.94 0.94
0.988 0.4
0.4 500
0.3 0.2
0.986 0.92 0.92 0.3
0.2
0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.000 1.0 1.0


0.996 30000
assault, seed=1

0.98 0.975
0.994 0.8 0.8 0.8 25000
0.950
0.96
0.992 0.6 0.925 0.6 0.6 20000
0.94 0.900 15000
0.990 0.4 0.875 0.4 0.4
0.92
10000
0.988 0.850
0.90 0.2 0.2 0.2
0.825 5000
0.986 0.88 0.800 0.0 0
0.0 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.0 1.0


0.998 30000
assault, seed=2

0.99 0.8 0.9


0.996 0.8 0.8 25000
0.994 0.98
0.8 20000
0.6 0.6 0.6
0.992 0.97
0.7 15000
0.990 0.96 0.4 0.4 0.4
0.6 10000
0.988 0.95 0.2
0.2 0.2 5000
0.986 0.5
0.94
0.0 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.998 1.000 1.0 1.000 1.0 1.0


30000
assault, seed=3

0.996 0.995
0.995 0.8 0.8 0.8 25000
0.994 0.990
0.990 0.6 0.985 0.6 0.6 20000
0.992
0.980 15000
0.990 0.985 0.4 0.4 0.4
0.975 10000
0.988
0.2 0.970 0.2 0.2
0.980 5000
0.986
0.965
0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.00 1.0 1.0 1.0 1.00 1000000


asterix, seed=1

0.9
0.95 0.9 0.95 800000
0.99 0.8 0.8
0.90 0.8 0.7 0.90 600000
0.98 0.6
0.85 0.7 0.6 0.85
0.5 400000
0.80 0.97 0.4 0.6 0.80
0.4
200000
0.75 0.5 0.3 0.75
0.96 0.2
0.4 0.2 0
0.70 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0


1.00 1.00 1.00
asterix, seed=2

0.9 0.95 800000


0.98 0.99 0.99
0.8 0.8 0.90
0.96 0.98 0.7 0.85 600000
0.98 0.6
0.6 0.80
0.94 0.97 400000
0.97 0.5 0.75
0.4
0.96 0.4 0.70 200000
0.92 0.96
0.2 0.3 0.65
0.95 0
0.90 0.95
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.0 350000


1.00 1.00 1.0
asterix, seed=3

0.99 0.9 300000


0.99 0.99
0.8 0.8
0.98 0.8 250000
0.98
0.98 0.6 0.97 0.6 0.7 200000
0.97
0.6
0.96 150000
0.97 0.96 0.4 0.4 0.5
0.95 100000
0.95 0.4
0.2 0.94 0.2 50000
0.96 0.94 0.3
0.93 0.2
0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.93 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.000 175000


0.9975
asteroids, seed=1

0.9950
0.9 0.9950 0.975 150000
0.9925 0.99 0.8
0.8 125000
0.9925 0.950
0.9900 0.98 0.7 0.6 0.9900 0.925 100000
0.9875 0.6
0.97 0.9875 75000
0.9850 0.5 0.4 0.900
0.9825 0.96 0.9850 0.875 50000
0.4
0.2 0.9825
0.9800 0.95 0.3 0.850 25000
0.9775 0.2 0.0 0.9800 0
0.825
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.00 160000


asteroids, seed=2

0.9950 0.99
0.98 0.9 140000
0.9925 0.8 0.95
0.96 0.8 0.98 120000
0.9900
0.94 0.7 0.6 0.90 100000
0.9875 0.97
0.92 0.6 80000
0.9850 0.4 0.85
0.90 0.5 60000
0.9825 0.96
0.88 0.4 0.80 40000
0.9800 0.2
0.86 0.3 0.95 20000
0.9775 0.2 0.0 0.75 0
0.84
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0 1.000 1.00 160000


asteroids, seed=3

0.995
0.998 140000
0.995 0.95
0.990 0.8 0.8
0.996 120000
0.990 0.90
0.985 0.6 0.994 100000
0.985 0.6 0.85 80000
0.980 0.992
0.980 0.4 60000
0.4 0.990 0.80
0.975 40000
0.975 0.2 0.988 0.75 20000
0.970 0.2
0.970 0.986 0.70 0
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.000 1.0 1.000 1.0


0.98
atlantis, seed=1

800000
0.994 0.998 0.9 0.998
0.96 0.8
0.996 0.8 0.996 600000
0.992 0.94 0.6
0.994 0.7 0.994
0.990 0.92 400000
0.992 0.6 0.992 0.4
0.90
0.988 0.990 0.5 0.990
0.88 0.2 200000
0.4 0.988
0.986 0.988
0.86
0.3 0.986 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.000 1.0


atlantis, seed=2

0.98 800000
0.994 0.998 0.9 0.998 0.8
0.96
0.992 0.996 0.8 600000
0.996 0.94 0.6
0.990 0.994 0.7
0.994 0.92 0.4 400000
0.992 0.6
0.988 0.90
0.992 0.2 200000
0.990 0.5
0.986 0.88
0.988 0.990 0.0 0
0.4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0 1.0 1.0 1.0


atlantis, seed=3

0.998 0.9 800000


0.9 0.8
0.996 0.8 0.8 0.8
0.8 600000
0.994 0.7 0.6
0.6 0.6
0.992 0.7 0.6
400000
0.5 0.4
0.990 0.6 0.4
0.4 0.4
0.988 0.2 200000
0.5 0.3
0.2
0.986 0.2
0.4 0.2 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 12. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

gp gv ge reward
bank_heist, seed=1

1.000 1.0 1.0


1.0 1.0 1.00
1200
0.995 0.95
0.9 0.8 0.8 0.8 1000
0.990 0.90
0.8 0.6 0.6 0.6 800
0.985 0.85
0.7 600
0.4 0.4 0.4 0.80
0.980 400
0.6 0.2 0.2 0.75
0.2 200
0.975
0.5 0.0 0.0 0.70
0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
bank_heist, seed=2

1.000 1.00 1.0 1.0 1.0 1.00 1200


0.995
0.95 0.8 0.8 1000
0.8 0.95
0.990
0.90 800
0.985 0.6 0.6 0.6
0.85 0.90
0.980 600
0.80 0.4 0.4
0.4 400
0.975 0.85
0.75 0.2 0.2
0.970 0.2 200
0.70 0.80
0.965 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
bank_heist, seed=3

1.0 1.00 1.0 1.0 1.0 1.0


0.9 1200
0.9 0.99 0.8 0.8
0.9 0.8 1000
0.98 0.7
0.8 0.6 0.6 800
0.97 0.8 0.6
600
0.7 0.4 0.4 0.5
0.96 0.7 0.4 400
0.6 0.2 0.2
0.95 0.3 200
0.6 0.2
0.5 0.94 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
battle_zone, seed=1

1.0 1.00
1.00 1.0 1.0 50000
0.9950
0.95 0.9
0.9925 0.8 0.95 40000
0.90 0.8
0.9900 0.8
0.9875 0.85 0.6 0.90 30000
0.7 0.6
0.9850 0.80
0.75 0.4 0.6 0.85 20000
0.9825 0.4
0.9800 0.70 0.5 10000
0.2
0.9775 0.65 0.80 0.2
0.4
0.9750 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
battle_zone, seed=2

1.00 1.0 1.0 1.00 1.0


0.9950 50000
0.9925 0.95 0.95
0.8 0.9 0.8 40000
0.9900 0.90 0.90
0.6 0.8 30000
0.9875 0.85 0.6
0.85
0.9850 0.80 0.4
0.7 20000
0.9825 0.80 0.4
0.75
0.9800 0.2 10000
0.70 0.6 0.75 0.2
0.9775 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
battle_zone, seed=3

1.000 1.0 1.000 1.0 20000


0.98
0.99 0.998 0.998 17500
0.8 0.8
0.96 15000
0.996 0.996
0.98
0.94 0.6 12500
0.994 0.6 0.994
0.97 0.92 10000
0.992 0.992 0.4
0.4 7500
0.90
0.96 0.990 5000
0.990 0.2
0.88
0.2 0.988 2500
0.95 0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
beam_rider, seed=1

0.996 1.00 1.0 1.00 1.0 1.0 120000


0.994 0.98 0.9 0.99 0.9 100000
0.8
0.992 0.8 0.8
0.96 0.98 80000
0.990 0.7 0.6
0.97 0.7 60000
0.94 0.6
0.988 0.6 0.4
0.92 0.5 0.96 40000
0.986 0.5
0.90 0.4 0.95 0.2 20000
0.984 0.4
0.3 0.94
0.982 0.88 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
beam_rider, seed=2

0.996 1.0 120000


1.000 1.0 1.000 1.0
0.994 0.975 0.9 0.975 0.9 100000
0.8
0.992 0.950 0.8 0.950
0.8 80000
0.925 0.7 0.925 0.6
0.990 0.7
0.900 60000
0.900 0.6
0.988 0.4
0.5 0.875 0.6 40000
0.986 0.875
0.4 0.850 0.5
0.850 0.2 20000
0.984 0.3 0.825
0.825 0.4
0.982 0.2 0.800 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8
beam_rider, seed=3

0.996 1.00 1.0 1.00 1.0 1.0 140000


0.994 0.9 0.98 0.9 120000
0.98 0.8
0.992 0.8 0.96 100000
0.96 0.8
0.7 0.94 0.6 80000
0.990
0.94 0.92 0.7
0.988 0.6 60000
0.90 0.4
0.92 0.5 0.6 40000
0.986 0.88
0.4 0.5 0.2
0.984 0.90 0.86 20000
0.3 0.84 0.0 0
0.982 0.88 0.4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.000 1.0 1.000 0.99 1.0 1100


berzerk, seed=1

0.994 0.998 0.9 0.998 0.8 1000


0.98
0.992 0.996 900
0.996 0.6
0.8 0.97 800
0.990 0.994 0.994
0.7 0.96 0.4 700
0.988 0.992 0.992 600
0.6 0.95 0.2
0.986 0.990 0.990 500
0.0 400
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.000 1.0 1.000 1.0 1200


berzerk, seed=2

0.994 0.9 0.98 1100


0.998 0.998 0.8
0.992 0.8 1000
0.996
0.996 0.96 0.6 900
0.990 0.994 0.7
0.6 800
0.988 0.992 0.994 0.94 0.4
0.5 700
0.986 0.990 0.992 0.2
0.4 0.92 600
0.984 0.988 500
0.3 0.990 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.000 1.0 1.000 1.0


berzerk, seed=3

1100
0.994 0.998 0.98
0.9 0.8 1000
0.995 0.996 0.96
0.992 900
0.8
0.994 0.6
0.990 0.990 0.94 800
0.7 0.992
0.988 0.92 0.4 700
0.6 0.990
0.986 0.985 600
0.988 0.90 0.2
0.5 500
0.984 0.980 0.986
0.4 0.88 400
0.984 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0000 1.0 1.0 1.0 1.0 1.000 45.0


bowling, seed=1

0.9975 0.9 0.995 42.5


0.8 0.8 0.8 40.0
0.9950 0.8 0.990
0.9925 0.6 37.5
0.7 0.6 0.6 0.985
0.9900 35.0
0.6 0.980
0.9875 0.4 0.4 0.4 32.5
0.5 0.975 30.0
0.9850
0.4 0.2 0.2
0.9825 0.2 0.970 27.5
0.3 25.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.9975 1.0 1.0 55


1.0000 1.000
bowling, seed=2

0.9950 0.9975 0.9 0.9 0.998 50


0.8 0.998
0.9925 0.9950 0.8 45
0.7 0.996 40
0.9900 0.9925 0.996 0.7
0.9900 0.6 35
0.6 0.994
0.9875 0.5 0.994
0.9875 0.5 30
0.9850 0.4 0.992 0.992
0.9850 0.4 25
0.9825 0.3
0.9825 0.3 0.990
0.2 0.990 20
0.9800 0.9800
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0000 1.0


bowling, seed=3

0.99 0.9 0.998 40


0.998 0.9975
0.8 0.8
0.98 0.996 0.9950 0.996 35
0.7
0.9925 0.6
0.97 0.994 0.6 0.994 30
0.9900
0.992 0.5 0.4
0.96 0.9875 0.992
0.990 0.4 25
0.95 0.3 0.9850 0.2
0.988 0.990
0.2 0.9825 20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.0 1.0 1.0 1.0 100


0.995
boxing, seed=1

0.990 0.8 0.8 0.9


0.9 0.8 80
0.985 0.8
0.6 0.6 0.6 60
0.980 0.8 0.7
0.975 0.4 0.4 0.4 0.6 40
0.970 0.7
0.2 0.5 20
0.965 0.2 0.2
0.6 0.4
0.960 0.0 0
0.0 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.0 1.0 1.0 1.0 100


0.995
boxing, seed=2

0.9 0.9
0.8 0.8 0.8 80
0.990 0.8
0.8
0.6 0.6 0.6 0.7 60
0.985
0.7 0.6
0.980 0.4 0.4 0.4 0.5 40
0.6
0.975 0.2 0.4 20
0.5 0.2 0.2
0.3
0.970 0.4 0.0 0.2 0
0.0 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.998 1.000 1.000 1.000 1.000 1.0 6


boxing, seed=3

0.996 0.998
0.998 0.998 0.998 0.8 4
0.994 0.996
0.996 0.996 0.996 0.6
0.992 0.994 2
0.994 0.994 0.994
0.990 0.992 0.4
0.992 0
0.988 0.992 0.992 0.990 0.2
0.986 0.990 0.990 0.988 2
0.990 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 13. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

gp gv ge reward
breakout, seed=1

0.995 1.00 1.0 1.00 1.00 800


0.8 0.99 0.98 0.75
0.98 600
0.990 0.6 0.98 0.96 0.50 400
0.96 0.4 0.97 0.94 0.25
0.985 200
0.2 0.96 0.92 0.00 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
breakout, seed=2

0.995 1.00 1.0 1.00 1.00


0.99 800
0.99 0.8 0.75 600
0.990 0.99 0.98
0.98 0.6 0.50 400
0.97
0.985 0.97 0.4 0.25 200
0.98 0.96
0.980 0.96 0.2 0.00 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.95 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
breakout, seed=3

0.995 1.00 1.0 1.00 1.00 800


0.995
0.8 0.75 600
0.990 0.98 0.99 0.990
0.6 0.50 400
0.96 0.4 0.98 0.985 0.25 200
0.985
0.2 0.980 0.00 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
centipede, seed=1

1.000 1.00 1.0 1.0 1.00 400000


0.9950
0.9925 0.975 0.75 0.8 0.98 300000
0.8
0.9900 0.950 0.50 0.6 0.96 200000
0.9875 0.925 0.25 0.4 0.6 0.94 100000
0.9850 0.2 0.92 0
0.900 0.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
centipede, seed=2

1.0 1.00 1.00 1.0 1.00


0.9950
0.9925 0.9 0.75 0.75 200000
0.8 0.99
0.9900 0.8 0.50 0.50
0.6 100000
0.9875 0.7 0.25 0.25 0.98
0.9850 0.6 0.00 0.00 0.4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
chopper_command, seed=1 centipede, seed=3

1.00 1.00 1.00 1.0 1.00


0.995 400000
0.75 0.75 0.99
0.95 0.8 300000
0.50 0.50 0.98
0.990 0.6 200000
0.90 0.97
0.25 0.25 100000
0.4 0.96
0.985 0.85 0.00 0.00 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.0 1.00 1.0 200000


0.9 0.8 0.95 150000
0.95 0.9
0.90 0.8
0.8 100000
0.90 0.6
0.8 0.85 50000
0.7 0.6
0.4 0.80
0.85 0.7 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
chopper_command, seed=2

1.0 1.0 1.0 1.0 1.00 1.0 800000


0.9 0.8 0.8 0.9 600000
0.9 0.95
0.8 0.8 400000
0.6 0.6
0.8 0.7 0.90 0.7 200000
0.4 0.4 0.6
0.6 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8
crazy_climber, seed=3 crazy_climber, seed=2 crazy_climber, seed=1chopper_command, seed=3

1.00 1.0 1.0 1.0 1.0 1.0


200000
0.95 0.8 0.9 0.9
0.9 0.8 150000
0.8
0.90 0.6 0.8 100000
0.8 0.7
0.6 50000
0.85 0.4 0.6 0.7
0.7 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.995 1.0 1.00 1.00 150000


0.998
0.9 0.75 0.75
0.990 0.95 0.996 100000
0.50 0.50 0.994
0.8
0.985 0.25 0.25 0.90 0.992 50000
0.7
0.00 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.00 1.0 1.00


0.9950
0.95 150000
0.9925 0.95 0.75 0.8 0.995
0.6 0.90 100000
0.9900 0.90 0.50
0.4 0.85 0.990
0.9875 0.25 50000
0.85 0.2 0.80
0.9850 0.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.9950 1.00 1.00 1.00 0.99


0.9925 0.95 0.75 0.75 0.98 0.995 100000
0.9900 0.90 0.50 0.50 0.97
0.990 50000
0.9875 0.25 0.25 0.96
0.85
0.9850 0.00 0.985
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
defender, seed=1

0.995 1.000 1.0 1.000 1.0 1.0


400000
0.995 0.8 0.8 0.9
0.990 0.995 300000
0.990 0.8
0.985 0.6 0.6 200000
0.990 0.7
0.980 0.985 0.4 0.4 100000
0.985 0.6
0.980 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8
defender, seed=2

1.00 1.0 1.000 1.0 1.0


0.995 400000
0.99 0.8 0.995 0.9 300000
0.8
0.990 0.6 0.990 0.8 200000
0.98
0.4 0.6 100000
0.985 0.985 0.7
0.97 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
defender, seed=3

1.00 1.0 1.0 1.000 1.0 1.0


0.8 0.8 0.9 200000
0.95 0.8 0.995
0.6 0.6
0.8
0.4 100000
0.90 0.4 0.990 0.7
0.6 0.2
0.2 0.6 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8
demon_attack, seed=3 demon_attack, seed=2 demon_attack, seed=1

1.00 1.00 1.0000 1.00 1.0


0.99 0.75 0.9975 0.75 0.8 100000
0.98
0.50 0.9950
0.50 0.6
0.98 0.9925 50000
0.96 0.25 0.25
0.9900 0.4
0.00 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0000 1.0 1.0


0.99 0.8 0.9975 0.8 0.8 100000
0.98
0.6 0.9950 0.6 0.6
0.98 50000
0.96 0.4 0.9925 0.4 0.4
0.97 0.2 0.9900 0.2
0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.00 1.0000 1.00 1.0


0.995
0.99 0.75 0.9975 0.75 0.8 100000
0.990 0.98 0.50 0.9950 0.50 0.6 50000
0.97 0.9925
0.25 0.25 0.4
0.985 0.96 0.9900 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 14. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

gp gv ge reward
double_dunk, seed=3 double_dunk, seed=2 double_dunk, seed=1

1.000 1.000 1.000 1.0


0.998
0.996 0.995 0.98 0.9 0
0.998 0.998
0.994 0.990 0.8
0.996 0.996 0.96 5
0.992 0.7
0.985
0.990 0.994 0.994 0.6 10
0.980 0.94
0.988 0.992 0.992 0.5
0.975 15
0.986 0.92 0.4
0.990 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.998 1.000 1.000 1.000 1.00


0.996 0.998 0.98 0
0.998 0.998 0.95
0.994 0.96
0.996 0.996 0.996 0.90
0.992 5
0.994 0.994 0.994 0.94 0.85
0.990
10
0.988 0.992 0.992 0.992 0.92 0.80
0.986 0.990 0.990 0.990 0.90 0.75 15
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.000 1.000 0.99 1.0


0.998
0.996 0.998 0.9 0
0.998 0.998 0.98
0.996 0.8
0.994 5
0.996 0.996 0.7
0.992 0.994 0.97
0.994 0.994 0.6 10
0.990 0.992
0.96 0.5
0.988 0.992 0.992
0.990 0.4 15
0.986 0.95
0.990 0.988 0.990 0.3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.9904
enduro, seed=1

0.998 0.995
0.994 0.996 0.04
0.994 0.9905
0.992 0.996 0.9902
0.993 0.994 0.02
0.994 0.9900
0.990 0.992
0.992 0.9900 0.00
0.988 0.992
0.991
0.990 0.9895 0.02
0.986 0.990 0.990 0.9898
0.988 0.989 0.04
0.984 0.9890
0.986 0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 0.9906
enduro, seed=2

0.998 0.998
0.998 0.996 0.9902 0.04
0.996 0.9905
0.996 0.996 0.9901 0.02
0.994 0.994 0.9904
0.992 0.994 0.994 0.9900 0.00
0.992 0.9903
0.990 0.992 0.992 0.9899
0.9902 0.02
0.988 0.990 0.990 0.9898
0.990 0.9901 0.04
0.986 0.988
0.9897
0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 0.9904
enduro, seed=3

0.998 0.996 0.998


0.998 0.9903 0.9903 0.04
0.996 0.995
0.994 0.996 0.996 0.9902 0.9902 0.02
0.994
0.992 0.994 0.993 0.9901 0.9901
0.994 0.00
0.990 0.992 0.992 0.9900 0.9900
0.988 0.992 0.02
0.991 0.9899
0.990 0.9899
0.986 0.990 0.9898 0.04
0.988 0.990 0.9898
0.984 0.989
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
fishing_derby, seed=3 fishing_derby, seed=2 fishing_derby, seed=1

0.9975 1.000 1.0 1.000 1.000 1.0


40
0.9950 0.9 0.998
0.998 0.998 0.8 20
0.9925 0.8 0.996
0.996 0.996 0.6 0
0.9900 0.7 0.994 20
0.9875 0.994 0.6 0.994 0.4
0.992 40
0.9850 0.5
0.992 0.992 0.990 0.2 60
0.9825 0.4
0.990 0.990 0.988 80
0.3 0.0
0.9800
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.000 1.000 1.0 40


0.995
0.9 0.998 20
0.998 0.998 0.8
0.990 0.8 0.996 0
0.985 0.996 0.7 0.996 0.6
0.994 20
0.994 0.6 0.994 0.4
0.980 0.992 40
0.5
0.975 0.992 0.992 0.990 0.2 60
0.4
0.988 80
0.970 0.990 0.3 0.990 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 100 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.000 1.000 1.0


0.996 40
0.9 0.998
0.994 0.998 0.998 0.8 20
0.8 0.996
0.992 0
0.996 0.7 0.996 0.994 0.6
0.990 20
0.994 0.6 0.994 0.992 0.4
0.988 40
0.5 0.990
0.986 0.992 0.992 0.2 60
0.4 0.988
0.984 80
0.990 0.3 0.990 0.986 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.992
freeway, seed=1

0.994 0.06
0.992 0.994 0.99000 0.9905
0.991
0.9904 0.05
0.992 0.990 0.992 0.98975
0.990 0.04
0.9903
0.990 0.989 0.990 0.98950
0.988 0.9902 0.03
0.988 0.988 0.988 0.98925 0.9901 0.02
0.986 0.987 0.986
0.986 0.98900 0.9900
0.01
0.984 0.986 0.984 0.98875 0.9899
0.984 0.00
0.985
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
freeway, seed=2

0.994 0.992 0.10


0.9925 0.994 0.9900
0.991 0.9906
0.9900 0.992 0.992 0.9898 0.08
0.990 0.9904
0.9875 0.990 0.9896 0.06
0.989 0.990
0.9850 0.9894 0.9902
0.988 0.988 0.988 0.04
0.9825 0.987 0.9892 0.9900
0.986 0.986
0.9800 0.986 0.9890 0.9898 0.02
0.984 0.985 0.984 0.9888
0.9775 0.9896 0.00
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.9925 0.9904
freeway, seed=3

0.992 0.993 0.993 0.9902


0.9920 0.9903 0.04
0.992 0.992
0.990 0.9915 0.9902
0.991 0.991 0.9900 0.02
0.988 0.9910 0.9901
0.990 0.990 0.00
0.9905 0.989 0.9898 0.9900
0.986 0.989
0.9900 0.988 0.9899 0.02
0.984 0.988
0.9895 0.987 0.9896 0.04
0.987 0.9898
0.982 0.9890 0.986
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
frostbite, seed=1

1.00 1.0 1.000 1.00 1.00 350


0.99 0.95 325
0.99 0.995 0.95
0.8 300
0.98 0.90
0.98 0.990 0.90
275
0.97 0.6 0.85
0.85 250
0.97 0.985 0.80
0.96 0.4 0.80 225
0.980 0.75
0.96 0.95 0.75 200
0.2 0.975 0.70
0.94 175
0.95 0.70
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
frostbite, seed=2

1.000 1.0 1.00 1.0 1.000 325


0.99 300
0.975 0.95 0.975
0.8 0.9
0.98 0.950 275
0.950 0.90
0.97 0.6 0.8 0.925 250
0.925 0.85
0.96 0.900 225
0.900 0.4 0.80 0.7
0.875 200
0.95 0.875 0.75
0.2 0.6 0.850 175
0.94 0.850
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00
frostbite, seed=3

1.0 1.0 1.00 1.00


0.995 350
0.9 0.95 0.95 0.95
0.8
0.990 0.90 0.90 300
0.8
0.6 0.85 0.90
0.7 0.85
0.985 0.80 250
0.85 0.80
0.6 0.4 0.75
0.980 0.5 0.70 0.80 0.75 200
0.2
0.4 0.65 0.70
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0000 1.0 1.0 1.00 1.0 120000


gopher, seed=1

0.998
0.95 0.9 100000
0.996 0.9975
0.8 0.8 0.90
0.994 0.9950 0.8 80000
0.6 0.85
0.992 0.7 60000
0.9925 0.6 0.80
0.990
0.9900 0.4 0.75 0.6 40000
0.988
0.4 0.70 0.5
0.986 0.9875 0.2 20000
0.65 0.4
0.984 0.9850 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0000 1.0 120000


gopher, seed=2

0.995 0.98
0.9975 0.9 100000
0.990 0.99 0.8 0.96
0.9950 0.8 80000
0.94
0.985 0.9925
0.98 0.6 0.92 0.7 60000
0.980 0.9900 0.6
0.90 40000
0.97 0.4 0.9875
0.975 0.88 0.5 20000
0.2 0.9850 0.86
0.970 0.96 0.4 0
0.9825
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0000 1.0 1.00 1.0 1.0


gopher, seed=3

0.996 120000
0.994 0.9975 0.95 0.9 100000
0.9
0.9950 0.8
0.992 0.90 0.8 80000
0.990 0.9925 0.8 0.7
0.6 0.85 60000
0.988 0.9900 0.6
0.80 0.7 40000
0.986 0.9875 0.4 0.5
0.75 20000
0.984 0.9850 0.6 0.4
0.70 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 15. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

1.000 1.0
gp 1.000
gv 1.00
ge 1.0 3500
reward
gravitar, seed=1

0.995
0.998 0.95 0.8 3000
0.990 0.995 0.8
2500
0.996 0.90
0.985 0.6 2000
0.990 0.6
0.994 0.85 1500
0.980 0.4
0.985 0.4
0.80 1000
0.975 0.992 0.2
0.2 500
0.980 0.990 0.75
0.970 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.000 1.0 1.000 1.00 1.0 3000


gravitar, seed=2

0.998 0.95
0.995 0.998 0.8 2500
0.996 0.8 0.90
0.85 2000
0.990 0.994 0.996 0.6
0.6
0.992 0.80 1500
0.985 0.994 0.75
0.990 0.4 0.4
1000
0.992 0.70
0.988
0.980 0.65 0.2 500
0.986 0.2
0.990 0.60
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.000 1.00 1.0


gravitar, seed=3

0.995 0.95 3000


0.998 0.8 0.998 0.8
0.90 2500
0.990 0.996 0.996 0.85
0.6 2000
0.6
0.80 1500
0.985 0.994 0.994
0.4 0.75 0.4 1000
0.980 0.992 0.992 0.70
0.2 500
0.65 0.2
0.990 0.990 0
0.975 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.995 1.000 1.0 1.00 1.0 1.0


35000
hero, seed=1

0.998 0.95 0.9 0.9 30000


0.990 0.8
0.996 0.90 0.8 0.8 25000
0.985
0.994 0.6 0.85 0.7 20000
0.7
0.980 0.992 0.80 0.6 15000
0.4 0.6
0.975 0.990 0.75 10000
0.5 0.5
0.988 0.2 0.70 5000
0.970 0.4 0.4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.000 1.0 1.0


35000
0.99 0.975 0.9
hero, seed=2

0.998 0.9 30000


0.8 0.8
0.996 0.950
0.98 0.7 25000
0.925 0.8
0.994 0.6
0.900 0.6 20000
0.97 0.992 0.7
0.4 0.875 0.5 15000
0.990 0.850 0.4 0.6 10000
0.96 0.2 0.3
0.988 0.825 5000
0.5
0.986 0.800 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.00 1.0 1.0


35000
0.99 0.998
hero, seed=3

0.98 0.9 30000


0.8 0.9
0.98 0.996 0.96 25000
0.994 0.8
0.97 0.6 0.94 0.8 20000
0.992 0.92 0.7 15000
0.96 0.990 0.4 0.7
0.90 0.6 10000
0.95 0.988
0.88 0.6 5000
0.986 0.2 0.5
0.94 0.86 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
ice_hockey, seed=1

0.998 1.000 1.0 1.000 0.99 1.0


15
0.996 0.998 0.998 0.98 0.8
0.9 10
0.994 0.97
0.996 0.8 0.996 0.6 5
0.992 0.96
0.994 0.994 0
0.990 0.7 0.4
0.95
0.988 0.992 5
0.992 0.94 0.2
0.6
0.986 10
0.990 0.990 0.93
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
ice_hockey, seed=2

0.998 1.000 1.0 1.000 0.99 1.0


15
0.996 0.998 0.9 0.998
0.98 0.8 10
0.994
0.996 0.8 0.996 0.6
0.992 5
0.97
0.994 0.7 0.994
0.990 0.4 0
0.6 0.96
0.988 0.992 0.992 0.2 5
0.986 0.990 0.5 0.990 0.95
0.0 10
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
ice_hockey, seed=3

1.000 1.0 1.000 0.9950 1.0 15


0.9975
0.9925 10
0.9950 0.998 0.9 0.998 0.8
0.9900
0.9925 0.996 0.996 5
0.6
0.8 0.9875
0.9900
0.994 0
0.994 0.9850 0.4
0.9875
0.7 5
0.9850 0.992 0.992 0.9825
0.2
0.9825 0.6 0.9800 10
0.990 0.990 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
jamesbond, seed=1

1.00 1.0 1.0 1.0 1.0


0.996 25000
0.98 0.9
0.994 0.8 0.9 0.8 0.8
0.96 20000
0.992 0.7
0.94 0.8 0.6 0.6 15000
0.990 0.6
0.92 0.5 10000
0.988 0.7 0.4
0.90 0.4 0.4
0.986 5000
0.88 0.3 0.2
0.6
0.984 0.2 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
jamesbond, seed=2

1.00 1.0 1.0 1.0 1.0 25000


0.996
0.99 0.9 0.9 20000
0.994 0.8 0.8
0.98 0.8
0.992 0.8 15000
0.97 0.7 0.6
0.6 0.7
0.990 0.96
0.6 0.4 10000
0.988 0.95 0.4 0.6
0.5
0.986 0.94 0.2 5000
0.4 0.5
0.984 0.93 0.2
0.4 0.0 0
0.3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
jamesbond, seed=3

1.00 1.0 1.0 1.0 1.0


0.996 25000
0.98 0.9 0.9
0.994 0.8 0.8 20000
0.96 0.8 0.8
0.992 0.94 0.7 0.6 15000
0.6 0.7
0.990 0.92 0.6
0.6 0.4 10000
0.90 0.4 0.5
0.988 0.5
0.88 0.2 5000
0.4
0.986 0.2 0.4
0.86 0.3 0
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0
kangaroo, seed=1

1.0 1.0 1.0 1.000 10000


0.99 0.9 0.975
0.8 0.8 0.8 8000
0.8 0.950
0.98 0.7 0.6 0.925
0.6 6000
0.6
0.6 0.900
0.97 0.4 4000
0.5 0.4
0.4 0.875
0.96 0.4 0.2 0.850 2000
0.2
0.3 0.2 0.825
0.95 0.2 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0
kangaroo, seed=2

1.0 1.0 1.0 1.00 14000


0.995
0.9 0.95 12000
0.990 0.8 0.8 0.8
0.8 10000
0.90
0.985 0.7 0.6 0.6 8000
0.6
0.85
0.980 0.6 6000
0.4 0.4
0.5 0.4 0.80
0.975 4000
0.4 0.2 0.2 0.75
0.970 0.2 2000
0.3 0.0 0.70 0
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.00
kangaroo, seed=3

1.0 1.0 1.0 14000


0.99 0.9 12000
0.8 0.8 0.8 0.99
0.8 10000
0.98 0.7 0.98
0.6 0.6 0.6 8000
0.6
0.97 0.4 0.97 6000
0.5 0.4 0.4
0.96 4000
0.96 0.4 0.2
0.3 0.2 2000
0.2 0.95
0.95 0.2 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.995 1.00 1.0 1.0


0.99 0.998 10000
0.990 0.99
krull, seed=1

0.9 0.98
0.985 0.8 0.996
0.98 8000
0.8 0.97
0.980
0.97 0.6 0.994
0.975 0.7 0.96 6000
0.96 0.992
0.970 0.6 0.95
0.4
0.965 0.95 4000
0.94 0.990
0.94 0.5
0.960 0.2 0.93 0.988
2000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 9000
0.995 1.00 1.00 0.990
0.998 8000
0.9 0.985
krull, seed=2

0.990 0.99 0.95 0.996


0.980 7000
0.8
0.985 0.98 0.90 0.975 0.994 6000
0.7
0.970 0.992 5000
0.980 0.97 0.6 0.85
0.965 4000
0.5 0.990
0.975 0.80
0.96 0.960 3000
0.4 0.988
0.970 0.75 0.955 2000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.995 1.00 1.0 1.00


0.998 8000
0.990 0.9 0.99
krull, seed=3

0.99 0.95
0.985 0.8 0.996 7000
0.98
0.980 0.98 0.7 0.90 6000
0.994
0.975 0.6 0.97
0.97 5000
0.5 0.85 0.992
0.970
0.96 0.4 0.96 4000
0.965 0.80 0.990 3000
0.960 0.95 0.3
0.95
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 16. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
kung_fu_master, seed=3kung_fu_master, seed=2kung_fu_master, seed=1

1.0 1.000 1.0


gp 1.000
gv 1.000
ge 1.0
reward
0.8 0.995 40000
0.995 0.995
0.8 0.5
0.6 0.990 0.990 20000
0.990
0.4 0.6 0.985 0.985 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.00 1.0 1.00 1.0


60000
0.8 0.99 0.8 0.99 0.995
0.5 40000
0.6 0.98 0.6 0.98
0.990 20000
0.4 0.97 0.4 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.00 1.0 1.00 1.0


0.9 0.9 40000
0.99 0.99 0.995 0.5
0.8 0.8 20000
0.98 0.7 0.98 0.990 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
montezuma_revenge, seed=1

0.9905 0.05
0.995 0.992
0.99 0.995 0.995 0.9900
0.991 0.00
0.990 0.9895
0.990 0.990
0.98 0.990
0.985 0.9890 0.05
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
montezuma_revenge, seed=2

2
0.995 0.995 0.990
0.995 0.992
0.990 0.990 0.99 1
0.991 0.989
0.985 0.985 0.990
0.98 0.990 0
0.980 0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
ms_pacman, seed=1montezuma_revenge, seed=3

1.00 4
0.993 0.990
0.99 0.995 0.99 0.995 0.992 0.988
2
0.990 0.98 0.991 0.986
0.98 0.990
0.985 0.990 0.984 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0 1.0 1.00 6000


0.995 0.9 0.75 0.995
4000
0.5 0.5 0.990
0.990 0.8 0.50
2000
0.985 0.7 0.0 0.0 0.25 0.985
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
ms_pacman, seed=2

1.000 1.0 1.0 1.0 1.00 1.00


0.995 0.8 0.75 0.98 4000
0.6 0.5 0.5 0.96
0.990 0.50 2000
0.4 0.94
0.985 0.0 0.0 0.25
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
name_this_game, seed=1 ms_pacman, seed=3

1.000 1.00 1.0 1.0 1.0 1.00 6000


0.995 0.95 0.8 4000
0.5 0.5 0.98
0.990 0.90 0.6 2000
0.985 0.0 0.0 0.4 0.96
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.00 1.0 1.0 1.0


20000
0.75 0.9 0.8
0.98 0.8 0.9
0.50 0.8 0.6 10000
0.96 0.6 0.25 0.7 0.4 0.8
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
name_this_game, seed=2

1.0 1.00 1.00 1.0 1.00


20000
0.75 0.8
0.98 0.8 0.95 0.95
0.50 0.6 10000
0.25 0.90 0.4 0.90
0.96 0.6
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
phoenix, seed=1 name_this_game, seed=3

1.0 1.00 1.00 1.0 1.00


0.99 20000
0.75 0.95 0.8 0.95
0.98 0.9 0.50 0.90
0.90 0.6 10000
0.97 0.25 0.85
0.8 0.4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.85 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.000 1.00 300000


0.8 0.95 0.75 200000
0.98 0.99 0.995
0.6 0.50 100000
0.98 0.4 0.90
0.96 0.990 0.25
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
phoenix, seed=2

1.000 1.0 1.000 1.00


0.98 0.995 0.8 0.95 0.75
0.995 200000
0.96 0.990 0.6 0.90 0.50
0.985 0.4 0.990 0.85 0.25 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
phoenix, seed=3

1.00 1.0 1.000 1.0 1.00


300000
0.98 0.8 0.75
0.99 0.995 0.9 200000
0.6 0.50
0.96 0.990 100000
0.98 0.4 0.25
0.8 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
pitfall, seed=1

1.00 1.0 1.00 0


0.99 0.975
0.98 0.98 0.99
0.98 0.8 0.950 20
0.96 0.98
0.97 0.96 0.925
0.94 0.6 40
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
pitfall, seed=2

1.00 1.0 1.000 1.0 0


0.99
0.975 0.9 20
0.98 0.95 0.8 0.995
0.950 0.8 40
0.97 0.6
0.90 0.925 0.7 0.990 60
0.96
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
pitfall, seed=3

1.00 1.0 1.00 0.99 0


0.99 0.99
0.98 0.98 0.995 50
0.98 0.8
0.98
100
0.97 0.96 0.97 0.97
0.6 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 17. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

0.998 1.000 1.0


gp 1.0
gv 1.0
ge 20
reward
pong, seed=1

0.996 0.995 0.998 15


0.8 0.9 0.8
0.994 10
0.990 0.996
0.6 0.8 5
0.992 0.6
0.985 0
0.990 0.7 0.994
0.980 0.4 0.4 5
0.988
0.6 0.992 10
0.986 0.975 0.2 0.2 15
0.984 0.970 0.5 0.990
20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.00 1.0 20


0.998 0.9975
pong, seed=2

0.95
0.996 0.98 0.8 0.8 0.9950
0.90 10
0.994 0.85 0.9925
0.96 0.6 0.6
0.992 0.80 0.9900 0
0.990 0.4 0.75 0.4 0.9875
0.94
0.988 0.70 0.9850 10
0.2 0.65 0.2 0.9825
0.986 0.92
0.60 0.9800 20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.00 1.0 1.0 1.0 20


pong, seed=3

0.99 0.998 15
0.995 0.8 0.9
0.98 0.8
0.996 10
0.990 0.97 0.8
0.6 0.6 0.994 5
0.985 0.96 0.7 0
0.95 0.4 0.992 5
0.980 0.6 0.4
0.94 0.990 10
0.975 0.93 0.2 0.5 0.2
0.988 15
0.92 0.4 20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
private_eye, seed=3 private_eye, seed=2 private_eye, seed=1

0.996 1.00 1.00 1.000 100


0.994 0.95 0.998 0.992 0.994
0.99
0.90 0.996 0.990 0.992 90
0.992 0.994
0.98 0.85 0.988 0.990
0.990 0.992 80
0.97 0.80 0.986 0.988
0.990
0.988 0.75 0.988 0.984 0.986 70
0.96
0.70 0.986 0.982 0.984
0.986
0.95 0.65 0.984 60
0.980 0.982
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.000 1.00 1.000 0.995 100


0.992
0.994 0.995 0.998 0.990 50
0.990 0.95 0.990
0.992 0.996 0.985 0
0.985 0.988 0.980
0.990 0.90 0.994 50
0.980
0.986 0.975
0.988 0.975 0.992 100
0.85 0.970
0.970 0.990 0.984 150
0.986 0.965
0.965 0.80 0.988 200
0.984 0.982 0.960
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.00 1.000 0.996 100


0.994 0.99
0.995 0.95 0.994
0.995 50
0.990 0.90 0.992 0.98
0.992
0.990
0.985 0.85 0.990 0.97 0
0.990 0.988
0.980 0.80 0.986
0.988 0.985 0.96 50
0.975 0.75 0.984
0.986 0.980 0.982 0.95 100
0.970 0.70
0.980
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.00 1.0 1.0 1.00 1.00 25000


qbert, seed=1

0.994 0.95 0.95


0.95 0.8 0.8 20000
0.992 0.90
0.90 0.6 0.90 0.85
0.990 0.6 15000
0.80
0.988 0.85 0.85
0.4 0.4 0.75 10000
0.986 0.80 0.80 0.70
0.2 0.2 5000
0.984 0.65
0.75 0.75
0.982 0.0 0.60 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.0 1.00 1.0


17500
qbert, seed=2

0.98 0.8 0.8 0.95 0.9


0.8 15000
0.90 0.8 12500
0.96 0.6 0.6
0.6 0.7 10000
0.85
0.94 0.4 0.4 0.6 7500
0.4
0.80 5000
0.92 0.2 0.2 0.5
0.2 2500
0.75 0.4
0.90 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.00 1.0 1.0 1.0 17500


qbert, seed=3

15000
0.994 0.95 0.8 0.99
0.8 0.8
12500
0.992 0.90 0.98
0.6 0.6 0.6 10000
0.990 7500
0.85 0.4 0.97 0.4
0.4
0.988 5000
0.80 0.2
0.2 0.96 0.2 2500
0.986
0.75 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.00


riverraid, seed=1

1.000 1.0 1.00


0.998 0.95 30000
0.998 0.95
0.996 0.8 0.8 0.90 25000
0.996 0.90
0.994 0.85 20000
0.994 0.6 0.6
0.992 0.80 0.85
0.992 15000
0.990 0.4 0.4 0.75 0.80
0.990 0.70 10000
0.988 0.2 0.75
0.988 0.2 0.65 5000
0.986 0.70
0.986 0.0 0.60
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.0
riverraid, seed=2

1.000 1.0 1.0


0.998 0.9
0.998 0.9 25000
0.996 0.8 0.8 0.8
0.996 0.8 20000
0.994 0.994 0.7
0.6 0.6 0.7
0.992 0.992 0.6 15000
0.6
0.990 0.990 0.4 0.4 0.5 10000
0.5
0.988 0.988 0.4
0.2 0.2 0.4 5000
0.986 0.986 0.3
0.984 0.3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.00 30000


riverraid, seed=3

1.000 1.0 1.0


0.998 0.95
0.998 25000
0.996 0.8 0.8 0.90 0.9
0.994 0.996 0.85 20000
0.6 0.6 0.80 0.8
0.992 0.994 15000
0.75
0.990 0.992 0.4 0.4 0.7 10000
0.70
0.988 0.65
0.990 0.2 0.2 5000
0.986 0.6
0.60
0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
road_runner, seed=3 road_runner, seed=2 road_runner, seed=1

1.00 1.0 1.0 1.00 1.00


60000
0.994 0.99 0.9 0.9 0.95
0.8 0.98 50000
0.98 0.8
0.992 0.7 0.90
0.7 0.96 40000
0.97
0.990 0.6 0.6 0.85 30000
0.96 0.5
0.5 0.94
0.988 0.95 0.80 20000
0.4 0.4
0.94 0.3 0.92 10000
0.986 0.3 0.75
0.93 0.2 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.0 1.0 1.0000 60000


1.00 1.0
0.994 0.9 0.9 0.9975
0.9 50000
0.98 0.8 0.8 0.9950
0.992 40000
0.7 0.7 0.8 0.9925
0.990 0.96
0.6 0.6 0.9900 30000
0.7 0.9875
0.988 0.94 0.5 0.5 20000
0.986 0.4 0.4 0.6 0.9850
0.92 10000
0.3 0.3 0.9825
0.984 0.5 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.9800 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.0
1.00 1.0 1.0 1.0
0.994 0.9 500000
0.9 0.9
0.95 0.8 0.8
0.992 400000
0.7 0.8
0.8
0.990 0.90 0.6 0.6 0.7 300000
0.5 0.7 200000
0.988 0.85 0.4 0.6
0.4
0.3 0.6 0.5 100000
0.986
0.80 0.2 0.2
0.4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
robotank, seed=1

1.0 1.0 1.0 1.0 1.00


0.995 0.9 0.9 50
0.98
0.990 0.8 0.8
0.8 0.8 0.96 40
0.985 0.7 0.6 0.7 0.94
0.6 30
0.980 0.6 0.6 0.92
0.5 0.4 0.5
0.975 0.4 0.90 20
0.4 0.4
0.970 0.2 0.88 10
0.3 0.3 0.2 0.86
0.965 0.2 0.0 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
robotank, seed=2

1.000 1.0 1.000 1.0 1.0 60


0.996 0.9
0.998 0.998 0.9 50
0.8 0.8
0.994 0.996 0.996 0.8 40
0.6 0.7
0.992 0.994 0.994 0.7 0.6 30
0.990 0.992 0.4 0.992 0.6 0.5 20
0.988 0.990 0.990 0.4
0.2 0.5 10
0.988 0.3
0.986 0.988
0.986 0.0 0.4 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
robotank, seed=3

1.0000 1.0 1.000 1.0 1.0 50


0.996
0.9975 0.9
0.994 0.8 0.998 0.8 0.8 40
0.9950
0.992 0.6 0.996 0.7 30
0.9925 0.6 0.6
0.990 0.9900 0.4 0.994 20
0.5
0.9875 0.4 0.4
0.988 0.2 0.992 10
0.9850 0.3
0.986 0.2
0.9825 0.0 0.990 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 18. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

gp gv ge reward
seaquest, seed=1

0.9950 1.00 1.0 1.000 1.0


0.98 3000
0.9925 0.98 0.995 0.8
0.8
0.9900 0.96 0.96
0.990 0.6 2000
0.9875 0.94 0.6 0.94
0.9850 0.985 0.4
0.92 0.92
0.9825 0.4 1000
0.90 0.980 0.2
0.9800 0.90
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
seaquest, seed=2

0.995 1.00 1.0 1.0000 1.0 1.0


0.99 0.8 0.9975 0.8 0.8 3000
0.990 0.98 0.9950
0.6 0.6 0.6
0.97 0.9925 2000
0.985 0.96 0.4 0.4 0.4
0.9900
0.95 0.2 0.2 1000
0.9875 0.2
0.980 0.94
0.9850 0.00 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
seaquest, seed=3

0.995 1.00 1.0 1.000 1.0 1.0


2500
0.990 0.98 0.8 0.8 0.8
0.995 2000
0.96 0.6 0.6
0.985 0.6 1500
0.4 0.990 0.4
0.980 0.94 1000
0.4 0.2
0.975 0.92 0.2 500
0.985
0.2 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.994
skiing, seed=1

0.995 1.0 1.00 1.0 10000


0.990 0.8 0.995
0.95 0.993
0.8 15000
0.985 0.6 0.990
0.90 0.992 20000
0.980 0.6 0.4
0.85 0.985 0.991 25000
0.975 0.2
0.4
0.970 0.80 0.0 0.990 30000
0.980
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
skiing, seed=2

0.995 1.0 1.00 1.0 10000


0.99
0.95 0.8 0.990
0.990 0.8 0.98 15000
0.90
0.985 0.6 0.97 0.988
0.6 0.85 20000
0.4 0.96
0.980 0.80
0.4 0.95 0.986 25000
0.75 0.2
0.975 0.94
0.2 0.70 0.0 30000
0.984
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 0.991


skiing, seed=3

0.99 10000
0.994 0.998 0.990
0.998 0.98
0.992 0.9 0.996 0.989 15000
0.996
0.990 0.97
0.994 0.8 0.994 0.988 20000
0.988 0.96
0.992 0.987
0.992 25000
0.986 0.7 0.95 0.986
0.990 0.990 30000
0.984
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
solaris, seed=1

0.998 1.000 1.0000 1.000 1.0


4000
0.996 0.998 0.9975 0.998 0.8
0.995 3500
0.994
0.996 0.9950 0.996 0.6 3000
0.992 0.990
0.994 0.9925 0.994 0.4 2500
0.990
0.988 0.9900 0.985 2000
0.992 0.992 0.2
0.986 1500
0.9875 0.0
0.990 0.990 0.980 1000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
solaris, seed=2

1.000 1.000 1.000 1.0


0.9975 3000
0.995 0.998 0.998 0.998 0.8
0.9950
0.996 0.996 0.996 0.6 2500
0.990 0.9925
0.994 0.994 0.994 0.4 2000
0.9900
0.985 0.992 0.992 0.992 0.2 1500
0.9875
0.990 0.990 0.990 0.0 1000
0.9850
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
solaris, seed=3

1.000 1.000 1.000 1.0


0.9975 0.9950 3000
0.998 0.998 0.998 0.8
0.9950 0.996 0.9925 2500
0.996 0.996 0.9900 0.6
0.9925 0.994 2000
0.9900 0.994 0.992 0.994 0.9875 0.4
1500
0.992 0.990 0.992 0.9850 0.2
0.9875 1000
0.988 0.9825
0.9850 0.990 0.990 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8
star_gunner, seed=3 star_gunner, seed=2 star_gunner, seed=1 space_invaders, seed=3space_invaders, seed=2space_invaders, seed=1

1.00 1.0 1.000 1.0 1.0


50000
0.98 0.98 0.9 0.8
0.8 0.995 40000
0.96 0.8
0.6 0.6 30000
0.96 0.94 0.990 0.7
0.92 0.6 0.4 20000
0.94 0.4 0.985
0.90 0.5 0.2 10000
0.2
0.92 0.88 0.980 0.4 0
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0000 1.0 1.0


50000
0.99 0.8 0.9975 0.8
0.99 0.8 40000
0.98 0.6 0.9950 0.6 30000
0.98 0.9925 0.6 0.4
0.4 20000
0.97
0.9900 0.2 10000
0.97 0.2 0.4
0.96 0.9875 0
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.000 1.0 1.0


50000
0.99 0.99 0.998 0.9 0.8
0.8 40000
0.98 0.8
0.996 0.6
0.98 0.97 0.6 0.7 30000
0.994 0.4
0.96 0.6 20000
0.4 0.992
0.97 0.95 0.5 0.2 10000
0.94 0.2 0.990 0.4
0.96 0.00 0.0 0
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.0 1.000 400000


0.99 0.975
0.98 0.98 0.8 0.9 0.8 300000
0.950
0.97 0.6 0.8 0.925
0.96 200000
0.96 0.6
0.900
0.4 0.7
0.95 0.94 0.875 100000
0.4
0.94 0.2 0.6 0.850
0.92 0.00 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0 1.0 1.00


0.99 0.975 300000
0.8 0.8 0.8
0.98 0.950 0.98
0.6 0.6 200000
0.925 0.6
0.97 0.96
0.900 0.4
0.96 0.4 0.4 100000
0.875 0.2 0.94
0.95 0.850 0.2
0.0 0.00 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.000 1.0 1.0 1.0 1.00 400000

0.99 0.975 0.8 0.8 0.8 300000


0.98
0.950 0.6
0.98 0.6 0.6 200000
0.925 0.4 0.96
0.97 0.4 0.4
0.900 100000
0.2 0.94
0.875 0.2 0.2
0.96 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8
surround, seed=1

1.0 1.0 1.0 1.0 1.0 1.000 10

0.9 0.8 0.8 0.8 0.8 0.998 5


0.6 0.6 0.996
0.8 0.6 0.6 0
0.4 0.4 0.994
0.7 0.4 0.4
0.992 5
0.2 0.2
0.6
0.2 0.0 0.0 0.2 0.990 10
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
surround, seed=2

1.00 1.0 1.0 1.0 1.0 1.000

0.95 0.9 0.8 0.8 0.998 5


0.8
0.8
0.90 0.6 0.6 0.996
0.7 0.6 0
0.6 0.4 0.4 0.994
0.85
0.4 5
0.5 0.2 0.2 0.992
0.80
0.4 0.0 0.2 0.990 10
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
surround, seed=3

1.00 1.0 1.0 1.0 1.0 1.000


0.9 0.8 0.8 0.8 0.998 5
0.95
0.8 0.6 0.6 0.6 0.996
0
0.90 0.7 0.4 0.994
0.4
0.4
5
0.85 0.6 0.2 0.2 0.992
0.2
0.5 0.0 0.0 0.990 10
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 19. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

1.00 1.0 1.0


gp 1.0
gv 1.0
ge 1.000
reward
tennis, seed=1

0.9 20
0.95 0.9 0.998
0.8 0.8
0.8 0.8 10
0.90 0.7 0.996
0.7 0.6
0.6 0
0.85 0.6
0.6 0.4 0.994
0.80 0.4 0.5
0.5 10
0.2 0.4 0.992
0.75 0.4 0.3
0.2 20
0.70 0.3 0.0 0.2 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.0 1.0


tennis, seed=2

0.95 20
0.9 0.998
0.8 0.8 0.8
0.90 0.8 10
0.85 0.996
0.7 0.6 0.6
0.6 0
0.80 0.6 0.994
0.75 0.4 0.4
0.5 0.4 10
0.70 0.2 0.992
0.4
0.65 0.2
0.3 0.2 0.990 20
0.60 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.0 1.0 1.0 1.000


tennis, seed=3

20
0.95 0.9 0.998
0.8 0.8 0.8
0.8 10
0.90 0.7 0.6 0.996
0.6 0.6
0.6 0
0.85 0.4 0.994
0.5 0.4
0.80 0.4 10
0.4 0.2 0.992
0.2
0.75 0.3 0.2 20
0.0 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
time_pilot, seed=1

1.00 1.0 1.0 1.0


0.998 0.98 60000
0.98 0.9 0.9
0.996 0.8 0.8 50000
0.96 0.96
0.994 0.8
0.94 0.7 0.6 40000
0.7 0.94
0.992 0.92 0.6
0.92 0.4 30000
0.990 0.90 0.5 0.6
0.988 0.4 0.90 20000
0.88 0.5 0.2
0.86 0.3 10000
0.986 0.4 0.88
0.2 0.0
0.84 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
time_pilot, seed=2

1.00 1.0 1.0 0.99 1.0 80000


0.995
0.95 0.98 70000
0.990 0.8 0.8 0.8 60000
0.90 0.97
0.985 0.96 0.6 50000
0.85 0.6 0.6
0.95 40000
0.980 0.80 0.4
0.4 30000
0.75 0.4 0.94
0.975 0.2 20000
0.70 0.93
0.2 10000
0.970 0.65 0.2 0.92
0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
time_pilot, seed=3

0.996 1.00 1.0 1.0 0.99 1.0


70000
0.9 0.98
0.994 0.95 0.8 0.8 60000
0.8 0.97
50000
0.992 0.96 0.6
0.90 0.6 0.7 40000
0.990 0.95
0.6 0.4 30000
0.85 0.4 0.94
0.988 0.5 20000
0.93 0.2
0.986 0.80 0.2 0.4 10000
0.92
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
tutankham, seed=1

0.996 1.0000 1.0 1.000 1.0 1.0


300
0.994 0.9975 0.9 0.995 0.9 0.8 250
0.992 0.9950 0.8 0.990
0.7 0.8
0.9925 0.985 0.6 200
0.990
0.6 0.7
0.988 0.9900 0.980 0.4 150
0.5
0.9875 0.975 0.6
0.986 0.4 0.2 100
0.9850 0.3 0.970 0.5
0.984
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 50 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
tutankham, seed=2

0.996 1.0000 1.0 1.000 1.00 1.0


0.994 0.9975 0.9 0.95 300
0.995 0.8
0.9950 0.8 0.90
0.992 250
0.9925 0.990 0.85 0.6
0.7
0.990 0.80 200
0.9900 0.6 0.985 0.4
0.988 0.9875 0.75 150
0.5 0.980
0.9850 0.70 0.2
0.986 0.4 100
0.9825 0.975 0.65
0.984 0.3 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
tutankham, seed=3

0.996 1.000 1.0 1.000 1.0


0.98 300
0.9 0.9
0.994 0.995 0.96
0.8 0.995 0.8
0.992 0.94 250
0.990 0.7 0.7
0.990 0.6 0.990 0.92
0.6 200
0.985 0.5 0.90
0.988 0.5
0.4 0.985 0.88 150
0.986 0.980 0.4
0.3 0.86
0.984 0.980 0.3 100
0.975 0.2 0.84
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
up_n_down, seed=1

1.000 1.0 1.0 1.0 1.00 1.0 300000


0.998 0.9 0.95 0.9
0.8 0.8 250000
0.996 0.90 0.8
0.8 200000
0.994 0.6 0.6 0.85 0.7
0.7 150000
0.992 0.80 0.6
0.6 0.4 0.4 0.75 0.5
0.990 100000
0.988 0.5 0.2 0.70 0.4
0.2 50000
0.4 0.65 0.3
0.986
0.0 0.0 0.60 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8
up_n_down, seed=2

1.000 1.000 1.0


1.000 1.0 1.0
0.998 300000
0.998 0.9
0.996 0.8 0.995 0.8 250000
0.996 0.8
0.994 0.6 200000
0.994 0.7 0.990 0.6
0.992 150000
0.992 0.6 0.4 0.4
0.990 0.985
0.990 0.5 100000
0.988 0.2
0.988 0.4 0.980 0.2 50000
0.986
0.984 0.986 0.3 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
up_n_down, seed=3

1.000 1.0 1.0 1.0 1.0 1.0 350000


0.998 0.9 300000
0.9 0.8 0.8 0.8 0.8
0.996 250000
0.994 0.8 0.6 0.6 0.7 200000
0.6 0.6
0.992 150000
0.7 0.4 0.4 0.5
0.990 100000
0.4 0.4
0.988 0.6 0.2 0.2
0.3 50000
0.986 0.2
0.5 0.0 0.0 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.995 0.997
venture, seed=1

0.996 0.9900 0.9904 0.04


0.998 0.994 0.996
0.994 0.995
0.996 0.993 0.9899 0.9902 0.02
0.992 0.994
0.992 0.993 0.9898 0.00
0.990 0.994 0.9900
0.992 0.02
0.988 0.992 0.991 0.9897
0.991 0.9898
0.986 0.990 0.990 0.9896 0.04
0.990
0.9896
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.998 1.000
venture, seed=2

0.998 0.9903
0.996 0.996 0.9904 0.04
0.998 0.9902
0.994 0.996 0.9901 0.02
0.996 0.994 0.9903
0.992 0.9900
0.994 0.994 0.9899 0.00
0.990 0.992 0.9902
0.992 0.9898 0.02
0.988 0.992
0.990 0.9897 0.9901
0.990 0.04
0.986 0.990 0.9896
0.988 0.9895
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 0.998
venture, seed=3

0.998
0.998 0.9900 0.9905
0.04
0.996 0.998 0.996
0.994 0.996 0.9899 0.9904 0.02
0.996 0.994
0.992 0.9898 0.9903 0.00
0.994 0.994
0.990 0.992
0.9897 0.9902 0.02
0.988 0.992 0.990 0.992

0.986 0.990 0.9896 0.9901 0.04


0.988 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
video_pinball, seed=3 video_pinball, seed=2 video_pinball, seed=1

1.000 1.0 1.0 1.000


1.000 1.0
700000
0.975 0.995 0.995
0.8 0.9 0.8 600000
0.950
0.990 0.8 0.990 500000
0.925 0.6 0.6
0.985 400000
0.900 0.7 0.985
0.4 0.4 300000
0.980
0.875 0.6 0.980 200000
0.975 0.2 0.2
0.850 100000
0.5 0.975
0.825 0.970 0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.0000 1.0 1.00


700000
0.99 0.995 0.9975
0.8 0.8 0.98 600000
0.990 0.9950
0.98 0.9925 500000
0.985 0.6 0.6 0.96
0.9900 400000
0.97 0.980
0.975 0.4 0.9875 0.4 0.94 300000
0.96 0.970 0.9850 200000
0.2 0.2 0.92
0.965 0.9825 100000
0.95 0.960 0.0 0.9800 0.0 0.90 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.00 1.0 1.00 1.0 1.00 700000


0.995
0.990 600000
0.98 0.8 0.95 0.8 0.98
0.985 500000
0.6 0.90 0.6 0.96 400000
0.980 0.96
0.975 0.4 0.85 0.4 0.94 300000
0.970 0.94 0.80 200000
0.965 0.2 0.2 0.92
100000
0.960 0.92 0.75
0.0 0.0 0.90 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 20. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning

1.00 1.0
gp 1.000
gv ge 1.0
reward
0.996 50000
0.995

0.994 0.9 0.998


0.8 40000
0.99
wizard_of_wor, seed=1

0.990
0.992
0.8 0.996

0.990 0.6 30000


0.98 0.985
0.7 0.994
0.988

0.6 0.4 20000


0.992 0.980
0.986 0.97

0.5
0.984 0.990 0.2 10000
0.975
0.96
0.982 0.4
0.988
0.0 0
0.980
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.998 40000
1.000 1.0 1.000 1.0

0.995 35000
0.996
0.998 0.998 0.8
0.9 30000
wizard_of_wor, seed=2

0.990
0.994

0.996 25000
0.996
0.985 0.6
0.992 0.8
20000
0.994
0.994 0.980
0.990 0.4
15000
0.7
0.992 0.975
0.992 10000
0.988
0.2

0.6 0.970 5000


0.990
0.986 0.990
0.0 0
0.965
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.998 50000
1.00 1.0 1.000 1.0

0.996 0.99
0.998
0.99 0.9 40000
0.8
wizard_of_wor, seed=3

0.994
0.996
0.8 0.98
0.992 0.98 30000
0.994 0.6

0.990 0.7
0.97 0.992
0.97 0.4 20000
0.988
0.6 0.990
0.96
0.986
0.96 0.2 10000
0.5 0.988

0.984
0.95
0.986
0.4 0.0 0
0.982
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.998
1.0 1.00 1.000
1.0 1.0
140000
0.996 0.95

0.8 0.8 0.995 120000


0.90
yars_revenge, seed=1

0.9
0.994
0.85 100000
0.6 0.6
0.992 0.990
0.8 0.80 80000

0.990 0.4 0.4 0.75


60000
0.985

0.7 0.70
0.988 40000
0.2 0.2
0.65
0.980
20000
0.986
0.6 0.60
0.0 0.0
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.000
1.0 1.0 1.00 45000
1.00 1.0

0.998 40000
0.99 0.9
0.8 0.95
0.8
0.996
yars_revenge, seed=2

35000
0.98
0.8
0.994 0.90 30000
0.6
0.6
0.97
0.7 25000
0.992
0.85
0.96 0.4
20000
0.990 0.4 0.6

0.95 0.80 15000


0.988
0.2
0.5
0.2 10000
0.94
0.986
0.75
0.0 0.4 5000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

0.996 1.00 1.0 1.0 1.0 1.00


120000
0.9
0.95
0.994 0.99
0.8 100000
0.8
yars_revenge, seed=3

0.8
0.90
0.98
0.992 0.7 80000
0.6
0.85
0.6 0.6
0.97
0.990 60000
0.5 0.80
0.4
0.96
0.4 40000
0.988 0.4 0.75

0.95 0.2 0.3 20000


0.986 0.70
0.2
0.2
0.94
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.000
1.000 1.0 1.000 0.990 1.0
30000
0.998
0.998 0.9 0.998 0.985
0.8
25000
0.996
0.980
zaxxon, seed=1

0.996 0.8 0.996


0.994 20000
0.6
0.975
0.992 0.994 0.7 0.994 15000
0.4
0.970
0.990 0.992 0.6 0.992 10000

0.965
0.988 0.2
0.990 5000
0.5 0.990
0.986 0.960

0.988 0.0 0
0.4 0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000
1.000 1.0 1.000 0.99 1.0
30000

0.998
0.9
0.998 0.998 0.8 25000
0.98
0.996
0.8
zaxxon, seed=2

0.994 0.996 20000


0.996
0.7 0.97 0.6

0.992
15000
0.994 0.6 0.994
0.96 0.4
0.990
0.5 10000
0.992
0.988 0.992
0.95 0.2
0.4
5000
0.986
0.990
0.990
0.3
0.984 0.94 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

1.000 1.0 1.000 0.99 1.0


30000
0.998

0.9
0.998 0.998 0.8 25000
0.996
0.98
zaxxon, seed=3

0.8
0.994 20000
0.996 0.996 0.6
0.97
0.992 0.7
15000

0.994 0.994 0.4


0.990 0.6
10000
0.96

0.988
0.992 0.992 0.2
0.5
5000
0.986 0.95
0.990 0.4 0.990 0.0 0
0.984
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8

Figure 21. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.

You might also like